David E. Keyes
Center for Computational Science
Old Dominion University
Institute for Computer Applications in Science & Engineering
NASA Langley Research Center
Institute for Scientific Computing Research
Lawrence Livermore National Laboratory
Algorithms for Cluster Computing Based on Domain Decomposition
David E. Keyes
Center for Computational Science
Old Dominion University
Institute for Computer Applications in Science & Engineering
NASA Langley Research Center
Institute for Scientific Computing Research
Lawrence Livermore National Laboratory
Algorithms for Cluster Computing Based on Domain Decomposition
Plan of Presentation
Introduction (5)
Imperative of domain decomposition and multilevel methods for terascale computing (5)
Basic and “advanced” domain decomposition and multilevel algorithmic concepts (14)
Vignettes of domain decomposition and multilevel performance (13)
(with a little help from my friends…)
Agenda for future research (6)
Terascale Optimal PDE Simulations (TOPS) software project (14)
Conclusions (4)
Terascale simulation has been “sold”
Environment
global climate
groundwater flow
Lasers & Energycombustion ICF modeling Engineeringstructural dynamicselectromagnetics Experiments expensive Chemistry & Bio
materials modelingdrug design Experiments controversial Applied
Physics
radiation transport
hydrodynamics Experiments prohibited Scientific
Simulation Experiments dangerous However, it is far from proven! To meet expectations, we need to handle problems of multiple physical scales.
Large platforms have been provided
‘97 ‘98 ‘99 ‘00 ‘01 ‘02 ‘03 ‘04 ‘05 ‘06 100+ Tflop / 30 TB Time (CY) Capability 1+ Tflop / 0.5 TB Plan Develop Use 30+ Tflop / 10 TB Red 3+ Tflop / 1.5 TB Blue 10+ Tflop / 4 TB White 50+ Tflop / 25 TB ASCI program of the U.S. DOE has roadmap to go to 100 Tflop/s by 2006
www.llnl.gov/asci/platforms Sandia Los Alamos Livermore Livermore
T3E “horizontal” aspects “vertical” aspects message passing, shared memory threads register blocking, cache blocking, prefetching Must run on physically distributed memory units connected by message-passing network, each serving one or more processors with multiple levels of cache
Building platforms is the “easy” part
Domain decomposition also “natural” for software engineering Fortunate that mathematicians built up its theory in advance of requirements! Algorithms must be
highly concurrent and straightforward to load balance
latency tolerant
cache friendly (good temporal and spatial locality)
highly scalable (in the sense of convergence)
Domain decomposed multilevel methods “natural” for all of these
Decomposition strategies for Lu=f in
Consider the implicitly discretized parabolic case Operator decomposition
Function space decomposition
Domain decomposition
Operator decomposition
Iteration matrix consists of four multiplicative substeps per timestep
two sparse matrix-vector multiplies
two sets of unidirectional bandsolves
Parallelism within each substep
But global data exchanges between bandsolve substeps
Consider ADI
Function space decomposition
System of ordinary differential equations
Perhaps are diagonal matrices
Perfect parallelism across spectral index
But global data exchanges to transform back to physical variables at each step
Consider a spectral Galerkin method
Domain decomposition
Solve by a Krylov method
Matrix-vector multiplies with
parallelism on each subdomain
nearest-neighbor exchanges, global reductions
possible small global system (not needed for parabolic case) = Consider restriction and extension operators for subdomains, , and for possible coarse grid,
Replace discretized with
Comparison
(Of course, domain decomposition can be interpreted as a special operator or function space decomposition) Operator decomposition (ADI)
natural row-based assignment requires all-to-all, bulk data exchanges in each step (for transpose)
Function space decomposition (Fourier)
Natural mode-based assignment requires all-to-all, bulk data exchanges in each step (for transform)
Domain decomposition (Schwarz)
Natural domain-based assignment requires local (nearest neighbor) data exchanges, global reductions, and optional small global problem
Theoretical Scaling of Domain Decomposition for Three Common Network Topologies
With tree-based (logarithmic) global reductions and scalable nearest neighbor hardware:
optimal number of processors scales linearly with problem size (scalable, assumes one subdomain per processor)
With 3D torus-based global reductions and scalable nearest neighbor hardware:
optimal number of processors scales as three-fourths power of problem size (almost scalable)
With common bus network (heavy contention):
optimal number of processors scales as one-fourth power of problem size (not scalable)
bad news for conventional Beowulf clusters, but see 2000 Bell Prize “price-performance awards” using multiple NICs per Beowulf node!
Optimal polynomials of lead to various preconditioned Krylov methods
Scale recurrence, e.g., with , leads to multilevel methods
The most basic idea in iterative methods
Evaluate residual accurately, but solve approximately, where is an approximate inverse to A
A sequence of complementary solves can be used, e.g., with first and then one has
Multilevel Preconditioning
smoother Finest Grid First Coarse Grid coarser grid has fewer cells (less work & storage) Restrictiontransfer from fine to coarse grid Recursively apply this idea until we have an easy problem to solve A Multigrid V-cycle Prolongationtransfer from coarse to fine grid
Schwarz Preconditioning
Let Boolean rectangular matrix extract the subset of : Let The Boolean matrices are gather/scatter operators, mapping between a global vector and its subdomain support Given A x = b , partition x into subvectors, corresp. to subdomains of the domain of the PDE, nonempty, possibly overlapping, whose union is all of the elements of
Iteration Count Estimates From the Schwarz Theory
Ο(P1/3) Ο(P1/3) 1-level Additive Schwarz Ο(1) Ο(1) 2-level Additive Schwarz Ο((NP)1/6) Ο((NP)1/4) Domain Jacobi (=0) Ο(N1/3) Ο(N1/2) Point Jacobi in 3D in 2D Preconditioning Type Krylov-Schwarz iterative methods typically converge in a number of iterations that scales as the square-root of the condition number of the Schwarz-preconditioned system In terms of N and P, where for d-dimensional isotropic problems, N=h-d and P=H-d, for mesh parameter h and subdomain diameter H, iteration counts may be estimated as follows:
Schur Preconditioning
Given a partition
Condense:
Let M be a good preconditioner for S
Then is a preconditioner for A
Moreover, solves with may be done approximately if all degrees of freedom are retained
Schwarz polynomials
This leads to “Hybrid II” in Smith, Bjorstad & Gropp: Polynomials of Schwarz projections that are combinations of additive and multiplicative may be appropriate for certain implementations
We may solve the fine subdomains concurrently and follow with a coarse grid (redundantly/cooperatively)
Comments