Real world problems have parallelism and locality:
Many objects operate independently of others.
Objects often depend much more on nearby than distant objects.
Dependence on distant objects can often be simplified.
Scientific models may introduce more parallelism:
When a continuous problem is discretized, temporal domain dependencies are generally limited to adjacent time steps.
Far-field effects may be ignored or approximated in many cases.
Many problems exhibit parallelism at multiple levels
Example: circuits can be simulated at many levels, and within each there may be parallelism within and between subcircuits.
Not all problems solved on parallel machines involve simulation, but many do
Basic Kinds of Simulation
Discrete event systems:
Examples: “Game of Life,” logic level circuit simulation.
Particle systems:
Examples: billiard balls, semiconductor device simulation, galaxies.
Lumped variables depending on continuous parameters:
ODEs, e.g., circuit simulation (Spice), structural mechanics, chemical kinetics.
Continuous variables depending on continuous parameters:
PDEs, e.g., heat, elasticity, electrostatics.
A given phenomenon can be modeled at multiple levels.
Many simulations combine more than one of these techniques.
Outline
discrete continuous Discrete event systems
Time and space are discrete
Particle systems
Important special case of lumped systems
Ordinary Differential Equations (ODEs)
Lumped systems
Location/entities are discrete, time is continuous
Partial Different Equations (PDEs)
Time and space are continuous
Discrete Event Systems
Discrete Event Systems
Systems are represented as:
finite set of variables.
the set of all variable values at a given time is called the state.
each variable is updated by computing a transition function depending on the other variables.
System may be:
synchronous: at each discrete timestep evaluate all transition functions; also called a finite state machine.
asynchronous: transition functions are evaluated only if the inputs change, based on an “event” from another part of the system; also called event driven simulation.
Example: The “game of life:”
Also known as Sharks and Fish #3:
http://www.cs.berkeley.edu/~demmel/cs267/Sharks_and_Fish/
Space divided into cells, rules govern cell contents at each step
Parallelism in Sharks and Fish (Recap)
P4 P1 P2 P3 P5 P6 P7 P8 P9 Repeat
compute locally to update local system
barrier()
exchange state info with neighbors
until done simulating The simulation is synchronous
use two copies of the grid (old and new).
the value of each new grid cell depends only on 9 cells (itself plus 8 neighbors) in old grid.
simulation proceeds in timesteps-- each cell is updated at every step.
Easy to parallelize by dividing physical domain
Locality is achieved by using large patches of the ocean
Only boundary values from neighboring patches are needed.
Synchronous Circuit Simulation
edge crossings = 6 edge crossings = 10 Circuit is a graph made up of subcircuits connected by wires
Component simulations need to interact if they share a wire.
Data structure is irregular (graph) of subcircuits.
Parallel algorithm is timing-driven or synchronous:
Evaluate all components at every timestep (determined by known circuit delay)
Graph partitioning assigns subgraphs to processors (NP-complete)
Determines parallelism and locality.
Attempts to evenly distribute subgraphs to nodes (load balance).
Attempts to minimize edge crossing (minimize communication).
Nodes and edges in graph may be weighted
Optimal graph partitioning is NP-complete (define for non-CS audience)
Good graph partitioning techniques exist
More on this in a later lecture
Asynchronous Simulation
Synchronous simulations may waste time:
Simulate even when the inputs do not change,.
Asynchronous simulations update only when an event arrives from another component:
No global time steps, but individual events contain time stamp.
Example: Game of life in loosely connected ponds (don’t simulate empty ponds).
Example: Circuit simulation with delays (events are gates changing).
Example: Traffic simulation (events are cars changing lanes, etc.).
Asynchronous is more efficient, but harder to parallelize
In MPI, events are naturally implemented as messages, but how do you know when to execute a “receive”?
Scheduling Asynchronous Circuit Simulation
Conservative:
Only simulate up to (and including) the minimum time stamp of inputs.
May need deadlock detection if there are cycles in graph, or else “null messages”.
Example: Pthor circuit simulator in Splash1 from Stanford.
Speculative (or Optimistic):
Assume no new inputs will arrive and keep simulating.
May need to backup if assumption wrong.
Example: Timewarp [D. Jefferson], Parswec [Wen,Yelick].
Optimizing load balance and locality is difficult:
Locality means putting tightly coupled subcircuit on one processor.
Since “active” part of circuit likely to be in a tightly coupled subcircuit, this may be bad for load balance.
Summary of Discrete Even Simulations
Model of the world is discrete
Both time and space
Approach
Decompose domain, i.e., set of objects
Run each component ahead using
Synchronous: communicate at end of each timestep
Asynchronous: communicate on-demand
Conservative scheduling – wait for inputs
Speculative scheduling – assume no inputs, roll back if necessary
Particle Systems
Particle Systems
A particle system has
a finite number of particles.
moving in space according to Newton’s Laws (i.e. F = ma).
time is continuous.
Examples:
stars in space with laws of gravity.
electron beam and ion beam semiconductor manufacturing.
atoms in a molecule with electrostatic forces.
neutrons in a fission reactor.
cars on a freeway with Newton’s laws plus model of driver and engine.
Many simulations combine particle simulation techniques with some discrete event techniques (e.g., Sharks and Fish).
Forces in Particle Systems
External force
ocean current to sharks and fish world (S&F 1).
externally imposed electric field in electron beam.
Nearby force
sharks attracted to eat nearby fish (S&F 5).
balls on a billiard table bounce off of each other.
Van der Wals forces in fluid (1/r6).
Far-field force
fish attract other fish by gravity-like (1/r2 ) force (S&F 2).
gravity, electrostatics, radiosity.
forces governed by elliptic PDE. force = external_force + nearby_force + far_field_force Force on each particle decomposed into near and far:
Parallelism in External Forces
External forces are the simplest to implement.
The force on each particle is independent of other particles.
Called “embarrassingly parallel”.
Evenly distribute particles on processors
Any even distribution works.
Locality is not an issue, no communication.
For each particle on processor, apply the external force.
Parallelism in Nearby Forces
Need to check for collisions between regions *often called “domain decomposition,” but the term also refers to a numerical technique. Nearby forces require interaction and therefore communication.
Force may depend on other nearby particles:
Example: collisions.
simplest algorithm is O(n2): look at all pairs to see if they collide.
Usual parallel model is decomposition* of physical domain:
O(n2/p) particles per processor if evenly distributed.
Parallelism in Nearby Forces
Communicate particles in boundary region to neighbors Challenge 1: interactions of particles near processor boundary:
need to communicate particles near boundary to neighboring processors.
surface to volume effect means low communication.
Which communicates less: squares (as below) or slabs?
Parallelism in Nearby Forces
Example: each square contains at most 3 particles See: http://njord.umiacs.umd.edu:1601/users/brabec/quadtree/points/prquad.html
Challenge 2: load imbalance, if particles cluster:
galaxies, electrons hitting a device wall.
To reduce load imbalance, divide space unevenly.
Each region contains roughly equal number of particles.
Quad-tree in 2D, oct-tree in 3D.
Parallelism in Far-Field Forces
Implement by rotating particle sets.
Keeps processors busy
All processor eventually see all particles Far-field forces involve all-to-all interaction and therefore communication.
Force depends on all other particles:
Examples: gravity, protein folding
Simplest algorithm is O(n2) as in S&F 2, 4, 5.
Just decomposing space does not help since every particle needs to “visit” every other particle.
Use more clever algorithms to beat O(n2).
Far-field Forces: Particle-Mesh Methods
1) Particles are moved to mesh (scatter)
2) Solve mesh problem
3) Forces are interpolated at particles (gather) Based on approximation:
Superimpose a regular mesh.
“Move” particles to nearest grid point.
Exploit fact that the far-field force satisfies a PDE that is easy to solve on a regular mesh:
FFT, multigrid (described in future lecture)
Accuracy depends on the fineness of the grid is and the uniformity of the particle distribution.
Data structures: a 1D array of particles (each with a location)
a 2D (or 3D) array representing the mesh
Scatter: store particle value (say mass) at nearby mesh point
Gather: read forces from mesh, using particle location as index (as in scatter)
Comments