Outline Distributed Memory Architectures
Topologies
Cost models
Distributed Memory Programming
Send and Receive
Collective Communication
Historical Perspective Early machines were:
Collection of microprocessors
bi-directional queues between neighbors
Messages were forwarded by processors on path
Strong emphasis on topology in algorithms
Network Analogy To have a large number of transfers occurring at once, you need a large number of distinct wires
Networks are like streets
link = street
switch = intersection
distances (hops) = number of blocks traveled
routing algorithm = travel plans
Properties
latency: how long to get somewhere in the network
bandwidth: how much data can be moved per unit time
limited by the number of wires
and the rate at which each wire can accept data
Components of a Network Networks are characterized by
Topology - how things are connected
two types of nodes: hosts and switches
Routing algorithm - paths used
e.g., all east-west then all north-south (avoids deadlock)
Switching strategy
circuit switching: full path reserved for entire message
like the telephone
packet switching: message broken into separately-routed packets
like the post office
Flow control - what if there is congestion
if two or more messages attempt to use the same channel
may stall, move to buffers, reroute, discard, etc.
Properties of a Network Diameter is the maximum shortest path between two nodes in the graph.
A network is partitioned if some nodes cannot reach others.
The bandwidth of a link in the is: w * 1/t
w is the number of wires
t is the time per bit
Effective bandwidth lower due to packet overhead
Bisection bandwidth
sum of the minimum number of channels which, if removed, will partition the network Routing and control header
Data payload
Error code
Trailer
Topologies Originally much research in mapping algorithms to topologies
Cost to be minimized was number of “hops” = communication steps along individual wires
Modern networks use similar topologies, but hide hop cost, so algorithm design easier
changing interconnection networks no longer changes algorithms
Since some algorithms have “natural topologies”, still worth knowing
Linear and Ring Topologies Linear array
diameter is n-1, average distance ~n/3
bisection bandwidth is 1
Torus or Ring
diameter is n/2, average distance is n/4
bisection bandwidth is 2
Used in algorithms with 1D arrays
Meshes and Tori 2D
Diameter: 2 * n
Bisection bandwidth: n
2D mesh 2D torus Often used as network in machines
Generalizes to higher dimensions (Cray T3D used 3D Torus)
Natural for algorithms with 2D, 3D arrays
Hypercubes Number of nodes n = 2d for dimension d
Diameter: d
Bisection bandwidth is n/2
0d 1d 2d 3d 4d
Popular in early machines (Intel iPSC, NCUBE)
Lots of clever algorithms
Greycode addressing
each node connected to d others with 1 bit different 001 000 100 010 011 111 101 110
Trees Diameter: log n
Bisection bandwidth: 1
Easy layout as planar graph
Many tree algorithms (summation)
Fat trees avoid bisection bandwidth problem
more (or wider) links near top
example, Thinking Machines CM-5
Butterflies Butterfly building block
Diameter: log n
Bisection bandwidth: n
Cost: lots of wires
Use in BBN Butterfly
Natural for FFT O 1 O 1 O 1 O 1
Evolution of Distributed Memory Multiprocessors Direct queue connections replaced by DMA (direct memory access)
Processor packs or copies messages
Initiates transfer, goes on computing
Message passing libraries provide store-and-forward abstraction
can send/receive between any pair of nodes, not just along one wire
Time proportional to distance since each processor along path must participate
Wormhole routing in hardware
special message processors do not interrupt main processors along path
message sends are pipelined
don’t wait for complete message before forwarding
Performance Models
PRAM Parallel Random Access Memory
All memory access free
Theoretical, “too good to be true”
OK for understanding whether an algorithm has enough parallelism at all
Slightly more realistic:
Concurrent Read Exclusive Write (CREW) PRAM
Latency and Bandwidth Time to send message of length n is roughly
Topology irrelevant
Often called “a-b model” and written
Usually a >> b >> time per flop
One long message cheaper than many short ones
Can do hundreds or thousands of flops for cost of one message
Lesson: need large computation to communication ratio to be efficient Time = latency + n*cost_per_word
= latency + n/bandwidth Time = a + n*b a + n*b << n*(a + 1*b)
Example communication costs a and b measured in units of flops, b measured per 8-byte word Machine Year a b Mflop rate per proc CM-5 1992 1900 20 20
IBM SP-1 1993 5000 32 100
Intel Paragon 1994 1500 2.3 50
IBM SP-2 1994 7000 40 200
Cray T3D (PVM) 1994 1974 28 94
UCB NOW 1996 2880 38 180
SGI Power Challenge 1995 3080 39 308
SUN E6000 1996 1980 9 180
More detailed performance model: LogP L: latency across the network
o: overhead (sending and receiving busy time)
g: gap between messages (1/bandwidth)
P: number of processors
People often group overheads into latency (a, b model)
Real costs more complicated
(see Culler/Singh, Chapter 7) P M P M os or L (latency)
Implementing Message Passing Many “message passing libraries” available
Chameleon, from ANL
CMMD, from Thinking Machines
Express, commercial
MPL, native library on IBM SP-2
NX, native library on Intel Paragon
Zipcode, from LLL
…
PVM, Parallel Virtual Machine, public, from ORNL/UTK
MPI, Message Passing Interface, industry standard
Need standards to write portable code
Rest of this discussion independent of which library
Will have detailed MPI lecture later
Implementing Synchronous Message Passing Send completes after matching receive and source data has been sent
Receive completes after data transfer complete from matching send source destination
1) Initiate send send (Pdest, addr, length,tag) rcv(Psource, addr,length,tag)
2) Address translation on Pdest
3) Send-Ready Request send-rdy-request
4) Remote check for posted receive tag match
5) Reply transaction
receive-rdy-reply
6) Bulk data transfer
time
data-xfer
Comments