Newest Viewed Downloaded

Distributed Memory Computers and Programming

Distributed Memory Computers and Programming

Outline Distributed Memory Architectures Topologies Cost models Distributed Memory Programming Send and Receive Collective Communication

Historical Perspective Early machines were: Collection of microprocessors bi-directional queues between neighbors Messages were forwarded by processors on path Strong emphasis on topology in algorithms

Network Analogy To have a large number of transfers occurring at once, you need a large number of distinct wires Networks are like streets link = street switch = intersection distances (hops) = number of blocks traveled routing algorithm = travel plans Properties latency: how long to get somewhere in the network bandwidth: how much data can be moved per unit time limited by the number of wires and the rate at which each wire can accept data

Components of a Network Networks are characterized by Topology - how things are connected two types of nodes: hosts and switches Routing algorithm - paths used e.g., all east-west then all north-south (avoids deadlock) Switching strategy circuit switching: full path reserved for entire message like the telephone packet switching: message broken into separately-routed packets like the post office Flow control - what if there is congestion if two or more messages attempt to use the same channel may stall, move to buffers, reroute, discard, etc.

Properties of a Network Diameter is the maximum shortest path between two nodes in the graph. A network is partitioned if some nodes cannot reach others. The bandwidth of a link in the is: w * 1/t w is the number of wires t is the time per bit Effective bandwidth lower due to packet overhead Bisection bandwidth sum of the minimum number of channels which, if removed, will partition the network Routing and control header Data payload Error code Trailer

Topologies Originally much research in mapping algorithms to topologies Cost to be minimized was number of “hops” = communication steps along individual wires Modern networks use similar topologies, but hide hop cost, so algorithm design easier changing interconnection networks no longer changes algorithms Since some algorithms have “natural topologies”, still worth knowing

Linear and Ring Topologies Linear array diameter is n-1, average distance ~n/3 bisection bandwidth is 1 Torus or Ring diameter is n/2, average distance is n/4 bisection bandwidth is 2 Used in algorithms with 1D arrays

Meshes and Tori 2D Diameter: 2 * n Bisection bandwidth: n 2D mesh 2D torus Often used as network in machines Generalizes to higher dimensions (Cray T3D used 3D Torus) Natural for algorithms with 2D, 3D arrays

Hypercubes Number of nodes n = 2d for dimension d Diameter: d Bisection bandwidth is n/2 0d 1d 2d 3d 4d Popular in early machines (Intel iPSC, NCUBE) Lots of clever algorithms Greycode addressing each node connected to d others with 1 bit different 001 000 100 010 011 111 101 110

Trees Diameter: log n Bisection bandwidth: 1 Easy layout as planar graph Many tree algorithms (summation) Fat trees avoid bisection bandwidth problem more (or wider) links near top example, Thinking Machines CM-5

Butterflies Butterfly building block Diameter: log n Bisection bandwidth: n Cost: lots of wires Use in BBN Butterfly Natural for FFT O 1 O 1 O 1 O 1

Evolution of Distributed Memory Multiprocessors Direct queue connections replaced by DMA (direct memory access) Processor packs or copies messages Initiates transfer, goes on computing Message passing libraries provide store-and-forward abstraction can send/receive between any pair of nodes, not just along one wire Time proportional to distance since each processor along path must participate Wormhole routing in hardware special message processors do not interrupt main processors along path message sends are pipelined don’t wait for complete message before forwarding

Performance Models

PRAM Parallel Random Access Memory All memory access free Theoretical, “too good to be true” OK for understanding whether an algorithm has enough parallelism at all Slightly more realistic: Concurrent Read Exclusive Write (CREW) PRAM

Latency and Bandwidth Time to send message of length n is roughly Topology irrelevant Often called “a-b model” and written Usually a >> b >> time per flop One long message cheaper than many short ones Can do hundreds or thousands of flops for cost of one message Lesson: need large computation to communication ratio to be efficient Time = latency + n*cost_per_word = latency + n/bandwidth Time = a + n*b a + n*b << n*(a + 1*b)

Example communication costs a and b measured in units of flops, b measured per 8-byte word Machine Year a b Mflop rate per proc CM-5 1992 1900 20 20 IBM SP-1 1993 5000 32 100 Intel Paragon 1994 1500 2.3 50 IBM SP-2 1994 7000 40 200 Cray T3D (PVM) 1994 1974 28 94 UCB NOW 1996 2880 38 180 SGI Power Challenge 1995 3080 39 308 SUN E6000 1996 1980 9 180

More detailed performance model: LogP L: latency across the network o: overhead (sending and receiving busy time) g: gap between messages (1/bandwidth) P: number of processors People often group overheads into latency (a, b model) Real costs more complicated (see Culler/Singh, Chapter 7) P M P M os or L (latency)

Implementing Message Passing Many “message passing libraries” available Chameleon, from ANL CMMD, from Thinking Machines Express, commercial MPL, native library on IBM SP-2 NX, native library on Intel Paragon Zipcode, from LLL … PVM, Parallel Virtual Machine, public, from ORNL/UTK MPI, Message Passing Interface, industry standard Need standards to write portable code Rest of this discussion independent of which library Will have detailed MPI lecture later

Implementing Synchronous Message Passing Send completes after matching receive and source data has been sent Receive completes after data transfer complete from matching send source destination 1) Initiate send send (Pdest, addr, length,tag) rcv(Psource, addr,length,tag) 2) Address translation on Pdest 3) Send-Ready Request send-rdy-request 4) Remote check for posted receive tag match 5) Reply transaction receive-rdy-reply 6) Bulk data transfer time data-xfer

Showing 1 - 20 of 25 items Details

Name: 
4network
Author: 
N/A
Company: 
N/A
Description: 
Distributed Memory Computers and Programming
Tags: 
send | data | tag | messag | processor | bandwidth | receiv | network
Created: 
1/20/1997 7:06:50 AM
Slides: 
25
Views: 
1
Downloads: 
1
Rating: 
0


> Comment



Share this presentation
|

Comments

Share this presentation:

|
Sitemap