Newest Viewed Downloaded

CS 267 Tricks with Trees James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr05

CS 267 Tricks with Trees James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr05

Outline

2 2 A log n lower bound to compute any function in parallel Reduction and broadcast in O(log n) time Parallel prefix (scan) in O(log n) time Adding two n-bit integers in O(log n) time Multiplying n-by-n matrices in O(log n) time Inverting n-by-n triangular matrices in O(log n) time Inverting n-by-n dense matrices in O(log n) time Evaluating arbitrary expressions in O(log n) time Evaluating recurrences in O(log n) time Solving n-by-n tridiagonal matrices in O(log n) time Traversing linked lists Computing minimal spanning trees Computing convex hulls of point sets

A log n lower bound to compute any function of n variables

Assume we can only use binary operations, one per time unit After 1 time unit, an output can only depend on two inputs Use induction to show that after k time units, an output can only depend on 2k inputs After log2 n time units, output depends on at most n inputs A binary tree performs such a computation

Broadcasts and Reductions on Trees

Parallel Prefix, or Scan

y[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-1 If “+” is an associative operator, and x[0],…,x[p-1] are input data then parallel prefix operation computes Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final value

Mapping Parallel Prefix onto a Tree - Details

1) Get values L and R from left and right children 2) Save L in a local register Lsave 3) Pass sum L+R to parent 1) Get value S from parent (the root gets 0) 2) Send S to the left child 3) Send S + Lsave to the right child Up-the-tree phase (from leaves to root) By induction, Lsave = sum of all leaves in left subtree Down the tree phase (from root to leaves) By induction, S = sum of all leaves to left of subtree rooted at the parent

E.g., Fibonacci via Matrix Multiply Prefix

Fn+1 = Fn + Fn-1 Can compute all Fn by matmul_prefix on [ , , , , , , , , ] then select the upper left entry Slide source: Alan Edelman, MIT

Adding two n-bit integers in O(log n) time

c[-1] = 0 … rightmost carry bit for i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] ) ... next carry bit s[i] = ( a[i] xor b[i] ) xor c[i-1] for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1] 1 1 0 1 1 1 … 2-by-2 Boolean matrix multiplication (associative) = C[i] * C[i-1] * … C[0] * 0 1 … evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit binary numbers We want their sum s = a+b = s[n]s[n-1]…s[0] Challenge: compute all c[i] in O(log n) time via parallel prefix Used in all computers to implement addition - Carry look-ahead

Other applications of scans

There are several applications of scans, some more obvious than others add multi-precision numbers (represented as array of numbers) evaluate recurrences, expressions solve tridiagonal systems (unstable!) implement bucket sort and radix sort to dynamically allocate processors to search for regular expression (e.g., grep) Names: +\ (APL), cumsum(Matlab), MPI_SCAN Note: 2n operations used when only n-1 needed

Multiplying n-by-n matrices in O(log n) time

k =1 n For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j) cost = 1 time unit, using n^3 processors For all (1 <= I,j <= n) C(i,j) = S P(i,j,k) cost = O(log n) time, using a tree with n^3 / 2 processors

Inverting triangular n-by-n matrices in O(log2 n) time

A 0 C B -1 = A 0 -B CA B -1 -1 -1 -1 If T is 1-by-1 return 1/T else … Write T = A 0 C B In parallel do { invA = TriInv(A) invB = TriInv(B) } … implicitly uses a tree newC = -invB * C * invA Return invA 0 newC invB Fact: Function TriInv(T) … assume n = dim(T) = 2m for simplicity time(TriInv(n)) = time(TriInv(n/2)) + O(log(n)) Change variable to m = log n to get time(TriInv(n)) = O(log2n)

Inverting Dense n-by-n matrices in O(log n) time

2 i=1 n i=1 n 1) Compute the powers A2, A3, …,An-1 by parallel prefix cost = O(log2 n) 2) Compute the traces sk = trace(Ak) cost = O(log n) 3) Solve Newton identities for coefficients of characteristic polynomial cost = O(log2 n) 4) Evaluate A-1 using Cayley-Hamilton Theorem cost = O(log n) Lemma 1: Cayley-Hamilton Theorem expression for A-1 via characteristic polynomial in A Lemma 2: Newton’s Identities Triangular system of equations for coefficients of characteristic polynomial, matrix entries = sk Lemma 3: sk = trace(Ak) = S Ak [i,i] = S [li (A)]k Csanky’s Algorithm (1976) Completely numerically unstable

Evaluating arbitrary expressions

Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labeled by +, -, * and / Theorem (Brent): E can be evaluated in O(log n) time, if we reorganize it using laws of commutativity, associativity and distributivity Sketch of (modern) proof: evaluate expression tree E greedily by collapsing all leaves into their parents at each time step evaluating all “chains” in E with parallel prefix

Evaluating recurrences

xi = fi(xi-1) = (ai * xi-1 + bi )/( ci * xi-1 + di ) can be written as xi = numi / deni = (ai * numi-1 + bi * deni-1)/(ci * numi-1 + di * deni-1) or numi = ai bi * numi-1 = Mi * numi-1 = Mi * Mi-1 * … * M1* num0 demi ci di deni-1 deni-1 den0 Can use parallel prefix with 2-by-2 matrix multiplication Let xi = fi(xi-1), fi a rational function, x0 given How fast can we compute xn? Theorem (Kung): Suppose degree(fi) = d for all i If d=1, xn can be evaluated in O(log n) using parallel prefix If d>1, evaluating xn takes W(n) time, i.e. no speedup is possible Sketch of proof when d=1 Sketch of proof when d>1 degree(xi) as a function of x0 is di After k parallel steps, degree(anything) <= 2k Computing xi take W(i) steps

Summary of tree algorithms

Lots of problems can be done quickly - in theory - using trees Some algorithms are widely used broadcasts, reductions, parallel prefix carry look ahead addition Some are of theoretical interest only Csanky’s method for matrix inversion Solving general tridiagonals (without pivoting) Both numerically unstable Csanky needs too many processors Embedded in various systems CM-5 hardware control network MPI, Split-C, Titanium, NESL, other languages

Showing 1 - 15 of 15 items Details

Name: 
3c_prefix
Author: 
David E. Culler
Company: 
N/A
Description: 
CS 267 Tricks with Trees James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr05
Tags: 
time | log | parallel | prefix | comput | tree | evalu | use
Created: 
1/20/1997 7:06:50 AM
Slides: 
15
Views: 
6
Downloads: 
0
Rating: 
0


> Comment



Share this presentation
|

Comments

Share this presentation:

|
Sitemap