CS 267Tricks with Trees James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr05
CS 267Tricks with Trees James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr05
Outline
2 2 A log n lower bound to compute any function in parallel
Reduction and broadcast in O(log n) time
Parallel prefix (scan) in O(log n) time
Adding two n-bit integers in O(log n) time
Multiplying n-by-n matrices in O(log n) time
Inverting n-by-n triangular matrices in O(log n) time
Inverting n-by-n dense matrices in O(log n) time
Evaluating arbitrary expressions in O(log n) time
Evaluating recurrences in O(log n) time
Solving n-by-n tridiagonal matrices in O(log n) time
Traversing linked lists
Computing minimal spanning trees
Computing convex hulls of point sets
A log n lower bound to compute any function of n variables
Assume we can only use binary operations, one per time unit
After 1 time unit, an output can only depend on two inputs
Use induction to show that after k time units, an output can only depend on 2k inputs
After log2 n time units, output depends on at most n inputs
A binary tree performs such a computation
Broadcasts and Reductions on Trees
Parallel Prefix, or Scan
y[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-1 If “+” is an associative operator, and x[0],…,x[p-1] are input data then parallel prefix operation computes
Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final value
Mapping Parallel Prefix onto a Tree - Details
1) Get values L and R from left and right children
2) Save L in a local register Lsave
3) Pass sum L+R to parent
1) Get value S from parent (the root gets 0)
2) Send S to the left child
3) Send S + Lsave to the right child Up-the-tree phase (from leaves to root)
By induction, Lsave = sum of all leaves in left subtree
Down the tree phase (from root to leaves)
By induction, S = sum of all leaves to left of subtree rooted at the parent
E.g., Fibonacci via Matrix Multiply Prefix
Fn+1 = Fn + Fn-1 Can compute all Fn by matmul_prefix on
[ , , , , , , , , ]
then select the upper left entry Slide source: Alan Edelman, MIT
Adding two n-bit integers in O(log n) time
c[-1] = 0 … rightmost carry bit
for i = 0 to n-1
c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] ) ... next carry bit
s[i] = ( a[i] xor b[i] ) xor c[i-1] for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit
for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit
c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1]
1 1 0 1 1 1
… 2-by-2 Boolean matrix multiplication (associative)
= C[i] * C[i-1] * … C[0] * 0
1
… evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit binary numbers
We want their sum s = a+b = s[n]s[n-1]…s[0]
Challenge: compute all c[i] in O(log n) time via parallel prefix
Used in all computers to implement addition - Carry look-ahead
Other applications of scans
There are several applications of scans, some more obvious than others
add multi-precision numbers (represented as array of numbers)
evaluate recurrences, expressions
solve tridiagonal systems (unstable!)
implement bucket sort and radix sort
to dynamically allocate processors
to search for regular expression (e.g., grep)
Names: +\ (APL), cumsum(Matlab), MPI_SCAN
Note: 2n operations used when only n-1 needed
Multiplying n-by-n matrices in O(log n) time
k =1 n For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j)
cost = 1 time unit, using n^3 processors
For all (1 <= I,j <= n) C(i,j) = S P(i,j,k)
cost = O(log n) time, using a tree with n^3 / 2 processors
Inverting triangular n-by-n matrices in O(log2 n) time
A 0
C B -1 = A 0
-B CA B -1 -1 -1 -1 If T is 1-by-1
return 1/T
else
… Write T = A 0
C B
In parallel do {
invA = TriInv(A)
invB = TriInv(B) } … implicitly uses a tree
newC = -invB * C * invA
Return invA 0
newC invB Fact:
Function TriInv(T) … assume n = dim(T) = 2m for simplicity
time(TriInv(n)) = time(TriInv(n/2)) + O(log(n))
Change variable to m = log n to get time(TriInv(n)) = O(log2n)
Inverting Dense n-by-n matrices in O(log n) time
2 i=1 n i=1 n 1) Compute the powers A2, A3, …,An-1 by parallel prefix
cost = O(log2 n)
2) Compute the traces sk = trace(Ak)
cost = O(log n)
3) Solve Newton identities for coefficients of characteristic polynomial
cost = O(log2 n)
4) Evaluate A-1 using Cayley-Hamilton Theorem
cost = O(log n) Lemma 1: Cayley-Hamilton Theorem
expression for A-1 via characteristic polynomial in A
Lemma 2: Newton’s Identities
Triangular system of equations for coefficients of characteristic polynomial, matrix entries = sk
Lemma 3: sk = trace(Ak) = S Ak [i,i] = S [li (A)]k
Csanky’s Algorithm (1976)
Completely numerically unstable
Evaluating arbitrary expressions
Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately
Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labeled by +, -, * and /
Theorem (Brent): E can be evaluated in O(log n) time, if we reorganize it using laws of commutativity, associativity and distributivity
Sketch of (modern) proof: evaluate expression tree E greedily by
collapsing all leaves into their parents at each time step
evaluating all “chains” in E with parallel prefix
Evaluating recurrences
xi = fi(xi-1) = (ai * xi-1 + bi )/( ci * xi-1 + di ) can be written as
xi = numi / deni = (ai * numi-1 + bi * deni-1)/(ci * numi-1 + di * deni-1) or
numi = ai bi * numi-1 = Mi * numi-1 = Mi * Mi-1 * … * M1* num0
demi ci di deni-1 deni-1 den0
Can use parallel prefix with 2-by-2 matrix multiplication
Let xi = fi(xi-1), fi a rational function, x0 given
How fast can we compute xn?
Theorem (Kung): Suppose degree(fi) = d for all i
If d=1, xn can be evaluated in O(log n) using parallel prefix
If d>1, evaluating xn takes W(n) time, i.e. no speedup is possible
Sketch of proof when d=1
Sketch of proof when d>1
degree(xi) as a function of x0 is di
After k parallel steps, degree(anything) <= 2k
Computing xi take W(i) steps
Summary of tree algorithms
Lots of problems can be done quickly - in theory - using trees
Some algorithms are widely used
broadcasts, reductions, parallel prefix
carry look ahead addition
Some are of theoretical interest only
Csanky’s method for matrix inversion
Solving general tridiagonals (without pivoting)
Both numerically unstable
Csanky needs too many processors
Embedded in various systems
CM-5 hardware control network
MPI, Split-C, Titanium, NESL, other languages
Comments