= Series of inner product (dot product) operations
Performance as n Increases
Reason:Matrix B Gets Too Big for Cache
Computing a row of C requires
accessing every element of B
Block Matrix Multiplication
= Replace scalar multiplication
with matrix multiplication
Replace scalar addition with matrix addition
Recurse Until B Small Enough
Comparing Sequential Performance
First Parallel Algorithm
Partitioning
Divide matrices into rows
Each primitive task has corresponding rows of three matrices
Communication
Each task must eventually see every row of B
Organize tasks into a ring
First Parallel Algorithm (cont.)
Agglomeration and mapping
Fixed number of tasks, each requiring same amount of computation
Regular communication among tasks
Strategy: Assign each process a contiguous group of rows
Communication of B
A A B C A A B C A A B C A A B C
Communication of B
A A B C A A B C A A B C A A B C
Communication of B
A A B C A A B C A A B C A A B C
Communication of B
A A B C A A B C A A B C A A B C
Complexity Analysis
Algorithm has p iterations
During each iteration a process multiplies(n / p) (n / p) block of A by (n / p) n block of B: (n3 / p2)
Total computation time: (n3 / p)
Each process ends up passing(p-1)n2/p = (n2) elements of B
Isoefficiency Analysis
Sequential algorithm: (n3)
Parallel overhead: (pn2)Isoefficiency relation: n3 Cpn2 n Cp
This system does not have good scalability
Weakness of Algorithm 1
Blocks of B being manipulated have p times more columns than rows
Each process must access every element of matrix B
Ratio of computations per communication is poor: only 2n / p
Parallel Algorithm 2(Cannon’s Algorithm)
Associate a primitive task with each matrix element
Agglomerate tasks responsible for a square (or nearly square) block of C
Computation-to-communication ratio rises to n / p
Elements of A and B Needed to Compute a Process’s Portion of C
Comments