Newest Viewed Downloaded

Chapter 11 Matrix Multiplication

Parallel Programming in C with MPI and OpenMP Michael J. Quinn

Chapter 11 Matrix Multiplication

Outline

Sequential algorithms Iterative, row-oriented Recursive, block-oriented Parallel algorithms Rowwise block striped decomposition Cannon’s algorithm

Iterative, Row-oriented Algorithm

 = Series of inner product (dot product) operations

Performance as n Increases

Reason: Matrix B Gets Too Big for Cache

Computing a row of C requires accessing every element of B

Block Matrix Multiplication

 = Replace scalar multiplication with matrix multiplication Replace scalar addition with matrix addition

Recurse Until B Small Enough

Comparing Sequential Performance

First Parallel Algorithm

Partitioning Divide matrices into rows Each primitive task has corresponding rows of three matrices Communication Each task must eventually see every row of B Organize tasks into a ring

First Parallel Algorithm (cont.)

Agglomeration and mapping Fixed number of tasks, each requiring same amount of computation Regular communication among tasks Strategy: Assign each process a contiguous group of rows

Communication of B

A A B C A A B C A A B C A A B C

Communication of B

A A B C A A B C A A B C A A B C

Communication of B

A A B C A A B C A A B C A A B C

Communication of B

A A B C A A B C A A B C A A B C

Complexity Analysis

Algorithm has p iterations During each iteration a process multiplies (n / p)  (n / p) block of A by (n / p)  n block of B: (n3 / p2) Total computation time: (n3 / p) Each process ends up passing (p-1)n2/p = (n2) elements of B

Isoefficiency Analysis

Sequential algorithm: (n3) Parallel overhead: (pn2) Isoefficiency relation: n3  Cpn2  n  Cp This system does not have good scalability

Weakness of Algorithm 1

Blocks of B being manipulated have p times more columns than rows Each process must access every element of matrix B Ratio of computations per communication is poor: only 2n / p

Parallel Algorithm 2 (Cannon’s Algorithm)

Associate a primitive task with each matrix element Agglomerate tasks responsible for a square (or nearly square) block of C Computation-to-communication ratio rises to n / p

Elements of A and B Needed to Compute a Process’s Portion of C

Algorithm 1 Cannon’s Algorithm

Showing 1 - 20 of 30 items Details

Name: 
Chapter11
Author: 
Michael J. Quinn
Company: 
Oregon State University
Description: 
Chapter 11 Matrix Multiplication
Tags: 
algorithm | block | process | each | communication | matrix | parallel | b22
Created: 
12/21/2001 12:01:37 AM
Slides: 
30
Views: 
1
Downloads: 
0
Rating: 
0


> Comment



Share this presentation
|

Comments

Share this presentation:

|
Sitemap