Uniprocessor Optimizations and Matrix Multiplication
Uniprocessor Optimizations and Matrix Multiplication
Outline
Parallelism in Modern Processors
Memory Hierarchies
Matrix Multiply Cache Optimizations
Idealized Uniprocessor Model
Processor names bytes, words, etc. in its address space
These represent integers, floats, pointers, arrays, etc.
Exist in the program stack, static region, or heap
Operations include
Read and write (given an address/pointer)
Arithmetic and other logical operations
Order specified by program
Read returns the most recently written data
Compiler and architecture translate high level expressions into “obvious” lower level instructions
Hardware executes instructions in order specified by compiler
Cost
Each operations has roughly the same cost
(read, write, add, multiply, etc.)
Uniprocessors in the Real World
Real processors have
registers and caches
small amounts of fast memory
store values of recently used or nearby data
different memory ops can have very different costs
parallelism
multiple “functional units” that can run in parallel
different orders, instruction mixes have different costs
pipelining
a form of parallelism, like an assembly line in a factory
Why is this your problem?
In theory, compilers understand all of this and can optimize your program; in practice they don’t.
What is Pipelining?
A B C D 6 PM 7 8 9 T
a
s
k
O
r
d
e
r Time 30 40 40 40 40 20 Dave Patterson’s Laundry example: 4 people doing laundry
wash (30 min) + dry (40 min) + fold (20 min) In this example:
Sequential execution takes 4 * 90min = 6 hours
Pipelined execution takes 30+4*40+20 = 3.3 hours
Pipelining helps throughput, but not latency
Pipeline rate limited by slowest pipeline stage
Potential speedup = Number pipe stages
Time to “fill” pipeline and time to “drain” it reduces speedup
Limits to Instruction Level Parallelism (ILP)
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away)
Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)
Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
The hardware and compiler will try to reduce these:
Reordering instructions, multiple issue, dynamic branch prediction, speculative execution…
You can also enable parallelism by careful coding
Outline
Parallelism in Modern Processors
Memory Hierarchies
Matrix Multiply Cache Optimizations
Memory Hierarchy
on-chip cache registers datapath control processor Second level cache (SRAM) Main memory
(DRAM) Secondary storage (Disk) Tertiary storage
(Disk/Tape) Speed (ns): 1 10 100 10 ms 10 sec
Size (bytes): 100s Ks Ms Gs Ts Most programs have a high degree of locality in their accesses
spatial locality: accessing things nearby previous accesses
temporal locality: reusing an item that was previously accessed
Memory hierarchy tries to exploit locality
Processor-DRAM Gap (latency)
µProc
60%/yr. DRAM
7%/yr. 1 10 100 1000 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 DRAM CPU 1982 Processor-Memory
Performance Gap:(grows 50% / year) Performance Time “Moore’s Law” Memory hierarchies are getting deeper
Processors get faster more quickly than memory
Y-axis is performance
X-axis is time
Latency
Cliché:
Not e that x86 didn’t have cache on chip until 1989
Experimental Study of Memory
time the following program for each size(A) and stride s
(repeat to obtain confidence and mitigate timer resolution)
for array A of size from 4KB to 8MB by 2x
for stride s from 8 Bytes (1 word) to size(A)/2 by 2x
for i from 0 to size by s
load A[i] from memory (8 Bytes) Microbenchmark for memory system performance
Memory Hierarchy on a Sun Ultra-IIi
L2: 2 MB, 36 ns
(12 cycles) L1: 16K, 6 ns
(2 cycle) Mem: 396 ns
(132 cycles) Sun Ultra-IIi, 333 MHz L2: 64 byte line 8 K pages See www.cs.berkeley.edu/~yelick/arvindk/t3d-isca95.ps for details L1: 16 byte line Array size
Actual performance of a simple program can be a complicated function of the architecture
Slight changes in the architecture or program change the performance significantly
To write fast programs, need to consider architecture
We would like simple models to help us design efficient algorithms
Is this possible?
We will illustrate with a common technique for improving cache performance, called blocking or tiling
Idea: used divide-and-conquer to define a problem that fits in register/L1-cache/L2-cache
Outline
Parallelism in Modern Processors
Memory Hierarchies
Matrix Multiply Cache Optimizations
Note on Matrix Storage
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 4 8 12 16 1 5 9 13 17 2 6 10 14 18 3 7 11 15 19 Column major Row major A matrix is a 2-D array of elements, but memory addresses are “1-D”
Conventions for matrix layout
by column, or “column major” (Fortran default)
by row, or “row major” (C default)
Optimizing Matrix Addition for Caches
Dimension A(n,n), B(n,n), C(n,n)
A, B, C stored by column (as in Fortran)
Algorithm 1:
for i=1:n, for j=1:n, A(i,j) = B(i,j) + C(i,j)
Algorithm 2:
for j=1:n, for i=1:n, A(i,j) = B(i,j) + C(i,j)
What is “memory access pattern” for Algs 1 and 2?
Which is faster?
What if A, B, C stored by row (as in C)?
Using a Simple Model of Memory to Optimize
Key to algorithm efficiency f * tf + m * tm = f * tf * (1 + tm/tf * 1/q) Key to machine efficiency Assume just 2 levels in the hierarchy, fast and slow
All data initially in slow memory
m = number of memory elements (words) moved between fast and slow memory
tm = time per slow memory operation
f = number of arithmetic operations
tf = time per arithmetic operation << tm
q = f / m average number of flops per slow element access
Minimum possible time = f* tf when all data in fast memory
Actual time
Larger q means Time closer to minimum f * tf
Simple example using memory model
s = 0
for i = 1, n
s = s + h(X[i]) Assume tf=1 Mflop/s on fast memory
Assume moving data is tm = 10
Assume h takes q flops
Assume array X is in slow memory So m = n and f = q*n
Time = read X + compute = 10*n + q*n
Mflop/s = f/t = q/(10 + q)
As q increases, this approaches the “peak” speed of 1 Mflop/s To see results of changing q, consider simple computation
Warm up: Matrix-vector multiplication
= + * y(i) y(i) A(i,:) x(:) {implements y = y + A*x}
for i = 1:n
for j = 1:n
y(i) = y(i) + A(i,j)*x(j)
Warm up: Matrix-vector multiplication
m = number of slow memory refs = 3n + n2
f = number of arithmetic operations = 2n2
q = f / m ~= 2
Matrix-vector multiplication limited by slow memory speed {read x(1:n) into fast memory}
{read y(1:n) into fast memory}
for i = 1:n
{read row i of A into fast memory}
for j = 1:n
y(i) = y(i) + A(i,j)*x(j)
{write y(1:n) back to slow memory}
Comments