GPU programming
Valentin Churavy
October 30th, 2019
Youtube Video Part 1
Youtube Video Part 2
Levels of parallelism in hardware
Instruction-Level Parallelism
Data-Level Parallelism
SIMD/Vector
GPUs
Thread-Level Parallelism
Instruction-Level Parallelism
Instruction-Level parallelism is used by your compiler and CPU to speed up serial programs. To signify that you are not expected to write code that considers ILP, the following code-snippets are in a very explicit low-level Julia dialect that approximates 1 RISC instruction per line.
function f(A, x) i = length(A) @label Loop a = A[i] # Load c = a + x # Add A[i] = c # Store i = i - 1 # Decrement i > 0 && @goto Loop return A end
The same code in RISC-V would have been:
...
Loop: fld f0, 0(x1)
fadd.d f4, f0, f2
fsd f4, 0(x1)
addi x1, x1, -8
bnez x1, Loop
...
** What are the data-dependencies in this loop? **
Pipeline Scheduling and loop unrolling
Latency
Load latency: 1 cycle
Float arithmetic latency: 2 cycle
Integer arithmetic latency: 0 cycle
@label Loop a = A[i] # Cycle 1 # Stall # Cycle 2 c = a + x # Cycle 3 # Stall # Cycle 4 # Stall # Cycle 5 A[i] = c # Cycle 6 i = i - 1 # Cycle 7 i > 0 && @goto Loop # Cycle 8
With our given latencies and issuing one operation per cycle, we can execute the loop in 8 cycles. By reordering we can execute it in 7 cycles. Can we do better?
@label Loop a = A[i] # Cycle 1 i = i - 1 # Cycle 2 c = a + x # Cycle 3 # Stall # Cycle 4 # Stall # Cycle 5 A[i+1] = c # Cycle 6 i > 0 && @goto Loop # Cycle 7
By reordering the decrement we can hide the load latency.
How many cycles are overhead: 2
How many stall cycles: 2
How many cycles are actually work: 3
In order to improve the performance of this code we want to reduce the overhead of the loop in relative to the work. One technique compilers will use is loop-unrolling. Unrolling replicates the loop body multiple times, changing the loop exit condition accordingly. This requires duplicating the loop so that we can handle iteration lengths that are not a multiple of the unrolling factor.
Note: A[i+1]
is free since it can be precomputed relative to A[i]
Unrolling by a factor of 4
@label Loop a = A[i] c = a + x A[i] = c a1 = A[i-1] c1 = a1 + x A[i-1] = c1 a2 = A[i-2] c2 = a2 + x A[i-2] = c2 a3 = A[i-3] c3 = a3 + x A[i-3] = c3 i = i - 4 i > 4 && @goto Loop
By unrolling with a factor of 4, we have reduced the overhead to 2 cycles (ignoring stalls for now). Note that A[i-3]
can be precomputed relative to A
and is therefore free on most architectures.
Do we still have stalls?: Yes
How many cycles are overhead: 2
How many stall cycles: 12
How many cycles are actually work: 12
@label Loop a = A[i] # Stall c = a + x # Stall # Stall A[i] = c a1 = A[i-1] # Stall c1 = a1 + x # Stall # Stall A[i-1] = c1 a2 = A[i-2] # Stall c2 = a2 + x # Stall # Stall A[i-2] = c2 a3 = A[i-3] # Stall c3 = a3 + x # Stall # Stall A[i-3] = c3 i = i - 4 i > 4 && @goto Loop
Can we re-order to reduce stalls?
@label Loop a = A[i] a1 = A[i-1] a2 = A[i-2] a3 = A[i-3] c = a + x c1 = a1 + x c2 = a2 + x c3 = a3 + x A[i] = c A[i-1] = c1 A[i-2] = c2 A[i-3] = c3 i = i - 4 i > 4 && @goto Loop
How many cycles are overhead (this includes stalls): 2
How many cycles are actually work: 12
This is also called interleaving and one should note that we started to cluster operations together. Instead of expressing operations like this that are inherently data-parallel in a serial manner and expecting the compiler and the underlying architecture to pick up the slack, we can also also explicitly express the data-parallelism. The two big avenues of doing so are: explicit vector programming and GPU programming.
Data-parallelism
Not all programs are data parallel programs, but many in scientific computing are and this has caused the introduction of hardware specialised to perform data-parallel operations. As an example many modern CPUs include vector extensions that enable Single-Instruction-Multiple-Data (SIMD) programming.
SIMD (explicit vectorized)
Terms: Each vector element is processed by a vector lane.
using SIMD A = rand(Float64, 64) T = Vec{4, Float64} x = 1.0 for i in 1:4:length(A) a = vload(T, A, i) c = a + x vstore(c, A, i) end
Stalls are only per instruction, and not per element
reduced overhead processing of
3*<4xFloat64>
per iteration with only 2 overhead instructions (excluding stalls), so the overhead is amortized across 4 elements.
Note:
We can remove stalls similar to what we did for the serial code:
pipelining
interleaving and unrolling
Latencies will be higher
How do we handle branching
Translating serial code to a vector processor is tricky if there are data or index dependent control-flow. There are some architectures (see the NEC Aurora VX) that have support for vector predication and there are also masked load and store instructions for SIMD on Intel CPUs. In general though one has to do a manual transform that computes both sides of the branch and then merges the results together.
A = rand(Int64, 64) for i in 1:length(A) a = A[i] if a % 2 == 0 A[i] = -a end end
using SIMD A = rand(Int64, 64) T = Vec{4, Int64} for i in 1:4:length(A) a = vload(T, A, i) mask = a % 2 == 0 # calculate mask b = -a # If branch c = vifelse(mask, b, a) # merge results vstore(c, A, i) end
GPU (implicit vectorized)
Instead of using explicit vectorization, GPUs change the programming model so that the programmer writes a kernel which operates over each element of the data. In effect the programmer is writing a program that is executed for each vector lane. It is important to remember that the hardware itself still operates on vectors (CUDA calls this warp-size and it is 32 elements).
At this point please refer to the lecture slides