The Different Flavors of Parallelism

Chris Rackauckas

September 25th, 2020

Youtube Video Link

Now that you are aware of the basics of parallel computing, let's give a high level overview of the differences between different modes of parallelism.

Lowest Level: SIMD

Recall SIMD, the idea that processors can run multiple commands simultaneously on specially structured data. "Single Instruction Multiple Data". SIMD is parallelism within a single core.

High Level Idea of SIMD

Calculations can occur in parallel in the processor if there is sufficient structure in the computation.

How to do SIMD

The simplest way to do SIMD is simply to make sure that your values are aligned. If they are, then great, LLVM's autovectorizer pass has a good chance of automatic vectorization (in the world of computing, "SIMD" is synonymous with vectorization since it is taking specific values and instead computing on small vectors. That is not to be confused with "vectorization" in the sense of Python/R/MATLAB, which is a programming style which prefers using C-defined primitive functions, like broadcast or matrix multiplication).

You can check for auto-vectorization inside of the LLVM IR by looking for statements like:

%wide.load24 = load <4 x double>, <4 x double> addrspac(13)* %46, align 8
; └
; ┌ @ float.jl:395 within `+'
%47 = fadd <4 x double> %wide.load, %wide.load24

which means that 4 additions are happening simultaneously. The amount of vectorization is heavily dependent on your architecture. The ancient form of SIMD, the SSE(2) instructions, required that your data was aligned. Now there's a bit more leeway, but generally it holds that making your the data you're trying to SIMD over is aligned. Thus there can be major differences in computing using a struct of array format instead of an arrays of structs format. For example:

struct MyComplex
  real::Float64
  imag::Float64
end
arr = [MyComplex(rand(),rand()) for i in 1:100]

100-element Vector{MyComplex}:
 MyComplex(0.2909922234025015, 0.3684950106942778)
 MyComplex(0.8472446809037655, 0.006968441771929124)
 MyComplex(0.64473688548435, 0.6170048129814039)
 MyComplex(0.7649895042144097, 0.21118628327755284)
 MyComplex(0.09828428127119948, 0.8911722917090261)
 MyComplex(0.7013104972119288, 0.6919119919289437)
 MyComplex(0.03699872532650539, 0.6199229856501254)
 MyComplex(0.08872906353145926, 0.41807863284690083)
 MyComplex(0.799149704642992, 0.7690758210036304)
 MyComplex(0.8655930872101948, 0.14390155509968594)
 ⋮
 MyComplex(0.4350083338952444, 0.7340662960063089)
 MyComplex(0.3206046086043499, 0.8323448952314209)
 MyComplex(0.49035716359283066, 0.6289988313173449)
 MyComplex(0.5479276742084399, 0.35110554541749484)
 MyComplex(0.0725023494926782, 0.41181259238514123)
 MyComplex(0.8565561915612465, 0.21515468620970168)
 MyComplex(0.9347514813926294, 0.8390188712545646)
 MyComplex(0.3546123381871167, 0.0825287740696451)
 MyComplex(0.8833688135912466, 0.9502963901464433)

is represented in memory as

[real1,imag1,real2,imag2,...]

while the struct of array formats are

struct MyComplexes
  real::Vector{Float64}
  imag::Vector{Float64}
end
arr2 = MyComplexes(rand(100),rand(100))

MyComplexes([0.7982897891192476, 0.10131368582135536, 0.5911646727035349, 0
.19038486942362787, 0.39677916953917425, 0.997198980647344, 0.2992456670926
271, 0.6660084172683108, 0.13094376216734016, 0.6410671410944007  …  0.5097
127637251634, 0.5281832610849995, 0.5755606696141649, 0.44008368624584615, 
0.3082898405005182, 0.312266424394561, 0.5137240249568831, 0.81673244284726
84, 0.3763769061014014, 0.6612349369390583], [0.8398041780959282, 0.2640267
831106272, 0.44813363658739946, 0.2691673306349801, 0.2252604759193444, 0.7
645729990386371, 0.17376316805978187, 0.11018365773987959, 0.39701791386872
7, 0.8740016425067744  …  0.9639070083856647, 0.47041058355104926, 0.837932
5281057264, 0.7569447195838326, 0.25308998676152694, 0.3815510136712087, 0.
5285521433191064, 0.14776563368719464, 0.3966436880965202, 0.43314513571219
54])

Now let's check what happens when we perform a reduction:

using InteractiveUtils
Base.:+(x::MyComplex,y::MyComplex) = MyComplex(x.real+y.real,x.imag+y.imag)
Base.:/(x::MyComplex,y::Int) = MyComplex(x.real/y,x.imag/y)
average(x::Vector{MyComplex}) = sum(x)/length(x)
@code_llvm average(arr)

; Function Signature: average(Array{Main.MyComplex, 1})
;  @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciML
Book/_weave/lecture06/styles_of_parallelism.jmd:5 within `average`
define void @julia_average_70604(ptr noalias nocapture noundef nonnull sret
([2 x double]) align 8 dereferenceable(16) %sret_return, ptr noundef nonnul
l align 8 dereferenceable(24) %"x::Array") #0 {
top:
  %jlcallframe1 = alloca [4 x ptr], align 8
  %0 = alloca [2 x double], align 16
; ┌ @ reducedim.jl:982 within `sum`
; │┌ @ reducedim.jl:982 within `#sum#935`
; ││┌ @ reducedim.jl:986 within `_sum`
; │││┌ @ reducedim.jl:986 within `#_sum#937`
; ││││┌ @ reducedim.jl:987 within `_sum`
; │││││┌ @ reducedim.jl:987 within `#_sum#938`
; ││││││┌ @ reducedim.jl:329 within `mapreduce`
; │││││││┌ @ reducedim.jl:329 within `#mapreduce#928`
; ││││││││┌ @ reducedim.jl:337 within `_mapreduce_dim`
; │││││││││┌ @ reduce.jl:426 within `_mapreduce`
; ││││││││││┌ @ indices.jl:494 within `LinearIndices`
; │││││││││││┌ @ abstractarray.jl:98 within `axes`
; ││││││││││││┌ @ array.jl:194 within `size`
               %1 = getelementptr inbounds i8, ptr %"x::Array", i64 16
               %.size.sroa.0.0.copyload = load i64, ptr %1, align 8
; ││││││││││└└└
; ││││││││││ @ reduce.jl:428 within `_mapreduce`
            switch i64 %.size.sroa.0.0.copyload, label %L31 [
    i64 0, label %L8
    i64 1, label %guard_pass99
  ]

L8:                                               ; preds = %top
; ││││││││││ @ reduce.jl:429 within `_mapreduce`
            store ptr @"jl_global#70614.jit", ptr %jlcallframe1, align 8
            %2 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
            store ptr @"jl_global#70615.jit", ptr %2, align 8
            %3 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 2
            store ptr %"x::Array", ptr %3, align 8
            %4 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 3
            store ptr @"jl_global#70616.jit", ptr %4, align 8
            %5 = call nonnull ptr @ijl_invoke(ptr nonnull @"jl_global#70613
.jit", ptr nonnull %jlcallframe1, i32 4, ptr nonnull @"-Main.Base.mapreduce
_empty_iter#70612.jit")
            call void @llvm.trap()
            unreachable

L31:                                              ; preds = %top
; ││││││││││ @ reduce.jl:433 within `_mapreduce`
; ││││││││││┌ @ int.jl:83 within `<`
             %6 = icmp sgt i64 %.size.sroa.0.0.copyload, 15
; ││││││││││└
            br i1 %6, label %guard_pass109, label %guard_exit115

L110:                                             ; preds = %guard_exit120,
 %guard_exit115, %guard_pass109, %guard_pass99
; └└└└└└└└└└
; ┌ @ essentials.jl:11 within `length`
   %.size12.sroa.0.0.copyload = phi i64 [ %.size12.sroa.0.0.copyload.pre, %
guard_pass109 ], [ 1, %guard_pass99 ], [ %.size.sroa.0.0.copyload, %guard_e
xit115 ], [ %.size.sroa.0.0.copyload, %guard_exit120 ]
   %7 = phi <2 x double> [ %14, %guard_pass109 ], [ %13, %guard_pass99 ], [
 %19, %guard_exit115 ], [ %24, %guard_exit120 ]
; └
; ┌ @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciM
LBook/_weave/lecture06/styles_of_parallelism.jmd:4 within `/` @ promotion.j
l:432
; │┌ @ promotion.jl:400 within `promote`
; ││┌ @ promotion.jl:375 within `_promote`
; │││┌ @ number.jl:7 within `convert`
; ││││┌ @ float.jl:239 within `Float64`
       %8 = sitofp i64 %.size12.sroa.0.0.copyload to double
; │└└└└
; │ @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciM
LBook/_weave/lecture06/styles_of_parallelism.jmd:4 within `/` @ promotion.j
l:432 @ float.jl:494
   %9 = insertelement <2 x double> poison, double %8, i64 0
   %10 = shufflevector <2 x double> %9, <2 x double> poison, <2 x i32> zero
initializer
   %11 = fdiv <2 x double> %7, %10
; │ @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciM
LBook/_weave/lecture06/styles_of_parallelism.jmd:4 within `/`
; │┌ @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/Sci
MLBook/_weave/lecture06/styles_of_parallelism.jmd:3 within `MyComplex`
    store <2 x double> %11, ptr %sret_return, align 8
    ret void

guard_pass99:                                     ; preds = %top
; └└
; ┌ @ reducedim.jl:982 within `sum`
; │┌ @ reducedim.jl:982 within `#sum#935`
; ││┌ @ reducedim.jl:986 within `_sum`
; │││┌ @ reducedim.jl:986 within `#_sum#937`
; ││││┌ @ reducedim.jl:987 within `_sum`
; │││││┌ @ reducedim.jl:987 within `#_sum#938`
; ││││││┌ @ reducedim.jl:329 within `mapreduce`
; │││││││┌ @ reducedim.jl:329 within `#mapreduce#928`
; ││││││││┌ @ reducedim.jl:337 within `_mapreduce_dim`
; │││││││││┌ @ reduce.jl:431 within `_mapreduce`
; ││││││││││┌ @ essentials.jl:917 within `getindex`
             %12 = load ptr, ptr %"x::Array", align 8
             %13 = load <2 x double>, ptr %12, align 8
             br label %L110

guard_pass109:                                    ; preds = %L31
; ││││││││││└
; ││││││││││ @ reduce.jl:444 within `_mapreduce`
; ││││││││││┌ @ reduce.jl:277 within `mapreduce_impl`
             call void @j_mapreduce_impl_70633(ptr noalias nocapture nounde
f nonnull sret([2 x double]) %0, ptr nonnull %"x::Array", i64 signext 1, i6
4 signext %.size.sroa.0.0.copyload, i64 signext 1024)
; └└└└└└└└└└└
  %14 = load <2 x double>, ptr %0, align 16
; ┌ @ essentials.jl:11 within `length`
   %.size12.sroa.0.0.copyload.pre = load i64, ptr %1, align 8
   br label %L110

guard_exit115:                                    ; preds = %L31
; └
; ┌ @ reducedim.jl:982 within `sum`
; │┌ @ reducedim.jl:982 within `#sum#935`
; ││┌ @ reducedim.jl:986 within `_sum`
; │││┌ @ reducedim.jl:986 within `#_sum#937`
; ││││┌ @ reducedim.jl:987 within `_sum`
; │││││┌ @ reducedim.jl:987 within `#_sum#938`
; ││││││┌ @ reducedim.jl:329 within `mapreduce`
; │││││││┌ @ reducedim.jl:329 within `#mapreduce#928`
; ││││││││┌ @ reducedim.jl:337 within `_mapreduce_dim`
; │││││││││┌ @ reduce.jl:435 within `_mapreduce`
; ││││││││││┌ @ essentials.jl:917 within `getindex`
             %15 = load ptr, ptr %"x::Array", align 8
; ││││││││││└
; ││││││││││ @ reduce.jl:436 within `_mapreduce`
; ││││││││││┌ @ essentials.jl:917 within `getindex`
             %16 = getelementptr inbounds [2 x double], ptr %15, i64 1
; ││││││││││└
; ││││││││││ @ reduce.jl:435 within `_mapreduce`
; ││││││││││┌ @ essentials.jl:917 within `getindex`
             %17 = load <2 x double>, ptr %15, align 8
; ││││││││││└
; ││││││││││ @ reduce.jl:436 within `_mapreduce`
; ││││││││││┌ @ essentials.jl:917 within `getindex`
             %18 = load <2 x double>, ptr %16, align 8
; ││││││││││└
; ││││││││││ @ reduce.jl:437 within `_mapreduce`
; ││││││││││┌ @ reduce.jl:24 within `add_sum`
; │││││││││││┌ @ /home/chrisrackauckas/github-runners/demeter4-23/_work/Sci
MLBook/SciMLBook/_weave/lecture06/styles_of_parallelism.jmd:3 within `+` @ 
float.jl:491
              %19 = fadd <2 x double> %17, %18
; ││││││││││└└
; ││││││││││ @ reduce.jl:438 within `_mapreduce`
; ││││││││││┌ @ int.jl:83 within `<`
             %.not141142 = icmp sgt i64 %.size.sroa.0.0.copyload, 2
; ││││││││││└
            br i1 %.not141142, label %guard_exit120, label %L110

guard_exit120:                                    ; preds = %guard_exit120,
 %guard_exit115
            %value_phi61145 = phi i64 [ %21, %guard_exit120 ], [ 2, %guard_
exit115 ]
            %20 = phi <2 x double> [ %24, %guard_exit120 ], [ %19, %guard_e
xit115 ]
; ││││││││││ @ reduce.jl:439 within `_mapreduce`
; ││││││││││┌ @ int.jl:87 within `+`
             %21 = add nuw nsw i64 %value_phi61145, 1
; ││││││││││└
; ││││││││││┌ @ essentials.jl:917 within `getindex`
             %22 = getelementptr inbounds [2 x double], ptr %15, i64 %value
_phi61145
             %23 = load <2 x double>, ptr %22, align 8
; ││││││││││└
; ││││││││││ @ reduce.jl:440 within `_mapreduce`
; ││││││││││┌ @ reduce.jl:24 within `add_sum`
; │││││││││││┌ @ /home/chrisrackauckas/github-runners/demeter4-23/_work/Sci
MLBook/SciMLBook/_weave/lecture06/styles_of_parallelism.jmd:3 within `+` @ 
float.jl:491
              %24 = fadd <2 x double> %20, %23
; ││││││││││└└
; ││││││││││ @ reduce.jl:438 within `_mapreduce`
; ││││││││││┌ @ int.jl:83 within `<`
             %exitcond.not = icmp eq i64 %21, %.size.sroa.0.0.copyload
; ││││││││││└
            br i1 %exitcond.not, label %L110, label %guard_exit120
; └└└└└└└└└└
}

What this is doing is creating small little vectors and then parallelizing the operations of those vectors by calling specific vector-parallel instructions. Keep this in mind.

Explicit SIMD

The following was all a form of loop-level parallelism known as loop vectorization. It's simply easier for compilers to reason at the array level, prove iterates are independent, and automatically generate SIMD code from that. This is not necessary, and compilers can produce SIMD code from non-looping code through a process known as SLP supervectorization, but the results are far from optimal and the compiler requires a lot of time to do this calculation, meaning that it's usually not a pass used by default.

If you want to pack the vectors yourself, then primitives for doing so from within Julia are available in SIMD.jl. This is for "real" performance warriors. This looks like for example:

using SIMD
v = Vec{4,Float64}((1,2,3,4))
@show v+v # basic arithmetic is supported
@show sum(v) # basic reductions are supported

v + v = <4 x Float64>[2.0, 4.0, 6.0, 8.0]
sum(v) = 10.0
10.0

Using this you can pull apart code and force the usage of SIMD vectors. One library which makes great use of this is LoopVectorization.jl. However, one word of "caution":

Most performance optimization is not trying to do something really good for performance. Most performance optimization is trying to not do something that is actively bad for performance.

Summary of SIMD

Communication in SIMD is due to locality: if things are local the processor can automatically setup the operations.
There's no real worry about "getting it wrong": you cannot overwrite pieces from different parts of the arithmetic unit, and if SIMD is unsafe then it just won't auto-vectorize.
Suitable for operations measured in ns.

Next Level Up: Multithreading

Last time we briefly went over multithreading and described how every process has multiple threads which share a single heap, and when multiple threads are executed simultaneously we have multithreaded parallelism. Note that you can have multiple threads which aren't executed simultaneously, like in the case of I/O operations, and this is an example of concurrency without parallelism and is commonly referred to as green threads.

Last time we described a simple multithreaded program and noticed that multithreading has an overhead cost of around 50ns-100ns. This is due to the construction of the new stack (among other things) each time a new computational thread is spun up. This means that, unlike SIMD, some thought needs to be put in as to when to perform multithreading: it's not always a good idea. It needs to be high enough on the cost for this to be counter-balanced.

One abstraction that was glossed over was the memory access style. Before, we were considering a single heap, or an UMA style:

However, this is the case for all shared memory devices. For example, compute nodes on the HPC tend to be "dual Xeon" or "quad Xeon", where each Xeon processor is itself a multicore processor. But each processor on its own accesses its own local caches, and thus one has to be aware that this is setup in a NUMA (non-uniform memory access) manner:

where there is a cache that is closer to the processor and a cache that is further away. Care should be taken in this to localize the computation per thread, otherwise a cost associated with the memory sharing will be hit (but all sharing will still be automatic).

In this sense, interthread communication is naturally done through the heap: if you want other threads to be able to touch a value, then you can simply place it on the heap and then it'll be available. We saw this last time by how overlapping computations can re-use the same heap-based caches, meaning that care needs to be taken with how one writes into a dynamically-allocated array.

A simple example that demonstrates this is. First, let's make sure we have multithreading enabled:

using Base.Threads
Threads.nthreads() # should not be 1

using BenchmarkTools
acc = 0
@threads for i in 1:10_000
    global acc
    acc += 1
end
acc

The reason for this behavior is that there is a difference between the reading and the writing step to an array. Here, values are being read while other threads are writing, meaning that they see a lower value than when they are attempting to write into it. The result is that the total summation is lower than the true value because of this clashing. We can prevent this by only allowing one thread to utilize the heap-allocated variable at a time. One abstraction for doing this is atomics:

acc = Atomic{Int64}(0)
@threads for i in 1:10_000
    atomic_add!(acc, 1)
end
acc

Atomic{Int64}(10000)

When an atomic add is being done, all other threads wishing to do the same computation are blocked. This of course can have a massive effect on performance since atomic computations are not parallel.

Julia also exposes a lower level of heap control in threading using locks

const acc_lock = Ref{Int64}(0)
const splock = SpinLock()
function f1()
    @threads for i in 1:10_000
        lock(splock)
        acc_lock[] += 1
        unlock(splock)
    end
end
const rsplock = ReentrantLock()
function f2()
    @threads for i in 1:10_000
        lock(rsplock)
        acc_lock[] += 1
        unlock(rsplock)
    end
end
acc2 = Atomic{Int64}(0)
function g()
  @threads for i in 1:10_000
      atomic_add!(acc2, 1)
  end
end
const acc_s = Ref{Int64}(0)
function h()
  global acc_s
  for i in 1:10_000
      acc_s[] += 1
  end
end
@btime f1()

50.971 μs (7 allocations: 576 bytes)

SpinLock is non-reentrant, i.e. it will block itself if a thread that calls a lock does another lock. Therefore it has to be used with caution (every lock goes with one unlock), but it's fast. ReentrantLock alleviates those concerns, but trades off a bit of performance:

@btime f2()

64.062 μs (7 allocations: 576 bytes)

But if you can use atomics, they will be faster:

@btime g()

19.861 μs (7 allocations: 576 bytes)

and if your computation is actually serial, then use serial code:

@btime h()

3.070 ns (0 allocations: 0 bytes)

Why is this so fast? Check the code:

@code_llvm h()

; Function Signature: h()
;  @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciML
Book/_weave/lecture06/styles_of_parallelism.jmd:26 within `h`
define void @julia_h_71894() #0 {
top:
  %"jl_global#71898.jit.promoted" = load i64, ptr @"jl_global#71898.jit", a
lign 32
  %0 = add i64 %"jl_global#71898.jit.promoted", 10000
;  @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciML
Book/_weave/lecture06/styles_of_parallelism.jmd:29 within `h`
  store i64 %0, ptr @"jl_global#71898.jit", align 32
;  @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciML
Book/_weave/lecture06/styles_of_parallelism.jmd:30 within `h`
  ret void
}

It just knows to add 10000. So to get a proper timing let's make the size mutable:

const len = Ref{Int}(10_000)
function h2()
  global acc_s
  global len
  for i in 1:len[]
      acc_s[] += 1
  end
end
@btime h2()

2.769 ns (0 allocations: 0 bytes)

@code_llvm h2()

; Function Signature: h2()
;  @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciML
Book/_weave/lecture06/styles_of_parallelism.jmd:3 within `h2`
define void @julia_h2_71942() #0 {
top:
;  @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciML
Book/_weave/lecture06/styles_of_parallelism.jmd:6 within `h2`
; ┌ @ refvalue.jl:59 within `getindex`
; │┌ @ Base.jl:49 within `getproperty`
    %.x = load i64, ptr @"jl_global#71946.jit", align 64
; └└
; ┌ @ range.jl:904 within `iterate`
; │┌ @ range.jl:681 within `isempty`
; ││┌ @ operators.jl:379 within `>`
; │││┌ @ int.jl:83 within `<`
      %0 = icmp slt i64 %.x, 1
; └└└└
  br i1 %0, label %L34, label %L18.preheader

L18.preheader:                                    ; preds = %top
  %"jl_global#71951.jit.promoted" = load i64, ptr @"jl_global#71951.jit", a
lign 32
;  @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciML
Book/_weave/lecture06/styles_of_parallelism.jmd:8 within `h2`
  %1 = add i64 %"jl_global#71951.jit.promoted", %.x
;  @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciML
Book/_weave/lecture06/styles_of_parallelism.jmd:7 within `h2`
  store i64 %1, ptr @"jl_global#71951.jit", align 32
;  @ /home/chrisrackauckas/github-runners/demeter4-23/_work/SciMLBook/SciML
Book/_weave/lecture06/styles_of_parallelism.jmd:8 within `h2`
  br label %L34

L34:                                              ; preds = %L18.preheader,
 %top
  ret void
}

It's still optimizing it!

non_const_len = 10000
function h3()
  global acc_s
  global non_const_len
  len2::Int = non_const_len
  for i in 1:len2
      acc_s[] += 1
  end
end
@btime h3()

4.920 ns (0 allocations: 0 bytes)

Note that what is shown here is a type-declaration. a::T = ... forces a to be of type T throughout the whole function. By giving the compiler this information, I am able to use the non-constant global in a type-stable manner.

One last thing to note about multithreaded computations, and parallel computations, is that one cannot assume that the parallelized computation is computed in any given order. For example, the following will has a quasi-random ordering:

const a2 = zeros(nthreads()*10)
const acc_lock2 = Ref{Int64}(0)
const splock2 = SpinLock()
function f_order()
    @threads for i in 1:length(a2)
        lock(splock2)
        acc_lock2[] += 1
        a2[i] = acc_lock2[]
        unlock(splock2)
    end
end
f_order()
a2

10-element Vector{Float64}:
  1.0
  2.0
  3.0
  4.0
  5.0
  6.0
  7.0
  8.0
  9.0
 10.0

Note that here we can see that Julia 1.5 is dividing up the work into groups of 10 for each thread, and then one thread dominates the computation at a time, but which thread dominates is random.

The Dining Philosophers Problem

A classic tale in parallel computing is the dining philosophers problem. In this case, there are N philosophers at a table who all want to eat at the same time, following all of the same rules. Each philosopher must alternatively think and then eat. They need both their left and right fork to start eating, but cannot start eating until they have both forks. The problem is how to setup a concurrent algorithm that will not cause any philosophers to starve.

The difficulty is a situation known as deadlock. For example, if each philosopher was told to grab the right fork when it's available, and then the left fork, and put down the fork after eating, then they will all grab the right fork and none will ever eat because they will all be waiting on the left fork. This is analogous to two blocked computations which are waiting on the other to finish. Thus, when using blocking structures, one needs to be careful about deadlock!

Two Programming Models: Loop-Level Parallelism and Task-Based Parallelism

As described in the previous lecture, one can also use Threads.@spawn to do multithreading in Julia v1.3+. The same factors all apply: how to do locks and Mutex etc. This is a case of a parallelism construct having two alternative programming models. Threads.@spawn represents task-based parallelism, while Threads.@threads represents Loop-Level Parallelism or a parallel iterator model. Loop-based parallelization models are very high level and, assuming every iteration is independent, almost requires no code change. Task-based parallelism is a more expressive parallelism model, but usually requires modifying the code to be explicitly written as a set of parallelizable tasks. Note that in the case of Julia, Threads.@threads is implemented using Threads.@spawn's model.

Summary of Multithreading

Communication in multithreading is done on the heap. Locks and atomics allow for a form of safe message passing.
50ns-100ns of overhead. Suitable for 1μs calculations.
Be careful of ordering and heap-allocated values.

GPU Computing

GPUs are not fast. In fact, the problem with GPUs is that each processor is slow. However, GPUs have a lot of cores... like thousands.

An RTX2080, a standard "gaming" GPU (not even the ones in the cluster), has 2944 cores. However, not only are GPUs slow, but they also need to be programmed in a style that is SPMD, which stands for Single Program Multiple Data. This means that every single thread must be running the same program but on different pieces of data. Exactly the same program. If you have

if a > 1
  # Do something
else
  # Do something else
end

where some of the data goes on one branch and other data goes on the other branch, every single thread will run both branches (performing "fake" computations while on the other branch). This means that GPU tasks should be "very parallel" with as few conditionals as possible.

GPU Memory

GPUs themselves are shared memory devices, meaning they have a heap that is shared amongst all threads. However, GPUs are heavily in the NUMA camp, where different blocks of the GPU have much faster access to certain parts of the memory. Additionally, this heap is disconnected from the standard processor, so data must be passed to the GPU and data must be returned.

GPU memory size is relatively small compared to CPUs. Example: the RTX2080Ti has 8GB of RAM. Thus one needs to be doing computations that are memory compact (such as matrix multiplications, which are O(n^3) making the computation time scale quicker than the memory cost).

Note on GPU Hardware

Standard GPU hardware "for gaming", like RTX2070, is just as fast as higher end GPU hardware for Float32. Higher end hardware, like the Tesla, add more memory, memory safety, and Float64 support. However, these require being in a server since they have alternative cooling strategies, making them a higher end product.

SPMD Kernel Generation GPU Computing Models

The core programming models for GPU computing are SPMD kernel compilers, of which the most well-known is CUDA. CUDA is a C++-like programming language which compiles to .ptx kernels, and GPU execution on NVIDIA GPUs is done by "all steams" of a GPU doing concurrent execution of the kernel (generally, without going into more details, you can of "all streams" as just meaning "all cores". More detailed views of GPU execution will come later).

.ptx CUDA kernels can be compiled from LLVM IR, and thus since Julia is a programming language which emits LLVM IR for all of its operations, native Julia programs are compatible with compilation to CUDA. The helper functions to enable this separate compilation path is CUDA.jl. Let's take a look at a basic CUDA.jl kernel generating example:

using CUDA

N = 2^20
x_d = CUDA.fill(1.0f0, N)  # a vector stored on the GPU filled with 1.0 (Float32)
y_d = CUDA.fill(2.0f0, N)  # a vector stored on the GPU filled with 2.0

function gpu_add2!(y, x)
    index = threadIdx().x    # this example only requires linear indexing, so just use `x`
    stride = blockDim().x
    for i = index:stride:length(y)
        @inbounds y[i] += x[i]
    end
    return nothing
end

fill!(y_d, 2)
@cuda threads=256 gpu_add2!(y_d, x_d)
all(Array(y_d) .== 3.0f0)

true

The key to understanding the SPMD kernel approach is the index = threadIdx().x and stride = blockDim().x portions.

The way kernels are expected to run in parallel is that they are given a specific block of the computation and are expected to write a kernel which only on that small block of the input. This kernel is then called on every separate thread on the GPU, making each CUDA core simultaneously compute each block. Thus as a user in such a SPMD programming model, you never specify the computation globally but instead simply specify how chunks should behave, giving the compiler the leeway to determine the optimal global execution.

Array-Based GPU Computing Models

The simplest version of GPU computing is the array-based programming model.

A = rand(100,100); B = rand(100,100)
using CUDA
# Pass to the GPU
cuA = cu(A); cuB = cu(B)
cuC = cuA*cuB
# Pass to the CPU
C = Array(cuC)

100×100 Matrix{Float32}:
 23.863   24.6708  21.8725  24.9592  …  25.5908  24.8065  23.3334  23.2618
 23.8834  26.802   21.4275  27.4445     25.188   22.96    25.195   22.7696
 26.5255  25.3208  23.5035  26.5674     26.8677  26.0738  27.4633  24.7782
 27.6369  26.6943  22.1478  25.5366     26.3152  28.0094  26.2136  23.6394
 27.0894  27.6087  25.898   28.1268     27.6206  27.4285  26.278   26.354
 28.007   26.6298  23.8174  27.5912  …  28.7796  26.8446  26.3072  24.7452
 24.7889  24.4185  22.8833  23.9211     24.9408  25.5077  24.8007  23.3158
 26.9175  27.7795  24.5265  27.1167     28.3928  26.4243  26.7679  26.4475
 24.666   25.9953  23.9776  26.7592     27.9948  26.3762  26.4127  24.1617
 24.9728  24.7309  22.3619  25.0544     26.8657  25.5179  24.3965  22.806
  ⋮                                  ⋱                             
 26.0683  28.1808  22.6948  26.8333     26.7836  26.695   27.3185  25.6781
 25.0293  24.6561  23.8644  25.4392     25.7634  24.6031  23.614   24.2342
 26.7512  29.4749  25.5901  27.5804     29.1763  26.8834  27.4234  27.0407
 26.1518  26.2094  22.2973  26.8741     27.2914  26.8728  26.9596  24.2723
 24.6687  24.507   21.6686  23.4551  …  26.7577  23.7801  24.5952  23.3696
 23.608   26.5848  24.7448  26.9084     27.7693  26.0858  24.6728  25.3241
 27.7392  30.1099  27.199   29.3512     31.9734  28.5199  29.0834  26.3768
 24.9474  24.8082  22.7529  25.3389     26.645   24.7477  25.2643  24.0791
 24.8861  25.2811  21.1914  25.1092     25.0373  24.1505  22.4526  23.4989

Let's see the transfer times:

@btime cu(A)

8.414 μs (13 allocations: 39.43 KiB)
100×100 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.831262    0.106157  0.890854   …  0.588661   0.732195   0.558258
 0.237244    0.660139  0.0742421     0.0722447  0.13501    0.0120616
 0.478647    0.922106  0.739655      0.755478   0.237986   0.0165197
 0.106338    0.838481  0.453943      0.550406   0.188748   0.675155
 0.872202    0.253869  0.785         0.33358    0.97072    0.262895
 0.00475533  0.741666  0.817871   …  0.659075   0.947027   0.328754
 0.177475    0.567137  0.429338      0.804983   0.519245   0.913473
 0.963553    0.74966   0.613477      0.593409   0.0786322  0.205396
 0.173934    0.861131  0.60095       0.432321   0.115463   0.806317
 0.986987    0.442394  0.0678642     0.300989   0.694183   0.985045
 ⋮                                ⋱                        
 0.737683    0.446508  0.0198677     0.813603   0.281466   0.261781
 0.860455    0.538938  0.340972      0.074264   0.548628   0.927958
 0.991592    0.653379  0.659511      0.738531   0.530366   0.615387
 0.923365    0.942472  0.650108      0.542761   0.0512892  0.399623
 0.424593    0.743127  0.867381   …  0.0436126  0.449309   0.896628
 0.12659     0.630646  0.665181      0.155365   0.949248   0.566103
 0.741278    0.354367  0.654834      0.535503   0.63824    0.0254407
 0.0531912   0.366502  0.622288      0.507267   0.755967   0.0325127
 0.934019    0.597875  0.0436832     0.904097   0.641195   0.6505

@btime Array(cuC)

15.221 μs (8 allocations: 39.23 KiB)
100×100 Matrix{Float32}:
 23.863   24.6708  21.8725  24.9592  …  25.5908  24.8065  23.3334  23.2618
 23.8834  26.802   21.4275  27.4445     25.188   22.96    25.195   22.7696
 26.5255  25.3208  23.5035  26.5674     26.8677  26.0738  27.4633  24.7782
 27.6369  26.6943  22.1478  25.5366     26.3152  28.0094  26.2136  23.6394
 27.0894  27.6087  25.898   28.1268     27.6206  27.4285  26.278   26.354
 28.007   26.6298  23.8174  27.5912  …  28.7796  26.8446  26.3072  24.7452
 24.7889  24.4185  22.8833  23.9211     24.9408  25.5077  24.8007  23.3158
 26.9175  27.7795  24.5265  27.1167     28.3928  26.4243  26.7679  26.4475
 24.666   25.9953  23.9776  26.7592     27.9948  26.3762  26.4127  24.1617
 24.9728  24.7309  22.3619  25.0544     26.8657  25.5179  24.3965  22.806
  ⋮                                  ⋱                             
 26.0683  28.1808  22.6948  26.8333     26.7836  26.695   27.3185  25.6781
 25.0293  24.6561  23.8644  25.4392     25.7634  24.6031  23.614   24.2342
 26.7512  29.4749  25.5901  27.5804     29.1763  26.8834  27.4234  27.0407
 26.1518  26.2094  22.2973  26.8741     27.2914  26.8728  26.9596  24.2723
 24.6687  24.507   21.6686  23.4551  …  26.7577  23.7801  24.5952  23.3696
 23.608   26.5848  24.7448  26.9084     27.7693  26.0858  24.6728  25.3241
 27.7392  30.1099  27.199   29.3512     31.9734  28.5199  29.0834  26.3768
 24.9474  24.8082  22.7529  25.3389     26.645   24.7477  25.2643  24.0791
 24.8861  25.2811  21.1914  25.1092     25.0373  24.1505  22.4526  23.4989

The cost transferring is about 20μs-50μs in each direction, meaning that one needs to be doing operations that cost at least 200μs for GPUs to break even. A good rule of thumb is that GPU computations should take at least a millisecond, or GPU memory should be re-used.

Summary of GPUs

GPUs cores are slow
GPUs are SPMD
GPUs are generally used for linear algebra
Suitable for SPMD 1ms computations

Xeon Phi Accelerators and OpenCL

Other architectures exist to keep in mind. Xeon Phis are a now-defunct accelerator that used X86 (standard processors) as the base, using hundreds of them. For example, the Knights Landing series had 256 core accelerator cards. These were all clocked down, meaning they were still slower than a standard CPU, but there were less restrictions on SPMD (though SPMD-like computations were still preferred in order to heavily make use of SIMD). However, because machine learning essentially only needs linear algebra, and linear algebra is faster when restricting to SPMD-architectures, this failed. These devices can still be found on many high end clusters.

One alternative to CUDA is OpenCL which supports alternative architectures such as the Xeon Phi at the same time that it supports GPUs. However, one of the issues with OpenCL is that its BLAS implementation currently does not match the speed of CuBLAS, which makes NVIDIA-specific libraries still the king of machine learning and most scientific computing.

TPU Computing

TPUs are tensor processing units, which is Google's newest accelerator technology. They are essentially just "tensor operation compilers", which in computer science speak is simply higher dimensional linear algebra. To do this, they internally utilize a BFloat16 type, which is a 16-bit floating point number with the same exponent size as a Float32 with an 8-bit significant. This means that computations are highly prone to catastrophic cancellation. This computational device only works because BFloat16 has primitive operations for FMA which allows 32-bit-like accuracy of multiply-add operations, and thus computations which are only dot products (linear algebra) end up okay. Thus this is simply a GPU-like device which has gone further to completely specialize in linear algebra.

Multiprocessing (Distributed Computing)

While multithreading computes with multiple threads, multiprocessing computes with multiple independent processes. Note that processes do not share any memory, not heap or data, and thus this mode of computing also allows for distributed computations, which is the case where processes may be on separate computing hardware. However, even if they are on the same hardware, the lack of a shared address space means that multiprocessing has to do message passing, i.e. send data from one process to the other.

Distributed Tasks with Explicit Memory Handling: The Master-Worker Model

Given the amount of control over data handling, there are many different models for distributed computing. The simplest, the one that Julia's Distributed Standard Library defaults to, is the master-worker model. The master-worker model has one process, deemed the master, which controls the worker processes.

Here we can start by adding some new worker processes:

using Distributed
addprocs(4)

This adds 4 worker processes for the master to control. The simplest computations are those where the master process gives the worker process a job which returns the value afterwards. For example, a pmap operation or @distributed loop gives the worker a function to execute, along with the data, and the worker then computes and returns the result.

At a lower level, this is done by Distributed.@spawning jobs, or using a remotecall and fetching the result. ParallelDataTransfer.jl gives an extended set of primitive message passing operations. For example, we can explicitly tell it to compute a function f on the remote process like:

@everywhere f(x) = x.^2 # Define this function on all processes
t = remotecall(f,2,randn(10))

remotecall is a non-blocking operation that returns a Future. To access the data, one should use the blocking operation fetch to receive the data:

xsq = fetch(t)

Distributed Tasks with Implicit Memory Handling: Distributed Task-Based Parallelism

Another popular programming model for distributed computation is task-based parallelism but where all of the memory handling is implicit. Since, unlike the shared memory parallelism case, data transfers are required for given processes to share heap allocated values, distributed task-based parallelism libraries tend to want a global view of the whole computation in order to build a sophisticated schedule that includes where certain data lives and when transfers will occur. Because of this, distributed task-based parallelism libraries tend to want the entire computational graph of the computation, to be able to restructure the graph as necessary with their own data transfer portions spliced into the compute. Examples of this kind of framework are:

Tensorflow
dask ("distributed tasks")
Dagger.jl

Using these kinds of libraries requires building a directed acyclic graph (DAG). For example, the following showcases how to use Dagger.jl to represent a bunch of summations:

using Dagger

add1(value) = value + 1
add2(value) = value + 2
combine(a...) = sum(a)

p = delayed(add1)(4)
q = delayed(add2)(p)
r = delayed(add1)(3)
s = delayed(combine)(p, q, r)

@assert collect(s) == 16

Once the global computation is specified, commands like collect are used to instantiate the graph on given input data, which then run the computation in a (potentially) distributed manner, depending on internal scheduler heuristics.

Distributed Array-Based Parallelism: SharedArrays, Elemental, and DArrays

Because array operations are a standard way to compute in scientific computing, there are higher level primitives to help with message passing. A SharedArray is an array which acts like a shared memory device. This means that every change to a SharedArray causes message passing to keep them in sync, and thus this should be used with a performance caution. DistributedArrays.jl is a parallel array type which has local blocks and can be used for writing higher level abstractions with explicit message passing. Because it is currently missing high-level parallel linear algebra, currently the recommended tool for distributed linear algebra is Elemental.jl.

MapReduce, Hadoop, and Spark: The Map-Reduce Model

Many data-parallel operations work by mapping a function f onto each piece of data and then reducing it. For example, the sum of squares maps the function x -> x^2 onto each value, and then these values are reduced by performing a summation. MapReduce was a Google framework in the 2000's built around this as the parallel computing concept, and current data-handling frameworks, like Hadoop and Spark, continue this as the core distributed programming model.

In Julia, there exists the mapreduce function for performing serial mapreduce operations. It also work on GPUs. However, it does not auto-distribute. For distributed map-reduce programming, the @distributed for-loop macro can be used. For example, sum of squares of random numbers is:

@distributed (+) for i in 1:1000
  rand()^2
end

One can see that computing summary statistics is easily done in this framework which is why it was majorly adopted among "big data" communities.

@distributed uses a static scheduler. The dynamic scheduling equivalent is pmap:

pmap(i->rand()^2,1:100)

which will dynamically allocate jobs to processes as they declare they have finished jobs. This thus has the same performance difference behavior as Threads.@threads vs Threads.@spawn.

MPI: The Distributed SPMD Model

The main way to do high-performance multiprocessing is MPI, which is an old distributed computing interface from the C/Fortran days. Julia has access to the MPI programming model through MPI.jl. The programming model for MPI is that every computer is running the same program, and synchronization is performed by blocking communication. For example, let's look at the following:

using MPI
MPI.Init()

comm = MPI.COMM_WORLD
rank = MPI.Comm_rank(comm)
size = MPI.Comm_size(comm)

dst = mod(rank+1, size)
src = mod(rank-1, size)

N = 4

send_mesg = Array{Float64}(undef, N)
recv_mesg = Array{Float64}(undef, N)

fill!(send_mesg, Float64(rank))

rreq = MPI.Irecv!(recv_mesg, src,  src+32, comm)

print("$rank: Sending   $rank -> $dst = $send_mesg\n")
sreq = MPI.Isend(send_mesg, dst, rank+32, comm)

stats = MPI.Waitall!([rreq, sreq])

print("$rank: Received $src -> $rank = $recv_mesg\n")

MPI.Barrier(comm)

> mpiexecjl -n 3 julia examples/04-sendrecv.jl
1: Sending   1 -> 2 = [1.0, 1.0, 1.0, 1.0]
0: Sending   0 -> 1 = [0.0, 0.0, 0.0, 0.0]
1: Received 0 -> 1 = [0.0, 0.0, 0.0, 0.0]
2: Sending   2 -> 0 = [2.0, 2.0, 2.0, 2.0]
0: Received 2 -> 0 = [2.0, 2.0, 2.0, 2.0]
2: Received 1 -> 2 = [1.0, 1.0, 1.0, 1.0]

Let's investigate this a little bit. Think about having two computers run this line-by-line side by side. They will both locally build arrays, and then call MPI.Irecv!, which is an asynchronous non-blocking call to listen for a message from a given rank (a rank is the ID for a given process). Then they call their sreq = MPI.Isend function, which is an asynchronous non-blocking call to send a message send_mesg to the chosen rank. When the expected message is found, MPI.Irecv! will then run on its green thread and finish, updating the recv_mesg with the information from the message. However, in order to make sure all of the messages are received, we have added in a blocking operation MPI.Waitall!([rreq, sreq]), which will block all further execution on the given rank until both its rreq and sreq tasks are completed. After that is done, each given rank will have its updated data, and the script will continue on all ranks.

This model is thus very asynchronous and allows for many different computers to run one highly parallelized program, managing the data transmissions in a sparse way without a single computer in charge of managing the whole computation. However, it can be prone to deadlock, since errors in the program may for example require rank 1 to receive a message from rank 2 before continuing the program, but rank 2 won't continue to program until it receives a message from rank 1. For this reason, while MPI has been the most successful large-scale distributed computing model and almost all major high-performance computing (HPC) cluster competitions have been won by codes utilizing the MPI model, the MPI model is nowadays considered a last resort due to these safety issues.

Summary of Multiprocessing

Cost is hardware dependent: only suitable for 1ms or higher depending on the connections through which the messages are being passed and the topology of the network.
The Master-worker programming model is Julia's Distributed model
The Map-reduce programming model is a common data-handling model
Array-based distributed computations are another abstraction, used in all forms of parallelism.
MPI is a SPMD model of distributed computing, where each process is completely independent and one just controls the memory handling.

The Bait-and-switch: Parallelism is about Programming Models

While this looked like a lecture about parallel programming at the different levels and types of hardware, this wide overview showcases that the real underlying commonality within parallel program is in the parallel programming models, of which there are not too many. There are:

Map-reduce parallelism models. pmap, MapReduce (Hadoop/Spark)
- Pros: Easy to use
- Cons: Requires that your program is specifically only mapping functions f and reducing them. That said, many data science operations like mean, variance, maximum, etc. can be represented as map-reduce calls, which lead to the popularity of these approaches for "big data" operations.
Array-based parallelism models. SIMD (at the compiler level), CuArray, DistributedArray, PyTorch.torch, ...
- Pros: Easy to use, can have very fast library implementations for specific functions
- Cons: Less control and restricted to specific functions implemented by the library. Parallelism matches the data structure, so it requires the user to be careful and know the best way to split the data.
Loop-based parallelism models. Threads.@threads, @distributed, OpenMP, MATLAB's parfor, Chapel's iterator parallelism
- Pros: Easy to use, almost no code change can make existing loops parallelized
- Cons: Refined operations, like locking and sharing data, can be awkward to write. Less control over fine details like scheduling, meaning less opportunities to optimize.
Task-based parallelism models with implicit distributed data handling. Threads.@spawn, Dagger.jl, TensorFlow, dask
- Pros: Relatively high level, low risk of errors since parallelism is mostly handled for the user. User simply describes which functions to call in what order.
- Cons: When used on distributed systems, implicit data handling is hard, meaning it's generally not as efficient if you don't optimize the code yourself or help the optimizer, and these require specific programming constructs for building the computational graph. Note this is only a downside for distributed data parallelism, whereas when applied to shared memory systems these aspects no longer require handling by the task scheduler.
Task-based parallelism models with explicit data handling. Distributed.@spawn
- Pros: Allows for control over what compute hardware will have specific pieces of data and allows for transferring data manually.
- Cons: Requires transferring data manually. All computations are managed by a single process/computer/node and thus it can have some issues scaling to extreme (1000+ node) computing situations.
SPMD kernel parallelism models. CUDA, MPI, KernelAbstractions.jl
- Pros: Reduces the problem for the user to only specify what happens in small chunks of the problem. Works on accelerator hardware like GPUs, TPUs, and beyond.
- Cons: Only works for computations that be represented block-wise, and relies on the compiler to generate good code.

In this sense, the different parallel programming "languages" and features are much more similar than they are all different, falling into similar categories.