Optimizing Serial Code

Chris Rackauckas
September 3rd, 2019

Youtube Video Link Part 1

Youtube Video Link Part 2

At the center of any fast parallel code is a fast serial code. Parallelism is made to be a performance multiplier, so if you start from a bad position it won't ever get much better. Thus the first thing that we need to do is understand what makes code slow and how to avoid the pitfalls. This discussion of serial code optimization will also directly motivate why we will be using Julia throughout this course.

Mental Model of a Memory

To start optimizing code you need a good mental model of a computer.

High Level View

At the highest level you have a CPU's core memory which directly accesses a L1 cache. The L1 cache has the fastest access, so things which will be needed soon are kept there. However, it is filled from the L2 cache, which itself is filled from the L3 cache, which is filled from the main memory. This bring us to the first idea in optimizing code: using things that are already in a closer cache can help the code run faster because it doesn't have to be queried for and moved up this chain.

When something needs to be pulled directly from main memory this is known as a cache miss. To understand the cost of a cache miss vs standard calculations, take a look at this classic chart.

(Cache-aware and cache-oblivious algorithms are methods which change their indexing structure to optimize their use of the cache lines. We will return to this when talking about performance of linear algebra.)

Cache Lines and Row/Column-Major

Many algorithms in numerical linear algebra are designed to minimize cache misses. Because of this chain, many modern CPUs try to guess what you will want next in your cache. When dealing with arrays, it will speculate ahead and grab what is known as a cache line: the next chunk in the array. Thus, your algorithms will be faster if you iterate along the values that it is grabbing.

The values that it grabs are the next values in the contiguous order of the stored array. There are two common conventions: row major and column major. Row major means that the linear array of memory is formed by stacking the rows one after another, while column major puts the column vectors one after another.

Julia, MATLAB, and Fortran are column major. Python's numpy is row-major.

A = rand(100,100)
B = rand(100,100)
C = rand(100,100)
using BenchmarkTools
function inner_rows!(C,A,B)
  for i in 1:100, j in 1:100
    C[i,j] = A[i,j] + B[i,j]
  end
end
@btime inner_rows!(C,A,B)
7.569 μs (0 allocations: 0 bytes)
function inner_cols!(C,A,B)
  for j in 1:100, i in 1:100
    C[i,j] = A[i,j] + B[i,j]
  end
end
@btime inner_cols!(C,A,B)
2.422 μs (0 allocations: 0 bytes)

Lower Level View: The Stack and the Heap

Locally, the stack is composed of a stack and a heap. The stack requires a static allocation: it is ordered. Because it's ordered, it is very clear where things are in the stack, and therefore accesses are very quick (think instantaneous). However, because this is static, it requires that the size of the variables is known at compile time (to determine all of the variable locations). Since that is not possible with all variables, there exists the heap. The heap is essentially a stack of pointers to objects in memory. When heap variables are needed, their values are pulled up the cache chain and accessed.

Heap Allocations and Speed

Heap allocations are costly because they involve this pointer indirection, so stack allocation should be done when sensible (it's not helpful for really large arrays, but for small values like scalars it's essential!)

function inner_alloc!(C,A,B)
  for j in 1:100, i in 1:100
    val = [A[i,j] + B[i,j]]
    C[i,j] = val[1]
  end
end
@btime inner_alloc!(C,A,B)
234.630 μs (10000 allocations: 625.00 KiB)
function inner_noalloc!(C,A,B)
  for j in 1:100, i in 1:100
    val = A[i,j] + B[i,j]
    C[i,j] = val[1]
  end
end
@btime inner_noalloc!(C,A,B)
2.425 μs (0 allocations: 0 bytes)

Why does the array here get heap-allocated? It isn't able to prove/guarantee at compile-time that the array's size will always be a given value, and thus it allocates it to the heap. @btime tells us this allocation occurred and shows us the total heap memory that was taken. Meanwhile, the size of a Float64 number is known at compile-time (64-bits), and so this is stored onto the stack and given a specific location that the compiler will be able to directly address.

Note that one can use the StaticArrays.jl library to get statically-sized arrays and thus arrays which are stack-allocated:

using StaticArrays
function static_inner_alloc!(C,A,B)
  for j in 1:100, i in 1:100
    val = @SVector [A[i,j] + B[i,j]]
    C[i,j] = val[1]
  end
end
@btime static_inner_alloc!(C,A,B)
2.422 μs (0 allocations: 0 bytes)

Mutation to Avoid Heap Allocations

Many times you do need to write into an array, so how can you write into an array without performing a heap allocation? The answer is mutation. Mutation is changing the values of an already existing array. In that case, no free memory has to be found to put the array (and no memory has to be freed by the garbage collector).

In Julia, functions which mutate the first value are conventionally noted by a !. See the difference between these two equivalent functions:

function inner_noalloc!(C,A,B)
  for j in 1:100, i in 1:100
    val = A[i,j] + B[i,j]
    C[i,j] = val[1]
  end
end
@btime inner_noalloc!(C,A,B)
2.421 μs (0 allocations: 0 bytes)
function inner_alloc(A,B)
  C = similar(A)
  for j in 1:100, i in 1:100
    val = A[i,j] + B[i,j]
    C[i,j] = val[1]
  end
end
@btime inner_alloc(A,B)
7.349 μs (2 allocations: 78.17 KiB)

To use this algorithm effectively, the ! algorithm assumes that the caller already has allocated the output array to put as the output argument. If that is not true, then one would need to manually allocate. The goal of that interface is to give the caller control over the allocations to allow them to manually reduce the total number of heap allocations and thus increase the speed.

Julia's Broadcasting Mechanism

Wouldn't it be nice to not have to write the loop there? In many high level languages this is simply called vectorization. In Julia, we will call it array vectorization to distinguish it from the SIMD vectorization which is common in lower level languages like C, Fortran, and Julia.

In Julia, if you use . on an operator it will transform it to the broadcasted form. Broadcast is lazy: it will build up an entire .'d expression and then call broadcast! on composed expression. This is customizable and documented in detail. However, to a first approximation we can think of the broadcast mechanism as a mechanism for building fused expressions. For example, the Julia code:

A .+ B .+ C;

under the hood lowers to something like:

map((a,b,c)->a+b+c,A,B,C);

where map is a function that just loops over the values element-wise.

Take a quick second to think about why loop fusion may be an optimization.

This about what would happen if you did not fuse the operations. We can write that out as:

tmp = A .+ B
tmp .+ C;

Notice that if we did not fuse the expressions, we would need some place to put the result of A .+ B, and that would have to be an array, which means it would cause a heap allocation. Thus broadcast fusion eliminates the temporary variable (colloquially called just a temporary).

function unfused(A,B,C)
  tmp = A .+ B
  tmp .+ C
end
@btime unfused(A,B,C);
9.778 μs (4 allocations: 156.34 KiB)
fused(A,B,C) = A .+ B .+ C
@btime fused(A,B,C);
5.510 μs (2 allocations: 78.17 KiB)

Note that we can also fuse the output by using .=. This is essentially the vectorized version of a ! function:

D = similar(A)
fused!(D,A,B,C) = (D .= A .+ B .+ C)
@btime fused!(D,A,B,C);
3.504 μs (0 allocations: 0 bytes)

Note on Broadcasting Function Calls

Julia allows for broadcasting the call () operator as well. .() will call the function element-wise on all arguments, so sin.(A) will be the elementwise sine function. This will fuse Julia like the other operators.

Note on Vectorization and Speed

In articles on MATLAB, Python, R, etc., this is where you will be told to vectorize your code. Notice from above that this isn't a performance difference between writing loops and using vectorized broadcasts. This is not abnormal! The reason why you are told to vectorize code in these other languages is because they have a high per-operation overhead (which will be discussed further down). This means that every call, like +, is costly in these languages. To get around this issue and make the language usable, someone wrote and compiled the loop for the C/Fortran function that does the broadcasted form (see numpy's Github repo). Thus A .+ B's MATLAB/Python/R equivalents are calling a single C function to generally avoid the cost of function calls and thus are faster.

But this is not an intrinsic property of vectorization. Vectorization isn't "fast" in these languages, it's just close to the correct speed. The reason vectorization is recommended is because looping is slow in these languages. Because looping isn't slow in Julia (or C, C++, Fortran, etc.), loops and vectorization generally have the same speed. So use the one that works best for your code without a care about performance.

(As a small side effect, these high level languages tend to allocate a lot of temporary variables since the individual C kernels are written for specific numbers of inputs and thus don't naturally fuse. Julia's broadcast mechanism is just generating and JIT compiling Julia functions on the fly, and thus it can accommodate the combinatorial explosion in the amount of choices just by only compiling the combinations that are necessary for a specific code)

Heap Allocations from Slicing

It's important to note that slices in Julia produce copies instead of views. Thus for example:

A[50,50]
0.4309718283801406

allocates a new output. This is for safety, since if it pointed to the same array then writing to it would change the original array. We can demonstrate this by asking for a view instead of a copy.

@show A[1]
E = @view A[1:5,1:5]
E[1] = 2.0
@show A[1]
A[1] = 0.8732593203756598
A[1] = 2.0
2.0

However, this means that @view A[1:5,1:5] did not allocate an array (it does allocate a pointer if the escape analysis is unable to prove that it can be elided. This means that in small loops there will be no allocation, while if the view is returned from a function for example it will allocate the pointer, ~80 bytes, but not the memory of the array. This means that it is O(1) in cost but with a relatively small constant).

Asymptotic Cost of Heap Allocations

Heap allocations have to locate and prepare a space in RAM that is proportional to the amount of memory that is calculated, which means that the cost of a heap allocation for an array is O(n), with a large constant. As RAM begins to fill up, this cost dramatically increases. If you run out of RAM, your computer may begin to use swap, which is essentially RAM simulated on your hard drive. Generally when you hit swap your performance is so dead that you may think that your computation froze, but if you check your resource use you will notice that it's actually just filled the RAM and starting to use the swap.

But think of it as O(n) with a large constant factor. This means that for operations which only touch the data once, heap allocations can dominate the computational cost:

using LinearAlgebra, BenchmarkTools
function alloc_timer(n)
    A = rand(n,n)
    B = rand(n,n)
    C = rand(n,n)
    t1 = @belapsed $A .* $B
    t2 = @belapsed ($C .= $A .* $B)
    t1,t2
end
ns = 2 .^ (2:11)
res = [alloc_timer(n) for n in ns]
alloc   = [x[1] for x in res]
noalloc = [x[2] for x in res]

using Plots
plot(ns,alloc,label="=",xscale=:log10,yscale=:log10,legend=:bottomright,
     title="Micro-optimizations matter for BLAS1")
plot!(ns,noalloc,label=".=")

However, when the computation takes O(n^3), like in matrix multiplications, the high constant factor only comes into play when the matrices are sufficiently small:

using LinearAlgebra, BenchmarkTools
function alloc_timer(n)
    A = rand(n,n)
    B = rand(n,n)
    C = rand(n,n)
    t1 = @belapsed $A*$B
    t2 = @belapsed mul!($C,$A,$B)
    t1,t2
end
ns = 2 .^ (2:7)
res = [alloc_timer(n) for n in ns]
alloc   = [x[1] for x in res]
noalloc = [x[2] for x in res]

using Plots
plot(ns,alloc,label="*",xscale=:log10,yscale=:log10,legend=:bottomright,
     title="Micro-optimizations only matter for small matmuls")
plot!(ns,noalloc,label="mul!")

Though using a mutating form is never bad and always is a little bit better.

Optimizing Memory Use Summary

  • Avoid cache misses by reusing values

  • Iterate along columns

  • Avoid heap allocations in inner loops

  • Heap allocations occur when the size of things is not proven at compile-time

  • Use fused broadcasts (with mutated outputs) to avoid heap allocations

  • Array vectorization confers no special benefit in Julia because Julia loops are as fast as C or Fortran

  • Use views instead of slices when applicable

  • Avoiding heap allocations is most necessary for O(n) algorithms or algorithms with small arrays

  • Use StaticArrays.jl to avoid heap allocations of small arrays in inner loops

Julia's Type Inference and the Compiler

Many people think Julia is fast because it is JIT compiled. That is simply not true (we've already shown examples where Julia code isn't fast, but it's always JIT compiled!). Instead, the reason why Julia is fast is because the combination of two ideas:

  • Type inference

  • Type specialization in functions

These two features naturally give rise to Julia's core design feature: multiple dispatch. Let's break down these pieces.

Type Inference

At the core level of the computer, everything has a type. Some languages are more explicit about said types, while others try to hide the types from the user. A type tells the compiler how to to store and interpret the memory of a value. For example, if the compiled code knows that the value in the register is supposed to be interpreted as a 64-bit floating point number, then it understands that slab of memory like:

Importantly, it will know what to do for function calls. If the code tells it to add two floating point numbers, it will send them as inputs to the Floating Point Unit (FPU) which will give the output.

If the types are not known, then... ? So one cannot actually compute until the types are known, since otherwise it's impossible to interpret the memory. In languages like C, the programmer has to declare the types of variables in the program:

void add(double *a, double *b, double *c, size_t n){
  size_t i;
  for(i = 0; i < n; ++i) {
    c[i] = a[i] + b[i];
  }
}

The types are known at compile time because the programmer set it in stone. In many interpreted languages Python, types are checked at runtime. For example,

a = 2
b = 4
a + b

when the addition occurs, the Python interpreter will check the object holding the values and ask it for its types, and use those types to know how to compute the + function. For this reason, the add function in Python is rather complex since it needs to decode and have a version for all primitive types!

Not only is there runtime overhead checks in function calls due to to not being explicit about types, there is also a memory overhead since it is impossible to know how much memory a value with take since that's a property of its type. Thus the Python interpreter cannot statically guarantee exact unchanging values for the size that a value would take in the stack, meaning that the variables are not stack-allocated. This means that every number ends up heap-allocated, which hopefully begins to explain why this is not as fast as C.

The solution is Julia is somewhat of a hybrid. The Julia code looks like:

a = 2
b = 4
a + b
6

However, before JIT compilation, Julia runs a type inference algorithm which finds out that A is an Int, and B is an Int. You can then understand that if it can prove that A+B is an Int, then it can propagate all of the types through.

Type Specialization in Functions

Julia is able to propagate type inference through functions because, even if a function is "untyped", Julia will interpret this as a generic function over possible methods, where every method has a concrete type. This means that in Julia, the function:

f(x,y) = x+y
f (generic function with 1 method)

is not what you may think of as a "single function", since given inputs of different types it will actually be a different function. We can see this by examining the LLVM IR (LLVM is Julia's compiler, the IR is the Intermediate Representation, i.e. a platform-independent representation of assembly that lives in LLVM that it knows how to convert into assembly per architecture):

using InteractiveUtils
@code_llvm f(2,5)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `f`
define i64 @julia_f_3079(i64 signext %0, i64 signext %1) #0 {
top:
; ┌ @ int.jl:87 within `+`
   %2 = add i64 %1, %0
   ret i64 %2
; └
}
@code_llvm f(2.0,5.0)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `f`
define double @julia_f_3081(double %0, double %1) #0 {
top:
; ┌ @ float.jl:409 within `+`
   %2 = fadd double %0, %1
   ret double %2
; └
}

Notice that when f is the function that takes in two Ints, Ints add to give an Int and thus f outputs an Int. When f is the function that takes two Float64s, f returns a Float64. Thus in the code:

function g(x,y)
  a = 4
  b = 2
  c = f(x,a)
  d = f(b,c)
  f(d,y)
end

@code_llvm g(2,5)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `g`
define i64 @julia_g_3083(i64 signext %0, i64 signext %1) #0 {
top:
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
; │┌ @ int.jl:87 within `+`
    %2 = add i64 %0, 6
; └└
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
7 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
; │┌ @ int.jl:87 within `+`
    %3 = add i64 %2, %1
    ret i64 %3
; └└
}

g on two Int inputs is a function that has Ints at every step along the way and spits out an Int. We can use the @code_warntype macro to better see the inference along the steps of the function:

@code_warntype g(2,5)
MethodInstance for g(::Int64, ::Int64)
  from g(x, y) @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizin
g.jmd:2
Arguments
  #self#::Core.Const(g)
  x::Int64
  y::Int64
Locals
  d::Int64
  c::Int64
  b::Int64
  a::Int64
Body::Int64
1 ─      (a = 4)
│        (b = 2)
│        (c = Main.f(x, a::Core.Const(4)))
│        (d = Main.f(b::Core.Const(2), c))
│   %5 = Main.f(d, y)::Int64
└──      return %5

What happens on mixtures?

@code_llvm f(2.0,5)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `f`
define double @julia_f_3542(double %0, i64 signext %1) #0 {
top:
; ┌ @ promotion.jl:422 within `+`
; │┌ @ promotion.jl:393 within `promote`
; ││┌ @ promotion.jl:370 within `_promote`
; │││┌ @ number.jl:7 within `convert`
; ││││┌ @ float.jl:159 within `Float64`
       %2 = sitofp i64 %1 to double
; │└└└└
; │ @ promotion.jl:422 within `+` @ float.jl:409
   %3 = fadd double %2, %0
   ret double %3
; └
}

When we add an Int to a Float64, we promote the Int to a Float64 and then perform the + between two Float64s. When we go to the full function, we see that it can still infer:

@code_warntype g(2.0,5)
MethodInstance for g(::Float64, ::Int64)
  from g(x, y) @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizin
g.jmd:2
Arguments
  #self#::Core.Const(g)
  x::Float64
  y::Int64
Locals
  d::Float64
  c::Float64
  b::Int64
  a::Int64
Body::Float64
1 ─      (a = 4)
│        (b = 2)
│        (c = Main.f(x, a::Core.Const(4)))
│        (d = Main.f(b::Core.Const(2), c))
│   %5 = Main.f(d, y)::Float64
└──      return %5

and it uses this to build a very efficient assembly code because it knows exactly what the types will be at every step:

@code_llvm g(2.0,5)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `g`
define double @julia_g_3545(double %0, i64 signext %1) #0 {
top:
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
; │┌ @ promotion.jl:422 within `+` @ float.jl:409
    %2 = fadd double %0, 4.000000e+00
; └└
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
; │┌ @ promotion.jl:422 within `+` @ float.jl:409
    %3 = fadd double %2, 2.000000e+00
; └└
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
7 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
; │┌ @ promotion.jl:422 within `+`
; ││┌ @ promotion.jl:393 within `promote`
; │││┌ @ promotion.jl:370 within `_promote`
; ││││┌ @ number.jl:7 within `convert`
; │││││┌ @ float.jl:159 within `Float64`
        %4 = sitofp i64 %1 to double
; ││└└└└
; ││ @ promotion.jl:422 within `+` @ float.jl:409
    %5 = fadd double %3, %4
    ret double %5
; └└
}

(notice how it handles the constant literals 4 and 2: it converted them at compile time to reduce the algorithm to 3 floating point additions).

Type Stability

Why is the inference algorithm able to infer all of the types of g? It's because it knows the types coming out of f at compile time. Given an Int and a Float64, f will always output a Float64, and thus it can continue with inference knowing that c, d, and eventually the output is Float64. Thus in order for this to occur, we need that the type of the output on our function is directly inferred from the type of the input. This property is known as type-stability.

An example of breaking it is as follows:

function h(x,y)
  out = x + y
  rand() < 0.5 ? out : Float64(out)
end
h (generic function with 1 method)

Here, on an integer input the output's type is randomly either Int or Float64, and thus the output is unknown:

@code_warntype h(2,5)
MethodInstance for h(::Int64, ::Int64)
  from h(x, y) @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizin
g.jmd:2
Arguments
  #self#::Core.Const(h)
  x::Int64
  y::Int64
Locals
  out::Int64
Body::UNION{FLOAT64, INT64}
1 ─      (out = x + y)
│   %2 = Main.rand()::Float64
│   %3 = (%2 < 0.5)::Bool
└──      goto #3 if not %3
2 ─      return out
3 ─ %6 = Main.Float64(out)::Float64
└──      return %6

This means that its output type is Union{Int,Float64} (Julia uses union types to keep the types still somewhat constrained). Once there are multiple choices, those need to get propagate through the compiler, and all subsequent calculations are the result of either being an Int or a Float64.

(Note that Julia has small union optimizations, so if this union is of size 4 or less then Julia will still be able to optimize it quite a bit.)

Multiple Dispatch

The + function on numbers was implemented in Julia, so how were these rules all written down? The answer is multiple dispatch. In Julia, you can tell a function how to act differently on different types by using type assertions on the input values. For example, let's make a function that computes 2x + y on Int and x/y on Float64:

ff(x::Int,y::Int) = 2x + y
ff(x::Float64,y::Float64) = x/y
@show ff(2,5)
@show ff(2.0,5.0)
ff(2, 5) = 9
ff(2.0, 5.0) = 0.4
0.4

The + function in Julia is just defined as +(a,b), and we can actually point to that code in the Julia distribution:

@which +(2.0,5)
+(x::Number, y::Number) in Base at promotion.jl:422

To control at a higher level, Julia uses abstract types. For example, Float64 <: AbstractFloat, meaning Float64s are a subtype of AbstractFloat. We also have that Int <: Integer, while both AbstractFloat <: Number and Integer <: Number.

Julia allows the user to define dispatches at a higher level, and the version that is called is the most strict version that is correct. For example, right now with ff we will get a MethodError if we call it between a Int and a Float64 because no such method exists:

ff(2.0,5)
ERROR: MethodError: no method matching ff(::Float64, ::Int64)

Closest candidates are:
  ff(::Float64, !Matched::Float64)
   @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:3
  ff(!Matched::Int64, ::Int64)
   @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:2

However, we can add a fallback method to the function ff for two numbers:

ff(x::Number,y::Number) = x + y
ff(2.0,5)
7.0

Notice that the fallback method still specializes on the inputs:

@code_llvm ff(2.0,5)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `ff`
define double @julia_ff_3664(double %0, i64 signext %1) #0 {
top:
; ┌ @ promotion.jl:422 within `+`
; │┌ @ promotion.jl:393 within `promote`
; ││┌ @ promotion.jl:370 within `_promote`
; │││┌ @ number.jl:7 within `convert`
; ││││┌ @ float.jl:159 within `Float64`
       %2 = sitofp i64 %1 to double
; │└└└└
; │ @ promotion.jl:422 within `+` @ float.jl:409
   %3 = fadd double %2, %0
   ret double %3
; └
}

It's essentially just a template for what functions to possibly try and create given the types that are seen. When it sees Float64 and Int, it knows it should try and create the function that does x+y, and once it knows it's Float64 plus a Int, it knows it should create the function that converts the Int to a Float64 and then does addition between two Float64s, and that is precisely the generated LLVM IR on this pair of input types.

And that's essentially Julia's secret sauce: since it's always specializing its types on each function, if those functions themselves can infer the output, then the entire function can be inferred and generate optimal code, which is then optimized by the compiler and out comes an efficient function. If types can't be inferred, Julia falls back to a slower "Python" mode (though with optimizations in cases like small unions). Users then get control over this specialization process through multiple dispatch, which is then Julia's core feature since it allows adding new options without any runtime cost.

Any Fallbacks

Note that f(x,y) = x+y is equivalent to f(x::Any,y::Any) = x+y, where Any is the maximal supertype of every Julia type. Thus f(x,y) = x+y is essentially a fallback for all possible input values, telling it what to do in the case that no other dispatches exist. However, note that this dispatch itself is not slow, since it will be specialized on the input types.

Ambiguities

The version that is called is the most strict version that is correct. What happens if it's impossible to define "the most strict version"? For example,

ff(x::Float64,y::Number) = 5x + 2y
ff(x::Number,y::Int) = x - y
ff (generic function with 5 methods)

What should it call on f(2.0,5) now? ff(x::Float64,y::Number) and ff(x::Number,y::Int) are both more strict than ff(x::Number,y::Number), so one of them should be called, but neither are more strict than each other, and thus you will end up with an ambiguity error:

ff(2.0,5)
ERROR: MethodError: ff(::Float64, ::Int64) is ambiguous.

Candidates:
  ff(x::Number, y::Int64)
    @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:3
  ff(x::Float64, y::Number)
    @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:2

Possible fix, define
  ff(::Float64, ::Int64)

Untyped Containers

One way to ruin inference is to use an untyped container. For example, the array constructors use type inference themselves to know what their container type will be. Therefore,

a = [1.0,2.0,3.0]
3-element Vector{Float64}:
 1.0
 2.0
 3.0

uses type inference on its inputs to know that it should be something that holds Float64 values, and thus it is a 1-dimensional array of Float64 values, or Array{Float64,1}. The accesses:

a[1]
1.0

are then inferred, since this is just the function getindex(a::Array{T},i) where T which is a function that will produce something of type T, the element type of the array. However, if we tell Julia to make an array with element type Any:

b = ["1.0",2,2.0]
3-element Vector{Any}:
  "1.0"
 2
 2.0

(here, Julia falls back to Any because it cannot promote the values to the same type), then the best inference can do on the output is to say it could have any type:

function bad_container(a)
  a[2]
end
@code_warntype bad_container(a)
MethodInstance for bad_container(::Vector{Float64})
  from bad_container(a) @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/
optimizing.jmd:2
Arguments
  #self#::Core.Const(bad_container)
  a::Vector{Float64}
Body::Float64
1 ─ %1 = Base.getindex(a, 2)::Float64
└──      return %1
@code_warntype bad_container(b)
MethodInstance for bad_container(::Vector{Any})
  from bad_container(a) @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/
optimizing.jmd:2
Arguments
  #self#::Core.Const(bad_container)
  a::Vector{Any}
Body::ANY
1 ─ %1 = Base.getindex(a, 2)::ANY
└──      return %1

This is one common way that type inference can breakdown. For example, even if the array is all numbers, we can still break inference:

x = Number[1.0,3]
function q(x)
  a = 4
  b = 2
  c = f(x[1],a)
  d = f(b,c)
  f(d,x[2])
end
@code_warntype q(x)
MethodInstance for q(::Vector{Number})
  from q(x) @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.j
md:3
Arguments
  #self#::Core.Const(q)
  x::Vector{Number}
Locals
  d::ANY
  c::ANY
  b::Int64
  a::Int64
Body::ANY
1 ─      (a = 4)
│        (b = 2)
│   %3 = Base.getindex(x, 1)::NUMBER
│        (c = Main.f(%3, a::Core.Const(4)))
│        (d = Main.f(b::Core.Const(2), c))
│   %6 = d::ANY
│   %7 = Base.getindex(x, 2)::NUMBER
│   %8 = Main.f(%6, %7)::ANY
└──      return %8

Here the type inference algorithm quickly gives up and infers to Any, losing all specialization and automatically switching to Python-style runtime type checking.

Type definitions

Value types and isbits

In Julia, types which can fully inferred and which are composed of primitive or isbits types are value types. This means that, inside of an array, their values are the values of the type itself, and not a pointer to the values.

You can check if the type is a value type through isbits:

isbits(1.0)
true

Note that a Julia struct which holds isbits values is isbits as well, if it's fully inferred:

struct MyComplex
  real::Float64
  imag::Float64
end
isbits(MyComplex(1.0,1.0))
true

We can see that the compiler knows how to use this efficiently since it knows that what comes out is always Float64:

Base.:+(a::MyComplex,b::MyComplex) = MyComplex(a.real+b.real,a.imag+b.imag)
Base.:+(a::MyComplex,b::Int) = MyComplex(a.real+b,a.imag)
Base.:+(b::Int,a::MyComplex) = MyComplex(a.real+b,a.imag)
g(MyComplex(1.0,1.0),MyComplex(1.0,1.0))
MyComplex(8.0, 2.0)
@code_warntype g(MyComplex(1.0,1.0),MyComplex(1.0,1.0))
MethodInstance for g(::MyComplex, ::MyComplex)
  from g(x, y) @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizin
g.jmd:2
Arguments
  #self#::Core.Const(g)
  x::MyComplex
  y::MyComplex
Locals
  d::MyComplex
  c::MyComplex
  b::Int64
  a::Int64
Body::MyComplex
1 ─      (a = 4)
│        (b = 2)
│        (c = Main.f(x, a::Core.Const(4)))
│        (d = Main.f(b::Core.Const(2), c))
│   %5 = Main.f(d, y)::MyComplex
└──      return %5
@code_llvm g(MyComplex(1.0,1.0),MyComplex(1.0,1.0))
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `g`
define void @julia_g_3803([2 x double]* noalias nocapture noundef nonnull s
ret([2 x double]) align 8 dereferenceable(16) %0, [2 x double]* nocapture n
oundef nonnull readonly align 8 dereferenceable(16) %1, [2 x double]* nocap
ture noundef nonnull readonly align 8 dereferenceable(16) %2) #0 {
top:
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
; │┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:3 within `+`
; ││┌ @ Base.jl:37 within `getproperty`
     %3 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64
 0
; ││└
; ││ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:3 within `+` @ promotion.jl:422 @ float.jl:409
    %unbox = load double, double* %3, align 8
    %4 = fadd double %unbox, 4.000000e+00
; ││ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:3 within `+`
; ││┌ @ Base.jl:37 within `getproperty`
     %5 = getelementptr inbounds [2 x double], [2 x double]* %1, i64 0, i64
 1
; └└└
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
; │┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:4 within `+` @ promotion.jl:422 @ float.jl:409
    %6 = fadd double %4, 2.000000e+00
; └└
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
7 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
; │┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:2 within `+` @ float.jl:409
    %unbox2 = load double, double* %5, align 8
    %7 = bitcast [2 x double]* %2 to <2 x double>*
    %8 = load <2 x double>, <2 x double>* %7, align 8
    %9 = insertelement <2 x double> poison, double %6, i64 0
    %10 = insertelement <2 x double> %9, double %unbox2, i64 1
    %11 = fadd <2 x double> %8, %10
; ││ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:2 within `+`
; ││┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.j
md:3 within `MyComplex`
     %12 = bitcast [2 x double]* %0 to <2 x double>*
     store <2 x double> %11, <2 x double>* %12, align 8
     ret void
; └└└
}

Note that the compiled code simply works directly on the double pieces. We can also make this be concrete without pre-specifying that the values always have to be Float64 by using a type parameter.

struct MyParameterizedComplex{T}
  real::T
  imag::T
end
isbits(MyParameterizedComplex(1.0,1.0))
true

Note that MyParameterizedComplex{T} is a concrete type for every T: it is a shorthand form for defining a whole family of types.

Base.:+(a::MyParameterizedComplex,b::MyParameterizedComplex) = MyParameterizedComplex(a.real+b.real,a.imag+b.imag)
Base.:+(a::MyParameterizedComplex,b::Int) = MyParameterizedComplex(a.real+b,a.imag)
Base.:+(b::Int,a::MyParameterizedComplex) = MyParameterizedComplex(a.real+b,a.imag)
g(MyParameterizedComplex(1.0,1.0),MyParameterizedComplex(1.0,1.0))
MyParameterizedComplex{Float64}(8.0, 2.0)
@code_warntype g(MyParameterizedComplex(1.0,1.0),MyParameterizedComplex(1.0,1.0))
MethodInstance for g(::MyParameterizedComplex{Float64}, ::MyParameterizedCo
mplex{Float64})
  from g(x, y) @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizin
g.jmd:2
Arguments
  #self#::Core.Const(g)
  x::MyParameterizedComplex{Float64}
  y::MyParameterizedComplex{Float64}
Locals
  d::MyParameterizedComplex{Float64}
  c::MyParameterizedComplex{Float64}
  b::Int64
  a::Int64
Body::MyParameterizedComplex{Float64}
1 ─      (a = 4)
│        (b = 2)
│        (c = Main.f(x, a::Core.Const(4)))
│        (d = Main.f(b::Core.Const(2), c))
│   %5 = Main.f(d, y)::MyParameterizedComplex{Float64}
└──      return %5

See that this code also automatically works and compiles efficiently for Float32 as well:

@code_warntype g(MyParameterizedComplex(1.0f0,1.0f0),MyParameterizedComplex(1.0f0,1.0f0))
MethodInstance for g(::MyParameterizedComplex{Float32}, ::MyParameterizedCo
mplex{Float32})
  from g(x, y) @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizin
g.jmd:2
Arguments
  #self#::Core.Const(g)
  x::MyParameterizedComplex{Float32}
  y::MyParameterizedComplex{Float32}
Locals
  d::MyParameterizedComplex{Float32}
  c::MyParameterizedComplex{Float32}
  b::Int64
  a::Int64
Body::MyParameterizedComplex{Float32}
1 ─      (a = 4)
│        (b = 2)
│        (c = Main.f(x, a::Core.Const(4)))
│        (d = Main.f(b::Core.Const(2), c))
│   %5 = Main.f(d, y)::MyParameterizedComplex{Float32}
└──      return %5
@code_llvm g(MyParameterizedComplex(1.0f0,1.0f0),MyParameterizedComplex(1.0f0,1.0f0))
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `g`
define [2 x float] @julia_g_3815([2 x float]* nocapture noundef nonnull rea
donly align 4 dereferenceable(8) %0, [2 x float]* nocapture noundef nonnull
 readonly align 4 dereferenceable(8) %1) #0 {
top:
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
; │┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:3 within `+`
; ││┌ @ Base.jl:37 within `getproperty`
     %2 = getelementptr inbounds [2 x float], [2 x float]* %0, i64 0, i64 0
; ││└
; ││ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:3 within `+` @ promotion.jl:422 @ float.jl:409
    %unbox = load float, float* %2, align 4
    %3 = fadd float %unbox, 4.000000e+00
; ││ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:3 within `+`
; ││┌ @ Base.jl:37 within `getproperty`
     %4 = getelementptr inbounds [2 x float], [2 x float]* %0, i64 0, i64 1
; └└└
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
; │┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:4 within `+` @ promotion.jl:422 @ float.jl:409
    %5 = fadd float %3, 2.000000e+00
; └└
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
7 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
; │┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:2 within `+`
; ││┌ @ Base.jl:37 within `getproperty`
     %6 = getelementptr inbounds [2 x float], [2 x float]* %1, i64 0, i64 0
; ││└
; ││ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:2 within `+` @ float.jl:409
    %unbox1 = load float, float* %6, align 4
    %7 = fadd float %unbox1, %5
; ││ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:2 within `+`
; ││┌ @ Base.jl:37 within `getproperty`
     %8 = getelementptr inbounds [2 x float], [2 x float]* %1, i64 0, i64 1
; ││└
; ││ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:2 within `+` @ float.jl:409
    %unbox2 = load float, float* %4, align 4
    %unbox3 = load float, float* %8, align 4
    %9 = fadd float %unbox2, %unbox3
; ││ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jm
d:2 within `+`
; ││┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.j
md:3 within `MyParameterizedComplex`
     %unbox4.fca.0.insert = insertvalue [2 x float] zeroinitializer, float 
%7, 0
     %unbox4.fca.1.insert = insertvalue [2 x float] %unbox4.fca.0.insert, f
loat %9, 1
     ret [2 x float] %unbox4.fca.1.insert
; └└└
}

It is important to know that if there is any piece of a type which doesn't contain type information, then it cannot be isbits because then it would have to be compiled in such a way that the size is not known in advance. For example:

struct MySlowComplex
  real
  imag
end
isbits(MySlowComplex(1.0,1.0))
false
Base.:+(a::MySlowComplex,b::MySlowComplex) = MySlowComplex(a.real+b.real,a.imag+b.imag)
Base.:+(a::MySlowComplex,b::Int) = MySlowComplex(a.real+b,a.imag)
Base.:+(b::Int,a::MySlowComplex) = MySlowComplex(a.real+b,a.imag)
g(MySlowComplex(1.0,1.0),MySlowComplex(1.0,1.0))
MySlowComplex(8.0, 2.0)
@code_warntype g(MySlowComplex(1.0,1.0),MySlowComplex(1.0,1.0))
MethodInstance for g(::MySlowComplex, ::MySlowComplex)
  from g(x, y) @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizin
g.jmd:2
Arguments
  #self#::Core.Const(g)
  x::MySlowComplex
  y::MySlowComplex
Locals
  d::MySlowComplex
  c::MySlowComplex
  b::Int64
  a::Int64
Body::MySlowComplex
1 ─      (a = 4)
│        (b = 2)
│        (c = Main.f(x, a::Core.Const(4)))
│        (d = Main.f(b::Core.Const(2), c))
│   %5 = Main.f(d, y)::MySlowComplex
└──      return %5
@code_llvm g(MySlowComplex(1.0,1.0),MySlowComplex(1.0,1.0))
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `g`
define void @julia_g_3834([2 x {}*]* noalias nocapture noundef nonnull sret
([2 x {}*]) align 8 dereferenceable(16) %0, [2 x {}*]* nocapture noundef no
nnull readonly align 8 dereferenceable(16) %1, [2 x {}*]* nocapture noundef
 nonnull readonly align 8 dereferenceable(16) %2) #0 {
top:
  %gcframe2 = alloca [8 x {}*], align 16
  %gcframe2.sub = getelementptr inbounds [8 x {}*], [8 x {}*]* %gcframe2, i
64 0, i64 0
  %3 = bitcast [8 x {}*]* %gcframe2 to i8*
  call void @llvm.memset.p0i8.i64(i8* align 16 %3, i8 0, i64 64, i1 true)
  %4 = getelementptr inbounds [8 x {}*], [8 x {}*]* %gcframe2, i64 0, i64 6
  %5 = bitcast {}** %4 to [2 x {}*]*
  %6 = getelementptr inbounds [8 x {}*], [8 x {}*]* %gcframe2, i64 0, i64 4
  %7 = bitcast {}** %6 to [2 x {}*]*
  %8 = getelementptr inbounds [8 x {}*], [8 x {}*]* %gcframe2, i64 0, i64 2
  %9 = bitcast {}** %8 to [2 x {}*]*
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #9
  %tls_ppgcstack = getelementptr i8, i8* %thread_ptr, i64 -8
  %10 = bitcast i8* %tls_ppgcstack to {}****
  %tls_pgcstack = load {}***, {}**** %10, align 8
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
   %11 = bitcast [8 x {}*]* %gcframe2 to i64*
   store i64 24, i64* %11, align 16
   %12 = getelementptr inbounds [8 x {}*], [8 x {}*]* %gcframe2, i64 0, i64
 1
   %13 = bitcast {}** %12 to {}***
   %14 = load {}**, {}*** %tls_pgcstack, align 8
   store {}** %14, {}*** %13, align 8
   %15 = bitcast {}*** %tls_pgcstack to {}***
   store {}** %gcframe2.sub, {}*** %15, align 8
   call void @"j_+_3836"([2 x {}*]* noalias nocapture noundef nonnull sret(
[2 x {}*]) %7, [2 x {}*]* nocapture nonnull readonly %1, i64 signext 4)
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
   call void @"j_+_3837"([2 x {}*]* noalias nocapture noundef nonnull sret(
[2 x {}*]) %9, i64 signext 2, [2 x {}*]* nocapture readonly %7)
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
7 within `g`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:2 within `f`
   call void @"j_+_3838"([2 x {}*]* noalias nocapture noundef nonnull sret(
[2 x {}*]) %5, [2 x {}*]* nocapture readonly %9, [2 x {}*]* nocapture nonnu
ll readonly %2)
   %16 = bitcast [2 x {}*]* %0 to i8*
   %17 = bitcast {}** %4 to i8*
   call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 derefer
enceable(16) %16, i8* noundef nonnull align 16 dereferenceable(16) %17, i64
 16, i1 false)
   %18 = load {}*, {}** %12, align 8
   %19 = bitcast {}*** %tls_pgcstack to {}**
   store {}* %18, {}** %19, align 8
   ret void
; └
}
struct MySlowComplex2
  real::AbstractFloat
  imag::AbstractFloat
end
isbits(MySlowComplex2(1.0,1.0))
false
Base.:+(a::MySlowComplex2,b::MySlowComplex2) = MySlowComplex2(a.real+b.real,a.imag+b.imag)
Base.:+(a::MySlowComplex2,b::Int) = MySlowComplex2(a.real+b,a.imag)
Base.:+(b::Int,a::MySlowComplex2) = MySlowComplex2(a.real+b,a.imag)
g(MySlowComplex2(1.0,1.0),MySlowComplex2(1.0,1.0))
MySlowComplex2(8.0, 2.0)

Here's the timings:

a = MyComplex(1.0,1.0)
b = MyComplex(2.0,1.0)
@btime g(a,b)
19.063 ns (1 allocation: 32 bytes)
MyComplex(9.0, 2.0)
a = MyParameterizedComplex(1.0,1.0)
b = MyParameterizedComplex(2.0,1.0)
@btime g(a,b)
19.494 ns (1 allocation: 32 bytes)
MyParameterizedComplex{Float64}(9.0, 2.0)
a = MySlowComplex(1.0,1.0)
b = MySlowComplex(2.0,1.0)
@btime g(a,b)
98.400 ns (5 allocations: 96 bytes)
MySlowComplex(9.0, 2.0)
a = MySlowComplex2(1.0,1.0)
b = MySlowComplex2(2.0,1.0)
@btime g(a,b)
729.865 ns (14 allocations: 288 bytes)
MySlowComplex2(9.0, 2.0)

Note on Julia

Note that, because of these type specialization, value types, etc. properties, the number types, even ones such as Int, Float64, and Complex, are all themselves implemented in pure Julia! Thus even basic pieces can be implemented in Julia with full performance, given one uses the features correctly.

Note on isbits

Note that a type which is mutable struct will not be isbits. This means that mutable structs will be a pointer to a heap allocated object, unless it's shortlived and the compiler can erase its construction. Also, note that isbits compiles down to bit operations from pure Julia, which means that these types can directly compile to GPU kernels through CUDAnative without modification.

Function Barriers

Since functions automatically specialize on their input types in Julia, we can use this to our advantage in order to make an inner loop fully inferred. For example, take the code from above but with a loop:

function r(x)
  a = 4
  b = 2
  for i in 1:100
    c = f(x[1],a)
    d = f(b,c)
    a = f(d,x[2])
  end
  a
end
@btime r(x)
4.763 μs (300 allocations: 4.69 KiB)
604.0

In here, the loop variables are not inferred and thus this is really slow. However, we can force a function call in the middle to end up with specialization and in the inner loop be stable:

s(x) = _s(x[1],x[2])
function _s(x1,x2)
  a = 4
  b = 2
  for i in 1:100
    c = f(x1,a)
    d = f(b,c)
    a = f(d,x2)
  end
  a
end
@btime s(x)
297.660 ns (1 allocation: 16 bytes)
604.0

Notice that this algorithm still doesn't infer:

@code_warntype s(x)
MethodInstance for s(::Vector{Number})
  from s(x) @ Main ~/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.j
md:2
Arguments
  #self#::Core.Const(s)
  x::Vector{Number}
Body::ANY
1 ─ %1 = Base.getindex(x, 1)::NUMBER
│   %2 = Base.getindex(x, 2)::NUMBER
│   %3 = Main._s(%1, %2)::ANY
└──      return %3

since the output of _s isn't inferred, but while it's in _s it will have specialized on the fact that x[1] is a Float64 while x[2] is a Int, making that inner loop fast. In fact, it will only need to pay one dynamic dispatch, i.e. a multiple dispatch determination that happens at runtime. Notice that whenever functions are inferred, the dispatching is static since the choice of the dispatch is already made and compiled into the LLVM IR.

Specialization at Compile Time

Julia code will specialize at compile time if it can prove something about the result. For example:

function fff(x)
  if x isa Int
    y = 2
  else
    y = 4.0
  end
  x + y
end
fff (generic function with 1 method)

You might think this function has a branch, but in reality Julia can determine whether x is an Int or not at compile time, so it will actually compile it away and just turn it into the function x+2 or x+4.0:

@code_llvm fff(5)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `fff`
define i64 @julia_fff_3928(i64 signext %0) #0 {
top:
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
8 within `fff`
; ┌ @ int.jl:87 within `+`
   %1 = add i64 %0, 2
   ret i64 %1
; └
}
@code_llvm fff(2.0)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `fff`
define double @julia_fff_3930(double %0) #0 {
top:
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
8 within `fff`
; ┌ @ float.jl:409 within `+`
   %1 = fadd double %0, 4.000000e+00
   ret double %1
; └
}

Thus one does not need to worry about over-optimizing since in the obvious cases the compiler will actually remove all of the extra pieces when it can!

Global Scope and Optimizations

This discussion shows how Julia's optimizations all apply during function specialization times. Thus calling Julia functions is fast. But what about when doing something outside of the function, like directly in a module or in the REPL?

@btime for j in 1:100, i in 1:100
  global A,B,C
  C[i,j] = A[i,j] + B[i,j]
end
727.965 μs (30000 allocations: 468.75 KiB)

This is very slow because the types of A, B, and C cannot be inferred. Why can't they be inferred? Well, at any time in the dynamic REPL scope I can do something like C = "haha now a string!", and thus it cannot specialize on the types currently existing in the REPL (since asynchronous changes could also occur), and therefore it defaults back to doing a type check at every single function which slows it down. Moral of the story, Julia functions are fast but its global scope is too dynamic to be optimized.

Summary

  • Julia is not fast because of its JIT, it's fast because of function specialization and type inference

  • Type stable functions allow inference to fully occur

  • Multiple dispatch works within the function specialization mechanism to create overhead-free compile time controls

  • Julia will specialize the generic functions

  • Making sure values are concretely typed in inner loops is essential for performance

Overheads of Individual Operations

Now let's dig even a little deeper. Everything the processor does has a cost. A great chart to keep in mind is this classic one. A few things should immediately jump out to you:

  • Simple arithmetic, like floating point additions, are super cheap. ~1 clock cycle, or a few nanoseconds.

  • Processors do branch prediction on if statements. If the code goes down the predicted route, the if statement costs ~1-2 clock cycles. If it goes down the wrong route, then it will take ~10-20 clock cycles. This means that predictable branches, like ones with clear patterns or usually the same output, are much cheaper (almost free) than unpredictable branches.

  • Function calls are expensive: 15-60 clock cycles!

  • RAM reads are very expensive, with lower caches less expensive.

Bounds Checking

Let's check the LLVM IR on one of our earlier loops:

function inner_noalloc!(C,A,B)
  for j in 1:100, i in 1:100
    val = A[i,j] + B[i,j]
    C[i,j] = val[1]
  end
end
@code_llvm inner_noalloc!(C,A,B)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `inner_noalloc!`
define nonnull {}* @"japi1_inner_noalloc!_3939"({}* %function, {}** noalias
 nocapture noundef readonly %args, i32 %nargs) #0 {
top:
  %stackargs = alloca {}**, align 8
  store volatile {}** %args, {}*** %stackargs, align 8
  %0 = load {}*, {}** %args, align 8
  %1 = getelementptr inbounds {}*, {}** %args, i64 1
  %2 = load {}*, {}** %1, align 8
  %3 = getelementptr inbounds {}*, {}** %args, i64 2
  %4 = load {}*, {}** %3, align 8
  %5 = bitcast {}* %2 to {}**
  %arraysize_ptr = getelementptr inbounds {}*, {}** %5, i64 3
  %6 = bitcast {}** %arraysize_ptr to i64*
  %arraysize = load i64, i64* %6, align 8
  %arraysize_ptr4 = getelementptr inbounds {}*, {}** %5, i64 4
  %7 = bitcast {}** %arraysize_ptr4 to i64*
  %8 = bitcast {}* %2 to double**
  %9 = bitcast {}* %4 to {}**
  %arraysize_ptr7 = getelementptr inbounds {}*, {}** %9, i64 3
  %10 = bitcast {}** %arraysize_ptr7 to i64*
  %arraysize_ptr12 = getelementptr inbounds {}*, {}** %9, i64 4
  %11 = bitcast {}** %arraysize_ptr12 to i64*
  %12 = bitcast {}* %4 to double**
  %13 = bitcast {}* %0 to {}**
  %arraysize_ptr21 = getelementptr inbounds {}*, {}** %13, i64 3
  %14 = bitcast {}** %arraysize_ptr21 to i64*
  %arraysize_ptr26 = getelementptr inbounds {}*, {}** %13, i64 4
  %15 = bitcast {}** %arraysize_ptr26 to i64*
  %16 = bitcast {}* %0 to double**
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
3 within `inner_noalloc!`
  br label %L2

L2:                                               ; preds = %L27.split.us.s
plit.us.split.us.1, %top
  %indvar = phi i64 [ 0, %top ], [ %indvar.next.1, %L27.split.us.split.us.s
plit.us.1 ]
  %value_phi = phi i64 [ 1, %top ], [ %264, %L27.split.us.split.us.split.us
.1 ]
  %17 = shl nuw nsw i64 %indvar, 3
  %18 = mul i64 %arraysize, %indvar
  %19 = add nsw i64 %value_phi, -1
  %arraysize5 = load i64, i64* %7, align 8
  %inbounds6 = icmp ult i64 %19, %arraysize5
  %20 = mul i64 %arraysize, %19
  %arrayptr41 = load double*, double** %8, align 8
  %arraysize8 = load i64, i64* %10, align 8
  %21 = mul i64 %arraysize8, %19
  %arrayptr1943 = load double*, double** %12, align 8
  %arrayptr1943224 = bitcast double* %arrayptr1943 to i8*
  %arraysize22 = load i64, i64* %14, align 8
  %arraysize27 = load i64, i64* %15, align 8
  %inbounds28 = icmp ult i64 %19, %arraysize27
  %22 = mul i64 %arraysize22, %19
  %arrayptr3345 = load double*, double** %16, align 8
  %arrayptr3345217 = bitcast double* %arrayptr3345 to i8*
  %inbounds6.fr = freeze i1 %inbounds6
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   br i1 %inbounds6.fr, label %L2.split.us, label %oob

L2.split.us:                                      ; preds = %L2
   %arraysize13 = load i64, i64* %11, align 8
   %inbounds14 = icmp ult i64 %19, %arraysize13
   %inbounds14.fr = freeze i1 %inbounds14
   br i1 %inbounds14.fr, label %L2.split.us.split.us, label %L2.split.us.sp
lit

L2.split.us.split.us:                             ; preds = %L2.split.us
   %inbounds28.fr = freeze i1 %inbounds28
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   br i1 %inbounds28.fr, label %L2.split.us.split.us.split.us, label %L2.sp
lit.us.split.us.split

L2.split.us.split.us.split.us:                    ; preds = %L2.split.us.sp
lit.us
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
3 within `inner_noalloc!`
  %smin = call i64 @llvm.smin.i64(i64 %arraysize8, i64 0)
  %23 = sub i64 %arraysize8, %smin
  %smax = call i64 @llvm.smax.i64(i64 %smin, i64 -1)
  %24 = add nsw i64 %smax, 1
  %25 = mul nuw nsw i64 %23, %24
  %umin = call i64 @llvm.umin.i64(i64 %arraysize, i64 %25)
  %smin156 = call i64 @llvm.smin.i64(i64 %arraysize22, i64 0)
  %26 = sub i64 %arraysize22, %smin156
  %smax157 = call i64 @llvm.smax.i64(i64 %smin156, i64 -1)
  %27 = add nsw i64 %smax157, 1
  %28 = mul nuw nsw i64 %26, %27
  %umin158 = call i64 @llvm.umin.i64(i64 %umin, i64 %28)
  %exit.mainloop.at = call i64 @llvm.umin.i64(i64 %umin158, i64 100)
  %.not196 = icmp eq i64 %exit.mainloop.at, 0
  br i1 %.not196, label %main.pseudo.exit, label %ib24.us.us.us.preheader

ib24.us.us.us.preheader:                          ; preds = %L2.split.us.sp
lit.us.split.us
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc!`
  %min.iters.check = icmp ult i64 %exit.mainloop.at, 12
  br i1 %min.iters.check, label %scalar.ph, label %vector.memcheck

vector.memcheck:                                  ; preds = %ib24.us.us.us.
preheader
  %29 = mul i64 %arraysize22, %17
  %uglygep = getelementptr i8, i8* %arrayptr3345217, i64 %29
  %scevgep = getelementptr double, double* %arrayptr3345, i64 %exit.mainloo
p.at
  %scevgep218 = bitcast double* %scevgep to i8*
  %uglygep219 = getelementptr i8, i8* %scevgep218, i64 %29
  %scevgep220 = getelementptr double, double* %arrayptr41, i64 %18
  %scevgep220221 = bitcast double* %scevgep220 to i8*
  %30 = add i64 %exit.mainloop.at, %18
  %scevgep222 = getelementptr double, double* %arrayptr41, i64 %30
  %scevgep222223 = bitcast double* %scevgep222 to i8*
  %31 = mul i64 %arraysize8, %17
  %uglygep225 = getelementptr i8, i8* %arrayptr1943224, i64 %31
  %scevgep226 = getelementptr double, double* %arrayptr1943, i64 %exit.main
loop.at
  %scevgep226227 = bitcast double* %scevgep226 to i8*
  %uglygep228 = getelementptr i8, i8* %scevgep226227, i64 %31
  %bound0 = icmp ult i8* %uglygep, %scevgep222223
  %bound1 = icmp ugt i8* %uglygep219, %scevgep220221
  %found.conflict = and i1 %bound0, %bound1
  %bound0229 = icmp ult i8* %uglygep, %uglygep228
  %bound1230 = icmp ult i8* %uglygep225, %uglygep219
  %found.conflict231 = and i1 %bound0229, %bound1230
  %conflict.rdx = or i1 %found.conflict, %found.conflict231
  br i1 %conflict.rdx, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %vector.memchec
k
  %n.vec = and i64 %exit.mainloop.at, 124
  %ind.end = or i64 %n.vec, 1
  %32 = add nsw i64 %n.vec, -4
  %33 = lshr exact i64 %32, 2
  %34 = add nuw nsw i64 %33, 1
  %xtraiter = and i64 %34, 7
  %35 = icmp ult i64 %32, 28
  br i1 %35, label %middle.block.unr-lcssa, label %vector.ph.new

vector.ph.new:                                    ; preds = %vector.ph
  %unroll_iter = and i64 %34, 9223372036854775800
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %
vector.ph.new
  %index = phi i64 [ 0, %vector.ph.new ], [ %index.next.7, %vector.body ]
  %niter = phi i64 [ 0, %vector.ph.new ], [ %niter.next.7, %vector.body ]
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %36 = add i64 %20, %index
   %37 = getelementptr inbounds double, double* %arrayptr41, i64 %36
   %38 = bitcast double* %37 to <4 x double>*
   %wide.load = load <4 x double>, <4 x double>* %38, align 8
   %39 = add i64 %21, %index
   %40 = getelementptr inbounds double, double* %arrayptr1943, i64 %39
   %41 = bitcast double* %40 to <4 x double>*
   %wide.load232 = load <4 x double>, <4 x double>* %41, align 8
; └
; ┌ @ float.jl:409 within `+`
   %42 = fadd <4 x double> %wide.load, %wide.load232
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %43 = add i64 %22, %index
   %44 = getelementptr inbounds double, double* %arrayptr3345, i64 %43
   %45 = bitcast double* %44 to <4 x double>*
   store <4 x double> %42, <4 x double>* %45, align 8
   %index.next = or i64 %index, 4
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %46 = add i64 %20, %index.next
   %47 = getelementptr inbounds double, double* %arrayptr41, i64 %46
   %48 = bitcast double* %47 to <4 x double>*
   %wide.load.1 = load <4 x double>, <4 x double>* %48, align 8
   %49 = add i64 %21, %index.next
   %50 = getelementptr inbounds double, double* %arrayptr1943, i64 %49
   %51 = bitcast double* %50 to <4 x double>*
   %wide.load232.1 = load <4 x double>, <4 x double>* %51, align 8
; └
; ┌ @ float.jl:409 within `+`
   %52 = fadd <4 x double> %wide.load.1, %wide.load232.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %53 = add i64 %22, %index.next
   %54 = getelementptr inbounds double, double* %arrayptr3345, i64 %53
   %55 = bitcast double* %54 to <4 x double>*
   store <4 x double> %52, <4 x double>* %55, align 8
   %index.next.1 = or i64 %index, 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %56 = add i64 %20, %index.next.1
   %57 = getelementptr inbounds double, double* %arrayptr41, i64 %56
   %58 = bitcast double* %57 to <4 x double>*
   %wide.load.2 = load <4 x double>, <4 x double>* %58, align 8
   %59 = add i64 %21, %index.next.1
   %60 = getelementptr inbounds double, double* %arrayptr1943, i64 %59
   %61 = bitcast double* %60 to <4 x double>*
   %wide.load232.2 = load <4 x double>, <4 x double>* %61, align 8
; └
; ┌ @ float.jl:409 within `+`
   %62 = fadd <4 x double> %wide.load.2, %wide.load232.2
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %63 = add i64 %22, %index.next.1
   %64 = getelementptr inbounds double, double* %arrayptr3345, i64 %63
   %65 = bitcast double* %64 to <4 x double>*
   store <4 x double> %62, <4 x double>* %65, align 8
   %index.next.2 = or i64 %index, 12
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %66 = add i64 %20, %index.next.2
   %67 = getelementptr inbounds double, double* %arrayptr41, i64 %66
   %68 = bitcast double* %67 to <4 x double>*
   %wide.load.3 = load <4 x double>, <4 x double>* %68, align 8
   %69 = add i64 %21, %index.next.2
   %70 = getelementptr inbounds double, double* %arrayptr1943, i64 %69
   %71 = bitcast double* %70 to <4 x double>*
   %wide.load232.3 = load <4 x double>, <4 x double>* %71, align 8
; └
; ┌ @ float.jl:409 within `+`
   %72 = fadd <4 x double> %wide.load.3, %wide.load232.3
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %73 = add i64 %22, %index.next.2
   %74 = getelementptr inbounds double, double* %arrayptr3345, i64 %73
   %75 = bitcast double* %74 to <4 x double>*
   store <4 x double> %72, <4 x double>* %75, align 8
   %index.next.3 = or i64 %index, 16
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %76 = add i64 %20, %index.next.3
   %77 = getelementptr inbounds double, double* %arrayptr41, i64 %76
   %78 = bitcast double* %77 to <4 x double>*
   %wide.load.4 = load <4 x double>, <4 x double>* %78, align 8
   %79 = add i64 %21, %index.next.3
   %80 = getelementptr inbounds double, double* %arrayptr1943, i64 %79
   %81 = bitcast double* %80 to <4 x double>*
   %wide.load232.4 = load <4 x double>, <4 x double>* %81, align 8
; └
; ┌ @ float.jl:409 within `+`
   %82 = fadd <4 x double> %wide.load.4, %wide.load232.4
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %83 = add i64 %22, %index.next.3
   %84 = getelementptr inbounds double, double* %arrayptr3345, i64 %83
   %85 = bitcast double* %84 to <4 x double>*
   store <4 x double> %82, <4 x double>* %85, align 8
   %index.next.4 = or i64 %index, 20
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %86 = add i64 %20, %index.next.4
   %87 = getelementptr inbounds double, double* %arrayptr41, i64 %86
   %88 = bitcast double* %87 to <4 x double>*
   %wide.load.5 = load <4 x double>, <4 x double>* %88, align 8
   %89 = add i64 %21, %index.next.4
   %90 = getelementptr inbounds double, double* %arrayptr1943, i64 %89
   %91 = bitcast double* %90 to <4 x double>*
   %wide.load232.5 = load <4 x double>, <4 x double>* %91, align 8
; └
; ┌ @ float.jl:409 within `+`
   %92 = fadd <4 x double> %wide.load.5, %wide.load232.5
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %93 = add i64 %22, %index.next.4
   %94 = getelementptr inbounds double, double* %arrayptr3345, i64 %93
   %95 = bitcast double* %94 to <4 x double>*
   store <4 x double> %92, <4 x double>* %95, align 8
   %index.next.5 = or i64 %index, 24
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %96 = add i64 %20, %index.next.5
   %97 = getelementptr inbounds double, double* %arrayptr41, i64 %96
   %98 = bitcast double* %97 to <4 x double>*
   %wide.load.6 = load <4 x double>, <4 x double>* %98, align 8
   %99 = add i64 %21, %index.next.5
   %100 = getelementptr inbounds double, double* %arrayptr1943, i64 %99
   %101 = bitcast double* %100 to <4 x double>*
   %wide.load232.6 = load <4 x double>, <4 x double>* %101, align 8
; └
; ┌ @ float.jl:409 within `+`
   %102 = fadd <4 x double> %wide.load.6, %wide.load232.6
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %103 = add i64 %22, %index.next.5
   %104 = getelementptr inbounds double, double* %arrayptr3345, i64 %103
   %105 = bitcast double* %104 to <4 x double>*
   store <4 x double> %102, <4 x double>* %105, align 8
   %index.next.6 = or i64 %index, 28
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %106 = add i64 %20, %index.next.6
   %107 = getelementptr inbounds double, double* %arrayptr41, i64 %106
   %108 = bitcast double* %107 to <4 x double>*
   %wide.load.7 = load <4 x double>, <4 x double>* %108, align 8
   %109 = add i64 %21, %index.next.6
   %110 = getelementptr inbounds double, double* %arrayptr1943, i64 %109
   %111 = bitcast double* %110 to <4 x double>*
   %wide.load232.7 = load <4 x double>, <4 x double>* %111, align 8
; └
; ┌ @ float.jl:409 within `+`
   %112 = fadd <4 x double> %wide.load.7, %wide.load232.7
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %113 = add i64 %22, %index.next.6
   %114 = getelementptr inbounds double, double* %arrayptr3345, i64 %113
   %115 = bitcast double* %114 to <4 x double>*
   store <4 x double> %112, <4 x double>* %115, align 8
   %index.next.7 = add nuw i64 %index, 32
   %niter.next.7 = add i64 %niter, 8
   %niter.ncmp.7 = icmp eq i64 %niter.next.7, %unroll_iter
   br i1 %niter.ncmp.7, label %middle.block.unr-lcssa, label %vector.body

middle.block.unr-lcssa:                           ; preds = %vector.body, %
vector.ph
   %index.unr = phi i64 [ 0, %vector.ph ], [ %index.next.7, %vector.body ]
   %lcmp.mod.not = icmp eq i64 %xtraiter, 0
   br i1 %lcmp.mod.not, label %middle.block, label %vector.body.epil

vector.body.epil:                                 ; preds = %vector.body.ep
il, %middle.block.unr-lcssa
   %index.epil = phi i64 [ %index.next.epil, %vector.body.epil ], [ %index.
unr, %middle.block.unr-lcssa ]
   %epil.iter = phi i64 [ %epil.iter.next, %vector.body.epil ], [ 0, %middl
e.block.unr-lcssa ]
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %116 = add i64 %20, %index.epil
   %117 = getelementptr inbounds double, double* %arrayptr41, i64 %116
   %118 = bitcast double* %117 to <4 x double>*
   %wide.load.epil = load <4 x double>, <4 x double>* %118, align 8
   %119 = add i64 %21, %index.epil
   %120 = getelementptr inbounds double, double* %arrayptr1943, i64 %119
   %121 = bitcast double* %120 to <4 x double>*
   %wide.load232.epil = load <4 x double>, <4 x double>* %121, align 8
; └
; ┌ @ float.jl:409 within `+`
   %122 = fadd <4 x double> %wide.load.epil, %wide.load232.epil
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %123 = add i64 %22, %index.epil
   %124 = getelementptr inbounds double, double* %arrayptr3345, i64 %123
   %125 = bitcast double* %124 to <4 x double>*
   store <4 x double> %122, <4 x double>* %125, align 8
   %index.next.epil = add nuw i64 %index.epil, 4
   %epil.iter.next = add i64 %epil.iter, 1
   %epil.iter.cmp.not = icmp eq i64 %epil.iter.next, %xtraiter
   br i1 %epil.iter.cmp.not, label %middle.block, label %vector.body.epil

middle.block:                                     ; preds = %vector.body.ep
il, %middle.block.unr-lcssa
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc!`
  %cmp.n = icmp eq i64 %exit.mainloop.at, %n.vec
  br i1 %cmp.n, label %main.exit.selector, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, 
%vector.memcheck, %ib24.us.us.us.preheader
  %bc.resume.val = phi i64 [ %ind.end, %middle.block ], [ 1, %ib24.us.us.us
.preheader ], [ 1, %vector.memcheck ]
  br label %ib24.us.us.us

ib24.us.us.us:                                    ; preds = %ib24.us.us.us,
 %scalar.ph
  %value_phi2.us.us.us = phi i64 [ %134, %ib24.us.us.us ], [ %bc.resume.val
, %scalar.ph ]
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %126 = add nsw i64 %value_phi2.us.us.us, -1
   %127 = add i64 %20, %126
   %128 = getelementptr inbounds double, double* %arrayptr41, i64 %127
   %arrayref.us.us.us = load double, double* %128, align 8
   %129 = add i64 %21, %126
   %130 = getelementptr inbounds double, double* %arrayptr1943, i64 %129
   %arrayref20.us.us.us = load double, double* %130, align 8
; └
; ┌ @ float.jl:409 within `+`
   %131 = fadd double %arrayref.us.us.us, %arrayref20.us.us.us
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %132 = add i64 %22, %126
   %133 = getelementptr inbounds double, double* %arrayptr3345, i64 %132
   store double %131, double* %133, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc!`
; ┌ @ range.jl:901 within `iterate`
   %134 = add nuw i64 %value_phi2.us.us.us, 1
; └
  %.not197 = icmp ult i64 %value_phi2.us.us.us, %exit.mainloop.at
  br i1 %.not197, label %ib24.us.us.us, label %main.exit.selector

main.exit.selector:                               ; preds = %ib24.us.us.us,
 %middle.block
  %value_phi2.us.us.us.lcssa = phi i64 [ %exit.mainloop.at, %middle.block ]
, [ %value_phi2.us.us.us, %ib24.us.us.us ]
; ┌ @ range.jl:901 within `iterate`
   %.lcssa = phi i64 [ %ind.end, %middle.block ], [ %134, %ib24.us.us.us ]
; └
  %135 = icmp ult i64 %value_phi2.us.us.us.lcssa, 100
  br i1 %135, label %main.pseudo.exit, label %L27.split.us.split.us.split.u
s

main.pseudo.exit:                                 ; preds = %main.exit.sele
ctor, %L2.split.us.split.us.split.us
  %value_phi2.us.us.us.copy = phi i64 [ 1, %L2.split.us.split.us.split.us ]
, [ %.lcssa, %main.exit.selector ]
  br label %L5.us.us.us.postloop

L27.split.us.split.us.split.us:                   ; preds = %ib24.us.us.us.
postloop, %main.exit.selector
; ┌ @ range.jl:901 within `iterate`
   %136 = add nuw nsw i64 %value_phi, 1
; └
  %indvar.next = or i64 %indvar, 1
  %137 = shl nuw nsw i64 %indvar.next, 3
  %138 = mul i64 %arraysize, %indvar.next
  %arraysize5.1 = load i64, i64* %7, align 8
  %inbounds6.1 = icmp ult i64 %value_phi, %arraysize5.1
  %139 = mul i64 %arraysize, %value_phi
  %arrayptr41.1 = load double*, double** %8, align 8
  %arraysize8.1 = load i64, i64* %10, align 8
  %140 = mul i64 %arraysize8.1, %value_phi
  %arrayptr1943.1 = load double*, double** %12, align 8
  %arrayptr1943224.1 = bitcast double* %arrayptr1943.1 to i8*
  %arraysize22.1 = load i64, i64* %14, align 8
  %arraysize27.1 = load i64, i64* %15, align 8
  %inbounds28.1 = icmp ult i64 %value_phi, %arraysize27.1
  %141 = mul i64 %arraysize22.1, %value_phi
  %arrayptr3345.1 = load double*, double** %16, align 8
  %arrayptr3345217.1 = bitcast double* %arrayptr3345.1 to i8*
  %inbounds6.fr.1 = freeze i1 %inbounds6.1
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   br i1 %inbounds6.fr.1, label %L2.split.us.1, label %oob

L2.split.us.1:                                    ; preds = %L27.split.us.s
plit.us.split.us
   %arraysize13.1 = load i64, i64* %11, align 8
   %inbounds14.1 = icmp ult i64 %value_phi, %arraysize13.1
   %inbounds14.fr.1 = freeze i1 %inbounds14.1
   br i1 %inbounds14.fr.1, label %L2.split.us.split.us.1, label %L2.split.u
s.split

L2.split.us.split.us.1:                           ; preds = %L2.split.us.1
   %inbounds28.fr.1 = freeze i1 %inbounds28.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   br i1 %inbounds28.fr.1, label %L2.split.us.split.us.split.us.1, label %L
2.split.us.split.us.split

L2.split.us.split.us.split.us.1:                  ; preds = %L2.split.us.sp
lit.us.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
3 within `inner_noalloc!`
  %smin.1 = call i64 @llvm.smin.i64(i64 %arraysize8.1, i64 0)
  %142 = sub i64 %arraysize8.1, %smin.1
  %smax.1 = call i64 @llvm.smax.i64(i64 %smin.1, i64 -1)
  %143 = add nsw i64 %smax.1, 1
  %144 = mul nuw nsw i64 %142, %143
  %umin.1 = call i64 @llvm.umin.i64(i64 %arraysize, i64 %144)
  %smin156.1 = call i64 @llvm.smin.i64(i64 %arraysize22.1, i64 0)
  %145 = sub i64 %arraysize22.1, %smin156.1
  %smax157.1 = call i64 @llvm.smax.i64(i64 %smin156.1, i64 -1)
  %146 = add nsw i64 %smax157.1, 1
  %147 = mul nuw nsw i64 %145, %146
  %umin158.1 = call i64 @llvm.umin.i64(i64 %umin.1, i64 %147)
  %exit.mainloop.at.1 = call i64 @llvm.umin.i64(i64 %umin158.1, i64 100)
  %.not196.1 = icmp eq i64 %exit.mainloop.at.1, 0
  br i1 %.not196.1, label %main.pseudo.exit.1, label %ib24.us.us.us.prehead
er.1

ib24.us.us.us.preheader.1:                        ; preds = %L2.split.us.sp
lit.us.split.us.1
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc!`
  %min.iters.check.1 = icmp ult i64 %exit.mainloop.at.1, 12
  br i1 %min.iters.check.1, label %scalar.ph.1, label %vector.memcheck.1

vector.memcheck.1:                                ; preds = %ib24.us.us.us.
preheader.1
  %148 = mul i64 %arraysize22.1, %137
  %uglygep.1 = getelementptr i8, i8* %arrayptr3345217.1, i64 %148
  %scevgep.1 = getelementptr double, double* %arrayptr3345.1, i64 %exit.mai
nloop.at.1
  %scevgep218.1 = bitcast double* %scevgep.1 to i8*
  %uglygep219.1 = getelementptr i8, i8* %scevgep218.1, i64 %148
  %scevgep220.1 = getelementptr double, double* %arrayptr41.1, i64 %138
  %scevgep220221.1 = bitcast double* %scevgep220.1 to i8*
  %149 = add i64 %exit.mainloop.at.1, %138
  %scevgep222.1 = getelementptr double, double* %arrayptr41.1, i64 %149
  %scevgep222223.1 = bitcast double* %scevgep222.1 to i8*
  %150 = mul i64 %arraysize8.1, %137
  %uglygep225.1 = getelementptr i8, i8* %arrayptr1943224.1, i64 %150
  %scevgep226.1 = getelementptr double, double* %arrayptr1943.1, i64 %exit.
mainloop.at.1
  %scevgep226227.1 = bitcast double* %scevgep226.1 to i8*
  %uglygep228.1 = getelementptr i8, i8* %scevgep226227.1, i64 %150
  %bound0.1 = icmp ult i8* %uglygep.1, %scevgep222223.1
  %bound1.1 = icmp ugt i8* %uglygep219.1, %scevgep220221.1
  %found.conflict.1 = and i1 %bound0.1, %bound1.1
  %bound0229.1 = icmp ult i8* %uglygep.1, %uglygep228.1
  %bound1230.1 = icmp ult i8* %uglygep225.1, %uglygep219.1
  %found.conflict231.1 = and i1 %bound0229.1, %bound1230.1
  %conflict.rdx.1 = or i1 %found.conflict.1, %found.conflict231.1
  br i1 %conflict.rdx.1, label %scalar.ph.1, label %vector.ph.1

vector.ph.1:                                      ; preds = %vector.memchec
k.1
  %n.vec.1 = and i64 %exit.mainloop.at.1, 124
  %ind.end.1 = or i64 %n.vec.1, 1
  %151 = add nsw i64 %n.vec.1, -4
  %152 = lshr exact i64 %151, 2
  %153 = add nuw nsw i64 %152, 1
  %xtraiter.1 = and i64 %153, 7
  %154 = icmp ult i64 %151, 28
  br i1 %154, label %middle.block.unr-lcssa.1, label %vector.ph.new.1

vector.ph.new.1:                                  ; preds = %vector.ph.1
  %unroll_iter.1 = and i64 %153, 9223372036854775800
  br label %vector.body.1

vector.body.1:                                    ; preds = %vector.body.1,
 %vector.ph.new.1
  %index.1 = phi i64 [ 0, %vector.ph.new.1 ], [ %index.next.7.1, %vector.bo
dy.1 ]
  %niter.1 = phi i64 [ 0, %vector.ph.new.1 ], [ %niter.next.7.1, %vector.bo
dy.1 ]
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %155 = add i64 %139, %index.1
   %156 = getelementptr inbounds double, double* %arrayptr41.1, i64 %155
   %157 = bitcast double* %156 to <4 x double>*
   %wide.load.1254 = load <4 x double>, <4 x double>* %157, align 8
   %158 = add i64 %140, %index.1
   %159 = getelementptr inbounds double, double* %arrayptr1943.1, i64 %158
   %160 = bitcast double* %159 to <4 x double>*
   %wide.load232.1255 = load <4 x double>, <4 x double>* %160, align 8
; └
; ┌ @ float.jl:409 within `+`
   %161 = fadd <4 x double> %wide.load.1254, %wide.load232.1255
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %162 = add i64 %141, %index.1
   %163 = getelementptr inbounds double, double* %arrayptr3345.1, i64 %162
   %164 = bitcast double* %163 to <4 x double>*
   store <4 x double> %161, <4 x double>* %164, align 8
   %index.next.1256 = or i64 %index.1, 4
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %165 = add i64 %139, %index.next.1256
   %166 = getelementptr inbounds double, double* %arrayptr41.1, i64 %165
   %167 = bitcast double* %166 to <4 x double>*
   %wide.load.1.1 = load <4 x double>, <4 x double>* %167, align 8
   %168 = add i64 %140, %index.next.1256
   %169 = getelementptr inbounds double, double* %arrayptr1943.1, i64 %168
   %170 = bitcast double* %169 to <4 x double>*
   %wide.load232.1.1 = load <4 x double>, <4 x double>* %170, align 8
; └
; ┌ @ float.jl:409 within `+`
   %171 = fadd <4 x double> %wide.load.1.1, %wide.load232.1.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %172 = add i64 %141, %index.next.1256
   %173 = getelementptr inbounds double, double* %arrayptr3345.1, i64 %172
   %174 = bitcast double* %173 to <4 x double>*
   store <4 x double> %171, <4 x double>* %174, align 8
   %index.next.1.1 = or i64 %index.1, 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %175 = add i64 %139, %index.next.1.1
   %176 = getelementptr inbounds double, double* %arrayptr41.1, i64 %175
   %177 = bitcast double* %176 to <4 x double>*
   %wide.load.2.1 = load <4 x double>, <4 x double>* %177, align 8
   %178 = add i64 %140, %index.next.1.1
   %179 = getelementptr inbounds double, double* %arrayptr1943.1, i64 %178
   %180 = bitcast double* %179 to <4 x double>*
   %wide.load232.2.1 = load <4 x double>, <4 x double>* %180, align 8
; └
; ┌ @ float.jl:409 within `+`
   %181 = fadd <4 x double> %wide.load.2.1, %wide.load232.2.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %182 = add i64 %141, %index.next.1.1
   %183 = getelementptr inbounds double, double* %arrayptr3345.1, i64 %182
   %184 = bitcast double* %183 to <4 x double>*
   store <4 x double> %181, <4 x double>* %184, align 8
   %index.next.2.1 = or i64 %index.1, 12
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %185 = add i64 %139, %index.next.2.1
   %186 = getelementptr inbounds double, double* %arrayptr41.1, i64 %185
   %187 = bitcast double* %186 to <4 x double>*
   %wide.load.3.1 = load <4 x double>, <4 x double>* %187, align 8
   %188 = add i64 %140, %index.next.2.1
   %189 = getelementptr inbounds double, double* %arrayptr1943.1, i64 %188
   %190 = bitcast double* %189 to <4 x double>*
   %wide.load232.3.1 = load <4 x double>, <4 x double>* %190, align 8
; └
; ┌ @ float.jl:409 within `+`
   %191 = fadd <4 x double> %wide.load.3.1, %wide.load232.3.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %192 = add i64 %141, %index.next.2.1
   %193 = getelementptr inbounds double, double* %arrayptr3345.1, i64 %192
   %194 = bitcast double* %193 to <4 x double>*
   store <4 x double> %191, <4 x double>* %194, align 8
   %index.next.3.1 = or i64 %index.1, 16
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %195 = add i64 %139, %index.next.3.1
   %196 = getelementptr inbounds double, double* %arrayptr41.1, i64 %195
   %197 = bitcast double* %196 to <4 x double>*
   %wide.load.4.1 = load <4 x double>, <4 x double>* %197, align 8
   %198 = add i64 %140, %index.next.3.1
   %199 = getelementptr inbounds double, double* %arrayptr1943.1, i64 %198
   %200 = bitcast double* %199 to <4 x double>*
   %wide.load232.4.1 = load <4 x double>, <4 x double>* %200, align 8
; └
; ┌ @ float.jl:409 within `+`
   %201 = fadd <4 x double> %wide.load.4.1, %wide.load232.4.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %202 = add i64 %141, %index.next.3.1
   %203 = getelementptr inbounds double, double* %arrayptr3345.1, i64 %202
   %204 = bitcast double* %203 to <4 x double>*
   store <4 x double> %201, <4 x double>* %204, align 8
   %index.next.4.1 = or i64 %index.1, 20
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %205 = add i64 %139, %index.next.4.1
   %206 = getelementptr inbounds double, double* %arrayptr41.1, i64 %205
   %207 = bitcast double* %206 to <4 x double>*
   %wide.load.5.1 = load <4 x double>, <4 x double>* %207, align 8
   %208 = add i64 %140, %index.next.4.1
   %209 = getelementptr inbounds double, double* %arrayptr1943.1, i64 %208
   %210 = bitcast double* %209 to <4 x double>*
   %wide.load232.5.1 = load <4 x double>, <4 x double>* %210, align 8
; └
; ┌ @ float.jl:409 within `+`
   %211 = fadd <4 x double> %wide.load.5.1, %wide.load232.5.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %212 = add i64 %141, %index.next.4.1
   %213 = getelementptr inbounds double, double* %arrayptr3345.1, i64 %212
   %214 = bitcast double* %213 to <4 x double>*
   store <4 x double> %211, <4 x double>* %214, align 8
   %index.next.5.1 = or i64 %index.1, 24
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %215 = add i64 %139, %index.next.5.1
   %216 = getelementptr inbounds double, double* %arrayptr41.1, i64 %215
   %217 = bitcast double* %216 to <4 x double>*
   %wide.load.6.1 = load <4 x double>, <4 x double>* %217, align 8
   %218 = add i64 %140, %index.next.5.1
   %219 = getelementptr inbounds double, double* %arrayptr1943.1, i64 %218
   %220 = bitcast double* %219 to <4 x double>*
   %wide.load232.6.1 = load <4 x double>, <4 x double>* %220, align 8
; └
; ┌ @ float.jl:409 within `+`
   %221 = fadd <4 x double> %wide.load.6.1, %wide.load232.6.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %222 = add i64 %141, %index.next.5.1
   %223 = getelementptr inbounds double, double* %arrayptr3345.1, i64 %222
   %224 = bitcast double* %223 to <4 x double>*
   store <4 x double> %221, <4 x double>* %224, align 8
   %index.next.6.1 = or i64 %index.1, 28
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %225 = add i64 %139, %index.next.6.1
   %226 = getelementptr inbounds double, double* %arrayptr41.1, i64 %225
   %227 = bitcast double* %226 to <4 x double>*
   %wide.load.7.1 = load <4 x double>, <4 x double>* %227, align 8
   %228 = add i64 %140, %index.next.6.1
   %229 = getelementptr inbounds double, double* %arrayptr1943.1, i64 %228
   %230 = bitcast double* %229 to <4 x double>*
   %wide.load232.7.1 = load <4 x double>, <4 x double>* %230, align 8
; └
; ┌ @ float.jl:409 within `+`
   %231 = fadd <4 x double> %wide.load.7.1, %wide.load232.7.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %232 = add i64 %141, %index.next.6.1
   %233 = getelementptr inbounds double, double* %arrayptr3345.1, i64 %232
   %234 = bitcast double* %233 to <4 x double>*
   store <4 x double> %231, <4 x double>* %234, align 8
   %index.next.7.1 = add nuw i64 %index.1, 32
   %niter.next.7.1 = add i64 %niter.1, 8
   %niter.ncmp.7.1 = icmp eq i64 %niter.next.7.1, %unroll_iter.1
   br i1 %niter.ncmp.7.1, label %middle.block.unr-lcssa.1, label %vector.bo
dy.1

middle.block.unr-lcssa.1:                         ; preds = %vector.body.1,
 %vector.ph.1
   %index.unr.1 = phi i64 [ 0, %vector.ph.1 ], [ %index.next.7.1, %vector.b
ody.1 ]
   %lcmp.mod.1.not = icmp eq i64 %xtraiter.1, 0
   br i1 %lcmp.mod.1.not, label %middle.block.1, label %vector.body.epil.1

vector.body.epil.1:                               ; preds = %vector.body.ep
il.1, %middle.block.unr-lcssa.1
   %index.epil.1 = phi i64 [ %index.next.epil.1, %vector.body.epil.1 ], [ %
index.unr.1, %middle.block.unr-lcssa.1 ]
   %epil.iter.1 = phi i64 [ %epil.iter.next.1, %vector.body.epil.1 ], [ 0, 
%middle.block.unr-lcssa.1 ]
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %235 = add i64 %139, %index.epil.1
   %236 = getelementptr inbounds double, double* %arrayptr41.1, i64 %235
   %237 = bitcast double* %236 to <4 x double>*
   %wide.load.epil.1 = load <4 x double>, <4 x double>* %237, align 8
   %238 = add i64 %140, %index.epil.1
   %239 = getelementptr inbounds double, double* %arrayptr1943.1, i64 %238
   %240 = bitcast double* %239 to <4 x double>*
   %wide.load232.epil.1 = load <4 x double>, <4 x double>* %240, align 8
; └
; ┌ @ float.jl:409 within `+`
   %241 = fadd <4 x double> %wide.load.epil.1, %wide.load232.epil.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %242 = add i64 %141, %index.epil.1
   %243 = getelementptr inbounds double, double* %arrayptr3345.1, i64 %242
   %244 = bitcast double* %243 to <4 x double>*
   store <4 x double> %241, <4 x double>* %244, align 8
   %index.next.epil.1 = add nuw i64 %index.epil.1, 4
   %epil.iter.next.1 = add i64 %epil.iter.1, 1
   %epil.iter.cmp.1.not = icmp eq i64 %epil.iter.next.1, %xtraiter.1
   br i1 %epil.iter.cmp.1.not, label %middle.block.1, label %vector.body.ep
il.1

middle.block.1:                                   ; preds = %vector.body.ep
il.1, %middle.block.unr-lcssa.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc!`
  %cmp.n.1 = icmp eq i64 %exit.mainloop.at.1, %n.vec.1
  br i1 %cmp.n.1, label %main.exit.selector.1, label %scalar.ph.1

scalar.ph.1:                                      ; preds = %middle.block.1
, %vector.memcheck.1, %ib24.us.us.us.preheader.1
  %bc.resume.val.1 = phi i64 [ %ind.end.1, %middle.block.1 ], [ 1, %ib24.us
.us.us.preheader.1 ], [ 1, %vector.memcheck.1 ]
  br label %ib24.us.us.us.1

ib24.us.us.us.1:                                  ; preds = %ib24.us.us.us.
1, %scalar.ph.1
  %value_phi2.us.us.us.1 = phi i64 [ %253, %ib24.us.us.us.1 ], [ %bc.resume
.val.1, %scalar.ph.1 ]
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %245 = add nsw i64 %value_phi2.us.us.us.1, -1
   %246 = add i64 %139, %245
   %247 = getelementptr inbounds double, double* %arrayptr41.1, i64 %246
   %arrayref.us.us.us.1 = load double, double* %247, align 8
   %248 = add i64 %140, %245
   %249 = getelementptr inbounds double, double* %arrayptr1943.1, i64 %248
   %arrayref20.us.us.us.1 = load double, double* %249, align 8
; └
; ┌ @ float.jl:409 within `+`
   %250 = fadd double %arrayref.us.us.us.1, %arrayref20.us.us.us.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %251 = add i64 %141, %245
   %252 = getelementptr inbounds double, double* %arrayptr3345.1, i64 %251
   store double %250, double* %252, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc!`
; ┌ @ range.jl:901 within `iterate`
   %253 = add nuw i64 %value_phi2.us.us.us.1, 1
; └
  %.not197.1 = icmp ult i64 %value_phi2.us.us.us.1, %exit.mainloop.at.1
  br i1 %.not197.1, label %ib24.us.us.us.1, label %main.exit.selector.1

main.exit.selector.1:                             ; preds = %ib24.us.us.us.
1, %middle.block.1
  %value_phi2.us.us.us.lcssa.1 = phi i64 [ %exit.mainloop.at.1, %middle.blo
ck.1 ], [ %value_phi2.us.us.us.1, %ib24.us.us.us.1 ]
; ┌ @ range.jl:901 within `iterate`
   %.lcssa.1 = phi i64 [ %ind.end.1, %middle.block.1 ], [ %253, %ib24.us.us
.us.1 ]
; └
  %254 = icmp ult i64 %value_phi2.us.us.us.lcssa.1, 100
  br i1 %254, label %main.pseudo.exit.1, label %L27.split.us.split.us.split
.us.1

main.pseudo.exit.1:                               ; preds = %main.exit.sele
ctor.1, %L2.split.us.split.us.split.us.1
  %value_phi2.us.us.us.copy.1 = phi i64 [ 1, %L2.split.us.split.us.split.us
.1 ], [ %.lcssa.1, %main.exit.selector.1 ]
  br label %L5.us.us.us.postloop.1

L5.us.us.us.postloop.1:                           ; preds = %ib24.us.us.us.
postloop.1, %main.pseudo.exit.1
  %value_phi2.us.us.us.postloop.1 = phi i64 [ %value_phi2.us.us.us.copy.1, 
%main.pseudo.exit.1 ], [ %263, %ib24.us.us.us.postloop.1 ]
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %255 = add i64 %value_phi2.us.us.us.postloop.1, -1
   %inbounds.us.us.us.postloop.1 = icmp ult i64 %255, %arraysize
   br i1 %inbounds.us.us.us.postloop.1, label %ib.us.us.us.postloop.1, labe
l %oob

ib.us.us.us.postloop.1:                           ; preds = %L5.us.us.us.po
stloop.1
   %inbounds9.us.us.us.postloop.1 = icmp ult i64 %255, %arraysize8.1
   br i1 %inbounds9.us.us.us.postloop.1, label %ib10.us.us.us.postloop.1, l
abel %oob15.split.us

ib10.us.us.us.postloop.1:                         ; preds = %ib.us.us.us.po
stloop.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %inbounds23.us.us.us.postloop.1 = icmp ult i64 %255, %arraysize22.1
   br i1 %inbounds23.us.us.us.postloop.1, label %ib24.us.us.us.postloop.1, 
label %oob29.split.us.split.us

ib24.us.us.us.postloop.1:                         ; preds = %ib10.us.us.us.
postloop.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %256 = add i64 %139, %255
   %257 = getelementptr inbounds double, double* %arrayptr41.1, i64 %256
   %arrayref.us.us.us.postloop.1 = load double, double* %257, align 8
   %258 = add i64 %140, %255
   %259 = getelementptr inbounds double, double* %arrayptr1943.1, i64 %258
   %arrayref20.us.us.us.postloop.1 = load double, double* %259, align 8
; └
; ┌ @ float.jl:409 within `+`
   %260 = fadd double %arrayref.us.us.us.postloop.1, %arrayref20.us.us.us.p
ostloop.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %261 = add i64 %141, %255
   %262 = getelementptr inbounds double, double* %arrayptr3345.1, i64 %261
   store double %260, double* %262, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc!`
; ┌ @ range.jl:901 within `iterate`
; │┌ @ promotion.jl:521 within `==`
    %.not.not.us.us.us.postloop.1 = icmp eq i64 %value_phi2.us.us.us.postlo
op.1, 100
; │└
   %263 = add nuw nsw i64 %value_phi2.us.us.us.postloop.1, 1
; └
  br i1 %.not.not.us.us.us.postloop.1, label %L27.split.us.split.us.split.u
s.1, label %L5.us.us.us.postloop.1

L27.split.us.split.us.split.us.1:                 ; preds = %ib24.us.us.us.
postloop.1, %main.exit.selector.1
; ┌ @ range.jl:901 within `iterate`
; │┌ @ promotion.jl:521 within `==`
    %.not.1 = icmp eq i64 %136, 100
; │└
   %264 = add nuw nsw i64 %value_phi, 2
; └
  %indvar.next.1 = add nuw nsw i64 %indvar, 2
  br i1 %.not.1, label %L38, label %L2

L2.split.us.split.us.split:                       ; preds = %L2.split.us.sp
lit.us.1, %L2.split.us.split.us
  %value_phi.lcssa246 = phi i64 [ %value_phi, %L2.split.us.split.us ], [ %1
36, %L2.split.us.split.us.1 ]
  %arraysize8.lcssa240 = phi i64 [ %arraysize8, %L2.split.us.split.us ], [ 
%arraysize8.1, %L2.split.us.split.us.1 ]
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %inbounds.us.us.not = icmp eq i64 %arraysize, 0
   br i1 %inbounds.us.us.not, label %oob, label %ib.us.us

ib.us.us:                                         ; preds = %L2.split.us.sp
lit.us.split
   %inbounds9.us.us.not = icmp eq i64 %arraysize8.lcssa240, 0
   br i1 %inbounds9.us.us.not, label %oob15.split.us, label %oob29.split.us
.split.us

oob29.split.us.split.us:                          ; preds = %ib10.us.us.us.
postloop, %ib.us.us, %ib10.us.us.us.postloop.1
   %value_phi251 = phi i64 [ %value_phi.lcssa246, %ib.us.us ], [ %value_phi
, %ib10.us.us.us.postloop ], [ %136, %ib10.us.us.us.postloop.1 ]
   %.us-phi104 = phi i64 [ 1, %ib.us.us ], [ %value_phi2.us.us.us.postloop,
 %ib10.us.us.us.postloop ], [ %value_phi2.us.us.us.postloop.1, %ib10.us.us.
us.postloop.1 ]
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %errorbox3044 = alloca [2 x i64], align 8
   %errorbox3044.sub = getelementptr inbounds [2 x i64], [2 x i64]* %errorb
ox3044, i64 0, i64 0
   store i64 %.us-phi104, i64* %errorbox3044.sub, align 8
   %265 = getelementptr inbounds [2 x i64], [2 x i64]* %errorbox3044, i64 0
, i64 1
   store i64 %value_phi251, i64* %265, align 8
   call void @ijl_bounds_error_ints({}* %0, i64* nonnull %errorbox3044.sub,
 i64 2)
   unreachable

L2.split.us.split:                                ; preds = %L2.split.us.1,
 %L2.split.us
   %value_phi.lcssa245 = phi i64 [ %value_phi, %L2.split.us ], [ %136, %L2.
split.us.1 ]
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %inbounds.us.not = icmp eq i64 %arraysize, 0
   br i1 %inbounds.us.not, label %oob, label %oob15.split.us

oob15.split.us:                                   ; preds = %ib.us.us.us.po
stloop, %L2.split.us.split, %ib.us.us, %ib.us.us.us.postloop.1
   %value_phi252 = phi i64 [ %value_phi.lcssa246, %ib.us.us ], [ %value_phi
.lcssa245, %L2.split.us.split ], [ %value_phi, %ib.us.us.us.postloop ], [ %
136, %ib.us.us.us.postloop.1 ]
   %.us-phi69 = phi i64 [ 1, %ib.us.us ], [ 1, %L2.split.us.split ], [ %val
ue_phi2.us.us.us.postloop, %ib.us.us.us.postloop ], [ %value_phi2.us.us.us.
postloop.1, %ib.us.us.us.postloop.1 ]
   %errorbox1642 = alloca [2 x i64], align 8
   %errorbox1642.sub = getelementptr inbounds [2 x i64], [2 x i64]* %errorb
ox1642, i64 0, i64 0
   store i64 %.us-phi69, i64* %errorbox1642.sub, align 8
   %266 = getelementptr inbounds [2 x i64], [2 x i64]* %errorbox1642, i64 0
, i64 1
   store i64 %value_phi252, i64* %266, align 8
   call void @ijl_bounds_error_ints({}* %4, i64* nonnull %errorbox1642.sub,
 i64 2)
   unreachable

L38:                                              ; preds = %L27.split.us.s
plit.us.split.us.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc!`
  ret {}* inttoptr (i64 140125612802056 to {}*)

oob:                                              ; preds = %L5.us.us.us.po
stloop, %L2.split.us.split, %L2.split.us.split.us.split, %L5.us.us.us.postl
oop.1, %L27.split.us.split.us.split.us, %L2
  %value_phi253 = phi i64 [ %value_phi.lcssa246, %L2.split.us.split.us.spli
t ], [ %value_phi.lcssa245, %L2.split.us.split ], [ %value_phi, %L5.us.us.u
s.postloop ], [ %136, %L5.us.us.us.postloop.1 ], [ %value_phi, %L2 ], [ %13
6, %L27.split.us.split.us.split.us ]
  %.us-phi52 = phi i64 [ 1, %L2.split.us.split.us.split ], [ 1, %L2.split.u
s.split ], [ %value_phi2.us.us.us.postloop, %L5.us.us.us.postloop ], [ %val
ue_phi2.us.us.us.postloop.1, %L5.us.us.us.postloop.1 ], [ 1, %L27.split.us.
split.us.split.us ], [ 1, %L2 ]
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %errorbox40 = alloca [2 x i64], align 8
   %errorbox40.sub = getelementptr inbounds [2 x i64], [2 x i64]* %errorbox
40, i64 0, i64 0
   store i64 %.us-phi52, i64* %errorbox40.sub, align 8
   %267 = getelementptr inbounds [2 x i64], [2 x i64]* %errorbox40, i64 0, 
i64 1
   store i64 %value_phi253, i64* %267, align 8
   call void @ijl_bounds_error_ints({}* %2, i64* nonnull %errorbox40.sub, i
64 2)
   unreachable

L5.us.us.us.postloop:                             ; preds = %ib24.us.us.us.
postloop, %main.pseudo.exit
   %value_phi2.us.us.us.postloop = phi i64 [ %value_phi2.us.us.us.copy, %ma
in.pseudo.exit ], [ %276, %ib24.us.us.us.postloop ]
   %268 = add i64 %value_phi2.us.us.us.postloop, -1
   %inbounds.us.us.us.postloop = icmp ult i64 %268, %arraysize
   br i1 %inbounds.us.us.us.postloop, label %ib.us.us.us.postloop, label %o
ob

ib.us.us.us.postloop:                             ; preds = %L5.us.us.us.po
stloop
   %inbounds9.us.us.us.postloop = icmp ult i64 %268, %arraysize8
   br i1 %inbounds9.us.us.us.postloop, label %ib10.us.us.us.postloop, label
 %oob15.split.us

ib10.us.us.us.postloop:                           ; preds = %ib.us.us.us.po
stloop
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %inbounds23.us.us.us.postloop = icmp ult i64 %268, %arraysize22
   br i1 %inbounds23.us.us.us.postloop, label %ib24.us.us.us.postloop, labe
l %oob29.split.us.split.us

ib24.us.us.us.postloop:                           ; preds = %ib10.us.us.us.
postloop
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc!`
; ┌ @ essentials.jl:14 within `getindex`
   %269 = add i64 %20, %268
   %270 = getelementptr inbounds double, double* %arrayptr41, i64 %269
   %arrayref.us.us.us.postloop = load double, double* %270, align 8
   %271 = add i64 %21, %268
   %272 = getelementptr inbounds double, double* %arrayptr1943, i64 %271
   %arrayref20.us.us.us.postloop = load double, double* %272, align 8
; └
; ┌ @ float.jl:409 within `+`
   %273 = fadd double %arrayref.us.us.us.postloop, %arrayref20.us.us.us.pos
tloop
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc!`
; ┌ @ array.jl:1024 within `setindex!`
   %274 = add i64 %22, %268
   %275 = getelementptr inbounds double, double* %arrayptr3345, i64 %274
   store double %273, double* %275, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc!`
; ┌ @ range.jl:901 within `iterate`
; │┌ @ promotion.jl:521 within `==`
    %.not.not.us.us.us.postloop = icmp eq i64 %value_phi2.us.us.us.postloop
, 100
; │└
   %276 = add nuw nsw i64 %value_phi2.us.us.us.postloop, 1
; └
  br i1 %.not.not.us.us.us.postloop, label %L27.split.us.split.us.split.us,
 label %L5.us.us.us.postloop
}

Notice that this getelementptr inbounds stuff is bounds checking. Julia, like all other high level languages, enables bounds checking by default in order to not allow the user to index outside of an array. Indexing outside of an array is dangerous: it can quite easily segfault your system if you change some memory that is unknown beyond your actual array. Thus Julia throws an error:

A[101,1]
ERROR: BoundsError: attempt to access 100×100 Matrix{Float64} at index [101, 1]

In tight inner loops, we can remove this bounds checking process using the @inbounds macro:

function inner_noalloc_ib!(C,A,B)
  @inbounds for j in 1:100, i in 1:100
    val = A[i,j] + B[i,j]
    C[i,j] = val[1]
  end
end
@btime inner_noalloc!(C,A,B)
2.424 μs (0 allocations: 0 bytes)
@btime inner_noalloc_ib!(C,A,B)
2.341 μs (0 allocations: 0 bytes)

SIMD

Now let's inspect the LLVM IR again:

@code_llvm inner_noalloc_ib!(C,A,B)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `inner_noalloc_ib!`
define nonnull {}* @"japi1_inner_noalloc_ib!_3975"({}* %function, {}** noal
ias nocapture noundef readonly %args, i32 %nargs) #0 {
top:
  %stackargs = alloca {}**, align 8
  store volatile {}** %args, {}*** %stackargs, align 8
  %0 = load {}*, {}** %args, align 8
  %1 = getelementptr inbounds {}*, {}** %args, i64 1
  %2 = load {}*, {}** %1, align 8
  %3 = getelementptr inbounds {}*, {}** %args, i64 2
  %4 = load {}*, {}** %3, align 8
  %5 = bitcast {}* %2 to {}**
  %arraysize_ptr = getelementptr inbounds {}*, {}** %5, i64 3
  %6 = bitcast {}** %arraysize_ptr to i64*
  %arraysize = load i64, i64* %6, align 8
  %7 = bitcast {}* %2 to double**
  %arrayptr21 = load double*, double** %7, align 8
  %8 = bitcast {}* %4 to {}**
  %arraysize_ptr4 = getelementptr inbounds {}*, {}** %8, i64 3
  %9 = bitcast {}** %arraysize_ptr4 to i64*
  %arraysize5 = load i64, i64* %9, align 8
  %10 = bitcast {}* %4 to double**
  %arrayptr822 = load double*, double** %10, align 8
  %11 = bitcast {}* %0 to {}**
  %arraysize_ptr10 = getelementptr inbounds {}*, {}** %11, i64 3
  %12 = bitcast {}** %arraysize_ptr10 to i64*
  %arraysize11 = load i64, i64* %12, align 8
  %13 = bitcast {}* %0 to double**
  %arrayptr1423 = load double*, double** %13, align 8
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
3 within `inner_noalloc_ib!`
  br label %L2

L2:                                               ; preds = %L25, %top
  %indvar = phi i64 [ %indvar.next, %L25 ], [ 0, %top ]
  %value_phi = phi i64 [ %671, %L25 ], [ 1, %top ]
  %14 = add nsw i64 %value_phi, -1
  %15 = mul i64 %arraysize, %14
  %16 = mul i64 %arraysize5, %14
  %17 = mul i64 %arraysize11, %14
  %18 = mul i64 %arraysize5, %indvar
  %19 = add i64 %18, 100
  %scevgep34 = getelementptr double, double* %arrayptr822, i64 %19
  %scevgep32 = getelementptr double, double* %arrayptr822, i64 %18
  %20 = mul i64 %arraysize, %indvar
  %21 = add i64 %20, 100
  %scevgep30 = getelementptr double, double* %arrayptr21, i64 %21
  %scevgep28 = getelementptr double, double* %arrayptr21, i64 %20
  %22 = mul i64 %arraysize11, %indvar
  %23 = add i64 %22, 100
  %scevgep26 = getelementptr double, double* %arrayptr1423, i64 %23
  %scevgep = getelementptr double, double* %arrayptr1423, i64 %22
  %bound0 = icmp ult double* %scevgep, %scevgep30
  %bound1 = icmp ult double* %scevgep28, %scevgep26
  %found.conflict = and i1 %bound0, %bound1
  %bound036 = icmp ult double* %scevgep, %scevgep34
  %bound137 = icmp ult double* %scevgep32, %scevgep26
  %found.conflict38 = and i1 %bound036, %bound137
  %conflict.rdx = or i1 %found.conflict, %found.conflict38
  br i1 %conflict.rdx, label %L5, label %vector.body.preheader

vector.body.preheader:                            ; preds = %L2
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %24 = getelementptr inbounds double, double* %arrayptr21, i64 %15
   %25 = bitcast double* %24 to <4 x double>*
   %wide.load = load <4 x double>, <4 x double>* %25, align 8
   %26 = getelementptr inbounds double, double* %arrayptr822, i64 %16
   %27 = bitcast double* %26 to <4 x double>*
   %wide.load39 = load <4 x double>, <4 x double>* %27, align 8
; └
; ┌ @ float.jl:409 within `+`
   %28 = fadd <4 x double> %wide.load, %wide.load39
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %29 = getelementptr inbounds double, double* %arrayptr1423, i64 %17
   %30 = bitcast double* %29 to <4 x double>*
   store <4 x double> %28, <4 x double>* %30, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %31 = add i64 %15, 4
   %32 = getelementptr inbounds double, double* %arrayptr21, i64 %31
   %33 = bitcast double* %32 to <4 x double>*
   %wide.load.1 = load <4 x double>, <4 x double>* %33, align 8
   %34 = add i64 %16, 4
   %35 = getelementptr inbounds double, double* %arrayptr822, i64 %34
   %36 = bitcast double* %35 to <4 x double>*
   %wide.load39.1 = load <4 x double>, <4 x double>* %36, align 8
; └
; ┌ @ float.jl:409 within `+`
   %37 = fadd <4 x double> %wide.load.1, %wide.load39.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %38 = add i64 %17, 4
   %39 = getelementptr inbounds double, double* %arrayptr1423, i64 %38
   %40 = bitcast double* %39 to <4 x double>*
   store <4 x double> %37, <4 x double>* %40, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %41 = add i64 %15, 8
   %42 = getelementptr inbounds double, double* %arrayptr21, i64 %41
   %43 = bitcast double* %42 to <4 x double>*
   %wide.load.2 = load <4 x double>, <4 x double>* %43, align 8
   %44 = add i64 %16, 8
   %45 = getelementptr inbounds double, double* %arrayptr822, i64 %44
   %46 = bitcast double* %45 to <4 x double>*
   %wide.load39.2 = load <4 x double>, <4 x double>* %46, align 8
; └
; ┌ @ float.jl:409 within `+`
   %47 = fadd <4 x double> %wide.load.2, %wide.load39.2
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %48 = add i64 %17, 8
   %49 = getelementptr inbounds double, double* %arrayptr1423, i64 %48
   %50 = bitcast double* %49 to <4 x double>*
   store <4 x double> %47, <4 x double>* %50, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %51 = add i64 %15, 12
   %52 = getelementptr inbounds double, double* %arrayptr21, i64 %51
   %53 = bitcast double* %52 to <4 x double>*
   %wide.load.3 = load <4 x double>, <4 x double>* %53, align 8
   %54 = add i64 %16, 12
   %55 = getelementptr inbounds double, double* %arrayptr822, i64 %54
   %56 = bitcast double* %55 to <4 x double>*
   %wide.load39.3 = load <4 x double>, <4 x double>* %56, align 8
; └
; ┌ @ float.jl:409 within `+`
   %57 = fadd <4 x double> %wide.load.3, %wide.load39.3
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %58 = add i64 %17, 12
   %59 = getelementptr inbounds double, double* %arrayptr1423, i64 %58
   %60 = bitcast double* %59 to <4 x double>*
   store <4 x double> %57, <4 x double>* %60, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %61 = add i64 %15, 16
   %62 = getelementptr inbounds double, double* %arrayptr21, i64 %61
   %63 = bitcast double* %62 to <4 x double>*
   %wide.load.4 = load <4 x double>, <4 x double>* %63, align 8
   %64 = add i64 %16, 16
   %65 = getelementptr inbounds double, double* %arrayptr822, i64 %64
   %66 = bitcast double* %65 to <4 x double>*
   %wide.load39.4 = load <4 x double>, <4 x double>* %66, align 8
; └
; ┌ @ float.jl:409 within `+`
   %67 = fadd <4 x double> %wide.load.4, %wide.load39.4
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %68 = add i64 %17, 16
   %69 = getelementptr inbounds double, double* %arrayptr1423, i64 %68
   %70 = bitcast double* %69 to <4 x double>*
   store <4 x double> %67, <4 x double>* %70, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %71 = add i64 %15, 20
   %72 = getelementptr inbounds double, double* %arrayptr21, i64 %71
   %73 = bitcast double* %72 to <4 x double>*
   %wide.load.5 = load <4 x double>, <4 x double>* %73, align 8
   %74 = add i64 %16, 20
   %75 = getelementptr inbounds double, double* %arrayptr822, i64 %74
   %76 = bitcast double* %75 to <4 x double>*
   %wide.load39.5 = load <4 x double>, <4 x double>* %76, align 8
; └
; ┌ @ float.jl:409 within `+`
   %77 = fadd <4 x double> %wide.load.5, %wide.load39.5
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %78 = add i64 %17, 20
   %79 = getelementptr inbounds double, double* %arrayptr1423, i64 %78
   %80 = bitcast double* %79 to <4 x double>*
   store <4 x double> %77, <4 x double>* %80, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %81 = add i64 %15, 24
   %82 = getelementptr inbounds double, double* %arrayptr21, i64 %81
   %83 = bitcast double* %82 to <4 x double>*
   %wide.load.6 = load <4 x double>, <4 x double>* %83, align 8
   %84 = add i64 %16, 24
   %85 = getelementptr inbounds double, double* %arrayptr822, i64 %84
   %86 = bitcast double* %85 to <4 x double>*
   %wide.load39.6 = load <4 x double>, <4 x double>* %86, align 8
; └
; ┌ @ float.jl:409 within `+`
   %87 = fadd <4 x double> %wide.load.6, %wide.load39.6
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %88 = add i64 %17, 24
   %89 = getelementptr inbounds double, double* %arrayptr1423, i64 %88
   %90 = bitcast double* %89 to <4 x double>*
   store <4 x double> %87, <4 x double>* %90, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %91 = add i64 %15, 28
   %92 = getelementptr inbounds double, double* %arrayptr21, i64 %91
   %93 = bitcast double* %92 to <4 x double>*
   %wide.load.7 = load <4 x double>, <4 x double>* %93, align 8
   %94 = add i64 %16, 28
   %95 = getelementptr inbounds double, double* %arrayptr822, i64 %94
   %96 = bitcast double* %95 to <4 x double>*
   %wide.load39.7 = load <4 x double>, <4 x double>* %96, align 8
; └
; ┌ @ float.jl:409 within `+`
   %97 = fadd <4 x double> %wide.load.7, %wide.load39.7
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %98 = add i64 %17, 28
   %99 = getelementptr inbounds double, double* %arrayptr1423, i64 %98
   %100 = bitcast double* %99 to <4 x double>*
   store <4 x double> %97, <4 x double>* %100, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %101 = add i64 %15, 32
   %102 = getelementptr inbounds double, double* %arrayptr21, i64 %101
   %103 = bitcast double* %102 to <4 x double>*
   %wide.load.8 = load <4 x double>, <4 x double>* %103, align 8
   %104 = add i64 %16, 32
   %105 = getelementptr inbounds double, double* %arrayptr822, i64 %104
   %106 = bitcast double* %105 to <4 x double>*
   %wide.load39.8 = load <4 x double>, <4 x double>* %106, align 8
; └
; ┌ @ float.jl:409 within `+`
   %107 = fadd <4 x double> %wide.load.8, %wide.load39.8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %108 = add i64 %17, 32
   %109 = getelementptr inbounds double, double* %arrayptr1423, i64 %108
   %110 = bitcast double* %109 to <4 x double>*
   store <4 x double> %107, <4 x double>* %110, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %111 = add i64 %15, 36
   %112 = getelementptr inbounds double, double* %arrayptr21, i64 %111
   %113 = bitcast double* %112 to <4 x double>*
   %wide.load.9 = load <4 x double>, <4 x double>* %113, align 8
   %114 = add i64 %16, 36
   %115 = getelementptr inbounds double, double* %arrayptr822, i64 %114
   %116 = bitcast double* %115 to <4 x double>*
   %wide.load39.9 = load <4 x double>, <4 x double>* %116, align 8
; └
; ┌ @ float.jl:409 within `+`
   %117 = fadd <4 x double> %wide.load.9, %wide.load39.9
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %118 = add i64 %17, 36
   %119 = getelementptr inbounds double, double* %arrayptr1423, i64 %118
   %120 = bitcast double* %119 to <4 x double>*
   store <4 x double> %117, <4 x double>* %120, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %121 = add i64 %15, 40
   %122 = getelementptr inbounds double, double* %arrayptr21, i64 %121
   %123 = bitcast double* %122 to <4 x double>*
   %wide.load.10 = load <4 x double>, <4 x double>* %123, align 8
   %124 = add i64 %16, 40
   %125 = getelementptr inbounds double, double* %arrayptr822, i64 %124
   %126 = bitcast double* %125 to <4 x double>*
   %wide.load39.10 = load <4 x double>, <4 x double>* %126, align 8
; └
; ┌ @ float.jl:409 within `+`
   %127 = fadd <4 x double> %wide.load.10, %wide.load39.10
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %128 = add i64 %17, 40
   %129 = getelementptr inbounds double, double* %arrayptr1423, i64 %128
   %130 = bitcast double* %129 to <4 x double>*
   store <4 x double> %127, <4 x double>* %130, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %131 = add i64 %15, 44
   %132 = getelementptr inbounds double, double* %arrayptr21, i64 %131
   %133 = bitcast double* %132 to <4 x double>*
   %wide.load.11 = load <4 x double>, <4 x double>* %133, align 8
   %134 = add i64 %16, 44
   %135 = getelementptr inbounds double, double* %arrayptr822, i64 %134
   %136 = bitcast double* %135 to <4 x double>*
   %wide.load39.11 = load <4 x double>, <4 x double>* %136, align 8
; └
; ┌ @ float.jl:409 within `+`
   %137 = fadd <4 x double> %wide.load.11, %wide.load39.11
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %138 = add i64 %17, 44
   %139 = getelementptr inbounds double, double* %arrayptr1423, i64 %138
   %140 = bitcast double* %139 to <4 x double>*
   store <4 x double> %137, <4 x double>* %140, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %141 = add i64 %15, 48
   %142 = getelementptr inbounds double, double* %arrayptr21, i64 %141
   %143 = bitcast double* %142 to <4 x double>*
   %wide.load.12 = load <4 x double>, <4 x double>* %143, align 8
   %144 = add i64 %16, 48
   %145 = getelementptr inbounds double, double* %arrayptr822, i64 %144
   %146 = bitcast double* %145 to <4 x double>*
   %wide.load39.12 = load <4 x double>, <4 x double>* %146, align 8
; └
; ┌ @ float.jl:409 within `+`
   %147 = fadd <4 x double> %wide.load.12, %wide.load39.12
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %148 = add i64 %17, 48
   %149 = getelementptr inbounds double, double* %arrayptr1423, i64 %148
   %150 = bitcast double* %149 to <4 x double>*
   store <4 x double> %147, <4 x double>* %150, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %151 = add i64 %15, 52
   %152 = getelementptr inbounds double, double* %arrayptr21, i64 %151
   %153 = bitcast double* %152 to <4 x double>*
   %wide.load.13 = load <4 x double>, <4 x double>* %153, align 8
   %154 = add i64 %16, 52
   %155 = getelementptr inbounds double, double* %arrayptr822, i64 %154
   %156 = bitcast double* %155 to <4 x double>*
   %wide.load39.13 = load <4 x double>, <4 x double>* %156, align 8
; └
; ┌ @ float.jl:409 within `+`
   %157 = fadd <4 x double> %wide.load.13, %wide.load39.13
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %158 = add i64 %17, 52
   %159 = getelementptr inbounds double, double* %arrayptr1423, i64 %158
   %160 = bitcast double* %159 to <4 x double>*
   store <4 x double> %157, <4 x double>* %160, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %161 = add i64 %15, 56
   %162 = getelementptr inbounds double, double* %arrayptr21, i64 %161
   %163 = bitcast double* %162 to <4 x double>*
   %wide.load.14 = load <4 x double>, <4 x double>* %163, align 8
   %164 = add i64 %16, 56
   %165 = getelementptr inbounds double, double* %arrayptr822, i64 %164
   %166 = bitcast double* %165 to <4 x double>*
   %wide.load39.14 = load <4 x double>, <4 x double>* %166, align 8
; └
; ┌ @ float.jl:409 within `+`
   %167 = fadd <4 x double> %wide.load.14, %wide.load39.14
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %168 = add i64 %17, 56
   %169 = getelementptr inbounds double, double* %arrayptr1423, i64 %168
   %170 = bitcast double* %169 to <4 x double>*
   store <4 x double> %167, <4 x double>* %170, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %171 = add i64 %15, 60
   %172 = getelementptr inbounds double, double* %arrayptr21, i64 %171
   %173 = bitcast double* %172 to <4 x double>*
   %wide.load.15 = load <4 x double>, <4 x double>* %173, align 8
   %174 = add i64 %16, 60
   %175 = getelementptr inbounds double, double* %arrayptr822, i64 %174
   %176 = bitcast double* %175 to <4 x double>*
   %wide.load39.15 = load <4 x double>, <4 x double>* %176, align 8
; └
; ┌ @ float.jl:409 within `+`
   %177 = fadd <4 x double> %wide.load.15, %wide.load39.15
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %178 = add i64 %17, 60
   %179 = getelementptr inbounds double, double* %arrayptr1423, i64 %178
   %180 = bitcast double* %179 to <4 x double>*
   store <4 x double> %177, <4 x double>* %180, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %181 = add i64 %15, 64
   %182 = getelementptr inbounds double, double* %arrayptr21, i64 %181
   %183 = bitcast double* %182 to <4 x double>*
   %wide.load.16 = load <4 x double>, <4 x double>* %183, align 8
   %184 = add i64 %16, 64
   %185 = getelementptr inbounds double, double* %arrayptr822, i64 %184
   %186 = bitcast double* %185 to <4 x double>*
   %wide.load39.16 = load <4 x double>, <4 x double>* %186, align 8
; └
; ┌ @ float.jl:409 within `+`
   %187 = fadd <4 x double> %wide.load.16, %wide.load39.16
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %188 = add i64 %17, 64
   %189 = getelementptr inbounds double, double* %arrayptr1423, i64 %188
   %190 = bitcast double* %189 to <4 x double>*
   store <4 x double> %187, <4 x double>* %190, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %191 = add i64 %15, 68
   %192 = getelementptr inbounds double, double* %arrayptr21, i64 %191
   %193 = bitcast double* %192 to <4 x double>*
   %wide.load.17 = load <4 x double>, <4 x double>* %193, align 8
   %194 = add i64 %16, 68
   %195 = getelementptr inbounds double, double* %arrayptr822, i64 %194
   %196 = bitcast double* %195 to <4 x double>*
   %wide.load39.17 = load <4 x double>, <4 x double>* %196, align 8
; └
; ┌ @ float.jl:409 within `+`
   %197 = fadd <4 x double> %wide.load.17, %wide.load39.17
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %198 = add i64 %17, 68
   %199 = getelementptr inbounds double, double* %arrayptr1423, i64 %198
   %200 = bitcast double* %199 to <4 x double>*
   store <4 x double> %197, <4 x double>* %200, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %201 = add i64 %15, 72
   %202 = getelementptr inbounds double, double* %arrayptr21, i64 %201
   %203 = bitcast double* %202 to <4 x double>*
   %wide.load.18 = load <4 x double>, <4 x double>* %203, align 8
   %204 = add i64 %16, 72
   %205 = getelementptr inbounds double, double* %arrayptr822, i64 %204
   %206 = bitcast double* %205 to <4 x double>*
   %wide.load39.18 = load <4 x double>, <4 x double>* %206, align 8
; └
; ┌ @ float.jl:409 within `+`
   %207 = fadd <4 x double> %wide.load.18, %wide.load39.18
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %208 = add i64 %17, 72
   %209 = getelementptr inbounds double, double* %arrayptr1423, i64 %208
   %210 = bitcast double* %209 to <4 x double>*
   store <4 x double> %207, <4 x double>* %210, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %211 = add i64 %15, 76
   %212 = getelementptr inbounds double, double* %arrayptr21, i64 %211
   %213 = bitcast double* %212 to <4 x double>*
   %wide.load.19 = load <4 x double>, <4 x double>* %213, align 8
   %214 = add i64 %16, 76
   %215 = getelementptr inbounds double, double* %arrayptr822, i64 %214
   %216 = bitcast double* %215 to <4 x double>*
   %wide.load39.19 = load <4 x double>, <4 x double>* %216, align 8
; └
; ┌ @ float.jl:409 within `+`
   %217 = fadd <4 x double> %wide.load.19, %wide.load39.19
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %218 = add i64 %17, 76
   %219 = getelementptr inbounds double, double* %arrayptr1423, i64 %218
   %220 = bitcast double* %219 to <4 x double>*
   store <4 x double> %217, <4 x double>* %220, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %221 = add i64 %15, 80
   %222 = getelementptr inbounds double, double* %arrayptr21, i64 %221
   %223 = bitcast double* %222 to <4 x double>*
   %wide.load.20 = load <4 x double>, <4 x double>* %223, align 8
   %224 = add i64 %16, 80
   %225 = getelementptr inbounds double, double* %arrayptr822, i64 %224
   %226 = bitcast double* %225 to <4 x double>*
   %wide.load39.20 = load <4 x double>, <4 x double>* %226, align 8
; └
; ┌ @ float.jl:409 within `+`
   %227 = fadd <4 x double> %wide.load.20, %wide.load39.20
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %228 = add i64 %17, 80
   %229 = getelementptr inbounds double, double* %arrayptr1423, i64 %228
   %230 = bitcast double* %229 to <4 x double>*
   store <4 x double> %227, <4 x double>* %230, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %231 = add i64 %15, 84
   %232 = getelementptr inbounds double, double* %arrayptr21, i64 %231
   %233 = bitcast double* %232 to <4 x double>*
   %wide.load.21 = load <4 x double>, <4 x double>* %233, align 8
   %234 = add i64 %16, 84
   %235 = getelementptr inbounds double, double* %arrayptr822, i64 %234
   %236 = bitcast double* %235 to <4 x double>*
   %wide.load39.21 = load <4 x double>, <4 x double>* %236, align 8
; └
; ┌ @ float.jl:409 within `+`
   %237 = fadd <4 x double> %wide.load.21, %wide.load39.21
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %238 = add i64 %17, 84
   %239 = getelementptr inbounds double, double* %arrayptr1423, i64 %238
   %240 = bitcast double* %239 to <4 x double>*
   store <4 x double> %237, <4 x double>* %240, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %241 = add i64 %15, 88
   %242 = getelementptr inbounds double, double* %arrayptr21, i64 %241
   %243 = bitcast double* %242 to <4 x double>*
   %wide.load.22 = load <4 x double>, <4 x double>* %243, align 8
   %244 = add i64 %16, 88
   %245 = getelementptr inbounds double, double* %arrayptr822, i64 %244
   %246 = bitcast double* %245 to <4 x double>*
   %wide.load39.22 = load <4 x double>, <4 x double>* %246, align 8
; └
; ┌ @ float.jl:409 within `+`
   %247 = fadd <4 x double> %wide.load.22, %wide.load39.22
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %248 = add i64 %17, 88
   %249 = getelementptr inbounds double, double* %arrayptr1423, i64 %248
   %250 = bitcast double* %249 to <4 x double>*
   store <4 x double> %247, <4 x double>* %250, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %251 = add i64 %15, 92
   %252 = getelementptr inbounds double, double* %arrayptr21, i64 %251
   %253 = bitcast double* %252 to <4 x double>*
   %wide.load.23 = load <4 x double>, <4 x double>* %253, align 8
   %254 = add i64 %16, 92
   %255 = getelementptr inbounds double, double* %arrayptr822, i64 %254
   %256 = bitcast double* %255 to <4 x double>*
   %wide.load39.23 = load <4 x double>, <4 x double>* %256, align 8
; └
; ┌ @ float.jl:409 within `+`
   %257 = fadd <4 x double> %wide.load.23, %wide.load39.23
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %258 = add i64 %17, 92
   %259 = getelementptr inbounds double, double* %arrayptr1423, i64 %258
   %260 = bitcast double* %259 to <4 x double>*
   store <4 x double> %257, <4 x double>* %260, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %261 = add i64 %15, 96
   %262 = getelementptr inbounds double, double* %arrayptr21, i64 %261
   %263 = bitcast double* %262 to <4 x double>*
   %wide.load.24 = load <4 x double>, <4 x double>* %263, align 8
   %264 = add i64 %16, 96
   %265 = getelementptr inbounds double, double* %arrayptr822, i64 %264
   %266 = bitcast double* %265 to <4 x double>*
   %wide.load39.24 = load <4 x double>, <4 x double>* %266, align 8
; └
; ┌ @ float.jl:409 within `+`
   %267 = fadd <4 x double> %wide.load.24, %wide.load39.24
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %268 = add i64 %17, 96
   %269 = getelementptr inbounds double, double* %arrayptr1423, i64 %268
   %270 = bitcast double* %269 to <4 x double>*
   store <4 x double> %267, <4 x double>* %270, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
; │┌ @ promotion.jl:521 within `==`
    br label %L25

L5:                                               ; preds = %L5, %L2
    %value_phi2 = phi i64 [ %670, %L5 ], [ 1, %L2 ]
; └└
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %271 = add nsw i64 %value_phi2, -1
   %272 = add i64 %271, %15
   %273 = getelementptr inbounds double, double* %arrayptr21, i64 %272
   %arrayref = load double, double* %273, align 8
   %274 = add i64 %271, %16
   %275 = getelementptr inbounds double, double* %arrayptr822, i64 %274
   %arrayref9 = load double, double* %275, align 8
; └
; ┌ @ float.jl:409 within `+`
   %276 = fadd double %arrayref, %arrayref9
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %277 = add i64 %271, %17
   %278 = getelementptr inbounds double, double* %arrayptr1423, i64 %277
   store double %276, double* %278, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %279 = add nuw nsw i64 %value_phi2, 1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %280 = add i64 %value_phi2, %15
   %281 = getelementptr inbounds double, double* %arrayptr21, i64 %280
   %arrayref.1 = load double, double* %281, align 8
   %282 = add i64 %value_phi2, %16
   %283 = getelementptr inbounds double, double* %arrayptr822, i64 %282
   %arrayref9.1 = load double, double* %283, align 8
; └
; ┌ @ float.jl:409 within `+`
   %284 = fadd double %arrayref.1, %arrayref9.1
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %285 = add i64 %value_phi2, %17
   %286 = getelementptr inbounds double, double* %arrayptr1423, i64 %285
   store double %284, double* %286, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %287 = add nuw nsw i64 %value_phi2, 2
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %288 = add i64 %279, %15
   %289 = getelementptr inbounds double, double* %arrayptr21, i64 %288
   %arrayref.2 = load double, double* %289, align 8
   %290 = add i64 %279, %16
   %291 = getelementptr inbounds double, double* %arrayptr822, i64 %290
   %arrayref9.2 = load double, double* %291, align 8
; └
; ┌ @ float.jl:409 within `+`
   %292 = fadd double %arrayref.2, %arrayref9.2
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %293 = add i64 %279, %17
   %294 = getelementptr inbounds double, double* %arrayptr1423, i64 %293
   store double %292, double* %294, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %295 = add nuw nsw i64 %value_phi2, 3
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %296 = add i64 %287, %15
   %297 = getelementptr inbounds double, double* %arrayptr21, i64 %296
   %arrayref.3 = load double, double* %297, align 8
   %298 = add i64 %287, %16
   %299 = getelementptr inbounds double, double* %arrayptr822, i64 %298
   %arrayref9.3 = load double, double* %299, align 8
; └
; ┌ @ float.jl:409 within `+`
   %300 = fadd double %arrayref.3, %arrayref9.3
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %301 = add i64 %287, %17
   %302 = getelementptr inbounds double, double* %arrayptr1423, i64 %301
   store double %300, double* %302, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %303 = add nuw nsw i64 %value_phi2, 4
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %304 = add i64 %295, %15
   %305 = getelementptr inbounds double, double* %arrayptr21, i64 %304
   %arrayref.4 = load double, double* %305, align 8
   %306 = add i64 %295, %16
   %307 = getelementptr inbounds double, double* %arrayptr822, i64 %306
   %arrayref9.4 = load double, double* %307, align 8
; └
; ┌ @ float.jl:409 within `+`
   %308 = fadd double %arrayref.4, %arrayref9.4
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %309 = add i64 %295, %17
   %310 = getelementptr inbounds double, double* %arrayptr1423, i64 %309
   store double %308, double* %310, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %311 = add nuw nsw i64 %value_phi2, 5
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %312 = add i64 %303, %15
   %313 = getelementptr inbounds double, double* %arrayptr21, i64 %312
   %arrayref.5 = load double, double* %313, align 8
   %314 = add i64 %303, %16
   %315 = getelementptr inbounds double, double* %arrayptr822, i64 %314
   %arrayref9.5 = load double, double* %315, align 8
; └
; ┌ @ float.jl:409 within `+`
   %316 = fadd double %arrayref.5, %arrayref9.5
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %317 = add i64 %303, %17
   %318 = getelementptr inbounds double, double* %arrayptr1423, i64 %317
   store double %316, double* %318, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %319 = add nuw nsw i64 %value_phi2, 6
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %320 = add i64 %311, %15
   %321 = getelementptr inbounds double, double* %arrayptr21, i64 %320
   %arrayref.6 = load double, double* %321, align 8
   %322 = add i64 %311, %16
   %323 = getelementptr inbounds double, double* %arrayptr822, i64 %322
   %arrayref9.6 = load double, double* %323, align 8
; └
; ┌ @ float.jl:409 within `+`
   %324 = fadd double %arrayref.6, %arrayref9.6
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %325 = add i64 %311, %17
   %326 = getelementptr inbounds double, double* %arrayptr1423, i64 %325
   store double %324, double* %326, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %327 = add nuw nsw i64 %value_phi2, 7
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %328 = add i64 %319, %15
   %329 = getelementptr inbounds double, double* %arrayptr21, i64 %328
   %arrayref.7 = load double, double* %329, align 8
   %330 = add i64 %319, %16
   %331 = getelementptr inbounds double, double* %arrayptr822, i64 %330
   %arrayref9.7 = load double, double* %331, align 8
; └
; ┌ @ float.jl:409 within `+`
   %332 = fadd double %arrayref.7, %arrayref9.7
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %333 = add i64 %319, %17
   %334 = getelementptr inbounds double, double* %arrayptr1423, i64 %333
   store double %332, double* %334, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %335 = add nuw nsw i64 %value_phi2, 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %336 = add i64 %327, %15
   %337 = getelementptr inbounds double, double* %arrayptr21, i64 %336
   %arrayref.8 = load double, double* %337, align 8
   %338 = add i64 %327, %16
   %339 = getelementptr inbounds double, double* %arrayptr822, i64 %338
   %arrayref9.8 = load double, double* %339, align 8
; └
; ┌ @ float.jl:409 within `+`
   %340 = fadd double %arrayref.8, %arrayref9.8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %341 = add i64 %327, %17
   %342 = getelementptr inbounds double, double* %arrayptr1423, i64 %341
   store double %340, double* %342, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %343 = add nuw nsw i64 %value_phi2, 9
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %344 = add i64 %335, %15
   %345 = getelementptr inbounds double, double* %arrayptr21, i64 %344
   %arrayref.9 = load double, double* %345, align 8
   %346 = add i64 %335, %16
   %347 = getelementptr inbounds double, double* %arrayptr822, i64 %346
   %arrayref9.9 = load double, double* %347, align 8
; └
; ┌ @ float.jl:409 within `+`
   %348 = fadd double %arrayref.9, %arrayref9.9
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %349 = add i64 %335, %17
   %350 = getelementptr inbounds double, double* %arrayptr1423, i64 %349
   store double %348, double* %350, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %351 = add nuw nsw i64 %value_phi2, 10
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %352 = add i64 %343, %15
   %353 = getelementptr inbounds double, double* %arrayptr21, i64 %352
   %arrayref.10 = load double, double* %353, align 8
   %354 = add i64 %343, %16
   %355 = getelementptr inbounds double, double* %arrayptr822, i64 %354
   %arrayref9.10 = load double, double* %355, align 8
; └
; ┌ @ float.jl:409 within `+`
   %356 = fadd double %arrayref.10, %arrayref9.10
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %357 = add i64 %343, %17
   %358 = getelementptr inbounds double, double* %arrayptr1423, i64 %357
   store double %356, double* %358, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %359 = add nuw nsw i64 %value_phi2, 11
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %360 = add i64 %351, %15
   %361 = getelementptr inbounds double, double* %arrayptr21, i64 %360
   %arrayref.11 = load double, double* %361, align 8
   %362 = add i64 %351, %16
   %363 = getelementptr inbounds double, double* %arrayptr822, i64 %362
   %arrayref9.11 = load double, double* %363, align 8
; └
; ┌ @ float.jl:409 within `+`
   %364 = fadd double %arrayref.11, %arrayref9.11
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %365 = add i64 %351, %17
   %366 = getelementptr inbounds double, double* %arrayptr1423, i64 %365
   store double %364, double* %366, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %367 = add nuw nsw i64 %value_phi2, 12
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %368 = add i64 %359, %15
   %369 = getelementptr inbounds double, double* %arrayptr21, i64 %368
   %arrayref.12 = load double, double* %369, align 8
   %370 = add i64 %359, %16
   %371 = getelementptr inbounds double, double* %arrayptr822, i64 %370
   %arrayref9.12 = load double, double* %371, align 8
; └
; ┌ @ float.jl:409 within `+`
   %372 = fadd double %arrayref.12, %arrayref9.12
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %373 = add i64 %359, %17
   %374 = getelementptr inbounds double, double* %arrayptr1423, i64 %373
   store double %372, double* %374, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %375 = add nuw nsw i64 %value_phi2, 13
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %376 = add i64 %367, %15
   %377 = getelementptr inbounds double, double* %arrayptr21, i64 %376
   %arrayref.13 = load double, double* %377, align 8
   %378 = add i64 %367, %16
   %379 = getelementptr inbounds double, double* %arrayptr822, i64 %378
   %arrayref9.13 = load double, double* %379, align 8
; └
; ┌ @ float.jl:409 within `+`
   %380 = fadd double %arrayref.13, %arrayref9.13
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %381 = add i64 %367, %17
   %382 = getelementptr inbounds double, double* %arrayptr1423, i64 %381
   store double %380, double* %382, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %383 = add nuw nsw i64 %value_phi2, 14
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %384 = add i64 %375, %15
   %385 = getelementptr inbounds double, double* %arrayptr21, i64 %384
   %arrayref.14 = load double, double* %385, align 8
   %386 = add i64 %375, %16
   %387 = getelementptr inbounds double, double* %arrayptr822, i64 %386
   %arrayref9.14 = load double, double* %387, align 8
; └
; ┌ @ float.jl:409 within `+`
   %388 = fadd double %arrayref.14, %arrayref9.14
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %389 = add i64 %375, %17
   %390 = getelementptr inbounds double, double* %arrayptr1423, i64 %389
   store double %388, double* %390, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %391 = add nuw nsw i64 %value_phi2, 15
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %392 = add i64 %383, %15
   %393 = getelementptr inbounds double, double* %arrayptr21, i64 %392
   %arrayref.15 = load double, double* %393, align 8
   %394 = add i64 %383, %16
   %395 = getelementptr inbounds double, double* %arrayptr822, i64 %394
   %arrayref9.15 = load double, double* %395, align 8
; └
; ┌ @ float.jl:409 within `+`
   %396 = fadd double %arrayref.15, %arrayref9.15
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %397 = add i64 %383, %17
   %398 = getelementptr inbounds double, double* %arrayptr1423, i64 %397
   store double %396, double* %398, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %399 = add nuw nsw i64 %value_phi2, 16
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %400 = add i64 %391, %15
   %401 = getelementptr inbounds double, double* %arrayptr21, i64 %400
   %arrayref.16 = load double, double* %401, align 8
   %402 = add i64 %391, %16
   %403 = getelementptr inbounds double, double* %arrayptr822, i64 %402
   %arrayref9.16 = load double, double* %403, align 8
; └
; ┌ @ float.jl:409 within `+`
   %404 = fadd double %arrayref.16, %arrayref9.16
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %405 = add i64 %391, %17
   %406 = getelementptr inbounds double, double* %arrayptr1423, i64 %405
   store double %404, double* %406, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %407 = add nuw nsw i64 %value_phi2, 17
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %408 = add i64 %399, %15
   %409 = getelementptr inbounds double, double* %arrayptr21, i64 %408
   %arrayref.17 = load double, double* %409, align 8
   %410 = add i64 %399, %16
   %411 = getelementptr inbounds double, double* %arrayptr822, i64 %410
   %arrayref9.17 = load double, double* %411, align 8
; └
; ┌ @ float.jl:409 within `+`
   %412 = fadd double %arrayref.17, %arrayref9.17
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %413 = add i64 %399, %17
   %414 = getelementptr inbounds double, double* %arrayptr1423, i64 %413
   store double %412, double* %414, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %415 = add nuw nsw i64 %value_phi2, 18
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %416 = add i64 %407, %15
   %417 = getelementptr inbounds double, double* %arrayptr21, i64 %416
   %arrayref.18 = load double, double* %417, align 8
   %418 = add i64 %407, %16
   %419 = getelementptr inbounds double, double* %arrayptr822, i64 %418
   %arrayref9.18 = load double, double* %419, align 8
; └
; ┌ @ float.jl:409 within `+`
   %420 = fadd double %arrayref.18, %arrayref9.18
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %421 = add i64 %407, %17
   %422 = getelementptr inbounds double, double* %arrayptr1423, i64 %421
   store double %420, double* %422, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %423 = add nuw nsw i64 %value_phi2, 19
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %424 = add i64 %415, %15
   %425 = getelementptr inbounds double, double* %arrayptr21, i64 %424
   %arrayref.19 = load double, double* %425, align 8
   %426 = add i64 %415, %16
   %427 = getelementptr inbounds double, double* %arrayptr822, i64 %426
   %arrayref9.19 = load double, double* %427, align 8
; └
; ┌ @ float.jl:409 within `+`
   %428 = fadd double %arrayref.19, %arrayref9.19
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %429 = add i64 %415, %17
   %430 = getelementptr inbounds double, double* %arrayptr1423, i64 %429
   store double %428, double* %430, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %431 = add nuw nsw i64 %value_phi2, 20
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %432 = add i64 %423, %15
   %433 = getelementptr inbounds double, double* %arrayptr21, i64 %432
   %arrayref.20 = load double, double* %433, align 8
   %434 = add i64 %423, %16
   %435 = getelementptr inbounds double, double* %arrayptr822, i64 %434
   %arrayref9.20 = load double, double* %435, align 8
; └
; ┌ @ float.jl:409 within `+`
   %436 = fadd double %arrayref.20, %arrayref9.20
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %437 = add i64 %423, %17
   %438 = getelementptr inbounds double, double* %arrayptr1423, i64 %437
   store double %436, double* %438, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %439 = add nuw nsw i64 %value_phi2, 21
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %440 = add i64 %431, %15
   %441 = getelementptr inbounds double, double* %arrayptr21, i64 %440
   %arrayref.21 = load double, double* %441, align 8
   %442 = add i64 %431, %16
   %443 = getelementptr inbounds double, double* %arrayptr822, i64 %442
   %arrayref9.21 = load double, double* %443, align 8
; └
; ┌ @ float.jl:409 within `+`
   %444 = fadd double %arrayref.21, %arrayref9.21
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %445 = add i64 %431, %17
   %446 = getelementptr inbounds double, double* %arrayptr1423, i64 %445
   store double %444, double* %446, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %447 = add nuw nsw i64 %value_phi2, 22
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %448 = add i64 %439, %15
   %449 = getelementptr inbounds double, double* %arrayptr21, i64 %448
   %arrayref.22 = load double, double* %449, align 8
   %450 = add i64 %439, %16
   %451 = getelementptr inbounds double, double* %arrayptr822, i64 %450
   %arrayref9.22 = load double, double* %451, align 8
; └
; ┌ @ float.jl:409 within `+`
   %452 = fadd double %arrayref.22, %arrayref9.22
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %453 = add i64 %439, %17
   %454 = getelementptr inbounds double, double* %arrayptr1423, i64 %453
   store double %452, double* %454, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %455 = add nuw nsw i64 %value_phi2, 23
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %456 = add i64 %447, %15
   %457 = getelementptr inbounds double, double* %arrayptr21, i64 %456
   %arrayref.23 = load double, double* %457, align 8
   %458 = add i64 %447, %16
   %459 = getelementptr inbounds double, double* %arrayptr822, i64 %458
   %arrayref9.23 = load double, double* %459, align 8
; └
; ┌ @ float.jl:409 within `+`
   %460 = fadd double %arrayref.23, %arrayref9.23
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %461 = add i64 %447, %17
   %462 = getelementptr inbounds double, double* %arrayptr1423, i64 %461
   store double %460, double* %462, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %463 = add nuw nsw i64 %value_phi2, 24
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %464 = add i64 %455, %15
   %465 = getelementptr inbounds double, double* %arrayptr21, i64 %464
   %arrayref.24 = load double, double* %465, align 8
   %466 = add i64 %455, %16
   %467 = getelementptr inbounds double, double* %arrayptr822, i64 %466
   %arrayref9.24 = load double, double* %467, align 8
; └
; ┌ @ float.jl:409 within `+`
   %468 = fadd double %arrayref.24, %arrayref9.24
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %469 = add i64 %455, %17
   %470 = getelementptr inbounds double, double* %arrayptr1423, i64 %469
   store double %468, double* %470, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %471 = add nuw nsw i64 %value_phi2, 25
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %472 = add i64 %463, %15
   %473 = getelementptr inbounds double, double* %arrayptr21, i64 %472
   %arrayref.25 = load double, double* %473, align 8
   %474 = add i64 %463, %16
   %475 = getelementptr inbounds double, double* %arrayptr822, i64 %474
   %arrayref9.25 = load double, double* %475, align 8
; └
; ┌ @ float.jl:409 within `+`
   %476 = fadd double %arrayref.25, %arrayref9.25
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %477 = add i64 %463, %17
   %478 = getelementptr inbounds double, double* %arrayptr1423, i64 %477
   store double %476, double* %478, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %479 = add nuw nsw i64 %value_phi2, 26
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %480 = add i64 %471, %15
   %481 = getelementptr inbounds double, double* %arrayptr21, i64 %480
   %arrayref.26 = load double, double* %481, align 8
   %482 = add i64 %471, %16
   %483 = getelementptr inbounds double, double* %arrayptr822, i64 %482
   %arrayref9.26 = load double, double* %483, align 8
; └
; ┌ @ float.jl:409 within `+`
   %484 = fadd double %arrayref.26, %arrayref9.26
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %485 = add i64 %471, %17
   %486 = getelementptr inbounds double, double* %arrayptr1423, i64 %485
   store double %484, double* %486, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %487 = add nuw nsw i64 %value_phi2, 27
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %488 = add i64 %479, %15
   %489 = getelementptr inbounds double, double* %arrayptr21, i64 %488
   %arrayref.27 = load double, double* %489, align 8
   %490 = add i64 %479, %16
   %491 = getelementptr inbounds double, double* %arrayptr822, i64 %490
   %arrayref9.27 = load double, double* %491, align 8
; └
; ┌ @ float.jl:409 within `+`
   %492 = fadd double %arrayref.27, %arrayref9.27
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %493 = add i64 %479, %17
   %494 = getelementptr inbounds double, double* %arrayptr1423, i64 %493
   store double %492, double* %494, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %495 = add nuw nsw i64 %value_phi2, 28
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %496 = add i64 %487, %15
   %497 = getelementptr inbounds double, double* %arrayptr21, i64 %496
   %arrayref.28 = load double, double* %497, align 8
   %498 = add i64 %487, %16
   %499 = getelementptr inbounds double, double* %arrayptr822, i64 %498
   %arrayref9.28 = load double, double* %499, align 8
; └
; ┌ @ float.jl:409 within `+`
   %500 = fadd double %arrayref.28, %arrayref9.28
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %501 = add i64 %487, %17
   %502 = getelementptr inbounds double, double* %arrayptr1423, i64 %501
   store double %500, double* %502, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %503 = add nuw nsw i64 %value_phi2, 29
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %504 = add i64 %495, %15
   %505 = getelementptr inbounds double, double* %arrayptr21, i64 %504
   %arrayref.29 = load double, double* %505, align 8
   %506 = add i64 %495, %16
   %507 = getelementptr inbounds double, double* %arrayptr822, i64 %506
   %arrayref9.29 = load double, double* %507, align 8
; └
; ┌ @ float.jl:409 within `+`
   %508 = fadd double %arrayref.29, %arrayref9.29
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %509 = add i64 %495, %17
   %510 = getelementptr inbounds double, double* %arrayptr1423, i64 %509
   store double %508, double* %510, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %511 = add nuw nsw i64 %value_phi2, 30
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %512 = add i64 %503, %15
   %513 = getelementptr inbounds double, double* %arrayptr21, i64 %512
   %arrayref.30 = load double, double* %513, align 8
   %514 = add i64 %503, %16
   %515 = getelementptr inbounds double, double* %arrayptr822, i64 %514
   %arrayref9.30 = load double, double* %515, align 8
; └
; ┌ @ float.jl:409 within `+`
   %516 = fadd double %arrayref.30, %arrayref9.30
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %517 = add i64 %503, %17
   %518 = getelementptr inbounds double, double* %arrayptr1423, i64 %517
   store double %516, double* %518, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %519 = add nuw nsw i64 %value_phi2, 31
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %520 = add i64 %511, %15
   %521 = getelementptr inbounds double, double* %arrayptr21, i64 %520
   %arrayref.31 = load double, double* %521, align 8
   %522 = add i64 %511, %16
   %523 = getelementptr inbounds double, double* %arrayptr822, i64 %522
   %arrayref9.31 = load double, double* %523, align 8
; └
; ┌ @ float.jl:409 within `+`
   %524 = fadd double %arrayref.31, %arrayref9.31
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %525 = add i64 %511, %17
   %526 = getelementptr inbounds double, double* %arrayptr1423, i64 %525
   store double %524, double* %526, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %527 = add nuw nsw i64 %value_phi2, 32
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %528 = add i64 %519, %15
   %529 = getelementptr inbounds double, double* %arrayptr21, i64 %528
   %arrayref.32 = load double, double* %529, align 8
   %530 = add i64 %519, %16
   %531 = getelementptr inbounds double, double* %arrayptr822, i64 %530
   %arrayref9.32 = load double, double* %531, align 8
; └
; ┌ @ float.jl:409 within `+`
   %532 = fadd double %arrayref.32, %arrayref9.32
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %533 = add i64 %519, %17
   %534 = getelementptr inbounds double, double* %arrayptr1423, i64 %533
   store double %532, double* %534, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %535 = add nuw nsw i64 %value_phi2, 33
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %536 = add i64 %527, %15
   %537 = getelementptr inbounds double, double* %arrayptr21, i64 %536
   %arrayref.33 = load double, double* %537, align 8
   %538 = add i64 %527, %16
   %539 = getelementptr inbounds double, double* %arrayptr822, i64 %538
   %arrayref9.33 = load double, double* %539, align 8
; └
; ┌ @ float.jl:409 within `+`
   %540 = fadd double %arrayref.33, %arrayref9.33
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %541 = add i64 %527, %17
   %542 = getelementptr inbounds double, double* %arrayptr1423, i64 %541
   store double %540, double* %542, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %543 = add nuw nsw i64 %value_phi2, 34
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %544 = add i64 %535, %15
   %545 = getelementptr inbounds double, double* %arrayptr21, i64 %544
   %arrayref.34 = load double, double* %545, align 8
   %546 = add i64 %535, %16
   %547 = getelementptr inbounds double, double* %arrayptr822, i64 %546
   %arrayref9.34 = load double, double* %547, align 8
; └
; ┌ @ float.jl:409 within `+`
   %548 = fadd double %arrayref.34, %arrayref9.34
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %549 = add i64 %535, %17
   %550 = getelementptr inbounds double, double* %arrayptr1423, i64 %549
   store double %548, double* %550, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %551 = add nuw nsw i64 %value_phi2, 35
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %552 = add i64 %543, %15
   %553 = getelementptr inbounds double, double* %arrayptr21, i64 %552
   %arrayref.35 = load double, double* %553, align 8
   %554 = add i64 %543, %16
   %555 = getelementptr inbounds double, double* %arrayptr822, i64 %554
   %arrayref9.35 = load double, double* %555, align 8
; └
; ┌ @ float.jl:409 within `+`
   %556 = fadd double %arrayref.35, %arrayref9.35
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %557 = add i64 %543, %17
   %558 = getelementptr inbounds double, double* %arrayptr1423, i64 %557
   store double %556, double* %558, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %559 = add nuw nsw i64 %value_phi2, 36
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %560 = add i64 %551, %15
   %561 = getelementptr inbounds double, double* %arrayptr21, i64 %560
   %arrayref.36 = load double, double* %561, align 8
   %562 = add i64 %551, %16
   %563 = getelementptr inbounds double, double* %arrayptr822, i64 %562
   %arrayref9.36 = load double, double* %563, align 8
; └
; ┌ @ float.jl:409 within `+`
   %564 = fadd double %arrayref.36, %arrayref9.36
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %565 = add i64 %551, %17
   %566 = getelementptr inbounds double, double* %arrayptr1423, i64 %565
   store double %564, double* %566, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %567 = add nuw nsw i64 %value_phi2, 37
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %568 = add i64 %559, %15
   %569 = getelementptr inbounds double, double* %arrayptr21, i64 %568
   %arrayref.37 = load double, double* %569, align 8
   %570 = add i64 %559, %16
   %571 = getelementptr inbounds double, double* %arrayptr822, i64 %570
   %arrayref9.37 = load double, double* %571, align 8
; └
; ┌ @ float.jl:409 within `+`
   %572 = fadd double %arrayref.37, %arrayref9.37
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %573 = add i64 %559, %17
   %574 = getelementptr inbounds double, double* %arrayptr1423, i64 %573
   store double %572, double* %574, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %575 = add nuw nsw i64 %value_phi2, 38
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %576 = add i64 %567, %15
   %577 = getelementptr inbounds double, double* %arrayptr21, i64 %576
   %arrayref.38 = load double, double* %577, align 8
   %578 = add i64 %567, %16
   %579 = getelementptr inbounds double, double* %arrayptr822, i64 %578
   %arrayref9.38 = load double, double* %579, align 8
; └
; ┌ @ float.jl:409 within `+`
   %580 = fadd double %arrayref.38, %arrayref9.38
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %581 = add i64 %567, %17
   %582 = getelementptr inbounds double, double* %arrayptr1423, i64 %581
   store double %580, double* %582, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %583 = add nuw nsw i64 %value_phi2, 39
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %584 = add i64 %575, %15
   %585 = getelementptr inbounds double, double* %arrayptr21, i64 %584
   %arrayref.39 = load double, double* %585, align 8
   %586 = add i64 %575, %16
   %587 = getelementptr inbounds double, double* %arrayptr822, i64 %586
   %arrayref9.39 = load double, double* %587, align 8
; └
; ┌ @ float.jl:409 within `+`
   %588 = fadd double %arrayref.39, %arrayref9.39
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %589 = add i64 %575, %17
   %590 = getelementptr inbounds double, double* %arrayptr1423, i64 %589
   store double %588, double* %590, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %591 = add nuw nsw i64 %value_phi2, 40
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %592 = add i64 %583, %15
   %593 = getelementptr inbounds double, double* %arrayptr21, i64 %592
   %arrayref.40 = load double, double* %593, align 8
   %594 = add i64 %583, %16
   %595 = getelementptr inbounds double, double* %arrayptr822, i64 %594
   %arrayref9.40 = load double, double* %595, align 8
; └
; ┌ @ float.jl:409 within `+`
   %596 = fadd double %arrayref.40, %arrayref9.40
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %597 = add i64 %583, %17
   %598 = getelementptr inbounds double, double* %arrayptr1423, i64 %597
   store double %596, double* %598, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %599 = add nuw nsw i64 %value_phi2, 41
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %600 = add i64 %591, %15
   %601 = getelementptr inbounds double, double* %arrayptr21, i64 %600
   %arrayref.41 = load double, double* %601, align 8
   %602 = add i64 %591, %16
   %603 = getelementptr inbounds double, double* %arrayptr822, i64 %602
   %arrayref9.41 = load double, double* %603, align 8
; └
; ┌ @ float.jl:409 within `+`
   %604 = fadd double %arrayref.41, %arrayref9.41
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %605 = add i64 %591, %17
   %606 = getelementptr inbounds double, double* %arrayptr1423, i64 %605
   store double %604, double* %606, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %607 = add nuw nsw i64 %value_phi2, 42
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %608 = add i64 %599, %15
   %609 = getelementptr inbounds double, double* %arrayptr21, i64 %608
   %arrayref.42 = load double, double* %609, align 8
   %610 = add i64 %599, %16
   %611 = getelementptr inbounds double, double* %arrayptr822, i64 %610
   %arrayref9.42 = load double, double* %611, align 8
; └
; ┌ @ float.jl:409 within `+`
   %612 = fadd double %arrayref.42, %arrayref9.42
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %613 = add i64 %599, %17
   %614 = getelementptr inbounds double, double* %arrayptr1423, i64 %613
   store double %612, double* %614, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %615 = add nuw nsw i64 %value_phi2, 43
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %616 = add i64 %607, %15
   %617 = getelementptr inbounds double, double* %arrayptr21, i64 %616
   %arrayref.43 = load double, double* %617, align 8
   %618 = add i64 %607, %16
   %619 = getelementptr inbounds double, double* %arrayptr822, i64 %618
   %arrayref9.43 = load double, double* %619, align 8
; └
; ┌ @ float.jl:409 within `+`
   %620 = fadd double %arrayref.43, %arrayref9.43
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %621 = add i64 %607, %17
   %622 = getelementptr inbounds double, double* %arrayptr1423, i64 %621
   store double %620, double* %622, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %623 = add nuw nsw i64 %value_phi2, 44
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %624 = add i64 %615, %15
   %625 = getelementptr inbounds double, double* %arrayptr21, i64 %624
   %arrayref.44 = load double, double* %625, align 8
   %626 = add i64 %615, %16
   %627 = getelementptr inbounds double, double* %arrayptr822, i64 %626
   %arrayref9.44 = load double, double* %627, align 8
; └
; ┌ @ float.jl:409 within `+`
   %628 = fadd double %arrayref.44, %arrayref9.44
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %629 = add i64 %615, %17
   %630 = getelementptr inbounds double, double* %arrayptr1423, i64 %629
   store double %628, double* %630, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %631 = add nuw nsw i64 %value_phi2, 45
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %632 = add i64 %623, %15
   %633 = getelementptr inbounds double, double* %arrayptr21, i64 %632
   %arrayref.45 = load double, double* %633, align 8
   %634 = add i64 %623, %16
   %635 = getelementptr inbounds double, double* %arrayptr822, i64 %634
   %arrayref9.45 = load double, double* %635, align 8
; └
; ┌ @ float.jl:409 within `+`
   %636 = fadd double %arrayref.45, %arrayref9.45
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %637 = add i64 %623, %17
   %638 = getelementptr inbounds double, double* %arrayptr1423, i64 %637
   store double %636, double* %638, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %639 = add nuw nsw i64 %value_phi2, 46
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %640 = add i64 %631, %15
   %641 = getelementptr inbounds double, double* %arrayptr21, i64 %640
   %arrayref.46 = load double, double* %641, align 8
   %642 = add i64 %631, %16
   %643 = getelementptr inbounds double, double* %arrayptr822, i64 %642
   %arrayref9.46 = load double, double* %643, align 8
; └
; ┌ @ float.jl:409 within `+`
   %644 = fadd double %arrayref.46, %arrayref9.46
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %645 = add i64 %631, %17
   %646 = getelementptr inbounds double, double* %arrayptr1423, i64 %645
   store double %644, double* %646, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %647 = add nuw nsw i64 %value_phi2, 47
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %648 = add i64 %639, %15
   %649 = getelementptr inbounds double, double* %arrayptr21, i64 %648
   %arrayref.47 = load double, double* %649, align 8
   %650 = add i64 %639, %16
   %651 = getelementptr inbounds double, double* %arrayptr822, i64 %650
   %arrayref9.47 = load double, double* %651, align 8
; └
; ┌ @ float.jl:409 within `+`
   %652 = fadd double %arrayref.47, %arrayref9.47
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %653 = add i64 %639, %17
   %654 = getelementptr inbounds double, double* %arrayptr1423, i64 %653
   store double %652, double* %654, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
   %655 = add nuw nsw i64 %value_phi2, 48
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %656 = add i64 %647, %15
   %657 = getelementptr inbounds double, double* %arrayptr21, i64 %656
   %arrayref.48 = load double, double* %657, align 8
   %658 = add i64 %647, %16
   %659 = getelementptr inbounds double, double* %arrayptr822, i64 %658
   %arrayref9.48 = load double, double* %659, align 8
; └
; ┌ @ float.jl:409 within `+`
   %660 = fadd double %arrayref.48, %arrayref9.48
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %661 = add i64 %647, %17
   %662 = getelementptr inbounds double, double* %arrayptr1423, i64 %661
   store double %660, double* %662, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `inner_noalloc_ib!`
; ┌ @ essentials.jl:14 within `getindex`
   %663 = add i64 %655, %15
   %664 = getelementptr inbounds double, double* %arrayptr21, i64 %663
   %arrayref.49 = load double, double* %664, align 8
   %665 = add i64 %655, %16
   %666 = getelementptr inbounds double, double* %arrayptr822, i64 %665
   %arrayref9.49 = load double, double* %666, align 8
; └
; ┌ @ float.jl:409 within `+`
   %667 = fadd double %arrayref.49, %arrayref9.49
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
5 within `inner_noalloc_ib!`
; ┌ @ array.jl:1024 within `setindex!`
   %668 = add i64 %655, %17
   %669 = getelementptr inbounds double, double* %arrayptr1423, i64 %668
   store double %667, double* %669, align 8
; └
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
6 within `inner_noalloc_ib!`
; ┌ @ range.jl:901 within `iterate`
; │┌ @ promotion.jl:521 within `==`
    %.not.not.49 = icmp eq i64 %value_phi2, 51
; │└
   %670 = add nuw nsw i64 %value_phi2, 50
; └
  br i1 %.not.not.49, label %L25, label %L5

L25:                                              ; preds = %L5, %vector.bo
dy.preheader
; ┌ @ range.jl:901 within `iterate`
; │┌ @ promotion.jl:521 within `==`
    %.not.not24 = icmp eq i64 %value_phi, 100
; │└
   %671 = add nuw nsw i64 %value_phi, 1
; └
  %indvar.next = add i64 %indvar, 1
  br i1 %.not.not24, label %L36, label %L2

L36:                                              ; preds = %L25
  ret {}* inttoptr (i64 140125612802056 to {}*)
}

If you look closely, you will see things like:

%wide.load24 = load <4 x double>, <4 x double> addrspac(13)* %46, align 8
; └
; ┌ @ float.jl:395 within `+'
%47 = fadd <4 x double> %wide.load, %wide.load24

What this is saying is that it's loading and adding 4 Float64s at a time! This feature of the processor is known as SIMD: single input multiple data. If certain primitive floating point operations, like + and *, are done in succession (i.e. no inbounds checks between them!), then the processor can lump them together and do multiples at once. Since clock cycles have stopped improving while transistors have gotten smaller, this "lumping" has been a big source of speedups in computational mathematics even though the actual + and * hasn't gotten faster. Thus to get full speed we want to make sure this is utilized whenever possible, which essentially just amounts to doing type inferred loops with no branches or bounds checks in the way.

FMA

Modern processors have a single operation that fuses the multiplication and the addition in the operation x*y+z, known as a fused multiply-add or FMA. Note that FMA has less floating point roundoff error than the two operation form. We can see this intrinsic in the resulting LLVM IR:

@code_llvm fma(2.0,5.0,3.0)
;  @ floatfuncs.jl:439 within `fma`
define double @julia_fma_3976(double %0, double %1, double %2) #0 {
common.ret:
; ┌ @ floatfuncs.jl:434 within `fma_llvm`
   %3 = call double @llvm.fma.f64(double %0, double %1, double %2)
; └
;  @ floatfuncs.jl within `fma`
  ret double %3
}

The Julia function muladd will automatically choose between FMA and the original form depending on the availability of the routine in the processor. The MuladdMacro.jl package has a macro @muladd which pulls apart statements to add muladd expressions. For example, x1*y1 + x2*y2 + x3*y3 can be rewritten as:

muladd(x1,y1,muladd(x2,y2,x3*y3))

Which reduces the linear combination to just 3 arithmetic operations. FMA operations can be SIMD'd.

Inlining

All of this would go to waste if function call costs of 50 clock cycles were interrupting every single +. Fortunately these function calls disappear during the compilation process due to what's known as inlining. Essentially, if the function call is determined to be "cheap enough", the actual function call is removed and the code is basically pasted into the function caller. We can force a function call to occur by telling it to not inline:

@noinline fnoinline(x,y) = x + y
finline(x,y) = x + y # Can add @inline, but this is automatic here
function qinline(x,y)
  a = 4
  b = 2
  c = finline(x,a)
  d = finline(b,c)
  finline(d,y)
end
function qnoinline(x,y)
  a = 4
  b = 2
  c = fnoinline(x,a)
  d = fnoinline(b,c)
  fnoinline(d,y)
end
qnoinline (generic function with 1 method)
@code_llvm qinline(1.0,2.0)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
4 within `qinline`
define double @julia_qinline_3979(double %0, double %1) #0 {
top:
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
7 within `qinline`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:3 within `finline`
; │┌ @ promotion.jl:422 within `+` @ float.jl:409
    %2 = fadd double %0, 4.000000e+00
; └└
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
8 within `qinline`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:3 within `finline`
; │┌ @ promotion.jl:422 within `+` @ float.jl:409
    %3 = fadd double %2, 2.000000e+00
; └└
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
9 within `qinline`
; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
:3 within `finline`
; │┌ @ float.jl:409 within `+`
    %4 = fadd double %3, %1
    ret double %4
; └└
}
@code_llvm qnoinline(1.0,2.0)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
11 within `qnoinline`
define double @julia_qnoinline_3981(double %0, double %1) #0 {
top:
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
14 within `qnoinline`
  %2 = call double @j_fnoinline_3983(double %0, i64 signext 4)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
15 within `qnoinline`
  %3 = call double @j_fnoinline_3984(i64 signext 2, double %2)
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
16 within `qnoinline`
  %4 = call double @j_fnoinline_3985(double %3, double %1)
  ret double %4
}

We can see now that it keeps the function calls:

%4 = call double @julia_fnoinline_21538(double %3, double %1)

and this is slower in comparison to what we had before (but it still infers).

x = 1.0
y = 2.0
@btime qinline(x,y)
19.505 ns (1 allocation: 16 bytes)
9.0
@btime qnoinline(x,y)
24.110 ns (1 allocation: 16 bytes)
9.0

Note that if we ever want to go the other direction and tell Julia to inline as much as possible, one can use the macro @inline.

Summary

  • Scalar operations are super cheap, and if they are cache-aligned then more than one will occur in a clock cycle.

  • Inlining a function will remove the high function call overhead.

  • Branch prediction is pretty good these days, so keep them out of super tight inner loops but don't worry all too much about them.

  • Cache misses are quite expensive the further out it goes.

Note on Benchmarking

Julia's compiler is smart. This means that if you don't try hard enough, Julia's compiler might get rid of your issues. For example, it can delete branches and directly compute the result if all of the values are known at compile time. So be very careful when benchmarking: your tests may have just compiled away!

Notice the following:

@btime qinline(1.0,2.0)
1.552 ns (0 allocations: 0 bytes)
9.0

Dang, that's much faster! But if you look into it, Julia's compiler is actually "cheating" on this benchmark:

cheat() = qinline(1.0,2.0)
@code_llvm cheat()
;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
2 within `cheat`
define double @julia_cheat_4013() #0 {
top:
  ret double 9.000000e+00
}

It realized that 1.0 and 2.0 are constants, so it did what's known as constant propagation, and then used those constants inside of the function. It realized that the solution is always 9, so it compiled the function that... spits out 9! So it's fast because it's not computing anything. So be very careful about propagation of constants and literals. In general this is a very helpful feature, but when benchmarking this can cause some weird behavior. If a micro benchmark is taking less than a nanosecond, check and see if the compiler "fixed" your code!

Conclusion

Optimize your serial code before you parallelize. There's a lot to think about.

Discussion Questions

Here's a few discussion questions to think about performance engineering in scientific tasks:

  1. What are the advantages of a Vector{Array} vs a Matrix? What are the disadvantage? (What's different?)

  2. What is a good way to implement a data frame?

  3. What are some good things that come out of generic functions for free? What are some things you should watch out for with generic functions?