Internals (Private)

Octavian._matmul_serial!Method

matmul_serial!(C, A, B[, α = 1, β = 0])

Calculates C = α * (A * B) + β * C in place.

A single threaded matrix-matrix-multiply implementation. Supports dynamically and statically sized arrays.

Organizationally, matmul_serial! checks the arrays properties to try and dispatch to an appropriate implementation. If the arrays are small and statically sized, it will dispatch to an inlined multiply.

Otherwise, based on the array's size, whether they are transposed, and whether the columns are already aligned, it decides to not pack at all, to pack only A, or to pack both arrays A and B.

source
Octavian.divide_blocksMethod

divide_blocks(M, Ntotal, _nspawn, W)

Splits both M and N into blocks when trying to spawn a large number of threads relative to the size of the matrices.

source
Octavian.find_first_acceptableMethod

findfirstacceptable(M, W)

Finds first combination of Miter and Niter that doesn't make M too small while producing Miter * Niter = num_cores(). This would be awkard if there are computers with prime numbers of cores. I should probably consider that possibility at some point.

source
Octavian.matmul_sizesMethod

Checks sizes for compatibility, and preserves the static size information if given a mix of static and dynamic sizes.

source
Octavian.reset_bcache_lock!Method

resetbcachelock!()

Currently not using try/finally in matmul routine, despite locking. So if it errors for some reason, you may need to manually call reset_bcache_lock!().

source
Octavian.solve_block_sizesMethod

solveblocksizes(::Val{T}, M, K, N, α, β, R₂, R₃)

This function returns iteration/blocking descriptions Mc, Kc, and Nc for use when packing both A and B. It tries to roughly minimize the cost

MKN / (Kc * W) + α * MKN / Mc + β * MKN / Nc

subject to constraints

Mc - M ≤ 0
Kc - K ≤ 0
Nc - N ≤ 0
Mc * Kc - L₁ₑ ≤ 0
Kc * Nc - L₂ₑ ≤ 0

That is, our constraints say that our block sizes shouldn't be bigger than the actual dimensions, and also that our packed A (Mc × Kc) should fit into the first packing cache (generally, actually the L₂, and our packed B (Kc × Nc) should fit into the second packing cache (generally the L₃).

Our cost model consists of three components:

  1. Cost of moving data in and out of registers. This is done (M/Mᵣ * K/Kc * N/Nᵣ) times and the cost per is (Mᵣ/W * Nᵣ).
  2. Cost of moving strips from B pack from the low cache levels to the highest cache levels when multiplying Aₚ * Bₚ. This is done (M / Mc * K / Kc * N / Nc) times, and the cost per is proportional to (Kc * Nᵣ). α is the proportionality-constant parameter.
  3. Cost of packing A. This is done (M / Mc * K / Kc * N / Nc) times, and the cost per is proportional to (Mc * Kc). β is the proportionality-constant parameter.

As W is a constant, we multiply the cost by W and absorb it into α and β. We drop it from the description from here on out.

In the full problem, we would have Lagrangian, with μ < 0: f((Mc,Kc,Nc),(μ₁,μ₂,μ₃,μ₄,μ₅)) MKN/Kc + α * MKN/Mc + β * MKN/Nc - μ₁(Mc - M) - μ₂(Kc - K) - μ₃(Nc - N) - μ₄(McKc - L2) - μ₅(KcNc - L3)

0 = ∂L / ∂Mc = -α * MKN / Mc² - μ₁ - μ₄ * Kc
0 = ∂L / ∂Kc = -MKN / Kc² - μ₂ - μ₄ * Mc - μ₅ * Nc
0 = ∂L / ∂Nc = -β * MKN / Nc² - μ₃ - μ₅ * Kc
0 = ∂L / ∂μ₁ = M - Mc
0 = ∂L / ∂μ₂ = K - Kc
0 = ∂L / ∂μ₃ = N - Nc
0 = ∂L / ∂μ₄ = L₁ₑ - Mc * Kc
0 = ∂L / ∂μ₅ = L₂ₑ - Kc * Nc

The first 3 constraints complicate things, because they're trivially solved by setting M = Mc, K = Kc, and N = Nc. But this will violate the last two constraints in general; normally we will be on the interior of the inequalities, meaning we'd be dropping those constraints. Doing so, this leaves us with:

First, lets just solve the cost w/o constraints 1-3

0 = ∂L / ∂Mc = -α * MKN / Mc² - μ₄ * Kc
0 = ∂L / ∂Kc = -MKN / Kc² - μ₄ * Mc - μ₅ * Nc
0 = ∂L / ∂Nc = -β * MKN / Nc² - μ₅ * Kc
0 = ∂L / ∂μ₄ = L₁ₑ - Mc * Kc
0 = ∂L / ∂μ₅ = L₂ₑ - Kc * Nc

Solving:

Mc = √(L₁ₑ) * √(L₁ₑ * β + L₂ₑ * α) / √(L₂ₑ)
Kc = √(L₁ₑ) * √(L₂ₑ) / √(L₁ₑ * β + L₂ₑ * α)
Nc = √(L₂ₑ) * √(L₁ₑ * β + L₂ₑ * α) / √(L₁ₑ)
μ₄ = -K * √(L₂ₑ) * M * N * α / (L₁ₑ^(3 / 2) * √(L₁ₑ * β + L₂ₑ * α))
μ₅ = -K * √(L₁ₑ) * M * N * β / (L₂ₑ^(3 / 2) * √(L₁ₑ * β + L₂ₑ * α))

These solutions are indepedent of matrix size. The approach we'll take here is solving for Nc, Kc, and then finally Mc one after the other, incorporating sizes.

Starting with N, we check how many iterations would be implied by Nc, and then choose the smallest value that would yield that number of iterations. This also ensures that Nc ≤ N.

Niter = cld(N, √(L₂ₑ) * √(L₁ₑ * β + L₂ₑ * α) / √(L₁ₑ))
Nblock, Nrem = divrem(N, Niter)
Nblock_Nrem = Nblock + (Nrem > 0)

We have Nrem iterations of size Nblock_Nrem, and Niter - Nrem iterations of size Nblock.

We can now make Nc = Nblock_Nrem a constant, and solve the remaining three equations again:

0 = ∂L / ∂Mc = -α * MKN / Mc² - μ₄ * Kc
0 = ∂L / ∂Kc = -MKN / Kc² - μ₄ * Mc - μ₅ * Ncm
0 = ∂L / ∂μ₄ = L₂ₑ - Mc * Kc

yielding

Mc = √(L₁ₑ) * √(α)
Kc = √(L₁ₑ) / √(α)
μ₄ = -K * M * N * √(α) / L₁ₑ^(3 / 2)

We proceed in the same fashion as for Nc, being sure to reapply the Kc * Nc ≤ L₂ₑ constraint:

Kiter = cld(K, min(√(L₁ₑ) / √(α), L₂ₑ / Nc))
Kblock, Krem = divrem(K, Ki)
Kblock_Krem = Kblock + (Krem > 0)

This leaves Mc partitioning, for which, for which we use the constraint Mc * Kc ≤ L₁ₑ to set the initial number of proposed iterations as cld(M, L₁ₑ / Kcm) for calling split_m. # approximate ceil

Mbsize, Mrem, Mremfinal, Mblocks = split_m(M, cld(M, L₁ₑ / Kcm), StaticInt{W}())

Note that for synchronization on B, all threads must have the same values for Kc and Nc. K and N will be equal between threads, but M may differ. By calculating Kc and Nc independently of M, this algorithm guarantees all threads are on the same page.

source
Octavian.split_mMethod
split_m(M, Miters_base, W)

Splits M into at most Miters_base iterations. For example, if we wish to divide 517 iterations into roughly 7 blocks using multiples of 8:

julia> split_m(517, 7, 8)
(72, 2, 69, 7)

This suggests we have base block sizes of size 72, with two iterations requiring an extra remainder of 8 ( = W), and a final block of 69 to handle the remainder. It also tells us that there are 7 total iterations, as requested.

julia> 80 * 2 + 72 * (7 - 2 - 1) + 69
517

This is meant to specify roughly the requested amount of blocks, and return relatively even sizes.

This method is used fairly generally.

source