Benchmarks

All benchmarks were run on an Apple M2 Max, macOS 26, single-threaded, with Julia 1.12.5 and AppleAccelerate.jl v0.6.0. Times are the minimum of 5 trials.

Run the full benchmark suite with:

julia --project=test/bench test/bench/run_benchmarks.jl

Array Operations

Performance comparison of vDSP array operations vs Julia Base equivalents (map(Base.f, X) for unary, @simd loops for binary/compound). Source: bench_array.jl.

Unary Math Functions

Transcendental functions show the biggest gains — vDSP is 7–19× faster for sin/cos on Float32:

OpTypeNvDSP (μs)Julia (μs)Speedup
expFloat64100,0001714402.6×
logFloat64100,0001916523.4×
sinFloat64100,0001521,0516.9×
cosFloat64100,0001601,0836.8×
sqrtFloat64100,00042972.3×
expFloat32100,000604647.8×
logFloat32100,000805426.8×
sinFloat32100,000541,01218.8×
cosFloat32100,000551,03518.9×
sqrtFloat32100,00040982.5×

Reductions

sum/maximum/minimum are 1.1–2× faster. Note: dot is slower via vDSP because Julia's LinearAlgebra.dot already uses the Accelerate-forwarded BLAS:

OpTypeNvDSP (μs)Julia (μs)Speedup
sumFloat641,000,0001281431.1×
maximumFloat641,000,0001281361.1×
minimumFloat641,000,0001271341.1×
sumFloat321,000,00062771.2×
maximumFloat321,000,00062711.1×
minimumFloat321,000,00063711.1×

Binary Element-wise Ops

Addition and multiplication are memory-bandwidth-bound for Float64; vDSP is faster for Float32 at large sizes:

OpTypeNvDSP (μs)Julia (μs)Speedup
vaddFloat641,000,0003543601.0×
vmulFloat641,000,0003433401.0×
vaddFloat321,000,0001331671.3×
vmulFloat321,000,000871671.9×
Benchmark environment

Julia reference uses map(Base.f, X) for unary ops and @inbounds @simd loops for binary/compound ops. Source: bench_array.jl.

Dense Linear Algebra

Performance comparison of Apple Accelerate vs OpenBLAS. The dense benchmark script loads OpenBLAS first, then switches to Accelerate via LBT, so both are measured in the same process. Source: bench_dense.jl.

GEMM (mul!) — GFLOPS (higher is better)

Accelerate is 6–14× faster for matrix multiply, with the largest gains for Float32 due to AMX/NEON support:

TypeNOpenBLASAccelerateSpeedup
Float64641718010.6×
Float64256312518.0×
Float641,024362587.2×
Float644,096362416.6×
Float3264624066.5×
Float322567088912.7×
Float321,024731,02914.1×
Float324,0967390612.4×

Factorizations (Float64) — time in μs (lower is better)

OperationNOpenBLAS (μs)Accelerate (μs)Speedup
LU2566392852.2×
LU1,02424,63915,4711.6×
LU2,048182,65492,6042.0×
QR5128,1934,5591.8×
QR2,048407,071200,7362.0×
Cholesky2563591143.1×
Cholesky1,02413,6473,8013.6×
SVD25616,4598,4232.0×
SVD512100,12740,3552.5×
SVD1,024625,263208,0553.0×

Linear Solve (A\b) — time in μs (lower is better)

TypeNOpenBLAS (μs)Accelerate (μs)Speedup
Float642566473232.0×
Float641,02424,7897,3103.4×
Float642,048175,94950,7143.5×
Float322564481762.5×
Float321,02414,3373,6264.0×
Float322,04895,77221,3604.5×
Benchmark environment

LinearAlgebra.jl v1.12.0 (OpenBLAS 0.3.29). OpenBLAS benchmarked before loading AppleAccelerate; Accelerate benchmarked after using AppleAccelerate forwards BLAS via LBT. Source: bench_dense.jl.

Sparse Linear Algebra

Performance comparison of Apple Sparse Solvers vs SuiteSparse (CHOLMOD/UMFPACK). The sparse benchmark script runs SuiteSparse first (before loading AppleAccelerate), so SuiteSparse uses OpenBLAS internally, then loads AppleAccelerate and re-runs with Apple's sparse solvers. Source: bench_sparse.jl.

Sparse Matrix-Vector Multiply (density=0.01)

SuiteSparse CSC SpMV is 2.5–4.6× faster due to its simpler data layout:

TypeNApple (μs)SuiteSparse (μs)Ratio
Float641,00035100.30×
Float6410,0002,9987280.24×
Float6450,00076,06130,6150.40×
Float321,00035100.29×
Float3210,0002,9386440.22×

QR Factorize + Solve

SuiteSparse LU (\) is faster for Float64. Apple QR wins for Float32 at N≥500 (up to 1.8×):

TypeNApple (μs)SuiteSparse (μs)Speedup
Float645003,0572,6100.85×
Float642,000148,37298,6080.66×
Float645,0002,481,2851,834,5680.74×
Float321,00011,79816,4601.40×
Float322,00070,343124,0741.76×
Float325,0001,307,7791,800,8631.38×

Cholesky Factorize + Solve

Apple Cholesky is faster at N=5000, SuiteSparse faster at smaller sizes:

TypeNApple (μs)SuiteSparse (μs)Speedup
Float645001,9571,5460.79×
Float642,00052,41246,8200.89×
Float645,000344,766505,7651.47×
Float322,00042,12833,1950.79×
Float325,000273,444339,1051.24×
Benchmark environment

SparseArrays.jl v1.12.0 (SuiteSparse 7.8.3). Matrices have density 0.01. Source: bench_sparse.jl.

Benchmark limitations

These benchmarks use random sparse matrices (sprandn) which lack the structure found in real-world problems (e.g., banded, block-diagonal, or mesh-derived sparsity patterns). The matrix sizes tested (N up to 5,000–50,000) are also modest by sparse solver standards. Performance on structured problems from applications like FEM, circuit simulation, or graph analysis may differ significantly.

FFT

Pre-planned FFT performance comparing Apple vDSP against FFTW, both single-threaded. Source: bench_fft.jl.

Complex 1D FFT

vDSP and FFTW are closely matched. vDSP is notably faster at N=4096; at larger sizes FFTW has a slight edge:

TypeNvDSP (μs)FFTW (μs)Speedup
ComplexF641,0247.07.11.01×
ComplexF644,09620.534.11.67×
ComplexF6465,5367186530.91×
ComplexF641,048,57618,76717,2080.92×
ComplexF321,0243.13.11.01×
ComplexF324,09611.518.41.61×
ComplexF3265,5363293230.98×
ComplexF321,048,5764,8914,7530.97×

Real FFT

For Float32, vDSP is faster at larger sizes (up to 2.2×). Float64 rfft favors FFTW:

TypeNvDSP (μs)FFTW (μs)Speedup
Float641,0243.81.60.42×
Float6465,5363602090.58×
Float324,0967.07.61.09×
Float3265,5361151691.47×
Float32262,1445391,1662.16×

Complex 2D FFT

Nearly identical performance:

TypeSizevDSP (μs)FFTW (μs)Speedup
ComplexF6464×6447471.00×
ComplexF64256×2564,6594,7081.01×
ComplexF3264×6418221.19×
ComplexF32256×2563213251.01×
Benchmark environment

FFTW.jl v1.10.0. Both use pre-planned transforms. Source: bench_fft.jl.