Research

What the L1 cache is really doing to your finance kernels

Abstract. Quantitative finance loves to talk about models and rarely about memory. But the same factor model or option pricer that looks elegant on a whiteboard spends most of its wall-clock time waiting for data to arrive from the right level of the memory hierarchy. qhpc_cache is an empirical study of four representative numerical-finance kernels — Cholesky factorization, Monte Carlo simulation, GARCH estimation, and general matrix multiply (GEMM) — instrumented with PAPI hardware counters on an AMD EPYC node of Indiana University's BigRed200. The goal was narrow and deliberately unglamorous: measure what the L1 cache actually does, and only make claims I can trace back to a counter reading.

Keywords: memory hierarchy, arithmetic intensity, reuse distance, roofline, PAPI.

1. Motivation, or: the model is not the bottleneck

Here is the uncomfortable thing about numerical finance. You can derive the cleanest possible estimator, and then watch it run at five percent of a CPU's theoretical throughput because every inner-loop access misses cache and stalls on DRAM. The machine does not care that your covariance matrix is positive definite. It cares whether the bytes you need next are already close.

The quantity that decides this is arithmetic intensity — the ratio of floating-point work to memory traffic:

$I = \frac{\text{FLOPs}}{\text{bytes moved from memory}} \quad \left[\frac{\text{flop}}{\text{byte}}\right].$

A kernel with high $I$ reuses each loaded byte many times and can run near the processor's peak FLOP rate. A kernel with low $I$ is memory-bound: it is throttled by bandwidth, and no amount of clever vectorization saves it. The Roofline model [4] makes this precise — attainable performance is

$P \le \min\!\left(P_{\text{peak}},\; \beta \cdot I\right),$

where $P_{\text{peak}}$ is peak compute and $\beta$ is peak memory bandwidth. Everything in this study is really an argument about where each finance kernel lands under that roofline.

2. The four kernels (and why these four)

I picked kernels that show up, in some disguise, in almost every quant pipeline:

Cholesky factorization ( $A = LL^\top$ ). The workhorse for sampling correlated returns and solving the normal equations. Flop count $\approx n^3/3$ [2], with a beautiful blocked structure that should be cache-friendly if you let it be.
Monte Carlo simulation. The engine of derivatives pricing and risk [5]. Embarrassingly parallel, but its memory behavior depends entirely on whether your path state fits in cache.
GARCH(1,1) estimation. The standard volatility model [3]. A fundamentally sequential recursion, $\sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2$ , which is a very different beast from the matrix kernels.
GEMM ( $C \leftarrow \alpha AB + \beta C$ ). The most-optimized kernel in computing history [6], and the yardstick everything else is measured against. $2n^3$ flops, and the textbook example of a kernel that can be made compute-bound through blocking.

The point of the spread is contrast: two dense-linear-algebra kernels that can be tuned to high intensity, one MC kernel whose behavior is data-layout-dependent, and one stubbornly sequential recursion.

3. Method

All runs were on a single AMD EPYC (Zen 2) node of BigRed200. I read hardware counters directly with PAPI [1] — PAPI_L1_DCM (L1 data cache misses), PAPI_L2_DCM, PAPI_TOT_INS, and PAPI_FP_OPS — and computed a measured miss ratio

$m_{L1} = \frac{\text{L1 data misses}}{\text{L1 data accesses}}.$

For each kernel I swept problem size so the working set crossed the cache capacity boundaries on purpose. The interesting physics happens exactly when the working set stops fitting in a given level — that is where reuse distance [the number of distinct addresses touched between two accesses to the same address] starts exceeding cache capacity, and the miss curve has a knee.

Every configuration was pinned, repeated, and logged with a run ID. If a number appears in the report, there is a counter dump behind it. This was the entire discipline of the project: no plot without provenance.

4. What I found

A few results were predictable. Several were humbling.

GEMM and blocked Cholesky behave. Once tiled to fit the L1/L2 working set, both reuse each loaded block $O(\text{block size})$ times, push arithmetic intensity up, and ride near the compute roofline. This is the Goto–van de Geijn story [6] reproduced on a finance pretext, and it worked as advertised.

Naive Monte Carlo is quietly memory-bound. When per-path state is small and laid out contiguously, MC is compute-bound and scales beautifully. But the moment path state or per-step lookup tables spill out of L1, the miss ratio jumps and the kernel falls off the roofline — despite being "embarrassingly parallel." Parallelism does not rescue a kernel that is starving for bandwidth; it just lets more cores starve simultaneously.

GARCH is a different animal entirely. The recursion is sequential and its per-step working set is tiny, so it basically lives in L1 and is bound by dependency latency, not bandwidth. The cache barely matters; the critical path does. This was the kernel that reminded me a single "performance" framing does not fit all of finance.

The headline, stated honestly: three of these four kernels are decided by memory behavior, not flop count, and the one exception (GARCH) is decided by latency, not flops either. Almost none of it is about the arithmetic the textbook emphasizes.

5. Limitations (the part I will not hide)

This is an undergraduate empirical study, not a systems-conference paper. Single architecture, single node, reference (not production-tuned) kernel implementations. The GARCH numbers in particular are sensitive to compiler autovectorization, which I did not exhaustively control. Reuse-distance estimates are inferred from miss curves rather than measured with a full memory trace. I would treat the directions as solid and the exact magnitudes as architecture-specific.

6. Where this goes

The natural next step is the one that connects to everything else I work on: take the Monte Carlo branch and make it cache-aware by construction — restructure path state and RNG streams so the hot working set stays resident — then carry the same locality lens into quantum and hybrid Monte Carlo for derivatives pricing, which is exactly the bridge I am building in Prof. Hong's lab. A kernel that respects the memory hierarchy on a classical core is the right mental model for respecting the much scarcer coherence budget on a quantum one.

If there is one thing to take away: before you optimize the model, measure what the cache is doing. The machine has opinions, and a hardware counter will tell you them for free.

References

Terpstra, D., Jagode, H., You, H., & Dongarra, J. (2010). Collecting Performance Data with PAPI-C. Tools for High Performance Computing.
Golub, G. H., & Van Loan, C. F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.
Bollerslev, T. (1986). Generalized Autoregressive Conditional Heteroskedasticity. Journal of Econometrics, 31(3).
Williams, S., Waterman, A., & Patterson, D. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4).
Glasserman, P. (2003). Monte Carlo Methods in Financial Engineering. Springer.
Goto, K., & van de Geijn, R. (2008). Anatomy of High-Performance Matrix Multiplication. ACM Transactions on Mathematical Software, 34(3).

Code: github.com/pbathuri/finance-cache-hpc