Research

Do sparse factors survive a regime change?

Abstract. The "factor zoo" problem in empirical asset pricing is that researchers have proposed hundreds of characteristics that supposedly predict returns, and most of them are noise wearing a lab coat [4]. Machine learning helps by selecting a sparse subset that actually carries signal [3]. RegimeFactorZoo asks the follow-up question that matters for anyone who has to trade the thing: does the sparse set a model selects in calm markets stay selected when volatility regime flips? This is a build log and a short methods paper, written on fully public data so anyone can reproduce it with make data && make run.

Keywords: factor models, regularization, time-series cross-validation, regime stability, reproducibility.

1. The problem with the zoo

Start from the cross-sectional regression at the heart of empirical asset pricing. For asset $i$ at time $t$ , with a vector of firm characteristics $x_{i,t}$ , we model the excess return as

$r_{i,t+1} = x_{i,t}^\top \beta + \varepsilon_{i,t+1}.$

The Fama-French program [1] says a handful of factors — market, size, value, later profitability and investment — explain most of the cross-section. Cochrane's 2011 presidential address [2] reframed the whole field around one question: which characteristics provide independent information once you control for the others? The honest answer turned out to be "far fewer than the literature claims" [4]. Harvey, Liu, and Zhu showed that once you correct for multiple testing, most published factors should never have cleared the bar.

So the modern move is sparsity. Use a penalized regression — Lasso [5] — that drives most coefficients to exactly zero:

$\hat{\beta} = \arg\min_{\beta}\; \frac{1}{2N}\sum_{i,t}\bigl(r_{i,t+1} - x_{i,t}^\top \beta\bigr)^2 \;+\; \lambda \lVert \beta \rVert_1.$

The $\ell_1$ penalty is the whole trick: it has corners on the axes, so the solution lands on them, and you get genuine variable selection rather than just shrinkage. Gu, Kelly, and Xiu [3] showed this family of methods convincingly beats the textbook linear model out of sample, and Baba-Yara's recent work pushes it into a sparse-Bayesian framing [6] where the prior itself encodes "most factors are zero."

2. The question nobody loves answering

Here is what bugs me about most of these results: they are reported on the full sample. But a sparse model that selects momentum and value on average over 1963–2024 might be selecting momentum in bull markets and value in crashes, and the average is hiding a regime switch.

So RegimeFactorZoo's contribution is a conditioning step. Define a market state by the volatility regime — concretely, split the sample by VIX percentile (and by the 2008 and 2020 structural breaks). Re-estimate the sparse model within each regime, and compare the selected supports. Let $S_{\text{calm}}$ and $S_{\text{stress}}$ be the sets of nonzero-coefficient factors in low- and high-volatility regimes. The question becomes a set-overlap question:

$J = \frac{|S_{\text{calm}} \cap S_{\text{stress}}|}{|S_{\text{calm}} \cup S_{\text{stress}}|}.$

If $J \approx 1$ , sparse selections are regime-stable and the zoo's verdict is robust. If $J$ is small, then "the factors that matter" is a regime-dependent statement, and any strategy built on a single full-sample selection is quietly mis-specified.

3. The WRDS rug-pull (and why the project got better)

Original plan: pull CRSP and Compustat from WRDS, build everything from raw security data. Then I discovered my WRDS master account does not have access over the summer. Mild panic, followed by a strictly better design.

I rebuilt the data layer on fully public sources: the Kenneth French Data Library for factor returns, the Open Source Asset Pricing dataset (Chen & Zimmermann) [7] for 200-plus replicated anomaly portfolios, yfinance for benchmark prices, and FRED for the VIX and macro series. The payoff is not just access — it is reproducibility. A WRDS-based project cannot be re-run by anyone without a WRDS seat. Mine clones and runs from a fresh machine. For a project whose entire point is "stop trusting unreproducible factor claims," depending on a paywalled black box would have been its own punchline.

4. The pipeline

The architecture is deliberately boring, which is the compliment it deserves:

Data layer. Pull FF3/FF5/momentum from Ken French; pull the OSAP signal panel; align to a monthly grid; cache to parquet.
Replication. Reproduce the Fama-French three-factor and momentum models, and validate the factor returns against the published series before trusting anything downstream.
Models. OLS, then Lasso / Ridge / ElasticNet, then gradient-boosted trees (XGBoost), then a sparse-Bayesian factor model in PyMC — a deliberate ladder from interpretable to flexible.
Evaluation. Walk-forward time-series cross-validation, never random K-fold. This matters more than people admit: shuffling time-series data leaks the future into the training set and produces backtests that are basically fiction. You train on $[t_0, t_k]$ , test on $[t_k, t_{k+1}]$ , and roll forward.
Regime layer. The VIX-percentile split and the support-overlap analysis above.
Realism overlay. A transaction-cost charge of 5 bps per unit turnover, because a Sharpe ratio that ignores trading costs is a marketing number.

5. Status and honest scope

The data layer and the Fama-French replication are the foundation; the ML ladder and the regime analysis are the active build toward a public v1.0 with a one-page memo. I am writing this mid-project on purpose — the point of the site is to show work in motion, not to pretend research arrives fully formed. When v1.0 ships, the headline number will be that support-overlap $J$ across regimes, and I have promised myself to report it whether or not it flatters the hypothesis.

6. Why this is the project I care most about

It sits exactly on the intersection I am building a career around: real statistics, real code, real markets, and a reproducibility standard borrowed from the systems world. It is also the artifact behind the substantive note I plan to send Prof. Fahiz Baba-Yara, whose In Search of Sparsity [6] is the most direct intellectual parent of the regime-stability question. If the factors do not survive a regime change, that is not a failure of the project — that is the finding.

References

Fama, E. F., & French, K. R. (1993). Common Risk Factors in the Returns on Stocks and Bonds. Journal of Financial Economics, 33(1).
Cochrane, J. H. (2011). Presidential Address: Discount Rates. Journal of Finance, 66(4).
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical Asset Pricing via Machine Learning. Review of Financial Studies, 33(5).
Harvey, C. R., Liu, Y., & Zhu, H. (2016). …and the Cross-Section of Expected Returns. Review of Financial Studies, 29(1).
Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1).
Baba-Yara, F. (2026). In Search of Sparsity: Bayesian Sparse Factor Models and the Factor Zoo. Working paper.
Chen, A. Y., & Zimmermann, T. (2022). Open Source Cross-Sectional Asset Pricing. Critical Finance Review.

Code: github.com/pbathuri/RegimeFactorZoo