Research
Do sparse factors survive a regime change?
Abstract. The "factor zoo" problem in empirical asset pricing is that researchers have proposed hundreds of characteristics that supposedly predict returns, and most of them are noise wearing a lab coat [4]. Machine learning helps by selecting a sparse subset that actually carries signal [3]. RegimeFactorZoo asks the follow-up question that matters for anyone who has to trade the thing: does the sparse set a model selects in calm markets stay selected when volatility regime flips? This is a build log and a short methods paper, written on fully public data so anyone can reproduce it with make data && make run.
Keywords: factor models, regularization, time-series cross-validation, regime stability, reproducibility.
1. The problem with the zoo
Start from the cross-sectional regression at the heart of empirical asset pricing. For asset at time , with a vector of firm characteristics , we model the excess return as
The Fama-French program [1] says a handful of factors — market, size, value, later profitability and investment — explain most of the cross-section. Cochrane's 2011 presidential address [2] reframed the whole field around one question: which characteristics provide independent information once you control for the others? The honest answer turned out to be "far fewer than the literature claims" [4]. Harvey, Liu, and Zhu showed that once you correct for multiple testing, most published factors should never have cleared the bar.
So the modern move is sparsity. Use a penalized regression — Lasso [5] — that drives most coefficients to exactly zero:
The penalty is the whole trick: it has corners on the axes, so the solution lands on them, and you get genuine variable selection rather than just shrinkage. Gu, Kelly, and Xiu [3] showed this family of methods convincingly beats the textbook linear model out of sample, and Baba-Yara's recent work pushes it into a sparse-Bayesian framing [6] where the prior itself encodes "most factors are zero."
2. The question nobody loves answering
Here is what bugs me about most of these results: they are reported on the full sample. But a sparse model that selects momentum and value on average over 1963–2024 might be selecting momentum in bull markets and value in crashes, and the average is hiding a regime switch.
So RegimeFactorZoo's contribution is a conditioning step. Define a market state by the volatility regime — concretely, split the sample by VIX percentile (and by the 2008 and 2020 structural breaks). Re-estimate the sparse model within each regime, and compare the selected supports. Let and be the sets of nonzero-coefficient factors in low- and high-volatility regimes. The question becomes a set-overlap question:
If , sparse selections are regime-stable and the zoo's verdict is robust. If is small, then "the factors that matter" is a regime-dependent statement, and any strategy built on a single full-sample selection is quietly mis-specified.
3. The WRDS rug-pull (and why the project got better)
Original plan: pull CRSP and Compustat from WRDS, build everything from raw security data. Then I discovered my WRDS master account does not have access over the summer. Mild panic, followed by a strictly better design.
I rebuilt the data layer on fully public sources: the Kenneth French Data Library for factor returns, the Open Source Asset Pricing dataset (Chen & Zimmermann) [7] for 200-plus replicated anomaly portfolios, yfinance for benchmark prices, and FRED for the VIX and macro series. The payoff is not just access — it is reproducibility. A WRDS-based project cannot be re-run by anyone without a WRDS seat. Mine clones and runs from a fresh machine. For a project whose entire point is "stop trusting unreproducible factor claims," depending on a paywalled black box would have been its own punchline.
4. The pipeline
The architecture is deliberately boring, which is the compliment it deserves:
- Data layer. Pull FF3/FF5/momentum from Ken French; pull the OSAP signal panel; align to a monthly grid; cache to parquet.
- Replication. Reproduce the Fama-French three-factor and momentum models, and validate the factor returns against the published series before trusting anything downstream.
- Models. OLS, then Lasso / Ridge / ElasticNet, then gradient-boosted trees (XGBoost), then a sparse-Bayesian factor model in PyMC — a deliberate ladder from interpretable to flexible.
- Evaluation. Walk-forward time-series cross-validation, never random K-fold. This matters more than people admit: shuffling time-series data leaks the future into the training set and produces backtests that are basically fiction. You train on , test on , and roll forward.
- Regime layer. The VIX-percentile split and the support-overlap analysis above.
- Realism overlay. A transaction-cost charge of 5 bps per unit turnover, because a Sharpe ratio that ignores trading costs is a marketing number.
5. Status and honest scope
The data layer and the Fama-French replication are the foundation; the ML ladder and the regime analysis are the active build toward a public v1.0 with a one-page memo. I am writing this mid-project on purpose — the point of the site is to show work in motion, not to pretend research arrives fully formed. When v1.0 ships, the headline number will be that support-overlap across regimes, and I have promised myself to report it whether or not it flatters the hypothesis.
6. Why this is the project I care most about
It sits exactly on the intersection I am building a career around: real statistics, real code, real markets, and a reproducibility standard borrowed from the systems world. It is also the artifact behind the substantive note I plan to send Prof. Fahiz Baba-Yara, whose In Search of Sparsity [6] is the most direct intellectual parent of the regime-stability question. If the factors do not survive a regime change, that is not a failure of the project — that is the finding.
References
- Fama, E. F., & French, K. R. (1993). Common Risk Factors in the Returns on Stocks and Bonds. Journal of Financial Economics, 33(1).
- Cochrane, J. H. (2011). Presidential Address: Discount Rates. Journal of Finance, 66(4).
- Gu, S., Kelly, B., & Xiu, D. (2020). Empirical Asset Pricing via Machine Learning. Review of Financial Studies, 33(5).
- Harvey, C. R., Liu, Y., & Zhu, H. (2016). …and the Cross-Section of Expected Returns. Review of Financial Studies, 29(1).
- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1).
- Baba-Yara, F. (2026). In Search of Sparsity: Bayesian Sparse Factor Models and the Factor Zoo. Working paper.
- Chen, A. Y., & Zimmermann, T. (2022). Open Source Cross-Sectional Asset Pricing. Critical Finance Review.