Pradyot Bathuri

Research

Do sparse factors survive a regime change?

Abstract. The "factor zoo" problem in empirical asset pricing is that researchers have proposed hundreds of characteristics that supposedly predict returns, and most of them are noise wearing a lab coat [4]. Machine learning helps by selecting a sparse subset that actually carries signal [3]. RegimeFactorZoo asks the follow-up question that matters for anyone who has to trade the thing: does the sparse set a model selects in calm markets stay selected when volatility regime flips? This is a build log and a short methods paper, written on fully public data so anyone can reproduce it with make data && make run.

Keywords: factor models, regularization, time-series cross-validation, regime stability, reproducibility.

1. The problem with the zoo

Start from the cross-sectional regression at the heart of empirical asset pricing. For asset ii at time tt, with a vector of firm characteristics xi,tx_{i,t}, we model the excess return as

ri,t+1=xi,tβ+εi,t+1.r_{i,t+1} = x_{i,t}^\top \beta + \varepsilon_{i,t+1}.

The Fama-French program [1] says a handful of factors — market, size, value, later profitability and investment — explain most of the cross-section. Cochrane's 2011 presidential address [2] reframed the whole field around one question: which characteristics provide independent information once you control for the others? The honest answer turned out to be "far fewer than the literature claims" [4]. Harvey, Liu, and Zhu showed that once you correct for multiple testing, most published factors should never have cleared the bar.

So the modern move is sparsity. Use a penalized regression — Lasso [5] — that drives most coefficients to exactly zero:

β^=argminβ  12Ni,t(ri,t+1xi,tβ)2  +  λβ1.\hat{\beta} = \arg\min_{\beta}\; \frac{1}{2N}\sum_{i,t}\bigl(r_{i,t+1} - x_{i,t}^\top \beta\bigr)^2 \;+\; \lambda \lVert \beta \rVert_1.

The 1\ell_1 penalty is the whole trick: it has corners on the axes, so the solution lands on them, and you get genuine variable selection rather than just shrinkage. Gu, Kelly, and Xiu [3] showed this family of methods convincingly beats the textbook linear model out of sample, and Baba-Yara's recent work pushes it into a sparse-Bayesian framing [6] where the prior itself encodes "most factors are zero."

2. The question nobody loves answering

Here is what bugs me about most of these results: they are reported on the full sample. But a sparse model that selects momentum and value on average over 1963–2024 might be selecting momentum in bull markets and value in crashes, and the average is hiding a regime switch.

So RegimeFactorZoo's contribution is a conditioning step. Define a market state by the volatility regime — concretely, split the sample by VIX percentile (and by the 2008 and 2020 structural breaks). Re-estimate the sparse model within each regime, and compare the selected supports. Let ScalmS_{\text{calm}} and SstressS_{\text{stress}} be the sets of nonzero-coefficient factors in low- and high-volatility regimes. The question becomes a set-overlap question:

J=ScalmSstressScalmSstress.J = \frac{|S_{\text{calm}} \cap S_{\text{stress}}|}{|S_{\text{calm}} \cup S_{\text{stress}}|}.

If J1J \approx 1, sparse selections are regime-stable and the zoo's verdict is robust. If JJ is small, then "the factors that matter" is a regime-dependent statement, and any strategy built on a single full-sample selection is quietly mis-specified.

3. The WRDS rug-pull (and why the project got better)

Original plan: pull CRSP and Compustat from WRDS, build everything from raw security data. Then I discovered my WRDS master account does not have access over the summer. Mild panic, followed by a strictly better design.

I rebuilt the data layer on fully public sources: the Kenneth French Data Library for factor returns, the Open Source Asset Pricing dataset (Chen & Zimmermann) [7] for 200-plus replicated anomaly portfolios, yfinance for benchmark prices, and FRED for the VIX and macro series. The payoff is not just access — it is reproducibility. A WRDS-based project cannot be re-run by anyone without a WRDS seat. Mine clones and runs from a fresh machine. For a project whose entire point is "stop trusting unreproducible factor claims," depending on a paywalled black box would have been its own punchline.

4. The pipeline

The architecture is deliberately boring, which is the compliment it deserves:

5. Status and honest scope

The data layer and the Fama-French replication are the foundation; the ML ladder and the regime analysis are the active build toward a public v1.0 with a one-page memo. I am writing this mid-project on purpose — the point of the site is to show work in motion, not to pretend research arrives fully formed. When v1.0 ships, the headline number will be that support-overlap JJ across regimes, and I have promised myself to report it whether or not it flatters the hypothesis.

6. Why this is the project I care most about

It sits exactly on the intersection I am building a career around: real statistics, real code, real markets, and a reproducibility standard borrowed from the systems world. It is also the artifact behind the substantive note I plan to send Prof. Fahiz Baba-Yara, whose In Search of Sparsity [6] is the most direct intellectual parent of the regime-stability question. If the factors do not survive a regime change, that is not a failure of the project — that is the finding.


References

  1. Fama, E. F., & French, K. R. (1993). Common Risk Factors in the Returns on Stocks and Bonds. Journal of Financial Economics, 33(1).
  2. Cochrane, J. H. (2011). Presidential Address: Discount Rates. Journal of Finance, 66(4).
  3. Gu, S., Kelly, B., & Xiu, D. (2020). Empirical Asset Pricing via Machine Learning. Review of Financial Studies, 33(5).
  4. Harvey, C. R., Liu, Y., & Zhu, H. (2016). …and the Cross-Section of Expected Returns. Review of Financial Studies, 29(1).
  5. Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58(1).
  6. Baba-Yara, F. (2026). In Search of Sparsity: Bayesian Sparse Factor Models and the Factor Zoo. Working paper.
  7. Chen, A. Y., & Zimmermann, T. (2022). Open Source Cross-Sectional Asset Pricing. Critical Finance Review.

Code: github.com/pbathuri/RegimeFactorZoo