Pradyot Bathuri

Projects

Auditing a multi-terabyte filesystem without melting the laptop that audits it

Abstract. disk-archival-toolkit answers a question every researcher with too many run outputs eventually faces: what is on this disk, what can move to cold storage, and can I decide that without loading the whole inventory into RAM? The toolkit streams multi-million-row filesystem inventories, classifies files into storage tiers, and emits a budget-aware archival manifest — all in bounded memory. The interesting part is not the file I/O. It is the constraint: the dataset is larger than memory, so the streaming requirement dictates the entire architecture.

Keywords: streaming algorithms, external memory, bounded-memory processing, storage tiering.

1. The constraint is the design

The naive version of this tool is three lines of pandas: load the inventory, sort by size, decide. It works beautifully until the inventory is a multi-terabyte working set with tens of millions of rows, at which point "load the inventory" is the bug. You cannot hold NN rows in memory when NN exceeds memory. This is the classic external-memory setting [1]: you must process the data in passes, touching O(1)O(1) — or at worst O(logN)O(\log N) — state at a time relative to the stream.

So the design target is a hard inequality. With MM bytes of memory and an inventory of NMN \gg M rows, every operation must run in working memory

O(M)O(N),O(M) \ll O(N),

streaming the rows past a small, fixed-size set of accumulators rather than materializing them. Sorting becomes external merge sort; "top-K largest directories" becomes a bounded heap; tier statistics become running aggregates updated one row at a time. The tool never holds more than a window of the stream, so it audits a disk far larger than the machine doing the auditing.

2. What it actually produces

Three outputs, each computable in a single (or bounded) pass:

3. Why I built it, and what it taught

It started as pure self-defense: HPC research generates absurd volumes of run outputs, and I needed to reclaim space without spending a weekend hand-auditing directories or, worse, OOM-killing the audit script halfway through. But it turned into a clean lesson in the same principle that runs through my cache research and my RAG-serving work: when the data is bigger than memory, the memory hierarchy is the architecture. You do not get to think about the algorithm first and memory second. The bound comes first, and the algorithm is whatever survives it.

That is a less glamorous sentence than "AI agent" or "quantum optimizer," but it is the one that has paid off the most often. Streaming is not a trick you reach for when things get big — it is the honest default once you stop assuming RAM is free.


References

  1. Vitter, J. S. (2001). External Memory Algorithms and Data Structures: Dealing with Massive Data. ACM Computing Surveys, 33(2).
  2. Muthukrishnan, S. (2005). Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, 1(2).

Code: github.com/pbathuri/disk-archival-toolkit