Projects

Auditing a multi-terabyte filesystem without melting the laptop that audits it

Abstract. disk-archival-toolkit answers a question every researcher with too many run outputs eventually faces: what is on this disk, what can move to cold storage, and can I decide that without loading the whole inventory into RAM? The toolkit streams multi-million-row filesystem inventories, classifies files into storage tiers, and emits a budget-aware archival manifest — all in bounded memory. The interesting part is not the file I/O. It is the constraint: the dataset is larger than memory, so the streaming requirement dictates the entire architecture.

Keywords: streaming algorithms, external memory, bounded-memory processing, storage tiering.

1. The constraint is the design

The naive version of this tool is three lines of pandas: load the inventory, sort by size, decide. It works beautifully until the inventory is a multi-terabyte working set with tens of millions of rows, at which point "load the inventory" is the bug. You cannot hold $N$ rows in memory when $N$ exceeds memory. This is the classic external-memory setting [1]: you must process the data in passes, touching $O(1)$ — or at worst $O(\log N)$ — state at a time relative to the stream.

So the design target is a hard inequality. With $M$ bytes of memory and an inventory of $N \gg M$ rows, every operation must run in working memory

$O(M) \ll O(N),$

streaming the rows past a small, fixed-size set of accumulators rather than materializing them. Sorting becomes external merge sort; "top-K largest directories" becomes a bounded heap; tier statistics become running aggregates updated one row at a time. The tool never holds more than a window of the stream, so it audits a disk far larger than the machine doing the auditing.

2. What it actually produces

Three outputs, each computable in a single (or bounded) pass:

A tier classification. Each file is bucketed — hot, warm, cold — from streamed signals like size, age, and path heuristics. This is a per-row map with no global state, so it is trivially streamable.
Budget-aware archival manifests. Given a target to reclaim (say, "free 2 TB to external disk or Drive"), the tool selects an archival set that meets the budget while preferring cold, large, safely-movable files. This is the part with real decision logic — a bounded-memory pass that maintains just enough state to make a greedy, budget-respecting selection.
An execution plan. The moves themselves, emitted as a manifest you review before anything is touched. Nothing is deleted; the toolkit proposes, a human disposes. (Given that the alternative is a script with rm -rf and optimism, the manifest-first design is the entire safety story.)

3. Why I built it, and what it taught

It started as pure self-defense: HPC research generates absurd volumes of run outputs, and I needed to reclaim space without spending a weekend hand-auditing directories or, worse, OOM-killing the audit script halfway through. But it turned into a clean lesson in the same principle that runs through my cache research and my RAG-serving work: when the data is bigger than memory, the memory hierarchy is the architecture. You do not get to think about the algorithm first and memory second. The bound comes first, and the algorithm is whatever survives it.

That is a less glamorous sentence than "AI agent" or "quantum optimizer," but it is the one that has paid off the most often. Streaming is not a trick you reach for when things get big — it is the honest default once you stop assuming RAM is free.

References

Vitter, J. S. (2001). External Memory Algorithms and Data Structures: Dealing with Massive Data. ACM Computing Surveys, 33(2).
Muthukrishnan, S. (2005). Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, 1(2).

Code: github.com/pbathuri/disk-archival-toolkit