Projects

Making retrieval stop lying: agentic RAG for legal and property diligence

Abstract. Legal and property diligence is a retrieval problem dressed as a reading problem. The documents are long, repetitive, and adversarially boring, and the cost of summarizing the wrong clause is not a bad grade — it is a bad deal. This is a build log for an agentic retrieval-augmented generation (RAG) pipeline over India real-estate and M&A documents, built with LangGraph for orchestration, Qdrant for vector search, and served on vLLM tuned for AMD MI300X. The thesis throughout: the model is rarely the problem. The retrieval is the problem, and most of the engineering is making retrieval refuse to lie.

Keywords: retrieval-augmented generation, vector search, agentic orchestration, faithfulness, legal NLP.

1. Why RAG, and why it fails

A large language model asked to review a hundred-page sale deed from memory will hallucinate, because it does not have the document — it has a vibe of documents like it. Retrieval-augmented generation [1] fixes the input side: embed the document into chunks, retrieve the chunks relevant to a question, and condition the model on those rather than on its parametric memory.

Retrieval works by embedding text into vectors and ranking chunks by similarity to the query embedding $q$ . The standard score is cosine similarity:

$\text{sim}(q, d) = \frac{q \cdot d}{\lVert q \rVert\, \lVert d \rVert}.$

Qdrant indexes these with HNSW [4] so nearest-neighbor search over millions of chunks stays sub-linear instead of scanning everything. So far, so textbook. The failure mode is subtle: the top- $k$ chunks by cosine similarity are semantically close to the question but may be the wrong instance — the indemnity clause from the wrong party, the encumbrance on the wrong survey number. The model then summarizes the retrieved-but-wrong clause with total confidence, and you have built a machine that is wrong faster.

2. The design move: make the agent argue with the retriever

The fix is to stop treating retrieval as a single shot. LangGraph [3] lets you model the pipeline as a graph of steps with state and conditional edges, so the agent can:

Decompose the diligence question into sub-queries (parties, dates, encumbrances, obligations) rather than one fuzzy query.
Retrieve per sub-query, with metadata filters (document type, party, survey/parcel ID) layered on top of vector similarity — hybrid retrieval, because pure semantic search ignores the structured handles that legal documents are full of [2].
Verify retrieved chunks against the sub-query before generation — a cheap check that the chunk actually contains the entity asked about, not just something that rhymes with it.
Abstain when retrieval confidence is low, instead of generating. An "I could not find this" is worth ten confident wrong answers in a diligence context.

That fourth step is the whole personality of the system. A diligence tool that occasionally says "not found, go look manually" is trustworthy. One that always answers is a liability with a progress bar.

3. Serving: vLLM on MI300X

Inference runs on vLLM, which uses paged attention to manage the KV cache efficiently and keep throughput high under concurrent requests [5], tuned for AMD MI300X accelerators. This is where my HPC habits paid off — the same "where does the working set live and what is the memory bottleneck" instinct from the cache-research project applies directly to getting acceptable tokens-per-second out of a large model on a specific accelerator. Long legal documents mean long contexts mean large KV-cache pressure; paged attention is the memory-hierarchy trick that makes it tractable, and recognizing that as a memory-hierarchy problem rather than a model problem saved a lot of flailing.

4. Honest scope

This was built as a hackathon-grade submission, not a deployed legal product, and I want to be precise about that because legal-tech overclaiming is genuinely harmful. It demonstrates the architecture — decomposition, hybrid retrieval, verification, abstention, efficient serving — on a real document corpus. It is not a substitute for a lawyer, it has not been validated against professional diligence on a controlled benchmark, and the abstention thresholds were tuned by hand rather than calibrated. What it is: a credible argument that the right unit of engineering for legal AI is the retrieval contract, not the model.

5. What I took from it

Two things transfer to everything else I build. First, that retrieval faithfulness — does the system condition on the right evidence — is a measurable property you can engineer toward, and the same standard shows up in my research work as "no plot without provenance." Second, that the unglamorous components (metadata filters, verification gates, abstention) carry more of the reliability than the model choice does. The model is the easy part now. Making it only speak when it has the right document in hand is the job.

References

Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS.
Karpukhin, V., et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. EMNLP.
LangChain (2024). LangGraph: Building Stateful, Multi-Actor Applications with LLMs. Documentation.
Malkov, Yu. A., & Yashunin, D. A. (2018). Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE TPAMI.
Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP.

Code: github.com/pbathuri/legal-document-intelligence