The Legal AI 'Context Window' Trap: Why Long Contracts Are Breaking Your Review Tools — and What to Do About It

Here is what the demo never shows you: paste a 180-page master service agreement — complete with six schedules, two exhibits, a data processing addendum, and a statement of work — into your AI contract review tool, and watch what happens to the indemnification analysis...

By Andy Armstrong | The Legal Stack | June 9, 2026

The Problem Nobody Wants to Admit at the Pitch Meeting

The output looks fine. Confidently formatted. Nicely structured. Completely missing the carve-out buried in Schedule 4 that guts the limitation of liability you just told your client they had.

This is the context window trap, and it is quietly producing legally dangerous work product across commercial transactions, M&A due diligence, and supply chain contracting right now. The vendors know about it. Most legal ops buyers do not test for it. And practitioners are jury-rigging workarounds that deserve to be documented honestly.

What Actually Goes Wrong with Long Documents

Context windows in large language models are measured in tokens — roughly, chunks of text. Most commercially deployed legal AI tools today operate with effective context windows somewhere between 100,000 and 200,000 tokens, though frontier models push higher. A 100,000-token window sounds generous until you load a complex M&A transaction document. A single acquisition agreement with reps, warranties, disclosure schedules, and ancillary documents can run 300 to 500 pages. You have already blown the budget before you get to the exhibits.

The failure modes are specific and worth naming precisely:

Truncation without warning. The most dangerous pattern. The tool processes what it can and silently ignores the rest. You get an analysis that looks complete. Defined terms established late in a document — or in an exhibit — get missed. A limitation of liability that is modified in a side letter never gets read. The tool confidently summarizes the governing law clause based on the boilerplate in Section 18 while missing the forum selection carve-out in Exhibit B.

Context degradation midstream. Even within the technical window, attention mechanisms in transformer-based models are not uniform. The "lost in the middle" problem — documented in research by Liu et al. and subsequently confirmed across multiple enterprise deployments — means that material in the middle of a long document receives systematically less reliable processing than content near the beginning or end. For a 90-page supply contract, this means the operational SLAs buried in Article 7 may be mischaracterized or omitted even when they technically fall within the window.

Cross-reference collapse. Complex commercial agreements are not linear documents. They are networks of defined terms, incorporated schedules, and conditional provisions that modify each other. When an AI tool processes a document in chunks — a common architectural workaround — it loses the thread between a definition established in Article 1 and its practical application in Schedule 3. The analysis of any single provision becomes decontextualized. You get a technically accurate description of what Section 12.4 says in isolation while missing entirely that Section 12.4 is subordinated to the carve-outs in Exhibit C.

Which Tool Categories Are Most Exposed

Not all legal AI tooling fails the same way here. The exposure varies by architecture.

Clause extraction tools — the earlier generation of contract review software built on fine-tuned classification models rather than generative AI — are actually more predictable in their failure. They cannot analyze what they cannot classify, but they will not fabricate confident-sounding analysis of content they never processed. That predictability has real value.

The more dangerous category is the current generation of generative AI contract review platforms that produce narrative summaries, risk assessments, and issue lists. These tools are most likely to produce fluent, authoritative-sounding output that is materially incomplete. The legal professional reading a clean five-paragraph summary has no visible signal that 40 pages were never analyzed. The output is not marked as partial. It is just presented.

RAG-based architectures — retrieval-augmented generation, where the tool chunks the document and retrieves relevant sections to answer specific queries — help but do not solve the problem. Retrieval quality depends entirely on whether the query language maps correctly onto how the relevant provision is actually drafted. Ask about "limitation of liability" and you may not retrieve the provision titled "aggregate exposure cap." The system is only as good as the retrieval, and retrieval on dense legal language is still brittle.

How Legal Ops Should Be Testing Before Procurement

If you are evaluating contract review tools and this problem has not come up in your vendor conversations, that is itself a finding.

A practical procurement test: take three of your organization's most complex executed agreements — ideally an MSA with full schedules, a supply contract with multiple SOWs, and a transaction document from a recent deal — and run them through any tool you are evaluating. Then have a human reviewer audit the output against the actual document, specifically testing:

Are all schedules and exhibits reflected in the analysis?
Are cross-references between sections accurately tracked?
Does the tool flag or acknowledge document length constraints in its output?
Does the summary of any key provision survive comparison to the actual drafting?

Ask vendors directly: what is your effective context limit for contract review? What happens when a document exceeds it — does the tool warn the user, truncate silently, or chunk and retrieve? Where does the tool's confidence calibration break down on long documents?

Vendors who answer these questions clearly and specifically are more trustworthy than vendors who redirect to benchmark performance on standard-length agreements.

What Practitioners Are Actually Doing Today

The workarounds being used in sophisticated legal ops environments are pragmatic and worth documenting. Most practitioners running complex deal work have adopted a document segmentation protocol — manually or programmatically splitting long agreements into logical units (operative agreement, each schedule, each exhibit) and running review on each component separately. This reduces context pressure but requires a downstream synthesis step that is usually done by a human, not the AI tool.

Some firms are maintaining a separate review pass specifically for defined terms and cross-references — a quasi-manual ontology check that maps how key definitions flow through the document before trusting any AI analysis of substantive provisions.

A smaller number of legal ops teams are building custom preprocessing pipelines that extract and annotate cross-references before the document hits the AI layer, essentially pre-linking the network structure that generative models would otherwise have to infer.

None of this is elegant. All of it adds workflow steps that partially undermine the efficiency case for AI review in the first place.

The Conclusion That Should Make You Uncomfortable

The context window trap is not an edge case. It is the standard condition for the contracts that matter most — the long, complex, multi-party agreements where a missed carve-out or a misread indemnity cap has real financial consequences. The tools that are most likely to fail on these documents are also the tools most likely to produce output that looks like it worked.

The responsible posture for legal practitioners right now is not to avoid AI contract review — the efficiency gains on standard commercial paper are real and worth capturing. It is to be explicit about which document categories require human verification layers regardless of what the tool outputs, and to treat any AI summary of a complex agreement as a starting hypothesis rather than a finished work product.

The vendors will solve this. Context windows will grow. Architectures will improve. But the gap between what these tools can reliably do today and what they are being asked to do in production legal workflows is material, and practitioners who do not understand the specific failure modes are the ones most likely to be embarrassed by them.

Andy Armstrong covers legal technology and AI governance for The Legal Stack.