Vol. III · No. 128 Independent LegalTech Analysis Wednesday, June 17, 2026

The Legal Stack

← Analysis Analysis · AI Tools

The Legal AI 'Silent Degradation' Problem: Why Your Contract Review Tool Performs Worse on Friday Afternoon Than Monday Morning — and Nobody Is Measuring It

There is a quiet scandal developing in legal AI procurement, and almost nobody in legal operations is talking about it yet. Your contract review platform almost certainly performs differently depending on when you use it. Not differently in ways that crash dashboards or trigger support...

There is a quiet scandal developing in legal AI procurement, and almost nobody in legal operations is talking about it yet. Your contract review platform almost certainly performs differently depending on when you use it. Not differently in ways that crash dashboards or trigger support tickets — but differently in ways that matter: slower inference, softer extraction confidence, more hallucinated clause citations, subtler misreadings of materiality thresholds. The tool still runs. The green uptime dot still glows. But the output quality has silently degraded, and you have no contractual right to know.

This is the silent degradation problem. And it is going to cause a significant legal malpractice conversation before most law firms have even begun instrumenting for it.

What Is Actually Happening Under the Hood

Large language model inference is not a fixed-cost operation. When AI vendors serve hundreds or thousands of concurrent requests — Friday afternoon being a peak period across U.S. time zones, as every firm rushes to close the week — infrastructure managers make real-time decisions about compute allocation. They throttle batch sizes. They route to lower-capacity GPU clusters. Some vendors, under competitive pricing pressure, have quietly shifted to quantized model variants during peak load: smaller, faster, cheaper versions of their flagship model that sacrifice some precision for throughput.

None of this violates your vendor's terms of service. None of it shows up in the SLA. The SLA almost certainly guarantees uptime — 99.5% availability, defined as the API responding within some latency threshold. It says nothing about inference quality at that latency. A response in 800 milliseconds can be meaningfully less accurate than one that takes four seconds. You are buying availability. You are not buying performance consistency.

For consumer applications, this is a nuisance. For contract review in an M&A diligence context, it is a material risk exposure.

The Procurement Benchmark Illusion

The problem compounds at the point of purchase. When your legal ops team evaluated three AI contract review platforms six months ago, the vendor demos and benchmark evaluations were conducted in controlled conditions. Dedicated compute environments. Low concurrency. The model was, essentially, on its best behavior.

This mirrors a problem the cybersecurity industry confronted a decade ago with penetration testing tools: performance in a sandboxed evaluation environment told you almost nothing about behavior under adversarial production conditions. The legal AI industry has not yet had its equivalent reckoning, but the structural conditions are identical.

Ironically, the vendors know this. Several of the major legal AI platforms — including contract lifecycle management tools built on OpenAI, Anthropic, and proprietary foundation models — run on shared infrastructure where enterprise customers co-exist with smaller accounts. There is no technical reason a Friday 4 PM diligence review should receive the same inference priority as a Monday 10 AM demo for a prospective Fortune 500 customer. And there is every commercial incentive for those priorities to be reversed.

Legal Ops Is Not Instrumenting for This

Here is the uncomfortable part for legal operations leaders: even where degradation is occurring, almost no firm is building the measurement infrastructure to detect it.

Quality assurance in legal AI deployments is almost universally defined in terms of catch rate on known test sets. Did the tool find the change-of-control clause? Did it flag the GDPR data transfer provision? These benchmarks are run at onboarding and occasionally at renewal. They are not run continuously in production. They are almost never run at 4:45 PM on a Friday against a real document from an active deal.

The firms that are ahead of this — and there are a handful — are beginning to implement what I'd call inference quality monitoring: randomized injection of documents with known ground-truth outputs into the live production workflow, with automated comparison of AI outputs against those ground truths over time. This is expensive to build. It requires legal operations teams to think more like ML infrastructure engineers than contract administrators. But it is the only way to detect silent degradation before it causes a problem that a client or a court notices first.

The Mata v. Avianca hallucination case from 2023 was a visible, dramatic failure. Silent degradation produces invisible, incremental failures — missed indemnification caps, slightly wrong governing law flags, materiality thresholds misread by a percentage point. These do not make headlines. They make malpractice claims.

What Minimum Performance Transparency Should Look Like

Vendors are not going to offer this voluntarily. Legal ops directors, GCs, and law firm CIOs need to demand it contractually. Here is my minimum floor for what vendor SLAs should contain in 2026:

Inference quality metrics, not just uptime. Vendors should commit to maintaining a specified accuracy rate on a disclosed, reproducible benchmark test set — measured in production, not in demo environments, across all hours of the business day.

Infrastructure transparency disclosures. Any use of quantized model variants, reduced-parameter inference, or dynamic compute allocation that deviates from the configuration used during procurement evaluation should require written notice to enterprise customers.

Time-stratified performance reporting. Monthly reports should show performance metrics segmented by time of day and day of week — not just aggregate uptime. If Friday afternoon inference quality is statistically distinguishable from Monday morning quality, that is a disclosure obligation, not a trade secret.

Audit rights for inference configuration. Enterprise legal clients should have the contractual right to request independent technical verification that the model version and inference configuration in production matches the configuration represented at procurement.

None of these asks are technically unreasonable. Every serious AI infrastructure operator already collects this data internally. The question is whether your vendor contract gives you the right to see it.

The Accountability Gap Is Going to Close — One Way or Another

The EU AI Act's high-risk system provisions are beginning to bite on legal and compliance automation applications. The FTC has shown increasing appetite for enforcement against AI performance misrepresentation. State bar ethics opinions on AI-assisted legal work — including New York's 2024 guidance — are converging on a supervision standard that implicitly requires lawyers to know when their tools are underperforming.

Legal ops teams that are still treating AI tool procurement like SaaS procurement are operating in a framework that is at least two years behind both the technology and the regulatory direction. The vendors are not going to tell you your tool performs worse on Friday afternoon. You are going to need to find out for yourself — and then make it a contractual right.

The silent degradation problem will not stay silent much longer. The only question is whether your firm detects it before a client or a regulator does.


Andy Armstrong writes about legal technology and AI governance for The Legal Stack.