By Neil Cameron

I attended LegalWeek’s 2026 session ‘AI in the Courtroom: A Mock Argument on Generative AI for Document Review’ with considerable interest – and left with considerable concern. The mock judge ruled that GenAI review was defensible. The basis: validation statistics. Recall. Precision. The familiar numbers. With respect, that is precisely the problem.

No court has yet issued an opinion on the use of generative AI for document review and production. This session saw the plaintiff and defense counsel go head-to-head arguing for and against the use of Relativity aiR for Review in their case.

The speakers were Cristin Traylor, senior director at Relativity; Elizabeth Marie Gary, an associate at Morgan Lewis & Bockius; Michelle Newcomer, eDiscovery counsel at Kessler Topaz Meltzer & Check; and Judge Andrew Peck, who is now senior counsel at DLA Piper.

The mock judge’s role was inevitably played by Judge Peck who is, of course, the author of da Silva Moore v. Publicis Groupe, the opinion that gave TAR its judicial legitimacy and which this publication has covered extensively. Nobody in this space has done more to bring rigorous thinking to the validation of AI-assisted review. It is precisely because of that record that his mock ruling – that validation statistics were sufficient to approve GenAI review – deserves respectful but direct scrutiny. I would argue that the statistics he accepted measure retrieval. They do not measure interpretation. Those are not the same thing, and the gap between them is where the risk lives.

Recall and precision tell you whether the right documents ended up in the production set. They do not tell you whether the AI’s summaries were faithful, whether its thematic clusters were meaningful, whether its privilege rationales were correct, or whether the framing choices it made upstream – which witnesses looked important, which arguments got developed, what never made it to a human eye – were sound. In a GenAI-assisted workflow, those interpretive steps are not a side effect of the process. They are the process.

The panel’s answer to this appears to be: trust the validation stats, and trust the lawyers to verify what the AI tells them. That is not a validation framework. That is a hope.

The panel also spent considerable time arguing about the disclosability of prompts. With respect, that is a bit like debating how many angels can dance on the head of a pin – when we already know that an identical prompt submitted to the same LLM an hour later can produce materially different results. The question is not whether prompts should be disclosed. The question is whether the outputs they generate can be trusted, reproduced, and defended. That question went largely unasked.

The framework the profession actually needs is not complicated. If you are using any AI tool that operates within established TAR methodology, carry on – the validation architecture already exists and the courts have accepted it. If you are using GenAI for research, argument formation, or case strategy, that is internal work product and the existing professional responsibility framework applies. But if you are using GenAI to determine which specific documents within a larger set are discoverable or privileged – if the AI is making or shaping production decisions – then you face a straightforward choice: (a) follow TAR process and rules, (b) read every document yourself, or (c) use GenAI and be able to demonstrate that it worked with at least the accuracy and reliability that TAR was required to prove before the courts accepted it. That proof does not yet exist for most deployed systems. Until it does, the third option is not a choice. It is a gamble.

The profession is not validating GenAI review. It is letting GenAI mark its own homework, and calling the grade a validation.

Speaking to Legal IT Insider for our recent report ‘From TAR to HAR – Are GenAI Discovery Tools Ready for Forensic Scrutiny?‘, Professor Maura Grossman – whose foundational work on TAR gave the profession the empirical spine it needed for da Silva Moore – noted that we have gone straight to step two (deployment) without completing step one (independent validation). The TREC Legal Track was not created because TAR sounded plausible. It was created because Jason Baron, Maura Grossman, Gordon Cormack and others insisted on blind, independent benchmarking before the profession staked its credibility on the technology. No equivalent exercise exists for GenAI. Not yet.

The doctrinal frontier, as Baron has noted, remains largely untested. No briefs. No sworn declarations. No independent benchmarking. Just recall figures, commercial confidence, and crossed fingers.

Something will go wrong. When it does, this mock courtroom session will look less like a milestone, and more like a warning that nobody heeded.

See also:

From TAR TO HAR Report: Are GenAI Discovery Tools Ready for Forensic Scrutiny? Read it here now!

The post Comment: Legalweek’s GenAI mock courtroom may be the warning nobody heeded appeared first on Legal IT Insider.

Read More