One of legal's hottest startups is helping lawyers finally answer: Is the AI's work any good?
Legal AI startups say their tools can absorb routine work. Crosby is releasing a benchmark to test whether lawyers should trust them.
Crosby
- Billions of dollars are riding on the promise that artificial intelligence can absorb legal work.
- Crosby, a tech-driven law firm, built a benchmark to measure how well models negotiate contracts.
- Redline Bench is meant to help lawyers answer whether they can trust the technology's work.
Legal technology wants its vibe-coding moment. But first, it has to prove the tools can think like a lawyer.
Taking up the task is Crosby, a startup-meets-law-firm that sells basic legal services to companies, including Cursor and Rogo. On Wednesday, it released the Redline Bench, a tool built to measure how well artificial intelligence models perform real-world legal tasks, starting with contract review.
Software engineers have spent the past few years watching these systems get shockingly good at writing code and debugging errors. Now legal tech companies are chasing a similar prize: artificial intelligence that can review contracts, spot risks, and haggle terms faster and cheaper than lawyers.
But law has a problem that coding does not, says Ryan Daniels, a former in-house lawyer turned Crosby founder. "It's really hard to define 'good' or 'bad,'" he said.
Models can write code that either runs or breaks. Legal work is a murkier target. A sales contract can be edited, or "redlined," in lots of defensible ways, Daniels explains. A change that one lawyer sees as prudent, another might call too aggressive.
That ambiguity has become a headache for companies racing to automate legal work, from the scrappy neofirms to the model labs themselves. Anthropic has spent the past few months courting in-house lawyers with tools built for them. That push has been closely watched by investors. Earlier this year, Anthropic's new legal plugin stirred a sell-off in legal tech stocks.
Benchmarks are one of the main ways companies track progress. The labs building frontier models use them as stress tests, measuring whether a new system is better at tasks than the last one.
Coding has hundreds of benchmarks for evaluating models. But the legal industry still lacks a shared way to answer the question: Is the AI's work any good?
Crosby has been working on a new yardstick. The company pulled its engineers and lawyers into a tactical unit called Crosby Intelligence to build agents for Crosby's law firm and a benchmark to grade them against. That team includes engineer Sharan Ramjee, who worked on transformer models to sniff out fraud at Stripe, and Ross Weiser, a lawyer who joined from elite law firm Sullivan & Cromwell.
Crosby
Crosby also partnered with Micro1, a company that helps model-makers recruit expert workers, to find more lawyers who could help define what counts as good legal work.
To build the benchmark, senior lawyers simulated software deals and marked the contract changes they considered most important at each stage of the negotiation. Those changes were turned into weighted criteria.
When Crosby runs a new test, it gives models the same contracts and asks them to make their own edits. Then a panel of three judges compares these redlines with the lawyer-built rubric. The judges vote pass or fail on each item, and the final score shows how often the models made the kinds of edits that lawyers considered important.
Redline Bench will be made public so any lab can put its models through Crosby's paces. Crosby also plans to regularly release reports tracking how major models compare.
The first release of the Redline Bench put ChatGPT 5.5 at the top of the heap, with a score of 50.5%, meaning the model's redlines matched half of the edits that lawyers prioritized. Gemini 3.5 Flash followed at 45.1%, and Claude Opus 4.8 scored 44.4%.
Crosby was able to test Anthropic's highly capable new model, Fable 5, only once before Anthropic pulled it off the shelves. The results were promising, with a score of 47.3%. When access is restored, Crosby will run the benchmark again and update it.
Crosby
Crosby isn't the only company trying to measure how the models stack up. Harvey, one of the best-funded legal startups, has released benchmarks for case law research and contract review.
Anthropic and OpenAI also build their own benchmarks to measure performance on real-world tasks. But Daniels said those results can be hard to trust. Over time, the labs eventually tune their systems to perform well on their own tests, he said.
The stakes are bigger than a scoreboard. Billions of investment dollars are riding on the promise that artificial intelligence can lower legal bills and absorb work that used to pile up on the general counsel's desk.
Lawyers will only use the tools if they trust them. Crosby wants to give them a reason to.
Read the original article on Business Insider