This lawsuit raises the overlooked issue concerning legal databases charging people for money for content that they don’t really own. Anyone who’s been to law school is trained to use Westlaw on a limited academic license, paid for with prohibitive students tuitions, in order to complete research that ultimately is in the public’s interest. Student access to Westlaw opens many doors for employment in the legal trade, because a significant number of law firms cannot afford to pay for Westlaw access, so they rely on interns’ academic access. These paid databases are hurting the public the most. There is no copyright per se on judgments or legislation, or citations. Filed proceedings, except under non-publication orders, are considered public information, which means they should be accessible. For free.
Legal information and trial data shouldn’t be kept behind a paywall
We contend that all legal databases should be free for the public, and by extension free to train large language models. I believe that in this day and age when legislation changes often, with numerous reforms undergoing in a whole brand new world to look forward to, the public should be equipped for free with all the tools traditionally at lawyers disposal, no less to figure out their rights and duties, and become better citizens.
Monetizing a large language model however, that sifts through all this data and answers your questions like a lawyer, is justified, because it will save us thousands of hours of research (if it works at all), will improve access to justice, and will root out frivolous lawsuits. Thomson Reuters is training their own LLM on data they don’t really own, but simply aggregate into databases, they call “proprietary”. And now they want to stop competing LLM’s from training on said databases. For example, CanLII in Canada and some of Soquij also have proprietary elements but remain free for the public for what matters most, namely jurisprudence and caselaw by provision. The parts that are not free yet, such as dockets access, shouldn’t be monetized either. Dockets info is completely free in states like California and in the UK.
It feels like Thomson Reuters wants to stall innovation and monopolize legal LLM’s
Indeed, Thomson Reuters is accusing Ross Intelligence of unlawfully copying content from its legal-research platform Westlaw to train a competing artificial intelligence-based platform. A decision by U.S. Circuit Judge Stephanos Bibas sending the case to a jury sets the stage for what could be one of the first trials related to the “unauthorized” use of data to train AI systems.
When you pay Westlaw a salty hourly fee to access its databases, nothing precludes you from copying this information at will for whatever purpose you need it for, which evidently includes training LLM’s. If anything, there should be more LLM’s training on Westlaw’s databases.
This is very different from tech companies such as Meta Platforms, Stability AI and Microsoft-backed OpenAI facing lawsuits from authors, visual artists and other copyright owners over the use of their work to train the companies’ generative AI software. Authors, artists and copyright owners actually own copyright over their works that have been used without their consent. The same cannot be said about Thomson and Reuters. Nobody gave them a license to make those databases, because a licence is not required in the first place. In theory, anyone or a bot can make such databases by compiling publicly available information.
The issue revolves mainly around the “headnotes” which summarize points of law in court opinions. These are citations extracted from the opinions themselves. They are something of an extremely detailed bullet point deconstruction of the legal analysis. Students do that every day. Another thing about the headnotes, handy as they are, you do need a bot to go through all of them, because they end up taking up more space than the entire judgment. I don’t agree that they are proprietary. I tend to agree with the defendant that they are fair use.
Ross said that the Headnotes material was used as a “means to locate judicial opinions,” and that the company did not compete in the market for the materials themselves. Thomson Reuters responded that Ross copied the materials to build a direct Westlaw competitor.
The court decided to leave up to the jury to decide fair use and other questions, including the extent of Thomson Reuters’ copyright protection in the headnotes. He noted that there were factors in the fair-use analysis that favored each side. The judge said he could not determine whether Ross “transformed” the Westlaw material into a “brand-new research platform that serves a different purpose,” which is often a key fair use question.
Yes, but it is not the only factor. Fair use analysis would apply if Westlaw had copyright over the headnotes to begin with. I think the headnotes are already fair use in a sense if we accept that judgments and papers are protected by copyright in theory, even though unenforceable in practice. I don’t see why you would need to prove transformative use when training models on someone else’s fair use material in a context where there is no economic right in the core content to begin with. It is indeed an interesting case.
“Here, we run into a hotly debated question,” judge Bibas said. “Is it in the public benefit to allow AI to be trained with copyrighted material?”
I would answer the question with a resounding: YES.