OpenAI is facing lawsuits over copyrighted materials it uses to train ChatGPT

NPR | By Michel Martin,

Published August 18, 2023 at 4:09 AM CDT

MICHEL MARTIN, HOST:

AI tools like ChatGPT scrape millions of pages from the internet - news articles, books, blog posts. But is it legal? NPR tech reporter Bobby Allyn has learned that the New York Times is considering a lawsuit that asks that question. And he's here with us now to tell us more. Good morning, Bobby.

BOBBY ALLYN, BYLINE: Good morning, Michel.

MARTIN: So let's start with this - why is an AI tool like ChatGPT collecting so much data?

ALLYN: Yeah. Well, tools like ChatGPT only exist really because they're vacuuming up a staggering amount of data from the web really at all times. It's trained on the data of the internet, right? We're talking millions, likely billions of pages. Anything it can find is sucked up and used to make AI chatbots smarter. But from a legal standpoint, Michel, there's a big issue. And it's this - all this data that's being scraped has been scraped without permission.

MARTIN: So do the operators of chatbots have to ask for permission before hoovering up somebody's data?

ALLYN: Yeah, you know, it really depends. For a lot of the internet, no, it doesn't. But, you know, when it starts scanning and processing work that is copyrighted, it does get trickier. We're talking books, poems - anything that is published online and someone owns the rights to. Now, I talked to Daniel Gervais about this. He leads the intellectual property program at Vanderbilt University, and he studies generative AI.

DANIEL GERVAIS: So the machines are making a copy of the material before they process it. That could be copyright infringement.

ALLYN: Gervais says what is produced at the other end - so the output - could also be copyright infringement.

MARTIN: What are the consequences of that?

ALLYN: The consequences could be quite serious. A court could order that ChatGPT's prized possession, its data set, be completely destroyed since it contains copyrighted material. A court could fine a company $150,000 per infringement. Gervais says a successful copyright lawsuit has the potential of really bankrupting a company, since we're talking about millions and millions of instances of infringement.

GERVAIS: It's a sword that's going to hang over the heads of those companies for several years unless they negotiate a solution.

MARTIN: You know, it would seem that someone would have thought of this before now. I mean, it's not exactly a secret that this is the way these chatbots work. Are solutions being talked about? Are they being negotiated?

ALLYN: Yeah, in some instances they are - in other instances, no, right? I mean, some publishers are trying to hammer out licensing deals with OpenAI behind closed doors so that publishers get paid. Others are not playing so nice. Comedian Sarah Silverman is suing OpenAI for processing her memoir without her permission. Getty Images is suing the maker of a tool called Stable Diffusion over use of its photos that they said was illegal. And I recently learned that the New York Times is considering suing OpenAI for using its stories and archives without permission and without any compensation.

MARTIN: So before we let you go, what kind of defense does your reporting indicate that tech companies like OpenAI will likely be making?

ALLYN: Yeah, they're expected to use something called fair use doctrine. And to really boil that down, fair use law allows someone or a company to use copyrighted material without consent as long as certain conditions are met - for instance, if it's used for teaching or research or criticism or news reporting. You know, this law is intended to encourage freedom of expression, but there are real limits on it. For instance, the Supreme Court has said that if copyrighted material is used to make something new and that new thing competes with the original copyrighted work, that is not fair use. And that's the position of the New York Times here and many other publications, that ChatGPT is spitting out stuff that's becoming a replacement for its own stories - for reading articles on the New York Times website. And obviously that's a big problem if your company relies on readers and clicks and advertising dollars.

MARTIN: That is NPR's Bobby Allyn. Bobby, thank you.

ALLYN: Thanks, Michel. Transcript provided by NPR, Copyright NPR.

NPR transcripts are created on a rush deadline by an NPR contractor. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.