- cross-posted to:
- [email protected]
- cross-posted to:
- [email protected]
The New York Times is suing OpenAI and Microsoft for copyright infringement, claiming the two companies built their AI models by “copying and using millions” of the publication’s articles and now “directly compete” with its content as a result.
As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.” This “undermine[s] and damage[s]” the Times’ relationship with readers, the outlet alleges, while also depriving it of “subscription, licensing, advertising, and affiliate revenue.”
The complaint also argues that these AI models “threaten high-quality journalism” by hurting the ability of news outlets to protect and monetize content. “Through Microsoft’s Bing Chat (recently rebranded as “Copilot”) and OpenAI’s ChatGPT, Defendants seek to free-ride on The Times’s massive investment in its journalism by using it to build substitutive products without permission or payment,” the lawsuit states.
The full text of the lawsuit can be found here
Really seemed like this was inevitable - it will be interesting to see if their fair use defense pans out.
I don’t expect it will, and I’m worried of the impact of that precedent on the legitimate fair use circuit…
I’m amazed that it’s taken this long for a high profile lawsuit about it.
This is so fucking ridiculous…
How so?
The trained model includes vast swathes of copyrighted material. It’s the rights holders who get to decide whether someone can use it.
Just because it makes it inconvenient or harder for someone to train an AI model does not justify wholesale stealing.
A lot of models are even trained on large numbers of pirated material like books downloaded from pirate sites etc. I guarantee you OpenAI and others didn’t even buy a lot of the material they use to train the AI models on.
I guarantee you OpenAI and others didn’t even buy a lot of the material they use to train the AI models on.
My hunch is that if they did actually buy or properly license that material, they would have been bankrupt before the first version of ChatGPT came online. And if that’s true, then OpenAI owes it’s entire existence to it’s piracy.
Its not piracy to just webscrap everything for data…
There isn’t a person sitting around and pirating shit, its a Algorithm that takes everything from the internet it can reach.
Yeah… That’s not a good defense if you think about it. If someone made a Reddit comment with the entire contents of Discworld (idk, just an example), and OpenAI scraped all of Reddit to train their model, well now they’ve used copyrighted material without paying for a commercial license, and now they’re on the hook. By being unscrupulous about their scraping, they actually open themselves up to more liability than if they were more careful about what they scrape and where.
This is all to say nothing of the fact that several other major companies were caught pants down by training with databases explicitly created by torrenting a ton of books.
https://torrentfreak.com/authors-accuse-openai-of-using-pirate-sites-to-train-chatgpt-230630/
There is no direct evidence that OpenAI used pirate sites to train ChatGPT. That said, it is no secret that some AI projects have trained on pirated material in the past, as an excellent summary from Search Engine Journal highlights.
The mainstream media has picked up this issue too. The Washington Post previously reported that the “C4 data set,” which Google and Facebook used to train their AI models, included Z-Library and various other pirate sites.
If I read an article and then I reference it or summarize it myself, that isn’t copyright infringement. There’s no difference if I have a computer do the work for me. It’s fair use.
Everyone accuses Open AI of everything. In the end most stuff they do will not be illigal, there are loads of reasons, mainly due to the technical issues involved. You would need a database of every copyrighted stuff to check anything. The computing power requiref for this would be absurdly high.
The demands are idiotic and ridiculous.
And as said they didn’t “train chat GPT on a piracy site” the scraping algorithm put some stuff form there in the training data. There is no person doing that.
There is no person doing that.
I’ve heard many defenses of AI, some of which I agree with, but “strip mining content off the internet is fine because it’s automated” is easily one of the weakest. It doesn’t pass the sniff test.
If you write a script that downloads every single image from every single website, no questions asked, and then reupload them to various websites at random, do you suppose the police shouldn’t charge you with (inevitably) possessing and distributing CSAM? “Oh no officer, your true culprit is the Dell in my living room! Arrest that box!”
Everyone is, on some level, responsible for the things they create.
And as said they didn’t “train chat GPT on a piracy site” the scraping algorithm put some stuff form there in the training data. There is no person doing that.
“Your honour my program that I created to slurp up data from the internet using my paid for internet connection, into my AI trained model that I own and control happened to slurp up copyrighted data… I um, it’s not my fault it slurped up copyrighted data even though I put no checks in place for it to check what it was slurping up or from where.”
That is the argument you are putting forth.
Do you think any judge/court of law would view that favourably?
Its not piracy to just webscrap everything for data…
Yes it is.
No. It’s publicly available, piracy would be to use stuff that isn’t publicly available.
Publicly available =/= public domain.
No it doesn’t, the training data isn’t inside the LLM.
So firstly, even if those claims are true, you sue the wrong business, you would need to sue the training data maker. They however are usually protected by laws for science, because they are “non profit research”
Therefore this is completely ridiculous.
Btw, A the copyright part is only a thing if its a significant portion of the thing… Wich it clearly isn’t in this case (its below 1% of it) making it even more ridiculous.
Also, if you can get the information on the internet, you are again suing the wrong place, you should be after the provider, not the automatic data grabbing system… As they can and will argue that they cant control what their algorithm crawler takes. There is a way to mark content as “dont use” for Mashines, but most people don’t do that and will lose in court because they don’t understand it…
Lastly, the training wouldn’t be harder, the problem is the gathering of data. You can’t manually look through all of it and its idiotic to think that its reasonable to demand such a thing.
No it doesn’t, the training data isn’t inside the LLM.
This is factually incorrect. You can extract the data. How do you think the legal cases are being brought?
The model has to contain the data in order to produce works.
Wholesale commercial copyright infringement where you’re profiting off of others work on a large scale is a whole different ball game.
They’re training their models on large amounts of pirated content and profiting off it.
Of course the rights holders are going to say “wait a minute, why are you making money off my content without my permission? And how much of my work did you pirate to use?”
You cannot hand wave away mass piracy to train their models, and then distribute said models based on an act of mass copyright infringement.
Do you not understand the basics of the law?
its idiotic to think that its reasonable to demand such a thing.
Again, the law is the law. If they mass pirate a bunch of media which then the model contains chunks of they are breaking the law.
I can’t believe this is a hard concept for someone to understand.
Even if the compression is extremely lossy, compression is insufficient to be transformative.
The whole “the original data isn’t in the model” argument is one only techbro idiots find compelling.
No, that’s the current legal precedent within the US.
The court opinion:
“The Court finds two of the four factors weigh in favor of fair use, and two weigh against it. The first and fourth factors (character of use and lack of market harm) weigh in favor of a fair use finding because of the established importance of search engines and the “transformative” nature of using reduced versions of images to organize and provide access to them. The second and third factors (creative nature of the work and amount or substantiality of copying) weigh against fair use.”
That “compression is transformative” principle has been pretty solidly enshrined as precedence at this point (IE Perfect 10, Inc. v. Amazon.com, Inc.) however with no real guidelines as to what amount is required to be considered transformative
The major argument as to whether the sort of LLM training in the parent article still constitutes fair use or not depends on whether there exists “market harm” or the “substantiality of copying” is especially egregious (note that these are the two fronts that the NYT is taking.) There is precedence for copying of style not being fair use Dr. Seuss Enters., L.P. v. Penguin Books USA, Inc. which I suspect is why NYT is approaching it the way that they are…
Now, all that being said, my personal opinion is fuck the US legal system and fuck copyright. There is no solution to the core issues surrounding this topic that isn’t inherently contradictory and/or just a corporate power grab. However, the “techbro idiots” are “right” and you’re not, but it’s because they are idiots who are largely detached from any sort of material reality and see no problem with subjecting the rest of us to their insanity.
Now, all that being said, my personal opinion is fuck the US legal system and fuck copyright.
Some form of copyright has to exist, and - as angrily explained to me by authors - it needs to extend somewhat beyond the life of the author. I’m certainly never going to agree with it being indefinite though.
I can’t tell if sarcasm… If not why?
deleted by creator
The model has to contain the data in order to produce works.
as far as I understand, this isn’t true. can you elaborate on why it needs to contain the data?It contains large parts of the data in order to create. In my link I provided it shows that the models do contain chunks of the original works.
Otherwise, how would it create the words etc.
I am amazed that we now have people on the level of crypto coin idiocy going on about ai models who don’t understand this.
You would probably claim I don’t deserve my job with my level of technical illiteracy however you think you are inferring that . Anyways they do make reasonable efforts to design models that don’t memorize and are able to generalize. This is quite basic or fundamental on machine learning in general.
Previous models had semantic reasoning capacidad without memorization e.g. word2vec.
You should also realize that just because current models are memorizing despite efforts to prevent it doesn’t mean that models need to memorize. Like i said initially they are actually designed to work without needing to memorize.
You’re contradicting yourself.
In one sentence you say it doesn’t memorize (with “reasonable effort”) then in the next you admit it does.
“Reasonable effort” is weasel wording.
Make up your mind.
This entire comment screams of 0 technical knowledge.
The LLM does not contain the training data. It contains nothing but math it generates you an answer by calculations, in the end you get the awnser wich is statistically most likely what you want. Otherwise the fucking thing wouldn’t produce fake news and make shit up.
Shure if you want it to write you a very specific thing and you know exactly what to ask, you might get a small text that is “copyrighted” but thats because you asked for it, not because it’s inside. It just gives you the awnser you most likely find helpful, statistically.
Its like asking you to read a page very well and then asking you the next day to write down what was on the page, while giving you lots of hints. You didn’t actually copy from it in that case.
Its like asking you to read a page very well and then asking you the next day to write down what was on the page, while giving you lots of hints. You didn’t actually copy from it in that case.
My guy, if you compellingly re-wrote Harry Potter from memory and charged people for access to your work, you can definitely expect J.K. Rowling to sue you.
This entire comment screams of 0 technical knowledge
Yes, your comment does.
There is literally software to extract this stuff from models now.
This “it’s just math” is techbro idiocy. It’s like the idiots regurgitating crypto coin bullshit.
It’s all black magic to me, so if you have resources on this, that would be great. My initial thought is that it would have surely have a data source to reference to? Your last example is some one referring to their memory of something and recreating it. By referring to that memory, that is in essence a reference back to the original data that someone has remembered?
the poem poem poem thing shows that the llms actually do memorize at least some training data. chatgpt changed their eula to forbid users from asking it to repeat words forever after this was in the news.
also as far as I understand there are usually fair use and non profit exceptions for use of training data but they generally limit how it can be used. so training a model for commercial purposes might be against the license of the training data.
I don’t necessarily agree with the nyt but they seem to be framing this as someone aggregating their data and packeting it in a better way so they are hurting their profits. i don’t really see that as necessarily being true. they could argue the same about google news showing their news…
They don’t “remember” anything they produce a “awnser” by generating a shit load of math wich renders down to the most “helpful” answer it can statistically give you.
LLMs are neuronal networks, if you know how they work you know how idiotic all copyright claims are, they all just mad that their shit is getting obsolete and in the background use the engine to do “work” wich they claim to have violated their copyright, now they are mad because it does a better job at writing than they do and they fear of being replaced.
All lawsuits against AI companies, regarding copyright of training data, are dumb as hell.
You are right about the commercial/non profit training data part, but from my understanding that’s basically a gray zone and politics are to slow to keep up with tech.
Btw fuck Open AI, they are as open as a fucking Supermax prison. Even the programmers don’t know what their main LLM does, they just place a simple one between the user and the actual GPT to make shure that it doesn’t give people instructions on how to build a bomb and stuff like that or to keep people from making it say bad words…
that’s the theory. previous models also were supposed to be doing 3 digit math but they dicovered that the questions were in the training data.
so you should look into what happens when people ask chat gpt to repeat a word forever, it prints the word for a while and then prints training data, check this link https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/
edit: relevant part:
It also, crucially, shows that ChatGPT’s “alignment techniques do not eliminate memorization,” meaning that it sometimes spits out training data verbatim. This included PII, entire poems, “cryptographically-random identifiers” like Bitcoin addresses, passages from copyrighted scientific research papers, website addresses, and much more.
“In total, 16.9 percent of generations we tested contained memorized PII,”
I should also reiterate that I agree that the intent is to avoid memorization, but they are not successful yet.
Is AI just a giant screen scraper with a presentation layer? I always thought of it more like Asimov’s positronic brain.
To be fair some of the chat bots are effectively just that. They have “scrapped” their data models and outputing it in a way that seems like you are having a conversation with the “bot”.