OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling’s Harry Potter series::A new research paper laid out ways in which AI developers should try and avoid showing LLMs have been trained on copyrighted material.
Sure, but then it’s only even more fair for these companies to pay up during the training of an AI. Schools don’t get to copy entire book transcripts off the Internet for lessons. They can’t pirate documentaries. And for higher education, the student pays tuition to learn information. They can pirate textbooks, but that isn’t enough alone to learn fields of studies.
If we’re going to use human analogies for AI, then it should be limited in the same ways. The companies have to buy any books or media, or use material that is explicitly in the public domain with respect to copyright law – you could post a transcript of it online in front of the most litigious lawyer, and nothing would happen.
Couple of things. There is no way to prove they are using “pirated content”. There are web scrapers that go through the internet and scrape everything. This is going to include discussion, articles, blog posts, and video transcripts of many people discussing copyrighted content. The AI can give you a reasonable analysis for a book without ever having read the book because of this.
Everything online is public. You cannot force someone to pay for seeing your reddit post. Second, I bet they have large databases of all the books in the public domain. This includes a very large corpus of text.
This alone is probably enough to train their AI. Beyond that, presumably, they could pay for books. Textbooks, fiction, biographies, etc. They could pay for these and pump them into the system.
If I were them, personally, I would probably just find large torrents with all the books. Or write some automated script to pull from libgen. But there’s no real way of proving how they did this and there’s no real way of proving what content the AI was trained on.
It would create incredible legal liability for the company. If the authors and publishing studios caught wind of that, the AI companies would be sued into oblivion. Think about how intense media companies get about pirating when it’s just for pleasure or entertainment – if you’re using it to turn a profit, you’re legally fucked.
Having no chain of custody for knowing what the AI was trained on sounds like typical cost cutting, until you realize this means they can’t detect or identify another AI’s output. They’ll quickly become garbage model.
Well obviously it’s a massive legal liability. However… seemingly legitimate serious companies with large legal departments have been known to do legally dangerous things before. Apple deliberately sabotaged old iPhones by sending updates to drain battery - encouraging people to get new phones. Volkswagen faked their emissions tests (if I remember correctly, people went to prison and Apple had to pay out fines). I don’t put it past OpenAI to be doing illegal things for short term benefit to their long term detriment.
Not saying it’s happening, but I’m saying it’s possible and it’s hard for me or you to prove that it’s happening.
I’m sure they have good internal controls for what goes in the model. I’m guessing the information is very tightly controlled, for above reasons. I’m not sure what you mean by another AI’s output though.
I think they probably have criteria for what’s used to train it, but they don’t keep a list of what material was used. I believe they’ve said in the past they don’t have that information.
For another AI – these models fall apart when they’re trained on AI generated content, after a few generations. If they have no way of discerning if content is AI generated or not, they’re going to have a ticking time bomb. At some point the models will heavily degrade in quality because of it. The question I guess is what % of training material can be AI generated before it causes problems.
This does mean however that AI generated material can never become a substantial % of all the content out there. Whenever there’s too much, the algorithms will fall apart, and probably not recover until that content falls below a certain % again.
I will look into this. I feel like that’s quite an oversight. Perhaps it’s easier to just tell the public otherwise because of the legal questions like we are discussing. I would have kept everything in storage so we can re-train updated models or what have you with the same data.
I think it’s an interesting thing you bring up. There will be a sort of distinction in the corpus of human works. Pre ~2023 and Post ~2023. All work before that time will more or less be legitimate and you can use it for training data. Afterwards it will all be tainted.
Honestly the implications go further than that. For one, I don’t trust that there is a human behind any comment I see online anymore. Especially in topics and areas that I feel are likely to be astroturfed - like politics.
Very possible. I think they don’t want to keep things in storage because then they indisputably need to pay for it.
Agreed on the human element too. The Reddit protests were eye opening for me because of the supposed “pro Reddit/anti mod” crowd that showed up as a vocal minority. They popped out of nowhere, and in some cases they were verified as AI bots.