NYT is suing OpenAI&Microsoft for alleged copyright infringement; some quick thoughts
Unpaywalled article, the lawsuit.
(I don’t have a law degree, this is not legal advice, my background is going through a US copyright law course many years ago.) I’ve read most of the lawsuit and skimmed through the rest, some quick thoughts on the allegations:
Memorisation: when ChatGPT outputs text that closely copies original NYT content, this is clearly a copyright infringement. I think it’s clear that OpenAI & Microsoft should be paying everyone whose work their LLMs reproduce.
Training: it’s not clear to me whether training LLMs on copyrighted content is a copyright infringement under the current US copyright law. I think lawmakers should introduce regulations to make it an infringement, but I wouldn’t think the courts should consider it to be an infringement under the current laws (although I might not be familiar with all relevant case law).
Summarising news articles found on the internet: copyright protects expression, not facts (if you read about something in a NYT article, the knowledge you received isn’t protected by copyright, and you’re free to share the knowledge); I think that if an LLM summarises text it has lawful access to, this doesn’t violate copyright if it just talks about the same facts, or might be fair use. NYT alleges damage from Bing that Wikipedia also causes by citing facts and linking the source. I think to the extent LLMs don’t preserve the wording/the creative structure, copyright doesn’t provide protection; and some preservation of the structure might be fair use.
Hallucinations: ChatGPT hallucinating false info and attributing it to NYT is outside copyright law, but seems bad and damaging. I’m not sure what the existing law around that sort of stuff is, but I think even if it’s not covered by the existing law, it’d be great to see regulations making AI companies liable for all sorts of damage from their products, including attributing statements to people who’ve never made them.
This is great news. I particularly agree that legislators should pass new laws making it illegal to train AIs on copyrighted data without the consent of the copyright owner. This is beneficial from at least two perspectives:
If AI is likely to automate most human labor, then we need to build systems for redistributing wealth from AI providers to the rest of the world. One previous proposal is the robot tax, which would offset the harms of automation borne by manufacturing workers. Another popular idea is a Universal Basic Income. Following the same philosophy as these proposals, I think the creators of copyrighted material ought to be allowed to name their price for training AI systems on their data. This would distribute some AI profits to a larger group of people who contributed to the model’s capabilities, and it might slow or prevent automation in industries where workers organize to deny AI companies access to training data. In economic terms, automation would then only occur if the benefits to firms and consumers outweigh the costs to workers. This could reduce concentration of power via wealth inequality, and slow the takeoff speeds of GDP growth.
For anyone concerned about existential threats from AI, restricting the supply of training data could slow AI development, leaving more time for work on technical safety and governance which would reduce x-risk.
I think previous counterarguments against this position are fairly weak. Specifically, while I agree that foundation models which are pretrained to imitate a large corpus of human-generated data are safer in many respects than RL agents trained end-to-end, I think that foundation models are clearly the most promising paradigm over the next few years, and even with restrictions on training data I don’t think end-to-end RL training would quickly catch up.
OpenAI appears to lobby against these restrictions. This makes sense if you model OpenAI as profit-maximizing. Surprisingly to me, even OpenAI employees who are concerned about x-risk have opposed restrictions, writing “We hope that US policymakers will continue to allow this area of dramatic recent innovation to proceed without undue burdens from the copyright system.” I wonder if people concerned about AI risk may have been “captured” by industry on this particular issue, meaning that people have unquestioningly supported a policy because they trust the AI companies which endorse it, even though the policy might increase x-risk from AI development.
Curious why this is being downvoted. I think legislators should pass laws which have positive consequences. I explained the main reasons why I think this policy would have positive consequences. Then I speculated that popular beliefs on this issue might be biased by profit motives. I did not claim that this is a comprehensive analysis of the issue, or that there are no valid counterarguments. Which part of this is norm-violating?
I’d also be curious to know why (some) people downvoted this.
Perhaps it’s because you imply that some OpenAI folks were captured, and maybe some people think that that’s unwarranted in this case?
Sadly, the more-likely explanation (IMO) is that policy discussions can easily become tribal, even on LessWrong.
I think LW still does better than most places at rewarding discourse that’s thoughtful/thought-provoking and resisting tribal impulses, but I wouldn’t be surprised if some people were doing something like “ah he is saying something Against AI Labs//Pro-regulation, and that is bad under my worldview, therefore downvote.”
(And I also think this happens the other way around as well, and I’m sure people who write things that are “pro AI labs//anti-regulation” are sometimes unfairly downvoted by people in the opposite tribe.)
Why do you think that
to train an AI on content that’s copyrighted but publicly accessible? Should it also be illegal for people to learn from copyrighted material? If not, what are the relevant differences between humans and AIs? Are there possible future situations in which you think it would be OK for AIs to learn from copyrighted material?
Imagine, e.g., a situation where somehow it’s turned out that the AIs we make are broadly comparable to humans in cognitive abilities, and they “live” alongside us in something like the way you see in some science fiction where there are humans and somewhat-human-like robots, and they learn in something like the way humans do. Would you then want humans able to learn from any materials that have been published, while the AIs have to learn only from material that has explicit “yes, AIs can learn from this” permissions attached? You may well feel that this sort of scenario is wildly improbable, and you’d probably be right, but if you would want AIs able to learn from the same things as humans in that scenario but not in more-probable ones, what is it about these scenarios that makes the difference?
I’d suggest looking at this from a consequentialist perspective.
One of your questions was, “Should it also be illegal for people to learn from copyrighted material?” This seems to imply that whether a policy is good for AIs depends on whether it would be good for humans. It’s almost a Kantian perspective—“What would happen if we universalized this principle?” But I don’t think that’s a good heuristic for AI policy. For just one example, I don’t think AIs should be given constitutional rights, but humans clearly should.
My other comment explains why I think the consequences of restricting training data would be positive.
I don’t say that the same policies must necessarily apply to AIs and humans. But I do say that if they don’t then there should be a reason why they treat AIs and humans differently.
Why?
If a law treats people a certain way, there must be a reason for that, because people have rights.
But if a law treats non-people a certain way, there doesn’t need to be any reason for that. All that is required is that there be good reasons for what consequences the law has for people.
There does not seem to be any reason why the default should be to treat AIs and humans the same way (or to treat AIs in any particular way).
I think “humans are people and AIs aren’t” could be a perfectly good reason for treating them differently, and didn’t intend to say otherwise. So, e.g., if Mikhail had said “Humans should be allowed to learn from anything they can read because doing so is a basic human right and it would be unjust to forbid that; today’s AIs aren’t the sort of things that have rights, so that doesn’t apply to them at all” then that would have been a perfectly cromulent answer. (With, e.g., the implication that to whatever extent that’s the whole reason for treating them differently in this case, the appropriate rules might change dramatically if and when there are AIs that we find it appropriate to think of as persons having rights.)
Humans can’t learn from any materials that NYT has published without paying NYT or otherwise getting a permission, as NYT articles are usually paywalled. NYT, in my opinion, should have the right to restrict commercial use of the work they own.
The current question isn’t whether digital people are allowed to look at something at learn from it the way humans are allowed to; the current question is whether for-profit AI companies can use copyrighted human work to create arrays of numbers that represent the work process behind the copyrighted material and the material itself by changing these numbers to increase the likelihood of specific operations on them producing the copyrighted material. These AI companies then use these extracted work processes to compete with the original possessors of these processes. [To be clear, I believe that further refinement of these numbers to make something that also successfully achieves long-term goals is likely to lead to no human or digital consciousness existing or learning or doing anything of value (even if we embrace some pretty cosmopolitan views, see https://moratorium.ai for my reasoning on this), which might bias me towards wanting regulation that prevents big labs from achieving ASI until safety is solved, especially with policies that support innovation, startups, etc., anything that has benefits without risking the existence of our civilisation.]
If in the specific case of NYT articles the articles in question aren’t intended to be publicly accessible, then this isn’t just a copyright matter. But the OP doesn’t just say “there should be regulations to make it illegal to sneak around access restrictions in order to train AIs on material you don’t have access to”, it says there should be regulations to prohibit training AIs on copyrighted material. Which is to say, on pretty much any product of human creativity. And that’s a much broader claim.
Your description at the start of the second paragraph seems kinda tendentious. What does it have to do with anything that the process involves “arrays of numbers”? In what sense do these numbers “represent the work process behind the copyrighted material”? (And in what sense if any is that truer of AI systems than of human brains that learn from the same copyrighted material? My guess is that it’s much truer of the humans.) The bit about “increase the likelihood of … producing the copyrighted material” isn’t wrong exactly, but it’s misleading and I think you must know it: it’s the likelihood of producing the next token of that material given the context of all the previous tokens, and actually reproducing the input in bulk is very much not a goal.
It may well be true that all progress on AI is progress toward our doom, but it’s not obviously appropriate to go from that to “so we should pass laws that make it illegal to train AIs on copyrighted text”. That seems a bit like going from “Elon Musk’s politics are too right-wing for my taste and making him richer is bad” to “so we should ban electric vehicles” or from “the owner of this business is gay and I personally disapprove of same-sex relationships” to “so I should encourage people to boycott the business”. In each case, doing the thing may have the consequences you want, but it’s not an appropriate way to pursue those consequences.
If the NYT is paywalled, how did the training of ChatGPT have access to it? If OpenAI negotiated terms with NYT for access, the question is then whether ChatGPT violates those terms.
I guess NYT spits out unpaywalled articles to search engines (to get clicks and expecting search engines’ users won’t have access to the full texts), but getting unpaywalled HTML doesn’t mean you can use it however you want. OpenAI did not negotiate the terms prior to scrapping NYT, according to the lawsuit. I believe the NYT terms prohibit commercial use without acquiring a license; I think the lawsuit mentioned the price along the lines of a standard cost of $10 per article if you want to circulate it internally in your company
The NYT paywall
doesn’tdidn’t do anything if Javascript is disabled.EDIT: I’ve noticed recently that NYT articles are cut-off before the end now, even without JavaScript. I wonder if the timing of this paywall upgrade is related to the lawsuit?
My feel is that it could have been fair use as long as LLMs were just research projects, but then OpenAI started selling theirs as a product without changing their working model at all, and if you’re commercializing the model, it’s another story. Not sure where open models would lay here, but I still reckon probably copyright infringement since you’re not using the data only internally. Would like to hear an expert’s opinion on this.
The problem is that this is a new case because it completely destroys the business model of these websites if you can have an AI agent visit them and then relay a summary to you—as it denies them the clicks (and ad visualizations) they need to pay themselves off. At which point odds are they’ll just lay on paywalls even harder if they’re not protected from this.
I could imagine something like a defamation lawsuit? But it would probably have to focus on a specific case, not the general possibility of it? Again, hard to guess, this is all unexplored territory and new questions that never needed to be asked until now.
Can someone familiar with LLMs comment on how ChatGPT can memorize so many NYT texts so well?
I see a case against punishing here, in general. Consider me asking you “What did Mario say?”, and you answering—in private -
“Listen to this sentence—even though it might well be totally wrong: Mario said xxxx.”,
or, even more in line with the ChatGPT situation,
“From what I read, I have the somewhat vague impression that Mario said xxxx—though I might mix this up, so you may really want to double-check.”
Assume Mario has not said xxxx. We still have a strong case for not, in general, punishing you for the above statement. And even if I acted badly in response to your message, so that someone gets hurt, I’d see, a priori, the main blame to fall upon me, not you.
The parallels to the case of ChatGPT[1] suggests to extend a similar reservation about punishing to our current LLMs.
Admittedly, pragmatism is in order. If an LLM’s hallucinations—despite warnings—end up creating entire groups of people attacking others due to false statements, it may be high time to reign in the AI. But the default for false attributions should not be that, not as long as the warning is clear and obvious: Do not trust it as of yet at all.
In addition to knowing today’s LLMs hallucinate, we currently even get a “ChatGPT can make mistakes. Consider checking important information.” right next to its prompt.
ChatGPT isn’t a substitute for a NYT subscription. It wouldn’t work at all without browsing. It would probably get blocked with browsing enabled, both by NYT through its useragent, and by OpenAI’s “alignment.” Even if it doesn’t get blocked, it would be slower than skimming the article manually, and its output not trustable.
OTOH, NYT can spend pennies to have an AI TLDR at the top of each of their pages. They can even use their own models, as semanticscholar does. Anybody who is economical enough to prefer the much worse experience of ChatGPT, would not have paid NYT in the first place. You can bypass the paywall trivially.
In fact, why don’t NYT authors write a TLDR themselves? Most of their articles are not worth reading. Isn’t the lack of a summary an anti-user feature to artificially inflate their offering’s volume?
NYT would, if anything, benefit from LLMs potentially degrading the average quality of the competing free alternatives.
The counterfactual version of GPT4 that did not have NYT in its training is extremely unlikely to have been a worse model. It’s like removing sand from a mountain.
The whole case is an example of rent-seeking post-capitalism.