For years, researchers have trained machine learning systems on
whatever data they could find. People mostly haven’t cared about this
or paid attention, I think because the systems hadn’t been very good.
Recently, however, some very impressive systems have come out,
including ones that answer questions,
complete
code, and
generateimagesfrom prompts.
Because these are so capable a lot more people are paying attention
now, and there are big questions around whether it’s ok that these
systems were trained this way. Code that I uploaded to GitHub and the
writing that I’ve put into this blog went into training these models:
I didn’t give permission for this kind of use, and no one asked me if
it was ok. Doesn’t this violate my copyrights?
The machine learning community has generally assumed that training
models on some input and using it to generate new output is legal, as
long as the output is sufficiently different from the input. This
relies on the doctrine of “fair use”, which
does not require any sort of permission from the original author as
long as it is sufficiently “transformative”. For example, if I took a
book and replaced every instance of the main characters name with my
own I doubt any court would consider that sufficiently transformative,
and so my book would be considered a “derivative work” of the original
book. On the other hand, if I took the words in the book and
painstakingly reordered them to tell a completely unrelated story,
there’s a sense in which my book was “derived” from the original one
but I think it would pretty clearly be transformative enough that I
wouldn’t need any permission from the copyright holder.
These models can be used to create things that are clearly derivative
works of their input. For example, people very
quickly realized that Copilot would complete the code for Greg Walsh’sfast
inverse square root implementation verbatim, and if you ask any of
the image generators for the Mona Lisa or Starry Night
you’ll get something close enough to the original that it’s clearly a
knock-off. This is a major issue with current AI systems, but it’s
also a relatively solvable one. It’s already possible to slowly check
that the output doesn’t excessively resemble any input, and I think
it’s likely they’ll soon figure out how to do that efficiently. On the
other hand, all of the examples of this I’ve seen (and I just did some
looking)
have been people trying to elicit plagiarism.
The normal use case is much more interesting, and more controversial.
While the transformative fair use justification I described above is
widely assumed within the machine learning community as far as I can
tell it hasn’t been tested in court. There is currently a large
class action lawsuit over Copilot, and it’s possible this kind of
usage will turn out not qualify. Speculating, I think it’s pretty
unlikely that the suit will succeed, but I’ve created a prediction
market on it to gather information:
Aside from the legal question, however, there is also a moral or
social question: is it ok to train a model on someone’s work without
their permission? What if this means that they and others in their
profession are no longer able to earn a living?
On the second question, you could imagine someone creating
a model where they used only data that was either in the public domain
or which they’d purchased appropriate licenses for. While that’s
great for the particular people who agree and get paid, a much larger
number would still be out of work without compensation. I do think
there’s potentially quite a bad situation, where as these systems get
better more and more people are unable to add much over an automated
system, and we get massive technological
unemployment. Now, historically worries here proved unfounded, and
technology has consistently been much more of a human complement than
human substitute. As the saying goes, however, that was also the case
for horses until it wasn’t. I think a Universal
Basic Income is probably the best approach here.
On the first question, learning from other people’s work without their
consent is something humans do all the time. You can’t draw too
heavily on any one thing you’ve seen without following a complex set
of rules about permission and acknowledgement, but human creative work
generally involves large amounts of borrowing. These machine learning
systems are not humans, but they are fundamentally doing a pretty
similar thing when they learn from examples, and I don’t see a strong
reason to treat their work differently here. Because these systems
don’t currently understand how much borrowing is ok we do need to
apply our own judgment to avoid technologically-facilitated
plagiarism, but the normal case of creating something relatively
original that pulls from a wide range of prior work is fine for us to
do with our brains and should be equally ok for us to do with our
tools.
Machine Learning Consent
Link post
For years, researchers have trained machine learning systems on whatever data they could find. People mostly haven’t cared about this or paid attention, I think because the systems hadn’t been very good. Recently, however, some very impressive systems have come out, including ones that answer questions, complete code, and generate images from prompts.
Because these are so capable a lot more people are paying attention now, and there are big questions around whether it’s ok that these systems were trained this way. Code that I uploaded to GitHub and the writing that I’ve put into this blog went into training these models: I didn’t give permission for this kind of use, and no one asked me if it was ok. Doesn’t this violate my copyrights?
The machine learning community has generally assumed that training models on some input and using it to generate new output is legal, as long as the output is sufficiently different from the input. This relies on the doctrine of “fair use”, which does not require any sort of permission from the original author as long as it is sufficiently “transformative”. For example, if I took a book and replaced every instance of the main characters name with my own I doubt any court would consider that sufficiently transformative, and so my book would be considered a “derivative work” of the original book. On the other hand, if I took the words in the book and painstakingly reordered them to tell a completely unrelated story, there’s a sense in which my book was “derived” from the original one but I think it would pretty clearly be transformative enough that I wouldn’t need any permission from the copyright holder.
These models can be used to create things that are clearly derivative works of their input. For example, people very quickly realized that Copilot would complete the code for Greg Walsh’s fast inverse square root implementation verbatim, and if you ask any of the image generators for the Mona Lisa or Starry Night you’ll get something close enough to the original that it’s clearly a knock-off. This is a major issue with current AI systems, but it’s also a relatively solvable one. It’s already possible to slowly check that the output doesn’t excessively resemble any input, and I think it’s likely they’ll soon figure out how to do that efficiently. On the other hand, all of the examples of this I’ve seen (and I just did some looking) have been people trying to elicit plagiarism.
The normal use case is much more interesting, and more controversial. While the transformative fair use justification I described above is widely assumed within the machine learning community as far as I can tell it hasn’t been tested in court. There is currently a large class action lawsuit over Copilot, and it’s possible this kind of usage will turn out not qualify. Speculating, I think it’s pretty unlikely that the suit will succeed, but I’ve created a prediction market on it to gather information:
Aside from the legal question, however, there is also a moral or social question: is it ok to train a model on someone’s work without their permission? What if this means that they and others in their profession are no longer able to earn a living?
On the second question, you could imagine someone creating a model where they used only data that was either in the public domain or which they’d purchased appropriate licenses for. While that’s great for the particular people who agree and get paid, a much larger number would still be out of work without compensation. I do think there’s potentially quite a bad situation, where as these systems get better more and more people are unable to add much over an automated system, and we get massive technological unemployment. Now, historically worries here proved unfounded, and technology has consistently been much more of a human complement than human substitute. As the saying goes, however, that was also the case for horses until it wasn’t. I think a Universal Basic Income is probably the best approach here.
On the first question, learning from other people’s work without their consent is something humans do all the time. You can’t draw too heavily on any one thing you’ve seen without following a complex set of rules about permission and acknowledgement, but human creative work generally involves large amounts of borrowing. These machine learning systems are not humans, but they are fundamentally doing a pretty similar thing when they learn from examples, and I don’t see a strong reason to treat their work differently here. Because these systems don’t currently understand how much borrowing is ok we do need to apply our own judgment to avoid technologically-facilitated plagiarism, but the normal case of creating something relatively original that pulls from a wide range of prior work is fine for us to do with our brains and should be equally ok for us to do with our tools.
Comment via: facebook, mastodon