I was reading the article and thinking “suck a dick, AI companies” but then it mentions the EFF and ALA filed against the class action. I have found those organizations to be generally reputable and on the right side of history, so now I’m wondering what the problem is.
Take scraping. Companies like Clearview will tell you that scraping is legal under copyright law. They’ll tell you that training a model with scraped data is also not a copyright infringement. They’re right.
I love Cory’s writing, but while he does a masterful job of defending scraping, and makes a good argument that in most cases, it’s laws other than Copyright that should be the battleground, he does, kinda, trip over the main point.
That is that training models on creative works and then selling access to the derivative “creative” works that those models output very much falls within the domain of copyright - on either side of a grey line we usually call “fair use” that hasn’t been really tested in courts.
Lets take two absurd extremes to make the point. Say I train an LLM directly on Marvel movies, and then sell movies (or maybe movie scripts) that are almost identical to existing Marvel movies (maybe with a few key names and features altered). I don’t think anyone would argue that is not a derivative work, or that falls under “fair use.” However, if I used literature to train my LLM to be able to read, and used that to read street signs for my self-driving car, well, yeah, that might be something you could argue is “fair use” to sell. It’s not producing copy-cat literature.
I agree with Cory that scraping, per se, is absolutely fine, and even re-distributing the results in some ways that are in the public interest or fall under “fair use”, but it’s hard to justify the slop machines as not a copyright problem.
In the end, yeah, fuck both sides anyway. Copyright was extended too far and used for far too much, and the AI companies are absolute thieves. I have no illusions this type of court case will do anything more than shift wealth from one robber-barron to another, and won’t help artists and authors.
Say I train an LLM directly on Marvel movies, and then sell movies (or maybe movie scripts) that are almost identical to existing Marvel movies (maybe with a few key names and features altered). I don’t think anyone would argue that is not a derivative work, or that falls under “fair use.”
I think you’re failing to differentiate between a work, which is protected by copyright, vs a tool which is not affected by copyright.
Say I use Photoshop and Adobe Premiere to create a script and movie which are almost identical to existing Marvel movies. I don’t think anyone would argue that is not a derivative work, or that falls under “fair use”.
The important part here is that the subject of this sentence is ‘a work which has been created which is substantially similar to an existing copyrighted work’. This situation is already covered by copyright law. If a person draws a Mickey Mouse and tries to sell it then Disney will sue them, not their pencil.
Specific works are copyrighted and copyright laws create a civil liability for a person who creates works that are substantially similar to a copyrighted work.
Copyright doesn’t allow publishers to go after Adobe because a person used Photoshop to make a fake Disney poster. This is why things like Bittorrent can legally exist despite being used primarily for copyright violation. Copyright laws apply to people and the works that they create.
A generated Marvel movie is substantially similar to a copyrighted Marvel movie and so copyright law protects it. A diffusion model is not substantially similar to any copyrighted work by Disney and so copyright laws don’t apply here.
I think AI is a big Scam ( pattern matching has nothing to do with !!! intelligence !!! ).
And this Scam might end as the Dot-Com bubble in the late 90s ( https://en.wikipedia.org/wiki/Dot-com/_bubble ) including the huge economic impact cause to many people have invested in an “idea” not in an proofen technology.
And as the Dot-Com bubble once the AI bubble has been cleaned up Machine Learning and Vector Databases will stay forever ( maybe some other part of the tech ).
Both don’t need copyright changes cause they will never try to be one solution for everything. Like a small model to transform text to speech … like a small model to translate … like a full text search using a vector db to index all local documents …
These were entire sets of writing consumed and reworked into poor data without respecting the license to them.
Honestly, I wouldn’t be surprised if copyright wasn’t the only thing to be the problem here, but intellectual property as well. In that case, EFF probably has an interest in that instead. Regardless, I really think it need to be brought through court.
LLMs are harmful, full stop. Most other Machine Learning mechanisms use licensed data to train. In the case of software as a medical device, such as image analysis AI, that data is protected by HIPPA and special attention is already placed in order to utilize it.
AI coding tools are using the exact same backends as AI fiction writing tools, so it would hurt the fledgling vibe coder profession (which according to proper software developers should not be allowed to exist at all).
I was reading the article and thinking “suck a dick, AI companies” but then it mentions the EFF and ALA filed against the class action. I have found those organizations to be generally reputable and on the right side of history, so now I’m wondering what the problem is.
They don’t want copyright power to expand further. And I agree with them, despite hating AI vendors with a passion.
For an understanding of the collateral damage, check out How To Think About Scraping by Cory Doctorow.
I love Cory’s writing, but while he does a masterful job of defending scraping, and makes a good argument that in most cases, it’s laws other than Copyright that should be the battleground, he does, kinda, trip over the main point.
That is that training models on creative works and then selling access to the derivative “creative” works that those models output very much falls within the domain of copyright - on either side of a grey line we usually call “fair use” that hasn’t been really tested in courts.
Lets take two absurd extremes to make the point. Say I train an LLM directly on Marvel movies, and then sell movies (or maybe movie scripts) that are almost identical to existing Marvel movies (maybe with a few key names and features altered). I don’t think anyone would argue that is not a derivative work, or that falls under “fair use.” However, if I used literature to train my LLM to be able to read, and used that to read street signs for my self-driving car, well, yeah, that might be something you could argue is “fair use” to sell. It’s not producing copy-cat literature.
I agree with Cory that scraping, per se, is absolutely fine, and even re-distributing the results in some ways that are in the public interest or fall under “fair use”, but it’s hard to justify the slop machines as not a copyright problem.
In the end, yeah, fuck both sides anyway. Copyright was extended too far and used for far too much, and the AI companies are absolute thieves. I have no illusions this type of court case will do anything more than shift wealth from one robber-barron to another, and won’t help artists and authors.
I think you’re failing to differentiate between a work, which is protected by copyright, vs a tool which is not affected by copyright.
Say I use Photoshop and Adobe Premiere to create a script and movie which are almost identical to existing Marvel movies. I don’t think anyone would argue that is not a derivative work, or that falls under “fair use”.
The important part here is that the subject of this sentence is ‘a work which has been created which is substantially similar to an existing copyrighted work’. This situation is already covered by copyright law. If a person draws a Mickey Mouse and tries to sell it then Disney will sue them, not their pencil.
Specific works are copyrighted and copyright laws create a civil liability for a person who creates works that are substantially similar to a copyrighted work.
Copyright doesn’t allow publishers to go after Adobe because a person used Photoshop to make a fake Disney poster. This is why things like Bittorrent can legally exist despite being used primarily for copyright violation. Copyright laws apply to people and the works that they create.
A generated Marvel movie is substantially similar to a copyrighted Marvel movie and so copyright law protects it. A diffusion model is not substantially similar to any copyrighted work by Disney and so copyright laws don’t apply here.
@FauxLiving @Jason2357
I take a bold stand on the whole topic:
I think AI is a big Scam ( pattern matching has nothing to do with !!! intelligence !!! ).
And this Scam might end as the Dot-Com bubble in the late 90s ( https://en.wikipedia.org/wiki/Dot-com/_bubble ) including the huge economic impact cause to many people have invested in an “idea” not in an proofen technology.
And as the Dot-Com bubble once the AI bubble has been cleaned up Machine Learning and Vector Databases will stay forever ( maybe some other part of the tech ).
Both don’t need copyright changes cause they will never try to be one solution for everything. Like a small model to transform text to speech … like a small model to translate … like a full text search using a vector db to index all local documents …
Like a small tool to sumarize text.
I agree, and I think your points line up with Doctorow’s other writing on the subject. It’s just hard to cover everything in one short essay.
Ahhh, it makes more sense now. Thank you!
Let’s give them this one last win. For spite.
I disagree with the EFF and ALA on this one.
These were entire sets of writing consumed and reworked into poor data without respecting the license to them.
Honestly, I wouldn’t be surprised if copyright wasn’t the only thing to be the problem here, but intellectual property as well. In that case, EFF probably has an interest in that instead. Regardless, I really think it need to be brought through court.
LLMs are harmful, full stop. Most other Machine Learning mechanisms use licensed data to train. In the case of software as a medical device, such as image analysis AI, that data is protected by HIPPA and special attention is already placed in order to utilize it.
My guess is that the EFF is mostly concerned with the fact this is a class action and also worried about expanding copyright in general.
AI coding tools are using the exact same backends as AI fiction writing tools, so it would hurt the fledgling vibe coder profession (which according to proper software developers should not be allowed to exist at all).
The same goes for the Internet Archive - if scraping is illegal, than the Internet Archive is as well.