1) Schools use primarily public domain knowledge for education. It's rarely your private blog post being used to mostly learn writing blog posts.
2) There's no attribution, no credit. Public academia is heavily based (at least theoretically) on acknowledging every single paper you built your thesis on.
3) There's no payment. In school (whatever level) somebody's usually paying somebody for having worked to create a set of educational materials.
Note: Like above. All very theoretical. Huge amounts of corruption in academia and education. Of Vice/Virtue who wants to watch the Virtue Squad solve crimes? What's sold in America? Working hard and doing your honest 9 to 5? Nah.
1) If your blog posts are private, why are they on publicly accessible websites? Why not put it behind a paywall of some sort?
2) How many novels have bibliographies? How many musicians cite their influences? Citing sources is all well and good in academic papers, but there’s a point at which it just becomes infeasible. The more transformative the work, the harder it is to cite inspiration.
3) What about libraries? Should they be licensing every book they have in their collections? Should the people who check the books out have to pay royalties to learn from them?
> 1) If your blog posts are private, why are they on publicly accessible websites? Why not put it behind a paywall of some sort?
If I grow apple trees in front of my house and you come and take all apples and then turn up at my doorstep trying to sell me apple juice made from the apples you nicked that doesn't mean you had the right to do it, because I chose not to build a tall fence around my apple trees. Public content is free to read for humans, not free for corporations to offer paid content generation services based on my public content taken without me knowing or being asked for permission.
> 2) How many novels have bibliographies? How many musicians cite their influences? Citing sources is all well and good in academic papers, but there’s a point at which it just becomes infeasible. The more transformative the work, the harder it is to cite inspiration.
You are making this kind of argument: "How much is a drop of gas? Nothing. Right, could you fill my car drop by drop?"
If we have technology that can charge for producing bullshit on an industrial scale by recombining sampled works of others, we are perfectly capable of keeping track of the sources used for training and generative diarrhoea.
> 3) What about libraries? Should they be licensing every book they have in their collections? Should the people who check the books out have to pay royalties to learn from them?
All of these responses were so quality, there's really no need to add. I Especially like the apple argument about a product in your front yard. You still have no basis to take them from my front yard.
If there was the equivalent of what a lot of other sites have (gems, gold, ribbons) I'd give you one. Got a lot of gems, I'll send you an admittedly teeny heliodore, tourmaline, or peridot at cost if you want one. Gemstone market's junk lately with the economy.
You're both just repeating the "you wouldn't download an apple" argument. In the context of the Internet, you're voluntarily sending the user an apple and expecting them to not do various things to it, which is unreasonable. Nothing is taken. If it were, your website would be completely empty.
Remember, Copying Is Not Theft. Copyright law is just a temporary monopoly meant to economically incentivize you. Nothing more.
BTW, pro-AI countries do differentiate between private and public posts. If it's public, it's legally fair game to train on it. If it's private, you need a license to access it. So it does matter. Also see: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
Schools use books that were paid for and library lending falls under PLR (in the UK), so authors of books used in schools do get compensated. Not a lot, but they are. AI companies are run by people who will loot your place when you're not looking and charge you for access to your own stuff. Fuck that lot.
> AI companies are run by people who will loot your place when you're not looking and charge you for access to your own stuff.
Funnily enough they do understand that having your own product used to build a competing product is uncool, they just don't care unless it's happening to them.
> What you cannot do. You may not use our Services for any illegal, harmful, or abusive activity. For example [...] using Output to develop models that compete with OpenAI.
If you think going to school to get an education is the same thing as training an LLM then you are just so misguided. Normal people read books to gain an understanding of a concept, but do not retain the text verbatim in memory in perpetuity. This is not what training an LLM does.
LLMs don’t memorize everything they’re trained on verbatim, either. It’s all vectors behind the scenes, which is relatable to how the human brain works. It’s all just strong or weak connections in the brain.
The output is what matters. If what the LLM creates isn’t transformative, or public domain, it’s infringement. The training doesn’t produce a work in itself.
Besides that, how much original creative work do you really believe is out there? Pretty much all art (and a lot of science) is based on prior work. There are true breakthroughs, of course, but they’re few and far between.
Some people memorize verbatim. Most LLM knowledge is not memorized. Easy proof: source material is in one language, and you can query LLMs in tens to a hundred plus. How can it be verbatim in a different language?
These "some people" would not fall under the "normal people" that I specifically said. but you go right ahead and keep thinking they are normal so you can make caveats on an internet forum.
I think this is tricky because of course this is okay most of the time. If I produce a search index, it's okay. If I produce summate statistics of a work (how many words starting with an H are in John Grisham novels?) that's okay. Producing an unofficial guide to the Star Wars universe is okay. "Processing" and "produce content" I think are too vague.
You should be able to judge whether something is a copyright violation based on the resulting work. If a work was produced with or without computer assistance, why would that change whether it infringes?
It helps. If it's at stake whether there is infringement or not, and it comes that you were looking at a photograph of the protected work while working on yours (or any other type of "computer assistance") do you think this would not make for a more clear cut case?
That's why clean room reverse engineering and all of that even exists.
As a normative claim, this is interesting, perhaps this should be the rule.
As a descriptive claim, it isn't correct. Several lawsuits relating to sampling in hip-hop have hinged on whether the sounds in the recording were, in fact, sampled, or instead, recreated independently.
This is interesting from the legal point of view, because AI service providers like OpenAI give you "rights" to the output produced by their systems. E.g. see the "Content" section of https://openai.com/policies/eu-terms-of-use/
Given that output cannot be produced without input, and models have to be trained on something, one could claim the original IP owners could have a reasonable claim against people and entities who use their content without permission.