If someone is crawling content in *any* way and aren't respecting robots.txt, th...

ke88y · on July 18, 2023

Your first two paragraphs are your opinion. Unless you're a SCOTUS judge or powerful Senator, I'm not sure that your opinion matters much. And I do agree with your opinion, btw, so you're in good company, but I think the courts mostly vehemently disagree with this position...

Your third through fifth paragraphs misunderstand the central question. I can say whatever I want in a license. Whether or not those clauses are enforceable is a very different question. If OpenAI and GitHub say that using open source code/content is fair use, then the terms of the copyright license aren't necessarily relevant.

E.g., if I say that my website can only be used for non-commercial purposes and then run for office and NYT publishes a copy of a two page racist screed on my website, then I can try to sue then for violating CC-by-... But it won't matter and I will lose. The terms of the license are not relevant since NYT clearly has a very strong fair use case. That's the whole point of fair use -- that the content can be used in ways that the rights holder does not consent to!

WorldMaker · on July 18, 2023

The advice in this article linked here is "you should post copyright info if you don't want to be scraped/crawled for LLM work". That already presumes that fair use does not apply. (I certainly don't think LLM efforts constitute fair use, the article context here already presumes it doesn't apply, which is relevant context to the discussion that we are having.)

It certainly is my opinion that robots.txt is the equivalent of hanging up a "No Robots Allowed" sign in a well known place on a fence around your property. It is an opinion that companies that don't respect that are doing themselves a disservice. I don't think I need to be a SCOTUS judge or powerful Senator for those opinions. Those opinions are one of an internet user from the 1990s including being around for some of the oldest robots.txt debates. The meaning for robots.txt doesn't change no matter what kind of crawler/scraper you are building and no matter how "fair use" you think your use case is. I certainly think that whether or not you are respectful of posted signs or oblivious to them is an option, but I also think (as we agreed in the 1990s) it is also fair to call trespassing crawlers that ignore robots.txt "evil" and to take further actions to block them.

As for other opinions, it's not solely "my" opinion that US copyright law is automatic and thus needs to be proven when copyright doesn't apply.

> When is my work protected?

> Your work is under copyright protection the moment it is created and fixed in a tangible form that it is perceptible either directly or with the aid of a machine or device.

https://www.copyright.gov/help/faq/faq-general.html

We could get into the weeds of specific court cases, but in general they trend towards "you don't need any specific type of notice since 1978 when automatic protection went into effect" and "you don't need to know trivia about copyright law to be protected as long as you make 'best effort' from your own point of view". I really don't think it is a stretch for someone not a SCOTUS judge or powerful Senator to suggest that "the burden of proof falls that a document needs to be proven not copyrighted" because copyright and copyright protection since 1978 is "the default status" (in the US).

ke88y · on July 18, 2023

> I don't think I need to be a SCOTUS judge or powerful Senator for those opinions.

The world isn't fair. Unless you think you can imminently effect political or judicial changes, you need to ACT in the world according to how the law ACTUALLY WORKS, not how you think it morally OUGHT to work.

> As for other opinions, it's not solely "my" opinion that US copyright law is automatic and thus needs to be proven when copyright doesn't apply.

You've completely lose the plot here. This is completely irrelevant to my last two posts.

Automatic copyright -- or fuck, even specific terms saying "you owe me a billion dollars if you use this as training data" -- doesn't help you if training is fair use. Fair use is an affirmative defense to copyright violation. And these companies are arguing that using copyrighted data is fair use. We'll see. But if using text as training data turns out to be fair use, then you'll just be wasting your money and in some jurisdictions possibly risking counter-suit if you try to enforce your copyright.

The question is not whether you have copyright. You do. The question is whether the courts will grant model trainers and users an affirmative defense against your copyright claims.

WorldMaker · on July 18, 2023

I never suggested robots.txt was "fair", I suggested it is a cultural convention long established at this point and ignoring it is wrong for moral reasons nothing to do with legal reasons. Morals/ethics and laws are different frameworks (though, related, naturally, and hopefully interacting).

I don't think I've lost the plot when talking about automatic copyright. The article's suggestion that copyright notices should be more useful than robots.txt only makes sense if you assume that copyright applies in general and LLM scraping is not fair use. Following that assumption to its logical conclusion then automatic copyright is a legal problem for all crawling/scraping by the LLM companies and they are at drastic legal risk for that. Beyond the already huge issues with their moral/ethical risk in doing that and ignoring things like robots.txt files.

"LLM crawlers should ignore robots.txt and instead look for CC licenses and copyright notices" is only good moral "advice" if you think model trainers will eventually lose the "fair use" argument and from that standpoint it is both ethically and legally wrong from the automatic copyright protection standpoint to suggest stuff without META tags or visible copyright notices is "fair game"/"uncopyrighted".

In the case that we are presuming "fair use" can apply to this data, then all arguments about which copyright licenses to claim is indeed facile, because it doesn't matter, but that's not an interesting argument to have here. If robots.txt isn't the answer and copyright claims aren't the answer: what is? At that point, yes, all legal protections cease. If they've already proven to be a bad actor (ethically speaking) willing to ignore robots.txt, what hope do you have that they won't ignore llms-please-ignore-this-site.txt or X-LLM: Please Ignore headers or anything else?

There's not really a discussion to have in that case, just a judgment to make that whatever those companies are: that's evil and unethical. (Call that if you like a strong opinion, but that's how ethics works, in the court of public opinion.)