robots.txt is not the answer for LLMs

PreInternet01 · on July 18, 2023

> Not all LLMs use crawlers and identify themselves

Yeah, exactly why I want to block them.

> This is not a sustainable solution as the number of crawlers continues to grow

Well, I want zero crawlers to access my content, so that seems pretty sustainable?

> An ‘all or nothing’ approach is unacceptable

To whom? For me 'nothing' works? I do not want your crawler to access my content for any reason whatsoever. That's not too hard, is it? I block your UA, I block your IP range, done, right?

> Robots.txt is all about managing crawling while the copyright discussion is all about how the data is used

Potato, potato. Don't crawl me, don't use my data for AI, or anything. Does that affect my search engine ranking? Don't care.

> Reinventing the wheel

> The meta tag is the way

> Foolproof solution

Ah, OK, so anything but your totally imagined META solution won't work. Good luck with that!

ke88y · on July 18, 2023

The article is making the point that you need to use copyright -- not robots.txt -- to enforce restrictions on use of your content. I think that's probably correct -- robots.txt is a request. Violating that request is probably not a violation of the CFAA and certainly doesn't entail extra copyright protection. Gripe all you want and black hole IP ranges all you want; your content will get crawled.

TBH, I'm not sure there is a way to enforce this preference on the open internet if crawlers are willing to violate robots.txt. (There are, of course, practical ways that mostly involve not putting your work on the open internet; eg, a custom sign-up flow with all content behind a login page will do the trick for any site that isn't sufficiently popular.)

So, per the article, use copyright instead. And if you do not want your content used to train an AI, then you need to find a way to clearly communicate to the bots that your work is under copyright; otherwise, they'll use your content for training and you'll have to sue them post-facto. Which is all good and well, but again, the premise is that you don't want your content used for training in the first place!

Which brings us to the operative question: is it true that CC BY-ND and CC BY-NC-ND prevent the use of data by LLMs? In fact, is a licenses which explicitly disallows use of content in LLM training is effective? I'm not sure that this is clear yet; e.g., if it turns out that using content in training data turns out to be cut-and-dry fair use then do restrictive licenses have any practical effect?

WorldMaker · on July 18, 2023

If someone is crawling content in any way and aren't respecting robots.txt, that's a violation of the intent of robots.txt. It may be a "friendly request", but it's never the requestor doing something wrong if the requestee simply (badly) ignores the signs marked "beware of dog" or "no robots allowed". (Really the big thing missing here with robots.txt is that the internet doesn't have guard dogs to try to put some teeth painfully into your skin if you violate private property. Lack of guard dogs to enforce them isn't lack of signs and some of these companies need to get over themselves that they think they are better than posted signs.)

Plus, if those signs aren't enough, websites are published materials and all content under current US laws and court precedents is considered that published materials fall under copyright unless noticed otherwise. We've moved to a "prove that there is no applicable copyright" system, not a "prove that there is" system. It's not up to me to make sure that my copyright statements are clear and obvious enough, it's up to you to prove that you find a public domain or license statement of some kind (such as CC licenses including CC0; CC0 exists because there needed to be an explicit declaration of "no really, this is public domain" under current copyright regimes). This kind of crawling seems completely oblivious to how current copyright law works and seems to me to be wantonly asking for more lawsuits (which again, given current evidence, crawlers would lose).

> is it true that CC BY-ND and CC BY-NC-ND prevent the use of data by LLMs?

Yes. Easily. LLMs are a derivative process, LLMs fail all the requirements for CC ND clauses. LLMs are also commercial processes and being used for commercial interests and just as easily fail CC NC clauses.

I would even go far as to say that LLMs are bad at attribution, often lie about it, especially in situations of near direct quotes, and easily fail most reasonable interpretations of the CC BY clause.

It seems cut-and-dry to me. I don't think there is a lot of gray area.

ke88y · on July 18, 2023

Your first two paragraphs are your opinion. Unless you're a SCOTUS judge or powerful Senator, I'm not sure that your opinion matters much. And I do agree with your opinion, btw, so you're in good company, but I think the courts mostly vehemently disagree with this position...

Your third through fifth paragraphs misunderstand the central question. I can say whatever I want in a license. Whether or not those clauses are enforceable is a very different question. If OpenAI and GitHub say that using open source code/content is fair use, then the terms of the copyright license aren't necessarily relevant.

E.g., if I say that my website can only be used for non-commercial purposes and then run for office and NYT publishes a copy of a two page racist screed on my website, then I can try to sue then for violating CC-by-... But it won't matter and I will lose. The terms of the license are not relevant since NYT clearly has a very strong fair use case. That's the whole point of fair use -- that the content can be used in ways that the rights holder does not consent to!

WorldMaker · on July 18, 2023

The advice in this article linked here is "you should post copyright info if you don't want to be scraped/crawled for LLM work". That already presumes that fair use does not apply. (I certainly don't think LLM efforts constitute fair use, the article context here already presumes it doesn't apply, which is relevant context to the discussion that we are having.)

It certainly is my opinion that robots.txt is the equivalent of hanging up a "No Robots Allowed" sign in a well known place on a fence around your property. It is an opinion that companies that don't respect that are doing themselves a disservice. I don't think I need to be a SCOTUS judge or powerful Senator for those opinions. Those opinions are one of an internet user from the 1990s including being around for some of the oldest robots.txt debates. The meaning for robots.txt doesn't change no matter what kind of crawler/scraper you are building and no matter how "fair use" you think your use case is. I certainly think that whether or not you are respectful of posted signs or oblivious to them is an option, but I also think (as we agreed in the 1990s) it is also fair to call trespassing crawlers that ignore robots.txt "evil" and to take further actions to block them.

As for other opinions, it's not solely "my" opinion that US copyright law is automatic and thus needs to be proven when copyright doesn't apply.

> When is my work protected?

> Your work is under copyright protection the moment it is created and fixed in a tangible form that it is perceptible either directly or with the aid of a machine or device.

https://www.copyright.gov/help/faq/faq-general.html

We could get into the weeds of specific court cases, but in general they trend towards "you don't need any specific type of notice since 1978 when automatic protection went into effect" and "you don't need to know trivia about copyright law to be protected as long as you make 'best effort' from your own point of view". I really don't think it is a stretch for someone not a SCOTUS judge or powerful Senator to suggest that "the burden of proof falls that a document needs to be proven not copyrighted" because copyright and copyright protection since 1978 is "the default status" (in the US).

ke88y · on July 18, 2023

> I don't think I need to be a SCOTUS judge or powerful Senator for those opinions.

The world isn't fair. Unless you think you can imminently effect political or judicial changes, you need to ACT in the world according to how the law ACTUALLY WORKS, not how you think it morally OUGHT to work.

> As for other opinions, it's not solely "my" opinion that US copyright law is automatic and thus needs to be proven when copyright doesn't apply.

You've completely lose the plot here. This is completely irrelevant to my last two posts.

Automatic copyright -- or fuck, even specific terms saying "you owe me a billion dollars if you use this as training data" -- doesn't help you if training is fair use. Fair use is an affirmative defense to copyright violation. And these companies are arguing that using copyrighted data is fair use. We'll see. But if using text as training data turns out to be fair use, then you'll just be wasting your money and in some jurisdictions possibly risking counter-suit if you try to enforce your copyright.

The question is not whether you have copyright. You do. The question is whether the courts will grant model trainers and users an affirmative defense against your copyright claims.

WorldMaker · on July 18, 2023

I never suggested robots.txt was "fair", I suggested it is a cultural convention long established at this point and ignoring it is wrong for moral reasons nothing to do with legal reasons. Morals/ethics and laws are different frameworks (though, related, naturally, and hopefully interacting).

I don't think I've lost the plot when talking about automatic copyright. The article's suggestion that copyright notices should be more useful than robots.txt only makes sense if you assume that copyright applies in general and LLM scraping is not fair use. Following that assumption to its logical conclusion then automatic copyright is a legal problem for all crawling/scraping by the LLM companies and they are at drastic legal risk for that. Beyond the already huge issues with their moral/ethical risk in doing that and ignoring things like robots.txt files.

"LLM crawlers should ignore robots.txt and instead look for CC licenses and copyright notices" is only good moral "advice" if you think model trainers will eventually lose the "fair use" argument and from that standpoint it is both ethically and legally wrong from the automatic copyright protection standpoint to suggest stuff without META tags or visible copyright notices is "fair game"/"uncopyrighted".

In the case that we are presuming "fair use" can apply to this data, then all arguments about which copyright licenses to claim is indeed facile, because it doesn't matter, but that's not an interesting argument to have here. If robots.txt isn't the answer and copyright claims aren't the answer: what is? At that point, yes, all legal protections cease. If they've already proven to be a bad actor (ethically speaking) willing to ignore robots.txt, what hope do you have that they won't ignore llms-please-ignore-this-site.txt or X-LLM: Please Ignore headers or anything else?

There's not really a discussion to have in that case, just a judgment to make that whatever those companies are: that's evil and unethical. (Call that if you like a strong opinion, but that's how ethics works, in the court of public opinion.)

JohnFen · on July 18, 2023

> The meta tag is the way

No, it is not. Doing this with a meta tag suffers from the same fundamental problem as using a mechanism like robots.txt has: it depends on the goodwill of the crawler to obey it. It depends on trust.

That trust is not justifiable when it comes to search engine crawlers and robots.txt, since plenty of crawlers ignore it. It's hard to see how LLM crawlers would be more trustworthy.

ke88y · on July 18, 2023

They may not be more trustworthy, but if using content in training data is not fair use then you'll at least have a stronger legal case. I don't think meta tags are necessary, but including the meta tag and a copyright in the text makes it very hard for the LLM trainer to claim ignorance (not that this is a defense anyways, but again, it gives you a strictly stronger hand and the brazenness may impact damages.)

JohnFen · on July 18, 2023

My personal IANAL assessment is that using content in this way does not, in fact, violate copyright law as long as that content is not emitted verbatim by the LLM. So I don't think that copyright law is helpful here unless/until new law is created to cover this use case.

And there is no good technological solution, either. It's a very tough spot to be in. Right now, I've done the only thing that people who are concerned can do -- I've removed all of my content from the public web entirely.

That's how it will be until a better solution presents itself.

ke88y · on July 18, 2023

I generally agree. Until the SD and Copilot cases are settled, there's a huge risk that training is Fair Use.

amiga-workbench · on July 18, 2023

Are we gonna have to start putting highly expensive proof of work checks on pages just to make this kind of scraping non-viable?

bitwize · on July 18, 2023

robots.txt is like attempting to remind the Terminator about Asimov's Three Laws.