Hacker News new | past | comments | ask | show | jobs | submit login

Does anyone have a problem with it, so long as the material it trained on was with explicit permission/license and not potentially in violation of copyright?

That's where the line is for it to be suspect IMO.




This is what I hope comes out of the lawsuit. If a company wants to sell an AI model, they need to own all of the training data. It can't be "fair use" to take other peoples' works at zero cost, and use it to build a commercial product without compensation.

And maybe models trained on public data should be in the public domain, so that AI research can happen without requiring massive investments to obtain the training data.


There has to be a reasonable context here. Even if it’s trained on proprietary code it rarely ever is inserting that code directly in a way that is at all relevant to how it was used in the past.

Obvious licensing needs to be respected and it shouldn’t be hard to solve that problem. But 99.9% of code isn’t some unique algorithm, it’s gluing libraries and setting up basic structures.

Most of the examples I’ve seen done line up with the reality of code completion tools. Code is rarely valuable when broken up into its small parts.

Even copying a full codebase is rarely enough to draw value from… there’s way more to a software business than the raw code. But that’s a different problem.


> It can't be "fair use" to take other peoples' works at zero cost, and use it to build a commercial product without compensation.

You just described open source software.

That's the whole heart of this lawsuit, and equally Copilot. It was trained on OSS which is explicitly licensed for free use.


Ok you got me, that wording was lazy on my part. But that's a really bad take on yours:

> It was trained on OSS which is explicitly licensed for free use.

That's not what the lawsuit is about. It's not about money, it's about licensing. OSS licenses have specific requirements and restrictions for using them, and Copilot explicitly ignores those requirements, thus violating the license agreement.

The GPL, for example, requires you to release your own source code if you use it in a publicly-released product. If you don't do that, you're committing copyright infringement, since you're copying someone's work without permission.


Yeah, and I think that's fair re: licensing. Curious to see how it pans out.

Also, re: your edit, not quite. They require you to release modified source under certain conditions if you make modifications to it. If everybody had to release code using GPL to the world, every companies code would currently be released to the world. There's more nuance than that. The gnu site covers a lot of that nuance (https://www.gnu.org/licenses/gpl-faq.en.html#UnreleasedMods)

LGPL is the one that enterprises won't touch with a 10 foot pole, due to more restrictive licensing, and more conditions under which you'd have to open source your own code.


Most companies building commercial products on top of FOSS are obeying the license requirements. (I have been through due diligence reviews where we had to demonstrate that, for each library/tool/package.)

The same cannot be said for Copilot: there have been prior examples here on HN showing that it can emit large chunks of copyrighted code (without the license).


> That's the whole heart of this lawsuit, and equally Copilot. It was trained on OSS which is explicitly licensed for free use.

Most open-source software is not licensed for free use. MIT and GPL, the two most common licenses, both require attribution.


FOSS license does not mean "do whatever you want". The GPL requires all derived work to also be licensed under a GPL compatible license for example.


It being permissively licensed is virtually irrelevant because only a minority of code is so permissively licensed you can just do what you like under any license. Far more is do what you like within the scope of the license. For example GPL do with it what you like so long as any derivative work is also GPL.


I guess I'm just afraid that it might not be as good as it is that way.

It's a bit like how GPT-3, Stable Diffusion and all those generative models use extensive amounts of copyrighted material in training to get as good as they do.

In those cases however the output space is so vast that plagiarism is very unlikely.

With code, not so much.


GPT-3 and Stable Diffusion might not copy things exactly - but they certainly do copy "style" There are many articles likes this:

https://hyperallergic.com/766241/hes-bigger-than-picasso-on-...

The interesting thing is that the names get explicitly attached to these styles. It isn't exactly a copyright issue, but I'm sure it will get litigated regardless.


I think the prompt "GPT-3, tell me what the lyrics for the song Stan by Eminem is" is very likely to output copyrighted material. The same copyrighted material is, of course, already republished without permission on google.com.


there are literally thousands of years of artwork that fall under public domain, the idea that the dataset isn't big enough to make good images without copyright infringement and attribution laundering is frankly laughable.


My guess is that is not as much about the amount of available data but how accessible it is. Scraping the internet seems to be one of the preferred ways of gathering vast amounts of, in particular, text and images.

Telling apart what's public domain or not is not a trivially automatable task.

If one just relies on curated libraries of vetted public domain content you don't get, by far, the expected amout of variability and diversity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: