This is what I hope comes out of the lawsuit. If a company wants to sell an AI model, they need to own all of the training data. It can't be "fair use" to take other peoples' works at zero cost, and use it to build a commercial product without compensation.
And maybe models trained on public data should be in the public domain, so that AI research can happen without requiring massive investments to obtain the training data.
There has to be a reasonable context here. Even if it’s trained on proprietary code it rarely ever is inserting that code directly in a way that is at all relevant to how it was used in the past.
Obvious licensing needs to be respected and it shouldn’t be hard to solve that problem. But 99.9% of code isn’t some unique algorithm, it’s gluing libraries and setting up basic structures.
Most of the examples I’ve seen done line up with the reality of code completion tools. Code is rarely valuable when broken up into its small parts.
Even copying a full codebase is rarely enough to draw value from… there’s way more to a software business than the raw code. But that’s a different problem.
Ok you got me, that wording was lazy on my part. But that's a really bad take on yours:
> It was trained on OSS which is explicitly licensed for free use.
That's not what the lawsuit is about. It's not about money, it's about licensing. OSS licenses have specific requirements and restrictions for using them, and Copilot explicitly ignores those requirements, thus violating the license agreement.
The GPL, for example, requires you to release your own source code if you use it in a publicly-released product. If you don't do that, you're committing copyright infringement, since you're copying someone's work without permission.
Yeah, and I think that's fair re: licensing. Curious to see how it pans out.
Also, re: your edit, not quite. They require you to release modified source under certain conditions if you make modifications to it. If everybody had to release code using GPL to the world, every companies code would currently be released to the world. There's more nuance than that. The gnu site covers a lot of that nuance (https://www.gnu.org/licenses/gpl-faq.en.html#UnreleasedMods)
LGPL is the one that enterprises won't touch with a 10 foot pole, due to more restrictive licensing, and more conditions under which you'd have to open source your own code.
Most companies building commercial products on top of FOSS are obeying the license requirements. (I have been through due diligence reviews where we had to demonstrate that, for each library/tool/package.)
The same cannot be said for Copilot: there have been prior examples here on HN showing that it can emit large chunks of copyrighted code (without the license).
And maybe models trained on public data should be in the public domain, so that AI research can happen without requiring massive investments to obtain the training data.