Hacker News new | past | comments | ask | show | jobs | submit login

Correct legally, morally, or both?

Legally a copyright claim seems weak, but they didn't assert one. Some of their claims look stronger than others. The DMCA claim in particular strikes me as strong-ish at first glance, though.

Morally I think this class action is dead wrong. This is how innovation dies. Many of the class members likely do not want to kill Copilot and every future service that operates similarly. Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.




Innovation dies when creators can't create without someone ripping off their work against the terms they release it under.

I am more hesitant to release code on GitHub under any licenses now. Even outside of GPL-esque terms, I've considered open sourcing some of my product's components under a source available but otherwise proprietary license, but if Microsoft won't adhere to popular licenses like the GPL, why would they adhere my own licensing terms?

If my licenses mean nothing, why would I release my work in a form that will be ripped off by a trillion dollar company without any attribution, compensation or even a license to do so? The incentives to create and share are diminished by companies that won't respect the terms you've released your creations under.

That's just me as an individual. Thinking in terms of for-profit companies, many of them would choose not to share their source code if they know their competitors can ignore their licenses, slurp it up and regurgitate it at an incomprehensible scale.


> Innovation dies when creators can't create without someone ripping off their work against the terms they release it under.

I strongly disagree. There would be more innovation if code couldn't be copyrighted or kept secret. See: all of open source.

> I've considered open sourcing some of my product's components under a source available but otherwise proprietary license

What's the point of that? This isn't useful to anyone. The fact you even consider it shows you don't understand open source. I'm sure you happily use open source code yourself though.


> There would be more innovation if code couldn't be copyrighted or kept secret. See: all of open source.

I actually agree. However this is not what's happening. Copilot effectvely removes copyright from FLOSS code, but doesn't touch proprietary software. FLOSS loses it's teeth against the corporations.


I'm the author of about a dozen popular AGPL and GPL projects, but please tell me how I don't understand open source.

The purpose of releasing source available but proprietary code is so that users can learn and integrate into it, and making it available lets anyone learn how it works. The only reason I even considered making the source available is balance between 1) needing to eat and 2) valuing open source enough to risk #1.

Please take your condescension elsewhere.


> There would be more innovation if code couldn't be copyrighted or kept secret. See: all of open source

There is a ton of innovative stuff that is not open source. I don't see what open source has to do with innovation.


I played around with creating an MIT license on my GitHub that explicitly forbids Copilot and other such systems that I thought I may update my projects to, because I strongly dislike the data collection. I'm not a lawyer though.

Is there a GutHub terms of agreement that covers Copilot?


They claim it is fair use, therefore they can bypass copyright (and therefore license terms).

It being in GitHub has not been brought up as a factor yet (by GitHub/Microsoft), AFAIK they could use code from other places with that logic, they just don't need to.


I find your comment a bit perplexing, perhaps you can help me understand.

Why do you want to release code on GitHub with an oppressive license? What's the motivation for you, and what's the benefit for anyone else in it being released?

The size of code fragments being generated with these AI tools is, as far as I can tell, extremely small. Do you think you could even notice if your own implementation of sqrt, comments and all, wound up in Excel?


The point of copyleft licenses (which I assume are what you mean with "opressive") is to subvert copyright in order to incentivize others to share their code by providing them with something to build on if they return the favor. You cannot possibly call these licenses opressive since the default state with copyright is that you are not allowed to do much at all (at least when it comes to copying). In fact copyleft licenses allow you to do much much more than your average corp-lawyer approved proprietary license.

The problem (or A problem) with copilot is that it tries to sidestep those licenses, purpotedly allowing you to build upon the work of others without giving anything back even if the work you are building on has been published on the explicit condition that what you create with it should also be shared in the same way. While the great AI tumbler makes the legal copyright infringement argument complicated by giving you lots of small bits from lots of different sources it really does not change the moral situation: you are explicitly going against the wishes of the people that are enabling you to do what you are doing.

Beyond copyleft, this kind of disregard for other peoples wishes also applies to attribution even with more liberal licenses. Programming is already a field where proper attrubution is woefully lacking - we don't need to make it worse by introducing processes where it becomes much harder if not impossible to tell who contributed to the creation.

Now I am all for maximum code sharing. I'm all for abolishing copyright entirely and letting everyone build what they want without being shackled by so-called intellectual property. But that is not something Microsoft is doing with Copilot. What they have created is a one way funnel from OSS to proprietary software. If Microsoft had initially trained Copilot on their own proprietary sources this would have been seen very differently. But they did not. Because the way Microsoft "loves open source" is not in the way of a mutally beneficial symbiotic relationship but that of an abuser that loves taking advantage of whatever they can with giving as little back as they can get away with.


How does one make the leap from "a source available but otherwise proprietary license" to a copyleft license? As I understand the terms, perhaps in too limited a way, a proprietary license is never one in which others are free to build on the code or incorporate any part of it into their own works, and a source available proprietary license is just publishing source that no-one can use.

As for whether Copilot's morally wrong or not - I don't think copyright as a concept makes any sense at the level of the trivial, where Copilot _should_ be acting. If Copilot regularly reproduces sizeable portions of code from a single origin _without_ careful and deliberate guidance, I'd agree that there's a problem here. As I understand it though, that's not happening.

By its very nature of being published, code from OSS is funnelled into proprietary codebases by humans performing a similar task to Copilot - reading available code and using that to evolve an understanding of how to produce software. I like to think we do it at a deeper level than Copilot, but the general effect is the same: the code I write, like the words I write, are heavily influenced by all the code I've read over the years.

If I wind up using a few words from your comment, down the line, because some turn of phrase you used struck me as a good way to say something, do you think I've morally wronged you?


It's a pity I can up-vote only once. This nails it!


I'm fine with Copilot, but I think all rightsholders should be allowed to decide if they want their code training it or not. And that should be opt-in, not opt-out.

(And refusing to opt in shouldn't have to mean switching to a new hosting platform.)

> Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.

That's the case in pretty much any class action. I look at class actions as having two purposes: to require that the defendant stops doing something, and to fine the defendant some amount of money. Sure, individual class members will see very little of that money, but I look at it as a way of hurting a company that has done people wrong. Hopefully they won't do that anymore, and other companies will be on notice that they shouldn't do those bad things either. Of course, sometimes monetary damages end up being a slap on the wrist, just something a company considers a cost of doing business.


>I look at class actions as having two purposes . . . to require that the defendant stops doing something

That's my point. Many of the class members don't want the company to stop doing this.

I have code on GitHub, and Copilot is a useful tool. I don't care if my code was used to train the model. Sure, I personally could opt out of the suit, but that would be utterly meaningless in the grand scheme of things. The bottom line is, if I'm a coder with code on Github and I like Copilot, this suit is a huge net negative.

Even more importantly, I want to see the next version of Copilot that will be created by some other company, and then the next version after that. I want development to continue in this area at a high velocity. This suit does nothing but put giant screeching brakes on that development, and that is just a shame.


If lawsuit goes through, it's not likely that Copilot would disappear.. but there would be a checkbox to opt-in your code. You could check it and your code will be used to train model.

I have some code on Github as well and would not want it to be used in training, nor by Microsoft nor by other company. It is under GPL license to ensure that any derived use is public and not stripped of copyrights and locked into proprietary codebase, and copilot is pretty much 100% opposite of this.


If this lawsuit is successful it doubt it will change anything at all. Microsoft will just pay the damages as a cost of doing business and continue what they are doing. Maybe they will add an opt-out.


Continuing to do damage would mean higher fees next time.


If you don't feel like you're being represented, you're free to choose not to be a member of class in the lawsuit.


> If you don't feel like you're being represented, you're free to choose not to be a member of class in the lawsuit.

I think you missed this part:

> Sure, I personally could opt out of the suit, but that would be utterly meaningless in the grand scheme of things.


why would it be meaningless?

seems like a great opportunity for Microsoft to alter copilot so it's opt-in to get your code scanned, and to mandatorily add licensing and attribution to outputs

I know you said you're OK with it as is, but many aren't, so if I'm a coder, this suit represents a big net positive for me, being a way to reduce the probability of someone laundering my code away without proper attribution or license attention


That I did, thanks for pointing it out. Phone posting does that.


Hypothetically, if I wanted to learn how to code by studying open source examples on GitHub, should I have to go ask permission of each rightsholder to learn from their code? I agree that, if Copilot is based on a model that overfits to output the exact same code it read, the lawsuit has merit (and Copilot is not really ML), but the idea of ML is that the model doesn’t memorize specific answers, it learns internal rules and higher-level representations that can output a valid result when given some input. Very much like me, the coder, would output valid code when given a use case description, after studying a lot of open source examples of that. Should most programmers just be paying rights to all publishers of code they have studied?


> the idea of ML is that the model doesn’t memorize specific answers, it learns internal rules and higher-level representations

that's the idea, yeah, and it would've been great if that's how copilot worked all the time

as for the whataboutism, if developers copied copyrighted code, the rights holder has the right to go after them, too, if they so choose

the rights holder could also choose to go after only big companies that violate licenses egregiously, if they so choose

you know, common sense and nuance


For a long time, Microsoft has used software licenses to reap profits from Windows and Office, the two products that enabled Microsoft to capture near-monopolies in their respective markets.

Now, Microsoft is violating other people's software licenses to repackage the work of numerous free and open source software contributors into a proprietary product. There is nothing moral about flouting the same type of contract that you depend on every day, for the sake of generating more money.

Either the entire Copilot dataset needs to be made available under a license that would be compatible with the code it was derived from (most likely AGPLv3), or Windows and Office need to be brought into the commons. Microsoft cannot have it both ways without legal repercussions.


I don’t think this lawsuit would hinder innovation but it would greatly change it and who owns it.

If an AI model is the joint property of all the people who contributed IP to it, it’s a pretty hugely democratic and decentralizing force. It also will incentivise a huge amount of innovation on better, richer data sources for AI.

If an AI model isn’t joint property of the IP it learned then it’s a great way to build extractive business models because the raw resource is mostly free. This will incentivise larger, more centralised entities.

Much of the most interesting data comes from everyday people. A class action precedent is probably good for society and good for innovation (particularly pushing innovation on the edge/data collection side)


The problem of jointly-owned AI is that the actual value of a particular contribution to the training set is not particularly easy to calculate. We can't tie a particular model weight back to individual training set examples, nor can we take an output that's a statistical mix of two different peoples' work and trace it back to them.

With current technology, the only licensing model we can offer is "give us your training set example, we'll chuck a few pennies at you out of credit sales and nothing more". We can't even comply with CC-BY because the model can't determine who to attribute to.


If the resource is free and non-rivalrous, what is being extracted?


The resource is not "free": it is provided under a license that attempts to lay out the terms the entity using the resource must comply with in order to get the benefit of using their product; just because this is a non-monetary compensation doesn't mean it is "free".


Authors of code (open source or otherwise) hold a copyright in that code. The purpose of the license agreement is to set out the terms on which the authors will permit others to take actions that would otherwise infringe copyright.

Using code, photographs, documents, or other material to train a model isn't copyright infringement. The person operating the model is not violating the exclusive rights of the copyright author: they are not making copies or derivative works.

Any other result means that all AI development based on training models is going to grind to a screeching halt, because essentially all training material—text, pictures, recordings—is copyrighted.


> The person operating the model is not violating the exclusive rights of the copyright author: they are not making copies or derivative works.

How do they not make copies? Do you know how a computer works? Ever heard of RAM? (At least the German Urheberrecht recognizes this clearly: You can't do any processing on any data with the help of a computer without at least making temporary local copies, so there are exceptions to some rules. I'm quite sure common law copyright also recognizes this!)

Also the claim that this is not a derivative work is actually one of the disputed claims here…

> Any other result means that all AI development based on training models is going to grind to a screeching halt, because essentially all training material—text, pictures, recordings—is copyrighted.

Exactly, it's all copyrighted! That's why you can't use it for whatever you like. That's the whole point of copyright.

As a result this means that whoever wants to exploit that work in said way needs to buy (or get otherwise) a license!

Nobody said that feeding AI with properly licensed work would be problematic. Only the original creators need to get their fair cut form the outcome of such process.


you are clearly doesn't understand how machine learning works, if machine learning ok copyrighted data becomes illegal then most of our infrastructures will be down because most of it uses machine learning, the first that will affect many people is probably google search


I believe this is the core point of the lawsuit - is Copilot really creating code from what it learned (which happens to, by some weird glitch, mimic the source code) or is it just a big overfitting model that learned to encode and memorize a large number of answers and spit them out verbatim when prompted?

I think that losing this lawsuit has much more serious consequences for Copilot than just having to connect to a list of millions of potential copyright owners - it would mean the model behind it is essentially a failure.

Personal opinion: the real situation lies somewhere in the middle. From what I’ve seen, I think Copilot has some ability to actually generate code, or at least adapt and connect unrelated code pieces it remembers to respond to prompts - but I also believe it just “remembers” (i.e., has a close-to-lossless encoding of the input) how to do some operations and spits them out as part of the response to some prompts.

I hardly think the lawsuit will really explore this discussion, but it sounds like a great investigation into what DL models like transformers actually learn. For all I know, it might even give insight into how we learn. I have no reason to believe that humans don’t use the same strategy of memorising some operations and learning how to adjust them “at the edges” to combine them.


I don't think that anybody will try to answer the philosophical question in what regard what this machine does has anything to do with human reasoning.

In the end it's just a machine. It's not a person. So trying to anthropomorphize this case makes no sense from the get go.

Looking at it this way (and I guess this is the right way to look at it from the law standpoint) Copilot is just a fancy database.

It's a database full of copyrighted work…

How this database (and it's query system) works from the technical viewpoint isn't relevant. It just makes no difference as by law machines aren't people. End of story.

But should the curt (to my extreme surprise) rule that what MS did was "fair use" than the flood gates of "fairuseify through ML"[1] would be open. Given the history of copyright and/or other IP laws in the US this just won't happen! The US won't ever accept that someone would be allowed to grab all Mikey Mouse movies put them into some AI and start to create new Mikey Mouse movies. That's the unthinkable. Just imagine what this would mean. You could "launder" any copyrighted work just by uploading and re-querying it form some "ML-based database system". That would be the end of copyright. This just won't happen. MS is going lose this trail. There is no other option.

The only real question is how severe their loose will be. They used for sure also AGPLv4 code for training. Thinking this through to the end with all consequences would mean that large chunks of MS's infrastructure, and all supporting code, which means more or less all of Azure, which means more or less all of MS's software, would need to be offered in (usable!) source to all users of Copilot. I think this won't happen. I expect the court to find a way to weasel out of this consequence.

[1] https://web.archive.org/web/20220121020414/fairuseify.ml/


Holy cow you are right.


> Morally I think this class action is dead wrong. This is how innovation dies.

This legal challenge is coming one way or another. I think it’s better to get it out of the way early. At least then we will know the rules going forward, as opposed to being in some quasi-legal gray area for years.


I disagree. The more entrenched a practice is, like training AI models on media content, the less willing a court is going to be to take that practice away.


that seems like a machiavellian way of avoiding The People deciding the issue for themselves via their government representatives, and it'd just make things harder when the court takes the practice away anyways




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: