I’m not a lawyer, but here is why I believe a class action lawsuit is correct;
“AI” is just fancy speak for “complex math program”. If I make a program that’s simply given an arbitrary input then, thought math operations, outputs Microsoft copyright code, am I in the clear just because it’s “AI”? I think they would sue the heck out of me if I did that, and I believe the opposite should be true as well.
I’m sure my own open source code is in that thing. I did not see any attributions, thus they break the fundamentals of open source.
In the spirit of Rick Sanchez; It’s just compression with extra steps.
I read most of the complaint. The only examples of supposed copyright infringement are isEven and isPrime functions. Here's what Copilot gives me in a Typescript file:
function isPrime(n: number): boolean {
for (let i = 2; i < n; i++) {
if (n % i === 0) {
return false;
}
}
return n > 1;
}
function isEven(n: number): boolean {
return n % 2 === 0;
}
These are clearly not covered by copyright in the first place. This case is really quite pathetic.
Correct me if I'm wrong. I don't think this document needs to be a comprehensive record of every piece of copyrighted material that Copilot or Codex produce. That's something that will be produced during/for the trial process itself. Right now, this is just establishing the basic premise, and the claims for the type of behavior that is going on.
I think they intentionally picked (literal) textbook examples because they're short and easy for non-experts to grasp and have some understanding of. But I don't think we've seen any of the code from the respective J. Doe's yet, and I would assume we would in the trial (possibly in addition to more cases).
I tested co-pilot initially with Hello World in different languages. In Lisp, it gave me verbatim code from a particular tutorial, which was made obvious because their code had "Hello <tutorialname>" where <tutorialname> was the name of a YouTube tutorial, instead of the word "World." It was surely slurped into the model via someone who had done the tutorial and uploaded their efforts to Github. Mind you, it's pretty much the way everyone would code it, but the inclusion of <tutorialname> is definitely an issue.
I have only skimmed. But lines 23 and 24 on page 23 also reference Copilot's autocompletion of Quake III's `Q_rsqrt`[1] and mention that it is under GPL2.
"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."
"Finally, material that exists in the public domain can not be copyrighted and is also removed from the analysis."
That code is specifically optimized for efficiency and there were similar approaches floating (get it?) around in the 1980s.
The magic-constant is not optimal there exist better alternatives. It is not necessary to implement this function and should be copyrightable. It is also not a trivial part.
On the other hand, Microsoft may only need to show "Hey, we got this code from FooBar under this license and this license and ..."
Why should it be copyrightable. It's just a way to calculte inverse square root. This falls under the public, in my non lawyer opinion. Such small snippets do not qualify, usually, for copyright.
It's not just the constant but it was easiest to identify for me in the last post. And due to it's popularity the size of the snippet doesn't matter, it stands on its own as a significant work.
The essence of the algorithm takes 4 lines: function declaration, declaration of 'y', one line for calculating the exponent in log-space, one line for returning the root finding.
The rest is fluff. Every line of the snippet has creative input with the chosen names ('threehalfs' for 1.5F), the order of declarations and instructions, the redundancy. There have been internet-wars around indentations and newlines, these are style choices.
((And it is public -- GPL more specifically, which is a restrictive license that should be respected. I think this snippets makes a perfect example of the dangers of copilot. But not one to litigate details with.))
(((Thinking back, I'm not sure anymore how the license laundering argument works if they got the code from a fair-use MIT-licensed hobby project. Can one person claim fair-use and include it under an MIT-license and have somebody else say 'oh this free code I'm going to use it commercially'?)))
You didn't read the relevant part of the complaint. It starts on document page 14 (PDF page 17). There's a clear footnote:
> Due to the nature of Codex, Copilot, and AI in general, Plaintiffs cannot be certain these examples would produce the same results if attempted following additional trainings of Codex and/or Copilot.
The offending solution from the AI included extra lines that are reasonably understood to come straight from Eloquent JavaScript:
This seems like an incredibly trivial example. If I remembered that example subconsciously, and used it myself somewhere, would that be an infringement of intellectual property? In any large code base how many such infringements are there? Many? Should we sue every software company on this premise?
Sure, those comments might be considered infringement, but that's from an earlier version of Codex. Copilot does not return that code. The complaint even says so.
If a software systematically engages in copyright violation but only haphazardly corrects those violations, those haphazard correct aren't evidence the problem has vanished.
If Copilot is committing widespread infringements of their copyright, then surely they will be able to find examples of such infringement to submit in their lawsuit.
I assume they want some kind of broad relief, such as an injunction to take down copilot. They are not going to get it, they are not going to get anything at all, if they can’t even provide examples of violating code.
During the piratebay case, the prosecutor only had to illustrate that it was likely (as in, convinced the judges) that copyright infringement had occurred. They did this by showing the top 100 torrents. They did not have to prove with certainty that the top 100 torrent actually was used by people. The fact that the names of movies and games showed up on the list was enough to convince the judges.
The lawyers defending the founders did try to make the argument that no infringement had been proven, and that the list itself was not proof of any infringement. It was just a list on a website, and they even presented evidence that the counter on the list was algorithm faulty. The judges was not convinced and applied the common sense approach that taken as a whole, it was not believable that no infringement had occurred by the website given the context of the site (the name, the top list, the overall perspective of how the site was designed).
> ...then surely they will be able to find examples of such infringement to submit in their lawsuit
Perhaps that is why they are reaching out to potential class members
> if they can’t even provide examples of violating code.
This is the very beginning of a very long process. I wouldn't rule out a settlement where class members get $10-100, which is a common resolution for class action suits.
There are many public examples of that same effect happening (for example https://twitter.com/mitsuhiko/status/1410886329924194309 ), and the legal team has been soliciting for more examples. Those examples are likely to come out if it does go to trial.
If this legal team was interested in this going to trial you think they would have put together a stronger case instead of risking that it won’t be heard.
There’s not even a single mention of any established legal doctrines around copyright and software, such as abstract-filter-compare, idea-expression dichotomy, etc.
> it threatens to disrupt one of the biggest technological progressions of all time.
Chill dude, all they have to do is include the licenses on their generated code.
If anything, this is going to generate even more progress. The copilot team would have to create some kind of feature that would connect the generated output the the relevant training data. That'd be pretty incredible to see in the field of AI/ML in general.
If they can actually link output to specific input, the lawsuit has merit and more, GPT-3 is a lie. A neural network is supposed to learn how things work, not memorize a large number of examples and spit them out verbatim - or keep connections to specific inputs.
Copilot losing the lawsuit is evidence it’s a case of overfitting, not true ML.
Not just AI is threatened, but also the use of sites like StackOverflow because some of those snippets might infringe a license. So we have to write everything from our heads, de novo. No more googling for solutions.
I think we should just relax copyright, it's dying anyway. Language models allow people to borrow skills learned from other people, and solve tasks. That's huge. Like Matrix, loading up a new skill at the press of a button. Can we give up such a huge advantage in order to protect copyright?
I think the notion of copyright has been under attack already for 2 decades by the internet, search engine and social networks. They all work against it, and AI does it even more. It just encapsulates the whole culture in a box, mixing everything up, all copyrights melting away, everything one prompt away. This could be a new medium of propagation for ideas. No longer limited to human brains and books, they can now propagate through language models more efficiently.
That isPrime function does not even cut at sqrt(n). Asking for the state of the art isPrime function is too much, but the sqrt trick is the very first step and it's free. (IIRC, the faster version uses i*i<n)
When searching for "console.log(isEven(50));" "// → true", which is one of the parts that the complaints is about, since this is also reproduced inside a Programming learning book: We get with cs.github.com
" Showing 1 - 20 of 66 files found (in 76 milliseconds)"
So, if this lawsuit succeeds in some way shape or form, does the author have a case against the 66 people that reproduced these lines in their own repository?
You could argue that if the author pursued enforcing their licence over those 66 people their code wouldn't have ended up in the training set in the first place. IANAL but I recall that you can't invoke copyright law to selectively enforce it, copyright is only protected if the holder pursues every violation of it. Maybe it works the same for enforcing a licence.
They can already sue those people if they don't follow the original license, they just need to file a complaint individually to each author, I think. Standard OSS license stuff, or else, why would people even use licenses?
Legally a copyright claim seems weak, but they didn't assert one. Some of their claims look stronger than others. The DMCA claim in particular strikes me as strong-ish at first glance, though.
Morally I think this class action is dead wrong. This is how innovation dies. Many of the class members likely do not want to kill Copilot and every future service that operates similarly. Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.
Innovation dies when creators can't create without someone ripping off their work against the terms they release it under.
I am more hesitant to release code on GitHub under any licenses now. Even outside of GPL-esque terms, I've considered open sourcing some of my product's components under a source available but otherwise proprietary license, but if Microsoft won't adhere to popular licenses like the GPL, why would they adhere my own licensing terms?
If my licenses mean nothing, why would I release my work in a form that will be ripped off by a trillion dollar company without any attribution, compensation or even a license to do so? The incentives to create and share are diminished by companies that won't respect the terms you've released your creations under.
That's just me as an individual. Thinking in terms of for-profit companies, many of them would choose not to share their source code if they know their competitors can ignore their licenses, slurp it up and regurgitate it at an incomprehensible scale.
> Innovation dies when creators can't create without someone ripping off their work against the terms they release it under.
I strongly disagree. There would be more innovation if code couldn't be copyrighted or kept secret. See: all of open source.
> I've considered open sourcing some of my product's components under a source available but otherwise proprietary license
What's the point of that? This isn't useful to anyone. The fact you even consider it shows you don't understand open source. I'm sure you happily use open source code yourself though.
> There would be more innovation if code couldn't be copyrighted or kept secret. See: all of open source.
I actually agree. However this is not what's happening. Copilot effectvely removes copyright from FLOSS code, but doesn't touch proprietary software. FLOSS loses it's teeth against the corporations.
I'm the author of about a dozen popular AGPL and GPL projects, but please tell me how I don't understand open source.
The purpose of releasing source available but proprietary code is so that users can learn and integrate into it, and making it available lets anyone learn how it works. The only reason I even considered making the source available is balance between 1) needing to eat and 2) valuing open source enough to risk #1.
I played around with creating an MIT license on my GitHub that explicitly forbids Copilot and other such systems that I thought I may update my projects to, because I strongly dislike the data collection. I'm not a lawyer though.
Is there a GutHub terms of agreement that covers Copilot?
They claim it is fair use, therefore they can bypass copyright (and therefore license terms).
It being in GitHub has not been brought up as a factor yet (by GitHub/Microsoft), AFAIK they could use code from other places with that logic, they just don't need to.
I find your comment a bit perplexing, perhaps you can help me understand.
Why do you want to release code on GitHub with an oppressive license? What's the motivation for you, and what's the benefit for anyone else in it being released?
The size of code fragments being generated with these AI tools is, as far as I can tell, extremely small. Do you think you could even notice if your own implementation of sqrt, comments and all, wound up in Excel?
The point of copyleft licenses (which I assume are what you mean with "opressive") is to subvert copyright in order to incentivize others to share their code by providing them with something to build on if they return the favor. You cannot possibly call these licenses opressive since the default state with copyright is that you are not allowed to do much at all (at least when it comes to copying). In fact copyleft licenses allow you to do much much more than your average corp-lawyer approved proprietary license.
The problem (or A problem) with copilot is that it tries to sidestep those licenses, purpotedly allowing you to build upon the work of others without giving anything back even if the work you are building on has been published on the explicit condition that what you create with it should also be shared in the same way. While the great AI tumbler makes the legal copyright infringement argument complicated by giving you lots of small bits from lots of different sources it really does not change the moral situation: you are explicitly going against the wishes of the people that are enabling you to do what you are doing.
Beyond copyleft, this kind of disregard for other peoples wishes also applies to attribution even with more liberal licenses. Programming is already a field where proper attrubution is woefully lacking - we don't need to make it worse by introducing processes where it becomes much harder if not impossible to tell who contributed to the creation.
Now I am all for maximum code sharing. I'm all for abolishing copyright entirely and letting everyone build what they want without being shackled by so-called intellectual property. But that is not something Microsoft is doing with Copilot. What they have created is a one way funnel from OSS to proprietary software. If Microsoft had initially trained Copilot on their own proprietary sources this would have been seen very differently. But they did not. Because the way Microsoft "loves open source" is not in the way of a mutally beneficial symbiotic relationship but that of an abuser that loves taking advantage of whatever they can with giving as little back as they can get away with.
How does one make the leap from "a source available but otherwise proprietary license" to a copyleft license? As I understand the terms, perhaps in too limited a way, a proprietary license is never one in which others are free to build on the code or incorporate any part of it into their own works, and a source available proprietary license is just publishing source that no-one can use.
As for whether Copilot's morally wrong or not - I don't think copyright as a concept makes any sense at the level of the trivial, where Copilot _should_ be acting. If Copilot regularly reproduces sizeable portions of code from a single origin _without_ careful and deliberate guidance, I'd agree that there's a problem here. As I understand it though, that's not happening.
By its very nature of being published, code from OSS is funnelled into proprietary codebases by humans performing a similar task to Copilot - reading available code and using that to evolve an understanding of how to produce software. I like to think we do it at a deeper level than Copilot, but the general effect is the same: the code I write, like the words I write, are heavily influenced by all the code I've read over the years.
If I wind up using a few words from your comment, down the line, because some turn of phrase you used struck me as a good way to say something, do you think I've morally wronged you?
I'm fine with Copilot, but I think all rightsholders should be allowed to decide if they want their code training it or not. And that should be opt-in, not opt-out.
(And refusing to opt in shouldn't have to mean switching to a new hosting platform.)
> Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.
That's the case in pretty much any class action. I look at class actions as having two purposes: to require that the defendant stops doing something, and to fine the defendant some amount of money. Sure, individual class members will see very little of that money, but I look at it as a way of hurting a company that has done people wrong. Hopefully they won't do that anymore, and other companies will be on notice that they shouldn't do those bad things either. Of course, sometimes monetary damages end up being a slap on the wrist, just something a company considers a cost of doing business.
>I look at class actions as having two purposes . . . to require that the defendant stops doing something
That's my point. Many of the class members don't want the company to stop doing this.
I have code on GitHub, and Copilot is a useful tool. I don't care if my code was used to train the model. Sure, I personally could opt out of the suit, but that would be utterly meaningless in the grand scheme of things. The bottom line is, if I'm a coder with code on Github and I like Copilot, this suit is a huge net negative.
Even more importantly, I want to see the next version of Copilot that will be created by some other company, and then the next version after that. I want development to continue in this area at a high velocity. This suit does nothing but put giant screeching brakes on that development, and that is just a shame.
If lawsuit goes through, it's not likely that Copilot would disappear.. but there would be a checkbox to opt-in your code. You could check it and your code will be used to train model.
I have some code on Github as well and would not want it to be used in training, nor by Microsoft nor by other company. It is under GPL license to ensure that any derived use is public and not stripped of copyrights and locked into proprietary codebase, and copilot is pretty much 100% opposite of this.
If this lawsuit is successful it doubt it will change anything at all. Microsoft will just pay the damages as a cost of doing business and continue what they are doing. Maybe they will add an opt-out.
seems like a great opportunity for Microsoft to alter copilot so it's opt-in to get your code scanned, and to mandatorily add licensing and attribution to outputs
I know you said you're OK with it as is, but many aren't, so if I'm a coder, this suit represents a big net positive for me, being a way to reduce the probability of someone laundering my code away without proper attribution or license attention
Hypothetically, if I wanted to learn how to code by studying open source examples on GitHub, should I have to go ask permission of each rightsholder to learn from their code? I agree that, if Copilot is based on a model that overfits to output the exact same code it read, the lawsuit has merit (and Copilot is not really ML), but the idea of ML is that the model doesn’t memorize specific answers, it learns internal rules and higher-level representations that can output a valid result when given some input. Very much like me, the coder, would output valid code when given a use case description, after studying a lot of open source examples of that. Should most programmers just be paying rights to all publishers of code they have studied?
For a long time, Microsoft has used software licenses to reap profits from Windows and Office, the two products that enabled Microsoft to capture near-monopolies in their respective markets.
Now, Microsoft is violating other people's software licenses to repackage the work of numerous free and open source software contributors into a proprietary product. There is nothing moral about flouting the same type of contract that you depend on every day, for the sake of generating more money.
Either the entire Copilot dataset needs to be made available under a license that would be compatible with the code it was derived from (most likely AGPLv3), or Windows and Office need to be brought into the commons. Microsoft cannot have it both ways without legal repercussions.
I don’t think this lawsuit would hinder innovation but it would greatly change it and who owns it.
If an AI model is the joint property of all the people who contributed IP to it, it’s a pretty hugely democratic and decentralizing force. It also will incentivise a huge amount of innovation on better, richer data sources for AI.
If an AI model isn’t joint property of the IP it learned then it’s a great way to build extractive business models because the raw resource is mostly free. This will incentivise larger, more centralised entities.
Much of the most interesting data comes from everyday people. A class action precedent is probably good for society and good for innovation (particularly pushing innovation on the edge/data collection side)
The problem of jointly-owned AI is that the actual value of a particular contribution to the training set is not particularly easy to calculate. We can't tie a particular model weight back to individual training set examples, nor can we take an output that's a statistical mix of two different peoples' work and trace it back to them.
With current technology, the only licensing model we can offer is "give us your training set example, we'll chuck a few pennies at you out of credit sales and nothing more". We can't even comply with CC-BY because the model can't determine who to attribute to.
The resource is not "free": it is provided under a license that attempts to lay out the terms the entity using the resource must comply with in order to get the benefit of using their product; just because this is a non-monetary compensation doesn't mean it is "free".
Authors of code (open source or otherwise) hold a copyright in that code. The purpose of the license agreement is to set out the terms on which the authors will permit others to take actions that would otherwise infringe copyright.
Using code, photographs, documents, or other material to train a model isn't copyright infringement. The person operating the model is not violating the exclusive rights of the copyright author: they are not making copies or derivative works.
Any other result means that all AI development based on training models is going to grind to a screeching halt, because essentially all training material—text, pictures, recordings—is copyrighted.
> The person operating the model is not violating the exclusive rights of the copyright author: they are not making copies or derivative works.
How do they not make copies? Do you know how a computer works? Ever heard of RAM? (At least the German Urheberrecht recognizes this clearly: You can't do any processing on any data with the help of a computer without at least making temporary local copies, so there are exceptions to some rules. I'm quite sure common law copyright also recognizes this!)
Also the claim that this is not a derivative work is actually one of the disputed claims here…
> Any other result means that all AI development based on training models is going to grind to a screeching halt, because essentially all training material—text, pictures, recordings—is copyrighted.
Exactly, it's all copyrighted! That's why you can't use it for whatever you like. That's the whole point of copyright.
As a result this means that whoever wants to exploit that work in said way needs to buy (or get otherwise) a license!
Nobody said that feeding AI with properly licensed work would be problematic. Only the original creators need to get their fair cut form the outcome of such process.
you are clearly doesn't understand how machine learning works, if machine learning ok copyrighted data becomes illegal then most of our infrastructures will be down because most of it uses machine learning, the first that will affect many people is probably google search
I believe this is the core point of the lawsuit - is Copilot really creating code from what it learned (which happens to, by some weird glitch, mimic the source code) or is it just a big overfitting model that learned to encode and memorize a large number of answers and spit them out verbatim when prompted?
I think that losing this lawsuit has much more serious consequences for Copilot than just having to connect to a list of millions of potential copyright owners - it would mean the model behind it is essentially a failure.
Personal opinion: the real situation lies somewhere in the middle. From what I’ve seen, I think Copilot has some ability to actually generate code, or at least adapt and connect unrelated code pieces it remembers to respond to prompts - but I also believe it just “remembers” (i.e., has a close-to-lossless encoding of the input) how to do some operations and spits them out as part of the response to some prompts.
I hardly think the lawsuit will really explore this discussion, but it sounds like a great investigation into what DL models like transformers actually learn. For all I know, it might even give insight into how we learn. I have no reason to believe that humans don’t use the same strategy of memorising some operations and learning how to adjust them “at the edges” to combine them.
I don't think that anybody will try to answer the philosophical question in what regard what this machine does has anything to do with human reasoning.
In the end it's just a machine. It's not a person. So trying to anthropomorphize this case makes no sense from the get go.
Looking at it this way (and I guess this is the right way to look at it from the law standpoint) Copilot is just a fancy database.
It's a database full of copyrighted work…
How this database (and it's query system) works from the technical viewpoint isn't relevant. It just makes no difference as by law machines aren't people. End of story.
But should the curt (to my extreme surprise) rule that what MS did was "fair use" than the flood gates of "fairuseify through ML"[1] would be open. Given the history of copyright and/or other IP laws in the US this just won't happen! The US won't ever accept that someone would be allowed to grab all Mikey Mouse movies put them into some AI and start to create new Mikey Mouse movies. That's the unthinkable. Just imagine what this would mean. You could "launder" any copyrighted work just by uploading and re-querying it form some "ML-based database system". That would be the end of copyright. This just won't happen. MS is going lose this trail. There is no other option.
The only real question is how severe their loose will be. They used for sure also AGPLv4 code for training. Thinking this through to the end with all consequences would mean that large chunks of MS's infrastructure, and all supporting code, which means more or less all of Azure, which means more or less all of MS's software, would need to be offered in (usable!) source to all users of Copilot. I think this won't happen. I expect the court to find a way to weasel out of this consequence.
> Morally I think this class action is dead wrong. This is how innovation dies.
This legal challenge is coming one way or another. I think it’s better to get it out of the way early. At least then we will know the rules going forward, as opposed to being in some quasi-legal gray area for years.
I disagree. The more entrenched a practice is, like training AI models on media content, the less willing a court is going to be to take that practice away.
that seems like a machiavellian way of avoiding The People deciding the issue for themselves via their government representatives, and it'd just make things harder when the court takes the practice away anyways
Say you read a bunch of code, say over years of developer career. What you write is influenced by all that. Will include similar patterns, similar code and identical snippets, knowingly or not. How large does snippet have to be before it's copyright? "x"? "x==1"? "if x==1\n print('x is one')"? [obviously, replace with actual common code like if not found return 404].
Do you want to be vulnerable to copyright litigation for code you write? Can you afford to respond to every lawsuit filed by disgruntled wingbat, large corp wanting to shut down open source / competing project?
This is a logical fallacy. A human is not an algorithm. We do not have to extend rights regarding novel invention to an algorithm to protect them for people.
Differentiating between a human and a machine simply because one "is not an algorithm" doesn't make a lot of sense. If it were true, people would very easily game it, by using algorithms to automate the most trivial parts of copying someone's work.
Ultimately the algorithm is automating something a human could do. There is a lot of gray area to copyright law, but you can't get around that simply by offloading to an algorithm.
> Differentiating between a human and a machine simply because one "is not an algorithm" doesn't make a lot of sense.
Uh? So if I design a self driving car which kills someone, it's the car that goes to jail?
Legal precedent seems to indicate this is not the case at all. Because humans and machines are different, simply because humans aren't machines and viceversa.
"So if I design a self driving car which kills someone, it's the car that goes to jail?"
No but the manufacturer will typically be held responsible. If the manufacturer intentionally designed it to kill people, someone could certainly be charged with murder. More likely it was a software defect and then it is a matter of financial liability. (in between is a software defect that someone knew about and chose not to fix)
This isn't a new issue. If you design a car and the brakes fail due to a design issue and that issue can be determined to be something that could have been preventable by more competent design.... someone might indeed go to jail but more likely it would be the corporation paying out a large amount of money.
It could even be a mixture of the manufacturer's fault and the driver. Maybe the brakes failed but the driver was speeding and being reckless and slammed on the brakes with no time to spare. Had it not been a faulty design, no one would have gotten hurt, but also if the driver had been competent and responsible, no one would have gotten hurt.
But with self driving cars, when they no longer need a "safety driver", it certainly won't typically be the human occupant of the car's fault to any degree, since they are simply a passenger.
Last I checked this was very much a gray area. I’d expect at least a long investigation into the amount of work and testing put into validating that the self-driving algorithm operates inside reasonable standards of safe driving. In fact, I expect that, as the industry progresses, the tests for minimal acceptable self-driving safety get more and more standardised.
That doesn’t answer the question of who’s responsible when an accident happens and someone gets hurt or dies - but then, there was a time when animals would be judged and sentenced if they committed a crime under human law. That practice is no longer deemed valid, maybe we need to agree that, if the self-driving car was built with reasonable care, accidents can still happen and it’s no one’s fault.
First of all, that isn't simple. How do you determine what is done by humans? If the human is using a computer and using copy and paste does that still qualify?
No matter where you draw the line between "done by computers" and "done by a human simply using a computer as a tool," there will always be a lot of gray area.
Also, if I spend a year creating my masterpiece, and some kid releases a copy of it for free and claims that that's ok just because it's "not for profit," there is still a problem.
> Differentiating between a human and a machine simply because one "is not an algorithm" doesn't make a lot of sense.
it makes a lot of sense, for that reason and a lot of others
people can create algorithms that do whatever they want, including copyright infringement and outright criminality, but algorithms can't create people or want anything for themselves
Copyright already worries about this sort of thing a great deal, and it's actually a lot more well thought-out than your average hacker is aware of. There are no hard and fast rules; but generally... the thing being sued over has to be creative enough to be copyrightable in the first place. Small snippets do not qualify for copyright protection alone.
I'm not sure this is true. At least for copyright in the common law meaning.
Oracle got copyright on API signatures…
In civil law there is a bar to protection if the work lacks "substantial" creativity. But even this bar is extremely low. More or less everything besides maybe simple math formulas is protected.
Oracle got a very thin copyright on API signatures. The "programmer convenience" ruling in Google v. Oracle basically precludes almost all copyright action on APIs alone.
No, they got absolute copyright on the API signatures.
The court did not even question any copyright, it just assumed the APIs are copyrighted by Oracle. Than it looked for reasons why copying the APIs could possibly be fair use…
By the skin of their teeth they found some very involved and case specific reasons why Google's use of the copyrighted APIs was, after all, fair use.
The reason why SCOTUS bent over backwards to not talk about copyrightability was not because they assumed it was true for APIs, but because they didn't feel like they had all the facts. They basically said "we don't know if it's copyrightable, but if it is, here's a ruling that makes this case and anything similar to it go away".
Oracle only has copyright over APIs in the Federal Circuit, because they were able to hoodwink the judge into applying patent logic[0] to a copyright case. In other circuits it's still up in the air. And in the Ninth Circuit[1] there's already loads of controlling precedent that would have resulted in Oracle's case being summarily dismissed, API copyright or no.
The term "thin copyright" is a term of art. It refers to the kind of copyright protection you get from combining uncopyrightable elements in a creative way. For example, you can't own a particular chord progression. But, if you combine that with, say, a particular instrument, some audio engineering techniques, the subject matter of the lyrics, and so on... then you start getting something that requires creative effort and thus is copyrightable. Courts still have to take this into account when ruling on copyright claims as they do not want to give people a monopoly over just the chord, or just that instrument, etc.
In the case of APIs, we're talking about a series of names, plus an arrangement of type signatures that go with them. Very much a thin copyright, as the legal profession in the US calls it.
And when you have thin copyright, courts are going to be more liberal with handing out fair use exceptions. The "programmer convenience" argument that SCOTUS adopted means that copying an API to put in a different platform is OK. The Ninth Circuit says that copying an API to reimplement a platform that other people's code relies upon is also OK. There's very little room left to actually make a copyright claim on an API alone.
In the case of Copilot, it's not merely copying APIs and filling them out with novel details. It is either generating wholly novel code, or regurgitating training data, the latter of which is just a regular 'ol infringement claim with no difficult legal questions to worry about.
[0] The Court of Appeals for the Federal Circuit is the only court with subject-matter jurisdiction over patent claims. When you're the only person who can make hammers, everything looks like a nail.
[1] The Ninth Circuit court of appeals has jurisdiction over California, which means it takes on the brunt of copyright cases.
I still don't buy the part that there is not much to worry.
The thing you call "thin copyright" is still copyright. Being protected or not is in the end a binary judgment: If your stuff is "a little bit" protected it is actually fully protected—with all consequences that follow from that.
Also, alone the "assumption" of the highest US court that APIs are protected is a very strong signal. They could just have ruled that there is no protection at all; case closed. But they preferred to go for a weasel solution. This has reasons… They deliberately didn't open up the door for API freedom. (Most likely to still be able wield that weapon against foreign concurrency should they feel like that some day).
The point is: IP law is completely crazy. The smallest brain-farts are routinely protected.
The exceptions to this rule are actually stronger in civil law, but still even in the EU single words or sub-second audio samples are protected by default. (Regarding APIs the situation is better though: It's legal to reverse engineer something for e.g. compatibility, and a few other reasons; but that are explicit exceptions. The default is that almost every expression of even the slightest form of human "creativity" is copyrighted; the bar is extremely low; and gets actually pushed constantly lower and lower by common law influence).
So on both sides of the Atlantic the default is that every single line of code is protected. There is nothing like a lower bound in size. Than, form there, you could try to argue that there should be an exception from this protection in some particular case, e.g. there was no "creativity" at all involved. But you will need to win a—often very hard, expensive, and ridiculously long—fight over that issue, and wining that is nothing like a sure thing; the default is that just everything is protected to the max. (Just have a look at all the craziness around news headlines in the EU; Google lost that case back than; to understand this better, as this may be very surprising to US people: civil law does not recognize anything like "fair use"; there are exceptions of copyright protection that have in the end almost the same effect, like grants for libraries or educational purposes, but those exceptions, and their limitations, are listed explicitly in the law; if no exception is listed there just isn't one, and only the very vague "creativity bar" remains).
Regarding Copilot: It makes not much difference whether this machine spits out some verbatim copies of (clearly copyrighted!) snippets or some "remix" thereof. There is no "novel" code if at best all what this machine does is creating "remixes" of the code it has in its database based on the query given. (Its "knowledge base" is nothing else than a very funky database; technical details regarding the actual implementation of that database or its query system should not matter legally).
Before this comes up again: No, any comparisons to how humans learn are irrelevant in this consideration. That machine is not a human. It's a machine. End of story. So even if you consider also a human brain a kind of "funky database" this makes no difference.
I haven't heard anyone saying that copilot is legal "just because it's AI." That's a pretty bad faith, reductive, and disingenuous representation. The core argument I've seen is that the output is sufficiently transformative and not straight up copying.
I wasn't really trying to address whether the argument is valid, I was just noting the representation of the other side here is reductive to the point of being in bad faith. I find that kind of rhetoric a little frustrating since it's kind of inflammatory, and, I believe, not particularly productive towards having honest/informative disagreements and discussion.
I think if another algorithm was used instead of ML that did the same job as Copilot, then people would be making the same arguments. I think it's just the case that ML is just the first tech capable of doing what Copilot is doing.
You can't copyright an algorithm, you can copyright a particular expression of one, or you can attempt to patent an algorithm, but two authors can legitimately write the same thing and not infringe on each others copyright unless one copied from the other.
Suppose you own the rights to a jpeg. And I apply a simple algorithm that increments every hex value. So 00 becomes 01 and so on. The gibberish images it spits out would be so different then your original image that you wouldn't have any claim to them at all.
So I may create a tool that is capable of "incrementing every hex value" of an image, and also of "decrementing every hex value", and than distribute any of your images after "incrementing the hex values", together with said tool, right?
Or maybe it would be enough to just zip your image, to be allowed to distribute it? In the end the bytes I would distribute than "would be so different then your original image that you wouldn't have any claim to them at all", right?
I encourage you to go get a copy of the latest hollywood blockbuster, apply your transformation, share it on the internet and see if the courts agree with your copyright hack.
Humans are just compression with extra steps by that logic.
There's a fairly simple technical fix for codex/copilot anyway; stick a search engine on the back end and index the training data and don't output things found in the search engine.
If I were to memorize my employer's IP then reproduce it (almost) verbatim and give it to a competitor, then I would be setting myself up for a world of legal hurt.
So yes, it is like how human memory is compression with extra steps.
I dont think that would work very well because there are not infinite ways to succinctly solve most programming problems. In fact the majority of solutions will look exactly the same.
The real solution is very, very simple. Only use opt-in training data. Don't acquire codebases from people who didn't agree to it.
If I own a repository on github and I have received contributions from other people, or included a .h file from mpv (thing that I have done), do I still have the right to click the opt-in button? I didn't ask the other contributors.
But github is in a position to scan my code and see if there are copy paste bits and disable the opt-in button in that case.
Except they act in bad faith so they wouldn't do that.
> I dont think that would work very well because there are not infinite ways to succinctly solve most programming problems. In fact the majority of solutions will look exactly the same.
Algorithms can't be patented or copyrighted, as they are pure mathematics. If an implementation of an algorithm has no creative content because it is succinct then it likely doesn't deserve copyright.
We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that resembles public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. In addition, we have announced that we are building a feature that will provide a reference for suggestions that resemble public code on GitHub so that you can make a more informed decision about whether and how to use that code, as well as explore and learn how that code is used in other projects.
Attributions are fundamental to open source? I thought having source openly available was fundamental to open source (and allowed use without liability/warranty) as per apache, mit, and other licenses.
If they just stick to using permissive-licensed source code then i'm not sure what the actual 'harm' is with co-pilot.
If they auto-generate an acknowledgement file for all source repos used in co-pilot, and then asked clients of co-pilot to ship that file with their product, would that be enough? Call it "The Extended Github Co-Pilot Derivative Use License" or something.
After five minutes of googling I'm still not sure if using MIT code requires an attribution, but many people claim it does, see https://opensource.stackexchange.com/a/8163 as one example
You could have read the MIT license in its entirity in less than five minutes. It is very clear that the preserving attribution is a required condition. Other permissive licenses even explicitly require attribution in binaries / documentation.
MIT License:
Copyright <YEAR> <COPYRIGHT HOLDER>
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
> A short and simple permissive license with conditions only requiring preservation of copyright and license notices. Licensed works, modifications, and larger works may be distributed under different terms and without source code.
People would likely not share any code if they could not trust that their work would be respected, and attributed. So yes, I believe it to be fundamental to open source.
People share proprietary code publicly. And the fact that you're allowed to read a book doesn't (currently) give you the right to copy it and redistribute the copy.
If I read 10 or 20 books about a topic and then go teach that topic to others, do I have to attribute each thing to all the authors from where I learned it? And what if I come up with my own interpretation of a topic, do I have to trace it back to all interpretations of all the authors that influenced it? Even more, do the previous authors also have to do that and do I have to quote all the chain of references? If not, why an ML model that is supposed to learn how coding works, not memorize pieces of code verbatim, should have to “because of copyright laws”?
It does give you the right to write excerpts from memory though. If it happens to exactly match the text in the book, nobody gets excited about that, even if you could potentially rewrite the whole book.
maybe that is true, but there exist others for whom that is not true, and as long as they number greater than zero, the argument that 'open source means free to use however for whatever' will be invalid
True and valid. But all those clauses, AFAIK, were written with the mindset of ïf you want to run this code (particularly, but not limited to, for profit), you have to at least attribute it”. Copilot allegedly doesn’t run that code - it claims to read it, understand how it works and then generate its own code that does an equivalent function if requested. It’s up to the lawsuit to decide if that’s what it actually does, but my point is that the licenses simply did not cover this usage pattern, as much as no open source license requires any kind of action from someone who’s just reading or studying the code.
> “AI” is just fancy speak for “complex math program”
Not really? It's less about arithmetic and more about inferencing data in higher dimensions than we can understand. Comparing it to traditional computation is a trap, same as treating it like a human mind. They've very different, under the surface.
IMO, if this is a data problem then we should treat it like one. Simple fix - find a legal basis for which licenses are permissive enough to allow for ML training, and train your models on that. The problem here isn't developers crying out in fear of being replaced by robots, it's more that the code that it is reproducing is not licensed for reproduction (and the AI doesn't know that). People who can prove that proprietary code made it into Copilot deserve a settlement. Schlubs like me who upload my dotfiles under BSD don't fall under the same umbrella, at least the way I see it.
Who decides what constitutes an "AI program" vs just a "program"? What heuristic do we look at? At the end of the day, they have an equivalent of a .exe which runs, and outputs code that has a license attached to it.
I can suggest an idea, considering that the “AI program” is the model, not the training algorithm.
A program gets written by an entity (usually a person) and is executed to generate the desired output according to a deterministic mathematical function it expresses.
A training algorithm is a program that gets written to train a model (the model being the “AI Program”) when presented to some training data inputs, to implement a function that is not the training algorithm function itself, but another one, generalising over a problem domain beyond just the original examples fed to the training algorithm.
The output model is not the training algorithm or the training data (or an encoding of it) and exists as its own artefact, independent of both.
That oughtn't be controversial, in fact I wouldn't even bother with 'on steroids', implying it's a slightly different/morphed thing. The way I learnt it (very slightly, at university, not a particular focus) it was abundantly clear it was just stats.
I bring the steroids thing up because it's only relatively recently that we've had the massive computing power at our finger tips that we do now. We discovered the foundations of our current ML techniques a relatively long time ago, it's only been recently that we've been able to throw data centers full of powerful GPUs and whatnot at them.
The only license that is permissive enough for AI training is CC0.
Art generators can't comply with attribution requirements and code generators don't know if and when they trip the GPL copyleft. I believe most permissive code licenses also have some kind of attribution requirement.
Who should be sued? Microsoft who produces an application known as "Copilot" which itself contains nobody else's code but Microsoft's? OR the person who USES Copilot, to produce code which contains somebody else's copyrighted code?
Using Copilot is a bit like using a shotgun, can be very illegal depending on what you shoot at. Creating and distributing the app Copilot is like creating and selling a shotgun.
Microsoft produces a service known as "Copilot" which does contain other people's code. That the Copilot network contains other peoples code is not in question since it has been demonstrated to output other people's code and Microsoft even added (very limited) filters to detect if it ooutputs other people's code.
copilot only generate copyrighted when it seen the code many many times and that called memorization in machine learning, machine learning researchers will always try to decrease the amount of memorization in their artificial neurons
Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.
It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.
> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.
This is kinda smug, because it overcomplicates things for no reason, and only serves as a faux technocentric strawman. It just muddies the waters for a sane discussion of the topic, which people can participate in without a CS degree.
The AI models of today are very simple to explain: its a product built from code (already regulated, produced by the implementors) and source data (usually works that are protected by copyright and produced by other people). It would be a different product if it didn't have used the training data.
The fact that some outputs are similar enough to source data is circumstantial, and not important other than for small snippets. The elephant in the room is the act of using source data to produce the product, and whether the right to decide that lies with the (already copyright protected) creator or not. That's not something to dismiss.
It's not something to dismiss but it is something that has already been addressed. Authors Guild v Google. Google Books is built upon scanning millions of books from libraries without first gaining permission from copyright holders, this was found to not be a violation of copyright.
Building a product on top of copyright works that does not directly distribute those works is legal. More specifically, a computer consuming a copyright work is not a violation of copyright.
At the time the suit was launched, Google search would only display snippet views. The very nature presents the attribution to the user, enabling them to separately obtain a license for the content.
This would be more or less analogous to Copilot linking to lines in repositories. If Copilot was doing that, there wouldn't be much outrage.
The fact that they are producing the entire relevant snippet, without attribution and in a way that does not necessitate referencing the source corpus, suggests the transgression is different. It is further amplified by the fact that the output itself is typically integrated in other copyrighted works.
Attribution is irrelevant in Authors Guild, the books were not released under open source licenses where attribution is sufficient to meeting the licensing terms. Google never sought or obtained licenses from any of the publishers, and the court ruled such a license was not needed as Google's usage of the contents of the books (scanning them to build a product) did not represent a copyright infringement.
Attribution is mentioned in this filing because such attribution would be sufficient to meet the licensing terms for some of the alleged infringements.
It's an irrelevant discussion though, the suit does not make a claim that the training of Copilot was an infringement which is where Authors Guild is a controlling precedent.
> Authors Guild v Google. Google Books is built upon scanning millions of books from libraries
I agree it's relevant precedent, but not exactly the same. Libraries are a public good and more importantly Google books references the original works. In short, I don't think that's the final word in all seemingly related cases.
> More specifically, a computer consuming a copyright work is not a violation of copyright.
I don't agree with this way of describing technology, as if humans weren't responsible for operating and designing the technology. Law is concerned with humans and their actions. If you create an autonomous scraper that takes copyrighted works and distributes them, you are (morally) responsible for the act of distributing them, even if you didn't "handle" them or even see them yourself.
Neither of the important aspects – remixing and automation – is novel, but the combination is. That's what we should focus on, instead of treating AI as some separate anthropomorphized entity.
Your disagreement and feelings about how copyright and the law should work are valid, they have very little to do with how copyright is addressed judicially in the United States
At which case Google paid some hundred million $ to companies and authors, created a registry collecting revenues and giving to rightsholders, provided opt-out to already scanned books, etc. Hey, doesn't sound that bad for same thing to happen with Copilot.
A) No it doesn't, there's nothing in the Copilot model or the plugin that represents or constitutes a reproduction of copyright code being distributed by GH/MS. The allegation is it generates code that constitutes a copyright violation. This distinction is not academic, it's significant, and represents an unexplored area of copyright law.
B) "parts of" copyright works are not themselves sufficient to constitute a copyright violation. The violation must be a substantial reproduction. While it's up to the court to determine if the alleged infringements demonstrated in the suit (I'm sure far more will be submitted if this case moves forward) meet this bar, from what I've seen none of them have.
Historically the bar is pretty high for software, hundreds or thousands of lines depending on use case. A purely mechanical description of an operation is not sufficient for copyright, you cannot copyright an implementation of a matrix transformation in isolation no matter what license you slap on the repo. Recall that the recent Google v Oracle case was litigated over tens of thousands of lines of code and found to be fair use because of the context of those lines.
I've yet to see a demonstrated case of Copilot generating code that is both non-transformative and represents a significant reproduction of the source work.
> The allegation is it generates code that constitutes a copyright violation.
The weights of the Copilot very likely contain verbatim parts of the copyrighted code, just like in a zip archive. It chooses semi-randomly which parts to show and sometimes breaks copyright by displaying large enough pieces.
Say you publish a song and copyright it. Then I record it and save it in a .xz format. It's not an MP3, it is not an audio file. Say I split it into N several chunks and I share it with N different people. Or with the same people, but I share it at N different dates. Say I charge them $10 a month for doing that, and I don't pay you anything.
Am I violating your copyright? Are you entitled to do that?
To make it funnier: Say instead of the .xz, I "compress" it via π compression [1]. So what I share with you is a pair of π indices and data lengths for each of them, from which you can "reconstruct" the audio. Am I illegally violating your copyrights by sharing that?
I take your code and I compress it in a tar.gz file. Il call that file "the model".
Then I ask an algorithm (Gzip) to infer some code using "the model".
The algorithm (gzip) just learned how to code by reading your code. It just happened to have it memorized in its model.
With the exception that there are infinite types of chords in this case, and even though many musicians follow familiar chord structures the underlying melodies and rhythms are unique enough for any familiar person to be able to differentiate "Red Hot Chill Peppers" from "All-American Rejects", and now there is a system where All-American Rejects hit a few buttons and a song is generated (using audio samples of "Under the Bridge") that sounds like "Under the Bridge pt 2, All-American Rejects Boogaloo".
That's why it's actionable and why there is meat on the bone for this case. The real issue is going to be if they can convince a jury that this software is just stealing code and whether its wrong if a robot does it.
Google doesn't sell its search feature as a product that you can just plagiarize the results from and they're yours. Microsoft does that with Copilot.
Copilot is as much of a search engine as Stable Diffusion or DALL-e are, which is to say they aren't at all. If you want to compare it to a search engine, despite it being a tortured metaphor, the most apt comparison is not to Google, but to The Pirate Bay if TPB stored all of their copyrighted content and served it up themselves.
With Copilot it's your responsibility not to use it as a search engine to copy-paste code. It's completely obvious when it's being used as a search engine so it's not a problem at all.
Stable Diffusion works on completely different principles and they can't exactly replicate a pixels from their training data.
Ok, cool. Presumably that is because it’s smart enough to know that there is only one (public) solution to the constraints you set (like asking it to reproduce licensed code).
Now, while you may be able to get it to reproduce one function. One file, and definitely the whole repository seems extremely unlikely.
Just to be clear; I cannot prove that they have used my code, but for the sake of argument, lets assume so.
They would have directly used my code when they trained the thing. I see it as an equivalent of creating a zip-file. My code is not directly in the zip file either. Only by the act of un-zipping does it come back, which requires a sequence of math-steps.
But there is no equivalent of "unzipping" for Copilot.
This is a generative neural network. It doesn't contain a copy of your code; it contains weightings that were slightly adjusted by your code. Getting it to output a literal copy is only possible in two cases:
- If your code solves a problem that can only be solved in a single way, for a given coding style / quality level. The AI will usually produce the same result, given the same input, and it's going to be an attempt at a solution. This isn't copyright violation.
- If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?
There is no guarantee that a ML network only produces the input data under those two conditions. But even for
> If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?
Replication is not a violation if the terms of the license are followed. Many open source projects are replicated hundreds of times with no license violation - that doesn't mean that you can now ignore the license.
But even if they did violate the license, that doesn't give you the right to do it too. There is no requirement to enforce copyright consistently - see e.g. mods for games which are more often than not redistributing copyrighted content and derivatives of it but usually don't run into trouble because they benefit the copyright owner. But try to make your own game based on that same content and the original publisher will not handle it in the same way as those mods. Same for OSS licenses: The original author does not lose any rights to sue you if they have ignored technical license violations by others when those uses are acceptable to the original author.
Neutral nets can and do encode and compress the information they're trained on, and can regurgitate it given the right inputs. It is very likely that someone's code is in that neural net, encoded/compressed/however you want to look at it, which Copilot doesn't have a license to distribute.
You can easily see this happen, the regurgitation of training data, in an over fitted neural net.
This is not necessarily true, the function space defined by the hidden layers might not contain an exact duplicate of the original training input for all (or even most) of the training inputs. Things that are very well represented in the training data probably have a point in the function space that is "lossy compression" level close to the original training image though, not so much in terms of fidelity as in changes to minor details.
When I say encoded or compressed, I do not mean verbatim copies. That can happen, but I wouldn't say it's likely for every piece of training data Copilot was trained on.
Pieces of that data are encoded/compressed/transformed, and given the right incantation, a neutral net can put them together to produce a piece of code that is substantially the same as the code it was trained on. Obviously not for every piece of code it was trained on, but there's enough to see this effect in action.
> which Copilot doesn't have a license to distribute
when you upload code to a public repository on github.com, you necessarily grant GitHub the right to host that code and serve it to other users. the methods used for serving are not specified. This is above and beyond the license specified by the license you choose for your own code.
you also necessarily grant other GitHub users the right to view this code, if the code is in a public repository.
Host that code. Serve that code to other users. It does not grant the right to create derivative works of that code outside the purview of the code's license. That would be a non-starter in practice; see every repository with GPL code not written by the repository creator.
Whether the results of these programs is somehow Not A Derivative Work is the question at hand here, not "sharing". I think (and I hope) that the answer to that question won't go the way the AI folks want it to go; the amount of circumlocution needed to excuse that the not actually thinking and perceiving program is deriving data changes from its copyright-protected inputs is a tell that the folks pushing it know it's silly.
Actually pirate bay was even less of an infringement as they did not dsitribute the copygihted content or derivatives themselves, only indexed where it could be found. With Copilot all the content you're getting goes trough Microsoft.
We need the legal right to do things like host Your Content, publish it, and share it. You grant us and our legal successors the right to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video.
This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program."
It's served under the terms of my licenses when viewed on GitHub. Both attribution and licenses are shared.
This is like saying GitHub is free to do whatever they want with copyrighted code that's uploaded to their servers, even use it for profit while violating its licenses. According to this logic, Microsoft can distribute software products based on GPL code to users without making the source available to them in violation of the terms of the GPL. Given that Linux is hosted on GitHub, this logic would say that Microsoft is free to base their next version of Windows on Linux without adhering to the GPL and making their source code available to users, which is clearly a violation of the GPL. Copilot doing the same is no different.
> It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.
So what? Why shouldn't we update the rules of copyright to catch up to advances in technology?
Prior to the invention of the printing press, we didn't have copyright law. Nobody could stop you from taking any book you liked, and paying a scribe to reproduce it, word for word, over and over again. You could then lend, gift, or sell those copies.
The printing press introduced nothing novel to this process! It simply increased the rate at which ink could be put to pages. And yet, in response to its invention, copyright law was created, that banned the most obvious and simple application of this new technology.
I think it's entirely reasonable for copyright law to be updated, to ban the most obvious and simple application of this new technology, both for generating images, and code.
> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.
Completely incorrect. False dichotomy. It's widely known that AI can and does memorize things just like humans do. Memorization isn't a defense to violating copyright, and calling memorization "adjusting a generative model" doesn't make it stop being memorization.
If you memorized Microsoft's code in your brain while working there and exfiltrated it, the fact that it passed through your brain wouldn't be a defense. Substituting "generative model" for "brain" and the fact that it's a tool used by third parties doesn't change this.
it is essentially a weighted sum of your code and other copyright holders code. Do not let the mystique of AI fool you. Copilot does not learn, it glues.
If I read JRR Tolkien and then go and write a fantasy novel following a unexpected hero on his dangerous quest to undo evil, I haven't infringed, even if I use some of Tolkien's better turns of phrase.
Copyright laws, if enforced perfectly, would make programming simply impossible. We've been skating by on people not really enforcing them, despite the laws still being on the books, and the existence of tools like this makes that not a viable strategy. Today it's Copilot, which can be shut down, but tomorrow it'll be something developers can run at home. Bits don't have colour; there's no way to distinguish between a copy happening by independent recreation, and one that's actually a copy. So we'll need proper rulings.
In fact, considering Fauxpilot, that will happen as soon as the models have improved somewhat.
*: Of course I don't think "independent recreation" is really a thing. Humans are excellent at open source laundering. It's called "learning".
The AFC test is a three-step process for determining substantial similarity of the non-literal elements of a computer program. The process requires the court to first identify the increasing levels of abstraction of the program. Then, at each level of abstraction, material that is not protectable by copyright is identified and filtered out from further examination. The final step is to compare the defendant's program to the plaintiff's, looking only at the copyright-protected material as identified in the previous two steps, and determine whether the plaintiff's work was copied. In addition, the court will assess the relative significance of any copied material with respect to the entire program.
Abstraction
The purpose of the abstraction step is to identify which aspects of the program constitute its expression and which are the ideas. By what is commonly referred to as the idea/expression dichotomy, copyright law protects an author's expression, but not the idea behind that expression. In a computer program, the lowest level of abstraction, the concrete code of the program, is clearly expression, while the highest level of abstraction, the general function of the program, might be better classified as the idea behind the program. The abstractions test was first developed by the Second Circuit for use in literary works, but in the AFC test, they outline how it might be applied to computer programs. The court identifies possible levels of abstraction that can be defined. In increasing order of abstraction; these are: individual instructions, groups of instructions organized into a "hierarchy of modules", the functions of the lowest-level modules, the functions of the higher-level modules, the "ultimate function" of the code.
Filtration
The second step is to remove from consideration aspects of the program which are not legally protectable by copyright. The analysis is done at each level of abstraction identified in the previous step. The court identifies three factors to consider during this step: elements dictated by efficiency, elements dictated by external factors, and elements taken from the public domain.
The court explains that elements dictated by efficiency are removed from consideration based on the merger doctrine which states that a form of expression that is incidental to the idea cannot be protected by copyright. In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright.
Eliminating elements dictated by external factors is an application of the scènes à faire doctrine to computer programs. The doctrine holds that elements necessary for, or standard to, expression in some particular theme cannot be protected by copyright. Elements dictated by external factors may include hardware specifications, interoperability and compatibility requirements, design standards, demands of the market being served, and standard programming techniques.
Finally, material that exists in the public domain can not be copyrighted and is also removed from the analysis.
Comparison
The final step of the AFC test is to consider the elements of the program identified in the first step and remaining after the second step, and for each of these compare the defendant's work with the plaintiff's to determine if the one is a copy of the other. In addition, the court will look at the importance of the copied portion with respect to the entire program.
The brain is also just a "complex math program". Since math is just the language we use to describe the world. I don't feel this argument has any weight at all.
The legal world tends to be less interested in these kind of logical gotchas than engineering types would like. I don't see a judge caring about that brain framing at all.
Not to mention, if your brain starts outputting Microsoft copyright code, they're going to sue the shit out of you and win, so I'm not sure how that would help even so.
So if I read the windows explorer source code, then later produced a line for line copy (without referring back to the source). Microsoft couldn't sue me?
Explain yourself. There is not a understood natural phenomenon which we could not capture in math. If you argue behavior of the brain cannot be modeled using a complex math program you are claiming the brain is qualitative different then any mechanism known to man since the dawn of time.
The physics that gives rise to the brain is pretty much known. We can model all the protons, electrons and photons incredibly accurately. It's an extraordinary claim you say the brain doesn't function according to these known mechanisms.
You are confusing the nondiscrete math of physics with the discrete math of computation. Even with unlimited computational resources, we can’t simulate arbitrary physical systems exactly, or even with limited error bounds (see chaos theory). What a program (mathematical or not) in the turing-machine sense can do is only a tiny, tiny subset of what physics can do.
Personally I believe it’s likely that the brain can essentially be reduced to a computation, but we have no proof of that.
> We can model all the protons, electrons and photons incredibly accurately.
We can't even accurately model a receptor protein on a cell or the binding of its ligands, nor can we accurately simulate a single neuron.
This is one of those hard problems in computing and medicine. It is very much an open question about how or if we can model complex biology accurately like that.
> There is not a understood natural phenomenon which we could not capture in math.
This is a belief about our ability to construct models, not a fact. Models are leaky abstractions, by nature. Models using models are exponentially leaky.
> I didn't say we can simulate it.
Mathematics (at large) is descriptive. We describe matter mathematically, as it's convenient to make predictions with a shared modeling of the world, but the quantum of matter is not an equation. f() at any scale of complexity, does not transmute.
I'm using simulate as a synonym for model. For any biological model at the atomic, molecular and protein levels, accuracy is key for useful models. What I'm saying is that accuracy at that level is a hard problem in computing and biology, and even simple protein interactions are hard problems.
> There is not a understood natural phenomenon which we could not capture in math.
You are saying "If we know how something works, we can explain how it works using math."
But we know almost nothing about how the brain works.
> The physics that gives rise to the brain is pretty much known.
...no it is not! No physicist would describe any physical phenomenon as being "pretty much known". Let alone cognition. We don't even have a complete atomic model.
I think you are mostly correct but most people don't like this explanation and choose to believe in magic or spirits or whatever instead of physical reality. For some reason the brain is "magic" and non-physical unlike other organs (and everything else that exists) to most people. It's almost impossible to convince anyone of this though and it's not even worth trying.
> most people don't like this explanation and choose to believe in magic or spirits or whatever instead of physical reality.
You have it reversed. Math is a language tool to describe things, in a limited fashion (our current modeling). One is physical matter (even if it's antimatter). If you believe that there will be a language that can describe anything, it still doesn't manifest matter by speaking that language or describing it...unless you're into magic or spirits or whatever.
This disconnect has nothing to do with how well we do or do not understand physical phenomena. I think what the OP meant to say (and probably you support) is how the "mind" or how we think, can be described with mathematical models. Maybe one day we will have a full understanding, but we're not there yet and not currently in a way that is legally compelling.
I feel like this is a massive oversimplification...
In this answer, you're completely ignoring the massive fact that we cannot create a human brain. Having mathematical models about particles does not mean we have "solved" the brain. Unless you're also believe that these LLMs are actually behaving just like human brains, in that have consciousness, they have logic, they dream, they have nightmares, they produce emotions such as fear, love, anger, that they grow and change over time, that they controls body, your lungs, heart, etc...
You see my point, right? Surely you see that the statement 'The brain is also just a "complex math program"' is at best extremely over-simplistic.
There's certainly no model of a brain at the level of protons, electrons and photons. That's way beyond our level of mathematical understanding or computational ability. Biology isn't understood at the level of physics.
Somewhere in the complex math is the origin of whatever it is in intellectual property that we deem worthy of protection. Because we are humans, we take the complex math done by human brains as worthy of protection by fiat. When a painter paints a tree, we assign the property interest in the painting to the human painter, not the tree, notwithstanding that the tree made an essential contribution to the content. The whole point is to protect the interests of humans (to give them an incentive to work). There is no other reason to even entertain the concept of "property".
As long as AIs are incapable of recognizing when they are plagiarizing, as humans are generally capable of, the double standard seems entirely warranted.
Well, that you caught yourself is already something that makes a difference. It would already change the equation if Copilot would send an email saying “Hey, that snippet xyz I suggested yesterday is actually plagiarized from repo abc. I’m truly sorry about that, I’ll do my best to be more careful in the future.”
As far as “citation needed”, humans are being convicted for plagiarism, so it is generally assumed that they are able to tell and hence can be held responsible for it.
Responsibility or liability is really the crux here. As long as AIs can’t be made liable for their actions (output) like humans or legal entities can, instead the AI operators must be held accountable, and it’s arguably their responsibility to take all practical measures to prevent their AIs from plagiarizing, or from otherwise violating license terms.
At this point we are back in the territory that the idea and the expression of the idea are inseparable, therefore the conclusion will be that copyright protection does not apply to code.
Personally I think this has the potential to blow up in everyones faces.
If it does end up that way, I feel like the trickle away from github will become a stampede. And that would be unfortunate. Having such a good hub for sharing and learning code is useful, but only if licenses are respected. If not, people will just hunker down and treat code like the Coke secret recipe. That benefits no one.
The problem with the class action lawsuit against GitHub is this: if you host your code on GitHub, it doesn't matter what license you use. Microsoft can do whatever they want with it. You agreed to this by agreeing to their terms and conditions.
The end user agreement also says you must have the authority to grant these epic rights to GitHub, i.e. you cannot upload someone else's code. They could probably absolve themselves from responsibility due to your having committed wire fraud in this case. But, alas, IANAL.
If I have access to the source of a BSD, MIT, or GPL project - is there anything in those licenses that would prevent me from mirroring it on GitHub or GitLab?
I am sorry for not bringing any kind of legal perspective here, but:
*Jesus Christ*, I hope I live long enough to see copyright die. Here we are at the cusp of a new paradigm of commanding computers to do stuff for us, right at the beginning of the first AI development which actually impresses me.
And we are fucking bickering about how we were cheated out of $0.00034 because our repo from 2015 might have been used for training.
I am also deeply disappointed in HackerNews; where is that deep hatred of patent trolls and smug satisfaction whenever something gets cracked or pirated now?
On piracy, HN users defend Sci-Hub to protest against the academic publishing industry, which involves large corporations such as Elsevier charging publishing and subscription fees that are much more than the value that these corporations bring to the actual research, review, and publication. Academics need to publish in order to survive, and they individually do not have enough power to subvert the existing academic publishing system. Since academics do not receive royalties, Sci-Hub enables academics to pay less into the same system that exploits them for profit. By supporting Sci-Hub, HN users take a populist stance by supporting individuals against the system.
The situation with Microsoft and Copilot is the exact opposite. Here, Microsoft is misusing its acquisition of GitHub to repackage the work of individual free and open source contributors into a proprietary product in violation of the authors' software licenses. These licenses do not even require Microsoft to pay. They only require attribution and redistribution under a compatible license. Supporting Microsoft's misuse of GitHub is an anti-populist stance that puts the interests of the corporation over the interests of the individuals.
The argument for libgen is that whatever damage there is to the authors missing out on revenues is outweighed by people being able to get books that they otherwise wouldn’t be able to afford (especially in developing countries).
In the case of copilot, the damage suffered by the authors is close to zero. And those who benefit the most are the authors themselves. A double digit percent productivity enhancement is worth more to me than a few million $ to a trillion dollar company, especially because MS has to pay for compute.
You're assuming that Microsoft needs to shut down Copilot to comply with the licenses of the software they misused. That is not the case. To make Copilot legitimate, all Microsoft has to do is restrict Copilot's inputs to non-proprietary code, release the Copilot dataset under a compatible license, and clarify that the code generated by Copilot is also covered under that license. Attribution can be done by inserting a comment with link to a paginated list of all of the contributors whose code was used in Copilot.
Microsoft can even continue to sell Copilot as a service while keeping it license-compliant, since most developers are not going to self-host the entire dataset. Microsoft can also choose to exclude copyleft-licensed code from Copilot or create multiple flavors of Copilot, each licensed differently. You can get your "productivity enhancement" without needing Microsoft to violate software licenses.
The damage is not in the monetary payment denied to free and open source software contributors, payment these contributors never demanded. The damage is in Microsoft violating other people's software licenses to create a proprietary product derived from copyleft-licensed and attribution-required code, and in Microsoft encouraging other developers to violate these licenses. Microsoft needs to rectify these violations with specific performance.
Yes, the passion to defend copyright trolling for isEven functions (an example included in the filing) from people here is bizarre.
I can't decide if people just hate Microsoft enough that a future where you must pay to include an iseven function in your code is a price worth paying to give them a bloody nose, or there are just a large contingent of users making millions off their GPL code who are put out.
More seriously yes, copilot damages copyright (or is perceived to) and that is a good outcome irrespective of the actor. I will never see eye to eye with people defending the existing legal framework.
The idea that using copilot "damages" copyright seems ill-considered.
Suppose a commercial software company just took GPL-licensed software and openly incorporated into their own code and then sold. Would that "damage" copyright also? Remember, there's no legal principle that says "if we catch violating copyright, your stuff is now free". The copyright holder can sue for damages or to stop distribution and that's it.
People like Larry Ellison of Oracle have claimed they just steal GPL'd stuff 'cause it there. But Oracle defends it's copyrighted code very aggressively. Oppositely, the GPL is intended to allow more open access than public domain in a time where commercial companies want to take anything they can get.
1) This "copilot is great 'cause copyright is evil" argument breaks when you look at the fact that copilot is copyrighted, closed software tool for producing closed, copyrighted software. If you trained copilot on GPL'd software and specified that copilot's output was also GPL'd maybe you'd have some reasonable claim (but even then, the attribution claim would come in).
2) So far, these tools are "better search" schemes, not actual intelligence. Sure, many find them very useful. But given this, the (voluntary or involuntary) providers of data ought to get credit/benefit for/from this phenomena, along with the tool creators. Especially giving the current situation is Microsoft/OpenAI selling to commercial software developers who sell to general public.
>2) So far, these tools are "better search" schemes, not actual intelligence. Sure, many find them very useful. But given this, the (voluntary or involuntary) providers of data ought to get credit/benefit for/from this phenomena, along with the tool creators. Especially giving the current situation is Microsoft/OpenAI selling to commercial software developers who sell to general public.
What they are good at is predicting what's after the text. The problem of predicting what's next could be used to create a universal artificial intelligence (there's a mathematical definition for this). I.e. if you have a system which is very good at predicting what's next, you could get to very powerful AI.
What they are good at is predicting what's after the text. The problem of predicting what's next could be used to create a universal artificial intelligence (there's a mathematical definition for this). I.e. if you have a system which is very good at predicting what's next, you could get to very powerful AI.
The intelligence of human beings isn't unspecifically good "predicting what's next" but rather is good at particular sorts of predictions in particular contexts, often involving the person having helped create the situation. I'm fairly safe at driving because I maintain an arrange of my vehicle in a fashion that allows me to predict easily what's next as well as allowing me to adjust if my predictions are wrong. Self-driving software might predict what's next as well as me in normal circumstances but it's neither aware of larger context nor does it things to maintain "smooth traffic flow".
Opposite, being able to predict anything generically would certainly be limitless intelligence but you can't describe any system with just that. Copilot is trained with a certain window, with the transforms special element giving more context but I don't think very many people doing current research expects that become generic prediction. I think I'm describing the consensus that it's a "better Google" for finding code one can use - and Google is a pretty good resource for this - if you aren't doing something unusual or difficult.
Jeff Hawking also makes "prediction is intelligence" claim but I think your and his approach misses that human intelligence is good not by being generic but doing more specific things.
The entire response to this suit on this site is mind-blowing to me. Everyone is up in arms that someone trained an AI model that could potentially spit out tiny, twisted fragments of public, open-source code. This response is nothing but selfish behavior that runs counter to the core principles of open source development and the free software movement.
Your code is protected by copyright. The license allows for what would otherwise be copyright infringement.
But training an AI model on media (code or otherwise) is not copyright infringement, so the license is irrelevant.
It's selfish to pretend otherwise and to try to assert a copyright right that doesn't exist, for the purpose of impeding progress in a field that benefits us all.
Here is a snippet of things that copyright is intended to cover:
>the right to exclude others from making certain uses of the work: copying it, making a derivative work based on that work, distributing copies of the work to the public, and publicly performing or displaying the work.
So why would "training" "AI" on code with the intention of emitting derived works not be copyright infringement exactly?
This product is transforming copyrighted code into something that's intended to be used or sold in other works. The snippets it emits are directly derived from copyrighted code.
The most common argument against this is that humans also learn from copyrighted material. My argument against this is that CoPilot is not a human and should not be assumed to inherit rules intended for humans.
>in a field that benefits us all
As it stands currently CoPilot is proprietary and does not benefit anyone except for MicroSoft. If CoPilot was released under a FOSS license it would actually benefit us all. Most of the people against CoPilot are not against AI, but rather a proprietary AI product transforming FOSS work into other potentially proprietary works with the intention of profiting off of the completion service and hoarding the code that powers it.
> But training an AI model on media (code or otherwise) is not copyright infringement, so the license is irrelevant.
Well, maybe. But even if we assume that this is true, when anyone later uses the AI to reproduce a copy of the code, a copy has been made and copyright has been infringed.
If I need a code to loop over 10 lines, I'll code a for loop the same way regardless of what I'm developing.
Define for me, at what point of complexity, does code gets Copyrighted?
The things copilot is outputting is literally small chunks of code that needs a lot of cleanup afterwards. Is not like I type "Build twitter for me" and BAM, I got a working clone of twitter.
Every line that you produce is copyrighted by you, automatically. There is no "obviousness" test for copyright: if you decide to code a for loop, then that for loop is copyrighted the moment you type it in.
copilot is outputting literally small chunks of code that needs a lot of cleanup afterwards
If you start from copyrighted chunks, then clean them, you're still violating people's copyright. Multi-million dollar lawsuits have been fought over people using small samples from other people's music, cleaning them, and releasing them as parts of their own song.
Well yeah, it is. Some function that you wrote years ago in an hour having one line taken out of it to be used in a project that wouldn't have made money anyways doesn't hurt you at all regardless of what license is on your project. Getting upset because "that was MINE!" does come off as very selfish.
Nobody wants to completely ban AI or training. You can have progress in this area and still respect people's intellectual property. Framing this as destroying progress is silly.
Why can't they train on the code they own, such as Windows sources for example?
Or even better, why can't they release CoPilot itself under an open source license that is compatible with the licenses of code they would like to train on?
Also, I don't think anyone cares about the monetary aspects. The idea behind the GPL style license is to make sure that code remains free, regardless of what or who uses it. Freedom in this context refers to the ability study the code, modify the code, and distribute any modifications. Without the GPL the code can be used in a proprietary product which strips those rights away from users of the product.
Copeleft uses copyright laws to attempt to guarantee freedom for users. This is the inverse of what normal copyright does, which is allow a single entity to sit on the ideas and not allow other's to benefit from them.
If we can just strip copyleft licenses from projects, we are giving up those guarantees that GPL code will remain free for all users.
The GPL is trying to do it's job here, not slow down progress. Progress would be everyone benefiting from the technology behind CoPilot, rather than just MicroSoft sitting on the project and selling it as a service.
> And we are fucking bickering about how we were cheated out of $0.00034 because our repo from 2015 might have been used for training.
I just hope Microsoft AutoPlagiarist is not the Final Solution to Free Software they have been seeking since before the millennium's turn.
Seems to me this discussion is likely to pivot on a fulcrum located between "old enough to remember Microsoft before Bill Gates began spending his ill-gotten gains on philanthropy" and "young enough to see Microsoft primarily as the Xbox people".
I am only slightly hopeful for a lawsuit so that it loses spectacularly and sets a precedent that this is legal; however I don't trust the legal system enough to think they'll logically reach that conclusion.
My code is 100% in Github Pilot, is there any way to publicly say that I'm against the lawsuit even if they pretend to represent me?
If you'd license or relicense all your code to explicitly allow AI training without attribution on it, I would see that as you saying you're in favor of copilot here.
Is it patent trolling when you are defending your future labor from being made obsolete by megacorps and signularitarians using your past labor without permission?
Patent trolls that are hated look like: You develop package to do X from first principles, then get sued because someone patented using a known algorithm for the purpose of X.
Copyright working in a supported/non hated way: You develop a package to do X by cribbing off someone else's package X. They sue you for stealing their work, not to make money off you.
Situation at hand is case 2, hence the lack of interest in financial gain.
Why is this case 2, when it does not always reproduce the copyrighted works exactly? Situation: You realise that rather than cribbing off of one persons package X, you can crib off two other package X's and mix/average their contents. Scale this to 100's of packages.
Eventually, ML should avoid this by developing to work from first principles, writing in it's own style, with public code used only for validation of it's ability to understand and write code.
Nope. It's about a fairness. Until Microsoft/Google/Apple/BigCorp all release their software/designs/maps, count me in favor of copyright for the small guy. And I'm someone who especially hates copyright/patents.
Agreed. My only consolation is that as the technology improves and it becomes easier to train these types of models on modest hardware at home, the detractors of this technology have already lost, but rather than a mercy killing they prefer to bleed out slowly.
The general attitude of AI researchers/developers is fuck as many other industries as possible, they generally don't care about consent at all. So it's hardly surprising that someone eventually decides to challenge that.
Not just copyright, what Copilot does it fair use. If Google taking tens of thousands of lines directly and gets away with it, it's going to be impossible to see any logic that gets AI being trained on millions of lines of code, not being fair use for any individual open source copyright holder.
You haven’t actually thought through what kind of world it would be if there was no copyright law, have you? I don’t know what your political leanings are, but I’ve met some libertarians who are blissfully naive about the extent to which their world and worldview is buttressed by laws and the governments that enforce them, and your comment reminds me of that.
I'm from a country which basically ignore all copyright laws in practice. The musical scene relies on live sessions to generate any money, and the movie scene (which was never anything special) is mostly dead. That's because any media that hits the market is copied and sold everywhere. Even books, if they got popular, will get copied and sold with little enforcement, at every corner.
This is mostly because the means to copy requires little effort compared to the act of creating. So, there is no incentive to create because you wouldn't make a living out of it. Imagine spending two years writing a book and someone buys one and copy it to sell at 25%. He can make a profit at a lower threshold than you, so you as a creator cannot compete.
> the movie scene (which was never anything special) is mostly dead
To be fair, from my experience most countries don't have much of a movie scene even with copyright and instead mostly import hollywood stuff.
> This is mostly because the means to copy requires little effort compared to the act of creating. So, there is no incentive to create because you wouldn't make a living out of it. Imagine spending two years writing a book and someone buys one and copy it to sell at 25%. He can make a profit at a lower threshold than you, so you as a creator cannot compete.
So don't compete by selling copies but by funding the creation up front. No one is claiming that abolishing copyright won't be disruptive to any existing business models - in fact, that's the point: once something becomes part of our shared culture it is ridiculous to let one entity continue to have exclusive rights so if your business model relies on continued royalties, find a better one.
Otherwise, perhaps consider continued payments to everyone who built your house, computer and whatever else you use if you think that is a great way for society to function. Don't worry, the way things are going we might get there via technical means anyway.
What do you describe is true everywhere. Books were never a get-rich-quick-scheme. There are like seven bands in all of history that get significant revenue from anything other than live shows.
But what do you mean, no incentive to create?
The Tao Te Ching was reluctantly written after the author was begged by his pupils. Most Greek philosopher's teachings were only written down after their death because other people thought that's an important job. On The Origin Of Species is a book because that was just the normal way to communicate scientific findings in Darwin's time. Da Vinci saw some fat commissions in his life, but Mona Lisa certainly never brought him any money. In fact, out of my twenty favorite artists maybe two saw anything approaching fame in their lifetime.
Please, go to some random DeviantArt page or Spotify profile or GitHub repo with 3 views and tell me why it exists when the only reason for human creation is dollars and red carpets...what a sad perspective, really
Well, Linux might not exist, at least not in its current nature form, enabled by a “share and share alike” license that has meant that companies contribute to it instead of copying and closing it off from others.
More generally, what is left to protect any creative work besides guarding physical access? Why would any company make any movie or tv show if it could be copied and redistributed by others endlessly the moment it gets shown once?
That's assuming the only way to fund the creation of something is to sell copies after the fact. And assuming that people only create for monetary compensation.
There have been creative endeavous before copyright and there would be creative endeavours after copyright. Perhaps even more since people are free to remix and share without restrictions.
Well since you asked, I am about as far away from libertarian as you can be without making a point out of it.
That doesn't mean I must be in favor of every repressive innovation-stifling law that was ever cooked up.
You bring up the arts in another comment; ever considered why like half the people regarded as genuinely world-changing or geniuses (da Vinci, Galileo, Columbus, Machiavelli, Michelangelo) were born in the same two hundred years in the same region? Because the Italian renaissance was all about intense, free information-sharing! People freely visited each others work places and ruthlessly stole form each other, and it was accepted. Boom, you get a period of unparalleled human productivity.
And now you want to tell me that a set of weird laws who only ever benefited Disney and Elsevier are the only thing preventing humanity from ceasing to create awesome shit? Nah man, the masses will always continue creating, exactly as proven by the fact that they did in the last decades while getting continuously butt-fucked by the very laws you pretend are made to protect them...
As a non-lawyer, I am very suspicious of the claim that "Plaintiffs and the Class have suffered monetary damages as a result of Defendants’ conduct." Flagrant disregard for copyright? Sure, maybe. The output of the model is subject to copyright? Who knows! But the copyright holders being damaged in some what? Seems doubtful. The best argument I could think of would be "GitHub would have had to pay us for this, and they didn't pay us, so we lost money," but that'd presumably work out to pennies per person.
The common practice in copyright cases is to calculate damages based on the theoretical cost that the infringer would have paid if they have bought the rights in the first place. This method was used during the piratebay case to calculate damages caused by the sites founders.
They did not actually calculate damages in terms of lost movie tickets or estimates vs actually sales number of sold game copies. When it came to pre-releases where such product wouldn't have been sold legally in the first place, they simply added a multiplier to indicate that the copyright owner wouldn't have been willing to sell.
For software code, an other practice I have read is to use the man-hours that rewriting copyrighted code would cost. Using such calculations they would likely estimate the man hours based on number of lines of code and multiply that with the average salary of a programmer.
The one thing we can say with complete certainty is that most programmers who had their code used without permission will not receive very much money at all if this class action lawsuit is decided in their favor.
I don't care about the money. I support this because it will establish case law that other companies can't ignore licenses as long as they throw AI somewhere in the chain.
If "I took your code and trained an AI that then generated your code" is a legal defense, the GPL and similar licenses all become moot.
"This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content"
Money likely isn't the main goal (maybe it is for the lawyers), these are open source repos. Maybe they didn't consent to have their code used as training and that seems like the kind of thing consent should be needed for. Maybe this the AI spitting out copied snippets is a violation of open source licensing without attribution.
So for iseven can we go for how much a student might accept 20 an hour say and multiply that by the one minute required to create it and offer them 33 cents?
"Using such calculations they would likely estimate the man hours based on number of lines of code and multiply that with the average salary of a programmer."
The average salary of a programmer in which country?
So much programming is outsourced these days, and in some places programmers are very cheap.
This is just my guess, but I think the intention from the judges is not to actually calculate a true number. The reason they used the cost of publishing fees in the piratebay case was likely to illustrate how the court distinguished between a legal publisher vs an illegal one. The legal publisher would have bought the publishing rights, and since piratebay did not do this, the court uses those publishing fees in order to illustrate the difference.
If the court wanted to distinguish between Microsoft using their own programmers to generate code vs taking code from github users, then the salary in question would likely be that of Microsoft programmers. It would then be used to illustrate how a legal training data would look like compared to an illegal one.
Those damages are enumerated on pages 50-52. Remember, "damages" is being used in a legal sense here -- for a non-lawyer, you can interpret it more like "a dollar value on something someone did that was wrong". This is more broad than the colloquial use of the word.
If your intent is to create a competing product for profit, chances are that won't be found as fair use, given that determining fair use depends on intent and how the content is used.
Using clips from a movie in a movie review is probably fair use.
Using clips from a movie in knock-off of that movie for profit? Probably not fair use if it's not a parody.
Copilot is not like a movie reviewer using clips to review a movie. Copilot is like a production team for a movie taking clips from another movie to make a ripoff of that movie and selling it.
Consider every repo on github to be a movie. Copilot is taking individual frames out of every movie on github and composting them into a new film.
I think most of us would agree that individually, each frame is copyrighted. But what if you take one frame from a million different movies and put them in an order that produces a new coherent movie?
The core question we need to settle in court is: does the new movie become its own copyrightable work, or is it plagiarism?
You're mistaking the end-user's copyright infringement with Copilot's alleged infringement.
Copilot is fair use and transformative -- that is unless there is an open source Copilot that Copilot is training on, only then would it be competing and it's easy for GitHub or OpenAI to exclude those repos of copilot alternatives from the training set.
> Copilot is like a production team for a movie taking clips from another movie to make a ripoff of that movie and selling it.
I can't think of a 5 line snippet I've written or read that makes sense to claim ownership of. They don't stand on their own in the way even a 30s movie clip does.
I dont think that's comparable. For starters, its not just the length of a quote that makes it fair use, but the way quotes are used i.e. to engage in commentary.
It's the license that matters, not whether the code is visible on Microsoft's website.
Code which anybody can view is called "source available". You aren't necessarily allowed to use the code, but some companies will let their customers see what is going on so they can better integrate the code, understand performance implications, debug and fix unexpected issues, etc. The customers would probably face significant legal risks if they took that code and started to sell it.
"Open source" code implies permission to re-use the code, but there is still some nuance. Some open-source licenses come with almost no restrictions, but others include limiting clauses. The GPL, for example, is "viral": anybody who uses GPL code in a project must also provide that project's source code on request.
What do you think the chances are that Microsoft would surrender the Copilot codebase upon receipt of a GPL request?
Aren't there statutory damages for copyright infringement, i.e. there is a presumption that each work infringed is worth at least a certain amount without proving actual damages?
I'm not confident in this stance - sharing it to have a conversation. Hopefully some folks can help me think through this!
The value of copyleft licenses, for me, was that we were fighting back against the notion of copyright. That you couldn't sell me a product that I wasn't allowed to modify and share my modifications back with others. The right to modify and redistribute transitively though the software license gave a "virality" to software freedom.
If training a NN against a GPL licensed code "launders" away the copyleft license, isn't that a good thing for software freedom? If you can launder away a copyleft license, why couldn't you launder away a proprietary license? If training a NN is fair use, couldn't we bring proprietary software into the commons using this?
It seems like the end goal of copyleft was to fight back against copyright, not to have copyleft. Tools like copilot seem to be an exceptionally powerful tool (perhaps more powerful than the GPL) for liberating software.
Nobody is laundering away proprietary livenses, because that code is not open source and not in public github repos. And OSS capabilities are now present in copilot, which is neither free nor open. Furthermore these contributions are making their way into proprietary code and the OSS licensing becomes even further watered down. This is the epitome of what copyleft is against!
Indeed, the ability to 'launder away' proprietary licenses when source is available means that companies in the future (that would otherwise provide source under a non-permissive license) will shift in favour of not providing source code at all.
Code published on Github is not necessarily open source. There is a lot of code there that has no particular license attached, which means that all rights are reserved except for those covered in the Github TOS, which I believe just covers viewing the code on Github.
I'm not sure this is true. Proprietary source code gets leaked and that can be used to train a NN. I find it likely that Copilot was trained against at least one non-OSS code base hosted on GitHub.
Second, if copyright is being laundered away we can get increasingly clever with how we liberate proprietary software. Today, decompiling and reverse engineering is a labor intensive process. That's the whole point of "open source" - that working in source is easier than working in bytecode. Given the hockey-stick of innovation happening in AI right now, I'd be surprised if we don't see AI assisted disassembly happening in the next decade. If you can go from bytecode to source code, that unlocks a lot. Even more so if you can go from bytecode to source code and feed that into a NN to liberate the code from its original license.
I follow you explanation but not your end statement.
What I think GP is getting at in my understanding is that all this OSS/licensing stuff was a cautious attempt to assert a radical idea into an atmosphere of extrem secrecy: That information wants to be free.
Now we have a fat cooperation making a public statement of putting the value of advancing humanity over the value of honoring weird old Victorian ideas of "intellectual property" - which is what we are always tried to do, no?
Not that there is nothing to criticize, but I think that's a good thing on the whole.
> [OSS] was a cautious attempt to assert a radical idea into an atmosphere of extrem secrecy: That information wants to be free.
Information may want to be free, but users of free information often want to enrich their private endeavors by shackling the information that was given to them freely.
The (A|L)GPL acknowledges the fact that some people and corporations like to use free-and-gratis work in their products and not reciprocate the courtesy shown to them by the authors of that work. (I choose the (L)GPL whenever I can so that folks who derive from my work are either required to either make it available as I have, or pay me enough so that I don't mind them shackling my work.)
The BSD license acknowledges the fact that some people and many corporations like to use gratis work in their closed-source products and never even do so much as bother to credit the authors of work that they used.
For as long as powerful folks continue to use and improve upon gratis information and software without contributing the products that used that information and/or improvements, the 'weird old Victorian ideas of "intellectual property"' are going to have to continue to be dealt with. Remember... you likely cannot reasonably afford an army of lawyers to ensure that pretty much noone uses your work without paying you, but big companies like Microsoft, RedHat, IBM, Oracle, etc, etc, and wealthy individuals can.
For as long as those wealthy entities can lock up and force you to pay for their work and ideas, but make it ruinously expensive for us little people to -individually- do the same to them, we'll need "weird old Victorian" things like licenses to help correct this imbalance of power.
It looks like you're missing the entire purpose of copyleft vs public domain.
The point is that copyleft source code cannot be used to improve proprietary software. That limitation is enforced with copyright.
Proprietary software is closed source. You can't train your NN on it, because you can't read it in the first place.
If someone takes your open source code and incorporates it into their proprietary software, then they are effectively using your work for their private gain. The entire purpose of copyleft is to compel that person to "pay it forward", by publishing their code as copyleft. This is why Stallman is a proponent of copyright law. Without copyright, there is no copyleft.
I'm replying to the comment that RMS supports copyright. I don't believe he does, I believe he would rather it not exist at all but since it does, you have to make use of it.
That's the full context of what I was saying. Copyleft is a hack for copyright. In a world where copyright is enforced, RMS doesn't consider the neutral ground of public domain licenses (like the popular MIT and BSD licenses) good enough. They do nothing to solve the problem of proprietary software.
The GPL is entirely dependent on copyright. Rather than pretend copyright doesn't exist, the GPL turns it in the other direction. By violating the GPL, Copilot is still violating copyright.
> If someone takes your open source code and incorporates it into their proprietary software, then they are effectively using your work for their private gain.
And then if we can close that loop by taking their proprietary software and feeding it into a NN to re-liberate it isn't that a net win for software freedom?
Today crossing the sourcecode->bytecode veil effectively obfuscates the implementation beyond most human's ability to modify the software. Humans work best in sourcecode. Nothing saying our AI overlords won't be able to work well in bytecode or take it in the other direction.
I guess what I'm saying is, today a compiler is a one-way door for software freedom. Once it goes through the compiler, we lose a lot of freedom without a massive human investment or the original source code. Maybe that door is about to become a two way door with copyright law supporting moving back and forth through that door?
I can’t tell if you disagree with the rest of my comment or didn’t bother to read it…
That’s literally not the definition of proprietary.
You download proprietary software when you navigate to (nearly) every webpage. Just because a website like HN sends you (possibly unobfuscated) HTML, CSS, and JS over the wire in plain-text does not mean those files are not proprietary. Those files are covered by copyright in the U.S.
Access to the source code is not sufficient for that source code to be FOSS.
You also failed to acknowledge leaked source code and bytecode decompilation, which were a substantial portion of my comment.
The definition gets a bit blurry around software, just like the definition of "ownership" does.
Colloquially, "proprietary software" means closed-source. You can definitely put it in context where it means "copyright without license"; but outside that context, the colloquial meaning is enough.
I think (1) you're mainly missing that copyleft vs non-copyleft is actually irrelevant for the copilot case. You also (2) may be missing the legal footing of copyleft licenses.
(1) The problem with copilot is that when it blurps out code X that is arguably not under fair use (given how large and non-transformed the code segment is), copilot users have no idea who owns copyright on X, and thus they are in a legal minefield because they have no idea what the terms of licensing X are.
Copilot creates legal risk regardless of whether the licensing terms of X are copyleft or not.
Many permissive licenses (MIT, BSD, etc) still require attribution (identifying who owns copyright on X), and copilot screws you out doing that too.
(2) Whatever legal power copyleft licenses have, it is ultimately derived from copyright law, and people who take FOSS seriously know that. The point of "copyleft" licenses is to use the power of copyright law to implement "share and share alike" in an enforceable way. When your WiFi router includes info about the GPL code it uses, that's the legal of power of copyright at work. The point of copyleft licenses is not to create a free-for-all by "liberating" code.
You can "launder" away the license of any source code you have copied simply by deleting it! No snazzy neural network needed.. The litigants argument is that this is what GitHub CoPilot does. It allows others to publish derivative works of copyrighted works with the license deleted. Given that it apparently is trivial to get CoPilot to spit out nearly verbatim copies of the code that it was trained on, I don't think it satisfies the "transformative" requisite of the (American) Fair use doctrine.
It seems like the ideal way to proceed is to make the AI output unique and creative. Perhaps that requires AGI because currently the model has no understanding of art.
Maybe more importantly, the AI needs the ability to judge when its output amounts to plagiarism, like humans generally are able to. The AI needs to feel bad about ripping off someone else’s work. ;)
Farmers plant their crops out in the open too. Should Boston Dynamics be allowed to have their robots rob those fields empty and sell the produce without having to at least pay the farmer? They'd be walking and plucking just like any human would be.
Some source code might be published but not open source licensed. At least some such code has been taken with complete disregard of their licenses and/or other legal protections, and it's impossible to find and properly map out any similar violations for the purposes of a legal response.
This is literally the "you wouldn't steal a car" meme.
To spell it out: No, this analogy does not hold. "Stealing" data does not deprive the owner of anything, so it should not be treated remotely the same as physical stealing (usually not even of potential revenue, as piracy studies show).
While there might not be damages in the literal sense to the owners of the scraped repos, MS is making money from Copilot subscriptions, so what they're doing is closer to selling bootleg copies of a film than giving away pirated copies.
I partially have to concede on that point. The anology could have been better, and I should have put greater emphasis on the massive scale and the lack of recourse for those affected.
Nevertheless, stealing remains illegal so at the very least they have deprived the source code owners of their rights.
> It seems like the end goal of copyleft was to fight back against copyright, not to have copyleft.
Whether this was the original motivation depends on whom you are asking.
You may disagree, but the "Free Software" movement (RMS and the people who agree with him) essentially wants everything to be copyleft. The "Open Source" movement is probably more aligned with your views.
TabNine has absolutely improved my life as a programmer. There's something really rewarding about having a robot read your mind for entire blocks of code.
It's not just functions either, one of the most common things that it helps me with daily is simple stuff like this:
Typing
const x = {
a: 'one',
b: 'two',
...
}
And later I'll be typing
y = [
a['one'],
b[' <-- it auto-completes the rest here
]
It's really amazing the amount of busy-work typing in programming that a smart pattern matching algo could help with.
That autocomplete was sort of ok in tabnine, but Copilot completely blows it out of the water. Resource consumption for Copilot is also much more restrained.
Which reminds me I have to cancel my tabnine subscription. Been paying them for a year without using it.
I haven't tried Copilot personally but thanks I'll try it. I did try TabNine over a year ago and found it's improved dramatically since then so maybe it's gotten better within training.
I don't think this is a good example of the value of these things. You can just as easily do that same thing with advanced text editor features. Sublime for example supports multi-cursor editing. Just hold alt+shift+arrow keys to add a cursor, then type in the brackets you want. Ctrl+D can be used to select the next occurrence of the current selection with multiple cursors, built-in commands from the command pallete can do anything to your current selection (e.g. convert case), etc.
All of that efficiency without having to pay a monthly subscription, wasting electricity on some AI model, and worrying about the legal/moral implications.
Why? You can copy and paste the entire section, and use multiple cursors to add in the brackets.
going from
a: 'one',
to
a['one'],
just requires you to add two brackets and remove the colon. With multiple cursors you can do that exact same operation for all lines in a few keystrokes.
It's having to go find the other block you want, copy and paste it, and then set up the multiple cursors and type, versus it just happening automatically without any of that.
In my use cases I’ve long moved on from writing the original hash. Having it autocomplete without having to open a file and tab back/forth (or find then copy/paste a block to the other file to temporary work on it) etc.
But what's lost in my over simplified example is the contetxt is usually way more involved. I'm usually passing those as arguments to some function or other unique syntax situation that a glorified find and replace can solve. It's all about doing it in the times you would never think even bother writing a custom command because typing is faster given the unique syntactical context... The only thing faster then is autocomplete.
I'm not actually recreating a new hash with the convienient same format.
Does anyone have a problem with it, so long as the material it trained on was with explicit permission/license and not potentially in violation of copyright?
That's where the line is for it to be suspect IMO.
This is what I hope comes out of the lawsuit. If a company wants to sell an AI model, they need to own all of the training data. It can't be "fair use" to take other peoples' works at zero cost, and use it to build a commercial product without compensation.
And maybe models trained on public data should be in the public domain, so that AI research can happen without requiring massive investments to obtain the training data.
There has to be a reasonable context here. Even if it’s trained on proprietary code it rarely ever is inserting that code directly in a way that is at all relevant to how it was used in the past.
Obvious licensing needs to be respected and it shouldn’t be hard to solve that problem. But 99.9% of code isn’t some unique algorithm, it’s gluing libraries and setting up basic structures.
Most of the examples I’ve seen done line up with the reality of code completion tools. Code is rarely valuable when broken up into its small parts.
Even copying a full codebase is rarely enough to draw value from… there’s way more to a software business than the raw code. But that’s a different problem.
Ok you got me, that wording was lazy on my part. But that's a really bad take on yours:
> It was trained on OSS which is explicitly licensed for free use.
That's not what the lawsuit is about. It's not about money, it's about licensing. OSS licenses have specific requirements and restrictions for using them, and Copilot explicitly ignores those requirements, thus violating the license agreement.
The GPL, for example, requires you to release your own source code if you use it in a publicly-released product. If you don't do that, you're committing copyright infringement, since you're copying someone's work without permission.
Yeah, and I think that's fair re: licensing. Curious to see how it pans out.
Also, re: your edit, not quite. They require you to release modified source under certain conditions if you make modifications to it. If everybody had to release code using GPL to the world, every companies code would currently be released to the world. There's more nuance than that. The gnu site covers a lot of that nuance (https://www.gnu.org/licenses/gpl-faq.en.html#UnreleasedMods)
LGPL is the one that enterprises won't touch with a 10 foot pole, due to more restrictive licensing, and more conditions under which you'd have to open source your own code.
Most companies building commercial products on top of FOSS are obeying the license requirements. (I have been through due diligence reviews where we had to demonstrate that, for each library/tool/package.)
The same cannot be said for Copilot: there have been prior examples here on HN showing that it can emit large chunks of copyrighted code (without the license).
It being permissively licensed is virtually irrelevant because only a minority of code is so permissively licensed you can just do what you like under any license. Far more is do what you like within the scope of the license. For example GPL do with it what you like so long as any derivative work is also GPL.
I guess I'm just afraid that it might not be as good as it is that way.
It's a bit like how GPT-3, Stable Diffusion and all those generative models use extensive amounts of copyrighted material in training to get as good as they do.
In those cases however the output space is so vast that plagiarism is very unlikely.
The interesting thing is that the names get explicitly attached to these styles. It isn't exactly a copyright issue, but I'm sure it will get litigated regardless.
I think the prompt "GPT-3, tell me what the lyrics for the song Stan by Eminem is" is very likely to output copyrighted material. The same copyrighted material is, of course, already republished without permission on google.com.
there are literally thousands of years of artwork that fall under public domain, the idea that the dataset isn't big enough to make good images without copyright infringement and attribution laundering is frankly laughable.
My guess is that is not as much about the amount of available data but how accessible it is. Scraping the internet seems to be one of the preferred ways of gathering vast amounts of, in particular, text and images.
Telling apart what's public domain or not is not a trivially automatable task.
If one just relies on curated libraries of vetted public domain content you don't get, by far, the expected amout of variability and diversity.
I feel like Charlie Gordon from Algernon with and without Copilot.
Literally 10x faster development.
Case in point: had an unexpected project and no time to complete it.
Within an hour Copilot helped me:
* Write a couple of tricky matplotlib plots
* Do some extensive analysis with Pandas
* Write a couple of SQL queries
* Write a Flask back-end and deploy it
* Write a bit of a front-end
* This all with extra comments , links to documentation and pretty reasonable style
I have experience with all of the above mentioned but the speed increase was considerable.
This would a a good day's work without Copilot and there would be less commenting and hackier code.
Before Copilot I would be cursing a lot more reading various docs...
The key thing that Copilot does it reduces latency for your thoughts-action-results loop.
Does the open source really suffer if less people read documentation directly? Would you really be less likely to create an open source library if you knew someone can now use your library at 10x speed?
The inference ability has crossed uncanny valley so many times.
I find myself wondering whether there is a speech recognition component at times.
When teaching a lecture I will start saying something and write a prompt at the same time and the sentence produced by Copilot will be spot on what I've just said.
Ideally there would an open source version of Copilot that respects everyone's wishes. I fear that is impossible.
why not just train it on your own code or an opt-in data base of voluntarily contributed code? why does everyone else have to make your life easier [and generate enormous wealth for a third party with zero compensation for their work] involuntarily?
I really don't understand how there can be a problem with how Copilot works. Any human just works in the same way. A human is trained on lots and lots of of copyrighted material. Still, what a human produces in the end is not automatically derived work from all the human has seen in his life before.
So, why should an AI be treated different here? I don't understand the argument for this.
I actually see quite some danger in this line of thinking, that there are different copyright rules for an AI compared to a human intelligence. Once you allow for such arbitrary distinction, it will get restricted more and more, much more than humans are, and that will just arbitrarily restrict the usefulness of AI, and effectively be a net negative for the whole humanity.
I think we must really fight against such undertaking, and better educate people on how Copilot actually works, such that no such misunderstanding arises.
I think there's a parallel in surveillance systems. For example, it's perfectly reasonable for a police officer conducting an investigation to follow a suspect as they drive around town. After all, it's happening in public and it's not illegal to watch what someone does in public (caveat being taking it to the level of stalking).
However, is it reasonable to write an AI system that monitors the time and location of all license plates seen around town, puts them into a database, and then that same officer can simply put in the suspect's license plate instead of actually following them around? Maybe, maybe not, that's not my point here. But the creation of that functionality can easily lead to its abuse.
Is this exactly the same case as Copilot? Of course not, these are two wildly different systems. But I think it's an interesting parallel to consider when discussing the point of "it's okay when a human does it" because humans and algorithms operate at two very different levels of scale. The potential for abuse of the latter being far higher and far easier than something a human has to do manually.
I would argue, also humans are far from perfect here. But this is anyway not so much my argument. I agree with you, this should be improved. But I don't see such a big problem in improving this. I'm sure we will find some ways to get this also to a human-level or better.
I'm mostly talking about the statement "[Copilot] relies on unprecedented open-source software piracy". This is just wrong. It learns from open-source code, just like a human does.
> >So, why should an AI be treated different here? I don't understand the argument for this.
> Because the AI is not a human and only humans have rights, including the right to learn.
Okay then: Who counts as 'human'? What's the qualifier for being a 'human'?
------
(The following questions all point to the same underlying question.)
Are you human if you have only one leg or 8 fingers due to a genetic deformity? What about albinism or sickle cell disease?
If someone had robotic implants, are they human? Is it inhuman to have an artificial leg? What about both legs?
Same scenario as above, but both arms & legs are replaced. Are they human?
Same as above, but now everything below the torso has been replaced. Same question.
Same question, but now everything below the neck.
If someone were to successfully transplant their brains into a robot body, are they still human?
Someone embeds a neural implant into their brain: Still human?
Same question, now multiple neural implants.
Same question, but now the brain-to-implant ratio is 2:1. Brain mass & neural count hasn't changed since then.
Same question, but with the brain-to-implant ratio now 3:1.
I'm not talking about cases where code is copied, as in your example. I fully agree, this should be fixed. But I don't see such a big problem here. We can do sth about this and reduce those cases to a reasonable human-level minimum or below.
I explicitly say human-level because humans would also not be totally immune to this. It can happen that you unintentionally write the same code you have seen somewhere.
It can also even happen that you write the same code just by pure chance.
I'm talking about the statement in general, that all Copilot output is derived work. This is just wrong, as it is for a human as well.
I'm talking about the statement "[Copilot] relies on unprecedented open-source software piracy". This is just wrong. A human also relies on open-source software (and even private software) to learn, and this is not piracy.
The title of the submitted PDF document: "Microsoft Word - 2022-11-02 Copilot Complaint (near final)"[0]
I've noticed this a lot and it's quite funny seeing what the actual filename of the document was. Does this just get included as metadata by default when you export to PDF?
This will fail very quickly. The licence that project owners publish with their code on Github applies to third parties who wish to use the code, but does not apply to Github. Authors who publish their code on Github grant Github a licence under the Github Terms: https://docs.github.com/en/site-policy/github-terms/github-t...
Specifically, sections D.4 to D.7 grant Github the right to "to store, archive, parse, and display Your Content, and make incidental copies, as necessary to provide the Service, including improving the Service over time. This license includes the right to do things like copy it to our database and make backups; show it to you and other users; parse it into a search index or otherwise analyze it on our servers; share it with other users; and perform it, in case Your Content is something like music or video."
I don’t see that being “quickly” - they’d have to get a judge to agree that passing your code off without attribution for other people to use as their own work is a normal service improvement. Given that it’s a separate feature with different billing terms, I’m skeptical that it’s anywhere near the given that you’re portraying it as.
It's worth reading the passage in its entirety and how a court would interpret it:
> We need the legal right to do things like host Your Content, publish it, and share it
> This license does not grant GitHub the right to sell Your Content. It also does not grant GitHub the right to otherwise distribute or use Your Content outside of our provision of the Service, except that as part of the right to archive Your Content, GitHub may permit our partners to store and archive Your Content in public repositories in connection with the GitHub Arctic Code Vault and GitHub Archive Program.
If Copilot is straight-up reproducing work, and it is a service that users have to pay to use, then it seems like Copilot is "sell[ing] your content" and thus the license does not apply.
More generally, a court is likely to look at the plain English summary and judge. Copilot is not an integral part of "the service" as developers understood it before Copilot existed.
"desperate semantic games" is actually a reasonable description of the legal process :-)
I'm not sure I agree that anything expressed in a legal contract using natural language is "unambiguously clear". MS / Github's expensively-attired lawyers will not doubt forcefully argue that they are not selling the YOUR content, but a service based on a model generated from a large collection of content, which they have been granted a licence to "parse it into a search index or otherwise analyze it on our servers". There may even be in-court discussion of generalization, which will be exciting.
This is the standard content display license that everyone uses. Even in your quoted text I don't see any hint that snippets can be shown without attribution or the code license.
It also says they can't sell the code, which CoPilot is doing.
Also, in a very high number of cases it isn't the author who uploads.
Repeating your line of argumentation (which occurs in every CoPilot thread) does not make it true.
It's irrelevant whether it's standard or not. Again, the terms in the code licence (including attribution) do not apply to Github, because that is not the licence under which they are using the code. You grant them a separate licence when you start using their service.
If someone who isn't the author has uploaded code which they do not have a right to copy, they are liable, not Github. This is also clear from the Github Terms: "If you're posting anything you did not create yourself or do not own the rights to, you agree that you are responsible for any Content you post"
It's almost as if these highly paid lawyers know what they're doing.
You grant them a content display license, not a general code license.
> It's almost as if these highly paid lawyers know what they're doing.
Sure, they wrote the content display license long before CoPilot even existed. Any court will see the intent and not interpret these terms as a code re-licensing.
There is no such thing as a "content display licence" or "general code licence". There is copyright (literally, the right to make copies) which broadly lies with the author, who can then grant other parties a licence to copy their content.
I'm afraid I do not believe your legal expertise is so extensive that you are able to accurately predict the judgement of "any court".
And it explicitly states that it does give them the right to share your code. Copilot isn't selling code; if it were, then GitHub wouldn't let you share the output of Copilot; that would destroy their market. That they allow you to share the output of Copilot with others proves that what they are selling is the service, not the output. The output is, at worst, "shared" code from Github's licensors.
So, it isn't clear to me which of these clauses you are citing grants them the forced right to "Copilot" (which I'm using as a verb to avoid defining what stage of production we are talking about) that wasn't granted by the license of the code, but let's assume for a moment that you are correct: that just means that GitHub as a service makes no sense, right? Like, there are a ton of people using GitHub to develop using code I've published in the past... code which is under various of these example licenses, and which I've never myself (as the copyright holder) published to GitHub (and, in fact, would never as I despise GitHub). There are also a number of very popular projects--such as the Linux kernel--which people no only upload to GitHub but which have official mirrors of on GitHub where no party even owns the copyright in order to agree to these terms of service. Meaning, if you are correct, GitHub is often being used illegal and a ton of the source code they are training against wasn't legally provided to them in the first place.
Section D.3: "If you're posting anything you did not create yourself or do not own the rights to, you agree that you are responsible for any Content you post". A lawsuit against Github has no standing for the scenario you suggest, because Github is not at fault.
Ok, so: "that just means that GitHub as a service makes no sense, right?" Like, I feel you simply ignored the core complaint of my comment so you could instead note something about GitHub's potential liability (a thought process I didn't even bring up, though I can see how many you decided to bend my final comment into somehow being relevant for that thought). Like, are you simply ceding then that a ridiculous amount of the content on GitHub -- including major projects such as Linux -- are not allowed to be posted to GitHub?
Section D.3: "If you're posting anything you did not create yourself or do not own the rights to, you agree that you are responsible for any Content you post". A lawsuit against Github has no standing for the scenario you suggest, because Github is not at fault.
Well, I don't think it's a DCMA issue, but it does very much depend on the licence you have chosen. That's what the licence is for, to allow people to use the code that you have copyright of, and to define what they are / are not allowed to do with it.
This sounds unenforceable in the general case. How could github know whether someone pushes their own code or not? Is it a license violation to push someone's FOSS code to github because the author didn't sign up with GH?
> Is it a license violation to push someone's FOSS code to github because the author didn't sign up with GH?
It depends on the licence.
It's very much enforceable that companies who provide content publishing platforms will indemnify themselves against people publishing content to which they do not have an appropriate licence.
Does everybody credit the author when using Stack Overflow code? I have, but don't always. Not that I'm trying to steal, I just don't take the time, especially in personal projects.
This isn't exactly the same thing, but it seems to me that three of the biggest differences are:
1. Stack Overflow code is posted for people to use it (fair enough, but they do have a license that requires attribution anyway, so that's not an escape)
2. Scale (true; but is it a fundamental difference?)
3. People are paying attention in this case. Nobody is scanning my old code, or yours, but if they did, would they have a case?
I dunno. I'm more sympathetic to visual artists who have their work slurped up to be recapitulated as someone else's work via text to image models. Code, especially if it is posted publicly, doesn't feel like it needs to be guarded. I'm not saying this is correct, just saying that's my reaction, and I wonder why it's wrong.
On page 18, they show Copilot produces the following code:
>function isEven(n) {
> return n % 2 === 0;
>}
They then say, "Copilot’s Output, like Codex’s, is derived from existing code. Namely, sample code that appears in the online book Mastering JS, written by Valeri Karpov."
Surely everyone reading this has written that code verbatim at some point in their lives. How can they assert that this code is derived specifically from Mastering JS, or that Karpov has any copyright to that code?
There is no way in hell that isEven is covered by copyright.
"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."
> There is no way in hell that isEven is covered by copyright.
Hey, I said the same thing about APIs, but here we are.
Edit: Actually, the Supreme Court declined ruling whether APIs are copyrightable, but they did say that if they are, reusing them like google reused the java apis in android would fall under fair use. Given that lower courts did think that APIs should be copyrightable, we don't know if they are anymore.
What's interesting in that case is I would argue the interface code is MORE important than the implementing code. You can hire any SWE to re-implement the entire API and it being pretty straightforward. The interface code was what actually mattered with developers and took creative expression, design, sequence, organization to put together and feedback and iteration over the years. Google knew this too, taking the interface code was WAY more important for them than if they had done the reverse.
True, but trademark and copyright are pretty different on a fundamental level both in the purpose behind the law and how its implemented. I suspect if it weren't for the term "intellectual property" tying trademark, copyright and patents together, we wouldn't really think of them in such a unified way since they are all really different from each other.
It's possible the complaint is using a trivial example to illustrate the type of argument plaintiffs want to make during any trial. A 200-line example is too unwieldy for non-programmers to digest, especially given the formatting constraints of a legal brief.
Look at paragraphs 90 and 91 on page 27 of the complaint[1]:
"90. GitHub concedes that in ordinary use, Copilot will reproduce passages of code verbatim: “Our latest internal research shows that about 1% of the time, a suggestion [Output] may contain some code snippets longer than ~150 characters that matches” code from the training data. This standard is more limited than is necessary for copyright infringement. But even using GitHub’s own metric and the most conservative possible criteria, Copilot has violated the DMCA at least tens of thousands of times."
Does distributing licensed code without attribution on a mass scale count as fair use?
If Copilot is inadvertently providing a programmer with copyrighted code, is that programmer and/or their employer responsible for copyright infringement?
There's a lot of interesting legal complications I think the courts will want to adjudicate.
They changed it, I'm 100 % sure. The profile picture was Saul from Breaking Bad. I assume they read the comments here and changed it in a matter of one or two minutes.
They determined the other `isEven()` function was cribbed from Eloquent Javascript because of matching comments. I wonder if the complaint just left off telltale comments from that Mastering JS one?
Yeah, the other one I found much more persuasive. The extra comments were unequivocally reproduced from the claimed source. (although that output was from Codex, rather than Copilot).
Yep, same. Not in JS, but in Haskell, for the Even Fib project Euler problem. Something like a million people have submitted right answers for that problem and assuming half wrote their own filter rather than importing a isEven library then that's half a million people there.
That seems like a really bad choice of an example for this, but as I haven't read the document I don't have any other context beyond what you've posted here, I have to take your word for it that that's the purpose of this snippet.
However, if you are looking to understand the reasoning behind this lawsuit, there are lots of better examples online where Copilot blatantly ripped off open source code.
Maybe I'm being too cynical, but this feels like it's more a law firm and individual looking to profit and make their mark in legal history rather than an aggrieved individual looking for justice.
Programmer/Lawyer Plaintiff + upstart SF Based Law Firm + novel technology = a good shot at a case that'll last a long time, and fertile ground to establish yourself as experts in what looks to be a heavily litigated area over the next decade+.
One of the core principles of the American system of government is that we outsource enforcement to private parties. Instead of the public needing to fund enforcement with tax dollars private parties undertake risky litigation in exchange for the chance of a big payoff.
There is a reasonable argument that's a horrible system. But it doesn't make sense to criticize the plaintiff looking for a profit - the entire system has been set up such that that's what they're supposed to do. If you're angry about it lobby for either no rules or properly funded government enforcement of rules.
> But it doesn't make sense to criticize the plaintiff looking for a profit…
I don’t know man, I can simultaneously see the systemic issue that needs to be solved and also critique someone for subcoming to base needs like greed when they don’t have the need.
But the need is obviously there. Everyone who produces the following code in a non-university environment - for a fee! - needs to be punished quickly and severely:
Based on the given prompt, [Codex] produced the following response:
What they're doing is a service, though. Say that $10 million worth of damage against others has been done. If the law firm does not act, the villainous curs who caused that damage get to keep their money and are incentivized to do it again. If the law firm does act and prevails, then the villains lose their ill-gotten gains (in favor of the law firm and, sometimes, to an extent, the injured parties). That's preferable. Not ideal, but certainly better than nothing.
That implies it’s a service I want, which I have not decided on in this situation. Either way I was more arguing with the other posters claim that it “didn’t make sense” to critique this move, which I think is factually incorrect since I can come up with a few plausible situations where it does make sense
I perhaps wasn’t clear, I meant that I am not sure I want copilot constrained in this way. If I solidify that belief into definitely not wanting copilot constrained, then this would be a negative suit for me
understood, and there are others who do want it constrained this way, and their right to not be a victim of copyright infringement outweighs a desire to reap the gains of such infringement
Sadly, you've got it backwards. Class actions are opt-out. You're part of it unless you know about the settlement and contact them to let them know you're opting out of it.
That's entirely fair - and I'm not angry, just not convinced in their arguments, especially when the motive is likely not genuine.
As an aside - I'm almost positive MSFT/Github expected this and their legal teams have been prepping for this moment. Copyright Law and Fair Use in the US is so nuanced and vague that anything created involving prior art by big-pocket individuals or corporations will be litigated swiftly.
I expected one of these lawsuits to come first from Getty or one of the big money artist estates against OpenAI or Stability.ai, but Getty and OpenAI seem to be partnering instead of litigating.
Are you against policing? Because that's government enforcement. Admittedly policing in the US is god awful, but I still think most people would rather have it than no police force at all.
Government enforcement of this kind of law is really no different. It wouldn't be the legislature doing it.
> If you're angry about it lobby for either no rules or properly funded government enforcement of rules.
No, there are plenty of other changes you might want to see.
For example, in the American system, judges are generally not allowed to be aware of anything not mentioned by a party to the case. There is no good reason for this.
This is a classic example of the ad hominem fallacy. Stating that "they are no angels" doesn't detract from whether they're right or capable of effecting positive legal change.
Frankly, I don't care if anyone makes a name for themselves for doing this. In fact, I applaud them and would happily give them recognition should they be successful.
Similarly, I'd hope that there are opportunties for profit in this space, given that I don't want cheap lawyers botching this case and setting terrible legal precedent for the rest of us. Microsoft has a billion dollar legal team and they will do everything they can to protect their bottom line.
Just like Google’s noble but misguided attempt to make all the world’s books searchable a few years back, what we have here is IP law getting in the way of a societal goodness.
Copyright and patent are not natural; they’re granted by law “to promote progress in the useful arts”. At first glance here it appears that GitHub is promoting progress and the plaintiffs are just rent-seeking.
It can be and is both what you describe and a necessary feature of our adversarial legal system.
Github can't really go to a court by themselves and ask "is this legal?". There is the concept declaratory relief but you need to be at least threatened with a lawsuit before that's on the table.
So Github kinda just has to try releasing CoPilot and get sued to find out. The legal system is setup to reward the lawyer who will go to bat against them to find out if it is legal. The plantiff (and maybe lawyer, depending on how the case is financed) take the risk they are wrong just as Github had to.
It is setup this way to incentivize lawyers to protect everyone's rights.
But who cares? Who else is willing to fund litigation on this important legal question? The real justice here is declarative and benefits everyone.
No matter who litigates and for what reasons it will be extremely valuable for good precedents to be set around the question of things like Copilot and DALL-E with respect to copyright and ownership. I'd rather have self interested lawyers dedicated to winning their case than self interested corporations fighting this out.
I brought a class action suit against Sharp and I was the class representative. They settled. The judge awarded me a whopping $1,000 from the settlement money. From the time I put into it, including 3 or 4 full days in NYC because my deposition coincided with a snowstorm, I didn’t exactly come out ahead financially.
Obviously this is different for the reasons you stated, but I didn’t want people to think bringing a class action lawsuit forward is a way to get rich. It’s a bit of a joke, really.
> rather than an aggrieved individual looking for justice.
How an aggravated individual can seek justice from a big multinational corporation? That's not possible unless that individual is a retired billionaire wanting to become a millionaire.
I have a friend from highschool who does class action lawsuits. He spends a very large amount of money funding his suits on things like expert witnesses among other things, only 1 in 5 pays off, so it has to pay off well. His model is similar to venture capitalism. Most of these cases take 5-7 years to execute. So he basically takes out loans from another laywer to fund them. His average pay for the last 10 years has been around $140k/year. Some years he makes nothing and pays out a lot, others he makes several million and pays back all the loans. Another way to think of it is like giving money to tax fraud wistleblowers.
Yes he does think of it somewhat like that, establishing himself in an area. However a lot of his work comes from finding people aggrieved by something not them finding him.
How come? When people contributed code publicly they attached a license how the code may be used. Is training an AI model on this allowed? I think there's a fair, important and novel legal question to be examined here.
Patent trolls usually file lawsuits that are just unmerited, but rely simply on the fact that mounting a defence is more expensive than settling.
It is a little different. The first patent troll that blazed the trail gets both more credit (for ingenuity) and blame (for the deleterious impact) in my opinion. I'll give the same internet points to these guys.
If Kasparov uses chess programs to be better at chess maybe we can use copilot to be better developers?
Also, anyone, either a person or a machine, is welcome to learn from the code I wrote, actually that is how I learnt how to code, so why would I stop others from doing the same?.
Judging by the majority opinion in this thread, it seems pretty clear GitHub could have asked and gotten enough people to opt-in to have no problem training their model. They probably would have been thrilled to do it and proud of being included in the training data.
But the preference of the majority does not override the conditions placed by people who prefer not to participate.
No human perfectly reproduces the learning material they used.
If that was true, one might as well just higher engineers from Twitter and make a new platform from the code they remember!
Well, we humans do it occasionally. You probably remember a few specific code snippets in your lang of choice because they kept annoying you/you love them/you wrote them a lot. So if I would put you in the exactly right situation, you would indeed reproduce code verbatim.
So does Copilot.
I am not trying to insinuate that Copilot works like a human, but it is literally the same situation.
I suspect this will be the first of many lawsuits over training data sets. Just because it is obscured by artificial neural networks doesn't mean it's an original work that is not subject to copyright restrictions.
I don't know why we're treating it as anything less than a human brain. A human can replicate a painting from memory or a picture of mickey mouse and that would likely be copyright infringement, but they could also take a drawing of Mickey Mouse sitting on the beach and given him a bloody knife & some sunglasses and it'd likely be fair use of the original art.
The AI can copy things if it wants, but it can also modify things to the point of being fair use, and it can even create new works with so little of any particular work that it's effectively creativity on the same level of humans when they draw something that popped into their heads.
I'm kinda sceptical that this goes anywhere given that basically they say that whatever copilot outputs is your responsibility to vet that it doesn't break any copyright (obviously that goes against the promise of it and the PR but that's the small print that gets them out of trouble).
Saying "it's your responsibility to not breach licenses or violate copyright" doesn't absolve your service from breaching licenses and violating copyright itself.
Yet we all use web browsers that copy copyrighted text from buffer to buffer all the time. This doesn't even include all of the copying that ISPs perform.
It might be fair to say that the read performed in training has the same character since no human is involved.
The real copyright violation would be using a derived work.
A browser isn't a amalgamation of billions of pieces of other works. A browser executes and renders code it's served.
Copilot's corpus is quite literally tomes of copyrighted work that are encoded and compressed in its neural network, from which it launders that work to create similar works. Copilot itself, the neutral network, is that corpus of encoded and compressed information, you can't separate the two. Copilot stores and distributes that work without any input from rightsholders, and it does it for profit.
A better analogy would be between a browser and a file server filled with copyrighted movies whose operator charges $10/mo for access. The browser is just a browser in this analogy, where the file server is the corpus that forms Copilot itself.
the actual copying isn't a problem, it's distribution. if i buy access to a PDF i'm not going to get in trouble for duplicating the file unless i send it to someone else.
when someone uploads their copyrighted text to a web page they are distributing it to whoever visits that page. the browser is just the medium.
You could argue that it’s the individual projects using copilot that are violating here, I guess? Like you can use curl or git to dump some AGPL code into your commercial closed software but no one would (hopefully) blame those tools.
So copilot is fine but anyone using it must abide by the collective set of licenses that it used to write code for you…?
If a license requires attribution, and you reproduce the code without attribution using your editor plugin, it seems to me the infringement is on the editor plugin.
Note that even licenses like MIT ostensibly require attribution.
This is a bad analogy because P2P networks exist that are legal to operate, because Section 230 of the CDA prevents interactive computer services from being held responsible for user generated content.
What made Napster illegal is that the company did not create their network for fair use of content, but to explicitly violate copyright for profit.
Copilot is like Napster in this case, in that both services launder copyrighted data and distributed it to users for profit.
Copilot is not like other P2P networks that exist to share data that is either free to distribute or can be used under the fair use doctrine. Copilot explicitly takes copyrighted content and distributes it to users in violation of licenses, that's its explicit purpose.
It's entirely possible to make a Copilot-like product that was trained on data that doesn't have restrictive licensing in the same way it's entirely possible to create a P2P network for sharing files that you have the right to share legally.
So if you produce napster 2.0 to be the best music piracy tool, and you test it for piracy, and you promote it for piracy... you're going to have trouble.
If you produce napster 2.0 as a general purpose file sharing system, let's call it a torrent client, and you can claim no ill intent... you may have trouble but it's a lot more defensible in court.
I would find it a big stretch to say Github's intent here is to illegally distribute copyrighted code. No judgment on whether the class action has any merit, just saying I would be very surprised if discovery turns up lots of emails where Github execs are saying "this is great, it'll let people steal code."
> I would find it a big stretch to say Github's intent here is to illegally distribute copyrighted code.
Almost everything on GitHub is subject to copyright, except for some very old works (maybe something written by Ada Lovelace?), and US government works not eligible for copyright.
Now, many of the works there are also licensed under permissive licenses, but that is only a defense to copyright infringement if the terms of those licenses are being adequately fulfilled.
> Almost everything on GitHub is subject to copyright,
Agreed. Like I said, it's about intent. Can anyone say with a straight face that copilot is an elaborate scheme to profit by duplicating copyrighted work?
I don't think the defense is that it wasn't trained on copyrighted data. It obviously was.
I think the defense is that anything, including a person, that learns from a large corpus of copyrighted data will sometimes produce verbatim snippets that reflect their training data.
So when it comes to copyright infringement, are we moving the goalposts to where merely learning from copyrighted material is already infringement? I'm not sure I want to go there.
I think you're looking for consistency that the legal system just doesn't provide. The music industry is more organised and litigious than the software industry and that gives them power that you and I don't have. If you called it "Napster 2.0" specifically you'd probably be prevented from shipping by a preliminary injunction. Is that fair or consistent? No. But it's the world we live in. Programmers want laws to be irrefutable and executable logic but they just aren't.
Now, IANAL, but iirc, that is all 100% okay and legal. In fact, I can even download copyrighted music and movies without issue. So, I don't even need to make sure I don't download anything under copyright.
The issue isn't downloading copyrighted stuff.
Rather, it's making available and letting others download it. That was where you got in trouble.
Knowingly downloading copyrighted material, say to get it for free, still violates the rights of the copyright holders. It's just that litigating against members of the public is bad PR and not exactly lucrative, especially when it's likely that kids downloaded the content.
People used to get busted from buying bootleg VHS and DVDs on the street before P2P filesharing was a common thing. Then, early on, people were sued for downloading copyrighted files before rightsholders decided to take a different legal strategy to go after sharers and bootleggers.
Well, if the trackers also hosted mixed-up blocks of data for all the torrents they tracked and their protection was "LOL make sure you don't accidentally download any of these tiny data blocks in the correct order to reconstruct the copyrighted material they may be parts of wink"
If Microsoft is so confident in the legality and ethics of Copilot, and that it doesn't leak or steal proprietary IP... they should go train it on the MS Word and Windows and Excel source trees.
Did they make a statement that they did not want to do that?
Because if not I would offer the very mundane explanation that the Copilot team probably just couldn't be bothered hitting up the other software teams and jumping through 3,046 internal red tape compliance steps to make their product 0.001% better (I am pretty sure the code base of all of GH dwarfs MS code base quite a lot)
I can't believe I am actually defending fucking Microsoft, but just want to say there isn't a conspiracy everwhere...
I have no doubt they will -- but the specific models will be used for Microsoft engineers. There will be a Copilot for Enterprise that trains on customers' private code.
Wow, this is interesting iteration in the ongoing divide between "East Coast code" vs. "West Coast code" as defined by Larry Lessig. For background, see https://lwn.net/Articles/588055/
I am not against this lawsuit but I'm against the implications of this because it can lead to disastrous laws.
A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ? What is the line between copying and machine learning ? Where does overfitting come in ?
Today they're filing a lawsuit against copilot.
Tomorrow it will be against stable diffusion or (dall-e, gpt-3 whatever)
And then eventually against Wine/Proton and emulators (are APIs copyrightable)
Well they are a special case here however since they don't solve a specific problem nor build a programm per se but instead (re)build a programm after existing specs. Their explicit goal is to match the behaviour of another piece of software with a translation layer.
Forbidding people who have seen the "source" programm is most likely to protect their version from going from "matching behaviour" to "behaving like", as in the same code, point. This might also be intended to build a safeguard for good intentioned developers to not break their (most likely existing) own NDAs accidently.
That was out of abundance of caution, not based on any legal precedent.
In fact, the little precedent that exists over learning from copyrightable code is in favor of it.
More important, the rule urged by Sony would require that a software engineer, faced with two engineering solutions that each require intermediate copying of protected and unprotected material, often follow the least efficient solution (In cases in which the solution that required the fewest number of intermediate copies was also the most efficient, an engineer would pursue it, presumably, without our urging.) This is precisely the kind of “wasted effort that the proscription against the copyright of ideas and facts . . . [is] designed to prevent.” (Sony v. Connectix)
It demonstrates that it stifles copying. That may make it easier for the copier to innovate, but doesn’t dispute the main argument for having copyright protection: that, without the protection of copyright, the code wouldn’t have been written.
Most of it? I would think >50% of open source code writers find it necessary to restrict the rights to copy and use their code. In a world without copyright protections, would the GPL be legal?
(and I guess courts might, in the future, say the GPL expires when copyrights on the code expire)
Sure, but given the timetable for changing the law, it still seems pretty reasonable to apply the same standard to Microsoft (and by extension Github) in the meantime
I don’t quite agree. Msft took a conservative approach to copyright to protect their own business.
Meanwhile open source software has had an immeasurable benefit to society. My computer, tv, phone, light bulb, etc all benefit from OSS—running various licenses, and only a subset using a copyleft like license.
The fact that the laws are inconsistent and expensive to defend against leads companies like Microsoft to take this conservative approach that slows down progress.
Copyright laws aren't preventing you from learning cinematography by watching said Disney movies though, and using all their techniques for your own project.
OpenAI did a dirty job though judging by the cases of the model just reproducing code to the comment, so I can understand why one would criticize this specific project.
That sucks for little snippets of software though, doesn’t it? It’s like copyrighting individual dance moves (not allowed under the current system) and forcing dancers to never watch each other to make sure they’re never stealing.
I mean, it's not like the copyrights are keeping you from doing things. It's stopping you from looking at someone else's source. And it's not like source is easy to accidentally see like dance moves are.
Way, way back in 1992, Unix Systems Laboratories sued BSDI for copyright infringement. Among other things, they claimed that since the BSD folks had seen the Unix source code, they were "mentally contaminated" and their code would be a copyright violation. This led to the BSD folks wearing "mentally contaminated" buttons for a while.
GitHub Copilot has been proven to use code without license attribution. This doesn't need to be as controversial as it is today.
If you're using code and know that it will be output in some form, just stick a license attribution in the autocomplete.
In fact, did you know this is what Apple Books does by default? Say, for example, you copy and paste a code sample from The C Programming Language. 2nd Edition. What comes out? The code you copy and pasted, plus attribution.
> A programmer can read available but not oss licensed code and learn from it. Thats fair use.
If a human programmer reads some else's copyrighted code, OSS or otherwise, memorizes it and later reproduces it verbatim or nearly so, that is copyright infringement. If it wasn't, copyright would be meaningless.
The argument, so far as I understand it, is that Copilot is essentially a compressed copy of some or all of the repositories it was trained on. The idea that Copilot is "learning from" and transforming its training corpus seems, to me, like a fiction that has been created to excuse the copyright infringement. I guess we will have to see how it plays out in court.
As a non-lawyer it seems to me that stable diffusion is also on pretty shaky ground.
APIs are not copyrightable (in the US), so Wine is safe (in the US).
In 2004, Google added copyrighted books to is Google Books search engine, that does search among millions of book text and shows full page results without any authors authorization. Any sane lawyer of the time would have bet on this being illegal because, well, it most certainly was. And you may be shocked to learn that it is actually not.
in 2005 the Authors Guild sues for this pretty straightforward copyright violation.
Now an important part of the story: IT TOOK 10 YEARS FOR THE JUDGEMENT TO BE DECIDED (8 years + 2 years appeal) during which, well, tech continued its little stroll. Ten year is a lot in the web world, it is even more for ML.
The judgement decided Google use of the books was fair use. Why? Not because of the law, silly. A common error we geeks do is to believe that the law is like code and that it is an invincible argument in court. No, the court was impressed by the array of people who were supporting Google, calling it an invaluable tool to find books, that actually caused many sales to increase, and therefore the harm the laws were trying to prevent was not happening while a lot of good came from it.
Now the second important part of the story: MOST OF THESE USEFUL USES HAPPENED AFTER THE LITIGATION STARTS. That's the kind of crazy world we are living in: the laws are badly designed and badly enforced, so the way to get around them is to disregard them for the greater good, and hope the tribunal won't be competent enough to be fast but not incompetent enough to fail and understand the greater picture.
Rants aside, I doubt training data use will be considered copyright infringement if the courts have a similar mindset than in 2005-2015. Copyright laws were designed to preserve the authors right to profit from copies of their work, not to give them absolute control on every possible use of every copy ever made.
> A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ?
You can learn from it, but if you start copying snippets or base your code on it to such an extent that its clear your work is based on it, things start to get risky.
For comparison, people have tried to get around copyright of photos by hiring an illustrator to "draw" the photo, which doesn't work legally. This situation seems similar.
It might or might not be depending on the situation. Some of it might come down to intent.
Like if the drawing was meant to be an artistic rendering with independent artistic value, much more likely to be fair use. If the drawing was meant to be a loop-hole to avoid paying the licensing fee on the original, its much less likely. Fair use has a bunch of criteria - a lot of it depends on intention and how the usage would affect the original copyright holder.
I would add that fair use lets you use a copyrighted work, it doesn't make the copyright go away, just adds some cases where you can use the work notwithstanding the original copyright, but the original copyright is still there.
Note: IANAL, this all could be wrong. I dont have any cases, i do know that people propose this sort of thing at wikipedia from time to time - i.e. hiring someone to draw copyrighted photos - and it usually gets shot down as not solving the problem, although im not familiar with the legal basis.
> If a machine does it, is it wrong ? What is the line between copying and machine learning ?
What is the difference between a neighbor watching you leave your home to visit the local grocery store and mass surveillance? Where do you draw the line?
Wine/Proton are safe because there is controlling 9th and SCOTUS precedent in favor of reimplementation of APIs.
The reason why those wouldn't apply to Copilot is because they aren't separating out APIs from implementation and just implementing what they need for the goal of compatibility or "programmer convenience". AI takes the whole work and shreds it in a blender in the hopes of creating something new. The hope of the AI community is that the fair use argument is more like Authors Guild v. Google rather than Sony v. Connectix.
Slippery slope? Are you familiar with judicial precedent? Being bound to precedents is central to common law legal systems, so I don't think the GP's take was so outlandish. "Slippery slopes" and "whataboutism" might be thought-terminating buzzwords online, but not in front of a judge.
>A programmer can read available but not oss licensed code and learn from it. Thats fair use.
No it isn't, at least not automatically which is why infringement of licenses exists at all, the fact that you have a brain doesn't change that and never has. If you reproduce someone's code you can be in hot water, and that should be the case for an operator of a machine.
It's also why the concept of a clean room implementation exists at all.
I think the commenter you replied to was talking about using the functional, non-copyrightable elements of the copyrighted code. Clean-room is not even required by case law. There's precedent that explicitly calls it out as inefficient.
More important, the rule urged by Sony would require that a software engineer, faced with two engineering solutions that each require intermediate copying of protected and unprotected material, often follow the least efficient solution (In cases in which the solution that required the fewest number of intermediate copies was also the most efficient, an engineer would pursue it, presumably, without our urging.) This is precisely the kind of “wasted effort that the proscription against the copyright of ideas and facts . . . [is] designed to prevent.” (Sony v. Connectix)
> A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ?
My (extremely amateur) understanding is that what is meant by "learn from it" is one of the hinge points of the legal question.
If a programmer reads licensed code and reproduces it verbatim or near-verbatim in a project with a conflicting license, that becomes a legal problem in certain circumstances.
If a programmer reads the same code and gets an idea to implement something different, that's less troublesome (or at least, if it is troublesome it's in a different area; if the idea was related to a patentable process, then other questions arise, but I'm even less qualified to speak to that area of law).
There's nothing special about copy/paste buttons that make them the only way you can infringe copyright.
Fair use doesn't automatically kick in just because someone uses what they took/copied as part of a larger artifact; it's a really complicated legal line.
Maybe its time for Creative Commons License to address this. I'm curious if No-Derivative would already prohibit this? Does the ND language need tweaking? Or do they need a whole new clause.
Not for GitHub -- users who upload their code accept GitHub's license agreements which allows it to use it in many different ways, including Copilot. Kind of how when you create a Robinhood account you agree to arbitration and can't sue them.
It would be good to have a definitive and simple line for fair use that could be applied to all forms of copyright. Right now fair use is defined by four guidelines:
The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
The nature of the copyrighted work
The amount and substantiality of the portion used in relation to the copyrighted work as a whole
The effect of the use upon the potential market for or value of the copyrighted work.
A programmer who studied in school and learned to code did so clearly for and educational purpose. The nature of the work is primarily facts and ideas, while expression and fixation is generally not what the school is focusing on (obviously some copying of style and implementation could occur). The amount and substantiality of the original works is likely to be so minor as to be unrecognized, and the effect of the use upon the potential market when student learn from existing works would be very hard to measure (if it could be detected).
When a machine do this, are we going to give the same answers? Their purpose is explicitly commercial. Machines operate on expression and fixation, and the operators can't extract the idea that a model should have learned in order to explain how a given output is generated. Machines makes no distinction of the amount and substantiality of the original works, with no ability to argue for how they intentionally limited their use of the original work. And finally, GitHub Copilot and other tools like them do not consider the potential market of the infringed work.
API's are generally covered by the interoperability exception. I am unsure how that is related copilot or dall-e (and the likes). In the Oracle v. Google case the court also found that the API in question was neither an expression or fixation of an idea. A co-pilot that only generated header code could in theory be more likely to fall within fair use, but then the scope of the project would be tiny compared to what exist now.
Agreed. But it could go the other way as well. Let's say MS / HB wins and the decision establishes and even less healthy / profitable (?) outcome over the long term.
Remember when Napster was all the rage. And then Jobs and Apple stepped in and set an expectation for the value of a song (at 99 cents)? And that made music into the razor and the iPod the much more profitable blades. Sure it pushed back Napster but artists - as the creator of the goods - have yet to recover.
I'm not saying this is the same thing. It's not. Only noting that today's "win" is tomorrow's loss. This very well could be a case of be careful what you wish for.
This is why we can't have nice things. Copilot is the best thing that happened in developper tools since a long time, it increased a lot my productivity. Please don't ruin it.
How would you "permit copilot learning on it"? Say, what if you could upload that code to a certain website and grant the website owner the necessary license to share your work with others (via copilot)? It sounds like that would work!
I really feel that Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith[0] is going to have a big effect on this type of thing. They are basically relying on their AI magic to make it transformative. I'm starting to think the era of learning from material other people own without a license / permission is going to end quickly.
Is it not in the agency of the developer to hit the save button?
It seems like GitHub Copilot can spit out copyrighted works all day but the person running the text editor has to "choose" which Copilot output to actually save/commit/deploy.
Does it really matter that much "how" the text in your text editor gets there? You write it yourself or copy/paste it or have Copilot generate it. Ultimately the individual that "approved" it to be saved to the disk is the one violating the copyright, Copilot is just making a "suggestion".
I think if this is successful it will be very bad for the open world.
Large platforms like github will just stick blanket agreements into the TOS which grant them permission (and require you indemnify them for any third party code you submit). By doing so they'll gain a monopoly on comprehensively trained AI, and the open world that doesn't have the lever of a TOS will not at all be able to compete with that.
Copilot has seemed to have some outright copying problems, presumably because its a bit over-fit. (perhaps to work at all it must be because its just failing to generalize enough at the current state of development) --- but I'm doubtful that this litigation could distinguish the outright copying from training in a way that doesn't substantially infringe any copyright protected right (e.g. where the AI learns the 'ideas' rather than verbatim reproducing their exact expressions).
The same goes for many other initiatives around AI training material-- e.g. people not wanting their own pictures being used to train facial recognition. Litigating won't be able to stop it but it will be able to hand the few largest quasi-monopolisits like facebook, google, and microsoft a near monopoly over new AI tools when they're the only ones that can overcome the defaults set by legislation or litigation.
It's particularly bad because the spectacular data requirements and training costs already create big centralization pressures in the control of the technology. We will not be better off if we amplify these pressures further with bad legal precedents.
GitHub already has this in TOS -- that is the irony of the lawsuit, it is actually in GitHub's favor this happens. GitHub can in such a case jack up the price 10x as the sole provider.
… & of course we again ask Microsoft's GitHub to start respecting FOSS licenses, cooperate with the community, & retract their incorrect claim that their behavior is “fair use”.
It seems like we should come to agreement on what the license is intended for, given that when the licenses were created in a time before AI like this existed. If the authors did not intend their code to be used like this, should we not respect it?
Also, does it make sense to create new licenses which explicitly state whether using it for AI training is acceptable or not - or are our current licenses good enough?
The most important part of this is not whether the lawsuit will be won or lost by one of the parties, but what is the legality of fair use in machine learning, and language models. There's a good chance that it gets to Supreme Court and there will be a defining precedent to be used by future entrepreneurs about what's possible and what's not.
It seems obvious that AI models are derivative works of the works they are trained on but it also seems obvious that it is totally legally untested whether they are derivative works in the formal legal sense of copyright law. So it should be a good case assuming we have wise and enlightened judges who understand all nuances and can guide us into the future.
You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified it, and giving a relevant date.
b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to “keep intact all notices”.
c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it.
——
I don’t see how one could argue that training on GPL code is not “based on” GPL code.
A developer is a person. Copilot is software based on the GPL code. Just because you use the word “learn” does not make what a human does and software does the same thing.
If github or google indexes source code using a neural net to help you find it, given a query, is that also illegal? If you think of copilot as something that helps you find code you’re looking for, is it all that different, and if so, why?
In this case, wouldn’t the users of copilot be the ones responsible for any copyrighted code they may have accessed using copilot?
The crux of the issue: Is the code that is being generated being used in a way that it's license allows? That's it. I'm confident that this problem would go away if copilot said:
//below output code is MIT licensed (source: github/repo/blah)
And yes, the "users" are responsible, but it's possible that copilot could be implicated in a case depending on how it's access is licensed.
Stable diffusion has this same problem btw, but in visual arts "fair use" is even murkier.
For code, if you could use the code and respect the license, why wouldn't you? Copilot takes away that opportunity and replaces it with "trust us".
If you had legal expertise and a strong opinion on the matter I suppose you could write a persuasive brief for the consideration of the court. If you have a strong opinion but aren't a legal eagle you could write to your legislators in support of legislation explicitly supporting this use case or organize the support of people more capable in that arena.
If you are opinionated but lazy, no judgement here as I sit here watching TV, you could add a notation at the top of your repos explicitly supporting the usage of your code in such tools as fair use.
Notably if your code is derivative of other works you have no power to grant permission for such use for code you don't own so best include some weasel words to that effect. Say.
I SUPPORT AND EXPLICITLY GRANT PERMISSION FOR THE USAGE OF THE BELOW CODE TO TRAIN ML SYSTEMS TO PRODUCE USEFUL HIGH QUALITY AUTOCOMPLETE FOR THE BETTERMENT AND UTILITY OF MY FELLOW PROGRAMMERS TO THE EXTENT ALLOWABLE BY LICENSE AND LAW. NOTHING ABOUT THIS GRANT SHALL BE CONSTRUED TO GRANT PERMISSION TO ANY CODE I DO NOT OWN THE RIGHTS TO NOR ENCOURAGE ANY INFRINGING USE OF SAID CODE.
Years from now when such cases are being heard and appealed ad nauseam a large portion of repos bearing such notices may persuade a judge that such use is a desired and normal use.
You could even make a GPLesque modification if you were so included where you said. SO LONG AS THE RESULTING TOOLING AND DATA IS MADE AVAILABLE TO ALL
Note not only am I not your lawyer, I am not a lawyer of any sort so if you think you'll end up in court best buy the time of an actual lawyer instead of a smart ass from the internet.
copilot is great, and ignorance is bliss, isn't it
The situation that this lawsuit is trying to save you from is this: (1) copilot blurps out some code X that you use, and then redistribute in some form (monetized or not); (2) it turns out company C owns copyright on something Y that copilot was trained on, and then (3) C makes a strong case that X is part of Y, and that your use of X does not fall under "fair use", i.e. you infringed on the licensing terms that C set for Y.
You are now in legal trouble, and copilot put you there, because it never warned that you X is part of Y, and that Y comes with such and such licensing terms.
Whether we like copilot or not, we should be grateful that this case is seeking to clarify some things are currently legally untested. Microsoft's assertions may muddy the waters, but that doesn't make law.
I hope MS used a lot of AGPL code to train Copilot… This would be fun.
But no matter how this goes, in case training AI with copyrighted inputs is "fair use" that'll end up as the ultimate "copyright laundry machine" like this "joke" project here:
1. The ability to be able to run and train these models is going to eventually be perfectly plausible on a home machine.
2. It's only a matter of time before models, e.g. a popular model scraped from all of the code on GitHub, is a publicly available torrent.
3. People will be able to just run it locally as an integrated plug-in in jet brains or VS code.
4. You'll never know if somebody has lifted their code in violation of a license anymore than you would be able to tell if somebody used code from stack overflow without attribution in any commercial endeavor.
Just because some people get away with copyright infringement doesn't mean that copyright infringement is now legal.
I don't think 1-3 matter at all. The point is that GitHub is selling a tool that can commit copyright infringement. This lawsuit is trying to get them to pay the consequences for the infringement that they have enabled.
Crackpot Theory: Copilot (and by association many ML tools) is a form of probabilistic encryption. Once encoded, it's virtually impossible to pull the code (plaintext) directly out of the raw ML model (the cyphertext), yet when the proper key is input ('//sparse matrix transpose'), you get the relevant segment of the original function (the plaintext) back.
We've even seen this with stable diffusion image generation, where specific watermarks can be re-created (decrypted?) deterministically with the proper input.
The part of GitHub Copilot to which I object is that it's trained on private repos. Where does GitHub get off consuming explicitly private intellectual property for their own purposes?
If GitHub ends up having to tweak their product to avoid ethical/legal concerns, I actually imagine it could still be pretty cool. Right now Copilot is a black box that spits out code with no attributes; what if they worked on instead making it a glass box, where it always brings up snippets of other projects along with their licensing info so that you can decide how to incorporate the ideas fairly yourself? Or they could still output the same code suggestions, but always include attribution and license data along with it. Making the product more transparent would probably make more people comfortable with using it, anyway.
I wonder if the plaintiffs' code would stand up to scrutiny of whether any of it was copied, even unintentionally, from other code they saw in their years of learning to program? I know that I have more-or-less transcribed from Stack Overflow/etc, and I have a strong suspicion that I have probably produced code identical to snippets I've seen in the past.
Think some of the negativity about Copilot may be the perception that if an individual or small startup attempted training an ML model from public source-code and commercialised a service from it they would be drowning in legal issues from big companies not happy with their code used in such a product.
In addition just because code is available publicly on GitHub does not necessarily mean it is permissively licensed to use elsewhere, even with attribution. Copyright holders not happy with their copyrighted works publicly accessible can use the DMCA to issue take-downs that GitHub does comply with but how that interacts with Copilot and any of its training data is a different question.
As much as the DMCA is bad law rather funny seeing Microsoft be charged in this lawsuit with the less known provision against 'removal of copyright management information'. Microsoft does have more resources to mount at defence so it will probably end up different compared to a smaller player facing this action.
Consider each repo on github to be a movie. What copilot does is to search for sequences of frames from any movie which line up to create a new coherent movie.
Individually, each frame is protected by the copyright of the movie it belongs to. But what happens if you take a million frames from a million different movies and just arrange them in a new way?
That's the core question here. Is the new movie a new copyrightable work, or is it plagiarizing a million other works at once? Is it legal to use copyrighted works in this way?
The other question is if it is right to use copyrighted works this way. Is this within the spirit of open source software? Or is this just a bad corporation taking advantage of your good will?
I'm not sure where I stand on this, it's a complicated problem for sure. Definitely interested to see how this plays out in court.
>By training their AI systems on public GitHub repositories (though based on their public statements, possibly much more) we contend that the defendants have violated the legal rights of a vast number of creators who posted code or other work under certain open-source licenses on GitHub.
I don't know about the US laws in copyright so I can't comment on the legal documents but this website is not complaining that copilot is reproducing copyrighted content but it was trained on copyrighted content. I don't see how you can forbid someone or something to read and learn from something that is public (once again producing is another problem)
How much code is necessary to be considered a copyright infringement from an existing code base?
For example let's say I'll take a single frame of animation from a cartoon, The frame contains a mountain, house, and a couple characters although those characters are not integral to the actual cartoon maybe they're extras (villagers and not named characters something like Mickey Mouse for example)
I draw a picture of a lake with a cabin next to it, then start to draw a frontiersman but I trace one of his arms from a villager of that previous frame of animation... Number one am I in danger of copyright infringement (have I hit some arbitrary threshold), and number two: am I causing monetary losses for the cartoon?
Merits of the case aside, I'm befuddled that a company with a legal team like Microsoft approved this product. Is their assumption that this would bring in more revenue than potentially defending it in court? The math doesn't make sense to me.
If you'd ever read even a single one of the licenses to the software I'm sure you use everyday, you'd understand.
This is such an obvious and pathetic strawman.
I notice often on hackernews that people don't seem to understand anything about free or open-source software outside of the pragmatics of whether they can abuse the work for free.
You read a lot into my not so serious comment. Maybe internet comment sections aren't the right place for you.
But I'll bite: I know licensing, thank you. But what's copyrightable is not so easy. Licenses are not so easy. Copilot does not copy entire works and it's very questionable if a few lines of code are "piracy". It's a repeating discussion again and again, there's nothing novel about it except for the fact that a machine learns (and overfits for small portions of code). So please get off your high horse. I don't care for your fundamentalism.
If you know this area of IP law then you know that LOTs of licenses, copyleft or not, require attribution (which copilot never does, and can’t do by its construction), and you know that what’s problematic is when the model output is arguably not “fair use”. Examples of that abound.
You don’t need any fundamentalism to know that copilot’s output carries huge and untested legal risk. If this lawsuit clears some of this up, that’s a big win for everyone.
Yes, you're right. But my point also was that it's not so easy when it's just a few lines, isn't it? Especially since this is an international issue and the definitions of copyrightable work is not easily definable.
> You don’t need any fundamentalism to know that copilot’s output carries huge and untested legal risk. If this lawsuit clears some of this up, that’s a big win for everyone.
I agree with that! I also see this as the only proper takeaway that I think is ok. The rest is making money off this thing. But the US has a different law suit culture anyway, which I find weird.
Yeah I guess so. This website reads like bullshit bingo from some weird twitter dude trying to sell you his newest product:
"AI needs to be fair & ethical for everyone. If it’s not, then it can never achieve its vaunted aims of elevating humanity. It will just become another way for the privileged few to profit from the work of the many."
Blah blah. Can we get back to the hacking on stuff mentality?
Hah, funny. I've used Pollen before and think I've had contact with him a few years ago! The blah blah about AI elevating the world is still bs imho. I still disagree with his views (https://matthewbutterick.com/chron/this-copilot-is-stupid-an...) and this law suit.
I wasn't actually talking about him specifically btw when saying "this sounds like a crypto bro from twitter". The overly enthusiastic AI talk reminded me of that, that's what I wanted to say.
It doesn't make sense. If I make a piece of software that curls a random gist and then puts it into your editor am I infringing or are you infringing when you run it or are you infringing when you use that file and distribute it somewhere?
> If I make a piece of software that curls a random gist and then puts it into your editor am I infringing
Depends on the license. If it's MIT and you serve the license, no, you are not infringing at all. A trimmed version of MIT for the relevant bits:
Permission is hereby granted [...[ to any person obtaining a copy of this software [..] to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, [...] subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
> are you infringing when you run it
Depends on the license
> are you infringing when you use that file and distribute it somewhere
Depends on the license
----
When copilot gives you code without the license, you can't even know!
The law will consider "intent". By your logic, web browsers are infringing.
Can you use curl to infringe on copyright? Yes. Is every time you use curl copyright infringement? No. Can you in theory tell when you are infringing with curl? Yes.
Can you use copilot to infringe? Yes. Is every time you use copilot copyright infringement? No. Can you in theory tell when you are infringing with copilot? *No*
I'm not a lawyer, but if you provide a platform that enables infringement that's different than if you provide a tool that could enable infringement.
Popcorn time vs. bittorrent.
And you are right the EULA could say "it's up to the end user to confirm you can use this code". But then how do you verify? That slows down "productivity" where copilot promises "speeding up" productivity.
Yep, makes sense! I guess we'll see what arguments the court finds convincing. I, for one, hope Copilot stays, but if we can delay its destruction long enough I think we'll get open-source models that will give us this for free. And then the cat will be out of the bag.
This issue seems to have an obvious solution that I fail to see anyone mention:
Treat copilot simply as a tool, let it be trained on whatever without any consent requirements. However the outputs should be subject to copyright as with any other code produced by a human. Then on a case by case basis courts can decide if infringement has occurred. The idea of banning copilot or other AI models as a whole just seems like a collective case of sour grapes because innovation and automation is finally threatening some people who only expected these things to affect the working class
I think it's a great time to explain why this won't hit AI art such as Stable Diffusion, even if GitHub loses this case.
The crux of the lawsuit's argument is that the AI unlawfully outputs copyrighted material. This is evident in many tests with many people here and on Twitter even getting verbatim comments out of it.
AI art, in the other hand, is not capable of outputting the images from its training set, as it's not a collage-maker, but an artificial brain with a paintbrush and virtual hand.
Eh... I don't know. It sounds to me like you are saying because the code example outputs exact lines, it's a copyright violation; but the image AI's necessarily don't output exact copies of even portions of pre-existing images, that's not how they work.
But I don't think copyright on visual images actually works like that, that it needs to be an exact copy to infringe.
If I draw my own pictures of Mickey Mouse and Goofy having a tea party, it's still a copyright infringement if it is substantially similar to copyright depictions of mickey mouse and goofy. (subject to fair use defenses; I'm allowed to do what would otherwise have been a copyright infringement if it meets a fair use defense, which is also not cut and dry, but if it's, say, a parody it's likely to be fair use. There is probably a legal argument that Copilot is fair use.... the more money Github makes on it, the harder it is though, but making money off something is not relevant to whether it's a copyright violation in the first place, but is to fair use defense).
(yes, it might also be a trademark infringement; but there's a reason Disney is so concerned with copyright on mickey expiring, and it's not that they think there's lots of money to be spent on selling copies of the specific Steamboat Willy movie...)
> There is actually no percentage by which you must change an image to avoid copyright infringement. While some say that you have to change 10-30% of a copyrighted work to avoid infringement, that has been proven to be a myth. The standard is whether the artworks are “substantially similar,” or a “substantial part” has been changed, which of course is subjective.
For those curious about the standards for fair sue in US copyright law, and how they have been considered in previous caess, here's one legal overview:
This AI re-mixing stuff is so new, I think few legal observers would say they could definitely predict what the courts will do with it. Nobody really knew how the Google Books case, for instance, was going to go until it went.
I also don't know if I would anthropomorphize ML to that degree. It's a poor metaphor and isn't really analogous to a human brain, especially considering our current understanding, or lack thereof, of the brain, and even the limited insight we have into how some of these models work from the people who work on them.
IMO, the case is exactly the same for copilot and generative models for images. That's why it's so important to have some precedent as a guide for future products.
If a software developer learns how to code better by reading GPL software and then later uses the skills they developed to build closed source for profit software should they be sued?
If a software developer writes a program to remember a million lines of GPL code, then uses that dataset to "generate" some of that code, then they are essentially violating that license with extra steps.
The extra steps aren't enough to exhonorate them. It's just a convoluted copy operation.
Is just like how a lossy encoding of a song is still - with respect to copyright - a copy of that song. The data is totally different, and some of the original is missing. It's still a derivative work. So is a remix. So is a reperformance.
Whether it is legally wrong or not to scan OSS code (I think it is wrong), there has been a time-honored precedent for disallowing automated scanning:
robots.txt
This is exactly what is needed for source code, and the default (no robots.txt) should be "disallow".
The fact that the Web has considered this moral issue should be a strong hint for the AI people not to take a purely legal stance but consider the OSS community that they are so heavily using.
Forgive my ignorance, but who is going to benefit from this lawsuit? I have a lot of code on GitHub, can I, for instance, expect a check in the mail in case of a win?
(Not a lawyer, so this is really definitely absolutely not legal advice and if you're looking to profit you should speak to a lawyer... for instance the lawyers who just filed the lawsuit)
They're asking for two things, injunctive relief (ordering github/openai/microsoft to stop doing this) and damages.
I suppose the injunctive relief really benefits anyone who doesn't want AI models to exist, because that's what it's asking for.
The damages will go the members of the class certified for damages, with more going to the lead plaintiffs (those actually involved in the suit) and some going to the lawyers. They're asking for the following class definition for damages
> All persons or entities domiciled in the United States that, (1) owned an interest in at least one US copyright in any work; (2) offered that work under one of GitHub’s Suggested Licenses; and (3) stored Licensed Materials in any public GitHub repositories at any time during the Class Period.
No worries, I put that disclaimer in because it's illegal for me to give legal advice and because I want to discourage people from thinking random internet comments are good at properly stating the law, not because I'm that worried that someone will actually decide to stake a bunch of money based on analysis in my comments ;)
I think the software is probably ok provided that, the sources are credited (ie, if co-pilot copies code from say SDL, then the relevant code sections need to be correctly attributed, the mandatory license readme copied to the project so all code is following the open source licenses used. That's literally the purpose of open source licenses. If Copilot can't be bothered to do that, then yeah it should be shut down.
Made a throwaway since I guess this stance is controversial. I could not care less about how copilot was made and what kind of code it outputs. It's useful and was inevitable.
I'm 1000% on team open source and have had to refer to things like tldrlegal.com many times to make sure I get all my software licensing puzzle pieces right. Totally get the argument for why this litigation exists in the present.
Just saying in general my friends I hope you have an absolutely great day. Someone will be wrong on the internet tomorrow, no doubt about it. Worry about something productive instead.
This one has the feel of being nothing more than tilting at windmills in the long run.
I will never understand why people push code to public repos and then complain when someone or something uses that code. Code that you want to keep private or make money off of should be private. Only publish stuff to the public that you want other people to see and learn from. All the complaints about attribution… who cares.
> All the complaints about attribution… who cares.
I may not care if some guy I've never met uses my niche library without attribution. (I do care, really.) But Microsoft certainly cares if you use their code without attribution, so why shouldn't I take the same belligerent, copyright-enforcing attitude towards them? That's the main reason why people are angry, because MS has "rules for thee but not for me" by virtue of being big enough to have ~~good~~effective lawyers and lobbyists.
Copilot is trained on public repos. Id imagine if Microsoft doesn’t want you to use their code, that code would be in a private repo. There’s nothing stopping me from using code in a public repo, regardless of the license.
My own view is that it is not legal for humans to produce derivatives of copyrighted works currently. So therefore it is probably already not legal to train an artificial intelligence using copyrighted works to in order to produce derivatives either.
As much as I love the little guy beating the big evil company, I hope the lawsuit doesn't cause anything to happen to copilot. Maybe some changes, like better protection against emitting 1:1 licensed code or opting out your code from training.
Can someone explain to me Microsoft’s decision here to use GPL code in the training set? It would seem like sticking to non-attribution / non-viral licenses would have kept them in the clear. Was that an insufficient size data set?
There were well known examples of copilot reproducing exact code snippets well before this lawsuit (e.g. the Quake's fast inverse square root function). Microsoft dealt with them by simply adding the offending function names to a blocklist.
In other words, if your open source project doesn't have such immediately recognizable code and didn't cause a shitstorm on Twitter, chances are copilot is still happily spewing out your exact code, sans the copyright and license info.
How many ways are there to write many of the basic algorithms we all use though? Can I copyright "({ item }) => <li>{item.label}</li>"?
Because I sure have seen that exact code written, from scratch, in many many places.
I guess my question boils down to "What is the smallest copyrightable unit of code?". Because I'm certain suing a novelist for copyright infringement on a character that says "Hi, how are you?" would be considered absurd.
Apple Books slaps an attribution notice on the end if you copy 4 or more words from a book. The Verve got sued by The Rolling Stones for a 4 second sample on 'Bittersweet Symphony'. Post 'Blurred Lines' you can now be sued for copying "the feel" of a song.
Really what it comes down to is do you have enough resources to convince a judge or jury that X is a copy of Y? Doesn't really matter the size of X.
I find this whole subject exhausting. The only reason I’m glad there is a lawsuit is that we can finally put this thing to rest when either party wins.
I hope this case will fail and establish a good precedent for all future AI litigations and may be even prevent new ones. Your code is open source - irregardless of license, one might read it as a text book and then remember or even copy snippets and re-use this somewhere else unrelated to the original application. If you don't like this, don't make your code open source. This was happening and is happening independent of any license all over the world by majority of developers. What Copilot and similar tools did was to make those snippets accessible for extrapolation in new applications.
If these folks win - we again throw progress under the bus.
No thank you.
I put a license to be followed, not to just be disregarded by an AI as "Learning material".
No human perfectly reproduces their learning material no matter what, but Copilot does.
You mean to tell me that no one has ever perfectly replicated an example that they read somewhere? There's only so many ways to write AABB collision, fibonacci, or any number of other common algorithms. I'm not saying there aren't things to consider but I'm sure I've perfectly replicated something I read somewhere whether I'm actively aware of it or not
So are you ok with it being illegal for humans to learn from copyrighted books unless they have a license that explicitly allows learning? That does not sound like a pleasant consequence.
Would you use an AI text generator to write a thesis? No, there's a risk a whole chunk of it will be considered plagiarism because you have no idea what the source of the AI output is, but you know it was trained with unknown copyrighted material. This has nothing to do with the way humans learn, it's about correct attribution.
There is no technical reason why Microsoft can't respect licenses with Copilot. But that would mean more work and less training input, so they do code laundering and excuse it with comparisons to human learning because making AI seem more advanced than it is has always worked well in marketing.
Edit: And where do you draw the line between "learning" and copying? I can train a network to exactly reproduce licensed code (or books, or movies) just like a human can memorize it given enough time - and both of those would be considered a copyright violation if used without correct attribution. If you trained an AI model with copyrighted data you will get copyrighted results with random variation which might be enough to become unrecognizable if you're lucky.
> Would you use an AI text generator to write a thesis? No, there's a risk a whole chunk of it will be considered plagiarism because you have no idea what the source of the AI output is, but you know it was trained with unknown copyrighted material.
Of course, but that's a separate issue. We're not talking about whether the output of the AI is copyrighted. We're talking about whether it's ok for it to learn from copyrighted material.
Again you can say exactly the same about humans. I am perfectly capable of plagiarising or outputting copyrighted material. That doesn't mean it's illegal to learn from that material, just to output it verbatim.
So the fundamental issue is that it's harder to tell when an AI is plagiarising than it is when you produce something yourself. But that is a technical (and probably solvable) issue, not a legal one. And it's not the subject of this lawsuit.
Here's the thing - the US has well-established laws around copyright that don't consider learning from books a violation of those copyrights. This lawsuit is intended to challenge Copilot as a violation of licensing and isn't a litigation of "how people learn." Your program stole my code in violation of my license - there's a clear legal issue here.
I'd pose a question to you - would it be okay for me to copy/paste your code verbatim into my paid product in violation of your license and claim that I'm just using it for "learning"?
If you cherry picked sections of my code? I'd have no more issue with it then George R.R. Martin would care if you grabbed a few paragraphs out of one of his fantasy books and used them in your novel.
I think they're taking issue with the unauthorized duplication of copyrighted code. That's distinct from learning how to code (which I don't think anyone would claim Copilot is doing) which people get from reading a book. If you were to read the book only to copy it verbatim and resell it, you're going to have a bad time.
It's a pleasant consequence for the person who spent years becoming an expert and then writing the book. It's also a pleasant consequence for the people who buy the book, which might not have existed without a copyright system to protect the writer's interests.
AI are not humans, no human can read _all_ the code on Github. They certainly can't read _all_ the code on Github at the scale that MS can, and are unlikely to be able to extract profits directly from that code, in violation of the licensing.
100% false, there are loads of historical cases of people with eidetic memories being able to reproduce things that they've seen with near complete fidelity, there's no reason to believe that a coder with such a memory would be any different.
> Your code is open source - irregardless of license, one might read it as a text book and then remember or even copy snippets and re-use this somewhere else unrelated to the original application.
Yes, but attribution should still be given. Just because you don't copy-paste someone else's creation doesn't mean you're licensed to use it.
Is it the role of the tool (in this case copilot) to include the license information? Or is it the responsibility of the organization using the code to make sure that it wasn't copied from somewhere?
What if, instead of a tool, you had a random consultant do some work, and it was found out that he asked a ton of stuff on Stack Overflow and copied the CC-BY-SA 4.0 answers into his work? What if it was then found out that one of those answers was based on copying something from the Linux kernel? Who is responsible for doing the license check on the code before releasing the product?
> Or is it the responsibility of the organization using the code to make sure that it wasn't copied from somewhere?
Do you know whether the code you got from Copilot has an incompatible license? No, so if you plan to use Copilot for serious projects you need it to include sources/licenses either way. In fact that would be a very helpful feature as it would let you filter licenses.
Licenses be damned, copyright law sits above it -- and for now, it's hard to see how this isn't fair use. The only case might be an open source Copilot alternative and GitHub and OpenAI can take any such projects out of the training set.
Perhaps the lawsuit contends that Copilot isn't in fact learning how to code, but is rather regurgitating information it has managed to glean and statistically categorize, without any real understanding as to what it was doing?
So why MS can screw only with some licenses that you call "open source".
Your example with a human reading a book would also work with code available licenses or decompiled binaries.
I would have been fine if the open source code was used to create an open model or if MS would have put his ass on the line and also train the model with all the GitHub code because they claim there is no copyright issue.
The problem is that copyright laws were introduced for a reason, and with a thinking similar to yours we might decide to get rid of copyright altogether, which I think is a bad idea.
If organisations are going to ignore the licenses attached to my OOS and that's legimitised in the law, then that's a surefire way to irreparably damage the open source ecosystem
var G = 6.67e-11;
var force = G * mass1 * mass2 / distance * distance;
return force;
Second prompt:
% pointMatrixGravity - Calculates the gravitational force and torque on a
% point mass due to a matrix of point masses.
%
% [force, torque]=pointMatrixGravity(array1,array2)
%
% Inputs:
% array1 - 3xN array of point masses
% array2 - 3xN array of point masses
% Outputs:
% force - 3x1 array of gravitational force
% torque - 3x1 array of gravitational torque
%
% Author: David J. R. MacKay,
% email:
I got something a bit different from the other person who responded to you.
function force=Gmmr2Array(mass1, mass2) {
var G = 6.67408e-11;
var m1 = mass1;
var m2 = mass2;
var r = 1;
var force = G \* m1 \* m2 / (r \* r);
return force;
}
function [force, torque]=pointMatrixGravity(array1,array2) {
var G = 6.67408e-11;
var m1 = array1(1);
var m2 = array2(1);
var r = norm(array1(2:4)-array2(2:4));
var force = G \* m1 \* m2 / (r \* r);
var torque = cross(array1(2:4)-array2(2:4), force);
return [force, torque];
}
I have quite a bit of respect for Matthew Butterick. I don't think he's just a lawyer looking to earn a quick buck. He cares about software and wants to make the world a better place.
> But neither Matthew Butterick nor anyone at the Joseph Saveri Law Firm is your lawyer
This is curious. None of them are my lawyers, but surely at least some of them are someone's lawyers? Isn't it wrong to put such a blanket disclaimer on a website which might well be read by their clients?
This. I've seen so many class action lass suits where at the end of the day the highest gain per Capita always ends up going to the lawyers. Fuck this guy and everyone trying to make money from this.
I don't have a comment on this personally but I want to throw this out there because every time I see people criticizing Copilot or Dall-E someone always says "BUT ITS FAIR USE! Those people don't seem to grasp that "Fair Use" is a defense. The burden is not on me to prove what you are doing is not fair use; the burden is on you to prove what you are doing is fair use
As celestialcheese says [1], it seems like a manufactured case for the purpose of furthering someone's legal career rather than seeking remittance for any violations made by Copilot.
But I like to put on my conspiracy hat from time to time, and right now is one such time, so let's begin...
Though the motivations behind this case are uncertain, what is certain is that this case will establish a precedent. As we know, precedents are very important for any further rulings on cases of a similar nature.
Could it be the case that Microsoft has a hand in this, in trying to preempt a precedent that favors Copilot in any further litigation against it?
Ask HN: I want to modify the BSD 2-Clause Open Source License to explicitly prohibit the use of the licensed software in training systems like Microsoft's Copilot (and use during inference). How should the third clause be worded?
The No-AI 3-Clause Open Source Software License
Copyright (C) <YEAR> <COPYRIGHT HOLDER>
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in
the documentation and/or other materials provided with the
distribution.
3. Use in source or binary forms for the construction or operation
of predictive software generation systems is prohibited.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
IANAL, and I'm no fan of copilot, but I wonder if this kind of clause (your #3) is going to fly: you're preemptively prohibiting certain kinds of reading of the code (when code is read by the ML model in training). But is that something a license can actually do?
The legal footing that copyright gives you, on which licensing rests, certainly empowers you to limit things about how others may redistribute your work (and things derived from it), but does it empower you to limit how others may read your work? As a ridiculous example, I don't think it would be enforceable to have a license say "this code can't be used by left-handed people", since that's not what copyright is about, right?
Licenses get to set terms of redistribution. But training of the ML model -- the thing described by your #3 -- is not redistribution (imho). So maybe it's as unenforceable as saying left-handed people can't read your code.
The redistribution happens later, either when copilot blurps out some of your code, or when the copilot user then distributes something using that code (I'm curious which). At that point, whether some use of your code is infringing your license doesn't depend on the path the code took, does it? (in which case #3 is moot)
Okay; thanks for clarifying. I actually hadn't noticed that use of "use" in the BSD license before. I think I need an IP lawyer explain what that "use" means.
The legal theory for copilot is that training an ML model is fair use, not that the license allows it. If it is fair use then you can't prohibit it by license, no matter what you put in your license.
For this clause to have any positive effect, you need to 1) be willing to pursue legal action against violators and 2) actually notice that the clause has been violated.
Such language must be carefully written. What is the definition of “construction” and “operation” in a legal context? What is a “predictive software generation system”? That’s a very specific use case, you sure you covered everything you want to prohibit?
You’ve inserted your clause in such a way that this dependency cannot be used in any way to build anything similar to a “predictive software generation system”, even with attribution, as it would fail clause 3.
You have to consider that novel licenses make it difficult for any party that respects licenses to use your code. It is difficult to make one-off exceptions, especially when the text is not legally sound. So adoption of your project will be harmed.
So if you are serious about this license, you need a lawyer.
How would you ever prove the parameters of a model were generated by specific training data? Couldn't multiple sets of training data produce the same embeddings/parameters? I imagine there could be infinite possible sets of training data that would lead to the same results, depending on the type of predictive software.
This does seem like a pretty compelling rebuttal, since the preceding comment suggests that GPL does nothing to Microsoft's ability to incorporate code into Copilot's model.
Is it? A similarly casual clause in the OCB license prevented OCB from being used by the military for many years (granted, it prevented OCB from being used almost everywhere else, too).
I have no idea if this license language works or doesn't, but this is hardly the least productive subthread on this story. It's concrete and specific, and we can learn stuff from it.
Copilot isn't just "displaying" something. Copilot has mined the collective efforts of developers in an effort to produce derivative works, without permission, re-distributing that value without giving anything back.
It'd be like suing Adobe because photoshop comes bundled with a your holiday photos, without permission, and uses those to in a "family photos" filter.
Large scale mining of value and then selling it without due credit or reward for those you stole that value from is plain theft.