We've filed a lawsuit against GitHub Copilot

an1sotropy · on Nov 3, 2022

Seems important to point out that the announcement on this page (https://githubcopilotlitigation.com/) is a followup to https://githubcopilotinvestigation.com/ previously discussed here: https://news.ycombinator.com/item?id=33240341 (with 1219 comments)

Cort3z · on Nov 3, 2022

I’m not a lawyer, but here is why I believe a class action lawsuit is correct;

“AI” is just fancy speak for “complex math program”. If I make a program that’s simply given an arbitrary input then, thought math operations, outputs Microsoft copyright code, am I in the clear just because it’s “AI”? I think they would sue the heck out of me if I did that, and I believe the opposite should be true as well.

I’m sure my own open source code is in that thing. I did not see any attributions, thus they break the fundamentals of open source.

In the spirit of Rick Sanchez; It’s just compression with extra steps.

williamcotton · on Nov 4, 2022

I read most of the complaint. The only examples of supposed copyright infringement are isEven and isPrime functions. Here's what Copilot gives me in a Typescript file:

  function isPrime(n: number): boolean {
    for (let i = 2; i < n; i++) {
      if (n % i === 0) {
        return false;
      }
    }
    return n > 1;
  }
  
  function isEven(n: number): boolean {
    return n % 2 === 0;
  }

These are clearly not covered by copyright in the first place. This case is really quite pathetic.

eslaught · on Nov 4, 2022

Correct me if I'm wrong. I don't think this document needs to be a comprehensive record of every piece of copyrighted material that Copilot or Codex produce. That's something that will be produced during/for the trial process itself. Right now, this is just establishing the basic premise, and the claims for the type of behavior that is going on.

I think they intentionally picked (literal) textbook examples because they're short and easy for non-experts to grasp and have some understanding of. But I don't think we've seen any of the code from the respective J. Doe's yet, and I would assume we would in the trial (possibly in addition to more cases).

ksaj · on Nov 4, 2022

I tested co-pilot initially with Hello World in different languages. In Lisp, it gave me verbatim code from a particular tutorial, which was made obvious because their code had "Hello <tutorialname>" where <tutorialname> was the name of a YouTube tutorial, instead of the word "World." It was surely slurped into the model via someone who had done the tutorial and uploaded their efforts to Github. Mind you, it's pretty much the way everyone would code it, but the inclusion of <tutorialname> is definitely an issue.

So it isn't too hard to prove the case.

nequo · on Nov 4, 2022

I have only skimmed. But lines 23 and 24 on page 23 also reference Copilot's autocompletion of Quake III's `Q_rsqrt`[1] and mention that it is under GPL2.

[1] https://news.ycombinator.com/item?id=27710287

williamcotton · on Nov 4, 2022

I doubt that anything other than Carmack's comments are covered. See the "Filtration" section:

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."

"Finally, material that exists in the public domain can not be copyrighted and is also removed from the analysis."

That code is specifically optimized for efficiency and there were similar approaches floating (get it?) around in the 1980s.

rrobukef · on Nov 4, 2022

The magic-constant is not optimal there exist better alternatives. It is not necessary to implement this function and should be copyrightable. It is also not a trivial part.

On the other hand, Microsoft may only need to show "Hey, we got this code from FooBar under this license and this license and ..."

rat9988 · on Nov 5, 2022

Why should it be copyrightable. It's just a way to calculte inverse square root. This falls under the public, in my non lawyer opinion. Such small snippets do not qualify, usually, for copyright.

rrobukef · on Nov 6, 2022

It's not just the constant but it was easiest to identify for me in the last post. And due to it's popularity the size of the snippet doesn't matter, it stands on its own as a significant work.

The essence of the algorithm takes 4 lines: function declaration, declaration of 'y', one line for calculating the exponent in log-space, one line for returning the root finding.

The rest is fluff. Every line of the snippet has creative input with the chosen names ('threehalfs' for 1.5F), the order of declarations and instructions, the redundancy. There have been internet-wars around indentations and newlines, these are style choices.

((And it is public -- GPL more specifically, which is a restrictive license that should be respected. I think this snippets makes a perfect example of the dangers of copilot. But not one to litigate details with.))

(((Thinking back, I'm not sure anymore how the license laundering argument works if they got the code from a fair-use MIT-licensed hobby project. Can one person claim fair-use and include it under an MIT-license and have somebody else say 'oh this free code I'm going to use it commercially'?)))

TAForObvReasons · on Nov 4, 2022

You didn't read the relevant part of the complaint. It starts on document page 14 (PDF page 17). There's a clear footnote:

> Due to the nature of Codex, Copilot, and AI in general, Plaintiffs cannot be certain these examples would produce the same results if attempted following additional trainings of Codex and/or Copilot.

The offending solution from the AI included extra lines that are reasonably understood to come straight from Eloquent JavaScript:

    console.log(isEven(50));
    // → true
    console.log(isEven(75));
    // → false
    console.log(isEven(‐1));
    // → ??

ALittleLight · on Nov 4, 2022

This seems like an incredibly trivial example. If I remembered that example subconsciously, and used it myself somewhere, would that be an infringement of intellectual property? In any large code base how many such infringements are there? Many? Should we sue every software company on this premise?

williamcotton · on Nov 4, 2022

Sure, those comments might be considered infringement, but that's from an earlier version of Codex. Copilot does not return that code. The complaint even says so.

joe_the_user · on Nov 4, 2022

If a software systematically engages in copyright violation but only haphazardly corrects those violations, those haphazard correct aren't evidence the problem has vanished.

zarzavat · on Nov 4, 2022

If Copilot is committing widespread infringements of their copyright, then surely they will be able to find examples of such infringement to submit in their lawsuit.

I assume they want some kind of broad relief, such as an injunction to take down copilot. They are not going to get it, they are not going to get anything at all, if they can’t even provide examples of violating code.

belorn · on Nov 4, 2022

During the piratebay case, the prosecutor only had to illustrate that it was likely (as in, convinced the judges) that copyright infringement had occurred. They did this by showing the top 100 torrents. They did not have to prove with certainty that the top 100 torrent actually was used by people. The fact that the names of movies and games showed up on the list was enough to convince the judges.

The lawyers defending the founders did try to make the argument that no infringement had been proven, and that the list itself was not proof of any infringement. It was just a list on a website, and they even presented evidence that the counter on the list was algorithm faulty. The judges was not convinced and applied the common sense approach that taken as a whole, it was not believable that no infringement had occurred by the website given the context of the site (the name, the top list, the overall perspective of how the site was designed).

sangnoir · on Nov 4, 2022

> ...then surely they will be able to find examples of such infringement to submit in their lawsuit

Perhaps that is why they are reaching out to potential class members

> if they can’t even provide examples of violating code.

This is the very beginning of a very long process. I wouldn't rule out a settlement where class members get $10-100, which is a common resolution for class action suits.

afiori · on Nov 4, 2022

In filing a lawsuits you make plausible allegations and claims, it is not the place to present evidence.

TAForObvReasons · on Nov 4, 2022

There are many public examples of that same effect happening (for example https://twitter.com/mitsuhiko/status/1410886329924194309 ), and the legal team has been soliciting for more examples. Those examples are likely to come out if it does go to trial.

williamcotton · on Nov 4, 2022

If this legal team was interested in this going to trial you think they would have put together a stronger case instead of risking that it won’t be heard.

There’s not even a single mention of any established legal doctrines around copyright and software, such as abstract-filter-compare, idea-expression dichotomy, etc.

freejazz · on Nov 4, 2022

It's a complaint, not a brief.

poulpy123 · on Nov 4, 2022

I'm all to punish GitHub copilot if it produce copyrighted code, but this example of is even is absolutely trivial and has no right to be copyrighted

echelon · on Nov 4, 2022

Moreover, if this case wins, it threatens to disrupt one of the biggest technological progressions of all time.

AI/ML will change every field just as the Internet and smartphones did. It doesn't show any indication of peaking, either.

If the US chooses the wrong path here, we'll only tie our hands behind our backs. Other countries won't be so foolish.

We should be able to train on any media a child could see, hear, or read.

ranguna · on Nov 4, 2022

> it threatens to disrupt one of the biggest technological progressions of all time.

Chill dude, all they have to do is include the licenses on their generated code.

If anything, this is going to generate even more progress. The copilot team would have to create some kind of feature that would connect the generated output the the relevant training data. That'd be pretty incredible to see in the field of AI/ML in general.

bobbruno · on Nov 4, 2022

If they can actually link output to specific input, the lawsuit has merit and more, GPT-3 is a lie. A neural network is supposed to learn how things work, not memorize a large number of examples and spit them out verbatim - or keep connections to specific inputs.

Copilot losing the lawsuit is evidence it’s a case of overfitting, not true ML.

ranguna · on Nov 5, 2022

Mind you, not all outputs are the result of over fitting.

WithinReason · on Nov 4, 2022

They already have that feature and you can turn it on

ranguna · on Nov 5, 2022

Ah great, then it should be on and with no way to disable it.

Of course, end user could just strip the license/attribution off their generated output, but that's a different story.

visarga · on Nov 4, 2022

Not just AI is threatened, but also the use of sites like StackOverflow because some of those snippets might infringe a license. So we have to write everything from our heads, de novo. No more googling for solutions.

I think we should just relax copyright, it's dying anyway. Language models allow people to borrow skills learned from other people, and solve tasks. That's huge. Like Matrix, loading up a new skill at the press of a button. Can we give up such a huge advantage in order to protect copyright?

I think the notion of copyright has been under attack already for 2 decades by the internet, search engine and social networks. They all work against it, and AI does it even more. It just encapsulates the whole culture in a box, mixing everything up, all copyrights melting away, everything one prompt away. This could be a new medium of propagation for ideas. No longer limited to human brains and books, they can now propagate through language models more efficiently.

still_grokking · on Nov 4, 2022

Especially the copyright / IP obsessed USA won't go this path.

Otherwise they would create the ultimate "copyright laundry machine".[1]

I'm very sure at least Hollywood and the big music labels would not like that… ;-)

[1] https://news.ycombinator.com/item?id=33459967

gus_massa · on Nov 4, 2022

That isPrime function does not even cut at sqrt(n). Asking for the state of the art isPrime function is too much, but the sqrt trick is the very first step and it's free. (IIRC, the faster version uses i*i<n)

elikoga · on Nov 4, 2022

When searching for "console.log(isEven(50));" "// → true", which is one of the parts that the complaints is about, since this is also reproduced inside a Programming learning book: We get with cs.github.com

" Showing 1 - 20 of 66 files found (in 76 milliseconds)"

So, if this lawsuit succeeds in some way shape or form, does the author have a case against the 66 people that reproduced these lines in their own repository?

WithinReason · on Nov 4, 2022

You could argue that if the author pursued enforcing their licence over those 66 people their code wouldn't have ended up in the training set in the first place. IANAL but I recall that you can't invoke copyright law to selectively enforce it, copyright is only protected if the holder pursues every violation of it. Maybe it works the same for enforcing a licence.

teddyh · on Nov 4, 2022

> I recall that you can't invoke copyright law to selectively enforce it, copyright is only protected if the holder pursues every violation of it

IIRC, that is wrong. What you are describing is trademarks, not copyright.

WithinReason · on Nov 4, 2022

Thanks, I wasn't sure

ranguna · on Nov 4, 2022

They can already sue those people if they don't follow the original license, they just need to file a complaint individually to each author, I think. Standard OSS license stuff, or else, why would people even use licenses?

HeavyStorm · on Nov 4, 2022

And who enforces that?

ranguna · on Nov 5, 2022

The owners of the original work

jpollock · on Nov 5, 2022

Those are quite copyrightable, in the same way that rangeCheck() is copyrighted.

D13Fd · on Nov 3, 2022

Correct legally, morally, or both?

Legally a copyright claim seems weak, but they didn't assert one. Some of their claims look stronger than others. The DMCA claim in particular strikes me as strong-ish at first glance, though.

Morally I think this class action is dead wrong. This is how innovation dies. Many of the class members likely do not want to kill Copilot and every future service that operates similarly. Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.

heavyset_go · on Nov 4, 2022

Innovation dies when creators can't create without someone ripping off their work against the terms they release it under.

I am more hesitant to release code on GitHub under any licenses now. Even outside of GPL-esque terms, I've considered open sourcing some of my product's components under a source available but otherwise proprietary license, but if Microsoft won't adhere to popular licenses like the GPL, why would they adhere my own licensing terms?

If my licenses mean nothing, why would I release my work in a form that will be ripped off by a trillion dollar company without any attribution, compensation or even a license to do so? The incentives to create and share are diminished by companies that won't respect the terms you've released your creations under.

That's just me as an individual. Thinking in terms of for-profit companies, many of them would choose not to share their source code if they know their competitors can ignore their licenses, slurp it up and regurgitate it at an incomprehensible scale.

ahtihn · on Nov 4, 2022

> Innovation dies when creators can't create without someone ripping off their work against the terms they release it under.

I strongly disagree. There would be more innovation if code couldn't be copyrighted or kept secret. See: all of open source.

> I've considered open sourcing some of my product's components under a source available but otherwise proprietary license

What's the point of that? This isn't useful to anyone. The fact you even consider it shows you don't understand open source. I'm sure you happily use open source code yourself though.

fsflover · on Nov 4, 2022

> There would be more innovation if code couldn't be copyrighted or kept secret. See: all of open source.

I actually agree. However this is not what's happening. Copilot effectvely removes copyright from FLOSS code, but doesn't touch proprietary software. FLOSS loses it's teeth against the corporations.

heavyset_go · on Nov 4, 2022

I'm the author of about a dozen popular AGPL and GPL projects, but please tell me how I don't understand open source.

The purpose of releasing source available but proprietary code is so that users can learn and integrate into it, and making it available lets anyone learn how it works. The only reason I even considered making the source available is balance between 1) needing to eat and 2) valuing open source enough to risk #1.

Please take your condescension elsewhere.

bmitc · on Nov 4, 2022

> There would be more innovation if code couldn't be copyrighted or kept secret. See: all of open source

There is a ton of innovative stuff that is not open source. I don't see what open source has to do with innovation.

bmitc · on Nov 4, 2022

I played around with creating an MIT license on my GitHub that explicitly forbids Copilot and other such systems that I thought I may update my projects to, because I strongly dislike the data collection. I'm not a lawyer though.

Is there a GutHub terms of agreement that covers Copilot?

remram · on Nov 4, 2022

They claim it is fair use, therefore they can bypass copyright (and therefore license terms).

It being in GitHub has not been brought up as a factor yet (by GitHub/Microsoft), AFAIK they could use code from other places with that logic, they just don't need to.

codebje · on Nov 4, 2022

I find your comment a bit perplexing, perhaps you can help me understand.

Why do you want to release code on GitHub with an oppressive license? What's the motivation for you, and what's the benefit for anyone else in it being released?

The size of code fragments being generated with these AI tools is, as far as I can tell, extremely small. Do you think you could even notice if your own implementation of sqrt, comments and all, wound up in Excel?

account42 · on Nov 4, 2022

The point of copyleft licenses (which I assume are what you mean with "opressive") is to subvert copyright in order to incentivize others to share their code by providing them with something to build on if they return the favor. You cannot possibly call these licenses opressive since the default state with copyright is that you are not allowed to do much at all (at least when it comes to copying). In fact copyleft licenses allow you to do much much more than your average corp-lawyer approved proprietary license.

The problem (or A problem) with copilot is that it tries to sidestep those licenses, purpotedly allowing you to build upon the work of others without giving anything back even if the work you are building on has been published on the explicit condition that what you create with it should also be shared in the same way. While the great AI tumbler makes the legal copyright infringement argument complicated by giving you lots of small bits from lots of different sources it really does not change the moral situation: you are explicitly going against the wishes of the people that are enabling you to do what you are doing.

Beyond copyleft, this kind of disregard for other peoples wishes also applies to attribution even with more liberal licenses. Programming is already a field where proper attrubution is woefully lacking - we don't need to make it worse by introducing processes where it becomes much harder if not impossible to tell who contributed to the creation.

Now I am all for maximum code sharing. I'm all for abolishing copyright entirely and letting everyone build what they want without being shackled by so-called intellectual property. But that is not something Microsoft is doing with Copilot. What they have created is a one way funnel from OSS to proprietary software. If Microsoft had initially trained Copilot on their own proprietary sources this would have been seen very differently. But they did not. Because the way Microsoft "loves open source" is not in the way of a mutally beneficial symbiotic relationship but that of an abuser that loves taking advantage of whatever they can with giving as little back as they can get away with.

codebje · on Nov 7, 2022

How does one make the leap from "a source available but otherwise proprietary license" to a copyleft license? As I understand the terms, perhaps in too limited a way, a proprietary license is never one in which others are free to build on the code or incorporate any part of it into their own works, and a source available proprietary license is just publishing source that no-one can use.

As for whether Copilot's morally wrong or not - I don't think copyright as a concept makes any sense at the level of the trivial, where Copilot _should_ be acting. If Copilot regularly reproduces sizeable portions of code from a single origin _without_ careful and deliberate guidance, I'd agree that there's a problem here. As I understand it though, that's not happening.

By its very nature of being published, code from OSS is funnelled into proprietary codebases by humans performing a similar task to Copilot - reading available code and using that to evolve an understanding of how to produce software. I like to think we do it at a deeper level than Copilot, but the general effect is the same: the code I write, like the words I write, are heavily influenced by all the code I've read over the years.

If I wind up using a few words from your comment, down the line, because some turn of phrase you used struck me as a good way to say something, do you think I've morally wronged you?

still_grokking · on Nov 5, 2022

It's a pity I can up-vote only once. This nails it!

kelnos · on Nov 3, 2022

I'm fine with Copilot, but I think all rightsholders should be allowed to decide if they want their code training it or not. And that should be opt-in, not opt-out.

(And refusing to opt in shouldn't have to mean switching to a new hosting platform.)

> Beyond that, the class members aren't likely to get much if any money. The only party here who stands to clearly benefit is the attorneys.

That's the case in pretty much any class action. I look at class actions as having two purposes: to require that the defendant stops doing something, and to fine the defendant some amount of money. Sure, individual class members will see very little of that money, but I look at it as a way of hurting a company that has done people wrong. Hopefully they won't do that anymore, and other companies will be on notice that they shouldn't do those bad things either. Of course, sometimes monetary damages end up being a slap on the wrist, just something a company considers a cost of doing business.

D13Fd · on Nov 4, 2022

>I look at class actions as having two purposes . . . to require that the defendant stops doing something

That's my point. Many of the class members don't want the company to stop doing this.

I have code on GitHub, and Copilot is a useful tool. I don't care if my code was used to train the model. Sure, I personally could opt out of the suit, but that would be utterly meaningless in the grand scheme of things. The bottom line is, if I'm a coder with code on Github and I like Copilot, this suit is a huge net negative.

Even more importantly, I want to see the next version of Copilot that will be created by some other company, and then the next version after that. I want development to continue in this area at a high velocity. This suit does nothing but put giant screeching brakes on that development, and that is just a shame.

theamk · on Nov 4, 2022

If lawsuit goes through, it's not likely that Copilot would disappear.. but there would be a checkbox to opt-in your code. You could check it and your code will be used to train model.

I have some code on Github as well and would not want it to be used in training, nor by Microsoft nor by other company. It is under GPL license to ensure that any derived use is public and not stripped of copyrights and locked into proprietary codebase, and copilot is pretty much 100% opposite of this.

zarzavat · on Nov 4, 2022

If this lawsuit is successful it doubt it will change anything at all. Microsoft will just pay the damages as a cost of doing business and continue what they are doing. Maybe they will add an opt-out.

LtWorf · on Nov 4, 2022

Continuing to do damage would mean higher fees next time.

heavyset_go · on Nov 4, 2022

If you don't feel like you're being represented, you're free to choose not to be a member of class in the lawsuit.

D13Fd · on Nov 4, 2022

> If you don't feel like you're being represented, you're free to choose not to be a member of class in the lawsuit.

I think you missed this part:

> Sure, I personally could opt out of the suit, but that would be utterly meaningless in the grand scheme of things.

ImPostingOnHN · on Nov 5, 2022

why would it be meaningless?

seems like a great opportunity for Microsoft to alter copilot so it's opt-in to get your code scanned, and to mandatorily add licensing and attribution to outputs

I know you said you're OK with it as is, but many aren't, so if I'm a coder, this suit represents a big net positive for me, being a way to reduce the probability of someone laundering my code away without proper attribution or license attention

heavyset_go · on Nov 4, 2022

That I did, thanks for pointing it out. Phone posting does that.

bobbruno · on Nov 4, 2022

Hypothetically, if I wanted to learn how to code by studying open source examples on GitHub, should I have to go ask permission of each rightsholder to learn from their code? I agree that, if Copilot is based on a model that overfits to output the exact same code it read, the lawsuit has merit (and Copilot is not really ML), but the idea of ML is that the model doesn’t memorize specific answers, it learns internal rules and higher-level representations that can output a valid result when given some input. Very much like me, the coder, would output valid code when given a use case description, after studying a lot of open source examples of that. Should most programmers just be paying rights to all publishers of code they have studied?

ImPostingOnHN · on Nov 5, 2022

> the idea of ML is that the model doesn’t memorize specific answers, it learns internal rules and higher-level representations

that's the idea, yeah, and it would've been great if that's how copilot worked all the time

as for the whataboutism, if developers copied copyrighted code, the rights holder has the right to go after them, too, if they so choose

the rights holder could also choose to go after only big companies that violate licenses egregiously, if they so choose

you know, common sense and nuance

commoner · on Nov 4, 2022

For a long time, Microsoft has used software licenses to reap profits from Windows and Office, the two products that enabled Microsoft to capture near-monopolies in their respective markets.

Now, Microsoft is violating other people's software licenses to repackage the work of numerous free and open source software contributors into a proprietary product. There is nothing moral about flouting the same type of contract that you depend on every day, for the sake of generating more money.

Either the entire Copilot dataset needs to be made available under a license that would be compatible with the code it was derived from (most likely AGPLv3), or Windows and Office need to be brought into the commons. Microsoft cannot have it both ways without legal repercussions.

williamtrask · on Nov 3, 2022

I don’t think this lawsuit would hinder innovation but it would greatly change it and who owns it.

If an AI model is the joint property of all the people who contributed IP to it, it’s a pretty hugely democratic and decentralizing force. It also will incentivise a huge amount of innovation on better, richer data sources for AI.

If an AI model isn’t joint property of the IP it learned then it’s a great way to build extractive business models because the raw resource is mostly free. This will incentivise larger, more centralised entities.

Much of the most interesting data comes from everyday people. A class action precedent is probably good for society and good for innovation (particularly pushing innovation on the edge/data collection side)

kmeisthax · on Nov 4, 2022

The problem of jointly-owned AI is that the actual value of a particular contribution to the training set is not particularly easy to calculate. We can't tie a particular model weight back to individual training set examples, nor can we take an output that's a statistical mix of two different peoples' work and trace it back to them.

With current technology, the only licensing model we can offer is "give us your training set example, we'll chuck a few pennies at you out of credit sales and nothing more". We can't even comply with CC-BY because the model can't determine who to attribute to.

shanusmagnus · on Nov 3, 2022

If the resource is free and non-rivalrous, what is being extracted?

saurik · on Nov 4, 2022

The resource is not "free": it is provided under a license that attempts to lay out the terms the entity using the resource must comply with in order to get the benefit of using their product; just because this is a non-monetary compensation doesn't mean it is "free".

D13Fd · on Nov 4, 2022

Authors of code (open source or otherwise) hold a copyright in that code. The purpose of the license agreement is to set out the terms on which the authors will permit others to take actions that would otherwise infringe copyright.

Using code, photographs, documents, or other material to train a model isn't copyright infringement. The person operating the model is not violating the exclusive rights of the copyright author: they are not making copies or derivative works.

Any other result means that all AI development based on training models is going to grind to a screeching halt, because essentially all training material—text, pictures, recordings—is copyrighted.

still_grokking · on Nov 4, 2022

> The person operating the model is not violating the exclusive rights of the copyright author: they are not making copies or derivative works.

How do they not make copies? Do you know how a computer works? Ever heard of RAM? (At least the German Urheberrecht recognizes this clearly: You can't do any processing on any data with the help of a computer without at least making temporary local copies, so there are exceptions to some rules. I'm quite sure common law copyright also recognizes this!)

Also the claim that this is not a derivative work is actually one of the disputed claims here…

> Any other result means that all AI development based on training models is going to grind to a screeching halt, because essentially all training material—text, pictures, recordings—is copyrighted.

Exactly, it's all copyrighted! That's why you can't use it for whatever you like. That's the whole point of copyright.

As a result this means that whoever wants to exploit that work in said way needs to buy (or get otherwise) a license!

Nobody said that feeding AI with properly licensed work would be problematic. Only the original creators need to get their fair cut form the outcome of such process.

AnnasVirtual · on Nov 4, 2022

you are clearly doesn't understand how machine learning works, if machine learning ok copyrighted data becomes illegal then most of our infrastructures will be down because most of it uses machine learning, the first that will affect many people is probably google search

bobbruno · on Nov 4, 2022

I believe this is the core point of the lawsuit - is Copilot really creating code from what it learned (which happens to, by some weird glitch, mimic the source code) or is it just a big overfitting model that learned to encode and memorize a large number of answers and spit them out verbatim when prompted?

I think that losing this lawsuit has much more serious consequences for Copilot than just having to connect to a list of millions of potential copyright owners - it would mean the model behind it is essentially a failure.

Personal opinion: the real situation lies somewhere in the middle. From what I’ve seen, I think Copilot has some ability to actually generate code, or at least adapt and connect unrelated code pieces it remembers to respond to prompts - but I also believe it just “remembers” (i.e., has a close-to-lossless encoding of the input) how to do some operations and spits them out as part of the response to some prompts.

I hardly think the lawsuit will really explore this discussion, but it sounds like a great investigation into what DL models like transformers actually learn. For all I know, it might even give insight into how we learn. I have no reason to believe that humans don’t use the same strategy of memorising some operations and learning how to adjust them “at the edges” to combine them.

still_grokking · on Nov 5, 2022

I don't think that anybody will try to answer the philosophical question in what regard what this machine does has anything to do with human reasoning.

In the end it's just a machine. It's not a person. So trying to anthropomorphize this case makes no sense from the get go.

Looking at it this way (and I guess this is the right way to look at it from the law standpoint) Copilot is just a fancy database.

It's a database full of copyrighted work…

How this database (and it's query system) works from the technical viewpoint isn't relevant. It just makes no difference as by law machines aren't people. End of story.

But should the curt (to my extreme surprise) rule that what MS did was "fair use" than the flood gates of "fairuseify through ML"[1] would be open. Given the history of copyright and/or other IP laws in the US this just won't happen! The US won't ever accept that someone would be allowed to grab all Mikey Mouse movies put them into some AI and start to create new Mikey Mouse movies. That's the unthinkable. Just imagine what this would mean. You could "launder" any copyrighted work just by uploading and re-querying it form some "ML-based database system". That would be the end of copyright. This just won't happen. MS is going lose this trail. There is no other option.

The only real question is how severe their loose will be. They used for sure also AGPLv4 code for training. Thinking this through to the end with all consequences would mean that large chunks of MS's infrastructure, and all supporting code, which means more or less all of Azure, which means more or less all of MS's software, would need to be offered in (usable!) source to all users of Copilot. I think this won't happen. I expect the court to find a way to weasel out of this consequence.

[1] https://web.archive.org/web/20220121020414/fairuseify.ml/

Brian_K_White · on Nov 6, 2022

Holy cow you are right.

mbreese · on Nov 4, 2022

> Morally I think this class action is dead wrong. This is how innovation dies.

This legal challenge is coming one way or another. I think it’s better to get it out of the way early. At least then we will know the rules going forward, as opposed to being in some quasi-legal gray area for years.

D13Fd · on Nov 4, 2022

I disagree. The more entrenched a practice is, like training AI models on media content, the less willing a court is going to be to take that practice away.

ImPostingOnHN · on Nov 5, 2022

that seems like a machiavellian way of avoiding The People deciding the issue for themselves via their government representatives, and it'd just make things harder when the court takes the practice away anyways

njharman · on Nov 3, 2022

Say you read a bunch of code, say over years of developer career. What you write is influenced by all that. Will include similar patterns, similar code and identical snippets, knowingly or not. How large does snippet have to be before it's copyright? "x"? "x==1"? "if x==1\n print('x is one')"? [obviously, replace with actual common code like if not found return 404].

Do you want to be vulnerable to copyright litigation for code you write? Can you afford to respond to every lawsuit filed by disgruntled wingbat, large corp wanting to shut down open source / competing project?

devmor · on Nov 3, 2022

This is a logical fallacy. A human is not an algorithm. We do not have to extend rights regarding novel invention to an algorithm to protect them for people.

robbrown451 · on Nov 4, 2022

Differentiating between a human and a machine simply because one "is not an algorithm" doesn't make a lot of sense. If it were true, people would very easily game it, by using algorithms to automate the most trivial parts of copying someone's work.

Ultimately the algorithm is automating something a human could do. There is a lot of gray area to copyright law, but you can't get around that simply by offloading to an algorithm.

bombolo · on Nov 4, 2022

> Differentiating between a human and a machine simply because one "is not an algorithm" doesn't make a lot of sense.

Uh? So if I design a self driving car which kills someone, it's the car that goes to jail?

Legal precedent seems to indicate this is not the case at all. Because humans and machines are different, simply because humans aren't machines and viceversa.

robbrown451 · on Nov 5, 2022

"So if I design a self driving car which kills someone, it's the car that goes to jail?"

No but the manufacturer will typically be held responsible. If the manufacturer intentionally designed it to kill people, someone could certainly be charged with murder. More likely it was a software defect and then it is a matter of financial liability. (in between is a software defect that someone knew about and chose not to fix)

This isn't a new issue. If you design a car and the brakes fail due to a design issue and that issue can be determined to be something that could have been preventable by more competent design.... someone might indeed go to jail but more likely it would be the corporation paying out a large amount of money.

It could even be a mixture of the manufacturer's fault and the driver. Maybe the brakes failed but the driver was speeding and being reckless and slammed on the brakes with no time to spare. Had it not been a faulty design, no one would have gotten hurt, but also if the driver had been competent and responsible, no one would have gotten hurt.

But with self driving cars, when they no longer need a "safety driver", it certainly won't typically be the human occupant of the car's fault to any degree, since they are simply a passenger.

bobbruno · on Nov 4, 2022

Last I checked this was very much a gray area. I’d expect at least a long investigation into the amount of work and testing put into validating that the self-driving algorithm operates inside reasonable standards of safe driving. In fact, I expect that, as the industry progresses, the tests for minimal acceptable self-driving safety get more and more standardised.

That doesn’t answer the question of who’s responsible when an accident happens and someone gets hurt or dies - but then, there was a time when animals would be judged and sentenced if they committed a crime under human law. That practice is no longer deemed valid, maybe we need to agree that, if the self-driving car was built with reasonable care, accidents can still happen and it’s no one’s fault.

still_grokking · on Nov 5, 2022

This makes no sense.

If you build a machine and sell it, and this machine kills someone even operated correctly you'll have a problem. A big problem…

AI is a machine.

So the case is actually quite simple.

Regarding the sibling's Uber example: There the argumentation was that the machine was not operated correctly. So this is not a comparable case.

bombolo · on Nov 5, 2022

https://usa.streetsblog.org/2019/03/08/uber-got-off-the-hook...

Well the "long" investigation let uber off the hook despite disabling emergency breaking and put the driver in jail.

Which seems to put all the blame on the user and nothing on the makers of the AI.

nchi3 · on Nov 4, 2022

>by using algorithms to automate the most trivial parts of copying someone's work

That's basically what copilot is...?

robbrown451 · on Nov 6, 2022

Ethical or not, Copilot is sophisticated software, which doesn't qualify as "trivial" by my definition.

ranguna · on Nov 4, 2022

> you can't get around that simply by offloading to an algorithm.

You can ...?

By simply saying existing fair usage rights are limited to be used by humans and not for-profit companies building for-profit products.

robbrown451 · on Nov 6, 2022

First of all, that isn't simple. How do you determine what is done by humans? If the human is using a computer and using copy and paste does that still qualify?

No matter where you draw the line between "done by computers" and "done by a human simply using a computer as a tool," there will always be a lot of gray area.

Also, if I spend a year creating my masterpiece, and some kid releases a copy of it for free and claims that that's ok just because it's "not for profit," there is still a problem.

ImPostingOnHN · on Nov 5, 2022

> Differentiating between a human and a machine simply because one "is not an algorithm" doesn't make a lot of sense.

it makes a lot of sense, for that reason and a lot of others

people can create algorithms that do whatever they want, including copyright infringement and outright criminality, but algorithms can't create people or want anything for themselves

kmeisthax · on Nov 4, 2022

Copyright already worries about this sort of thing a great deal, and it's actually a lot more well thought-out than your average hacker is aware of. There are no hard and fast rules; but generally... the thing being sued over has to be creative enough to be copyrightable in the first place. Small snippets do not qualify for copyright protection alone.

still_grokking · on Nov 5, 2022

I'm not sure this is true. At least for copyright in the common law meaning.

Oracle got copyright on API signatures…

In civil law there is a bar to protection if the work lacks "substantial" creativity. But even this bar is extremely low. More or less everything besides maybe simple math formulas is protected.

kmeisthax · on Nov 5, 2022

Oracle got a very thin copyright on API signatures. The "programmer convenience" ruling in Google v. Oracle basically precludes almost all copyright action on APIs alone.

still_grokking · on Nov 5, 2022

No, they got absolute copyright on the API signatures.

The court did not even question any copyright, it just assumed the APIs are copyrighted by Oracle. Than it looked for reasons why copying the APIs could possibly be fair use…

By the skin of their teeth they found some very involved and case specific reasons why Google's use of the copyrighted APIs was, after all, fair use.

https://www.bhfs.com/insights/alerts-articles/2021/supreme-c...

kmeisthax · on Nov 7, 2022

The reason why SCOTUS bent over backwards to not talk about copyrightability was not because they assumed it was true for APIs, but because they didn't feel like they had all the facts. They basically said "we don't know if it's copyrightable, but if it is, here's a ruling that makes this case and anything similar to it go away".

Oracle only has copyright over APIs in the Federal Circuit, because they were able to hoodwink the judge into applying patent logic[0] to a copyright case. In other circuits it's still up in the air. And in the Ninth Circuit[1] there's already loads of controlling precedent that would have resulted in Oracle's case being summarily dismissed, API copyright or no.

The term "thin copyright" is a term of art. It refers to the kind of copyright protection you get from combining uncopyrightable elements in a creative way. For example, you can't own a particular chord progression. But, if you combine that with, say, a particular instrument, some audio engineering techniques, the subject matter of the lyrics, and so on... then you start getting something that requires creative effort and thus is copyrightable. Courts still have to take this into account when ruling on copyright claims as they do not want to give people a monopoly over just the chord, or just that instrument, etc.

In the case of APIs, we're talking about a series of names, plus an arrangement of type signatures that go with them. Very much a thin copyright, as the legal profession in the US calls it.

And when you have thin copyright, courts are going to be more liberal with handing out fair use exceptions. The "programmer convenience" argument that SCOTUS adopted means that copying an API to put in a different platform is OK. The Ninth Circuit says that copying an API to reimplement a platform that other people's code relies upon is also OK. There's very little room left to actually make a copyright claim on an API alone.

In the case of Copilot, it's not merely copying APIs and filling them out with novel details. It is either generating wholly novel code, or regurgitating training data, the latter of which is just a regular 'ol infringement claim with no difficult legal questions to worry about.

[0] The Court of Appeals for the Federal Circuit is the only court with subject-matter jurisdiction over patent claims. When you're the only person who can make hammers, everything looks like a nail.

[1] The Ninth Circuit court of appeals has jurisdiction over California, which means it takes on the brunt of copyright cases.

still_grokking · on Nov 12, 2022

I still don't buy the part that there is not much to worry.

The thing you call "thin copyright" is still copyright. Being protected or not is in the end a binary judgment: If your stuff is "a little bit" protected it is actually fully protected—with all consequences that follow from that.

Also, alone the "assumption" of the highest US court that APIs are protected is a very strong signal. They could just have ruled that there is no protection at all; case closed. But they preferred to go for a weasel solution. This has reasons… They deliberately didn't open up the door for API freedom. (Most likely to still be able wield that weapon against foreign concurrency should they feel like that some day).

The point is: IP law is completely crazy. The smallest brain-farts are routinely protected.

The exceptions to this rule are actually stronger in civil law, but still even in the EU single words or sub-second audio samples are protected by default. (Regarding APIs the situation is better though: It's legal to reverse engineer something for e.g. compatibility, and a few other reasons; but that are explicit exceptions. The default is that almost every expression of even the slightest form of human "creativity" is copyrighted; the bar is extremely low; and gets actually pushed constantly lower and lower by common law influence).

So on both sides of the Atlantic the default is that every single line of code is protected. There is nothing like a lower bound in size. Than, form there, you could try to argue that there should be an exception from this protection in some particular case, e.g. there was no "creativity" at all involved. But you will need to win a—often very hard, expensive, and ridiculously long—fight over that issue, and wining that is nothing like a sure thing; the default is that just everything is protected to the max. (Just have a look at all the craziness around news headlines in the EU; Google lost that case back than; to understand this better, as this may be very surprising to US people: civil law does not recognize anything like "fair use"; there are exceptions of copyright protection that have in the end almost the same effect, like grants for libraries or educational purposes, but those exceptions, and their limitations, are listed explicitly in the law; if no exception is listed there just isn't one, and only the very vague "creativity bar" remains).

Regarding Copilot: It makes not much difference whether this machine spits out some verbatim copies of (clearly copyrighted!) snippets or some "remix" thereof. There is no "novel" code if at best all what this machine does is creating "remixes" of the code it has in its database based on the query given. (Its "knowledge base" is nothing else than a very funky database; technical details regarding the actual implementation of that database or its query system should not matter legally).

Before this comes up again: No, any comparisons to how humans learn are irrelevant in this consideration. That machine is not a human. It's a machine. End of story. So even if you consider also a human brain a kind of "funky database" this makes no difference.

jacobjjacob · on Nov 4, 2022

I am vulnerable to copyright litigation for code I write, if I copied it. This is already true of anyone who is writing code.

cdrini · on Nov 3, 2022

I haven't heard anyone saying that copilot is legal "just because it's AI." That's a pretty bad faith, reductive, and disingenuous representation. The core argument I've seen is that the output is sufficiently transformative and not straight up copying.

pessimizer · on Nov 3, 2022

> The core argument I've seen is that the output is sufficiently transformative and not straight up copying.

An argument that isn't made about any other type of algorithm.

cdrini · on Nov 4, 2022

I wasn't really trying to address whether the argument is valid, I was just noting the representation of the other side here is reductive to the point of being in bad faith. I find that kind of rhetoric a little frustrating since it's kind of inflammatory, and, I believe, not particularly productive towards having honest/informative disagreements and discussion.

I think if another algorithm was used instead of ML that did the same job as Copilot, then people would be making the same arguments. I think it's just the case that ML is just the first tech capable of doing what Copilot is doing.

devman0 · on Nov 4, 2022

You can't copyright an algorithm, you can copyright a particular expression of one, or you can attempt to patent an algorithm, but two authors can legitimately write the same thing and not infringe on each others copyright unless one copied from the other.

jelled · on Nov 4, 2022

Suppose you own the rights to a jpeg. And I apply a simple algorithm that increments every hex value. So 00 becomes 01 and so on. The gibberish images it spits out would be so different then your original image that you wouldn't have any claim to them at all.

still_grokking · on Nov 4, 2022

So I may create a tool that is capable of "incrementing every hex value" of an image, and also of "decrementing every hex value", and than distribute any of your images after "incrementing the hex values", together with said tool, right?

Or maybe it would be enough to just zip your image, to be allowed to distribute it? In the end the bytes I would distribute than "would be so different then your original image that you wouldn't have any claim to them at all", right?

account42 · on Nov 4, 2022

I encourage you to go get a copy of the latest hollywood blockbuster, apply your transformation, share it on the internet and see if the courts agree with your copyright hack.

Filligree · on Nov 4, 2022

Because other algorithms don't work this way, and aren't as transformative.

Is your claim that no algorithm can be transformative?

benlivengood · on Nov 3, 2022

Humans are just compression with extra steps by that logic.

There's a fairly simple technical fix for codex/copilot anyway; stick a search engine on the back end and index the training data and don't output things found in the search engine.

dleslie · on Nov 4, 2022

If I were to memorize my employer's IP then reproduce it (almost) verbatim and give it to a competitor, then I would be setting myself up for a world of legal hurt.

So yes, it is like how human memory is compression with extra steps.

devmor · on Nov 3, 2022

I dont think that would work very well because there are not infinite ways to succinctly solve most programming problems. In fact the majority of solutions will look exactly the same.

The real solution is very, very simple. Only use opt-in training data. Don't acquire codebases from people who didn't agree to it.

bombolo · on Nov 4, 2022

opt-in is complicated.

If I own a repository on github and I have received contributions from other people, or included a .h file from mpv (thing that I have done), do I still have the right to click the opt-in button? I didn't ask the other contributors.

But github is in a position to scan my code and see if there are copy paste bits and disable the opt-in button in that case.

Except they act in bad faith so they wouldn't do that.

benlivengood · on Nov 4, 2022

> I dont think that would work very well because there are not infinite ways to succinctly solve most programming problems. In fact the majority of solutions will look exactly the same.

Algorithms can't be patented or copyrighted, as they are pure mathematics. If an implementation of an algorithm has no creative content because it is succinct then it likely doesn't deserve copyright.

WithinReason · on Nov 4, 2022

That feature already exists, you can turn it on here:

https://github.com/settings/copilot

More info:

We built a filter to help detect and suppress the rare instances where a GitHub Copilot suggestion contains code that resembles public code on GitHub. You have the choice to turn that filter on or off during setup. With the filter on, GitHub Copilot checks code suggestions with its surrounding code for matches or near matches (ignoring whitespace) against public code on GitHub of about 150 characters. If there is a match, the suggestion will not be shown to you. In addition, we have announced that we are building a feature that will provide a reference for suggestions that resemble public code on GitHub so that you can make a more informed decision about whether and how to use that code, as well as explore and learn how that code is used in other projects.

https://github.com/features/copilot#what-can-i-do-to-reduce-...

animitronix · on Nov 6, 2022

Until this is turned on by default it's not sufficient.

ugh123 · on Nov 3, 2022

Attributions are fundamental to open source? I thought having source openly available was fundamental to open source (and allowed use without liability/warranty) as per apache, mit, and other licenses.

If they just stick to using permissive-licensed source code then i'm not sure what the actual 'harm' is with co-pilot.

If they auto-generate an acknowledgement file for all source repos used in co-pilot, and then asked clients of co-pilot to ship that file with their product, would that be enough? Call it "The Extended Github Co-Pilot Derivative Use License" or something.

neongreen · on Nov 3, 2022

Apparently they are using GPL-licensed code as well, see https://twitter.com/DocSparse/status/1581461734665367554

After five minutes of googling I'm still not sure if using MIT code requires an attribution, but many people claim it does, see https://opensource.stackexchange.com/a/8163 as one example

account42 · on Nov 4, 2022

You could have read the MIT license in its entirity in less than five minutes. It is very clear that the preserving attribution is a required condition. Other permissive licenses even explicitly require attribution in binaries / documentation.

MIT License:

Copyright <YEAR> <COPYRIGHT HOLDER>

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

xigoi · on Nov 3, 2022

From GitHub itself (emphasis mine):

> A short and simple permissive license with conditions only requiring preservation of copyright and license notices. Licensed works, modifications, and larger works may be distributed under different terms and without source code.

nyberg · on Nov 4, 2022

MIT explicitly states that it does in the license.

TAForObvReasons · on Nov 3, 2022

Attributions are fundamental to permissive licenses as well. It's worth reading the licenses in question. MIT:

> The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

This is the "attribution" requirement that even a Copilot trained on only-MIT code would miss.

If it were just about sharing code, there are public domain declarations and variants like CC0 licenses

Cort3z · on Nov 3, 2022

People would likely not share any code if they could not trust that their work would be respected, and attributed. So yes, I believe it to be fundamental to open source.

Aeolun · on Nov 3, 2022

Maybe researchers that are used to hunting for publications and attributions.

If I’m sharing my code publicly, it’s because I want it to be _used_.

mplewis · on Nov 3, 2022

I use a license which requires attribution. You do not speak for me.

pessimizer · on Nov 3, 2022

People share proprietary code publicly. And the fact that you're allowed to read a book doesn't (currently) give you the right to copy it and redistribute the copy.

bobbruno · on Nov 4, 2022

If I read 10 or 20 books about a topic and then go teach that topic to others, do I have to attribute each thing to all the authors from where I learned it? And what if I come up with my own interpretation of a topic, do I have to trace it back to all interpretations of all the authors that influenced it? Even more, do the previous authors also have to do that and do I have to quote all the chain of references? If not, why an ML model that is supposed to learn how coding works, not memorize pieces of code verbatim, should have to “because of copyright laws”?

charcircuit · on Nov 4, 2022

Almost everyone has learned some amount of English by reading copyrighted books.

Aeolun · on Nov 4, 2022

It does give you the right to write excerpts from memory though. If it happens to exactly match the text in the book, nobody gets excited about that, even if you could potentially rewrite the whole book.

ImPostingOnHN · on Nov 5, 2022

maybe that is true, but there exist others for whom that is not true, and as long as they number greater than zero, the argument that 'open source means free to use however for whatever' will be invalid

heavyset_go · on Nov 3, 2022

Attribution and inclusion of copies of licenses are stipulations in almost all of the popular open source licenses, including BSD and MIT licenses.

bobbruno · on Nov 4, 2022

True and valid. But all those clauses, AFAIK, were written with the mindset of ïf you want to run this code (particularly, but not limited to, for profit), you have to at least attribute it”. Copilot allegedly doesn’t run that code - it claims to read it, understand how it works and then generate its own code that does an equivalent function if requested. It’s up to the lawsuit to decide if that’s what it actually does, but my point is that the licenses simply did not cover this usage pattern, as much as no open source license requires any kind of action from someone who’s just reading or studying the code.

smoldesu · on Nov 3, 2022

> “AI” is just fancy speak for “complex math program”

Not really? It's less about arithmetic and more about inferencing data in higher dimensions than we can understand. Comparing it to traditional computation is a trap, same as treating it like a human mind. They've very different, under the surface.

IMO, if this is a data problem then we should treat it like one. Simple fix - find a legal basis for which licenses are permissive enough to allow for ML training, and train your models on that. The problem here isn't developers crying out in fear of being replaced by robots, it's more that the code that it is reproducing is not licensed for reproduction (and the AI doesn't know that). People who can prove that proprietary code made it into Copilot deserve a settlement. Schlubs like me who upload my dotfiles under BSD don't fall under the same umbrella, at least the way I see it.

Cort3z · on Nov 3, 2022

Who decides what constitutes an "AI program" vs just a "program"? What heuristic do we look at? At the end of the day, they have an equivalent of a .exe which runs, and outputs code that has a license attached to it.

bobbruno · on Nov 4, 2022

I can suggest an idea, considering that the “AI program” is the model, not the training algorithm.

A program gets written by an entity (usually a person) and is executed to generate the desired output according to a deterministic mathematical function it expresses. A training algorithm is a program that gets written to train a model (the model being the “AI Program”) when presented to some training data inputs, to implement a function that is not the training algorithm function itself, but another one, generalising over a problem domain beyond just the original examples fed to the training algorithm.

The output model is not the training algorithm or the training data (or an encoding of it) and exists as its own artefact, independent of both.

AnnasVirtual · on Nov 4, 2022

no, you clearly don't understand Neural Network, basically it's a file containing artificial neurons and their connections

heavyset_go · on Nov 3, 2022

I've been saying AI is computational statistics on steroids for a while, and I think that's an apt generalization of what ML is.

OJFord · on Nov 3, 2022

That oughtn't be controversial, in fact I wouldn't even bother with 'on steroids', implying it's a slightly different/morphed thing. The way I learnt it (very slightly, at university, not a particular focus) it was abundantly clear it was just stats.

heavyset_go · on Nov 3, 2022

I bring the steroids thing up because it's only relatively recently that we've had the massive computing power at our finger tips that we do now. We discovered the foundations of our current ML techniques a relatively long time ago, it's only been recently that we've been able to throw data centers full of powerful GPUs and whatnot at them.

BenjiWiebe · on Nov 4, 2022

If it's just statistics, wouldn't that mean you wouldn't get verbatim output? i.e. you'd throw away all high frequency content?

2muchcoffeeman · on Nov 3, 2022

But it all runs on hardware we created and we know exactly what operations were implemented in that hardware. How is it not just math?

AnnasVirtual · on Nov 4, 2022

it is a math that simulate neurons firing

2muchcoffeeman · on Nov 5, 2022

That’s clearly not the case because if it were so simple we’d have an AGI now.

kmeisthax · on Nov 4, 2022

The only license that is permissive enough for AI training is CC0.

Art generators can't comply with attribution requirements and code generators don't know if and when they trip the GPL copyleft. I believe most permissive code licenses also have some kind of attribution requirement.

urthor · on Nov 4, 2022

> Not really? It's less about arithmetic and more about inferencing data in higher dimensions than we can understand.

This is a VERY poor definition of mathematics.

galaxyLogic · on Nov 4, 2022

Who should be sued? Microsoft who produces an application known as "Copilot" which itself contains nobody else's code but Microsoft's? OR the person who USES Copilot, to produce code which contains somebody else's copyrighted code?

Using Copilot is a bit like using a shotgun, can be very illegal depending on what you shoot at. Creating and distributing the app Copilot is like creating and selling a shotgun.

account42 · on Nov 4, 2022

Microsoft produces a service known as "Copilot" which does contain other people's code. That the Copilot network contains other peoples code is not in question since it has been demonstrated to output other people's code and Microsoft even added (very limited) filters to detect if it ooutputs other people's code.

ranguna · on Nov 4, 2022

Everyone, copilot because they used (for training) and generate copyrighted code for they product and people that use the product.

Although users can probably get away with it because they didn't know copilot was actively generating copyrighted code.

bombolo · on Nov 4, 2022

They know now.

AnnasVirtual · on Nov 4, 2022

copilot only generate copyrighted when it seen the code many many times and that called memorization in machine learning, machine learning researchers will always try to decrease the amount of memorization in their artificial neurons

bombolo · on Nov 5, 2022

They didn't succeed, it seems.

nudpiedo · on Nov 4, 2022

the metaphor is sort of broken... here the "shotgun" has the ammunition and the dead child in.

drvortex · on Nov 3, 2022

Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

It is not directly using your code any more than programmers are using print statements. A book can be copyrighted, the vocabulary of language cannot. A particular program can be copyrighted, but snippets of it cannot, especially when they are used in a different context.

And that is why this lawsuit is dead on arrival.

klabb3 · on Nov 3, 2022

> Your code is not in that thing. That thing has merely read your code and adjusted its own generative code.

This is kinda smug, because it overcomplicates things for no reason, and only serves as a faux technocentric strawman. It just muddies the waters for a sane discussion of the topic, which people can participate in without a CS degree.

The AI models of today are very simple to explain: its a product built from code (already regulated, produced by the implementors) and source data (usually works that are protected by copyright and produced by other people). It would be a different product if it didn't have used the training data.

The fact that some outputs are similar enough to source data is circumstantial, and not important other than for small snippets. The elephant in the room is the act of using source data to produce the product, and whether the right to decide that lies with the (already copyright protected) creator or not. That's not something to dismiss.

nickelpro · on Nov 3, 2022

It's not something to dismiss but it is something that has already been addressed. Authors Guild v Google. Google Books is built upon scanning millions of books from libraries without first gaining permission from copyright holders, this was found to not be a violation of copyright.

Building a product on top of copyright works that does not directly distribute those works is legal. More specifically, a computer consuming a copyright work is not a violation of copyright.

TAForObvReasons · on Nov 3, 2022

At the time the suit was launched, Google search would only display snippet views. The very nature presents the attribution to the user, enabling them to separately obtain a license for the content.

This would be more or less analogous to Copilot linking to lines in repositories. If Copilot was doing that, there wouldn't be much outrage.

The fact that they are producing the entire relevant snippet, without attribution and in a way that does not necessitate referencing the source corpus, suggests the transgression is different. It is further amplified by the fact that the output itself is typically integrated in other copyrighted works.

nickelpro · on Nov 4, 2022

Attribution is irrelevant in Authors Guild, the books were not released under open source licenses where attribution is sufficient to meeting the licensing terms. Google never sought or obtained licenses from any of the publishers, and the court ruled such a license was not needed as Google's usage of the contents of the books (scanning them to build a product) did not represent a copyright infringement.

Attribution is mentioned in this filing because such attribution would be sufficient to meet the licensing terms for some of the alleged infringements.

It's an irrelevant discussion though, the suit does not make a claim that the training of Copilot was an infringement which is where Authors Guild is a controlling precedent.

couchand · on Nov 4, 2022

Attribution goes directly to factors 1, 3, and 4 of the fair use test.

nickelpro · on Nov 4, 2022

In some contexts it's used to characterize the purpose of the copying, but it's not a consideration that was made in Authors Guild.

klabb3 · on Nov 4, 2022

> Authors Guild v Google. Google Books is built upon scanning millions of books from libraries

I agree it's relevant precedent, but not exactly the same. Libraries are a public good and more importantly Google books references the original works. In short, I don't think that's the final word in all seemingly related cases.

> More specifically, a computer consuming a copyright work is not a violation of copyright.

I don't agree with this way of describing technology, as if humans weren't responsible for operating and designing the technology. Law is concerned with humans and their actions. If you create an autonomous scraper that takes copyrighted works and distributes them, you are (morally) responsible for the act of distributing them, even if you didn't "handle" them or even see them yourself.

Neither of the important aspects – remixing and automation – is novel, but the combination is. That's what we should focus on, instead of treating AI as some separate anthropomorphized entity.

nickelpro · on Nov 4, 2022

Your disagreement and feelings about how copyright and the law should work are valid, they have very little to do with how copyright is addressed judicially in the United States

forgotpwd16 · on Nov 4, 2022

>Authors Guild v Google

At which case Google paid some hundred million $ to companies and authors, created a registry collecting revenues and giving to rightsholders, provided opt-out to already scanned books, etc. Hey, doesn't sound that bad for same thing to happen with Copilot.

yenwodyah · on Nov 3, 2022

But Copilot has been shown to distribute (parts of) the copyrighted works used to create it. That’s the difference.

nickelpro · on Nov 4, 2022

A) No it doesn't, there's nothing in the Copilot model or the plugin that represents or constitutes a reproduction of copyright code being distributed by GH/MS. The allegation is it generates code that constitutes a copyright violation. This distinction is not academic, it's significant, and represents an unexplored area of copyright law.

B) "parts of" copyright works are not themselves sufficient to constitute a copyright violation. The violation must be a substantial reproduction. While it's up to the court to determine if the alleged infringements demonstrated in the suit (I'm sure far more will be submitted if this case moves forward) meet this bar, from what I've seen none of them have.

Historically the bar is pretty high for software, hundreds or thousands of lines depending on use case. A purely mechanical description of an operation is not sufficient for copyright, you cannot copyright an implementation of a matrix transformation in isolation no matter what license you slap on the repo. Recall that the recent Google v Oracle case was litigated over tens of thousands of lines of code and found to be fair use because of the context of those lines.

I've yet to see a demonstrated case of Copilot generating code that is both non-transformative and represents a significant reproduction of the source work.

fsflover · on Nov 4, 2022

> The allegation is it generates code that constitutes a copyright violation.

The weights of the Copilot very likely contain verbatim parts of the copyrighted code, just like in a zip archive. It chooses semi-randomly which parts to show and sometimes breaks copyright by displaying large enough pieces.

https://news.ycombinator.com/item?id=33458603

nickelpro · on Nov 4, 2022

Speculation, and furthermore the model itself isn't distributed to consumers.

xtracto · on Nov 3, 2022

Say you publish a song and copyright it. Then I record it and save it in a .xz format. It's not an MP3, it is not an audio file. Say I split it into N several chunks and I share it with N different people. Or with the same people, but I share it at N different dates. Say I charge them $10 a month for doing that, and I don't pay you anything.

Am I violating your copyright? Are you entitled to do that?

To make it funnier: Say instead of the .xz, I "compress" it via π compression [1]. So what I share with you is a pair of π indices and data lengths for each of them, from which you can "reconstruct" the audio. Am I illegally violating your copyrights by sharing that?

[1] https://github.com/philipl/pifs

Aeolun · on Nov 3, 2022

What you are actually giving people is a set of chords that happen to show up in your song, the machine can suggest an appropriate next chord.

It’s also smart enough to rebuild your song from the chords _if you ask it to_.

varajelle · on Nov 3, 2022

I take your code and I compress it in a tar.gz file. Il call that file "the model". Then I ask an algorithm (Gzip) to infer some code using "the model". The algorithm (gzip) just learned how to code by reading your code. It just happened to have it memorized in its model.

Aeolun · on Nov 4, 2022

Yeah, and that’s completely fine.

I’ve seen this point made before, but it assumes you use the entire input as output, which is silly.

varajelle · on Nov 4, 2022

Oh no, I'm not using the entire input, just a few functions of interest. And not the copyright headers of course.

BizarroLand · on Nov 3, 2022

With the exception that there are infinite types of chords in this case, and even though many musicians follow familiar chord structures the underlying melodies and rhythms are unique enough for any familiar person to be able to differentiate "Red Hot Chill Peppers" from "All-American Rejects", and now there is a system where All-American Rejects hit a few buttons and a song is generated (using audio samples of "Under the Bridge") that sounds like "Under the Bridge pt 2, All-American Rejects Boogaloo".

That's why it's actionable and why there is meat on the bone for this case. The real issue is going to be if they can convince a jury that this software is just stealing code and whether its wrong if a robot does it.

williamcotton · on Nov 3, 2022

Actually the real issue is if Copilot can stand up to these legal doctrines:

https://en.wikipedia.org/wiki/Idea–expression_distinction

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

2muchcoffeeman · on Nov 3, 2022

I was thinking of something similar as a counter argument and lo and behold, it’s a real thing maths has solved with a real implementation.

obiefernandez · on Nov 3, 2022

This analogy is flawed

andrewmcwatters · on Nov 3, 2022

This is demonstrably false. It is a system outputting character-for-character repository code.[1]

[1]: https://news.ycombinator.com/item?id=33457517

adriand · on Nov 3, 2022

If I use Photoshop to create an image that is identical to a registered trademark, is the rights violation my fault or Adobe’s fault?

xigoi · on Nov 3, 2022

Photoshop can't produce copyrighted images on its own.

metadat · on Nov 3, 2022

To play devil's advocate: Co-Pilot can't reproduce copyrighted work without appropriate user input.

Just trying to demonstrate a point- this analogy seems flawed.

heavyset_go · on Nov 3, 2022

If I draw some eyes in Photoshop, it won't automatically draw the Mona Lisa around it for me.

metadat · on Nov 3, 2022

Until you sprinkle a bit of Stable Diffusion V2 or 3 on it, or perhaps some GaN.

The more I think about it, the more this all seems like another dimension of Jack and the Magic Beanstalk crossed with The Matrix.

WithinReason · on Nov 3, 2022

If you Google Mona Lisa the result is the Mona Lisa. If you query Copilot for a common piece of code you get that code.

heavyset_go · on Nov 4, 2022

Google doesn't sell its search feature as a product that you can just plagiarize the results from and they're yours. Microsoft does that with Copilot.

Copilot is as much of a search engine as Stable Diffusion or DALL-e are, which is to say they aren't at all. If you want to compare it to a search engine, despite it being a tortured metaphor, the most apt comparison is not to Google, but to The Pirate Bay if TPB stored all of their copyrighted content and served it up themselves.

WithinReason · on Nov 4, 2022

With Copilot it's your responsibility not to use it as a search engine to copy-paste code. It's completely obvious when it's being used as a search engine so it's not a problem at all.

Stable Diffusion works on completely different principles and they can't exactly replicate a pixels from their training data.

adriand · on Nov 4, 2022

So the problem you have with it is the UI?

kyruzic · on Nov 3, 2022

No because that's not a trademark violation in anyway. Using GPL code in a non GPL project is a violation of copyright law though.

Aeolun · on Nov 3, 2022

Ok, cool. Presumably that is because it’s smart enough to know that there is only one (public) solution to the constraints you set (like asking it to reproduce licensed code).

Now, while you may be able to get it to reproduce one function. One file, and definitely the whole repository seems extremely unlikely.

naikrovek · on Nov 3, 2022

[flagged]

xigoi · on Nov 3, 2022

Individual words can't be copyrighted.

pmarreck · on Nov 3, 2022

It can be modified to not do that (example: mutating the code to a "synonym" that is functionally but not visually identical).

It can also be modified to be opt-in-only (only peoples' code that they permit to be learned on, can use the product)

falcolas · on Nov 3, 2022

Perhaps you are right, and it could be so modified.

Could be, but isn’t. And that matters.

ImPostingOnHN · on Nov 5, 2022

plagiarism with some words swapped is still plagiarism

Cort3z · on Nov 3, 2022

Just to be clear; I cannot prove that they have used my code, but for the sake of argument, lets assume so.

They would have directly used my code when they trained the thing. I see it as an equivalent of creating a zip-file. My code is not directly in the zip file either. Only by the act of un-zipping does it come back, which requires a sequence of math-steps.

Filligree · on Nov 4, 2022

But there is no equivalent of "unzipping" for Copilot.

This is a generative neural network. It doesn't contain a copy of your code; it contains weightings that were slightly adjusted by your code. Getting it to output a literal copy is only possible in two cases:

- If your code solves a problem that can only be solved in a single way, for a given coding style / quality level. The AI will usually produce the same result, given the same input, and it's going to be an attempt at a solution. This isn't copyright violation.

- If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?

account42 · on Nov 4, 2022

There is no guarantee that a ML network only produces the input data under those two conditions. But even for

> If 'your' code has actually already been replicated hundreds of times over, such that the AI was over-trained on it. In that case it's a copyright violation... but how come you never went after the hundreds of other violations?

Replication is not a violation if the terms of the license are followed. Many open source projects are replicated hundreds of times with no license violation - that doesn't mean that you can now ignore the license.

But even if they did violate the license, that doesn't give you the right to do it too. There is no requirement to enforce copyright consistently - see e.g. mods for games which are more often than not redistributing copyrighted content and derivatives of it but usually don't run into trouble because they benefit the copyright owner. But try to make your own game based on that same content and the original publisher will not handle it in the same way as those mods. Same for OSS licenses: The original author does not lose any rights to sue you if they have ignored technical license violations by others when those uses are acceptable to the original author.

heavyset_go · on Nov 3, 2022

Neutral nets can and do encode and compress the information they're trained on, and can regurgitate it given the right inputs. It is very likely that someone's code is in that neural net, encoded/compressed/however you want to look at it, which Copilot doesn't have a license to distribute.

You can easily see this happen, the regurgitation of training data, in an over fitted neural net.

CuriouslyC · on Nov 3, 2022

This is not necessarily true, the function space defined by the hidden layers might not contain an exact duplicate of the original training input for all (or even most) of the training inputs. Things that are very well represented in the training data probably have a point in the function space that is "lossy compression" level close to the original training image though, not so much in terms of fidelity as in changes to minor details.

heavyset_go · on Nov 3, 2022

When I say encoded or compressed, I do not mean verbatim copies. That can happen, but I wouldn't say it's likely for every piece of training data Copilot was trained on.

Pieces of that data are encoded/compressed/transformed, and given the right incantation, a neutral net can put them together to produce a piece of code that is substantially the same as the code it was trained on. Obviously not for every piece of code it was trained on, but there's enough to see this effect in action.

naikrovek · on Nov 3, 2022

> which Copilot doesn't have a license to distribute

when you upload code to a public repository on github.com, you necessarily grant GitHub the right to host that code and serve it to other users. the methods used for serving are not specified. This is above and beyond the license specified by the license you choose for your own code.

you also necessarily grant other GitHub users the right to view this code, if the code is in a public repository.

eropple · on Nov 3, 2022

Host that code. Serve that code to other users. It does not grant the right to create derivative works of that code outside the purview of the code's license. That would be a non-starter in practice; see every repository with GPL code not written by the repository creator.

Whether the results of these programs is somehow Not A Derivative Work is the question at hand here, not "sharing". I think (and I hope) that the answer to that question won't go the way the AI folks want it to go; the amount of circumlocution needed to excuse that the not actually thinking and perceiving program is deriving data changes from its copyright-protected inputs is a tell that the folks pushing it know it's silly.

naikrovek · on Nov 3, 2022

copilot isn't creating derivative works: copilot users are.

the human at the keyboard is responsible for what goes into the source code being written.

to aid copilot users here, they are creating tools to give users more info about the code they are seeing: https://github.blog/2022-11-01-preview-referencing-public-co...

devmor · on Nov 3, 2022

Your argument is essentially the same as the argument that the pirate bay didn't infringe copyright, it only facilitated infringement.

And we all saw how well that went legally.

account42 · on Nov 4, 2022

Actually pirate bay was even less of an infringement as they did not dsitribute the copygihted content or derivatives themselves, only indexed where it could be found. With Copilot all the content you're getting goes trough Microsoft.

AnnasVirtual · on Nov 4, 2022

that is not how similar at all that is not how machine learning works OMG

devmor · on Nov 5, 2022

Machine learning is not important to this line of argument. We are talking about the legal responsibility of a tool.