Hacker News new | past | comments | ask | show | jobs | submit login

I read most of the complaint. The only examples of supposed copyright infringement are isEven and isPrime functions. Here's what Copilot gives me in a Typescript file:

  function isPrime(n: number): boolean {
    for (let i = 2; i < n; i++) {
      if (n % i === 0) {
        return false;
      }
    }
    return n > 1;
  }
  
  function isEven(n: number): boolean {
    return n % 2 === 0;
  }
These are clearly not covered by copyright in the first place. This case is really quite pathetic.



Correct me if I'm wrong. I don't think this document needs to be a comprehensive record of every piece of copyrighted material that Copilot or Codex produce. That's something that will be produced during/for the trial process itself. Right now, this is just establishing the basic premise, and the claims for the type of behavior that is going on.

I think they intentionally picked (literal) textbook examples because they're short and easy for non-experts to grasp and have some understanding of. But I don't think we've seen any of the code from the respective J. Doe's yet, and I would assume we would in the trial (possibly in addition to more cases).


I tested co-pilot initially with Hello World in different languages. In Lisp, it gave me verbatim code from a particular tutorial, which was made obvious because their code had "Hello <tutorialname>" where <tutorialname> was the name of a YouTube tutorial, instead of the word "World." It was surely slurped into the model via someone who had done the tutorial and uploaded their efforts to Github. Mind you, it's pretty much the way everyone would code it, but the inclusion of <tutorialname> is definitely an issue.

So it isn't too hard to prove the case.


I have only skimmed. But lines 23 and 24 on page 23 also reference Copilot's autocompletion of Quake III's `Q_rsqrt`[1] and mention that it is under GPL2.

[1] https://news.ycombinator.com/item?id=27710287


I doubt that anything other than Carmack's comments are covered. See the "Filtration" section:

https://en.wikipedia.org/wiki/Abstraction-Filtration-Compari...

"In computer programs, concerns for efficiency may limit the possible ways to achieve a particular function, making a particular expression necessary to achieving the idea. In this case, the expression is not protected by copyright."

"Finally, material that exists in the public domain can not be copyrighted and is also removed from the analysis."

That code is specifically optimized for efficiency and there were similar approaches floating (get it?) around in the 1980s.


The magic-constant is not optimal there exist better alternatives. It is not necessary to implement this function and should be copyrightable. It is also not a trivial part.

On the other hand, Microsoft may only need to show "Hey, we got this code from FooBar under this license and this license and ..."


Why should it be copyrightable. It's just a way to calculte inverse square root. This falls under the public, in my non lawyer opinion. Such small snippets do not qualify, usually, for copyright.


It's not just the constant but it was easiest to identify for me in the last post. And due to it's popularity the size of the snippet doesn't matter, it stands on its own as a significant work.

The essence of the algorithm takes 4 lines: function declaration, declaration of 'y', one line for calculating the exponent in log-space, one line for returning the root finding.

The rest is fluff. Every line of the snippet has creative input with the chosen names ('threehalfs' for 1.5F), the order of declarations and instructions, the redundancy. There have been internet-wars around indentations and newlines, these are style choices.

((And it is public -- GPL more specifically, which is a restrictive license that should be respected. I think this snippets makes a perfect example of the dangers of copilot. But not one to litigate details with.))

(((Thinking back, I'm not sure anymore how the license laundering argument works if they got the code from a fair-use MIT-licensed hobby project. Can one person claim fair-use and include it under an MIT-license and have somebody else say 'oh this free code I'm going to use it commercially'?)))


You didn't read the relevant part of the complaint. It starts on document page 14 (PDF page 17). There's a clear footnote:

> Due to the nature of Codex, Copilot, and AI in general, Plaintiffs cannot be certain these examples would produce the same results if attempted following additional trainings of Codex and/or Copilot.

The offending solution from the AI included extra lines that are reasonably understood to come straight from Eloquent JavaScript:

    console.log(isEven(50));
    // → true
    console.log(isEven(75));
    // → false
    console.log(isEven(‐1));
    // → ??


This seems like an incredibly trivial example. If I remembered that example subconsciously, and used it myself somewhere, would that be an infringement of intellectual property? In any large code base how many such infringements are there? Many? Should we sue every software company on this premise?


Sure, those comments might be considered infringement, but that's from an earlier version of Codex. Copilot does not return that code. The complaint even says so.


If a software systematically engages in copyright violation but only haphazardly corrects those violations, those haphazard correct aren't evidence the problem has vanished.


If Copilot is committing widespread infringements of their copyright, then surely they will be able to find examples of such infringement to submit in their lawsuit.

I assume they want some kind of broad relief, such as an injunction to take down copilot. They are not going to get it, they are not going to get anything at all, if they can’t even provide examples of violating code.


During the piratebay case, the prosecutor only had to illustrate that it was likely (as in, convinced the judges) that copyright infringement had occurred. They did this by showing the top 100 torrents. They did not have to prove with certainty that the top 100 torrent actually was used by people. The fact that the names of movies and games showed up on the list was enough to convince the judges.

The lawyers defending the founders did try to make the argument that no infringement had been proven, and that the list itself was not proof of any infringement. It was just a list on a website, and they even presented evidence that the counter on the list was algorithm faulty. The judges was not convinced and applied the common sense approach that taken as a whole, it was not believable that no infringement had occurred by the website given the context of the site (the name, the top list, the overall perspective of how the site was designed).


> ...then surely they will be able to find examples of such infringement to submit in their lawsuit

Perhaps that is why they are reaching out to potential class members

> if they can’t even provide examples of violating code.

This is the very beginning of a very long process. I wouldn't rule out a settlement where class members get $10-100, which is a common resolution for class action suits.


In filing a lawsuits you make plausible allegations and claims, it is not the place to present evidence.


There are many public examples of that same effect happening (for example https://twitter.com/mitsuhiko/status/1410886329924194309 ), and the legal team has been soliciting for more examples. Those examples are likely to come out if it does go to trial.


If this legal team was interested in this going to trial you think they would have put together a stronger case instead of risking that it won’t be heard.

There’s not even a single mention of any established legal doctrines around copyright and software, such as abstract-filter-compare, idea-expression dichotomy, etc.


It's a complaint, not a brief.


I'm all to punish GitHub copilot if it produce copyrighted code, but this example of is even is absolutely trivial and has no right to be copyrighted


Moreover, if this case wins, it threatens to disrupt one of the biggest technological progressions of all time.

AI/ML will change every field just as the Internet and smartphones did. It doesn't show any indication of peaking, either.

If the US chooses the wrong path here, we'll only tie our hands behind our backs. Other countries won't be so foolish.

We should be able to train on any media a child could see, hear, or read.


> it threatens to disrupt one of the biggest technological progressions of all time.

Chill dude, all they have to do is include the licenses on their generated code.

If anything, this is going to generate even more progress. The copilot team would have to create some kind of feature that would connect the generated output the the relevant training data. That'd be pretty incredible to see in the field of AI/ML in general.


If they can actually link output to specific input, the lawsuit has merit and more, GPT-3 is a lie. A neural network is supposed to learn how things work, not memorize a large number of examples and spit them out verbatim - or keep connections to specific inputs.

Copilot losing the lawsuit is evidence it’s a case of overfitting, not true ML.


Mind you, not all outputs are the result of over fitting.


They already have that feature and you can turn it on


Ah great, then it should be on and with no way to disable it.

Of course, end user could just strip the license/attribution off their generated output, but that's a different story.


Not just AI is threatened, but also the use of sites like StackOverflow because some of those snippets might infringe a license. So we have to write everything from our heads, de novo. No more googling for solutions.

I think we should just relax copyright, it's dying anyway. Language models allow people to borrow skills learned from other people, and solve tasks. That's huge. Like Matrix, loading up a new skill at the press of a button. Can we give up such a huge advantage in order to protect copyright?

I think the notion of copyright has been under attack already for 2 decades by the internet, search engine and social networks. They all work against it, and AI does it even more. It just encapsulates the whole culture in a box, mixing everything up, all copyrights melting away, everything one prompt away. This could be a new medium of propagation for ideas. No longer limited to human brains and books, they can now propagate through language models more efficiently.


Especially the copyright / IP obsessed USA won't go this path.

Otherwise they would create the ultimate "copyright laundry machine".[1]

I'm very sure at least Hollywood and the big music labels would not like that… ;-)

[1] https://news.ycombinator.com/item?id=33459967


That isPrime function does not even cut at sqrt(n). Asking for the state of the art isPrime function is too much, but the sqrt trick is the very first step and it's free. (IIRC, the faster version uses i*i<n)


When searching for "console.log(isEven(50));" "// → true", which is one of the parts that the complaints is about, since this is also reproduced inside a Programming learning book: We get with cs.github.com

" Showing 1 - 20 of 66 files found (in 76 milliseconds)"

So, if this lawsuit succeeds in some way shape or form, does the author have a case against the 66 people that reproduced these lines in their own repository?


You could argue that if the author pursued enforcing their licence over those 66 people their code wouldn't have ended up in the training set in the first place. IANAL but I recall that you can't invoke copyright law to selectively enforce it, copyright is only protected if the holder pursues every violation of it. Maybe it works the same for enforcing a licence.


> I recall that you can't invoke copyright law to selectively enforce it, copyright is only protected if the holder pursues every violation of it

IIRC, that is wrong. What you are describing is trademarks, not copyright.


Thanks, I wasn't sure


They can already sue those people if they don't follow the original license, they just need to file a complaint individually to each author, I think. Standard OSS license stuff, or else, why would people even use licenses?


And who enforces that?


The owners of the original work


Those are quite copyrightable, in the same way that rangeCheck() is copyrighted.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: