I am not against this lawsuit but I'm against the implications of this because it can lead to disastrous laws.
A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ? What is the line between copying and machine learning ? Where does overfitting come in ?
Today they're filing a lawsuit against copilot.
Tomorrow it will be against stable diffusion or (dall-e, gpt-3 whatever)
And then eventually against Wine/Proton and emulators (are APIs copyrightable)
Well they are a special case here however since they don't solve a specific problem nor build a programm per se but instead (re)build a programm after existing specs. Their explicit goal is to match the behaviour of another piece of software with a translation layer.
Forbidding people who have seen the "source" programm is most likely to protect their version from going from "matching behaviour" to "behaving like", as in the same code, point. This might also be intended to build a safeguard for good intentioned developers to not break their (most likely existing) own NDAs accidently.
That was out of abundance of caution, not based on any legal precedent.
In fact, the little precedent that exists over learning from copyrightable code is in favor of it.
More important, the rule urged by Sony would require that a software engineer, faced with two engineering solutions that each require intermediate copying of protected and unprotected material, often follow the least efficient solution (In cases in which the solution that required the fewest number of intermediate copies was also the most efficient, an engineer would pursue it, presumably, without our urging.) This is precisely the kind of “wasted effort that the proscription against the copyright of ideas and facts . . . [is] designed to prevent.” (Sony v. Connectix)
It demonstrates that it stifles copying. That may make it easier for the copier to innovate, but doesn’t dispute the main argument for having copyright protection: that, without the protection of copyright, the code wouldn’t have been written.
Most of it? I would think >50% of open source code writers find it necessary to restrict the rights to copy and use their code. In a world without copyright protections, would the GPL be legal?
(and I guess courts might, in the future, say the GPL expires when copyrights on the code expire)
Sure, but given the timetable for changing the law, it still seems pretty reasonable to apply the same standard to Microsoft (and by extension Github) in the meantime
I don’t quite agree. Msft took a conservative approach to copyright to protect their own business.
Meanwhile open source software has had an immeasurable benefit to society. My computer, tv, phone, light bulb, etc all benefit from OSS—running various licenses, and only a subset using a copyleft like license.
The fact that the laws are inconsistent and expensive to defend against leads companies like Microsoft to take this conservative approach that slows down progress.
Copyright laws aren't preventing you from learning cinematography by watching said Disney movies though, and using all their techniques for your own project.
OpenAI did a dirty job though judging by the cases of the model just reproducing code to the comment, so I can understand why one would criticize this specific project.
That sucks for little snippets of software though, doesn’t it? It’s like copyrighting individual dance moves (not allowed under the current system) and forcing dancers to never watch each other to make sure they’re never stealing.
I mean, it's not like the copyrights are keeping you from doing things. It's stopping you from looking at someone else's source. And it's not like source is easy to accidentally see like dance moves are.
Way, way back in 1992, Unix Systems Laboratories sued BSDI for copyright infringement. Among other things, they claimed that since the BSD folks had seen the Unix source code, they were "mentally contaminated" and their code would be a copyright violation. This led to the BSD folks wearing "mentally contaminated" buttons for a while.
GitHub Copilot has been proven to use code without license attribution. This doesn't need to be as controversial as it is today.
If you're using code and know that it will be output in some form, just stick a license attribution in the autocomplete.
In fact, did you know this is what Apple Books does by default? Say, for example, you copy and paste a code sample from The C Programming Language. 2nd Edition. What comes out? The code you copy and pasted, plus attribution.
> A programmer can read available but not oss licensed code and learn from it. Thats fair use.
If a human programmer reads some else's copyrighted code, OSS or otherwise, memorizes it and later reproduces it verbatim or nearly so, that is copyright infringement. If it wasn't, copyright would be meaningless.
The argument, so far as I understand it, is that Copilot is essentially a compressed copy of some or all of the repositories it was trained on. The idea that Copilot is "learning from" and transforming its training corpus seems, to me, like a fiction that has been created to excuse the copyright infringement. I guess we will have to see how it plays out in court.
As a non-lawyer it seems to me that stable diffusion is also on pretty shaky ground.
APIs are not copyrightable (in the US), so Wine is safe (in the US).
In 2004, Google added copyrighted books to is Google Books search engine, that does search among millions of book text and shows full page results without any authors authorization. Any sane lawyer of the time would have bet on this being illegal because, well, it most certainly was. And you may be shocked to learn that it is actually not.
in 2005 the Authors Guild sues for this pretty straightforward copyright violation.
Now an important part of the story: IT TOOK 10 YEARS FOR THE JUDGEMENT TO BE DECIDED (8 years + 2 years appeal) during which, well, tech continued its little stroll. Ten year is a lot in the web world, it is even more for ML.
The judgement decided Google use of the books was fair use. Why? Not because of the law, silly. A common error we geeks do is to believe that the law is like code and that it is an invincible argument in court. No, the court was impressed by the array of people who were supporting Google, calling it an invaluable tool to find books, that actually caused many sales to increase, and therefore the harm the laws were trying to prevent was not happening while a lot of good came from it.
Now the second important part of the story: MOST OF THESE USEFUL USES HAPPENED AFTER THE LITIGATION STARTS. That's the kind of crazy world we are living in: the laws are badly designed and badly enforced, so the way to get around them is to disregard them for the greater good, and hope the tribunal won't be competent enough to be fast but not incompetent enough to fail and understand the greater picture.
Rants aside, I doubt training data use will be considered copyright infringement if the courts have a similar mindset than in 2005-2015. Copyright laws were designed to preserve the authors right to profit from copies of their work, not to give them absolute control on every possible use of every copy ever made.
> A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ?
You can learn from it, but if you start copying snippets or base your code on it to such an extent that its clear your work is based on it, things start to get risky.
For comparison, people have tried to get around copyright of photos by hiring an illustrator to "draw" the photo, which doesn't work legally. This situation seems similar.
It might or might not be depending on the situation. Some of it might come down to intent.
Like if the drawing was meant to be an artistic rendering with independent artistic value, much more likely to be fair use. If the drawing was meant to be a loop-hole to avoid paying the licensing fee on the original, its much less likely. Fair use has a bunch of criteria - a lot of it depends on intention and how the usage would affect the original copyright holder.
I would add that fair use lets you use a copyrighted work, it doesn't make the copyright go away, just adds some cases where you can use the work notwithstanding the original copyright, but the original copyright is still there.
Note: IANAL, this all could be wrong. I dont have any cases, i do know that people propose this sort of thing at wikipedia from time to time - i.e. hiring someone to draw copyrighted photos - and it usually gets shot down as not solving the problem, although im not familiar with the legal basis.
> If a machine does it, is it wrong ? What is the line between copying and machine learning ?
What is the difference between a neighbor watching you leave your home to visit the local grocery store and mass surveillance? Where do you draw the line?
Wine/Proton are safe because there is controlling 9th and SCOTUS precedent in favor of reimplementation of APIs.
The reason why those wouldn't apply to Copilot is because they aren't separating out APIs from implementation and just implementing what they need for the goal of compatibility or "programmer convenience". AI takes the whole work and shreds it in a blender in the hopes of creating something new. The hope of the AI community is that the fair use argument is more like Authors Guild v. Google rather than Sony v. Connectix.
Slippery slope? Are you familiar with judicial precedent? Being bound to precedents is central to common law legal systems, so I don't think the GP's take was so outlandish. "Slippery slopes" and "whataboutism" might be thought-terminating buzzwords online, but not in front of a judge.
>A programmer can read available but not oss licensed code and learn from it. Thats fair use.
No it isn't, at least not automatically which is why infringement of licenses exists at all, the fact that you have a brain doesn't change that and never has. If you reproduce someone's code you can be in hot water, and that should be the case for an operator of a machine.
It's also why the concept of a clean room implementation exists at all.
I think the commenter you replied to was talking about using the functional, non-copyrightable elements of the copyrighted code. Clean-room is not even required by case law. There's precedent that explicitly calls it out as inefficient.
More important, the rule urged by Sony would require that a software engineer, faced with two engineering solutions that each require intermediate copying of protected and unprotected material, often follow the least efficient solution (In cases in which the solution that required the fewest number of intermediate copies was also the most efficient, an engineer would pursue it, presumably, without our urging.) This is precisely the kind of “wasted effort that the proscription against the copyright of ideas and facts . . . [is] designed to prevent.” (Sony v. Connectix)
> A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ?
My (extremely amateur) understanding is that what is meant by "learn from it" is one of the hinge points of the legal question.
If a programmer reads licensed code and reproduces it verbatim or near-verbatim in a project with a conflicting license, that becomes a legal problem in certain circumstances.
If a programmer reads the same code and gets an idea to implement something different, that's less troublesome (or at least, if it is troublesome it's in a different area; if the idea was related to a patentable process, then other questions arise, but I'm even less qualified to speak to that area of law).
There's nothing special about copy/paste buttons that make them the only way you can infringe copyright.
Fair use doesn't automatically kick in just because someone uses what they took/copied as part of a larger artifact; it's a really complicated legal line.
Maybe its time for Creative Commons License to address this. I'm curious if No-Derivative would already prohibit this? Does the ND language need tweaking? Or do they need a whole new clause.
Not for GitHub -- users who upload their code accept GitHub's license agreements which allows it to use it in many different ways, including Copilot. Kind of how when you create a Robinhood account you agree to arbitration and can't sue them.
It would be good to have a definitive and simple line for fair use that could be applied to all forms of copyright. Right now fair use is defined by four guidelines:
The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes
The nature of the copyrighted work
The amount and substantiality of the portion used in relation to the copyrighted work as a whole
The effect of the use upon the potential market for or value of the copyrighted work.
A programmer who studied in school and learned to code did so clearly for and educational purpose. The nature of the work is primarily facts and ideas, while expression and fixation is generally not what the school is focusing on (obviously some copying of style and implementation could occur). The amount and substantiality of the original works is likely to be so minor as to be unrecognized, and the effect of the use upon the potential market when student learn from existing works would be very hard to measure (if it could be detected).
When a machine do this, are we going to give the same answers? Their purpose is explicitly commercial. Machines operate on expression and fixation, and the operators can't extract the idea that a model should have learned in order to explain how a given output is generated. Machines makes no distinction of the amount and substantiality of the original works, with no ability to argue for how they intentionally limited their use of the original work. And finally, GitHub Copilot and other tools like them do not consider the potential market of the infringed work.
API's are generally covered by the interoperability exception. I am unsure how that is related copilot or dall-e (and the likes). In the Oracle v. Google case the court also found that the API in question was neither an expression or fixation of an idea. A co-pilot that only generated header code could in theory be more likely to fall within fair use, but then the scope of the project would be tiny compared to what exist now.
Agreed. But it could go the other way as well. Let's say MS / HB wins and the decision establishes and even less healthy / profitable (?) outcome over the long term.
Remember when Napster was all the rage. And then Jobs and Apple stepped in and set an expectation for the value of a song (at 99 cents)? And that made music into the razor and the iPod the much more profitable blades. Sure it pushed back Napster but artists - as the creator of the goods - have yet to recover.
I'm not saying this is the same thing. It's not. Only noting that today's "win" is tomorrow's loss. This very well could be a case of be careful what you wish for.
A programmer can read available but not oss licensed code and learn from it. Thats fair use. If a machine does it, is it wrong ? What is the line between copying and machine learning ? Where does overfitting come in ?
Today they're filing a lawsuit against copilot.
Tomorrow it will be against stable diffusion or (dall-e, gpt-3 whatever)
And then eventually against Wine/Proton and emulators (are APIs copyrightable)