The Changelog: Software Development, Open Source - We ask a lawyer about GitHub Copilot (Interview)

Episode Date: September 8, 2021

This week we're bringing JS Party to The Changelog — Nick Nisi and Christopher Hiller had an awesome conversation with Luis Villa, co-founder and General Counsel at Tidelift. They discuss GitHub Cop...ilot and the implications of an AI pair programmer and fair use from a legal perspective.

Transcript
Discussion (0)
Starting point is 00:00:00 What's up? Welcome back. This week, we're bringing JS Party to the changelog. Nick Nisi and Christopher Hiller had an awesome conversation with Louis Villa, co-founder and general counsel at Tidelift, talking about GitHub Copilot and the implications of an AI-paired programmer and fair use from a legal perspective. Of course, big thanks to our partners, Linode, Fastly, and LunchDarkly. We love Linode. They keep it fast, and they keep it simple. Get $100 in credit at linode.com. Our bandwidth is you by Gitpod. Gitpod lets you spin up fresh, ephemeral, automated dev environments in the cloud in seconds.
Starting point is 00:00:51 And I'm here with Johannes Landgraf, co-founder of Gitpod. Johannes, GitHub made a big announcement recently with Codespaces, validating that it is now time for dev teams to consider what automated dev environments can do for them. What do you have to say to that? I'd say, welcome to the party, GitHub and Microsoft. No, honestly, we were very excited because it validated to the developer community what we have been pioneering over the last years,
Starting point is 00:01:13 that developer environments need to be automated and ephemeral. We are now at the right place and the right time to move software development to the cloud for everybody, not just for developers working for the Googles, Facebooks or Shopify's who left local development already for several years. Gitpod is open source and provisions for every development team on GitHub, GitLab, and Bitbucket cloud-powered dev environments. You can access your developer environments via upstream VS Code running on your desktop or in the browser, and soon also all JetBrains IDs.
Starting point is 00:01:42 Very cool. If this gets you excited, learn more and get started for free at getpod.io getpod is free for individual developers for 50 hours a month can be self-hosted and is available for every developer today again getpod.io This is JS Party, a weekly celebration of JavaScript and the web. Tune in live on Thursdays at 1 p.m. Eastern, 10 a.m. Pacific. Watch the show live on YouTube at youtube.com slash changelog or subscribe at jsparty.fm. All right, y'all, it's party time. Hello and welcome to JS Party. I'm your host this week, Nick Nisi. Ahoy, ahoy. Ahoy, ahoy. And with me is Chris, aka Boneskull. Boneskull, what's up?
Starting point is 00:02:39 What's up, Nick? Yay. Yeah, welcome to the show. Very excited about our topic today. And on that note, I want to introduce our special guest and that is Louis Villa. Louis, how is it going? It's going pretty well. I mean, I'm really excited. You know, Chris is just clearly so excited to talk to a lawyer. Like I'm just always really glad when I see that enthusiasm. Oh yeah. Definitely. It's infectious, isn't it? Are we allowed to say it's infectious now? Like it seems like that's, you know, one of those words that 2020 has ruined for us. I didn't even think of that until you brought it up. So on my side, I think so.
Starting point is 00:03:18 So let's tell us a little bit about yourself. So I'm a former programmer. I got a CS degree in the last millennium and worked in open source for a while. I got involved in open source while I was in college, actually originally hacking on the Lego Mindstorms, the very first generation Lego Mindstorms. And also just as a Linux user, right? Like this idea that people were building an entire operating system together on the internet was like, but I was also a political science major. And so I was very interested in this overlap of politics and power and computing, right? It started off as just like, I was interested in politics and I was interested in computers. And I really thought those two weren't related.
Starting point is 00:03:58 And then by the end of the nineties was like, oh yeah, actually these are like super related. Right. And yeah, so I worked at a startup called Zimian, which worked on the GNOME Linux desktop. Yeah, and basically then after that I was like, actually law school sounds like it would be fun. Pro tip kids, law school, not fun. But yeah, and then since then I've worked at a series of Mozilla as an attorney where I worked on the Mozilla public license revision version 2.0. I worked at a big law firm for a while working, among other things, on the Google Oracle lawsuit for Google. And where fair use came up quite a bit, which is something we'll talk about today. called Tidelift, where we are trying to make open source better for everyone by helping build a sort
Starting point is 00:04:47 of economic and payment loop so that maintainers get paid to do all the sort of not fun parts of maintenance. But they're not fun, but they're really important for businesses and enterprises. So like, we're trying to close that loop as a business. But yeah, I'm a copyright nerd at heart. And so I think that's sort yeah, I'm a copyright nerd at heart. And so I think that's sort of why I'm here today. That's awesome. And this kind of ties into a meeting I had just before this, where I was talking to our interns and really talking about how like so much of the software that we use and so many of the big companies that we see and work with and use their products are built on all of this open source software.
Starting point is 00:05:29 And it's really hard out there for open source developers. And I was kind of evangelizing that. And so it's really exciting that you're working on making the lives of open source developers much easier. So thank you for that. We're not here to talk about open source too much today. Like specifically, we want to talk about Copilot and kind of get into that. So Boneskull, you want to maybe explain what Copilot is and kind of get us going with that? Yeah. So if you're not aware of what it is, essentially it's kind of like an AI-assisted autocomplete on steroids or something like that. And your IDE may have autocomplete suggestions. Oh, like this is the name of the function and you hit tab to complete it. But it's a lot more than that. It does AI things to try to give you more code. It tries to kind of like if you write a comment that says this function does that,
Starting point is 00:06:28 it can try to write the function for you. It's good at automatically completing like boilerplate. And so, yeah, right now, GitHub Copilot is like it's in a, is it called a closed beta? I'm not sure. You have to like sign up and maybe they'll let you in. But yeah, so it's not generally available yet. Yeah. To that note, when you sign up, which I did on the first day, through that process, they ask you, because right now it's only available as a VS Code extension. Right. And they ask you about your Visual Studio Code usage. And I answered as honest as I could,
Starting point is 00:07:01 which is I never opened that app. So I don't have an invite. Lewis, are you in that? So I do have an invite. I think it's fair to disclaim here. I mentioned that my first job out of college was at Zimian. Those of you who are real old school open source will know that Zimian was founded by Nat Friedman, who's the CEO of GitHub. So I may have gotten my invite a back way.
Starting point is 00:07:28 I didn't get one at first because my ID these days is Word mostly, which is a Microsoft product, but it's not VS Code. So I admit I got an invite purely because I wanted to troll people on Twitter by trying to see if I could get VS Code to write a license. But I admit time has not been on my side, so I haven't done that project yet. So yeah yeah the back door was there for me now and i haven't talked about this much but i i felt like that was a sort of appropriate thing uh to lean on him for so yeah yeah i mean i think it's fascinating right i mean it simply is like uh there's both this like the lawyer side of me but i do want to say
Starting point is 00:08:02 like boy why simply isn't as an example of like, some of the examples of code coming out of it are like simultaneously amazing. And also very much occasionally the like, boy, you know, the robots are not coming for us anytime soon. Right. Like I saw somebody used it to auto generate a function about calend. And it was like, oh yeah, months are 30 days long and years are 365 days long. And apparently nobody's trained it on that blog post, falsehoods, programmers believe about time or believe about dates or whatever. Nobody's trained it on that yet. Or maybe it's just not heavily weighted enough.
Starting point is 00:08:43 It simply is a technical matter. It'll generate code, but it's just not heavily weighted enough. But, you know, it simply is a technical matter. Right. It'll generate code, but that code is not necessarily correct. You got to double check. Though I have seen some amazing examples of it really filling in some. And it's one of those things. I mean, one of the things that I'm, you know, with my like business hat on, I think there's this really fascinating questions about where it goes from here, right? Because I assume, again, haven't talked to anybody at GitHub about this, but like, this is the kind of thing that once you get it in place, the ways you
Starting point is 00:09:15 can leverage it are really interesting, right? Like, how does it know about third party APIs? You know, because it could, right? Like right now it only knows about third, as best as from what I've seen on the internet of people playing with it, you know, it seems to know about third-party APIs just by reading other people's source code, right? But you could see, like, I bet GitHub's partnerships team is thinking like, how can we integrate this like intelligently with third-party APIs, right? Or like security, for example, this is showing my age programmer wise, my experience of security issues, I'm thinking purely about C based
Starting point is 00:09:52 like string parsing kind of stuff, right? Instead of, you know, I guess with Java, it's like cross site stuff and things like that, right? But as I, you know, from the examples I've seen online, it seems like the AI is still like, if there are a lot of bad code examples in the code base, which out there in the wild there are, it's going to replicate some of those security fails. And what are they doing to train it to avoid some of those security fails? I don't know, but I think that's going to be really interesting. Again, with my nerd hat on as opposed to my lawyer hat on, we can get to the lawyer bits in a second,
Starting point is 00:10:27 but I just think that's really cool, right? Yeah, right now it's just like, I mean, it's pretty smart, but it's not that smart, and it just outputs things. It doesn't, maybe in the future it would be cool if something like that could just look at your code and be like, you know what this is wrong and it would look at what you're trying to do and it would compare it against known good implementations of that thing you're trying to do and it will alert you to problems like that would
Starting point is 00:10:56 be cool too and so you know there's a lot of places it could go in the future and that's going to be interesting so we've kind of explained like what it does and then a little bit of what it doesn't really do it's still operating its data set is its training set is just source code it uses like gpt3 i believe yeah open ai whatever is right in the scenes for them yeah and so it's just like dealing with a lot of text. But so when Copilot came out and people started playing with it, and then on Twitter, you see that you can give it certain prompts, and it will actually generate code that may have been, I mean, it's kind of obvious that it's getting the code that it's writing from GitHub. And so that's anything on GitHub, right? And so a lot of people were kind of upset about this. So like, Lewis, why do you think people were upset about it? You know, boy, that's actually like a deceptively complicated question right because i think there's so many layers of
Starting point is 00:12:05 you know people are upset for like very business reasons right like what if this code that's created is accidentally copyright infringing for my company right so like there's you know i've heard maybe they're apocryphal by now right but like i've definitely heard at least some CTOs, VPs saying, can't use this in our company's code base until there's a little more legal clarity, right? So that's one reason people are a little angry. Like, I think there's some sense that maybe GitHub was being a little sloppy about that, right? So that's one source of concern. Another source of concern is simply just the emotional, like, you know, authors feel ownership over their code, right? Like that's a very deeply felt for a lot of people, certainly not for everybody, but for a lot of people, for a lot of authors. And that's not unique to code, right? Some musicians who get sampled, they're like, oh, this is so awesome. Like my music is being reused, right? Like, you know, Nine Inch Nails, their stuff got reused by, I've never actually said this out loud, so I may be misprinting,
Starting point is 00:13:15 Lil Nas X, right? Like his stuff sampled from an old Nine Inch Nails track and like Nine Inch Nails was like, cool, I finally have a number one hit, right? Whereas like a lot of other sample, a lot of other musicians that get sampled are like taking it to court. Right. Literally. So there's that like emotional component. And of course, there's like this added component of some people placed their code under licenses that are explicitly reciprocal. Right. The idea is that if you use part of my code, you've also got to share with the world your code. And, you know, the common name for those is copyleft, though I think reciprocal in this case really captures something important that copyleft doesn't necessarily convey, right? The idea is that there's supposed to be a sharing and sharing alike. And so a lot of people
Starting point is 00:13:58 who deliberately chose to put their code under that license, you know, were pretty frustrated about that. And so, you know, we're pretty frustrated about that. And so, you know, all those things sort of layered on top of each other to produce some pretty negative responses. Yeah. Kind of stepping back before we dig more into that, to like the legal side of it, I just wanted to disclaim that I haven't used it yet. I have used a similar tool, I think, called Tab9, which was kind of doing a similar AI completion thing, but I don't think it was like completing full functions. It was more like, Oh, I see you're naming a variable like this. This is a very common variable name or something. And we're auto-complete it. I ended up turning it off because it was more noisy than helpful often, but it's, I mean, I'm sure it's growing and
Starting point is 00:14:41 getting better and all of that. But from a, like an outsider's perspective, looking at it, there are definitely like good and bad that I've seen from it. When I first saw it, it was like, whoa, this is like, this is just amazing. And not necessarily thinking, you know, it's going to take my job tomorrow or anything like that. But it was like, wow, this could really help, you know, if I'm staring at a blank file, how can I get going with something? And like, it seemed like a very good way can I get going with something? And like,
Starting point is 00:15:05 it seemed like a very good way to just kind of get something on the canvas, so to speak, to get going and whether or not it's correct, you can kind of tweak it from there and it'll learn and get better over time. It was also impressive that it didn't go like the route of some other AI stuff that it's come out over the years. Like Microsoft Tay is very coming to mind. It's good that it's not just immediately like going that route with uh you know i don't know very racist code or something like that but it's it's so far been pretty positive like that and then at the same time i've seen like really kind of i'll just say like dumb examples of it where somebody like auto completed you know
Starting point is 00:15:44 like an about me page and it auto-completed to like the about me, including like the Twitter handles of like a GitHub employee and stuff like that. So it's showing that it's literally just copy paste at that point, but it is kind of in an intelligent way. So it's, it's like straddling this line of like really simple and really complicated and really impressive that I think is an interesting place to be. But of course, this is the early days. So it's going to continue learning and going from there. I mean, you know, Nick, something that I was realizing as I was preparing to talk to you guys today, actually, a lot of the I mean, even before like because tab
Starting point is 00:16:22 nine, there were a few other things like tab9. There's also just been IDE autocomplete for a long time, right? Of various sorts, right? Like it knows what kind of code base you're working in. And it'll, it's one thing when it autocompletes. I mean, we've had, you know, our brackets get matched automatically for ages in Emacs, right? Yeah. But there's also been more sophisticated stuff that will read documentation and try to guess it like, of millions of dollars of attorney's fees on whether or not API like function names essentially and lot of this code is copyrightable and fair use doesn't apply, it's not entirely clear that even those simple, like I'm going to auto-complete the function name
Starting point is 00:17:32 from the standard library, like some of those same arguments that Oracle use apply there, right? And it's actually been sort of interesting and honestly a little frustrating for me. Some of the same people who came out strongly in favor of fair use when it was Google saying, yeah, re-implementation should be fair use. Like basically when it was Oracle stuff getting copied, everybody was like, hell yeah, copying is awesome. And now when it's GPL stuff, like I get the emotional valence there, right? But from a lawyer perspective, like GPL is a copyright license and Oracle's, you know, grungy, terrible, every lawyer hates it, terms of service or, you know, standard EULA around their code. Copyright perspective, those are both copyright licenses, right?
Starting point is 00:18:21 Courts don't, you know, courts aren't in the business of saying, oh, yes, but we really like Richard Stallman and we really don't like Larry Ellison. So therefore, one of these is fair use and the other isn't, right? Like there's been some, to me, sort of frustrating inconsistency about people who until a month ago were like big fair use proponents. We can get into the nuances of that because it is really complicated. Like the question of fair use proponents. We can get into the nuances of that because it is really complicated. Like the question of fair use and machine learning is in fact a really complicated one. And anyone who tells you that it's black and white, like courts don't know what machine learning is. So like the idea that you can say, oh yeah, this is definitely fair use or definitely not fair use. There's so much gray area in there.
Starting point is 00:19:05 We could go on about that for hours, but I'll pause and let you guys get in another question edgewise. Let's actually break right there and we'll come back after the break and talk about that. Yeah, it's potentially terrifying just thinking about how the technical aptitude of a court could potentially decide the fate of software. And that's terrifying. More and more startups are using Retool to focus their time on their core product. And that's exactly why they launched Retool for Startups. This is a program that gives early stage founders free access to a lot of the software needed for great internal tooling. And Retools worked with thousands of startups.
Starting point is 00:19:55 And the trend line they noticed was technical founders spending tons of time building internal tools. That means at this critical stage, these founders were distracted from their core product. The goal is simple, make it 10 times faster to build the admin panels, CRUD apps, and the dashboards most early stage teams need. And Retool has bundled together a year of free access to Retool with over $160,000 in partner discounts to save you money while building Retool apps with common integrations like AWS, MongoDB, Brex, and Segment. There is so much you can do with Retool. You can use these free credits to build tools that join product and billing data into a single customer view, tools that convert manual workflows into fully featured apps for your team, or tools that help non-technical teammates get access to your database to read and write data, analyze, and query. These are just a few examples.
Starting point is 00:20:44 Learn more, apply, and join lightning demos at retool.com slash startups. Again, retool.com slash startups. So Lewis, you mentioned that fair use is kind of a... it can be a gray area around this sort of thing. Can you go into a bit more about, I know this is a thing that continually comes up in trials is, is this fair use or not? And so where do you think something like Copilot lands and why? So let me first just start for those who aren't copywriters in the audience that why you'd be listening to us if you're not, I'm not sure, but we'll start with, you know, a little bit like so fair use is this very American concept, right? It's not present in a lot of other legal systems around the world that copyright should be bounded, right?
Starting point is 00:22:00 That that, yeah, of course, we give authors a lot of rights. It's very explicitly in the Constitution that we give them rights in order to promote the progress of the country, right? Like that's literally the phrase. There used to be a copyright blog called Promote the Progress, right? Like the idea was that this was something that you gave to authors and exchange, they made everybody better off, right? And so in part, because of that sort of founding intuition, first the US court system, and then eventually it was transferred from the court sort of, to put it a little bit in programming terms, the courts prototyped fair use, right? They sort of made it up on the fly when they ran into some problems. And then the Congress sort of took those ideas that had been floating around the courts for several decades at that point, maybe even almost 100 years at that point. And Congress sort of refactored it and said, like, this is how we're going to steal, you know, a phrase from one court here and a phrase from another court here. We're going to put it together into one refactoring,
Starting point is 00:23:11 make it part of the law, and then judges will sort of go on elaborating and clarifying that. And so this was done in the transition of, if I'm remembering my timeline correctly, the transition of fair use into actual statutory law, written down law, as opposed to judges sort of making this up on the fly, happened in the 50s. And so very much the examples, if you go back to what was Congress talking about, they were talking about things like teachers, right? So if you want to use a few minutes of a movie in a classroom to teach some point, fair use protects that, right? Fair use says the copyright holder can't just unilaterally block that. Or if you're another sort of canonical example is literary criticism. If you want to quote a paragraph of a book in order to prove a point about like, this author is an asshole or this author, then you can do that. And fair use allows you to do that. Fair use does, you know, Chris, to your point about it seems like it comes up in court a lot. The whole thing about fair use is that it is sort of I know it when I see it. Right. There are some guidelines. There's a four factor test that everybody applies. But the fourth factor is sort of like, yeah, and whatever else we want to throw in at the time, right? And that's because the whole point of it is like, you know what? People may be doing something new and different and innovative with this stuff. And we don't want that to be, we don't want authors to be able to block that if we think it's a good idea, right? Like that's sort of the core of it, right? So we think literary criticism
Starting point is 00:24:43 is a good idea. And important there to note that it's criticism, right? Like one of the core of it, right? So we think literary criticism is a good idea. And important there to note that it's criticism, right? Like one of the reasons that we have this fair use established is because if copyright holders could block that kind of use, then you would only have positive reviews of books, right? Like the authors would be able to say like, I didn't really like that review, take it down. And they'd be able to use copyright to block that. And so baked in, like from the very beginning of fair use, one of the things is you really don't want authors to be able to of safety valves for like, this is really important. You know, similarly, news reporting, fair use is used all the time, right? To be able to say, look, you know, here's a 10 second snippet of this politician's ad, for example. And we think that this politician's ad is misleading. Let me use this 10 second snippet to like set the stage
Starting point is 00:25:42 for this discussion of why this politician is misleading you, right? The creators of that advertisement have copyright in that 10 second snippet, right? They could, if copyright didn't have the escape hatch of fair use, they could use copyright to take down that news segment, right? Because they obviously don't like that the news reporter is saying the politician's lying to you here's why and so copyright originally existed for all that kind of stuff right but of course it was written in the 50s literally has no concept of software much less what's a you know much less machine learning right and so there's all this settled stuff like academics don't sue over fair use very much because that stuff was all settled a hundred years ago. Right. Whereas software, like it comes up quite a bit because
Starting point is 00:26:31 in fact we have no idea, right? Like we really don't, Congress has never really weighed in on this courts only weigh in on it once in a blue moon, right? Like a lot of the arguing in Google Oracle over is this re-implementation of the API a fair use? I mean, one of the key cases in that was Lotus v. Borland, which was about whether the drop-down menus on like an x86 PC in like black and green on your CRT screen, like a court found that those menus were a way of operating the spreadsheet in the 1980s. And like, so here we were 30 years later, and we're trying to like one of the things, Nick, you mentioned about judges, you know, how much tech do they know? We were trying to tell them like, well, look, the API is sort of like a menu in a spreadsheet in the 1980s.
Starting point is 00:27:27 And like, how is any reasonable person supposed to like find a reason? So yeah, tons of gray area. And that's where fair use then comes in is this question of has come up a lot recently in these questions of, we joke that you could write an entire casebook. Casebook is like a legal equivalent of sort of intro to programming kind of thing, right? Where you take snippets, ironically, through fair use of these cases, and you say, well, here's what a judge found about this. You, budding lawyer, should learn about the law by reading what the judge says about this. And we often joke that you could write an entire casebook about copyright just through the lawsuits that Google has been involved in. Right.
Starting point is 00:28:12 Because they like early 2000s, they were just like, you know what? We're going to scrape the whole web. It's going to be great. We're going to organize all the world's knowledge. And like average people were like, yay, organizing the world's knowledge and scan all the books, scan all the books, scan all the porn. Uh, like there's a whole line of cases that are about Google image search and this one porn company who just, the guy really, just really didn't like Google or search engines in general. Perfect 10. There's a whole line of cases about perfect 10. Yeah, Google Book Search,
Starting point is 00:28:45 another huge one, right, where they literally just said, we're going to scan all the books. And if you don't like it, and the authors didn't like it, there was a big extensive lawsuit from a bunch of authors and the US courts eventually found, no, you know what, like, we understand you, the authors aren't happy about this, but this is so transformative. Transformative use is a concept that is not found anywhere in the statute authors wouldn't have conceived of, didn't have a business in, and it's something really radically new and different. And, you know, the book publishers were not in the business of creating something like Google Book Search. And that's where the analogy starts getting really obvious, right? Like we as code authors, were we in the business of creating something like Copilot? The trend so far in the US has been that machine learning typically is so transformative.
Starting point is 00:29:57 Definitely not always, but almost always in the handful of cases that courts have really considered machine learning, courts have tended to find it transformative, fair use. And that typically involves the training sets then? Yeah, yeah. The training sets are really where there's a clear copyright. Copyright is really a set of rights, which includes things like the right to copy it, the right to redistribute it, the right to make reproductions. And so the first step of training, right, is this idea like, we're going to scan all the training set, right? And you are making a reproduction there,
Starting point is 00:30:34 right? And you're making a reproduction for commercial purposes, which doesn't always matter, but sometimes matters. And so, yeah, you've just copied the whole thing, right? I mean, Google Book Search is a copy in a very literal sense of all the world's books. And at least the training set, right, that initial training of presumably OpenAI was not like pinging GitHub's API for code snippets at every point during the training, right? Presumably they vacuumed it all down and then did their training on it. So a copy was made. So a copyright infringement has occurred unless fair use defends that, right? And courts have generally found, and there's some good policy reasons for this, right? One of my favorite papers in this area is a paper, I'll share a link with you guys so that you can put it on the,
Starting point is 00:31:19 but it's a paper about how fair use is actually really important to building equitable training sets. Because we know that like a lot of training sets, this're going to have to use fair use because you literally can't buy a like racially diverse training set. Like it just doesn't exist because Getty and like all these other photo services are like actually have all kinds of biases, right? So you're going to have to deliberately construct and you're going to have to rely on fair use for that. And you know, if you rely only on things you can buy a license to, you're just to have to rely on fair use for that. And, you know, if you rely only on things you can buy a license to, you're just introducing all sorts of biases into your training data set. Now, of course, there are, as we know, you know, any of you have followed sort of
Starting point is 00:32:14 artificial intelligence policy discussions. There's all kinds of other ways you can introduce bias, right? But like fair use is one of the good tools, one of the few good tools we have to remedy that in like the AI space more generally. Again, that hasn't come up. Nick, as you mentioned, like Microsoft Tay or whatever, like those kinds of – I haven't seen any egregious examples. I'm actually really curious. My first job out of college with Zimian was as a QA guy. And I am deadly curious what kind of QA they did around race and gender and things like that. Because an obvious use case is something like, co-pilot, build me a gender selector dropdown. And that's like a fraud that like that's that's a super complicated. You know, that's one of those things that it turns out is all kinds of fraud. Right. And I have no idea. I haven't seen any particularly bad examples of that.
Starting point is 00:33:13 Maybe I just haven't looked hard enough, but. Yeah. And obviously I haven't used it yet. So everything I've gleaned about it has been just from mostly like tweets and a few articles here and there. But one that comes to mind in particular is Cassidy Williams from Netlify. She does like a live stream coding thing every week. And at least one week she did like a, you know, showing off co-pilot on the stream thing and was specifically trying to make it be biased about something. So, she was writing comments in Spanish. I could be misremembering, but maybe like a gender dropdown was like an example of that.
Starting point is 00:33:49 And overall, just like gleaning from her tweets about it, it was overall like pretty positive. It wasn't going to any dark places with that. And so like, that's probably where like a lot of secret sauce comes in to really take the training data and make it into something that is not only usable, but is also like ethically inbounds. I can't wait. One of these days, somebody is going to write a guide to regression testing your AI, right? That's going to have whole chapters on like, so, okay, you've regression tested your AI and it basically does what you want. Now let's test it and see how racist it is, right? Because at some level you do, like because we can't entirely peer inside the black box,
Starting point is 00:34:32 there's going to be, right, like we have to do. If GitHub didn't screw that up, they must have deliberately had some people, you know, poking at it with exactly those kinds of examples that Cassidy was trying. You know, which does actually get us to one of the interesting, I think one of the things that might come out a little bit in our conversation is a little bit of frustration at times around, there are really interesting legal
Starting point is 00:34:56 and policy questions around this. Most of the discussion online was not really about the interesting, you know, so much of it was emotional, frustrated, which I get, right? Like, I mean, I have, I'm a big proponent of copyleft. Like somebody challenged my like copyleft bona fides on Twitter. And I was like, I literally don't think I can fit all the copyleft licenses that I've advised on into one tweet. And there aren't that many of them. Right. So like, I get it, right? Like there is, I do believe in reciprocity is an important part of how we build software. But at the same time, we've also always, like a lot of old school copyleft folks have also been
Starting point is 00:35:38 old school fair use folks. And so like, this is a little bit of, I think there was some tension there and some frustration on both sides of that discussion. Right. Yeah. Which came through. Welcome to arguing on the Internet. What's up, party people? This episode is brought to you by Century.
Starting point is 00:36:03 Century just shipped their SDK for Next.js. Now, in your Next.js apps, you can capture errors, measure performance, manage releases, configure suspect commits, and automatically upload source maps to view unminified JavaScript and TypeScript with zero-ish configuration. You can get your events enriched with device data, breadcrumbs created for outgoing HTTP requests, release health for tracking crash for users and sessions, and automatic performance monitoring for both the client and the server. Check for a link in the show notes for details of this release.
Starting point is 00:36:33 JS Party listeners new to Sentry get the team plan for free for three months when you sign up and use the code PARTYTIME at the Sentry.io and use the code PARTYTIME because, hey, it's party time, y'all taking like maybe a step back from this and kind of thinking about this from like a software perspective and specifically like a software license perspective. Maybe I miss stating like the overall argument around that. But is it considered fair use because it's just training off of like potentially like certainly licensed code and not necessarily like running it? And if it were doing something to like run it, would that change the way that it might be perceived so you do two steps in a if you're trying to figure out like is there some sanctionable copying here right and so first is was there copying at all
Starting point is 00:37:40 right and that kind of thing doesn't come up you don't see that come into court very often or at least not in particularly dramatic high profile ways, because usually it's pretty easy to just compare this, compare that, like pretty much the same, right? That's sort of step one of your analysis. And here, it's important to distinguish that there's two possible stages at which copyright infringement could have occurred, right? There's did GitHub infringe people's copyright? And are people who are using Copilot infringing the copyright, right? And the answers to those may be different.
Starting point is 00:38:17 Like I tend to think that the answer is in both cases, there's no infringement, right? Like that's sort of my bottom line, but it's important to like distinguish between those two because you can see a world, right? Like that's sort of my bottom line, but it's important to like distinguish between those two because you can see a world, right? Like there's definitely arguments to be made that GitHub is infringing, but the user of Copilot is not, right? And so let me get into a little bit why that is, right? So when you're looking at fair use, I mentioned the sort of transformative concept, but before we get to that, there's these
Starting point is 00:38:45 four rules and I can't believe maybe I've had a longer morning than I thought I should normally be able to rattle them off, but we'll go through them one by one. And I will really try to remember the fourth one by the time I get to it. So one is like the nature of the taking, right? So like, are you doing this for like some kind of societally advancing purpose or not right and this is where things like teachers get much more of a flexibility than like a rival book publisher right another is the how much did you copy and so like it is one thing. So this is one of the key ways in which how GitHub is copied and how a copilot user might copy is very different because copilot undoubtedly at some point in the process copied the whole thing. Right. And so a court looks differently at did you copy the whole thing versus did you copy, you know, one function fragment out of a giant, you know. This came up in the Oracle Google trial because Google, well, really Apache, but we'll say Google for simplicity, really only copied like one type of thing, right? Like they copied, I used to have these numbers right on top of my brain.
Starting point is 00:40:00 It's a really good sign that I don't remember them exactly anymore. But like it's basically like 10, lines of api names and you know function names but they didn't copy the other several million lines of the implementation right like that was all implemented them right they carefully re-implemented them right and so a court will look at that and say, oh, it looks like actually not much of this was copied. Right. But this is where it gets a little complicated. Right. And again, why does this go to courts to decide these things is there's a famous case about the biography of Gerald Ford, who's like our nation's most boring president, essentially. Right? Except that he like pardoned Richard Nixon, right? So he wrote a biography and a magazine got their hands on like an advanced copy of the biography. And they basically reprinted the part about him pardoning Nixon. And the court was like, let's be honest here. Nobody is buying that book for any reason other than to read the
Starting point is 00:41:06 part about pardoning Nixon, because otherwise, who cares? It's Gerald Ford, right? So the court said, well, like, even though only a very small part was taken, like as a percentage of the book, it's still not a fair use, because that was really the core of the value of the book, right? And Oracle tried to make a similar argument, which was, OK, well, yeah, you know, you didn't copy 95 percent of Java, but you did copy the most valuable part, which is the API. And so this is where, you know, a copilot user is going to get. I mean, I think they said in their white paper, which I recommend everybody read. And again, I'll send you guys a link because it's pretty short and pretty interesting. They say something like in their internal testing, something like 0.1% of suggestions actually matched back to like when they did a sampled thing.
Starting point is 00:41:59 Only like 0.1% of suggestions looked like they were copied from another source. Right. like 0.1% of suggestions looked like they were copied from another source, right? The other 99.9% were original. Original in the sense of like being created by the machine learning. By the AI. Yeah, by the AI. It's still sort of weird to talk about things being created by an AI, right? So if you're trying to reimplement some competitor's API, I probably wouldn't use Copilot, right? Because then it's going to, the output is going to look like you took the heart of this other person's thing, right? It's
Starting point is 00:42:31 probably going to start auto-suggesting code that looks a lot like their implementation, if it's an open source implementation, right? So like if you're, if there's like a GPL implementation of something and you want to write an MIT implementation of it, like I suspect Copilot, I haven't seen anybody try this yet. Right. But I suspect Copilot is going to start doing things that look a lot like the original implementation and then you're going to have a problem. But because one of the tests for fair use is how much of it did you take? If like you end up with like a five line fragment out of somebody's GPL code, that's like a hundred thousand lines of, you know, or like, I mean, what's the Linux kernel these days? Like six, end up with like a five line fragment out of somebody's gpl code that's like a hundred thousand
Starting point is 00:43:05 lines of you know or like i mean what's the linux kernel these days like six seven million lines of code right like if you end up like a court's just gonna laugh that out of court right they're just gonna say like if somebody comes after you on a gpl claim for you as a user of copilot again that's different from github right because gith presumably copy, you know, the entirety of the Linux kernel, right? So we've got the, what was the nature of the taking? You know, how much did you take? Another thing that courts are going to look at is the commercial impact of this copying. And so like, you know, again, GitHub, like potentially big, since they're copying the
Starting point is 00:43:43 whole thing and they're a big corporate competitor, like possibly some big, for you as a user of Copilot, like, you were not like that company was not trying to sell you five lines of code. They weren't trying to license five lines of code to you. And you weren't looking to buy five lines of code, right? You were just gonna write write it yourself anyway. And so a court, again, is going to look somewhat skeptically at, and this is, you know, something we know from Google Book Search case, right? That a court's going to say like, well, you all weren't selling snippet search of your book. So like, you know, you're not, in fact, if anything, this is a key difference from Google Book Search to Copilot. The court found in that case that actually this is a key difference from google book search to copilot the court found in that
Starting point is 00:44:25 case that actually this is going to help you sell more books right because people are going to find books there's a limitation on how much gets shown so you know and there's a buy button right there there's no equivalent to that in uh you could see maybe github will do something like by the way it looks like you copied this from linux kernel. Click here to sponsor to do GitHub sponsors. That might be a little tacky, right? But you could see that as a thing that they could do in the future, perhaps. So that's sort of the basic analysis, right, of how much got taken? Was it really important stuff that got taken?
Starting point is 00:45:00 What was the commercial impact? Is it something new and bold and different that wasn't going to happen anyway? And I think looking at all those, find a really hard time seeing that a court is going to say that this was not a fair use, right? Because it's so different. The impact is so small. Like the emotional impact is real, right? And I don't want to downplay that. Like as authors.
Starting point is 00:45:21 But again, the whole point of fair use is sometimes authors are pissed and we ignore that like as a policy matter and you know by the way i should say again this is all in the u.s eu's got different sets of rules about this and i really think one of the interesting things that is under discussed that i would love to see more of is commentary from European union lawyers, Japanese lawyers, like, cause we don't, I don't think we have as good a sense yet of what that would look like in other places, other legal regimes. I'm getting the idea here that essentially, you know, you put your code on GitHub. Number one, there's like a terms of service that says GitHub can use your code, right?
Starting point is 00:46:05 Yep. Okay. You don't always put your code on GitHub, right? Right. But that's presumably what they used as their training set, right? I think I even saw some suggestion that they also looked at other repositories as well because OpenAI scans the entire web. So I did see some suggestion.
Starting point is 00:46:21 I don't know if it's been confirmed, but I just see some suggestion they would have looked at other. There are other repositories now. But at this point, the fraction of the world's code that is on GitHub is large. Right. So it's probably mostly GitHub. Speaking of, it seems like the license is irrelevant. All right.
Starting point is 00:46:38 I don't know. I mean, I think it's important to say, right, I've I've been wearing my lawyer hat this whole call so far. Right. And there's a whole other ethical, like, is it legal? I mean, like I said, I think the answer is probably pretty clearly. Yes. It's possible to be legal and still be a right. Excuse me. I don't know where, are we a family family podcast here or, uh, all right, great. good. Jared's not around. Or maybe I am. Yeah, you can still be a jerk, right? And I certainly think GitHub talked in that white paper that I mentioned earlier that they are implementing.
Starting point is 00:47:15 I don't know where this is at. I don't know if it's art world out or anything. But they mentioned that they're going to try to implement some kind of, by the way, it looks like this probably is not original, probably came from this. Putting aside whether or not that's legally necessary, you know, in terms of like not being a jerk, like hooray, GitHub should not be jerks, right? Like they're an 800 pound gorilla. And I think maybe in their rollout of this, I think maybe one of the things here is they didn't reckon with the like emotional like, you know, the heft that they carry. They've been really good. Like, I think I'm not a Microsoft apologist.
Starting point is 00:47:51 Like, I literally got into open source in part because I was convinced that Microsoft was evil. So like personally irritated, like the Bill Gates, you know, image rehabilitation campaign. Like the guy has all this money to give to charity because he like operated an abusive monopoly. Like, that's why he has so much money. So it's nice that he gives it away. But like, let's not forget that first part. Right. So I'm not a Microsoft apologist, but like I think GitHub and Microsoft past few years have mostly done really well by open source. laurel resting right a little too comfortable here and didn't fully understand like didn't
Starting point is 00:48:25 fully think through how much this would really you know emotionally piss people off even if the lawyers like even if the lawyers gave it a full thumbs up right yeah i think that that cloud that they've built up in open source over the last couple of years probably should help like give them some leeway in this in in figuring this out but definitely you know working to to figure that out figure out where the emotions are coming from and things like that but do you think that maybe like some of this murkiness is just caused by it's like dealing with the like with code itself where as like with the books example you know it's code to scan books and to open books up and create this product around that this is using code to look at code to suggest code, like it's all just one thing. It's layers
Starting point is 00:49:12 of indirection and layers of like, it's layers on layers and layers and it's layers. It's murky, because we don't really even mean how much of like, I mean, this is one of these big sort of meta trends, right? Is it copy left has been, you know, somewhat much of like, I mean, this is one of these big sort of meta trends, right? Is that copy left has been, you know, somewhat in decline and more certainly in the JavaScript community, right? There's essentially no copy left. So there's this sort of like sense of like, yeah, look, I mean, I sometimes call it almost like car exhaust, right? Like putting code on GitHub is like this thing that sort of happens accidentally by way of doing the thing that you actually want to do. Right. And so like, I mean, a lot of people aren't even entirely sure,
Starting point is 00:49:51 like how much is copyright really a motivator for, especially in a sassy world, like how much is copyright even, it doesn't have the same kind of motivational role that the copyright, at least in the US really assumes, right? Like, it's like, why even is this stuff copyrighted? Because we're just going to throw it on GitHub under a license that, by the way, we never enforce anyway, right? Like, that's actually, there's a different discussion, maybe a different day about like, MIT and BSD require people to acknowledge, right? They require like, yeah, you got to ship this license text and and so if github violates if
Starting point is 00:50:27 copilot violates the gpl copilot also violates mit and bsd right because those attribution clauses like the part of the license and like and we often pretend in the javascript world i shouldn't say we there but like certainly my observation is that we often pretend that like MIT and BSD basically are just public domain, right? Like nobody complies with these notice requirements pretty much. And everybody's just sort of agreed that like, you know what, that's fine. Like that's not what I'm here for anyway. Right.
Starting point is 00:50:59 So yeah. And the law hasn't really like, there's both this like layers of technical indirection and layers of just like, how much do we actually care about copyright in this space right now? Anyway. I mean, I imagine there's some people who are looking at this and they're all upset and they're thinking, gee, there has to be some sort of way for me to create code and put it out there. But, you know, GitHub and Microsoft can't take it and make something like this with it. And then I'm thinking, well, hmm, that sounds a whole lot like the ethical open source movement that wants to place restrictions on things.
Starting point is 00:51:36 And so it's kind of like talking out both sides of your mouth, because, you know. Yeah, I mean, absolutely. There's absolutely some sort of like, oh, you can use Yeah. I mean, absolutely. There's absolutely some sort of like, oh, you can use it however you want, but not like that. That I just like, like that makes me cringe, right? Because that is not the,
Starting point is 00:51:53 Matthew Garrett, who's a former FSF board member, has written really eloquently on this in a long blog post on like, his vision of software freedom is very much about tearing down the copyright system, right? Like in his view, he thinks that like he got into free software in part because the whole
Starting point is 00:52:13 idea was more people should have access to more source code and we should have fewer restrictions on how it's used, right? And GPL was a tool to get to that end. But in his blog post, he talks about how we should be cheering on something that helps break down some of those barriers, right? Again, I tend to, as a longtime card-carrying FSF member until the recent leadership stuff, you know, I mean, I agree, right? Like, I think it's actually really interesting. A lot of these copy left licenses have clauses in them that very specifically say this is limited by fair use, right? If there's a fair use and that's sort of redundant, right? Like in some sense, like you don't have to write that into the license because it's already
Starting point is 00:52:56 part of the law, right? So it's sort of like you're sort of, it's belts and suspenders, but like it was also, we put those in there and i say we because literally i helped put some of those into these licenses we put those in there because it was a statement that like fair use is important to us as like an ethical concern not just a legal concern and so to see some people being like oh yeah i love fair use when i get to fair use things but like when you're fair using my stuff, for those of you on the podcast who can't see my facial expressions, there's a lot of hand-waving and grimacing right now. I'd love if you could share that blog post URL. Yeah, I'll do that. So we can provide it
Starting point is 00:53:40 to our listeners. Yeah, we'll do. Yeah, there's been a ton of fun writing about this. I can definitely send a few links. A former member of the European Parliament from the Pirate Party wrote a really good thing about it from an EU perspective. Because the EU, interestingly, actually did reform their copyright laws a couple of years ago. I think I can say this without being too political, but Congress in the US,S. not real effective at like passing laws. Right. Especially when they're like big lobbying companies involved, which a lot of the tech companies are these days. The EU passed a rule that specifically said that machine learning is interestingly opt out.
Starting point is 00:54:18 So you can in the EU write a license that says you can't use this stuff for machine learning. But you have to explicitly say you can't use this stuff for machine learning. But you have to explicitly say you can't use this stuff for machine learning. Are we going to have robots.txt inside of our Git repos now? I bet it's coming. In fact, actually, there's a W3C working group on exactly that. I forget that I'll send a link to the working group. One of the fun things about this for me is learning because my day job is very much, like, not machine learning, I have not stayed super in touch with it. There's this project called Eleutheri.
Starting point is 00:54:52 I don't know if I'm pronouncing it right. But that's like an entire, it's like an, it's not like a, it is an open source GPT model. And they have not just the model, but it is trained. They, by hook or by crook, they got some GPU hours to train it. And they built a whole data set, which, by the way, includes a lot of open source code. And they specifically included open source code before Copilot. They specifically included open source code because they wanted to do open source. They, like, part of their vision is open source code completion.
Starting point is 00:55:21 And so, you know, that's out there. This WC3 working group on like robots.txt but for code is out there. A lot of cool stuff that I found out about from the sort of mini furor about this. So I'm more optimistic. Like besides, you know, I mentioned there's like the legal thing. There's the sort of, are you a jerk thing. And there's also this like policy layer of like, do we really like what AI is doing to centralize power with companies that can scrape a lot of data and have the GPU cycles to like do training on that data. And I think one of the, another strong reason why we should be strongly in favor of fair use in our community is that the weaker we make fair use, the more AI becomes a game that can be played only if you have a strong legal team, right? Like the position that a lot of people are taking around this would shut down Eleutheri.
Starting point is 00:56:25 Now there's still this question of how do you get the GPU cycles? Because those are not cheap. So like maybe the answer is it's going to be centralized anyway. But like as long as there's sort of green shoots of people doing open AI, like we should be really, really worried about what clamping down on fair use for training might mean for those folks. Yeah, we should probably wrap it up there. But this was fascinating. And it just kind of shows that there is a lot more nuance to this than just kind of the immediate emotional reaction that comes out of seeing a potentially transformative use of AI like this.
Starting point is 00:57:05 And there's a lot to think about. And yeah, I tend to agree with you. I think overall, it'll be a very good thing and will be good for software engineers going forward. It's not going to replace us, I don't think yet, but it will be just fascinating to see how this grows and changes and how we grow and change to adapt to it at the same time oh absolutely right i mean we used to have t-shirts that you know shut up or
Starting point is 00:57:31 i'll replace you with a very small shell script and i you know now it's gonna be shut up or i'll replace you with a very expensive gpu cycle and like it can be potentially so empowering for for programmers not just vs hopefully not in the long-term future, not just VS Code users, not just GitHub users. Hopefully there's a brave new future where access to that is democratized. We'll see, right? It's going to be interesting. Definitely not a problem that's going away, that's for sure. Yeah.
Starting point is 00:58:00 Part of that democratization should be to release a Vim extension so I don't have to use VS Code to try it out. Yeah, I was going to say Emacs, but that's another conversation for another chat. We'll have this battle offline. Lewis, thank you so much for coming on and talking to us and giving us these amazing insights. Yes, thank you very much. Yeah, my pleasure. Happy to talk legal geekery with you guys anytime. That's it for this special crossover
Starting point is 00:58:30 episode of JS Party here on The Change Log. Thank you so much to Nick Nisi and Christopher Hiller for being such awesome panelists on JS Party. Also, of course, huge thanks to Louis Villa
Starting point is 00:58:39 for bringing all that wisdom. And later this week, we're shipping another episode of The Change Log talking with Corey Wilkerson from GitHub about their transition to GitHub Codespaces,
Starting point is 00:58:47 the making of it, and a lot of fun things about what they're doing with that platform. And on deck after that, we're talking to Adam Jacob about open source business models and a lot of fun stuff
Starting point is 00:58:56 around building a software business. If you're not subscribed yet, now's a good time. Subscribe at changelog.fm and everywhere listen to podcasts. The Galaxy brand move is, of course, to get the master feed at changelog.fm and everywhere listen to podcasts. The Galaxy brand move is, of course, to get the master feed at changelog.com slash master.
Starting point is 00:59:09 Special thanks to our partners, Linode, Fastly, and LaunchDarkly. Also, thanks to Breakmaster Cylinder for making all of our awesome beats. That's it for this episode. We'll see you next time.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.