Algorithms + Data Structures = Programs - Episode 35: SkyNet is Here!

Episode Date: July 23, 2021

In this episode, Conor and Bryce talk about GitHub Copilot and more.Date Recorded: 2021-06-30 Date Released: 2021-07-23Lenovo ThinkpadsGitHub CopilotGPL LicenseSkyNetTLAMark HarrisLEWG ISO C++ GitHub...Intro Song InfoMiss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8

Transcript
Discussion (0)
Starting point is 00:00:00 Oh yeah, so what are we going to call that episode? This is going to be cut into like three episodes and we're naming an episode that people have listened to two episodes ago. I don't know. I mean, Skynet maybe? Skynet is here. Skynet is here. Skynet is here.
Starting point is 00:00:17 Yeah, I think that's got to be the name. Welcome to ADSP, the podcast episode 35 recorded on June 30th, 2021. My name is Connor, and today with my co-host Bryce, of, of, uh, of comedy gems prior to, uh, to your joining that are on my opacity. Okay. We will, I will have fun listening to those. But hang on. I want to see the, did you get it? No, no, no. So my, my haircut's at 205, uh, which is.
Starting point is 00:01:01 Oh, that's right. It's, it's after today. Well, okay. Then I still want to see your hair because it's going to be the last time I get to see it super long before you go back to regular Connor hair. So turn your camera on. Yeah, my camera is on.
Starting point is 00:01:13 So I'm in Edge right now because Chrome is just like I can click on the Google Meet thing and then the screen is frozen. So I think this one's using the camera so it won't let edge so now if i do that hey there we go hey there you are well it sounds like you need a new computer uh or a new browser i probably do something i probably do need a um just making sure, yes, I am recording. All right, we're good to go.
Starting point is 00:01:48 Yes, I probably do need a new computer. This laptop is, ooh, 2017. My laptop's 2017. It works perfectly fine. I have a Lenovo 25-year anniversary edition, which is like a limited series run. What brand did you say? It's Lenovo. Oh, I thought you said Lenova, and I was like. I series run. What brand did you say? It's Lenovo. Oh, I thought you said Lenova.
Starting point is 00:02:09 I might have said Lenova. That's not how that company is pronounced. Although that company doesn't exist anymore. Didn't they get acquired by IBM or something? No, you are... The ThinkPad division of IBM, which made great laptops, was sold to Lenovo. Not a. Lenovo.
Starting point is 00:02:39 And I made a 25-year anniversary edition ThinkPad, which has the classic ThinkPad keyboard. And I bought one. And I'm quite happy with it. That's good. They've got that little uh track thingy red dot in the middle yes it does yes it does it's got my trackball you're very excited about it not only at the trackball that when they ship it to you they ship you they shipped you the different trackball heads from all the different eras of thinkpads so that you could use whichever one you wanted why are you so excited about it and all the different eras of ThinkPad so that you could use whichever one you wanted.
Starting point is 00:03:07 Why are you so excited about it and all the different versions that you got? Nostalgia, man, nostalgia. How can you not be excited about nostalgia? Those good old school ThinkPads, those are good laptops. They could physically beat up other laptops. I'm not sure that's a good thing. Did you use the little, what is it called?
Starting point is 00:03:28 Trackball versus a touchpad? Oh, yeah. Well, because the Lenovo ThinkPad W510 that I've used for most of my career, it is on its second screen. And at some point during the course of the many times that I've opened that sucker up, I think I did something to the mouse pad that caught the long story short, the mouse pad no longer works. Only the trackball works. So yeah, I'm pretty familiar with it. That's one way to learn. One way to learn. Break the touchpad. So what do you think about this GitHub co-pilot thing? I don't even know. What is this?
Starting point is 00:04:14 Do you live under a rock? Are you not on Twitter 24-7 like I am? I mean, I use GitHub for my day job. So if there's something that happened to github I thought I would definitely audience you are you're in for a treat because Connor is about to get exposed to github co-pilot um for the first time live on this podcast I am going to send you the link and then I'm just gonna I'm just gonna sit wait where's the chat thingy
Starting point is 00:04:48 Google keeps on changing there we're recording this in Google Meet you know what I think maybe we're done with Google Meet I think maybe the fact that your laptop can no longer join it and that I cannot find the chat thing. All right.
Starting point is 00:05:06 I just Twitter DM'd you the link. God, you know, I liked it back in the day when the way that you communicated with everybody was text message. Whereas like today, like I have one group of friends that communicates with me over Twitter DMs. That's probably most people. Then I got another group that does it over Slack, another group that does it over Discord. Then Hana uses Facebook Messenger with me. That's the only person that I use Facebook Messenger with,
Starting point is 00:05:35 which is a little bit convenient because if I get a Facebook Messenger notification, I know that it's Hana. And then I got two people that use signal and like one person that uses whatsapp it's just like it's too many things it's too many things all right all right are you looking at this i literally have to go to my phone because like chrome i don't know what chrome is doing but chrome uh did oh yeah i sent you a link your ai pair programmer. Yeah. Yeah. So for those who are listening, GitHub Copilot is... Is this a joke?
Starting point is 00:06:12 It is not a joke. It is not a joke. It is a very real thing. GitHub Copilot is an AI that suggests code for you while you're typing out code. I haven't tried it out yet. I'm just hating on it on the internet,
Starting point is 00:06:32 so I don't know much about it, but the little examples on their website show you type out a comment or you start typing out the signature for a function, and it does its AI magic and goes and figures out what the function should be based on that. I don't,
Starting point is 00:06:48 uh, this seems like, uh, this seems like a, could be going, could go terribly wrong. Uh, like what is this based on?
Starting point is 00:06:58 Is this based on stack overflow or like, no, it's, no, that's the, that's the best part. It's trained on code on GitHub And they say in their disclaimer
Starting point is 00:07:07 That it's trained on this large database of code And that usually it won't do this But sometimes it will suggest A snippet of code That is exactly taken from some of the code That it was trained on. And so people on Twitter very rapidly figured out, oh, hey, what are the licensing implications of that? If your AI pair programmer suggests some GPL licensed code, like, is your code now gpl licensed so yeah this is uh yeah i feel like they you know
Starting point is 00:07:50 they talk about you know biased algorithms which is really just biased data and like this seems like it's just gonna i feel like there's probably more bad code on github than there is good code i feel like and i i can say that because i have a lot of garbage code that I put on GitHub that I'm just like, oh, we need to put this somewhere. It's just, you know, the world doesn't need to see it, but I need to put this somewhere and it's going to go on GitHub. And if you look at some of the code that I put on from years ago, it's just awful. I think Odin Holmes too, he, in one of his talks mentions that like all of the code on
Starting point is 00:08:23 his GitHub is just awful. Oh, yeah. I have code on my GitHub from like 2011 when I was a little squirt. And like that code is. Yeah, that's that's rough. That is rough code. Yeah. So like this, it just seems like a terrible, a terrible idea.
Starting point is 00:08:40 It's like, where could we find some really low quality code and let's train train our ai to write code like that it gets better the one of the examples on their like launch page um shows an example of a function that is dealing with quantities of money and it's using floating point numbers to represent quantities of money and it's using floating point numbers to represent quantities of money for those for those who may be unfamiliar uh floating point is not a uh is not a suitable data type to store quantities of money because um floating point can lose you know it can start to lose information and like the last thing you want is to lose a few cents it when you're uh uh computing financial transactions because it may not seem like much but like over time it starts adding up and then uh and then like bad things happen they've got little uh what do they call them um testimonials github co-pilot
Starting point is 00:09:48 discovered that a test file i was working on was missing a specific test and suggested and wrote the test for me what that doesn't sound right like you were just in a file and then Copilot was like, hey, are you missing this test? I wrote it. Like, that doesn't, something doesn't add up here. I, what day is it? Maybe it's not an AI. Maybe it's actually just some guy in a room. I'm impressed by how GitHub Copilot seems to know exactly what I want to type next.
Starting point is 00:10:24 Well, that's concerning. That's not impressive. We've got a problem here. You should be checking the back of your neck to see if you got something plugged into you, man. Somebody wrote somewhere something to the effect of, well, at least the rise of Skynet will be well documented. Skynet is a reference to the general artificial intelligence in the Terminator series of movies that destroys a large chunk of humanity by launching all the nukes
Starting point is 00:11:00 and then tries to destroy the rest of humanity by sending Arnold Schwarzenegger back in time to kill some people from the 80s. The 3% of our audience that didn't know appreciates that explanation. Although, what percentage of our audience do you think actually doesn't know what Skynet is? Well, so, you know, I think that it's always important to give context and to not make assumptions about what people know. One of the places where this comes up so frequently for me is acronyms. And I'm super bold about it. TLAs?
Starting point is 00:11:38 Yeah, TLA. So there's a great story. One of the people that Connor that um that connor works with mark harris um mark has been at nvidia for um oh for many years well over a decade maybe close to two decades yeah it's like 17 or 18 he's he was one of the people responsible for popularizing CUDA and wrote a lot of literature about it. Coined to the term GPGPU, actually. Yeah, he did. And when I started at NVIDIA, him and I had an interaction. We were talking about something. And I sent him an email and I said, well, what about UVM? UVM stands for Unified Virtual Memory, which is like the internal acronym for what we call CUDA Unified Memory. And he sent me back an email that just said,
Starting point is 00:12:38 you shouldn't use TLAs. And I sent him an email saying, what is TLA? And he sent me back an email that simply said three-letter acronym. And I just, that was just it. I was just like, wow, you got me. You got me good. So if somebody uses an acronym on Twitter or something, like I'm pretty bold about like, I'll just go and ask them. Like I was reading a twed, a twed, a thread.
Starting point is 00:13:09 Lenovo and a twed. A thread from somebody. And they used the acronym TC. And I was like, I don't, I don't know what TC is. I think there's like a TC39, which is maybe some JavaScript committee. And to me, TC and the isosense can mean technical committee, but I'm like, that's not what they mean. So I'm like, I better ask, and I just asked.
Starting point is 00:13:33 They meant total compensation. But it was just like, and I get it. It's a tweet. You've got a limited number of characters. You've got to abbreviate sometimes. And so I don't fault anybody for for that but um but yeah acronyms acronyms are not my favorite thing you'll notice that i i go out of my way to avoid saying luge which is the name of the c++ committee's library evolution group i instead say library evolution
Starting point is 00:14:05 because lug is a term that you only understand library evolution working group is what it stands for. You only understand that acronym if like you're on the committee. And likewise, like I'm not a big fan of, I'm not a big fan of the degree to which the C++ committee uses esoteric numbers for things. Somebody will be like, oh, well, we should ask SG16 what that is. And if you're not on the C++ committee, SG16 does not have meaning to you.
Starting point is 00:14:41 And it's just like a series of random letters and numbers. That sentence is a lot more accessible if you say oh we should ask the text and unicode study group what that means or even just like the text study group what that means um and like even the name of the c++ committee wg21 you know people send an email about like something, something WG21, like just say C++ for the C++ committee. And, you know, we do use paper numbers for a lot of things and the numbers are useful. Like it's super useful to have like a reference number for stuff like papers. But one of the common, the common requests whenever there's some discussion about a paper, if somebody doesn't put the title of the paper in the email subject line or they don't mention the title of the paper when they're talking about it, somebody will almost inevitably be like, what paper are we talking about? I don't know what that number references.
Starting point is 00:15:41 I don't know what the subject is that we're discussing. And does it mean you have to be a little bit more verbose to expand these things out sure like typing library evolution instead of l-e-w-g like yeah that it's a few more characters um but like i i usually i usually don't mind those extra characters it's also fine too because co-pilot's gonna be here soon for just our regular text and we're not going to have to type anything. Yeah. I wonder,
Starting point is 00:16:10 are we so hostile to copilot because copilot is actually a bad idea or because like some part of us deep down is like, Oh, only humans can do this. And like, there's some deep down guttural response. That's like, Oh, we got to protect our job security.
Starting point is 00:16:28 Nah. Well, my reaction is just, I mean, it's the same reaction that I have to, you know, regular society and algorithms. It's like, oh, we're going to train this AI to do this based on the last 30 years of data. Well, let's take a look back and see how humans did for the last 30 years. And the answer is terribly. And so we're going to train these AIs to have all the embedded bias and prejudice and oppression, all that stuff. In their defense though, I saw,
Starting point is 00:17:01 what's the name of the Timnit, the AI bias researcher who was fired from Google recently? The one with the paper? Yeah. I do not know that individual's name. But yes, I know. Yeah. I think it's, hang on, hang on, hang on. I'll look it up.
Starting point is 00:17:22 I'll look it up. I know it. I just can't pronounce it. Timnit, Timnit, Ge on. I'll look it up. I'll look it up. I know it. I just can't pronounce it. Timnit, Timnit, Gebru. I'm sure I'm butchering that. And I'm terribly sorry. She tweeted something mentioning the fact that GitHub Copilot happened to cite one of her papers on AI bias, like very prominently cited in one of the pages on their site that described how it worked. And her remark was something like, hey, you know,
Starting point is 00:17:55 this paper got me fired from one company and there's another company that's, you know, citing it. So at the very least, I think they're aware of that. And while we are, you know, trashing on GitHub Copilot a bit, I do feel that we're perhaps being a little bit unfair. I mean, I think it is a very cool technology and a lot of people have put a ton of work into it. And I think that there is a lot of potential there. I saw an amazing talk a few years ago about this thing that I think was called like angelic programming. And the idea was that like you would write and somehow an expression of constraints or something or requirements. And then like the programming language or the system would figure out what the code was supposed to do based on that um and and so like like it's not that i think that this is a bad idea it's just that i think that there's a lot of um of challenges and open questions uh and potential for for abuse and harm under the hood here. Like the whole licensing question around it really scares me.
Starting point is 00:19:09 And maybe it shouldn't, maybe like I'm overreacting, but I think like we're sort of in untested waters there for what exactly that means. And yeah, like the quality of what you'll get out of it, you know, it's probably been trained in a lot of code that contains security vulnerabilities and, you know, memory safety violations and stuff like that. And so you've got to think about that.
Starting point is 00:19:39 But that said, I mean, I think tools like this are probably the future of how we'll program. And so, of course, you know, this being the first big one to be launched, you know, it's going to have some problems. And we're going to probably all be skeptical about it for a while. And probably it'll lead to some, you know, some bad PRs and some mistakes. But who knows where it'll be like 10, 20 years from now. I mean, I think it's a super smart move by GitHub to invest in something like this.
Starting point is 00:20:15 I think it's really brilliant usage of their platform and the data that they have available through their platform. I would, I really would like to see them, though, spend some cycles making GitHub search not so terrible. Like, I get it, cool, you guys made an AI that can pair program with me. That's great. But all that I would really like is to be able to search for C++ expressions on GitHub that are not simple identifiers and get coherent results. Like, please, pretty please, I would like to be able to search for,
Starting point is 00:20:59 you know, some C++ expression that contains non-alphanumeric characters and have it not just filter out all the non-alphanumeric characters. Yeah. So please, please, folks. And do some, like, deduplication because, like, inevitably you search for something and then, like, there's 40 pages of, like, you know, page number one to page number 40 is all the same library that's just been forked a thousand times. Um, you know, this, this actually makes me think of, um, so my, my, my predecessor and library evolution, the former library evolution chair, Titus Winters,
Starting point is 00:21:36 um, he's a huge believer in, uh, in software engineering tooling. Um, I'd say he's one of the leading voices and advocates for the power of tools and automation in software engineering and how that can change how you write code. So the basic premise is like, hey, making breaking changes are hard. But imagine if we had tools that could automatically, you know, refactor code so that we could make breaking changes. Or, you know, evaluating how, what the impact of the change will be is hard. So imagine if we had, you know, tools that gave us great insight and visibility into our code bases. And so at Google, they have this code search engine, which from every time I've seen it been used has just been absolutely fabulous.
Starting point is 00:22:40 And every now and then on the C++ committee, there's been a time when the committees ask some question like, you know, well, it would be nice if we could do X, but, you know, if we did that, maybe there'd be these edge cases that we'd have to worry about. Or like, you know, maybe a better example, like, let's deprecate X. And then somebody's like, oh, but X is still used. And then like, how do you disprove that? And then inevitably, one of the Google people on the committee will go start typing in their keyboard and they'll be like, I just searched our entire code base and I found 100 uses of X.
Starting point is 00:23:17 And relative to the size of our code base, that means that it's essentially not used anywhere. And just that ability of being able... Like, imagine that you're in a meeting talking about some design decision that you're going to make. And just imagine being able to go and just search, like, the code base... Like, not all code in the world, but like a representative size of code to be able to search it and answer some query about, like, how frequently is this pattern used? How frequently is this function called with this type of arguments? Like just imagine being able to get answers to that very rapidly. It would
Starting point is 00:23:54 totally change how we would make decisions about software evolution. Yeah, that's a super powerful tool. Amazon has something similar. I don't think it's as polished as Google's, but. NVIDIA has something similar called Envy Grok. And yeah, it's not as polished as Google. But what I really want is I really want the open source version of that. And there used to be some open, Google used to run some open version of this called like Google Code Search or something. Maybe it's still around.
Starting point is 00:24:27 I seem to recall using it at some point in the past. But what I really want, you know, ultimately what we really want is we want to be able to go and search through GitHub and like do these queries on GitHub. But it's just not really possible and feasible today with the limitations of their search API. Yeah. Yep. Well, so wait, I'll say one more thing, and then we need to figure out what to call this little 30-minute episode. Uh, so I tried to sign up for this co-pilot, because what I want to see, if I type into
Starting point is 00:25:02 my C++ program a comment that says, sum the elements of this vector, put a little pointy arrow to the vector above. Will it give me a for loop or will it give me a std accumulate? That's what I want to know. And if it gives me a std accumulate, sign me up. I'll have code.
Starting point is 00:25:24 I'll have this. What about range-based i'll have what about range based for loop is range based for loop acceptable no oh no you said you said an accumulation pattern yeah okay i meant like i was thinking like more for each pattern thanks for listening we hope you enjoyed and have a great day

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.