Algorithms + Data Structures = Programs - Episode 35: SkyNet is Here!
Episode Date: July 23, 2021In this episode, Conor and Bryce talk about GitHub Copilot and more.Date Recorded: 2021-06-30 Date Released: 2021-07-23Lenovo ThinkpadsGitHub CopilotGPL LicenseSkyNetTLAMark HarrisLEWG ISO C++ GitHub...Intro Song InfoMiss You by Sarah Jansen https://soundcloud.com/sarahjansenmusic Creative Commons — Attribution 3.0 Unported — CC BY 3.0 Free Download / Stream: http://bit.ly/l-miss-you Music promoted by Audio Library https://youtu.be/iYYxnasvfx8
Transcript
Discussion (0)
Oh yeah, so what are we going to call that episode?
This is going to be cut into like three episodes and we're naming an episode that people have
listened to two episodes ago.
I don't know.
I mean, Skynet maybe?
Skynet is here.
Skynet is here.
Skynet is here.
Yeah, I think that's got to be the name. Welcome to ADSP, the podcast episode 35 recorded on June 30th, 2021.
My name is Connor, and today with my co-host Bryce, of, of, uh, of comedy gems prior to, uh, to your joining that are on my opacity.
Okay.
We will, I will have fun listening to those.
But hang on.
I want to see the, did you get it?
No, no, no.
So my, my haircut's at 205, uh, which is.
Oh, that's right.
It's, it's after today.
Well, okay. Then I still want to see your hair
because it's going to be the last time
I get to see it super long
before you go back to regular Connor hair.
So turn your camera on.
Yeah, my camera is on.
So I'm in Edge right now
because Chrome is just like
I can click on the Google Meet thing
and then the screen is frozen.
So I think this one's using the camera so it won't let edge so now if i do that hey there we go hey there you
are well it sounds like you need a new computer uh or a new browser i probably do something i
probably do need a um just making sure, yes, I am recording.
All right, we're good to go.
Yes, I probably do need a new computer.
This laptop is, ooh, 2017.
My laptop's 2017.
It works perfectly fine.
I have a Lenovo 25-year anniversary edition, which is like a limited series run.
What brand did you say?
It's Lenovo. Oh, I thought you said Lenova, and I was like. I series run. What brand did you say? It's Lenovo.
Oh, I thought you said Lenova.
I might have said Lenova.
That's not how that company is pronounced.
Although that company doesn't exist anymore.
Didn't they get acquired by IBM or something?
No, you are...
The ThinkPad division of IBM, which made great laptops, was sold to Lenovo.
Not a.
Lenovo.
And I made a 25-year anniversary edition ThinkPad, which has the classic ThinkPad keyboard.
And I bought one.
And I'm quite happy with it.
That's good. They've got that little uh track thingy red dot in
the middle yes it does yes it does it's got my trackball you're very excited about it not only
at the trackball that when they ship it to you they ship you they shipped you the different
trackball heads from all the different eras of thinkpads so that you could use whichever one
you wanted why are you so excited about it and all the different eras of ThinkPad so that you could use whichever one you wanted.
Why are you so excited about it and all the different versions that you got?
Nostalgia, man, nostalgia.
How can you not be excited about nostalgia?
Those good old school ThinkPads,
those are good laptops.
They could physically beat up other laptops.
I'm not sure that's a good thing.
Did you use the little, what is it called?
Trackball versus a touchpad?
Oh, yeah.
Well, because the Lenovo ThinkPad W510 that I've used for most of my career, it is on its second screen. And at some point during the course of the many times that I've opened that sucker up, I think I did something to the mouse pad that caught the long story short, the mouse pad no longer works. Only the trackball works. So yeah, I'm pretty familiar with it.
That's one way to learn. One way to learn.
Break the touchpad.
So what do you think about this GitHub co-pilot thing?
I don't even know.
What is this?
Do you live under a rock?
Are you not on Twitter 24-7 like I am?
I mean, I use GitHub for my day job.
So if there's something that happened to github I
thought I would definitely audience you are you're in for a treat because Connor is about to get
exposed to github co-pilot um for the first time live on this podcast
I am going to send you the link and then I'm just gonna I'm just gonna sit wait
where's the chat thingy
Google keeps on changing there
we're recording this in Google Meet
you know what
I think maybe we're done with Google Meet
I think maybe
the fact that your laptop can no longer join it
and that I cannot find the chat thing.
All right.
I just Twitter DM'd you the link.
God, you know, I liked it back in the day when the way that you communicated with everybody
was text message.
Whereas like today, like I have one group of friends that communicates with me over
Twitter DMs.
That's probably most people. Then I got
another group that does it over Slack, another group that does it over Discord. Then Hana uses
Facebook Messenger with me. That's the only person that I use Facebook Messenger with,
which is a little bit convenient because if I get a Facebook Messenger notification, I know that
it's Hana. And then I got two people that use signal and like one person that uses whatsapp it's just like it's too many things it's too many things
all right all right are you looking at this i literally have to go to my phone because like
chrome i don't know what chrome is doing but chrome uh did oh yeah i sent you a link
your ai pair programmer. Yeah.
Yeah.
So for those who are listening,
GitHub Copilot is... Is this a joke?
It is not a joke.
It is not a joke.
It is a very real thing.
GitHub Copilot is an AI
that suggests code for you
while you're typing out code.
I haven't tried it out yet.
I'm just hating on it on the internet,
so I don't know much about it,
but the little examples on their website show
you type out a comment
or you start typing out the signature for a function,
and it does its AI magic
and goes and figures
out what the function should be based on that.
I don't,
uh,
this seems like,
uh,
this seems like a,
could be going,
could go terribly wrong.
Uh,
like what is this based on?
Is this based on stack overflow or like,
no,
it's,
no,
that's the,
that's the best part.
It's trained on code on GitHub
And they say in their disclaimer
That it's trained on this large database of code
And that usually it won't do this
But sometimes it will suggest
A snippet of code
That is exactly taken from some of the code
That it was trained on.
And so people on Twitter very rapidly figured out, oh, hey, what are the licensing implications of that?
If your AI pair programmer suggests some GPL licensed code, like, is your code now gpl licensed so yeah this is uh yeah i feel like they you know
they talk about you know biased algorithms which is really just biased data and like this seems
like it's just gonna i feel like there's probably more bad code on github than there is good code
i feel like and i i can say that because i have a lot of garbage code that I put on GitHub
that I'm just like, oh, we need to put this somewhere.
It's just, you know, the world doesn't need to see it, but I need to put this somewhere
and it's going to go on GitHub.
And if you look at some of the code that I put on from years ago, it's just awful.
I think Odin Holmes too, he, in one of his talks mentions that like all of the code on
his GitHub is just awful.
Oh, yeah.
I have code on my GitHub from like 2011 when I was a little squirt.
And like that code is.
Yeah, that's that's rough.
That is rough code.
Yeah.
So like this, it just seems like a terrible, a terrible idea.
It's like, where could we find some really low quality code and let's train
train our ai to write code like that it gets better the one of the examples on their like
launch page um shows an example of a function that is dealing with quantities of money
and it's using floating point numbers to represent quantities of money and it's using floating point numbers to represent quantities of money for those for those who may be unfamiliar uh floating point is not a uh is not a suitable
data type to store quantities of money because um floating point can lose you know it can start to lose information and like the last thing you want is to lose a few
cents it when you're uh uh computing financial transactions because it may not seem like much
but like over time it starts adding up and then uh and then like bad things happen
they've got little uh what do they call them um testimonials github co-pilot
discovered that a test file i was working on was missing a specific test and suggested and wrote
the test for me what that doesn't sound right like you were just in a file and then Copilot was like, hey, are you missing this test?
I wrote it.
Like, that doesn't, something doesn't add up here.
I, what day is it?
Maybe it's not an AI.
Maybe it's actually just some guy in a room.
I'm impressed by how GitHub Copilot seems to know exactly what I want to type next.
Well, that's concerning.
That's not impressive.
We've got a problem here.
You should be checking the back of your neck to see if you got something plugged into you, man.
Somebody wrote somewhere something to the effect of, well, at least the rise of Skynet will be well documented.
Skynet is a reference to the general artificial intelligence in the Terminator series of movies
that destroys a large chunk of humanity
by launching all the nukes
and then tries to destroy the rest of humanity
by sending Arnold Schwarzenegger back
in time to kill some people from the 80s. The 3% of our audience that didn't know appreciates
that explanation. Although, what percentage of our audience do you think actually doesn't know
what Skynet is? Well, so, you know, I think that it's always important to give context and to not make assumptions about what people know.
One of the places where this comes up so frequently for me is acronyms.
And I'm super bold about it.
TLAs?
Yeah, TLA.
So there's a great story.
One of the people that Connor that um that connor works with mark harris
um mark has been at nvidia for um oh for many years well over a decade maybe close to two
decades yeah it's like 17 or 18 he's he was one of the people responsible for popularizing CUDA and wrote a lot of literature
about it. Coined to the term GPGPU, actually. Yeah, he did. And when I started at NVIDIA,
him and I had an interaction. We were talking about something. And I sent him an email and I said,
well, what about UVM? UVM stands for Unified Virtual Memory, which is like the internal acronym for what we call CUDA Unified Memory. And he sent me back an email that just said,
you shouldn't use TLAs. And I sent him an email saying, what is TLA?
And he sent me back an email that simply said three-letter acronym.
And I just, that was just it.
I was just like, wow, you got me.
You got me good.
So if somebody uses an acronym on Twitter or something,
like I'm pretty bold about like, I'll just go and ask them.
Like I was reading a twed, a twed, a thread.
Lenovo and a twed.
A thread from somebody.
And they used the acronym TC.
And I was like, I don't, I don't know what TC is.
I think there's like a TC39, which is maybe some JavaScript committee.
And to me, TC and the isosense can mean technical committee,
but I'm like, that's not what they mean.
So I'm like, I better ask, and I just asked.
They meant total compensation.
But it was just like, and I get it.
It's a tweet.
You've got a limited number of characters.
You've got to abbreviate sometimes.
And so I don't fault anybody for for that but um but yeah acronyms acronyms are not my favorite thing you'll notice
that i i go out of my way to avoid saying luge which is the name of the c++ committee's library
evolution group i instead say library evolution
because lug is a term that you only understand
library evolution working group is what it stands for.
You only understand that acronym if like you're on the committee.
And likewise, like I'm not a big fan of,
I'm not a big fan of the degree to which the C++ committee
uses esoteric numbers for things.
Somebody will be like, oh, well, we should ask SG16 what that is.
And if you're not on the C++ committee, SG16 does not have meaning to you.
And it's just like a series of random letters and numbers.
That sentence is a lot more accessible if you say oh we should ask the text and unicode study
group what that means or even just like the text study group what that means um and like even the
name of the c++ committee wg21 you know people send an email about like something, something WG21, like just say C++
for the C++ committee. And, you know, we do use paper numbers for a lot of things and the numbers
are useful. Like it's super useful to have like a reference number for stuff like papers.
But one of the common, the common requests whenever there's some discussion about a paper, if somebody doesn't put the title of the paper in the email subject line or they don't mention the title of the paper when they're talking about it, somebody will almost inevitably be like, what paper are we talking about?
I don't know what that number references.
I don't know what the subject is that we're discussing.
And does it mean you have to be a little bit more verbose to expand these things
out sure like typing library evolution instead of l-e-w-g like yeah that it's a few more characters
um but like i i usually i usually don't mind those extra characters it's also fine too because
co-pilot's gonna be here soon for just our regular text and we're
not going to have to type anything.
Yeah.
I wonder,
are we so hostile to copilot because copilot is actually a bad idea or
because like some part of us deep down is like,
Oh,
only humans can do this.
And like,
there's some deep down guttural response.
That's like,
Oh, we got to protect our job security.
Nah.
Well, my reaction is just, I mean, it's the same reaction that I have to, you know, regular society and algorithms. It's like, oh, we're going to train this AI to do this based on the last 30 years of data.
Well, let's take a look back and see how humans did for the last 30 years.
And the answer is terribly.
And so we're going to train these AIs
to have all the embedded bias and prejudice
and oppression, all that stuff.
In their defense though, I saw,
what's the name of the Timnit, the AI bias researcher who was fired from Google recently?
The one with the paper?
Yeah.
I do not know that individual's name.
But yes, I know.
Yeah.
I think it's, hang on, hang on, hang on.
I'll look it up.
I'll look it up.
I know it.
I just can't pronounce it. Timnit, Timnit, Ge on. I'll look it up. I'll look it up. I know it. I just can't pronounce it.
Timnit, Timnit, Gebru.
I'm sure I'm butchering that.
And I'm terribly sorry.
She tweeted something mentioning the fact that GitHub Copilot happened to cite one of her papers on AI bias, like very prominently cited in one of the pages on their
site that described how it worked. And her remark was something like, hey, you know,
this paper got me fired from one company and there's another company that's, you know, citing
it. So at the very least, I think they're aware of that. And while we are, you know, trashing on GitHub Copilot a bit, I do feel that we're perhaps being a little bit unfair. I mean,
I think it is a very cool technology and a lot of people have put a ton of work into it. And
I think that there is a lot of potential there.
I saw an amazing talk a few years ago about this thing that I think was called like angelic programming. And the idea was that like you would write and somehow an expression of constraints or something or requirements.
And then like the programming language or the system would figure out what the code was supposed to do based on that um and and so like like it's not that i think that this
is a bad idea it's just that i think that there's a lot of um of challenges and open questions uh
and potential for for abuse and harm under the hood here. Like the whole licensing question around it really scares me.
And maybe it shouldn't, maybe like I'm overreacting,
but I think like we're sort of in untested waters there
for what exactly that means.
And yeah, like the quality of what you'll get out of it,
you know, it's probably been trained in a lot of code
that contains security vulnerabilities
and, you know, memory safety violations and stuff like that.
And so you've got to think about that.
But that said, I mean, I think tools like this
are probably the future of how we'll program.
And so, of course, you know, this being the first big one to be launched, you know, it's going to have some problems.
And we're going to probably all be skeptical about it for a while.
And probably it'll lead to some, you know, some bad PRs and some mistakes.
But who knows where it'll be like 10, 20 years from now.
I mean, I think it's a super smart move by GitHub
to invest in something like this.
I think it's really brilliant usage of their platform
and the data that they have available through their platform.
I would, I really would like to see them,
though, spend some cycles making GitHub search not so terrible.
Like, I get it, cool, you guys made an AI that can pair program with me.
That's great. But all that I would really like
is to be able to search for C++ expressions on GitHub that are not simple identifiers
and get coherent results. Like, please, pretty please, I would like to be able to search for,
you know, some C++ expression that contains non-alphanumeric characters and have it not just filter out all the non-alphanumeric characters.
Yeah.
So please, please, folks.
And do some, like, deduplication because, like, inevitably you search for something
and then, like, there's 40 pages of, like, you know, page number one to page number 40
is all the same library that's
just been forked a thousand times. Um, you know, this, this actually makes me think of, um, so my,
my, my predecessor and library evolution, the former library evolution chair, Titus Winters,
um, he's a huge believer in, uh, in software engineering tooling. Um, I'd say he's one of the leading voices and advocates for the power of
tools and automation in software engineering and how that can change how you write code.
So the basic premise is like, hey, making breaking changes are hard. But imagine if we
had tools that could automatically, you know, refactor code so that we could make breaking
changes. Or, you know, evaluating how, what the impact of the change will be is hard. So imagine
if we had, you know, tools that gave us great insight and visibility into our code bases.
And so at Google, they have this code search engine,
which from every time I've seen it been used has just been absolutely fabulous.
And every now and then on the C++ committee, there's been a time when the committees ask some question like,
you know, well, it would be nice if we could do X, but, you know, if we did that,
maybe there'd be these edge cases that we'd have to worry about. Or like, you know,
maybe a better example, like, let's deprecate X. And then somebody's like, oh, but X is still used.
And then like, how do you disprove that?
And then inevitably, one of the Google people on the committee will go start typing in their
keyboard and they'll be like, I just searched our entire code base and I found 100 uses
of X.
And relative to the size of our code base, that means that it's essentially not used
anywhere.
And just that ability of being able... Like, imagine that
you're in a meeting talking about some design decision that you're going to make. And just
imagine being able to go and just search, like, the code base... Like, not all code in the world,
but like a representative size of code to be able to search it and answer some query about, like,
how frequently is this pattern used? How frequently is this function called with this
type of arguments? Like just imagine being able to get answers to that very rapidly. It would
totally change how we would make decisions about software evolution. Yeah, that's a super powerful
tool. Amazon has something similar. I don't think it's as polished as Google's, but.
NVIDIA has something similar called Envy Grok.
And yeah, it's not as polished as Google.
But what I really want is I really want the open source version of that.
And there used to be some open, Google used to run some open version of this called like
Google Code Search or something.
Maybe it's still around.
I seem to recall using it at some point in the past.
But what I really want, you know, ultimately what we really want is we want to be able to go and search through GitHub and like do these queries on GitHub.
But it's just not really possible and feasible today with the limitations of their search API.
Yeah.
Yep.
Well, so wait, I'll say one more thing, and then we need to figure out what to call this
little 30-minute episode.
Uh, so I tried to sign up for this co-pilot, because what I want to see, if I type into
my C++ program a comment that says,
sum the elements of this vector,
put a little pointy arrow to the vector above.
Will it give me a for loop or will it give me a std accumulate?
That's what I want to know.
And if it gives me a std accumulate,
sign me up.
I'll have code.
I'll have this. What about range-based i'll have what about range based for loop is
range based for loop acceptable no oh no you said you said an accumulation pattern yeah okay i meant
like i was thinking like more for each pattern thanks for listening we hope you enjoyed and
have a great day