Programming Throwdown - Squashing bugs using AI and Machine Learning
Episode Date: February 18, 2020The best part of hosting Programming Throwdown is reading emails from people who listened to this show before they had any coding experience and went on to land jobs in tech. Thanks so much f...or inspiring us with your stories. My second favorite part of hosting the show is hearing about so many awesome programming tools and resources, often when they are just starting out. DeepCode is one of these amazing resources. DeepCode is a static analysis tool that looks at your code and, using AI trained on all code in github (!!!), finds common mistakes and offers suggestions on how to resolve. I am a heavy user of static analysis tools, and yet DeepCode was still able to find real issues in one of my python projects above and beyond pylint and mypy. Best of all, it's completely free to use for open source projects! Give it a shot and let us know what you think! Show notes: https://www.programmingthrowdown.com/2020/02/episode-99-squashing-bugs-using-ai-and.html ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
programming throwdown episode 99 squashing bugs using ai and machine learning with boris paskalov
take it away jason hey. This is a really, really
fascinating topic. I think there's been a lot of interest around using AI to sort of help developers.
We've all had apps, even like super popular apps like the Uber and the Google app and other apps
where they have bugs and they crash and things like that. So even the people who are the best at this have tons of issues.
And so it's becoming a really interesting area.
And we're so fortunate we have Boris Puskalev, the CEO and co-founder of DeepCode,
to kind of sit down and roundtable how we can make software better.
So Boris, why don't you kind of start off by telling us, like, a little bit about your background.
You know, what was the idea or the inception of DeepCode and, you know, kind of at a high level of what you guys are doing right now.
Sounds good. So, welcome, everyone. Thank you for having me.
So, my name is Boris Paskalev. I was originally born in Bulgaria.
And I did my bachelor's in Mastering Computer Science in Boston and then worked with a number
of different companies as a developer. Later on in my life I did an executive
MBA so I started moving into project management, program management and many
different things in that space. And yeah, and I'm still working in technology,
but I'm coding less and less, as you can imagine.
So about DeepCode, obviously, the idea is that it started in 2016,
and it's a spin-off from ETH Zurich,
which is the number one tech university in Europe.
I usually call it the MIT of Europe when I make fun with the guys.
Nice.
But yeah, the idea is that my other two co-founders,
Professor Dr. Martin Vechef and Dr. Veselin Raichev,
so they spent about 12 years together researching the topic of learning from big code,
so program analysis and how we can really apply powerful machine learning algorithms on top of code. And they've published many, many papers, lots of awards in the space. And after our CTO,
Veselin Wright, finished his PhD, pretty much the idea was clear that we have to start that
as a new platform that should revolutionize how software development works over time. So that's really the passion behind it.
Cool. So let's unpack a lot of that.
So you were a software developer, and then at some point you went to a business school.
What sort of inspired that?
Oh, that's an interesting question.
I think it's a consequence of many, many events.
But slowly, as a software developer,
I started kind of project managing the software I was dealing with,
kind of delivering it to the end users
and kind of working on the end-to-end,
trying to actually get specifications from them,
understand how we test it,
obviously making sure there's less and less bugs, making sure that we fulfill the right requirements. And kind of in my head, slowly,
I started building this overall picture that it's not just the software, but it's an end-to-end
process. So I started project managing some of those, then they started giving me larger projects
with more people. And then all of a sudden, the project became project became hey build a software that actually runs this robotized production line and then that this is how I ventured out outside of
the software for software space and then eventually I just had projects like can you open our
development office in Switzerland and that's how I kind of moved into non-software projects
and eventually I said well then I really do do that successfully I need to
learn a bit more about that that side of the world and can say that's an executive MBA.
Cool that makes sense so you know this is a something I've always wondered is what do you
think about the idea that that at what level do you not need to have a technical background so
for example does a line manager need to be technical? A line
engineering manager? Does the director need to be technical? And at what point do you
say, okay, this now has transitioned fully into project management, process management
from like a business perspective?
That is a very polarized topic. I mean, as a techie, I obviously believe that everybody
should have technical understanding and background just because this is where you learn basic logic, let's put it that way.
Start thinking more like a machine than a person, to put it that way.
And I think everybody should have a little bit of that.
I mean, clearly, when you're going to the arts, etc., there is quite a big range that you might not need that. But for anything, when you're working in the business world, you need some
kind of a technical background, being that pure math that can actually lead into the same way of
thinking. But with that said, there's many exceptions. I had many colleagues that didn't
actually have any technical background. One of my very first technical manager, brilliant developer, he actually has a PhD in literature, right?
So, yeah.
So there's plenty of exceptions in the rule.
Yeah, totally.
And we have a lot of folks listening who maybe they started listening
when the show started, I don't know, Patrick, what, eight years ago or something.
And they have all sorts of different degrees um and
backgrounds and and one of the awesome things about this area is that is that so it's so easy
to pick up um you don't need to have a really long apprenticeship um actually i was reading
something that if you're a hand doctor it's actually one of the hardest professions to pick up.
And so in software you can do it in, you know, I mean it's a craft so you have to develop
it like any craft, but you can at least get started quickly regardless of your background.
Yeah, I think we just actually hired a junior software developer that had zero background
in software development.
Like a couple of years ago he said, hey, I just want to be a developer.
And he started picking it up.
And I think with modern languages and frameworks, it is getting easier and easier to get into
it, which is great.
And there's a lot of tools that can help you along the way.
Cool.
So let's go into deep code.
So you're talking about this person, these two folks who were doing some academic research on code analysis.
And one of the things I think that's hard to grasp for a lot of people is how do you do analysis on code, right?
I mean, you can imagine, I guess you have to get rid of all the white space, right?
But it's this sort of really complicated language.
It doesn't have clear parts of speech.
So a lot of NLP techniques, it's not obvious how they would work here.
And so how does one even begin to sort of programmatically understand a C++ file?
Like how does that actually work?
It is a very, very good question.
And the short answer is it is very hard.
I bet.
But you start with extremely good developers.
That's the number one thing.
I think both of our other co-founders,
they're really like, they're language agnostic.
They can literally do anything.
They understand the languages inside out.
Because the majority of the problem is program analysis.
I mean, it's a pretty old type of thing that it's not very sought after today,
but it really kind of gives you the core of what it is.
So let me just walk you high level to the process of what it is.
So first you start with parsers.
So for each language there is a parser.
So we actually use one of the standard parsers,
maybe small changes here and there, but that's about it.
And then a lot of tools pretty much end up here.
They're just minor adjustments
because the parser gives you an abstract syntax tree or AST,
and then people work on it.
But the ASTs are not rich enough.
They don't really represent everything that you need in the program,
as you said, very complex.
So this is where kind of the magic starts.
We actually use proprietary solvers that actually extract every single semantic fact in the program,
every single interaction, every single function, every single variable, every single object,
and we build the relationship between them in terms of who is calling who, how is it changing,
are you casting
something to something else etc etc so this actually goes into a graph index
that she represents this whole interactions of of the cult and this is
kind of the key piece right and this graphing this has to be pretty much in a
machine learnable format right so then you can apply machine learning on top of
it so that's kind of where most of the magic is kind of to create a machine learning representation for code which is pretty much
does not exist today or if it exists as you said it's been it's treating the the code as string as
text which obviously drops down all the semantic facts that are interesting about coding um and
and after you have this machine learning representation, then it's all about speed, efficiency,
to learn from every single fix ever made,
every single line of code that exists out there,
and then apply machine learning algorithms to extract specific facts.
This actually brings you to have
this knowledge base of everything that has happened,
how people can fix different things,
are there consensus how to fix them, are there, how people have fixed different things, are
there consensus how to fix them, are there people fixing it in totally different ways,
then you can actually assign probabilities of what problem is likely to be fixed in a
specific way versus in another way.
Got it.
Okay, so let me see if I can unpack this.
So similar to sort of the grammar parsers, if folks listening remember that from grade school, how you would have the sentence, you know, the man went to the supermarket and you'd have to diagram that sentence.
And you would separate out the subject and the verb and you'd draw this little graph to represent that sentence.
You can do the same thing with code.
And then I guess, but that by itself.
So can you explain a little bit what has to happen to turn that syntax tree
into a machine learnable format?
So in other words, why couldn't the machines just learn on the syntax tree?
What's missing there?
So it's missing a lot of the interactions and the depth.
For example, you cannot track. For example, you have an object you put it into an array,
right, and then you actually get this object out of the array. The abstract
syntax tree will not be able to tell you if it's the same object, right? So the
idea is that you have to have a much, much, much larger depth and naturally tracking
every single thing that happens. Like you need inter-procedural analysis, points-to-analysis,
type-state analysis,
may-versus-must analysis. So there is
a very wide range of different things
that abstract syntaxes
will not be able to give you.
And that's why the type of
issues or representation
you actually make on them are much more
simplistic. And then if you build rules
on top of them, then your accuracy level, it's much lower.
So you get lots of false positives.
Got it. That makes sense.
So similar to, you've talked in the past on type inference
and how, well, it probably started way before Haskell,
but Haskell, I think, is where it started to become popular.
And then you saw Scala with it,
and now you have almost every TypeScript has it,
Python has it. And so what they're doing is they're looking at the program flow,
and they're saying, you know, x equals 3, a equals x, and so therefore a is also an integer.
And so they're kind of tracing through. And you're saying that you need to do that kind of tracing through. And you're saying that you need to do that
kind of as a pre-processing step
to actually look at the runtime flow
in addition to the syntax,
and all of that goes into some,
let's say, connection structure, some type of graph.
That is correct.
And the key part here is that
we're actually doing this statically, right?
So we don't have to actually build the code.
We don't have to actually, we don't even have to compile it,
make sure that it runs.
We can actually track this purely by actually understanding the structure.
This is where the program analysis come in.
Got it. Okay, cool. That makes sense. And then at that point, it's just ingesting as much code as you can,
looking at zillions of GitHub
pull requests that say, you know,
fix array out of bounds error.
There's probably, you know, 800,000 of those in GitHub.
And so you can kind of look at all of those
and see what the common structures are.
That is correct, because when you actually convert it
to this internal representation that we have,
then that becomes purely language agnostic, right?
That's kind of the best part.
So the only thing that is language specific is the parsing, and then the representation
is language agnostic.
So we don't care how the developer wrote it.
We actually care what it does and how it actually does it.
Got it.
Do you look at the variable names, or is that just too much noise?
So we actually have metadata that we actually track back to what it is,
because that's the idea that when we understand the problem,
we can actually point to where it's coming from and what it does.
But ultimately, we don't care what the variable name is.
It could be anything.
Yeah, that makes sense.
I mean, I can imagine just semantic errors.
You say miles equals feet times 10 and say, okay, that's an issue.
But I feel like, as you said, it's just too hard.
There's just too many different names for things.
Yeah, so there is a different tool that we have released
and that was during the research years that is delphiscation.
It actually tries to predict very accurately
what is the best name for each variable
because you can delphiscate it so you can de-obfuscate it
so you can actually make it human readable.
Yeah, de-obfuscate it.
Yeah, so in that state,
we actually have pretty good heuristics
based on pretty much machine learning
based on how a specific variable is used
and what it does
and then understand how people name it
and then understand what the right name should be.
That is super cool.
Yeah. Yeah, is that something that people can just
try out? Can they just upload a CPP
file and see what the Deoptis
Gator... I think that was mainly for Java.
It's JS nice and
nice to predict. Those are the two
tools that actually do that.
And they're free to use, so
anybody can upload anything there and
use it.
Oh, very cool. You have to check that out.
So you said JSNice?
JSNice, yes.
JSNice. Cool. Yeah, we'll add it to the show notes. So, okay, let's spiral back a bit to the sort of product and market here.
So how do folks, you know, in general, how do folks find bugs now? Especially, you know, I mean, we've all found bugs, you know, in our school projects and things like that.
But when you're in these enormous software engineering teams, so imagine you're building the Amazon app or something.
How do developers go about finding bugs in these sort of monolithic projects?
Wow, it really depends. I mean, there's ultimately two major ways.
The human way, like when you're using humans to do it,
and some kind of automated ways.
Human way is obviously a range,
like while coding, like yourself,
or when you're doing peer programming,
you have your counterpart actually say,
hey, what the hell is that?
So that's standard, right?
The peer code reviews, I mean,
code reviews are pretty common today.
So that's one of them, I think most common ways
that people will actually identify issues.
And then unit and functional testing that you've built
or you're building as you're developing.
The more, sometimes I call it old school,
but it's really actually pretty popular still today,
actual QA testing or QA processes.
There's many different ones.
Lots of them are still human-based.
And then the very final one is
you're actually customers or users testing it
and saying, hey, this broke.
What the hell?
What should I do with it?
So I think this is kind of the range
of the human identification of bugs.
And there's obviously many automated ways that are actually
coming this way, that using actual static analysis, the automatic test coverage or test generations,
then you have the wider range of formal verification or fuzzing, then there's actually
some compliance testing that could be happening. So you have some for specific industry, there's
a specific set of rules that you have to test for, then they can uncover issues. Obviously, there's dynamic analysis as well. That's pretty big those
days. And there's many areas that you can catch things by just simulating the runtime of the
program, or actually running it live while the users are performing operations.
Yeah, that makes sense. I think as you move along that continuum, you add sort of more and more, I guess, risk or more side effects, right?
So if a bug makes it all the way out to the end user, it can be totally catastrophic.
Like you break the login page for your app.
You might never recover from that.
Actually, I know companies who, like small companies that broke the login page and literally never recovered.
And you go all the way to the other end of the spectrum where there's a static analysis tool
so that before you've even hit save on the file,
you know that there's an issue.
And so that's obviously anything you can catch there
adds huge value to your time and to de-risking the project.
Correct.
So ultimately, the belief is yes, the earlier you catch it in the development cycle, it is much cheaper.
I think there is an exponential growth.
A couple of studies have shown that if you're catching it after it goes to production, it costs you like thousands of times more than actually catching it during development time.
So ultimately, most tools are actually focusing
to try to get it as early as possible.
So the developers not only identify them
as they create the problem,
but they actually can learn from the solution
because it is fresh in your mind.
You say, oh, there is a problem here.
Oh, that actually makes sense.
Or if you have a symbolic head
that even can help you with the explanation, then's even easier for uh uh for the developer to figure out oh yeah that
is the problem makes sense i know how to fix it and you're pretty much never going to make that
mistake again which is the beauty of it all yeah that makes sense it's so true if you if you have
to fix something especially if you're under the gun because it's affecting production and maybe
it's two months after you wrote it,
you're probably not going to retain anything meaningful.
And then you could just make the same mistake again later.
Yeah.
Or it's very likely that somebody else had to fix it,
not the original developer.
Oh, that's true too.
Yeah.
So what is the, you know, in your entire career
of being an engineer, leading teams, leading projects
here and abroad
or depending on
where you're listening, abroad and then in Europe,
what's
the
biggest, most intense horror
story? What's the bug that really
kept everyone up at night?
Something that
terrified you?
That's a fair question.
I mean, as a good example,
actually, it literally kept us up at night.
We had, this is back in the early days of Vistaprint.
It was the holiday season.
It was like pretty much like 10 times more volume
than a standard day, right?
And you have to process millions of orders.
So like production went down, like literally,
and that's very costly because a huge amount of orders
that pretty much cannot go through.
So we actually had like, yeah, pages went up,
like we had to wake up like three in the morning plus,
and then we went to the office,
and I think we spent something like two to three days
because what happened is that we actually overflowed one of our our production facilities so they could not really take any more orders.
So we really had to create the fake production facilities in a matter of like a day.
And we didn't know exactly originally where the problem is.
It took us some time to actually figure it out.
And then we said, OK, that's a really serious problem.
And then we have to continue fixing it.
And the fix was pretty bad because we had to make a new production facility uh so we spent a couple of
days in the office non-stop it was a quite interesting experience and uh we ate a lot of pizza
i'm sure you built some camaraderie and i'm sure you uh never want to do it again
that is for sure yes yes i mean there was a of SQL back then. We had to write a lot
of SQL migration scripts to catch that. And even today, that is not the most bulletproof language
to do things in. Yeah. I mean, I think, so Patrick, it'd be cool to hear from you too.
But I have a couple. One, we were working with this vendor who is providing basically an internal tool for us.
But the internal tool, this is a terrible design decision,
the internal tool is using the same fleet of machines as our production site.
And basically they deployed the internal tool
and about a month later
the entire site went down
and we were trying to figure out what it was
it was really hard to sort of find
do some root cause analysis
and what we ended up finding
was this one file
where, so folks
out there might know
in C++ if you append to a string and you do this many, many times, it's actually not a big deal.
Because C++ will just allocate a big chunk of memory and then start filling it up slowly.
But in other languages, and I think this was in Visual Basic, every time you append to a string, it makes a copy of the string. I think Java does this
too. It makes a copy of the string that's big enough for the two strings you're trying to stick
together and then puts both of them there. And so this entire file was just thousands and thousands
of append string, append string, append string commands. And, you know, the logic was all fine. So, you know,
logically, it was doing the right thing. And if you for a while, it didn't matter. But then the
memory fragmentation just eventually causes all the machines to start underperforming. And then
they, it's one of these things where once something is start slowing down, more requests come in,
people just start hitting refresh, which then causes it to slow down and blows up.
And I just remember spending just so many hours, same thing, just eating pizza.
I think we worked almost like 30 hours straight.
We were sleeping in the office.
And then finally we figured out it was this internal tool.
And then we just deleted it and went home.
Nice.
Yeah. internal tool and then we just deleted it and went home nice yeah yeah i don't know that i have any like horror stories where i just stay up 30 hours that i guess that sounds pretentious not trying to
be i mean tons of bugs like you mentioned i mean we have one where we were dereferencing a pointer
that wasn't set correctly in c++ and of course you get data, it was just sort of garbage data. But it, you know,
mostly worked until it didn't. So it was mostly zeros. And then occasionally, it would not be
zeros and would break stuff. And we it was very hard to debug because, you know, 99% of the time,
it would work correctly, because you would dereference an address that was just set to zeros.
But then occasionally, it would dereference an address that would have something else in it,
that would be some old, not overwritten data,
and then the program would crash.
Yeah, I remember.
This is a classic.
Yeah, I remember similar things,
just like after two hours of runtime,
it would just randomly crash.
It always comes out something like that.
I think actually the hardest things for me lately
are just issues with the data.
And these are things where I think there's a lot of tooling that still needs to be built.
So, I mean, one example I'm thinking of is someone made a bug in a system where they had a day and they needed to convert it to day of week.
And somehow they messed that up.
And so there was no Sunday.
So basically all the Sundays were set to Saturday.
And so there was twice as many Saturdays as you'd expect, no Sundays.
And that just caused total havoc in all of the downstream systems.
But again, it's not something you can find pretty easily.
It's something kind of inherent in the data.
Yeah, that's actually one of the most common errors
that we reported in 2019.
It's a date time formatting.
Just there's like hundreds of different ways
to actually make a mistake in those
and obviously hard to figure out.
And that's why we have a whole category
of issues that actually detects that which is pretty cool yeah totally yeah i think uh
um anytime you can stay to utc time it's probably just stay in utc just don't print the date unless
maybe at the very end um am pm is another big issue in that space i think i've had myself a
couple of issues like this,
that you say, okay, it's daytime,
and you forget that you're actually using the 12 hours only,
and then you actually don't keep track of AM, PM at all.
Yeah, yeah.
We have an issue where I think there's something about
there's an extra hour.
I don't remember the details, but we have an anomaly.
Oh, that's right, yeah. That's but we have an anomaly. So we have an anomaly
detection system, and to
this day, every twice
a year when there's a daylight savings
shift, there's more
or less traffic than we expect that day,
and an alarm goes off.
Cool.
Sorry, go ahead.
I was going to say, with daylight savings, we always get uh sorry go ahead i was gonna say with daylight savings we always had issues that uh during that week there was always you're gonna show up to a meeting and half of the
people will not be there just because they're in a different time zone that's in a different
daylight savings that's always uh always a funny one yeah totally yeah that's i i'm waiting for
the day where they get rid of that i i personally I like it to be lighter later. And so there was actually, they were going to, over here in California, they were going to pass a we should get rid of it but then in typical
politicians way they said but each country decides how they want to implement it so now each country
decides if they want to use it or not so it's uh still gonna have some countries drag along
oh geez hey guys i'm gonna jump in for a minute here with a word from this episode's sponsor. We're happy
again to have educative.io, an online learning resource. And since the last time we came to you
talking about them, they've changed things up a little. They have a brand new option. Instead of
buying courses one by one and selecting what you want, they now offer, they still offer that,
but they also offer the ability to get a subscription where you can, for the length
of your subscription, access any of the courses. So all the same courses we talked about before,
you can now access for sort of a flat monthly or per time period fee. And because of their
sponsorship, they've agreed to give us at educative.io
slash programming throwdown 10 off either a single month's purchase or the subscription
that's right you get 10 off pretty much anything in the store so so i was looking at the courses
after we talked last time and um a things that I guess I missed and which
is one is they have a number of courses that are actually just completely free
if you want to try them so they have free previews you know little parts of
many of their courses but they also have a number of languages from scratch so
C++ and Python from scratch and those courses are actually free so not only
are they a great way to learn
or advance your knowledge in those languages,
but it's a great way to actually check out the platform first
because spending any money,
I mean, I don't like spending money.
So not spending money
and still being able to check something out
always alleviates a little bit of the sort of nervousness
around doing it.
And this is a great way to check out the platform,
these from scratch courses. And there's also another course I found on here that's the
practicing for programming interviews. So Jason, you do interviews at your company.
Do you recommend people practice programming before they show up to your interview?
Yeah, totally. The cool thing about this is, when you're kind of writing in this sort of environment,
they kind of give you some setup
and then you write some code
and then you can even kind of validate
that you've done the right thing.
And that kind of loop of writing something,
especially in this case,
unless you're sort of using an editor
and then coming back,
but if you're writing it in the editor, you're kind of writing that as if you're writing it on a whiteboard.
And then you're getting this instant feedback like, OK, you got it right.
You need to try again.
And that's really going to kind of give you that muscle memory that you need so that when you go to a whiteboard interview, you kind of are a little bit more prepared.
I think the from scratch courses are really solid. And the other part of it is, you know, it's it's a different
way to learn. You know, there might be some folks out there who can do sort of the lecture thing.
For me, it's it's it never really resonated with me. Something like this where it's hands on is
really good for people like me. And it's awesome that they have this free course. So even if you're,
you know, a Python guru, um, but, but you think this might be a good way to learn something else,
uh, you know, you could dive through the Python course. You can pick a different language
and dive through that, uh, totally for free. And then if that looks like the kind of,
you know, learning model, that's really going to work for you. Um, you know, that you kind of know that without having to spend any money.
They didn't really. I was trying to think about it. I guess when I was learning a program, I mostly did it from like books before even really like getting on the Internet or there being Internet.
I mean, I guess there probably was, but we didn't really use it.
So the closest I could think of this was did you ever do Vim tutor?
No, I didn't oh so if you
ever you know have accessed vim one of the things that will reckon are i guess vi is the same thing
i'm not really good about the difference between the two um so if you either if you open them
sometimes the bottom would be like i forget there's some command you can type and it will
basically walk you through like a document which also tells you
like how to edit and you like move up and down in the document and it's sort of almost like kind of
like interactive fiction about learning how to use vim oh yeah emacs has something similar oh okay
and i was thinking like actually this kind of is like that but you know of course like 100 times
better or whatever but yeah you know when you originally said that i thought you were thinking about um they actually used to have these um i keep wanting to say choose your own
adventure they're about the same form factor as those choose your own adventure books but it's
not you read it you know left to right um but but you get to a point and there's basically like a
little puzzle and um um you know you're supposed to like solve the puzzle and then somehow there's basically like a little puzzle. And, you know, you're supposed to like solve the puzzle.
And then somehow there's some way to like validate you got it right.
And then you keep reading.
And so the idea is you have to, it's kind of on the honor system.
But you would be reading, this is just like a paperback book.
And then it would be like, oh, put in this basic code.
And then edit it to solve the puzzle.
And that's basically how i got started
yeah so i mean i think this you know format is you know obviously catching on in a lot of different
places and it's really awesome to see um this is a learning resource i think it'll really resonate
with people the ability to be able to be almost anywhere and be able to interactively you know
read write the code not necessarily just like listening to a lecture and trying to figure
out like how do i make it go 50 faster or how fast can i go before i can't understand the person
anymore uh not that i've ever done that before um but yeah i think this is a great forum for
learning yeah this is awesome guys check it out educative.io slash programming throwdown
that's going to get you a 10 discount that's also going to let them know that um you know that
that that uh this was a good slot for them that that they were a good sponsor for us so so it
helps us out um it also helps helps you out and uh check it out also let us know you know send us
comments send us email uh let us know what you think of the courses and you know we can relay
that back to the folks at educative andative and give them feedback so they can keep improving.
All right.
Thanks for the sponsorship.
And let's get back to interviewing.
All right.
I want to talk more about how DeepCode works.
We talked about from the technical side.
But as a developer, what's sort of that experience like?
So there's multiple ways to use deep code that's the idea is we have a public api and you can command line interface you can really
hook it up anywhere you want but the most standard usage as we spoke earlier is part of the ide so
we released already a vs code and atom plugins so you can directly just get the suggestions
as you as you code the next stage is obviously after you commit the code and we actually have
integrations with uh with github bitbucket and gitlab uh where we have a bot that automatically
will comment on pull requests that will tell you hey you guys doing these new things by the way
you're introducing this issue please look at it so So in this case, it gives you a diff analysis on the new things.
Obviously, at the latest stage,
you can actually add it to any CI or CD pipelines or QA,
let's put it that way.
So it really depends on
the workflow that you're actually working in,
but you can use it pretty much anywhere.
That makes sense. So from a usability standpoint standpoint it sounds similar to other static analysis tools like like
Shellshock or or Pylint or Flake or one of these things. I mean yeah from the
workflow perspective absolutely the case the big difference is that we actually
run pretty much in real time so we actually complete the analysis in like
seconds like one second for average piece of code to even less.
So that allows you to actually have the IDE piece
because a lot of linters, et cetera,
when they run the ID, it takes time to actually get results.
So speed is one of our main focus.
Everything has to be real time for the experience to be correct
and for the people not to say,
okay, I have to wait now for 15 minutes to get my results.
Hopefully, I get an e-mail. Great.
The amount of time you save is wasted by waiting.
Yeah. Are you running
some machine learning in the Visual Studio Code extension,
or are you making an RPC with the code to a server?
Yeah, it's a server-side analysis.
So pretty much we have a cloud server that can do that.
For larger companies, you can also have the on-premise server
that actually runs it locally if you're fully behind a firewall, for example.
Got it. Okay, that makes sense.
So if you, and we'll talk about this,
actually we should talk about the pricing and all that later,
but if you're a student or you're working on a hobby project,
then you just turn this on, you're good to go.
If you're working at, you know,
if you're working at like a military research lab or something,
then maybe, you know, it might not be the best idea to turn this on.
And so what you want to do is reach out to Boris
and get a contract for your company
and get something on premises.
That is correct.
And then when you install it in the IDE,
you can just select which server you actually want to get
the analysis happening on, either the cloud-based,
SaaS, or your own.
Cool, that makes sense.
And so what languages are supported?
Is it
pretty much everything or are you starting with a few key languages?
So we started with kind of the top languages. So we have Java, JavaScript,
TypeScript, Python. So this is live today. This month we're going to release C and C++,
very likely. Hey, hey, hey, good news. CHRIS BROADFOOTENHEIMER- News of all.
ANDREAS SUIKAITISONISI- And yeah,
we're going to be adding a couple more languages this year.
Pretty much our main focus has been
to kind of perfect, making sure that we really
rock on all languages that are live,
rather than just add every single language out there
and be just average.
And we built the platform that architects should do.
So also now adding new languages is extremely fast for us,
which will enable the addition of new languages as well.
Plus, at some point, there is the vision to open source it
so people coming up with new languages,
which happens a lot these days, can just add it as well.
Cool, that makes sense.
So anything that's machine learning-based,
you have to sort of deal with this sort of four possible outcomes, right?
You know, so true positive means, you know, a person made a mistake they've seen a thousand times.
You tell them about it. They fix it. That's awesome.
True negative is, you know, everything else that your program is not telling people about or you know things that are fine right
um but let's talk a little bit about the false situation so you know if you're if your program
tells someone to fix something but then they kind of disagree what's sort of what do they do there
is there a way to give you feedback in the ide or do they turn that one off is there like a uh
you know slash slash slash ignore
or something that they can put?
Absolutely, so yeah, we definitely have
slash slash deep code ignore or this ignore,
which will pretty much ignore the specific instance
in the line before or after.
And you can actually highlight
which specific issue ignoring
because sometimes you can have more than one
on a single line of code.
And same thing, you can actually provide feedback. So the way our platform is built that once you provide a feedback we can very quickly i'm talking within minutes we can actually adjust
the knowledge base and either kind of expand the rule or split it into multiple rules so it
actually get accurate so that's uh that's the beauty of the platform that you can very quickly
and very easily ingest those this user feedback and kind of really push the fix to everybody else immediately.
Cool. That's so awesome.
How are the errors surfaced?
So can you kind of walk me through that?
So the machine learning system says this, I guess, syntax flow tree, we've seen this pattern before and we don't like it.
How does that turn into something actionable in sort of an English format where someone
can read that and grok it?
So this is one of the kind of nice add-on engine that we have, which gives you kind
of semantic explanations of the problem.
This is where the symbolic ai comes in uh so as we have metadata to the original piece of the code we can kind of
define gaps and say hey the function that you're calling and we kind of tell you which function if
you actually hover with the mouse over it will actually show you the line of code and the
function that you are calling right is getting user input right and you can actually see what the object with the user
input is and this input flows into some kind of an execution on the back end for example and we
can actually when you cover it we'll show you where it is and we can tell you hey this user
input is not being sanitized by by the time it gets to the execution uh which can actually cause
some kind of a denial of service or path traversal
or many different undesirable things executing on your backend.
But that's the idea.
So we actually have a human-readable explanation that semantically points to each object of
the problem.
And you can use a code base to point you through it so you can actually walk it through and
understand the flow.
So am I right to say there's sort of this machine learning step
where you are doing this almost unsupervised problem
of looking at GitHub PRs and trying to find patterns of mistakes, right?
But then to convert that into some sort of
like slot-based language generation,
there has to be like a person in the loop there, right?
So is there someone on your side
who's looking at the most common errors
and then trying to find a symbolic representation of that?
Yeah, so we have a semi-supervised learning
on this second part,
which allows us to
create categories of the problems right so we have a data flow category for example saying hey you
have a user input and flows into something and it's not sanitized so what happens here we define
this category uh and then with a very small seat of examples right um our machine learning actually
identifies thousands of objects that are actually
doing the same thing. So we have, you provide three examples, so 10, 20 examples of user input
functions. And then our core automatically says, hey, there's actually here 6,000 different user
inputs that I found. So that becomes your category with 6,000 potential user inputs, right? Then you
actually say, what are the things or where something actually can get executed on the backend,
and that becomes the category of those problems.
And then you look for sanitizers,
and you can have like thousands of,
or hundreds of thousands of sanitizers.
And any combination of those three
can actually lead to a data flow problem, right?
So with very little effort from a user,
you end up having with pretty much millions of combinations of problems that you detect.
And this is kind of the benefit that you don't have to write individual rules.
Instead of writing a million rules, within five minutes, you actually created a category that actually represents millions of potential problems.
Oh, that's super interesting. So I think it sounds a little similar to sort of these lookalike, you know, fingerprint models that they do for spam detection.
So, for example, someone will mark an email as spam. And so then what will happen is a system will add that to a list of seed spam emails.
And then afterwards, there's some lookalike system that says, does this new email share
a lot of the same properties as my spam email?
If so, then let's sort of add it to the spam category.
And so it's almost like an active learning type thing.
So in your case, someone says, hey, you know this,
or from the GitHub pull request,
you can emerge that this passing the string
directly in this function was not, you know, sanitary.
And so here's a pull request
that adds a sanitization wrapper around the string.
And you can seed sort of, let's say, a new problem with that,
with a few of those examples,
and then do some lookalike type thing to find, as you said,
like all of the adjacent pull requests
that are most similar in nature to that one.
And then if that process gave you pretty high signal, so there wasn't a lot of false positives,
then you say this is sort of like a good concept.
And then you can roll that out.
Yeah.
And the add-on point to this is that while you create a category, since we have the knowledge
base and the history of Git in memory,
let's put it that way, it automatically tells you how many people have fixed such problems,
how many people are vulnerable to this problem today, and you get this real time pretty much.
So as you kind of create a category, you automatically see, is that an important category?
And you can even look into the vulnerabilities and say, okay, am I seeing any false positives
or not?
So you can very quickly check if it's accurate category
and it should be pushed to production.
That allows, again, within 10 minutes,
you can actually create millions of rules
compared to what a current rule-based system looks like.
Very cool.
So can you tell me a little bit about the scale
on the machine learning side?
Like you're ingesting just an
unbelievable it sounds like you're just an unbelievable amount of of data um do you have
sort of like web mirrors running 24 7 scraping github i mean is this something that runs on
thousands and thousands of machines in the cloud um um give me an idea roughly of the scale and the scope of this effort
so that's the beauty of it all
we have built very lean pipelines
and we don't have to scrape
we've obviously read GitHub once
but once you actually receive the repository history
and convert it into our internal representation
which is considerably smaller than the GitHub representation
then you actually just have to look at the new changes into this repository internal representation, which is considerably smaller than the GitHub representation, then
you actually just have to look at the new changes into this repository.
So we don't have to kind of re-grab everything else, right?
And then our internal representation is pretty efficient.
So we're talking, let's say, for all of Java, you're going to be talking like a couple of
terabytes.
It's not going to be much larger than that.
And analyzing, let's say, all the Java codes out there,
it will take you less than a day.
Everything happens quick in a run.
Wow, so less than a terabyte for all Java code on GitHub.
Yeah, and this is not only the tip,
this is all the history.
What we do, we pretty much use in memory,
we have a memory index, a semantic index of every single version of every single repository out there in GitHub.
For a pre-language, obviously, we do it.
Wow, that's amazing. I mean, that's just a treasure trove of information.
That's what you need in order to actually extract the knowledge of the development community automatically, right?
Otherwise, if that's not efficient and if you cannot index it and grab it really fast, then you're kind of dead in the automatically, right? Otherwise, if that's not efficient
and if you cannot index it and grab it really fast,
then you're kind of dead in the water, right?
It would take, like, months to actually figure anything else.
Yeah.
That's one of the reasons why we chose
not to do static analysis
without requiring to build or compile anything
because clearly if you want to compile something
from, like, five years ago,
like, it's unlikely you'll be able to actually build it
on your own.
Yeah.
Yeah.
And then the time too,
to build all of that code is extraordinary.
Cool.
So tell me a little bit about,
well,
first of all,
so folks who are listening,
who are,
you know,
in college and high school,
or they're working on open source projects,
what is available for them? and what's the price like?
Everything for free is the short answer.
So we are 100% free for anything open source.
Our motto is pretty much, hey, we are learning from the open source community,
pretty much everything.
We want to actually encourage the open source community to get better,
be more dynamic, because we'll learn even more.
So it's fully free. So anything in the Cloud,
GitLab, Bitbucket, GitHub, it's free.
You just log in with your account for the Git
and all your repositories you can scan,
you can scan anything open source.
For educational purposes, the same.
We've actually worked with a couple of universities already,
even some of them are developing open source add-ons, which is great as part of their master thesis.
Nice.
And I think the first one that started doing that was San Jose State University.
So we're very happy with the collaboration there.
And it is free for them because that's the idea.
We believe that students, they start using things
and over time,
as they go into the industry,
they say,
hey, I actually need that commercially.
And our tool is pretty good
for people that are learning things.
That's the main thing
because we explain the problems,
we provide examples
how other people have fixed it
in a totally different settings.
So it's a great learning tool as well.
And actually,
we're also free for small companies like if you have
less than 30 developers pretty much
you can just start using it and nobody will ever bother you
Cool, this is awesome
we've been super
fortunate to have you on the
show, have Zenhub on the
show and a few other folks on the show
CircleCI on the show
and almost all of these
actually every one of these products
is free for students. So if you're doing, let's say, a senior design project,
you're a CS student or a computer engineering student, you have a senior design project,
and it's something that's going to take you three, four, or five months from start to finish. So it's
not a trivial project. You're working with a team.
You could have continuous integration.
You can have some project management.
And you can have the deep code static analysis totally for free.
And it's really going to give you students out there a feel of what it's like to work
on a team, an industry, which is really amazing.
That's why I allow students to get into
software development much easier these days,
because all the tooling helps them be faster.
Yeah, totally. I'm sure this is personalized,
but just at a high level,
how does it work for businesses?
Let's say someone works at,
not a,
let's say, bigger than a 30-person
business, but not a
huge, giant conglomerate.
So they work in, say, a mid-sized business,
Macy's or something, as an example.
How can they go about getting deep code
on-premises?
And what does that kind of
look like?
So the on-premise stuff,
it happens through Docker container.
So you get your custom Docker container.
It takes about 10 to 15 minutes to set up.
And then you have your own deep code on-premise server.
And then you have the exact same integration,
the exact same benefits than the SaaS option.
And the cost there is ranging depending on your size, I would say between $30 to $50 per developer per month. Clearly, we offer a 30-day free trial so people
can experience it, test it with their code base, see what it is before they have to purchase. So
there is a risk-free trial there as well.
Cool. Yeah, that makes sense. So now we'll jump into a little bit about the company. So
is DeepCode in, you mentioned ETH Zurich, is DeepCode based out of there, the company?
Yeah, we are literally five minutes away from the Zurich ETH main campus.
So we are a Zurich-based company.
Cool, yeah, I've been there before.
It's very expensive, if I remember correctly.
But it's beautiful.
Yeah, I went there for a conference one time.
And yeah, I'm probably going to get this wrong.
You can correct me.
But I think I just got a hamburger and it was like $8 or something.
That sounds like a cheap hamburger.
Okay, yeah, it's probably more than that.
But this was probably like 10 years ago.
But yeah, Zurich is beautiful.
I remember we took the train to, I want to say Bern, but it's been a long time.
Some place and we did some skiing and it was it was
gorgeous well burn is the capital so likely you went to somewhere else for skiing oh okay okay
i think maybe we did burn one day and then another day oh actually yeah we went to young far yuck
like it's the tallest point young frau hall yeah yeah Hall, okay, there we go. And then I think somewhere around there, there was this,
all I really remember is it was a downhill skiing,
but it was kind of a very natural downhill.
And it was literally down this mountain,
and it just kind of kept going and going.
It was gorgeous.
Yeah, there's so many skiing options here.
And yes, the Alps are like, yeah, so many different, like off-piste, non-piste, and long-piste, and very natural indeed.
Yeah, it's awesome.
So everyone's over there.
So folks are interested in jobs that would be over there.
There's not like remote work or anything like that at the moment? We do have two people that are doing remote work, but mostly this is when you're working
on kind of the integrations, like open source things that we actually deliver. Most of the
core is pretty much happening here. Because it is very dynamic, it's changing so fast
that it's very hard to do the coordination over time clearly that will expand so we can have
specific areas that could be uh worked on offline but today there's too much talking interactions
with the team so yeah we like to be close to each other how big is the company right now we're 15
people continuously growing as as i said now we're getting roughly about one new person per month uh wow yeah so it's uh it's it's nice and and growing and very exciting because of that
yeah wow i mean so you'll you'll almost double next year yes and we did double last year but
that's kind of that's the definition of a startup like if you don't do like 100 percent at least
year over year then you're not considered a startup anymore. So you have to.
That's true.
Yeah.
Well, nowadays it's unbelievable.
Actually, you're seeing a lot of IPOs nowadays, like very recently.
But it seemed as though no one was going to IPO.
And so it's like, okay, this company is worth $60 billion.
Is it really a startup?
I mean, come on.
Yeah, exactly.
Yeah. So what about internships if folks want to come there for a summer is it kind of too
early for that or do you are you doing internships so what we do we do a lot of
master master student thesis is here we usually run at least three at a time
most of them are from ETH but we've gotten some students that are coming
from another school that officially become part of ETH just for the master thesis and they actually do it.
So there are definitely options if people want to do research specifically that's related to what we do and usually on top of our platform because that enables quite a lot of interesting topics.
So we do a lot of those.
Cool.
Cool.
That makes sense. So as far as skills, it sounds like you're looking for definitely people who do program analysis.
Also, it sounds like maybe some graph convolution or like some deep learning folks would be useful.
We have some deep learning, yes.
It's for some of the new services that we're preparing for.
It's been happening, so we've had a number of GPUs here running and making noise.
All right.
Cool.
But, yes, so program analysis, most of the back end is in C++, obviously.
But, yeah, pretty much core developers as well, front end as well.
We have a pretty strong team, so there's always opportunities there as well, front end as well. We have a pretty strong team,
so there's always opportunities there as well.
Cool, and what's it like to work at DeepCode?
So what is something that kind of makes DeepCode unique
as far as a place to work?
Oh, well, it's definitely an exhilarating experience.
You can think about it as sprinting
while drinking from a fire hose.
So yeah, there's daily new things happening.
You cannot even plan your day easily enough because there's new things.
So internally, again, the team is highly motivated
and then kind of amazing experts in the space.
And they keep on innovating new solutions and methods
that oftentimes even researchers actually spend years on
without much success.
So a lot of those things we don't even have time to publish anymore,
but it is required.
And then from an external perspective, again,
like seeing all the major companies or open source frameworks using us,
it's pretty nice.
I mean, this is pretty much what keeps us working late at night,
as well as having developers
kind of unpromptedly
saying, hey, that's amazing. I love this.
You're building
something amazing. So that's
yeah, that makes it lots of fun.
Cool. That's awesome. Yeah, I bet it's
intense in the beginning
because you
don't, you have to figure out really
like the size of the market you're in.
And so I have a friend who started a company a couple of years back.
And he says it's just huge ups and downs.
It's this rollercoaster rocket ship where you're just getting thrown all over the place.
Yeah.
But it's amazing because you really like cutting edge stuff.
We work very closely with
dth zurich so there's a lot of research coming from there and new things come up all the time
which is great that's awesome have you thought about um you know i got something the other day
from github saying this one of these old projects that that i open sourced a long time ago
um has some security vulnerability.
It was actually in a dependency of the project.
And so GitHub emailed me.
And then about a week later, GitHub actually sent me a pull request updating that dependency.
Have you thought about basically you could do almost like an outbound marketing approach
where you just ping random people on GitHub and say, your code has this error, you can fix it, you know?
So we've done small tests, like we've actually filed pull requests in Mozilla and some other places,
specifically in the security space, that's obviously well regarded,
and pretty much all of the interactive projects were accepted.
So yes, we've tested it, and we are indeed looking into making it more scalable as well.
Very cool. That sounds totally awesome.
Yeah, I mean, I'm definitely going to try this.
So actually, one question.
If someone installs the extension, so on my laptop, I do a bunch of hacking on random things for fun,
but I also have a real day job
where they don't want the source code
going out of the off premises, right?
So how can I sort of reconcile that with this extension?
Can I, is there an easy way to turn it on and off?
Should I have two copies of Visual Studio Code?
How would I do that?
So that's a common request by our users.
So what happens for each project that you open,
you actually get a pop up saying, are
you okay for this code to be transmitted and analyzed?
And you can clearly say no for the project you don't care about.
I got it.
So really, any project, so when you first install it, you have three open projects,
you're going to get three separate pop ups for each one saying, do you want to do this?
Do you want to do this?
Do you want to do this?
Same thing if you close and open a new one, you're going to it again because that's yeah common common concern and uh it has to be fresh
in your mind you have to always make the decision saying do you want to do it or not uh you can
always clearly disable it fully if you know that like i'm going to be working only on my proprietary
stuff for uh for a day yeah that makes sense very cool yeah i will check this out i think patrick's
gonna have to wait until you make the C++ version.
Actually, yeah, now you're doing Java, right?
Or is it still C++?
No, mostly C++ still.
Oh, is it?
Okay.
Got it.
So Patrick, I'll ping you later this month.
All right.
Wow, it does move fast.
Wow, that's quick.
Cool.
So tell people how they can reach out to you. So these would be, you know, students, people who want to get this installed at their workplace. What's a good way to reach out to you? And then also, what's a good way separately for people just kind of passively follow what you're doing, maybe on social media or somewhere else yeah so reaching to me my personal email is boris at deepcode.ai uh you can also just
go to deepcode.ai website there is a live chat there well not always live but if we are up and
running you'll be live otherwise we'll come back to you over an email um uh on social media obviously
our twitter handle is deepcodeai uh same for linkedin and i'd say with twitter and
linkedin you'll be getting most of the news about us uh we also have a medium uh page where people
can actually get our developer advocate actually pushes a lot of a lot of cool articles examples
where it is like top bugs specific vulnerabilities vulnerabilities. So that's pretty nice.
And now we actually have a YouTube channel for tutorials,
like how do you set up VS Code, for example,
how do you use VS Code?
How do you actually set up the on-premise versions as well?
So all those you can have kind of a quick walkthrough.
So it should answer most of your questions.
Very, very cool.
And you said, just to recap,
there's a command line for folks who aren't using VS Code.
There's a command line option
that they would install through app or something.
That is correct. You can just use the command line,
can literally point to a folder to analyze all the code there.
Very cool. Thank you so much Boris. This is awesome.
I personally am enriched by this. I'm going to try it out.
I'm really excited to see what comes what comes up and let me know all right cool
sounds good we really appreciate your time and yeah you folks at at home you should check this
out this sounds amazing if you're on your drive and you didn't catch uh some of the some of the
the you know the urls and things like that we're going to post it all in the show notes.
So you can just, from iTunes or whatever podcast app,
you can just tap over to the description.
There's a link to the show notes.
You can get all the info from there.
But thanks again, Boris.
This is awesome.
And have fun.
Do some skiing.
And we'll reach out to you after we've tried this out.
Sounds good.
Thank you very much, guys.
Appreciate it.
The intro music is Axo by Binar Pilot.
Programming Throwdown is distributed under a Creative Commons Attribution Sharealike 2.0 license.
You're free to share, copy, distribute, transmit the work, to remix, adapt the work, but you must provide attribution to Patrick and I and share alike in kind.