Programming Throwdown - Squashing bugs using AI and Machine Learning

Episode Date: February 18, 2020

The best part of hosting Programming Throwdown is reading emails from people who listened to this show before they had any coding experience and went on to land jobs in tech. Thanks so much f...or inspiring us with your stories. My second favorite part of hosting the show is hearing about so many awesome programming tools and resources, often when they are just starting out. DeepCode is one of these amazing resources. DeepCode is a static analysis tool that looks at your code and, using AI trained on all code in github (!!!), finds common mistakes and offers suggestions on how to resolve. I am a heavy user of static analysis tools, and yet DeepCode was still able to find real issues in one of my python projects above and beyond pylint and mypy. Best of all, it's completely free to use for open source projects! Give it a shot and let us know what you think! Show notes: https://www.programmingthrowdown.com/2020/02/episode-99-squashing-bugs-using-ai-and.html ★ Support this podcast on Patreon ★

Transcript
Discussion (0)
Starting point is 00:00:00 programming throwdown episode 99 squashing bugs using ai and machine learning with boris paskalov take it away jason hey. This is a really, really fascinating topic. I think there's been a lot of interest around using AI to sort of help developers. We've all had apps, even like super popular apps like the Uber and the Google app and other apps where they have bugs and they crash and things like that. So even the people who are the best at this have tons of issues. And so it's becoming a really interesting area. And we're so fortunate we have Boris Puskalev, the CEO and co-founder of DeepCode, to kind of sit down and roundtable how we can make software better.
Starting point is 00:01:02 So Boris, why don't you kind of start off by telling us, like, a little bit about your background. You know, what was the idea or the inception of DeepCode and, you know, kind of at a high level of what you guys are doing right now. Sounds good. So, welcome, everyone. Thank you for having me. So, my name is Boris Paskalev. I was originally born in Bulgaria. And I did my bachelor's in Mastering Computer Science in Boston and then worked with a number of different companies as a developer. Later on in my life I did an executive MBA so I started moving into project management, program management and many different things in that space. And yeah, and I'm still working in technology,
Starting point is 00:01:46 but I'm coding less and less, as you can imagine. So about DeepCode, obviously, the idea is that it started in 2016, and it's a spin-off from ETH Zurich, which is the number one tech university in Europe. I usually call it the MIT of Europe when I make fun with the guys. Nice. But yeah, the idea is that my other two co-founders, Professor Dr. Martin Vechef and Dr. Veselin Raichev,
Starting point is 00:02:14 so they spent about 12 years together researching the topic of learning from big code, so program analysis and how we can really apply powerful machine learning algorithms on top of code. And they've published many, many papers, lots of awards in the space. And after our CTO, Veselin Wright, finished his PhD, pretty much the idea was clear that we have to start that as a new platform that should revolutionize how software development works over time. So that's really the passion behind it. Cool. So let's unpack a lot of that. So you were a software developer, and then at some point you went to a business school. What sort of inspired that? Oh, that's an interesting question.
Starting point is 00:03:02 I think it's a consequence of many, many events. But slowly, as a software developer, I started kind of project managing the software I was dealing with, kind of delivering it to the end users and kind of working on the end-to-end, trying to actually get specifications from them, understand how we test it, obviously making sure there's less and less bugs, making sure that we fulfill the right requirements. And kind of in my head, slowly,
Starting point is 00:03:30 I started building this overall picture that it's not just the software, but it's an end-to-end process. So I started project managing some of those, then they started giving me larger projects with more people. And then all of a sudden, the project became project became hey build a software that actually runs this robotized production line and then that this is how I ventured out outside of the software for software space and then eventually I just had projects like can you open our development office in Switzerland and that's how I kind of moved into non-software projects and eventually I said well then I really do do that successfully I need to learn a bit more about that that side of the world and can say that's an executive MBA. Cool that makes sense so you know this is a something I've always wondered is what do you
Starting point is 00:04:15 think about the idea that that at what level do you not need to have a technical background so for example does a line manager need to be technical? A line engineering manager? Does the director need to be technical? And at what point do you say, okay, this now has transitioned fully into project management, process management from like a business perspective? That is a very polarized topic. I mean, as a techie, I obviously believe that everybody should have technical understanding and background just because this is where you learn basic logic, let's put it that way. Start thinking more like a machine than a person, to put it that way.
Starting point is 00:04:54 And I think everybody should have a little bit of that. I mean, clearly, when you're going to the arts, etc., there is quite a big range that you might not need that. But for anything, when you're working in the business world, you need some kind of a technical background, being that pure math that can actually lead into the same way of thinking. But with that said, there's many exceptions. I had many colleagues that didn't actually have any technical background. One of my very first technical manager, brilliant developer, he actually has a PhD in literature, right? So, yeah. So there's plenty of exceptions in the rule. Yeah, totally.
Starting point is 00:05:33 And we have a lot of folks listening who maybe they started listening when the show started, I don't know, Patrick, what, eight years ago or something. And they have all sorts of different degrees um and backgrounds and and one of the awesome things about this area is that is that so it's so easy to pick up um you don't need to have a really long apprenticeship um actually i was reading something that if you're a hand doctor it's actually one of the hardest professions to pick up. And so in software you can do it in, you know, I mean it's a craft so you have to develop it like any craft, but you can at least get started quickly regardless of your background.
Starting point is 00:06:16 Yeah, I think we just actually hired a junior software developer that had zero background in software development. Like a couple of years ago he said, hey, I just want to be a developer. And he started picking it up. And I think with modern languages and frameworks, it is getting easier and easier to get into it, which is great. And there's a lot of tools that can help you along the way. Cool.
Starting point is 00:06:36 So let's go into deep code. So you're talking about this person, these two folks who were doing some academic research on code analysis. And one of the things I think that's hard to grasp for a lot of people is how do you do analysis on code, right? I mean, you can imagine, I guess you have to get rid of all the white space, right? But it's this sort of really complicated language. It doesn't have clear parts of speech. So a lot of NLP techniques, it's not obvious how they would work here. And so how does one even begin to sort of programmatically understand a C++ file?
Starting point is 00:07:21 Like how does that actually work? It is a very, very good question. And the short answer is it is very hard. I bet. But you start with extremely good developers. That's the number one thing. I think both of our other co-founders, they're really like, they're language agnostic.
Starting point is 00:07:40 They can literally do anything. They understand the languages inside out. Because the majority of the problem is program analysis. I mean, it's a pretty old type of thing that it's not very sought after today, but it really kind of gives you the core of what it is. So let me just walk you high level to the process of what it is. So first you start with parsers. So for each language there is a parser.
Starting point is 00:08:01 So we actually use one of the standard parsers, maybe small changes here and there, but that's about it. And then a lot of tools pretty much end up here. They're just minor adjustments because the parser gives you an abstract syntax tree or AST, and then people work on it. But the ASTs are not rich enough. They don't really represent everything that you need in the program,
Starting point is 00:08:23 as you said, very complex. So this is where kind of the magic starts. We actually use proprietary solvers that actually extract every single semantic fact in the program, every single interaction, every single function, every single variable, every single object, and we build the relationship between them in terms of who is calling who, how is it changing, are you casting something to something else etc etc so this actually goes into a graph index that she represents this whole interactions of of the cult and this is
Starting point is 00:08:54 kind of the key piece right and this graphing this has to be pretty much in a machine learnable format right so then you can apply machine learning on top of it so that's kind of where most of the magic is kind of to create a machine learning representation for code which is pretty much does not exist today or if it exists as you said it's been it's treating the the code as string as text which obviously drops down all the semantic facts that are interesting about coding um and and after you have this machine learning representation, then it's all about speed, efficiency, to learn from every single fix ever made, every single line of code that exists out there,
Starting point is 00:09:32 and then apply machine learning algorithms to extract specific facts. This actually brings you to have this knowledge base of everything that has happened, how people can fix different things, are there consensus how to fix them, are there, how people have fixed different things, are there consensus how to fix them, are there people fixing it in totally different ways, then you can actually assign probabilities of what problem is likely to be fixed in a specific way versus in another way.
Starting point is 00:09:56 Got it. Okay, so let me see if I can unpack this. So similar to sort of the grammar parsers, if folks listening remember that from grade school, how you would have the sentence, you know, the man went to the supermarket and you'd have to diagram that sentence. And you would separate out the subject and the verb and you'd draw this little graph to represent that sentence. You can do the same thing with code. And then I guess, but that by itself. So can you explain a little bit what has to happen to turn that syntax tree into a machine learnable format?
Starting point is 00:10:36 So in other words, why couldn't the machines just learn on the syntax tree? What's missing there? So it's missing a lot of the interactions and the depth. For example, you cannot track. For example, you have an object you put it into an array, right, and then you actually get this object out of the array. The abstract syntax tree will not be able to tell you if it's the same object, right? So the idea is that you have to have a much, much, much larger depth and naturally tracking every single thing that happens. Like you need inter-procedural analysis, points-to-analysis,
Starting point is 00:11:06 type-state analysis, may-versus-must analysis. So there is a very wide range of different things that abstract syntaxes will not be able to give you. And that's why the type of issues or representation you actually make on them are much more
Starting point is 00:11:22 simplistic. And then if you build rules on top of them, then your accuracy level, it's much lower. So you get lots of false positives. Got it. That makes sense. So similar to, you've talked in the past on type inference and how, well, it probably started way before Haskell, but Haskell, I think, is where it started to become popular. And then you saw Scala with it,
Starting point is 00:11:43 and now you have almost every TypeScript has it, Python has it. And so what they're doing is they're looking at the program flow, and they're saying, you know, x equals 3, a equals x, and so therefore a is also an integer. And so they're kind of tracing through. And you're saying that you need to do that kind of tracing through. And you're saying that you need to do that kind of as a pre-processing step to actually look at the runtime flow in addition to the syntax, and all of that goes into some,
Starting point is 00:12:14 let's say, connection structure, some type of graph. That is correct. And the key part here is that we're actually doing this statically, right? So we don't have to actually build the code. We don't have to actually, we don't even have to compile it, make sure that it runs. We can actually track this purely by actually understanding the structure.
Starting point is 00:12:33 This is where the program analysis come in. Got it. Okay, cool. That makes sense. And then at that point, it's just ingesting as much code as you can, looking at zillions of GitHub pull requests that say, you know, fix array out of bounds error. There's probably, you know, 800,000 of those in GitHub. And so you can kind of look at all of those and see what the common structures are.
Starting point is 00:12:56 That is correct, because when you actually convert it to this internal representation that we have, then that becomes purely language agnostic, right? That's kind of the best part. So the only thing that is language specific is the parsing, and then the representation is language agnostic. So we don't care how the developer wrote it. We actually care what it does and how it actually does it.
Starting point is 00:13:16 Got it. Do you look at the variable names, or is that just too much noise? So we actually have metadata that we actually track back to what it is, because that's the idea that when we understand the problem, we can actually point to where it's coming from and what it does. But ultimately, we don't care what the variable name is. It could be anything. Yeah, that makes sense.
Starting point is 00:13:36 I mean, I can imagine just semantic errors. You say miles equals feet times 10 and say, okay, that's an issue. But I feel like, as you said, it's just too hard. There's just too many different names for things. Yeah, so there is a different tool that we have released and that was during the research years that is delphiscation. It actually tries to predict very accurately what is the best name for each variable
Starting point is 00:14:02 because you can delphiscate it so you can de-obfuscate it so you can actually make it human readable. Yeah, de-obfuscate it. Yeah, so in that state, we actually have pretty good heuristics based on pretty much machine learning based on how a specific variable is used and what it does
Starting point is 00:14:19 and then understand how people name it and then understand what the right name should be. That is super cool. Yeah. Yeah, is that something that people can just try out? Can they just upload a CPP file and see what the Deoptis Gator... I think that was mainly for Java. It's JS nice and
Starting point is 00:14:36 nice to predict. Those are the two tools that actually do that. And they're free to use, so anybody can upload anything there and use it. Oh, very cool. You have to check that out. So you said JSNice? JSNice, yes.
Starting point is 00:14:53 JSNice. Cool. Yeah, we'll add it to the show notes. So, okay, let's spiral back a bit to the sort of product and market here. So how do folks, you know, in general, how do folks find bugs now? Especially, you know, I mean, we've all found bugs, you know, in our school projects and things like that. But when you're in these enormous software engineering teams, so imagine you're building the Amazon app or something. How do developers go about finding bugs in these sort of monolithic projects? Wow, it really depends. I mean, there's ultimately two major ways. The human way, like when you're using humans to do it, and some kind of automated ways. Human way is obviously a range,
Starting point is 00:15:35 like while coding, like yourself, or when you're doing peer programming, you have your counterpart actually say, hey, what the hell is that? So that's standard, right? The peer code reviews, I mean, code reviews are pretty common today. So that's one of them, I think most common ways
Starting point is 00:15:51 that people will actually identify issues. And then unit and functional testing that you've built or you're building as you're developing. The more, sometimes I call it old school, but it's really actually pretty popular still today, actual QA testing or QA processes. There's many different ones. Lots of them are still human-based.
Starting point is 00:16:11 And then the very final one is you're actually customers or users testing it and saying, hey, this broke. What the hell? What should I do with it? So I think this is kind of the range of the human identification of bugs. And there's obviously many automated ways that are actually
Starting point is 00:16:25 coming this way, that using actual static analysis, the automatic test coverage or test generations, then you have the wider range of formal verification or fuzzing, then there's actually some compliance testing that could be happening. So you have some for specific industry, there's a specific set of rules that you have to test for, then they can uncover issues. Obviously, there's dynamic analysis as well. That's pretty big those days. And there's many areas that you can catch things by just simulating the runtime of the program, or actually running it live while the users are performing operations. Yeah, that makes sense. I think as you move along that continuum, you add sort of more and more, I guess, risk or more side effects, right? So if a bug makes it all the way out to the end user, it can be totally catastrophic.
Starting point is 00:17:14 Like you break the login page for your app. You might never recover from that. Actually, I know companies who, like small companies that broke the login page and literally never recovered. And you go all the way to the other end of the spectrum where there's a static analysis tool so that before you've even hit save on the file, you know that there's an issue. And so that's obviously anything you can catch there adds huge value to your time and to de-risking the project.
Starting point is 00:17:48 Correct. So ultimately, the belief is yes, the earlier you catch it in the development cycle, it is much cheaper. I think there is an exponential growth. A couple of studies have shown that if you're catching it after it goes to production, it costs you like thousands of times more than actually catching it during development time. So ultimately, most tools are actually focusing to try to get it as early as possible. So the developers not only identify them as they create the problem,
Starting point is 00:18:15 but they actually can learn from the solution because it is fresh in your mind. You say, oh, there is a problem here. Oh, that actually makes sense. Or if you have a symbolic head that even can help you with the explanation, then's even easier for uh uh for the developer to figure out oh yeah that is the problem makes sense i know how to fix it and you're pretty much never going to make that mistake again which is the beauty of it all yeah that makes sense it's so true if you if you have
Starting point is 00:18:38 to fix something especially if you're under the gun because it's affecting production and maybe it's two months after you wrote it, you're probably not going to retain anything meaningful. And then you could just make the same mistake again later. Yeah. Or it's very likely that somebody else had to fix it, not the original developer. Oh, that's true too.
Starting point is 00:18:57 Yeah. So what is the, you know, in your entire career of being an engineer, leading teams, leading projects here and abroad or depending on where you're listening, abroad and then in Europe, what's the
Starting point is 00:19:15 biggest, most intense horror story? What's the bug that really kept everyone up at night? Something that terrified you? That's a fair question. I mean, as a good example, actually, it literally kept us up at night.
Starting point is 00:19:35 We had, this is back in the early days of Vistaprint. It was the holiday season. It was like pretty much like 10 times more volume than a standard day, right? And you have to process millions of orders. So like production went down, like literally, and that's very costly because a huge amount of orders that pretty much cannot go through.
Starting point is 00:19:54 So we actually had like, yeah, pages went up, like we had to wake up like three in the morning plus, and then we went to the office, and I think we spent something like two to three days because what happened is that we actually overflowed one of our our production facilities so they could not really take any more orders. So we really had to create the fake production facilities in a matter of like a day. And we didn't know exactly originally where the problem is. It took us some time to actually figure it out.
Starting point is 00:20:20 And then we said, OK, that's a really serious problem. And then we have to continue fixing it. And the fix was pretty bad because we had to make a new production facility uh so we spent a couple of days in the office non-stop it was a quite interesting experience and uh we ate a lot of pizza i'm sure you built some camaraderie and i'm sure you uh never want to do it again that is for sure yes yes i mean there was a of SQL back then. We had to write a lot of SQL migration scripts to catch that. And even today, that is not the most bulletproof language to do things in. Yeah. I mean, I think, so Patrick, it'd be cool to hear from you too.
Starting point is 00:20:59 But I have a couple. One, we were working with this vendor who is providing basically an internal tool for us. But the internal tool, this is a terrible design decision, the internal tool is using the same fleet of machines as our production site. And basically they deployed the internal tool and about a month later the entire site went down and we were trying to figure out what it was it was really hard to sort of find
Starting point is 00:21:33 do some root cause analysis and what we ended up finding was this one file where, so folks out there might know in C++ if you append to a string and you do this many, many times, it's actually not a big deal. Because C++ will just allocate a big chunk of memory and then start filling it up slowly. But in other languages, and I think this was in Visual Basic, every time you append to a string, it makes a copy of the string. I think Java does this
Starting point is 00:22:06 too. It makes a copy of the string that's big enough for the two strings you're trying to stick together and then puts both of them there. And so this entire file was just thousands and thousands of append string, append string, append string commands. And, you know, the logic was all fine. So, you know, logically, it was doing the right thing. And if you for a while, it didn't matter. But then the memory fragmentation just eventually causes all the machines to start underperforming. And then they, it's one of these things where once something is start slowing down, more requests come in, people just start hitting refresh, which then causes it to slow down and blows up. And I just remember spending just so many hours, same thing, just eating pizza.
Starting point is 00:22:53 I think we worked almost like 30 hours straight. We were sleeping in the office. And then finally we figured out it was this internal tool. And then we just deleted it and went home. Nice. Yeah. internal tool and then we just deleted it and went home nice yeah yeah i don't know that i have any like horror stories where i just stay up 30 hours that i guess that sounds pretentious not trying to be i mean tons of bugs like you mentioned i mean we have one where we were dereferencing a pointer that wasn't set correctly in c++ and of course you get data, it was just sort of garbage data. But it, you know,
Starting point is 00:23:26 mostly worked until it didn't. So it was mostly zeros. And then occasionally, it would not be zeros and would break stuff. And we it was very hard to debug because, you know, 99% of the time, it would work correctly, because you would dereference an address that was just set to zeros. But then occasionally, it would dereference an address that would have something else in it, that would be some old, not overwritten data, and then the program would crash. Yeah, I remember. This is a classic.
Starting point is 00:23:52 Yeah, I remember similar things, just like after two hours of runtime, it would just randomly crash. It always comes out something like that. I think actually the hardest things for me lately are just issues with the data. And these are things where I think there's a lot of tooling that still needs to be built. So, I mean, one example I'm thinking of is someone made a bug in a system where they had a day and they needed to convert it to day of week.
Starting point is 00:24:24 And somehow they messed that up. And so there was no Sunday. So basically all the Sundays were set to Saturday. And so there was twice as many Saturdays as you'd expect, no Sundays. And that just caused total havoc in all of the downstream systems. But again, it's not something you can find pretty easily. It's something kind of inherent in the data. Yeah, that's actually one of the most common errors
Starting point is 00:24:52 that we reported in 2019. It's a date time formatting. Just there's like hundreds of different ways to actually make a mistake in those and obviously hard to figure out. And that's why we have a whole category of issues that actually detects that which is pretty cool yeah totally yeah i think uh um anytime you can stay to utc time it's probably just stay in utc just don't print the date unless
Starting point is 00:25:18 maybe at the very end um am pm is another big issue in that space i think i've had myself a couple of issues like this, that you say, okay, it's daytime, and you forget that you're actually using the 12 hours only, and then you actually don't keep track of AM, PM at all. Yeah, yeah. We have an issue where I think there's something about there's an extra hour.
Starting point is 00:25:42 I don't remember the details, but we have an anomaly. Oh, that's right, yeah. That's but we have an anomaly. So we have an anomaly detection system, and to this day, every twice a year when there's a daylight savings shift, there's more or less traffic than we expect that day, and an alarm goes off.
Starting point is 00:26:02 Cool. Sorry, go ahead. I was going to say, with daylight savings, we always get uh sorry go ahead i was gonna say with daylight savings we always had issues that uh during that week there was always you're gonna show up to a meeting and half of the people will not be there just because they're in a different time zone that's in a different daylight savings that's always uh always a funny one yeah totally yeah that's i i'm waiting for the day where they get rid of that i i personally I like it to be lighter later. And so there was actually, they were going to, over here in California, they were going to pass a we should get rid of it but then in typical politicians way they said but each country decides how they want to implement it so now each country decides if they want to use it or not so it's uh still gonna have some countries drag along
Starting point is 00:26:57 oh geez hey guys i'm gonna jump in for a minute here with a word from this episode's sponsor. We're happy again to have educative.io, an online learning resource. And since the last time we came to you talking about them, they've changed things up a little. They have a brand new option. Instead of buying courses one by one and selecting what you want, they now offer, they still offer that, but they also offer the ability to get a subscription where you can, for the length of your subscription, access any of the courses. So all the same courses we talked about before, you can now access for sort of a flat monthly or per time period fee. And because of their sponsorship, they've agreed to give us at educative.io
Starting point is 00:27:46 slash programming throwdown 10 off either a single month's purchase or the subscription that's right you get 10 off pretty much anything in the store so so i was looking at the courses after we talked last time and um a things that I guess I missed and which is one is they have a number of courses that are actually just completely free if you want to try them so they have free previews you know little parts of many of their courses but they also have a number of languages from scratch so C++ and Python from scratch and those courses are actually free so not only are they a great way to learn
Starting point is 00:28:26 or advance your knowledge in those languages, but it's a great way to actually check out the platform first because spending any money, I mean, I don't like spending money. So not spending money and still being able to check something out always alleviates a little bit of the sort of nervousness around doing it.
Starting point is 00:28:43 And this is a great way to check out the platform, these from scratch courses. And there's also another course I found on here that's the practicing for programming interviews. So Jason, you do interviews at your company. Do you recommend people practice programming before they show up to your interview? Yeah, totally. The cool thing about this is, when you're kind of writing in this sort of environment, they kind of give you some setup and then you write some code and then you can even kind of validate
Starting point is 00:29:14 that you've done the right thing. And that kind of loop of writing something, especially in this case, unless you're sort of using an editor and then coming back, but if you're writing it in the editor, you're kind of writing that as if you're writing it on a whiteboard. And then you're getting this instant feedback like, OK, you got it right. You need to try again.
Starting point is 00:29:35 And that's really going to kind of give you that muscle memory that you need so that when you go to a whiteboard interview, you kind of are a little bit more prepared. I think the from scratch courses are really solid. And the other part of it is, you know, it's it's a different way to learn. You know, there might be some folks out there who can do sort of the lecture thing. For me, it's it's it never really resonated with me. Something like this where it's hands on is really good for people like me. And it's awesome that they have this free course. So even if you're, you know, a Python guru, um, but, but you think this might be a good way to learn something else, uh, you know, you could dive through the Python course. You can pick a different language and dive through that, uh, totally for free. And then if that looks like the kind of,
Starting point is 00:30:21 you know, learning model, that's really going to work for you. Um, you know, that you kind of know that without having to spend any money. They didn't really. I was trying to think about it. I guess when I was learning a program, I mostly did it from like books before even really like getting on the Internet or there being Internet. I mean, I guess there probably was, but we didn't really use it. So the closest I could think of this was did you ever do Vim tutor? No, I didn't oh so if you ever you know have accessed vim one of the things that will reckon are i guess vi is the same thing i'm not really good about the difference between the two um so if you either if you open them sometimes the bottom would be like i forget there's some command you can type and it will
Starting point is 00:31:01 basically walk you through like a document which also tells you like how to edit and you like move up and down in the document and it's sort of almost like kind of like interactive fiction about learning how to use vim oh yeah emacs has something similar oh okay and i was thinking like actually this kind of is like that but you know of course like 100 times better or whatever but yeah you know when you originally said that i thought you were thinking about um they actually used to have these um i keep wanting to say choose your own adventure they're about the same form factor as those choose your own adventure books but it's not you read it you know left to right um but but you get to a point and there's basically like a little puzzle and um um you know you're supposed to like solve the puzzle and then somehow there's basically like a little puzzle. And, you know, you're supposed to like solve the puzzle.
Starting point is 00:31:46 And then somehow there's some way to like validate you got it right. And then you keep reading. And so the idea is you have to, it's kind of on the honor system. But you would be reading, this is just like a paperback book. And then it would be like, oh, put in this basic code. And then edit it to solve the puzzle. And that's basically how i got started yeah so i mean i think this you know format is you know obviously catching on in a lot of different
Starting point is 00:32:11 places and it's really awesome to see um this is a learning resource i think it'll really resonate with people the ability to be able to be almost anywhere and be able to interactively you know read write the code not necessarily just like listening to a lecture and trying to figure out like how do i make it go 50 faster or how fast can i go before i can't understand the person anymore uh not that i've ever done that before um but yeah i think this is a great forum for learning yeah this is awesome guys check it out educative.io slash programming throwdown that's going to get you a 10 discount that's also going to let them know that um you know that that that uh this was a good slot for them that that they were a good sponsor for us so so it
Starting point is 00:32:51 helps us out um it also helps helps you out and uh check it out also let us know you know send us comments send us email uh let us know what you think of the courses and you know we can relay that back to the folks at educative andative and give them feedback so they can keep improving. All right. Thanks for the sponsorship. And let's get back to interviewing. All right. I want to talk more about how DeepCode works.
Starting point is 00:33:16 We talked about from the technical side. But as a developer, what's sort of that experience like? So there's multiple ways to use deep code that's the idea is we have a public api and you can command line interface you can really hook it up anywhere you want but the most standard usage as we spoke earlier is part of the ide so we released already a vs code and atom plugins so you can directly just get the suggestions as you as you code the next stage is obviously after you commit the code and we actually have integrations with uh with github bitbucket and gitlab uh where we have a bot that automatically will comment on pull requests that will tell you hey you guys doing these new things by the way
Starting point is 00:34:01 you're introducing this issue please look at it so So in this case, it gives you a diff analysis on the new things. Obviously, at the latest stage, you can actually add it to any CI or CD pipelines or QA, let's put it that way. So it really depends on the workflow that you're actually working in, but you can use it pretty much anywhere. That makes sense. So from a usability standpoint standpoint it sounds similar to other static analysis tools like like
Starting point is 00:34:30 Shellshock or or Pylint or Flake or one of these things. I mean yeah from the workflow perspective absolutely the case the big difference is that we actually run pretty much in real time so we actually complete the analysis in like seconds like one second for average piece of code to even less. So that allows you to actually have the IDE piece because a lot of linters, et cetera, when they run the ID, it takes time to actually get results. So speed is one of our main focus.
Starting point is 00:35:00 Everything has to be real time for the experience to be correct and for the people not to say, okay, I have to wait now for 15 minutes to get my results. Hopefully, I get an e-mail. Great. The amount of time you save is wasted by waiting. Yeah. Are you running some machine learning in the Visual Studio Code extension, or are you making an RPC with the code to a server?
Starting point is 00:35:28 Yeah, it's a server-side analysis. So pretty much we have a cloud server that can do that. For larger companies, you can also have the on-premise server that actually runs it locally if you're fully behind a firewall, for example. Got it. Okay, that makes sense. So if you, and we'll talk about this, actually we should talk about the pricing and all that later, but if you're a student or you're working on a hobby project,
Starting point is 00:35:52 then you just turn this on, you're good to go. If you're working at, you know, if you're working at like a military research lab or something, then maybe, you know, it might not be the best idea to turn this on. And so what you want to do is reach out to Boris and get a contract for your company and get something on premises. That is correct.
Starting point is 00:36:13 And then when you install it in the IDE, you can just select which server you actually want to get the analysis happening on, either the cloud-based, SaaS, or your own. Cool, that makes sense. And so what languages are supported? Is it pretty much everything or are you starting with a few key languages?
Starting point is 00:36:31 So we started with kind of the top languages. So we have Java, JavaScript, TypeScript, Python. So this is live today. This month we're going to release C and C++, very likely. Hey, hey, hey, good news. CHRIS BROADFOOTENHEIMER- News of all. ANDREAS SUIKAITISONISI- And yeah, we're going to be adding a couple more languages this year. Pretty much our main focus has been to kind of perfect, making sure that we really rock on all languages that are live,
Starting point is 00:36:56 rather than just add every single language out there and be just average. And we built the platform that architects should do. So also now adding new languages is extremely fast for us, which will enable the addition of new languages as well. Plus, at some point, there is the vision to open source it so people coming up with new languages, which happens a lot these days, can just add it as well.
Starting point is 00:37:19 Cool, that makes sense. So anything that's machine learning-based, you have to sort of deal with this sort of four possible outcomes, right? You know, so true positive means, you know, a person made a mistake they've seen a thousand times. You tell them about it. They fix it. That's awesome. True negative is, you know, everything else that your program is not telling people about or you know things that are fine right um but let's talk a little bit about the false situation so you know if you're if your program tells someone to fix something but then they kind of disagree what's sort of what do they do there
Starting point is 00:37:58 is there a way to give you feedback in the ide or do they turn that one off is there like a uh you know slash slash slash ignore or something that they can put? Absolutely, so yeah, we definitely have slash slash deep code ignore or this ignore, which will pretty much ignore the specific instance in the line before or after. And you can actually highlight
Starting point is 00:38:18 which specific issue ignoring because sometimes you can have more than one on a single line of code. And same thing, you can actually provide feedback. So the way our platform is built that once you provide a feedback we can very quickly i'm talking within minutes we can actually adjust the knowledge base and either kind of expand the rule or split it into multiple rules so it actually get accurate so that's uh that's the beauty of the platform that you can very quickly and very easily ingest those this user feedback and kind of really push the fix to everybody else immediately. Cool. That's so awesome.
Starting point is 00:38:50 How are the errors surfaced? So can you kind of walk me through that? So the machine learning system says this, I guess, syntax flow tree, we've seen this pattern before and we don't like it. How does that turn into something actionable in sort of an English format where someone can read that and grok it? So this is one of the kind of nice add-on engine that we have, which gives you kind of semantic explanations of the problem. This is where the symbolic ai comes in uh so as we have metadata to the original piece of the code we can kind of
Starting point is 00:39:30 define gaps and say hey the function that you're calling and we kind of tell you which function if you actually hover with the mouse over it will actually show you the line of code and the function that you are calling right is getting user input right and you can actually see what the object with the user input is and this input flows into some kind of an execution on the back end for example and we can actually when you cover it we'll show you where it is and we can tell you hey this user input is not being sanitized by by the time it gets to the execution uh which can actually cause some kind of a denial of service or path traversal or many different undesirable things executing on your backend.
Starting point is 00:40:11 But that's the idea. So we actually have a human-readable explanation that semantically points to each object of the problem. And you can use a code base to point you through it so you can actually walk it through and understand the flow. So am I right to say there's sort of this machine learning step where you are doing this almost unsupervised problem of looking at GitHub PRs and trying to find patterns of mistakes, right?
Starting point is 00:40:42 But then to convert that into some sort of like slot-based language generation, there has to be like a person in the loop there, right? So is there someone on your side who's looking at the most common errors and then trying to find a symbolic representation of that? Yeah, so we have a semi-supervised learning on this second part,
Starting point is 00:41:04 which allows us to create categories of the problems right so we have a data flow category for example saying hey you have a user input and flows into something and it's not sanitized so what happens here we define this category uh and then with a very small seat of examples right um our machine learning actually identifies thousands of objects that are actually doing the same thing. So we have, you provide three examples, so 10, 20 examples of user input functions. And then our core automatically says, hey, there's actually here 6,000 different user inputs that I found. So that becomes your category with 6,000 potential user inputs, right? Then you
Starting point is 00:41:42 actually say, what are the things or where something actually can get executed on the backend, and that becomes the category of those problems. And then you look for sanitizers, and you can have like thousands of, or hundreds of thousands of sanitizers. And any combination of those three can actually lead to a data flow problem, right? So with very little effort from a user,
Starting point is 00:42:03 you end up having with pretty much millions of combinations of problems that you detect. And this is kind of the benefit that you don't have to write individual rules. Instead of writing a million rules, within five minutes, you actually created a category that actually represents millions of potential problems. Oh, that's super interesting. So I think it sounds a little similar to sort of these lookalike, you know, fingerprint models that they do for spam detection. So, for example, someone will mark an email as spam. And so then what will happen is a system will add that to a list of seed spam emails. And then afterwards, there's some lookalike system that says, does this new email share a lot of the same properties as my spam email? If so, then let's sort of add it to the spam category.
Starting point is 00:42:59 And so it's almost like an active learning type thing. So in your case, someone says, hey, you know this, or from the GitHub pull request, you can emerge that this passing the string directly in this function was not, you know, sanitary. And so here's a pull request that adds a sanitization wrapper around the string. And you can seed sort of, let's say, a new problem with that,
Starting point is 00:43:29 with a few of those examples, and then do some lookalike type thing to find, as you said, like all of the adjacent pull requests that are most similar in nature to that one. And then if that process gave you pretty high signal, so there wasn't a lot of false positives, then you say this is sort of like a good concept. And then you can roll that out. Yeah.
Starting point is 00:43:56 And the add-on point to this is that while you create a category, since we have the knowledge base and the history of Git in memory, let's put it that way, it automatically tells you how many people have fixed such problems, how many people are vulnerable to this problem today, and you get this real time pretty much. So as you kind of create a category, you automatically see, is that an important category? And you can even look into the vulnerabilities and say, okay, am I seeing any false positives or not? So you can very quickly check if it's accurate category
Starting point is 00:44:28 and it should be pushed to production. That allows, again, within 10 minutes, you can actually create millions of rules compared to what a current rule-based system looks like. Very cool. So can you tell me a little bit about the scale on the machine learning side? Like you're ingesting just an
Starting point is 00:44:45 unbelievable it sounds like you're just an unbelievable amount of of data um do you have sort of like web mirrors running 24 7 scraping github i mean is this something that runs on thousands and thousands of machines in the cloud um um give me an idea roughly of the scale and the scope of this effort so that's the beauty of it all we have built very lean pipelines and we don't have to scrape we've obviously read GitHub once but once you actually receive the repository history
Starting point is 00:45:18 and convert it into our internal representation which is considerably smaller than the GitHub representation then you actually just have to look at the new changes into this repository internal representation, which is considerably smaller than the GitHub representation, then you actually just have to look at the new changes into this repository. So we don't have to kind of re-grab everything else, right? And then our internal representation is pretty efficient. So we're talking, let's say, for all of Java, you're going to be talking like a couple of terabytes.
Starting point is 00:45:40 It's not going to be much larger than that. And analyzing, let's say, all the Java codes out there, it will take you less than a day. Everything happens quick in a run. Wow, so less than a terabyte for all Java code on GitHub. Yeah, and this is not only the tip, this is all the history. What we do, we pretty much use in memory,
Starting point is 00:46:02 we have a memory index, a semantic index of every single version of every single repository out there in GitHub. For a pre-language, obviously, we do it. Wow, that's amazing. I mean, that's just a treasure trove of information. That's what you need in order to actually extract the knowledge of the development community automatically, right? Otherwise, if that's not efficient and if you cannot index it and grab it really fast, then you're kind of dead in the automatically, right? Otherwise, if that's not efficient and if you cannot index it and grab it really fast, then you're kind of dead in the water, right? It would take, like, months to actually figure anything else.
Starting point is 00:46:32 Yeah. That's one of the reasons why we chose not to do static analysis without requiring to build or compile anything because clearly if you want to compile something from, like, five years ago, like, it's unlikely you'll be able to actually build it on your own.
Starting point is 00:46:46 Yeah. Yeah. And then the time too, to build all of that code is extraordinary. Cool. So tell me a little bit about, well, first of all,
Starting point is 00:46:56 so folks who are listening, who are, you know, in college and high school, or they're working on open source projects, what is available for them? and what's the price like? Everything for free is the short answer. So we are 100% free for anything open source.
Starting point is 00:47:14 Our motto is pretty much, hey, we are learning from the open source community, pretty much everything. We want to actually encourage the open source community to get better, be more dynamic, because we'll learn even more. So it's fully free. So anything in the Cloud, GitLab, Bitbucket, GitHub, it's free. You just log in with your account for the Git and all your repositories you can scan,
Starting point is 00:47:35 you can scan anything open source. For educational purposes, the same. We've actually worked with a couple of universities already, even some of them are developing open source add-ons, which is great as part of their master thesis. Nice. And I think the first one that started doing that was San Jose State University. So we're very happy with the collaboration there. And it is free for them because that's the idea.
Starting point is 00:48:01 We believe that students, they start using things and over time, as they go into the industry, they say, hey, I actually need that commercially. And our tool is pretty good for people that are learning things. That's the main thing
Starting point is 00:48:14 because we explain the problems, we provide examples how other people have fixed it in a totally different settings. So it's a great learning tool as well. And actually, we're also free for small companies like if you have less than 30 developers pretty much
Starting point is 00:48:28 you can just start using it and nobody will ever bother you Cool, this is awesome we've been super fortunate to have you on the show, have Zenhub on the show and a few other folks on the show CircleCI on the show and almost all of these
Starting point is 00:48:44 actually every one of these products is free for students. So if you're doing, let's say, a senior design project, you're a CS student or a computer engineering student, you have a senior design project, and it's something that's going to take you three, four, or five months from start to finish. So it's not a trivial project. You're working with a team. You could have continuous integration. You can have some project management. And you can have the deep code static analysis totally for free.
Starting point is 00:49:15 And it's really going to give you students out there a feel of what it's like to work on a team, an industry, which is really amazing. That's why I allow students to get into software development much easier these days, because all the tooling helps them be faster. Yeah, totally. I'm sure this is personalized, but just at a high level, how does it work for businesses?
Starting point is 00:49:42 Let's say someone works at, not a, let's say, bigger than a 30-person business, but not a huge, giant conglomerate. So they work in, say, a mid-sized business, Macy's or something, as an example. How can they go about getting deep code
Starting point is 00:49:58 on-premises? And what does that kind of look like? So the on-premise stuff, it happens through Docker container. So you get your custom Docker container. It takes about 10 to 15 minutes to set up. And then you have your own deep code on-premise server.
Starting point is 00:50:17 And then you have the exact same integration, the exact same benefits than the SaaS option. And the cost there is ranging depending on your size, I would say between $30 to $50 per developer per month. Clearly, we offer a 30-day free trial so people can experience it, test it with their code base, see what it is before they have to purchase. So there is a risk-free trial there as well. Cool. Yeah, that makes sense. So now we'll jump into a little bit about the company. So is DeepCode in, you mentioned ETH Zurich, is DeepCode based out of there, the company? Yeah, we are literally five minutes away from the Zurich ETH main campus.
Starting point is 00:51:06 So we are a Zurich-based company. Cool, yeah, I've been there before. It's very expensive, if I remember correctly. But it's beautiful. Yeah, I went there for a conference one time. And yeah, I'm probably going to get this wrong. You can correct me. But I think I just got a hamburger and it was like $8 or something.
Starting point is 00:51:27 That sounds like a cheap hamburger. Okay, yeah, it's probably more than that. But this was probably like 10 years ago. But yeah, Zurich is beautiful. I remember we took the train to, I want to say Bern, but it's been a long time. Some place and we did some skiing and it was it was gorgeous well burn is the capital so likely you went to somewhere else for skiing oh okay okay i think maybe we did burn one day and then another day oh actually yeah we went to young far yuck
Starting point is 00:51:58 like it's the tallest point young frau hall yeah yeah Hall, okay, there we go. And then I think somewhere around there, there was this, all I really remember is it was a downhill skiing, but it was kind of a very natural downhill. And it was literally down this mountain, and it just kind of kept going and going. It was gorgeous. Yeah, there's so many skiing options here. And yes, the Alps are like, yeah, so many different, like off-piste, non-piste, and long-piste, and very natural indeed.
Starting point is 00:52:30 Yeah, it's awesome. So everyone's over there. So folks are interested in jobs that would be over there. There's not like remote work or anything like that at the moment? We do have two people that are doing remote work, but mostly this is when you're working on kind of the integrations, like open source things that we actually deliver. Most of the core is pretty much happening here. Because it is very dynamic, it's changing so fast that it's very hard to do the coordination over time clearly that will expand so we can have specific areas that could be uh worked on offline but today there's too much talking interactions
Starting point is 00:53:12 with the team so yeah we like to be close to each other how big is the company right now we're 15 people continuously growing as as i said now we're getting roughly about one new person per month uh wow yeah so it's uh it's it's nice and and growing and very exciting because of that yeah wow i mean so you'll you'll almost double next year yes and we did double last year but that's kind of that's the definition of a startup like if you don't do like 100 percent at least year over year then you're not considered a startup anymore. So you have to. That's true. Yeah. Well, nowadays it's unbelievable.
Starting point is 00:53:49 Actually, you're seeing a lot of IPOs nowadays, like very recently. But it seemed as though no one was going to IPO. And so it's like, okay, this company is worth $60 billion. Is it really a startup? I mean, come on. Yeah, exactly. Yeah. So what about internships if folks want to come there for a summer is it kind of too early for that or do you are you doing internships so what we do we do a lot of
Starting point is 00:54:14 master master student thesis is here we usually run at least three at a time most of them are from ETH but we've gotten some students that are coming from another school that officially become part of ETH just for the master thesis and they actually do it. So there are definitely options if people want to do research specifically that's related to what we do and usually on top of our platform because that enables quite a lot of interesting topics. So we do a lot of those. Cool. Cool. That makes sense. So as far as skills, it sounds like you're looking for definitely people who do program analysis.
Starting point is 00:54:51 Also, it sounds like maybe some graph convolution or like some deep learning folks would be useful. We have some deep learning, yes. It's for some of the new services that we're preparing for. It's been happening, so we've had a number of GPUs here running and making noise. All right. Cool. But, yes, so program analysis, most of the back end is in C++, obviously. But, yeah, pretty much core developers as well, front end as well.
Starting point is 00:55:23 We have a pretty strong team, so there's always opportunities there as well, front end as well. We have a pretty strong team, so there's always opportunities there as well. Cool, and what's it like to work at DeepCode? So what is something that kind of makes DeepCode unique as far as a place to work? Oh, well, it's definitely an exhilarating experience. You can think about it as sprinting while drinking from a fire hose.
Starting point is 00:55:47 So yeah, there's daily new things happening. You cannot even plan your day easily enough because there's new things. So internally, again, the team is highly motivated and then kind of amazing experts in the space. And they keep on innovating new solutions and methods that oftentimes even researchers actually spend years on without much success. So a lot of those things we don't even have time to publish anymore,
Starting point is 00:56:08 but it is required. And then from an external perspective, again, like seeing all the major companies or open source frameworks using us, it's pretty nice. I mean, this is pretty much what keeps us working late at night, as well as having developers kind of unpromptedly saying, hey, that's amazing. I love this.
Starting point is 00:56:31 You're building something amazing. So that's yeah, that makes it lots of fun. Cool. That's awesome. Yeah, I bet it's intense in the beginning because you don't, you have to figure out really like the size of the market you're in.
Starting point is 00:56:47 And so I have a friend who started a company a couple of years back. And he says it's just huge ups and downs. It's this rollercoaster rocket ship where you're just getting thrown all over the place. Yeah. But it's amazing because you really like cutting edge stuff. We work very closely with dth zurich so there's a lot of research coming from there and new things come up all the time which is great that's awesome have you thought about um you know i got something the other day
Starting point is 00:57:18 from github saying this one of these old projects that that i open sourced a long time ago um has some security vulnerability. It was actually in a dependency of the project. And so GitHub emailed me. And then about a week later, GitHub actually sent me a pull request updating that dependency. Have you thought about basically you could do almost like an outbound marketing approach where you just ping random people on GitHub and say, your code has this error, you can fix it, you know? So we've done small tests, like we've actually filed pull requests in Mozilla and some other places,
Starting point is 00:57:54 specifically in the security space, that's obviously well regarded, and pretty much all of the interactive projects were accepted. So yes, we've tested it, and we are indeed looking into making it more scalable as well. Very cool. That sounds totally awesome. Yeah, I mean, I'm definitely going to try this. So actually, one question. If someone installs the extension, so on my laptop, I do a bunch of hacking on random things for fun, but I also have a real day job
Starting point is 00:58:25 where they don't want the source code going out of the off premises, right? So how can I sort of reconcile that with this extension? Can I, is there an easy way to turn it on and off? Should I have two copies of Visual Studio Code? How would I do that? So that's a common request by our users. So what happens for each project that you open,
Starting point is 00:58:43 you actually get a pop up saying, are you okay for this code to be transmitted and analyzed? And you can clearly say no for the project you don't care about. I got it. So really, any project, so when you first install it, you have three open projects, you're going to get three separate pop ups for each one saying, do you want to do this? Do you want to do this? Do you want to do this?
Starting point is 00:59:02 Same thing if you close and open a new one, you're going to it again because that's yeah common common concern and uh it has to be fresh in your mind you have to always make the decision saying do you want to do it or not uh you can always clearly disable it fully if you know that like i'm going to be working only on my proprietary stuff for uh for a day yeah that makes sense very cool yeah i will check this out i think patrick's gonna have to wait until you make the C++ version. Actually, yeah, now you're doing Java, right? Or is it still C++? No, mostly C++ still.
Starting point is 00:59:33 Oh, is it? Okay. Got it. So Patrick, I'll ping you later this month. All right. Wow, it does move fast. Wow, that's quick. Cool.
Starting point is 00:59:42 So tell people how they can reach out to you. So these would be, you know, students, people who want to get this installed at their workplace. What's a good way to reach out to you? And then also, what's a good way separately for people just kind of passively follow what you're doing, maybe on social media or somewhere else yeah so reaching to me my personal email is boris at deepcode.ai uh you can also just go to deepcode.ai website there is a live chat there well not always live but if we are up and running you'll be live otherwise we'll come back to you over an email um uh on social media obviously our twitter handle is deepcodeai uh same for linkedin and i'd say with twitter and linkedin you'll be getting most of the news about us uh we also have a medium uh page where people can actually get our developer advocate actually pushes a lot of a lot of cool articles examples where it is like top bugs specific vulnerabilities vulnerabilities. So that's pretty nice. And now we actually have a YouTube channel for tutorials,
Starting point is 01:00:49 like how do you set up VS Code, for example, how do you use VS Code? How do you actually set up the on-premise versions as well? So all those you can have kind of a quick walkthrough. So it should answer most of your questions. Very, very cool. And you said, just to recap, there's a command line for folks who aren't using VS Code.
Starting point is 01:01:07 There's a command line option that they would install through app or something. That is correct. You can just use the command line, can literally point to a folder to analyze all the code there. Very cool. Thank you so much Boris. This is awesome. I personally am enriched by this. I'm going to try it out. I'm really excited to see what comes what comes up and let me know all right cool sounds good we really appreciate your time and yeah you folks at at home you should check this
Starting point is 01:01:36 out this sounds amazing if you're on your drive and you didn't catch uh some of the some of the the you know the urls and things like that we're going to post it all in the show notes. So you can just, from iTunes or whatever podcast app, you can just tap over to the description. There's a link to the show notes. You can get all the info from there. But thanks again, Boris. This is awesome.
Starting point is 01:01:56 And have fun. Do some skiing. And we'll reach out to you after we've tried this out. Sounds good. Thank you very much, guys. Appreciate it. The intro music is Axo by Binar Pilot. Programming Throwdown is distributed under a Creative Commons Attribution Sharealike 2.0 license.
Starting point is 01:02:26 You're free to share, copy, distribute, transmit the work, to remix, adapt the work, but you must provide attribution to Patrick and I and share alike in kind.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.