The Changelog: Software Development, Open Source - source{d} turns code into actionable insights (Interview)

Episode Date: January 16, 2019

Adam caught up with Francesc Campoy at KubeCon + CloudNativeCon 2018 in Seattle, WA to talk about the work he's doing at source{d} to apply Machine Learning to source code, and turn that codebase into... actionable insights. It's a movement they're driving called Machine Learning on Code. They talked through their open source products, how they work, what types of insights can be gained, and they also talked through the code analysis Francesc did on the Kubernetes code base. This is as close as you get to the bleeding edge and we're very interested to see where this goes.

Transcript
Discussion (0)
Starting point is 00:00:00 Bandwidth for ChangeLog is provided by Fastly. Learn more at Fastly.com. We move fast and fix things here at ChangeLog because of Rollbar. Check them out at Rollbar.com. And we're hosted on Linode cloud servers. Head to Linode.com slash ChangeLog. This episode is brought to you by Linode, our cloud server of choice. It is so easy to get started with Linode. Servers start at just five bucks a month. We host ChangeLog on Linode cloud servers and we love it. We get great 24-7 support. Zeus-like powers with native SSDs. A super fast
Starting point is 00:00:32 40 gigabit per second network and incredibly fast CPUs for processing. And we trust Linode because they keep it fast. They keep it simple. Check them out at linode.com slash changelog. From Changelog Media, you're listening to the Changelog, a podcast featuring the hackers, the leaders, and the innovators of software development. I'm Adam Stachowiak, Editor-in-Chief here at Changelog. Today, I'm at KubeCon, CloudNativeCon, talking to Francesc Campoy. We talk about the work he's doing at Source to apply machine learning to source code and turn that code base into actionable insights. It's a movement they're driving called machine learning on code.
Starting point is 00:01:19 We talk to their open source products, how they work, what types of insights can be gained, and we also talk through the code analysis analysis Frances did on the Kubernetes code base. This is as close as you get to the bleeding edge. And I'm very interested to see where this goes. You're at Sourced now. Give me the breakdown of what Sourced is. Is it formed around open source? Everything right now is open source, right? So nothing you have
Starting point is 00:01:48 is a paid product at least yet. Everything we do is open source. We are working on an enterprise edition for one of the products and basically the whole idea is that it will everything will be keeping on being open source except for one thing that allows our product to work distributed. So to back up a little bit and give a little bit of context, what we do at source is our tagline is machine learning for large scale code analysis. OK. I like that.
Starting point is 00:02:17 I like that tagline. We worked on that tagline for quite a while. I'm happy you like it. It's succinct, and it gets the message across. Yeah. That's the whole point. it gets the message across. Yeah. That's the whole point. The whole idea behind it is when you're writing code, normally people think about the fact that you write code,
Starting point is 00:02:33 then you build it, and you ship something. And what you ship is what matters, right? Source code is just a way to get there. And what we realize is that actually it's a huge and very, very deep source of data. When you have a Git repository, you can actually see what's happened there since the beginning of time until now. You can actually analyze trends. You can see so much stuff in there.
Starting point is 00:02:58 So what we did is created this engine product that basically what it does, it provides a SQL interface. So you can find things in Web Repository. So you can do things like find commit messages with this text or whatever. But you can actually go even deeper than that and go into, actually, I want to see the content of the file. I want to parse it. I want to extract the function names. I want to extract the strings or whatever, right?
Starting point is 00:03:23 So there's a bunch of different projects that make this possible. And basically, every single one of those projects is completely open source. And I created a product which is called the Engine, which is putting all those together in a nice way to use. Little binary, you get started, and everything just works.
Starting point is 00:03:41 And then the other side of things that we're doing, so that is what we call the code as data, seeing source code as a source of data. And the other part is ML on Code. And ML on Code is the part that I've been talking around, because it's super exciting. The whole idea is learning stuff from source code. One of the things that you can learn, for instance,
Starting point is 00:04:01 is, say, to predict a token in a program if you're given the rest right i give you a go program and uh i'm just i i remove one viable name from somewhere and you need to predict it you train a neural network to do this and eventually we'll be able to do this quite correctly now what we try to do is not to predict the missing pieces of a program, because in general, programs do not have missing pieces. But what we can see is that if what we predict and what you wrote is very, very different,
Starting point is 00:04:36 and even more than that, what you wrote, we know that is unpredictable, what we can tell you is that probably that's a bug. So this is a slightly complicated way of doing it. But what this detects is copy-paste errors. When you copy a section of code and paste it somewhere else, and you modify a bunch of things, but probably when you're checking for the error,
Starting point is 00:04:59 that's not what you wanted. And you're checking for the previous error or something like that. That happens all the time. I know it happens to me all the time. And with this, you're actually able to detect it directly. Building something that would use static analysis for that, it is possible, but it's really hard, because static analysis deals with syntaxes and grammar and stuff like that,
Starting point is 00:05:25 but not really with the semantics of the program. I like this idea of when you're writing code, there's two things. There's what you say and what you mean. And when those two things differ, that's when you have a bug. When you're saying, oh, actually, that's not what I meant, sorry.
Starting point is 00:05:41 You need to fix it. What we're trying to do is apply machine learning to see what you meant and compare it to what you said and see whether we can find bugs in there. And that's super interesting, super powerful, and we are doing a lot in that, but that is more like the future.
Starting point is 00:06:00 Currently, the cool thing that we just released yesterday was analysis of the Kubernetes code base. Yeah. And that's a pretty lengthy, data-filled blog post. Great job on that. I mean, the cool thing is that so much data in there, right? Like, you have almost 2 million lines of code. And we've been working on that project since 2014, I think.
Starting point is 00:06:24 Right. So that's a lot of data in there. And we were able to find things like, oh, so how many exported functions are there? And how did that grow over time? We saw that from version 1 to version 1.0 to 1.4, the API grew by four times the size, which is bad. Like, if you're having that, if you kept on doing that, it means that by now we would have it.
Starting point is 00:06:53 So it went from 4,000 to 16,000. If we kept on going the same pace, we would have something like around 100,000 endpoints. That is not maintainable. It's too complex. No contributor would be able to think about Kubernetes as a thing. You had to split it into pieces, right? So yeah, we were able to see all of these things, super interesting, and
Starting point is 00:07:11 the whole idea behind that article was give back to the community. When you tell to the community, hey, you're doing great, you're maturing, and you can tell, and the innovation is somewhere else, which means that the APIs are really good. All of these data, I mean, it's not newsworthy,
Starting point is 00:07:30 I'd say, right? Because nothing crazy new. But it's just confirmation through data that this feeling of Kubernetes is doing well, it is actually accurate. Right. You also were able to account for the different languages within Kubernetes. So it shows where there's declines or growths or, you know, even for developers who are thinking about transitioning to a different language, like just identifying where some of their future value for their career could be. Yeah, there's a lot of indications around that. Or even, as you mentioned, contributors and healthy growth and things like that. Those are all indicators like, well, people are here at this conference, 8,000 now versus 4,000 last year in Austin. What that shows is a significant sign of investment and betting on Kubernetes.
Starting point is 00:08:15 So understanding that it is healthy, in fact, based on true data, that's amazing. The cool thing is I will open source the way I did all of these analysis, but it is literally just a bunch of SQL and a bunch of Python. It's not that complicated. I mean, I'm not a good Pythonista. Let's not go there. But I'm not really good at writing Python. I had to learn a lot.
Starting point is 00:08:38 But still, it's actually pretty straightforward. When you say, for instance, I want to count all the languages that I have, basically what you're doing is like, okay, so give me all the files and I'm going to use this language function that classifies and tells you what language it is and classify that. Easy. And now give me the... I think we are around at 72,000
Starting point is 00:08:58 commits on Kubernetes codebase. So I'm going to do it every 1,000. So every 1,000 commits find how many you have and just create the plot. So I'm going to do it every 1,000. So every 1,000 commits, find how many you have, and just create the plot. So it's actually very straightforward, but the information we got from that was super interesting. I shared it with Chris Nova and Joe Beta,
Starting point is 00:09:14 and they were very interested in checking it out. They found a little bug, because apparently I'm not very good at reading. So instead of million lines of code on the PR that was sent to many analysts, instead I said billions. Oh, gosh. There's a difference there. Yeah, so Joe Betana is calling me Mr. Billions, which is awful.
Starting point is 00:09:36 So the example you're sharing here, at least with the analysis, is an open source project. Yeah, all of this is open source, and not only open source, but also Apache V2. What I mean is analyzing an open source code base. So maybe give an example of, say, how this applies to enterprises. Maybe somebody that's got their internal code. I know most things are open source,
Starting point is 00:09:58 but we're building our own products, and those products tend to be behind the scenes and the things we touch tangentially through dependencies or contributions are in the open source world. So how does sourced engine, right? Sourced engine? MARK MANDELBAUM- Sourced engine, yeah. FRANCESC CAMPOY FLORES- So how does that
Starting point is 00:10:12 apply in a case where it's my own code? How do I run on my own code bases? MARK MANDELBAUM- So you can run it exactly the same way. The cool thing is that since the beginning, we've developed all of the tools we've built to be run on-prem, right? Because, I mean, I used to work at Google. If you go to a Googler and you tell them, hey, we're going to be sending your source code over the network to some random server, they're going to be like, there's no way you're doing that, right? So we knew that source code
Starting point is 00:10:40 is a very, very delicate piece of data. So everything can run on-prem. Everything runs on Docker. So you can even have a Helm chart and just start everything up very easy. And everything is open source. So not wrong with that. Would you do this on a laptop? Would you do this at the server level? I mean, it depends on the amount of code you have.
Starting point is 00:10:59 If you're doing it on a laptop, yeah, it's going to take some time. I was running it on the analysis that I did for the- MARK MANDELMANN, So this is highly process intensive. FRANCESC CAMPOY FLORES- Yeah. I mean, it's pretty large, because it's big data, right? So the analysis that I did for the Kubernetes code base, I was running on an instance on Google Cloud Platform with, I think it was 96 cores.
Starting point is 00:11:22 So you know, a pretty large instance. And yeah, the analysis of counting all of the languages for all of the commits over time took around 10 minutes. So it's not that bad, actually. But if you're trying to do this for a very large thing, 96 cores is going to be maybe enough at the beginning. But eventually, you want to have it distributed. And that's where basically we're saying,
Starting point is 00:11:44 once you need more than one node, then it's enterprise edition and we should talk. Because the whole idea is that we want to give as much as possible to the open source community. Especially the engine can be a really powerful way to obtain data for all of the research part of machine learning, right? There's a lot of people doing research,
Starting point is 00:12:04 and they need data sets. The fact that they will be able to generate those data sets by running SQL queries that they already know very well, that's super powerful. So we want to make sure that they get access to that. But for larger companies that they want to do analysis, and the interesting thing is that those metrics that we came out with, you can tweak them, right?
Starting point is 00:12:24 And we are going to come up with a catalog of the kind of metrics that you should be figuring out, looking at. So for instance, if you're saying, I'm going to be moving on cloud native, cloud native competing foundation, I'm going to go cloud native. Cool. What are the things that you should be looking at?
Starting point is 00:12:42 Well, you should have Dockerfile. You should have continuous integration. You should have continuous integration. You should have continuous deployment. All of these things, nowadays, they're in the source code. So we can analyze those things and give you a little bit of an idea of, if you're going towards being cloud native, how far away are you from getting there? And also, what are the things that you should be changing?
Starting point is 00:13:03 What pieces of the source code should be worked on in order to get there. So that is super useful because basically the whole idea is that it brings visibility to processes like going cloud native or adopting inner source or adopting DevOps. Lots of people talk about, oh, we're going to be doing DevOps.
Starting point is 00:13:21 What does that mean, right? There's actually clear things in the source code. FRANCESC CAMPOY FLORES- Please answer that. FRANCESC CAMPOY FLORES- Oh, yeah. I mean, how many hours do we have now doing DevOps? It can be many, many things. But the beginning is, well, you're going to need to have clear observability.
Starting point is 00:13:37 You're going to have metrics. You're going to have a lot of different things that then you're going to be fetching into some systems that will allow you to understand what your system is doing. But all of the observability things, again, it's source code. When you think about infrastructure as code, where do you find that? Source code.
Starting point is 00:13:54 So we keep on putting more and more stuff inside of Git repositories. And what we're trying to do is, sure, that's great. But now let's analyze it. Let's use that data we put in there to try to understand what's going on. The cool thing about being SQL, because I was actually, and I'm still thinking about offering a GraphQL thing because Git repositories are trees,
Starting point is 00:14:16 and once you parse code, you get a tree. So everything's trees. GraphQL for trees is great. But the fact that it's SQL, it allows you to mix it with other data sets. So you have like Looker or Power BI or things like this where you can have many data sets and do a query across many different databases.
Starting point is 00:14:35 Imagine doing something where you're saying, okay, so I'm gonna say inner source. The whole goal under inner source is really make sure you break the silos in a company and that everybody collaborates with each other, right? So like the Google style, Facebook style, even though the inner source term was created at PayPal. In order to do this, what you need to do, in order to measure how well you're doing, the whole idea is that you need to first know who is in each team. Unfortunately, that is not in your Git data set, right?
Starting point is 00:15:03 So you're going to need to mix it with some other dataset, HR dataset or whatever it is. So Looker or Power BI or I think that even Tableau, they would allow you to do these kind of things. You can look up the repo URL on GitHub even if it's a private GitHub repo as well. Yeah, yeah, yeah. So the cool thing is that...
Starting point is 00:15:21 Because you have teams at the org level, so you could look up, not so much by the repo, but by the repo URL. Yeah, you could even check it. Yeah, yeah, yeah. So the thing is that all of that, that is the GitHub API. So GitHub, we work with any Git repository.
Starting point is 00:15:35 So a lot of the concepts that we work with are Git for now. So that's why, for instance, the organization is a GitHub or GitLab point. You could expose it from a different data set. Just download the whole thing, put it in MySQL, and that's it. You can do that too. And that's actually really powerful because you can then mix it with, if you have financial data or things like this,
Starting point is 00:15:56 you can try to see correlations. One that I like is the correlation of, if we're writing, can we correlate the number of commits with the money we're making? Are developers, when developers write a lot of code, is it good? Or maybe it's bad, and you should stop your developers
Starting point is 00:16:15 from writing more code. Or is there no relationship at all, and it just doesn't matter? Yeah, yeah, yeah. So all of these things, like, Interesting. Once you expose all of that data, our idea is data analysts and data scientists,
Starting point is 00:16:26 they're going to be able to do really cool stuff with that. It sounds, though, like Sourced Engine is, let's maybe use an analogy. I'm a painter. I want to paint a painting. But it sounds like Sourced Engine is just a brush. I still got to apply the right kind of colors and understand color theory. So it's a tool to get there, but it's not the recipe to get there. Yeah, I would say, following that metaphor, I would say that sourced engine,
Starting point is 00:16:52 it wouldn't even be like the brush or anything. It would actually be deeper than that. It would be like the thing that makes the paint for you, right? It's like you're going to be extracting all of this data, and then with's like what you're going to be, you're going to be extracting all of this data. And then with the data, you're going to be painting something. You're going to be creating your dashboard. You're going to be proving your point, right? Like data and statistics, those are new.
Starting point is 00:17:16 We've been using statistics to prove our point for quite a while. So the idea is that for data analysts, if you tell a data analyst, oh, yeah, you should use Git in order to find this. So let me explain to you how Git, let's say, Git log to start with, right? Like Git log works, and how branches work, and how commits work, and what is a merge commit,
Starting point is 00:17:38 all these things. The data analyst probably stopped listening to you like five minutes ago. So the idea of exposing all of these concepts in a way that data analysts understand is actually really powerful. Because data analysts, also data scientists, and also machine learning scientists.
Starting point is 00:17:54 So is the interface a config file of queries? Is it a dashboard of queries? What's the interface to sit down and get something done? So the interface, there's a couple ways of doing it. One is the source engine itself is SQL, an SQL interface. We do have a playground that allows you to have a list of all of the tables, understands well what a tree is, and shows it better
Starting point is 00:18:21 than what you see with traditional SQL client. But I think that the best way to do this is actually with Jupyter. Jupyter notebooks works incredibly well. That's what I've been using. Because then it allows you to, you know, you have your text describing what you're measuring. Then you have your SQL query that is sent. Then you use the result to use a little bit of Python. And then you generate your graph, and all of this stays in the same place.
Starting point is 00:18:48 That's what I've been using, and it's a really great experience, to be honest. I like that better. But if you want to use the MySQL client from your command line, that works here. It's just MySQL. So anything that works with MySQL works with the engine. So the example you've talked us through is,
Starting point is 00:19:03 and I'd love to see if you want to go into some of the data that you pull back for Kubernetes. Oh, yeah. If you want to share some of that. But I also want to mention that that's an example that you've done. What's an example of, say, someone who's a customer, an adopter of your open source, and they're using Source Engine in ways that you weren't?
Starting point is 00:19:23 Share some of the imagination of users. We've been working with especially large banks. And large banks are really interesting because they have an incredible amount of source code. And that incredible amount of source code goes from their cloud-native Kubernetes, Docker Compose, stuff like that, and COBOL. They go all the way back to having cobble. When you tell them that they're going to be able to measure the technical debt for them, it's like, Oh yes, like let's do this. Cause they're all about the debt, you know, what's, what's black.
Starting point is 00:20:02 No, but like once you, once you tell them, it's like, Oh, you know, like many banks, they do not really even know how much code they have, right? There's so much of it that when you tell them, okay, so how much COBOL do you have? How many lines of code? Right. How many lines of code of COBOL do you think you have? And they're like, in between 100,000 and maybe half a million.
Starting point is 00:20:22 And it's like, well, if you're going to put some budget to go and rewrite that in something more modern, good luck with that estimation. So the idea is that we're going to be able to bring all of this data to them, so they're going to be able to make informed decisions. There was counting lines of code per language, which for us is literally a group by query.
Starting point is 00:20:46 Like, it's super simple to do. For them, it was like, this is actually really interesting. The other change that lots of banks want to do is going back to the inner sourcing, right? They want, large banks, they have many IT groups all around their organization, and they want them to work together well. And the first piece is to figure out who is doing what, what resembles to what, how much code duplication you have. Like, we have a thing that analyzes code duplication, not character by character, but rather extracting
Starting point is 00:21:18 the abstraction text tree, modifying some couple things. So it's actually a very smart way of figuring out whether two pieces of code are very similar, right? They're kind of so similar that if you saw them next to each other, you would say you need to refactor them and just write one function, right?
Starting point is 00:21:34 We're able to detect these automatically. And this helps a lot because if you imagine you're like the CTO of a bank and they tell you, it's like, okay, so you have this code base that dates from the 60s, and please put it on the cloud, right? That's hard. That is a harsh thing to ask to anyone. So the idea of being able to tell them, well, actually, all of this source code, let's see which parts are going to be the easiest ones, like this MTA, like modernizing traditional
Starting point is 00:22:04 applications, which is like not really cloud native, like this MTA, like modernizing traditional applications, which is like not really cloud native, but, you know, we can make it be cloud native. We can make it run on Kubernetes. And then what is the COBOL that, you know, that's going to be an interesting challenge to migrate, right? So having a view of all of this by just running a couple of queries is really powerful. The other option is literally running, like, I'm going to be very helpful and say that you could run a really huge bash crib calling Git very, very often, and maybe you will get something similar. But it would take hours instead of seconds.
Starting point is 00:22:35 Yeah. I think the thing I'm trying to drive out here is that clearly you can pull back a lot of intelligence. Oh, yeah. If you know what you're looking for, right? So it seems like maybe some consulting is involved there, or at least the right kind of teams in place that know how to ask those questions of being a data analyst, for example.
Starting point is 00:22:53 We're not that interested on the consulting side of things. Well, not so much as a company, but it seems like the intermediary there is someone who knows how to use Sourced Engine. There's going to be service integrators. There's going to be people not only installing the thing, but also. So you're kind of building out an economy even almost. Yeah.
Starting point is 00:23:07 With the final product, you'll eventually have an enterprise version of it. And or also enable others to make sense of data. Yeah. Nowadays, you have many consultants that are helping with these kind of tasks. Right. What we're building is a super powerful tool for those consultants. Right. So, and then, I mean, they're going to be able to run it internally and keep it as dashboard.
Starting point is 00:23:28 So, you know, it's observability. It's all about seeing where we are right now and see where we want to go and how to get there. Once you get there. There hasn't been a lot of tools around this front. I mean, maybe, I mean. Observability on source code, not really, no. I mean, at this level, what you're doing, it kind of reminds me of, you probably know his name.
Starting point is 00:23:48 Felipe, I believe, worked at Google, would do these things. Felipe Hoffa. His post, I mean, this seems almost like you were inspired by the work he's done. I actually wrote a blog post that was like the one that Felipe worked on. But actually, I'm pretty sure that mine got more views. So, hey, Felipe. No, but like the idea was I was trying to analyze on all of the source code that we had on BigQuery, analyzing which is the most common package name or which is the most common package we import and stuff like that.
Starting point is 00:24:22 And it was everything. It was cool to do, but also regular expressions everywhere. Our idea is that it's kind of similar to that. But imagine that it's a better interface. Instead of saying, oh, I'm going to have like, oh, find the function names. Well, it's going to be func.star space, then something that starts with a letter, whatever, like that's a pain to write. And also, what if now you don't have a go function, but you have a go method? Actually, that will not work anymore, right? So what we're doing is instead allowing
Starting point is 00:24:57 you to extract the tokens that you care about. So we work with this concept that we call universal abstract syntax trees. And the whole idea is that we call universal abstract syntax trees. And the whole idea is that it's an abstract syntax tree, so the result of parsing a program. But it allows you to extract things by using annotations. And those annotations are universal, right? So say a function is a function no matter what programming language you have, right? An identifier, same thing. Strings, same thing. So if you want to extract the function names,
Starting point is 00:25:27 what you need to do is basically use the UAST function. You pass the content. You pass what language you want to use. And then you just pass something that it's an XPath thing that basically says the function names. Same thing will work for Go, for Python, for Java, for no matter what programming language you're trying to use. So that is the kind of power that, yeah, you
Starting point is 00:25:46 could use it with an incredibly, like, I would love to see the regular expression that does the same thing if someone has the time to write it. MARK MANDELMANN- Just for fun, right? FRANCESC CAMPOY FLORES- Yeah, yeah. Just for pain. But that would take super long time. And even once you're done, if you
Starting point is 00:26:02 ask the person that wrote it, are you completely sure that this covers all the cases probably the answer is no yeah so we are making it a much more reliable and easy way to kind of extract the same information that you could in some other ways This episode is brought to you by Clubhouse. One of the biggest problems software teams face is having clear expectations set in an environment where everyone can come together to focus on what matters most, and that's creating software and products their customers love. The problem is that software out there trying to solve this problem is either too simple and doesn't provide enough structure or it's too complex and becomes very overwhelming. Clubhouse solves all these problems. It's the first project management platform for software teams that brings everyone together. It's designed from the ground up to be a developer first product in its DNA, but also simple and intuitive enough that all teams can enjoy using it.
Starting point is 00:27:08 With a fast, intuitive interface, a simple API, and a robust set of integrations, Clubhouse seamlessly integrates with the tools you use every day and gets out of your way. Learn more and get started at clubhouse.io slash changelog. Our listeners get a bonus two free months after your trial ends. Once again, clubhouse.io slash changelog. And by Raygun.
Starting point is 00:27:31 Raygun recently launched their application performance monitoring service, APM as it's called. It was built with a developer and DevOps in mind, and they are leading with first-class support for.NET apps and also available as an Azure app service. They have plans to support.NET Core followed by Java and Ruby in the very near future. And they've done a ton of competitive research between the current APM providers out there.
Starting point is 00:27:55 And where they excel is the level of detail they're surfacing. New Relic and AppDynamics, for example, are more business business oriented, where Raygun has been built with developers and DevOps in mind. The level of detail it provides in traces allows you to actively solve problems and dramatically boost your team's efficiency when diagnosing problems. Deep dive into root cause with automatic linkbacks to source for an unbeatable issue resolution workflow. This is awesome. Check it out. Learn more and get started at raygun.com slash apm let's give a prescription then for those listening out there thinking, I mean, I'd love to find some intelligence, love to come in on a Monday morning with greater intelligence to my code
Starting point is 00:28:51 base. Give some examples of how someone listened to this, whether they're in a larger team, smaller team, their own project, whatever. What's a good prescription for getting up and running? So I would say that the best place to start is like, you go to source.tech and check the engine, download it. It's a little binary and it only has one dependency, which is Docker.
Starting point is 00:29:15 So probably you already have it on your computer. I was curious why on the Mac OS installation process, you didn't use Homebrew. I know it's just a binary. I found you just put it in your bin folder, but I was wondering why. Oh, we don't have Homebrew yet, but we'll get there. I was like, the process to install seems so simple. It's like probably a simple Homebrew recipe as well. I need to work on that, but, you know, it's been a busy week.
Starting point is 00:29:33 But just doing any Mac OS install, I'm always, you know, expecting a Homebrew process or something specific to the way a language installs certain things. There's an issue somewhere to implement that. Sorry for pulling that off. Oh, no worries. No, but so once you have that binary, however you install it,
Starting point is 00:29:51 the idea is that you can just run something like source SQL, right? And now you are inside of a SQL client and you are acquiring all of the code that you found, all of the Git repositories that you found from the directory where you were, right? So now the cool thing is that you can start by doing things like, you know, count the commits that you have per month
Starting point is 00:30:11 or something like that. That is actually very interesting because you can see how much the team has been working over time. Or you can count the number of lines or things like this. These seem like pretty simple things, but even those are actually going to show weird things. For instance, for Kubernetes, I was like, I'm going to count the number of lines of code.
Starting point is 00:30:34 It goes up, right? Sure. But it also goes down eventually. And it is really weird, because it goes down by a lot of lines of code. And I started looking around. So I had some fun deletes and stuff. Yeah, and I was like, what is going on with this, right?
Starting point is 00:30:49 So there's actually, no matter what data set it is, you're going to be finding a lot of cool stuff because those are organic data sets, right? We keep on committing all the time and we're going to make mistakes. You're going to see from time to time, like the number of files goes up by like thousands and then goes down again. And then you look at that, it was like, huh, someone vendored the dependencies that they were not supposed to. Right? Like all of these things are...
Starting point is 00:31:10 Yeah, like, whoops. But yeah, all of these things, you're able to see more information. And the thing is that as soon as you start playing with this, at least in my experience, the more answers you get, the more questions you get. Right? Like, okay, so I saw this, but what happened with this thing? Or you can also find things like
Starting point is 00:31:27 something that, it's a really cool game, I'm going to be open sourcing it soon. It's, have you ever heard about the Grease of Bacon? Oh, yes. Kevin Bacon, yes. Kevin Bacon, yeah. Six Degrees of Kevin Bacon. Six Degrees of Kevin Bacon. So you can do the Six Degrees of Kevin Bacon, but
Starting point is 00:31:44 on Git, trying to figure out. So for me to say, I don't know, someone famous in the Go community, Rob Pike, how many degrees are there? So for me, I edited a file that was edited by someone else, that edited another file that was edited by someone else that was edited by Rob Pike, something like that.
Starting point is 00:32:04 You can actually extract that information from Git, right? So you can do, like, you can extract really useful insights for your business, but also you can build pretty cool games. So that's the thing. It's like, have fun with it. It's data. So if you've ever done any kind of data analysis, I mean, it's called data exploration for a reason. You do not necessarily know what you're going to be finding,
Starting point is 00:32:26 but that's the whole game, right? You're going to be able to extract some things. And then if you're actually interested on some specific metrics, check out the Kubernetes blog post that we wrote where you're going to have all of the different queries that were ran and you can run the same thing for you and see, for instance, the trends on what programming languages are you using. How are they growing?
Starting point is 00:32:45 Are you using more Go than before? Or maybe you're using more Java or Python? All of these things are going to appear very clearly on your graphs. FRANCESC CAMPOY FOSTER LONGIORIERA What do you say to maybe some for-profit, say, SaaS competitors to say this rough idea, which is basically data-driven intelligence in a development team.
Starting point is 00:33:05 So look at our code base, our repository, learn some insights. What do you know about other competitors, and how do you see Sourced moving forward, coming from OpenCore, eventually going to have your own products and different ways you can sustain financially? I mean, you are a company, so eventually you've got open source, but that's only going to last so long until you actually have to create some products to generate revenue.
Starting point is 00:33:29 Where will you be at in this space? So it is hard to answer because we are somewhere in between many different fields. There are some companies that do metrics, like software metrics, but the thing is that the software metrics they provide are the software metrics they provide. That's it, right?
Starting point is 00:33:48 So you cannot tweak them. Yeah, like you can get like... You have no visibility into... Yeah, like you can choose software metrics. Some of them might be really interesting. Like you can do like lines of code and stuff like that, number of commits. But also you can do things like cyclomatic complexity, right?
Starting point is 00:34:03 Like cyclomatic complexity. It's a really cool concept, but probably doesn't apply to you. Like what you want is things like, actually, what I care about is how many comments do I have per function? Like do I allow my functions correctly commented or not? Those things, probably what you want to do is express exactly what you want. And that's why
Starting point is 00:34:25 I think that what we're building is something that many of the companies that compete with us, they could be powered by us, really. That's what I was thinking. It's almost as if you're building their future tools. If they've done what they've done and they've gotten maybe, say, two or three years into their business, but they don't have the tooling, they may actually retrofit their business to essentially become a service provider on top of Source Engine, for example. If they're interested in doing that, talk to us. That is the thing, right?
Starting point is 00:34:50 What we're building is... So Source essentially is a standard. Source Engine at least could become an open source standard for data intelligence and code bases. Yeah. The idea is we want to extract data from source code, right? Right. The most common way of storing source code is Git.
Starting point is 00:35:07 The most common way of analyzing data is SQL. So we just put them together. And that is our first product, but we actually built it to extract information that then we can use to train models and do machine learning, right? We believe that many people are interested in doing that kind of thing,
Starting point is 00:35:24 and we want them to do it. Because at the of the day if we if we end up being successful our code review tool which called lookout it will provide an opportunity to write analyzers right to basically classify a piece of code as does this contain some specific thing or not so does this contain a bug does it not this contain lint error or not? So does this contain a bug? Does it not? Does this contain lint error or something like that, right? So those can be done with completely traditional tooling, like linters and stuff like that. But also we believe that many of them will need machine learning.
Starting point is 00:35:58 We cannot build all of those things. We're building the platform so other people can build on top of us. So what you're talking about is a product called Lookout. It's in beta right now. You can request a demo, obviously, if you wanted to. So Source Tangents is in beta. I think that Source Lookout is in alpha, probably. It says here on your site beta.
Starting point is 00:36:18 Really? Yeah. That's probably a mistake. Sign up for the beta. I see it right here. I'll talk to my team. I'm pretty sure that is an alpha normally. Sign up for the beta for the Kubernetes.
Starting point is 00:36:33 The source engine beta, yes. I'm pretty sure Lookout is still alpha. But anyway, it's still also, again, completely open source. You can check it out, run it on your project, etc. We do want to, we do not think that running the engine as a SaaS, as a software as a service makes much sense because people
Starting point is 00:36:54 do not want to send their code to random servers. But the source code analysis, sorry, the code review, assistive code review, we want to make that a SaaS. So eventually you will be able to just add as a GitHub application that just reviews your code. We've done that for all of our projects
Starting point is 00:37:11 and it works really well. It's able to warn you about, hey, this piece of code is suspiciously similar to that piece of code in that dependency. Did you copy paste it? Or maybe you should be calling that function, right? There's a lot of really good hints on what you should be doing,
Starting point is 00:37:32 and we want to have more and more on that. And those probably will have eventually a SaaS version that you can just click a button, install it in your repositories in your GitHub or GitLab, and that's it. For many people, the people that really care about deep analysis of large code bases, they tend to also not want
Starting point is 00:37:50 to share their source code. So for that, it doesn't make that much sense to have a SaaS for the engine. So if folks sign up for the beta, what can they expect? You know, what's... Sorry, alpha. I'm correcting that. That's what I was trying to do, is tee up the fact that it's sort of an early release.
Starting point is 00:38:08 Maybe you're even looking for feedback. Yeah. So that's the whole point is we are trying to get people to use the product, file issues, let us know what they think. File issues for things that are going to work, but also for things that they would like to do, right? This is a pretty young project. We released it two months ago, I want to say, something like that. So it's pretty early on. And the idea is that we're going to be working with really large companies to try to make it as good as possible.
Starting point is 00:38:37 But at the same time, we also want to have the input from the community because they have different needs, right? So we don't want to end up having something that targets only large companies, input from the community because they have different needs, right? So we don't want to end up having something that targets only large companies, but it's pretty useless for developers. We want to build something that everyone can get something from. Large companies, they're going to have some specific analysis and some specific things that that's what our enterprise edition will have. But also our free edition will always be free.
Starting point is 00:39:07 We want people to make sure that that becomes as good as possible. And also, if you feel like it, contribute. It's green and go. It's a really cool project. We use a lot of open source. We use Pilosa, which is for making indexes on SQL. We use Vitesse, which is a Google thing
Starting point is 00:39:24 that YouTube created between their Python code and the MySQL. So we grabbed all of the SQL parsing and stuff like that from there. We use regular expressions from, I forgot the name of the library, but yeah, no, I totally forgot the name of the library, but it's also open source. So we are open source. We use everything in open source. of the library. But it's also open source. So we are open source. We use everything in open source.
Starting point is 00:39:47 And for now, we are analyzing also open source. So, you know, open source everywhere. I was just thinking about that. Now, any future plans for any sort of list you got running right now for future blog posts of different analysis on different source codes? Or have you got any requests? So we've gotten a couple of requests. Yeah, absolutely. What did you call it? Request for've gotten a couple of requests. Yeah, absolutely. Would you call it request for analysis?
Starting point is 00:40:07 Request for analysis. Yeah, that's a good name. So we did this analysis in Kubernetes. And as soon as we did it, there were some people saying, oh, what about if you do it for the competitors of Kubernetes, right? So Cloud Foundry, stuff like that. Like people want to see how mature they are, stuff like that. People want to see how mature they are, stuff like that. I think that the next analysis that I want to do,
Starting point is 00:40:30 I want to do it in a different language. So Kubernetes was mostly Go. I want to do it for TensorFlow because it's also a huge community and it's a different language, mostly Python. Lots of C too, I think. So trying to figure that out and probably in that analysis,
Starting point is 00:40:48 when I'm going to open source that, the six degrees off, and it's obviously going to be six degrees of, what's his name? I forgot. Dean, one of the creators of Kubernetes, like Jeff Dean, that's it, Jeff Dean. It's one of the big creators of of everything machine learning related at Google.
Starting point is 00:41:09 That's him behind it. So yeah, if you're a contributor to Kubernetes, how many degrees away are you from Jeff Dean? I think it could be an interesting thing to do. Yeah. Yeah. Also, if you have ideas on how to analyze this data from different axes, also super interested with that.
Starting point is 00:41:24 So if you have follow-up questions or projects that you would like to see analyzed, yeah, let us know. We're going to be working on those, trying to get one per month at least. Because we've seen a lot. It's probably really good for growth. Yeah, it's really good for growth, really good for adoption, and also really good for us. Because really good to see whether every analysis that we want to do, whether it's doable or not. So there were some things that, you know, like silly example,
Starting point is 00:41:52 EastUpper was not supported. So now we're going to be supporting EastUpper. You become a user too. Yeah, I am the user. It's also QA. You're QAing your product essentially by doing some good exercise. You know, developer relations, customer zero, all of those things. I still keep on doing those things.
Starting point is 00:42:08 It's very, very useful. So if people have ideas, let me know. One thing I love too, just to mention your website, I love when community is in the main nav of open source based companies because far too often it's like, where
Starting point is 00:42:23 is the community? Who is the community who is the community how is it represented and how can i talk to the contributors too often it's just too many clicks are hard to find out yeah no who's involved in the team how can i talk to somebody how can i get what's my on-ramp you know i'm i get questions maybe it's a 101 i prefer and right here you have community and the second one down has talked to us. Yeah, we deeply care about community. There's some really active, we have a very active Slack community with a bunch of different channels.
Starting point is 00:42:54 Machine learning is one of them, super active. People are there talking about what they want us to build and stuff like that. We also have language analysis. If you're a language analysis geek, we are doing a lot of really cool stuff. The number of conversations that I've had about like Rust weird things or even Lisp or how to parse COBOL and stuff like that, it is really cool. Like I'm a language nerd. I love different
Starting point is 00:43:17 languages and I'm having lots of fun because of that. So yeah, even if you're not necessarily interested on what we're building with, which is this analysis, this analysis engine, and you're interested just on some of the details, I think there's a lot that you can learn from that. The concept of universal abstract syntax tree is being used by other engineers to do things like security analysis of source code, things like this.
Starting point is 00:43:41 So check it out and join us and let us know what you think. And if you're working on something, it's always good. We have our mailing list, biweekly mailing list, that was supposed to come out today, but there was no way at the time to write it, or Victor, head of community. And in that mailing list,
Starting point is 00:44:00 we always have at the end of the mailing list, we have a highlight on someone from the community that has done something cool right so we really really care about community yeah join us it's it's a good community and i'm sure that's probably the the way you hire too is probably from yeah we've been hiring members we've been hired through that at least this one way by the way we are hiring that's a good. Yeah, so we are trying to figure out, like, we have engineers that have been hired through this. We have also people hired through, they wrote about us,
Starting point is 00:44:33 about like, oh, I've discovered this, wrote a blog post, and now they're going to be joining us soon. So yeah, like, we're definitely hiring for so many different people. Machine learning experts, language analysis experts, people in product management, people in developer relations. And the team is distributed, I assume? The team is very distributed.
Starting point is 00:44:53 The CEO is remote, just to give you an idea. So the CEO is in Lisbon. We have people in Seattle, San Francisco, Madrid, London, and then somewhere in France, somewhere in Poland, somewhere in Russia, somewhere in Ukraine, somewhere in France somewhere in Poland somewhere in Russia somewhere in Ukraine somewhere in many places so the good thing is all these jobs you have open all of these jobs
Starting point is 00:45:11 worldwide most of them except there's a couple of them that are actually specifically for San Francisco but all of them are completely
Starting point is 00:45:19 distributed so you can work from wherever you feel like well Francesc it's been a pleasure to meet you with you. Thank you. I've known you for years, but just never really had a chance to sit down and have a conversation with you.
Starting point is 00:45:30 This is the first time. It's kind of a bummer, actually, but good at the same time. Let's just make it not the last time. That's right. Let's make it not the last time. Basically, you know, we're looking for feedback. We're looking for participation. So just go check out source.tech and then uh
Starting point is 00:45:46 find the community and join us cool thank you all right man thank you so much for your time appreciate it all right thank you for tuning in for this episode of the changelog if you enjoyed the show do us a favor go into itunes wrap the podcast leave us a rating or review go into overcast and favorite it. Tweet a link to it. Share it with a friend. Of course, thank you to our sponsors and our partners, Linode, Clubhouse, and Raygun. Also, thank you to Fastly, our bandwidth partner.
Starting point is 00:46:18 Head to Fastly.com to learn more. And we move fast and fix things around here at Changelog because of Rollbar. Check them out at Rollbar.com slash changelog. And we're hosted on Leno cloud servers, leno.com slash changelog. Also, special thanks to our friends at Cloud Native Computing Foundation for bringing us to KubeCon Cloud NativeCon. It was awesome to be there. If you want to hear more episodes like this, subscribe to our master feed at changelog.com slash master or go into your podcast app and search for ChangeLog Master. You'll find it.
Starting point is 00:46:51 Subscribe. Get all of our podcasts in a single feed as well as some extras that only hit the master feed. Thanks again for listening. We'll see you soon.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.