The Changelog: Software Development, Open Source - source{d} turns code into actionable insights (Interview)
Episode Date: January 16, 2019Adam caught up with Francesc Campoy at KubeCon + CloudNativeCon 2018 in Seattle, WA to talk about the work he's doing at source{d} to apply Machine Learning to source code, and turn that codebase into... actionable insights. It's a movement they're driving called Machine Learning on Code. They talked through their open source products, how they work, what types of insights can be gained, and they also talked through the code analysis Francesc did on the Kubernetes code base. This is as close as you get to the bleeding edge and we're very interested to see where this goes.
Transcript
Discussion (0)
Bandwidth for ChangeLog is provided by Fastly. Learn more at Fastly.com.
We move fast and fix things here at ChangeLog because of Rollbar.
Check them out at Rollbar.com.
And we're hosted on Linode cloud servers. Head to Linode.com slash ChangeLog.
This episode is brought to you by Linode, our cloud server of choice.
It is so easy to get started with Linode. Servers start at just five bucks a month.
We host ChangeLog on Linode cloud
servers and we love it. We get great 24-7 support. Zeus-like powers with native SSDs. A super fast
40 gigabit per second network and incredibly fast CPUs for processing. And we trust Linode
because they keep it fast. They keep it simple. Check them out at linode.com slash changelog.
From Changelog Media, you're listening to the Changelog,
a podcast featuring the hackers, the leaders,
and the innovators of software development.
I'm Adam Stachowiak, Editor-in-Chief here at Changelog.
Today, I'm at KubeCon, CloudNativeCon, talking to Francesc Campoy. We talk about the work he's doing at Source to apply machine learning to source code and turn that code base into
actionable insights. It's a movement they're driving called machine learning on code.
We talk to their open source products, how they work, what types of insights can be gained,
and we also talk through the code analysis analysis Frances did on the Kubernetes code base.
This is as close as you get to the bleeding edge.
And I'm very interested to see where this goes.
You're at Sourced now.
Give me the breakdown of what Sourced is.
Is it formed
around open source? Everything right now is open source, right? So nothing you have
is a paid product at least yet. Everything we do is open source. We are
working on an enterprise edition for one of the products and basically the whole
idea is that it will everything will be keeping on being open source except for
one thing that allows our product to work distributed. So to back up a little bit and give a little bit of context,
what we do at source is our tagline
is machine learning for large scale code analysis.
OK.
I like that.
I like that tagline.
We worked on that tagline for quite a while.
I'm happy you like it.
It's succinct, and it gets the message across.
Yeah. That's the whole point. it gets the message across. Yeah.
That's the whole point.
The whole idea behind it is when you're writing code,
normally people think about the fact that you write code,
then you build it, and you ship something.
And what you ship is what matters, right?
Source code is just a way to get there.
And what we realize is that actually it's
a huge and very, very deep source of data.
When you have a Git repository, you can actually see what's happened there since the beginning of time until now.
You can actually analyze trends.
You can see so much stuff in there.
So what we did is created this engine product that basically what it does, it provides a SQL interface.
So you can find things in Web Repository.
So you can do things like find commit messages with this text or whatever.
But you can actually go even deeper than that and go into,
actually, I want to see the content of the file.
I want to parse it.
I want to extract the function names.
I want to extract the strings or whatever, right?
So there's a bunch of different projects
that make this possible.
And basically, every single one of those projects
is completely open source.
And I created a product which is called
the Engine, which is putting all those together
in a nice way to use.
Little binary, you get started, and everything just works.
And then the other side of things that we're doing,
so that is what we call the code as data,
seeing source code as a source of data.
And the other part is ML on Code.
And ML on Code is the part that I've been talking around,
because it's super exciting.
The whole idea is learning stuff from source code.
One of the things that you can learn, for instance,
is, say, to predict a token in a program if you're given the rest
right i give you a go program and uh i'm just i i remove one viable name from somewhere and you need
to predict it you train a neural network to do this and eventually we'll be able to do this quite
correctly now what we try to do is not to predict the missing pieces
of a program, because in general, programs
do not have missing pieces.
But what we can see is that if what we predict
and what you wrote is very, very different,
and even more than that, what you wrote,
we know that is unpredictable, what we can tell you
is that probably that's a bug.
So this is a slightly complicated way of doing it.
But what this detects is copy-paste errors.
When you copy a section of code and paste it somewhere else,
and you modify a bunch of things,
but probably when you're checking for the error,
that's not what you wanted.
And you're checking for the previous error
or something like that.
That happens all the time.
I know it happens to me all the time.
And with this, you're actually able to detect it directly.
Building something that would use static analysis for that, it is possible, but it's really
hard, because static analysis deals with syntaxes and grammar and stuff like that,
but not really with the semantics of the program.
I like this idea of when you're writing code,
there's two things.
There's what you say and what you mean.
And when those two things differ,
that's when you have a bug.
When you're saying,
oh, actually, that's not what I meant, sorry.
You need to fix it.
What we're trying to do is apply machine learning
to see what you meant
and compare it to what you said
and see whether we can find bugs in there.
And that's super interesting, super powerful,
and we are doing a lot in that,
but that is more like the future.
Currently, the cool thing that we just released yesterday
was analysis of the Kubernetes code base.
Yeah.
And that's a pretty lengthy, data-filled blog post.
Great job on that.
I mean, the cool thing is that so much data in there, right?
Like, you have almost 2 million lines of code.
And we've been working on that project since 2014, I think.
Right.
So that's a lot of data in there.
And we were able to find things like, oh,
so how many exported functions are there?
And how did that grow over time?
We saw that from version 1 to version 1.0 to 1.4,
the API grew by four times the size, which is bad.
Like, if you're having that, if you kept on doing that, it means that by now we would have it.
So it went from 4,000 to 16,000.
If we kept on going the same pace, we would have something like around 100,000 endpoints.
That is not maintainable.
It's too complex.
No contributor would be able to think about Kubernetes
as a thing. You had to split it into pieces,
right? So yeah, we were able to
see all of these things, super interesting, and
the whole idea behind that article was
give
back to the community.
When you tell to the community, hey, you're doing
great, you're maturing,
and you can tell, and
the innovation is somewhere else,
which means that the APIs are really good. All of these data, I mean, it's not newsworthy,
I'd say, right? Because nothing crazy new. But it's just confirmation through data that this
feeling of Kubernetes is doing well, it is actually accurate. Right. You also were able to account for
the different languages within Kubernetes. So it shows where there's declines or growths or, you know, even for developers who are thinking about transitioning to a different language, like just identifying where some of their future value for their career could be.
Yeah, there's a lot of indications around that.
Or even, as you mentioned, contributors and healthy growth and things like that.
Those are all indicators like, well, people are here at this conference,
8,000 now versus 4,000 last year in Austin.
What that shows is a significant sign of investment and betting on Kubernetes.
So understanding that it is healthy, in fact, based on true data, that's amazing.
The cool thing is I will open source the way I did all of these analysis, but it
is literally just a bunch of SQL and a bunch of Python.
It's not that complicated.
I mean, I'm not a good Pythonista.
Let's not go there.
But I'm not really good at writing Python.
I had to learn a lot.
But still, it's actually pretty straightforward.
When you say, for instance, I want to count all the languages that I have, basically what you're
doing is like, okay, so give me all the files
and I'm going to use this language function
that classifies and tells you what language it is
and classify that. Easy.
And now give me the...
I think we are around at 72,000
commits on Kubernetes
codebase. So I'm going to do it every
1,000. So every 1,000 commits
find how many you have and just create the plot. So I'm going to do it every 1,000. So every 1,000 commits, find how many you have,
and just create the plot.
So it's actually very straightforward,
but the information we got from that was super interesting.
I shared it with Chris Nova and Joe Beta,
and they were very interested in checking it out.
They found a little bug, because apparently I'm
not very good at reading.
So instead of million lines of code on the PR that was sent to many analysts,
instead I said billions.
Oh, gosh.
There's a difference there.
Yeah, so Joe Betana is calling me Mr. Billions, which is awful.
So the example you're sharing here, at least with the analysis,
is an open source project.
Yeah, all of this is open source, and not only open source, but also Apache V2.
What I mean is analyzing an open source code base.
So maybe give an example of, say,
how this applies to enterprises.
Maybe somebody that's got their internal code.
I know most things are open source,
but we're building our own products,
and those products tend to be behind the scenes
and the things we touch tangentially through dependencies
or contributions are in the open source world.
So how does sourced engine, right?
Sourced engine?
MARK MANDELBAUM- Sourced engine, yeah.
FRANCESC CAMPOY FLORES- So how does that
apply in a case where it's my own code?
How do I run on my own code bases?
MARK MANDELBAUM- So you can run it exactly the same way.
The cool thing is that since the beginning,
we've developed all of the tools we've built to be run on-prem,
right? Because, I mean, I used to work at Google. If you go to a Googler and you tell them,
hey, we're going to be sending your source code over the network to some random server,
they're going to be like, there's no way you're doing that, right? So we knew that source code
is a very, very delicate piece of data. So everything can run on-prem.
Everything runs on Docker.
So you can even have a Helm chart and just start everything up very easy.
And everything is open source.
So not wrong with that.
Would you do this on a laptop?
Would you do this at the server level?
I mean, it depends on the amount of code you have.
If you're doing it on a laptop, yeah, it's going to take some time.
I was running it on the analysis that I did for the-
MARK MANDELMANN, So this is highly process intensive.
FRANCESC CAMPOY FLORES- Yeah.
I mean, it's pretty large, because it's big data, right?
So the analysis that I did for the Kubernetes code base,
I was running on an instance on Google Cloud Platform
with, I think it was 96 cores.
So you know, a pretty large instance.
And yeah, the analysis of counting all of the languages
for all of the commits over time took around 10 minutes.
So it's not that bad, actually.
But if you're trying to do this for a very large thing,
96 cores is going to be maybe enough at the beginning.
But eventually, you want to have it distributed.
And that's where basically we're saying,
once you need more than one node,
then it's enterprise edition and we should talk.
Because the whole idea is that we
want to give as much as possible to the open source community.
Especially the engine can be a really powerful way
to obtain data for all of the research part of machine
learning, right?
There's a lot of people doing research,
and they need data sets.
The fact that they will be able to generate those data sets
by running SQL queries that they already know very well,
that's super powerful.
So we want to make sure that they get access to that.
But for larger companies that they want to do analysis,
and the interesting thing is that those metrics
that we came out with, you can tweak them, right?
And we are going to come up with a catalog of the kind
of metrics that you should be figuring out, looking at.
So for instance, if you're saying,
I'm going to be moving on cloud native,
cloud native competing foundation,
I'm going to go cloud native.
Cool.
What are the things that you should be looking at?
Well, you should have Dockerfile.
You should have continuous integration. You should have continuous integration.
You should have continuous deployment.
All of these things, nowadays, they're in the source code.
So we can analyze those things and give you a little bit of an idea of,
if you're going towards being cloud native,
how far away are you from getting there?
And also, what are the things that you should be changing?
What pieces of the source code should be worked on
in order to get there.
So that is super useful because basically the whole idea
is that it brings visibility to processes
like going cloud native or adopting inner source
or adopting DevOps.
Lots of people talk about,
oh, we're going to be doing DevOps.
What does that mean, right?
There's actually clear things in the source code.
FRANCESC CAMPOY FLORES- Please answer that.
FRANCESC CAMPOY FLORES- Oh, yeah.
I mean, how many hours do we have now doing DevOps?
It can be many, many things.
But the beginning is, well, you're
going to need to have clear observability.
You're going to have metrics.
You're going to have a lot of different things
that then you're going to be fetching into some systems
that will allow you to understand
what your system is doing.
But all of the observability things, again, it's source code.
When you think about infrastructure as code,
where do you find that? Source code.
So we keep on putting more and more stuff inside of Git repositories.
And what we're trying to do is, sure, that's great.
But now let's analyze it.
Let's use that data we put in there to try to understand what's going on.
The cool thing about being SQL,
because I was actually,
and I'm still thinking about offering a GraphQL thing
because Git repositories are trees,
and once you parse code, you get a tree.
So everything's trees.
GraphQL for trees is great.
But the fact that it's SQL,
it allows you to mix it with other data sets.
So you have like Looker or Power BI or things like this
where you can have many data sets and do a query
across many different databases.
Imagine doing something where you're saying,
okay, so I'm gonna say inner source.
The whole goal under inner source is really
make sure you break the silos in a company
and that everybody collaborates with each other, right?
So like the Google style, Facebook style, even though the inner source term was created at PayPal.
In order to do this, what you need to do, in order to measure how well you're doing, the whole idea is that you need to first know who is in each team.
Unfortunately, that is not in your Git data set, right?
So you're going to need to mix it with some other dataset,
HR dataset or whatever it is.
So Looker or Power BI or I think that even Tableau,
they would allow you to do these kind of things.
You can look up the repo URL on GitHub
even if it's a private GitHub repo as well.
Yeah, yeah, yeah.
So the cool thing is that...
Because you have teams at the org level,
so you could look up, not so much by the repo,
but by the repo URL.
Yeah, you could even check it.
Yeah, yeah, yeah.
So the thing is that all of that,
that is the GitHub API.
So GitHub, we work with any Git repository.
So a lot of the concepts that we work with are Git for now.
So that's why, for instance,
the organization is a GitHub or GitLab point.
You could expose it from a different data set.
Just download the whole thing, put it in MySQL, and that's it.
You can do that too.
And that's actually really powerful because you can then mix it with,
if you have financial data or things like this,
you can try to see correlations.
One that I like is the correlation of,
if we're writing, can we correlate the number of commits
with the money we're making?
Are developers, when developers write a lot of code,
is it good?
Or maybe it's bad,
and you should stop your developers
from writing more code.
Or is there no relationship at all,
and it just doesn't matter?
Yeah, yeah, yeah.
So all of these things, like,
Interesting.
Once you expose all of that data,
our idea is data analysts and data scientists,
they're going to be able to do really cool stuff with that.
It sounds, though, like Sourced Engine is, let's maybe use an analogy.
I'm a painter.
I want to paint a painting.
But it sounds like Sourced Engine is just a brush.
I still got to apply the right kind of colors and understand color theory.
So it's a tool to get there, but it's not the recipe to get there.
Yeah, I would say, following that metaphor, I would say that sourced engine,
it wouldn't even be like the brush or anything.
It would actually be deeper than that.
It would be like the thing that makes the paint for you, right?
It's like you're going to be extracting all of this data, and then with's like what you're going to be, you're going to be extracting all of this data.
And then with the data, you're going to be painting something.
You're going to be creating your dashboard.
You're going to be proving your point, right?
Like data and statistics, those are new.
We've been using statistics to prove our point for quite a while.
So the idea is that for data analysts,
if you tell a data analyst,
oh, yeah, you should use Git in order to find this.
So let me explain to you how Git, let's say,
Git log to start with, right?
Like Git log works, and how branches work,
and how commits work, and what is a merge commit,
all these things.
The data analyst probably stopped listening to you
like five minutes ago.
So the idea of exposing all of these concepts
in a way that data analysts understand
is actually really powerful.
Because data analysts, also data scientists,
and also machine learning scientists.
So is the interface a config file of queries?
Is it a dashboard of queries?
What's the interface to sit down and get something done?
So the interface, there's a couple ways of doing it.
One is the source engine itself is SQL, an SQL interface.
We do have a playground that allows
you to have a list of all of the tables,
understands well what a tree is, and shows it better
than what you see with traditional SQL client.
But I think that the best way to do this is actually with Jupyter.
Jupyter notebooks works incredibly well.
That's what I've been using.
Because then it allows you to, you know, you have your text describing what you're measuring.
Then you have your SQL query that is sent. Then you use the result to use a little bit of Python.
And then you generate your graph,
and all of this stays in the same place.
That's what I've been using,
and it's a really great experience, to be honest.
I like that better.
But if you want to use the MySQL client
from your command line, that works here.
It's just MySQL.
So anything that works with MySQL works with the engine.
So the example you've talked us through is,
and I'd love to see if you want to go into some of the data
that you pull back for Kubernetes.
Oh, yeah.
If you want to share some of that.
But I also want to mention that that's an example that you've done.
What's an example of, say, someone who's a customer,
an adopter of your open source,
and they're using Source Engine in ways that you weren't?
Share some of the imagination of users.
We've been working with especially large banks.
And large banks are really interesting because they have an incredible amount of source code.
And that incredible amount of source code goes from their cloud-native Kubernetes, Docker Compose, stuff like that, and COBOL.
They go all the way back to having cobble.
When you tell them that they're going to be able to measure the technical
debt for them, it's like, Oh yes, like let's do this.
Cause they're all about the debt, you know, what's, what's black.
No, but like once you, once you tell them, it's like, Oh, you know,
like many banks, they do not really even
know how much code they have, right?
There's so much of it that when you tell them, okay, so how much COBOL do you have?
How many lines of code?
Right.
How many lines of code of COBOL do you think you have?
And they're like, in between 100,000 and maybe half a million.
And it's like, well, if you're going to put some budget
to go and rewrite that in something more modern,
good luck with that estimation.
So the idea is that we're going to be able to bring
all of this data to them,
so they're going to be able to make informed decisions.
There was counting lines of code per language,
which for us is literally a group by query.
Like, it's super simple to do.
For them, it was like, this is actually really interesting.
The other change that lots of banks want to do is going back to the inner sourcing, right?
They want, large banks, they have many IT groups all around their organization, and they want them to work together well.
And the first piece is to figure out who is doing what, what
resembles to what, how much code duplication you have.
Like, we have a thing that analyzes code duplication,
not character by character, but rather extracting
the abstraction text tree, modifying some couple things.
So it's actually a very smart way
of figuring out whether two pieces of code
are very similar, right?
They're kind of so similar
that if you saw them next to each other,
you would say you need to refactor them
and just write one function, right?
We're able to detect these automatically.
And this helps a lot because if you imagine
you're like the CTO of a bank
and they tell you, it's like, okay,
so you have this code base that dates from
the 60s, and please put it on the cloud, right? That's hard. That is a harsh thing to ask to
anyone. So the idea of being able to tell them, well, actually, all of this source code, let's
see which parts are going to be the easiest ones, like this MTA, like modernizing traditional
applications, which is like not really cloud native, like this MTA, like modernizing traditional applications,
which is like not really cloud native, but, you know, we can make it be cloud native.
We can make it run on Kubernetes. And then what is the COBOL that, you know, that's going to be an interesting challenge to migrate, right?
So having a view of all of this by just running a couple of queries is really powerful.
The other option is literally running, like, I'm going to be very helpful and say that
you could run a really huge bash crib calling Git very, very often, and maybe you will get
something similar.
But it would take hours instead of seconds.
Yeah.
I think the thing I'm trying to drive out here is that clearly you can pull back a lot
of intelligence.
Oh, yeah.
If you know what you're looking for, right?
So it seems like maybe some consulting is involved there,
or at least the right kind of teams in place
that know how to ask those questions of being a data analyst, for example.
We're not that interested on the consulting side of things.
Well, not so much as a company,
but it seems like the intermediary there
is someone who knows how to use Sourced Engine.
There's going to be service integrators.
There's going to be people not only installing the thing, but also.
So you're kind of building out an economy even almost.
Yeah.
With the final product, you'll eventually have an enterprise version of it.
And or also enable others to make sense of data.
Yeah.
Nowadays, you have many consultants that are helping with these kind of tasks.
Right.
What we're building is a super powerful tool for those consultants.
Right.
So, and then, I mean, they're going to be able to run it internally and keep it as dashboard.
So, you know, it's observability.
It's all about seeing where we are right now and see where we want to go and how to get
there.
Once you get there.
There hasn't been a lot of tools around this front.
I mean, maybe, I mean.
Observability on source code, not really, no.
I mean, at this level, what you're doing, it kind of reminds me of, you probably know his name.
Felipe, I believe, worked at Google, would do these things.
Felipe Hoffa.
His post, I mean, this seems almost like you were inspired by the work he's done.
I actually wrote a blog post that was like the one that Felipe worked on.
But actually, I'm pretty sure that mine got more views.
So, hey, Felipe.
No, but like the idea was I was trying to analyze on all of the source code that we had on BigQuery,
analyzing which is the most common package name or which is the most common package we import and stuff like that.
And it was everything.
It was cool to do, but also regular expressions everywhere.
Our idea is that it's kind of similar to that.
But imagine that it's a better interface.
Instead of saying, oh, I'm going to have like, oh, find the function names.
Well, it's going to be func.star space, then something that starts with a letter, whatever,
like that's a pain to write. And also, what if now you don't have a go function, but you have a go
method? Actually, that will not work anymore, right? So what we're doing is instead allowing
you to extract the tokens that you care about. So we work with this concept that we call
universal abstract syntax trees. And the whole idea is that we call universal abstract syntax trees.
And the whole idea is that it's an abstract syntax tree, so the result of parsing a program.
But it allows you to extract things by using annotations.
And those annotations are universal, right? So say a function is a function no matter what programming language you have, right?
An identifier, same thing.
Strings, same thing.
So if you want to extract the function names,
what you need to do is basically use the UAST function.
You pass the content.
You pass what language you want to use.
And then you just pass something that it's an XPath thing that
basically says the function names.
Same thing will work for Go, for Python, for Java,
for no matter what programming language you're trying to use.
So that is the kind of power that, yeah, you
could use it with an incredibly, like,
I would love to see the regular expression that
does the same thing if someone has the time to write it.
MARK MANDELMANN- Just for fun, right?
FRANCESC CAMPOY FLORES- Yeah, yeah.
Just for pain.
But that would take super long time.
And even once you're done, if you
ask the person that wrote it, are you completely sure
that this covers all the cases probably the answer is no yeah so we are making it a much more
reliable and easy way to kind of extract the same information that you could in some other ways This episode is brought to you by Clubhouse.
One of the biggest problems software teams face is having clear expectations set in an environment where everyone can come together to focus on what matters most, and that's creating software and products their customers love. The problem is that software out there trying to solve this problem is either too simple and doesn't provide enough structure or it's too complex
and becomes very overwhelming. Clubhouse solves all these problems. It's the first project
management platform for software teams that brings everyone together. It's designed from the ground
up to be a developer first product in its DNA, but also simple and intuitive enough
that all teams can enjoy using it.
With a fast, intuitive interface, a simple API,
and a robust set of integrations,
Clubhouse seamlessly integrates
with the tools you use every day and gets out of your way.
Learn more and get started at clubhouse.io slash changelog.
Our listeners get a bonus two free months after your trial ends.
Once again, clubhouse.io slash changelog.
And by Raygun.
Raygun recently launched
their application performance monitoring service,
APM as it's called.
It was built with a developer and DevOps in mind,
and they are leading with first-class support
for.NET apps and also available as an Azure app service.
They have plans to support.NET Core followed by Java and Ruby in the very near future.
And they've done a ton of competitive research between the current APM providers out there.
And where they excel is the level of detail they're surfacing.
New Relic and AppDynamics, for example, are more business business oriented, where Raygun has been built with
developers and DevOps in mind. The level of detail it provides in traces allows you to actively solve
problems and dramatically boost your team's efficiency when diagnosing problems. Deep dive
into root cause with automatic linkbacks to source for an unbeatable issue resolution workflow.
This is awesome. Check it out. Learn more and get started at raygun.com
slash apm let's give a prescription then for those listening out there thinking, I mean, I'd love to find
some intelligence, love to come in on a Monday morning with greater intelligence to my code
base.
Give some examples of how someone listened to this, whether they're in a larger team,
smaller team, their own project, whatever.
What's a good prescription for getting up and running?
So I would say that the best place to start is like,
you go to source.tech and check the engine, download it.
It's a little binary and it only has one dependency,
which is Docker.
So probably you already have it on your computer. I was curious why on the Mac OS installation process,
you didn't use Homebrew.
I know it's just a binary.
I found you just put it in your bin folder, but I was wondering why.
Oh, we don't have Homebrew yet, but we'll get there.
I was like, the process to install seems so simple.
It's like probably a simple Homebrew recipe as well.
I need to work on that, but, you know, it's been a busy week.
But just doing any Mac OS install, I'm always, you know,
expecting a Homebrew process or something specific to the way a language
installs certain things.
There's an issue somewhere to implement that.
Sorry for pulling that off.
Oh, no worries.
No, but so once you have that binary,
however you install it,
the idea is that you can just run something like
source SQL, right?
And now you are inside of a SQL client
and you are acquiring all of the code that you found,
all of the Git repositories that you found
from the directory where you were, right?
So now the cool thing is that you can start by doing things like,
you know, count the commits that you have per month
or something like that.
That is actually very interesting
because you can see how much the team has been working over time.
Or you can count the number of lines or things like this.
These seem like pretty simple things,
but even those are actually going to show weird things.
For instance, for Kubernetes, I was like,
I'm going to count the number of lines of code.
It goes up, right?
Sure.
But it also goes down eventually.
And it is really weird, because it goes down
by a lot of lines of code.
And I started looking around.
So I had some fun deletes and stuff.
Yeah, and I was like, what is going on with this, right?
So there's actually, no matter what data set it is, you're going to be finding a lot of cool stuff because those are organic data sets, right?
We keep on committing all the time and we're going to make mistakes.
You're going to see from time to time, like the number of files goes up by like thousands
and then goes down again.
And then you look at that, it was like,
huh, someone vendored the dependencies that they were not supposed to.
Right?
Like all of these things are...
Yeah, like, whoops.
But yeah, all of these things, you're able to see more information.
And the thing is that as soon as you start playing with this,
at least in my experience,
the more answers you get, the more questions you get.
Right?
Like, okay, so I saw this, but what happened with this thing?
Or you can also find things like
something that, it's a really
cool game, I'm going to be open sourcing it soon.
It's, have you ever heard about
the Grease of Bacon?
Oh, yes. Kevin Bacon, yes.
Kevin Bacon, yeah. Six Degrees of Kevin Bacon.
Six Degrees of Kevin Bacon. So you can do the
Six Degrees of Kevin Bacon, but
on Git,
trying to figure out.
So for me to say, I don't know, someone
famous in the Go community, Rob Pike, how many degrees
are there?
So for me, I edited a file that was edited by someone else,
that edited another file that was edited by someone else that
was edited by Rob Pike, something like that.
You can actually extract that information from Git, right?
So you can do, like, you can extract really useful insights for your business,
but also you can build pretty cool games.
So that's the thing.
It's like, have fun with it.
It's data.
So if you've ever done any kind of data analysis, I mean, it's called data exploration for a reason.
You do not necessarily know what you're going to be finding,
but that's the whole game, right?
You're going to be able to extract some things.
And then if you're actually interested on some specific metrics,
check out the Kubernetes blog post that we wrote
where you're going to have all of the different queries that were ran
and you can run the same thing for you
and see, for instance, the trends on what programming languages are you using.
How are they growing?
Are you using more Go than before?
Or maybe you're using more Java or Python?
All of these things are going to appear very
clearly on your graphs.
FRANCESC CAMPOY FOSTER LONGIORIERA What do you say
to maybe some for-profit, say, SaaS competitors to say this
rough idea, which is basically data-driven intelligence in a
development team.
So look at our code base, our repository, learn some insights.
What do you know about other competitors, and how do you see Sourced moving forward,
coming from OpenCore, eventually going to have your own products
and different ways you can sustain financially?
I mean, you are a company, so eventually you've got open source,
but that's only going to last so long
until you actually have to create some products
to generate revenue.
Where will you be at in this space?
So it is hard to answer
because we are somewhere in between many different fields.
There are some companies that do metrics,
like software metrics,
but the thing is that the software metrics they provide
are the software metrics they provide.
That's it, right?
So you cannot tweak them.
Yeah, like you can get like...
You have no visibility into...
Yeah, like you can choose software metrics.
Some of them might be really interesting.
Like you can do like lines of code and stuff like that,
number of commits.
But also you can do things like cyclomatic complexity, right?
Like cyclomatic complexity.
It's a really cool concept, but probably doesn't apply to you.
Like what you want is things like, actually, what
I care about is how many comments
do I have per function?
Like do I allow my functions correctly commented or not?
Those things, probably what you want to do
is express exactly what you want. And that's why
I think that what we're building is something that many of the companies that compete with us,
they could be powered by us, really. That's what I was thinking. It's almost as if you're
building their future tools. If they've done what they've done and they've gotten maybe, say,
two or three years into their business, but they don't have the tooling, they may actually
retrofit their business to essentially become a service provider
on top of Source Engine, for example.
If they're interested in doing that, talk to us.
That is the thing, right?
What we're building is...
So Source essentially is a standard.
Source Engine at least could become
an open source standard for data intelligence and code bases.
Yeah.
The idea is we want to extract data from source code, right?
Right.
The most common way of storing source code is Git.
The most common way of analyzing data is SQL.
So we just put them together.
And that is our first product,
but we actually built it to extract information
that then we can use to train models
and do machine learning, right?
We believe that many people are interested
in doing that kind of thing,
and we want them to do it. Because at the of the day if we if we end up being successful our
code review tool which called lookout it will provide an opportunity to write analyzers right
to basically classify a piece of code as does this contain some specific thing or not so does
this contain a bug does it not this contain lint error or not? So does this contain a bug? Does it not?
Does this contain lint error or something like that, right?
So those can be done with completely traditional tooling,
like linters and stuff like that.
But also we believe that many of them will need machine learning.
We cannot build all of those things.
We're building the platform so other people can build on top of us.
So what you're talking about is a product called Lookout.
It's in beta right now.
You can request a demo, obviously, if you wanted to.
So Source Tangents is in beta.
I think that Source Lookout is in alpha, probably.
It says here on your site beta.
Really?
Yeah.
That's probably a mistake.
Sign up for the beta.
I see it right here.
I'll talk to my team.
I'm pretty sure that is an alpha normally.
Sign up for the beta for the Kubernetes.
The source engine beta, yes.
I'm pretty sure Lookout is still alpha.
But anyway, it's still also, again, completely open source.
You can check it out, run it on your project, etc.
We do want to, we do not think that
running the engine as a
SaaS, as a software as a service
makes much sense because people
do not want to send their code to random servers.
But the source code
analysis, sorry, the code
review, assistive code review,
we want to make that a SaaS. So
eventually you will be able to just add
as a GitHub application that just reviews your code.
We've done that for all of our projects
and it works really well.
It's able to warn you about,
hey, this piece of code is suspiciously similar
to that piece of code in that dependency.
Did you copy paste it?
Or maybe you should be calling that function, right?
There's a lot of really good hints
on what you should be doing,
and we want to have more and more on that.
And those probably will have eventually a SaaS version
that you can just click a button,
install it in your repositories in your GitHub or GitLab,
and that's it.
For many people,
the people that really care about deep analysis of large code bases,
they tend to also not want
to share their source code. So for
that, it doesn't make that much sense to have a SaaS
for the engine.
So if
folks sign up for the beta, what can they expect?
You know, what's...
Sorry, alpha. I'm correcting that.
That's what I was trying to do, is tee up the fact that it's sort of an early release.
Maybe you're even looking for feedback.
Yeah.
So that's the whole point is we are trying to get people to use the product, file issues, let us know what they think.
File issues for things that are going to work, but also for things that they would like to do, right?
This is a pretty young project.
We released it two months ago, I want to say, something like that.
So it's pretty early on.
And the idea is that we're going to be working with really large companies to try to make it as good as possible.
But at the same time, we also want to have the input from the community
because they have different needs, right?
So we don't want to end up having something that targets only large companies, input from the community because they have different needs, right?
So we don't want to end up having something that targets only large companies, but it's pretty useless for developers.
We want to build something that everyone can get something from.
Large companies, they're going to have some specific analysis and some specific things
that that's what our enterprise edition will have.
But also our free edition will always be free.
We want people to make sure that that
becomes as good as possible.
And also, if you feel like it, contribute.
It's green and go.
It's a really cool project.
We use a lot of open source.
We use Pilosa, which is for making indexes on SQL.
We use Vitesse, which is a Google thing
that YouTube created
between their Python code and the MySQL.
So we grabbed all of the SQL parsing and stuff like that from there.
We use regular expressions from, I forgot the name of the library,
but yeah, no, I totally forgot the name of the library,
but it's also open source.
So we are open source. We use everything in open source. of the library. But it's also open source. So we are open source.
We use everything in open source.
And for now, we are analyzing also open source.
So, you know, open source everywhere.
I was just thinking about that.
Now, any future plans for any sort of list you got running right now for future blog posts of different analysis on different source codes?
Or have you got any requests?
So we've gotten a couple of requests.
Yeah, absolutely. What did you call it? Request for've gotten a couple of requests. Yeah, absolutely.
Would you call it request for analysis?
Request for analysis.
Yeah, that's a good name.
So we did this analysis in Kubernetes.
And as soon as we did it, there were some people saying, oh, what about if you do it
for the competitors of Kubernetes, right?
So Cloud Foundry, stuff like that.
Like people want to see how mature they are, stuff like that. People want to see how mature they are, stuff like that.
I think that the next analysis that I want to do,
I want to do it in a different language.
So Kubernetes was mostly Go.
I want to do it for TensorFlow
because it's also a huge community
and it's a different language, mostly Python.
Lots of C too, I think.
So trying to figure that out
and probably in that analysis,
when I'm going to open source that,
the six degrees off,
and it's obviously going to be six degrees of,
what's his name?
I forgot.
Dean, one of the creators of Kubernetes,
like Jeff Dean, that's it, Jeff Dean.
It's one of the big creators of of everything machine learning related at Google.
That's him behind it.
So yeah, if you're a contributor to Kubernetes,
how many degrees away are you from Jeff Dean?
I think it could be an interesting thing to do.
Yeah.
Yeah.
Also, if you have ideas on how to analyze this data from different axes,
also super interested with that.
So if you have follow-up questions or projects that you would like to see analyzed, yeah, let us know.
We're going to be working on those, trying to get one per month at least.
Because we've seen a lot.
It's probably really good for growth.
Yeah, it's really good for growth, really good for adoption, and also really good for us.
Because really good to see whether every analysis that we want to do,
whether it's doable or not.
So there were some things that, you know, like silly example,
EastUpper was not supported.
So now we're going to be supporting EastUpper.
You become a user too.
Yeah, I am the user.
It's also QA.
You're QAing your product essentially by doing some good exercise.
You know, developer relations, customer zero, all of those things.
I still keep on doing those things.
It's very, very useful. So if people
have ideas, let me know.
One thing I love too,
just to mention your website,
I love when community is in the main
nav of open source
based companies because
far too often it's like, where
is the community? Who is the community who is the community how
is it represented and how can i talk to the contributors too often it's just too many clicks
are hard to find out yeah no who's involved in the team how can i talk to somebody how can i get
what's my on-ramp you know i'm i get questions maybe it's a 101 i prefer and right here you
have community and the second one down has talked to us. Yeah, we deeply care about community.
There's some really active,
we have a very active Slack community
with a bunch of different channels.
Machine learning is one of them, super active.
People are there talking about what they want us to build
and stuff like that.
We also have language analysis.
If you're a language analysis geek,
we are doing a lot of really cool stuff.
The number of conversations that I've had about like Rust weird things or even Lisp or how to
parse COBOL and stuff like that, it is really cool. Like I'm a language nerd. I love different
languages and I'm having lots of fun because of that. So yeah, even if you're not necessarily
interested on what we're building with, which is this analysis, this analysis engine,
and you're interested just on some of the details,
I think there's a lot that you can learn from that.
The concept of universal abstract syntax tree
is being used by other engineers
to do things like security analysis of source code,
things like this.
So check it out and join us
and let us know what you think.
And if you're working on something, it's always good.
We have our mailing list, biweekly mailing list,
that was supposed to come out today,
but there was no way at the time to write it,
or Victor, head of community.
And in that mailing list,
we always have at the end of the mailing list,
we have a highlight on someone from
the community that has done something cool right so we really really care about community yeah
join us it's it's a good community and i'm sure that's probably the the way you hire too is
probably from yeah we've been hiring members we've been hired through that at least this one way
by the way we are hiring that's a good. Yeah, so we are trying to figure out,
like, we have engineers that have been hired through this.
We have also people hired through, they wrote about us,
about like, oh, I've discovered this, wrote a blog post,
and now they're going to be joining us soon.
So yeah, like, we're definitely hiring
for so many different people.
Machine learning experts, language analysis experts,
people in product management, people in developer relations.
And the team is distributed, I assume?
The team is very distributed.
The CEO is remote, just to give you an idea.
So the CEO is in Lisbon.
We have people in Seattle, San Francisco, Madrid, London,
and then somewhere in France, somewhere in Poland, somewhere in Russia, somewhere in Ukraine, somewhere in France somewhere in Poland somewhere in Russia somewhere in Ukraine
somewhere in many places
so the good thing is
all these jobs you have open
all of these jobs
worldwide
most of them
except
there's a couple of them
that are actually
specifically for San Francisco
but all of them
are completely
distributed
so you can work from
wherever you feel like
well Francesc
it's been a pleasure
to meet you with you.
Thank you. I've known you for years, but just never
really had a chance to sit down and have a conversation with you.
This is the first time. It's kind of a bummer, actually,
but good at the same time.
Let's just make it not the last time.
That's right. Let's make it not the last time.
Basically, you know, we're looking
for feedback. We're looking for participation.
So just go check out
source.tech and then uh
find the community and join us cool thank you all right man thank you so much for your time
appreciate it all right thank you for tuning in for this episode of the changelog if you enjoyed
the show do us a favor go into itunes wrap the podcast leave us a rating or review go into
overcast and favorite it.
Tweet a link to it.
Share it with a friend.
Of course, thank you to our sponsors and our partners, Linode, Clubhouse, and Raygun.
Also, thank you to Fastly, our bandwidth partner.
Head to Fastly.com to learn more.
And we move fast and fix things around here at Changelog because of Rollbar.
Check them out at Rollbar.com slash changelog.
And we're hosted on Leno cloud servers, leno.com slash changelog.
Also, special thanks to our friends at Cloud Native Computing Foundation for bringing us to KubeCon Cloud NativeCon.
It was awesome to be there.
If you want to hear more episodes like this, subscribe to our master feed at changelog.com slash master or go into your podcast app and search for ChangeLog Master.
You'll find it.
Subscribe.
Get all of our podcasts in a single feed as well as some extras that only hit the master feed.
Thanks again for listening.
We'll see you soon.