The Changelog: Software Development, Open Source - Sourcegraph the 'Google for Code' (Interview)
Episode Date: August 26, 2016Beyang Liu, the CTO and co-founder of Sourcegraph, joined the show to talk about the backstory of Sourcegraph, how it works, how they're aiming to be the 'Google for Code', ideas around offline suppor...t for code search, how it's licensed, and their new software license called Fair Source.
Transcript
Discussion (0)
I'm Byung-Loo, and you're listening to The Change Log.
Welcome back, everyone. This is The Change Log, and I'm your host, Adam Stachowiak.
This is episode 217, and today, Jared and I are talking to Byung-Loo.
Byung is the CTO and co-founder of Sourcegraph, and Sourcegraph is aiming to be the Google for
Code. We talked about the backstory of Sourcegraph. And Sourcegraph is aiming to be the Google for code.
We talked about the backstory of Sourcegraph, how it works, ideas around offline support,
how it's licensed, which led us to talk about their new software license called Fair Source.
We have two sponsors today, Linode and Datalayer, a one-day event organized by our friends at Compose.
Learn more at datalayer.com.
Our first sponsor of the show is our friends at Linode,
cloud server of choice here at Changelog.
Get a Linode cloud server up and running in seconds.
Head to linode.com slash changelog to get started.
Choose your flavor of Linux, resources, and node location.
Plans start at just $10 a month.
You get full root access, run VMs, run containers.
You can even manage your Linodes
from the comfort of terminal using Linode CLI. They've got SDKs in Python, Perl, PHP, Ruby,
JavaScript, Node.js, so you can hack away on your Linodes with their API. Take advantage of add-ons
like backups, node balancers, DNS manager, and more. Again, use our code CHANGELOG20 for $20 in credit with unlimited uses. Tell your friends.
Head to leno.com slash changelog and now on to the show. All right, we're back.
We got Byung Lu here from Sourcegraph.
Jared, we like to trudge through open source, right?
And not just open source, but the details of it, the functions, the language, and see where their use cases are.
And this is exactly what Sourcegraph does. So, Byung's here, obviously, to tell us about his company, but also all the cool open source
they're doing at Sourcegraph. Yeah. And I feel like Byung's kind of,
he's been on Beyond Code and then recently he was featured on GoTime and now he's on the
Changelog. So, that's the... He's making his rounds.
He's like hitting for the cycle. Yes.
Thank you guys for having me.
It's great to be here.
Well, Byung, let's begin with your origin story.
I think that, you know, graduated from Stanford, you got a unique path to where you're at today.
But aside from working at some cool companies and figuring out some developer problems,
where did things actually begin for you?
Like, how far back do we go to figure out where you got your interest peaked around
open source or around software development?
Uh, well, if you want to go origin story, I guess I should start with, uh, uh, my birth.
Uh, I was born in China, but, uh, I was raised in the Midwest.
I always like to mention that in case there are any Midwesterners out there listening.
Um, you're talking to Midwesterners.
Yeah.
Oh, yeah.
Jared, you're out in Nebraska, right?
I'm in Nebraska, and Adam's in Texas.
So there you go.
Nice.
That's awesome.
No wonder you guys are such nice guys.
Yeah, so I grew up in the Midwest, but I came out to California for high school.
And I think I first got into programming just,
you know, I had to buy a TI-83 graphing calculator for some, I think it was like high school geometry.
Yes.
And I happened to get the version of the calculator
that came with like the 500 page reference manual,
which not all versions come with.
But this thing is like,
like it's got everything you ever would want to know about the TI-83 calculator,
and it includes a section in the back that teaches you how to write
the dialogue of BASIC that they have on the calculator.
And so I would, when I was taking the bus back and forth from school,
I would just kind of whip that book out and try to program stuff on the calculator
in my spare time, you know, program some cool animations or some, you know,
automated formula calculators. And that's kind of how I got into it. And I liked it enough that
after that, you know, my school offered a computer science class. I ended up taking that. Can I stop you for a second, Byung?
Because I had the TI-86 in high school,
which is pretty much the exact same calculator.
And mine also came with the manual.
But mine came with something that, to me,
was better than the manual,
which was the game Nibbles.
Did you have that one on yours?
No, I did not have Nibbles.
See, now this could have changed the course of your life
because I had nibbles.
Therefore, I was not going to program anything into that thing.
I just tossed the manual out
and just played nibbles the entire way to school.
Yeah.
So lucky you.
The TI-86, I think, had a slightly faster processor.
I was always envious of the folks who had that.
Maybe that's why it came with Nibbles stock
and yours didn't.
That's probably why.
Yeah, it just had just enough RAM to run Nibbles.
Exactly.
Anyways, keep going.
Yeah, I had a great teacher
by the name of Mr. Olivares in high school.
He was great at just laying down the facts for computer science.
Ended up kind of loving it, went off to college.
I knew I wanted to do something math and science related.
Computer science just seemed like the perfect marriage between stuff that was theoretically interesting,
but also stuff that would have kind of a real world impact.
So that's kind of how I got into this whole thing.
So you got this calculator, obviously.
And Jared, you mentioned that you had one similar to the TI-86 and Byung, you got the TI-83.
And Jared, many people that come on this show, their origin stories sometimes begin with gaming. And whereas beyond his history, it sounds like,
and correct me if I'm wrong, but it sounds like what you're saying is that
you were really interested in the sciences, which I think most computer scientists are anyways,
but you're kind of interested in sciences, but more importantly, the things
that you can actually implement today and change the world around versus being
interested in simply just games to get you excited about that is it fair to say that or is that not the truth
yeah you know i i'd like to think i had so noble of a mentality back in high school but uh to be
honest i think the the the reason why i never got into nibbles or any other calculator game was
i just had no patience for reading through
how to install those things.
And the calculator didn't come with any games pre-installed.
And I Googled some stuff on how do I install...
I think the game that everyone else was playing
was Penguin, which is this Super Mario clone.
And I could just never quite figure out
how to install that on my calculator,
and then I just gave up.
So it was really out of sloth and laziness.
I like that.
Well, laziness means you make a great programmer.
Another question might be,
do you still have the manual?
Do you still have the 50-page manual
lying around by any chance with notes in it?
Yes.
Bookmark and stuff?
Yes, I do.
It's still on my
bookshelf wow awesome yeah it sounds like you had kind of a straight and narrow path to where you
are in terms of education and desires and lots of people are going to change what they they're not
sure what they like they maybe they find out through video games maybe they find out uh through
reading books or whatever it happens to be. Other people take completely different course changes in life or in career before they end up being in software. Take us to where we met you. So this is a gopher
con was a 2015 July 2015 gopher con. You now have the source graph thing. Maybe it's a company at
this point. Maybe it's just a side project, but you meet us there, you're in to go and you have this source graph. Your, your, your answer to the most influential open source project for you was source
lib was what you said when we asked you that question. So take us from where you just left off,
bring us all the way back up to the near future, near future, the near present, which was July,
2015. How did, how'd you get there? Um, so I went to college, knew I wanted to do something math and science related.
After I took the very first CS class at Stanford, I kind of knew that this is probably the right thing, at least for the next four years.
So I declared the major.
I was fortunate enough to be accepted into a research lab as an undergrad.
Stanford has this great undergraduate research program called Curis.
And so I landed in Daphne Kohler's research lab.
And she was a great mentor.
She eventually became my advisor.
I really got into AI research.
For a while, I thought I I was gonna get a PhD in
computer vision or machine learning something like that but after doing that
for a while I kind of decided that industry was probably where I wanted to
be more and so I started looking around for companies that I thought were doing
interesting things with you know large data sets and uh
at that point in time this is you know 2011 um palantir uh was a big presence in the stanford
campus at that point and it seemed to me that they were tackling some really interesting problems uh
with large data sets and doing really impactful things in the world.
So I decided to join them, landed on the commercial side,
which basically works with a lot of companies in industry
to help solve their most important technological
and software-related problems.
And it's kind of there that I got to work closely with
my future co-founder, Quinn Slack.
We we'd gone to school together and kind of knew each other from there.
But it was at Palantir where we really, you know,
got to spend some quality time together.
And that was also kind of a tipping point for me because
I think a lot of the roots of Sourcegraph were planted in that experience.
So Quinn and I are both CS majors by background,
so we both kind of had this pain that I think every programmer
feels, which is, man, it seems like it's harder than it should
be to find existing code and reuse it.
It just seems like I'm spending too much time searching the internet,
crawling through random forums, trying to find the answer to how to do this
pretty straightforward thing in code.
And so we felt that kind of day-to-day pain as programmers,
but the experience at Palantir kind of showed us that this is a problem that's not just relevant to programmers now.
It's actually relevant to, you know, say the top leadership at one of the big five banks in the U.S. because what we realized was, you know, right now we're kind of at this point where
software is becoming mainstream.
And what I mean by that is, you know, it used to be that for non-technology companies, you
know, technology companies that are outside of Silicon Valley, software engineering was
kind of an afterthought or just a small department or
that, you know, they might outsource it to some other firm.
But these days it's becoming more of a core competency.
You know, more and more of the core logic of the business is actually captured in the
logic of code.
And that's what we realized working at Palantir with the types of customers that we were working with.
And what we realized was as painful as it was for us,
the pain was felt 10 times as much outside of Silicon Valley
where companies aren't traditionally steeped
in all the different
processes and principles that we kind of soak up being immersed in the software development
world on how to run an engineering team and what tools to use to find the answers to everyday
questions.
And so we kind of took a step back and were like, hmm, this seems like a solvable problem.
You know, code is just another form of data.
And, you know, at Palantir, we're building all these fancy tools for other sorts of knowledge workers to analyze, you know, their data sets, but the tools that we seem to be using as programmers, both at Palantir and
at some of the customer sites that we're working with, still seem kind of primitive.
I mean, the top two code search utilities today are probably Google Search and Grep.
And Google is just kind of like the all-purpose, you know, fallback.
Like, we have no other recourse.
It's kind of like the Hail Mary.
Like, I hope somewhere someone has written a blog post or an answer out there that answers my question.
And grep is, you know, a great tool.
It's a powerful tool.
But it was written in the 1970s and hasn't really changed much since then
even as the world of software has evolved around it so then we kind of got
to think about this idea we didn't start working on it right away I went back to
school to finish up my masters Quinn went off and started another company
with some folks from Palantir and then we kind of serendipitously
met each other
at some house party in San Francisco.
Actually, it might not have been serendipitous.
I later learned that Quinn's
then girlfriend, now wife,
she knew that he was thinking about this problem and she knew that I was going to be at this house party.
So she kind of like orchestrated the whole meeting.
That's interesting.
Which is kind of kind of funny.
But you must send her nice cards for Christmas and stuff.
Yeah, she's great.
But yeah, at the time, you know, it felt like, oh, you know, you know, you're thinking about this as well.
We got to kind of talking and then we started just hacking on this and, you know, got two years just building it out and testing both the technical side,
which a lot of people didn't actually think could be done initially when we started,
and also the product side, which is how do we actually make this something that people can rely on every day? And that, I think, brings us up to GopherCon 2015.
You know, we were a company at that point by then, but we're still relatively small.
I think we only had a handful of people.
But we were pretty, pretty, we had a good amount of traction by then uh at least an open source and uh it it
seemed like you know we were definitely on to something and it was it was exciting to go to
growth con and kind of share the uh the the tool that we'd built with the people and kind of see
their reaction yeah it's interesting that you you said a very similar sentiment when we interviewed
you for beyond code that you just said here a few minutes ago.
And what you said then in the last summer was in the next 10 to 20 years, every interesting
company is going to become a software company at its core.
And so this seems like an insight that you've had over time and continue to believe to this
day.
Yeah, I really think, I mean, there's been a couple additional points of validation I think so you know have
you guys seen General Electric's most recent ad campaign I think they aired it
during the Super Bowl where they're kind of rebranding themselves they're a
digital company that happens to do infrastructure yeah it's like they're
they're both like don't think about them that
way anymore. Now think of them as a software
slash hardware company.
Yeah, exactly.
That really indicates that they're
putting software first.
Another recent news item
was the recent
outage at Delta Airlines
where a software glitch
basically shut down the airline
for, you know, a day or more. And, you know, if we live in a world where, you know, a software bug
like that basically shuts down, like makes it so you can't do business, that means that
even as an airline, you know, you may think your core business is flying planes,
setting prices,
and all of that is done more and more so
in software.
I guess we've gotten this far here so far
with your backstory,
and we've mentioned Sourcegraph a couple times,
even in the intro.
I'm going to have to rewind myself and get upset
because I didn't actually say what Sourcegraph is, but we're getting close to our first break.
But before we go into that break, let's have you break down exactly what Sourcegraph is.
Obviously, you've kind of teed up some of the ideas for which Sourcegraph was built
around, but help our listeners understand.
Then when we come back from the break, we'll go a little further into it.
But what is Sourcegraph?
Sourcegraph is basically global jump to definition,
find references, and documentation lookup
across all the code you use,
whether it's private or public.
And it understands the code at a semantic level.
So that means when you're jumping to a definition
or searching for something,
it knows the difference between a function call
and the occurrence of that
particular name and some random doc string.
So it basically, those are things that programmers do every day.
And it's a tool that helps you answer the most common everyday programming questions
in seconds.
There you have it.
Let's take a break then, because we got tons of questions about source graph, everything
from licensing to what you're open sourcing,
how you choose what to open source, why you even open source,
and maybe some of the perspectives you have around how you license
the different software you have and stuff like that.
And this big idea of being able to be the Google of code, basically.
So let's pause here, take a break.
When we come back, we'll dive a little further in.
If you're focused on the data layer,
there's an awesome conference being put on by our friends at Compose.
Monolithic databases are so 20th century.
Today, teams are using a JSON document store for scale,
a graphing database to create connections,
a message queue to handle messaging,
a caching system to accelerate responses,
a time series database for streaming
data, and a relational database for metrics and more, it can be hard to stay on top of
all your options, and that's why you should attend.
While much talk in developer circles these days focuses on the app layer, not enough
attention is placed on the data layer, and data is the secret ingredient to ensuring
applications are optimized for speed, security, and user experience. Hear talks from GitHub, Artsy, LinkedIn, Meteor, Capital One,
and several startups, including Elemento and DynamiteDB.
Talks range from the Polyglot Enterprise to using GraphQL
to expose data backed by MySQL, Elasticsearch, and more.
The conference is in Seattle on September 28th.
Tickets are just $99, and Changelog listeners get 20% off. Head to
datalayer.com and use the code CHANGELOG when you register.
We're back with
Byung-Loo, CTO of Sourcegraph. And Byung, before we took the break,
we obviously got to get an explainer of exactly what
Sourcegraph is but it goes much deeper
than this. It's the
I'm not sure if you coined this term or not
if this was the Newstack or
Susan Hall who wrote
this article but the title is
Sourcegraph aims to be the
Google for code
and being a public utility
for all developers out there,
you know,
being able to look up functions and dive into different usages of,
of open source,
whether it's private or public,
help us understand the beginnings of this company,
what this company was founded upon and why you actually built it in the first
place.
As far as the beginnings go,
you know,
it was Quinn and myself in the beginning.
And it,
it really grew out of this itch that we had ourselves as, you know, it was Quinn and myself in the beginning. And it really grew out of this itch
that we had ourselves as programmers, which was, we felt that a lot of the code that we were writing
was somehow duplicative. Either, you know, someone in our company had probably already written it,
or there's probably some open source library that we just weren't aware of, or just, you know,
couldn't figure out how to use that might save us a lot of time.
And I think almost every professional programmer
is aware of how often programmers reinvent the wheel every single day.
And we're trying to think about how we could encourage more code reuse.
What was the thing that was preventing us from going out and discovering the pieces of code
that we knew someone somewhere had already written,
but it was just too difficult to find it out?
And so we started thinking about it,
and what it came down to was,
well, look, code is actually really highly structured data.
I come from a machine learning background and natural language processing.
There's a lot of parallels between natural languages and programming languages.
But the difficult thing about natural languages is that
even to construct a simple parse tree from an everyday English sentence,
that's still an open research problem.
Whereas with programming languages,
you have this thing called a compiler or interpreter
that just gives you literally everything
you'd ever want to know about a block of code.
And once you have all that data,
then you ask yourself,
well, can I build a system on top of this
that helps me automate or partially automate the task of finding pieces of code,
of reading through existing pieces of code, and really understanding that piece of code
in a way that lets me use it.
And so that was kind of the itch that we were scratching.
And a couple other points of inspiration for us,
the stuff that we saw inside of Palantir
was definitely something that solidified our belief
that this was not only a problem that programmers everywhere face,
but it was also a problem that was important to leaders of large businesses.
And the other point of inspiration that we took was,
uh, I had previously, uh, you know, done an internship inside of Google and Google
internally actually has this great utility. If you ever meet a software engineer who works
in the main Google code base and you ask them, uh, what they think about Google code search,
I guarantee you there, they will say it's, uh, say it's the best thing since sliced bread.
Just ask them how many times they use it every day, how often they have it open in some browser
tab, and they'll tell you 60, 70, even 80% of the time, I just have it open as a reference.
And so seeing the value that that provided inside Google and also just missing that tool and not seeing it anywhere else in kind of the every individual developer out there
to go and take advantage of this giant corpus of human knowledge
that is open source code and code inside your company
and kind of build on the,
stand on the shoulders of giants, so to speak.
Definitely a bit off a big problem
in terms of just surface area, I think, with so to speak. Definitely a bit off a big problem in terms of just surface area,
I think, with things to do.
Because even once you have
the analysis done,
you're collecting all the data,
I'm sure you guys have
some sort of crawler or something
that's spanning the different code bases
and finding other pieces of code
that can go index.
Then you have developers using all these different environments,
their editors, you have how many languages.
Was it ever overwhelming to say,
how can we provide support for all these popular editors
and then across all these languages
to where we're actually going to provide a holistic solution for people?
Yeah, so that was definitely kind of a sticking point in the early days.
And one of the first technical hurdles we had to overcome was,
how do we do this in a manner that's efficient?
How do we make it so that we're not, 10 years from now,
we're still writing the plugin for the umpteenth language that we want to support.
And that kind of leads into the creation of SourceLib,
which is the open source library that powers a lot of the underlying source code analysis
that gives you what you see on Sourcegraph. And the basic idea of Sourcelib is,
look, as far as end-user applications are concerned,
applications that want to make code explorable and accessible,
so I'm thinking editor plugins,
things like Google Code Search or Sourcegraph,
most programming languages are basically the same. They all have a way to define things and name them
and reference those things in some other part of code.
So if we can kind of put the data that is the code
in a form where you just capture
kind of like that essential part of it
and it's a kind of common language agnostic schema,
then you can just build your end user application
on top of that single schema.
And then underneath,
you just have to build a bunch of translators
from different languages to that schema.
So that takes it from this problem
of having to build a specific library or plugin
for every combination of editor and language
to, okay, now you just have to build a translator
for every language to the schema,
and then once you have that,
you can build a single application
that understands all those languages at once.
It's like the adapter pattern for languages.
Yeah, exactly.
It takes it from an O of N squared
or O of N times M problem
to an O of N problem.
Right.
So that was where you started.
And so what I would like to find out about
is the schema that it gets translated into. Like, what are the
bits and bobs that you guys need for each of these, the normalized version? And then how do
you store those? Yeah, so the schema, it's a graph schema. So, you know, the schema is in the name of
the company, the source graph. It's literally like a graph of source code. And so there's kind of three
fundamental concepts in the schema. So one is kind of the AST node. So this is kind of,
once you've parsed the code, this is like the essence of the code. Like once you have the AST,
then you can kind of derive every other fact about the code that you need from that.
And you can also translate it back to text.
It's like the perfect, I guess you could say it's like the natural form of code as data.
And then in addition to the AST node,
the things that really let us kind of build useful features on top of it are two concepts,
a definition and a reference.
So a definition is just a function declaration,
or a class declaration, or a variable definition somewhere
in code.
It's basically anywhere you define a name in code.
So we extract all those, and we produce a unique identifier
for each that's global to all the code in the world.
And then on the other side of the table, you have references.
So references is any time one of the names
that you define in code is referenced.
So it could be a function call.
It could be a type reference.
It could be a package import.
And once you have these two things, definitions
and references, that essentially allows
you to walk the graph of source codes.
If you think about the things that we do probably
hundreds of times per day as developers
when we're exploring the data day as developers when we're kind of exploring
the data set of the code we're working on, it's following forward and backward links.
It's either jumping to definition or finding references. And that's kind of the bread and
butter of what we do. And that's exactly what the schema allows you to do. The main difference is
that because of that globally unique identifier,
you can now do so across all the code in the world
rather than just the code that's on your local machine.
Which is pretty rad.
So SourceLib, Open Source, MIT license,
SourceLib.org, we'll link it up in the show notes.
That seems like you've opened up
a core piece of your guys' business.
Is that not the case?
Yeah, it's a good library.
I would say, you know, for us,
it's just something that felt like it should be an open standard
because it's going to be useful, I think, for a lot of tools
beyond just Sourcegraph.
We hope that, you hope that this is one wheel
that people shouldn't have to reinvent
when they're trying to build great tools for developers.
As far as the business case is concerned,
we really think that the value we're going to provide
to companies is scaling this
across the entire open source universe
and the code inside their company
and connecting those two different worlds of code
together. So there's a lot of additional technology that we built around scaling this,
making it super fast across all the code that you might use that is not in source.
Source is kind of the analysis primitive.
Also seems like a really nice way. And I know people hate when we use the word leverage but
um when when you take advantage of well that sounds bad too but just kind of the open source
spirit right where what you have i mean especially when you have like an adapter
situation where you have all of these uncommon interfaces and what makes your guys's end goal
and end product better is more the more adapters that you have.
So for instance, you may not have the time or the capacity to write the Elm, what do you call them, analyzers?
Yeah, the language analyzer that would conform it to SourceLib.
But you know the Elm community, when they see, you know, you can use SourceGraph on GitHub and look at the Go code and see what it does,
they think, oh, I want that.
They're going to get excited.
They might actually build that for you.
And then on the other side, you have your editors or your plugins,
and you could have the same situation there.
Maybe the Atom community says,
why don't we have a Sourcegraph for Atom?
Yeah.
Or not Sourcegraph, but maybe something that adapts to Sourcelib for Atom,
and then they can do that.
So it seems like a great business case for what's also beneficial to all of us as open sourcers is,
you know, we don't have to be the only ones building this stuff.
Yeah, exactly. You know, there's so many use cases out there for a library like that, that,
you know, we're not going to be the ones to think of that someone else is going to think of it.
And yeah, that's exactly what happened when we released it.
There were a lot of people in the community that reached out and said,
hey, I want to build out support for this editor or this language.
And it actually helped us on the business side too.
One of the companies that uses Sourcegraph is Twitter.
And we're deployed to Twitter's entire Scala code base.
And there, they reached out at a point
where we didn't even have Scala support.
But one of their engineers wanted this so bad
because he had also been a previous Google engineer,
and so he wanted something kind of like Google Code Search.
And so they actually built out Scala support
as kind of a Hack Week project.
And we kind of took it from there.
So it's great to have the source code of your product
just publicly available.
Because speaking as a programmer myself,
it's just magical when I use a product and then I can go and see kind of how it works internally.
It kind of gets back to, you know, people used to say that like in the old days, you know, with hardware, you know, back in the 70s, you know, you'd buy an old clock radio or something or an old computer and you could just take it apart as a kid and kind of like
figure out, map out how everything kind of worked.
And that's kind of like a magical experience.
You know, today it's not really a thing anymore because, you know, hardware is so complex
and, you know, like some pieces of hardware like even try to prevent you from kind of
taking it apart and seeing how it works.
And I just think as an engineer,
it's just a magical experience where you buy something,
you get a lot of value from it,
and then you can just kind of disassemble it
and peer inside and see how it works.
Yeah.
Or even in this case, make it better.
Yeah.
A lot of times when we open things up,
we can't get them back together again.
But with source code, you could always just get reset dash dash hard and then you're right back.
Exactly. Exactly. Yeah. being an open standard, basically, it's an invitation to the community out there that if motivated enough, as Twitter was, as you mentioned with Scala, that in a weekend they
could run a hack or something like that, or a hackathon internally or whatever, and build
out their own piece, and it could possibly actually be adopted into the main repository
or whatever.
But having that motivation, if you're motivated enough and having open source, you're able to build out your own thing based on that or build on top of
it if you wanted to. And it's, it's just like a, an open invitation to do that. I'm kind of curious
though, the, you know, whenever you search with source graph or do any of this stuff that you do,
like this being, you know, being able
to search a function or whatever.
What sources are behind source graph?
Like, what do you comb?
How do you, how does that work?
So we, we crawl a lot of the major open source code repositories.
So GitHub, Bitbucket. Currently we crawl mainly
fully formed code repositories.
In the future we might also want to do
snippets that are
just found in blog posts and Q&A
forums around the internet. But right now it's just
kind of like the go-to places
where most
open source code is hosted.
Did you have to do anything special to get access to
that or blessed API access or anything like that?
Any sort of relationship you have with the code hosts?
No, nothing formal.
So we hit their APIs for some metadata,
but by and large, we mostly just hit the Git API.
So just like Git clone,
that kind of thing.
And that's nice for us because
a lot of companies
don't use
a
well-known code host internally. They just
have a Git repository. And so you can
just point us to any
Git clone URL and we'll be able
to index that code.
So whenever you do that,
are you actually pulling down the full repo?
Walk us through what actually happens
whenever you ping a source,
you pull back the, you know,
the whole scheme of translation you talked about before
with the source lib.
What happens then?
Like what kind of data do you actually store
about a repository and the code that's in it?
Yeah, so it's all kind of ephemeral.
So if you give us access to your repository,
every time we detect a new commit,
we fetch that commit, we clone the repository,
and then we just run source lib as kind of a command line tool
in a Docker container,
and that outputs the data in the schema that we expect, and
then pushes that to an API endpoint in the Sourcegraph web application.
And underneath the hood, that then deserializes that and then stores it in one of several
underlying database systems that we have.
And I guess I could, so with SourceLib, it actually,
the way SourceLib is structured is that it's kind of got this core orchestrator
part of it, which kind of defines the schema and is responsible for coordinating
the interface between SourceLib and the outside world.
But underneath the hood, it just shells out to a bunch of different command line tools.
We call them tool chains, and each of the tool chains is responsible for translating from a specific language
to the thing that SourceLib expects.
You mentioned blog posts potentially being extended to this.
I'm thinking back in the day of microformats.
Is there some sort of spec that you plan on doing
that might extend from source labor or whatever,
or some sort of schema to adopt in terms of HTML,
some sort of fragmenting to make that more possible?
Like, hey, you scan any blog or any medium post or whatever,
and you auto discover anybody who wants to sort of offer their code
samples up to source lib or sorry,
I guess a source graph,
not source lib.
Yeah.
Um,
what's your plan there?
Pre tags.
Simple as that,
I guess.
Yeah.
So,
you know,
we,
we used to have this thing called source boxes.
That was really cool.
It basically allowed you to embed an interactive code snippet
inside your blog post.
The only problem was the way we implemented it
was this JavaScript thing that you would embed.
So you actually couldn't embed it in a Medium post
or any other blog site
unless you had the ability to post scripts to the site.
So we kind of discontinued that.
But we've been thinking about this a lot.
And I think there's a couple of directions we could take.
If any of your listeners are bloggers, I'd be curious to hear how useful they'd find this.
But so one direction we could take is you give us any snippet of code,
and we'll kind of parse it and emit HTML with links to the documentation
and usage examples of whatever you call on source graph.
Granted, when you send us the code, you'll have to give us enough context
so that our analyzer can actually figure out
what code your thing is calling.
Like if you just type, you know,
HTTP.newRequest
and just give us that one liner,
that's probably not enough context
for us to resolve that to, you know,
the new request method in the standard library.
But if you give us, you know,
the import at the top
and a couple other lines of context,
I think that should be good enough.
And the other angle we're thinking of coming at it
is we have this Chrome extension now
that you can install in Chrome.
And what it does is as you're browsing code on GitHub,
it hits the Sourcegraph API and gives you jump to def
and find refs and simple search right in the GitHub UI.
And a lot of people really like that.
It also does that in pull requests.
And that's something that's really useful for code review, just like being able to jump to def when you're reading through a large code review is so helpful.
But we're thinking about extending that to code snippets, too, so that if you have the Chrome extension installed,
let's say you come across some post on Stack Overflow that
has a lengthy snippet that references some function,
and now you want to figure out what that function does,
the Chrome extension could link that snippet of code.
So you can just hover over a reference
to see the documentation and
click it to jump straight to where it's defined if you want to go diving into the source what exactly
in terms of source graph the product and you can help us differentiate free versus paid or open
versus licensed as well but like what is it in terms of how I use it today as a developer?
Is it plugins?
You see at the Chrome thing, uh, do I go to your website?
Give us the lay of the land.
It's, uh, it's free for open source and always will be.
If you're using it inside a company, you can use it for free up to, I think, uh, the limit
is 15 people now.
And after that, there's, uh, kind of, you know, the standard Perseed pricing model.
As far as how you can consume it,
we've actually experimented with a couple ways
you can consume it.
So the most popular way of consuming it
is just going to sourcegraph.com
and using it as a web application
that gives you, you know, global search,
global usage examples.
So you get usage examples pulled from every open source library
that might use a function.
And, you know, a bunch of other stuff that's useful in the application.
The other alternative is some people prefer a native application.
So kind of the same way that, you know, Slack,
the Slack native app is essentially the web application in a native frame.
The Sourcegraph desktop app
is essentially the same experience,
but in a native frame.
But with the added benefit
of direct editor integration.
So if you install a plugin,
it'll add some shortcuts to your editor
that make it super simple
to look up stuff in Sourcegraph.
So as you're coding, Sourcegraph will kind of like preload the documentation and usage examples it thinks are relevant to the code that you're writing.
And so you can quickly alt-tab over and get the answer to, you know, how do I use this function in a split second.
And then there's a Chrome extension,
which if you find yourself reading code on GitHub a lot,
just install it.
I mean, I'm biased, obviously,
but I think it's a magical experience to click into code on GitHub and everything's just linked.
You can hover over for documentation
and click on something.
And even if it's defined in a completely separate repository,
you know, you're there.
What about language support?
Yeah, so language support,
we support officially Java and Go.
We have Python deployed to private beta.
JavaScript's also in private beta,
but we're not confident enough in the quality of those yet
to make those public.
But if you sign up for the beta,
we'll try to get you on as quickly as possible.
And then we have a couple other languages in the pipeline.
And we have Scala inside some companies,
but that's not public yet either.
Well, what's up with that, man?
We just got to get that Twitter added Scala support. We're're gonna get that for the rest of us right yeah yeah we're
we're uh like you know the twitter dev team has been great uh working with us on that um we're
just kind of going through the the the process with them right now i'm sure there are contractual
agreements uh with that with that particular customer i'm sure yeah are contractual agreements with that particular customer, I'm sure.
Yeah.
Very good.
I think that helps us understand exactly what exists in terms of how we could use it today.
And I think we're going to tee up our next break.
But I do have a question for you with regards to all the data that you're capturing.
And we should also talk about private source versus open source.
But you're collecting a lot of data.
I'm sure you're well aware of, you know, GitHub's recent push into public data with the BigQuery.
And I'm guessing that source graph has some overlap there, perhaps.
So let's not answer that now, but let's just take a break and we'll answer it on the other side.
Every Saturday morning, we ship an email called ChangeLog Weekly. It's our editorialized take on what happened this week in open source and software development.
It's not generated by a machine. There's no algorithms involved.
It's me, it's Jared, hand curating this email, keeping up to date with the latest headlines, links, videos, projects, and repos.
And to get this awesome email in your inbox every single week, head to changelog.com slash weekly and subscribe.
All right, we are back with Byung-Loo and we are talking about source graph, source code, all that good stuff.
Byung, we mentioned before the break that you are collecting a lot of data.
Yep.
I like how you think about code as data.
Seems like a very powerful way to think
because you end up with tooling like this.
And recently, GitHub and Google made a big announcement
around BigQuery and GitHub's public data set
where they have added not just the commits and issues,
I believe it was previously.
Yep.
They now actually have full source code snapshots in BigQuery, queryable.
And that was something that has been pretty cool and opened up a lot of opportunities
to answer certain questions amongst open source people like us.
Yep.
I'm thinking that you guys have very similar type data and perhaps there's some opportunities
there with regards to reporting, analysis, what have you.
Can you talk to us about that?
Yeah, totally.
First off, I think it's awesome that GitHub and Google released that data.
It's a really interesting data set, and there have been a lot of great blog posts written about that.
They've been just really interesting to read
about certain patterns you can find in open source.
I think the data that we're collecting or that we're recording is...
The main way that it's different from that is...
My understanding is that the GitHub dump
is basically kind of like a dump of source code as text,
whereas on the back end with source graph,
we actually go and parse out all the code.
So we store every function definition and method call
and things like that separately
as kind of a distinct node in the graph.
So there are certain operations
that might have a lower false positive rate on top of that data set.
That having been said, we've thought a little bit about the use case of,
hey, I'm a key open source author or I'm a senior engineer at my company.
I want to go and analyze the code base to see what kind of high-level patterns I can discover.
But at the moment, we're very focused on building for the day-to-day use case of developers,
so helping developers answer the most common everyday questions they have in seconds.
Whereas the type of analysis you would do with that larger kind of data set, in my view, is kind of something that you would kind of do every once in a while as a senior engineer, I think.
Also, you have to be motivated, too, because it costs money.
That's such a huge factor.
But obviously, if you're going to pay per query or pay per size of queries, then you're going to want to think a little closer to what you're actually doing.
It's probably going to be a barrier to that entry,
not so much to pay for it,
but if you had a general question,
you might want to ask BigQuery in this data set,
but generally, you got to be pretty motivated
because you have to pay for it.
Yeah.
It's a disincentive whenever you have like pay-per-use yeah like
querying because like every time you do it you're even if it's a small sum there's like something
in us as humans we're like oh i'm i gotta pay that oh i'll just figure it out myself you know
when you're coding like you want all that access at all times and you don't want to be thinking is
this lookup worth it to me i guess the other question might be on the side of that, Jared, is like, so since BigQuery
obviously is a paid tool and searching the GitHub data set on BigQuery is part of that,
you know, the question for Bjorn might be, how do you make it free for one?
And how do you make it fast like you have?
I think Brian Kettleson mentioned in the GoTime episode we talked about a couple of times
on this show so far was he actually had to uninstall
something because it was a little slow and you're aware
of that, but for the most part it's
pretty fast to get these lookups
back. Yeah, so
it really, I think, comes down
to how we store the data.
If we were storing all the code in the world
as text,
it would actually be pretty expensive
to kind of comb through all that text and try to
parse it with regular expressions and return answers in a live fashion. But the kind of
high-level way to describe it is we're taking advantage of structure in the data to make the
problem of querying it faster. So one of the reasons that search is a lot
faster is we don't have to index every single uh token in you know a string constant uh or a doc
string we can just scope our search to the the functions that we know are actual you know
function definitions and so that that reduces the quantity of data that we have to sit through by a lot. Um, and there's other sorts of gains that we can get on the backend because, uh, you
know, all the data that's coming into us is, is in the source lib schema, as opposed to just this,
you know, file with a bunch of text in it. You still have to be connected to the source though.
Is there any chance that like offline support, or I'm thinking of times of, you know, bad latency, you're on an airplane times where, you know, you don't want to lose that customer who you want.
You don't want to lose Brian.
You know, he's got his he's got his Vim open.
He's got his source graph and he's your customer.
And now he's like, oh, this is just either not either.
It's too slow right now or it's not available.
These are probably things you guys are thinking about.
Yeah, that's an avenue that we're thinking about with desktop
is just kind of getting the code that you're writing real time
and getting that into source graph
so that when you pop over to ask a question,
it kind of has the data ready.
But I do think that's a little bit more of a nice to have use case
just because if you're on an airplane programming,
there's no Wi-Fi, then at that point you probably can't even look up
documentation if the documentation is hosted online um or you know read the code on on github
so at that point you're kind of you know you're in the mode where hopefully you're not having to
rely on external libraries that you don't uh that you don't know as much.
And you're just, you can be,
like I try to, whenever I'm like about to take a plane ride,
I try to think of like, okay,
what's the most kind of like isolated coding task
I could do?
Like the thing that I can just like,
you know, be in the zone for, you know, five hours
and just hit the standard library for, yeah.
Yeah. I'm with that. I'm also against it to a certain degree because pushback I moved to the country and I have Adam Adam can attest to this
I have bad internet and so I often find times where it's painful to work online sometimes I'll
just go completely offline and so in that in
those cases it's similar like you when you're getting ready for your for your plane ride um
you know there are tools like on the mac there's an app called dash which is a paid app if you are
on awesome it's a great one and you know it's a tool that many people are happy to pay for because
it will offline all those and make them searchable and stuff. So, um, and I, I used to be swimming in bandwidth. And so I was like, who cares guys? But,
uh, you know, it's very narcissistic of me, but now that I have the problem,
my experience at firsthand, like having that, it's definitely a nice to have,
but for some people it's like, it could make or break a customer. And so I would say like,
think about ways that even if it's just, if it's not the global source graph, right?
If it's not, like, everybody's code,
but it's at least either, like, hot code,
like things I've been looking at recently
or my, you know, local repository stuff.
I think having that would be a really interesting extension
of what you guys currently do.
That's true, yeah.
You can actually use most people use
slash code or slash projects
where they keep all their source code
locally. You can even
crawl locally one particular
directory or a set of directories
based on a config.
Totally. That's actually
a great point.
One of the things that...
It's a good feature. Let's do that.
One of the things that we really want to make possible,
you're in the country where the internet is terrible.
Another place that is terrible is the developing world.
There's a lot of people who could become great programmers
and contribute to the global graph of knowledge and software,
but they're kind of hamstrung by poor connectivity.
So just kind of thinking out loud,
one of the things we could do is,
if you have code that you're working on
on your local machine,
Sourcegraph is smart enough to understand
what exactly you're depending on,
because we can actually go and parse the build file
and figure out these are the repositories that you're depending on. Because we can actually go and parse the build file and figure out these are the repositories
that you're using.
And once we have that,
we could just kind of like pre-fetch all the data
for those things and store it locally
and make that accessible.
It'd be kind of like the equivalent of Google Maps,
like save offline maps feature.
So if you know you're going to go to like a zone
of poor connectivity,
or if you just happen to live in one, maybe you could rely on that.
Yeah, it's not an easy problem to solve.
But one thing that I've realized,
and I think this is what you guys are going for with any developer-focused product,
is anytime you can make a developer say,
I love Sourcegraph, you're winning and every time i
have to go offline and i can still work because of that dash app i say to myself i love this thing
and so and so like it's rare right like most of the time online everything's fine but when i have
to use it and it's there for me that's when you turn like normal customers into customers that love your stuff.
So yeah, totally.
And, and you know, you know what, like back when I, I first started like programming on a computer, I remember, you know, in those days I was writing mostly Java and mostly
the standard library.
You could just pull all those documents, you could just pull all that documentation down
and have it on your local machine.
So even if you're going to like someplace where you didn't have the Wi-Fi password, it was all there.
And it was almost, in some ways, a nicer experience because you didn't have the distraction of the internet while you're trying to code.
I feel like these days, so many resources we look at are in you know, kind of in the browser that it's so easy to kind of get off on a tangent.
You know, you're like, you try to look into how to do this one thing and then maybe the same forum post links to this other library.
And, you know, you click on some other link and sooner or later you're like on Hacker News and you're like, how did I get here?
You still have Twitter open in a tab and they have that thing that updates the page title
with the number of notifications you have.
And so you don't have to view it.
You're just there in a tab.
Oh, I have three notifications on Twitter.
And then you're just, it's an hour later.
You haven't done anything.
Speak for yourself, Jared.
Somebody else told me they did that.
Might have been me no personal experience
so beyond you uh you mentioned that you've got this background in machine learning that that's a
thing you love obviously and jared mentioned that uh you're obviously collecting a lot of data you
think of code as data that's it's a cool uh way to look at this obviously so yeah you must have
not that this isn't a big enough plan you know what you're
doing with sourcegraph but you must have even bigger plans on top of all this knowledge this
wealth of knowledge you're ultimately building for the developer community can you at all share
the future for us like what's over the horizon what's something no one knows about
that you can at least tease us with with what you're thinking about for the future of sourcegraph
yeah i'm happy to spitball i just want to declare up front that you know as of tease us with, with what you're thinking about for the future of Sourcegraph. Yeah, I'm happy to spitball.
I just want to declare up front that, you know,
as of now, we're not working on any sort of machine learning related thing.
Like, as a person with a machine learning background,
it kind of rubs me the wrong way.
You know, a lot of companies say they're doing,
they have some fancy machine learning algorithm,
and really it's just mechanical turk underneath
the hood.
I just want to make it clear that Sourcegraph is not doing that.
If and when we do use machine learning, we want to have a very clear use case in mind.
Now that having been said, one of the things that got me really interested in this problem in the first place was, as a person who likes data and thinking about how to model it, the data set of all the code in the world, it's got two properties.
One, it's extremely interesting because it's such a valuable data set and there's so much information that's embedded in it.
And two, it's relatively unexplored. and there's so much information that's embedded in it.
And two, it's relatively unexplored.
There's not a lot of tools that are specifically designed for reading and understanding that data.
Most of the tools are optimized for creating the data,
actually writing code.
And so from the get-go, this has been something that's been in the back of our minds.
Just to name a few things that we could do after we've collected the data set, kind of
half-baked ideas.
One is kind of intelligent autocomplete.
So we think of autocomplete as this thing that just queues off of compiler signals
and it gives you a list of all the possible tokens
that you could possibly,
that are syntactically, semantically correct
to use at a given point in a file.
But what if you could actually go beyond that
and suggest a variable name
or suggest a parameter value
based on the surrounding context?
Now, that prediction problem is a lot fuzzier.
You probably won't be able to get that just from heuristics and what the compiler tells you alone.
That's probably something that you want to learn.
Like, okay, I've seen this pattern before in code, this pattern to AST. And in the past, you know, when I've seen the token, you know, read, for example, and
now this user is calling some function that reads a file or, sorry, writes a file.
And what if they're passing, you know, the wrong value of the permissions flag?
They're setting it to, you know, 0666 instead of 0777.
That's something that I think
there's probably
given enough data, you could
probably learn some interesting patterns there for
what
things to flag to the user that, hey, maybe you're
hitting this API incorrectly because you're
using it in a different way than
the hundred other people
out there
in open source use it.
So that's kind of like one half-baked idea
we have in the back of our minds.
Another problem which is kind of related to that is,
in order to do that prediction problem well,
a sub-problem you kind of have to solve
is the scoring problem.
So given machine learning, the way you'd phrase it is, you know,
given this piece of code,
give me the probability that this piece of code exists
or is valid.
So you give it a likelihood score.
And what that tells you is if you see a piece
of unlikely code, like a piece of code
that your model thinks is like,
oh, that's kind of interesting.
More likely than not, it's an error.
And you can flag that sort of thing.
So think about running this model.
You train this model on all the code in the world, and you discover kind of associations
like associations of specific words and doc strings and, you know, parameter values and function calls. And then you can actually, once you've trained it, you run it
on all the code in the world and you can kind of give a printout to people saying like, hey, you
know, in addition to the linter errors that you already get, here are some places where, you know,
you might want to think about how you're calling this API or, hey,
senior engineer, one of your jobs is to make sure that the other people on the team aren't
shooting themselves in the foot or incurring a lot of tech debt.
Here's a daily printout of hotspots that you might want to scan that our model kind of
discovered.
So both those ideas are very half baked. Um, haven't
really explored them, uh, uh, seriously yet, but I think, you know, given the structure of this
data set and how novel it is, uh, there's bound to be some great low hanging fruit, uh, in there.
Yeah. Just as an aside, I find it amusing somewhat that you were in research and doing
machine learning, uh, and you left it to,
to get more into the industry side of things.
Yeah.
And you flash forward to 2016 and like, it's, it's practically the most buzzword term of
entire industry.
It's like, everybody wants to do, have we doing machine learning?
Do we have any machine learning going on?
So you couldn't actually be more industry right now.
It's, uh, yeah, it's, you know, I, I, this, but right now it's uh yeah it's you know i i this but i
think it's both good and bad like i'm glad that people are interesting in machine learning i think
it can add a lot of value to a lot of products yeah um but you know along with the with the
good also comes the hype and it's kind of funny to watch you know absolutely well let's shift gears
a little bit and let's talk about licensing. So we have a few different projects coming out of Sourcegraph. Of course, we mentioned SourceLib itself, which is MIT licensed. You also have some cool new things like CheckUp, which we can talk about in a minute in detail. commissioned a creation of a new open source license called fair source and you even hired
a lawyer uh to write it can you give us the background on fair source why it needed to exist
and and what your thoughts are there yeah totally so just to just be clear we don't consider fair
source open source and we want to make sure that uh you know people understand we're not trying to
pawn fair source off as as an open source license.
We think it's separate and distinct from open source, but we do think it has a place in the world.
So the reason that we created the FairSource license is that in open source, you kind of have this problem.
And a lot of companies building open source technology have this problem where, you know,
you want to build out something great,
you know, a utility that people really rely on,
and you want to make the source code publicly available
because it just feels like the right thing to do as a developer.
You know, as a developer, if I'm curious,
I want to be able to kind of peek underneath the hood
and figure out how something works.
Nothing's worse than when you encounter some bug
and the thing that you're using is a black box
and you have no way of fixing at all
or even understanding what's going wrong.
And so we wanted to make the source code publicly available,
but at the same time, we wanted to build a sustainable business on top of this
because we think that this is a really valuable problem we're solving.
It's going to add a lot of value to both technology companies and non-technology companies
alike. And we think that it's fair for people investing time and effort into building these
things to be compensated for the value that they're providing. And when we looked around, the classic kind of way to do this is kind of the dual licensing
model where you release it as open source under some really restrictive license like a
GPL or a GPL, and then you have kind of a separate commercial license.
But that just didn't seem like a great fit for us. It also, I mean, if you talk
to lawyers in industry, there's actually a lot of concerns around that, you know, just like, oh,
you know, what if we accidentally, you know, pull in the GPL part of your code base, and we're not
technically paying for it. And it just, there's a lot of like fear, uncertainty and doubt from
the industry side of things. And we kind of looked around and said, well, and doubt from the industry side of things.
And we kind of looked around and said,
well, can we kind of take some things from open source and take some things from closed source
and make a license that lets us release the source code publicly,
but at the same time, you know, if a company like Twitter comes along
and wants to use our product,
we can charge them a fair price
for the value that we're providing
to their development team.
And so we kind of looked around,
we asked a bunch of open source contributors,
you know, what they thought about the idea.
We were really worried that we'd get a lot of pushback
from people because I think, you know, a lot of people, and rightly so, they have concerns about companies coming along and trying to cast things as open source that aren't open source.
But what we found among open source authors was actually kind of this latent frustration at the fact that they're kind of investing so many hours of their lives.
You know, a lot of these people have families and kids in addition to day jobs.
And they're investing time and energy into these projects.
And companies are using those projects to build things that make a lot of money.
And the people actually building the underlying technology don't see a penny.
And, you know, that's bad because if you're building something valuable for the world,
you should be able to make a living off of it.
And so, you know, talking to those contributors kind of gave us the confidence to kind of
keep looking around.
And then we ended up meeting this lawyer by the name of Heather Meeker, who I think was involved in drafting
the Mozilla public license
and a couple of other open source licenses.
She's a lawyer who specifies in open source licensing law,
and she had actually been thinking about this same problem
because she works with a lot of open source contributors
as well, and she heard all the same frustrations,
and it was kind of like very serendipitous. We met them through, you know, a mutual friend of the company. And she said, you
know, I would love to take this on as a project. And we said, that would be great. Can you draft
up something simple that we can use to release our source code publicly, but still retain the
ability to build a business on top of it.
And that's kind of how a fair source was born.
Adam,
we mentioned that beyond has pretty much hit for the cycle on the
changelog network,
but he actually hasn't been on request for commits yet.
And this sounds like a good topic for,
for our brand new show with Nadia and Michael.
Yeah.
That's with the Nadia Ekball,
right?
She was on the show a couple of weeks back.
Yeah,
she was on the show.
We had her back.
Uh,
we had her on the change log all the way back in January.
Yeah.
And then since then we were,
uh,
we enjoyed talking to her so much and,
we told her if she ever wanted to do a podcast,
uh,
she should come to us.
And she did.
And we've been working with her and Michael Rogers who's the is he the head of
the what is he in the Node Foundation Adam
he is
was he he's
something for the Node Foundation
community manager that's what it is
foundation
foundation so the entire show is based
around the human side of open
source and sustainability and licensing and governance and
all such things I think
I'm sure Nadia and Michael have a lot
of opinions about fair source one way or the other
whereas I
do not have very many opinions
Adam what do you think? I would say
well I don't know you got some opinions
too but maybe their opinions run
deeper yeah they're probably more
informed whereas mine are
just like gut gut reaction yes oh this is great or oh this is terrible yes i would also say too
that uh they actually have heather meeker on the list and i think it would make sense obviously to
add someone from sourcecraft to that that uh conversation too it hasn't been scheduled yet
but uh i mean this story is really fascinating in
terms of like, especially how you said it's not to replace open source. It's not, it is not open
source. And I think that's a good caveat to add to that even before mentioning it like you did,
because some, especially like me, I know that whenever I first looked at it, I thought
that this was a new type of open source. And you know, you explaining that portion of it makes it a lot more clear that you're not doing
that.
Yeah, it's not open source at all.
Yeah.
So I guess the plan for this license for you in this case was to be a license for your
core application.
Is that that's not open source yet, though, right?
No, it's still a private repository at this point. We want
to release the public
code publicly
soon, but there's just some cleanup
things that we want to do before we
are comfortable
sharing the code with the world.
Hopefully in the next couple
weeks, it'll be
online. You can look it up on GitHub.
Have you done much discussing or talking out there on the internet anywhere about air source and the motivations
behind it and the plan for it yeah we've talked to there's been some interest from open source
authors a journalist from wired reached out a couple months back and my co-founder quinn
spoke to him and i think wrote up an article.
But it's not been kind of a core focus of the company.
Like the main focus right now is just building an awesome product for developers.
It's just, this is just a means for us to release the product in a way that we think is kind of the right way to do it for developers.
Any common myths about this license you want to debunk right now?
Yeah, I think the main myth is that we're trying to cast it
as an open-source license.
It's not open-source.
We've tried to make that clear from day one.
I think maybe it's just the fact that it's called blank source.
People confuse it.
We're not trying to kill open-. We love open source. I personally have gotten a lot of value from open source software. You know, I wouldn't have been able,
as a curious, you know, teenager, I wouldn't have been able to dive into the repositories
that I dove into if things weren't out in the open. I use Linux as my programming environment.
And the worst thing would be for this to lead to the demise of the open source world.
The main goal is just to let us release our code in a way that we think it should be released and also give other
code authors out there a way to actually see some of the value financially in terms of
what they provide to companies that use their software.
And I guess to that end, the license does include a clause saying it's actually a parameterized
license, meaning you can call it FairSource 10 or FairSource 15.
And what that means is any company that's below the size of 15 people can use your software
for free.
And it's only after you hit that magical limit that
then they need to, to acquire a separate commercial license from you.
I'm glad you mentioned that number there. Cause I was actually thinking about that.
And it's in my notes to mention, but I almost forgot. So how, how in the world do you,
do you track that? I was thinking like when I first read that, I'm thinking that's great,
but how do you track usages of it? You have to kind of operate on this honor system then too.
How do you create a conduit to get paid?
Is it just a known way to pay someone?
Like how do they say, hey, I'm honest and I've used 16 of the 15 licenses and I've got to pay for license number 16 or something.
How does it work?
How do you plan to work? Yeah, so we're not trying to make money off of individual programmers or super small teams
working for mom and pop coding shops or small companies.
The experience at Palantir has taught us that the problem that we're solving is valuable
enough that it's the large and established companies that will pay us a lot of money
to make their development teams more efficient.
And so that's where we think the business is.
As for the rest of the world, we just want to make this accessible
to as many developers as possible because we think we built a tool
that's great for learning and understanding code.
What about from the generalized license perspective? Like if I use fair source 10,
for example,
how do I,
uh,
enable those who use it?
Uh,
companies,
you know,
once they get to 11,
12,
13,
how do I enable them to one,
be honest and say,
Hey,
I've used it,
uh,
with,
you know,
13 or 14 users versus the tens.
I need to pay for a few licenses or whatever.
Yeah.
And then how do I communicate to them how to pay?
Is it just sort of on the license person who uses it to,
to,
to figure that out?
Or is it something that's actually baked into the license?
Yeah,
right now it's,
you know,
it's kind of like a,
the honor system,
uh,
right now.
But in the way we think about it is that there's no legitimate company in the
world that would,
uh, about is that there's no legitimate company in the world that would uh willingly violate a software
license just so they could save uh you know a few bucks ten dollars a few bucks a month um uh on a
piece of on a piece of code um and as for the ones who are illegitimate and you know uh skirt the law
you're probably not going to make that's probably you're not you're not your customers anyways yeah you're not going to build a giant business off of those uh those
people anyways so that was just some knee-jerk questions i had when i read it i'm like okay so
how do you enforce the honor system and how do you get paid because great you got the license and
great that you actually put that there but how do you enforce it because if you don't enforce it or
at least prescribe how you should operate around it, then no one's going to follow it.
And I was just thinking, is that something that you've thought through?
Is that something you have some suggestions on?
And I'm just curious.
That's a smaller subject, though, but just curious.
Yeah.
I mean, in the future, we think that there can be a more automated mechanism. If we're thinking from the Sourcegraph perspective,
if you're using something like Sourcegraph that
understands the dependencies you start pulling in
through your code, you can have an automated alert that
tells you, hey, you started using this thing that
has a Ferris Source license attached to it.
If more than 10 people start using it,
then you should pay this person.
But that's just kind of like vague stuff
that we haven't really built out yet.
Well, Biang, we're getting close to low on time,
but we did want to touch on Checkup.
We mentioned it earlier in the call
and want to give you a chance to get that out there.
A new piece of open source by Sourcegraph
and I believe built in collaboration
with a friend of the show, Matt Holt,
of the Caddy web server.
He's also been on the changelog.
He's also been on GoTime.
So tell us about CheckUp, simple uptime monitoring,
distributed self-hosted health checks, and status pages.
What is it, and why is it?
Yeah, so I'll kind of start with the why.
Okay.
The problem that it kind of solves was,
like many web services,
Sourcegraph uses an uptime monitoring service
to make sure that our site is up
and to make sure that someone gets paged when things go down.
And we kind of ran into a couple of pains
that were kind of surprising using the standard uptime
monitoring services.
The biggest pain for us was just like,
it was so hard to use the UI of these things.
Like you'd think it'd be the simplest UI in the world.
Like you got some URLs that you want to hit,
and I put them in and you tell me whether they're green
or not.
But a lot of the UIs just take multiple seconds
to load a single page.
And you're sitting there as an engineer.
Efficiency is the most important thing.
And you're sitting there waiting for a page to load.
You can't help but appreciate the irony
that the thing that you're monitoring,
that's monitoring your site to make sure
that your latencies are below, you know, one second,
itself is taking like seven seconds to load.
That coupled with the fact that there was no way
to programmatically update the endpoints
for a lot of services, or no easy way, I should say.
And that kind of got us thinking like,
well, you know, this is uptime monitoring.
It's not rocket science.
It's actually dirt simple.
Ideally, we should be able to run these things as unit tests.
Wouldn't it be great if I could actually run uptime checks in development,
just to make sure that obviously you still
need it in production in case some weird production issue comes up.
But a lot of times, you break an endpoint just because you pushed a bug that breaks a page
that you could catch in CI or development.
And so we got to thinking, and we tried really, we really did not want to build this ourselves.
We were like, surely there must be something out there that does this the way we want it to.
But we looked around and just couldn't find anything.
So we kind of, you know, Matt Hold is
kind of a friend of Sourcegraph as well.
And we talked to him from time to time.
And he kind of seen
the, had
his own frustrations of this
sort. And I'm sure
he's heard a lot from folks who use the
Caddy web server.
And so we kind of got talking, and he was like,
I've been thinking about building this thing.
And we're like, well, we'd love to sponsor you.
We would definitely use it.
And so he went and built this library for us
that's also a command line tool.
And what it essentially does is you
can run it as a command line tool, which
means you can run it basically on any EC2 or Google Cloud
instance.
And what it does is just you give it a set of endpoints,
you know, programmatically.
It's some config file that you can version in with your code.
You run this command, it hits all the endpoint,
and then it uploads the data that it records to an S3 bucket.
And then there's a separate command that pulls up a dashboard
that pulls the data from the S3 bucket.
And that's the thing that tells you whether your site is up or down.
And so you can run the uptime command from any EC2 instances or any set of geographically distributed EC2 instances
and pull uptime data from all across the world, push it to an S3 bucket,
and then checking your uptime
is as simple as running a command
to display a dashboard.
And as a side benefit,
because it's so simple,
you can also run it in CI or even development.
I love that.
You got a problem,
maybe you don't have enough time to do it yourself.
Matt Holt has some time
and he also would like to solve said problem.
And a beautiful thing happens.
You know, that's the great new world of open source
where we do have businesses
that are being run around open source
and being successful.
And we can sponsor little things
that can benefit ourselves,
but also benefit the whole community.
So that's really cool.
Yeah, totally.
So that's sourcegraph.github.io slash checkup.
We'll link that one up in the show notes as well.
Very cool.
So is that kind of a thing that's just said and done?
You guys launched it,
or is there actually continued development?
Do you have future plans for checkup,
or is it just it's out there and use it?
Yeah, so it's out there.
It's usable.
It's kind of like a minimal viable tool right now.
So we're actively looking for other open source
contributions.
We've actually been overwhelmed by the sheer amount of interest.
I guess it turns out that a lot of other people
have had similar frustrations.
But people have already submitted pull requests. One person added the ability to check for you know tcp endpoints
as opposed to hdp endpoints and we've gotten a lot of other uh great pull requests as well but
you know if if you're out there and you're listening and you want to contribute to libraries
like uh like this then we're we're open for business. We're happily accepting pull requests. Yeah.
I've been working with Gerhard Lazu
on the deployment of our new website
and CMS and all that good stuff.
And we have been discussing uptime monitoring.
And he's a DevOps guy.
And so he has opinions
on all the different uptime monitors in the world.
The Ping Dums, the uptime monitors in the world uh the the ping dumbs the uptime robots the uh the new
shiny apex ping which it looks interesting and one of the things is i asked him like what's the
best one because i've been using this uptime robot thing which i appreciate it's free for me and cheap
for many people so i don't want to to really diss it here on the show,
but not the best thing that I,
not spitting all my needs, but I use it.
And he's like, I've used all of them.
He's like, I have accounts on nine different uptime monitors.
He's like, they're all sub,
they're all subpar in some way and they all fall down in some ways.
Like I just, I just, I just use them all.
I think he's going to, I think he's going to be interested in CheckUp.
Yeah.
Yeah.
I would love to hear his thoughts.
Well, Bjorn, we would totally ask you the hero question on the show,
except that you've already answered it on BeyondCode.
So instead of doing that,
we're just going to link up your interview on BeyondCode at Go4Con2015.
But one of the other closing questions we like to ask
is really an invitation to the community.
So from Sourcegraph to the community,
what are the best ways, you know, with your mission,
with what you're doing, with all the things you have going on right now
for the open source community to step in,
to support what you're doing,
or to help you move the ball forward
towards the progress you're trying to make,
whether it's, you know, on the company side or on the open source side, what's, what's moving
source graph forward and how can the open source community step in and help out?
Yeah. So I think the best thing right now is just, uh, try out source graph and use it to explore
some open source code, you know, maybe use it to dive into that repository that, uh, you know, think you think is really cool, but perhaps a little bit inscrutable or, uh, overwhelming right
now.
Cause you know, really the reason we made it was to make it easier to dive in into unfamiliar
code and to figure out what's going on, what it's doing.
So, you know, use it for that.
Hopefully it helps you.
Uh, we'd love to hear your feedback and if you end you end up liking it, you know, tell your friends,
tell your teammates and help us spread the word. What about language support or editor support or
different areas where we talked earlier in the show about cross pollination or motivations to
like, look at where you're trying to go. And, you know, is there any unturned rocks out there that
you just personally don't have time for? It's not on your roadmap, but the open source community can step in and help out.
Yeah, totally.
So, you know, even for the languages that we do currently support, the tool chains could
always be better.
You know, the Go, JavaScript, Python, TypeScript, those are the languages that we have kind
of work in progress tool chains for.
We'd love to get contributions for that.
If your favorite language happens to be a language that's
not one of those languages, if you reach out,
we'd love to work with you on how to build a tool
chain for that.
It's one of those tasks that I think
building a source code analysis tool chain
seems really fancy, but you just come talk to us.
It's actually pretty straightforward and and you
actually kind of learn a lot about the internals of programming languages and level up as a
programmer when you do so so if you're interested in any of that um just uh you know tweet at us or
uh shoot us an email and uh we're happy to connect and see how how we can work together on your
contact page it's hi at sourcegraph.com.
Is that a good email for something like that?
That's perfect.
Awesome.
Any closing thoughts from you, Byung,
for the listeners who've been listening
this whole entire show?
I think it's the hour.
I think we're past.
Are we past time?
14 minutes?
Yeah, we're past 14 minutes.
Wow, okay.
So we're over time by a bit. I haven't even been watching the clock, Jared, this time 14 minutes we're yeah we're past 14 minutes wow okay so we're over time uh by a bit i didn't even haven't even watched an o'clock jared this last 14 minutes okay so we're
going on an hour and a half show roughly any closing thoughts for the listeners who've been
hanging on to the to the end of the show here um i think i would just say you know
i'll speak directly to the listeners who might be a little bit newer to programming because, you know, I was definitely a person once who, you know, I didn't start programming really in earnest until end of high school, beginning of college.
And that's it's a little bit old for a lot of programmers in kind of the software industry. So if you're a person who's just learning to code and it just seems like there's this huge universe of things
out there that you can never hope to know,
I'd just say just keep going.
Dive into source code.
Learn from the examples of other people.
It's not rocket science at the end of the day,
and once you get out the other end
and you can build stuff on your own,
it's like you've been given magic powers.
A lot of great advice to you from you as well
from that Beyond Code interview.
I can remember you saying,
you know, what would you go back and change?
And I'll just give a snippet here
because we'll link it up anyways.
But you said, go back and read more source code.
And I thought that was such an interesting answer considering what you do now with Sourcegraph
because that's pretty much what the tool is that you built does is read source code
and create some more information based on that, some more logic on top of that.
But we'll link that up.
I thought that was a pretty interesting thing too as well just to kind of go back in
and dive into the open source code out there and and yeah don't feel like there's a different way to
get it right you know that the reading source code is probably the best option to learning the
program yep well beyond thank you so much for uh for coming on the show and definitely thank you
for you know you know your your love for open source and your love for productivity for
developers out there and obviously uh all the things that open source and your love for productivity for developers out there.
And obviously all the things
that Sourcegraph and your company is doing
to prosper open source,
but also to give us better tools
to not have to rework every time
or recreate the wheel every time
and to leverage the collective knowledge
out there available in open source
and all these open repositories to
help us make our day-to-day lives a little bit better. And that's obviously a pretty cool thing.
So sourcegraph.com is where you can find Sourcegraph, obviously. GitHub.com slash
Sourcegraph is where you can find a lot of our code. And with that, fellas, let's call this
show done and say goodbye. Thanks so much, Adam and Jared for having me on the show. I really appreciate it.
Love the change log and, uh, keep doing what you're doing.
Awesome. Thanks, man. We appreciate that. Outro Music