The Changelog: Software Development, Open Source - Sourcegraph the 'Google for Code' (Interview)

Starting point is 00:00:00 I'm Byung-Loo, and you're listening to The Change Log. Welcome back, everyone. This is The Change Log, and I'm your host, Adam Stachowiak. This is episode 217, and today, Jared and I are talking to Byung-Loo. Byung is the CTO and co-founder of Sourcegraph, and Sourcegraph is aiming to be the Google for Code. We talked about the backstory of Sourcegraph. And Sourcegraph is aiming to be the Google for code. We talked about the backstory of Sourcegraph, how it works, ideas around offline support, how it's licensed, which led us to talk about their new software license called Fair Source. We have two sponsors today, Linode and Datalayer, a one-day event organized by our friends at Compose.

Starting point is 00:00:42 Learn more at datalayer.com. Our first sponsor of the show is our friends at Linode, cloud server of choice here at Changelog. Get a Linode cloud server up and running in seconds. Head to linode.com slash changelog to get started. Choose your flavor of Linux, resources, and node location. Plans start at just $10 a month. You get full root access, run VMs, run containers.

Starting point is 00:01:03 You can even manage your Linodes from the comfort of terminal using Linode CLI. They've got SDKs in Python, Perl, PHP, Ruby, JavaScript, Node.js, so you can hack away on your Linodes with their API. Take advantage of add-ons like backups, node balancers, DNS manager, and more. Again, use our code CHANGELOG20 for $20 in credit with unlimited uses. Tell your friends. Head to leno.com slash changelog and now on to the show. All right, we're back. We got Byung Lu here from Sourcegraph. Jared, we like to trudge through open source, right? And not just open source, but the details of it, the functions, the language, and see where their use cases are.

Starting point is 00:02:01 And this is exactly what Sourcegraph does. So, Byung's here, obviously, to tell us about his company, but also all the cool open source they're doing at Sourcegraph. Yeah. And I feel like Byung's kind of, he's been on Beyond Code and then recently he was featured on GoTime and now he's on the Changelog. So, that's the... He's making his rounds. He's like hitting for the cycle. Yes. Thank you guys for having me. It's great to be here. Well, Byung, let's begin with your origin story.

Starting point is 00:02:29 I think that, you know, graduated from Stanford, you got a unique path to where you're at today. But aside from working at some cool companies and figuring out some developer problems, where did things actually begin for you? Like, how far back do we go to figure out where you got your interest peaked around open source or around software development? Uh, well, if you want to go origin story, I guess I should start with, uh, uh, my birth. Uh, I was born in China, but, uh, I was raised in the Midwest. I always like to mention that in case there are any Midwesterners out there listening.

Starting point is 00:03:04 Um, you're talking to Midwesterners. Yeah. Oh, yeah. Jared, you're out in Nebraska, right? I'm in Nebraska, and Adam's in Texas. So there you go. Nice. That's awesome.

Starting point is 00:03:15 No wonder you guys are such nice guys. Yeah, so I grew up in the Midwest, but I came out to California for high school. And I think I first got into programming just, you know, I had to buy a TI-83 graphing calculator for some, I think it was like high school geometry. Yes. And I happened to get the version of the calculator that came with like the 500 page reference manual, which not all versions come with.

Starting point is 00:03:42 But this thing is like, like it's got everything you ever would want to know about the TI-83 calculator, and it includes a section in the back that teaches you how to write the dialogue of BASIC that they have on the calculator. And so I would, when I was taking the bus back and forth from school, I would just kind of whip that book out and try to program stuff on the calculator in my spare time, you know, program some cool animations or some, you know, automated formula calculators. And that's kind of how I got into it. And I liked it enough that

Starting point is 00:04:18 after that, you know, my school offered a computer science class. I ended up taking that. Can I stop you for a second, Byung? Because I had the TI-86 in high school, which is pretty much the exact same calculator. And mine also came with the manual. But mine came with something that, to me, was better than the manual, which was the game Nibbles. Did you have that one on yours?

Starting point is 00:04:42 No, I did not have Nibbles. See, now this could have changed the course of your life because I had nibbles. Therefore, I was not going to program anything into that thing. I just tossed the manual out and just played nibbles the entire way to school. Yeah. So lucky you.

Starting point is 00:04:58 The TI-86, I think, had a slightly faster processor. I was always envious of the folks who had that. Maybe that's why it came with Nibbles stock and yours didn't. That's probably why. Yeah, it just had just enough RAM to run Nibbles. Exactly. Anyways, keep going.

Starting point is 00:05:17 Yeah, I had a great teacher by the name of Mr. Olivares in high school. He was great at just laying down the facts for computer science. Ended up kind of loving it, went off to college. I knew I wanted to do something math and science related. Computer science just seemed like the perfect marriage between stuff that was theoretically interesting, but also stuff that would have kind of a real world impact. So that's kind of how I got into this whole thing.

Starting point is 00:05:51 So you got this calculator, obviously. And Jared, you mentioned that you had one similar to the TI-86 and Byung, you got the TI-83. And Jared, many people that come on this show, their origin stories sometimes begin with gaming. And whereas beyond his history, it sounds like, and correct me if I'm wrong, but it sounds like what you're saying is that you were really interested in the sciences, which I think most computer scientists are anyways, but you're kind of interested in sciences, but more importantly, the things that you can actually implement today and change the world around versus being interested in simply just games to get you excited about that is it fair to say that or is that not the truth

Starting point is 00:06:29 yeah you know i i'd like to think i had so noble of a mentality back in high school but uh to be honest i think the the the reason why i never got into nibbles or any other calculator game was i just had no patience for reading through how to install those things. And the calculator didn't come with any games pre-installed. And I Googled some stuff on how do I install... I think the game that everyone else was playing was Penguin, which is this Super Mario clone.

Starting point is 00:07:01 And I could just never quite figure out how to install that on my calculator, and then I just gave up. So it was really out of sloth and laziness. I like that. Well, laziness means you make a great programmer. Another question might be, do you still have the manual?

Starting point is 00:07:19 Do you still have the 50-page manual lying around by any chance with notes in it? Yes. Bookmark and stuff? Yes, I do. It's still on my bookshelf wow awesome yeah it sounds like you had kind of a straight and narrow path to where you are in terms of education and desires and lots of people are going to change what they they're not

Starting point is 00:07:35 sure what they like they maybe they find out through video games maybe they find out uh through reading books or whatever it happens to be. Other people take completely different course changes in life or in career before they end up being in software. Take us to where we met you. So this is a gopher con was a 2015 July 2015 gopher con. You now have the source graph thing. Maybe it's a company at this point. Maybe it's just a side project, but you meet us there, you're in to go and you have this source graph. Your, your, your answer to the most influential open source project for you was source lib was what you said when we asked you that question. So take us from where you just left off, bring us all the way back up to the near future, near future, the near present, which was July, 2015. How did, how'd you get there? Um, so I went to college, knew I wanted to do something math and science related. After I took the very first CS class at Stanford, I kind of knew that this is probably the right thing, at least for the next four years.

Starting point is 00:08:37 So I declared the major. I was fortunate enough to be accepted into a research lab as an undergrad. Stanford has this great undergraduate research program called Curis. And so I landed in Daphne Kohler's research lab. And she was a great mentor. She eventually became my advisor. I really got into AI research. For a while, I thought I I was gonna get a PhD in

Starting point is 00:09:06 computer vision or machine learning something like that but after doing that for a while I kind of decided that industry was probably where I wanted to be more and so I started looking around for companies that I thought were doing interesting things with you know large data sets and uh at that point in time this is you know 2011 um palantir uh was a big presence in the stanford campus at that point and it seemed to me that they were tackling some really interesting problems uh with large data sets and doing really impactful things in the world. So I decided to join them, landed on the commercial side,

Starting point is 00:09:50 which basically works with a lot of companies in industry to help solve their most important technological and software-related problems. And it's kind of there that I got to work closely with my future co-founder, Quinn Slack. We we'd gone to school together and kind of knew each other from there. But it was at Palantir where we really, you know, got to spend some quality time together.

Starting point is 00:10:19 And that was also kind of a tipping point for me because I think a lot of the roots of Sourcegraph were planted in that experience. So Quinn and I are both CS majors by background, so we both kind of had this pain that I think every programmer feels, which is, man, it seems like it's harder than it should be to find existing code and reuse it. It just seems like I'm spending too much time searching the internet, crawling through random forums, trying to find the answer to how to do this

Starting point is 00:10:56 pretty straightforward thing in code. And so we felt that kind of day-to-day pain as programmers, but the experience at Palantir kind of showed us that this is a problem that's not just relevant to programmers now. It's actually relevant to, you know, say the top leadership at one of the big five banks in the U.S. because what we realized was, you know, right now we're kind of at this point where software is becoming mainstream. And what I mean by that is, you know, it used to be that for non-technology companies, you know, technology companies that are outside of Silicon Valley, software engineering was kind of an afterthought or just a small department or

Starting point is 00:11:48 that, you know, they might outsource it to some other firm. But these days it's becoming more of a core competency. You know, more and more of the core logic of the business is actually captured in the logic of code. And that's what we realized working at Palantir with the types of customers that we were working with. And what we realized was as painful as it was for us, the pain was felt 10 times as much outside of Silicon Valley where companies aren't traditionally steeped

Starting point is 00:12:24 in all the different processes and principles that we kind of soak up being immersed in the software development world on how to run an engineering team and what tools to use to find the answers to everyday questions. And so we kind of took a step back and were like, hmm, this seems like a solvable problem. You know, code is just another form of data. And, you know, at Palantir, we're building all these fancy tools for other sorts of knowledge workers to analyze, you know, their data sets, but the tools that we seem to be using as programmers, both at Palantir and at some of the customer sites that we're working with, still seem kind of primitive.

Starting point is 00:13:13 I mean, the top two code search utilities today are probably Google Search and Grep. And Google is just kind of like the all-purpose, you know, fallback. Like, we have no other recourse. It's kind of like the Hail Mary. Like, I hope somewhere someone has written a blog post or an answer out there that answers my question. And grep is, you know, a great tool. It's a powerful tool. But it was written in the 1970s and hasn't really changed much since then

Starting point is 00:13:45 even as the world of software has evolved around it so then we kind of got to think about this idea we didn't start working on it right away I went back to school to finish up my masters Quinn went off and started another company with some folks from Palantir and then we kind of serendipitously met each other at some house party in San Francisco. Actually, it might not have been serendipitous. I later learned that Quinn's

Starting point is 00:14:19 then girlfriend, now wife, she knew that he was thinking about this problem and she knew that I was going to be at this house party. So she kind of like orchestrated the whole meeting. That's interesting. Which is kind of kind of funny. But you must send her nice cards for Christmas and stuff. Yeah, she's great. But yeah, at the time, you know, it felt like, oh, you know, you know, you're thinking about this as well.

Starting point is 00:14:44 We got to kind of talking and then we started just hacking on this and, you know, got two years just building it out and testing both the technical side, which a lot of people didn't actually think could be done initially when we started, and also the product side, which is how do we actually make this something that people can rely on every day? And that, I think, brings us up to GopherCon 2015. You know, we were a company at that point by then, but we're still relatively small. I think we only had a handful of people. But we were pretty, pretty, we had a good amount of traction by then uh at least an open source and uh it it seemed like you know we were definitely on to something and it was it was exciting to go to growth con and kind of share the uh the the tool that we'd built with the people and kind of see

Starting point is 00:15:57 their reaction yeah it's interesting that you you said a very similar sentiment when we interviewed you for beyond code that you just said here a few minutes ago. And what you said then in the last summer was in the next 10 to 20 years, every interesting company is going to become a software company at its core. And so this seems like an insight that you've had over time and continue to believe to this day. Yeah, I really think, I mean, there's been a couple additional points of validation I think so you know have you guys seen General Electric's most recent ad campaign I think they aired it

Starting point is 00:16:36 during the Super Bowl where they're kind of rebranding themselves they're a digital company that happens to do infrastructure yeah it's like they're they're both like don't think about them that way anymore. Now think of them as a software slash hardware company. Yeah, exactly. That really indicates that they're putting software first.

Starting point is 00:16:56 Another recent news item was the recent outage at Delta Airlines where a software glitch basically shut down the airline for, you know, a day or more. And, you know, if we live in a world where, you know, a software bug like that basically shuts down, like makes it so you can't do business, that means that even as an airline, you know, you may think your core business is flying planes,

Starting point is 00:17:27 setting prices, and all of that is done more and more so in software. I guess we've gotten this far here so far with your backstory, and we've mentioned Sourcegraph a couple times, even in the intro. I'm going to have to rewind myself and get upset

Starting point is 00:17:43 because I didn't actually say what Sourcegraph is, but we're getting close to our first break. But before we go into that break, let's have you break down exactly what Sourcegraph is. Obviously, you've kind of teed up some of the ideas for which Sourcegraph was built around, but help our listeners understand. Then when we come back from the break, we'll go a little further into it. But what is Sourcegraph? Sourcegraph is basically global jump to definition, find references, and documentation lookup

Starting point is 00:18:09 across all the code you use, whether it's private or public. And it understands the code at a semantic level. So that means when you're jumping to a definition or searching for something, it knows the difference between a function call and the occurrence of that particular name and some random doc string.

Starting point is 00:18:28 So it basically, those are things that programmers do every day. And it's a tool that helps you answer the most common everyday programming questions in seconds. There you have it. Let's take a break then, because we got tons of questions about source graph, everything from licensing to what you're open sourcing, how you choose what to open source, why you even open source, and maybe some of the perspectives you have around how you license

Starting point is 00:18:53 the different software you have and stuff like that. And this big idea of being able to be the Google of code, basically. So let's pause here, take a break. When we come back, we'll dive a little further in. If you're focused on the data layer, there's an awesome conference being put on by our friends at Compose. Monolithic databases are so 20th century. Today, teams are using a JSON document store for scale,

Starting point is 00:19:17 a graphing database to create connections, a message queue to handle messaging, a caching system to accelerate responses, a time series database for streaming data, and a relational database for metrics and more, it can be hard to stay on top of all your options, and that's why you should attend. While much talk in developer circles these days focuses on the app layer, not enough attention is placed on the data layer, and data is the secret ingredient to ensuring

Starting point is 00:19:40 applications are optimized for speed, security, and user experience. Hear talks from GitHub, Artsy, LinkedIn, Meteor, Capital One, and several startups, including Elemento and DynamiteDB. Talks range from the Polyglot Enterprise to using GraphQL to expose data backed by MySQL, Elasticsearch, and more. The conference is in Seattle on September 28th. Tickets are just $99, and Changelog listeners get 20% off. Head to datalayer.com and use the code CHANGELOG when you register. We're back with

Starting point is 00:20:18 Byung-Loo, CTO of Sourcegraph. And Byung, before we took the break, we obviously got to get an explainer of exactly what Sourcegraph is but it goes much deeper than this. It's the I'm not sure if you coined this term or not if this was the Newstack or Susan Hall who wrote this article but the title is

Starting point is 00:20:38 Sourcegraph aims to be the Google for code and being a public utility for all developers out there, you know, being able to look up functions and dive into different usages of, of open source, whether it's private or public,

Starting point is 00:20:51 help us understand the beginnings of this company, what this company was founded upon and why you actually built it in the first place. As far as the beginnings go, you know, it was Quinn and myself in the beginning. And it, it really grew out of this itch that we had ourselves as, you know, it was Quinn and myself in the beginning. And it really grew out of this itch

Starting point is 00:21:06 that we had ourselves as programmers, which was, we felt that a lot of the code that we were writing was somehow duplicative. Either, you know, someone in our company had probably already written it, or there's probably some open source library that we just weren't aware of, or just, you know, couldn't figure out how to use that might save us a lot of time. And I think almost every professional programmer is aware of how often programmers reinvent the wheel every single day. And we're trying to think about how we could encourage more code reuse. What was the thing that was preventing us from going out and discovering the pieces of code

Starting point is 00:21:48 that we knew someone somewhere had already written, but it was just too difficult to find it out? And so we started thinking about it, and what it came down to was, well, look, code is actually really highly structured data. I come from a machine learning background and natural language processing. There's a lot of parallels between natural languages and programming languages. But the difficult thing about natural languages is that

Starting point is 00:22:16 even to construct a simple parse tree from an everyday English sentence, that's still an open research problem. Whereas with programming languages, you have this thing called a compiler or interpreter that just gives you literally everything you'd ever want to know about a block of code. And once you have all that data, then you ask yourself,

Starting point is 00:22:37 well, can I build a system on top of this that helps me automate or partially automate the task of finding pieces of code, of reading through existing pieces of code, and really understanding that piece of code in a way that lets me use it. And so that was kind of the itch that we were scratching. And a couple other points of inspiration for us, the stuff that we saw inside of Palantir was definitely something that solidified our belief

Starting point is 00:23:12 that this was not only a problem that programmers everywhere face, but it was also a problem that was important to leaders of large businesses. And the other point of inspiration that we took was, uh, I had previously, uh, you know, done an internship inside of Google and Google internally actually has this great utility. If you ever meet a software engineer who works in the main Google code base and you ask them, uh, what they think about Google code search, I guarantee you there, they will say it's, uh, say it's the best thing since sliced bread. Just ask them how many times they use it every day, how often they have it open in some browser

Starting point is 00:23:52 tab, and they'll tell you 60, 70, even 80% of the time, I just have it open as a reference. And so seeing the value that that provided inside Google and also just missing that tool and not seeing it anywhere else in kind of the every individual developer out there to go and take advantage of this giant corpus of human knowledge that is open source code and code inside your company and kind of build on the, stand on the shoulders of giants, so to speak. Definitely a bit off a big problem in terms of just surface area, I think, with so to speak. Definitely a bit off a big problem in terms of just surface area,

Starting point is 00:24:47 I think, with things to do. Because even once you have the analysis done, you're collecting all the data, I'm sure you guys have some sort of crawler or something that's spanning the different code bases and finding other pieces of code

Starting point is 00:25:01 that can go index. Then you have developers using all these different environments, their editors, you have how many languages. Was it ever overwhelming to say, how can we provide support for all these popular editors and then across all these languages to where we're actually going to provide a holistic solution for people? Yeah, so that was definitely kind of a sticking point in the early days.

Starting point is 00:25:29 And one of the first technical hurdles we had to overcome was, how do we do this in a manner that's efficient? How do we make it so that we're not, 10 years from now, we're still writing the plugin for the umpteenth language that we want to support. And that kind of leads into the creation of SourceLib, which is the open source library that powers a lot of the underlying source code analysis that gives you what you see on Sourcegraph. And the basic idea of Sourcelib is, look, as far as end-user applications are concerned,

Starting point is 00:26:13 applications that want to make code explorable and accessible, so I'm thinking editor plugins, things like Google Code Search or Sourcegraph, most programming languages are basically the same. They all have a way to define things and name them and reference those things in some other part of code. So if we can kind of put the data that is the code in a form where you just capture kind of like that essential part of it

Starting point is 00:26:42 and it's a kind of common language agnostic schema, then you can just build your end user application on top of that single schema. And then underneath, you just have to build a bunch of translators from different languages to that schema. So that takes it from this problem of having to build a specific library or plugin

Starting point is 00:27:08 for every combination of editor and language to, okay, now you just have to build a translator for every language to the schema, and then once you have that, you can build a single application that understands all those languages at once. It's like the adapter pattern for languages. Yeah, exactly.

Starting point is 00:27:29 It takes it from an O of N squared or O of N times M problem to an O of N problem. Right. So that was where you started. And so what I would like to find out about is the schema that it gets translated into. Like, what are the bits and bobs that you guys need for each of these, the normalized version? And then how do

Starting point is 00:27:52 you store those? Yeah, so the schema, it's a graph schema. So, you know, the schema is in the name of the company, the source graph. It's literally like a graph of source code. And so there's kind of three fundamental concepts in the schema. So one is kind of the AST node. So this is kind of, once you've parsed the code, this is like the essence of the code. Like once you have the AST, then you can kind of derive every other fact about the code that you need from that. And you can also translate it back to text. It's like the perfect, I guess you could say it's like the natural form of code as data. And then in addition to the AST node,

Starting point is 00:28:41 the things that really let us kind of build useful features on top of it are two concepts, a definition and a reference. So a definition is just a function declaration, or a class declaration, or a variable definition somewhere in code. It's basically anywhere you define a name in code. So we extract all those, and we produce a unique identifier for each that's global to all the code in the world.

Starting point is 00:29:16 And then on the other side of the table, you have references. So references is any time one of the names that you define in code is referenced. So it could be a function call. It could be a type reference. It could be a package import. And once you have these two things, definitions and references, that essentially allows

Starting point is 00:29:35 you to walk the graph of source codes. If you think about the things that we do probably hundreds of times per day as developers when we're exploring the data day as developers when we're kind of exploring the data set of the code we're working on, it's following forward and backward links. It's either jumping to definition or finding references. And that's kind of the bread and butter of what we do. And that's exactly what the schema allows you to do. The main difference is that because of that globally unique identifier,

Starting point is 00:30:07 you can now do so across all the code in the world rather than just the code that's on your local machine. Which is pretty rad. So SourceLib, Open Source, MIT license, SourceLib.org, we'll link it up in the show notes. That seems like you've opened up a core piece of your guys' business. Is that not the case?

Starting point is 00:30:29 Yeah, it's a good library. I would say, you know, for us, it's just something that felt like it should be an open standard because it's going to be useful, I think, for a lot of tools beyond just Sourcegraph. We hope that, you hope that this is one wheel that people shouldn't have to reinvent when they're trying to build great tools for developers.

Starting point is 00:30:50 As far as the business case is concerned, we really think that the value we're going to provide to companies is scaling this across the entire open source universe and the code inside their company and connecting those two different worlds of code together. So there's a lot of additional technology that we built around scaling this, making it super fast across all the code that you might use that is not in source.

Starting point is 00:31:18 Source is kind of the analysis primitive. Also seems like a really nice way. And I know people hate when we use the word leverage but um when when you take advantage of well that sounds bad too but just kind of the open source spirit right where what you have i mean especially when you have like an adapter situation where you have all of these uncommon interfaces and what makes your guys's end goal and end product better is more the more adapters that you have. So for instance, you may not have the time or the capacity to write the Elm, what do you call them, analyzers? Yeah, the language analyzer that would conform it to SourceLib.

Starting point is 00:31:58 But you know the Elm community, when they see, you know, you can use SourceGraph on GitHub and look at the Go code and see what it does, they think, oh, I want that. They're going to get excited. They might actually build that for you. And then on the other side, you have your editors or your plugins, and you could have the same situation there. Maybe the Atom community says, why don't we have a Sourcegraph for Atom?

Starting point is 00:32:19 Yeah. Or not Sourcegraph, but maybe something that adapts to Sourcelib for Atom, and then they can do that. So it seems like a great business case for what's also beneficial to all of us as open sourcers is, you know, we don't have to be the only ones building this stuff. Yeah, exactly. You know, there's so many use cases out there for a library like that, that, you know, we're not going to be the ones to think of that someone else is going to think of it. And yeah, that's exactly what happened when we released it.

Starting point is 00:32:47 There were a lot of people in the community that reached out and said, hey, I want to build out support for this editor or this language. And it actually helped us on the business side too. One of the companies that uses Sourcegraph is Twitter. And we're deployed to Twitter's entire Scala code base. And there, they reached out at a point where we didn't even have Scala support. But one of their engineers wanted this so bad

Starting point is 00:33:18 because he had also been a previous Google engineer, and so he wanted something kind of like Google Code Search. And so they actually built out Scala support as kind of a Hack Week project. And we kind of took it from there. So it's great to have the source code of your product just publicly available. Because speaking as a programmer myself,

Starting point is 00:33:42 it's just magical when I use a product and then I can go and see kind of how it works internally. It kind of gets back to, you know, people used to say that like in the old days, you know, with hardware, you know, back in the 70s, you know, you'd buy an old clock radio or something or an old computer and you could just take it apart as a kid and kind of like figure out, map out how everything kind of worked. And that's kind of like a magical experience. You know, today it's not really a thing anymore because, you know, hardware is so complex and, you know, like some pieces of hardware like even try to prevent you from kind of taking it apart and seeing how it works. And I just think as an engineer,

Starting point is 00:34:25 it's just a magical experience where you buy something, you get a lot of value from it, and then you can just kind of disassemble it and peer inside and see how it works. Yeah. Or even in this case, make it better. Yeah. A lot of times when we open things up,

Starting point is 00:34:42 we can't get them back together again. But with source code, you could always just get reset dash dash hard and then you're right back. Exactly. Exactly. Yeah. being an open standard, basically, it's an invitation to the community out there that if motivated enough, as Twitter was, as you mentioned with Scala, that in a weekend they could run a hack or something like that, or a hackathon internally or whatever, and build out their own piece, and it could possibly actually be adopted into the main repository or whatever. But having that motivation, if you're motivated enough and having open source, you're able to build out your own thing based on that or build on top of it if you wanted to. And it's, it's just like a, an open invitation to do that. I'm kind of curious

Starting point is 00:35:36 though, the, you know, whenever you search with source graph or do any of this stuff that you do, like this being, you know, being able to search a function or whatever. What sources are behind source graph? Like, what do you comb? How do you, how does that work? So we, we crawl a lot of the major open source code repositories. So GitHub, Bitbucket. Currently we crawl mainly

Starting point is 00:36:05 fully formed code repositories. In the future we might also want to do snippets that are just found in blog posts and Q&A forums around the internet. But right now it's just kind of like the go-to places where most open source code is hosted.

Starting point is 00:36:22 Did you have to do anything special to get access to that or blessed API access or anything like that? Any sort of relationship you have with the code hosts? No, nothing formal. So we hit their APIs for some metadata, but by and large, we mostly just hit the Git API. So just like Git clone, that kind of thing.

Starting point is 00:36:48 And that's nice for us because a lot of companies don't use a well-known code host internally. They just have a Git repository. And so you can just point us to any Git clone URL and we'll be able

Starting point is 00:37:04 to index that code. So whenever you do that, are you actually pulling down the full repo? Walk us through what actually happens whenever you ping a source, you pull back the, you know, the whole scheme of translation you talked about before with the source lib.

Starting point is 00:37:19 What happens then? Like what kind of data do you actually store about a repository and the code that's in it? Yeah, so it's all kind of ephemeral. So if you give us access to your repository, every time we detect a new commit, we fetch that commit, we clone the repository, and then we just run source lib as kind of a command line tool

Starting point is 00:37:40 in a Docker container, and that outputs the data in the schema that we expect, and then pushes that to an API endpoint in the Sourcegraph web application. And underneath the hood, that then deserializes that and then stores it in one of several underlying database systems that we have. And I guess I could, so with SourceLib, it actually, the way SourceLib is structured is that it's kind of got this core orchestrator part of it, which kind of defines the schema and is responsible for coordinating

Starting point is 00:38:24 the interface between SourceLib and the outside world. But underneath the hood, it just shells out to a bunch of different command line tools. We call them tool chains, and each of the tool chains is responsible for translating from a specific language to the thing that SourceLib expects. You mentioned blog posts potentially being extended to this. I'm thinking back in the day of microformats. Is there some sort of spec that you plan on doing that might extend from source labor or whatever,

Starting point is 00:38:54 or some sort of schema to adopt in terms of HTML, some sort of fragmenting to make that more possible? Like, hey, you scan any blog or any medium post or whatever, and you auto discover anybody who wants to sort of offer their code samples up to source lib or sorry, I guess a source graph, not source lib. Yeah.

Starting point is 00:39:15 Um, what's your plan there? Pre tags. Simple as that, I guess. Yeah. So, you know,

Starting point is 00:39:22 we, we used to have this thing called source boxes. That was really cool. It basically allowed you to embed an interactive code snippet inside your blog post. The only problem was the way we implemented it was this JavaScript thing that you would embed. So you actually couldn't embed it in a Medium post

Starting point is 00:39:39 or any other blog site unless you had the ability to post scripts to the site. So we kind of discontinued that. But we've been thinking about this a lot. And I think there's a couple of directions we could take. If any of your listeners are bloggers, I'd be curious to hear how useful they'd find this. But so one direction we could take is you give us any snippet of code, and we'll kind of parse it and emit HTML with links to the documentation

Starting point is 00:40:14 and usage examples of whatever you call on source graph. Granted, when you send us the code, you'll have to give us enough context so that our analyzer can actually figure out what code your thing is calling. Like if you just type, you know, HTTP.newRequest and just give us that one liner, that's probably not enough context

Starting point is 00:40:37 for us to resolve that to, you know, the new request method in the standard library. But if you give us, you know, the import at the top and a couple other lines of context, I think that should be good enough. And the other angle we're thinking of coming at it is we have this Chrome extension now

Starting point is 00:40:55 that you can install in Chrome. And what it does is as you're browsing code on GitHub, it hits the Sourcegraph API and gives you jump to def and find refs and simple search right in the GitHub UI. And a lot of people really like that. It also does that in pull requests. And that's something that's really useful for code review, just like being able to jump to def when you're reading through a large code review is so helpful. But we're thinking about extending that to code snippets, too, so that if you have the Chrome extension installed,

Starting point is 00:41:26 let's say you come across some post on Stack Overflow that has a lengthy snippet that references some function, and now you want to figure out what that function does, the Chrome extension could link that snippet of code. So you can just hover over a reference to see the documentation and click it to jump straight to where it's defined if you want to go diving into the source what exactly in terms of source graph the product and you can help us differentiate free versus paid or open

Starting point is 00:41:57 versus licensed as well but like what is it in terms of how I use it today as a developer? Is it plugins? You see at the Chrome thing, uh, do I go to your website? Give us the lay of the land. It's, uh, it's free for open source and always will be. If you're using it inside a company, you can use it for free up to, I think, uh, the limit is 15 people now. And after that, there's, uh, kind of, you know, the standard Perseed pricing model.

Starting point is 00:42:26 As far as how you can consume it, we've actually experimented with a couple ways you can consume it. So the most popular way of consuming it is just going to sourcegraph.com and using it as a web application that gives you, you know, global search, global usage examples.

Starting point is 00:42:44 So you get usage examples pulled from every open source library that might use a function. And, you know, a bunch of other stuff that's useful in the application. The other alternative is some people prefer a native application. So kind of the same way that, you know, Slack, the Slack native app is essentially the web application in a native frame. The Sourcegraph desktop app is essentially the same experience,

Starting point is 00:43:10 but in a native frame. But with the added benefit of direct editor integration. So if you install a plugin, it'll add some shortcuts to your editor that make it super simple to look up stuff in Sourcegraph. So as you're coding, Sourcegraph will kind of like preload the documentation and usage examples it thinks are relevant to the code that you're writing.

Starting point is 00:43:37 And so you can quickly alt-tab over and get the answer to, you know, how do I use this function in a split second. And then there's a Chrome extension, which if you find yourself reading code on GitHub a lot, just install it. I mean, I'm biased, obviously, but I think it's a magical experience to click into code on GitHub and everything's just linked. You can hover over for documentation and click on something.

Starting point is 00:44:08 And even if it's defined in a completely separate repository, you know, you're there. What about language support? Yeah, so language support, we support officially Java and Go. We have Python deployed to private beta. JavaScript's also in private beta, but we're not confident enough in the quality of those yet

Starting point is 00:44:27 to make those public. But if you sign up for the beta, we'll try to get you on as quickly as possible. And then we have a couple other languages in the pipeline. And we have Scala inside some companies, but that's not public yet either. Well, what's up with that, man? We just got to get that Twitter added Scala support. We're're gonna get that for the rest of us right yeah yeah we're

Starting point is 00:44:50 we're uh like you know the twitter dev team has been great uh working with us on that um we're just kind of going through the the the process with them right now i'm sure there are contractual agreements uh with that with that particular customer i'm sure yeah are contractual agreements with that particular customer, I'm sure. Yeah. Very good. I think that helps us understand exactly what exists in terms of how we could use it today. And I think we're going to tee up our next break. But I do have a question for you with regards to all the data that you're capturing.

Starting point is 00:45:19 And we should also talk about private source versus open source. But you're collecting a lot of data. I'm sure you're well aware of, you know, GitHub's recent push into public data with the BigQuery. And I'm guessing that source graph has some overlap there, perhaps. So let's not answer that now, but let's just take a break and we'll answer it on the other side. Every Saturday morning, we ship an email called ChangeLog Weekly. It's our editorialized take on what happened this week in open source and software development. It's not generated by a machine. There's no algorithms involved. It's me, it's Jared, hand curating this email, keeping up to date with the latest headlines, links, videos, projects, and repos.

Starting point is 00:45:59 And to get this awesome email in your inbox every single week, head to changelog.com slash weekly and subscribe. All right, we are back with Byung-Loo and we are talking about source graph, source code, all that good stuff. Byung, we mentioned before the break that you are collecting a lot of data. Yep. I like how you think about code as data. Seems like a very powerful way to think because you end up with tooling like this. And recently, GitHub and Google made a big announcement

Starting point is 00:46:34 around BigQuery and GitHub's public data set where they have added not just the commits and issues, I believe it was previously. Yep. They now actually have full source code snapshots in BigQuery, queryable. And that was something that has been pretty cool and opened up a lot of opportunities to answer certain questions amongst open source people like us. Yep.

Starting point is 00:47:00 I'm thinking that you guys have very similar type data and perhaps there's some opportunities there with regards to reporting, analysis, what have you. Can you talk to us about that? Yeah, totally. First off, I think it's awesome that GitHub and Google released that data. It's a really interesting data set, and there have been a lot of great blog posts written about that. They've been just really interesting to read about certain patterns you can find in open source.

Starting point is 00:47:29 I think the data that we're collecting or that we're recording is... The main way that it's different from that is... My understanding is that the GitHub dump is basically kind of like a dump of source code as text, whereas on the back end with source graph, we actually go and parse out all the code. So we store every function definition and method call and things like that separately

Starting point is 00:47:53 as kind of a distinct node in the graph. So there are certain operations that might have a lower false positive rate on top of that data set. That having been said, we've thought a little bit about the use case of, hey, I'm a key open source author or I'm a senior engineer at my company. I want to go and analyze the code base to see what kind of high-level patterns I can discover. But at the moment, we're very focused on building for the day-to-day use case of developers, so helping developers answer the most common everyday questions they have in seconds.

Starting point is 00:48:40 Whereas the type of analysis you would do with that larger kind of data set, in my view, is kind of something that you would kind of do every once in a while as a senior engineer, I think. Also, you have to be motivated, too, because it costs money. That's such a huge factor. But obviously, if you're going to pay per query or pay per size of queries, then you're going to want to think a little closer to what you're actually doing. It's probably going to be a barrier to that entry, not so much to pay for it, but if you had a general question, you might want to ask BigQuery in this data set,

Starting point is 00:49:17 but generally, you got to be pretty motivated because you have to pay for it. Yeah. It's a disincentive whenever you have like pay-per-use yeah like querying because like every time you do it you're even if it's a small sum there's like something in us as humans we're like oh i'm i gotta pay that oh i'll just figure it out myself you know when you're coding like you want all that access at all times and you don't want to be thinking is this lookup worth it to me i guess the other question might be on the side of that, Jared, is like, so since BigQuery

Starting point is 00:49:47 obviously is a paid tool and searching the GitHub data set on BigQuery is part of that, you know, the question for Bjorn might be, how do you make it free for one? And how do you make it fast like you have? I think Brian Kettleson mentioned in the GoTime episode we talked about a couple of times on this show so far was he actually had to uninstall something because it was a little slow and you're aware of that, but for the most part it's pretty fast to get these lookups

Starting point is 00:50:12 back. Yeah, so it really, I think, comes down to how we store the data. If we were storing all the code in the world as text, it would actually be pretty expensive to kind of comb through all that text and try to parse it with regular expressions and return answers in a live fashion. But the kind of

Starting point is 00:50:36 high-level way to describe it is we're taking advantage of structure in the data to make the problem of querying it faster. So one of the reasons that search is a lot faster is we don't have to index every single uh token in you know a string constant uh or a doc string we can just scope our search to the the functions that we know are actual you know function definitions and so that that reduces the quantity of data that we have to sit through by a lot. Um, and there's other sorts of gains that we can get on the backend because, uh, you know, all the data that's coming into us is, is in the source lib schema, as opposed to just this, you know, file with a bunch of text in it. You still have to be connected to the source though. Is there any chance that like offline support, or I'm thinking of times of, you know, bad latency, you're on an airplane times where, you know, you don't want to lose that customer who you want.

Starting point is 00:51:33 You don't want to lose Brian. You know, he's got his he's got his Vim open. He's got his source graph and he's your customer. And now he's like, oh, this is just either not either. It's too slow right now or it's not available. These are probably things you guys are thinking about. Yeah, that's an avenue that we're thinking about with desktop is just kind of getting the code that you're writing real time

Starting point is 00:51:52 and getting that into source graph so that when you pop over to ask a question, it kind of has the data ready. But I do think that's a little bit more of a nice to have use case just because if you're on an airplane programming, there's no Wi-Fi, then at that point you probably can't even look up documentation if the documentation is hosted online um or you know read the code on on github so at that point you're kind of you know you're in the mode where hopefully you're not having to

Starting point is 00:52:19 rely on external libraries that you don't uh that you don't know as much. And you're just, you can be, like I try to, whenever I'm like about to take a plane ride, I try to think of like, okay, what's the most kind of like isolated coding task I could do? Like the thing that I can just like, you know, be in the zone for, you know, five hours

Starting point is 00:52:40 and just hit the standard library for, yeah. Yeah. I'm with that. I'm also against it to a certain degree because pushback I moved to the country and I have Adam Adam can attest to this I have bad internet and so I often find times where it's painful to work online sometimes I'll just go completely offline and so in that in those cases it's similar like you when you're getting ready for your for your plane ride um you know there are tools like on the mac there's an app called dash which is a paid app if you are on awesome it's a great one and you know it's a tool that many people are happy to pay for because it will offline all those and make them searchable and stuff. So, um, and I, I used to be swimming in bandwidth. And so I was like, who cares guys? But,

Starting point is 00:53:30 uh, you know, it's very narcissistic of me, but now that I have the problem, my experience at firsthand, like having that, it's definitely a nice to have, but for some people it's like, it could make or break a customer. And so I would say like, think about ways that even if it's just, if it's not the global source graph, right? If it's not, like, everybody's code, but it's at least either, like, hot code, like things I've been looking at recently or my, you know, local repository stuff.

Starting point is 00:53:57 I think having that would be a really interesting extension of what you guys currently do. That's true, yeah. You can actually use most people use slash code or slash projects where they keep all their source code locally. You can even crawl locally one particular

Starting point is 00:54:14 directory or a set of directories based on a config. Totally. That's actually a great point. One of the things that... It's a good feature. Let's do that. One of the things that we really want to make possible, you're in the country where the internet is terrible.

Starting point is 00:54:31 Another place that is terrible is the developing world. There's a lot of people who could become great programmers and contribute to the global graph of knowledge and software, but they're kind of hamstrung by poor connectivity. So just kind of thinking out loud, one of the things we could do is, if you have code that you're working on on your local machine,

Starting point is 00:54:58 Sourcegraph is smart enough to understand what exactly you're depending on, because we can actually go and parse the build file and figure out these are the repositories that you're depending on. Because we can actually go and parse the build file and figure out these are the repositories that you're using. And once we have that, we could just kind of like pre-fetch all the data for those things and store it locally

Starting point is 00:55:14 and make that accessible. It'd be kind of like the equivalent of Google Maps, like save offline maps feature. So if you know you're going to go to like a zone of poor connectivity, or if you just happen to live in one, maybe you could rely on that. Yeah, it's not an easy problem to solve. But one thing that I've realized,

Starting point is 00:55:36 and I think this is what you guys are going for with any developer-focused product, is anytime you can make a developer say, I love Sourcegraph, you're winning and every time i have to go offline and i can still work because of that dash app i say to myself i love this thing and so and so like it's rare right like most of the time online everything's fine but when i have to use it and it's there for me that's when you turn like normal customers into customers that love your stuff. So yeah, totally. And, and you know, you know what, like back when I, I first started like programming on a computer, I remember, you know, in those days I was writing mostly Java and mostly

Starting point is 00:56:16 the standard library. You could just pull all those documents, you could just pull all that documentation down and have it on your local machine. So even if you're going to like someplace where you didn't have the Wi-Fi password, it was all there. And it was almost, in some ways, a nicer experience because you didn't have the distraction of the internet while you're trying to code. I feel like these days, so many resources we look at are in you know, kind of in the browser that it's so easy to kind of get off on a tangent. You know, you're like, you try to look into how to do this one thing and then maybe the same forum post links to this other library. And, you know, you click on some other link and sooner or later you're like on Hacker News and you're like, how did I get here?

Starting point is 00:57:02 You still have Twitter open in a tab and they have that thing that updates the page title with the number of notifications you have. And so you don't have to view it. You're just there in a tab. Oh, I have three notifications on Twitter. And then you're just, it's an hour later. You haven't done anything. Speak for yourself, Jared.

Starting point is 00:57:21 Somebody else told me they did that. Might have been me no personal experience so beyond you uh you mentioned that you've got this background in machine learning that that's a thing you love obviously and jared mentioned that uh you're obviously collecting a lot of data you think of code as data that's it's a cool uh way to look at this obviously so yeah you must have not that this isn't a big enough plan you know what you're doing with sourcegraph but you must have even bigger plans on top of all this knowledge this wealth of knowledge you're ultimately building for the developer community can you at all share

Starting point is 00:57:54 the future for us like what's over the horizon what's something no one knows about that you can at least tease us with with what you're thinking about for the future of sourcegraph yeah i'm happy to spitball i just want to declare up front that you know as of tease us with, with what you're thinking about for the future of Sourcegraph. Yeah, I'm happy to spitball. I just want to declare up front that, you know, as of now, we're not working on any sort of machine learning related thing. Like, as a person with a machine learning background, it kind of rubs me the wrong way. You know, a lot of companies say they're doing,

Starting point is 00:58:21 they have some fancy machine learning algorithm, and really it's just mechanical turk underneath the hood. I just want to make it clear that Sourcegraph is not doing that. If and when we do use machine learning, we want to have a very clear use case in mind. Now that having been said, one of the things that got me really interested in this problem in the first place was, as a person who likes data and thinking about how to model it, the data set of all the code in the world, it's got two properties. One, it's extremely interesting because it's such a valuable data set and there's so much information that's embedded in it. And two, it's relatively unexplored. and there's so much information that's embedded in it.

Starting point is 00:59:07 And two, it's relatively unexplored. There's not a lot of tools that are specifically designed for reading and understanding that data. Most of the tools are optimized for creating the data, actually writing code. And so from the get-go, this has been something that's been in the back of our minds. Just to name a few things that we could do after we've collected the data set, kind of half-baked ideas. One is kind of intelligent autocomplete.

Starting point is 00:59:43 So we think of autocomplete as this thing that just queues off of compiler signals and it gives you a list of all the possible tokens that you could possibly, that are syntactically, semantically correct to use at a given point in a file. But what if you could actually go beyond that and suggest a variable name or suggest a parameter value

Starting point is 01:00:04 based on the surrounding context? Now, that prediction problem is a lot fuzzier. You probably won't be able to get that just from heuristics and what the compiler tells you alone. That's probably something that you want to learn. Like, okay, I've seen this pattern before in code, this pattern to AST. And in the past, you know, when I've seen the token, you know, read, for example, and now this user is calling some function that reads a file or, sorry, writes a file. And what if they're passing, you know, the wrong value of the permissions flag? They're setting it to, you know, 0666 instead of 0777.

Starting point is 01:00:47 That's something that I think there's probably given enough data, you could probably learn some interesting patterns there for what things to flag to the user that, hey, maybe you're hitting this API incorrectly because you're using it in a different way than

Starting point is 01:01:02 the hundred other people out there in open source use it. So that's kind of like one half-baked idea we have in the back of our minds. Another problem which is kind of related to that is, in order to do that prediction problem well, a sub-problem you kind of have to solve

Starting point is 01:01:19 is the scoring problem. So given machine learning, the way you'd phrase it is, you know, given this piece of code, give me the probability that this piece of code exists or is valid. So you give it a likelihood score. And what that tells you is if you see a piece of unlikely code, like a piece of code

Starting point is 01:01:42 that your model thinks is like, oh, that's kind of interesting. More likely than not, it's an error. And you can flag that sort of thing. So think about running this model. You train this model on all the code in the world, and you discover kind of associations like associations of specific words and doc strings and, you know, parameter values and function calls. And then you can actually, once you've trained it, you run it on all the code in the world and you can kind of give a printout to people saying like, hey, you

Starting point is 01:02:15 know, in addition to the linter errors that you already get, here are some places where, you know, you might want to think about how you're calling this API or, hey, senior engineer, one of your jobs is to make sure that the other people on the team aren't shooting themselves in the foot or incurring a lot of tech debt. Here's a daily printout of hotspots that you might want to scan that our model kind of discovered. So both those ideas are very half baked. Um, haven't really explored them, uh, uh, seriously yet, but I think, you know, given the structure of this

Starting point is 01:02:51 data set and how novel it is, uh, there's bound to be some great low hanging fruit, uh, in there. Yeah. Just as an aside, I find it amusing somewhat that you were in research and doing machine learning, uh, and you left it to, to get more into the industry side of things. Yeah. And you flash forward to 2016 and like, it's, it's practically the most buzzword term of entire industry. It's like, everybody wants to do, have we doing machine learning?

Starting point is 01:03:19 Do we have any machine learning going on? So you couldn't actually be more industry right now. It's, uh, yeah, it's, you know, I, I, this, but right now it's uh yeah it's you know i i this but i think it's both good and bad like i'm glad that people are interesting in machine learning i think it can add a lot of value to a lot of products yeah um but you know along with the with the good also comes the hype and it's kind of funny to watch you know absolutely well let's shift gears a little bit and let's talk about licensing. So we have a few different projects coming out of Sourcegraph. Of course, we mentioned SourceLib itself, which is MIT licensed. You also have some cool new things like CheckUp, which we can talk about in a minute in detail. commissioned a creation of a new open source license called fair source and you even hired a lawyer uh to write it can you give us the background on fair source why it needed to exist

Starting point is 01:04:13 and and what your thoughts are there yeah totally so just to just be clear we don't consider fair source open source and we want to make sure that uh you know people understand we're not trying to pawn fair source off as as an open source license. We think it's separate and distinct from open source, but we do think it has a place in the world. So the reason that we created the FairSource license is that in open source, you kind of have this problem. And a lot of companies building open source technology have this problem where, you know, you want to build out something great, you know, a utility that people really rely on,

Starting point is 01:04:50 and you want to make the source code publicly available because it just feels like the right thing to do as a developer. You know, as a developer, if I'm curious, I want to be able to kind of peek underneath the hood and figure out how something works. Nothing's worse than when you encounter some bug and the thing that you're using is a black box and you have no way of fixing at all

Starting point is 01:05:10 or even understanding what's going wrong. And so we wanted to make the source code publicly available, but at the same time, we wanted to build a sustainable business on top of this because we think that this is a really valuable problem we're solving. It's going to add a lot of value to both technology companies and non-technology companies alike. And we think that it's fair for people investing time and effort into building these things to be compensated for the value that they're providing. And when we looked around, the classic kind of way to do this is kind of the dual licensing model where you release it as open source under some really restrictive license like a

Starting point is 01:05:55 GPL or a GPL, and then you have kind of a separate commercial license. But that just didn't seem like a great fit for us. It also, I mean, if you talk to lawyers in industry, there's actually a lot of concerns around that, you know, just like, oh, you know, what if we accidentally, you know, pull in the GPL part of your code base, and we're not technically paying for it. And it just, there's a lot of like fear, uncertainty and doubt from the industry side of things. And we kind of looked around and said, well, and doubt from the industry side of things. And we kind of looked around and said, well, can we kind of take some things from open source and take some things from closed source

Starting point is 01:06:34 and make a license that lets us release the source code publicly, but at the same time, you know, if a company like Twitter comes along and wants to use our product, we can charge them a fair price for the value that we're providing to their development team. And so we kind of looked around, we asked a bunch of open source contributors,

Starting point is 01:06:58 you know, what they thought about the idea. We were really worried that we'd get a lot of pushback from people because I think, you know, a lot of people, and rightly so, they have concerns about companies coming along and trying to cast things as open source that aren't open source. But what we found among open source authors was actually kind of this latent frustration at the fact that they're kind of investing so many hours of their lives. You know, a lot of these people have families and kids in addition to day jobs. And they're investing time and energy into these projects. And companies are using those projects to build things that make a lot of money. And the people actually building the underlying technology don't see a penny.

Starting point is 01:07:44 And, you know, that's bad because if you're building something valuable for the world, you should be able to make a living off of it. And so, you know, talking to those contributors kind of gave us the confidence to kind of keep looking around. And then we ended up meeting this lawyer by the name of Heather Meeker, who I think was involved in drafting the Mozilla public license and a couple of other open source licenses. She's a lawyer who specifies in open source licensing law,

Starting point is 01:08:14 and she had actually been thinking about this same problem because she works with a lot of open source contributors as well, and she heard all the same frustrations, and it was kind of like very serendipitous. We met them through, you know, a mutual friend of the company. And she said, you know, I would love to take this on as a project. And we said, that would be great. Can you draft up something simple that we can use to release our source code publicly, but still retain the ability to build a business on top of it. And that's kind of how a fair source was born.

Starting point is 01:08:48 Adam, we mentioned that beyond has pretty much hit for the cycle on the changelog network, but he actually hasn't been on request for commits yet. And this sounds like a good topic for, for our brand new show with Nadia and Michael. Yeah. That's with the Nadia Ekball,

Starting point is 01:09:04 right? She was on the show a couple of weeks back. Yeah, she was on the show. We had her back. Uh, we had her on the change log all the way back in January. Yeah.

Starting point is 01:09:12 And then since then we were, uh, we enjoyed talking to her so much and, we told her if she ever wanted to do a podcast, uh, she should come to us. And she did. And we've been working with her and Michael Rogers who's the is he the head of

Starting point is 01:09:26 the what is he in the Node Foundation Adam he is was he he's something for the Node Foundation community manager that's what it is foundation foundation so the entire show is based around the human side of open

Starting point is 01:09:42 source and sustainability and licensing and governance and all such things I think I'm sure Nadia and Michael have a lot of opinions about fair source one way or the other whereas I do not have very many opinions Adam what do you think? I would say well I don't know you got some opinions

Starting point is 01:10:00 too but maybe their opinions run deeper yeah they're probably more informed whereas mine are just like gut gut reaction yes oh this is great or oh this is terrible yes i would also say too that uh they actually have heather meeker on the list and i think it would make sense obviously to add someone from sourcecraft to that that uh conversation too it hasn't been scheduled yet but uh i mean this story is really fascinating in terms of like, especially how you said it's not to replace open source. It's not, it is not open

Starting point is 01:10:30 source. And I think that's a good caveat to add to that even before mentioning it like you did, because some, especially like me, I know that whenever I first looked at it, I thought that this was a new type of open source. And you know, you explaining that portion of it makes it a lot more clear that you're not doing that. Yeah, it's not open source at all. Yeah. So I guess the plan for this license for you in this case was to be a license for your core application.

Starting point is 01:11:01 Is that that's not open source yet, though, right? No, it's still a private repository at this point. We want to release the public code publicly soon, but there's just some cleanup things that we want to do before we are comfortable sharing the code with the world.

Starting point is 01:11:18 Hopefully in the next couple weeks, it'll be online. You can look it up on GitHub. Have you done much discussing or talking out there on the internet anywhere about air source and the motivations behind it and the plan for it yeah we've talked to there's been some interest from open source authors a journalist from wired reached out a couple months back and my co-founder quinn spoke to him and i think wrote up an article. But it's not been kind of a core focus of the company.

Starting point is 01:11:51 Like the main focus right now is just building an awesome product for developers. It's just, this is just a means for us to release the product in a way that we think is kind of the right way to do it for developers. Any common myths about this license you want to debunk right now? Yeah, I think the main myth is that we're trying to cast it as an open-source license. It's not open-source. We've tried to make that clear from day one. I think maybe it's just the fact that it's called blank source.

Starting point is 01:12:21 People confuse it. We're not trying to kill open-. We love open source. I personally have gotten a lot of value from open source software. You know, I wouldn't have been able, as a curious, you know, teenager, I wouldn't have been able to dive into the repositories that I dove into if things weren't out in the open. I use Linux as my programming environment. And the worst thing would be for this to lead to the demise of the open source world. The main goal is just to let us release our code in a way that we think it should be released and also give other code authors out there a way to actually see some of the value financially in terms of what they provide to companies that use their software.

Starting point is 01:13:21 And I guess to that end, the license does include a clause saying it's actually a parameterized license, meaning you can call it FairSource 10 or FairSource 15. And what that means is any company that's below the size of 15 people can use your software for free. And it's only after you hit that magical limit that then they need to, to acquire a separate commercial license from you. I'm glad you mentioned that number there. Cause I was actually thinking about that. And it's in my notes to mention, but I almost forgot. So how, how in the world do you,

Starting point is 01:13:58 do you track that? I was thinking like when I first read that, I'm thinking that's great, but how do you track usages of it? You have to kind of operate on this honor system then too. How do you create a conduit to get paid? Is it just a known way to pay someone? Like how do they say, hey, I'm honest and I've used 16 of the 15 licenses and I've got to pay for license number 16 or something. How does it work? How do you plan to work? Yeah, so we're not trying to make money off of individual programmers or super small teams working for mom and pop coding shops or small companies.

Starting point is 01:14:31 The experience at Palantir has taught us that the problem that we're solving is valuable enough that it's the large and established companies that will pay us a lot of money to make their development teams more efficient. And so that's where we think the business is. As for the rest of the world, we just want to make this accessible to as many developers as possible because we think we built a tool that's great for learning and understanding code. What about from the generalized license perspective? Like if I use fair source 10,

Starting point is 01:15:05 for example, how do I, uh, enable those who use it? Uh, companies, you know, once they get to 11,

Starting point is 01:15:13 12, 13, how do I enable them to one, be honest and say, Hey, I've used it, uh, with,

Starting point is 01:15:19 you know, 13 or 14 users versus the tens. I need to pay for a few licenses or whatever. Yeah. And then how do I communicate to them how to pay? Is it just sort of on the license person who uses it to, to, to figure that out?

Starting point is 01:15:31 Or is it something that's actually baked into the license? Yeah, right now it's, you know, it's kind of like a, the honor system, uh, right now.

Starting point is 01:15:39 But in the way we think about it is that there's no legitimate company in the world that would, uh, about is that there's no legitimate company in the world that would uh willingly violate a software license just so they could save uh you know a few bucks ten dollars a few bucks a month um uh on a piece of on a piece of code um and as for the ones who are illegitimate and you know uh skirt the law you're probably not going to make that's probably you're not you're not your customers anyways yeah you're not going to build a giant business off of those uh those people anyways so that was just some knee-jerk questions i had when i read it i'm like okay so how do you enforce the honor system and how do you get paid because great you got the license and

Starting point is 01:16:16 great that you actually put that there but how do you enforce it because if you don't enforce it or at least prescribe how you should operate around it, then no one's going to follow it. And I was just thinking, is that something that you've thought through? Is that something you have some suggestions on? And I'm just curious. That's a smaller subject, though, but just curious. Yeah. I mean, in the future, we think that there can be a more automated mechanism. If we're thinking from the Sourcegraph perspective,

Starting point is 01:16:46 if you're using something like Sourcegraph that understands the dependencies you start pulling in through your code, you can have an automated alert that tells you, hey, you started using this thing that has a Ferris Source license attached to it. If more than 10 people start using it, then you should pay this person. But that's just kind of like vague stuff

Starting point is 01:17:06 that we haven't really built out yet. Well, Biang, we're getting close to low on time, but we did want to touch on Checkup. We mentioned it earlier in the call and want to give you a chance to get that out there. A new piece of open source by Sourcegraph and I believe built in collaboration with a friend of the show, Matt Holt,

Starting point is 01:17:25 of the Caddy web server. He's also been on the changelog. He's also been on GoTime. So tell us about CheckUp, simple uptime monitoring, distributed self-hosted health checks, and status pages. What is it, and why is it? Yeah, so I'll kind of start with the why. Okay.

Starting point is 01:17:45 The problem that it kind of solves was, like many web services, Sourcegraph uses an uptime monitoring service to make sure that our site is up and to make sure that someone gets paged when things go down. And we kind of ran into a couple of pains that were kind of surprising using the standard uptime monitoring services.

Starting point is 01:18:09 The biggest pain for us was just like, it was so hard to use the UI of these things. Like you'd think it'd be the simplest UI in the world. Like you got some URLs that you want to hit, and I put them in and you tell me whether they're green or not. But a lot of the UIs just take multiple seconds to load a single page.

Starting point is 01:18:32 And you're sitting there as an engineer. Efficiency is the most important thing. And you're sitting there waiting for a page to load. You can't help but appreciate the irony that the thing that you're monitoring, that's monitoring your site to make sure that your latencies are below, you know, one second, itself is taking like seven seconds to load.

Starting point is 01:18:53 That coupled with the fact that there was no way to programmatically update the endpoints for a lot of services, or no easy way, I should say. And that kind of got us thinking like, well, you know, this is uptime monitoring. It's not rocket science. It's actually dirt simple. Ideally, we should be able to run these things as unit tests.

Starting point is 01:19:14 Wouldn't it be great if I could actually run uptime checks in development, just to make sure that obviously you still need it in production in case some weird production issue comes up. But a lot of times, you break an endpoint just because you pushed a bug that breaks a page that you could catch in CI or development. And so we got to thinking, and we tried really, we really did not want to build this ourselves. We were like, surely there must be something out there that does this the way we want it to. But we looked around and just couldn't find anything.

Starting point is 01:19:44 So we kind of, you know, Matt Hold is kind of a friend of Sourcegraph as well. And we talked to him from time to time. And he kind of seen the, had his own frustrations of this sort. And I'm sure he's heard a lot from folks who use the

Starting point is 01:20:01 Caddy web server. And so we kind of got talking, and he was like, I've been thinking about building this thing. And we're like, well, we'd love to sponsor you. We would definitely use it. And so he went and built this library for us that's also a command line tool. And what it essentially does is you

Starting point is 01:20:18 can run it as a command line tool, which means you can run it basically on any EC2 or Google Cloud instance. And what it does is just you give it a set of endpoints, you know, programmatically. It's some config file that you can version in with your code. You run this command, it hits all the endpoint, and then it uploads the data that it records to an S3 bucket.

Starting point is 01:20:40 And then there's a separate command that pulls up a dashboard that pulls the data from the S3 bucket. And that's the thing that tells you whether your site is up or down. And so you can run the uptime command from any EC2 instances or any set of geographically distributed EC2 instances and pull uptime data from all across the world, push it to an S3 bucket, and then checking your uptime is as simple as running a command to display a dashboard.

Starting point is 01:21:13 And as a side benefit, because it's so simple, you can also run it in CI or even development. I love that. You got a problem, maybe you don't have enough time to do it yourself. Matt Holt has some time and he also would like to solve said problem.

Starting point is 01:21:27 And a beautiful thing happens. You know, that's the great new world of open source where we do have businesses that are being run around open source and being successful. And we can sponsor little things that can benefit ourselves, but also benefit the whole community.

Starting point is 01:21:44 So that's really cool. Yeah, totally. So that's sourcegraph.github.io slash checkup. We'll link that one up in the show notes as well. Very cool. So is that kind of a thing that's just said and done? You guys launched it, or is there actually continued development?

Starting point is 01:22:01 Do you have future plans for checkup, or is it just it's out there and use it? Yeah, so it's out there. It's usable. It's kind of like a minimal viable tool right now. So we're actively looking for other open source contributions. We've actually been overwhelmed by the sheer amount of interest.

Starting point is 01:22:19 I guess it turns out that a lot of other people have had similar frustrations. But people have already submitted pull requests. One person added the ability to check for you know tcp endpoints as opposed to hdp endpoints and we've gotten a lot of other uh great pull requests as well but you know if if you're out there and you're listening and you want to contribute to libraries like uh like this then we're we're open for business. We're happily accepting pull requests. Yeah. I've been working with Gerhard Lazu on the deployment of our new website

Starting point is 01:22:50 and CMS and all that good stuff. And we have been discussing uptime monitoring. And he's a DevOps guy. And so he has opinions on all the different uptime monitors in the world. The Ping Dums, the uptime monitors in the world uh the the ping dumbs the uptime robots the uh the new shiny apex ping which it looks interesting and one of the things is i asked him like what's the best one because i've been using this uptime robot thing which i appreciate it's free for me and cheap

Starting point is 01:23:21 for many people so i don't want to to really diss it here on the show, but not the best thing that I, not spitting all my needs, but I use it. And he's like, I've used all of them. He's like, I have accounts on nine different uptime monitors. He's like, they're all sub, they're all subpar in some way and they all fall down in some ways. Like I just, I just, I just use them all.

Starting point is 01:23:42 I think he's going to, I think he's going to be interested in CheckUp. Yeah. Yeah. I would love to hear his thoughts. Well, Bjorn, we would totally ask you the hero question on the show, except that you've already answered it on BeyondCode. So instead of doing that, we're just going to link up your interview on BeyondCode at Go4Con2015.

Starting point is 01:24:03 But one of the other closing questions we like to ask is really an invitation to the community. So from Sourcegraph to the community, what are the best ways, you know, with your mission, with what you're doing, with all the things you have going on right now for the open source community to step in, to support what you're doing, or to help you move the ball forward

Starting point is 01:24:22 towards the progress you're trying to make, whether it's, you know, on the company side or on the open source side, what's, what's moving source graph forward and how can the open source community step in and help out? Yeah. So I think the best thing right now is just, uh, try out source graph and use it to explore some open source code, you know, maybe use it to dive into that repository that, uh, you know, think you think is really cool, but perhaps a little bit inscrutable or, uh, overwhelming right now. Cause you know, really the reason we made it was to make it easier to dive in into unfamiliar code and to figure out what's going on, what it's doing.

Starting point is 01:24:57 So, you know, use it for that. Hopefully it helps you. Uh, we'd love to hear your feedback and if you end you end up liking it, you know, tell your friends, tell your teammates and help us spread the word. What about language support or editor support or different areas where we talked earlier in the show about cross pollination or motivations to like, look at where you're trying to go. And, you know, is there any unturned rocks out there that you just personally don't have time for? It's not on your roadmap, but the open source community can step in and help out. Yeah, totally.

Starting point is 01:25:27 So, you know, even for the languages that we do currently support, the tool chains could always be better. You know, the Go, JavaScript, Python, TypeScript, those are the languages that we have kind of work in progress tool chains for. We'd love to get contributions for that. If your favorite language happens to be a language that's not one of those languages, if you reach out, we'd love to work with you on how to build a tool

Starting point is 01:25:52 chain for that. It's one of those tasks that I think building a source code analysis tool chain seems really fancy, but you just come talk to us. It's actually pretty straightforward and and you actually kind of learn a lot about the internals of programming languages and level up as a programmer when you do so so if you're interested in any of that um just uh you know tweet at us or uh shoot us an email and uh we're happy to connect and see how how we can work together on your

Starting point is 01:26:22 contact page it's hi at sourcegraph.com. Is that a good email for something like that? That's perfect. Awesome. Any closing thoughts from you, Byung, for the listeners who've been listening this whole entire show? I think it's the hour.

Starting point is 01:26:39 I think we're past. Are we past time? 14 minutes? Yeah, we're past 14 minutes. Wow, okay. So we're over time by a bit. I haven't even been watching the clock, Jared, this time 14 minutes we're yeah we're past 14 minutes wow okay so we're over time uh by a bit i didn't even haven't even watched an o'clock jared this last 14 minutes okay so we're going on an hour and a half show roughly any closing thoughts for the listeners who've been hanging on to the to the end of the show here um i think i would just say you know

Starting point is 01:26:59 i'll speak directly to the listeners who might be a little bit newer to programming because, you know, I was definitely a person once who, you know, I didn't start programming really in earnest until end of high school, beginning of college. And that's it's a little bit old for a lot of programmers in kind of the software industry. So if you're a person who's just learning to code and it just seems like there's this huge universe of things out there that you can never hope to know, I'd just say just keep going. Dive into source code. Learn from the examples of other people. It's not rocket science at the end of the day, and once you get out the other end

Starting point is 01:27:45 and you can build stuff on your own, it's like you've been given magic powers. A lot of great advice to you from you as well from that Beyond Code interview. I can remember you saying, you know, what would you go back and change? And I'll just give a snippet here because we'll link it up anyways.

Starting point is 01:28:01 But you said, go back and read more source code. And I thought that was such an interesting answer considering what you do now with Sourcegraph because that's pretty much what the tool is that you built does is read source code and create some more information based on that, some more logic on top of that. But we'll link that up. I thought that was a pretty interesting thing too as well just to kind of go back in and dive into the open source code out there and and yeah don't feel like there's a different way to get it right you know that the reading source code is probably the best option to learning the

Starting point is 01:28:33 program yep well beyond thank you so much for uh for coming on the show and definitely thank you for you know you know your your love for open source and your love for productivity for developers out there and obviously uh all the things that open source and your love for productivity for developers out there. And obviously all the things that Sourcegraph and your company is doing to prosper open source, but also to give us better tools to not have to rework every time

Starting point is 01:28:57 or recreate the wheel every time and to leverage the collective knowledge out there available in open source and all these open repositories to help us make our day-to-day lives a little bit better. And that's obviously a pretty cool thing. So sourcegraph.com is where you can find Sourcegraph, obviously. GitHub.com slash Sourcegraph is where you can find a lot of our code. And with that, fellas, let's call this show done and say goodbye. Thanks so much, Adam and Jared for having me on the show. I really appreciate it.

Starting point is 01:29:26 Love the change log and, uh, keep doing what you're doing. Awesome. Thanks, man. We appreciate that. Outro Music

The Changelog: Software Development, Open Source - Sourcegraph the 'Google for Code' (Interview)

Beyang Liu, the CTO and co-founder of Sourcegraph, joined the show to talk about the backstory of Sourcegraph, how it works, how they're aiming to be the 'Google for Code', ideas around offline suppor...t for code search, how it's licensed, and their new software license called Fair Source.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.