The Changelog: Software Development, Open Source - GitHub Archive and Changelog Nightly (Interview)

Episode Date: February 27, 2015

Ilya Grigorik joined the show to talk about GitHub Archive, logging and archiving GitHub's public event data, and how he uses Google BigQuery to make querying that data accessible to everyone....

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome back everyone, this is The Change Log and I'm your host Adam Stachowiak. This is episode 144 and today we're talking to Ilya Grigorik. Ilya is a self-professed internet plumber as you'll hear on the show today. He works at Google and he basically makes the internet faster, and much more, of course, for those who are fans of Ilya's work. Ilya has a side project called GitHub Archive, which we took some interest in because we wanted to start shipping a daily email, nightly as a matter of fact.
Starting point is 00:00:38 I'm going to break some news real quick because we're going to tell you later in this episode, but I have to tell you now because I want you to sign up for this email. Stop the show, go to thechangeall. because I want you to sign up to this email. Stop the show. Go to the changelog.com slash nightly and sign up right now. It's an email we're shipping now called changelog nightly, as you can tell from the URL. And this email unearths the hottest reposts on GitHub every single night
Starting point is 00:00:58 and drops them in your inbox. It's going to be awesome. You're going to love it. We have some awesome sponsors for today's show, CodeShip, TopTile, and CodeSchool. We'll tell you a bit more about CodeSchool and TopTile later in the show, but our friends at CodeShip, they have released a brand new feature called Parallel CI.
Starting point is 00:01:17 If you want to get faster tests from your builds, you've got to run your builds in Parallel. They recently shipped Parallel CI, and now you can split your test commands into up to 10 pipelines, 10 test pipelines. This enables you to run your test suite in parallel and drastically reduce the time it takes to run your builds. They integrate with GitHub and Bitbucket.
Starting point is 00:01:37 You can deploy your code to cloud services like Heroku or AWS and much more. And you can get started today by trying out their free plan, which includes 100 builds a month and five private projects. Or you can also use our offer code, the changelogpodcast, to get a 20% discount on any plan you choose for three months. Head to codeship.com slash the changelog to get started. And now on to the show.
Starting point is 00:02:04 All right, everybody, we're back. We've got Ilya Grigorik joining us today. Ilya, you've been on to the show. All right, everybody, we're back. We've got Ilya Grigorik joining us today. Ilya, you've been on the show before, episode 55, if we go back in time, back when Wynn was around, and that was an awesome show. You were talking about Goliath and Event Machine and other fun stuff. So welcome back to the show, man. How are you? I'm doing great.
Starting point is 00:02:22 Thanks for the invitation. And, yeah, it's been a while. I think that episode was back in 2011. It's like we were laughing before the show how forever ago that was, basically. And that's just crazy. Yeah, Speedy was a brand new thing back then, and now it's been completely replaced. That's right, actually. As of today or yesterday, the new HTTP2 stuff is now official. So we went from this kind of experimental thing that a couple of engineers at Google started to something that's out in the wild and ready to be deployed.
Starting point is 00:02:54 It's pretty amazing. And as you can hear, we also have Jared Santo on the call as well. So Jared, say what's up. What's up? What's up? So Ilya, this is sort of a two-part show, right? We got sort of an announcement from us, which we'll talk about later. But then we have this awesome project that you started, GitHub Archive, and that kind of tees off of Google's BigQuery project.
Starting point is 00:03:18 You've been on the show before, but since it's been a while, back then you were a founder and CTO of PostRank, which has since been acquired by Google. And you now work at Google and it's been all that time since then. How do you introduce yourself now whenever you're on a stage or saying hello to people? So nowadays within Google, I basically work as an internet plumber. And my job is more or less to figure out how to make the internet faster. That's related to spec work for things like Google Chrome, so open standards, also things like HTTP2 and all the rest. So trying to figure out how to make Chrome faster,
Starting point is 00:03:55 how to make Google products faster, because we know that speed helps user retention and just leads to happier users, and also just make intranet faster as a whole. So it's been a pretty fun gig. I like the name, or the title at least, Net Plumber. It's a good description of kind of the dirty work that you actually have to do to make it all work.
Starting point is 00:04:17 Yep, yep. Someone's got to do it, right? And how big is your team that you work with there at Google? That's a good question. It's actually a big distributed effort, as you can imagine, because every product within Google is focused on speed. So I work and kind of collaborate with a lot of different teams within Google and then even outside of Google as well.
Starting point is 00:04:37 So we work with mobile carriers because mobile is becoming so important and everything's migrating there. Other vendors, Microsoft, Mozilla, Apple, and all the rest. So it's hard to say what the team is because we don't have just a strictly unified team, but it's hundreds of people. So do you work in an office?
Starting point is 00:04:59 I work in an office, but I do find myself kind of jumping between offices depending on where I have a lot of meetings. But my day-to-day is mostly in Mountain View, California. Gotcha. Well, cool. Didn't you live not in San Francisco before
Starting point is 00:05:15 you were a partner? That's right. Yeah, we were actually in Waterloo, Canada. Then we moved to sunny California. Yeah. Must be better right So you're not dealing with the whole Winter apocalypse or what is it called over there
Starting point is 00:05:29 Snowpocalypse Snowpocalypse yes Feel bad for the people on the east coast Here in sunny Texas it's just around 55 degrees today or something Yep can't complain Well I got sun but I also have like 9 degree sun
Starting point is 00:05:44 So it's not exactly I I guess I can complain. Not as much as them on the East Coast, but I can still complain. So maybe just to paint a little bit of the history for people too to kind of let them know which Ilya we're talking to. I don't think there's that many out there. VimGolf, GitHub Archive, you were a Ruby hero in 2008. You started PostRank can you kind of give a primer of what PostRank was prior to the acquisition of Google and sort of
Starting point is 00:06:09 what sort of made you this internet plumber you are today yeah the PostRank work was actually I'm going to say mostly unrelated to the work that I'm doing before but the idea behind PostRank was to help measure the impact of just social advertising on the web.
Starting point is 00:06:27 And by advertising, I share a link to an open source project, and we want to be able to figure out what did that yield? Was that a good share in the sense that a lot of people click on it? Whom should you approach? Or where should you advertise to make the most of your advertising budget? So we built a collection of tools for marketers, advertisers, and the rest or where should you advertise to kind of make the most of your advertising budget. So we built a collection of tools for marketers, advertisers, and the rest to kind of help them facilitate that whole end-to-end cycle of, we want to invest money into social, and what is the return on investment?
Starting point is 00:06:56 And then I guess that was interesting to Google, who at the time was really kind of expanding their social strategy. So we ended up there and I spent about a year building and rebuilding some of the products that we build within PostRank. And then jumped to kind of this web performance work, which has always been my passion. In the background, the last episode that we did with you guys with Goliath was actually centered around that. We wrote our own HSP server because we found that the performance of some of the existing servers just wasn't up to par. So I've always kind of played that plumber role,
Starting point is 00:07:30 perhaps in the background as an engineer. And then when I got the opportunity to actually focus on it full-time, I decided to jump on it. Cool. As Adam mentioned, you are a Ruby hero, and many people probably remember you from igvita.com, your blog that you wrote on very commonly back then.
Starting point is 00:07:50 Curious if you're still slinging any Ruby inside Google or if you've switched tool sets. I am, but perhaps not kind of day-to-day production projects. So I still have open source projects, certainly a lot of my day-to-day work for, you know, I need a script for this, I need to automate something like that. I still use Ruby, it's my default language to date. Probably the most recent project that I've worked on was actually the HTTP2 Ruby gem. So it's a pure Ruby implementation of the full HTTP protocol, HTTP2 protocol. So that was fun, just kind of roll up my sleeves and work on that. And I should also mention that I got some great contributions from a number of people on GitHub, from the Tokyo community in particular. So they've been helping
Starting point is 00:08:29 out quite a bit. So it was really good. And I actually went to Tokyo last year to talk at the HB2 event. So I met a lot of the Rubyists there that were also doing HB2. So that was really good. Awesome, man. Well, we're here today to talk about a specific project of yours, one that had caught our radar and that I was quite fond of for some time, which is your GitHub Archive project, githubarchive.org. Tell us about that and what it is and kind of where you got the idea for it. Sure. So this is a fun one.
Starting point is 00:09:00 And I think as most every open source project that starts with a personal itch. And my personal itch was I love open source I love following open source and I'd like to keep on top of what are the interesting projects that are coming up being released what are the new issues so on and so forth so back in the early days of github I would just follow a lot of people right once you once you follow kind of the right people in any particular community, you just observe like what they star or what they comment on and all the rest. And that worked well for a while. But with time, as more and more projects and more and more people joined GitHub, I found that, and I was subscribing to like a thousand plus projects and people, that my stream was actually just being overwhelmed like i went from being
Starting point is 00:09:45 able to check my uh my stream on github.com for like what are the new events once every couple of days to i had to check it once a day because they had a limit i think was like 500 events and it would just kind of scroll off the bottom so i couldn't catch up and then finally it was like every half a day i would have to check it. And clearly that wasn't scaling. So that was a problem. And I figured like, hey, I should figure out how to solve this problem. I specifically wanted to just solve that problem for myself. So I started looking around and realized that GitHub actually provides this API where they show you the latest activity. So this is like somebody opened a pull request, somebody closed an issue, just basically anything and everything.
Starting point is 00:10:25 And if you pull that API, you can actually get the live stream. So based on that, I said, okay, well, fine, I'll just write a crawler that will just basically sit there in the loop and just log all of that data. And then once I have the data, I can mash it up and answer my question that I actually need to do. So nothing like a good yak shave. So that was basically the inception of GitHub Archive. I just wrote a little Ruby crawler that sits there and just collects this data, logs it into archives, hourly archives, which I store all on cloud storage. At the time it was S3, then I moved it to Google Cloud Storage. And I think I started that back in March of
Starting point is 00:11:08 2012 or 2013, one of those years. I think it was 2013. And ever since it's just been logging those hourly archives. And then based on that data I tried to figure out, okay, now you have those JSON payloads, what can you do with it? Because now you actually have a ton of data, which is in itself kind of a problem.
Starting point is 00:11:35 And processing all of that may take some time. And Google at the time just announced BigQuery, which was a project that was actually an internal project called Dremel within Google, which is incredibly popular. And the idea there is you can write kind of SQL-like, you have a SQL-like syntax that you can write, and under the hood, it's actually implemented as a MapReduce job. So you write SQL, it gets translated to MapReduce, and you can run it across massive data sets. And it returns really fast, which was a plus, of course, because you want to have that sort of kind of interactive querying capabilities. And that just became available as BigQuery, so I figured like, hey, this seems like a
Starting point is 00:12:17 perfect fit. So if I just push all of this data into BigQuery, I get the ability to kind of query the data really easily and in a fast way. And two, that data actually becomes, I can make it as a public data set such that other people can come and write queries against it without having to do the import. And the import part turns out to be kind of significant. I think we'll talk about it later because at the time there were some restrictions on how I could store the data. But the gist of it is write a crawler,
Starting point is 00:12:53 log all the data, import the data into BigQuery, and then with BigQuery you have this data set that's easy to queryable. And then finally, after doing all the yak shaving, I wrote a simple Ruby script that just queried BigQuery once a day and sent me two things. The top 10 repos that were open source within the last 24 hours and top 10 by number of stars received. So just which of the repos that received the most attention. And then which repos that are
Starting point is 00:13:21 not new but received the most stars. And that became a newsletter that I also opened up to other people and approximately 1,000 people have signed up for it. And funny thing, I actually, last year, got a report from MailChimp, which is the service that I was using to do the delivery. And I was just looking at it before the show. And it said that the average open rate for those emails, for those campaigns, was about 40%, which is massive, right? Compared to the rest of the industry. So like it was not only were people
Starting point is 00:13:52 signing up, they're actually opening the emails. And the click rates for the links sent in there was about 15%, which is about 10% times, 10 times rather higher than your kind your industry average. So it worked. It worked. I like that. While it worked on me, I was a subscriber, which is kind of how I came across the whole thing. I can't remember who keyed me on to the newsletter immediately or originally, but I had been getting it for months.
Starting point is 00:14:23 And you were scratching your own itch, but you scratched an itch that Adam and I have around these parts, trying to keep up with open source, is to find the new repos. And yeah, every night, I think coming in, something about daily emails where it's like, eventually you kind of get sick of them because you're like, yeah, here it is again, every single day, old faithful. But more often than not, there know, there's good stuff.
Starting point is 00:14:48 There's gems hidden in there. So it was very valuable. It definitely worked. It's interesting that it kind of came, you know, I assume because you're at Google and BigQuery is a Google product, that it was kind of worked the other way around where it was like here they had BigQuery and maybe they asked some of their employees to use this thing, to have some good use cases. But it's kind of interesting that it was kind of organic,
Starting point is 00:15:10 the way that you ended up using BigQuery. Yeah, it was actually pretty lucky timing, because at the time I wasn't even aware that the product was going to be released. I was just kind of struggling to figure out a way how to do this. So either I had to write just my own processing logic to go through all the archives, but then they announced BigQuery and it was like, oh yeah, I heard of this thing called Dremel and Google is very popular, so let me just give it a try. And it turns out that was actually probably one of the best decisions I've made, just because aside from having the ability to do these fast queries, it opened it up to anybody and everybody
Starting point is 00:15:46 to just run arbitrary queries. And one of the benefits of BigQuery is that it actually gives you a free quota. So you do need to have a Google account, but once you sign in, it's a webpage where you just type in your SQL query and you can ask it any question you want. So if you are curious about what is the top starred repo for, I don't know, for a particular user,
Starting point is 00:16:10 you can just write a query and get an answer to that. Or if you're interested in what are the top repos that have the most issues open that are Ruby repos, you can ask that too and you get immediate answers. And that turned out to be very popular with a lot of people because it just kind of feels very easy to approach and start asking these questions so we've seen a lot of kind of really interesting projects built up around it even though originally it was just meant to solve this kind of very narrow problem that i had have any for instances for us of people using the big github archive big query and making cool stuff with it? Oh yeah, oh man, there's so many.
Starting point is 00:16:46 So we actually worked with, when I started this, I also pinged GitHub crew to make sure that this is all good and I'm logging this data and they don't have any kind of issues with it. And to their credit, not only did they say it was all great, but they also helped me to get the word out there. So actually, Brian, who's on their marketing team, has organized the big data challenge, the GitHub data challenge, for the
Starting point is 00:17:13 last three years, where they have a prize or set of prizes at the end. And basically, the idea is like, here's the data, use GitHub Archive, or any other data if you want, and just build an interesting visualization or something that kind of extracts some interesting insights out of the data. So if you guys go to githubarchive.org and you scroll down to the bottom, there's actually a collection of links to various projects and also the blog posts from GitHub where they show the winners. And some of my favorites, I'll just, I guess, pick out a few.
Starting point is 00:17:44 There was one project that was the open source report card. And the idea there was you type in a username, and the report card would be for a particular user. So it would aggregate all of the repos that you worked on and kind of figure out which languages you contribute to, what type of commits. Like, do you typically open issues? Do you do fixed issues?
Starting point is 00:18:02 Do you write code? Do you kind of discuss more? And give you kind of a nice description of the type of work that you do on github so that was kind of cool another project was just showing the geographic distribution so you pick a project or even a language and you can say like where are my contributors coming from are they from us europe new zealand just show me a map which is you know kind of like a simple intuitive thing to where are my contributors coming from? Are they from US, Europe, New Zealand? Just show me a map, which is kind of like a simple intuitive thing to ask,
Starting point is 00:18:29 but it's something that GitHub doesn't provide by itself. But here you just had this kind of third-party tool fill in that gap. Another one that's kind of, and all these projects approach the data from a different angle. So the one was on users, one was on projects. GitHub is an interesting one.
Starting point is 00:18:48 So github.com. It actually provides a really cool visualization for comparing programming languages. So you can see, for example, that if you select Ruby, you can see where it is ranked in terms of number of pull requests or issues or other things on GitHub. So not surprisingly, today JavaScript is at the top
Starting point is 00:19:13 in terms of the number of just commits. Yeah, this one actually made the rounds, I think. It was either last week or even maybe just Monday. This GitHub, that's G-I-T-H-U-T. Somebody had posted to some, whether it was Hacker News or something about the top languages of the year, JavaScript being so massive. And that came across my radar and I saw it and I'm like,
Starting point is 00:19:37 oh, that's pretty cool. I didn't even think that that was using the same data. So yeah, that's been a fun one. And I've seen it pop up a few times because I think it was actually done as a last year's entry. So yeah, that's been a fun one. And I've seen it pop up a few times because I think it was actually done as a last year's entry. So yeah, that one's really cool. Yeah, I guess every year becomes interesting again, right? Because you can see what happens since last year.
Starting point is 00:19:53 Right, yeah. And the great thing about this stuff is they're just leveraging BigQuery under the hood. So every once in a while they rerun the queries, right? So the data is always up to date. They don't have to worry about collecting the data or doing any of the other stuff. And now a word from our sponsor. TopTile is the best place to work as a freelance software developer. If you're freelancing right
Starting point is 00:20:16 now as a software developer and you're looking for a way to work with top clients on projects that are interesting, challenging, and using the technologies you want to use, TopTal might just be the place for you. Working as a freelance software developer with TopTal, your days of searching for high-quality, long-term work and getting paid what you're worth will be over. Let's face it, you're an awesome developer, and you deserve to be compensated like one. Joining TopTal means that you'll have the opportunity to travel the world as an elite freelancer on top of that top talk can help provide the software hardware and support you
Starting point is 00:20:50 need to work effectively no matter where you are head to top top.com slash developers that's t-o-p t-a-l.com slash developers to learn more and tell them the change log sent you. So essentially, GitHub Archive is a snapshot or the big data snapshot of all of GitHub public activity. That's right. So you actually have two ways of interacting with that data if you want. One is you can go and download the raw
Starting point is 00:21:17 archives, the hourly archives, and that just gives you exactly the data as I saw it coming from GitHub. And you can apply anything you want to it. So if you want to, I don't know, put up your own Hadoop cluster or write your own Ruby script to process it, go for it. And the other option is, the more convenient option, is to use the BigQuery interface where you can just write the SQL stuff.
Starting point is 00:21:41 So whichever one fits you best. So if you want to make a GitHub or something like this, you can use the GitHub Archive data set to sort of slice and dice big data coming from GitHub. That's right, yep. Gotcha. It's pretty neat how these artifacts can have such insights after the fact
Starting point is 00:21:59 and just the foresight of collecting the information. I guess this is kind of the whole conceit of big data, right? It's like one person with the foresight of let's collect this data and make it publicly available. Down the road, it opens up all these opportunities and visualizations and insights into the open source community that otherwise wouldn't have been available. Yeah, absolutely.
Starting point is 00:22:17 And I guess one thing that I've learned through this process, and I've seen it happen before as well, is it's very important to make the analysis of the data very cheap and easy. Because that enables a very different type of collaboration and just iteration. Because if it takes you, let's say, half a day to answer a question, you're very limited in the types of questions you can ask of the data.
Starting point is 00:22:44 Whereas if you get a very quick response, you can actually start iterating on your questions. So it's very often that I'll start with a particular question and then be like, oh, well, I didn't expect that. That looks like an outlier. Let me drill in a little bit further. So having the tools,
Starting point is 00:23:01 and this is where the BigQuery stuff really helped. And by the way, there's other projects that can do this sort of thing. There are open source projects. I think Amazon has some of the kind of similar capabilities. The fact that it's BigQuery or not is not the important part. It's just the fact that it allows you to quickly and easily ask questions and get fast answers. And just having that has been incredibly valuable. When you say fast, how fast is fast?
Starting point is 00:23:24 Well, you're processing on the order of, let's see, I think the current data set is on the order of a couple hundred gigs, and you can process all of that in a span of one to ten seconds. So if you write a very complicated query, then it'll take up to ten seconds. In comparison to, say, doing that on your desktop would be a day? Well, yeah. Depends on your desktop.
Starting point is 00:23:50 Yeah, on your desktop. Run-of-the-mill MacBook Pro. It would take like an hour just to read the data off disk, right? Whereas if you have a nice distributed system, you would just read it from many different disks, and that goes a heck of a lot faster. I'm just trying to paint a picture for those out there who are like, what is this BigQuery
Starting point is 00:24:07 and what does he mean by fast? Because an hour or two hours is way slow. Sure. 10 seconds is way fast. So that's a great question. So I guess for context, so BigQuery is the public version of a product that we use internally at Google called Dremel.
Starting point is 00:24:24 And Dremel is used to analyze terabyte-sized data sets, so in multi-terabyte data. And you're leveraging the large computer infrastructure that Google has, and a terabyte of data can be processed in the same order of magnitude, kind of seconds at most minutes, which would take otherwise literally days or weeks on your single computer. I think this is cool.
Starting point is 00:24:50 I think this GitHub archive kind of shows really that internet plumber attitude that you have towards things because what you've done is some of the dirty work, right? And you started off with this itch to scratch. And you know, I have these all the time, and I'm sure developers out there, we always have like this little, ooh, if I could just do this this itch to scratch. And you know, I have these all the time, and I'm sure developers out there, we always have like this little, ooh, if I could just do this, it'd be nice.
Starting point is 00:25:08 And then you follow the thread a little bit and you realize this is like two, three, maybe a week, you know, three days, maybe a week worth of work or whatever threshold that's just like, and you just kind of shelve it. If you would have done that, you know, all these other projects
Starting point is 00:25:22 probably wouldn't exist because you've lowered the bar for them to get to the interesting part, right? I want to visualize the data. You just wanted the email of the repos every night, but all this extra work actually turned into something that we all can use and has made the ecosystem more fruitful because of it. Yeah, I think that sounds about right. That sort of approach does take a bit more at the beginning
Starting point is 00:25:49 because you're required to do more work, like how to make sure that this is accessible, it's usable by other people and all the rest. But in the long run, I definitely think it's kind of a better approach because exactly as you said, it allows other people to leverage that data. And it also allows me to play with the data more because instead of just having that report card for what are the interesting new projects yesterday, I can ask it tons of other questions. Yeah, so it also means that you got something you need to maintain as well, which is kind
Starting point is 00:26:21 of the other side of that coin. So GitHub Archive is not a new project. It's been out there. And GitHub itself changes. I think the API has changed over time. Did that present any difficulties for you, API changes on the GitHub side? It did in some ways. So the trouble here was that,
Starting point is 00:26:40 and this is more of a kind of BigQuery-specific gotcha, when BigQuery was first introduced, you could, in terms of the data schema that you could store within BigQuery, you had to define that upfront. So you would say, you know, these are the columns for all of my records. And you couldn't change that afterwards.
Starting point is 00:27:02 You could create a new data set that had a different schema. But later that actually started causing problems because, as you said, GitHub would, as they would, you know, enhance their product, add new fields, maybe deprecate an old field. And I had a little bit of pain there where even though I was logging all of the raw data, you know, I always had the raw data stored. I would have to kind of massage it into the schema that I froze early on, such that you could run a query against the entire data set. So that did cause a little bit of friction. But then last year, BigQuery actually allowed you to start importing just like JSON payloads, so unstructured data.
Starting point is 00:27:43 And this actually gave me an opportunity to go back and kind of revisit my original implementation. And I switched it after kind of a bunch of back and forth on what's the best way to do it. Earlier this year, or actually exactly on January 1st, I switched to a new model where instead of having every column be fixed, I'm actually fixing a subset of the columns, which I know are stable.
Starting point is 00:28:08 And if you look at the API documentation for the events API in GitHub, they'll tell you that these five columns are fixed. They will always be there. But then take, for example, a pull request versus issue request. Both of those always have an actor, so somebody who's doing the action. And both of those have like a timestamp
Starting point is 00:28:28 and something else. And that's always there. So that's like I mapped that into distinct columns. But then the actual payload of the request or the activity is different for each activity. And that I just store as kind of a JSON blob. So it just requires a little bit more work on people that are writing the queries now
Starting point is 00:28:45 to kind of reach into the JSON data and pull out the fields that they want. But now I don't have this problem at all because GitHub could just change anything they want and I just throw that data into BigQuery and there's no updates on the send. Yeah, I think I ran into that as I was trying to do some of the queries to get the email going. And I'd just like to say that, man, you are fast on the trigger, helping out on the issues. I appreciate how quickly you got back to me on GitHub, helping me out with getting the queries all going. January 1st, so that's right about probably the same time you turned the email off, is it not? Yes, so actually, yeah, on January 1st, after that update went out, I guess I didn't explicitly
Starting point is 00:29:30 turn it off as much as I didn't update the query in my daily run. And then I realized that, whoops, the data schema has changed. Or rather, I stopped logging data into the same table. I had created a new table, and now I'm actually creating daily tables. One of the things we found was because I've been logging data into the same table for now over three years, and we've actually backported some data with GitHub's help, so there's about three and a half years of data there,
Starting point is 00:29:58 you do have a free quota, but it's very easy to exceed that free quota if you're not careful. So that was the reason why we went into this new model where each day and each month is a separate table such that people can experiment a bit more without exceeding their free quota. So what was the thought behind,
Starting point is 00:30:19 I mean, obviously because it just stopped working, you didn't turn the email back on. There were a few cries for help, myself being one of them. I think there's three or four other people on your GitHub account asking what's up with the email. What was your decision to not rewrite that? So as I said, the first two days, or three days before I realized that the email stopped coming in,
Starting point is 00:30:43 because as you said, there's something about daily emails where after a while you start to tune them out. It took me about three days to register, like, oh, right, this is why I'm not seeing it. And then a couple of GitHub issues popped up. And the thought process there was, I guess, twofold. One was, since I've actually started the GitHub Archive newsletter, GitHub came up with their own trending repositories email
Starting point is 00:31:09 that you can sign up to. And I've subscribed to that, and to be honest, I actually don't find it as valuable as the one I implemented, but of course I'm biased because... Hey, we're with you. Otherwise I wouldn't ask you to turn it back on if I was satisfied. What they're doing is basically the same thing, but they just provide fewer repos in there.
Starting point is 00:31:31 I think it's like the top 10. And they also don't separate what are the new repos versus the old repos that got the most activity. So I read both effectively, but I find different things in both repos or in both emails. But at the same time, it was there. So once I realized that the email stopped going out, I actually wondered if anybody would cry about it, if anybody would contact me. So I gave it another couple of days.
Starting point is 00:32:01 And sure enough, as you said, there was a couple of issues that were being opened on the repo. And then at the same time, I guess you guys reached out to me about the work that you guys are doing. And at that point, it kind of became clear that perhaps I should find somebody else to run that project and focus on the infrastructure part, which is kind of to your point earlier, just enable other people to build cool things on top. Well, that's a good segue then, isn't it? Yeah, we obviously have a hand in the bag, so to speak, in terms of shooting out emails and stuff like that. And to our best ability, we try to keep up.
Starting point is 00:32:39 And as Jared mentioned, we used GitHub Archive before and we were like, well, that's a bummer. It's not going on anymore. So it made sense to reach out and see if that was something we could take over. And we've since had some conversation about it. We're launching a new email called Change Log Nightly that will essentially become what GitHub Archive was, the daily emails at least. And working with Ilya, we've transferred the email list. So we're going to work with Ilya on making sure this continues.
Starting point is 00:33:09 So if you're listening to this and you're on that email list and you get an email from us here in the near future, it's the same email list and we'll sort of put out an announcement in addition with this podcast to sort of clear the way in terms of not spamming and stuff like that. This is a collaboration. So yeah. And I'm really excited.
Starting point is 00:33:27 You guys showed me the preview of the email. It looks great. It looks much better designed than what I managed to pull off and in my version. So that's awesome. So if you're listening to this right now, we're in the, we're, we're recording this in the past, but you're going to listen to this in the, in the future. So when you're actually listening to this right now um we're in the we're we're recording this in the past but you're gonna listen to this in the in the future so when you're actually listening to this uh so if you're hearing
Starting point is 00:33:49 my voice right now you can actually go to the changelog.com slash nightly um that may move to nightly.thechangelog.com in the near future but for now it's going to be there you can subscribe now um hopefully jared you can give me a nod or something like that to say for sure we're shooting emails right now we're doing it internally. As Ilya just mentioned, we had shared the design with him now. And Ilya, on the work, I love how there's sort of layers to this onion. You're itched from way back when, all this work with BigQuery, all this work with storing this data. And then now we've come behind you. Would you say that you're a designer, Ilya, or would you say you're not a designer?
Starting point is 00:34:29 Sometimes I pretend to be a designer. I can't say I'm a good one. Play one on TV. So, I mean, I would consider myself a designer. And when we came across this project and taking over the email part of it, I was like, there's a way, I like the data, but there's a way we can visualize it a little differently. And, you know, we're sharing the stars a lot clearer, the up stars for that day a lot more clear.
Starting point is 00:34:54 So when you see this email, you're going to love how it looks. We even went as far as making it have a night theme because we figured if you're going to be, you know, on your phone or on your MacBook at night, if you're in the East Coast central time zones or in the U.S. time zones, it's probably going to be at night. Otherwise it's daytime or something like that for you. But we figured let's ship it with a night theme. So we made it dark.
Starting point is 00:35:20 We may actually offer a day and night theme in the future, but at least for now it's going to ship with a night theme to make your eyes a little bit, you know a little bit easier on the eyes at night. And as Adam said, we've been shipping this just to ourselves over the last couple days. And I texted him, was it last night? I'm like, I'm so happy. It's weird. I'm a total nerd that this makes me so happy to have this back every night.
Starting point is 00:35:45 So we're excited to get it out there and get it in your mailboxes as well. Yeah, so go to the changelog.com slash nightly, sign up. And Ilya, I think we've grown the list a little bit since you handed it to us. It went from just around 900 to I think just a little over 1,000 now. Oh, that's awesome. We've actually grown the list a tiny little bit. So hopefully between this and our changelog weekly email which won't change we'll still share repos in there that's more of
Starting point is 00:36:09 our edits realized um you know highly curated email whereas this one's automated so they sort of sister and brother in that regard where you got nightly which is sort of this constant daily update nightly update and then our changel weekly, which goes out on Saturdays, which is links, videos, top repos that are hitting our radar. We're still sharing that email. Yeah, that's awesome. So for the GitHub mailing list,
Starting point is 00:36:36 or the original mailing list, I've never actually even actively promoted it. It was just one of those things where I had the email, and I think one time I forwarded it to one of my friends because I was like, oh, look, your repo has made the list. And then he asked me for, like, where can I sign up? So after that, I just dropped a link on the githubarchive.org website, and I never actively promoted it,
Starting point is 00:36:59 and yet somehow it gathered 1,000 people. So I'm curious to see where you guys take it, because I agree, I find them incredibly valuable. And I actually think there's a lot of room for kind of experimenting in the space as well. One of the things that I've wanted for a long time and just never got around to it was creating more thematic lists as well. So right now it's just like everything across GitHub, right?
Starting point is 00:37:22 But if I'm particularly interested in, let's say, Ruby or Node or something else, you can imagine just coping it to that, which would be quite cool. I would say, and Jared, you can back me up on this, but I would say that this is definitely a start for us. I think that as we can get more and more interesting data out of what you've been storing in BigQuery and GitHub Archive, I think that I'd love to keep exploring. I think this is just the tip of the iceberg for us because I've already had tons of fun just doing what we've done so far. And I think we'll just keep – from a listener's perspective, if you're listening to this and you love the changelog and you're a member or you're not, we aim to serve the open source community as best we can.
Starting point is 00:38:07 And sometimes that might be shipping really awesome emails. Sometimes that's doing a really awesome podcast. Sometimes that's sharing things on Twitter or a blog or wherever or going to a conference. So this is one of the ways we definitely plan to press hard. Yeah, and we do plan to also open source the repo that runs nightly so you can contribute as well if you're a reader of the email and you want to see a new data point in there or you'd love to have these language specific emails which has definitely been something we've discussed internally but it's a little bit more heavy lifting it's going to be
Starting point is 00:38:41 open source you can hop on there open an issue or fork it and do all that good stuff too. So Elliot, what do you think about Nightly then? What are your initial thoughts and just us taking it over and not having to worry about the burden of the email anymore? I'm happy. I'm super happy. So first of all, I get my emails back, which is great because I've been for the last month and a half I've been relying on the GitHub version. And as I said, you know, those are great, but I don't find that they're as interesting in
Starting point is 00:39:09 many ways. I don't discover as many interesting things. That and just having you guys work on it, I think you'll do a much better long-term job of it. Well, I appreciate that. And, you know, these itches you keep scratching too, let's, I would say say let's figure out a way to keep working together. I know Jared and I will take over and start doing some things, but if you've got a particular email that you want to see go out or dataset pulled from this, then let's work on it. Let's figure out a way to make it happen. Yeah, for sure. I'm really happy to hear that you guys are going to make the
Starting point is 00:39:43 actual code for that open source. And I guess I should mention that all of the GitHub Archive source for the website, for the crawler, and even for the old reports, if you still want them, is online. So if you just go under igregoric.githubarchive.org, that's the repo. And if you find bugs, improvements, all of that stuff is welcome. Awesome. Well, we do have a note in here to talk a little bit about the future of GitHub Archive. If you have future plans or a roadmap, or if you consider it kind of a finished thing as a piece of plumbing, what are your thoughts on that? So I think it's mostly a finished thing in the sense that the crawler is running. It's stable.
Starting point is 00:40:27 I think I've figured out all the bugs there and I'm really happy with it. Like it's been running for years. So that part is good. What I would like to do is maybe go back and revisit how I've imported some of the data into BigQuery. Because as I mentioned, the schema was changing and I had a frozen schema. So some of the fields may not be there that perhaps should have been. So that's just kind of one of those things where I would like to get to it, where I'd like to go back and re-import the old data in the same way that I'm importing the new data now, just to make it all nice and consistent. And if
Starting point is 00:41:04 somebody is interested in taking that on, that would be even better, to be honest. But that's probably the main thing. Otherwise, it's running, it's humming along. I should plug, actually, the BigQuery team. They've given me a lot of support, and they've paid the bills for hosting all that data. So kudos to them.
Starting point is 00:41:23 So there's that. There's no kind of concerns over how much dataudos to them. So there's that. There's no kind of concerns over how much data we're storing. So that's been really good. That's nice right there. I was going to add, I guess, on the tail end of that, is this one of the main ways, is there any other large data sets of GitHub data out there other than this one?
Starting point is 00:41:43 You know what? I'm not sure. I don't think so. Not that I've come across. I keep coming across projects that I'm surprised to find out are using GitHub archive data under the hood because either they grab the archives or they're using BigQuery. But I've not seen other people kind of log it and store it and process it on their own. And in terms of the future, you mentioned your collaboration with GitHub and improving things and things like that.
Starting point is 00:42:07 Are they mutually involved in this to a degree? Is there any sort of interest in this for them in the future? Yeah, I think so. Actually, just last week I was exchanging emails with somebody at GitHub where they're interested in engaging the academic community and actually coming back to the kind of interesting use cases of this data, I've been approached by a number of researchers in various universities
Starting point is 00:42:32 that are using GitHub Archive data for analyzing things like what makes a great open source community, or what patterns do they exhibit, what makes a resilient open source community, so on and so forth. It's kind of like the social dynamics of open source. So there's been a couple of papers published on this stuff using GitHub Archive data. And I think GitHub in particular is interested in getting more of that kind of collaboration with the academic community.
Starting point is 00:42:54 So we're chatting now about potentially exposing additional data sets via BigQuery, because clearly the researchers are already using that interface. So in the future, you may see some additional augmented data become available through GitHub Archive. But we're still kind of working through the details of what that is and how that would work and all the rest. And now, a word from our sponsor. It is time to put the program books away. Put them away, Put them down. And learn by doing with CodeSchool. CodeSchool offers a variety of courses to help you expand your skills and learn new technologies such as JavaScript, Ruby, iOS, Git, HTML, CSS, and many more.
Starting point is 00:43:52 CodeSchool knows that learning the code can be a daunting task. They combine experienced instructors with proven learning techniques to make learning the code educational as well as memorable, giving you the confidence you need to continue past the hurdles. They're always launching new courses on new technologies and offering deep dives on tried and true languages. So if you don't see them you need, suggest a course and they'll build it if there's enough demand. GoSchool also knows that languages are a moving target. They're always updating content to give you the latest and greatest learning resources. You can even try before you buy. Roughly one out of every five courses on CodeSchool is free. This includes introductory classes for Git, Ruby, and jQuery, which allow free members to play full courses with coding challenges included you can also pay as you go one monthly fee gives you full access to every
Starting point is 00:44:33 code school course and if you ever need a breather take a break you can suspend your account at any time don't worry your account history points and badges will all be there when you're ready to pick things up again. Get started on sharpening your skills today at Codeschool.com. Once again, that's Codeschool.com. So while we're talking about, I guess, the future of GitHub Archive, what are some of the ways that the community can step in? We always ask a question like, what's a call to arms for a GitHub archive that doesn't require you to do every single thing? Where can the community step in to help out on this project? So I'd say two things.
Starting point is 00:45:14 One is just go and play with the data. I think that's the best place to start because if you get hooked on that data, and I think it's pretty easy to get hooked because there's so much of it and you can analyze all kinds of stuff, then it just becomes much more interesting in the long run. Like you start thinking of new things you could figure out based on this data. So that's one, just start playing with the data. You can either grab the raw archives, the JSON stuff, and just
Starting point is 00:45:38 do something with it, or you can try the BigQuery approach. And the second one is if you are interested in helping out, the project that I mentioned where it's just about re-importing the old data, that could be interesting. And if that's something that you want to help with, that'd be cool. So is that still an in-progress project then? The re-import?
Starting point is 00:45:57 You're talking about the tables, right? Breaking up one big table into many smaller tables? Right. Yeah. So is that still going on then? Well, it's one of the things that, you know, on my to-do project list of 100 things that I need to yak shave, it's there. It's just a question of when will I get to it? So is there currently an issue open with some, like, guide marks or guidelines for someone to step in and help out there?
Starting point is 00:46:21 No, that's a good point. There isn't. And I will do that. Yeah, I think that would be helpful, especially, you know, I like when project owners, you know, if they haven't asked like that, you know, if they can put something out there because you're going to get a question anyway, someone will start the issue for you if you don't. So might as well give someone some guide rails to follow and that way people can step in.
Starting point is 00:46:43 So if you're wanting to hack on BigQuery or play with breaking up these tables into smaller tables, then Ilya will give you some help on making that happen. Yeah, I'll do that. I'll definitely put something together. So many yaks, so little time. Pretty much, yes. Is there a t-shirt for that? There should be.
Starting point is 00:47:05 That'd be a nice yak shave. Go make a t-shirt for that? Because I like that. There should be. Yeah, that'd be a nice yak shave. Go make a t-shirt about yak shaving. Okay. Another question that we like to ask at the end, and I know you've been lacking your email lately, but you're a guy who has his thumb on the pulse of open source. I think your Twitter account is a good one that I follow, just constantly kind of surfacing cool new projects.
Starting point is 00:47:23 So what are some projects, name one or a couple that are on your radar that are exciting to you these days? These days? Well, so I guess the main open source project that I spend probably most of my time on nowadays is Chromium. So that's definitely something that's very interesting, exciting to me. And I keep learning new things about it. And if you're not familiar, Chromium is the open source version of the Chrome project, the Chrome browser. So there's a Chromium browser which you can build on your own, and then Chrome is kind of the repackaged version
Starting point is 00:47:54 that just adds the Google branding on top of it and a few additional things. So I spend most of my days working on that, trying to figure out what are the things that we need in there to make it faster, or what are the things that we need in there to make it faster or what are the performance regressions, bugs, and so on and so forth. So that's definitely been occupying a lot of my time. And then others are actually HTTP2 related.
Starting point is 00:48:16 So as we mentioned at the beginning of the show, HTTP2 is now officially a thing as of, I guess, yesterday. And now the big push is now that the spec is final and it's stable is to actually have servers support it. If you think about, let's say, the Ruby ecosystem, there's actually not any server that I'm aware of that is HTTP2 compatible at this point. So that's something that I'm thinking about actively right now. If you go to the HTTP2 wiki, if you just search for HTTP2 on Google or use your other favorite search engine, you'll arrive at just kind of a status page for HTTP2. If you click on implementations, there's a list of servers that
Starting point is 00:48:57 are already implemented. And a lot of those could use some help in terms of contributions, testing them, compatibility with other browsers, and all the rest. So if you're interested in that sort of thing, that's definitely something that I would encourage others to play with. Excellent. I think I'll just give you a plug as well, because you won't take it yourself. You also have a book out, High Performance Browser Networking. It's an O'Reilly book. Looks like you can read the entire thing online for free
Starting point is 00:49:28 or buy it for a few bucks. Definitely if you guys are interested in these types of things, HTTP2, XHR improvements, server-sent events, that kind of stuff. I haven't read the book myself, but I've heard people singing its praises on the internets. So just give that a shout out as well. Does HTTP2 require any sort of update to this? Actually, it does.
Starting point is 00:49:51 I do have a section in the book. As you mentioned, the book is online for free, so if you just go to hpbn.co, you'll arrive at a page where you can flip through it. There's an HTTP2 chapter in there, but I wrote that chapter about a year ago. And since then, there's been some protocol changes, kind of like cosmetic changes,
Starting point is 00:50:11 that I need to go back and update. But that's true of any tech book in general, right? The moment you hit publish, it's already out of date. Yeah. We'll definitely link that up in the show notes for people who are interested. Remember you mentioned HTTP2 as well that's like fresh off the press
Starting point is 00:50:28 like as of like basically last night so you know don't expect you to go open go update the book today or anything well I am actually hoping to kind of get it done sooner rather than later because I think there's a lot of interest right now in the community it's like what is this thing how do I make it work? What does it mean for the
Starting point is 00:50:45 servers? What does it mean for my website? So this would be a good time to actually have that out. So I'm hoping that within the next, fingers crossed, couple of weeks, unless I find other Yaks to shave, I'll have that up. We'll have to have you come back and talk HTTP2 because
Starting point is 00:51:01 I have a lot of questions about it and I'm sure you've got a lot of answers. I think that'd be a good time. Yeah, I'd be happy to. Let's do it. Let's get on the books. Four weeks from now. Bam. Gone. Well, it was definitely fun having you on the call today.
Starting point is 00:51:18 We'll definitely enjoy working with you on keeping the emails current, looking awesome on mobile and desktop, frequent, and exploring new frontiers with that as well. So definitely excited about the future of working with you on that part there. You mentioned githubarchive.org will have an update mentioning ChangeLog Nightly. If you're going to subscribe, go to thechangelog.com slash nightly. You can subscribe there. if you're going to subscribe go to the changelog.com slash nightly you can subscribe there when you're listening to this we should be shipping emails so expect um an email like the
Starting point is 00:51:50 next night i think we're shipping what jared at 10 o'clock on central time or eastern time central so 10 p.m central because jared and i live in central and that's like the center of the world to us because it's central right that's right they call the central time zone for a reason it's the center humor arrogance there that's what that is humorous arrogance um that we're going to ship at our time at 10 so if you're on the other side of the world uh we can't help that so it'll be like 10 the afternoon for you or something like that so um that's the changelog afternoonly afternoon doesn't have the same ring to it no no and if you're a fan of the show you know back in episode 141 uh we had a pretty decent announcement that i came on as as a full time employee of this year fledgling company we're building called the change log uh i once worked
Starting point is 00:52:39 for a non-profit uh full-time there and stepped away to pursue the dreams of keeping up uh with open source and serving the open source community so this is now my full-time gig and as part of that we ask our listeners to become members supporting members of the changelog you can go to the changelog.com membership to learn more but right now i'm gonna rattle off a list of i don't know how many but quite a few members that have stepped up in support of the change log. Gabriel Solis. Forgive me if I mispronounce a few names here because some of them do have nine letters like my last name or like Gregorik. Jonathan Loweniski.
Starting point is 00:53:20 Sorry if I messed that one up. Darcy Clark. Mike Oliveri. Todd Ward. Colin Coghill. G Jensen Magnus Endger that's an awesome name Benoit Tijonat I believe that definitely messed that one up that's French though so you can I get a buy on French names Charles Hicks this one I can't even pronounce uh Pena Goddess I'm not even sure how that one sorry about that David can't even pronounce. Penegoddess. I'm not even sure how to pronounce that one. Sorry about that. David, can't see your last name either.
Starting point is 00:53:48 Steven Howes, Brett Weaver, and Jan Novak. All these awesome people have stepped up to make sure the changelog stays around and support us and going full time. So if you want to do it too, the changelog.com slash membership. You got some awesome benefits there. I won't tell you what there are now, but lots of cool stuff on that page there check it out uh ilia thanks again so much for coming on the show working with us on changelog nightly definitely excited about shipping that in the future we've got some awesome sponsors i think to mention as well let's see who those are code ship top towel and code school awesome people so with that let's say goodbye
Starting point is 00:54:23 everybody goodbye And Code School. Awesome people. So with that, let's say goodbye, everybody. Goodbye. Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.