The Changelog: Software Development, Open Source - GitHub Archive and Changelog Nightly (Interview)
Episode Date: February 27, 2015Ilya Grigorik joined the show to talk about GitHub Archive, logging and archiving GitHub's public event data, and how he uses Google BigQuery to make querying that data accessible to everyone....
Transcript
Discussion (0)
Welcome back everyone, this is The Change Log and I'm your host Adam Stachowiak.
This is episode 144 and today we're talking to Ilya Grigorik.
Ilya is a self-professed internet plumber as you'll hear on the show today.
He works at Google and he basically makes the internet faster,
and much more, of course, for those who are fans of Ilya's work.
Ilya has a side project called GitHub Archive,
which we took some interest in because we wanted to start shipping
a daily email, nightly as a matter of fact.
I'm going to break some news real quick because we're going to tell you
later in this episode, but I have to tell you now
because I want you to sign up for this email.
Stop the show, go to thechangeall. because I want you to sign up to this email. Stop the show.
Go to the changelog.com slash nightly and sign up right now.
It's an email we're shipping now called changelog nightly,
as you can tell from the URL.
And this email unearths the hottest reposts on GitHub every single night
and drops them in your inbox.
It's going to be awesome.
You're going to love it.
We have some awesome sponsors for today's show,
CodeShip, TopTile, and CodeSchool.
We'll tell you a bit more about CodeSchool and TopTile later in the show,
but our friends at CodeShip,
they have released a brand new feature called Parallel CI.
If you want to get faster tests from your builds,
you've got to run your builds in Parallel.
They recently shipped Parallel CI,
and now you can split your test commands
into up to 10 pipelines, 10 test pipelines.
This enables you to run your test suite in parallel
and drastically reduce the time it takes to run your builds.
They integrate with GitHub and Bitbucket.
You can deploy your code to cloud services
like Heroku or AWS and much more.
And you can get started today by trying out their free plan,
which includes 100 builds a month and five private projects.
Or you can also use our offer code, the changelogpodcast,
to get a 20% discount on any plan you choose for three months.
Head to codeship.com slash the changelog to get started.
And now on to the show.
All right, everybody, we're back. We've got Ilya Grigorik joining us today. Ilya, you've been on to the show. All right, everybody, we're back.
We've got Ilya Grigorik joining us today.
Ilya, you've been on the show before, episode 55, if we go back in time,
back when Wynn was around, and that was an awesome show.
You were talking about Goliath and Event Machine and other fun stuff.
So welcome back to the show, man.
How are you?
I'm doing great.
Thanks for the invitation.
And, yeah, it's been a while. I think that episode was back in 2011.
It's like we were laughing before the show how forever ago that was, basically.
And that's just crazy.
Yeah, Speedy was a brand new thing back then, and now it's been completely replaced.
That's right, actually. As of today or yesterday, the new HTTP2 stuff is now official.
So we went from this kind of experimental thing that a couple of engineers at Google started
to something that's out in the wild and ready to be deployed.
It's pretty amazing.
And as you can hear, we also have Jared Santo on the call as well.
So Jared, say what's up.
What's up?
What's up?
So Ilya, this is sort of a two-part show, right?
We got sort of an announcement from us, which we'll talk about later.
But then we have this awesome project that you started, GitHub Archive, and that kind of tees off of Google's BigQuery project.
You've been on the show before, but since it's been a while, back then you were a founder and CTO of PostRank, which has since been acquired by Google.
And you now work at Google and it's been all that time since then.
How do you introduce yourself now whenever you're on a stage or saying hello to people?
So nowadays within Google, I basically work as an internet plumber.
And my job is more or less to figure out how to make the internet faster.
That's related to spec work for things like Google Chrome,
so open standards, also things like HTTP2 and all the rest.
So trying to figure out how to make Chrome faster,
how to make Google products faster,
because we know that speed helps user retention
and just leads to happier users,
and also just make intranet faster as a whole.
So it's been a pretty fun gig.
I like the name, or the title at least, Net Plumber.
It's a good description of kind of the dirty work
that you actually have to do to make it all work.
Yep, yep.
Someone's got to do it, right?
And how big is your team that you work with there at Google?
That's a good question.
It's actually a big distributed effort, as you can imagine,
because every product within Google is focused on speed.
So I work and kind of collaborate with a lot of different teams within Google
and then even outside of Google as well.
So we work with mobile carriers because mobile is becoming so important
and everything's migrating there.
Other vendors, Microsoft, Mozilla, Apple,
and all the rest.
So it's hard to say what the team is
because we don't have just a strictly unified team,
but it's hundreds of people.
So do you work in an office?
I work in an office,
but I do find myself kind of jumping between offices
depending on where I have a lot of meetings. But my
day-to-day is mostly in
Mountain View, California.
Gotcha.
Well, cool. Didn't you live
not in San Francisco before
you were a partner? That's right. Yeah, we were actually
in Waterloo, Canada.
Then we moved to
sunny California.
Yeah.
Must be better right
So you're not dealing with the whole
Winter apocalypse or what is it called over there
Snowpocalypse
Snowpocalypse yes
Feel bad for the people on the east coast
Here in sunny Texas it's just around
55 degrees today or something
Yep can't complain
Well I got sun but I also have like
9 degree sun
So it's not exactly I I guess I can complain.
Not as much as them on the East Coast, but I can still complain.
So maybe just to paint a little bit of the history for people too to kind of let them know which Ilya we're talking to.
I don't think there's that many out there.
VimGolf, GitHub Archive, you were a Ruby hero in 2008.
You started PostRank can you kind of give a primer of what PostRank was
prior to the acquisition
of Google and sort of
what sort of made you this internet plumber you are today
yeah the PostRank work was actually
I'm going to say mostly unrelated to the work
that I'm doing before but the idea behind
PostRank was to
help measure the impact
of just
social advertising on the web.
And by advertising, I share a link to an open source project,
and we want to be able to figure out what did that yield?
Was that a good share in the sense that a lot of people click on it?
Whom should you approach?
Or where should you advertise to make the most of your advertising budget?
So we built a collection of tools for marketers, advertisers, and the rest or where should you advertise to kind of make the most of your advertising budget.
So we built a collection of tools for marketers, advertisers, and the rest to kind of help them facilitate that whole end-to-end cycle of,
we want to invest money into social, and what is the return on investment?
And then I guess that was interesting to Google,
who at the time was really kind of expanding their social strategy.
So we ended up there and I spent about
a year building and rebuilding some of the products that we build within PostRank. And then
jumped to kind of this web performance work, which has always been my passion.
In the background, the last episode that we did with you guys with Goliath was actually centered
around that. We wrote our own HSP server because we found that the performance of some of the existing servers just wasn't up to par.
So I've always kind of played that plumber role,
perhaps in the background as an engineer.
And then when I got the opportunity
to actually focus on it full-time,
I decided to jump on it.
Cool.
As Adam mentioned, you are a Ruby hero,
and many people probably remember you from igvita.com,
your blog that you wrote on very commonly back then.
Curious if you're still slinging any Ruby inside Google or if you've switched tool sets.
I am, but perhaps not kind of day-to-day production projects.
So I still have open source projects, certainly a lot of my day-to-day work for, you know,
I need a script for this, I need to automate something like that.
I still use Ruby, it's my default language to date. Probably the most recent
project that I've worked on was actually the HTTP2 Ruby gem. So it's a pure Ruby implementation of
the full HTTP protocol, HTTP2 protocol. So that was fun, just kind of roll up my sleeves and work
on that. And I should also mention that I got some great contributions from a number of people on GitHub, from the Tokyo community in particular. So they've been helping
out quite a bit. So it was really good. And I actually went to Tokyo last year to talk at
the HB2 event. So I met a lot of the Rubyists there that were also doing HB2. So that was really good.
Awesome, man. Well, we're here today to talk about a specific project of yours,
one that had caught our radar and that I was quite fond of for some time,
which is your GitHub Archive project, githubarchive.org.
Tell us about that and what it is and kind of where you got the idea for it.
Sure.
So this is a fun one.
And I think as most every open source project that starts with a personal itch.
And my personal itch was I love open source I love following open source and I'd like to keep on top
of what are the interesting projects that are coming up being released what are the new issues
so on and so forth so back in the early days of github I would just follow a lot of people right
once you once you follow kind of the right people in any particular community, you just observe like what they star or what they comment on and all the
rest. And that worked well for a while. But with time, as more and more projects and more and more
people joined GitHub, I found that, and I was subscribing to like a thousand plus projects and
people, that my stream was actually just being overwhelmed like i went from being
able to check my uh my stream on github.com for like what are the new events once every couple
of days to i had to check it once a day because they had a limit i think was like 500 events
and it would just kind of scroll off the bottom so i couldn't catch up and then finally it was
like every half a day i would have to check it. And clearly that wasn't scaling. So that was a problem.
And I figured like, hey, I should figure out how to solve this problem.
I specifically wanted to just solve that problem for myself.
So I started looking around and realized that GitHub actually provides this API where they show you the latest activity.
So this is like somebody opened a pull request, somebody closed an issue, just basically anything and everything.
And if you pull that API, you can actually get the live stream.
So based on that, I said, okay, well, fine, I'll just write a crawler
that will just basically sit there in the loop and just log all of that data.
And then once I have the data, I can mash it up and answer my question
that I actually need to do.
So nothing like a good yak shave.
So that was basically the inception of GitHub Archive. I just wrote a little Ruby crawler that sits there and just collects this data, logs it into archives, hourly archives, which I store
all on cloud storage. At the time it was S3, then I moved it to Google Cloud Storage. And I think I started that back in March of
2012 or 2013, one of those years.
I think it was 2013. And ever since it's just been logging
those hourly archives. And then based on that data
I tried to figure out, okay, now you have those JSON
payloads,
what can you do with it?
Because now you actually have a ton of data,
which is in itself kind of a problem.
And processing all of that may take some time.
And Google at the time just announced BigQuery,
which was a project that was actually an internal project called Dremel within Google, which is incredibly popular. And the idea there is you can write kind of SQL-like, you have a SQL-like syntax
that you can write, and under the hood, it's actually implemented as a MapReduce job.
So you write SQL, it gets translated to MapReduce, and you can run it across massive data sets.
And it returns really fast, which was a plus, of course, because you want to have that sort
of kind of interactive querying capabilities.
And that just became available as BigQuery, so I figured like, hey, this seems like a
perfect fit.
So if I just push all of this data into BigQuery, I get the ability to kind of query the data
really easily and in a fast way. And two,
that data actually becomes, I can make it as a public data set such that other people can
come and write queries against it without having to do the import. And the import part turns out
to be kind of significant. I think we'll talk about it later because at the time there were
some restrictions on how I could store the data.
But the gist of it is write a crawler,
log all the data, import the data into BigQuery,
and then with BigQuery you have this data set that's easy to queryable.
And then finally, after doing all the yak shaving,
I wrote a simple Ruby script
that just queried BigQuery once a day
and sent me two things.
The top 10 repos that were open source within the last 24 hours and top 10 by number of stars
received. So just which of the repos that received the most attention. And then which repos that are
not new but received the most stars. And that became a newsletter that I also opened up to other people
and approximately 1,000 people have signed up for it.
And funny thing, I actually, last year, got a report from MailChimp,
which is the service that I was using to do the delivery.
And I was just looking at it before the show.
And it said that the average open rate for those emails,
for those campaigns, was about 40%,
which is massive, right? Compared to the rest of the industry. So like it was not only were people
signing up, they're actually opening the emails. And the click rates for the links sent in there
was about 15%, which is about 10% times, 10 times rather higher than your kind your industry average.
So it worked.
It worked. I like that.
While it worked on me, I was a subscriber,
which is kind of how I came across the whole thing.
I can't remember who keyed me on to the newsletter immediately or originally,
but I had been getting it for months.
And you were scratching your own itch,
but you scratched an itch that Adam and I have around these parts,
trying to keep up with open source, is to find the new repos.
And yeah, every night, I think coming in,
something about daily emails where it's like,
eventually you kind of get sick of them because you're like,
yeah, here it is again, every single day, old faithful.
But more often than not, there know, there's good stuff.
There's gems hidden in there.
So it was very valuable.
It definitely worked.
It's interesting that it kind of came, you know, I assume because you're at Google and BigQuery is a Google product,
that it was kind of worked the other way around where it was like here they had BigQuery
and maybe they asked some of their employees to use this thing,
to have some good use cases.
But it's kind of interesting that it was kind of organic,
the way that you ended up using BigQuery.
Yeah, it was actually pretty lucky timing,
because at the time I wasn't even aware that the product was going to be released.
I was just kind of struggling to figure out a way how to do this.
So either I had to write just my own processing logic to go through all the archives, but then they announced BigQuery and it was like,
oh yeah, I heard of this thing called Dremel and Google is very popular, so let me just give it a
try. And it turns out that was actually probably one of the best decisions I've made, just because
aside from having the ability to do these fast queries, it opened it up to anybody and everybody
to just run arbitrary queries.
And one of the benefits of BigQuery
is that it actually gives you a free quota.
So you do need to have a Google account,
but once you sign in, it's a webpage
where you just type in your SQL query
and you can ask it any question you want.
So if you are curious about what is the top starred repo for, I don't know, for a particular user,
you can just write a query and get an answer to that.
Or if you're interested in what are the top repos that have the most issues open that are Ruby repos,
you can ask that too and you get immediate answers.
And that turned out to be very popular with a lot of people
because it just kind of feels very easy to approach and start asking these questions so we've seen a
lot of kind of really interesting projects built up around it even though originally it was just
meant to solve this kind of very narrow problem that i had have any for instances for us of people
using the big github archive big query and making cool stuff with it? Oh yeah, oh man, there's so many.
So we actually worked with, when I started this,
I also pinged GitHub crew to make sure that this is all good
and I'm logging this data
and they don't have any kind of issues with it.
And to their credit, not only did they say it was all great,
but they also helped me to get the word out there.
So actually, Brian, who's on their
marketing team, has organized the big data challenge, the GitHub data challenge, for the
last three years, where they have a prize or set of prizes at the end. And basically, the idea is
like, here's the data, use GitHub Archive, or any other data if you want, and just build an
interesting visualization or something that kind of extracts some interesting insights out of the data.
So if you guys go to githubarchive.org
and you scroll down to the bottom,
there's actually a collection of links to various projects
and also the blog posts from GitHub where they show the winners.
And some of my favorites, I'll just, I guess, pick out a few.
There was one project that was the open source report card.
And the idea there was you type in a username,
and the report card would be for a particular user.
So it would aggregate all of the repos that you worked on
and kind of figure out which languages you contribute to,
what type of commits.
Like, do you typically open issues?
Do you do fixed issues?
Do you write code?
Do you kind of discuss more?
And give you kind of a nice description of the type of work that you do on github so that was kind of cool
another project was just showing the geographic distribution so you pick a project or even a
language and you can say like where are my contributors coming from are they from
us europe new zealand just show me a map which is you know kind of like a simple intuitive thing to where are my contributors coming from? Are they from US, Europe, New Zealand?
Just show me a map,
which is kind of like a simple intuitive thing to ask,
but it's something that GitHub doesn't provide by itself.
But here you just had this kind of third-party tool
fill in that gap.
Another one that's kind of,
and all these projects approach the data
from a different angle.
So the one was on users, one was on projects.
GitHub is an interesting one.
So github.com.
It actually provides a really cool visualization
for comparing programming languages.
So you can see, for example, that if you select Ruby,
you can see where it is ranked
in terms of number of pull requests
or issues or other things on GitHub.
So not surprisingly, today JavaScript is at the top
in terms of the number of just commits.
Yeah, this one actually made the rounds, I think.
It was either last week or even maybe just Monday.
This GitHub, that's G-I-T-H-U-T.
Somebody had posted to some, whether it was Hacker News
or something about the top languages of the year,
JavaScript being so massive.
And that came across my radar and I saw it and I'm like,
oh, that's pretty cool.
I didn't even think that that was using the same data.
So yeah, that's been a fun one.
And I've seen it pop up a few times
because I think it was actually done as a last year's entry. So yeah, that's been a fun one. And I've seen it pop up a few times because I think it was actually done as a last year's entry.
So yeah, that one's really cool.
Yeah, I guess every year becomes interesting again, right?
Because you can see what happens since last year.
Right, yeah.
And the great thing about this stuff is
they're just leveraging BigQuery under the hood.
So every once in a while they rerun the queries, right?
So the data is always up to date.
They don't have to worry about collecting the data
or doing any of the other stuff. And now a word from our sponsor.
TopTile is the best place to work as a freelance software developer. If you're freelancing right
now as a software developer and you're looking for a way to work with top clients on projects
that are interesting, challenging, and using the technologies you want to use,
TopTal might just be the place for you.
Working as a freelance software developer with TopTal,
your days of searching for high-quality, long-term work and getting paid what you're worth will be over.
Let's face it, you're an awesome developer, and you deserve to be compensated like one.
Joining TopTal means that you'll have the opportunity to travel the world as an elite
freelancer on top of that top talk can help provide the software hardware and support you
need to work effectively no matter where you are head to top top.com slash developers that's t-o-p
t-a-l.com slash developers to learn more and tell them the change log sent you. So essentially, GitHub Archive
is a snapshot or
the big data snapshot of
all of GitHub public activity.
That's right. So you actually have two ways of
interacting with that data if you want. One is
you can go and download the raw
archives, the hourly archives,
and that just gives you exactly the data
as I saw it coming from GitHub.
And you can apply anything you want to it.
So if you want to, I don't know, put up your own Hadoop cluster
or write your own Ruby script to process it, go for it.
And the other option is, the more convenient option,
is to use the BigQuery interface where you can just write the SQL stuff.
So whichever one fits you best.
So if you want to make a GitHub or something like this,
you can use the GitHub Archive data set
to sort of slice and dice big data coming from GitHub.
That's right, yep.
Gotcha.
It's pretty neat how these artifacts
can have such insights after the fact
and just the foresight of collecting the information.
I guess this is kind of the whole conceit of big data, right?
It's like one person with the foresight of let's collect this data
and make it publicly available.
Down the road, it opens up all these opportunities
and visualizations and insights into the open source community
that otherwise wouldn't have been available.
Yeah, absolutely.
And I guess one thing that I've learned through this process,
and I've seen it happen before as well,
is it's very important to make the analysis of the data
very cheap and easy.
Because that enables a very different type of collaboration
and just iteration.
Because if it takes you, let's say, half a day to answer a question,
you're very limited in the types of questions you can ask of the data.
Whereas if you get a very quick response,
you can actually start iterating on your questions.
So it's very often that I'll start
with a particular question and then be like,
oh, well, I didn't expect that.
That looks like an outlier.
Let me drill in a little bit further.
So having the tools,
and this is where the BigQuery stuff really helped.
And by the way, there's other projects that can do this sort of thing.
There are open source projects.
I think Amazon has some of the kind of similar capabilities.
The fact that it's BigQuery or not is not the important part.
It's just the fact that it allows you to quickly and easily ask questions and get fast answers.
And just having that has been incredibly valuable.
When you say fast, how fast is fast?
Well, you're processing on the order of, let's see,
I think the current data set is on the order of a couple hundred gigs,
and you can process all of that in a span of one to ten seconds.
So if you write a very complicated query,
then it'll take up to ten seconds.
In comparison to, say, doing that on your desktop would be a day?
Well, yeah.
Depends on your desktop.
Yeah, on your desktop.
Run-of-the-mill MacBook Pro.
It would take like an hour just to read the data off disk, right?
Whereas if you have a nice distributed system,
you would just read it from many different disks,
and that goes a heck of a lot faster.
I'm just trying to paint a picture for those out there
who are like, what is this BigQuery
and what does he mean by fast?
Because an hour or two hours is way slow.
Sure.
10 seconds is way fast.
So that's a great question.
So I guess for context,
so BigQuery is the public version of a product
that we use internally at Google called Dremel.
And Dremel is used to analyze terabyte-sized data sets,
so in multi-terabyte data.
And you're leveraging the large computer infrastructure that Google has,
and a terabyte of data can be processed in the same order of magnitude,
kind of seconds at most minutes,
which would take otherwise literally days or weeks
on your single computer.
I think this is cool.
I think this GitHub archive kind of shows really
that internet plumber attitude that you have towards things
because what you've done is some of the dirty work, right?
And you started off with this itch to scratch.
And you know, I have these all the time,
and I'm sure developers out there, we always have like this little, ooh, if I could just do this this itch to scratch. And you know, I have these all the time, and I'm sure developers out there,
we always have like this little,
ooh, if I could just do this, it'd be nice.
And then you follow the thread a little bit
and you realize this is like two, three,
maybe a week, you know, three days,
maybe a week worth of work
or whatever threshold that's just like,
and you just kind of shelve it.
If you would have done that,
you know, all these other projects
probably wouldn't exist
because you've lowered the bar for them to get to the interesting part, right?
I want to visualize the data.
You just wanted the email of the repos every night,
but all this extra work actually turned into something that we all can use
and has made the ecosystem more fruitful because of it.
Yeah, I think that sounds about right.
That sort of approach does take a bit more at the beginning
because you're required to do more work,
like how to make sure that this is accessible,
it's usable by other people and all the rest.
But in the long run, I definitely think it's kind of a better approach
because exactly as you said, it allows other people to leverage that data. And it also allows me to play with the data more because instead of just having
that report card for what are the interesting new projects yesterday, I can ask it tons
of other questions.
Yeah, so it also means that you got something you need to maintain as well, which is kind
of the other side of that coin. So GitHub Archive is not a new project.
It's been out there.
And GitHub itself changes.
I think the API has changed over time.
Did that present any difficulties for you,
API changes on the GitHub side?
It did in some ways.
So the trouble here was that,
and this is more of a kind of BigQuery-specific gotcha,
when BigQuery was first introduced,
you could, in terms of the data schema
that you could store within BigQuery,
you had to define that upfront.
So you would say, you know,
these are the columns for all of my records.
And you couldn't change that afterwards.
You could create a new data set that had a different schema.
But later that actually started causing problems because, as you said, GitHub would, as they would, you know, enhance their product, add new fields, maybe deprecate an old field.
And I had a little bit of pain there where even though I was logging all of the raw data, you know, I always had the raw data stored.
I would have to kind of massage it into the schema that I froze early on,
such that you could run a query against the entire data set.
So that did cause a little bit of friction.
But then last year, BigQuery actually allowed you to start importing just like JSON payloads,
so unstructured data.
And this actually gave me an opportunity to go back
and kind of revisit my original implementation.
And I switched it after kind of a bunch of back and forth
on what's the best way to do it.
Earlier this year, or actually exactly on January 1st,
I switched to a new model where instead of having every column be fixed,
I'm actually fixing a subset of the columns,
which I know are stable.
And if you look at the API documentation
for the events API in GitHub,
they'll tell you that these five columns are fixed.
They will always be there.
But then take, for example, a pull request versus issue request.
Both of those always have an actor,
so somebody who's doing the action.
And both of those have like a timestamp
and something else.
And that's always there.
So that's like I mapped that into distinct columns.
But then the actual payload of the request
or the activity is different for each activity.
And that I just store as kind of a JSON blob.
So it just requires a little bit more work
on people that are writing the queries now
to kind of reach into the JSON data and pull out the fields that they want. But now I don't have
this problem at all because GitHub could just change anything they want and I just throw that
data into BigQuery and there's no updates on the send. Yeah, I think I ran into that as I was
trying to do some of the queries to get the email going.
And I'd just like to say that, man, you are fast on the trigger, helping out on the issues.
I appreciate how quickly you got back to me on GitHub, helping me out with getting the queries all going.
January 1st, so that's right about probably the same time you turned the email off, is it not?
Yes, so actually, yeah, on January 1st, after that update went out, I guess I didn't explicitly
turn it off as much as I didn't update the query in my daily run. And then I realized that, whoops,
the data schema has changed. Or rather, I stopped logging data into the same table.
I had created a new table, and now I'm actually creating daily tables.
One of the things we found was
because I've been logging data into the same table
for now over three years,
and we've actually backported some data with GitHub's help,
so there's about three and a half years of data there,
you do have a free quota,
but it's very easy to exceed that free quota
if you're not careful.
So that was the reason why we went into this new model
where each day and each month is a separate table
such that people can experiment a bit more
without exceeding their free quota.
So what was the thought behind,
I mean, obviously because it just stopped working,
you didn't turn the email back on.
There were a few cries for help, myself being one of them.
I think there's three or four other people on your GitHub account
asking what's up with the email.
What was your decision to not rewrite that?
So as I said, the first two days,
or three days before I realized that the email stopped coming in,
because as you said, there's something about daily emails
where after a while you start to tune them out.
It took me about three days to register, like,
oh, right, this is why I'm not seeing it.
And then a couple of GitHub issues popped up.
And the thought process there was, I guess, twofold.
One was, since I've actually started the GitHub Archive newsletter,
GitHub came up with their own trending repositories email
that you can sign up to.
And I've subscribed to that, and to be honest,
I actually don't find it as valuable as the one I implemented,
but of course I'm biased because...
Hey, we're with you.
Otherwise I wouldn't ask you to turn it back on if I was satisfied.
What they're doing is basically the same thing,
but they just provide fewer repos in there.
I think it's like the top 10.
And they also don't separate what are the new repos
versus the old repos that got the most activity.
So I read both effectively,
but I find different things in both repos or in both emails.
But at the same time, it was there.
So once I realized that the email stopped going out, I actually wondered if anybody would cry about it, if anybody would contact me.
So I gave it another couple of days.
And sure enough, as you said, there was a couple of issues that were being opened on the repo.
And then at the same time, I guess you guys reached out to me about the work that you guys are doing.
And at that point, it kind of became clear that perhaps I should find somebody else to run that project and focus on the infrastructure part,
which is kind of to your point earlier, just enable other people to build cool things on top.
Well, that's a good segue then, isn't it?
Yeah, we obviously have a hand in the bag, so to speak,
in terms of shooting out emails and stuff like that.
And to our best ability, we try to keep up.
And as Jared mentioned, we used GitHub Archive before
and we were like, well, that's a bummer.
It's not going on anymore.
So it made sense to reach out and see if that was something we could take over.
And we've since had some conversation about it.
We're launching a new email called Change Log Nightly that will essentially become what GitHub Archive was, the daily emails at least.
And working with Ilya, we've transferred the email list.
So we're going to work with Ilya on making sure this continues.
So if you're listening to this and you're on that email list
and you get an email from us here in the near future,
it's the same email list and we'll sort of put out an announcement
in addition with this podcast to sort of clear the way
in terms of not spamming and stuff like that.
This is a collaboration.
So yeah.
And I'm really excited.
You guys showed me the preview of the email.
It looks great.
It looks much better designed than what I managed to pull off and in my
version.
So that's awesome.
So if you're listening to this right now, we're in the, we're,
we're recording this in the past,
but you're going to listen to this in the, in the future. So when you're actually listening to this right now um we're in the we're we're recording this in the past but you're gonna listen to this in the in the future so when you're actually listening to this uh so if you're hearing
my voice right now you can actually go to the changelog.com slash nightly um that may move to
nightly.thechangelog.com in the near future but for now it's going to be there you can subscribe now
um hopefully jared you can give me a nod or something like that to say for sure we're
shooting emails right now we're doing it internally.
As Ilya just mentioned, we had shared the design with him now.
And Ilya, on the work, I love how there's sort of layers to this onion.
You're itched from way back when, all this work with BigQuery, all this work with storing this data. And then now we've come behind you.
Would you say that you're a designer, Ilya, or would you say you're not a designer?
Sometimes I pretend to be a designer.
I can't say I'm a good one.
Play one on TV.
So, I mean, I would consider myself a designer.
And when we came across this project and taking over the email part of it, I was like, there's a way, I like the
data, but there's a way we can visualize it a little differently.
And, you know, we're sharing the stars a lot clearer, the up stars for that day a lot more
clear.
So when you see this email, you're going to love how it looks.
We even went as far as making it have a night theme because we figured if you're going to
be, you know, on your phone or on your MacBook at night,
if you're in the East Coast central time zones or in the U.S. time zones,
it's probably going to be at night.
Otherwise it's daytime or something like that for you.
But we figured let's ship it with a night theme.
So we made it dark.
We may actually offer a day and night theme in the future,
but at least for now it's going to ship with a night theme
to make your eyes a little bit, you know a little bit easier on the eyes at night.
And as Adam said, we've been shipping this just to ourselves over the last couple days.
And I texted him, was it last night?
I'm like, I'm so happy.
It's weird.
I'm a total nerd that this makes me so happy to have this back every night.
So we're excited to get it out there and get it in your mailboxes as well.
Yeah, so go to the changelog.com slash nightly, sign up.
And Ilya, I think we've grown the list a little bit since you handed it to us.
It went from just around 900 to I think just a little over 1,000 now.
Oh, that's awesome.
We've actually grown the list a tiny little bit.
So hopefully between this and
our changelog weekly email which won't change we'll still share repos in there that's more of
our edits realized um you know highly curated email whereas this one's automated so they sort of
sister and brother in that regard where you got nightly which is sort of this constant daily
update nightly update and then our changel weekly, which goes out on Saturdays,
which is links, videos, top repos
that are hitting our radar.
We're still sharing that email.
Yeah, that's awesome.
So for the GitHub mailing list,
or the original mailing list,
I've never actually even actively promoted it.
It was just one of those things where I had the email,
and I think one time I forwarded it to one of my friends
because I was like, oh, look, your repo has made the list.
And then he asked me for, like, where can I sign up?
So after that, I just dropped a link on the githubarchive.org website,
and I never actively promoted it,
and yet somehow it gathered 1,000 people.
So I'm curious to see where you guys take it, because I agree, I find them incredibly valuable.
And I actually think there's a lot of room
for kind of experimenting in the space as well.
One of the things that I've wanted for a long time
and just never got around to it
was creating more thematic lists as well.
So right now it's just like everything across GitHub, right?
But if I'm particularly interested in, let's say, Ruby or Node or something else,
you can imagine just coping it to that, which would be quite cool.
I would say, and Jared, you can back me up on this,
but I would say that this is definitely a start for us.
I think that as we can get more and more interesting data
out of what you've been storing in BigQuery and GitHub Archive, I think that I'd love to keep exploring.
I think this is just the tip of the iceberg for us because I've already had tons of fun just doing what we've done so far.
And I think we'll just keep – from a listener's perspective, if you're listening to this and you love the changelog and you're a member or you're not, we aim to serve the open source community as best we can.
And sometimes that might be shipping really awesome emails. Sometimes that's doing a really awesome
podcast. Sometimes that's sharing things on Twitter
or a blog or wherever or going to a conference. So this is one of the ways
we definitely plan to press hard.
Yeah, and we do plan to also open source the repo that runs nightly so
you can contribute as well if you're a reader of the email and you want to see a new data point
in there or you'd love to have these language specific emails which has definitely been
something we've discussed internally but it's a little bit more heavy lifting it's going to be
open source you can hop on there open an issue or fork it and do all that good stuff too.
So Elliot, what do you think about Nightly then?
What are your initial thoughts and just us taking it over
and not having to worry about the burden of the email anymore?
I'm happy. I'm super happy.
So first of all, I get my emails back, which is great
because I've been for the last month and a half
I've been relying on the GitHub version. And as I said, you know, those are great, but I don't find that they're as interesting in
many ways. I don't discover as many interesting things. That and just having you guys work on it,
I think you'll do a much better long-term job of it. Well, I appreciate that. And, you know,
these itches you keep scratching too, let's, I would say say let's figure out a way to keep working together.
I know Jared and I will take over and start doing some things, but
if you've got a particular email that you want to see go out
or dataset pulled from this, then let's work on it.
Let's figure out a way to make it happen.
Yeah, for sure. I'm really happy to hear that you guys are going to make the
actual code for that open source.
And I guess I should mention that all of the GitHub Archive source for the website, for the crawler,
and even for the old reports, if you still want them, is online.
So if you just go under igregoric.githubarchive.org, that's the repo.
And if you find bugs, improvements, all of that stuff is welcome.
Awesome. Well, we do have a note in here to talk a little bit about the future of GitHub Archive.
If you have future plans or a roadmap, or if you consider it kind of a finished thing as a piece of plumbing, what are your thoughts on that? So I think it's mostly a finished thing in the sense that the crawler is running.
It's stable.
I think I've figured out all the bugs there and I'm really happy with it.
Like it's been running for years.
So that part is good.
What I would like to do is maybe go back and revisit how I've imported some of the data into BigQuery.
Because as I mentioned, the schema was changing and I had a frozen schema. So some of
the fields may not be there that perhaps should have been. So that's just kind of one of those
things where I would like to get to it, where I'd like to go back and re-import the old data in the
same way that I'm importing the new data now, just to make it all nice and consistent. And if
somebody is interested in taking that on,
that would be even better, to be honest.
But that's probably the main thing.
Otherwise, it's running, it's humming along.
I should plug, actually, the BigQuery team.
They've given me a lot of support,
and they've paid the bills for hosting all that data.
So kudos to them.
So there's that. There's no kind of concerns over how much dataudos to them. So there's that.
There's no kind of concerns over how much data we're storing.
So that's been really good.
That's nice right there.
I was going to add, I guess, on the tail end of that,
is this one of the main ways,
is there any other large data sets of GitHub data out there
other than this one?
You know what?
I'm not sure.
I don't think so.
Not that I've come across.
I keep coming across projects that I'm surprised to find out are using GitHub archive data under the hood because either they grab the archives or they're using
BigQuery. But I've not seen other people kind of log it
and store it and process it on their own.
And in terms of the future, you mentioned your collaboration with GitHub and improving things and things like that.
Are they mutually involved in this to a degree?
Is there any sort of interest in this for them in the future?
Yeah, I think so.
Actually, just last week I was exchanging emails
with somebody at GitHub
where they're interested in engaging the academic community
and actually coming back to the kind of interesting use cases of this data,
I've been approached by a number of researchers in various universities
that are using GitHub Archive data for analyzing things like
what makes a great open source community,
or what patterns do they exhibit,
what makes a resilient open source community, so on and so forth.
It's kind of like the social dynamics of open source.
So there's been a couple of papers published on this stuff using GitHub Archive data.
And I think GitHub in particular is interested in getting more of that kind of collaboration
with the academic community.
So we're chatting now about potentially exposing additional data sets via BigQuery, because
clearly the researchers are already using that interface. So in the future, you may see some additional augmented data become available through GitHub Archive.
But we're still kind of working through the details of what that is and how that would work and all the rest.
And now, a word from our sponsor.
It is time to put the program books away.
Put them away, Put them down.
And learn by doing with CodeSchool.
CodeSchool offers a variety of courses to help you expand your skills and learn new technologies such as JavaScript, Ruby, iOS, Git, HTML, CSS, and many more.
CodeSchool knows that learning the code can be a daunting task. They combine experienced instructors with proven learning techniques to make learning the code educational as well as memorable, giving you the confidence you need to continue past the hurdles.
They're always launching new courses on new technologies and offering deep dives on tried and true languages.
So if you don't see them you need, suggest a course and they'll build it if there's enough demand.
GoSchool also knows that languages are a moving target. They're always updating content to give you the latest and
greatest learning resources. You can even try before you buy. Roughly one out of every five
courses on CodeSchool is free. This includes introductory classes for Git, Ruby, and jQuery,
which allow free members to play full courses with coding
challenges included you can also pay as you go one monthly fee gives you full access to every
code school course and if you ever need a breather take a break you can suspend your account at any
time don't worry your account history points and badges will all be there when you're ready to pick things up again.
Get started on sharpening your skills today at Codeschool.com.
Once again, that's Codeschool.com.
So while we're talking about, I guess, the future of GitHub Archive, what are some of the ways that the community can step in? We always ask a question like, what's a call to arms for a GitHub archive
that doesn't require you to do every single thing?
Where can the community step in to help out on this project?
So I'd say two things.
One is just go and play with the data.
I think that's the best place to start
because if you get hooked on that data,
and I think it's pretty easy to get hooked
because there's so much of it
and you can analyze all kinds of stuff, then it just becomes much more interesting in the long run.
Like you start thinking of new things you could figure out based on this data. So that's one,
just start playing with the data. You can either grab the raw archives, the JSON stuff, and just
do something with it, or you can try the BigQuery approach. And the second one is if you are
interested in helping out, the project that I
mentioned where it's just about re-importing
the old data, that could
be interesting. And if that's something that
you want to help with, that'd be cool.
So is that still an in-progress
project then? The re-import?
You're talking about the tables, right? Breaking up one big table
into many smaller tables?
Right. Yeah. So is that still going on then?
Well, it's one of the things that, you know, on my to-do project list of 100 things that
I need to yak shave, it's there.
It's just a question of when will I get to it?
So is there currently an issue open with some, like, guide marks or guidelines for someone
to step in and help out there?
No, that's a good point.
There isn't.
And I will do that.
Yeah, I think that would be helpful, especially, you know, I like when project owners, you
know, if they haven't asked like that, you know, if they can put something out there
because you're going to get a question anyway, someone will start the issue for you if you
don't.
So might as well give someone some guide rails to follow and that way people can step in.
So if you're wanting to hack on BigQuery or play with
breaking up these tables into smaller tables, then Ilya will give you some
help on making that happen.
Yeah, I'll do that. I'll definitely put something together.
So many yaks, so little time.
Pretty much, yes.
Is there a t-shirt for that?
There should be.
That'd be a nice yak shave. Go make a t-shirt for that? Because I like that. There should be. Yeah, that'd be a nice yak shave.
Go make a t-shirt about yak shaving.
Okay.
Another question that we like to ask at the end,
and I know you've been lacking your email lately,
but you're a guy who has his thumb on the pulse of open source.
I think your Twitter account is a good one that I follow,
just constantly kind of surfacing cool new projects.
So what are some projects, name one or a
couple that are on your radar that are exciting to you these days? These days? Well, so I guess the
main open source project that I spend probably most of my time on nowadays is Chromium. So that's
definitely something that's very interesting, exciting to me. And I keep learning new things
about it. And if you're not familiar, Chromium is the open source version
of the Chrome project, the Chrome browser.
So there's a Chromium browser which you can build on your own,
and then Chrome is kind of the repackaged version
that just adds the Google branding on top of it
and a few additional things.
So I spend most of my days working on that,
trying to figure out what are the things that we need in there
to make it faster, or what are the things that we need in there to make it faster
or what are the performance regressions, bugs, and so on and so forth.
So that's definitely been occupying a lot of my time.
And then others are actually HTTP2 related.
So as we mentioned at the beginning of the show,
HTTP2 is now officially a thing as of, I guess, yesterday.
And now the big push is now that the spec is final and it's stable
is to actually have servers support it. If you think about, let's say, the Ruby ecosystem,
there's actually not any server that I'm aware of that is HTTP2 compatible at this point.
So that's something that I'm thinking about actively right now. If you go to the HTTP2 wiki, if you just
search for HTTP2 on Google or use your other favorite search engine, you'll arrive at just
kind of a status page for HTTP2. If you click on implementations, there's a list of servers that
are already implemented. And a lot of those could use some help in terms of contributions,
testing them, compatibility with other browsers, and all the rest.
So if you're interested in that sort of thing,
that's definitely something that I would encourage others to play with.
Excellent. I think I'll just give you a plug as well,
because you won't take it yourself.
You also have a book out, High Performance Browser Networking. It's an O'Reilly book.
Looks like you can read the entire thing online for free
or buy it for a few bucks.
Definitely if you guys are interested in these types of things,
HTTP2, XHR improvements, server-sent events, that kind of stuff.
I haven't read the book myself,
but I've heard people singing its praises on the internets.
So just give that a shout out as well.
Does HTTP2 require any sort of update to this?
Actually, it does.
I do have a section in the book.
As you mentioned, the book is online for free,
so if you just go to hpbn.co,
you'll arrive at a page where you can flip through it.
There's an HTTP2 chapter in there,
but I wrote that chapter about a year ago.
And since then, there's been some protocol changes,
kind of like cosmetic changes,
that I need to go back and update.
But that's true of any tech book in general, right?
The moment you hit publish, it's already out of date.
Yeah.
We'll definitely link that up in the show notes
for people who are interested.
Remember you mentioned HTTP2 as well
that's like fresh off the press
like as of like basically last night so
you know don't expect you to
go open go update the book
today or anything well I am actually hoping
to kind of get it done sooner rather than later
because I think there's a lot of interest right now
in the community it's like what is this thing
how do I make it work? What does it mean for the
servers? What does it mean for my website?
So this would be a good time to actually
have that out. So I'm hoping that
within the next, fingers crossed,
couple of weeks, unless I find other
Yaks to shave, I'll have that up.
We'll have to have you come
back and talk HTTP2 because
I have a lot of questions about it and I'm sure
you've got a lot of answers. I think that'd be a good time.
Yeah, I'd be happy to.
Let's do it. Let's get on the books. Four weeks from now.
Bam.
Gone.
Well, it was definitely
fun having you on the call today.
We'll definitely enjoy
working with you on
keeping the emails current,
looking awesome on mobile and desktop, frequent, and exploring new
frontiers with that as well. So definitely excited about the future of working with you on that part
there. You mentioned githubarchive.org will have an update mentioning ChangeLog Nightly. If you're
going to subscribe, go to thechangelog.com slash nightly. You can subscribe there. if you're going to subscribe go to the changelog.com slash nightly you can subscribe
there when you're listening to this we should be shipping emails so expect um an email like the
next night i think we're shipping what jared at 10 o'clock on central time or eastern time central
so 10 p.m central because jared and i live in central and that's like the center of the world
to us because it's central right that's right they call the central time zone for a reason it's the center humor arrogance there that's what that is humorous arrogance um that we're
going to ship at our time at 10 so if you're on the other side of the world uh we can't help that
so it'll be like 10 the afternoon for you or something like that so um that's the changelog
afternoonly afternoon doesn't have the same ring to it no no and if you're a fan of the show
you know back in episode 141 uh we had a pretty decent announcement that i came on as as a full
time employee of this year fledgling company we're building called the change log uh i once worked
for a non-profit uh full-time there and stepped away to pursue the dreams of keeping up uh with open
source and serving the open source community so this is now my full-time gig and as part of that
we ask our listeners to become members supporting members of the changelog you can go to
the changelog.com membership to learn more but right now i'm gonna rattle off a list of i don't
know how many but quite a few members that have stepped up in support of the change log.
Gabriel Solis.
Forgive me if I mispronounce a few names here because some of them do have nine letters like my last name or like Gregorik.
Jonathan Loweniski.
Sorry if I messed that one up.
Darcy Clark.
Mike Oliveri.
Todd Ward.
Colin Coghill. G Jensen Magnus Endger that's an awesome name Benoit Tijonat I believe that definitely messed that
one up that's French though so you can I get a buy on French names Charles Hicks this one I can't
even pronounce uh Pena Goddess I'm not even sure how that one sorry about that David can't even pronounce. Penegoddess. I'm not even sure how to pronounce that one. Sorry about that.
David, can't see your last name either.
Steven Howes, Brett Weaver, and Jan Novak.
All these awesome people have stepped up to make sure the changelog stays around and support us and going full time.
So if you want to do it too, the changelog.com slash membership.
You got some awesome benefits there.
I won't tell you what there are now, but lots of cool stuff on that page there check it out uh ilia thanks again
so much for coming on the show working with us on changelog nightly definitely excited about
shipping that in the future we've got some awesome sponsors i think to mention as well let's see who
those are code ship top towel and code school awesome people so with that let's say goodbye
everybody goodbye And Code School. Awesome people. So with that, let's say goodbye, everybody.
Goodbye.
Bye.