The Changelog: Software Development, Open Source - GitHub and Google on Public Datasets & Google BigQuery (Interview)
Episode Date: June 29, 2016Arfon Smith from GitHub, and Felipe Hoffa & Will Curran from Google joined the show to talk about BigQuery — the big picture behind Google Cloud's push to host public datasets, the collaboration bet...ween the two companies to expand GitHub's public dataset, adding query capabilities that have never been possible before, example queries, and more!
Transcript
Discussion (0)
Welcome back everyone. This is the Change Log and I'm your host, Adam Stachowiak. This
is episode 209 and today Jared and I have an awesome show for you. We talked to GitHub
and Google about this new collaboration they have. We talked to Arvon Smith from GitHub,
Felipe Hoffa from Google, and Will Kern from Google. We talked to Arvon Smith from GitHub, Felipe Hoffa from Google,
and Will Kern from Google.
We talked about Google BigQuery,
the big picture behind Google Cloud's push to host public datasets for BigQuery
as the usable front end.
We talked about the collaboration
between Google and GitHub
to host GitHub's public dataset,
adding querying capabilities to GitHub's data
that's never been possible before.
We had three sponsors today,
TopTile, Linode, our cloud server of choice,
and FullStackFest.
Our first sponsor of the show is our friends at TopTile,
the best place to work as a freelance software developer.
If you're freelancing right now,
and you're looking for ways to work with top clients,
work on things that are challenging you,
interesting to you,
technologies you want to use, TopTile is definitely that are challenging you, interesting to you, technologies you want to use.
TopTile is definitely the place for you.
Top companies rely upon TopTile freelancers every single day for their most mission-critical projects.
And at TopTile, you'll be part of a worldwide community of engineers and designers.
They have a huge Slack community, very much like family.
You'll be able to travel, blog on the TopTile engineering blog and design blog, apply for open-source grants.
Head to TopTile.com to learn more. That's T-O-P-T-A-L.com to learn more or email me,
Adam at changelog.com if you prefer a more personal introduction to our friends at TopTile.
And now onto the show.
All right, we're back.
We've got a fun show here.
I mean, Jared, we've got some backstory to tell,
a little bit to kind of tee this up. So back in episode 144, we talked to Ilya Gorik,
a huge friend of the show.
I mean, we've had Ilya on the show, I think, three times now.
Is that right?
I think that's right.
In fact, we were going to have him on this show as well.
We have three awesome guests,
and we figured we'd let them take the spotlight since they've been highly involved in this project as well as Ilya. Right. In fact, we're going to have him on this show as well. We have three awesome guests and we figured we'd let them take the spotlight since they've been highly involved in this project as well as Ilya.
Right. So we got GitHub and Google coming together, Google Cloud specifically, along with Google BigQuery.
Fun announcement around data sets are on GitHub, opening those up, BigQuery.
And we use BigQuery actually as sort of a byproduct of previous work from Ilya which was
GitHub Archive and
we worked with him to take over the
email that was coming from that and now we
call that change log nightly so that's
kind of interesting. Yeah in fact we had
a brief hiccup in the
transition but one that we were happy to work
around is what they've been
doing behind the scenes is making
GitHub Archive and the
Google BigQuery access to GitHub.
Lots more interesting.
We're going to hear all about that.
Absolutely.
So without further ado, we've got Felipe Hoffa, Arvon Smith, and Will Curran.
Felipe and Will are from Google and Arvon, as you may know, is from GitHub.
So fellows, welcome to the show.
Hi.
Thanks for having me.
Hello there. Nice to the show. Hi, thanks for having me. Hello there.
Nice to be here.
So I guess maybe just for voices' sake and for the listeners' sake, since we have three
additional people on the show, it's always difficult to sort of navigate voices. Let's
take turns and intro you guys. I got you from top to bottom, Felipe, Arvon, Will. So we'll
go in that order. So, Felipe, kind of give us a brief rundown of who you are and what
you do at Google.
Hello there.
I'm Felipe Hoffa.
I'm a developer advocate specifically for Google Cloud.
And I do a lot of big data and a lot with BigQuery.
And Arvon, how about you, bud?
Yep.
So my name is Arvon Smith and I am GitHub's program manager for open source data.
So it's my job to think about ways in which we can be sort of more proactive about releasing data products to the world.
And this is what we're going to talk about today is a perfect example of that.
Awesome. And Will, how about you?
Yeah. Hi there. This is Will Curran.
I'm a program manager for Google Cloud Platform, and I'm specifically working on the cloud partner engineering team. So my role is in the big data space and storage space to help us do product integrations with different partners and organizations. show here in particular is obviously touching back on how we're using GitHub Archive, but then also
how you two are coming together to make public datasets around GitHub available, collecting these
datasets, showing them off. I'm assuming a lot of new API changes around BigQuery. Who wants to
help us share the story of what BigQuery is, catch us up on the idea of it, hosting datasets.
What's happening here?
What's this announcement about?
So we're going to start with what are we doing with GitHub
or what is BigQuery?
Let's start with the big picture, BigQuery.
Public datasets will.
This is a big initiative of yours at Google.
GitHub, one of those public datasets.
But give us the big context of what you all are up to
with these public datasets.
It started with Felipe.
He's been working for a while now with the community and different organizations to publish a variety of public data sets.
And we've got a lot of great feedback from both users and data providers. is that they want more support for public data sets in terms of resourcing and attention
so that they can get more support
for not just hosting those data sets,
but for maintaining them,
which is our biggest challenge right now.
And so we developed a formal program at Google Cloud Platform
to launch a set of data sets
that Felipe had been working on them for a while.
And we launched those at GCP Next earlier this year.
And so the program basically provides funds for data providers to host their
data on Google Cloud,
as well as the resources to maintain those datasets over time so that there's
current data.
And so the program allows us to host a much larger number of datasets
and bigger data sets.
And currently, we're focused on growing the available structure
data sets for BigQuery.
But then we'll start adding more binary data sets
to Google Cloud Storage.
As an example, like Landsat data would be a binary data
set that we're looking to onboard.
And then that brings us to this week's announcement
around our GitHub collaboration.
FELIPE HOFFA- I would love to highlight this
about BigQuery.
We can find open data all over the internet.
That is awesome.
But what's special about data shared on BigQuery
is that anyone can go and immediately analyze it.
Everywhere else, you have to start by downloading this data
or by using certain APIs that restrict what you can do.
When people share data on BigQuery,
like, for example, the GitHub archive
that Ilya has been sharing for all this time,
this data is available for immediate querying by anyone,
and you can query it in any way you want.
You can basically run full table scans
that run in seconds without you having to wait
hours or days to download and analyze the data at your home.
Kind of reminds me of the Martian
when the guy's like, hey, I need to do a big analysis
on the trajectory of the orbits and stuff like that.
If anybody's seen the Martian, he's like,
I need supercomputer access.
It seems kind of like supercomputer access to any data set if that's what you want exactly once we have
a data same in the query anyone like you just need to log in everyone has uh for a free terabyte
every month to query has access to basically a supercomputer that able able to analyze terabytes
of data in seconds just for you i know one of the things
that uh and jared you can back up on this with uh piggybacking off of ilia's work with
github archive and now change all nightlies that that email you know that wouldn't be possible
without big query because those those uh you know those queries happen so fast it takes so much
effort on a computer's part
to get those queries on that big data set.
I mean, that's pretty interesting.
I like that.
Oh, yeah.
So Ilya was the one that started sharing data on BigQuery.
Like he told you in episode 144.
Right.
He was collecting all these files.
He was extracting from GitHub all the logs.
And BigQuery was opening up as from GitHub all the logs. And BigQuery was
opening up as a product at the time.
So he chose BigQuery to share
this dataset. And since
then, we've shared a lot more
datasets in BigQuery. All the
New York City taxi trips, Reddit
comments, hacker news, etc.
You're able to analyze it. And now
what we're doing with Will is
grow this into a formal program to get more data, to share more data, to analyze it. And now what we're doing with Will is grow this into a formal program
to get more data, to share more data,
to make it more awesome for everyone.
So those are interesting data sets.
Will, maybe give us a few more interesting ones.
Specifically, that would be cool
for developers and hackers to look at
and perhaps build things with.
Either ones that you guys have currently opened up
since our last show, which was February 2015,
quite a bit ago,
or things that you're hoping to open up
that would be interesting for developers?
One of the ones I like using myself
is the NOAA GSOD data.
I have a lot of interest around
climate change themes and topics.
And what I found interesting with that data set,
and Felipe had done
some some great documentation on how to uh how to actually leverage uh that data is you can go right
in there and instantly get uh you know in a matter of seconds um the coldest temperatures recorded
over time that they've been tracking it back since like, I think it was 1920s and the hottest
ones.
And immediately you can see the trends that everybody's talking about where, you know,
the past decade or so, yeah, we've hit a lot of record temperatures that have not been
seen in previous decades.
So it's kind of exciting just to be able to like pick up a data set like that and validate
a lot of the
the science and news that you're reading right that is interesting so well i guess what i was
going to say how do you go ahead and get started with that but maybe we'll save that for near the
end of the conversation once everybody's appetites are sufficiently whetted let's talk about uh the
subject at hand which is this new github data so we've had since Ilya set up GitHub Archive back in the day,
we've had some GitHub data,
which was specifically around the event's history
and issue comments and whatnot.
But y'all have been working hard behind the scenes,
both Google and GitHub together,
to make it a lot more useful.
So maybe, Arvon, let's hear from you the big news
that you guys are happy to announce today.
So, yeah, as you kind of will be well aware
with the existing GitHub archive,
the GitHub API spews out all these events,
like hundreds and hundreds per second
of public records of things happening on GitHub.
So things like when people push code,
when people star a repo, when orgs are created,
all these kinds of things already happen. And these are just JSON kind of blobs that come out
of the API. And so the GitHub archive has been collecting those for about five years now.
But what we're adding to that is the actual content that these events describe.
So if you had a push event in the past,
so somebody pushing code up to GitHub,
you had to go back to the GitHub API
to go and get the file, for example, a copy of what
was actually changed.
And so what we're actually adding to BigQuery
is a few more tables.
But these tables are really, really big.
So we've got a table full of commits.
So every commit now has the full message that people were,
you know, for the author, the files modified,
the files removed, all the information about the commit
and the source repo.
So that's about 145 million rows. It's
probably more now, it's probably upwards of over 150 million. We've got another table, which has
all the file content. So, you know, all of these projects on GitHub that have an open source
license, you know, the license allows us to, you know, third parties to take a copy of this code
and go off and, you know know do things with it that's
what's kind of one of the great things about about open source so there's now a copy of these files
in in bigquery tables and so this is this is the big one this is about three terabytes of uh raw
data that has the full file contents um of the of the object that was touched in the repository on GitHub.
And I'm sure we'll dive into some of the possibilities of what you can do with that.
And then in addition, there's another table,
which basically has a full mapping of where the file,
all of the files at kind of Git head in the repository. So like a mapping of all the files at kind of git ahead uh git head in um uh in the repository so like a mapping of all
the files and all their paths and joining them to the file content so you can and there's about
two billion of those file paths so basically we've got this kind of vast uh kind of network of files
commits and now also the contents of those files sitting ready to query in BigQuery.
So it's about, I think we're upwards of about
three terabyte data set here.
And it's the biggest data release that we've ever made.
That's awesome. It sounds like a lot of work.
I'm just sitting here thinking, man, it's a lot of work
even describing it.
I'm sure both sides have put a lot of effort in this.
Can you describe the partnership,
the way you've worked together, the two companies,
and from your perspective,
what all went into making this happen?
Sure, so I'll start,
but I'm sure there's more detail to come
from Felipe as well on this.
So unsung hero of today's call is,
well, two really, Ilya, of course,
but a guy called Sean Pierce,
who works in
the open source office at google and so um you know like the desire for data from github is like
a generally kind of uh a general request we get from large companies who are doing a lot of open
source so we get that from uh you know google regularly pulling data to analyze their own
open source projects on GitHub.
And so Sean had actually done some early work exploring this,
pulling these commits into BigQuery.
He'd started to kind of build out a pipeline
to help monitor their own open source projects.
But we have pretty good regular conversations with him and the team he's in.
And so I think it just came up in one kind of conversation
back in February.
He was like, hey, by the way, I've been working on this thing.
We have this public data set program that's growing,
and this would make a great data set to have available in BigQuery.
What do you think?
And we jumped at the chance to get involved.
And so it's been a few months in development
to make sure that we're getting,
pipelines are all working,
but most of the kind of lion's share of the work
has been done by Sean on the data pipeline,
which I think runs every week to update this.
But Felipe, can you remind us if that's the case? Yes.
At least today it's set up to run every week. So this snapshot
will be updated every week with the latest
files,
details in GitHub.
I have a quick story
about the partnership.
When I was first approached with this
and it was Sean and I got introduced to Ar this, and it was Sean, and I got introduced
to Arvon. And one of the first questions I asked when I talked to a data provider about, you know,
is this going to be useful or whatever, given the backlog we have is I asked, you know, can you send
me a sample query, you know, that shows, you know, how this will be useful to users. And one of the first queries that Arvon sent
was a number of times, this should never happen.
And I knew it was going to be fun just working with this data.
And I've just run the query actually after our last load here.
And we're not quite at a million times yet,
but we're getting close.
What do you mean by that shouldn't have happened?
That's the number of times in this data set
that someone has committed a comment that says, What do you mean by that? Shouldn't have happened. That's the number of times in this data set that,
that someone has committed a comment that says this should never happen.
Ah,
gotcha.
So it says it in the commit message or is it actually in the code comments
in the code?
In the code.
Yeah.
Yeah.
It's like the,
you know,
rescuing,
rescuing every error you can possibly imagine.
This,
this,
this will never happen. This will never happen.
This should never happen.
We're almost at a million.
Right, right.
And so you're like, yeah, okay.
But it's in there.
Yeah, there was a thing on Hacker News
a few months ago with this kind of came,
I think somebody demonstrated that.
I think they tried to do a,
I think they did a search on the GitHub side,
just on our standard search to say, you know, let's see how many times something should never happen.
Now you can do this with kind of looking at particular language types as well, right?
And segment, do much more powerful searches.
So that's one of the things that's kind of fun about the data.
That's a great use case.
And I think what I'm excited about this is, especially getting it out to our audience and to the whole developer community, is all these new opportunities and use cases and things that we collectively couldn't know previously.
And we can start to know by people asking different questions that I wouldn't have thought of or you wouldn't have thought of.
So we're going to take a quick break, but on the other side,
what we want to know is like,
what all does this open up?
Obviously there's things that we haven't thought of yet,
but what's the low hanging fruit
that's cool that you can do now?
You can ask these questions now
and you can get answers
that you couldn't previously get.
So I'll just tee that up
and we get on the other side of the break.
We'll talk about it.
Linode is our cloud server of choice. Get up and we get on the other side of the break we'll talk about it linode is our cloud server of choice get up and running in seconds with your choice of linux distro resources and node location
ssd storage 40 gigabit network intel e5 processors use the promo code changelog20 for a $20 credit
two months free one of the fastest most efficient most efficient SSD cloud servers is what we're
building our new CMS on. We love Linode. We think you'll love them too. Again, use the code
changelog20 for $20 credit. Head to linode.com slash changelog to get started.
All right, we're back with quite a crew here talking about big data, Google BigQuery, GitHub, fun stuff.
But in the wings, when we take these breaks,
we often have side conversations,
and it had just occurred to us that everyone on this call
is in a unique place.
Like, for example, Felipe, you're up in the YouTube studios
in New York because you're at a conference up there.
And Arvon, you're in a truck outside of a starbucks in canada while you're digital
nomading with your family in your travel trailer and and you've got a super fast internet connection
and will you're where you should be you're in seattle in your home office there and in a google
studio there so so it's kind of interesting so arvon what's unique about uh you know where you're
at right now i guess um well it's it's um it's well
the speed of the internet is the is remarkable uh i'm in yeah as i say outside starbucks with
about 100 megabit connection so that's pretty great that's unheard of yeah so i can report
that the canadians have better starbucks wi-fi than the chicagoans which is where i've lived for
the last four years so um what else is unique it It's lovely and sunny, but I've only been in Canada for three days,
so I have no idea if it's regularly sunny here.
But, yeah, it's really nice.
And the good thing for us with this scenario for you is that we get to capitalize
on a great recording because you sound great.
Your internet's doing great.
We don't have any glitches whatsoever.
So thanks, Starbucks, for super-fast internet connections in Canada.
Appreciate that. That's sponsored by Starbucks. We probably can't say that, right? is whatsoever so thanks starbucks for super fascinating connections in canada appreciate
that that's sponsored by starbucks probably can't say that right we'll have to reach out to their
pr department or their marketing department to send them a bill for this show or something like
that but uh onto the more fun stuff though so jared teed this up before we went into the break
but big story here google bigquery has been out there we're aware of it but now we're able to do
more things than we've ever been able to do before.
So let's dive into some of these things.
What are some things you could do now with this partnership,
with this new data set being available there,
the four terabytes or three terabytes of GitHub public data being there?
What can you do now that you couldn't do before?
The beauty is that anyone can do it.
So it's not just me, but anyone's open data.
But just having access,
being able to see
two billion files
to be able to analyze them
at the same time,
it's really, really awesome.
For example,
let's say you are the author
of a popular open source library.
You can go and find
every project that uses it
and not only that they are using it,
but how they're using it.
So you can go and see exactly
what patterns are people using,
what are they doing wrong,
where they are getting stuck.
And you can base your decisions
on the actual code that people are writing.
Yeah, I think the kind of insight
into how software that maybe you maintain
is being used, I think,
is one of the most powerful ones I can think of here.
Because, you know, for example,
say you're wanting to make a breaking change to your API.
Actually, one of the projects I maintain
on behalf of GitHub, the project called Linguist, we want to change one of the projects I maintain on behalf of GitHub the project called Linguist
we want to change one of the core methods
actually the one that detects the language of the file
we want to change its name and we want to re-architect some of the library
and we know it's a breaking change to the API
and we've had deprecation warnings out for 12 months
but honestly being able to run a query
that sees how many times people are actually using
that API method still helps me as a maintainer understand the downstream impact of my changes.
And currently, that's just not been possible before.
And of course, you can't see what's going on in private source code.
But a lot of this stuff is in open source repos as well.
So being able to have that kind of drill down into source code,
all of the open source code that's on GitHub.
And I mean, for me, the other kind of killer feature is like to be able to do this,
you want to write a regular expression of some kind, right?
And so being able to run regex across four terabytes of data
or three terabytes of data, we should actually figure out what the exact number is.
It increases daily, of course.
But, you know, being able to run a regex against all that data is incredibly powerful
and something that's just not been possible before.
Well, back we had Daniel Stettenberg on the show.
He's the author of curl and lib curl of course
and we asked him at that time how do you know who your users are how do you how do you speak to your
users and ask them things and really he said i have no idea i mean first of all curl's so popular
that like it's kind of like sqlite like the world is his users um but he didn't really know how
people were using his library but But with something like this,
like you said, it's only the public repos, of course, we wouldn't want to expose the private
repos, big data. But he can actually just go to BigQuery and look for how many people are
including Libcurl, right, linking to it in their open source. And not just that, but he can also,
like you said, look at very specific
method signatures
or how they're using it, and
he can get insights. Now,
it's not 100% the truth, because
like we said, he's got way more users than just
open source, but it's at least
a proxy for reality.
Is that fair to say?
I mean, and there's fun things you can do as well.
We're sharing some example queries that we've authored as a group, but of. I mean, and there's fun things you can do as well. Like we were sharing some example queries
that we've authored as a group,
but of course, you know,
there's unlimited possibilities here.
But, you know, you can also, you know,
look at, you know, most common emojis used
in commit messages and silly stuff like that.
But, you know, so there's less serious things
you can do as well that would also be,
currently be pretty difficult.
But yeah, yeah. I mean, being able to drill down and understand how people are using stuff is extremely, extremely important to many people.
Actually, one use case that's kind of near and dear to my heart.
I mean, everyone's interested if people are using their stuff, but some people actually have to report that, right? Because maybe one particular use case
that I'm very familiar with is like
people who've received funding to develop software.
So maybe academic researchers who develop code,
you know, they'll have funding maybe
from the National Science Foundation.
And the only thing that matters really to the NSF
is how many people,
like what was the impact of that software?
And, you know, it's really hard to answer that question.
Like how many people are using your stuff?
You can maybe say, oh, well, it's got 400 forks.
Now, you know, I would say anything that's been forked 400 times is pretty popular, but that doesn't actually mean it's being used.
It's a kind of a weak weak weak kind of signal of usage uh whereas an
actual like i can show you i can give you the url of every downstream consumer of my software and
it's been used by you know 50 different universities or whatever but you know they being able to give
people the opportunity to actually report usage is interesting and fun for lots of people,
but actually mission critical for many people as well.
And so we get a lot of requests at GitHub
from specifically researchers
who are trying to demonstrate
how much their stuff is being used.
It's been really hard to service those requests in the past,
but I think we're going to be in a much better position
to do that now.
Another interesting use case, Felipe,
maybe you can speak to this one.
Probably exciting both for white hats and black hats alike
is an
easy way of finding who
and what exactly is using
code that's vulnerable to attack.
Can you speak to that? Yes.
So I'm super excited
about that. Security security-wise,
if you are able to fix
and find a problem in your source code,
that's cool.
But if you're able to find the same pattern,
the same buggy code
or potential vulnerabilities,
with the query,
you will be able to find it
all around GitHub's open source code
and just send patches,
contact the project owners,
open an issue.
But now you're able to do this.
And things get really, really crazy.
So the kind of things you can do.
Like with SQL,
with BigQuery you can write SQL. SQL is powerful, but you can only do some limited amount of operations. You can write regular
expressions. But with BigQuery, we also open the space up to user-defined functions written in JavaScript. For example, there is this JavaScript study code analyzer
called JSHint.
And I'm running it now inside BigQuery
just to analyze other JavaScript code
and see, for example,
find all of the unused variables.
Like, you cannot do that
with a regular expression.
If you try to run this
in your own computer,
it would take hours, days.
But with BigQuery,
you're able to just
actually analyze
the flow of the code.
Are there unused variables?
How are the libraries being used?
So, yeah, it gets really crazy.
I'm getting now to
maybe the boundaries of what we can do with BigQuery, crazy I'm getting now to maybe the boundaries
of what we can do with the query but
I'm really looking forward to what
people will build upon this
let's focus on the security aspect
once again with regards to
the black hats
so a naysayer
of this type of available data
is that now you have
a zero day come out or, well, let's
just call it, yeah, zero days released.
And now this enables, you know, whether it's a script kitty or somebody who's more capable
can go out and not just, you know, fuzz the entire internet for vulnerable things, but
they can actually know exactly, you know, what line of code in a particular project is taking this input.
And so while people can go out and do pull requests, people can also go out and hack each other.
Do you have concerns about that?
Well, I believe in humanity on one side.
I think there are more good people than bad people.
And usually people, when they're attacking, they are more good people than bad people. And usually people, when they are attacking,
they are more focused on particular projects.
On the defense side,
here we're giving the ability to people
that want to make projects stronger.
We're giving them the ability to identify everywhere
where their potential problems are
and harden these open source project.
That's one of the beauties of open source.
Yes, it makes problems more visible, but by making them visible, you have more eyes looking
at them.
Now, with having all these source codes visible in BigQuery, we are just making people that
want to look for problems.
We are giving the tool
to find them easily
in an easy way
and fix them.
Yeah, I mean,
I look at it very much
like it's a tool.
You know, you can use a tool for bad.
You can use it for good.
And if anything,
what this does is it
it ups the ante
or it speeds up the game,
so to speak.
And so both sides can use it.
I would imagine
if you think about believing
in humanity, the good people,
it just takes one person to go out there and write a
program that
can use this
data set, query BigQuery for a specific
string
of code, and
automatically find that across all the repos on GitHub and
open a pull request just notifying them of the vulnerability.
In moments without any user interaction, I think we'll see stuff like that start to pop
up which is pretty exciting.
Exactly.
Let's say we always tell people within open source that more eyes means more secure code.
And that benefits a lot of open source projects.
But if you have a very obscure open source project, maybe no one will look at it.
Maybe no one will be looking out to harden your code.
But this gives a lot more people the ability to look into your skill project because they will
be just looking everywhere well just think about it now like right now we have not so much no
eyeballs but very little eyeballs because the process to have such knowledge is difficult
whereas uh you know with this partnership this data set available on bigquery and all the good
stuff you know now people have a much easier way to, to find these insights. And then obviously, you know, knowledge is power.
So in this case, it's, you know, I'm on Bleepy's side, Jared, I'm, I'm kind of a,
I'm not the naysayer, so to speak. I'm like, I'm like, do it, you know, cause I think about like
in our, in the show that's going to come out after this, we talked to Peter Hedenskog of Wikipedia about site speed.io.
We talked a lot about automating reporting of site performance.
And this is similar to your point, Jared, where you said, you know, could we automate
some things where a pull request is automatically opened up?
I think about the automation tools that may be able to take place on the security side to say, okay, here's a vulnerability.
It also opens up another topic I want to bring up, which is not just the GitHub data store, but other data stores or code stores like Bitbucket or GitLab having similar data sets on BigQuery and how that might open up insights to all stores and to all the major stores.
But long story short,
automating those kinds of things
to the open source out there,
that's an interesting topic to me.
So I was going to say,
one actually, if you,
a fun experiment,
is actually don't do this.
I'm not recommending this.
But if you commit,
if you commit a public access token
from your GitHub profile into a public repo,
you'll get an email from us within about a second
saying we disabled that for you
because you probably didn't want to do that.
Wow.
So I think there's actually like scanning
and making open source more secure
is something that we care a lot about.
We think that's in everybody's interest.
We think software is best when it's open.
And so, but we've all committed stuff accidentally
and had to rewrite history.
And it's just humans are humans.
And so I think the things that,
thinking about the things that, you know,
the tools that we can do to improve tools
to help people stay safe
and help their applications stay safe,
I think is really, really, really, really important.
And so we do that currently for GitHub tokens,
but you could imagine, you know,
I should probably want the same level of service if I commit, you
know, I don't know, an Amazon token or a big, you know, a Google Cloud token or whatever
it is, something that exposes me, you know, that's the kind of generically interesting
area to work on.
And so I think, yeah, I think more eyes on open source is, I think, showing how data can be used to make people more secure,
I think will help.
I think this just helps sort of accelerate the progress
of improvements to things like GitHub by making data more open.
One facet of this that we definitely should mention
is that the data set that's provided is not real time. And so when
we talk about zero days or like code that's currently vulnerable, you do have a lag time
between when that snapshot is created. Now, previously you had told us it was two weeks
and now Felipe is telling us is one week. So apparently y'all have gotten better at this
since we even talked last. 50%. Yeah. So that's nice. I mean, I'm curious if there's ever a goal
to make that a nightly thing
or if if a week is is good enough or what your thoughts are on that um i mean i would love to see
um you know i think an obvious thing to do with you know you know big archives of data is you
know to improve the frequency at which they're they're being refreshed um i would love to see
these things get more and more more and more close to live.
Yeah, so I mean,
I think it's how often the job runs.
I think the job takes about
20 hours to run currently.
So we're going to hit a limit
of how quickly the pipeline can run,
but maybe it can be paralyzed further.
I don't know, Felipe,
do you recall how long it takes
to do this big import right now?
FELIPE HOFFAEYSEN- What I can say
is things can only get better.
It's amazing how things just improve while I'm not looking.
MARK MANDELMANN- It's our current bottleneck
in data warehousing and analytics.
And so you can expect
that all cloud providers are going to be optimizing for that and getting as close to real time as
possible. What does it take, I guess, can someone walk us through the process of, you know, capturing
the data set, whether it dumps down to a file, what's the process, maybe even Arvon on your side,
like what inside of GitHub had to change to support this?
Like what new software had to be built?
But walk us through the process of the data becoming available and then actually moving into BigQuery.
What's that process like?
Kind of walk us through all the steps.
So from GitHub side, actually very little changed.
And I'm probably not the best person to talk to about the process of actually doing the data capture. I mean, we regularly increase API limits for large API customers.
And so I think we did that. But Felipe, do you have more detail on this?
Yeah, let me make a parallel with what the story Ilya told you when he was back here February last year.
First, he started looking at the GitHub's public API.
He started logging all of these log messages. And once he had
these files, he had to find a place, one to store them, to analyze
them, and to share them. And the answer was BigQuery.
Now in 2016, we had a similar problem,
just bigger. It starts by taking a mirror of GitHub, using their public API, looking at GitHub's
change story history. Once you start mirroring this, you have a lot of files. And then the
question becomes, where do I store them?
Where can I analyze them?
Where can I share them with other people?
And that's where Sean Pierce is a superstar
that writes this pipeline to take one mirror GitHub
and then put it inside BigQuery as relational table.
That's basically the Google magic in summary.
But yeah,
it takes a lot
of my producers
and doing things
at Google scale
to be able to just,
oh yes,
I downloaded,
I made a mirror
of all of GitHub.
I guess the thing
I'm trying to figure out
is what makes
it take a week?
What's the latency
in terms of
capturing
to querying inside of BigQuery? That's what I'm trying to figure out. What's the latency in terms of capturing to querying
inside of BigQuery?
That's what I'm trying to figure out.
What's the process to get it there?
That's a good story there, but why does it take a week?
Yeah, I think it might take closer to a day.
But it's all about how many machines you have to do this.
You want faster results, you just keep adding machines to it.
Then it becomes a question of how much quota
do you have inside Google versus other projects.
MARK MANDELMANN, And I hate to keep further compressing
the time, like we're just making changes right now.
But I think we're down to six hours.
Really?
Oh, nice.
MARK MANDELMANN, What?
The pipeline.
So we had a conversation a week ago,
basically, to tee up this conversation.
It was two weeks then.
Then we thought it was a week today, and now it's six hours.
By the time this show ends, it's going to be real time.
Yeah.
Good job, Will.
Felipe is actually coding right now as we talk.
Sean is a star, but it's all about getting more machine resources for the project and the more people
use this data set the more important it becomes well we start putting more resources on it i'm
really really looking forward to what the community will do with this data and the tools
that we develop over vQuery to to be able to just analyze the data in place. So I have a good example, I think, of a question that's currently pretty much impossible
to answer without this data set, if you're interested.
Absolutely.
So I was talking to a researcher about six months ago,
and he was trying to answer the question.
So if you kind of read it 101,
let's get getting started in open source.
How do you create a successful open source project?
People will tell you it's very important
that you have good documentation, right?
Like you want to have your API documented.
You want to have a good read me.
And he was like, you know what?
I've used software where the documentation is really poor,
but it's still really popular.
And over time I've seen the documentation improve.
So his question was,
is documentation a crucial component
of a project becoming successful,
becoming widely used?
And so to answer that question,
you kind of need a timeline of every commit on the project.
You probably want to know the file path,
what was in the file.
Let's say documentation in GitHub's world
is Markdown, ASCII doc, restructured text.
Even just those three extensions
would probably represent about 95% of all documentation.
And so you can look at what's code and what's docs,
but you can't do that query today you have to as
an individual you would have to go and pull down you have to get clone you know thousands hundreds
of thousands maybe of repos from github store them locally then write something that would allow you
to you know programmatically go through all these git repos building up these all these histories
um these histories now in histories are now in BigQuery.
So I'm not saying I know exactly how to write that query,
but the data's there, right?
It's possible now to answer this question.
And I think one of the most exciting things for me
about this data set is, you know,
I think there is still a huge amount to be learned
about how people build software best together.
And I think that's not something that necessarily, you know, the really hard questions I think
are often best answered by people like the, you know, the computational social sciences,
people who study like how people collaborate and they need really, really big data sets
to do these studies. And to date, it's just not really realistic for GitHub to,
you know, the API for GitHub's API is just not designed
to serve those kinds of requests.
It's designed for building applications against.
And so I think we're going to see, you know,
I think we're going to see a huge, huge kind of uptick
in the amount of really data-intensive research
around collaboration and about open source software
and about how people best work together and powered by this data set.
Yeah, that's very exciting.
And as people who are very much invested in watching the open source community do their thing
and tracking it over time.
I'm excited all the possibilities that are going to be opened up and I even think of
just when GitHub Archive came out and all of a sudden we started having cool visualizations
and charts and graphs and like people putting answers together that that we did we didn't
know we could ask ask questions about and now we have so much more and super awesome.
I think we're going to tee
up for our next section is BigQuery itself, because it does seem like a little bit of a
black box from the outside. Like, how do you use it? How do you get started? How long do the queries
take? You know, there's a free tier, there's a paid tier. I'd like to unpack that so that everybody
who's excited about this new data set can, at the end of the show, go start using it and check it out.
So we'll talk about that when we get back.
Our friends at Fullstack Fest are putting on a week-long Fullstack Development Conference
in Barcelona, September 5th through 9th.
The focus of this conference is solving current problems with new and inspiring perspectives.
Head to fullstackfest.com to learn more.
It's a wide range of topics any full-stack developer would enjoy.
Erlang and Elixir, Reactive Programming, HTTP2, GraphQL, NLP-backed bots, Docker, IPFS,
Distributive File System, Serverless Architecture, Unikernels, Elm Architecture,
The Future of JavaScript, ES6, ES7, CSS4, Relay vs. FileCoreJS, Angular, CycleJS, CSP channels, handling interplanetary latencies, web assembly, mixing React with 3D, virtual reality, and the physical web.
It's a full stack fest.
Early bird tickets are available until July 15th.
That's coming up soon soon at the end of that
day they will no longer be available speakers from all over the world from companies like twitter
netflix microsoft erlang shopify and to top it all off enjoy a week in sunny barcelona tickets
again are available for the whole week only the back-end days or only the back-end days, or only the front-end days. So you have your choice to kind of go all week or go to back-end or front-end days.
And for our listeners, save 75 euros before tax after July 15th if you miss this early
price.
Use the discount code, the changelog.
Once again, head to fullstackfest.com.
All right, we are back talking about BigQuery,
GitHub, public datasets, all that fun stuff.
Felipe, tell us about BigQuery.
How do you use it?
So BigQuery is a hosted tool by Google Cloud.
So you just go to BigQuery.cloud.google.com.
And basically it's there open,
ready for you to use to analyze
any of the open data sets
or to put your own data.
Just in case you're wondering
if it's only for open data, nope.
You can also load your private data
and it's absolutely secure, private, etc.
But with open data, you can just land there and start query.
Now, you will probably need to have a Google Cloud account.
So if you don't have one, you'll need to follow the process there
to create and open your Google Cloud account.
But then you will be able to use BigQuery to analyze data
and everyone can analyze up to a terabyte every month
without needing a credit card or anything.
So you can choose which data set to start with.
I wrote a guide about how to query Wikipedia slots.
Those are pretty fun.
But in this case, if we want to analyze GitHub,
we can go to the GitHub tables
to find some interesting queries.
We have the announcement on the GitHub blog,
on the Google Cloud Big Data blog.
I'm writing a Medium post
where I'm collecting all of the other articles
I'm finding around.
So you will want some queries to start with.
And then start the questions the question is what questions do you want to ask uh you have these tables that
our phone described at the beginning one of the most interesting tables is the one with all of the contents of github so this has all of github open source github files
that are less than one mega that are not binary at less than one megabyte and that table has around
1.7 terabytes of day and that's a lot especially if you're using your free quota. If you query that table
directly, your free quota will be out immediately. So thinking of that, we created at first a
sample table with all that much smaller. Let me check the size right now. I have it with me.
But I'll tell you the exact size in a minute.
The thing is you can go to this table
and you can run the same queries
you would run on the full table,
but your allowance,
your monthly terabyte will last way more.
You can choose to run all your analysis there on the sample
and then bring it back to the mega table.
But it all depends what questions you are asking.
And I also created, this is outside the main project,
but in my private space that I'm sharing,
I created an extract of all of the JavaScript files, but in my private space that I'm sharing,
I created an extract of all of the JavaScript files,
all of the PHP files, Python, Ruby, Java, Go.
So if you're interested in analyzing Java code,
you might be better off starting from my table.
And then you can start asking the questions you might have from at least start with
one of these sample queries.
A couple of things.
Let me interject here.
So all of these things
that Felipe is referencing,
we will have linked up in the show notes.
So if you're listening along
and have the show notes there,
pop them open.
We'll have example queries
and all the posts,
both from GitHub and Google, published around this. So that's probably a good place to
go. You mentioned your monthly allotment or your threshold. I can't remember the exact word, but
your quota. Yes. Let's talk about that. So BigQuery is free up to a certain point,
and then you start paying. And the reason for this example data set, which is smaller, is because if you're just
going to run test queries against the whole GitHub repos data set, you're going to hit
up against that pretty soon.
Can you talk about that?
I think there's some, even as a user, like we have ChangeLog Nightly going and have for
a couple of years now.
We've never gotten charged, so I guess we're inside of our quota,
but I don't have much of an insight
into what all we're doing.
How does the payment work in the quota?
Is it based on how much data you've processed?
Exactly.
So BigQuery is always on,
at least compared to any other tool.
You don't need to think about how many CPUs
or how much RAM
or how many hours
you're running it.
It's just on always.
And then the way
it charges you
is by how much data
you're querying.
So it looks at the tables
you're querying,
specifically at the columns
you're querying
and the size of those columns.
And that's basically the price of a query.
So if a column is one gig or something like that,
or, you know, half a terabyte,
then you're essentially being charged a query
at half a terabyte.
Exactly.
So today the price of a query is $5 per terabyte queried.
So if a column is one gigabyte,
divide $5 by a thousand,
and that's the price of your query,
the cost of your query.
So assume I'm right.
I got my question asked.
So I have my,
I've used the GitHub examples,
the data set or the subset
for my development.
And I have a query here.
In fact, from some of your guys' examples, here's one.
Let's say it's the how many times shouldn't it happen one
that Will talked about earlier,
which it appears that this thing pulls from githubrepos.samplefiles
and it joins githubrepos.samplecontents.
So every time I actually run that in production,
it's going to add up the size of those two particular things and then charge me once per time I hit the big query.
Is that right?
Exactly.
Every time you write a query.
In fact, when you write a query before running it, you can see how much data that query will process.
Oh, that's handy.
Yeah, because basically it's a static analysis.
You have the columns you mentioned
from the tables you mentioned,
and then the query knows basically the exact price.
I'm just thinking outside the box
because you all have AdSense
and the way people buy ads
that you may actually have a bidding war at some point or not so much a bidding war but
you might be able to have something where i want to query these things several times a month but i
have a budget and i'll query them if it's under this budget and you might be able to do those
queries if said budget is not met or is is. That seems like something in the near future,
especially as we talk about automation around this.
Yeah, so the idea here is to make pricing very, very simple.
If you're able to know the price of your query
before running it,
then you can choose to run it or not.
And it's essentially about,
instead of querying the whole dataset,
instead of querying the full contents table
with that 1.7 terabytes,
let's just query all of the Java files.
So, and if I have not created,
if someone has not created the extract you need,
maybe the best first step on your analysis
is extracting the data that you want to analyze.
Do you feel like you'll have any pushback at all for uh i guess a higher free threshold for open data sets because there's
always this um this sort of push or this uh this angst i guess where if you're doing something
for the good of open source or something that's free for the world
or just analysis that someone is always like,
hey, can you make this thing free for open source?
And if you're doing, you know,
if you're since where this shows specifically
about this partnership
and the GitHub public data set being available,
what are your thoughts on the pushback
you might get from the listeners who are like,
this is awesome, I want to do it, and can I have a higher limit?
So at least what makes me pretty happy is that we are able to offer this monthly quota to everyone.
It's not, it doesn't stop once.
It's not for the first 10 days.
You have access to this at least until, I
don't know, every month you will be having this terabyte back to run analysis.
And that's pretty cool on one side.
And then, well, if you want to consume a lot of resources, at least you are able to, instead
of having to wait one month, at least you have the freedom to pay for,
to have more resources,
even more resources available.
And just a context set,
because I agree like,
you know, in cloud,
we're continually getting feedback
and then just based on competition
to reduce pricing
and make things more optimized
and efficient and cost-effective.
And so just where we were just a moment ago really was without BigQuery is that in order
to do analysis on any data set, you would have to go find that data.
You would have to download that data and possibly pay some sort of egress.
You'd have to upload it into your own storage on whatever cloud provider you're using.
And there's a cost there.
And then you'd have the consumption for doing any query on it.
So it's a valid question.
But right now, we've already reduced the cost for public users.
And I fully expect that, yeah, people will be asking for more higher limits on querying the data.
And I just expect we'll continue moving and making things cheaper and more efficient for users.
I think the steps you just mentioned there that, you know, just for one, telling people, you know, this is what it actually takes to do this without BigQuery.
And now the BigQuery is here.
We've taken so many steps out of the equation.
You've obviously got Google Cloud
behind it, the supercomputer that we talked about in part one of the show, basically,
having access to that. And I think just sort of helping the general public who is going to have
clear interest in this, especially listening to this show, like everyone who listens to this show
is either a developer or is an aspiring developer. So they're listening to this show with
a developer's mind, so to speak.
And so they're thinking, yeah, if I want to use this, how can I use it?
But knowing the steps and knowing the pieces behind the scenes
to make this all possible definitely helps connect the dots for us.
And it's really what's great about working at Google
is this is really in our core mission.
I mean, Google's core mission is to organize the world's information and make it universally
available.
And then so for the public data program, this is a natural extension of that mission within
the cloud organization.
And I see these public data sets plus tools like BigQuery is, and I know this word gets
overused, but it's, you know, democratizing information even further.
You know, we've all been these unknowing or, you know, knowing or involuntary collaborators
in providing public data. And so I like the idea that we all have equal access
in these public data programs, and we're now getting meaningful access to that data. And so like today we're doing a better job
at making the data available for download, right?
It's like cdata.gov, for example.
Like public data is pretty accessible now.
And so I think the next step though,
and going back to that comment I made about meaningful,
is to provide the tools that lower that ramp even further and give all these collaborator collaborators meaningful access. You know,
so we're starting with SQL, which, you know, for, uh, for most developers and marketers, like is,
is a pretty good level of entry for, uh, querying, you know, enormous sets of data.
But, you know, I think we're going to end up with machine language-powered speech queries, right?
Where we're not, Felipe, Arvon, and I
aren't talking about these queries
that you have to construct
and managing your limits on the data.
We're actually telling you just to ask the machine,
the data set, a question.
Let's continue on the practical side of how you get that done.
You mentioned the console, which is where you can
write your queries and test your queries and run them.
There's other ways that you can use BigQuery as well.
Once you have those queries written, for instance,
with ChangeLog Nightly, we're not going into the console
and running that query every night and shipping off an email.
It's all programmatic.
Can you tell us what it looks like from the API side,
like how you use BigQuery not using the console?
Yeah, so BigQuery has a very simple-to-use REST API
for people that want to write code around it.
So now we have a lot of tools that connect to BigQuery.
Tableau is one of the big ones.
In specifically open
data, we are partnership
with Looker. So some
of our public datasets that we
are hosting with Will have
specifically Looker dashboards
built over them.
I love Redash
for writing dashboards
and that's a dashboard software
that was not created
for BigQuery at all,
but it was open source.
People loved it.
People started sending patches
so it connected to BigQuery.
So now you can use Redash
to analyze BigQuery data.
And that's,
I just love using that one.
The new
Google Data
Studio also. It's a
pretty easy way to just
create dashboards. I'm
sharing one of these dashboards specifically
for GitHub, this
GitHub data set too.
So yeah, you don't need to know SQL.
I just love SQL,
but you can connect it to all kinds of tools
and also to other platforms
like Pandas or R, et cetera.
It's all about once you have a REST API,
you can just connect to anything.
One last question on this line of conversation.
We talked about how long it takes
to process, to get the data into
BigQuery. It was two weeks, then it was
a week, and then it was 20 hours, now it's six hours.
How
about querying it? We haven't talked about
what do we expect
if we're going to do the GitHub
the full Monty
like this
query for emoji used in commit messages, for instance. full Monty, like this query
for emoji used in commit messages,
for instance, however many
terabytes that covers, are we talking
like three seconds,
30 seconds, minutes?
What do we expect?
Depends a lot on what you're doing.
Here we're really testing the
boundaries of BigQuery.
We are going beyond just,
you can go way beyond doing just a grep.
You can just, I don't know, look at every word
and every piece of code, split it, count it, group it,
a regular expression.
So some queries will take seconds.
I love those.
I love being able to just go on a stage
and just start with any crazy idea, code
it, and have the results while
I'm standing out there.
But sometimes there are queries
that are more complex that involve
joining two huge
tables that we query can't do these
joins. But when reaching
the boundaries, it's good to
limit how much data you query, for example.
Oh, I have this pretty interesting query that might take two minutes.
What about if we, just to get very quick results, we sample only 10% of that data or 1% and things start running a lot faster.
But it's really cool. So on one hand, you feel that, oh,
I'm reaching one of the boundaries.
But at the same time, you feel
that, wow, I'm really doing
a lot here.
Let me see if I can run a query now
while we talk. I'll come
back when I get my query.
Felipe, maybe you can multitask. I'm
not sure, but let's test
you out.
Earlier in the show, it was actually in a break, we talked about some things you have some affinities for,
for what the possibilities of BigQuery and all these datasets being available might offer.
And one of them you mentioned was being able to cross-examine datasets. So So for example, you had said how weather may affect,
I think it might've been pushes to GitHub or pushes to open source or something like that,
but basically how you're able to capture
various large public data sets
that may be like traffic patterns,
weather and the ability to deploy code
or push code to GitHub.
But what other ideas do you have around
and what are some of your dreams
for cross-examining datasets?
So just to answer the question, because I told you I was going to come back with this.
There you go.
I copy-pasted one of the sample queries.
In this case, we are looking at the sample tables with the sample contents.
This basically has 30 gigabytes of code.
I'm looking only at the Go files in this case,
and I'm looking at the most popular imports for Go.
And basically this query over 30 gigabytes
ran in five seconds.
Not too shabby.
That's fast.
Yeah, that's how cool things get.
Yeah, so going back to dreams,
just seeing data in BigQuery,
seeing people share data here,
was my appetite for how can I join different data sets.
For example, something I was running,
I ran last year when I got all of Hacker News
inside BigQuery, the whole history of comments and posts,
was to see how being mentioned on Hacker News
affected the number of stars you got on GitHub.
FRANCESC CAMPOY FLORES- Oh.
FRANCESC CAMPOY FLORES- Yes, I can send you
that link too.
MARK MANDELMANN- Or you can also
have the public data set of the change log.
And when we release new shows, how popular
that project might get.
FRANCESC CAMPOY FLORES- Exactly.
MARK MANDELMANN- Ooh. FRANCESC CAMPOY FLORES- Oh, how popular that project might get. Exactly.
Oh, yeah, that would be cool.
So we can see all these things
moving around the world, the pulse of it,
and how each one affects each other,
Reddit comments, Hacker News comments, the Wikipedia page
views.
And you can see the real effect on code,
on what will be happening on GitHub code, on the stars,
on how things start spreading around, and the ability to link these artifacts, to add
weather, like, oh, do people code under good or bad weather?
Right.
Let's extend that a bit then.
Another question we have for you is, and this is more for all of you.
This isn't just you, Felipe, but keying off of this topic here,
what would you like the community to do as a result of this?
So you have some pure love for cross-examining data sets, things like that.
And as you can hear, there's a crazy storm here in Houston.
You heard that lightning there.
The hatches are being battened down now.
My wife, she's out there taking care of it.
I got to go join her soon,
so maybe the show will end eventually.
But in between now and then,
what would you like the community to do?
So you got the listening ear of the open source world
hearing you guys talk about this stuff now,
all these data sets being available.
Will, maybe at some point you could talk about
some other data sets that might come to play here as well to
fuel this fire, but what are your dreams
for this? What do you want the community to do with it?
I'll go.
So I would love to,
I mean, one of my favorite projects that
uses GitHub data,
you know, open source data
from GitHub is
libraries.io, and I know you
had Andrew on a few episodes ago.
So I think there's still, I think,
a huge opportunity to lower the barrier to entry
for people into open source.
And so I think part of that is, you know,
maybe product changes and improvements to GitHub.
But I think there's, you know,
there's like really interesting projects out there,
like first pull request and up for grabs,
like low hanging fruit issues that are easy,
easy for the community to work on.
I think I'm convinced that there is in this data set,
like the answers to questions, like what makes, you know,
a welcoming project for people to come and work together,
you know, combining, we've got everything that everybody's ever said to each other. like what makes a welcoming project for people to come and work together.
Combining, we've got everything that everybody's ever said to each other.
All of the code that's been written,
you can run static analysis tools on that code
to look at the quality of that code,
maybe how approachable it is.
There's just, I think, a missing piece right now
that if I am a 20-something CS graduate
and I can program like crazy,
but I've never participated in open source
and there's lots of these people,
or maybe I'm just somebody who's just got my first computer
and I've heard about open source and I want to get stuck in.
I think there's a missing piece right now
that we're not connecting always the sort of supply in terms of the
talent that's out in the world with the opportunity of projects that have, you know, everyone
wants more contributors.
Everybody wants people helping to build software with them together.
And so I'm really excited to see what the community are going to do around those
topics.
Cause I think you think about what Andrew's done with libraries, I mean,
that's a really good example of like stepping, you know, stepping in that
direction, but this, this makes kind of richer, more intelligent kind of uses of
that data for, for, for, you, you know, strengthening the open source ecosystem
is where I think the big opportunities are. And I think that's actually, you know,
ideas are free, right? Like there's money to be made doing that. If somebody wants to go and like
build companies that solve that problem, I think that's a generally interesting problem to solve.
Yeah. Lots of ideas come to mind for me on that. But on the note of Andrew,
I think Andrew is actually, with libraries,
he's actually querying GitHub's API directly.
So in this case, he can actually go to BigQuery
and get the same data, maybe faster.
He might have to pay a little bit for it,
but he may not have to hit rate limits or things like that
or just actually have a much richer ability
to ask questions of GitHub versus the API.
Exactly, yeah.
Cool. What about, Felipe, what about on your side?
Or Will, on your side, any dreams?
For me, I like comparing this with the story of Google.
Google, for me, is the biggest company built on data.
Basically, you need data, tools, ideas.
Data for Google was collecting the whole
World Wide Web at that moment. Collecting it was not easy, but you also needed the tools to
store it, analyze it. And then you needed ideas. Like a lot of companies at that time, there were
a lot of web search companies that had all this data, had a copy of the web, a mirror of the web inside their servers.
But the ideas that Google had of,
hey, let's do page rank.
Let's look at the links between pages
to rank our searches.
That was huge.
So I look at the same,
I'm looking at the same right now
with this and other datasets.
We have the tooling.
Tooling might be BigQuery.
Tooling might be BigQuery.
BigQuery gives you the ability to analyze all of this,
but you can create tools above this.
I'm looking forward to see more static code analyzers that
will run inside BigQuery.
You need ideas.
That's where I'm looking for the world
to bring new ideas, new ways to look at this data that we're making available.
And I'm looking out for data.
I really want people,
we're making a lot of data available in the query
and I would love people to share more.
And that's why we have Will here also to help,
to bring, if you have an open data set,
if you want to share data,
instead of just leaving a file there
for people to download
and take hours to download
and then analyze on their computer, et cetera.
If you share it on BigQuery,
then you make it immediately available
for anyone to analyze
and then to join with other data sets.
So for me, that's...
Well, since you mentioned Will,
there's definitely one subject
that I wanted to save closer to the end here,
which is talking to you about the datasets that you're...
I mean, this is mostly around the partnership
with GitHub and this dataset,
but what other datasets, as Felipe had mentioned,
what do you have your eyes on?
What hopes do you have there?
Yeah, well, what I'm focused on right now is
trying to get data sets that uh address that accessibility issue i was telling you about
earlier um like a lot of the data.gov stuff like medicare data census data um some of the climate
data and what i find interesting about this is it's you know this this data has been collected
for decades and so the schemas around this data were designed, you know, well, before we even
thought about big data challenges, uh, much less just early, even sequel, you
know, it's like pre no sequel challenges.
Right.
So, um, you know, we're talking prior to the seventies.
And so the challenge here is like taking a lot of this data, which is coded, you know, it's truncated because at the time, uh, there were limitations on, um, uh, characters
and everything else. And so is getting all that coded data, which is technically available for
download by the public, but not usable. We're planning on, uh, onboarding some of the data
from the government catalogs, like the census data, data, Medicare data, patent data from both US and Europe, and then some more of the weather-related data.
And it's a big challenge because a lot of this data is decades old and was designed at a time before there you know, there was even SQL or big data.
And so it's heavily coded. And so the challenge here is to decode that data, which requires
resources and then structuring in a way that it fits well into BigQuery. And then, you know,
Felipe can take it from there to the community and construct all sorts of interesting queries
and address that accessibility challenge that I was talking about earlier.
Yeah, something that I just told you guys what that are going to be stored as part of this.
There's obviously some motivation on GitHub side to do this.
So Arvon, feel free to throw a mention here on this.
But I'm kind of curious to all three of you,
just whoever wants to share something about this, but I'm curious about how this opens the door for
other code stores, stores for it from back in the day. I think they still are kicking around. I'm
not really sure what their status is, but you got Bitbucket, you got GitLab. Obviously, having this
kind of insights is kind of interesting.
So does this open up the door for other stores?
Is this something that's a motivation for everyone to do that kind of thing?
Yeah, I mean, I'll take a stab at that.
I mean, I actually think that, you know, the open source software, wherever it is, is, you know, hugely valuable.
And so I would love to see more open source software available in a way,
you know, similar way to the way we're releasing this data today with Google.
So, you know, I think the more the better as far as I'm concerned.
You know, if we were 10 years ago, you know, a lot of open source activity was happening on SourceForge.
And, you know, there's still stuff a lot of open source activity was happening on source forge and, you know,
there's still stuff up there that's used and still incredibly important.
And of course, you know, people,
people are on Bitbucket and GitLab and another, another host as well.
So I would love to see, you know,
more vendors participating in archiving efforts like this.
I think there's more to be done simply than just depositing data.
I think there's also this sort of, you know, we have the way that our API works.
You know, Bitbucket has its API.
GitLab has its API.
You know, there's differences between all the different platforms, even if maybe many of them are using Git or Mercurial at the kind of base level for the code.
So I think there's actually really big opportunities
to standardize some of the ways in which we kind of describe
the sort of the data structures that represent not only code,
but all of the kind of pieces around it,
the community interactions, the comments, the pull requests,
all of these things.
And so I'm aware of a few community efforts.
There's one called Software Heritage.
There's one called Flossmole where they try and they've got, for example,
all of RubyGems stuff in there and a whole bunch of SourceForge data.
I think, you know, I've talked today about some of the things about, you know,
empowering the research community around these data sets.
I think one of the issues with doing that right now is,
I spend most of my time thinking about GitHub,
the data that GitHub hosts,
but of course that isn't all of open source.
And I think making sure that it's possible
for all of software to be studied,
I think is going to be really important going forward.
So yeah, I think there's a bunch of opportunities
there about improving platform interoperability
that I don't think many people are talking about right now.
And I'd love to see some advancement in that,
because I think it's good for the ecosystem at large.
FRANCESC CAMPOY FLORES- Yeah, I would like
to highlight also the technical side.
There is a big technical problem.
And the question here is, are we able to host all of GitHub code,
open source code in one place
and able to analyze it in seconds?
Well, we just proved that we can.
So let's keep bringing that in.
Let's keep furthering the limits.
But yes, technically we can solve this problem today.
That's a good thing. I mean, obviously, you know, Will, with your help and Felipe, your abilities to
lead this effort and Arvon, your efforts on the GitHub side of things to be open to this. And I
think part of this show is one, sharing this announcement, but two, opening up an invitation to the developers out there, to the people out there that are doing all this awesome open source and dreaming about all this awesome open source, having this invitation to bring their company's data sets if there's open data out there to BigQuery.
And so I guess, well, what's the first step for something like that? You said
that that's an open door. Obviously, if 10,000 people walk through the door at once, it's not
a good thing because you may not be able to handle it all. But what's the process for someone to
reach out? What's the process to share this open data?
Yeah. So they can contact us. And I'm trying to pull up just so I get the, it's on the cloud.google.com site
under our dataset page.
They can contact us.
Where is that email?
I will give that email to you
so you can put it in your accompanying doc.
But also, I would also encourage them
to reach out to Felipe on Reddit
or on the Medium post and just get a hold of either of us that way.
We'll have that Medium post in the show notes.
So if you've got your app up or whatever.
I just got it.
It's bq-public-data at google.com.
bq-public-data at Google.
Yes. bq-public-data at Google. Yes, I would like to add that on the technical side,
if tomorrow 10,000 people want to open data sets on BigQuery,
that's completely possible.
Anyone can just go and load data in BigQuery
and then make it public.
What we're offering with this program
is support to have your data set public side
shown taking care of the paying for the
hosting price.
But you can just go and do it yourself.
Working with us is much, it's cool, but you don't need to go through a manual process.
You can go and do it.
That's an excellent point.
And to be clear, you can upload your data
and then put ACLs on it to make it public. And then anybody that queries that data,
you're not going to be charged for their queries. Gotcha. That's good then. So you can,
you can mainly do it if you have a big data set and you want some extra handholding, so to speak.
So email the email you mentioned, we'll also copy that down and put that in the show notes, but
it's possible to do it on your own, as you'd mentioned through the
BigQuery interface and making it public and not being charred. That's a good thing.
Well, let's wrap up because I know I had a storm. We did have a quick break there because of the
storm and my internet outage for about five minutes. So thanks for bearing with that. And
listening audience, you probably didn't even hear it because we do a decent job of editing the show and make things making things
seamless when it comes to breaks like that but this is time for some closing thoughts so i'll
open up to everyone uh whomever wants to take it but just some closing thoughts on some general
things we talked about here today anything else you want to mention to the listening audience about what's happening here?
All right, I'll go.
Okay, so I would, I mean,
I'm incredibly excited to see this data out in the public.
I think we talked a lot today about, you know, public data,
but, you know, so there's sort of open data,
but also useful data, usable data.
And I think this is, you know,
the first time that you've been able to
you know query all of github uh and i think that's an you know incredible opportunity for
studying how people build software um you know understanding understanding uh you know what it
means for projects to be successful i think honestly i think the most exciting thing for
me about this is that the data is now available it's out there and i think the possibilities are you know near limitless and
i i can't wait to see what the what the community does with this data set well flip anything to add
to close i would love to add for anyone analyzing data it doesn't need to be open data. I love open data, but anyone that's analyzing data today
that is suffering,
waiting for hours to get results,
having a team,
managing a cluster,
babysitting a cluster overnight,
try BigQuery.
Things can be really fast,
really simple,
and that will open up your time
to do way more awesome things.
Awesome.
Well, I can definitely say that we've been enjoying BigQuery.
But go ahead, will you add something you wanted to add?
Oh, I just wanted to add to what both Arvon and Felipe were saying around communities.
What I'm really looking forward to is seeing the community participate in developing interesting queries.
And I'm sure there are data sets out there
that are interesting that I'm not aware of.
And I would love to hear about those
and try to get those more accessible.
One more curveball here, just at the end of the show.
It occurred to me too, during the show,
like over the years of the ChangeBlock,
we've had a blog, we've had this podcast,
we've got an email,
and we've talked several times about open data, public data being open sourced on GitHub. And it now occurs to me that all of that
effort can now be imported either by way of GitHub, but also just directly into BigQuery.
And so if you're out there and you've got a data set you've open sourced on GitHub, go ahead and go to BigQuery and put it there and make it public there.
That way people can actually leverage it because I can't even count on my hands how many times we've covered open data in all the ways we've talked about on the show today.
But that seems like putting it on GitHub is great, but then making it useful, not that GitHub isn't useful, but making it useful is putting it on BigQuery
and opening up for everybody.
That, to me, seems like the chair on top.
Obviously, you know, we've got a couple links
we're going to add to the show notes.
We've got this announcement, obviously,
between this partnership and the GitHub data set
being available in this new way.
The blog post being out there, we'll link those up.
So check the show notes listeners for that.
But I just want to say thanks to the three of you for, one,
your efforts in this mission and caring so much, but then, two,
working with us to do this podcast and share the details behind this
announcement because we're definitely timing this,
the release of this show for all the listeners right around,
if not the same day, like the same time frame, maybe the day after.
I know there's been a couple posts already shared out there,
so I'm not sure exactly on perfect timing,
but we're aiming for this to be right around the same time.
So announcement at CodeConf for GitHub.
But we're trying to work together to go deeper on this announcement,
share the deeper story here, and obviously get people excited about it.
So I want to thank you for working with us on that.
It's an honor to work with you guys like this.
But that's really all we wanted to cover today.
So listeners, thank you so much for tuning in.
Check the show notes for all the details we talked about in this show.
But, fellas, that's it.
So let's say goodbye.
All right.
Thanks very much. It's been really fun to talk in depth about the project today. So fellas, that's it. So let's say goodbye. All right. Thanks very much.
It's been really fun to talk in depth about the project today.
So thanks for having me on.
Thank you very much.
I love being here.
I love being able to connect to everyone here at the Change Flow.
Yeah.
Thanks for having me as well.
It's been a good conversation.
And with that, thanks listeners.
Bye. We'll see you next time. I'm out of here