The Changelog: Software Development, Open Source - GitHub and Google on Public Datasets & Google BigQuery (Interview)

Episode Date: June 29, 2016

Arfon Smith from GitHub, and Felipe Hoffa & Will Curran from Google joined the show to talk about BigQuery — the big picture behind Google Cloud's push to host public datasets, the collaboration bet...ween the two companies to expand GitHub's public dataset, adding query capabilities that have never been possible before, example queries, and more!

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome back everyone. This is the Change Log and I'm your host, Adam Stachowiak. This is episode 209 and today Jared and I have an awesome show for you. We talked to GitHub and Google about this new collaboration they have. We talked to Arvon Smith from GitHub, Felipe Hoffa from Google, and Will Kern from Google. We talked to Arvon Smith from GitHub, Felipe Hoffa from Google, and Will Kern from Google. We talked about Google BigQuery, the big picture behind Google Cloud's push to host public datasets for BigQuery as the usable front end.
Starting point is 00:00:34 We talked about the collaboration between Google and GitHub to host GitHub's public dataset, adding querying capabilities to GitHub's data that's never been possible before. We had three sponsors today, TopTile, Linode, our cloud server of choice, and FullStackFest.
Starting point is 00:00:50 Our first sponsor of the show is our friends at TopTile, the best place to work as a freelance software developer. If you're freelancing right now, and you're looking for ways to work with top clients, work on things that are challenging you, interesting to you, technologies you want to use, TopTile is definitely that are challenging you, interesting to you, technologies you want to use. TopTile is definitely the place for you.
Starting point is 00:01:07 Top companies rely upon TopTile freelancers every single day for their most mission-critical projects. And at TopTile, you'll be part of a worldwide community of engineers and designers. They have a huge Slack community, very much like family. You'll be able to travel, blog on the TopTile engineering blog and design blog, apply for open-source grants. Head to TopTile.com to learn more. That's T-O-P-T-A-L.com to learn more or email me, Adam at changelog.com if you prefer a more personal introduction to our friends at TopTile. And now onto the show. All right, we're back.
Starting point is 00:01:47 We've got a fun show here. I mean, Jared, we've got some backstory to tell, a little bit to kind of tee this up. So back in episode 144, we talked to Ilya Gorik, a huge friend of the show. I mean, we've had Ilya on the show, I think, three times now. Is that right? I think that's right. In fact, we were going to have him on this show as well.
Starting point is 00:02:02 We have three awesome guests, and we figured we'd let them take the spotlight since they've been highly involved in this project as well as Ilya. Right. In fact, we're going to have him on this show as well. We have three awesome guests and we figured we'd let them take the spotlight since they've been highly involved in this project as well as Ilya. Right. So we got GitHub and Google coming together, Google Cloud specifically, along with Google BigQuery. Fun announcement around data sets are on GitHub, opening those up, BigQuery. And we use BigQuery actually as sort of a byproduct of previous work from Ilya which was GitHub Archive and we worked with him to take over the email that was coming from that and now we
Starting point is 00:02:32 call that change log nightly so that's kind of interesting. Yeah in fact we had a brief hiccup in the transition but one that we were happy to work around is what they've been doing behind the scenes is making GitHub Archive and the Google BigQuery access to GitHub.
Starting point is 00:02:48 Lots more interesting. We're going to hear all about that. Absolutely. So without further ado, we've got Felipe Hoffa, Arvon Smith, and Will Curran. Felipe and Will are from Google and Arvon, as you may know, is from GitHub. So fellows, welcome to the show. Hi. Thanks for having me.
Starting point is 00:03:04 Hello there. Nice to the show. Hi, thanks for having me. Hello there. Nice to be here. So I guess maybe just for voices' sake and for the listeners' sake, since we have three additional people on the show, it's always difficult to sort of navigate voices. Let's take turns and intro you guys. I got you from top to bottom, Felipe, Arvon, Will. So we'll go in that order. So, Felipe, kind of give us a brief rundown of who you are and what you do at Google. Hello there.
Starting point is 00:03:27 I'm Felipe Hoffa. I'm a developer advocate specifically for Google Cloud. And I do a lot of big data and a lot with BigQuery. And Arvon, how about you, bud? Yep. So my name is Arvon Smith and I am GitHub's program manager for open source data. So it's my job to think about ways in which we can be sort of more proactive about releasing data products to the world. And this is what we're going to talk about today is a perfect example of that.
Starting point is 00:03:58 Awesome. And Will, how about you? Yeah. Hi there. This is Will Curran. I'm a program manager for Google Cloud Platform, and I'm specifically working on the cloud partner engineering team. So my role is in the big data space and storage space to help us do product integrations with different partners and organizations. show here in particular is obviously touching back on how we're using GitHub Archive, but then also how you two are coming together to make public datasets around GitHub available, collecting these datasets, showing them off. I'm assuming a lot of new API changes around BigQuery. Who wants to help us share the story of what BigQuery is, catch us up on the idea of it, hosting datasets. What's happening here? What's this announcement about?
Starting point is 00:04:47 So we're going to start with what are we doing with GitHub or what is BigQuery? Let's start with the big picture, BigQuery. Public datasets will. This is a big initiative of yours at Google. GitHub, one of those public datasets. But give us the big context of what you all are up to with these public datasets.
Starting point is 00:05:04 It started with Felipe. He's been working for a while now with the community and different organizations to publish a variety of public data sets. And we've got a lot of great feedback from both users and data providers. is that they want more support for public data sets in terms of resourcing and attention so that they can get more support for not just hosting those data sets, but for maintaining them, which is our biggest challenge right now. And so we developed a formal program at Google Cloud Platform
Starting point is 00:05:39 to launch a set of data sets that Felipe had been working on them for a while. And we launched those at GCP Next earlier this year. And so the program basically provides funds for data providers to host their data on Google Cloud, as well as the resources to maintain those datasets over time so that there's current data. And so the program allows us to host a much larger number of datasets
Starting point is 00:06:04 and bigger data sets. And currently, we're focused on growing the available structure data sets for BigQuery. But then we'll start adding more binary data sets to Google Cloud Storage. As an example, like Landsat data would be a binary data set that we're looking to onboard. And then that brings us to this week's announcement
Starting point is 00:06:24 around our GitHub collaboration. FELIPE HOFFA- I would love to highlight this about BigQuery. We can find open data all over the internet. That is awesome. But what's special about data shared on BigQuery is that anyone can go and immediately analyze it. Everywhere else, you have to start by downloading this data
Starting point is 00:06:45 or by using certain APIs that restrict what you can do. When people share data on BigQuery, like, for example, the GitHub archive that Ilya has been sharing for all this time, this data is available for immediate querying by anyone, and you can query it in any way you want. You can basically run full table scans that run in seconds without you having to wait
Starting point is 00:07:07 hours or days to download and analyze the data at your home. Kind of reminds me of the Martian when the guy's like, hey, I need to do a big analysis on the trajectory of the orbits and stuff like that. If anybody's seen the Martian, he's like, I need supercomputer access. It seems kind of like supercomputer access to any data set if that's what you want exactly once we have a data same in the query anyone like you just need to log in everyone has uh for a free terabyte
Starting point is 00:07:36 every month to query has access to basically a supercomputer that able able to analyze terabytes of data in seconds just for you i know one of the things that uh and jared you can back up on this with uh piggybacking off of ilia's work with github archive and now change all nightlies that that email you know that wouldn't be possible without big query because those those uh you know those queries happen so fast it takes so much effort on a computer's part to get those queries on that big data set. I mean, that's pretty interesting.
Starting point is 00:08:10 I like that. Oh, yeah. So Ilya was the one that started sharing data on BigQuery. Like he told you in episode 144. Right. He was collecting all these files. He was extracting from GitHub all the logs. And BigQuery was opening up as from GitHub all the logs. And BigQuery was
Starting point is 00:08:25 opening up as a product at the time. So he chose BigQuery to share this dataset. And since then, we've shared a lot more datasets in BigQuery. All the New York City taxi trips, Reddit comments, hacker news, etc. You're able to analyze it. And now
Starting point is 00:08:41 what we're doing with Will is grow this into a formal program to get more data, to share more data, to analyze it. And now what we're doing with Will is grow this into a formal program to get more data, to share more data, to make it more awesome for everyone. So those are interesting data sets. Will, maybe give us a few more interesting ones. Specifically, that would be cool for developers and hackers to look at
Starting point is 00:09:01 and perhaps build things with. Either ones that you guys have currently opened up since our last show, which was February 2015, quite a bit ago, or things that you're hoping to open up that would be interesting for developers? One of the ones I like using myself is the NOAA GSOD data.
Starting point is 00:09:18 I have a lot of interest around climate change themes and topics. And what I found interesting with that data set, and Felipe had done some some great documentation on how to uh how to actually leverage uh that data is you can go right in there and instantly get uh you know in a matter of seconds um the coldest temperatures recorded over time that they've been tracking it back since like, I think it was 1920s and the hottest ones.
Starting point is 00:09:47 And immediately you can see the trends that everybody's talking about where, you know, the past decade or so, yeah, we've hit a lot of record temperatures that have not been seen in previous decades. So it's kind of exciting just to be able to like pick up a data set like that and validate a lot of the the science and news that you're reading right that is interesting so well i guess what i was going to say how do you go ahead and get started with that but maybe we'll save that for near the end of the conversation once everybody's appetites are sufficiently whetted let's talk about uh the
Starting point is 00:10:20 subject at hand which is this new github data so we've had since Ilya set up GitHub Archive back in the day, we've had some GitHub data, which was specifically around the event's history and issue comments and whatnot. But y'all have been working hard behind the scenes, both Google and GitHub together, to make it a lot more useful. So maybe, Arvon, let's hear from you the big news
Starting point is 00:10:44 that you guys are happy to announce today. So, yeah, as you kind of will be well aware with the existing GitHub archive, the GitHub API spews out all these events, like hundreds and hundreds per second of public records of things happening on GitHub. So things like when people push code, when people star a repo, when orgs are created,
Starting point is 00:11:07 all these kinds of things already happen. And these are just JSON kind of blobs that come out of the API. And so the GitHub archive has been collecting those for about five years now. But what we're adding to that is the actual content that these events describe. So if you had a push event in the past, so somebody pushing code up to GitHub, you had to go back to the GitHub API to go and get the file, for example, a copy of what was actually changed.
Starting point is 00:11:40 And so what we're actually adding to BigQuery is a few more tables. But these tables are really, really big. So we've got a table full of commits. So every commit now has the full message that people were, you know, for the author, the files modified, the files removed, all the information about the commit and the source repo.
Starting point is 00:12:03 So that's about 145 million rows. It's probably more now, it's probably upwards of over 150 million. We've got another table, which has all the file content. So, you know, all of these projects on GitHub that have an open source license, you know, the license allows us to, you know, third parties to take a copy of this code and go off and, you know know do things with it that's what's kind of one of the great things about about open source so there's now a copy of these files in in bigquery tables and so this is this is the big one this is about three terabytes of uh raw data that has the full file contents um of the of the object that was touched in the repository on GitHub.
Starting point is 00:12:48 And I'm sure we'll dive into some of the possibilities of what you can do with that. And then in addition, there's another table, which basically has a full mapping of where the file, all of the files at kind of Git head in the repository. So like a mapping of all the files at kind of git ahead uh git head in um uh in the repository so like a mapping of all the files and all their paths and joining them to the file content so you can and there's about two billion of those file paths so basically we've got this kind of vast uh kind of network of files commits and now also the contents of those files sitting ready to query in BigQuery. So it's about, I think we're upwards of about
Starting point is 00:13:29 three terabyte data set here. And it's the biggest data release that we've ever made. That's awesome. It sounds like a lot of work. I'm just sitting here thinking, man, it's a lot of work even describing it. I'm sure both sides have put a lot of effort in this. Can you describe the partnership, the way you've worked together, the two companies,
Starting point is 00:13:47 and from your perspective, what all went into making this happen? Sure, so I'll start, but I'm sure there's more detail to come from Felipe as well on this. So unsung hero of today's call is, well, two really, Ilya, of course, but a guy called Sean Pierce,
Starting point is 00:14:04 who works in the open source office at google and so um you know like the desire for data from github is like a generally kind of uh a general request we get from large companies who are doing a lot of open source so we get that from uh you know google regularly pulling data to analyze their own open source projects on GitHub. And so Sean had actually done some early work exploring this, pulling these commits into BigQuery. He'd started to kind of build out a pipeline
Starting point is 00:14:35 to help monitor their own open source projects. But we have pretty good regular conversations with him and the team he's in. And so I think it just came up in one kind of conversation back in February. He was like, hey, by the way, I've been working on this thing. We have this public data set program that's growing, and this would make a great data set to have available in BigQuery. What do you think?
Starting point is 00:15:01 And we jumped at the chance to get involved. And so it's been a few months in development to make sure that we're getting, pipelines are all working, but most of the kind of lion's share of the work has been done by Sean on the data pipeline, which I think runs every week to update this. But Felipe, can you remind us if that's the case? Yes.
Starting point is 00:15:30 At least today it's set up to run every week. So this snapshot will be updated every week with the latest files, details in GitHub. I have a quick story about the partnership. When I was first approached with this and it was Sean and I got introduced to Ar this, and it was Sean, and I got introduced
Starting point is 00:15:47 to Arvon. And one of the first questions I asked when I talked to a data provider about, you know, is this going to be useful or whatever, given the backlog we have is I asked, you know, can you send me a sample query, you know, that shows, you know, how this will be useful to users. And one of the first queries that Arvon sent was a number of times, this should never happen. And I knew it was going to be fun just working with this data. And I've just run the query actually after our last load here. And we're not quite at a million times yet, but we're getting close.
Starting point is 00:16:20 What do you mean by that shouldn't have happened? That's the number of times in this data set that someone has committed a comment that says, What do you mean by that? Shouldn't have happened. That's the number of times in this data set that, that someone has committed a comment that says this should never happen. Ah, gotcha. So it says it in the commit message or is it actually in the code comments in the code?
Starting point is 00:16:36 In the code. Yeah. Yeah. It's like the, you know, rescuing, rescuing every error you can possibly imagine. This,
Starting point is 00:16:44 this, this will never happen. This will never happen. This should never happen. We're almost at a million. Right, right. And so you're like, yeah, okay. But it's in there. Yeah, there was a thing on Hacker News
Starting point is 00:16:55 a few months ago with this kind of came, I think somebody demonstrated that. I think they tried to do a, I think they did a search on the GitHub side, just on our standard search to say, you know, let's see how many times something should never happen. Now you can do this with kind of looking at particular language types as well, right? And segment, do much more powerful searches. So that's one of the things that's kind of fun about the data.
Starting point is 00:17:20 That's a great use case. And I think what I'm excited about this is, especially getting it out to our audience and to the whole developer community, is all these new opportunities and use cases and things that we collectively couldn't know previously. And we can start to know by people asking different questions that I wouldn't have thought of or you wouldn't have thought of. So we're going to take a quick break, but on the other side, what we want to know is like, what all does this open up? Obviously there's things that we haven't thought of yet, but what's the low hanging fruit
Starting point is 00:17:51 that's cool that you can do now? You can ask these questions now and you can get answers that you couldn't previously get. So I'll just tee that up and we get on the other side of the break. We'll talk about it. Linode is our cloud server of choice. Get up and we get on the other side of the break we'll talk about it linode is our cloud server of choice get up and running in seconds with your choice of linux distro resources and node location
Starting point is 00:18:12 ssd storage 40 gigabit network intel e5 processors use the promo code changelog20 for a $20 credit two months free one of the fastest most efficient most efficient SSD cloud servers is what we're building our new CMS on. We love Linode. We think you'll love them too. Again, use the code changelog20 for $20 credit. Head to linode.com slash changelog to get started. All right, we're back with quite a crew here talking about big data, Google BigQuery, GitHub, fun stuff. But in the wings, when we take these breaks, we often have side conversations, and it had just occurred to us that everyone on this call
Starting point is 00:18:53 is in a unique place. Like, for example, Felipe, you're up in the YouTube studios in New York because you're at a conference up there. And Arvon, you're in a truck outside of a starbucks in canada while you're digital nomading with your family in your travel trailer and and you've got a super fast internet connection and will you're where you should be you're in seattle in your home office there and in a google studio there so so it's kind of interesting so arvon what's unique about uh you know where you're at right now i guess um well it's it's um it's well
Starting point is 00:19:27 the speed of the internet is the is remarkable uh i'm in yeah as i say outside starbucks with about 100 megabit connection so that's pretty great that's unheard of yeah so i can report that the canadians have better starbucks wi-fi than the chicagoans which is where i've lived for the last four years so um what else is unique it It's lovely and sunny, but I've only been in Canada for three days, so I have no idea if it's regularly sunny here. But, yeah, it's really nice. And the good thing for us with this scenario for you is that we get to capitalize on a great recording because you sound great.
Starting point is 00:19:57 Your internet's doing great. We don't have any glitches whatsoever. So thanks, Starbucks, for super-fast internet connections in Canada. Appreciate that. That's sponsored by Starbucks. We probably can't say that, right? is whatsoever so thanks starbucks for super fascinating connections in canada appreciate that that's sponsored by starbucks probably can't say that right we'll have to reach out to their pr department or their marketing department to send them a bill for this show or something like that but uh onto the more fun stuff though so jared teed this up before we went into the break but big story here google bigquery has been out there we're aware of it but now we're able to do
Starting point is 00:20:24 more things than we've ever been able to do before. So let's dive into some of these things. What are some things you could do now with this partnership, with this new data set being available there, the four terabytes or three terabytes of GitHub public data being there? What can you do now that you couldn't do before? The beauty is that anyone can do it. So it's not just me, but anyone's open data.
Starting point is 00:20:46 But just having access, being able to see two billion files to be able to analyze them at the same time, it's really, really awesome. For example, let's say you are the author
Starting point is 00:20:59 of a popular open source library. You can go and find every project that uses it and not only that they are using it, but how they're using it. So you can go and see exactly what patterns are people using, what are they doing wrong,
Starting point is 00:21:15 where they are getting stuck. And you can base your decisions on the actual code that people are writing. Yeah, I think the kind of insight into how software that maybe you maintain is being used, I think, is one of the most powerful ones I can think of here. Because, you know, for example,
Starting point is 00:21:36 say you're wanting to make a breaking change to your API. Actually, one of the projects I maintain on behalf of GitHub, the project called Linguist, we want to change one of the projects I maintain on behalf of GitHub the project called Linguist we want to change one of the core methods actually the one that detects the language of the file we want to change its name and we want to re-architect some of the library and we know it's a breaking change to the API and we've had deprecation warnings out for 12 months
Starting point is 00:22:02 but honestly being able to run a query that sees how many times people are actually using that API method still helps me as a maintainer understand the downstream impact of my changes. And currently, that's just not been possible before. And of course, you can't see what's going on in private source code. But a lot of this stuff is in open source repos as well. So being able to have that kind of drill down into source code, all of the open source code that's on GitHub.
Starting point is 00:22:36 And I mean, for me, the other kind of killer feature is like to be able to do this, you want to write a regular expression of some kind, right? And so being able to run regex across four terabytes of data or three terabytes of data, we should actually figure out what the exact number is. It increases daily, of course. But, you know, being able to run a regex against all that data is incredibly powerful and something that's just not been possible before. Well, back we had Daniel Stettenberg on the show.
Starting point is 00:23:03 He's the author of curl and lib curl of course and we asked him at that time how do you know who your users are how do you how do you speak to your users and ask them things and really he said i have no idea i mean first of all curl's so popular that like it's kind of like sqlite like the world is his users um but he didn't really know how people were using his library but But with something like this, like you said, it's only the public repos, of course, we wouldn't want to expose the private repos, big data. But he can actually just go to BigQuery and look for how many people are including Libcurl, right, linking to it in their open source. And not just that, but he can also,
Starting point is 00:23:43 like you said, look at very specific method signatures or how they're using it, and he can get insights. Now, it's not 100% the truth, because like we said, he's got way more users than just open source, but it's at least a proxy for reality.
Starting point is 00:24:00 Is that fair to say? I mean, and there's fun things you can do as well. We're sharing some example queries that we've authored as a group, but of. I mean, and there's fun things you can do as well. Like we were sharing some example queries that we've authored as a group, but of course, you know, there's unlimited possibilities here. But, you know, you can also, you know, look at, you know, most common emojis used
Starting point is 00:24:17 in commit messages and silly stuff like that. But, you know, so there's less serious things you can do as well that would also be, currently be pretty difficult. But yeah, yeah. I mean, being able to drill down and understand how people are using stuff is extremely, extremely important to many people. Actually, one use case that's kind of near and dear to my heart. I mean, everyone's interested if people are using their stuff, but some people actually have to report that, right? Because maybe one particular use case that I'm very familiar with is like
Starting point is 00:24:48 people who've received funding to develop software. So maybe academic researchers who develop code, you know, they'll have funding maybe from the National Science Foundation. And the only thing that matters really to the NSF is how many people, like what was the impact of that software? And, you know, it's really hard to answer that question.
Starting point is 00:25:08 Like how many people are using your stuff? You can maybe say, oh, well, it's got 400 forks. Now, you know, I would say anything that's been forked 400 times is pretty popular, but that doesn't actually mean it's being used. It's a kind of a weak weak weak kind of signal of usage uh whereas an actual like i can show you i can give you the url of every downstream consumer of my software and it's been used by you know 50 different universities or whatever but you know they being able to give people the opportunity to actually report usage is interesting and fun for lots of people, but actually mission critical for many people as well.
Starting point is 00:25:48 And so we get a lot of requests at GitHub from specifically researchers who are trying to demonstrate how much their stuff is being used. It's been really hard to service those requests in the past, but I think we're going to be in a much better position to do that now. Another interesting use case, Felipe,
Starting point is 00:26:07 maybe you can speak to this one. Probably exciting both for white hats and black hats alike is an easy way of finding who and what exactly is using code that's vulnerable to attack. Can you speak to that? Yes. So I'm super excited
Starting point is 00:26:23 about that. Security security-wise, if you are able to fix and find a problem in your source code, that's cool. But if you're able to find the same pattern, the same buggy code or potential vulnerabilities, with the query,
Starting point is 00:26:41 you will be able to find it all around GitHub's open source code and just send patches, contact the project owners, open an issue. But now you're able to do this. And things get really, really crazy. So the kind of things you can do.
Starting point is 00:27:03 Like with SQL, with BigQuery you can write SQL. SQL is powerful, but you can only do some limited amount of operations. You can write regular expressions. But with BigQuery, we also open the space up to user-defined functions written in JavaScript. For example, there is this JavaScript study code analyzer called JSHint. And I'm running it now inside BigQuery just to analyze other JavaScript code and see, for example, find all of the unused variables.
Starting point is 00:27:42 Like, you cannot do that with a regular expression. If you try to run this in your own computer, it would take hours, days. But with BigQuery, you're able to just actually analyze
Starting point is 00:27:55 the flow of the code. Are there unused variables? How are the libraries being used? So, yeah, it gets really crazy. I'm getting now to maybe the boundaries of what we can do with BigQuery, crazy I'm getting now to maybe the boundaries of what we can do with the query but I'm really looking forward to what
Starting point is 00:28:09 people will build upon this let's focus on the security aspect once again with regards to the black hats so a naysayer of this type of available data is that now you have a zero day come out or, well, let's
Starting point is 00:28:28 just call it, yeah, zero days released. And now this enables, you know, whether it's a script kitty or somebody who's more capable can go out and not just, you know, fuzz the entire internet for vulnerable things, but they can actually know exactly, you know, what line of code in a particular project is taking this input. And so while people can go out and do pull requests, people can also go out and hack each other. Do you have concerns about that? Well, I believe in humanity on one side. I think there are more good people than bad people.
Starting point is 00:29:01 And usually people, when they're attacking, they are more good people than bad people. And usually people, when they are attacking, they are more focused on particular projects. On the defense side, here we're giving the ability to people that want to make projects stronger. We're giving them the ability to identify everywhere where their potential problems are and harden these open source project.
Starting point is 00:29:26 That's one of the beauties of open source. Yes, it makes problems more visible, but by making them visible, you have more eyes looking at them. Now, with having all these source codes visible in BigQuery, we are just making people that want to look for problems. We are giving the tool to find them easily in an easy way
Starting point is 00:29:50 and fix them. Yeah, I mean, I look at it very much like it's a tool. You know, you can use a tool for bad. You can use it for good. And if anything, what this does is it
Starting point is 00:30:00 it ups the ante or it speeds up the game, so to speak. And so both sides can use it. I would imagine if you think about believing in humanity, the good people, it just takes one person to go out there and write a
Starting point is 00:30:14 program that can use this data set, query BigQuery for a specific string of code, and automatically find that across all the repos on GitHub and open a pull request just notifying them of the vulnerability. In moments without any user interaction, I think we'll see stuff like that start to pop
Starting point is 00:30:37 up which is pretty exciting. Exactly. Let's say we always tell people within open source that more eyes means more secure code. And that benefits a lot of open source projects. But if you have a very obscure open source project, maybe no one will look at it. Maybe no one will be looking out to harden your code. But this gives a lot more people the ability to look into your skill project because they will be just looking everywhere well just think about it now like right now we have not so much no
Starting point is 00:31:11 eyeballs but very little eyeballs because the process to have such knowledge is difficult whereas uh you know with this partnership this data set available on bigquery and all the good stuff you know now people have a much easier way to, to find these insights. And then obviously, you know, knowledge is power. So in this case, it's, you know, I'm on Bleepy's side, Jared, I'm, I'm kind of a, I'm not the naysayer, so to speak. I'm like, I'm like, do it, you know, cause I think about like in our, in the show that's going to come out after this, we talked to Peter Hedenskog of Wikipedia about site speed.io. We talked a lot about automating reporting of site performance. And this is similar to your point, Jared, where you said, you know, could we automate
Starting point is 00:31:57 some things where a pull request is automatically opened up? I think about the automation tools that may be able to take place on the security side to say, okay, here's a vulnerability. It also opens up another topic I want to bring up, which is not just the GitHub data store, but other data stores or code stores like Bitbucket or GitLab having similar data sets on BigQuery and how that might open up insights to all stores and to all the major stores. But long story short, automating those kinds of things to the open source out there, that's an interesting topic to me. So I was going to say,
Starting point is 00:32:34 one actually, if you, a fun experiment, is actually don't do this. I'm not recommending this. But if you commit, if you commit a public access token from your GitHub profile into a public repo, you'll get an email from us within about a second
Starting point is 00:32:51 saying we disabled that for you because you probably didn't want to do that. Wow. So I think there's actually like scanning and making open source more secure is something that we care a lot about. We think that's in everybody's interest. We think software is best when it's open.
Starting point is 00:33:13 And so, but we've all committed stuff accidentally and had to rewrite history. And it's just humans are humans. And so I think the things that, thinking about the things that, you know, the tools that we can do to improve tools to help people stay safe and help their applications stay safe,
Starting point is 00:33:33 I think is really, really, really, really important. And so we do that currently for GitHub tokens, but you could imagine, you know, I should probably want the same level of service if I commit, you know, I don't know, an Amazon token or a big, you know, a Google Cloud token or whatever it is, something that exposes me, you know, that's the kind of generically interesting area to work on. And so I think, yeah, I think more eyes on open source is, I think, showing how data can be used to make people more secure,
Starting point is 00:34:08 I think will help. I think this just helps sort of accelerate the progress of improvements to things like GitHub by making data more open. One facet of this that we definitely should mention is that the data set that's provided is not real time. And so when we talk about zero days or like code that's currently vulnerable, you do have a lag time between when that snapshot is created. Now, previously you had told us it was two weeks and now Felipe is telling us is one week. So apparently y'all have gotten better at this
Starting point is 00:34:38 since we even talked last. 50%. Yeah. So that's nice. I mean, I'm curious if there's ever a goal to make that a nightly thing or if if a week is is good enough or what your thoughts are on that um i mean i would love to see um you know i think an obvious thing to do with you know you know big archives of data is you know to improve the frequency at which they're they're being refreshed um i would love to see these things get more and more more and more close to live. Yeah, so I mean, I think it's how often the job runs.
Starting point is 00:35:12 I think the job takes about 20 hours to run currently. So we're going to hit a limit of how quickly the pipeline can run, but maybe it can be paralyzed further. I don't know, Felipe, do you recall how long it takes to do this big import right now?
Starting point is 00:35:28 FELIPE HOFFAEYSEN- What I can say is things can only get better. It's amazing how things just improve while I'm not looking. MARK MANDELMANN- It's our current bottleneck in data warehousing and analytics. And so you can expect that all cloud providers are going to be optimizing for that and getting as close to real time as possible. What does it take, I guess, can someone walk us through the process of, you know, capturing
Starting point is 00:35:56 the data set, whether it dumps down to a file, what's the process, maybe even Arvon on your side, like what inside of GitHub had to change to support this? Like what new software had to be built? But walk us through the process of the data becoming available and then actually moving into BigQuery. What's that process like? Kind of walk us through all the steps. So from GitHub side, actually very little changed. And I'm probably not the best person to talk to about the process of actually doing the data capture. I mean, we regularly increase API limits for large API customers.
Starting point is 00:36:34 And so I think we did that. But Felipe, do you have more detail on this? Yeah, let me make a parallel with what the story Ilya told you when he was back here February last year. First, he started looking at the GitHub's public API. He started logging all of these log messages. And once he had these files, he had to find a place, one to store them, to analyze them, and to share them. And the answer was BigQuery. Now in 2016, we had a similar problem, just bigger. It starts by taking a mirror of GitHub, using their public API, looking at GitHub's
Starting point is 00:37:15 change story history. Once you start mirroring this, you have a lot of files. And then the question becomes, where do I store them? Where can I analyze them? Where can I share them with other people? And that's where Sean Pierce is a superstar that writes this pipeline to take one mirror GitHub and then put it inside BigQuery as relational table. That's basically the Google magic in summary.
Starting point is 00:37:45 But yeah, it takes a lot of my producers and doing things at Google scale to be able to just, oh yes, I downloaded,
Starting point is 00:37:53 I made a mirror of all of GitHub. I guess the thing I'm trying to figure out is what makes it take a week? What's the latency in terms of
Starting point is 00:38:03 capturing to querying inside of BigQuery? That's what I'm trying to figure out. What's the latency in terms of capturing to querying inside of BigQuery? That's what I'm trying to figure out. What's the process to get it there? That's a good story there, but why does it take a week? Yeah, I think it might take closer to a day. But it's all about how many machines you have to do this.
Starting point is 00:38:23 You want faster results, you just keep adding machines to it. Then it becomes a question of how much quota do you have inside Google versus other projects. MARK MANDELMANN, And I hate to keep further compressing the time, like we're just making changes right now. But I think we're down to six hours. Really? Oh, nice.
Starting point is 00:38:41 MARK MANDELMANN, What? The pipeline. So we had a conversation a week ago, basically, to tee up this conversation. It was two weeks then. Then we thought it was a week today, and now it's six hours. By the time this show ends, it's going to be real time. Yeah.
Starting point is 00:38:53 Good job, Will. Felipe is actually coding right now as we talk. Sean is a star, but it's all about getting more machine resources for the project and the more people use this data set the more important it becomes well we start putting more resources on it i'm really really looking forward to what the community will do with this data and the tools that we develop over vQuery to to be able to just analyze the data in place. So I have a good example, I think, of a question that's currently pretty much impossible to answer without this data set, if you're interested. Absolutely.
Starting point is 00:39:33 So I was talking to a researcher about six months ago, and he was trying to answer the question. So if you kind of read it 101, let's get getting started in open source. How do you create a successful open source project? People will tell you it's very important that you have good documentation, right? Like you want to have your API documented.
Starting point is 00:39:53 You want to have a good read me. And he was like, you know what? I've used software where the documentation is really poor, but it's still really popular. And over time I've seen the documentation improve. So his question was, is documentation a crucial component of a project becoming successful,
Starting point is 00:40:13 becoming widely used? And so to answer that question, you kind of need a timeline of every commit on the project. You probably want to know the file path, what was in the file. Let's say documentation in GitHub's world is Markdown, ASCII doc, restructured text. Even just those three extensions
Starting point is 00:40:34 would probably represent about 95% of all documentation. And so you can look at what's code and what's docs, but you can't do that query today you have to as an individual you would have to go and pull down you have to get clone you know thousands hundreds of thousands maybe of repos from github store them locally then write something that would allow you to you know programmatically go through all these git repos building up these all these histories um these histories now in histories are now in BigQuery. So I'm not saying I know exactly how to write that query,
Starting point is 00:41:09 but the data's there, right? It's possible now to answer this question. And I think one of the most exciting things for me about this data set is, you know, I think there is still a huge amount to be learned about how people build software best together. And I think that's not something that necessarily, you know, the really hard questions I think are often best answered by people like the, you know, the computational social sciences,
Starting point is 00:41:36 people who study like how people collaborate and they need really, really big data sets to do these studies. And to date, it's just not really realistic for GitHub to, you know, the API for GitHub's API is just not designed to serve those kinds of requests. It's designed for building applications against. And so I think we're going to see, you know, I think we're going to see a huge, huge kind of uptick in the amount of really data-intensive research
Starting point is 00:42:05 around collaboration and about open source software and about how people best work together and powered by this data set. Yeah, that's very exciting. And as people who are very much invested in watching the open source community do their thing and tracking it over time. I'm excited all the possibilities that are going to be opened up and I even think of just when GitHub Archive came out and all of a sudden we started having cool visualizations and charts and graphs and like people putting answers together that that we did we didn't
Starting point is 00:42:39 know we could ask ask questions about and now we have so much more and super awesome. I think we're going to tee up for our next section is BigQuery itself, because it does seem like a little bit of a black box from the outside. Like, how do you use it? How do you get started? How long do the queries take? You know, there's a free tier, there's a paid tier. I'd like to unpack that so that everybody who's excited about this new data set can, at the end of the show, go start using it and check it out. So we'll talk about that when we get back. Our friends at Fullstack Fest are putting on a week-long Fullstack Development Conference
Starting point is 00:43:14 in Barcelona, September 5th through 9th. The focus of this conference is solving current problems with new and inspiring perspectives. Head to fullstackfest.com to learn more. It's a wide range of topics any full-stack developer would enjoy. Erlang and Elixir, Reactive Programming, HTTP2, GraphQL, NLP-backed bots, Docker, IPFS, Distributive File System, Serverless Architecture, Unikernels, Elm Architecture, The Future of JavaScript, ES6, ES7, CSS4, Relay vs. FileCoreJS, Angular, CycleJS, CSP channels, handling interplanetary latencies, web assembly, mixing React with 3D, virtual reality, and the physical web. It's a full stack fest.
Starting point is 00:44:01 Early bird tickets are available until July 15th. That's coming up soon soon at the end of that day they will no longer be available speakers from all over the world from companies like twitter netflix microsoft erlang shopify and to top it all off enjoy a week in sunny barcelona tickets again are available for the whole week only the back-end days or only the back-end days, or only the front-end days. So you have your choice to kind of go all week or go to back-end or front-end days. And for our listeners, save 75 euros before tax after July 15th if you miss this early price. Use the discount code, the changelog.
Starting point is 00:44:39 Once again, head to fullstackfest.com. All right, we are back talking about BigQuery, GitHub, public datasets, all that fun stuff. Felipe, tell us about BigQuery. How do you use it? So BigQuery is a hosted tool by Google Cloud. So you just go to BigQuery.cloud.google.com. And basically it's there open,
Starting point is 00:45:07 ready for you to use to analyze any of the open data sets or to put your own data. Just in case you're wondering if it's only for open data, nope. You can also load your private data and it's absolutely secure, private, etc. But with open data, you can just land there and start query.
Starting point is 00:45:29 Now, you will probably need to have a Google Cloud account. So if you don't have one, you'll need to follow the process there to create and open your Google Cloud account. But then you will be able to use BigQuery to analyze data and everyone can analyze up to a terabyte every month without needing a credit card or anything. So you can choose which data set to start with. I wrote a guide about how to query Wikipedia slots.
Starting point is 00:45:59 Those are pretty fun. But in this case, if we want to analyze GitHub, we can go to the GitHub tables to find some interesting queries. We have the announcement on the GitHub blog, on the Google Cloud Big Data blog. I'm writing a Medium post where I'm collecting all of the other articles
Starting point is 00:46:20 I'm finding around. So you will want some queries to start with. And then start the questions the question is what questions do you want to ask uh you have these tables that our phone described at the beginning one of the most interesting tables is the one with all of the contents of github so this has all of github open source github files that are less than one mega that are not binary at less than one megabyte and that table has around 1.7 terabytes of day and that's a lot especially if you're using your free quota. If you query that table directly, your free quota will be out immediately. So thinking of that, we created at first a sample table with all that much smaller. Let me check the size right now. I have it with me.
Starting point is 00:47:26 But I'll tell you the exact size in a minute. The thing is you can go to this table and you can run the same queries you would run on the full table, but your allowance, your monthly terabyte will last way more. You can choose to run all your analysis there on the sample and then bring it back to the mega table.
Starting point is 00:47:52 But it all depends what questions you are asking. And I also created, this is outside the main project, but in my private space that I'm sharing, I created an extract of all of the JavaScript files, but in my private space that I'm sharing, I created an extract of all of the JavaScript files, all of the PHP files, Python, Ruby, Java, Go. So if you're interested in analyzing Java code, you might be better off starting from my table.
Starting point is 00:48:26 And then you can start asking the questions you might have from at least start with one of these sample queries. A couple of things. Let me interject here. So all of these things that Felipe is referencing, we will have linked up in the show notes. So if you're listening along
Starting point is 00:48:40 and have the show notes there, pop them open. We'll have example queries and all the posts, both from GitHub and Google, published around this. So that's probably a good place to go. You mentioned your monthly allotment or your threshold. I can't remember the exact word, but your quota. Yes. Let's talk about that. So BigQuery is free up to a certain point, and then you start paying. And the reason for this example data set, which is smaller, is because if you're just
Starting point is 00:49:07 going to run test queries against the whole GitHub repos data set, you're going to hit up against that pretty soon. Can you talk about that? I think there's some, even as a user, like we have ChangeLog Nightly going and have for a couple of years now. We've never gotten charged, so I guess we're inside of our quota, but I don't have much of an insight into what all we're doing.
Starting point is 00:49:29 How does the payment work in the quota? Is it based on how much data you've processed? Exactly. So BigQuery is always on, at least compared to any other tool. You don't need to think about how many CPUs or how much RAM or how many hours
Starting point is 00:49:46 you're running it. It's just on always. And then the way it charges you is by how much data you're querying. So it looks at the tables you're querying,
Starting point is 00:49:58 specifically at the columns you're querying and the size of those columns. And that's basically the price of a query. So if a column is one gig or something like that, or, you know, half a terabyte, then you're essentially being charged a query at half a terabyte.
Starting point is 00:50:16 Exactly. So today the price of a query is $5 per terabyte queried. So if a column is one gigabyte, divide $5 by a thousand, and that's the price of your query, the cost of your query. So assume I'm right. I got my question asked.
Starting point is 00:50:35 So I have my, I've used the GitHub examples, the data set or the subset for my development. And I have a query here. In fact, from some of your guys' examples, here's one. Let's say it's the how many times shouldn't it happen one that Will talked about earlier,
Starting point is 00:50:52 which it appears that this thing pulls from githubrepos.samplefiles and it joins githubrepos.samplecontents. So every time I actually run that in production, it's going to add up the size of those two particular things and then charge me once per time I hit the big query. Is that right? Exactly. Every time you write a query. In fact, when you write a query before running it, you can see how much data that query will process.
Starting point is 00:51:26 Oh, that's handy. Yeah, because basically it's a static analysis. You have the columns you mentioned from the tables you mentioned, and then the query knows basically the exact price. I'm just thinking outside the box because you all have AdSense and the way people buy ads
Starting point is 00:51:41 that you may actually have a bidding war at some point or not so much a bidding war but you might be able to have something where i want to query these things several times a month but i have a budget and i'll query them if it's under this budget and you might be able to do those queries if said budget is not met or is is. That seems like something in the near future, especially as we talk about automation around this. Yeah, so the idea here is to make pricing very, very simple. If you're able to know the price of your query before running it,
Starting point is 00:52:15 then you can choose to run it or not. And it's essentially about, instead of querying the whole dataset, instead of querying the full contents table with that 1.7 terabytes, let's just query all of the Java files. So, and if I have not created, if someone has not created the extract you need,
Starting point is 00:52:37 maybe the best first step on your analysis is extracting the data that you want to analyze. Do you feel like you'll have any pushback at all for uh i guess a higher free threshold for open data sets because there's always this um this sort of push or this uh this angst i guess where if you're doing something for the good of open source or something that's free for the world or just analysis that someone is always like, hey, can you make this thing free for open source? And if you're doing, you know,
Starting point is 00:53:13 if you're since where this shows specifically about this partnership and the GitHub public data set being available, what are your thoughts on the pushback you might get from the listeners who are like, this is awesome, I want to do it, and can I have a higher limit? So at least what makes me pretty happy is that we are able to offer this monthly quota to everyone. It's not, it doesn't stop once.
Starting point is 00:53:40 It's not for the first 10 days. You have access to this at least until, I don't know, every month you will be having this terabyte back to run analysis. And that's pretty cool on one side. And then, well, if you want to consume a lot of resources, at least you are able to, instead of having to wait one month, at least you have the freedom to pay for, to have more resources, even more resources available.
Starting point is 00:54:11 And just a context set, because I agree like, you know, in cloud, we're continually getting feedback and then just based on competition to reduce pricing and make things more optimized and efficient and cost-effective.
Starting point is 00:54:25 And so just where we were just a moment ago really was without BigQuery is that in order to do analysis on any data set, you would have to go find that data. You would have to download that data and possibly pay some sort of egress. You'd have to upload it into your own storage on whatever cloud provider you're using. And there's a cost there. And then you'd have the consumption for doing any query on it. So it's a valid question. But right now, we've already reduced the cost for public users.
Starting point is 00:54:59 And I fully expect that, yeah, people will be asking for more higher limits on querying the data. And I just expect we'll continue moving and making things cheaper and more efficient for users. I think the steps you just mentioned there that, you know, just for one, telling people, you know, this is what it actually takes to do this without BigQuery. And now the BigQuery is here. We've taken so many steps out of the equation. You've obviously got Google Cloud behind it, the supercomputer that we talked about in part one of the show, basically, having access to that. And I think just sort of helping the general public who is going to have
Starting point is 00:55:35 clear interest in this, especially listening to this show, like everyone who listens to this show is either a developer or is an aspiring developer. So they're listening to this show with a developer's mind, so to speak. And so they're thinking, yeah, if I want to use this, how can I use it? But knowing the steps and knowing the pieces behind the scenes to make this all possible definitely helps connect the dots for us. And it's really what's great about working at Google is this is really in our core mission.
Starting point is 00:56:03 I mean, Google's core mission is to organize the world's information and make it universally available. And then so for the public data program, this is a natural extension of that mission within the cloud organization. And I see these public data sets plus tools like BigQuery is, and I know this word gets overused, but it's, you know, democratizing information even further. You know, we've all been these unknowing or, you know, knowing or involuntary collaborators in providing public data. And so I like the idea that we all have equal access
Starting point is 00:56:37 in these public data programs, and we're now getting meaningful access to that data. And so like today we're doing a better job at making the data available for download, right? It's like cdata.gov, for example. Like public data is pretty accessible now. And so I think the next step though, and going back to that comment I made about meaningful, is to provide the tools that lower that ramp even further and give all these collaborator collaborators meaningful access. You know, so we're starting with SQL, which, you know, for, uh, for most developers and marketers, like is,
Starting point is 00:57:17 is a pretty good level of entry for, uh, querying, you know, enormous sets of data. But, you know, I think we're going to end up with machine language-powered speech queries, right? Where we're not, Felipe, Arvon, and I aren't talking about these queries that you have to construct and managing your limits on the data. We're actually telling you just to ask the machine, the data set, a question.
Starting point is 00:57:42 Let's continue on the practical side of how you get that done. You mentioned the console, which is where you can write your queries and test your queries and run them. There's other ways that you can use BigQuery as well. Once you have those queries written, for instance, with ChangeLog Nightly, we're not going into the console and running that query every night and shipping off an email. It's all programmatic.
Starting point is 00:58:04 Can you tell us what it looks like from the API side, like how you use BigQuery not using the console? Yeah, so BigQuery has a very simple-to-use REST API for people that want to write code around it. So now we have a lot of tools that connect to BigQuery. Tableau is one of the big ones. In specifically open data, we are partnership
Starting point is 00:58:31 with Looker. So some of our public datasets that we are hosting with Will have specifically Looker dashboards built over them. I love Redash for writing dashboards and that's a dashboard software
Starting point is 00:58:48 that was not created for BigQuery at all, but it was open source. People loved it. People started sending patches so it connected to BigQuery. So now you can use Redash to analyze BigQuery data.
Starting point is 00:59:03 And that's, I just love using that one. The new Google Data Studio also. It's a pretty easy way to just create dashboards. I'm sharing one of these dashboards specifically
Starting point is 00:59:18 for GitHub, this GitHub data set too. So yeah, you don't need to know SQL. I just love SQL, but you can connect it to all kinds of tools and also to other platforms like Pandas or R, et cetera. It's all about once you have a REST API,
Starting point is 00:59:38 you can just connect to anything. One last question on this line of conversation. We talked about how long it takes to process, to get the data into BigQuery. It was two weeks, then it was a week, and then it was 20 hours, now it's six hours. How about querying it? We haven't talked about
Starting point is 00:59:56 what do we expect if we're going to do the GitHub the full Monty like this query for emoji used in commit messages, for instance. full Monty, like this query for emoji used in commit messages, for instance, however many terabytes that covers, are we talking
Starting point is 01:00:11 like three seconds, 30 seconds, minutes? What do we expect? Depends a lot on what you're doing. Here we're really testing the boundaries of BigQuery. We are going beyond just, you can go way beyond doing just a grep.
Starting point is 01:00:29 You can just, I don't know, look at every word and every piece of code, split it, count it, group it, a regular expression. So some queries will take seconds. I love those. I love being able to just go on a stage and just start with any crazy idea, code it, and have the results while
Starting point is 01:00:49 I'm standing out there. But sometimes there are queries that are more complex that involve joining two huge tables that we query can't do these joins. But when reaching the boundaries, it's good to limit how much data you query, for example.
Starting point is 01:01:06 Oh, I have this pretty interesting query that might take two minutes. What about if we, just to get very quick results, we sample only 10% of that data or 1% and things start running a lot faster. But it's really cool. So on one hand, you feel that, oh, I'm reaching one of the boundaries. But at the same time, you feel that, wow, I'm really doing a lot here. Let me see if I can run a query now
Starting point is 01:01:35 while we talk. I'll come back when I get my query. Felipe, maybe you can multitask. I'm not sure, but let's test you out. Earlier in the show, it was actually in a break, we talked about some things you have some affinities for, for what the possibilities of BigQuery and all these datasets being available might offer. And one of them you mentioned was being able to cross-examine datasets. So So for example, you had said how weather may affect,
Starting point is 01:02:07 I think it might've been pushes to GitHub or pushes to open source or something like that, but basically how you're able to capture various large public data sets that may be like traffic patterns, weather and the ability to deploy code or push code to GitHub. But what other ideas do you have around and what are some of your dreams
Starting point is 01:02:24 for cross-examining datasets? So just to answer the question, because I told you I was going to come back with this. There you go. I copy-pasted one of the sample queries. In this case, we are looking at the sample tables with the sample contents. This basically has 30 gigabytes of code. I'm looking only at the Go files in this case, and I'm looking at the most popular imports for Go.
Starting point is 01:02:52 And basically this query over 30 gigabytes ran in five seconds. Not too shabby. That's fast. Yeah, that's how cool things get. Yeah, so going back to dreams, just seeing data in BigQuery, seeing people share data here,
Starting point is 01:03:10 was my appetite for how can I join different data sets. For example, something I was running, I ran last year when I got all of Hacker News inside BigQuery, the whole history of comments and posts, was to see how being mentioned on Hacker News affected the number of stars you got on GitHub. FRANCESC CAMPOY FLORES- Oh. FRANCESC CAMPOY FLORES- Yes, I can send you
Starting point is 01:03:36 that link too. MARK MANDELMANN- Or you can also have the public data set of the change log. And when we release new shows, how popular that project might get. FRANCESC CAMPOY FLORES- Exactly. MARK MANDELMANN- Ooh. FRANCESC CAMPOY FLORES- Oh, how popular that project might get. Exactly. Oh, yeah, that would be cool.
Starting point is 01:03:48 So we can see all these things moving around the world, the pulse of it, and how each one affects each other, Reddit comments, Hacker News comments, the Wikipedia page views. And you can see the real effect on code, on what will be happening on GitHub code, on the stars, on how things start spreading around, and the ability to link these artifacts, to add
Starting point is 01:04:13 weather, like, oh, do people code under good or bad weather? Right. Let's extend that a bit then. Another question we have for you is, and this is more for all of you. This isn't just you, Felipe, but keying off of this topic here, what would you like the community to do as a result of this? So you have some pure love for cross-examining data sets, things like that. And as you can hear, there's a crazy storm here in Houston.
Starting point is 01:04:41 You heard that lightning there. The hatches are being battened down now. My wife, she's out there taking care of it. I got to go join her soon, so maybe the show will end eventually. But in between now and then, what would you like the community to do? So you got the listening ear of the open source world
Starting point is 01:04:57 hearing you guys talk about this stuff now, all these data sets being available. Will, maybe at some point you could talk about some other data sets that might come to play here as well to fuel this fire, but what are your dreams for this? What do you want the community to do with it? I'll go. So I would love to,
Starting point is 01:05:14 I mean, one of my favorite projects that uses GitHub data, you know, open source data from GitHub is libraries.io, and I know you had Andrew on a few episodes ago. So I think there's still, I think, a huge opportunity to lower the barrier to entry
Starting point is 01:05:32 for people into open source. And so I think part of that is, you know, maybe product changes and improvements to GitHub. But I think there's, you know, there's like really interesting projects out there, like first pull request and up for grabs, like low hanging fruit issues that are easy, easy for the community to work on.
Starting point is 01:05:52 I think I'm convinced that there is in this data set, like the answers to questions, like what makes, you know, a welcoming project for people to come and work together, you know, combining, we've got everything that everybody's ever said to each other. like what makes a welcoming project for people to come and work together. Combining, we've got everything that everybody's ever said to each other. All of the code that's been written, you can run static analysis tools on that code to look at the quality of that code,
Starting point is 01:06:21 maybe how approachable it is. There's just, I think, a missing piece right now that if I am a 20-something CS graduate and I can program like crazy, but I've never participated in open source and there's lots of these people, or maybe I'm just somebody who's just got my first computer and I've heard about open source and I want to get stuck in.
Starting point is 01:06:42 I think there's a missing piece right now that we're not connecting always the sort of supply in terms of the talent that's out in the world with the opportunity of projects that have, you know, everyone wants more contributors. Everybody wants people helping to build software with them together. And so I'm really excited to see what the community are going to do around those topics. Cause I think you think about what Andrew's done with libraries, I mean,
Starting point is 01:07:09 that's a really good example of like stepping, you know, stepping in that direction, but this, this makes kind of richer, more intelligent kind of uses of that data for, for, for, you, you know, strengthening the open source ecosystem is where I think the big opportunities are. And I think that's actually, you know, ideas are free, right? Like there's money to be made doing that. If somebody wants to go and like build companies that solve that problem, I think that's a generally interesting problem to solve. Yeah. Lots of ideas come to mind for me on that. But on the note of Andrew, I think Andrew is actually, with libraries,
Starting point is 01:07:49 he's actually querying GitHub's API directly. So in this case, he can actually go to BigQuery and get the same data, maybe faster. He might have to pay a little bit for it, but he may not have to hit rate limits or things like that or just actually have a much richer ability to ask questions of GitHub versus the API. Exactly, yeah.
Starting point is 01:08:10 Cool. What about, Felipe, what about on your side? Or Will, on your side, any dreams? For me, I like comparing this with the story of Google. Google, for me, is the biggest company built on data. Basically, you need data, tools, ideas. Data for Google was collecting the whole World Wide Web at that moment. Collecting it was not easy, but you also needed the tools to store it, analyze it. And then you needed ideas. Like a lot of companies at that time, there were
Starting point is 01:08:39 a lot of web search companies that had all this data, had a copy of the web, a mirror of the web inside their servers. But the ideas that Google had of, hey, let's do page rank. Let's look at the links between pages to rank our searches. That was huge. So I look at the same, I'm looking at the same right now
Starting point is 01:08:59 with this and other datasets. We have the tooling. Tooling might be BigQuery. Tooling might be BigQuery. BigQuery gives you the ability to analyze all of this, but you can create tools above this. I'm looking forward to see more static code analyzers that will run inside BigQuery.
Starting point is 01:09:18 You need ideas. That's where I'm looking for the world to bring new ideas, new ways to look at this data that we're making available. And I'm looking out for data. I really want people, we're making a lot of data available in the query and I would love people to share more. And that's why we have Will here also to help,
Starting point is 01:09:39 to bring, if you have an open data set, if you want to share data, instead of just leaving a file there for people to download and take hours to download and then analyze on their computer, et cetera. If you share it on BigQuery, then you make it immediately available
Starting point is 01:09:54 for anyone to analyze and then to join with other data sets. So for me, that's... Well, since you mentioned Will, there's definitely one subject that I wanted to save closer to the end here, which is talking to you about the datasets that you're... I mean, this is mostly around the partnership
Starting point is 01:10:15 with GitHub and this dataset, but what other datasets, as Felipe had mentioned, what do you have your eyes on? What hopes do you have there? Yeah, well, what I'm focused on right now is trying to get data sets that uh address that accessibility issue i was telling you about earlier um like a lot of the data.gov stuff like medicare data census data um some of the climate data and what i find interesting about this is it's you know this this data has been collected
Starting point is 01:10:43 for decades and so the schemas around this data were designed, you know, well, before we even thought about big data challenges, uh, much less just early, even sequel, you know, it's like pre no sequel challenges. Right. So, um, you know, we're talking prior to the seventies. And so the challenge here is like taking a lot of this data, which is coded, you know, it's truncated because at the time, uh, there were limitations on, um, uh, characters and everything else. And so is getting all that coded data, which is technically available for download by the public, but not usable. We're planning on, uh, onboarding some of the data
Starting point is 01:11:23 from the government catalogs, like the census data, data, Medicare data, patent data from both US and Europe, and then some more of the weather-related data. And it's a big challenge because a lot of this data is decades old and was designed at a time before there you know, there was even SQL or big data. And so it's heavily coded. And so the challenge here is to decode that data, which requires resources and then structuring in a way that it fits well into BigQuery. And then, you know, Felipe can take it from there to the community and construct all sorts of interesting queries and address that accessibility challenge that I was talking about earlier. Yeah, something that I just told you guys what that are going to be stored as part of this. There's obviously some motivation on GitHub side to do this.
Starting point is 01:12:39 So Arvon, feel free to throw a mention here on this. But I'm kind of curious to all three of you, just whoever wants to share something about this, but I'm curious about how this opens the door for other code stores, stores for it from back in the day. I think they still are kicking around. I'm not really sure what their status is, but you got Bitbucket, you got GitLab. Obviously, having this kind of insights is kind of interesting. So does this open up the door for other stores? Is this something that's a motivation for everyone to do that kind of thing?
Starting point is 01:13:13 Yeah, I mean, I'll take a stab at that. I mean, I actually think that, you know, the open source software, wherever it is, is, you know, hugely valuable. And so I would love to see more open source software available in a way, you know, similar way to the way we're releasing this data today with Google. So, you know, I think the more the better as far as I'm concerned. You know, if we were 10 years ago, you know, a lot of open source activity was happening on SourceForge. And, you know, there's still stuff a lot of open source activity was happening on source forge and, you know, there's still stuff up there that's used and still incredibly important.
Starting point is 01:13:49 And of course, you know, people, people are on Bitbucket and GitLab and another, another host as well. So I would love to see, you know, more vendors participating in archiving efforts like this. I think there's more to be done simply than just depositing data. I think there's also this sort of, you know, we have the way that our API works. You know, Bitbucket has its API. GitLab has its API.
Starting point is 01:14:15 You know, there's differences between all the different platforms, even if maybe many of them are using Git or Mercurial at the kind of base level for the code. So I think there's actually really big opportunities to standardize some of the ways in which we kind of describe the sort of the data structures that represent not only code, but all of the kind of pieces around it, the community interactions, the comments, the pull requests, all of these things. And so I'm aware of a few community efforts.
Starting point is 01:14:46 There's one called Software Heritage. There's one called Flossmole where they try and they've got, for example, all of RubyGems stuff in there and a whole bunch of SourceForge data. I think, you know, I've talked today about some of the things about, you know, empowering the research community around these data sets. I think one of the issues with doing that right now is, I spend most of my time thinking about GitHub, the data that GitHub hosts,
Starting point is 01:15:13 but of course that isn't all of open source. And I think making sure that it's possible for all of software to be studied, I think is going to be really important going forward. So yeah, I think there's a bunch of opportunities there about improving platform interoperability that I don't think many people are talking about right now. And I'd love to see some advancement in that,
Starting point is 01:15:36 because I think it's good for the ecosystem at large. FRANCESC CAMPOY FLORES- Yeah, I would like to highlight also the technical side. There is a big technical problem. And the question here is, are we able to host all of GitHub code, open source code in one place and able to analyze it in seconds? Well, we just proved that we can.
Starting point is 01:15:56 So let's keep bringing that in. Let's keep furthering the limits. But yes, technically we can solve this problem today. That's a good thing. I mean, obviously, you know, Will, with your help and Felipe, your abilities to lead this effort and Arvon, your efforts on the GitHub side of things to be open to this. And I think part of this show is one, sharing this announcement, but two, opening up an invitation to the developers out there, to the people out there that are doing all this awesome open source and dreaming about all this awesome open source, having this invitation to bring their company's data sets if there's open data out there to BigQuery. And so I guess, well, what's the first step for something like that? You said that that's an open door. Obviously, if 10,000 people walk through the door at once, it's not
Starting point is 01:16:50 a good thing because you may not be able to handle it all. But what's the process for someone to reach out? What's the process to share this open data? Yeah. So they can contact us. And I'm trying to pull up just so I get the, it's on the cloud.google.com site under our dataset page. They can contact us. Where is that email? I will give that email to you so you can put it in your accompanying doc.
Starting point is 01:17:18 But also, I would also encourage them to reach out to Felipe on Reddit or on the Medium post and just get a hold of either of us that way. We'll have that Medium post in the show notes. So if you've got your app up or whatever. I just got it. It's bq-public-data at google.com. bq-public-data at Google.
Starting point is 01:17:44 Yes. bq-public-data at Google. Yes, I would like to add that on the technical side, if tomorrow 10,000 people want to open data sets on BigQuery, that's completely possible. Anyone can just go and load data in BigQuery and then make it public. What we're offering with this program is support to have your data set public side shown taking care of the paying for the
Starting point is 01:18:09 hosting price. But you can just go and do it yourself. Working with us is much, it's cool, but you don't need to go through a manual process. You can go and do it. That's an excellent point. And to be clear, you can upload your data and then put ACLs on it to make it public. And then anybody that queries that data, you're not going to be charged for their queries. Gotcha. That's good then. So you can,
Starting point is 01:18:36 you can mainly do it if you have a big data set and you want some extra handholding, so to speak. So email the email you mentioned, we'll also copy that down and put that in the show notes, but it's possible to do it on your own, as you'd mentioned through the BigQuery interface and making it public and not being charred. That's a good thing. Well, let's wrap up because I know I had a storm. We did have a quick break there because of the storm and my internet outage for about five minutes. So thanks for bearing with that. And listening audience, you probably didn't even hear it because we do a decent job of editing the show and make things making things seamless when it comes to breaks like that but this is time for some closing thoughts so i'll
Starting point is 01:19:15 open up to everyone uh whomever wants to take it but just some closing thoughts on some general things we talked about here today anything else you want to mention to the listening audience about what's happening here? All right, I'll go. Okay, so I would, I mean, I'm incredibly excited to see this data out in the public. I think we talked a lot today about, you know, public data, but, you know, so there's sort of open data, but also useful data, usable data.
Starting point is 01:19:41 And I think this is, you know, the first time that you've been able to you know query all of github uh and i think that's an you know incredible opportunity for studying how people build software um you know understanding understanding uh you know what it means for projects to be successful i think honestly i think the most exciting thing for me about this is that the data is now available it's out there and i think the possibilities are you know near limitless and i i can't wait to see what the what the community does with this data set well flip anything to add to close i would love to add for anyone analyzing data it doesn't need to be open data. I love open data, but anyone that's analyzing data today
Starting point is 01:20:25 that is suffering, waiting for hours to get results, having a team, managing a cluster, babysitting a cluster overnight, try BigQuery. Things can be really fast, really simple,
Starting point is 01:20:39 and that will open up your time to do way more awesome things. Awesome. Well, I can definitely say that we've been enjoying BigQuery. But go ahead, will you add something you wanted to add? Oh, I just wanted to add to what both Arvon and Felipe were saying around communities. What I'm really looking forward to is seeing the community participate in developing interesting queries. And I'm sure there are data sets out there
Starting point is 01:21:05 that are interesting that I'm not aware of. And I would love to hear about those and try to get those more accessible. One more curveball here, just at the end of the show. It occurred to me too, during the show, like over the years of the ChangeBlock, we've had a blog, we've had this podcast, we've got an email,
Starting point is 01:21:22 and we've talked several times about open data, public data being open sourced on GitHub. And it now occurs to me that all of that effort can now be imported either by way of GitHub, but also just directly into BigQuery. And so if you're out there and you've got a data set you've open sourced on GitHub, go ahead and go to BigQuery and put it there and make it public there. That way people can actually leverage it because I can't even count on my hands how many times we've covered open data in all the ways we've talked about on the show today. But that seems like putting it on GitHub is great, but then making it useful, not that GitHub isn't useful, but making it useful is putting it on BigQuery and opening up for everybody. That, to me, seems like the chair on top. Obviously, you know, we've got a couple links
Starting point is 01:22:13 we're going to add to the show notes. We've got this announcement, obviously, between this partnership and the GitHub data set being available in this new way. The blog post being out there, we'll link those up. So check the show notes listeners for that. But I just want to say thanks to the three of you for, one, your efforts in this mission and caring so much, but then, two,
Starting point is 01:22:34 working with us to do this podcast and share the details behind this announcement because we're definitely timing this, the release of this show for all the listeners right around, if not the same day, like the same time frame, maybe the day after. I know there's been a couple posts already shared out there, so I'm not sure exactly on perfect timing, but we're aiming for this to be right around the same time. So announcement at CodeConf for GitHub.
Starting point is 01:22:59 But we're trying to work together to go deeper on this announcement, share the deeper story here, and obviously get people excited about it. So I want to thank you for working with us on that. It's an honor to work with you guys like this. But that's really all we wanted to cover today. So listeners, thank you so much for tuning in. Check the show notes for all the details we talked about in this show. But, fellas, that's it.
Starting point is 01:23:22 So let's say goodbye. All right. Thanks very much. It's been really fun to talk in depth about the project today. So fellas, that's it. So let's say goodbye. All right. Thanks very much. It's been really fun to talk in depth about the project today. So thanks for having me on. Thank you very much. I love being here. I love being able to connect to everyone here at the Change Flow.
Starting point is 01:23:35 Yeah. Thanks for having me as well. It's been a good conversation. And with that, thanks listeners. Bye. We'll see you next time. I'm out of here

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.