Drill to Detail - Drill to Detail Ep.41 'Developing with Google BigQuery and Google Cloud Platform' With Special Guest Felipe Hoffa

Starting point is 00:00:00 So my guest this week is Felipe Hoffa, someone whose name many of you will recognize from his posts about Google Power Platform on things like Stack Overflow and on Twitter and Reddit and so on, and actually his role as a developer advocate at Google. So, Felipe, it's really good to have you on the show, and thank you for coming on as a guest. Great to be here, finally. Thank you for inviting me. Thank you.

Starting point is 00:00:35 So, Felipe, do you want to just kind of give us a bit of a background really for yourself and how you ended up working at Google and I suppose the route you had coming out of Chile and over to the States and so on? Exactly. up working at Google and I suppose the route you had coming out of Chile and over the states and so on. Exactly I grew up in Chile I had most of my professional life there until seven years ago when I got my Google interview I moved to San Francisco and I started at Google six years ago as a software engineer. And then two years later, someone thought that I would be a great developer advocate. They invited me. And yes, I've been doing this since then.

Starting point is 00:01:18 Okay. And so what is a developer advocate at Google Cloud Platform? And what do you do there? What's your kind of role and your focus and so on? For me, a developer advocate is a software engineer with a license to speak. So basically my job is to tell other software engineers, data scientists, doers, the cool things they can do with our platform and especially the my main topics are big data and especially BigQuery so my mission is basically how do I communicate to people the big query is a product that they can use and we

Starting point is 00:02:03 solve a lot of problems for them. What would be quite good at the start, really, would be just maybe to just talk about what just paint a picture, really, of what the landscape is of products that run on Google Cloud around big data and analytics, just at a high level, and then we'll go into some of the details later on. Yep. So to start off, I can just going back to my time in Chile when I was a software engineer there.

Starting point is 00:02:30 First, I was pretty impressed when we got the first cloud offerings, just getting virtual VMs in Linode and other providers. So I started using them. Then I was pretty happy when I saw Amazon getting in the game. So the startup I was working at, I started developing our services there. And then one day I realized there was this thing called App Engine from Google where the whole serving infrastructure was managed. I was able to just write my scripts,

Starting point is 00:03:07 write my code, and let Google host it without me having to worry about that. That was like, you know, eight years ago, nine years ago now. So as a software engineer outside of Google, I evolved from renting my own VMs to learning a new language. At that time, I was doing mostly Java and Ruby.

Starting point is 00:03:33 But I saw this App Engine offering, and it really resonated on how I wanted to do things. So at that time, App Engine worked only with Python, so I learned Python, and it really helped me to get things done. And a lot of Google Cloud platform offerings come from that world where the question is, how do I get Google to do most of the work for me? I can focus on adding my logic, adding my ideas, not needing infrastructure. I don't need to worry about infrastructure anymore. Yeah, definitely.

Starting point is 00:04:22 I mean, that was the thing that struck me. So I came into the world of this through a startup I'm working at, and I was very much used to running servers on premise. And even when we ran stuff in the cloud before, you know, you were managing VMs, and it was maybe a cluster of VMs that you could spin up using a kind of a tool that would spin them up and bring them down and so on. But you were still effectively working with VMs, and you're having to deal with things like scaling, you're having to deal with, you know, just deal with kind of faults and dealing with kind of capacity and so on. But I guess a kind of a common thread in Google Cloud Platform's products in this area is typically they're kind of what's called serverless or their infrastructure as a service. I mean, just, I suppose, spell out what that means in terms of scaling, in terms of kind of where the time goes.

Starting point is 00:05:05 And as a developer, you know, what does that mean in terms of where your focus is in this kind of world, really? Yeah, so my focus goes into developing the ideas that I have. Like, I'm extremely lazy, as good software engineers should be. So we really want to automate everything. If someone that is not me can automate everything that can be automated, I want to leverage that. So the origins of Google Cloud Platforms

Starting point is 00:05:39 come from the beginning. We wanted to do a completely managed solution. And as we did that with App Engine, when I was still not working here, but then I saw, I was still outside when I saw the first announcement for BigQuery, which, well, as you know, I'm pretty linked to now. And with BigQuery, we have a similar story. You have data to analyze, you can load And with BigQuery, we have a similar story. You have data to analyze. You

Starting point is 00:06:07 can load it to BigQuery. You can run your queries. And that's it. There is nothing to set up. There are no servers to turn on. There's not much to tune. It's just about, yeah, you just about yeah you have data you have queries this will solve that problem for you okay i mean we'll go as we took as we talk really i want to

Starting point is 00:06:35 talk to you later on about porting say a data warehouse workload into this environment and i guess some of the practicalities of some of it um but also that big query isn't the only database service that Google offers in this area. You've got things like Cloud SQL and this thing called Spanner. I mean, just maybe outline what those are and how they differ and how they're similar to BigQuery, really. Yes.

Starting point is 00:06:56 So between BigQuery, Cloud SQL, Spanner. So let's go back to the appending world where I was able to write my scripts and have them completely managed, scaling. At the same time, as we were doing services like that, not everyone was ready to jump into this new world. And a lot of people just wanted VMs, virtual machines. And Google also started offering them for people that just need raw machines. Now, in big data, we have a similar story. Some people might want to jump to BigQuery and do everything there, but some might want to still keep using the tools that they know they're an expert of and also get help on running them. So Cloud SQL solves that problem.

Starting point is 00:07:51 Cloud SQL is our managed either MySQL or Postgres servers. So if what you need is a MySQL server, if what you need is a Postgres server, we can help you run it. We will automate a lot of the infrastructure, calls, backups, scaling, etc. But at the end of the day what you're getting is the MySQL or PostgreSQL servers that you probably already know that are compatible with other things that you use. So do they autoscale in the same way as things like BigQuery?

Starting point is 00:08:29 I mean, I think, is it the point, again, that it's completely elastic sort of infrastructure? Or is it more a case of they run as a managed service, but they have more limits and so on? How does that kind of work? Yes. So they are a managed service that is not as magic as some products that are just natively magic.

Starting point is 00:08:53 With Cloud SQL, we try to do our best with the products that already exist. But they're still MySQL, they're still Postgres, so there are some inherent limitations to those products, while at the same time being pretty easy to pick up and use and start building your applications. Yeah, I mean, I actually use Cloud SQL for, there are some tools out there, some BI tools, like, say, Superset, for example, that don't yet natively connect, from my experience, to things like BigQuery.

Starting point is 00:09:26 So I tend to kind of like ship data into your Cloud SQL environment and then use that to connect. So the integration between the two is quite good, but it is effectively kind of like MySQL running in that environment. But what about Spanner? I mean, I know I'm very conscious Spanner isn't your kind of product area, but just at a high level, there's been a lot of kind of talk about Spanner and maybe some people aren't aware of it. What is Spanner, really?

Starting point is 00:09:48 And what's the background to that? So Spanner comes as part of Google's history of trying to solve our own data problems. With BigQuery, the question we are trying to solve is how do we run full table scans without indexes, that analysis in an extremely fast way? Now, with Spanner, what we needed is a database like MySQL, but that would be able to grow to a Google scale.

Starting point is 00:10:28 When you're growing MySQL, when you're using it, when you have multiple nodes, you are having endless requests, or when you use a NoSQL database instead, you start finding some limitations. You start either with eventual consistency, scaling up, and Spanner was our answer to all of these problems. How do we get a SQL database that we can scale massively?

Starting point is 00:10:59 And the most magic part, because what people had to decide when they were growing into these new NoSQL databases was how do we handle eventual consistency? How do we handle partitions in our network from the famous cap theorem? And with Spanner, we feel we solved that problem. That's a great question. In broad terms, I mean, how feel we solved that problem. That's a great question. In broad terms, I mean, how was that solved?

Starting point is 00:11:28 Or was it kind of Google magic that is kind of in various kind of like, you know, white papers are out there? Was there a key innovation there, do you think? Yeah, well, one of the key innovations there is through time. Our ability to, yeah, the ability to keep all of the servers synchronized and that time is being

Starting point is 00:11:55 accounted for accurately in every server. That's a very hard problem. That's a very hard problem, isn't it? It's one of those classic things in clustered systems and whatever that actually getting synchronized time between the services is an unexpectedly hard thing to do, isn't it? Yes, it's a super, super hard problem

Starting point is 00:12:14 unless you can bring hardware into the mix. And that's part of the Google magic here, is our ability to bring atomic clocks that synchronize all of these servers and give the system an accurate picture of what time is it and in what order these transactions in a distributed system are happening okay okay so so what i'd like to do next really is is walk through without kind of going through every product you know to go i'd like to kind of walk through what would be involved in, say, moving an on-premise workload, a data warehouse workload, into kind of BigQuery.

Starting point is 00:12:50 I'm not going to get into the kind of, not even get into the details of individual step-by-steps, but some of the kind of conceptual things really there. Before we do that, I just thought it would just be useful because a lot of, some people wouldn't really know that they're, you know, what is Google Cloud Platform? They might know Google from search and from docs and so on so just again just maybe just takes a couple of minutes what is google cloud platform and how does it relate to the sort of like the internal stuff google do and in a broad

Starting point is 00:13:14 sense i suppose really what's the kind of differentiator differentiator for it yes so well you know other alternatives to google cloud probably probably. I've used them too. At Google, what I've seen that really makes me happy, even before joining Google, was the ability of Google to manage to solve problems as magically as possible for you with App Engine, with BigQuery. And at the same time, when Google entered this world, we needed to catch up at first with traditional services

Starting point is 00:13:55 that other clouds were offering. And we came from a position of pretty awesome internal tech. Our networking abilities are really impressive, and all of these ideas make us believe that we can be pretty competitive and offer differentiating features. Yeah. I mean, my experience, I mean, again, coming into an environment here that was based around Google Cloud, the thing that struck me was,

Starting point is 00:14:30 I suppose Google in a way, maybe it reflects their business to consumer B2C kind of background, but it tends to be products they build. So that, you know, BigQuery is a product that's a managed service. It's kind of, you know,

Starting point is 00:14:41 it's kind of no ops. It's very much kind of finished off and so on. Comparing it to say other cloud platforms that are more i suppose components that's there for you to kind of put together it strikes me that even though the bits that we work with things like that cloud data flow and big query although this they're obviously still uh sort of it tools that they're finished as products and they are a more complete solution it's almost like there's one solution for a problem that's finished off rather than i suppose lots of takes on it from other providers that are half finished and it's your job then to bring them together and and this this this kind of

Starting point is 00:15:13 fact that it is finished off and it is a service that you can rely on it is virtually no ops it means you know taking our example that we we've got a team of engineers that do work on this platform and they actually just it's just innovating all the time. They're not trying to manage servers, they're not trying to scale stuff up. They're still very technical, and it's still a lot of innovation, but it's not around trying to keep servers running. It's around building a service on that that is itself innovative in terms of a business.

Starting point is 00:15:37 Exactly. So many of our products come from our own needs of how do you run Google. The scale of Google, the company, is impressive, like everything that we do as consumer products. Then the problem we've always had to solve is how do we run this? How do we scale internally? And that's where BigQuery comes from. That's where the ideas for Dataflow come from, starting with MapReduce.

Starting point is 00:16:06 MapReduce is the paper that started it all. Exactly. So let's think about moving the workload over. So let's imagine that I am a developer or DBA that's running an on-premise, let's say, for example, SQL Server or Oracle Data Warehouse. And somebody says to them, we're now moving this, or can you test out or learn how to start to run this kind of a workload on a kind of Google Cloud environment instead? And how would that person

Starting point is 00:16:37 approach that? And let's be more specific. If you are a developer, how would you think about moving the workload, moving the tables? You know, is BigQuery conceptually exactly the same as an on-premise system, or would you maybe structure the database differently? Maybe, what would your thoughts be on that as an initial bit of advice to somebody? So my favorite way of approaching this is asking people what are their pain points um like a lot of people when i go to a conference i'm speaking in front of everyone i ask them who is working on big data and not a lot of people are sure if they are working with big data or not depending on the conference but then if i ask them who knows sql everyone raises their hand and then I ask them

Starting point is 00:17:26 can you keep your hand up if you ever had a query that had taken hours to run, days to run and they have seen this pain this is a pain that they have felt so when I can offer them a solution that will take that pain away, where they can just load everything, rank queries in seconds,

Starting point is 00:17:50 without going through optimizations or committee meetings about what indexes should we add or not add to our database, that's a really good starting point. If everything is going well for you, you don't need to switch anywhere. But a lot of these products only make sense when you're feeling the pain. Same thing with Spanner.

Starting point is 00:18:19 If you're doing well with your MySQL server, Spanner might not be attractive for you. But then if you know how hard it is to run MySQL at scale, if you know how hard it is to run a NoSQL database at scale, when I'm able to offer you a solution that will take that pain away, people really start listening. Okay, okay.

Starting point is 00:18:43 So let's take a concrete example so i've worked at things like gaming companies in the past where we have you know let's imagine we're getting kind of like transactions bets and someone coming through and they're at the point where the volume of these things coming through is more than the more than the traditional kind of relational database can handle so they're having to deal with i suppose um a greater throughput now than they used to have and they are thinking about i've got a maybe a table structure running in in kind of like oracle or something that's got dimensions of facts and so on would you suggest that you know would you suggest that they move that across as it is or does i suppose the distributed kind of

Starting point is 00:19:22 nature of big query mean that you might think about data modeling different? Yes, so a lot of ideas that are behind your current designs is what you are designing around the current limitations. All of the data cubes, etc. The whole idea is to make things easier to compute later and to be able to process them. Then with BigQuery, that's not a problem anymore. So why keep those restrictions around your design anymore. So yeah, my favorite way of telling people to start playing with this, like there's nothing better than getting your own hands into it, is to get an export of their data into BigQuery. If they can get just one huge dump, a huge file into BigQuery, and they can start running queries.

Starting point is 00:20:27 They can feel the difference. It's a typical one-day-long, one-afternoon-long proof of concept that converts people to this. Yeah, yeah. So BigQuery is like a kind of, it's a column it's a column of database isn't it so i guess some things if you come from a world of row-based databases or storage some things are easy some things are hard but does the fact that it's column store uh mean anything differently and and you mentioned a little while there's no indexes so so in there

Starting point is 00:21:01 so i guess what what kind of queries work well in that environment and what kind of, I suppose, kind of design works well there and what would be a wrong one there, really? I mean, certainly not having indexes is interesting, really. Yes, so BigQuery's biggest strengths, it's also one of its biggest weaknesses, if you want to see it this way, is that BigQuery can process terabytes

Starting point is 00:21:26 of data in seconds. Now, it will also process small data requests in seconds because everything is optimized to just do a full columnar scan. Usually, you want your database to reply in less than a second. Yeah, you have a key, you want your value, that has to take milliseconds. BigQuery is not that. BigQuery is not that kind of database. So,

Starting point is 00:22:01 if you already have a solution that brings you answers in less than a second, don't switch to BigQuery. But then there's all these problems you have, all these queries that are taking you way longer than a second. I'm sure you've run all night processes where you come back the next day to see if it failed or not. Well, that's the kind of workflow you bring here what

Starting point is 00:22:26 i was going to say actually there was an interesting one and that you so you and i had a conversation on twitter i think recently where i'd come across a problem you know in my day job where we had problems joining tables and i think that there's there's a perception sometimes and that that kind of big query can't do joins maybe some people coming into big query would would actually not even know that joins are an issue i mean maybe just talk to us a little bit about why joins could be an issue with big query but actually a strategy to setting up your your table structures and the way you use big query that means you'll be successful in that because certainly you corrected me in that in that online with that and I think it was interesting kind of story there really yeah so BigQuery has the ability to do joins um when we first released the product it was not able to do just it just ran full column scans then a year later two years later we had the ability to

Starting point is 00:23:20 join with small tables um where we basically copied the small table to every distributed BigQuery node, and then we were able to join it with the big one. But today, BigQuery has the ability to join arbitrarily large tables, and it does a pretty good job at it. Now, sometimes what happens is when you're doing a join, sometimes you end up writing a SQL query

Starting point is 00:23:49 that creates an exploding join. And those are really bad. Like, let's say you're doing a cross join. You have a 5 million row table and you do a cross join with itself. That cross join can produce 25 trillion rows. That's a bad idea. You probably don't want to do that.

Starting point is 00:24:17 But with SQL, it's pretty easy to write queries that do that. How important is it to do things like nested columns? I mean, again, that's something when you learn, when you read about BigQuery, you hear about nested columns and so on. Is that perhaps kind of over-engineering the solution or does that kind of come into it really as a general day-to-day solution to things? Yeah. So I was listening to your interview with Dan earlier.

Starting point is 00:24:44 So yes, nested data, our ability to do arrays inside SQL, I find it a very beautiful solution, but that you still need to wrap your mind around it. Daniel Means at Looker wrote an excellent blog post about how beautiful he finds the ability to do nested data in the query. I don't know if you saw that one. Yeah. But it's a great modeling technique to put data that should be together, instead of having it in different tables, just leave nested inside one row. One of the typical examples we have here is from...

Starting point is 00:25:33 Well, you know Google Analytics. Mm-hmm. Yes. So it's pretty easy for Google Analytics 360 customers to export their data into BigQuery. So instead of going through the Google Analytics web UI or API, you can just dump everything in BigQuery and start asking any query that you may want or join it with your own datasets, which is pretty cool. Now, if you go and see how this data is modeled, each row represents a session. And that means you will have a certain number of columns

Starting point is 00:26:15 describing the session, who the user is. But one column contains multiple rows, which is each page view hit. And instead of duplicating this data around many rows, which is each page view hit. And instead of duplicating this data around many rows, all of the session-level data, you can find it compressing to only one row plus an array with every hit. And when writing queries, well, that's a pretty good solution that still creates problems

Starting point is 00:26:51 for people when they are just wrapping their mind around this idea. Yeah. So, yeah. So, so with that, I mean, I guess this is one thing that leads on leads on to is a lot of people moving a workload onto Google Cloud and BigQuery and so on for various kind of valid reasons would be thinking like dimension load processes. So in data warehousing, as you kind of know, there's this kind of concept of you've got a dimension table that maybe gets updated and it's joined back to the fact table and so on. I don't know if you know the answer to this, but would things like nested columns be a solution for that? Or when someone says to you, I need to replicate this kind of dimension joined to fact table kind of setup in BigQuery, forgetting the update part at the moment, but how would you typically approach that really? Or is it completely the wrong thing to think about? So what I always try to do with my tables

Starting point is 00:27:51 is to design around what queries I want to run. So I want to optimize my tables for query because that's where we are going to spend most of the time working. So if all of my sample queries have a join between three, four tables, it might be much better if I offer my users a table that has all that data pre-joined. Having copies of data, even if like, this having duplicated data might make you nervous, but that shouldn't be a big problem

Starting point is 00:28:36 if you're able to regenerate these tables at any time, storage is cheap. And the question now here is, what questions, what queries do you do your users normally run let's optimize for this okay okay so and that's an interesting lead into um how someone might get access to the tools to do this really so your title is actually you know developer advocate and so it'd be interesting to understand, well, someone who comes from the on-premise world

Starting point is 00:29:09 that maybe has kind of, I don't know, SQL access to the command line, or they have maybe kind of data integration tools or whatever, what would be the typical toolkit for somebody who is a developer working with this? And how can Google help with that? What do you have in terms of kind of like tooling and I suppose command line interfaces and so on?

Starting point is 00:29:26 FELIPE HOFFAEYENEN- Yeah, so if we focus around BigQuery, we have at least two different types of users. On one hand, we have the people with questions, the analysts, the data scientists. So they have a tool set to query this data. And on the other hand, we have our heroes that are the people keeping the pipeline alive. How do I keep fresh data inside the query so people are able to query that? In the analyst tool set, the first tool you find here is the BigQuery web UI. Yes, the ability to open up a browser, log into BigQuery,

Starting point is 00:30:15 and see all the tables you have access to and start running SQL queries without any more preparation. That's the first place we go. Over that, I don't know if you are already using the BigQuery-made Chrome extension? Yeah, just explain what that is actually. So one of our customers, users, one of my favorite BigQuery users in LA created this extension because a lot of people in their company are using BigQuery and sometimes they wanted our UI to do different things. So instead of waiting for us to create the different features to the UI, Mikhail just started adding his own features and now

Starting point is 00:31:04 this is released as a Chrome extension. We use it every day here, yeah. Yeah. Yeah, I'm impressed. If you go to the statistics for the extension, we have like, it has like 3,500 active users. One of the things it was good at as well, I think, one thing we like about that Chrome add-in, sorry,

Starting point is 00:31:32 is that it tells you, it actually predicts what the price of the query will be. It will tell you, this query will process this much data and whatever. But actually, what's interesting is how much of the queries that we generate are actually free. And BigQuery has got a different I suppose charging model isn't it to maybe what people are used to in that you charge based on queries

Starting point is 00:31:50 is that right rather than kind of like what you're storing I don't know how does that work yes so as BigQuery is a completely managed solution where you don't need to turn on servers or anything. Pricing goes... The pricing model is built around the queries you write. If you're not doing anything with BigQuery, then the cost is zero. When you want to write a query, you pay per query. That's a different mental model to approaching these problems. But at the end of the day,

Starting point is 00:32:28 a lot of people find out that it improves their cost a lot. But it's... Yeah. It's kind of interesting. Certainly, it means that we have to be very cognizant of making... In the old world of on-premise databases, doing select star from something was quite straightforward.

Starting point is 00:32:50 But, you know, doing select star from a BigQuery table brings back maybe all the columns. And how would that affect charging, really? I mean, how does that affect maybe the SQL you write or your kind of carefulness about bringing back all columns rather than less columns, really? Yes, so, select star limit 10 is not the most efficient BigQuery query, because basically, we're asking BigQuery to read all of our columns and then just give us 10 results. There is a free operation to do that. You can see if you want to see your table you can just see your table without running that query with that said the question is always

Starting point is 00:33:36 the problem you want to solve is how do I get the results to the results I want. On one hand, BigQuery shouldn't be expensive, but if you have massive tables, if you have one terabyte table and you're querying that one, querying a terabyte of data will be a thousand more time expensive than querying a gigabyte because the cost is linear but i suppose the fact is column store means that by if you just just get in the columns you want suddenly it's a lot cheaper isn't it so rather than rather than in a way paying for the servers

Starting point is 00:34:22 and the storage to store a big table and querying it. You're just paying for the columns that you actually kind of request. So in some respects, that is better, isn't it? Exactly. So what I'm able to do when writing a query is just to bring the columns that I'm interested in and just the ability to know the cost of a query before I run the query helps you a lot

Starting point is 00:34:49 with understanding that kind of problem. So I can limit the cost of my queries in two ways. One is choosing the columns that I will query instead of querying all of them, the typical select star. And then I can also develop my queries over samples of data. Instead, let's say I have one petabyte of logs.

Starting point is 00:35:17 The question is, how do I extract, before running all my queries, how do I extract the data I'm interested in from those tables? Yeah, I mean, actually, we're talking here about charges and costs but one of the things that really attracted me to to the google platform when i was i suppose transitioning from say oracle for example is is actually actually very rarely as a developer playing with things learning the technology it's actually pretty rare for you to actually incur any costs at all and and you know look at while we've been talking i've been kind of bringing up my transaction history on google cloud here and it's there's been many many months i've paid like 10

Starting point is 00:35:52 you know a pound or something or or it's as a developer i guess you you guys have been very good at making it possible for developers to learn this in a way that actually you're not going to have these big bills through all the tools are there and you're using the environment that is exactly the same environment as you'd use at work, just that you're using it with small data set and you've got these kind of free tiers as well. Is that correct? Just maybe, am I correct on that? Exactly. So everyone has every month one terabyte to run queries for free. You don't even need a credit card for that. And we have a lot of public data sets available. That means that you can load BigQuery, find any of these public data sets, all of the GitHub history, a copy of Hacker News, a copy of Stack Overflow, with the worldwide weather, etc., etc., etc.

Starting point is 00:36:46 And you can just start writing queries, filling out how different this is. Like, oh, suddenly you have a place where you can come, load the query, and start querying data right away without any charge at all. Yeah. Now, once you get your feet wet, and start querying data right away without any charge at all. Yeah.

Starting point is 00:37:12 Now, once you get your feet wet, you start loving it. Like, bringing it into your own life. I do. I mean, and I land all my personal data into BigQuery. I actually have loads of feeds coming in, and it goes into BigQuery, and so on. Because I think, again i think again as a developer if you're learning something the fact that your environment stays there month in month out i mean a lot of vendors will give you trials for one month two months or whatever and then suddenly it's unchargeable or you might have 30 day trials and so on but the fact that you guys make the environment available at you know it's

Starting point is 00:37:42 pretty hard to incur a cost as a developer at home. I mean, obviously at work it's different. And that environment stays there as well. And I think the other thing that's interesting is, you cover Google Cloud in general, not just so much BigQuery, and you've got all these APIs that you can use for things like sentiment analysis and stuff like that. I mean, tell us a bit about those and what they are

Starting point is 00:38:00 and kind of how easy they are to use, really. By the way, in BigQuery we also have a free storage tier. So, yes, the data you're storing there, the data you're querying in BigQuery, it's all free up to a certain limit. But that gives you a lot of freedom to play with this. And then we, as part of the Google Cloud offering, we are investing heavily on machine learning. Machine learning is one of the strengths of Google as a company, and we want to share our tool set with the world.

Starting point is 00:38:36 So people sometimes, when we go outside of structured data, let's talk about images, let's talk about text, let's talk about images. Let's talk about text. Let's talk about understanding a lot of data that people, institutions, companies are collecting that they are not able to understand. We want to help them understand that. You have a huge collection of pictures.

Starting point is 00:38:59 You may have a huge collection of videos. You might want to transcribe all of these podcasts to text. You might want to know what were the concepts, how are people speaking about your product. You might have a data center full of recordings and you want to understand all of them. We offer you APIs that can help you do exactly that. If you have a video and you want to know everything that's happening inside the video, either the audio or the pictures,

Starting point is 00:39:34 you can put your videos through this API, extract all of the metadata, and then store it in places like BigQuery to just analyze them later. Yeah. So actually on that point, one of the things that I found, the APIs are fantastic and I use them for sentiment analyzing incoming kind of like emails and tweets and all that kind of stuff.

Starting point is 00:39:58 But one thing I found was it was still a little bit hard to link that to kind of BigQuery. And I was trying to think about could I maybe create a function that would call the sentiment API, NLP API? I mean, is there anything around that, around integration with the APIs in BigQuery that is coming along that will make it even easier to access those? Or is this still a case of using the Ruby client or the Python client and so on? Yes. client and so on. Yes, so yeah there is an impedance mismatch with what you can do with BigQuery versus what you can do with an API. With BigQuery we can analyze a billion comments

Starting point is 00:40:38 in the next 10 seconds or three seconds. We would kill any API if we send them a billion requests per second. Oh, right, yes. So yeah, the question there is, how do we bring things to the scale and speed that BigQuery has? Yeah, but it's more, I suppose, because as you say, it's like an impotence mismatch and so

Starting point is 00:41:06 on i mean so so maybe maybe for someone who is looking to get maybe they're a python developer or maybe uh uh you know that that sort of thing where would they go to or how would they get to start to understand the apis that are available and and you know how would they run them and so on you know what would your initial advice be to someone who's looking at that really? Yeah, so a nice way to start with APIs, my teammate Sarah Robinson has a lot of examples of this, is to start listening to events happening in real time. Let's say with the feed. As you collect and read each tweet, you can pass them through the API and then store them in BigQuery. Now, at what speed are you doing this? You are doing this at the speed that you are getting tweets.

Starting point is 00:41:53 So you're not going over a billion tweets in the next 10 seconds. You are just collecting data from the outside, going through an API, understanding it, and then storing that data. So that's a good place to start. And there you don't have the impediment mismatch. The API will go at the real speed. Then you also have data flow to connect to extract data from BigQuery, run any process called an API and bring the data back in. And then if you want to stay in the BigQuery

Starting point is 00:42:34 world, what I do many times is instead of going through the most advanced sentiment analysis API, you can find cheaper solutions that you can run with SQL. So, for example, yes, Google Cloud Sentiment Text Analysis, it's pretty powerful. It's really awesome. But then if you want to run some cheap sentiment analysis

Starting point is 00:43:07 tools maybe you can take each word and score it according to this dictionary that tells you that if words are positive or negative and you will get a quick answer to how the sentiment has evolved during time. Okay, okay. So you mentioned data flow there. And one thing I have found with the kind of Google world is cloud data flow is a pretty kind of integral part of most systems I see. But it's a lot more complex, or appears to be complex than say BigQuery. Do you want to maybe just very high level paint a picture of what cloud data flow is? And I suppose, you know, as a developer like yourself who's used to BigQuery, how did you approach learning Dataflow? You know, what was and how much do you use it really in real life? Yep. So with Dataflow, we are solving at least three different problems. One of the most important ones

Starting point is 00:44:06 is how we were running batch analysis systems versus stream analysis systems. And there you have the typical Lambda architecture that gave you fast, fast inaccurate answers in real time.

Starting point is 00:44:21 And then you were able to have a batch process later that would give you the correct results. So with Dataflow, one of the important problems it's solving is how do we have a unified system that does both, where you can get correct results and in real time. And many of these ideas we put together in what we call now the Beam API. And the Beam API is now an Apache project and a lot of other systems like Spark, like Flink,

Starting point is 00:45:03 like these ideas, they have been adopting them and we are developing a unified programming model where you can write to the Beam API and have your problem solved by different runners. Now that's one problem. The other problem that Dataflow solves is how once you have your pipelines defined in this way, is where do you run them. So you have awesome open source runners, but you might want a managed solution. You might want just to deploy your logic somewhere that will scale up, scale down, and take care of running this. And DataFlow is our runner for the Beam SDK.

Starting point is 00:45:49 So you can write your programs for the Beam API SDK and then Google can take care of running them in a managed way. Now the third problem we have here is people coding. Is this API SDK easy? The first Dataflow API was implemented in Java, and maybe not everyone likes coding in Java, especially outside a corporate environment. So there I've seen Google advancements moving forward with our Python Beam SDK for people that love Python.

Starting point is 00:46:31 And other people that have adopted Dataflow for the Beam SDK, for the ability to have Google take care of running all of this process. They have also built their own APIs in their own languages. And there you have Spotify that is using Dataflow extensively, but they also developed an SDK in Scala. Now you can write data flow programs on the Scala language, and they also add a lot of interesting design decisions to the SDK from the programmer's viewpoint.

Starting point is 00:47:19 Okay. So just before we get on to the last thing we'll talk about, which is Google Data Studio, one of the things that you haven't mentioned, but is a massive kind of useful resource that I think you've been involved in is there's a lot of examples and tutorials and things on the Google kind of website that take you through getting started with all these things and BigQuery and Dataflow and so on. I mean, maybe just kind of explain what they are and kind of, I suppose, the content that's in there and how they might be able to help people to adopt this new technology.

Starting point is 00:47:47 Yep. So here the question for us is how do we help people get started from zero to become experts? try to write interesting data analysis posts that bring people to realize that we have all of this data available and that they can do similar analysis pretty quickly. But that's just a part of telling people that if they learn BigQuery, if they try it out, their reward is pretty good, especially if you are fascinated by analyzing data. Then we have all of these offerings, for example, our code labs for people that want to start in a guided way. So you can find many BigQuery code labs,

Starting point is 00:48:37 Dataflow code labs that will guide you through the whole process of getting started, setting up your environment, and finally arriving to an interesting result. So if you just search for Google BigQuery Code Lab, you will find a pretty step-by-step guide to the world of this. Another resource that I found quite useful

Starting point is 00:49:05 was Google Cloud Platform's GitHub repository. And there's a whole bunch of really good examples in there of, I suppose, linking, for example, BigQuery to sentiment analysis or things like that. I mean, that's a useful resource as well, but I don't think it's very well publicized. But I found it very useful as well. Oh, yeah. resource as well, but I don't think it's very well publicized. But I found it very useful as well. My teammates like Sarah,

Starting point is 00:49:28 Amy, and many more, they publish all of their code on GitHub. If you are Googling for those results, you will find them. I also try to collect all of these results on Reddit.

Starting point is 00:49:44 I'm the admin for reddit.com slash r slash BigQuery. So everything I know about BigQuery, everything interesting that I find out there, I collect it there and I let people just upvote and downvote what they find interesting. The same with the subreddit Google Cloud. When things go beyond BigQuery, I also put them on Google Cloud or the subreddit

Starting point is 00:50:06 App Engine, etc. So if you want to follow this collection of resources, at least you will find them all collected on Reddit and ready for you to follow. I wasn't aware of that, actually. So I'll actually take a look at that now. So that's kind kind of useful. The last thing I want to talk to you about is, I suppose, you've got a tool, Google Data Studio, which is getting very popular now within the world that I kind of work in. Google's free or free to use, I suppose, BI tool. Tell us what that is really and what it's for and how people might start to get enabled with that as well. Yeah, well, Data Studio is one of my new favorite products in the line that it just solves a problem for you

Starting point is 00:50:55 and you do not need to care about any infrastructure. With Data Studio, you can create interactive dashboards following the Google Drive model. Like, if you want to create a document and share it with anyone, you can open up Google Drive, you can start your document, you can make your document public,

Starting point is 00:51:15 you can share it with me if you want to share something privately. And with Data Studio, we have a similar model. Data Studio can connect to multiple data sources, hundreds of data sources now. We have this new connector program so anyone can write a connector for Data Studio. But it also connects to Google traditional data sources

Starting point is 00:51:41 like Google Analytics, YouTube, and of course, BigQuery. Now, once you can connect to data sources with a couple of clicks, you can start creating visualizations, adding interactive controls for them, and once your visualization, your dashboard is ready, you can share it with anyone you would like to. Again, following the Google Drive sharing model.

Starting point is 00:52:09 One thing with Google Data Studio that I was all surprised about was if you connect to, say, BigQuery, you can't bring in multiple tables and join them. It's either a single table or a single view or a SQL statement. Was that a deliberate design decision or would that be kind of broadened in the end?

Starting point is 00:52:25 I mean, it seemed a bit of a restriction. I was surprised about. Any thoughts on that? Well, Data Studio is a new product. So all the features you see there are the initial features. All right. Yeah. Will it get more features?

Starting point is 00:52:40 Probably. Yeah. I don't want to spoil the surprises. Yeah. I don't want to spoil the surprises. Yeah. But yeah, to start with, there are, you can,

Starting point is 00:52:51 if you want to run joins, you can run any custom query that you want. Exactly, yeah. Things are going to get interesting. Yeah, good. I mean, it's a very good tool. I mean, we kind of use it here internally for reporting and so on.

Starting point is 00:53:07 And a lot of our customers use it as a, especially as a common standard, the fact it connects to kind of Google Analytics and so on there. And it's freely available. It's almost like the Microsoft Access in our world of kind of, I suppose, cloud BI.

Starting point is 00:53:20 And there's actually one other tool that you guys brought out recently, which I think actually is in beta at the moment, which is Cloud Data Prep as well. I don't know if you know about that at all, but maybe if you do, just maybe explain what that is as well, and we'll finish on that. Yeah, of course I know about Data Prep. Okay, I just sprung that on you, yeah.

Starting point is 00:53:39 Yeah, as with Data Studio, the question is always how do we reduce the friction? Data Studio is amazingly low friction. With Data Prep, the question is how do you start loading data? You might have these huge dumps of data that you have never looked into, but you're still collecting. Or someone shares some CSV or JSON files with you, and now you want to load them, you want to clean up this data.

Starting point is 00:54:14 It's the first step. The first step for analyzing data is understanding what data you got. And with Data Prep, you have this interactive environment that will go over a sample of this data, will show you the shape of it, will allow you to interactively click around, decide how you want to clean it up, what data do you want to keep, what do you want to split in different columns, or what columns you want to drop.

Starting point is 00:54:55 And while you are designing this recipe for your data, you will store this recipe and transform it into a data flow job. Once that you don't need to really care about, it's just the recipe that you don't need to really care about. It's just the recipe that you interactively build. And then once you have your recipe ready, Dataflow is ready to run these jobs, not only in a sample of your data, but on as much data as you have, and it will scale up and down depending on how complex your data solution is. Yeah, I mean, I suppose it's a data-angling tool, isn't it?

Starting point is 00:55:31 And for me, the thing that was always disproportionately hard with working on BigQuery was actually just moving data around. It was incredibly easy to load data in, to process it, to do whatever. But actually creating a sequence of steps for example that would take some data and maybe kind of transform it and uh maybe kind of aggregate it and so on that there was no solution for that really apart from scripting

Starting point is 00:55:55 it and and i mean i don't know whether this is the solution for it but you know that that was always a hard bit really and and the fact that that Cloud Data Prep connected to BigQuery, for example, was a massive kind of bonus. I mean, do you tend to use this tool now sort of day to day or what, really? Personally, I'm not using it that heavily, but I know of a lot of people that do so. And you're touching on another super interesting point, which is how do we just build our

Starting point is 00:56:27 pipelines. I know and that is, I mean that's the bit that is suddenly you're on your own typically and you know there's solutions like Airflow and stuff like that but what's your solution for this and how do you build these kind of data pipelines? I love relying on my data heroes that bring me the data in. But Airflow is, I would agree, is one of the most adopted solutions right now by different customers. We do that, yeah. Yeah. That's the one I would recommend people using.

Starting point is 00:57:03 Maybe, I know you're not here to publicize Airflow, but what is Airflow? And how, I suppose, in a way, what we're really interested in is how does it integrate with kind of BigQuery, or how would you use it with BigQuery? And what problem does it solve for you, really, or for customers? AMANDO CASTILLO- Yeah, well, I'm not an Airflow expert, but I do know that it's one of the best tools

Starting point is 00:57:26 to define how you want to move data around and run your data pipelines. I've seen some great work, for example, from WePay that have described their whole solution of how they move their live data that lives in MySQL, how they use Airflow to synchronize a whole workflow pipeline of moving this data as quickly as possible to BigQuery for analysis,

Starting point is 00:57:54 for example. Yeah, I mean, we use it to do aggregation, yeah. For aggregation, we use it as well. Yeah, so if you want to do things like that, like how to keep two databases synchronized either daily or every 10 minutes, Airflow is a great place to define those steps. Just to kind of round up,

Starting point is 00:58:17 so just to recap again, how do people find you on the internet? How do people find the content you've been producing and where do they go to just to kind of get started with as a developer in this platform? Well, I'm on Twitter as Felipe Hoffa. I put a lot of

Starting point is 00:58:34 content on Medium. I love collecting all of these. For example, this podcast, as soon as you publish it, it's going straight to Reddit.com. For people that this podcast, as soon as you publish it, it's going straight to reddit.com. And for people that have questions, people that are, if you're running into technical problems,

Starting point is 00:58:53 if you have a programming question, Stack Overflow has this awesome community. I'm usually looking there at every question that is posted. Some I can answer, some I see how the community just comes out and answers. And we have the engineers from the product also working there. Stack Overflow is just this awesome resource.

Starting point is 00:59:20 It's usually you that's answering the question at the end there with the correct answer, which is good. So it's been fantastic to speak to you, Filipe. So, I mean, it's been fantastic to speak to you, Philippe. I mean, yeah, it's been good to talk to you, someone I've kind of seen on those places before and has been so helpful with the answers and so on. And, yeah, it's been great to speak to you. And thank you very much for coming on the show.

Starting point is 00:59:38 Oh, it's been great finally meeting you. And I've been a fan of you on Twitter this whole time. I love seeing your evolution yeah through the cloud world so yeah thank you very much for having me and let's stay connected

Starting point is 00:59:55 okay thank you Thank you.

Drill to Detail - Drill to Detail Ep.41 'Developing with Google BigQuery and Google Cloud Platform' With Special Guest Felipe Hoffa

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.