Drill to Detail - Drill to Detail Ep.64 ‘Google BigQuery, BI Engine and the Future of Data Warehousing’ with Special Guest Jordan Tigani

Starting point is 00:00:00 So welcome to Jewel to Detail, the podcast about big data and data management in the cloud, and I'm your host, Mark Ripman. So we're joined on the show this episode by someone I've been looking to have come on the show for quite some time now, Jordan Tigiani, Engineering Director responsible for Google BigQuery, who comes to us more or less direct from Google Cloud Next 19 last week, which is Google Cloud's big customer event that runs each year at the Moscone in San Francisco. So welcome to the show, Jordan, and thanks for making time to speak to us. Thanks very much for having me on. Excellent. So Jordan, so just to explain, just introduce yourself really to the audience and tell us which kind of products you look after.

Starting point is 00:00:50 And obviously that will explain why I'm so interested. Sure. So I'm actually the product management director for BigQuery. I was engineering director and I've recently moved over to the dark side. But I was one of the first engineers on BigQuery. I was one of the people that kind of sat around and was trying to decide what this thing is we were going to build. And what we came up with was BigQuery. And I've sort of been part of sort of riding this rocket ship for the last, I guess, eight years now. So I kind of know where a lot of the, you know, the skeletons are hidden in the code. And

Starting point is 00:01:31 at least I know most of them. But also, you know, now looking into, you know, how do we make this thing continue to grow? And how do we build it into this sort of great data warehouse of the future? Fantastic. So my introduction to BigQuery came through the work I was doing in a previous role when I was in the engineering team at Qubit, who are a Google Cloud partner. And we were using BigQuery there to store petabytes of consumer-level data, behavioral data, and then use it to drive personalizations and segmentation and so on. And I was blown away by what a fantastic product it was. And ever since then, BigQuery has been, I suppose, my database of choice. So it's great to have you on the show, really.

Starting point is 00:02:18 So what was your route into Google? How did you end up kind of doing what you're doing, really? So I started out as, you know, an engineer before Google had the cloud. And, you know, I got into, I worked at a bunch of startups and I never wanted to work at a big company again. And I ended up interviewing at Google just to practice, but I had some friends that were here. And just when I came into Google, there was an energy about the place that, and also I love that it was a bottom-up kind of place where individual engineers with ideas could make things happen. And that's sort of how BigQuery came about.

Starting point is 00:02:54 Like our site director had asked us to build something entirely different. He'd asked us to build a data marketplace actually. And so we kind of, several of us engineers kind of were, you know, just sitting around the original tech leads dining room table and trying to figure out how we were going to build a data marketplace. And Google didn't have a cloud. So there was sort of, you know, we realized that for a data marketplace to work for large data sets, you don't want to move the data to the compute. You don't want to just download your data. You want to actually move the compute to the data.

Starting point is 00:03:32 You want to have something you can do over the data in place so that you can solve your analytics problems or whatever the problem is you want to use the data for. And so we had this great internal tool called Dremel that would sort of operate over things sort of magically fast. And we're like, okay, well, we can tie these two things together. We can expose Dremel in the cloud and then we'll get around to doing the data marketplace. And so I guess eight years later, it's sort of the data marketplace has always been like two quarters out for the last eight years. Maybe someday it'll happen. But I think it was

Starting point is 00:04:11 just one of those cool things where as engineers, we were empowered to build what we thought was the right thing to build and nobody stopped us. So it's been an exciting ride and exciting to be a part of something that now is much larger and much more different. It used to be in the early days, if BigQuery went down over the weekend, we got a page to be like, eh, I'll deal with that on Monday. Nobody will notice. Like chances are, you know, nowadays we have, you know, our internal SLOs are, you know, I think three minutes per month of partial downtime. And, you know, that doesn't give you a whole lot of time to ignore a page. I mean, luckily we have a really awesome SRE team

Starting point is 00:05:05 that helps us maintain that SLA. But just the ways people are using it are different. People are building their businesses around it. And it's kind of an honor, but it's also scary that, wow, something, if we mess up, then people's know, people's businesses stop working. Yeah, exactly. I mean, exactly. So, I mean, there's a lot in that intro from you there. Obviously, there's Dremel, which I think would be great to talk about. There's, I suppose,

Starting point is 00:05:38 the data marketplace and what that means in terms of how you deliver the service, you know, the underlying kind of data structures and data service there and so on and you know and let you say you know this is now powering businesses i mean certainly looking at qubit for example it certainly was powering qubit's business and therefore powering a lot of e-commerce businesses that are relying on qubit to do you know the personalizations and then you know conversion optimization and so on and it is you know but certainly my time we're working with it was there was know, but certainly my time working with it was, there was never a problem with BigQuery. It was, you know, certainly, you know, we had to optimize our spend and we had to optimize the queries and so on, but the actual underlying service was fantastic. So what about,

Starting point is 00:06:15 so I mean, how are you involved in Dremel? Was Dremel already there when you arrived or, and what was the kind of, what problem was Dremel solving back in time as well so you know I think Dremel was kind of a uh seminal um you know tool at at at Google um it sort of grew to you know I think 80% of the company was using Dremel you know any 80% of the company was one of the monthly active users for for Dremel. Most of it was indirect usage, but it was still necessary for how Google ran and how Google was able to be a data-driven company, despite having, I guess we have eight products now with over a billion users, and that just generates lots of data. But at the start, Dremel was created by an engineer. And I always get his name wrong, so I should double check.

Starting point is 00:07:14 It's Andrei Gubarov. And he kind of created it while he was waiting for his map producers to finish. So he was in the ads team, and his map producers were taking a long time and he realized that there's like these data structures that are used at google that are very wide so they have lots and lots of columns and he only needed a few of those columns but the normal map produce would have to you know would have to scan you know read the whole read the whole thing and so you know first step is you know the column stores were you know starting to become popular at the time. And so he's like, first step, I'll build a column store. Second step, you know, build a, you know, a SQL engine.

Starting point is 00:07:52 And then he got an intern and the intern was sort of helping him and they made it distribute. And sort of the kind of some of the cool things about Dremel was, you know, it built this, this tree structure for your, for your, for your queries your queries. And you can sort of think of it as databases in general kind of rely on this on-disk data structure called a B of a computational version of a bee tree. So you have like, you have, you know, at the root, you have each individual leaf, and each individual leaf is responsible for a small portion of the data. And then as you go further up, as you go up this tree, it's sort of higher levels of aggregations. So the initial version of Dremel was this static tree of, you know,

Starting point is 00:08:52 I had root mixers, mixer ones, and then you had the shards. And it was extremely fast at doing scan filter aggregate queries, which is kind of what it was designed for. And the problem is, you know, if you want to do queries that are more complex, you want to do queries that have lots of, you know, subqueries or stages or joins, you need to do multiple passes in this tree. And if you think about it, it's a little bit similar to, you know, when MapReduce was first created. MapReduce allows you to take some fairly simple problems, but apply them over very large data sets.

Starting point is 00:09:33 So you can change things into a map stage and reduce stage. And so for Dremel, it was sort of like there's a scan and an aggregate stage which are really kind of similar. But then just like Hadoop, it turns out that most interesting problems you want to solve actually involve multiple of these stages, multiple passes. And the original MapReduce didn't have a way to kind of to chain those together. And so, you know, there's things like, you know, at Google, it was called Flume externally, it's called Dataflow or Apache Beam. Apache Flink also kind of lets you solve some of the same problems of kind of taking your MapReduces and chaining them together

Starting point is 00:10:21 in a convenient way. And so Dremel, you know, is still using SQL, but so Dremel made a major architectural change about six years ago, which went from kind of the static tree to a dynamic tree. So, you know, when you start your query, we build up this tree

Starting point is 00:10:41 and as, you know, with aggregations and shuffles at each phase. And I think one of the kind of unique things about it is it's not just a dynamic tree that gets created at the start of the query. It's a dynamic tree that actually will change its topology depending on how the query is running. So, for example, certain types of joins, let's say you're joining against a small table, a small amount of data, which is pretty common in kind of data warehousing scenarios. You have a small dimension table and you have your large fact table. So if you're joining against a small dimension table, rather than the expensive way of doing a join is basically you want to co-locate the keys that are going to match up together. Because you have to find if you're

Starting point is 00:11:34 joining on a key like customer ID, every customer ID that is like three, you have to put those in the same place because you have to be able to match them between those two tables. So the expensive way to do this is you shuffle the data on both sides, and then you have workers work on ranges of key space. But a much faster way of doing that, if the one side is small, is basically you just call it a broadcast join. You just take the small side and you broadcast it to every single node.

Starting point is 00:12:08 And the thing is, you don't know whether it's gonna be large or small ahead of time. So we have to make these dynamic interpretations based on how much data, how the query is run in cardinality, et cetera. And then we can sort of rapidly make those changes. Sort of the other option for that is sort of the kind of cost-based optimizer. And the cost-based optimizer is you say like, you have all this information about the query and how you think

Starting point is 00:12:37 it's going to run, and you make a decision ahead of time. The problem is if you make the wrong decision, it's super expensive. And so you have optimizers that are sort of like, well, it optimized this way. And one of the things that we do in Dremel and BigQuery is we try to just dynamically figure out what the optimal query plan is. So this is probably quite a few years ago now that you were working on Dremel and making these decisions. And there were quite a few kind of, I suppose, potential solutions to running SQL on Hadoop at the time at a kind of interactive speed. And it was Tez, there was Drill and so on. Did you make your decisions in the knowledge?

Starting point is 00:13:19 I mean, I know a lot of them came after Dremel or were based on that. But, you know, presumably you didn't go down those routes. You went down this route. What was your kind of thinking around that? It was sort of interesting. I guess that's one of the cool things about Google. Some people think it's cool. Some people think it's inefficient.

Starting point is 00:13:37 But at Google, there were a number of SQL query engines to query over large data sets that turned into map reduces. There was Tatra, Tenzig, PowerDrill. There was just all these different, you know, as F1 query, there's all these different engines. And, and so you kind of, it was hard to choose about which one, you know, which one of these is going to win. And, you know, and I think one of the nice things about Google is by sort of letting a lot of these different ideas, you know, flower, you know, one of them does win. So you do have competition. They each sort of push each other to be better, faster, more efficient, because otherwise, you know, because you're competing for this internal talent.

Starting point is 00:14:26 So, you know, sometimes it's, you know, it does seem like Google can be kind of cloistered or sort of not invented here. And in other places, when you have that, you know, often what people come up with is sort of a non, you know, non-optimal because it hasn't really had to compete against anything. You know, if you're at a big company where like they say, this is, you know, we're going to build X and this is going to be the X that we build until you kind of get that into the, into the market and other people are using it, you're kind of in a, in a space where you're not, you're not competing. So like these other, you know, so the, the, I don't remember the, the reason that we, that we chose this.

Starting point is 00:15:04 I think it was, there was a lot of momentum behind, behind Dremel at this point. And it was, it seemed like the thing that was going to, it was going to succeed, but you know, it wasn't a, it wasn't a foregone conclusion. And then I think, you know, drill started, you know, basically just started breaking ground right after we, right after we started, I think about a year later. And I think we started, you know, thinking about, okay, should we work with these folks? And they kind of went off in a slightly different direction. But, you know, I think it was, you know, we decided to kind of keep them at arm's length a little bit. And just, you know, for, you know, for focus purposes, just we had a lot of other stuff that we were trying to work on.

Starting point is 00:15:53 Okay. So, I mean, apart from, I suppose, the technology, the sort of general technology and so on, when you look at BigQuery, then there's some kind of fairly big decisions you took there. To me, when I came into this industry from working with more traditional, I suppose, on-premises data warehouses and Hadoop clusters and so on, I wouldn't have thought of them at the time, but they were massively important as to why people were adopting them. So things like the fully managed part, taking that for example, what made you go down the route of it being a fully managed service? Um, and, and, um, you know, what, what led to that and what was you thinking behind that really? Um, you know, perhaps surprisingly, but it was, there was almost no

Starting point is 00:16:35 thinking behind it because just at, at Google, it was like, this is how you, this is how you build services. We didn't, you know, serverless, serverless wasn't a term yet, fully managed wasn't a term, this wasn't a term yet. It's like, okay, well, if we're going to build a service, we want, we want to make it just so that like, you know, you don't have, you know, people shouldn't have to manage infrastructure because, you know, at Google, when you build services, nobody, you know, generally nobody manages infrastructure. was at the time it was obvious. And I think there are some other interesting corollaries or related things to that that weren't quite as obvious. One of those is, I think, one of the key things in BigQuery is all of the data is in one namespace. So if I have a data set

Starting point is 00:17:21 and I think you might think it's interesting, I can share that data set with you and you can join that with your data. And partly that comes from our idea of eventually building a data marketplace. Eventually build a data marketplace, you want to have one name for something. But also it becomes really powerful just as we have customers who they aggregate analytics data from, you know, from their users, and they show them dashboards, but then some of their advanced users, they want to actually give them direct access to the data. So they can just share the BigQuery data set with them.

Starting point is 00:17:53 And then that company can then go and, and, and query it themselves. Yeah, I mean, we used to do that in my old role, we would exactly do that we would we would provide either aggregations, dashboards, or we'd give them access to the data. They could then access that from their projects and it was completely sort of seamless, really. So, and certainly that, the fully managed part, you know, it doesn't seem that important to you when you don't have the benefit of that.

Starting point is 00:18:18 But when you do realise that it means that you can kind of auto-scale it, it means you can just run things on there and it just works. That means you can focus in on innovation and analytics and all that kind of auto scale it. It means you can just run things on that and it just works. That means you can focus in on innovation and analytics and all that kind of stuff, really. So that was, you said serverless there as well. And I think serverless again was a new concept

Starting point is 00:18:33 that I encountered when I first came across BigQuery. So maybe just explain what that means in the context of a data warehouse. Yeah, sure. So, you know, a lot of data warehouses, you know, even cloud ones, in order to run it, you need to spin up a server or you spin up an instance, even if it's just a virtual instance.

Starting point is 00:18:51 But you often can trace that back to, it's running on this virtual machine somewhere. And BigQuery is a multi-tenant, this sort of giant. We have hundreds of thousands of nodes. and your queries get sort of multiplexed across those hundreds of thousands of nodes. And so we have some nice advantages of scale, but nobody has to worry about instances. And if any of those underlying instances fail, nobody notices. In fact, if I looked at our Borg page right now, probably there'd be a dozen in any particular cell that were down, they were restarting. And queries that were using that would have maybe a 100 millisecond glitch.

Starting point is 00:19:43 And that's the only difference that they would experience. And we even do things like we have BigQuery runs. We have data in multiple availability zones. So in one of the zones, if there's a stuck tablet or a network switches down or one of the other just things that can go wrong in running a large scale service, we will drain that and we will basically move everybody out into their backup cells. And we do this probably, I don't know, a few times a week and there's no, there's no downtime. It just sort of magically, magically happens. And that's the kind of thing you can do when you're, when you're serverless, you're not actually relying on,

Starting point is 00:20:33 on individual, individual VMs. And, you know, again, this is just sort of like, that's how, how stuff gets built in general at Google. And so we didn't even think, we didn't think there was anything special about it when we did it. It's just like, okay, this is just what you do. And sort of like now in retrospect, like, oh, I guess other people made different design choices. So when I first used BigQuery again,

Starting point is 00:20:55 looking back, it was a fantastic service, but you definitely noticed when it wasn't a relational database. And so like joins, my first my first kind of crashing kind of um sort of uh encounter with reality with big query was when i tried to join two big tables together and i suddenly then realized that you know joins weren't effectively on a big query but you know but it's come along a lot since then and and maybe just tell us a

Starting point is 00:21:19 little bit about why there was that i suppose impotence mismatch at the start and and how the product has evolved to be much to be i, functionally more like a regular data warehouse now. Sure. So I think that, you know, originally we had, you know, it all just sort of stems from Dremel and logs processing at Google and logs analysis. Basically at Google, we had these giant denormalized proto-buffer logs that everybody operated on. And so one of the key features of Dremel was it was ability to use nested and repeated data

Starting point is 00:21:52 to use arrays, because that's just sort of how data was. And so it basically was designed so you can do a single scan over it and get your query results. So joins were very rare. In fact, when we were building BigQuery, we, you know, actually some of the BigQuery engineers implemented joins in Dremel because they didn't exist. You know, a year later, I guess in 2012,

Starting point is 00:22:18 the two teams merged. And so we've been the same team ever since. But, you know, so one of the things we realized, and this was several years ago that we realized, is that, like, is that, you know, customers have a different, you know, external customers have different expectations than internal Googlers. So, you know, the first thing we did was, you know,

Starting point is 00:22:41 we switched to standard SQL, to ANSI standard SQL. So we built a new ANSI standard SQL query engine that we adhered to the standard as much as possible with some additional additions to it, like nesting and arrays. And then also we've worked really hard on joins. And our join performance has gone up, I think is like 7x in the last year and a half by looking at some TPC benchmarks. But after realizing that data warehousing workloads are generally, are very join heavy. You know, people have their star schemas,

Starting point is 00:23:30 their fact and dimension tables, their third normal form. And we need to be able to do those well if we want to be competitive. So we've been sort of, you know, heads down, working on sort of making BigQuery into a fully featured data warehouse. You know, I think DML came a little bit later, so being able to do updates and deletes because that's also very, very important to data warehousing. So those sort of things. I mean, whenever I remember speaking to Dan McCleary

Starting point is 00:24:03 about this back in time and talking about joins and so on i was just to wonder you know in a way are we using the product in the wrong way when we do that you know you talk about i suppose meeting customers with their demands and and what they want to do but fundamentally you know is it still the case that a big wide table is better than ones with joins you know is there still a better way to kind of work with big query so fundamentally what we what we tell people now is use the data structures that are most natural for you. So if what is natural is a denormalized star schema, use a denormalized star schema. If what's natural is normalizing into everything and using nesting and repeating. Cause sometimes that is more natural. Like one, one case where that can be more natural is like orders and line items.

Starting point is 00:24:52 So, you know, often that's represented as two separate tables and you have to do a join. There's a one to many relationship, but really an order is one thing. And so if you can nest the line items within the orders, then like, then you're going to have a more, a more compact, a more compact data model. But in general, like if you, you know, use what you're comfortable with and, and if you need like, and then tune it as, as needed, I mean, it's sort of like, you know, the old saw that like, you know, premature performance optimization is the root of all bugs or whatever.

Starting point is 00:25:27 Like don't, don't prematurely optimize and think you have to denormalize because, you know, normalized, uh, normalized queries and, and join heavy queries are, are often quite fast if there's, but that said, if, if this is something that's really important to make this run fast, then, Hey then hey, maybe doing a, maybe denormalizing a little bit or using nesting and arrays can make it even faster. So how tempted have you been in the past to add indexes then to BigQuery? You know, it's funny. It just came up like half an hour ago. I was talking to one of the edge managers and he's like, where are we going we getting indexes? One of the kind of important things is we don't want you to need

Starting point is 00:26:11 to have somebody whose full-time job is to tune BigQuery for the types of queries that they're running. We talked to customers and they're like, I talked to one customer last week who said, we have a full-time person whose job is to manage the workload management in their on-prem data warehouse. It's not even to manage the schemas. It's just manage the jobs that are running

Starting point is 00:26:37 and that are running faster and slower. We don't want you to need those types of people. We don't want to put anybody out of a job. We also want those people to be working on something that's more rewarding to them than having to sort of babysit things. And so kind of the idea is no ops. You shouldn't need to be a DBA. But we do run into things where like, okay, well, primary keys, like if we could apply a uniqueness constraint here, we could apply all these, we could apply all these, all these optimizations. And so, you know,

Starting point is 00:27:12 there is some tendency to, to do some of those types of things, but we try to make sure we do it in a, in a principled way that is like, is saying, okay, we're going to get all this, this value. And, but also make sure that it's not something that people need. We don't want you to have to use a primary key here, but hey, if you want to signal to us that there's this key relationship or to optimize these joins, you can create this key relationship, then maybe we will add those. But then again, maybe we won't.

Starting point is 00:27:46 I would say the day you add indexes is the day that you could. Yeah, I think it's interesting, I think, about product management, isn't it? You know, how much you follow. I mean, you must have every, you know, so many customers saying to you, you know, add indexes and this kind of stuff. So they give very good cases to why, you know, you're doing a single row lookup and so on there. But, you know, it is the thin end of the wedge,

Starting point is 00:28:06 and you will end up with a whole industry of people who then become tuning experts at BigQuery. And the nice thing about it is it just works. So I always look out for that being the sign of the apocalypse, really, when you guys add indexes to that. What about migrations from platforms like, say, Oracle, Teradata, and Atiz? Are you finding, finding i mean presumably you've had some of those happen but but what's the driver behind those and how have those migrations gone and what the benefits been

Starting point is 00:28:33 so i mean i think there's a there's a clear move that you know from from on-prem to cloud and you know a lot of the a lot of the on-prem players, they're losing market share and they're losing a little bit of relevance to the folks in the cloud. We have tooling that will let you migrate from, I think we have a Teradata to BigQuery migration service. We have Redshift to BigQuery migration service. Other stuff in the pipeline. We have other data integration tools. We bought a company called Cask, and we just released that as Data Fusion so that you can create data pipelines. Google's also purchased Illuma. So our story right now is a little bit disconnected, but I think step one has just been like, make sure that we have first-party

Starting point is 00:29:40 options for migration. And then the next step is we're going to sort of try to tie those together in a coherent hole. But I think we do see a lot of migrations from other on-prem data warehouses and even a decent amount from other cloud data warehouses. Cloud data warehouses often, the migration is faster because people have already kind of already come to terms with what it means to be in the cloud.

Starting point is 00:30:10 You know, some of the on-prem to cloud migration is just getting people comfortable with storing their data in the cloud and the cloud security controls. And, you know, so like, you know, for example, HSBC, you know, they are in the middle of a kind of a long migration process to us. And one of the reasons that it's taken so long is because they've had to work with the regulators to basically say, okay, this is, you know, this is safe, it's okay to use, you know, this security is good enough for, you know, for storing unmasked data in the cloud.

Starting point is 00:30:45 And they've gotten a lot of the European regulators to agree to that. But it can just sometimes take a while. And just sometimes it takes people a while just to be internally comfortable with kind of the level of security and controls you get in the cloud. And some of it's a little bit of loss of control. It's like, oh, well, if somebody else screws up, then it impacts me. But on the other hand, I can guarantee that we have more security and network security people

Starting point is 00:31:18 and physical security people on staff than virtually anybody. And we treat people's security and privacy as, you know, incredibly important because it's, you know, if we don't, then, you know, if we screw it up, then it could be the end of our business. Yeah, exactly. I always used to say that security, people saying security was an issue about cloud was a bit of an excuse.

Starting point is 00:31:42 You know, it was because of other reasons and so on. And security suddenly then didn't become an issue when uh when when kind of flexibility and price and so on you know was so obviously sort of better in the cloud um and uh yeah it's interesting something about banking and those kind of clients because i suppose the perception of big query is it's very much for i suppose kind of cloud native you know maybe sort of like uh web native companies but you say banks like hs HSBC are using it now as well. Yeah. So, I mean, that was something interesting at Google Cloud Next that was just last week. There was a real kind of phase change in the types of organizations that we were talking to.

Starting point is 00:32:25 And it was a lot fewer. They were quite boring, weren't they, if anything? But rather than being, I suppose, the hot new internet companies, you say it was very normal kind of household name company. Yeah, no, it was fantastic. I don't want to use anybody's name who hasn't agreed to be a referenceable customer yet. But lots of big names and people that they know that cloud is important. They don't want to have to

Starting point is 00:32:56 manage their own, build their own software infrastructure. And they want to be able to focus on getting analysis results rather than running servers. And so they're coming to us and talking to us. And yeah, I think we've had some really good success so far. Interesting. I mean, just one last question on the migration side. How much is, I mean, have you ever thought about stored procedures? I mean, obviously we come from the Oracle world.

Starting point is 00:33:24 That was a massive, massive part of what we used to do i guess it's more from the transactional side but have you ever thought about i suppose stored procedures in the database and business logic in the database so absolutely you know i think um our general policies we don't talk about stuff that's coming down coming down the pipeline but this this one is this one is in alpha already so so I will confirm that, yes, stored procedures are going to be a thing. Interesting. I had no idea.

Starting point is 00:33:50 So that's very interesting. And one of the drivers of that is, as you mentioned, is migration. So people have their giant scripts that they've run somewhere else and it does create temporary table and has some loops and some, you know, variables and like, and, you know, used to be or currently part of the migration process. You have to split those up into individual query statements and it's super painful. And so, you know, we also have, you know, so we'll be able to have stored procedures. And so you basically can just run that whole thing.

Starting point is 00:34:20 You might have to, you know, we're following the ANSI standard as much as possible. But, you know, every database engine has their own quirk. So you may have to tweak it a little bit, but at least you don't have to like do major surgery. Okay, interesting. So let's move on then to Google Next 19. So there's some fantastic things that were announced last week. Maybe just talk about Google BI Engine. What is that? And what problem is it trying to solve? Sure. So the problem it's trying to solve is that, you know, the BigQuery, you know, a lot of our customers, you know, they have their data in BigQuery and they want to build dashboards based on the BigQuery data. And, you know, when you load a dashboard, it might do 10 queries because each one of the little widgets is a query,

Starting point is 00:35:10 and then you might have a whole bunch of users. And kind of when you do this, it can be slow. You really want this whole page to render in less than a second. And so one of our kind of lead engineers who has worked on, you know, BI stuff in the past, he's like, you know, I know how we can do this better. We can build this in memory, you know, OLAP engine effectively and do these really fast scans. And we can build it right on top of BigQuery so that the data is all backed by BigQuery. In case we miss the cache, you know, you're always going to get the correct data,

Starting point is 00:35:46 but we can generally do, you know, do things, you know, instead of getting results in seconds, we get results in tens of milliseconds. And so that's really the goal of the BI engine is to, you know, sub-hundred millisecond queries, high concurrency. You know, we don't, our billing model is different for that.

Starting point is 00:36:03 We only charge for the memory that you have rather than the bytes scanned in the query. So that if you're using it, you, you know, you can have a hundred concurrent users or a thousand concurrent users and they're all, they're all running, you know, dozens of queries and you're not necessarily worried about, you know, they're running up a, a huge, a huge bill. So it's super powerful. It's currently tied to Data Studio. Sorry. Yeah. I tried to use it with Looker the other day and I saw there was... So why is that? What does Data Studio do or the connection do that means it only works with that for

Starting point is 00:36:40 the time being? So we have a separate... It's a separate API currently. That's a separate kind of low latency API that skips a lot of kind of the, some of the stuff that our current asynchronous API has to do. And it's also, you know,

Starting point is 00:36:54 it's tuned for kind of BI style queries. So there's no SQL parsing involved, et cetera. And, you know, so, you know, we've talked to, we've talked to some of our, you know, the partners like Looker and Tableau and, and so they're, they are going to wait until we have a version that operates over SQL. So this will be this, you know, the BI engine, you know, you know, memory cache, adaptive cache, will be available to all SQL queries, but is not available yet.

Starting point is 00:37:32 Okay, okay. So you say adaptive cache, Dan. I'm used to, I suppose I'm used to two ways of doing this. One is, one thing we tried to do in the old role was we looked at maybe using Druid on top of BigQuery, and that would be a case of loading up the Druid cache, and then once it's in there, it's fast and so on. I mean, I suppose, first of all, how conceptually similar is that to what you're doing now? Presumably you have to load your cache up and keep it in sync and so on. So customers don't have to load. So the first query

Starting point is 00:38:02 will have to scan the underlying data, but that should load the cache. And if the cache gets full, basically it's not necessarily an LRU. We apply a bunch of heuristics to figure out what the best stuff to cache is. And so we won't necessarily cache a full table. We'll only cache the columns and the partitions that are being used. So you can use it over a large table. Let's say you have 10 gigs of BI engine memory, and you might have a 100 gigabyte table. It'll likely all fit because A, it's sort of compressed size. B, only the columns that you use. C, only the partitions that you use go in the cache. So we don't currently have a way to pin data into the cache.

Starting point is 00:38:52 We want to see how well we can figure out things automatically. But we may end up with the ability for people to say, hey, I really want this to be in the cache no matter what. Okay. Okay. And you mentioned, the other way I've seen this being done is through, you mentioned OLAP servers, and obviously they've got the ability to kind of pre-compute aggregates

Starting point is 00:39:17 and work through hierarchies and so on in a certain way. Is there anything in the product at the moment or in the pipeline to help with, I suppose, aggregation of data as well? Not necessarily. No. So basically, the way we're just approaching that is that our BI engine is so fast that it'll effectively let you build these aggregations and kind of OLAP aggregations very fast. So what about another, I mentioned indexes too, yet the other kind of side of that horror story of optimization

Starting point is 00:39:56 is materialized views and query write and so on. I mean, again, has that been, and the context of this has been, when I used to build Looker dashboards, sometimes I used to think it was so handy if I could just pre-compute some aggregates and have the database rewrite the query to use the aggregates. I mean, again, is that something that's been thought about with BigQuery? You know, again, we don't want to like, I don't want to pre-release anything, but I would say that's something that we've thought about. Okay, that's good. So what about, I mean, obviously, you know, when we used to use BigQuery in my previous role,

Starting point is 00:40:27 it was to receive streaming data via PubSub, and it was a huge amount of data coming in in real time. Is BI Engine part of that story around real-time analytics and real-time data ingestion, or is it very much sort of separate or what really? Absolutely. So actually one of the goals of BI Engine, and it's not necessarily tied up to it yet, but it will be, is if you're streaming data into BigQuery, you may be streaming in at of milliseconds and B, it should have, you know, up to the second, up to the second results. Because, you know, part of the advantage or

Starting point is 00:41:11 part of the reasons you want to have these dashboards is you want to be able to, you know, make decisions based on what's happening now. Like somebody, you know, I was talking with a customer who oversees a railroad system and wanted to know what is happening right now. And that's a question that traditional data warehouses, they just can't answer. And what's going on in my sort of this IoT network? What's happening right now? And I think that in order to do that kind of, yes, the real-time analysis is super important. And that has to go all the way through the tool chain from, you know,

Starting point is 00:41:59 how the data is being generated. You know, it shouldn't be, you know, buffered and batched to, like, how it's represented generated, it shouldn't be buffered and batched to how it's represented in the data warehouse. And I think that at... Oh, sorry. So I think one of the things, I think one of the areas of real differentiation in BigQuery that we're pushing on is this real-time data warehousing and real-time decision making. And it's kind of related to some of the stuff in Dataflow as well, like Dataflow,

Starting point is 00:42:32 Cloud Dataflow, which is kind of our managed data pipelines tool, is that in Dataflow, batch and streaming are two sides of the same coin. So you can turn a batch into a stream just by changing one parameter. And that's super powerful. And they're arguing that the architects of Dataflow and Apache Beam are arguing that while most people think that streaming is a special case of batch, they believe that as the batch size goes to one, then you get a stream. It's actually sort of the opposite is that batch is a special case of streaming. Because if you think about the stream or the batches, like say you have daily

Starting point is 00:43:24 batches, well, that's just a windowing strategy. That's just sort of like the window I have on my stream is sort of the daily batch. But you could easily have a sliding window. You could have, you know, late arriving data that gets applied to the window. Like, you know, kind of once you see them as sort of two sides or once you see the sort of batch as part of streaming, then it kind of opens you up to lots of different ways of representing your data and processing your data. And so we want to take that end to end with our data analysis

Starting point is 00:43:53 products. So from Dataflow streaming to BigQuery, we want to have a really nice streaming system that operates in lockstep and at virtually any scale. So in my next GCP Next talk last week, I showed off a Dataflow pipeline that was processing data at or sending BigQuery 22 gigabytes per second, which is like, I think it's about almost two petabytes of data per day. And then, you know, BigQuery was processing it and then you could run a query over it and the query. So I injected a row. I was simulating a sensor network. And so I injected a row of like

Starting point is 00:44:38 a bad sensor and I showed in a query that that shows up immediately. So I think it's a really powerful area and I think it's going to become more important in the future as people start to leverage the value of this data that's being generated over time. All data really starts with an event and so those events can be processed as a stream. Anyway, sorry for the rant. It's something that I'm excited about. It's really interesting. One thing that actually another announcement you made, or it was made at Next, actually is a bit of a blast in the past, really.

Starting point is 00:45:13 And it was the BigQuery storage APIs. And one of the first things that was very interesting about Hadoop was this idea you could have, you separated your query engine, your workload, effectively, from the storage. And you could run all different types of query, like graph queries, or SQL queries on top of Hadoop storage. And then I noticed that you've got the storage APIs being announced, which seems to me was about, again, running different workloads on BigQuery storage. And what was the thinking behind that? And what was driver behind, you know, releasing that feature?

Starting point is 00:45:40 So there's a lot of people that are doing, you know, data lakes these days where they have, you know, they store their data in object store and parquet files or ORC files. And they run Spark jobs and they run, you know, Hive jobs over that data. And the problem about that is that, like, you know, is that parquet files are difficult to manage. And so you have to, where do I store this data? What's in it? What's the schema? How do I make sure that the access control is set correctly? How do I make sure that it's optimized?

Starting point is 00:46:21 So in general, general data comes in over time. And when data comes in over time, there's a tendency if you want faster access, more real-time access to the data, you end up with smaller and smaller files. So you have smaller and smaller files, then you have more files, and then it's much less efficient to query,

Starting point is 00:46:37 sort of the small files problem. And I think like Databricks Delta is trying to solve some of this problem. And so when you have an actively managed storage system, you get a lot of benefits from performance. You get benefits from consistency because you actually get atomicity of changes to the data. Let's say you want to delete one row out of a table.

Starting point is 00:47:00 It basically involves a file because files in object stores are generally immutable. So it involves a file creation and files in object stores are generally immutable. So it involves a file creation and a file deletion. So there's an inconsistent period, whereas if you're in a managed storage system, you can provide a facade that says, I only want to see the exact version of the data now or at some time in the past. So BigQuery has what I think is a really top-notch managed storage system with lots of features like DML, like automatic kind of data coalescing and optimization.

Starting point is 00:47:38 So what the storage API does, it sort of gives you, if you have sort of a data lake paradigm where you have, you know, you're running from multiple different types of tools, it lets you kind of operate over the BigQuery data as if it's your data lake. And so it allows you to do some of the same types of things where you have like filter pushdowns, you know, column projections. So you don't actually have to sort of read all of the data in, but also, you know, manages data governance, manages, you know, performance at the same time. So I'm actually super excited about the storage API because I think that it really kind of turns the data lake idea upside down. Because one of the things that people don't want to do is they don't want to have to pay twice for their data.

Starting point is 00:48:31 They don't want to have to store their data in their data lake and their data warehouse. In different copies, they have to worry about synchronization and access control, drift. This lets you store one copy of the data in your data warehouse where you can actively manage it and then operate over it at full speed if you're using Spark, Spark ML, Hive, et cetera. Interesting. So another announcement was around BigQuery ML.

Starting point is 00:48:58 And I've been using BigQuery ML myself to do things like basic logistic regressions and so on. And again, the thing that's been interesting with it is how easy it's been to use compared to other solutions I've seen in the past for doing, I suppose, stats analysis and machine learning in the database. And if anybody doesn't know what BigQuery ML is, maybe just kind of recap on what it is and then what was announced at Next19. Sure. So it's BigQuery ML is, you know is machine learning in the database. So as I mentioned before, you want to move the compute to the data rather than the data to the compute. It's being able to

Starting point is 00:49:37 do machine learning without having to move the data and with the convenience of SQL access. So you run a query that looks just like a normal SQL query to train a model. So instead of running create table as select, you run create model as select, and you tell it which model type you want to run, and it'll build your machine learning model. And so the output of that will be that the model it creates will be a model that you can then reference in another query. And so then when you

Starting point is 00:50:09 predict or you infer based on that model, it's a SQL function. It's a table valued SQL function that you can operate on as if it was just sort of a normal function. So it's really powerful. It had been limited, however, to two model types, logistic and linear regression, which operated really well at high scale. But many people said that they wanted other model types. So we've launched matrix factorization. Matrix factorization lets you do kind of recommender systems.

Starting point is 00:50:48 We launched, what was the other one? I think it was k-means clustering. Yes, k-means clustering is now in beta. Thank you. And we've got a couple more in alpha. So in alpha is being able to build neural networks. And that one is interesting because it's really the first that ties it to cloud machine learning engine basically we now have kind of a commitment to you know between our our team and the cloud ai team is

Starting point is 00:51:14 that basically everything you can do in cloud ai uh and machine learning is we're going to be able to do through bqml so i think this is really going to be our, you know, a portal for, you know, for structured data analysis to machine learning. So just the last thing I want to talk about is in new features that were launched last week was maybe not, it's data transfer connectors. So there was a bunch of partners you now have offering services on the Google Cloud Marketplace, so Fivetran Supermetrics. What is that all about, really? And again, what problem is that solving, and what features does that add to BigQuery?

Starting point is 00:51:54 So it's really about being able to get data into the database more efficiently or to the data warehouse more efficiently. And so we announced a, so we have built a connector that allows anybody to build connectors to get data in through BigQuery and then to resell that data. So Fivetran was the first one we've announced that is, you know, so you can basically with a few clicks, you can import data from, I think it was 100 different sources and, you know, including, you know, Salesforce data, a bunch of other,

Starting point is 00:52:45 bunch of other data locations. So, you know, it's just really about, you know, people have data in other data sources, in other data, you know, that's sort of locked up in other, other, other tools, you know, SAS applications, services. And we want to be able to pull that data out because there's a value in joining that data with your other data and sort of centralizing the data in the data warehouse. Okay. And I mean, I suppose, again, one last thing. So it was the Fusion data product. So again, I was interested to see it's a kind of point-and-click

Starting point is 00:53:23 data integration product. And you say, actually, Google have bought the company behind it. I mean, is there anything you can talk about that at all? Is it a product you're aware of? So Data Fusion is a really, really convenient and easy to use tool that kind of lets you just sort of drag and drop, you know, you can select your source data, data service, select your destination data service, which, you know, hopefully it will be BigQuery, but it could be anything really. And, and, you know, click a button and it'll generate a, it'll generate a Google data proc job,

Starting point is 00:54:01 which is actually a spark job under the covers. And it'll run that job with the schedule and do the data migration. You can also specify transformations and schema changes. And it's got a lot of... We actually, for the beta, we turned off a lot of the features that were in the underlying service that we bought, but expect to see a lot of interesting things coming from that in the next few months. Okay. Okay. And I think going back to your point about lots of products, lots of projects kind of being invested in and then see which one is the one that succeeds.

Starting point is 00:54:41 I think I've used in the past, there's obviously the Trifactor sort of service that's on Google Cloud and there's Airflow as well. I mean, is there a particular, I suppose it's different, I suppose it's different tools for different kinds of jobs, really. But is there a particular technology or product that you guys are investing in, which is the go-to thing for data integration on BigQuery? Or is it really all of them in different cases?

Starting point is 00:55:03 So Data Fusion, I think, is really our, is the one that we are, we are betting on the, um, you know, this is the one that's going to be our kind of data integration tool of, of record. We're not going to, you know, shut any of the other things down or, or even stop investing in them. But, but like, this is what we see is kind of going to be our marquee thing going forward. And I think data, you know, composer, which is our managed airflow is very, is very different.

Starting point is 00:55:32 And so that has a, you know, perhaps slightly overlap, overlapping use cases, but, you know, requires a lot more kind of understanding of what you're doing. And so I think for most people that are going to, they want to pull data from one location or even do CDC,dc like the um the uh the way we are going to recommend is is through is through fusion okay fantastic and one last thing um you've written a book recently so uh what's that and what's that well i guess what it's about actually but it's tell us about the book you've written and uh and uh and and and your co-author as well yeah so it's

Starting point is 00:56:03 um the book is is in progress. It's in sort of early release. I think we've got actually just today two more chapters were launched into the early release. So we have now five chapters. And yeah, I mean, this is, you know, I think it's called BigQuery, the Definitive Guide. And, you know, the idea is this is sort of what BigQuery is and how to use it.

Starting point is 00:56:27 I wrote a book five years ago, which was the first book on BigQuery. And it's horrendously out of date now. And it's still for sale. So I think people are still buying it, which I feel really bad because it's giving them like giving them bad ideas and bad impression about what BigQuery can do and what it can't do. So I think that's one of the rationales behind this new book is sort of like, hey, this is what BigQuery can do now. You know, BigQuery is now a fully featured data warehouse that can do all these other

Starting point is 00:57:03 additional things. And yeah, and so my co-author, Locke, I have a hard time pronouncing his name, his full name, but everybody calls him Locke at Google. But he's an absolute machine. He just turns out pages and he's been really instrumental in making this book happen. And I've been largely so far been along for the ride. And he's been kind of the driving engine of that. But he also runs a lot of our training programs at Google and Google and has, you know, has another book I think on, on machine learning with, with GCP.

Starting point is 00:57:50 Fantastic. Well, that'll be my presents for the family and for Christmas anyway, this year. So that'd be, that'd be excellent. So how, just to kind of round things up, how do people find out about these new, new product features that came out recently? Have they maybe see some of the videos and presentations that you did last week? I mean, where do people go to for that? Yeah, I mean, so the Google Cloud Next videos

Starting point is 00:58:11 are all up on YouTube. There's a spotlight session that has a good highlight from all the data analytics features that were released. My session on modern data warehousing, I think, gives a good overview of all the things that were going. You know, my session on modern data warehousing, I think gives a good overview of all the things that were going on in BigQuery just to plug my own session. But, you know, I think we do have,

Starting point is 00:58:33 you know, we do have release notes and ways that you can keep up to date with the things that are coming out. And one of the problems that, you know, sometimes we find is that, you know, we don't do a great enough job of just sort of like of shouting from the rooftops when we make improvements and so sometimes we talk to customers even really good customers and they're like well i'm doing this because to work around

Starting point is 00:58:54 this this problem and we're like well we fixed that problem like a year ago uh and um or we've added this new feature a year ago and they hadn't, haven't started using it, but you know, if it's working for them, that's fine. But if they have to do kind of unnatural things to not use it, then that's, you know, that's not as, not as fine, but. Fantastic. Well, look, Jordan, it's been fantastic speaking to you. And I really appreciate the detail you've gone into there with, with how the product works and what's come out recently as well. And yeah, I mean, all I can say

Starting point is 00:59:25 is the database that I use day to day. So, and yeah, it's fantastic. It's great to speak to one of the people behind it. So Jordan, thank you very much for coming on the show and appreciate that and have a good rest of the day. Well, thanks for your time. This has been fun.

Drill to Detail - Drill to Detail Ep.64 ‘Google BigQuery, BI Engine and the Future of Data Warehousing’ with Special Guest Jordan Tigani

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.