Drill to Detail - Drill to Detail Ep.31 'Dremel, Druid and Data Modeling on Google BigQuery' With Special Guest Dan McClary

Episode Date: June 19, 2017

Mark is joined by returning special guest Dan McClary to talk about data modeling and database design on distributed query engines such as Google BigQuery, the underlying Dremel technology and Capacit...or storage format that enables this cloud distributed data warehouse-as-a-service platform to scale to petabyte-size tables spanning tens of thousands of servers, and techniques to optimize BigQuery table joins using nested fields, table partitioning and denormalization .

Transcript
Discussion (0)
Starting point is 00:00:00 So hello and welcome to another episode of Drill to Detail. And I'm delighted to be joined again by one of the first guests we had on the show back in September last year, Dan McCleary from Google. So Dan, it's great to have you back on the show. Why don't you introduce yourself to anyone who's missed the episode back last year? Sure, Mark, thanks for having me. And gosh, it's hard to think it's been all the way since September since we've done this, but I'm glad to be back. For folks who don't know who I am, I'm one of those people who has sort of bounced around the large scale data processing database industry for a while. I was at Oracle for many years. And I've been at Google for the
Starting point is 00:00:45 last little while. I work specifically on BigQuery, which is part of Google Cloud Platform, and then the internal system, which both backs it and serves Google itself, called Dremel. Fantastic. Well, Dan, I mean, obviously, what's been interesting for me, and one of the things that prompted me having you back on the show, is I've been using BigQuery myself a lot now in my current job. And we spoke earlier on about how it's interesting to maybe kind of replay back to you some of the things that I've been finding with it and whether those interpretations are correct or whatever, but also to kind of go back on some of the topics we talked about and have a bit more
Starting point is 00:01:16 of an informed discussion around it, really. So it's good to have you on here. And thanks very much for doing this. But just for some kind of background, really, I mean, the Google Cloud platform that you're working on and Google BigQuery, I've noticed it being very popular within some of the kind of startup world I've been working in, particularly kind of marketing tech and financial tech and so on. You know, just tell us what are these products and why are they particularly popular with this kind of market that I'm seeing at the moment? So, you know, Mark, if I were to try and take a macro view of it, I think one of the things that GCP excels at relative to other App Engine where you can scale up what you need to run an application without having to make any sort of server management decisions. You don't even deal with the notion of servers.
Starting point is 00:02:20 You simply write your application and you go or you write your query and you go. Now, historically, I think one of the reasons Google excels at this is our own internal experience. Google has been building cloud infrastructure for itself for a very long time. And one of the things I think that has been apparent to us internally is that more people can be more productive, whether that is analyzing data or building applications or serving users, if they can offload much of the responsibility of maintaining infrastructure and then even maintaining services themselves. And it's how we have composed the underpinnings of Google itself. And in most cases, we are passing that on
Starting point is 00:03:05 to Google Cloud Platform customers. I mean, this abstraction of the infrastructure and the fact that the infrastructure takes care of itself, or certainly you guys take care of it for customers, that's something that coming from the on-premise world, I kind of didn't really kind of get the significance of that really. And I think it's working in the startup world and seeing how they've been through iterations of this kind of big't really kind of get the significance of that really. And I think it's, you know, working in the startup world and seeing how they've been through iterations
Starting point is 00:03:27 of this kind of big data technology over the years and to the point where even though Hadoop technology is very kind of, you know, fault tolerant and so on and so forth, the actual kind of effort involved in maintaining that becomes overwhelming in the end. And this kind of, this ability to kind of take away that complication, everything just auto-scales. It's massively important for these businesses that run at this scale and have to be there all the time.
Starting point is 00:03:50 And if you think about startups or companies that are focused on tasks beyond we are technology, you have some limited runway and you have a problem to solve or a number of problems to solve, which are domain specific. And to then divide your precious time between that and sort of caring for infrastructure as if it was a collection of pets is maybe not the most effective way to reach your goals. And again, I mean, I think that replicates a lot of what we've found internally, which is that if you're trying to solve a problem, you probably don't need to go and build and maintain a storage system to go and solve the problem in advertising, in email, in document editing. You should solve the problem. And I think it's really heartening that we see so many startups and so many organizations that are focused on their domain problems embrace google cloud platform as a better way to focus on their mission exactly and i think the other thing that's been interesting
Starting point is 00:04:50 i found working with bigquery and that and your platform the last kind of six months is i remember the days of oracle when i was working mostly with that and one of the ways that it kind of built up its developer following was that you could go and download all the software from their website on a developer license and effectively you could play around with the entire stack on your kind of desktop on your pc and learn oracle in those days and that's what i've been finding is the case with bigquery and google power platform because of the kind of very generous kind of developer and trial limits you've got i think i think i must have paid about 50 pounds in about sort of six months in terms of you know usage of this at home but i've been able been able to teach myself BigQuery and PubSub and so on using that. And that's a very kind of
Starting point is 00:05:28 enlightened attitude I think you guys have got. Well, I think ultimately we, you know, free trials matter to us and we want to make sure that anybody can access these things. And speaking personally, I think, you know, we want to see more people analyzing data, whether it's in their personal lives or as hobbyists or as part of their business. And, you know, when we speak to another organization or you speak to a large organization, a small organization, obviously, you know, people understand that they're paying for service. However, we want to reach anyone who has an interest in analyzing data, which means we have to lower the cost of entry as well. And much the same way, you know, back in the old days, you could download the Oracle developer license set and play around with it and make a little database and do things. We want to inspire that same level of play and explore, but in a way in which you can do it
Starting point is 00:06:18 from a tablet. You can do it and feel no burden that, oh, I'm going to get a charge and it's going to disincentivize my learning. Exactly, exactly. So let's kind of get into some of the detail now. So you work with BigQuery and BigQuery is a technology, again, was a big surprise to me really coming from the more, I suppose, the relational world and big data world, very different technologies there. But then BigQuery's based on dremel i'll go in a second it's it's a kind of a different solution to this scaling problem isn't it i mean talk us through the roots of kind of dremel and what that did and how that worked and how it coaches this in a certain way so so i think one thing you know what we say internally is that
Starting point is 00:06:59 bigquery is dremel um because ultimately when you run a a query in BigQuery, it is the Dremel service that handles it for you. Now, to speak about the history of Dremel, for folks who have never read the paper that is relatively well known in database circles, Dremel is a system we've been running at Google for about 10 years. It is Google's enterprise data warehouse. At this point, 80% of Googlers use Dremel on a daily basis for some tasks, whether they know it or not. And effectively, it sort of started with a log processing problem, much the way many things in the sort of big data space did. You know, we have had MapReduce for a very long time internally. We had systems like Tenzing, which were both on top of it. But there was an engineer, Sergey Melnik,
Starting point is 00:07:46 who was working in the Pacific Northwest, and he was thinking about, well, we have all these log files, and they're structured, they're in some senses sort of complex objects. The lingua franca inside Google is a protocol buffer, which is much like a binary JSON object. And he was thinking, you you know it would be really great if we had a better way to analyze some of these logs it was faster and it was maybe sort of sequel oriented and there were other sequel systems within the time but he and he started it
Starting point is 00:08:14 was started as basically a 20 project and he started to play with a little bit it got some traction and it attracted some more engineers and and and over the course of a couple of years it really it really gained a lot of traction within the company because it was easy to use. It was exceptionally fast. It was very good at aggregations. It was very good at scans. You know, over time, again, there were decisions made like, wow, we should maybe make it so that this is the data warehouse for the entire company. And that's very much what has happened. Now, this has caused a few things to happen in sort of the evolution of what BigQuery was when it debuted about six years ago and what it has been over the last several years. And a couple of those things are, one, a tremendous focus on making it easy to access data such that we,
Starting point is 00:09:06 I think you and I both know that moving data around, particularly on network systems is death to performance. It is challenging to have to say, oh, I've got to copy this data over there, and then once it's materialized, I've got to read it again. And moving data around is slow death. And so within Google, we've certainly created a culture in which as long as you have the right authorizations to read data, you can read what you need to read,
Starting point is 00:09:30 right? Obviously, there's security and there's privacy filters that are on top of these things. But if I need to go and read a set of data, and I have auth to read it, I can just go and query it. And that's just how it works. And if I need to share some stuff with someone else, I can just share it, and they have rights to it, but they don't have to copy data around. And this has allowed us to centralize the sets of data on which we analyze things and allow us to have a common base of reference. And at the same time, it's really allowed us to involve more and more people within the company in the business of analyzing data to make good decisions. So that was a hallmark of how Dremel itself evolved. And it's a thing that we've exposed very much as a hallmark within BigQuery as well.
Starting point is 00:10:09 Sharing access to tables and data sets is super lightweight, costs nothing, and promotes more and more sharing. The other thing that really sort of began with Dremel is because our logs were so often stored in sort of structured nested formats, the engine first optimized for that and became very, very good at handling complex objects
Starting point is 00:10:37 as opposed to simple record-oriented things. And so we'd have columnar structures. The original file type we used was called ColumnIO, up in the Dremel paper. And it was very good at handling columnar structures. The original file type we used was called ColumnIO, up in the Dremel paper. And it was very good at handling columnar data. So it was fast for analytical workloads and then nested columnar data. Now, what this meant, though, is when BigQuery debuted several years ago, it looked very much like Dremel in the sense that it was very, very good at massively parallel scans and filters and aggregations, and it did very well with this sort of nested complex data.
Starting point is 00:11:08 But it was harder at the time to bring, say, a third, fourth, fifth normal form data model to the system and perform as well as it could. One of the areas of sort of major, major investment for us over the last several years has been making the system equally well prepared to handle both the sort of nested complex data, which in which you can represent relations, but you have in fact, you know, denormalized the data on disk. And being able to say, hey, pick up your, pick up your third normal form data model, put it in here and you know multi-stage joins
Starting point is 00:11:45 will perform well right you can you can handle problems of bitemporality you can handle problems of slowly moving dimensions in the same way that we have all sort of learned to in good data model and so there's an interesting i think there's an interesting evolution there in terms of with dremel we started by solving a google problem for Google. And as BigQuery allowed Dremel to meet users in the rest of the world, we realized there was a whole other sort of valid use cases that we needed to enhance our ability to handle. And it's become both the language level and the execution level, it's become a real obsession with us to handle those workloads, as well as we've handled
Starting point is 00:12:26 you know classically googley workloads yeah exactly i mean to the point where i've i've been using it just as a replacement for a relational database in some respects building out small tables big tables and so on and it's only at certain points that you start to realize this is different to a relational database and i'd like to get on to that a little bit later with the kind of data modeling side as well because i think how how one thing i was interested to get from you in this talk is is kind of how do you aim your design at the moment how do you aim for do you normalize tables do you kind of build it however you want to be quite good to cover that later on and try and get i suppose state of the art ready from you on that which would be interesting and i think it's
Starting point is 00:13:01 a good play it's a good point to cover just because i think yeah there can can be a lot of confusion around it, particularly when presented with a new option for data modeling. It's not always clear whether or not you should take it or not. Yeah, exactly. So something I wanted to talk about was, so the Dremel engine, which is obviously the query engine behind what you're doing there. Obviously, there are a lot of SQL on Hadoop engines. We talked about this before that claim to kind of be able to of thousands of servers, lots and lots of data and so on. But reading through the Dremel paper, the particular way that the Dremel splits the work amongst the servers, it puts columns and different nodes and so on. What was the innovation that Google had there that meant you can scale far beyond other solutions for this? Well, I mean, I think in some senses it's twofold, right? One of the things we talked about last September, and I maintain that this is true, is that part of the reason that Dremel is as powerful as it is
Starting point is 00:13:53 is because it relies on Google's infrastructure. It relies on our networking. It relies on our vast fleet of servers, which means that even if you were to take the code for Dremel and deploy it within someone else's data center, it likely wouldn't perform as well because it gains tremendous advantage from our size and scale. Yes.
Starting point is 00:14:13 And is that Jupyter and Borg? I remember reading in the paper. Is that what you call them? Your network and your sort of things? Yeah. So Borg is our container system. So Kubernetes has gained a lot of popularity in the last couple of years. And Kubernetes you can think of as sort of an evolution in container technology, sort of descendant from Borg.
Starting point is 00:14:33 Borg is the container infrastructure that is most commonly used inside Google. And so early containerization helped us quite a lot. It allows us to have very light worker processes that are able to do very powerful things and allow us to, again, operate clusters that are massive. Jupyter is the networking system that we have published on and we have talked about. And, yeah, Jupyter has a phenomenal capacity. We don't very often talk publicly about our internal network bandwidths. There was a paper, Jupiter Rising, in which we discuss it. There was a paper YouTube released, I believe, last year in which you can infer some things. What we have last publicly said is that our bisect is about a petabit. A network bisect of a petabit within our fleet effectively means
Starting point is 00:15:24 all machines can communicate with all other machines at 10 gigs a second. And, and, and, and that's, and that, that's pretty tremendous. And it allows us, it allows us an ability to think about storage and compute were both sort of exploring the Hadoop world and MapReduce, there was this notion, and it's an entirely reasonable notion, that the data is too big to move the work to the data. Entirely true, until you reach a point at which network bandwidths are fast enough that even if the storage and the compute are in separate racks, They are fast, they are connected in such a way that they are fast enough to treat as if they were local, right? If they reach that level of speed, then all of a sudden you get this really interesting ability to sort of say, make your storage decision in the way that you choose to, make your compute decision in the way that you choose to, and you can optimize for both independently. The other thing I would say about the query engine
Starting point is 00:16:27 and its difference to things like SQL and Hadoop is specifically that phrase, SQL on Hadoop. And I think we've seen this in that space, right? We look at Hive and things that actually use the original Hadoop MapReduce engine to process things. And I think we all know that a lot of that stuff, while powerful, could end up being wickedly slow.
Starting point is 00:16:49 Now, when we look at things like Impala, you know, we look at things like Presto, these engines have been built to sort of remove the sort of reliance on the MapReduce extension engine. Now, Dremel was never built on top of MapReduce inside Google. There were other systems like Tenzing that did that. Part of what made Dremel really powerful in its early days was the fact it was not built on MapReduce. And, you know, having not inherited a legacy of that particular piece of programming or that particular programming model has probably allowed us to move very quickly and to continue to refine stuff. But yeah, I think when you compare it to other people in Hadoop solutions, or even the
Starting point is 00:17:31 things I think that are very good in the market right now, Presto and things like that, there's still a fundamental question of, you know, the scale and infrastructural quality is a thing that is very difficult to replicate. Exactly, exactly. So I mean, the theme and infrastructural quality is a thing that is very difficult to replicate. Exactly, exactly. So, I mean, the theme of this podcast really was to introduce this to the world of people who mainly do data warehousing at the moment. And one of the things that I want to get onto data came into the world of people using this all the time was actually equally as important as the fact that the etl and the kind of all the messaging and stuff around it scales as well at the same time so if you are going to design a system that is going to rely on bigquery and its ability to kind of handle bursty processing and so on the actual fact that you know data flow and pub sub scales at the same way that's massively important isn't it
Starting point is 00:18:23 it is it's and we we think that as organizations, if you deal with bursty data volumes, if you deal with things in which you need to be cost optimized, things like Dataflow and PubSub can really help manage that. But at the same time, I think one of the things I'm noticing is that known players in the ETL space are realizing that BigQuery is a valuable and important target. And so, I mean, I think, you know, things I've seen recently from Informatica and particularly Informatica Cloud,
Starting point is 00:18:53 like I'm very impressed with their treatment of BigQuery and their interest in understanding how to model data well within transforms and things of that nature. So I think it's nice because you get the ability to say, you know, I have an ETL tool. I love my ETL tool. And you can come and use BigQuery. But you can also say, well, if I'm rethinking what the data warehouse is, maybe I need to rethink what ETL is, right? If I do that, do it in a similarly sort of serverless pattern, I want to do it in a way that is simple and straightforward for me from an execution standpoint, you know, data flow, cloud data prep, PubSub, all of these things kind of allow you to do that as well.
Starting point is 00:19:32 And that's an interesting topic in itself, really. I mean, I think certainly, I mean, I had this conversation with a variety of people really, kind of Maxine Buschermann and kind of Gwen Shapiro and so on about how ETL is different in these kind of situations. And I suppose one aspect of that discussion is whether data engineering now, for example, has changed ETL forever in a certain way. But also there's the whole kind of pipeline-based processing as well. I must admit, one of the things that I've found, I suppose, more daunting coming into this world is having to do ETL through things like PubSub and Dataflow.
Starting point is 00:20:05 You know, coming from the world of kind of ETL tools and so on, it's different, really. It's conceptually different. I think I will say, I mean, we look at things like Cloud Dataflow and Apache Beam, and they're conceptually different. And that means that as they're sort of making their way into the world, they are still very programmer-centric, right? I think for you and I, you know, we might gulp a bit, but, like, we can open up a text editor and we can write a job and we can make that go. As much as it was, you know, in the days in which, you know,
Starting point is 00:20:35 our ETL processes were largely, you know, either bash-scripted or written in another programming language, the tooling is, I think we're at the very early days of tooling on these kinds of platforms right um these kinds of concepts i think data prep is an interesting an interesting initial example of what does it look like when a tool begins to deal with these with these new paradigms as as they reflect etl or elt or or whatever data engineering is ultimately turning into. Exactly. Well, let's get on to the main topic I want to talk to you about, which is data modeling. And so you come from a background, you worked at Oracle before, both of us have worked in data warehousing and so on.
Starting point is 00:21:15 And we're very well kind of versed in this idea of kind of, you know, in a transactional system, you build things normalized in a data warehouse, you have facts and dimensions and so on. And as you said earlier on, the work you've been doing with bigquery the last few years is such that actually you can model things out in largely the same way you do before except and i tweeted this recently except at certain points when you have to consider the way that kind of bigquery stores data and so on you know what dan what as a starter what's your initial kind of high level guidance to people about who are coming from the data warehouse world and are building tables and structures in bigquery what do you generally advise them at the start well you know i i it's in many ways i think the reverse of the you know
Starting point is 00:21:54 the the sort of old sort of snippy thing to say which was you know normalized till it hurts denormalized till it works i mean i feel like with most data warehouses, that's what you ended up doing. And, you know, my advice to people thinking about migrating to BigQuery or thinking about, well, should I set up a new data warehouse on BigQuery? If your data model well represents your business, because again, the data model in many cases is designed to reflect reality, right? The reality of your business, the reality of your operations. If that data model is, you have strong belief in it, you have strong reliance on it, and you're not looking for some excuse to go and re-architect, I always advise people to try and bring it whole cloth. You know, if you have put in the effort and time into thinking about your
Starting point is 00:22:40 data in a fifth normal form, you form, it's worth bringing it and understanding where there may be performance challenges and where there are things in which the modeling itself is actually far more important than the performance itself. I think classic things people point at, slowly changing dimensions, is one of everybody's favorite bugaboos. If you have that, that reflects something about your business which is
Starting point is 00:23:09 important and that is a thing that you should probably treat in a way that is important to your business um now as you begin to sort of reach a point where like i need to consider optimization what are the ways in which i denormalize? Is denormalization the only path? One of the things, particularly when I talk to organizations that are coming from running very large Teradata, Netiza, Oracle warehouses, is often I think you and I would experience this as well as ad hoc analysts. I might write a query that I would fire off because of its priority and because of what it was doing. I kind of expected it to take a day or two.
Starting point is 00:23:49 And, you know, nobody liked writing that query, but if that's how you had to run it, that's how you had to run it. Oftentimes, I challenge people who write queries of that sort in a traditional EDW when they come to BigQuery to think about modularization, to think about, well, I know you have this one massive query which does this and you're used to it taking 48 hours to run. Can you think about decomposing it and materializing intermediate steps for two reasons? One, because storage is cheap and simple, it's very, very easy to say, well, you know, I only really need to run the last you know the last subquery there right or i can run all of these with clauses first and iterate on that on that
Starting point is 00:24:31 final query that i'm interested in in a way that would allow me more flexibility as an analyst and ultimately might end up being more performant um that's that's one thing right so the notion of like well can you can you take that big BigQuery and maybe think about breaking it down into modules? And in part, could some of those modules help your collaborators, right? If I have four with clauses in my query and then the query itself, you know, Mark, do you maybe find value in two of those with clauses? Like maybe I should just materialize them. Maybe that's a thing to do, right? Once I've figured it out, maybe I should just go ahead and materialize that for you. But then with the actual, we get down to should I actually denormalize? When we look at it, we think there are sort of two approaches to denormalization. There's classic big data denormalization
Starting point is 00:25:19 where I flatten the whole table and I get a massive number of columns. And that is a totally valid way to normalize. We support it. We like it. We see a lot of it, particularly when we find people coming to BigQuery from Hadoop and Hive, for example. Now, obviously, you know, people, oh, data duplication. Well, data duplication, when we think about how COD defined it and talked about it, right, there were two motivations, right? One was obviously you didn't want to have duplicate records for purposes of logical consistency and correct results.
Starting point is 00:25:50 But the other challenge that sort of Codd was addressing there is the fact that, you know, at the time, right, we're talking, you know, in the 70s, you know, bytes were very expensive, right? Bytes of persistent media were incredibly expensive, and bytes of persistent media are incredibly inexpensive now. And when we think about sort of, you know, just a flattening denormalization, I think we can remove, at least in BigQuery, we can remove the notion that like, oh, it's going to be very costly to do it from a billing perspective. I think what we have to ask ourselves is, is that going to make it easier for us to analyze data? Is it going to make it easier for everyone to find the columns they need? Is it going to make it easier for you to write your query? And are we going to maintain correct results?
Starting point is 00:26:32 We can think about it just in terms of like, what is the answer we're pursuing? Is it easier to get to it? Is it correct? The other piece, and I think you've probably experimented with this, just knowing your penchant for digging into these things is this notion of nested data and repeated. And I think, you know, in my mind, the brilliant twist on it is that it can be the best of both worlds from a performance perspective, which is to say I can maintain a denormalized structure on disk, which means that my queries will be very performant. However, when you, you know, if you take advantage of a nested data structure, you can effectively represent one to many relationships inside a single table. And so there are certain situations in which that's actually quite nice. If I have, you know, a set of if I have a dimension that is only ever joined with one fact table, and it's a relatively small dimension,
Starting point is 00:27:26 I mean, relatively, it could be many, many, many rows. Unless that dimension is updated quite often, there's very little need for it to be a separate table. I could preserve the one-to-many relationship because it's important to the way I think about my problem it's important the way I think about my business by just making it an array inside the fact table yeah and that's that's an interesting topic isn't it I mean that that completely threw me actually and I think it's one of those things that appears to be a little bit daunting and also also I guess coming from the relational world you had this concept of nested structures but they were kind of quite inefficient really I mean maybe just explain what so when we talk about a nested repeating structure and you mentioned kind of uh you know column io and color and colossus and so on early on what why is it
Starting point is 00:28:13 particularly that nested structures what are they first of all and why is it they're particularly kind of aligned with how bigquery works sure well i think it's it's fair to talk about sort of the two two classes of these things or maybe three classes of these things. Nested structures, repeated structures, and repeated nested structures. You know, a nested structure is, you know, it's simply like saying, you know, that I have a column, I have a sales table, and I have a column which has within it subfields, right? You know, a classic example might be if I'm, you know, it's a sales table and I collected, you know, your phone number from the sort of, you know, shop realty thing, I might actually break that into phone number is the column name, but within the phone number, I have phone number dot area code, phone number dot number, right? Now, in this sense, it's just, it's just nesting. It's just a, it's just a level of breakdown
Starting point is 00:29:07 within, within that column. And so when we store that on disk, there is a broader column, which is the phone number column, and then within that we are storing effectively separate columns for both the area code on the record and the number itself. Now, from a query perspective, I think this is actually very easy to reason about, because I effectively just use dot notation to get the things inside the nested structure. Now things get a bit more, I think, mentally challenging, at least for me coming from the traditional RDBMS world. When you think about a repeated structure, which is to say, you know, a silly example of something like cities lived in. There's a user profile table or something like that. And I have, you know, cities lived, and I've lived in more cities than I care to name.
Starting point is 00:29:52 And so rather than having a separate table, which was, you know, ID, you know, city lived, ID, city lived, I could instead have attached to that table of user information, I could have from my row an array of the cities I lived in, you know, Phoenix, Los Angeles, Chicago, San Francisco, so on and so forth. Now, this is really powerful because I've saved myself having to go out and sort of, you know, join with a table. But there's this sort of question of how do we operate on arrays in SQL, which is interesting because I think there have been discussions about it in ANSI.
Starting point is 00:30:29 There are thoughts about it. It's never been codified as this is the way to do it. And we've taken great pains, particularly in our sort of standard SQL dialect, to try and find the most ANSI-like way to do this that we believe is important. But there's a bit of mental reasoning. So if I need every value of that array, I have to unnest the array. But what does it mean to unnest it? Well, effectively, the way to handle this is to do what some people call a lateral join or a limited cross join, right? So if I want to take every city I lived in and get a row for each of those cities plus my user information, effectively what I need to do is do a cross-join between
Starting point is 00:31:10 the one row, which is my user information, and the array values itself. Now, nobody likes a cross-join, right? It always sounds scary. But if it's a cross-join between one thing in and then a limited array, it's actually a very efficient operation. And it can all be done locally, right? So if we think about not wanting to spread data or spread processing around across multiple workers within a distributed system, because it's effectively all local within the bytes I've already read into memory, I can just do it very, very quickly in memory as I'm executing and move on. And so even though I might write cross-join, right, I'm actually doing something that's much more efficient than a real cross-joint. So that's one thing. thing is if I can have an array and if I can have a nested object which has sort of subfields there is nothing which prevents me from having an array of nested objects right and this is how we
Starting point is 00:32:12 get to the notion of having a truly one-to-many relationship represented inside a table if that you know if that city's lived table that I talked about had many columns there's no reason that those columns couldn't each become parts of a nested structure and I could have an array for every element that corresponded to my user ID. Powerful because I've captured that relationship, right? Because I have everything that was in that dimension table collapsed with an array of these objects already attached to the record it's most likely to be joined with. Powerful in that that relationship is still there.
Starting point is 00:32:50 Powerful because, again, when I'm processing it, I'm not having to necessarily deal with a join that might move data between machines, that might require network, all of these sorts of things, because it's all right there when I have the data in memory to operate on. But challenging, and I think this is a thing we think quite a lot about because we believe very strongly in the SQL that we've produced, particularly for our standard SQL dialect. But we do realize that it is non-obvious
Starting point is 00:33:17 to a lot of people how to write a good query against those sorts of structures. It's interesting. I think that there were two routes I had coming into this particular kind of topic really was, you know, in the place I'm working at the moment, we had a situation where we were trying to join two big tables together and we suddenly hit this issue where, you know, with BigQuery, 99% of the time, you don't need to think about
Starting point is 00:33:38 how the data is stored. But then suddenly at this point, we had to. We had to think about what is the optimal way to store, you know, the data from two tables in such a way that we can return values back, return results back without kind of running out of resources. And I had the same thing with something I was doing at home as well, taking a couple of your sample tables together. And again, it all worked. When you scan one table, filter against it, aggregate, it works fine. Two massive tables and we hit this kind of, I suppose, kind of inherent issue you get with any kind of like distributed system um and as you say it was it was quite conceptually hard to think about but actually in the end it wasn't that difficult
Starting point is 00:34:13 really and and you've got fairly standard things in sql where you can do you know you can do a unless join in sql but the thing for me that was also interesting was look at the partner tool that i work with has support for these as part of its kind of metadata modeling area and we can just present it back out to the users in the end as a normal kind of like join yeah exactly i think you know the things the things where we think about in this space you know we're we're working we're working to really improve a lot of the documentation around this so that people can see more examples of how that was hard that was hard actually getting a working example there that was the hardest bit but once you've worked it out, it was easy.
Starting point is 00:34:47 Yeah, well, and one of the things we're trying to do is just, you know, we sat down and we wrote a bunch of different kinds of examples so that we can actually really walk people through more of the things that they might want to try and do. And I think, you know, bubbling back up to your sort of top-level question, like what should people think about when designing for BigQuery, data modeling for BigQuery, one of the things I really stress is actually who the end consumer is going to be.
Starting point is 00:35:10 Because you manage to reason your way through how to write good array SQL. I do the same thing. I have people on my team who are not engineers who probably can't. And so the question I ask myself is, is this something where everyone who's going to access it can write the array SQL and it's obvious? If it's not, can I expose a view for them? And is that view a reasonable way to pass things on? And then the sort of further downstream piece of it, and I think, you know, Looker's an exceptional example of this, is working with, you know, partners and tool providers to talk to them through the why of this and help them make good decisions around it.
Starting point is 00:35:51 And I think Looker's been great in terms of jumping on board and seeing the value of it and really working towards it. And particularly with BI partners, we're working to make sure that they understand, yes, it's probably the simplest thing to simply plug in through a driver and treat us like in a warehouse, but you'll get outsized advantage if you talk with us a little bit about how should you think about these kinds of structures, because you might be able to provide more scalable BI solutions to your users if you went this way. Okay. Okay. So on that topic, I mean, one of the things about BigQuery is there's no indexes,
Starting point is 00:36:25 and that's kind of, that's obviously massively useful, but is there a purpose in doing things like pre-summarization of data and summary tables and so on within BigQuery? Do you see that happening, or is it kind of pointless? I think people do it quite a lot. I think one of the things we think about is, because we know people do things like summary tables,
Starting point is 00:36:42 are there things we can do in the future to make that faster, simpler, easier? And again, I mean, going back to the notion of like, you know, if I have a very large query in which, you know, I have a bunch of with clauses, like maybe I should, I love the with clause. It's my favorite clause. And I may be the only person who feels that way. But, but, you know, if I were to materialize it, would that be useful to me and to someone else? Similarly, you know, should I make some summary tables? Is that useful to me and to other people? I think one of the things I think is really interesting is the sort of departure from me with my analyst hat on.
Starting point is 00:37:17 I'm using SQL to analyze data to me as a software provider, right? So if you are a SaaS organization, you were trying to build, you know, a system that serves analytical data to your users. You know, I think you have to treat the system somewhat differently. You can't just sort of run in and say, like, I'm going to bang out some SQL queries and serve the results directly to users. I think you have to think a little bit more around,
Starting point is 00:37:40 well, I have the ability to run massive queries. I have the ability to store and share data of any size very quickly and very cheaply what are this what's the hierarchy of of of data sets or tables i might need to serve my users the fastest right and summary tables are a huge part of that interesting so so what about um partition? So partitioning is the other part of this, really, as well as storing it in an optimal way. Partitioning sounds easy. It seems to be a little bit more complicated than I expect. And there's also, I guess, an issue about how do you go and re-partition tables that actually aren't partitioned now without it costing lots of money and having lots of downtime and so on. And the team back in the office said, when you speak to Dan McCleary, ask him about partitioning our tables.
Starting point is 00:38:27 Because I think it's one of those things where partitioning is interesting, isn't it? And why is it important and why is it sometimes hard to get done in practice on a production system? Well, and in fact, I mean, I think thinking back to my time at Oracle, I mean, partitioning was hard there too, right? They charged partitioning for reasons, right? Some of those reasons was not simple. So our partitioning story today is evolving. And I think it actually makes some sense to sort of take a trip back to the past a little bit. So inside Google, with Dremel, we didn't support partitioning. It wasn't a thing we did.
Starting point is 00:39:03 Instead, we had relied on a concept we didn't support partitioning. It wasn't a thing we did. Instead, we had relied on a concept we called table sharding. And this is very much what it sounds like. I can have a number of tables that all have common prefix, but have different suffixes. And in my query, I can choose to operate over a range of suffixes corresponding to a given prefix. So the canonical example would be mylogstable underscore and then some numbers representing the date. And then I could say, oh well give me everything, give me every table from the prefix mylogstable within the predicate that satisfies this date range. Now BigQuery supported that from I believe its outset. And it is a reasonable, if not incredible, way to manage date-oriented data.
Starting point is 00:39:53 However, partitioning ends up being more performant. It ends up being more flexible. It ends up being somewhat more limitless, right? So we, in fact, have a limit on the number of the shards you can access in a single table simply because of metadata burden required to go and fetch. You reach a point where you're not going to have thousands of shards. It's like, that's a lot of
Starting point is 00:40:13 metadata I actually have to hold just to satisfy that particular filter clause. Whereas partitioning, in a well-designed partitioning system, you can have as many partitions as you need. Now, our first go at this was effectively to introduce date range partitioning because we feel like that is the most common kind of partitioning that you see in a system. And I think when we think about the kinds of data that really start to hit big data scales, right, whether they are store-skew sales combinations or click data
Starting point is 00:40:47 or a number of these sorts of things, they're almost always time-oriented. So I think in broad agreement, date range partitioning is important. Now, our initial foray into this was to not modify the data at all and to simply say we'll keep a pseudo-column around which corresponds to the date at which this, the date partition to which this lives in. Now, in our SQL dialect, you access this again through a pseudo column, underscore partition time, which is rather inconvenient, particularly if you had a date field in your records already, because you probably had a
Starting point is 00:41:21 date field in your records. And this does create a little bit of a mismatch, right? Because you might have a situation in which you say, I have date-oriented data that I'm loading into the system. I would like you to partition it. But if you don't tell us which pseudocolumn partition keys they belong to, we might put them all in the same, well, you loaded all of that data today, so that's today's partition, right? We admit that this is silly.
Starting point is 00:41:57 The workaround for this today is effectively to address partitions directly, which you can address a partition directly by appending a dollar sign and the partition itself to the end of the table. So you can access any particular partition in a load job or a copy or something like that by saying dollar sign and then the partition date. That said, one of the things that we expect to be able to at least begin testing with customers is proper date-oriented partitioning in which you give me a date column, and we respect that as being the partition time, the sort of more explicit, tell me what the partition key is, We will deal with the partition key. And with that, I mean, I think, you know, speaking to the other part of your question, which is, if I have data that's not partitioned today, what should I do with it? Because that can be unexpectedly quite hard, I suppose, to do, can't it? You've got a massive table.
Starting point is 00:42:40 That table is being loaded into virus stream or something. Actually, to take it offline and repartitioning it is quite hard, isn't it? Yeah, it can. I mean, particularly when, particularly if you figure that there is no, from our perspective, there is no offline unit table, right? Like unlike a system that you control the storage subsystem for, you control the internals of, you could say we're offline in that table because we have to do a purchase and maintenance operation. Because BigQuery is service oriented and because we are serverless, there's no way for you to tell me to take a table offline. And from our perspective, we can't take your table offline because you might need it. We have no way of knowing. You might be streaming into us at 10,000 QPS and that's important. So
Starting point is 00:43:19 I can't offline it. And so we want to be very careful about it. Things we've done previously, if you have data that is sharded by date, we will automatically partition that. There's a command in our CLI which will partition that automatically for you. It will put it into a partition table. Now, when we get to a point where we can offer our users proper date-oriented partitions, we'll provide the same feature functionality there from the tool, but we'll also have to find a way to very closely or very simply allow you to point us at a table and say, this is not partitioned. I need a partitioned table. This is the partition key. And making that something that's both simple to do and cost-effective is very important, because I think the challenge right now is if you were
Starting point is 00:44:05 to have a table like that and say, I need to go and write it into individual partitions, you are unfortunately going to run into a situation which you have to run many queries to pull out the partitions themselves and insert them into the thing. And you can optimize it as much as you can, but there is a known limit in terms of the number of you have to run yes exactly exactly so um so lastly i want to ask you actually because i'm conscious of time for you um so we so obviously big query for us has been fantastic in this project and we're getting kind of you know consistently good response times and and so on and so forth but there still comes a point when someone says but even that's too slow sometimes and and you know you and i remember the days of kind of relational world
Starting point is 00:44:45 and Oracle and OLAP and so on, where people were getting to expect response times in the kind of subseconds. And there are technologies out there, like, for example, Druid, for example, that people are using to create this kind of like, I suppose, intermediate cache layer to make even those queries kind of run fast. I suppose, what's your sort of thoughts on that? And is that something that you guys are thinking about doing for BigQuery? Or is this the sort of thing that's not part of your agenda? You know, your sort of thoughts on that and is that something that you guys are thinking about doing for bigquery or is this the sort of thing that's not part of your agenda yeah what's your thoughts on that really well so i guess my my answer is somewhat twofold
Starting point is 00:45:12 right and the on the you know the the first part of it is yes we see the demand from bi tools from application providers for subsection um i think the question we ask ourselves is, is BigQuery the right tool to do that? Or are there other better approaches that would help you optimize costs, simplify use, things like that? So in terms of building a free cache layer, there are other products within Google that do this. A data studio, the BI tool, builds itself a cache layer for certain result sets, and that's how it's able to maintain interactive speeds. For things like Druid or maintaining a Reddit store
Starting point is 00:45:54 or maintaining an RDBMS that actually has this data-loaded hot memory for certain queries, entirely valid, and we generally recommend. One of the things we're very excited about is uh you know the notion of our integration with systems like big table um you know things like potentially integrating with cloud spanner um such that users can say if i have something that's very transactional or i have something that's more serving oriented, there will be a GCP database which is going to be appropriate for that that will be queryable by BigQuery. Whether it is today, like in the case of Cloud Bigtable or in the case of something like
Starting point is 00:46:39 Cloud Spanner, which we expect to have happen at some point, we think, again, you shouldn't have to say, well, I need BigQuery to do it. It should instead be a situation of, what is the optimal system for me to do this based on what my engineers need, what my end users need? And so, I mean, I think, yeah, we see Druid pop up. We see people using things like Cloud Bigtable for this. We see users building their own caches
Starting point is 00:47:02 in ways that are important to the way they operate. And I think those are – they are things we are interested in, but I think in the way we think about our roadmap, our interest is not necessarily in terms of providing sort of here is the caching layer, but instead sort of focusing on on two things one how do we make the kinds of queries for for which this is is likely much faster with you having to say something explicit um you know if if this is obviously something that's getting run a lot by by the customers of your sass application if this is obviously something everyone in your org hits through a BI tool, how do we make it faster for you without you having to do anything?
Starting point is 00:47:48 Does that mean we change our layout on disk? Does that mean we change the way we process the data? Does that mean that we investigate and cache on our own? All on the table, all things we're really interested in. We want to make sure that it's as simple and as transparent as a lot of the other things in the system. But then the other piece of it is making sure that whether it's a third party caching layer or another GCP database, that there's really good interoperability between the two, because there's no reason
Starting point is 00:48:16 you should have to take our word for it. Right? And I think we really are, we want to make sure that people choose the tools that are best for them, whether they are our tools or not. Okay, okay. That's fantastic, Dan. Well, look, it's been great speaking to you. So just to remind anybody on the call, where do people find out about what you're working with? Where do people find out about BigQuery and Google Cloud Platform and so on, really?
Starting point is 00:48:40 Oh, sure. The search Google, yeah. Really, it's BigQuery. To search Google. MARK BLYTHEUSSEYSEN, Really, it's bigquery.cloud.google.com. And that will actually take you directly to BigQuery. One of the silly things I like to do is when showing BigQuery users, I'm saying, oh, well,
Starting point is 00:48:56 I'm going to go ahead and spin up an instance of BigQuery. You go to bigquery.cloud.google.com, and it's there. It's there. The web UI is there. Play with it. Kick your tires with it. Give it a try, read the documentation, try the editor. And we have a host of other tools, but that's the best way to find out, right? It's just go and play. Fantastic.
Starting point is 00:49:13 And just to plug it, because I mentioned earlier in the discussion that we believe strongly in free trials to enable people to learn um bigquery gives you a terabyte of analysis every month for free and 10 gigabytes of storage so you can put data in there you can query data and there's no impact to you i mean effectively as a developer at home it's free you know and and that's just fantastic really i mean you have to worry about the cost of it it just works and obviously a commercial level you could pay for it and so on but that was quite an inspired thing really so um one other last thing is like there's a good video from the last google next i think in san francisco that you and daniel minstead as well about this topic as well is that correct that's right and there's another
Starting point is 00:49:55 there's another one from from next as well with uh jordan tagani who who runs our engineering team which i think is really for people who really want to know about the the nuts and bolts and the internal parts of our performance it's a really great deep dive on some of the stuff that we do in the engine that is, I think, interesting and meaningful to people who just want to geek out on, like, what does a big distributed system like that do at Google? Good. Excellent. That's really good. Well, I'll let you go now. It's been great having you on the show again, and thanks very much, and take care, Dan. All right. Thanks very much, Mark. I really enjoyed it.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.