Drill to Detail - Drill to Detail Ep.23 ‘Looker, BigQuery and Analytics on Big Data’ With Special Guest Daniel Mintz

Starting point is 00:00:00 So hello and welcome to Jewels of Detail and I'm your host Mark Rittman. So my guest on this week's show is someone who you may have seen present at the recent Google Next event in San Francisco alongside one of the first guests we had on the show back in October last year, Dan McCleary. So the guest this week is Daniel Mintz, who is the Chief Data Evangelist at Looker. And I'm very pleased to welcome him on to drill to detail to talk about Looker, data modeling on Google BigQuery, and where BI tools are going in this new world of cloud and big data. So Daniel, welcome and thank you for coming on the show and maybe just introduce yourself properly to the audience. Yeah, my pleasure. Thanks for coming on the show and maybe just introduce yourself properly to the audience. Yeah, my pleasure.

Starting point is 00:00:46 Thanks for having me, Mark. So yeah, I mean, I think maybe I'll start with the funny story of how I came to Looker because I actually was a Looker customer for a long time before I was a Looker staff member, a Looker employee. And my previous role before I came to Looker was, uh, working at the media company Upworthy. And when I came on, they were about a year into their history. They were just starting to get serious about data. Um, you know, and they didn't have really a data stack to speak of. This was in early 2013. And so big queer, uh, excuse me,

Starting point is 00:01:22 Redshift, Amazon Redshift had just come out of beta and was first really available to the public. I kind of looked at our options and quickly settled on that as our data warehouse. But then the next question was, well, what's going to sit on top of our data warehouse? What's our BI tool going to be so that people can actually access the data? Obviously, you know, me and my team, we were comfortable enough using a SQL runner, but most of the folks at Upworthy weren't going to do that. So, you know, I was looking at different BI tools, and I think a co-investor of ours at Upworthy and Lookers said, you know, oh, if you're looking at BI tools, you should check out this little startup out of Santa

Starting point is 00:02:02 Cruz that I've invested in called Looker. I said, sure. So I hopped on the phone with Lloyd Tabb, who's Looker's co-founder, and Kenan Rice, who was the only business guy at the company at that point. And we got on the call. Kenan realized pretty quickly, as a good, smart business guy guy that Lloyd and I spoke the same language. And so he should really get out of the way. And, and so Lloyd, you know, and I started chatting and Lloyd said, well, you know, what Looker is, is we're a platform, we sit on top of, of your database, you tell Looker what your data means, you define, you know, the relationships between your tables and the business logic that you have in LookML, which is our data modeling language, our sort of abstraction of SQL, and then Looker will write the SQL.

Starting point is 00:02:52 And so then you can expose to the business user just a sort of drag and drop interface where they can ask questions of the data themselves without having to rely on an analyst to write SQL for them. And I, being the sort of skeptical analyst that I am, was less than convinced that this was a real thing. Because I, you know, I will say I kind of imagined that that kind of thing had existed, but I'd never really seen it in action. And yet within 10 minutes, Lloyd, you know, had connected Looker to our Redshift cluster and was kind of saying, ask me anything. And so we were, you know, I was asking him questions and he was answering them and showing me things in my data

Starting point is 00:03:32 that I had never seen before, which was a pretty weird experience for someone I had just met to be finding things in our data that I didn't know about. And so, you know, I think that experience of seeing a tool that could really sort of write whatever SQL I needed it to and do so in a performant, clean way that gave everybody in our company the ability to ask questions of the data and get sort of reliable answers pretty quickly was pretty amazing. And so over the next three years, the tool really grew. I will say that when we first signed up for Looker, I think we were customer number 21. And the tool, the promise was there. The tool was not quite where it needed to be yet. There were no dashboards. There were very basic visualizations, no ability to sort of do calculations on data

Starting point is 00:04:28 results that had come back from the database already. So you couldn't, you know, add column A to column B to get column C, everything had to be done in SQL. And so that was, you know, that over the success of three years, the tool really got better and stayed on the trajectory to live up to the promise that sort of Lloyd had laid out in that very first meeting. And so when I was starting to think about what my next move would be, what you know, as I was getting ready to leave Upworthy, I was talking to Lloyd and he said, you know, well, what do you want to do? And I said, I'm, you know, I'm not sure. And he said, well, when you figure it out, tell me. And I said, well, I do miss writing and thinking and talking about how to do data andism like you want to evangelize for data and i said yeah i guess that's a reasonable uh you know summation of what i want to do and he said well we need an evangelist why don't you come be the evangelist at looker um and so so here i am

Starting point is 00:05:36 now a year later fantastic fantastic and and so i mean just as a as a kind of context i mean i i'm using looker day in day out at the moment the. The company I'm doing some work with at the moment is using Looker to build a BI, I suppose, platform or tool on top of kind of Google BigQuery. And it's an interesting kind of exercise to go through because a lot of the, and we'll get onto this later on in the conversation, but a lot of the concepts that you're talking about with looker are fairly kind of like you know established concepts in the world of suppose enterprise bi and crm and so on you know writing sql against a kind of a model and then generating sql against that and abstraction layers and so on but that's something that's very kind of uh up till now fairly unknown in the kind of big data world because of various i guess kind of like differences in that world compared to say

Starting point is 00:06:24 working with small data sets on in in, I suppose, kind of, you know, structured databases. Just tell us a bit about the foundation story of Looker and because what was the kind of thinking behind Looker as a company and the products? And in this new market, what problem is it trying to solve and for who really? Yeah mean i think you know when we look back at the history of bi you know i think you as you said you this isn't brand new a brand new idea uh by any means and when you look you know way back to the 80s and you look at folks like cognos and business objects and microstrategy you know they had this idea. It was there. Now, obviously, the underlying database technology back then was not where it is today, to say the very least, right?

Starting point is 00:07:12 And so, you know, like, first of all, the data volumes that they were dealing with were far, far smaller, orders of magnitude smaller, partially just because rather than, you know, machines generating the data, you actually had people generating the data. So there just wasn't that much of it. But even so, you know, what you saw was as data volumes grew, the appliances couldn't keep up. And so, you know, you did cubing, you did summarization, you did anything that you needed to to get the data small enough so that you could access it and get your answers. But that idea of a sort of rigid model was really core to that sort of first wave of BI, right? This idea that IT is going to own the tool.

Starting point is 00:07:54 They're the ones who are going to own the definitions of the data, the business logic. And so that's great because it means when you ask a question of your tool, you know that the answer is right. It's been vetted. It's not up for grabs. And, you know, and so that was where people were at. But over time, I think, you know, the rigidity, people got frustrated with that rigidity, right? They got frustrated with the summarization, frustrated with the cubing, frustrated with the inability to sort

Starting point is 00:08:21 of be agile and get answers to new questions. And so you see the rise of the second wave tools, which, you know, are mostly visual discovery tools, tools like Hello and Click, you know, where the idea is more around self-service. And I think when we look at the choices that they made, I would say we agree with a lot of them. The idea of self-service is great. The idea of, you know, getting rid of cubes, getting rid of summarization, keeping the sort of higher resolution in the data, we totally agree with that. But I think they maybe threw the baby out with the bathwater. And so throwing out that model, I think, was a real mistake.

Starting point is 00:09:00 Throwing out a shared understanding of what the data means makes it much harder to get real business value from your data. And I don't think that it was a sort of dumb decision that they made. I think they were really constrained by the technology of their time. that were mostly running on these, you know, faster laptops and desktops that people had at the time, but not were sort of disconnected from the warehouse because the warehouse certainly couldn't keep up with the demands of, of an agile workflow. Um, and so, you know, because you lost your connection to the mothership there, you lost this model that, that told you what the data meant. Um, and so that, you know, sort of self-service workbook BI, you know, had real advantages in terms of speed to, you know, to getting answers, but it also exposed you to this possibility that you were getting an answer, you were getting it fast, but it wasn't the right answer. And so with the advent of this new wave of tools, you know, I think really starting with

Starting point is 00:10:09 Redshift, frankly, not that there weren't, you know, amazing MPP databases before then, but they were mostly on premise. They were very expensive, you know, and then you see Redshift come out as a cloud MPP warehouse column or data store, And that really changes the game. And then you see, you know, BigQuery, which had been Google's internal tool for a long time, they repurpose it as a product that can be sold to the market. You see, you know, Vertica be available in the cloud, you see a big wave of SQL on Hadoop tools that are cloud-based. You see Snowflake. So you just see this huge wave of really, really fast, cheap data warehouses. And all of a sudden, that makes it possible to keep the agility that you wanted from that second wave tool,

Starting point is 00:11:01 that self-service idea, but also keep a model that had come from the first wave tool. And so you kind of get the best of both worlds. And that's really where Looker is aiming. We think of it as a third wave where you've got all the agility, all the self-service that you had with the sort of desktop workbook-based tools, but also the shared understanding of the data and the trustworthiness that comes with that, the reliability that comes with that, that you had with the old school BI tools. Yeah. So, I mean, within the world that I'm working at the moment, Looker is the kind of the hot new BI tool, is the hot new kind of startup.

Starting point is 00:11:37 But Looker might not be so well known to, I suppose, people who are more used to using kind of Oracle and Cognos and so on. Who's the typical kind of Oracle and Cognos and so on. Who are they? Who's the typical kind of like customer for Looker? And who are the typical kind of developers and users that are working with the tool? Yeah, I mean, we so we're in terms of companies that are using us, we're up over 800. And, you know, I think, as you said, we are first the first place that we slotted in that people really got Looker very quickly was in the sort of venture-backed tech Silicon Valley group. Not hugely surprisingly, lots of people speak SQL there. They don't have big million-dollar investments in existing stacks that there's a sort of sunk cost there. But more and more we're seeing larger companies, enterprise companies, who have these longstanding tools in place. They're both starting to transition their on-premise data warehouses to the cloud and also starting to consider new, more flexible tools.

Starting point is 00:12:38 And we generally don't see them saying, oh, well, we've, $5 million on this BI tool over the last 20 years. We're going to throw it out and start fresh. That is generally not the way that these things go. It is much more frequently that they, you know, have a new project that they're working on and they want more flexible data access. And so they, rather than continuing to sort of add to the rigid architecture that they had, they start with a new implementation of looker and then you know another team inside the company sees that and says well why why can they ask any question that they want and get an answer and you know in seconds that i want that um and so then you know more they sort of evangelize it internally and more teams start to pick it up yeah i mean

Starting point is 00:13:21 certainly the experience we had in the place i'm at now um was that internally there was use of of tableau for example and so tableau connects to bigquery as you're probably aware you know and it's you can create some pretty nice kind of data visualizations in there and that was typically that was typically the tool that was given to uh i suppose people working um as kind of strategists or analysts or whatever but the limitation with that with that was was a few things so it was very much kind of strategists or analysts or whatever. But the limitation with that was a few things. So it was very much kind of silo-based. And so people would build a report based on a particular topic area. And then typically, they'd need to write a custom piece of SQL

Starting point is 00:13:54 to populate the Tableau extract. Because really, and this is the thing that one of the engineers I wrote for Looker was that you could then actually... Looker was very, very good at retrieving the data live from, say, BigQuery, whereas Tableau was always really going via a cache, you know, and certainly in that respect as a BI tool, we found that kind of Looker was a better SQL engine, for example, than Tableau. But the thing that really kind of made the impact with people I've seen use Looker is this kind of, is the use of the model, is the use of this kind of semantic model where you can just explore your data and there's no kind of like having to join data and so on

Starting point is 00:14:28 there but do you find that's a concept that's sometimes quite hard to explain to people to engineers particularly i mean it's almost like a sort of stockholm syndrome isn't there with a lot of engineers where you're used to you're used to kind of data analysis being hard to try and explain to them it can be easy it must be hard sometimes yeah i mean i um i i would say the people who uh i i tend to think of our users in sort of three big buckets you know there's the data analysts who um who speak sql uh who have had ready access to the data who know the power of being able to answer their own questions with data. And for them, Looker, I would say the biggest advantage of Looker for them is that they spend a lot less time writing one-off queries for their colleagues, right?

Starting point is 00:15:14 They, rather than spending, you know, they send, someone comes with a request and then they come back a day later and they say, so I know you wrote me that query, but it turns out I need six months of data and i need it broken down by state not by country it's the classic thing isn't it in bmf every question you have has another question after it isn't it really for sure and so you know that's just as an analyst myself that's a that's really not a fun use of your time um it just feels like a never there's a never-ending queue it's's always growing. It is, you know, and it just prevents you from doing higher value analysis because you're just stuck answering one-off questions.

Starting point is 00:15:50 And so the analysts like Looker for that reason. But I would say that the people for whom Looker is most sort of life-changing are the people who are very analytically minded. You know, there may be an Excel wizard or a Tableau desktop wizard, but they've always relied on IT or the analyst organization to do the data extraction for them because they don't speak SQL, or maybe they speak just enough SQL to get themselves in trouble, but not enough to get themselves out of trouble. And so all of a sudden they go from having to wait for each new workbook to be built so that they can explore in it,

Starting point is 00:16:25 to being able to access the data directly, right? And that is a radical change for them. You know, they all of a sudden, the sort of speed with which they can iterate goes from, you know, one iteration per day to one iteration per minute, you know, they come up with a new question based on the answer. And, and then they, that gives them an. You know, they come up with a new question based on the answer and then that gives them an answer and then they come up with a new question and they get the answer to that all instantly. And that's just a huge change. And these are the people who really already know how to use good data to make better decisions. And so that's a huge boost to the enterprise as a whole because all of a sudden the people who are analytically minded have access to the data.

Starting point is 00:17:06 And then I think the third group are the sort of more data consumers, the people who need their reports, need their dashboards, but they kind of rely on those and they're not going to do a ton of exploration. But what we do see with those folks is more and more of them get the dashboard, the Looker dashboard in their email. It arrives on Monday morning and they say, huh, that's weird. I wonder why sales were down on Friday. And rather than having to go to an analyst to get an answer to that question, they instead can just click on the dashboard.

Starting point is 00:17:37 And that drills into and brings them to a place where they can start exploring. And whereas, you know, a sort of blank template would be pretty intimidating to them. You know, the thing about Looker is, is that it brings them right into a context that already exists. Someone has already built an exploration and, and answered that first question of sales were down on Friday. And they say, well, maybe I'll, I'll look by country to see if maybe there was a holiday in one country and that's why they were down. So, you know, all they have to do is click one thing to add that new dimension and all of a sudden they've got more information and they can see not just what happened but maybe why it happened yeah absolutely i mean the way i describe it is about two things it's about reducing if you're if you're an analyst

Starting point is 00:18:19 and you're trying to understand data and explore it you need we're looking to reduce the friction involved in that process and certainly certainly by making it everything pre-joined a semantic model um and so on it means that they can follow the train of thought much easier really and explore the data and understand it and that that group you described you know what i would call kind of strategists really they are the ones i think that really benefit out of this um in the past their job has often been like that chess by post in some respects where you'd sort of like write a move down and post it off and a week later you'd get an answer back the difference has been been astounding really for people working with this just being able to kind of like to just explore their data and just kind of i suppose make data-driven decisions

Starting point is 00:18:56 whereas in the past it was it was hard to do that really so um so let's kind of look a little bit into so so for all listeners on this show, certainly those that come from a more traditional background, what you're describing is something very familiar. But there's certain ways that I think Looker have approached this kind of problem that's different to before. And I think one of the ways that's interesting, and it would be good for you to talk us through it, is this concept of LookML and the way that the kind of modeling works. Because as you rightly said, this has been done before, but it's been done in a way that was not all that agile and and probably was a bit out of step with how things are developed now so maybe tell us a bit about what look ml is what is the

Starting point is 00:19:32 problem it was solving again and paint a picture a little bit of what it does and and kind of the approach that you you've got with that really yeah i mean i you know i talked a bunch about this at google next but i think when think when you look at the development of programming languages, you know, there's been enormous progress over the last, you know, really since the beginning of programming languages. But you look at like C, right, which is still runs the world fundamentally, you know, 98% of programs eventually compile down to C and then from there, right? Or I don't know exactly what the stat is, but, you know, fundamentally% of programs eventually compile down to C and then from there, right? Or I don't know exactly what the stat is, but, you know, fundamentally, it is still there at the base.

Starting point is 00:20:09 But most people most of the time are not writing in C. And that's because they mostly don't care about the low level concerns that C makes very easy to control. So things like garbage collection, you know, most of the time, you're okay with the higher level programming languages worrying about memory management, you know. And so folks have invented these higher level languages, these abstraction layers that allow people to not worry about the little stuff so that they can worry about the bigger questions, which is how do I quickly, you know, prototype an application? And that would be a lot faster in Python or Rails than it's going to be in C. And so, you know, when you look at

Starting point is 00:20:51 the parallel evolution on the data side, you know, you see SQL comes about in the same decade as C and still really does fundamentally run the world in terms of how people access data. But the amount of advancement that has been made in SQL is pretty minimal, right? You know, you can look at T-SQL comes out in the late 80s, and then MySQL and Postgres in the mid 90s. And, you know, you get a new version of Postgres or MySQL every so often, and not much more than that. And so we're still at the same level of abstraction, right? We're still worrying about, oh, which dialect am I in? How do I, does this one, you know, use sub queries? Or does it use common table expressions? How does this one do date handling, you know, all of these little things that you just really shouldn't have to worry about.

Starting point is 00:21:40 Computers are great at worrying about those things, and humans have bigger things to worry about. And so what LookML is, is it's an abstraction layer on top of SQL. It still maintains all of the power and the versatility, the provenness of SQL, but it allows you not to worry about things like, oh, what is the foreign key, primary key relationship for those two tables. I will define that once. And then anytime that I want to join those tables, I'll let the computer know that I have defined that join and it will figure out what the correct join logic is. Right. Um, and so that, you know, look, ML is a way to define all of that logic, all that business logic once, and then let looker write, you know, performant clean sequel, uh, on top of that. And one of the amazing things, I was just looking at this, but Looker now, I believe with the addition of Google Cloud Spanner, which they just released two weeks ago, I believe we now support 33 different dialects of SQL.

Starting point is 00:22:39 And the thing about LookML is because it's an abstraction of SQL, it doesn't care what the underlying dialect is. And so we've actually had customers who have built these whole models in LookML, have been on one database, been using one dialect, and then decided that they wanted to switch databases and switch dialects. And that process, which normally would mean rewriting every single line of SQL, is seamless because they just point Looker at the new connection and Looker says, oh, I speak that dialect too. So when I compile my LookML down to SQL, rather than compiling it to SQL dialect A, I will compile to SQL dialect B. And so what, you know, I think my experience with SQL is that there are a bunch of problems with it. You know, it's not reusable. I, I frankly can't even read, um, sequel that I've written, you know, a week or two ago, rather than, than trying to

Starting point is 00:23:30 understand that, that giant block of sequel that I wrote, I just start from scratch cause it's easier. Um, and you know, in reading somebody else's sequel in their own style with their own, you know, syntactic conventions is, is basically impossible. So, um, you know, syntactic conventions is basically impossible. So, you know, I don't, it's not collaborative, it's not reusable. It's not version controlled, because it's not organized the way that, you know, normal programming languages are. And then, you know, it just makes you deal with all these low level things. You know, when I was at the first company where I was learning SQL, there was an order table. And to make sure that you excluded all the failed credit card transactions, you had to say order

Starting point is 00:24:11 status equals completed. And you had to do that on every query that you ran against the orders table. Even though 95% of the time when you ran a query against that table, you didn't want the failed credit card transactions, right? But as a default, if you wanted to exclude them, you had to remember that where clause. And if you didn't, you would all of a sudden think, wow, sales were through the roof, but they weren't. You just were including all these failed transactions.

Starting point is 00:24:40 And so, you know, even for somebody who is very good at writing SQL, it becomes very easy to make stupid mistakes in SQL. And so LookML says, oh, when I explore the orders table, by default, apply the order status equals completed filter. And I can remove it if I want. But since I know 95% of the time I'm going to want it, let's apply that by default, right? And it doesn't make you reshape the underlying data, which is hard. And once your model and your data get intermingled, it gets very difficult to be agile because every change requires reprocessing all your data, reformatting your schemas, all of that. But, you know, by keeping the modeling layer agile and separate from your data, I can say, you know,

Starting point is 00:25:19 order status equals completed. And then, you know, we change something in the application and now it's order status equals complete. But I can make that change in the model. And as soon as I do, it propagates out. And now every query that runs against my, my database, when it hits the orders table, it knows now that it's called order status equals complete instead of completed or whatever it is, right. And so that, that agility is incredibly useful, because today's businesses, you know, the structure of their data is just changing all the time. Yeah, I mean, so it was interesting, again, coming from a history of using tools with these kind of semantic models, abstraction layers and query generation engines. It was interesting that Looker took the approach of, I suppose, making SQL more agile and more abstracted rather than, say, building a graphical interface and saying, look, you know, you just define things like dimensions and metrics and kind of fact tables

Starting point is 00:26:09 and so on. Was there a specific sort of decision to go down the route of sort of like a markup type language as opposed to using graphical tool for this? For sure. That was not an accident. You know, Lloyd Tabb, the founder who I was talking about, you know, he actually he's been around for a while. His, um, he was a, uh, database and languages architect back at Borland. Uh, he was the, the architect on Netscape Navigator Gold. Um, so he's, he's seen a fair amount and, you know, his experience with, uh, GUI based languages that, that allow you to sort of, you know, um, to define relationships and things like that is that they get very unwieldy very quickly. Um, you know, and, and and things like that is that they get very unwieldy very quickly.

Starting point is 00:26:46 You know, and that's been my experience, too, with all kinds of graphical, you know, languages, you know, for the same reason that we don't use cuneiform, that we use actual words, right? Languages are great. They are really good at communicating things. And so, you know, in different kinds of kinds of languages, excel at different things. And, you know, so natural language is great, but it's not extremely precise, I would say, whereas computer languages, programming languages are extremely precise, because they need to be in, you know, an analysis, analytics is really a precise discipline. And so, because SQL, I think part of the reason that that analysts love SQL so much is because it it is structured the way that they think about problems. It forces them to think about the analytic problems they're trying to solve in the right way. And so we didn't want to reinvent that wheel. You know, like I said, SQL has been around for 40 years. It's everywhere. And I think even in the sort of last 10 years where you saw Hadoop start to make a push against SQL,

Starting point is 00:27:51 I think you've pretty quickly seen SQL push back successfully. You've seen all of the columnar MPP data stores which use some version of SQL, and even then all the Hadoop dialects which actually are building SQL on Hadoop structures, whether it's Spark SQL or Impala or Presto or Hive, you know, you see that the Hadoop communities have realized that people really do want to write SQL when they're doing analytics. And so, you know, given that, it's just why reinvent that wheel, right?

Starting point is 00:28:22 The same way that the programmers didn't invent whole new languages that that went around C they invented languages that built on top of C yeah so another thing that what's been quite useful for what we're doing is is certainly within say the e-commerce space and and there's a lot of yeah what we call in in in kind of database world things like many to many relationships and lots of opportunities for things like double counting of aggregates and all this kind of stuff. And one of the things that we found LookML particularly good for is dealing with those things like double counting and the way that you do aggregation within these kind of data sets and so on. Is that something, again, that it was deliberate in there? Is that something you build in, particularly for this

Starting point is 00:29:03 kind of market and the type of data you're working with yeah i mean so i think you're referring specifically to something that we call symmetric aggregates yes which is you know it it when you fan out you know the classic example is like uh you know uh users and orders right you know a user can have multiple orders and so if you just count user the users table you get the correct answer but if you count users and you've joined the orders table now it's going to tell you that there are a lot more users than there actually are. And so you have to do count distinct user ID or whatever it is, right? And what we found with LookML and particularly with its sort of reusability paradigm is that you can actually achieve things with SQL, which you normally wouldn't think were achievable, right? Because SQL is, it has been a write-only language, right? Something that you write, but you never go back and read or elaborate on. I think we've underestimated what it's capable of. And so, you know, particularly around symmetric aggregates, we actually use a function, which I never knew was part of the ANSI SQL standard, but is called sum distinct. You know, everybody's familiar with count distinct,

Starting point is 00:30:09 but it turns out there's actually average distinct and some distinct as part of the ANSI SQL standard. And so if you know what the primary key is, you can avoid summing its value more than once or averaging its value more than once by looking at that that primary key and so this is something that you would never ever want to write if you were writing sql by hand but when look at mel is doing it for you when looker is doing the sql writing for you it becomes very simple um and these things that that look pretty complicated um actually are not particularly um computationally. And so the databases can achieve them quite easily. So symmetric aggregates is a great example, but there's

Starting point is 00:30:53 lots of other things where you can define really complicated logic, like case statements, for example, nested case statements, where again, you never want to try to keep track of those in raw SQL. But in XML, where you can say, here's a set of case statements that defines this one dimension. But then I want to define another dimension, and I want to use the case that I've already defined as part of that definition. And I don't have to rewrite that whole thing. I can just reference it. It makes it much easier to build these complex things in SQL because you're not the one writing the SQL. And so I think that idea of reusability, which is

Starting point is 00:31:32 really core to programming, the DRY is the acronym that they use, don't repeat yourself. You do something once, you define it in one place, and then you reuse it rather than redefining it. And that has a ton of advantages because it allows you to build more complex structures. It also means that when you change something, which is certainly happens a lot in the data world, you change it in one place, and it propagates out into all the other places, you don't have to then go track down every occurrence of some variable, and change it in all those different places, which is a pretty common occurrence when you're looking at a bunch of, you know, SQL recipes and you go, all right, I have to search this whole directory for every occurrence

Starting point is 00:32:11 of this field because that field has now changed its meaning. And so I have to go replace all of those, find and replace all of them. Yeah. And I'd like to get on in a moment to talk about, I suppose, specifics of working on BigQuery and how we interact with the database there and so on. But one of the other things that was particularly a plus point for you guys was the fact that all your, I suppose, metadata is stored in GitHub or certainly in GitHub repos and that sort of thing. That's an immediate kind of plus with any kind of engineering department now. I mean, again, just tell us about how that works and the use of things like kind of GitHub and your kind of looker blocks and that sort of thing. How does that kind of github and your kind of looker blocks and that sort of thing how does that kind of work yeah i mean so so get uh github or it's actually and then we generally use github but you can use whichever git versioning uh repository you want

Starting point is 00:32:56 but you know that that's been built into our our platform since the very beginning because again something that engineers have figured out a while ago and data analysts have not. And so again, rather than reinventing the wheel, we said, yeah, that's a great idea. It would be very useful to be able to roll back to an earlier state or understand who changed this definition at what point in the past and why. And that's all things that are deeply integrated into developer workflows, but haven't been part of analyst workflows. And so because LookML is code, but very simple code, it's code that really makes sense to SQL writers, you know, and not a GUI, it's very easy for us to build in versioning.

Starting point is 00:33:38 And so it's been there since the beginning. Whereas, you know, if you're looking at a GUI based system, it's much harder to understand how you would version that in that context. So, you know, I think that that goes to this, this core idea of let's not reinvent the wheel, you know, same thing with our visualizations, you know, where Looker runs entirely in your browser, it's web native. And so again, we could have written a library to do our visualizations from scratch. But, you know, D3 is amazing and gorgeous and incredibly flexible. And there's a huge community out there around that. So why wouldn't we make use of visualizations that already exist? And so that gives us, you know, dynamic, interactive visualizations, you know, that are JavaScript that run right in the browser really beautifully.

Starting point is 00:34:23 Fantastic. So, again, the last time I saw you, I suppose, in the flesh, I suppose, recently was when you presented LogSign Dan at the Google Next event. And the two of you did a very good presentation on, it was entitled about data modeling on BigQuery and so on. And something that was interesting on there that I'd like to talk to you about is what is analytics and BI like on a platform like Google BigQuery you know how how yeah how does how do the table structures look like how does how do you form kind of queries but particularly things like joining and indexes and stuff that we would take for

Starting point is 00:34:54 granted in a more traditional database that's different really isn't it and Dan opened it quite nicely I think you talked about it as well saying that let's be aware that we're working with a distributed data store here and it it's different to a relational database. I mean, tell us a bit about that. What do we have to be aware of when we're doing BI on top of kind of a platform like BigQuery? Yeah, I mean, so I think for anybody who hasn't experienced BigQuery, you should go try it out. I think your first terabyte of processing per month is free. So there's no reason not to just go check it out. I mean, I have to say, across all of the cloud data warehouses, that is one of the things that I love, that you don't have to go buy an appliance.

Starting point is 00:35:32 You just log in, and they give you some free credits, and you can play around. And my experience with them is they pretty uniformly, coming from the world of MySQL and Postgres, they blow my hair back. I just can't believe how fast they are. But I think BigQuery particularly, it is so simple in terms of the DevOps requirements that you stop worrying about things that you're just used to worrying about. Indexing is a great example. There's no concept of indexing in BigQuery. That's just not something that you have to worry about. They worry about it. Compression is another thing. They were, let Google handle your compression. It's not something that you have to worry about. And in

Starting point is 00:36:13 fact, it's not something you even can worry about. They handle it for you. Um, and so I think, you know, the Dan's point, um, and, and, you know, I talked about this from the BI side, he talked about it from the sort of, um, the schema side and the database side, and, and, you know, I talked about this from the BI side, he talked about it from the sort of the schema side and the database side. But, but I think our sort of joint point was that when you're dealing with this whole new environment, where a lot of the constraints that have historically attached to, to the ways that you do things are gone, when those are gone, you really need to go back to first principles and ask yourself, why am I doing this? Rather than just assuming that, well, I'm going to do it this way because it's always been done this way. Right. And so, you know, I think if your if your goals are around,

Starting point is 00:36:57 you know, simplicity and accurately representing the real world in your data and making it easy for other people to understand your schema, if you're keeping those first principles in mind, rather than saying, oh, well, we do snowflake schemas because we always do snowflake schemas. Oh, we do third normal form because that is how we have always done it, right? That is how we've always done it, but we've always done it that way for a reason, which is those first principles are the reasons, right? And because of the limitations of the hardware, we had to do it. That was the only way you could do that. But that is not the only way you can do it now. And so you really do need to go back and examine those first principles

Starting point is 00:37:37 and understand where they came from, understand a little bit about the history. And frankly, those are questions we get all the time from customers, people who are coming from, you know, a big heavyweight Oracle data warehouse that they've had for, for 20 years on premise. And they say, all right, we're ready to move to BigQuery, we're ready to move to the cloud. What do we do? Like, we, we don't, how do we structure our data? What's the right way to structure our data? I mean, I think I will say that one of the great things about about BigQuery is that it's very forgiving. It is so powerful that you can make a lot of little mistakes and they really don't hurt you. You know, the query takes 2.1 seconds to return instead of two seconds to return. So, you know, I think you do start to get up to data volumes where you want to be thoughtful about how you model your data, how you structure it. And I think there are a lot of neat things that BigQuery makes possible to improve that. But, you know, it is for certainly if you're talking about, you know, megabytes or gigabytes of data, you know, millions of rows instead of billions of rows the the penalty is very very small so you you really do have a very forgiving database whereas

Starting point is 00:38:49 in the past if you made those kinds of mistakes you'd be you'd be spinning for 24 hours and never get a result so do you typically find that there's in a project that would involve big query and looker is there going to be a in quotes kind of optimization or performance optimization phase in those projects i mean certainly my finding has been it just works the whole thing broadly just works and you know you i guess in big query you're trying to avoid joins you you know there's no indexes but that's because fast anyway is that i mean in your experience the projects typically work okay performance wise or is there a process that goes on to optimize the kind of the big query table structures and so on what's your experience on that yeah i mean i think it it really depends on the kind of the BigQuery table structures and so on. What's your experience on that? Yeah, I mean, I think it really depends on the scale of the data that the customers are

Starting point is 00:39:30 dealing with. So if you're dealing with, you know, if you're coming out of a Postgres or MySQL or SQL server environment, the reality is your data can't possibly be big enough to challenge BigQuery, right? BigQuery is going to laugh at your data scale and make quick work of it no matter how badly you model it. If, on the other hand, you're coming out of a Hadoop world where you had a giant data lake and you're trying to make that perform it, then you start really one of the keys to our architecture, right? Because we're architected without our own proprietary analytic engine, we are reliant on the underlying data warehouses. And so as a result, we have to work really, really well with them, right? We have to take full advantage of all of their tricks and all their power.

Starting point is 00:40:23 And so, you know, when BigQuery rolled out standard SQL to replace their legacy SQL over last summer, that was immediately in our next release, you know, a couple weeks later, when they rolled out an integration with Google Sheets, where you can use a Google spreadsheet as if it were a table in BigQuery. So you could have a billion row table in BigQuery and have a 50 row mapping table in Google Sheets and join those two together as if they both lived in your data warehouse. Next release, you know, that was out in a couple of weeks. You know, support for nested tables, which I talked about a bunch at Google Next. Similarly, they just rolled out for BigQuery,

Starting point is 00:41:02 a hyper log log implementation, which allows you to do things like count distinct on aggregates and, you know, really importantly allows you to sort of keep the intermediate state of those aggregates so that you can take advantage of that. So you can initialize a sort of hyperlog log calculation, keep a running log of that. So for example, you know, we, we might use that on IP addresses in our internal logs, right? That's a thing, you know, we log something about, you know, where a user logs in from, we don't actually need the underlying information, but we do want to know how many different, you know, addresses are being used. And so we can keep a running tally of that or running count of this distincts using this intermediate calculation that BigQuery exposes to you. And so things like that are really

Starting point is 00:41:52 powerful ways to, you know, to sort of improve the performance of BigQuery. And I will say with BigQuery, the funny thing is that it really, you have to work very hard to get BigQuery to go slow. But BigQuery can get expensive because BigQuery charges on a query basis. So it depends on how to take full advantage of the performance optimizations. And then also, you know, have built things into the tool where Looker will use BigQuery's API to tell you before you run the query, how much is it going to cost? How many gigabytes are you going to scan? And you can even then set, you know, a limit so that people can't accidentally run a runaway query that's going to cost $100 or $1,000. You can do that all right in Looker's Connection. And, you know, I think that speaks to our sort of really deep native integration with these tools that we want to take full advantage of those things. It's not, this is not a sort of afterthought.

Starting point is 00:43:03 It's not that we were architected for one world and now we're trying to figure out how to exist in the new world. This is really goes deeply to the way that we were architected from the very beginning. And I'd say that that is not to be underestimated. I mean, certainly again, my experience with tools that were from the kind of more, I suppose, on-premise small data set world. So for example, Tableau or Power BI or any of those tools really that came along with kind of BigQuery support. The issue with those is they're either based on having a kind of cache layer and that really isn't, I mean, I know you've got obviously persistent tables in kind of Looker, but

Starting point is 00:43:38 in general, in that kind of environment, particularly BigQuery, you can't be relying on having a cache there. Well, I mean, why bother? Why bother scaling up and going to the cloud and using this incredibly powerful warehouse and then use it to pull an extract out and put it on your local server? It's not practical. Another thing really is,

Starting point is 00:43:59 and that's a good point you made there about being mindful of the way these things are charged. So if you look at, again, you look at, for example, how certain BI tools work, they will go and retrieve the entire data set, first of all, and then they will kind of lay it out for you to select the columns. Whereas BigQuery, because of the way, obviously, that it charges, it charges per amount of data you bring back and so on. The thing that clinched it for us with Looker was the fact that it was much more economical

Starting point is 00:44:23 in how it retrieved data. It was much more cognizant, really, of how kind of BigQuery works. And I'm saying BigQuery all the time here, but this is kind of a similar story to all of these new style columnar kind of, you know, hybrid big data sort of systems that run in the cloud now. You've got to be aware of how they charge for things and work along with that. Yeah, we're very cognizant. We only join a table when you actually need a dimension from that table, right? We're very cognizant. And it results in SQL that you can,

Starting point is 00:44:52 you know, the SQL that Looker's writing is always available right behind a tab. You just one click and you can see exactly what the SQL that Looker's writing in, which is great because you can see, you can debug, right? If you're an analyst and someone says, hey, I don't think that this is quite right, you can look at the SQL immediately and say, oh, you're right.

Starting point is 00:45:10 You know what? We made a little mistake in the model. Let's go fix that. You fix that. You push it to production. They hit refresh on their browser and it works properly. And so we're not, we do not believe in black boxes. We think black boxes and true analysis, real valuable

Starting point is 00:45:26 analysis don't get along very well together. And so we want to be really transparent about what we're doing. And that means that we have to be clean. You know, sunlight is the best disinfectant. And so we are really, we want to be clear and transparent about what we're doing. And so we make it very visible so that folks can trust what we're doing. Okay. So to wrap up then, I mean, just tell the listeners, how would they get some exposure to Looker? How would they maybe get hold of a trial or how would they kind of understand some of these concepts around LookML? What's the next step if somebody's interested and they want to kind of take this further and understand a bit more really? Yeah. I mean, we'd love to talk to them. They can come to Looker.com

Starting point is 00:46:05 and sign up for a demonstration or a free trial. We give a free 30-day trial where we'll work with you to set up a Looker instance right on your database so that you can explore your own data and see how powerful it is. Our documentation is fully available on the web publicly, so you can go sort of check out LookML

Starting point is 00:46:25 if you're really interested in the concepts there. But I really do think, Lloyd actually said this to me in our very first conversation when he was trying to sell me Looker for the first time. But he said, you know, I'm not that good at explaining what Looker is. Let me just show you.

Starting point is 00:46:39 I think that's really our experience that we're getting better at explaining what it is, but there's still nothing more powerful than seeing it in action fantastic well look daniel it's been it's been fantastic speaking to you and i'll also i'll put the link to the the video of your uh presentation with dan mcclary in the show notes as well i mean that's that's fantastic i think um uh story from where we were in the past with 3nf models and so on through to kind of where we are now um but you know it's been, it's been great to speak to you. Thank you very much for coming on the show and yeah,

Starting point is 00:47:08 thank you very much. And it's been great to speak to you. Thanks Mark. Thanks for having me. Cheers. Thanks.

Drill to Detail - Drill to Detail Ep.23 ‘Looker, BigQuery and Analytics on Big Data’ With Special Guest Daniel Mintz

Mark Rittman is joined by Daniel Mintz from Looker to talk about BI and analytics on Google BigQuery, data modelling on the new generation of cloud-based distributed-data warehousing platforms, and Lo...oker's re-introduction of semantic models to big data analytics developers

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.