Drill to Detail - Drill to Detail Ep.64 ‘Google BigQuery, BI Engine and the Future of Data Warehousing’ with Special Guest Jordan Tigani
Episode Date: April 29, 2019Mark Rittman is joined in this episode by Jordan Tigani, Director of Product Management at Google for Google BigQuery, to talk about the history of BigQuery and its technology origins in Dremel; how B...igQuery has evolved since its original release to handle a wide range of data warehouse workloads; and new features announced for BigQuery at Google Next’19 including BI Engine, Storage API, Connected Sheets and a range of new data connectors from Supermetrics and Fivetran.Modern Data Warehousing with BigQuery (Cloud Next '19)Modern data warehousing with BigQuery: a Q&A with Engineering Director Jordan TiganiIntroduction to BigQuery BI EngineBigQuery Storage API OverviewSupermetrics and Fivetran BigQuery connectorsDrill to Detail Ep.2. 'Future of SQL on Hadoop', With Special Guest Dan McClaryDrill to Detail Ep.31 'Dremel, Druid and Data Modeling on Google BigQuery' With Special Guest Dan McClary
Transcript
Discussion (0)
So welcome to Jewel to Detail, the podcast about big data and data management in the
cloud, and I'm your host, Mark Ripman.
So we're joined on the show this episode by someone I've been looking to have come on
the show for quite some time now, Jordan Tigiani, Engineering Director responsible for Google BigQuery, who comes to us
more or less direct from Google Cloud Next 19 last week, which is Google Cloud's big customer event
that runs each year at the Moscone in San Francisco. So welcome to the show, Jordan,
and thanks for making time to speak to us. Thanks very much for having me on.
Excellent. So Jordan, so just to explain, just introduce yourself really to the audience and tell us which kind of products you look after.
And obviously that will explain why I'm so interested.
Sure. So I'm actually the product management director for BigQuery.
I was engineering director and I've recently moved over to the dark side.
But I was one of the first engineers on BigQuery.
I was one of the people that kind of sat around and was trying to decide what this thing is we
were going to build. And what we came up with was BigQuery. And I've sort of been
part of sort of riding this rocket ship for the last, I guess, eight years now. So
I kind of know where a lot of the, you know, the skeletons are hidden in the code. And
at least I know most of them. But also, you know, now looking into, you know, how do we
make this thing continue to grow? And how do we build it into this sort of great data warehouse of the future?
Fantastic.
So my introduction to BigQuery came through the work I was doing in a previous role when I was in the engineering team at Qubit, who are a Google Cloud partner.
And we were using BigQuery there to store petabytes of consumer-level data, behavioral data, and then use it to drive personalizations and segmentation and so on.
And I was blown away by what a fantastic product it was.
And ever since then, BigQuery has been, I suppose, my database of choice.
So it's great to have you on the show, really.
So what was your route into Google?
How did you end up kind of doing what you're doing, really?
So I started out as, you know,
an engineer before Google had the cloud. And, you know, I got into, I worked at a bunch of
startups and I never wanted to work at a big company again. And I ended up interviewing at
Google just to practice, but I had some friends that were here. And just when I came into Google,
there was an energy about the place that, and also I love that it was a bottom-up kind of place where individual
engineers with ideas could make things happen. And that's sort of how BigQuery came about.
Like our site director had asked us to build something entirely different. He'd asked us to
build a data marketplace actually. And so we kind of, several of us engineers kind of were,
you know, just sitting around the original tech leads dining room table and trying to figure out
how we were going to build a data marketplace. And Google didn't have a cloud. So there was sort of,
you know, we realized that for a data marketplace to work for large data sets,
you don't want to move the data to the compute.
You don't want to just download your data.
You want to actually move the compute to the data.
You want to have something you can do over the data in place
so that you can solve your analytics problems
or whatever the problem is you want to use the data for.
And so we had this great internal tool called Dremel that would sort of operate
over things sort of magically fast. And we're like, okay, well, we can tie these two things
together. We can expose Dremel in the cloud and then we'll get around to doing the data marketplace.
And so I guess eight years later, it's sort of the data marketplace has always been
like two quarters out for the last eight years. Maybe someday it'll happen. But I think it was
just one of those cool things where as engineers, we were empowered to build what we thought was
the right thing to build and nobody stopped us. So it's been an exciting ride and
exciting to be a part of something that now is much larger and much more different. It used to be
in the early days, if BigQuery went down over the weekend, we got a page to be like,
eh, I'll deal with that on Monday.
Nobody will notice. Like chances are, you know, nowadays we have, you know, our internal SLOs
are, you know, I think three minutes per month of partial downtime. And, you know, that doesn't
give you a whole lot of time to ignore a page. I mean, luckily we have a really awesome SRE team
that helps us maintain that SLA.
But just the ways people are using it are different.
People are building their businesses around it.
And it's kind of an honor,
but it's also scary that, wow, something,
if we mess up, then people's know, people's businesses stop working.
Yeah, exactly. I mean, exactly. So, I mean, there's a lot in that intro from you there.
Obviously, there's Dremel, which I think would be great to talk about. There's, I suppose,
the data marketplace and what that means in terms of how you deliver the service, you know,
the underlying kind of data structures and data service there and so on and you know and let you say you know this is now powering businesses
i mean certainly looking at qubit for example it certainly was powering qubit's business and
therefore powering a lot of e-commerce businesses that are relying on qubit to do you know the
personalizations and then you know conversion optimization and so on and it is you know but
certainly my time we're working with it was there was know, but certainly my time working with it was, there was never a problem
with BigQuery. It was, you know, certainly, you know, we had to optimize our spend and we had to
optimize the queries and so on, but the actual underlying service was fantastic. So what about,
so I mean, how are you involved in Dremel? Was Dremel already there when you arrived or,
and what was the kind of, what problem was Dremel solving back in time as well so you know I think Dremel was kind of a uh seminal um you know tool at at at Google um it
sort of grew to you know I think 80% of the company was using Dremel you know any 80% of
the company was one of the monthly active users for for Dremel. Most of it was indirect usage, but it was still
necessary for how Google ran and how Google was able to be a data-driven company,
despite having, I guess we have eight products now with over a billion users,
and that just generates lots of data. But at the start, Dremel was created by an engineer.
And I always get his name wrong, so I should double check.
It's Andrei Gubarov.
And he kind of created it while he was waiting for his map producers to finish.
So he was in the ads team, and his map producers were taking a long time and he realized that there's
like these data structures that are used at google that are very wide so they have lots and lots of
columns and he only needed a few of those columns but the normal map produce would have to you know
would have to scan you know read the whole read the whole thing and so you know first step is
you know the column stores were you know starting to become popular at the time. And so he's like, first step,
I'll build a column store. Second step, you know, build a, you know, a SQL engine.
And then he got an intern and the intern was sort of helping him and they made it distribute. And
sort of the kind of some of the cool things about Dremel was, you know, it built this,
this tree structure for your, for your, for your queries your queries. And you can sort of think of it as
databases in general kind of rely on this on-disk data structure called a B of a computational version of a bee tree.
So you have like, you have, you know, at the root, you have each individual leaf, and each
individual leaf is responsible for a small portion of the data.
And then as you go further up, as you go up this tree, it's sort of higher
levels of aggregations. So the initial version of Dremel was this static tree of, you know,
I had root mixers, mixer ones, and then you had the shards. And it was extremely fast at doing
scan filter aggregate queries,
which is kind of what it was designed for.
And the problem is, you know, if you want to do queries that are more complex,
you want to do queries that have lots of, you know, subqueries or stages or joins,
you need to do multiple passes in this tree.
And if you think about it, it's a little bit similar to, you know,
when MapReduce was first created. MapReduce allows you to take some fairly simple problems, but apply them over very large data sets.
So you can change things into a map stage and reduce stage.
And so for Dremel, it was sort of like there's a scan and an aggregate stage which are really kind of similar.
But then just like Hadoop, it turns out that most interesting problems you want to solve
actually involve multiple of these stages, multiple passes.
And the original MapReduce didn't have a way to kind of to chain those together. And so, you know, there's things like, you know, at Google,
it was called Flume externally, it's called Dataflow or Apache Beam.
Apache Flink also kind of lets you solve some of the same problems
of kind of taking your MapReduces and chaining them together
in a convenient way.
And so Dremel, you know, is still using SQL,
but so Dremel made a major architectural change
about six years ago,
which went from kind of the static tree
to a dynamic tree.
So, you know, when you start your query,
we build up this tree
and as, you know, with aggregations and shuffles at each phase.
And I think one of the kind of unique things about it is it's not just a dynamic tree that gets created at the start of the query.
It's a dynamic tree that actually will change its topology depending on how the query is running.
So, for example, certain types of joins, let's say
you're joining against a small table, a small amount of data, which is pretty common in kind
of data warehousing scenarios. You have a small dimension table and you have your large fact table.
So if you're joining against a small dimension table, rather than the expensive way of doing a join is basically you want to
co-locate the keys that are going to match up together. Because you have to find if you're
joining on a key like customer ID, every customer ID that is like three, you have to put those in
the same place because you have to be able to match them between those two tables.
So the expensive way to do this is you shuffle the data on both sides,
and then you have workers work on ranges of key space.
But a much faster way of doing that, if the one side is small,
is basically you just call it a broadcast join.
You just take the small side
and you broadcast it to every single node.
And the thing is, you don't know
whether it's gonna be large or small ahead of time.
So we have to make these dynamic interpretations
based on how much data,
how the query is run in cardinality, et cetera.
And then we can sort of rapidly make those changes.
Sort of the other option for that is sort of the kind of cost-based optimizer. And the cost-based
optimizer is you say like, you have all this information about the query and how you think
it's going to run, and you make a decision ahead of time. The problem is if you make the wrong
decision, it's super expensive. And so you have optimizers that are sort of like, well, it optimized this way. And one
of the things that we do in Dremel and BigQuery is we try to just dynamically figure out what
the optimal query plan is.
So this is probably quite a few years ago now that you were working on Dremel and making these decisions.
And there were quite a few kind of, I suppose, potential solutions to running SQL on Hadoop at the time at a kind of interactive speed.
And it was Tez, there was Drill and so on.
Did you make your decisions in the knowledge?
I mean, I know a lot of them came after Dremel or were based on that.
But, you know, presumably you didn't go down those routes.
You went down this route.
What was your kind of thinking around that?
It was sort of interesting.
I guess that's one of the cool things about Google.
Some people think it's cool.
Some people think it's inefficient.
But at Google, there were a number of SQL query engines
to query over large data sets that turned into map reduces.
There was Tatra, Tenzig, PowerDrill. There was just all these different, you know, as F1 query,
there's all these different engines. And, and so you kind of, it was hard to choose about which
one, you know, which one of these is going to win. And, you know, and I think one of the nice
things about Google is by sort of letting a lot of these different ideas, you know, flower,
you know, one of them does win. So you do have competition. They each sort of push each other
to be better, faster, more efficient, because otherwise, you know, because you're competing for this internal talent.
So, you know, sometimes it's, you know, it does seem like Google can be kind of cloistered
or sort of not invented here.
And in other places, when you have that, you know, often what people come up with is sort
of a non, you know, non-optimal because it hasn't really had to compete against anything.
You know, if you're at a big company where like they say, this is, you know, we're going to build X and this is going to be
the X that we build until you kind of get that into the, into the market and other people are
using it, you're kind of in a, in a space where you're not, you're not competing. So like these
other, you know, so the, the, I don't remember the, the reason that we, that we chose this.
I think it was, there was a lot of momentum behind, behind Dremel at this point.
And it was, it seemed like the thing that was going to, it was going to succeed, but
you know, it wasn't a, it wasn't a foregone conclusion.
And then I think, you know, drill started, you know, basically just started breaking
ground right after we, right after we started, I think about a year later.
And I think we started, you know, thinking about, okay, should we work with these folks?
And they kind of went off in a slightly different direction.
But, you know, I think it was, you know, we decided to kind of keep them at arm's length a little bit. And just, you know, for, you know, for focus purposes, just we had a lot of other stuff that we were trying to work on.
Okay.
So, I mean, apart from, I suppose, the technology, the sort of general technology and so on, when you look at BigQuery, then there's some kind of fairly big decisions you took there. To me, when I came into this industry from working with more traditional,
I suppose, on-premises data warehouses and Hadoop clusters and so on,
I wouldn't have thought of them at the time,
but they were massively important as to why people were adopting them.
So things like the fully managed part, taking that for example,
what made you go down the route of it being a fully managed service? Um, and, and, um, you know, what, what led to that and what was you
thinking behind that really? Um, you know, perhaps surprisingly, but it was, there was almost no
thinking behind it because just at, at Google, it was like, this is how you, this is how you
build services. We didn't, you know, serverless, serverless wasn't a term yet, fully managed wasn't
a term, this wasn't a term yet. It's like, okay, well, if we're going to build a service, we want, we want to make it just so that like, you know, you don't have, you know, people shouldn't have to manage infrastructure because, you know, at Google, when you build services, nobody, you know, generally nobody manages infrastructure. was at the time it was obvious. And I think there are some other interesting corollaries
or related things to that that weren't quite as obvious.
One of those is, I think,
one of the key things in BigQuery
is all of the data is in one namespace.
So if I have a data set
and I think you might think it's interesting,
I can share that data set with you
and you can join that with your data.
And partly that comes from our idea of eventually building a data marketplace.
Eventually build a data marketplace, you want to have one name for something.
But also it becomes really powerful just as we have customers who they aggregate analytics data from, you know, from their users,
and they show them dashboards, but then some of their advanced users, they want to actually
give them direct access to the data. So they can just share the BigQuery data set with them.
And then that company can then go and, and, and query it themselves.
Yeah, I mean, we used to do that in my old role, we would exactly do that we would we would provide
either aggregations, dashboards,
or we'd give them access to the data.
They could then access that from their projects and it was completely sort of seamless, really.
So, and certainly that, the fully managed part,
you know, it doesn't seem that important to you
when you don't have the benefit of that.
But when you do realise that it means
that you can kind of auto-scale it,
it means you can just run things on there
and it just works.
That means you can focus in on innovation and analytics and all that kind of auto scale it. It means you can just run things on that and it just works. That means you can focus in on innovation and analytics
and all that kind of stuff, really.
So that was, you said serverless there as well.
And I think serverless again was a new concept
that I encountered when I first came across BigQuery.
So maybe just explain what that means
in the context of a data warehouse.
Yeah, sure.
So, you know, a lot of data warehouses,
you know, even cloud ones, in order to run it,
you need to spin up a server or you spin up an instance,
even if it's just a virtual instance.
But you often can trace that back to,
it's running on this virtual machine somewhere.
And BigQuery is a multi-tenant, this sort of giant.
We have hundreds of thousands of nodes. and your queries get sort of multiplexed across those hundreds of thousands of nodes.
And so we have some nice advantages of scale, but nobody has to worry about instances.
And if any of those underlying instances fail, nobody notices.
In fact, if I looked at our Borg page right now, probably there'd be a dozen in any particular cell that were down, they were restarting.
And queries that were using that would have maybe a 100 millisecond glitch.
And that's the only difference that they would experience.
And we even do things like we have BigQuery runs.
We have data in multiple availability zones.
So in one of the zones, if there's a stuck tablet
or a network switches down or one of the other just things that can go wrong in running a large scale service, we will drain that and we will basically move everybody out into their backup cells.
And we do this probably, I don't know, a few times a week
and there's no, there's no downtime. It just sort of magically, magically happens. And that's the
kind of thing you can do when you're, when you're serverless, you're not actually relying on,
on individual, individual VMs. And, you know, again, this is just sort of like,
that's how, how stuff gets built in general at Google. And so we didn't even think,
we didn't think there was anything special about it when we did it.
It's just like, okay, this is just what you do.
And sort of like now in retrospect,
like, oh, I guess other people
made different design choices.
So when I first used BigQuery again,
looking back, it was a fantastic service,
but you definitely noticed
when it wasn't a relational database.
And so like joins,
my first my first
kind of crashing kind of um sort of uh encounter with reality with big query was when i tried to
join two big tables together and i suddenly then realized that you know joins weren't effectively
on a big query but you know but it's come along a lot since then and and maybe just tell us a
little bit about why there was that i suppose impotence mismatch at the start and and how the
product has evolved to be much to be i, functionally more like a regular data warehouse now.
Sure.
So I think that, you know, originally we had, you know, it all just sort of stems from Dremel
and logs processing at Google and logs analysis.
Basically at Google, we had these giant denormalized proto-buffer logs that everybody operated on.
And so one of the key features of Dremel
was it was ability to use nested and repeated data
to use arrays, because that's just sort of how data was.
And so it basically was designed so you
can do a single scan over it and get your query results.
So joins were very rare.
In fact, when we were building BigQuery,
we, you know, actually some of the BigQuery engineers
implemented joins in Dremel because they didn't exist.
You know, a year later, I guess in 2012,
the two teams merged.
And so we've been the same team ever since.
But, you know, so one of the things we realized,
and this was several years ago that we realized,
is that, like, is that, you know, customers have a different,
you know, external customers have different expectations
than internal Googlers.
So, you know, the first thing we did was, you know,
we switched to standard SQL, to ANSI standard SQL.
So we built a new ANSI standard
SQL query engine that we adhered to the standard as much as possible with some additional
additions to it, like nesting and arrays. And then also we've worked really hard on joins. And our join performance has
gone up, I think is like 7x in the last year and a half by looking at some TPC benchmarks.
But after realizing that data warehousing workloads are generally,
are very join heavy.
You know, people have their star schemas,
their fact and dimension tables, their third normal form.
And we need to be able to do those well if we want to be competitive. So we've been sort of, you know, heads down,
working on sort of making BigQuery into a fully featured data warehouse.
You know, I think DML came a little bit later,
so being able to do updates and deletes
because that's also very, very important to data warehousing.
So those sort of things.
I mean, whenever I remember speaking to Dan McCleary
about this back in time and talking about joins and so on i was just to wonder you know in a way are we using the
product in the wrong way when we do that you know you talk about i suppose meeting customers with
their demands and and what they want to do but fundamentally you know is it still the case that
a big wide table is better than ones with joins you know is there still a better way to kind of
work with big query so fundamentally what we what we tell people now is use the data structures that are most natural for you.
So if what is natural is a denormalized star schema, use a denormalized star schema.
If what's natural is normalizing into everything and using nesting and repeating. Cause sometimes that is more natural.
Like one, one case where that can be more natural is like orders and line items.
So, you know,
often that's represented as two separate tables and you have to do a join.
There's a one to many relationship, but really an order is one thing.
And so if you can nest the line items within the orders, then like,
then you're going to have a more, a more compact, a more compact data model. But in general, like if you,
you know, use what you're comfortable with and, and if you need like, and then tune it as,
as needed, I mean, it's sort of like, you know, the old saw that like, you know,
premature performance optimization is the root of all bugs or whatever.
Like don't, don't prematurely optimize and think you have to denormalize because, you
know, normalized, uh, normalized queries and, and join heavy queries are, are often quite
fast if there's, but that said, if, if this is something that's really important to make
this run fast, then, Hey then hey, maybe doing a, maybe
denormalizing a little bit or using nesting and arrays can make it even faster.
So how tempted have you been in the past to add indexes then to BigQuery?
You know, it's funny. It just came up like half an hour ago. I was talking to one of the
edge managers and he's like, where are we going we getting indexes? One of the kind of important things is we don't want you to need
to have somebody whose full-time job is to tune BigQuery
for the types of queries that they're running.
We talked to customers and they're like,
I talked to one customer last week who said,
we have a full-time person whose job is to manage the workload management
in their on-prem data warehouse.
It's not even to manage the schemas.
It's just manage the jobs that are running
and that are running faster and slower.
We don't want you to need those types of people.
We don't want to put anybody out of a job.
We also want those people to be working on something that's more rewarding to them than having to sort of babysit things.
And so kind of the idea is no ops.
You shouldn't need to be a DBA.
But we do run into things where like, okay, well, primary keys, like if we could apply a uniqueness constraint here,
we could apply all these, we could apply all these, all these optimizations. And so, you know,
there is some tendency to, to do some of those types of things, but we try to make sure we do it
in a, in a principled way that is like, is saying, okay, we're going to get all this, this value.
And, but also make sure that it's not something that people need.
We don't want you to have to use a primary key here, but hey,
if you want to signal to us that there's this key relationship or
to optimize these joins, you can create this key relationship,
then maybe we will add those.
But then again, maybe we won't.
I would say the day you add indexes is the day that you could.
Yeah, I think it's interesting, I think, about product management, isn't it?
You know, how much you follow.
I mean, you must have every, you know, so many customers saying to you,
you know, add indexes and this kind of stuff.
So they give very good cases to why, you know,
you're doing a single row lookup and so on there.
But, you know, it is the thin end of the wedge,
and you will end up with a whole industry of people
who then become tuning experts at BigQuery.
And the nice thing about it is it just works.
So I always look out for that being the sign of the apocalypse, really,
when you guys add indexes to that.
What about migrations from platforms like, say, Oracle,
Teradata, and Atiz? Are you finding, finding i mean presumably you've had some of those happen but
but what's the driver behind those and how have those migrations gone and what the benefits been
so i mean i think there's a there's a clear move that you know from from on-prem to cloud
and you know a lot of the a lot of the on-prem players, they're losing market share and they're losing a little bit of relevance to the folks in the cloud. We have tooling that will let you migrate from, I think we have a Teradata to BigQuery migration service.
We have Redshift to BigQuery migration service.
Other stuff in the pipeline.
We have other data integration tools.
We bought a company called Cask, and we just released that as Data Fusion so that you can create
data pipelines. Google's also purchased Illuma. So our story right now is a little bit
disconnected, but I think step one has just been like, make sure that we have first-party
options for migration.
And then the next step is we're going to sort of try to tie those together
in a coherent hole.
But I think we do see a lot of migrations from other on-prem data warehouses
and even a decent amount from other cloud data warehouses.
Cloud data warehouses often, the migration is faster
because people have already kind of already come to terms
with what it means to be in the cloud.
You know, some of the on-prem to cloud migration
is just getting people comfortable with storing their data in the cloud
and the cloud security controls.
And, you know, so like, you know, for example, HSBC,
you know, they are in the middle of a kind of a long migration process to us. And one of the reasons that it's taken so long is
because they've had to work with the regulators to basically say, okay, this is, you know, this
is safe, it's okay to use, you know, this security is good enough for, you know, for storing unmasked
data in the cloud.
And they've gotten a lot of the European regulators to agree to that.
But it can just sometimes take a while.
And just sometimes it takes people a while just to be internally comfortable with kind
of the level of security and controls you get in the cloud.
And some of it's a little bit of loss of control. It's like, oh, well, if somebody else screws up,
then it impacts me.
But on the other hand, I can guarantee
that we have more security and network security people
and physical security people on staff
than virtually anybody.
And we treat people's security and privacy as, you know, incredibly
important because it's, you know, if we don't, then, you know, if we screw it up, then it
could be the end of our business.
Yeah, exactly.
I always used to say that security, people saying security was an issue about cloud was
a bit of an excuse.
You know, it was because of other reasons and so on.
And security suddenly then didn't become an issue when uh when when kind of flexibility and price
and so on you know was so obviously sort of better in the cloud um and uh yeah it's interesting
something about banking and those kind of clients because i suppose the perception of big query is
it's very much for i suppose kind of cloud native you know maybe sort of like uh web native
companies but you say banks like hs HSBC are using it now as well.
Yeah. So, I mean, that was something interesting at Google Cloud Next that was just last week.
There was a real kind of phase change in the types of organizations that we were talking to.
And it was a lot fewer.
They were quite boring, weren't they, if anything?
But rather than being, I suppose, the hot new internet companies,
you say it was very normal kind of household name company.
Yeah, no, it was fantastic.
I don't want to use anybody's name who hasn't agreed
to be a referenceable customer yet. But lots of big names
and people that they know that cloud is important. They don't want to have to
manage their own, build their own software infrastructure. And they want to be able to focus on getting analysis results
rather than running servers.
And so they're coming to us and talking to us.
And yeah, I think we've had some really good success so far.
Interesting.
I mean, just one last question on the migration side.
How much is, I mean, have you ever thought about stored procedures?
I mean, obviously we come from the Oracle world.
That was a massive, massive part of what we used to do i guess it's
more from the transactional side but have you ever thought about i suppose stored procedures
in the database and business logic in the database so absolutely you know i think um
our general policies we don't talk about stuff that's coming down coming down the pipeline but
this this one is this one is in alpha already so so I will confirm that, yes,
stored procedures are going to be a thing.
Interesting.
I had no idea.
So that's very interesting.
And one of the drivers of that is, as you mentioned, is migration.
So people have their giant scripts that they've run somewhere else
and it does create temporary table and has some loops and some, you know, variables and like, and,
you know, used to be or currently part of the migration process.
You have to split those up into individual query statements and it's super painful.
And so, you know, we also have, you know, so we'll be able to have stored procedures.
And so you basically can just run that whole thing.
You might have to, you know, we're following the ANSI standard as much as possible.
But, you know, every database engine has their own quirk. So you may have to tweak it a little bit, but at least you don't have to like do major surgery. Okay, interesting. So let's move on then
to Google Next 19. So there's some fantastic things that were announced last week. Maybe
just talk about Google BI Engine. What is that? And what problem is it
trying to solve? Sure. So the problem it's trying to solve is that, you know, the BigQuery, you know,
a lot of our customers, you know, they have their data in BigQuery and they want to build dashboards
based on the BigQuery data. And, you know, when you load a dashboard, it might do 10 queries
because each one of the little widgets is a query,
and then you might have a whole bunch of users.
And kind of when you do this, it can be slow.
You really want this whole page to render in less than a second.
And so one of our kind of lead engineers who has worked
on, you know, BI stuff in the past, he's like, you know, I know how we can do this better. We
can build this in memory, you know, OLAP engine effectively and do these really fast scans. And
we can build it right on top of BigQuery so that the data is all backed by BigQuery. In case we
miss the cache, you know, you're always going to get the correct data,
but we can generally do, you know, do things,
you know, instead of getting results in seconds,
we get results in tens of milliseconds.
And so that's really the goal of the BI engine
is to, you know, sub-hundred millisecond queries,
high concurrency.
You know, we don't,
our billing model is different for that.
We only charge for the memory that you have rather than the bytes scanned in the query.
So that if you're using it, you, you know, you can have a hundred concurrent users or a thousand concurrent users and they're all, they're all running, you know, dozens of queries and you're not necessarily worried about, you know, they're running up a, a huge, a huge bill.
So it's super powerful. It's currently tied to Data Studio.
Sorry.
Yeah.
I tried to use it with Looker the other day and I saw there was...
So why is that?
What does Data Studio do or the connection do that means it only works with that for
the time being?
So we have a separate...
It's a separate API currently.
That's a separate kind of low latency API
that skips a lot of kind of the,
some of the stuff that our current asynchronous API
has to do.
And it's also, you know,
it's tuned for kind of BI style queries.
So there's no SQL parsing involved, et cetera.
And, you know, so, you know, we've talked to, we've talked to
some of our, you know, the partners like Looker and Tableau and, and so they're, they are going
to wait until we have a version that operates over SQL. So this will be this, you know, the BI engine,
you know, you know, memory cache, adaptive cache,
will be available to all SQL queries,
but is not available yet.
Okay, okay.
So you say adaptive cache, Dan.
I'm used to, I suppose I'm used to two ways of doing this.
One is, one thing we tried to do in the old role
was we looked at maybe using Druid on top of BigQuery,
and that would be a case of loading up the Druid cache,
and then once it's in there, it's fast and so on. I mean, I suppose, first of all, how conceptually similar is that to what you're doing now? Presumably you have to load
your cache up and keep it in sync and so on. So customers don't have to load. So the first query
will have to scan the underlying data, but that should load the cache.
And if the cache gets full, basically it's not necessarily an LRU. We apply a bunch of heuristics
to figure out what the best stuff to cache is. And so we won't necessarily cache a full table.
We'll only cache the columns and the partitions that are being
used. So you can use it over a large table. Let's say you have 10 gigs of BI engine memory,
and you might have a 100 gigabyte table. It'll likely all fit because A, it's sort of compressed
size. B, only the columns that you use. C, only the partitions that you use go in the cache.
So we don't currently have a way to pin data into the cache.
We want to see how well we can figure out things automatically.
But we may end up with the ability for people to say,
hey, I really want this to be in the cache no matter
what.
Okay.
Okay.
And you mentioned, the other way I've seen this being done is through, you mentioned
OLAP servers, and obviously they've got the ability to kind of pre-compute aggregates
and work through hierarchies and so on in a certain way.
Is there anything in the product at the moment or in the pipeline to help with, I suppose, aggregation of data as well? Not necessarily. No. So basically,
the way we're just approaching that is that our BI engine is so fast that it'll effectively
let you build these aggregations and kind of OLAP aggregations very fast.
So what about another,
I mentioned indexes too,
yet the other kind of side
of that horror story of optimization
is materialized views and query write and so on.
I mean, again, has that been,
and the context of this has been,
when I used to build Looker dashboards,
sometimes I used to think it was so handy if I could just pre-compute some aggregates and have the database rewrite the query to use the aggregates.
I mean, again, is that something that's been thought about with BigQuery?
You know, again, we don't want to like, I don't want to pre-release anything, but I would say that's something that we've thought about.
Okay, that's good. So what about, I mean, obviously, you know, when we used to use BigQuery in my previous role,
it was to receive streaming data via PubSub,
and it was a huge amount of data coming in in real time.
Is BI Engine part of that story around real-time analytics and real-time data ingestion,
or is it very much sort of separate or what really?
Absolutely.
So actually one of the goals of BI Engine,
and it's not necessarily tied up to it yet, but it will be, is if you're streaming data into BigQuery, you may be streaming in at of milliseconds and B, it should have, you know,
up to the second, up to the second results. Because, you know, part of the advantage or
part of the reasons you want to have these dashboards is you want to be able to, you know,
make decisions based on what's happening now. Like somebody, you know, I was talking with a customer who oversees a railroad system and wanted to know what is happening right now.
And that's a question that traditional data warehouses, they just can't answer.
And what's going on in my sort of this IoT network?
What's happening right now?
And I think that in order to do that kind of, yes,
the real-time analysis is super important.
And that has to go all the way through the tool chain from, you know,
how the data is being generated.
You know, it shouldn't be, you know, buffered and batched to, like, how it's represented generated, it shouldn't be buffered and batched to how it's represented
in the data warehouse.
And I think that at...
Oh, sorry.
So I think one of the things, I think one of the areas of real differentiation in BigQuery
that we're pushing on is this real-time data warehousing and real-time decision making.
And it's kind of related to some of the stuff in Dataflow as well, like Dataflow,
Cloud Dataflow, which is kind of our managed data pipelines tool, is that in Dataflow,
batch and streaming are two sides of the same coin.
So you can turn a batch into a stream just by changing one parameter.
And that's super powerful.
And they're arguing that the architects of Dataflow and Apache Beam are arguing that while most people think that
streaming is a special case of batch, they believe that as the batch size goes to one,
then you get a stream. It's actually sort of the opposite is that batch is a special case
of streaming. Because if you think about the stream or the batches, like say you have daily
batches, well, that's just a windowing strategy.
That's just sort of like the window I have on my stream is sort of the daily batch.
But you could easily have a sliding window.
You could have, you know, late arriving data that gets applied to the window.
Like, you know, kind of once you see them as sort of two sides or once you see the sort of batch as part of streaming,
then it kind of opens you up to lots of different ways
of representing your data and processing your data.
And so we want to take that end to end with our data analysis
products.
So from Dataflow streaming to BigQuery,
we want to have a really nice streaming system that
operates in lockstep and at virtually any scale.
So in my next GCP Next talk last week, I showed off a Dataflow pipeline that was processing data
at or sending BigQuery 22 gigabytes per second, which is like, I think it's about almost two petabytes of data per day.
And then, you know, BigQuery was processing it and then you could run a query over it and the
query. So I injected a row. I was simulating a sensor network. And so I injected a row of like
a bad sensor and I showed in a query that that shows up immediately. So I think it's a really powerful area and
I think it's going to become more important in the future as people start to leverage the value of
this data that's being generated over time. All data really starts with an event and so those
events can be processed as a stream. Anyway, sorry for the rant. It's something that I'm excited about.
It's really interesting.
One thing that actually another announcement you made,
or it was made at Next,
actually is a bit of a blast in the past, really.
And it was the BigQuery storage APIs.
And one of the first things that was very interesting about Hadoop
was this idea you could have, you separated your query engine,
your workload, effectively, from the storage.
And you could run all different types of query, like graph queries, or SQL queries on top of Hadoop storage.
And then I noticed that you've got the storage APIs being announced, which seems to me was about,
again, running different workloads on BigQuery storage. And what was the thinking behind that?
And what was driver behind, you know, releasing that feature?
So there's a lot of people that are doing, you know, data lakes these days where they have, you know, they store their data in object store and parquet files or ORC files.
And they run Spark jobs and they run, you know, Hive jobs over that data.
And the problem about that is that, like, you know, is that parquet files are difficult to manage.
And so you have to, where do I store this data?
What's in it?
What's the schema?
How do I make sure that the access control is set correctly?
How do I make sure that it's optimized?
So in general, general data comes in over time.
And when data comes in over time,
there's a tendency if you want faster access,
more real-time access to the data,
you end up with smaller and smaller files.
So you have smaller and smaller files,
then you have more files,
and then it's much less efficient to query,
sort of the small files problem.
And I think like Databricks Delta
is trying to solve some of this problem.
And so when you have an actively managed storage system,
you get a lot of benefits from performance.
You get benefits from consistency
because you actually get atomicity of changes to the data.
Let's say you want to delete one row out of a table.
It basically involves a file
because files in object stores are generally immutable.
So it involves a file creation and files in object stores are generally immutable. So it involves a
file creation and a file deletion. So there's an inconsistent period, whereas if you're in a
managed storage system, you can provide a facade that says, I only want to see the exact version
of the data now or at some time in the past. So BigQuery has what I think is a really top-notch managed storage system
with lots of features like DML,
like automatic kind of data coalescing and optimization.
So what the storage API does, it sort of gives you,
if you have sort of a data lake paradigm
where you have, you know, you're running from multiple different types of tools,
it lets you kind of operate over the BigQuery data as if it's your data lake. And so it allows
you to do some of the same types of things where you have like filter pushdowns, you know, column
projections. So you don't actually have to sort of read all of the data in, but also, you know, manages data governance, manages, you know, performance at the same time.
So I'm actually super excited about the storage API because I think that it really kind of turns the data lake idea upside down.
Because one of the things that people don't want to do is they don't want to have to pay twice for their data.
They don't want to have to store their data in their data lake
and their data warehouse.
In different copies, they have to worry about synchronization
and access control, drift.
This lets you store one copy of the data in your data warehouse
where you can actively manage it and then operate over it at full speed if you're using Spark, Spark ML, Hive, et cetera.
Interesting.
So another announcement was around BigQuery ML.
And I've been using BigQuery ML myself to do things like basic logistic regressions and so on. And again, the thing that's been interesting with it is how easy it's been to use compared
to other solutions I've seen in the past for doing, I suppose, stats analysis and machine
learning in the database.
And if anybody doesn't know what BigQuery ML is, maybe just kind of recap on what it
is and then what was announced at Next19.
Sure.
So it's BigQuery ML is, you know is machine learning in the database. So as I mentioned before,
you want to move the compute to the data rather than the data to the compute. It's being able to
do machine learning without having to move the data and with the convenience of SQL access. So you run a query that looks just like a normal SQL query
to train a model.
So instead of running create table as select,
you run create model as select,
and you tell it which model type you want to run,
and it'll build your machine learning model.
And so the output of that will be that the model it creates
will be a model that you can then reference in another query. And so then when you
predict or you infer based on that model, it's a SQL function. It's a table valued SQL function
that you can operate on as if it was just sort of a normal function. So it's really powerful.
It had been limited, however, to two model types, logistic
and linear regression, which operated really well
at high scale.
But many people said that they wanted other model types.
So we've launched matrix factorization.
Matrix factorization lets you do kind of recommender systems.
We launched, what was the other one?
I think it was k-means clustering.
Yes, k-means clustering is now in beta.
Thank you.
And we've got a couple more in alpha.
So in alpha is being able to build neural networks.
And that one is interesting because it's really the first that ties it to cloud machine learning engine basically
we now have kind of a commitment to you know between our our team and the cloud ai team is
that basically everything you can do in cloud ai uh and machine learning is we're going to be able
to do through bqml so i think this is really going to be our, you know, a portal for, you know,
for structured data analysis to machine learning. So just the last thing I want to talk about is in
new features that were launched last week was maybe not, it's data transfer connectors. So
there was a bunch of partners you now have offering services on the Google Cloud Marketplace, so Fivetran Supermetrics.
What is that all about, really?
And again, what problem is that solving,
and what features does that add to BigQuery?
So it's really about being able to get data into the database
more efficiently or to the data warehouse more efficiently. And so we announced a, so we have built a connector
that allows anybody to build connectors
to get data in through BigQuery and then to resell that data.
So Fivetran was the first one we've announced that is, you know,
so you can basically with a few clicks, you can import data from,
I think it was 100 different sources and, you know, including, you know,
Salesforce data, a bunch of other,
bunch of other data locations. So, you know, it's just really about, you know, people have data in
other data sources, in other data, you know, that's sort of locked up in other, other, other
tools, you know, SAS applications, services. And we want to be able to pull that data out because there's a value in joining that data with your other data
and sort of centralizing the data in the data warehouse.
Okay.
And I mean, I suppose, again, one last thing.
So it was the Fusion data product.
So again, I was interested to see it's a kind of point-and-click
data integration product.
And you say, actually, Google have bought the company behind it.
I mean, is there anything you can talk about that at all?
Is it a product you're aware of?
So Data Fusion is a really, really convenient and easy to use tool that kind of lets you just sort of drag and drop, you know, you can select your source data, data service, select your destination data service, which, you know,
hopefully it will be BigQuery, but it could be anything really. And,
and, you know, click a button and it'll generate a,
it'll generate a Google data proc job,
which is actually a spark job under the covers. And it'll run that job with
the schedule and do the data migration. You can also specify transformations and schema changes.
And it's got a lot of... We actually, for the beta, we turned off a lot of the features that were in the underlying service that we bought, but expect to see
a lot of interesting things coming from that in the next few months.
Okay.
Okay.
And I think going back to your point about lots of products, lots of projects kind of
being invested in and then see which one is the one that succeeds.
I think I've used in the past, there's obviously the Trifactor sort of service
that's on Google Cloud and there's Airflow as well.
I mean, is there a particular, I suppose it's different,
I suppose it's different tools for different kinds of jobs, really.
But is there a particular technology or product
that you guys are investing in,
which is the go-to thing for data integration on BigQuery?
Or is it really all of them in different cases?
So Data Fusion, I think, is really our, is the one that we are, we
are betting on the, um, you know, this is the one that's going to be our kind of data
integration tool of, of record.
We're not going to, you know, shut any of the other things down or, or even stop investing
in them.
But, but like, this is what we see is kind of going to be our marquee thing going forward.
And I think data, you know, composer,
which is our managed airflow is very, is very different.
And so that has a, you know, perhaps slightly overlap,
overlapping use cases, but, you know,
requires a lot more kind of understanding of what you're doing.
And so I think for most people that are going to,
they want to pull data from one location or even do CDC,dc like the um the uh the way we are going to
recommend is is through is through fusion okay fantastic and one last thing um you've written a
book recently so uh what's that and what's that well i guess what it's about actually but it's
tell us about the book you've written and uh and uh and and and your co-author as well yeah so it's
um the book is is in progress.
It's in sort of early release.
I think we've got actually just today two more chapters
were launched into the early release.
So we have now five chapters.
And yeah, I mean, this is, you know,
I think it's called BigQuery, the Definitive Guide.
And, you know, the idea is this is sort of what BigQuery is and how to use it.
I wrote a book five years ago, which was the first book on BigQuery.
And it's horrendously out of date now.
And it's still for sale.
So I think people are still buying it, which I feel really bad because it's giving them like giving them bad ideas and bad impression about what BigQuery can do and what it can't
do.
So I think that's one of the rationales behind this new book is sort of like, hey, this is
what BigQuery can do now.
You know, BigQuery is now a fully featured data warehouse that can do all these other
additional things.
And yeah, and so my co-author, Locke, I have a hard time pronouncing his name, his full name, but everybody calls him Locke at Google.
But he's an absolute machine.
He just turns out pages and he's been really instrumental in making this book happen.
And I've been largely so far been along for the ride.
And he's been kind of the driving engine of that.
But he also runs a lot of our training programs at Google and Google and has, you know,
has another book I think on, on machine learning with, with GCP.
Fantastic. Well, that'll be my presents for the family and for Christmas anyway,
this year. So that'd be, that'd be excellent.
So how, just to kind of round things up,
how do people find out about these new,
new product features that came out recently?
Have they maybe see some of the videos and presentations that you did last week?
I mean, where do people go to for that?
Yeah, I mean, so the Google Cloud Next videos
are all up on YouTube.
There's a spotlight session that has a good highlight
from all the data analytics features that were released.
My session on modern data warehousing,
I think, gives a good overview of all the things that were going. You know, my session on modern data warehousing, I think gives a good overview
of all the things that were going on in BigQuery
just to plug my own session.
But, you know, I think we do have,
you know, we do have release notes
and ways that you can keep up to date
with the things that are coming out.
And one of the problems that, you know,
sometimes we find is that,
you know, we don't do a great enough job of just sort of like
of shouting from the rooftops when we make improvements and so sometimes we talk to
customers even really good customers and they're like well i'm doing this because to work around
this this problem and we're like well we fixed that problem like a year ago uh and um or we've
added this new feature a year ago and they hadn't, haven't started using it, but you know, if it's working for them, that's fine.
But if they have to do kind of unnatural things to not use it, then that's,
you know, that's not as, not as fine, but.
Fantastic. Well, look, Jordan, it's been fantastic speaking to you.
And I really appreciate the detail you've gone into there with,
with how the product works and what's come out recently as well. And yeah,
I mean, all I can say
is the database that I use day to day.
So, and yeah, it's fantastic.
It's great to speak to one of the people behind it.
So Jordan, thank you very much for coming on the show
and appreciate that and have a good rest of the day.
Well, thanks for your time.
This has been fun.