The Data Stack Show - 186: Data Fusion and The Future Of Specialized Databases with Andrew Lamb of InfluxData
Episode Date: April 24, 2024Highlights from this week’s conversation include:The Evolution of Data Systems (0:47)The Role of Open Source Software (2:39)Challenges of Time Series Data (6:38)Architecting InfluxDB (9:34)High Card...inality Concepts (11:36)Trade-Offs in Time Series Databases (15:35)High Cardinality Data (18:24)Evolution to InfluxDB 3.0 (21:06)Modern Data Stack (23:04)Evolution of Database Systems (29:48)InfluxDB Re-Architecture (33:14)Building an Analytic System with Data Fusion (37:33)Challenges of Mapping Time Series Data into Relational Model (44:55)Adoption and Future of Data Fusion (46:51)Externalized Joins and Technical Challenges (51:11)Exciting Opportunities in Data Tooling (55:20)Emergence of New Architectures (56:35)Final thoughts and takeaways (57:47)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week, we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Welcome back to the Data Stack Show.
We're here with Andrew Lam, who's a staff engineer at Influx.
And Andrew, Costas and I have a million questions for you that we're going to have to squeeze into an hour.
You worked on the guts of some pretty amazing database systems.
But before we dig in, just give us a brief background on yourself.
Hello. Yes, thank you. I'm Andrew. Obviously, I've worked on lots of low-level database systems.
I started my career at Oracle for a while, then I worked on an embedded compiler for a while in a
startup. And then I spent six years at a company called Vertica, which built one of the first
sort of a phase of big distributed, shared nothing, massively parallel databases.
I then worked in some various machine learning capacity startups,
which was fun, but not really related to data stuff so much.
And then for the last year, I've been working with Paul Dixon and the co-founder and CTO of Influx Data on the new storage engine for Influx DB 3.0.
So I've been down working on building a new sort of analytic engine for focused on time
series. And you've also been working a lot with Data Fusion, right? The open source projects.
And there are like a lot of things we can chat about, as Eric mentioned. But something that I'm
really interested to hear from like your experience and perspective, Andrew, because you've been
in this space for a
very long time, is that it feels like we are almost like at an inflation point when it comes
like to data systems and how they are built, something that's probably happening for a long
time. But data systems are complex systems and very important systems. So there's always like,
let's say, risk is not exactly like like, something that people want to take when
it comes, like, to their data, right?
So, like, the evolution usually is, like, a little bit slower compared, like, to other
systems.
But it seems like we are reaching that point right now where the way that we build these
systems is going to, like, radically change.
So I'd love to hear from you how we got to this point.
What are, let's say, the milestones that are leading to that,
and what the future is going to be looking like based on what you've seen and what you're seeing
out there. So that's the part that I'm really interested to hear. What about you? What are
some things that you would like to talk about? Yeah, I would love to talk about that. And I'd
love to talk about sort of the role open source software plays in that evolution
and how we got there.
That was something that Influx, I think, always as a company has understood and really valued
is open source and how that's both evolving and then also how you leverage open source
to build the next generation products.
I think it'll fit beautifully.
And I think we can illustrate the story of sort of what we're doing with Influx DB 3.0
as part of that longer-term
trend. And I think there's lots of interesting things to talk about there.
Sounds great. What do you think, Eric? Let's go and do it.
Well, I'm ready. I'm ready. We need to hit the ground running here because we have so much to
cover. Yeah, let's do it. Okay. I'd love to just level set on what Influx is. I think it's very commonly known as a time series database,
but can you just explain what it is and sort of orient us to where it fits in the world of
databases and what people use it for? Absolutely. Yeah. So I think InfluxDV
is a part of a longer term trend in the database world where we went from super specialized
sort of monolithic systems systems very small number of very
expensive ones like oracles or db2 that kind of stuff and instead we've seen a proliferation of
databases that are specialized for the particular use case you're using for using them for and so
time series database as a category relatively new and i think the reason you end up with specialized
categories is because if you focus on those particular use cases you can end up up building systems that are like 10x better than the general purpose one.
And I think that's approximately the bar you need in order to justify a new system.
So time series databases in particular often deal with lots of really high volume denormalized data.
So that means you're basically having to read CSV files that come in and you quickly, they quickly want to be able to query them like within milliseconds.
Not, you know, like when I started in the industry too many years ago,
like you did batch loading at night, right?
Like you do it once a day, if you were lucky.
Yeah.
Right.
And now it's like maybe once an hour, maybe once every five minutes, but
like times your day was like, it's like measured milliseconds.
So that's really important.
Probably being able to query super fast afterwards.
And then I think also because this it's not part of an So that's really important being able to query super fast afterwards. And then
I think also because
it's not part of an application that's been set up
before, you don't necessarily know what's coming from
the client, so you don't upfront specify
what columns are there. So you basically
have to figure it out.
Data just flows in, you figure out what the database
has to figure out what columns
to add. And then of course
lots of time series data,
the most recent stuff is really important.
The stuff that's older is much less important,
and making sure your system can efficiently manage
a large amount of data,
even though the only part of it is really important
to deal with low latency.
And there's a huge number of examples
with open and closed source at this type of database.
Obviously, Influx is one, but there's a whole bunch of examples with open and closed source at this type of database. Like obviously Influx is one,
but there's a whole bunch of ones like Graphite
and Prometheus and the whole math of them, right?
There's a whole bunch of closed source ones,
both at the big internet companies,
like stuff like Gorilla or Monarch
or Gorilla from Facebook or Monarch from Google.
And there's like,
the dog has versions of this, right?
That are all proprietary to their own server.
So it's a whole category.
And Naflux is the biggest open source one.
So I want to dig into something you mentioned.
So this evolution, we'll dig into this specifically,
like this evolution of purpose-built databases all combined together.
But in terms of time series specifically,
is the need for those, you said it's relatively new, is that need driven
both by the opportunity that's created because there's technology that can solve some of those
problems that were probably difficult to solve before from a latency standpoint?
And then how much does just the availability and volume increase in volume of
data that we are able to collect and observe? How do those two factors influence the somewhat
recent importance of time series databases? Yeah, I mean, I think you hit it. I mean,
you understand, right? As we had more and more systems that operated, you need to actually be able to observe them and figure out what they were doing, right?
Which actually turned out to be a major data challenge themselves.
But it's a very different data set than like your bank account records,
which is what the traditional transaction system databases were designed for.
And so I think the workload characteristics are quite different.
Can we walk through just one?
I'm going to throw an example out and you tell me if it's a good one or bad one, but I'm just going to throw a use case out that I think would make a
lot of sense for InfluxDB and would love for you to speak to like, well, what are the unique
challenges for that? I'm thinking about like IoT devices. So, you know, think about a factory
that's wired up with potentially hundreds or thousands of different types of sensors.
You mentioned actually the incoming data can change.
And when you think about sensors, installing new sensors, collecting different types of data, that is probably changing intentionally a lot as you measure different things and
install new equipment.
Am I thinking correctly about a use case where you know yeah yeah no and that's another good example of like you know in a factory
the rate of change of the software you have installed on your robots is probably different
than the rate of change of software that's deployed in some cloud service right like it
doesn't get deployed every minute right because you've got like yeah yeah mess around the robot
so that's all the more reason why like why the data format that comes off those is relatively
simple, and you need to get back in
to handle that. But yeah,
time series or IoT is
definitely the classic thing. And the
idea is you dump your data in a time series
database, and now you can
do things like do predictive analytics
or pull the past history of those machines
or the sensors or whatever, and you build models
to predict when they're going to fail or when they need maintenance or,
you know, variety of other, you know, machine learning problems, which is now of course,
the hotness, right. But like to drive any of that stuff, you need data to actually
to train the models on. Yeah, totally. That makes total sense.
But we'll switching gears just a little bit. So the use case makes total sense. You've spent a lot of time, though, sort of building the guts of this system, right? And in fact, sort of helped transform the system into the latest iteration of InfluxDB. And I love talking about this because that's unique to, you know, say like, you know,
you know, ingesting data or, you know, a dedicated analytics tool or, you know, some of these other
things that are maybe like moving data or, you know, operating on data after it has been loaded
into some sort of database system. So what types of things have you had to think about
as you've architected this system
and interested specifically in the unique nature of time series data?
Yeah, so maybe I should start by just blaring a little bit about,
like, from my, so I don't exactly have an academic database background,
but it's been a lot of time studying them.
And so I'll just describe a little bit,
like basically all the original time series databases all have approximately the same
property, or approximately all have the same architecture, which is they all are driven by
a data structure called the LSM tree, right? Which is basically just a big, it's like a key value
store effectively, but it's a way to quickly ingest data. And so it helps you with the really
high volume ingest rate, and it lets you sort of move data off the end, right?
As the data gets cold, it's easy to drop it off the end.
But it also requires you effectively to map the data
that comes in into like a key value source.
You got to basically map the identifier
of the time series into a key.
And then in order to do any kind of querying,
you need to basically be able to map your queries to ranges on that key,
which I'm not sure I'm doing a great job explaining that,
but that's like the high level.
Yeah.
Totally checking.
Yeah.
So what that means,
the types.
So what that means is that architecture is phenomenal at queries that
like you can ingest data super fast.
That's great.
You can also look up individual time series,
right?
Like if you know exactly the robot you care about, it basically calculate the key range that you care about and you can
immediately fetch. And it's like, it's hard to imagine a better architecture to do that.
Sure. I think tends to come in several ways. Like, so the first one is if you want to do a
query, that's not just like, I already know how to, which, which robot I care about, right. Or I want to go do something like look across all the robots. Then it's not just like I already know which robot I care about or I want to go do something
like look across all the robots, then
it's typically very challenging because you
basically have to go walk through all the data to do that.
Likewise, if you have
a large number of individual things like
lots of little tiny robots
or containers or something,
which we would call high cardinality at the time,
that's typically very challenging
with the traditional architecture
because you have the same problem of the index you need
to figure out exactly where to look in the LSM tree
becomes huge and becomes overbearing.
Andrew, can you expand a little bit more on the high cardinality?
I think it's one of the things that it comes up like very often when
we are talking about time series data. I think people have the concept of what cardinality is
and what high cardinality would mean, but what does it specifically mean when it comes to time
series databases and why it's so important? Yeah. So cardinality just means the number of
distinct values in some ways, right? That's the academic term. And so even if you have a million
rows, if there's only seven distinct values, for example, there's only seven distinct, you only
have data in seven distinct AWS regions, well, that's a different data shape than you've got a
million rows and they've all got a million individual values, like individual IDs or something, right?
Unique IDs.
And though you might imagine it's the way you store that is different.
So the way it's really in particular important for time series is because the reason it's called a time series database is there's a notion of a time series, which tends to be like set of time, like timestamps, right?
And then measurements of something values yeah at that time but it's not just values and times that kind of hang out in the ether
they're attached to something which identifies like where they came from right like what is it
a measurement of it's the measurement of pressure of the robot over time or something right but not
just the robot like this particular robot in this factory on this floor whatever and so for a time series
database they're typically the classic ones are organized so that they store all the data for the
time series together and the queries then ask for individual series right so like i want a dashboard
that shows me for this robot what its current pressure value is or something in order to build
a database that answers that kind of query really quick, you need some kind of way to quickly figure out from the robot identifier, whatever you have, where it's stored in the robot id and the time stamp and so then if you want to do
a range of like tell me something about the robot for the last 15 days the keys that have that data
are like all next to each other in the key space and so your database just can go look up one place
and just read all the data contiguously and so so that's good right so that's great now you asked
what's the problem with the high cardinality?
It's that I've been hand-waving and saying there's some index structure that lets you quickly figure out where in the key space the particular robot is.
So in Influx, that thing is called the TSI index.
Other systems have the same notion that it's called something different.
But it basically lets you quickly figure out where in the key range the thing you care about is. And the problem
is typically those indexes
are the same order of magnitude size
as the number of distinct
values in your keys. So if you have a thousand
robots, you've got a thousand entries
in this file, it's fine.
You've got a million
robots or ten million robots or a hundred million robots.
The
index size just keeps growing bigger.
And so then managing that index becomes, that becomes the problem. And that's exactly,
I mean, so that's why most time series databases will tell you don't put high cardinality data in.
So it's not that they can't deal with large numbers of values. They can't deal with large
numbers of individual identifiers
for the time series themselves.
And I would assume there's always some trade-off there, right?
That's why we have these problems,
that we have to trade something for something else.
So what do we have, let's say, in a time series database, if we would like to
also support high cardinality without even having to care about, right? What we would have to trade
off? Right. I think what you have to trade off is the absolute minimum latency to request a single
time series. The way the traditional time series databases are structured,
it's hard for me to imagine some
way that you could do better.
If what you want is just the particular time
series for some particular range,
you've got that basically in
some memory structure and you've got
a single place where you can look at it and figure
out what range you need to scan.
That's pretty fast.
And it's hard to, i can't you know maybe
engineering wise you can eat some more percentage out of that but that's not like some fundamental
architectural thing like you look up where you want it's all one contiguous thing and you just
go find it so so that's like the thing that that they're phenomenal at and so you're going to have to trade off like super low latency.
I think you'd still be very good, right? We talk all about this.
But I don't think you're ever going to
be able to do quite as good.
Yeah, and so
that's like the...
So your high cardinality
is really an impact
on latency, right?
As you're operating over the data.
That's the trade-off for the end user?
Well, the trade-off for the user is typically
if you try to put high cardinality data into these systems,
it will tend to cause real trouble.
Like the maintenance of the index will just basically dominate.
Yeah.
The other alternative is you don't put that in and you...
But instead, you now need to, like, every query,
instead of knowing exactly where to look, it has to look
at a much broader swath, so it just
won't work every query.
Yep, yep. Sorry to interrupt, Kostas.
No, go ahead. I interrupted you
in the first place, so please...
No, no, no. You were asking
some questions.
Well, actually, just to continue on that.
So when we think about this trade-off, especially as it relates to cardinality,
like what are the, because I mean, you don't necessarily always have control
over the actual nature of your, I guess, physical data, if you want to say that, right?
And so is the response to this to say, or like,
when you talk about sort of best uses for a time series database, like, is that limiting it to
certain, you know, so time series data that has lower cardinality or how, like, is a data,
if I'm choosing a specific tool to manage time series data how do i need to think about where to employ
that and where because of cardinality issues it might cause you know cause issues for me yeah
so just to be clear i want to start saying like i'm not i haven't been down the trenches
deploying times for database to people so i can give you a high level answer that but not like
born out of experience so i think typically what will happen is, you know, where does cardinality data, high
cardinality data come from?
Right.
If you, if what you're doing is tracking like servers in your server farm, like you'd only
have so many servers and probably most times you would do it in your server farm.
If what you're tracking is like individual Kubernetes containers, right.
Of which every time, you know, your microservice has however many of them, right?
And they continually get created and destroyed, right?
Like that starts looking like a high carnality column.
And so I think if you want to use
one of these sort of traditional time series databases,
you have to be careful when you are deciding
what to send into the database and what to store
and how to identify the things you care about.
You know, like probably you don't send the individual container ID, right?
Like you probably have some sort of service identifier instead,
which is low cardinality.
So I think it's not going to use traditional database,
time-series databases for those use cases,
but it means that you have to be much more careful
when you stick data in that you don't, you know,
inadvertently include a high cardinality column
in something that's going to cause trouble. Yeah. There's plenty of people who are able to work, you know, inadvertently include a high cardinality column in something that's going to cause trouble.
Yeah.
There's plenty of people who are able to work, you know, make that kind of change.
One other question that I have based on, this is great, by the way, I don't know if we've talked about time series in this step about a use for time series data being the need to query that data in sort of an immediate, you know, for the queries needing to be more immediate, if you will.
In terms of thinking about when data goes cold or when you need to offload it, that's obviously a decision for the end user to decide what their thresholds are for that.
But generally, if we go back to the IoT device question,
in that case, what would you see for InfluxDB users? When are they offloading the old data?
Are there use cases around the historical analytics around that,
as opposed to getting a ping for potential device failure because an ML model is running on, you know, the data in near real time.
Yeah. So I think you've hit on the head there with like, the reason there's, well, let's see,
let's take a step back. So like the traditional architecture, you have a time series database
that ingests the data quickly and keeps it available for some window of time. I think
typically seven days or 30 days, like either of those would be common times, maybe
two months, but probably not that much longer.
And then if you want to do longer term analytic queries, you tend to have to basically have
a second system, right?
So rather than the time series being directed to both the time series data and this other
system, right, that then uses, that's probably some big distributed analytic database or something, right? Yeah, just doing classic, you know, look back analytics. Yeah other system, right. That then uses it. That's probably some big distributed analytic database or something.
Right.
Yeah.
Yeah.
Just doing classic, you know, look back analytics.
Yeah.
Yeah.
Yeah.
And where the late, where you're much less late, late fee sensitive,
less latency sensitive.
So, I mean, obviously that that's one mission of inflexdb 3.0 is that, you
know, it's really inconvenient to have two systems, right.
If you can have one system or even better, if you have a system that
you don't have to, you, the developer don't have to direct the data streams into two systems, right? If you can have one system or even better if you have a system that you don't have to, you
the developer don't have to direct the
data streams into two places, right? You only have one ingest
pipeline and then maybe I'll talk
a little bit about the new version later.
But you directed it into your time series
database and eventually ends up as Parquet files on object
store, still with all the low latency
love you enjoy, you need.
But at that point, your existing
analytic systems, which probably also
deal with parquet on object store a lot of these days sort of have a unified view the other thing
that i wanted to mention about the the cost thing is is i think most customers aren't like oh i only
want seven days of data right like they they want more they don't want they want much more than 30
or whatever they're typically driven by the cost right they just sure certain amount of cost or capacity and the current time series technology required basically fast
disks and in memory you know like large amounts of memory and so sure that tends to limit the scope
but i think there was a real need to like hey i don't want to have to put like it like it doesn't make
sense right if some of your data is really important to be hot but like i'd love to keep
three years worth of data but i but like my expectations on query latency on that is much
lower than if well you know it's okay if it takes longer than like the stuff that came in
30 minutes ago being able to do that in a cost-effective way was another driver for InflexDB 3.0.
I was just going to say, I want to dig into Inflex 3.0 before we get there, and maybe this is a good bridge, but can we zoom out just a little bit and talk? Because you mentioned having to do those in two separate systems.
I mean, there's a lot of overhead involved in that. But zooming out just a little
bit, that really gets at this question of what you said at the beginning, right? Where there's
this trend of sort of breaking apart the monolithic systems into these various database systems,
tons of really good open source tooling out there today, sort of enterprise grade,
I guess you could say.
How do you think about architecting a system where you have multiple databases?
I mean, in some sense, like having a giant monolith is, you know, it certainly decreases the complexity, even though there's tons of trade-offs there.
But how do you think about that?
I mean, we could call it modern data stack.
We could call it, you know, sort of compartmentalizing these pieces using best of breed tooling. But help us think about that? I mean, we could call it modern data stack, we could call it, you know, sort of compartmentalizing these pieces using
best-of-breed tooling, but help us think
through that.
So, traditionally, right, since the database
was tightly bound to the storage
system, like, the database was
the source of data, and so if you wanted multiple
databases in your system, you'd actually have to, like,
orchestrate the
data flow between them, right? That's what ETL
is all about, right? Like, you take the data from somewhere, you extract it,
you transform it, you put it somewhere else, right?
So that's the, and so I think as you pointed out,
the problem with doing that is now you want specialized tools
for the specialized job.
You have copies of your data and data pipelines everywhere.
And it's, you know, not just ingest, but like,
then once inside you're calculating whatever you need to calculate.
So I think what's happened over the last five years,
and it's only going to happen more, is that the source of truth for analytics specifically
is moving to Parquet files on object store. I had Ryan Blue from Tabular here the other week.
That's solving a problem that comes out of putting a whole bunch of Parquet file on object store.
But the economics and the compatibility of your data is Parquet files on object store but the economics and the compatibility of your data is parquet files and object stories i think compelling enough that's where everyone's
going to move to like it's basically it's not you know like having an infinite ftp server in the sky
sorry if i'm dating myself right but like s3 is basically you know it's like it's an incident
and super cheap oftentimes we like we have terabytes of stuff that are like, oh, we probably don't even need that, but like,
hey, it's cost a month, who cares, right?
Like, you know.
Yeah.
And so I think that's going to become the source of truth, right?
And then what you're going to have is a bunch of specialized engines
that are operating and are specialized for whatever your particular use case is
that will read data out of there and potentially write it back.
And obviously, InfoEasy Vue 3 is one of those.
But I think basically any database system
that's architected relatively recently,
like that for analytics especially,
that's the trend.
And I think that's only going to accelerate.
And I think that is really good
because now you have your data in one place,
maybe depending on how you feel about Amazon
or the other cloud providers.
But now your tools will now be able to talk to it
in the same place rather than have to have
end-to-end point connections.
Right, like daisy-chaining all these different systems.
Yeah, Costas, I know you have a ton of questions
around this architecture
and have actually thought a ton about it yourself.
So I would love to hear your thoughts as well
on the market moving
towards this. Yeah, 100%. I think it's kind of like a natural evolution. I think in the same way
that if we think about how we were building applications 15, 20 years ago, the monolithic
way of doing it in the early 2000s would never be able to scale to where the industry is today or the market is today.
So we had to become much more modular.
At the beginning, we were talking about, okay, we have the backend, the frontend, and this actually evolved to a much more granular architecture at the end.
That's for the applications, right?
We kind of need something similar
for databases too.
If we want to build, say,
products around data,
we can't afford for the teams
to continue...
What they were saying, that know, like to continue, like, you know, what they were saying that
to build like a database company, you need like 10 years just to go to markets
before you go to market because you have like to build first.
Like in 10 years, it's like a completely different world out there.
That's why we have so many databases that have died.
Like people are like trying to go and do that,
and then the time you're ready to go out,
you just cannot go to market anymore.
So if we want to have a similar way of building
and creating value off the top of data,
we need to replicate these kinds of patterns
that software engineering has done like other systems but now doing it like on
data on databases practically like it doesn't matter like at the end if it's like a distributed
processing system or whatever like at the end they all share the same let's say principles
behind them of like how they they are like architect. So it's inevitable to happen. And there's no way that
we can, whatever we say about AI or whatever is going to happen in the next 10 years,
without strong foundations on the infrastructure that manages, creates, and exposes this data,
nothing will happen. It's just how this industry works, how the world works at the end, physics.
Right. So that's, I think, inevitable. The question is how fast it will happen and what
direction. And that's what I would like to talk a little bit with Andrew because Andrew is bringing like a very unique experience here,
in my opinion. First because, okay, he's been like building database systems for a very long time,
so he has seen like the old monoliths to whatever like the state of the art is like today,
but also has experienced, let's say, what it takes to leave the monolith behind
and build something that adopts these new patterns.
So, Andrew, I'd like first to ask you,
what do you think are the main milestones in these past 10-15 years
that got us where we are today,
if you had to pick two or three.
And I think you mentioned something already,
which is the separation of storage with compute,
in a sense, right?
Which I think was what started everything in a way.
But there are others too.
It's not just like that separation.
So tell us in your mind
what you would put there
as like the top three
most important changes that happened.
Yeah, well, so take what I say
with a grain of salt
because it's very focused on analytics
and it's also colored by my background.
But I think the first thing
that was really important
was like, you know, the late late whatever it is 20 2010s was parallel databases actually became a
thing right that they weren't some crazy hardware you know i don't know if you guys remember but
like earlier you want to do like four cores i mean you had to buy like some fancy tandem like
specialized hardware thing right to actually run multiple cores in your database it was crazy
and so there was a raft of companies, including Vertica,
that were part of this wave that they went from
single-node machines to actually had commercial
offerings that were
parallel databases. So I think
that was one. They actually figured out parallel databases.
Second one was
columnar storage, as you put it,
for analytics, right?
Before, again, it's actually
often a lot of the same companies, but it sounds stupid, right? Like, well, traditional the same sort of, it's actually often a lot of the same companies,
but it sounds stupid, right?
Like, well, traditional databases store stuff in rows.
Now we're going to store stuff in columns
and that's going to be earth shattering.
Like it doesn't sound all that amazing,
but it ends up being very important
when your workload shifts from like uploading,
load heavy, right?
Like transaction systems where the,
you have lots of little thing,
lots of little transactions,
but each one touches like a small amount of the data.
When you switch to when you really care about analytics or really analyzing large amounts
of data, the workload pattern is very different.
The queries that come in, there's many fewer per second, but each one touches a much broader
swath of the data.
And so organizing your data in columns as a column store really makes that significantly
faster to do
and much more efficient.
And then the third one, since I get three,
is the separation of storage and compute.
Like you said, that I think is often referred to
in the database term as a disaggregated database.
And the distinction is like Vertica and similar,
like Power Excel, your early versions of Redshift
were like this too,
were what's called
shared nothing architectures,
which basically meant
you had a compute node
attached to disk
and the distributed database
was sharded effectively internally,
but then the individual nodes
are responsible
for individual parts of the data,
which is great
because that meant
you guaranteed
to have co-located compute.
The problem is
if you ever wanted to change,
like basically, if you wanted to have
elasticity and change the resources you gave to the system, you had to move all the data too,
because the data was tightly bound to the compute. And so that was definitely a challenge of Vertico's
dealing with that scalability. But I think that's why disaggregated databases became a thing.
Snowflake now seems obvious, but I remember reading the Snowflake paper
like that. They wrote a paper about their
architecture back then. It sounded crazy, right?
Like, we were
building distributed
databases for these hard, really high-end
enterprise-grade
machines. And they're like, we're going to build a distributed
database on a bunch of crappy VMs from
who knows where, and we're going to build it on
top of this object storage thing, which has
hundreds of times slower than
these fast disks that we're using
in the shared nothing architecture.
But I think they really,
looking back on it, it's clearly prescient.
They were basically the first major
commercial success
to have that architecture.
And it just seems
obvious now, but
that was really revolutionary, I think, at the time.
Yeah, yeah.
It sounded like a crazy idea,
like how are you ever going to build a database like that?
Yeah, 100%.
Okay, so let's talk a little bit about
the evolution of Influx,
because from what I know,
you started with different, let's say, core technology there.
And then at some point, you re-architected and moved to some different technologies with what will get us also talking about data fusion here.
So tell us a little bit about the story.
What was how, first of all, Influx looked like?
What triggered the need to investigate moving to something else,
and what, at the end, made you move into using something like Data Fusion.
Yeah, so I can talk a little bit about the motivation for Influx, the product.
I think Paul is probably the best person to talk about it,
but he's spoken about it publicly,
and it's not surprising when you think about it.
A bunch of the challenges we talked about earlier with time series database, like high
cardinality and having bigger retention intervals and doing sort of more analytics style queries,
all those were long-term asks from influx data customers, right?
So Paul was looking for technology that would allow basically to satisfy them. And I think also, as I understand it,
the market's effectively becoming commoditized, right?
There's a huge number of, you know, Prometheus came out
and a whole bunch of other open source various products
that they're basically influx clones.
I'm sure that everyone would take offense at that, right?
I'm not trying to offend, but like basically
the high level architecture is the same, right?
You got this in memory tree and you have some index and maybe it's better but like at
the core there it's all getting commoditized so i think paul is looking for like how do we
where do we bring the next bring the architecture to the next level and to your point earlier though
like building a database as he know because he just did it for like eight or ten years right
like it's a long-term experience right it's much longer than you ever want and it takes hundreds of millions of dollars i have some
slides somewhere that i you know you go through and just look at like pick your database company
and go look at how much money they raised and it's probably you know hundreds of millions of dollars
and it's not that all that money gets spent on engineering of course but a substantial amount
does so i think when paul was casting around like, how am I going to build a new database engine? The only practical way to do it
probably is find things you can reuse so you don't have to build it all from scratch.
And I think he, early on, even in 2020, he wrote a blog post that said Apache Arrow, Parquet,
Data Fusion, Flight, they're game changers from implementing databases.
You know, at that point,
I was like stuck in some engineering management nonsense.
So like, I just basically took it by faith.
Paul said, hey,
these are some great technologies.
I said, as long as you let me code,
and I'll code it,
I'll do it in Rust,
and I'll do whatever
these technologies you want.
I just need to get back in engineering.
But I think he was very, like, he was very prescient.
Like, I think he identified those early on.
And then, you know, we've invested to help those ecosystems drive forward.
But I think those are like the core, some of the core technologies.
And the reason they're core is because if you're going to go build an analytic system,
now it's been done for like 20 years.
So, like, we've had two or three waves of commercialization
and that was built after like 10 years of industrial of academic research so basically
the point is like people have built the same things over and over again and now i think we
are finally at the point where like the patterns are clear and so rebuilding it again like you
don't need to rebuild it again you can actually build it and the apis are the same so like things that are really like basically common
to every olap engine and i'm talking everything like vertica had a snowflake has it dot db has it
you know data fusion has it like you need some way to represent data in columns both in memory on disk
because that's so much more efficient for the analytics workload.
And so that turns into Apache Arrow is basically a way to store memory, store data in columns on memory, right?
And ways to encode it.
And Parquet is basically the same thing, but for disk.
I'm obviously skipping a bunch of the details, but between those two technologies, now you
have the in-memory processing component, you've got got the storage component and then you're going to build one
of these database systems you typically need a network component right because it's either
distributed system or you've got a client server so you need somebody to send it over the network
that's what arrow flight is and then often right you don't want to well i'll tell you like building
a something that can run sql or some language like sq SQL is a non-trivial endeavor.
But just like building a programming language is a non-trivial endeavor,
but you can still do it.
But basically, you can do it these days because you don't have to build the whole thing from the front end to the intermediate representation to all the optimizations to the code generator.
Because basically, LLVM is a technology for compilers that does all that.
I think that's why you end up with systems like Rust or Swift or Julia.
The reason they can be even made at all these days is because they don't have to go reinvent
all the lower level stuff.
Got to the same point with SQL engines where you don't need to go reinvent the whole SQL
engine from the bottom up because it's basically all...
We understand how to do it.
The patterns are there. Arrow is basically what you want. Obviously there's things about Parquet
you could improve, but it's pretty good. And so with that type of building blocks, so I should
say Data Fusion is a Apache project for that query engine, right? So it will let you run
SQL queries and a whole bunch of other stuff,
which we can talk about at length,
but I don't want to take over the whole talk.
Yeah, so would you say that data fusion
is the equivalent of LLVM, but for database systems?
Yes, that's how I like to think about it.
Okay, okay.
I think that's a great metaphor to use
to give a high-level view of what Data Fusion is trying to do.
So, all right, question now.
I understand the need of going from the pure time series workloads
to a more analytical capability, adding more analytical capabilities.
So it makes sense to go and look into a system that
is built primarily for analytical workloads and data fusion. If I'm not wrong, like, when Andy
started building it, that's what he had in his mind, right? He didn't have time-series data in
his mind. Which means that somehow these things need to be breached, right?
You still need to work with time-series data.
You can't abandon what you were doing.
So how do you take Data Fusion,
which is like a system for analytical workloads,
and you turn it into the core of a time series database
that does everything that a time series database does and has to do,
but at the same time also has, let's say, analytical capabilities there.
Yeah, so you're absolutely right that there's a bunch of stuff
that's time series specific that's not in Data Fusion, right?
So in fact, in some ways, it's a great story
because you get Data fusion not for free get like you spend a little bit of time but you know data
fusion is there and you can work with a bunch of people make it better and then like the vast
majority of it influxes time in terms of like where engineers spend their time is on time series
specific things right like so there's a lot of effort that went into the component that ingests
data really quickly turns it from whatever this like this line protocol csv kind of stuff is into in-memory
formats right someone's got to parse that quickly got to stick it in you've got to handle persist
like getting that into parquet files quickly and there's a system in the background to like merge
them together and whatnot so there's a huge amount of engineering effort that went into that. And then the systems aspect
of knowing who has what data
and where it is
and all that sort of stuff.
So where Influx spent
a lot of its engineering time
is on those time series
specific features, right?
And then as long as
data fusion is fast enough,
it works.
And so it's kind of cool.
Like early versions of our InfluxDB 3.0 products
only did SQL because we just ran it through data fusion.
However, InfluxDB also has a specialized query language
called InfluxQL, which if you ever worked with time series,
you'll know that SQL is miserable
for a lot of different types of queries.
So InfluxQL is actually not much nicer query language.
So that's another example.
I guess it's very time-specific.
But actually what we did was we don't have our own whole...
Now InfoSQL supports, or 3.0 supports InfoSQL.
We don't have our own whole query engine for InfoSQL.
There's just a front end, right,
that translates InfoSQL into the same intermediate representation
that DataView, which are called logical plans.
And then the whole thing, then the rest of it
is shared with the SQL engine.
So
maybe those are some specific examples
of like, when we built the time series
database, it didn't just magically come out
of data fusion. We had to build the time series
pieces around it. But we don't have to
then go build the basic
vectorized analytic
engine, right?
If you want to hear a list of the type you know you need like it needs to be fast it needs to be multi-threaded
you need to make sure it's streaming you need to make sure that it knows how to aggregate stuff
really fast you have to do it quickly with multiple columns and single columns and all
the different data types and oh there's you know five different ways of representing intervals and
you know yep yep yep and yeah yeah no mixed total sense i think that's like one of the
my opinion like also the kind of like traps that people get into when they are building like new
data systems is sql is the standard but at the same time like it's not great for many things
and when you get like into specialized things then you're like, okay, oh shit,
let's reinvent and build a new query language, right?
And the problem, and I think that's what is great
with what you are saying about data fusion,
is that now it's much easier for someone to be like,
hey, I can have the standard SQL that I support here,
which makes it easy for anyone to be like, hey, I can have the standard SQL that I support here, which makes
it easy for anyone to go and work without having to do the heavy lifting of learning
a new language.
But at the same time, I can also add in my product new dialects or new seductive sugar
in the existing SQL or whatever without having an overly complicated system there. And that's the equivalent of, let's say,
what the user interface is in a way for applications, the graphical user interface.
But in our world, the equivalent of that is the syntax that we provide to the user.
And I think a big problem in the past, actually, because of the monolithic architecture, right? Like adding something new there pretty much had to go through the whole
monolith. So that was a big no to go in that. And I think that's a lot of the value that's,
let's say, for the user, for the developer experience is broad because of how these
systems work. And what Influx does is, I think, an excellent example of that.
Yeah, we have SQL, and we also have
the specialized syntax that we need to go
and easily work with time series data
and be more efficient and more productive,
which is amazing, right?
What's that, if you have an anecdote,
in a way, what's from data fusion
and its focus in the analytical workloads made your life
harder when you had to work with
time series data, if there was something, right?
The first major challenge is that the time series data
model, at least as shown by... I'm going to talk all about
Lion Protocol, which is the influx data
version of it. But basically every system I know about that does time series, especially for
metrics, basically has the same data model. They call things differently, but it's basically the
same. So you basically have some identifier that's a string, right? That identifies your
time series, and then you have a value and you have the time series. And so Data Fusion is very much
a
relational engine, right?
And so is Apache Arrow. Yes, they have structured
types, but at the end of the day, it's basically tables
of rows, right? And so
the very first challenge is you've got to map
the influx data model back
into
a relational model.
So we did, right?
But that definitely is non-trivial.
And it ends up like, you know,
there's nulls and new columns can appear
and dealing with backfills
and what the updated semantics are
and stuff like that.
There's quite a lot of logic
that we had to work out
as opposed to just we had Parquet files
on disk array and we just run them, right?
Like it's a lot more complicated.
We actually get this other thing in,
we get a lot of work to figure out
exactly how to map them into the relational model.
Yeah.
That's very interesting.
Probably a topic on its own for a whole episode
I'd like to talk about.
But let's switch gears a little bit
because I want to talk a little bit more about
DataFusion and the current state of the project
and
what's next.
So you had,
you engaged in using DataFusion,
you're
at a pretty mature
point right now, I guess, with how things
look like using DataFusion.
It's also how a project has gained much
more traction. There are many people using it and many contributors back. What's next? Because,
okay, a system like LLVM takes a lot of hundreds or, I don't know, thousands of engineering years
to build and maintain. And that's probably true also for something like Data Fusion, right?
What's next there for the project?
What do you see being the part of the project that needs more love?
And where the opportunities are, in your opinion?
I'll talk about the technical challenges and opportunities in a second.
I just wanted to, at a project level, where I see the
project first. So I think
now we're at the point where there's early
adopters like Influx and there are a couple other early
adopter companies that built products on top of
Data Fusion. So Influx, DV is one of them
obviously, but there's GrepTime, there's a company called
CoreLogix. And then
there's a whole, there's another
so the early adopters
maybe have been doing this for four years. Then there's people like maybe there's another, so the early adopters maybe have been doing this
for four years.
Then there's people like maybe in the last year or two
have basically decided to build products on top of it.
And there's a whole new raft of them,
like Sineda and Arroyo.
And there's probably a whole bunch of other ones.
I can't, like C-File, I guess,
was actually one of the early adopters.
But there's a bunch of other sort of startups now
that are starting there because they want to build
some cool analytic thing, right? And they realize
building a whole new SQL engine is probably not
where they can afford to spend their time.
So that's sort of where the adoption
trend is. And I think we're just going to see more and more of that.
This is obviously people I know. It's an open source
project. Anyone can use this.
I'm sure people use it that I don't know about.
100%. And then
the project has been part of this Apache Arrow governance thing,
but we're in the final phases.
I expect it to be finalized in the next week or two.
It'll become its own Apache top-level project,
which for those of you who aren't deep down in the guts of Apache governance,
which you don't need to be.
But it's just basically a recognition that the community is now big enough
and self-sustaining enough that it needs its own place rather than being an arrow so arrow did it
was wonderful to be incubated there but it will be its own project very soon i think that'll help
uh drive its growth as well so yeah so now technically yes i think from my perspective
theta fusion now has basically very competitive performance. We just wrote a paper
in Sigma, if anyone cares. And our results, if you look at it, querying per K files is we're
basically better than DuckDB in some cases, not as good as some cases, but basically approximately
equivalent. And so that's actually pretty good. I mean, the DuckDB guys are very smart. That's
like the basic state of the art integrate OLAP engine these days so i feel very good about the the core performance and it has the basic so basically from like a
if you're going to build a an analytic system today you need like basic sql you need all the
time like places that are hard to do that we've done in data features like like you need the
basics of sql you need the basics of all the timestamp functions, which if you've never built
a database, you don't understand. It doesn't seem like it should be that big a deal, but actually
getting the time and time arithmetic right is just a major undertaking. But we've done that in
Data Fusion. Implementing window functions is also another one that takes a long time that
you might not even appreciate that those are a thing in SQL and so you have to implement them.
Data fusion has that.
And then you need a fast aggregation,
you need fast sorting, you need fast joins.
And then not just do you fast,
you need to be able to handle those three major operations if your data size doesn't fit in memory.
So data fusion does all that,
except the one thing it doesn't do
is it doesn't do externalized joins yet.
So if you want to go build a system on top of Data Fusion that was going to compete with other,
like basically going off the complicated enterprise market,
like some enterprise data warehouse thing where they've got 300 joins in their queries,
I'm not sure Data Fusion will do a great job.
Actually, having spent like four years of my life on the Vertica optimizer,
I'm not really sure that any database system can do a great job. Actually, having spent like four years of my life on the Vertica optimizer, I'm not really sure that any
database system can do a great job on a query
with 300 joins.
Yeah.
In terms of a sophisticated optimizer, it doesn't
have one of those. It has a fine
one, right? It does fine, but it's not going
to work great with hundreds of joins.
Yeah. It doesn't have externalizing joins.
Can you expand
a little bit more on that?
Because probably our audience might not know
what first of all means like externalized joins.
Yeah, so this, yes.
What this really means is can you do these operations
if they don't fit in memory?
And maybe I'll start with the simpler one
where like you just want to sort the data.
Let's do grouping, right?
Like let's say you're trying to like group
on all your different distinct user IDs or something
and you're just calculating some,
whoever clicked the most or something
or how much dollars,
but you have a huge number of individual user IDs.
The way a database will typically do a query like that
is you'll read the rows in,
you'll figure out the row ID
and you have sort of like a hash table
that you're keeping track of the current values
for each one of the different users, right?
So as the data flows in,
you find the right place in the hash table,
update the aggregate, and go on.
The problem with that's called hash
aggregation. That's basically what
they all do.
The problem is that if you have a huge number of distinct users,
you might not be able to fit the hash table memory.
By the way, there's also all sorts of ways
making that quick is actually really
another whole fascinating discussion, but I'm just
talking about externalizing right now.
So you don't typically do it one at a time.
Anyway, I digress.
So the problem is if your hash table doesn't fit memory, what do you do, right?
The first thing you do to the, well, first database system you write probably just crashes
and gets killed by Kubernetes.
The second thing you do is you actually track how big the hash table is, right?
Because it just allocates more memory, the operating system kills it.
Second thing you, so then what you do is you add a memory limits in that at least tracks how big the hash table is.
When it gets too big, it errors the query, so that's better. And then the third, more
sophisticated thing to do is you actually take the hash table, dump it state to disk somehow,
write some persistent storage, do that a couple of times based on how much data is coming in,
and then also then take... You could run a bunch of little files to disk, then you
read them back in, merge them them compute the final result and generate up
so that process of like taking the state out of main memory putting it to some other secondary storage that you have more of it would be typically called like externalizing okay yeah
stuff like spilling data to the hard drive so you don't want to kill the process
because it runs out of memory.
So Data Fusion doesn't do that for joins.
It does it for sorts and it does it for grouping,
but it does not.
Doesn't do it for joins.
Okay.
So if everyone here is out there
and would like to contribute.
If you want to do it, you know,
that's a whole, yes.
I wouldn't recommend,
it's probably not an afternoon project,
but there's certainly other people who are interested.
But I mean,
I joke about this afternoon project,
but it's,
it still amazes me like working,
Data Fusion is the first time I've worked in an open source project.
You know,
I've used open source software
for a long time,
but like actually running these things,
we're running it.
Like,
I don't really know
why people show up
with contributions. Sometimes
it's clear, I'm working this in my job and I need this feature. That's probably what happened.
I'm pretty confident a bunch of people, they just find it interesting. It's one of the other
beautiful things about working on the internet. There's lots of people. There are a certain
number of people that find database internals interesting and they show up. But I'm pretty
sure the first version of the window function was written by a guy who like i don't think it had anything
to do with his job i think he just thought it was an interesting challenge and he you know he's like
he i think he's a manager at b&b or something like not at all obviously using data she just
basically pounded out the first version of the window function it was it's still amazing to me
you know he did a great job.
Hopefully he had a good time. I think he had a good time.
I had a good time. We got a feature
out of it. It's a really
cool little vibe.
That's amazing.
All right. We are
close to the end here and I have to give
the microphone back to Eric.
I think we need multiple more
of these recordings, Andrew.
There are just so many things to talk
and it's such a joy
to listen to you.
It's a lot to learn.
So I'm looking forward to do that in the future
again. Eric,
all yours again.
Andrew, I'm interested to know,
we've gone into such depth on several subjects here,
which has been so wonderful.
Outside of the world of databases and time series
and the subjects that we've talked about,
in the broader landscape of data tooling,
what types of things are most exciting to you?
I think the broader story that the
Parquet files and object store, data lakes, data table, whatever they're eventually people decide story of that, the parquet files on object store,
data lakes,
data table,
like whatever they're eventually people decide to call that.
I think that is a amazing opportunity because now,
you know,
it's not going to make things simpler.
What it's going to do is it's going to let you make all sorts of even more
specialized tools for whatever your particular workload is.
So I think that is very exciting.
And I,
you know,
I also,
I'm also super biased, right. But I think the idea very exciting. And I'm also super biased,
but I think the idea of data fusion is super exciting too, because
as Costa was talking about before,
previously, if you had an idea because you
hated SQL and you had a better idea of how to do it,
not only do you have to be really
good at figuring out what the UX of that new tool
should be, you didn't have all the
ability
to go do all this
low-level database stuff that we're just talking about whereas i you know some people have that
like the guy who like pullers the richie actually i think has that but like it's very rare and so
i think having something like data fusion will let people like innovate in the query language space or
you know space without you know basically it lowers the cost
so you have a great product without having to invest all this time in like one of these analytic
engines so i think that's a super exciting area as well to see what people build with it
yeah i agree yeah it is really neat to see you know sort of architectures and technology emerge that encourage all sorts of creativity by,
you know, unbounding people from traditionally what were pretty severe limitations, except for
people like you said, where it's like, you know, pretty rare that you can manage all of these
different things to create something purely novel. I think that's the most positive way to look at
it. You know, another like more it. Another more boring economic way is
just commoditizing these OLAP engines. That's another summary of basically the same thing,
but I think it undersells the value of it because it's not just, yes, it's making it cheaper, but
it's not just the cost comes down. What it really means is there's a whole bunch of things that
become feasible that weren't previously. I think that's where the realization will happen.
Yeah. Very cool. The removal of constraints.
Well, it's going to be certainly
an exciting couple of years. Andrew, this has been
an incredible conversation.
We've learned a ton. I think the audience
has learned a ton.
And yeah, we'd love to have you back on
sometime to cover all the things
that we didn't
get to cover in this episode.
Sounds great. I'd love to.
Thank you very much.
We hope you enjoyed this episode of the Data Stack Show.
Be sure to subscribe on your favorite podcast app
to get notified about new episodes every week.
We'd also love your feedback.
You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastack show.com. The show is brought to you by rudder
stack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudder stack.com.