Screaming in the Cloud - The Power of Time Series Databases with Paul Dix
Episode Date: October 23, 2019About Paul DixPaul Dix is the creator of InfluxDB. He has helped build software for startups, large companies and organizations like Microsoft, Google, McAfee, Thomson Reuters, and Air Force ...Space Command. He is the series editor for Addison Wesley’s Data & Analytics book and video series. In 2010 Paul wrote the book Service-Oriented Design with Ruby and Rails for Addison Wesley’s. In 2009 he started the NYC Machine Learning Meetup, which now has over 7,000 members. Paul holds a degree in computer science from Columbia University.Links Referenced: Twitter Username: @pauldixLinkedIn URL: https://www.linkedin.com/in/pauldix/Personal site: pauldix.netCompany site: www.influxdata.com
Transcript
Discussion (0)
Hello and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn.
This weekly show features conversations with people doing interesting work in the world of cloud,
thoughtful commentary on the state of the technical world,
and ridiculous titles for which Corey refuses to apologize.
This is Screaming in the Cloud.
Welcome to Screaming in the Cloud. Welcome to Screaming in the Cloud.
I'm Corey Quinn.
This week's episode of Screaming in the Cloud is sponsored by Influx Data, makers of InfluxDB.
As a part of that sponsorship, they have generously provided one of their co-founders to have this conversation.
Paul Dix, welcome to the
show. Thanks, Corey. Glad to be here. Thank you for taking the time to entertain my ridiculous
nonsense. It's always appreciated. I guess where I want to start on this is, let's begin at the
very start of all of this. You are makers of the premier offering in the world of time series
databases. For those of us whose platonic ideal
of a database is Route 53,
what is a time series database and why might I need one?
Yeah, so, I mean, a time series database
is basically just a database that's optimized
for a specific kind of workload, which is time series data.
Now, the thing that makes time series data different than say reference data
that you keep in a relational database is that it's largely an append-only workload, right? You
have new data arriving all the time, you're not updating previous records, and when you query the
data you're frequently querying large ranges of data to compute summaries of what was the min value in these five minute
increments for the last four hours or something like that.
So you can certainly use other types of databases to store time series data.
You can use relational databases or other NoSQL databases.
But for time series data specifically, there are optimizations that
you can make to deal with the very high rate of ingest, the query workloads, which are very,
very different. And a few other things. One, which is it's very common in time series to have your
high precision data that you keep around for a limited window of time. Like say, I'm going to keep all my raw data for seven days. And then you want to summarize it or downsample it and say,
keep those summarizations around for longer periods of time, like three months or a year,
whatever. And a good time series solution database will basically handle that data
management lifecycle for you automatically, evicting
the high precision data, downsampling the other data.
So the eviction, I think, is actually really interesting from a database perspective when
you think about relational databases.
So in the naive case, if you're going to evict your time series data and you say you want
to keep it around for just a day, what that means is the naive way of doing this is
every time I do a write I have to delete the oldest data point right because if
I'm if I'm ingesting at a fixed rate then I know for every write that goes in
there's a delete happening and regular databases aren't designed for this
workload they actually assume that you
want to keep most of your data around for all time. So deletes actually are expensive. So time
series databases do things like that. They optimize the ingest, the eviction of high precision data,
the downsampling, and the summarizations that you might want to do in real time.
So I played a little bit with things like this once upon a time in my first life as a network admin.
We ran Cacti to manage a lot of these things.
If you've never used Cacti, the primary purpose of that software is to sit on your network,
be written in PHP, and be exploited and used as an attack platform for the rest of your network.
It displayed graphs using RRD tool or RRD-based tools that
came out of MRTG and a bunch of other similar products. And it's exactly what you described.
As you look back further in time, the data gets less and less granular under the baseline
assumption that you won't need to have that level of insight and visibility into things
that happened a month ago as you might yesterday. Is that aligned
with the same principles? Yeah, that's similar for the most part. There is an important distinction
with RRD that I like to make, which is when I think of time series data, I think of two types
of data. What's called a regular time series, which is samples taken at fixed intervals of time like
once every 10 seconds or once a minute or once an hour and then there are
irregular time series which are basically event streams that could be
individual requests to an API and their response times that could be a container
spinning up or shutting down any sort of an exception in an application, any sort of event.
Now RRD, and I guess it's kind of like a spiritual successor, which would be probably graphite,
those are based around storing regular time series data. So regular time series data is basically a
summarization of some underlying distribution or raw event stream or whatever.
So for example, if I store the response time for every request made into my API,
that's an event stream. That's an irregular time series. But I can query that time series and say,
give me the 95th percentile in 10 minute intervals for the last eight hours of
time and what you've done there is you've created a regular time series
from the underlying irregular event stream so essentially when you think
about putting data into RRD you're summarizing your data before it ever
goes into the database and withux, what we wanted to do
from the very early stage was we wanted to be able
to store the raw event stream
as well as the summarizations.
If we go back to when you said you made your first commit
to what became InfluxDB back in 2013,
if I didn't notice it at the time, and if I had,
I would have made blistering fun of it.
It's, oh, you're building your own database engine. I know you. You're that guy from Hacker News. Come to life. And it sounds
like something that only someone who's deranged would do it, except for the fact that it worked.
You've built a successful company. You are a name brand in the time series space, and you were
clearly right about this. I mean, now we have other entrants to the space,
which we'll get to in a little bit. But at the time, did it feel like you were
potentially making a catastrophic mistake?
It really didn't. That's based on my experience from a few other times. So in 2010, I was working for a fintech startup,
and we had to essentially build a quote-unquote time series database or time series solution
for tracking real-time pricing for like 200,000 different financial instruments.
And the solution I built was basically Scala Web Services
using Cassandra as the underlying long-term data store and Redis as a real-time indexing layer.
And then later on, when I built the first product that we built actually, which was a
SaaS platform for real-time metrics and monitoring in the server monitoring space,
I had to build the time series solution again. And this was basically two completely different
problem domains, but the solution was the exact same. In that case, for the first version of this,
I did Scala Web Services on top of Cassandra with Redis. And then I built a next version of that API, which I used Go for.
And this was in, I guess, late 2012 is when I started development on that. So I used Go
and I used LevelDB, which is an open source library that Google had at the time. And I
built this whole thing. So I essentially built a database for that API. And the thing I realized through the process of doing this was that for solving this time
series use case problem, you could use a general purpose database.
You could use Cassandra or you could use MySQL or Postgres or whatever.
But the thing is I had to write this mountain of application code, of web services code
in Scala to make the whole thing work. And my feeling in 2013 was
there's nobody focused on this exclusively. Graphite at that time was largely an abandoned
project. Everybody who was using it was complaining about the fact that it wouldn't scale.
So I thought, okay, nobody's focused on this, but here I am. I've had to solve this problem multiple times in the past few years.
I saw people at large companies trying to solve the problem themselves, and I saw the
monitoring companies doing the same thing.
So I basically thought, here's a need that isn't being served, so I might as well go
do it.
And I think the other important thing is
to think of the timing.
So in 2013, obviously, NoSQL was a big thing,
and you had the different players in the space,
and it wasn't obvious how that was gonna shake out,
although MongoDB was obviously already very popular.
But it wasn't like, I feel like over the last
two or three years, there's literally
a new time series database or a new database of some kind every other week.
That's on the front page of hacker news.
So I feel like there was a little bit less like new database fatigue in 2013
than there was then there is now.
Yeah.
It's, it's one of those areas where there are so many different database options
that I, to rip off the ancient JWZ quote, it's, oh those areas where there are so many different database options that,
to rip off the ancient JWZ quote, it's, oh, I have a problem.
I'll use regular expressions.
Now I have two problems.
It's, oh, I'll just write my own database.
And invariably in 2019, it feels like that's exactly the wrong direction to go in.
But counterpoint, you folks have recently released Influx 2.
So what's the story behind that?
I mean, I guess, so first off, I should probably say that, you know, I'm a firm believer in what
I call polyglot persistence, which is you use the right tool for the job and not every single
persistence need is the same as others, right? For some, you absolutely will need a relational transactional database, and for other things,
you won't need that. And all of this would be kind of like a moot point if relational transactional
databases were infinitely scalable and infinitely performant, then we would just use those. But
that's not the case. You make trade-offs and optimizations based on your needs and the
specific use case. So initially, you know,
within FluxDB 1.x, that's what that was about. But we started with the database and then we saw that
there were other needs that people had in this time series use case, right? And for time series,
like what I realized is it's an abstraction that works well for solving problems in multiple
domains, right? Server monitoring, I mentioned,
financial market data is one, real-time analytics, but also sensor data of all kinds, be it industrial, oil and gas, wearables, consumer tech, all that kind of stuff. So people had other needs
to solve these problems and to build applications on top of this time series abstraction.
They had to collect it, they had to store it and query it,
they had to process it for either doing ETL for enrichment or for doing monitoring and learning.
And then finally they had to summarize the data for human consumption either through visualization or reporting or other kinds of things.
So as I built the company,
I raised capital and hired developers to build these other pieces. And we learned a lot over that period of time,
over the last six years. And the thing I realized is what I wanted is a platform that was
easy for developers to use that kind of encapsulated all of that. Right now we have in 1.x, we have four separate products.
With 2.0, what we've tried to do is combine them into like one cohesive whole
where there's a single API that is consistent, that's easy to use.
There's a swagger definition for it.
And there's a user interface on top of it.
And then the last bit with 2.0 is, you know,
with InfluxDB 1, we had a query language that looks very much like SQL. And that's because I
thought it would be easier for people to pick up. And it certainly was. But what we found is there
were more complex like analytics and processing tasks that people wanted to accomplish that they wanted to push
down into the database.
And because they couldn't do that, a common pattern emerged where people would write code
in whatever language they choose, like Python or Ruby or apparently your favorite PHP.
And then they would query data out of the database and do some post-processing and then
write data back into the database
so that they could get it back into the tool chain for monitoring it, for visualizing it, and all these other things.
So when we created 2.0, we decided, let's create a new language called Flux, which is not just a query language,
but it's also a query planner, a query optimizer,
and it's a scripting language. So you can push down this kind of complex processing
into the data platform. And as a language, we want it to be Turing complete and generally useful,
but we also want it to be able to pull in data from sources outside of InfluxDB. As I mentioned, I'm a firm believer
in polyglot persistence. And what that means is Influx is great for time series data, but it's
not good for reference data. So we want to be able to pull in data from Postgres or MySQL,
or from any sort of third party API that you want to pull data in from. You could hit GitHub for data that you could mix and match with your time series data.
So basically 2.0 is the realization of collapsing those four components into one cohesive whole
and creating a language that allows people to define really complex analytics and processing
that the platform will just
do for them.
It seems almost like you're going through Hacker News to some extent and picking all
the terrible ideas at once.
You just mentioned, for example, that you built a new query language called Flux.
Two issues.
One, writing your own language is always one of those things that's fraught with peril,
but based upon what you've demonstrated, I will absolutely extend credence to that, that you're probably doing the
right thing. But I will say that from my experience working in tech for entirely too long, which is
where I guess this bitter cynicism all has root, I find that whenever I have to learn a specific
language to use a particular tool, it means tears before bedtime. And I want
to wind up calling out a bunch of different companies that have done this, but it's unfair
because it seems that every time I've dealt with a specific DSL, you have these problems. I'm still
going to maintain that Kubernetes wrote their own custom DSL called YAML, which is so historically
incorrect. I don't even know where to begin, but that's why I like saying those controversial things. What, I guess, what made you decide to do this, I guess,
in the face of historical terrible experiences with these?
Yeah, by the way, I think Kubernetes was, that's not original to create a YAML DSL.
I think they're just copying Spring and Struts who created a DSL in...
Once upon a time, we wound up adding Jinja to YAML and that would effectively
turn into SaltStack for its configuration language. Again, we're all code terrorists in our own way.
Yeah, yeah. No, I mean, it's absolutely fair. I agree. Generally speaking, why would you create
your own language? There are countless other languages out there. And so, you know, so basically, like,
one option is we just go with SQL, right? Well, one, creating an actual SQL compliant SQL
is really, really hard. It's a lot of work. Two, SQL is in turn complete. It's actually not
programming language. It's a declarative scripting language. Now, Microsoft's version of SQL has
extensions that makes it Turing complete. Oracle's version of SQL has the same.
But then again, you're not using standards compliant SQL. And really, even when you get
down to it, every single major database has differences between what their versions of SQL
are. So there's a standard, which is
the lowest common denominator. And then when you get into more powerful query functionality, you
end up getting into this specific database implementation. And as you mentioned, like,
there's so many tools that have their own languages, basically, like, I think any analytics
tool in existence, whether it's log analytics, user analytics, business intelligence, marketing analytics, they all have their own custom query languages. And that certainly
hasn't stopped them from becoming popular. But let me speak to one thing about our specific journey
to Flux that I think is relevant, which is in 2013, I created InfluxDB with this language, this query language that
looked like SQL, but it wasn't actually SQL. It was different in ways that are actually
frustrating if you're a SQL expert and you try to use it. But at the time, like, tons of people
picked it up because, you know, most people actually aren't writing SQL day to day, they're using their ORM.
I personally had a viewpoint probably in the fall of 2014 that the SQL style of writing
queries was maybe not the best way to work with time series data, which I basically viewed
as just like ordered streams of data coming through.
And I thought a functional style language would actually be the better way to
represent the query style. Now, I was too afraid to make that change at the time. But when we
introduced our processing agent, Capacitor, which is there for like background ETL and monitoring,
alerting and real time processing. When we introduced that in September of 2015,
it had a language that was more functional
in nature.
So we, again, made the foolish mistake of creating a language.
And we actually made not just that mistake, but the other mistake, which is we created
a platform that now had two separate languages that were custom.
One for interactive querying and one for background processing. And the language itself called TIC script actually looks like nothing else that you've probably ever seen.
It's very, very strange.
But over the last, was it three and a half, four years, a surprising number of people have actually adopted it.
And a surprising number of people have written very complex tech scripts despite very serious gaps
in the functionality that it should provide as a language and some gaps in what I call like
developer ergonomics which is the experience of actually writing code in it and developing and
testing things but it has this like fan base that's that uses it and they get a lot of value
out of it so I well, there must be something
there because if those people are willing to suffer the pain of using this thing that I can
see all sorts of like horrible warts on, there's something worth putting more effort into. And
when we went to create Flux, we wanted something that could be used for background processing as well as interactive querying.
And the choices then were, you know, we knew we couldn't use SQL for the reasons I mentioned.
So at this point, it's either do we use an embedded language like Lua or do we create
our own?
Now, Lua, obviously, like, there are very mature implementations of it.
It would have been way easier to just use that.
My problem is like, I don't think Lua has enough popularity and that people are familiar
enough with it.
I think the learning curve is too high for people to adopt Lua.
Like it's just not that easy for regular developers to use.
So the other thing we wanted was we wanted to be able to control the tooling
around the language. We want to ultimately we want to create an experience that has a
UI in it that allows you to create these flux scripts without actually writing flux code,
right? So point and click interfaces that describe like data flows of different time
series data that you're collecting that output monitoring learning rules
or all sorts of other things.
So we wanted to be able to control the language.
And then the next thing I did was I thought,
okay, if we're gonna do a new language,
it has to be easy to use.
It has to be easy to pick up.
So we intentionally made it look like JavaScript,
which plenty of people hate on JavaScript, but the fact is it's probably the most widely used programming language in the world.
Even people who don't write JavaScript day to day are usually pretty familiar with it. You can look
at the code and kind of understand what's going on. And we said, like, the truth is, like, the learning curve in this thing is going to be the API.
And the API learning curve would be there regardless of if we had written, you know, if we had used Ruby as the starting point, Lua, all of those other things.
Like, the API is the biggest surface area.
The surface area of the actual syntax of the language, that can be covered in 15 minutes by reading a getting started guide that shows you the basic pieces of it.
So that's the bet we're making.
Obviously, we just launched 2.0 as a cloud product.
The open source product is still the open source build is in alpha right now.
We've just released a new alpha release.
So it's really too early to tell what's going to happen. But the joke I like to make is,
I'll either be spectacularly wrong or spectacularly right, but there probably won't be a middle ground.
No, it's fair. And I think we're going to see one way or another. It's an interesting space,
and we're seeing a lot of emergence coming out of it, which I guess gets me to one issue that has been a recurring theme on this show,
has been the idea that multi-cloud is generally not a terrific direction to go in, pick a vendor, and go all in.
The challenge is that you're already going to be locked into whatever it is you choose, unless you're spending an awful lot of time working around that to no real benefit.
But understand where that lock-in comes from. From that perspective, if something gets built
on top of Influx, is that fundamentally locked in from a data model perspective? Is there lock-in
being driven from a once you start paying, you never, ever, ever get to leave in an Oracle-esque
model? How does that story play out as far as adoption and implementation go?
Yeah, so we have the open source InfluxDB, which is basically a single server.
You can use that, obviously. It's MIT licensed with no restrictions, so you can do whatever you want with that. If you want to make your own new version of Influx and fork it, go for it.
That's up to you. Our cloud product is basically the
exact same API, the exact same user interface. We don't yet have bulk export and bulk import of data,
but our goal is to have seamless data transitions from open source nodes at the edge and our cloud
product running in whichever provider you choose.
Right now we're just in AWS, but soon we'll be in GCP and Azure.
And ultimately, we don't want to hold your data hostage.
The data model of InfluxDB is simple enough that you can represent it pretty
easily, trivially you can represent it in any relational database, and you can also
represent it in Cassandra or, and you can also represent it in
Cassandra or HBase or whatever.
But basically, that open source component is ideally the thing that gives you the feeling
that you don't have lock-in.
But I agree with you in the sense that once you've invested a certain amount of time and
effort into a piece of infrastructure, and particularly into a provider that's hosting
your data for you, there's kind of, you know, there's lock-in just by virtue of the fact that
you don't want to spend the time and money to move off of it. And the other thing about data
is that it has gravity. It's not free to move from one place to another. So, and particularly
in our use case, we're talking about large amounts of data. So that becomes a thing that you actually have to pay attention to.
So ultimately, people want to feel like there's no lock-in.
But if it comes time to, say, switch cloud providers, are you really going to halt all feature development for six months while you do this lift and shift over to another cloud provider
that provides zero customer value, right?
Like the main thing you need is the threat
of moving to another cloud provider
to give you pricing leverage.
And there are mixed reports
as to how well that actually works,
but it does raise an interesting question.
One of the easiest jobs in the world
has got to be running product strategy at AWS because you're just a post-it note that says yes on it.
There's really no thing that I would put past them building at this point in time.
And to that end, they have announced their own time series offering called Amazon Timestream, which it sounds almost like it can manipulate time itself, which it probably should because it was announced at reInvent last year and we're about to hit hit reInvent this year, and it still hasn't been released. So it's like influx without those
whole pesky customers. So I don't know what the story there is, but more interesting to this
conversation and germane to what you're doing and what you're building, what is it like when
Amazon enters your market, when they come to crush you, for lack of a better term?
Yeah, so when that got announced
last year that wasn't actually entirely unexpected uh i just knew it was a matter of time just when
um you know it's it's obviously concerning that's always the question is like what if so and so
comes to build your product uh i mean the things i take comfort in are basically that amazon isn't
guaranteed to win in every single market to enter.
And it's not necessarily necessary that it's a winner-take-all situation for every single product.
So, for example, Amazon entered Elasticsearch.
They have Elasticsearch hosting.
And by all accounts, they make far more money at it than Elastic does.
But last I checked, Elastic's market cap was pretty big. And they're doing pretty well, despite the fact that Amazon has come for them and is trying to crush them,
right? And has forked their distribution, even though they don't call it a fork.
So I realized that there are things we can still do to try and deliver customer value that's outside
of what Amazon does, right? Like we're not gonna be able to buy server time
and memory and network bandwidth cheaper than they can,
but hopefully we can provide a developer experience
that's better,
we can provide a user experience that's better.
So, you know, when I think about Timestream,
which is their, you know,
their soon to come time series database in the cloud,
that's just one component of
what Influx 2.0 does.
Query and storage is just one piece.
If you were going to try to cobble together what Influx 2 does by yourself, you'd have
to take Kinesis paired with Lambda, paired with Timestream, paired with S3, paired with
some sort of visualization engine.
I forget what Amazon's is, but they certainly have.
Oh, QuickSight.
And that also has no customers because it's like Tableau, but crappy.
And it turns out that's not the most compelling marketing.
Yeah.
So that's the other thing that I've heard you say and I've heard plenty of other people
say, which is that Amazon competes very fiercely on basically infrastructure and costs and scale and stuff like that.
But they, for one reason or another, just don't see fit to build user experiences that are compelling and UIs that are compelling.
So that's one thing that we continue to invest heavily in is the UI and the API and how those things work together. Yeah, it seems that in a number of the higher level differentiated services,
the user experience lacks a certain polish.
And I think that's something that only comes with time, for starters.
But it also seems that it's not a high priority.
And when you're dealing with a tool like this,
where you're going to be spending not inconsiderable amounts of time gazing into it,
that experience should be reasonable and polished. And the idea that someone should be able to go from,
I've never heard of this thing before, to using it effectively should be measured in hours,
not weeks. And I think that that's a lesson that sometimes gets lost.
Yeah. I mean, our goal is to measure it in minutes and hopefully seconds.
Exactly. Pictures are worth a thousand words, as they say, and graphs, on the other hand,
lets you figure out exactly how many words
each picture is worth if you wind up
getting your axes and calibration done appropriately.
So what's coming next, if you have anything to share,
as far as what Influx is doing,
what's interesting that people should keep an eye out for?
What does the future hold?
So we recently launched basically monitoring
and alerting features inside our Cloud 2.0 product.
And that basically turns, you know,
Influx 2 into a full monitoring and alerting platform
in addition to a time series database
and an agent that can collect data.
Within Flux the language,
what I'm most excited about
is basically packages, right?
The ability for users of the system
to create their own packages
and share them with other people.
And those packages could be bits of Flux source code.
So they're shared like you would on MPM
or Ruby gems or crates.
And then the other piece is packages that know, packages that allows people to share
essentially entire application experiences, which could be dashboards that you see,
it could be drill downs that you can do within your time series data. And all of that is kind
of scoped to the structure and schema of what data looks like inside of InfluxDB. So that packaging thing is something I'm excited about.
And then the last bit is,
obviously like InfluxDB,
all the core components are open source
and we really need to drive towards getting
open source InfluxDB 2.0 into beta.
And for that, what we need is the,
basically the GA of 1.0 of fluxux, the language, and we need the compatibility
layer so that users of InfluxDB 1.x can point at InfluxDB 2.0 and work with it as though
it's a 1.x server.
We need the migration tooling.
And then after we're in the beta, it's all about performance, testing, robustness, and
getting to the point where we can get open source InfluxDB into 2.0,
into general release.
Got it.
Well, it sounds like there's going to be some interesting stuff coming up, and I'm very
curious to see how that winds up manifesting in the marketplace and seeing it in increasing
numbers of environments.
I want to thank you for taking the time to speak with me today.
If people want to learn more about Influx, about you, your sage thoughts on things that
people should and absolutely should not do, where can they find you?
So Influx, you can find it at Influx Data or on Twitter as InfluxDB.
And I can be found on Twitter as Paul Dix.
Thanks so much for taking the time to speak with us today.
I appreciate it.
Paul Dix, founder of Influx Data, makers of InfluxDB.
I'm Corey Quinn.
This is Screaming in the Cloud.
This has been this week's episode of Screaming in the Cloud.
You can also find more Corey at Screaming in the Cloud dot com or wherever fine snark
is sold.