Screaming in the Cloud - The Power of Time Series Databases with Paul Dix

Episode Date: October 23, 2019

About Paul DixPaul Dix is the creator of InfluxDB. He has helped build software for startups, large companies and organizations like Microsoft, Google, McAfee, Thomson Reuters, and Air Force ...Space Command. He is the series editor for Addison Wesley’s Data & Analytics book and video series. In 2010 Paul wrote the book Service-Oriented Design with Ruby and Rails for Addison Wesley’s. In 2009 he started the NYC Machine Learning Meetup, which now has over 7,000 members. Paul holds a degree in computer science from Columbia University.Links Referenced: Twitter Username: @pauldixLinkedIn URL: https://www.linkedin.com/in/pauldix/Personal site: pauldix.netCompany site: www.influxdata.com

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to Screaming in the Cloud with your host, cloud economist Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Welcome to Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn. This week's episode of Screaming in the Cloud is sponsored by Influx Data, makers of InfluxDB.
Starting point is 00:00:37 As a part of that sponsorship, they have generously provided one of their co-founders to have this conversation. Paul Dix, welcome to the show. Thanks, Corey. Glad to be here. Thank you for taking the time to entertain my ridiculous nonsense. It's always appreciated. I guess where I want to start on this is, let's begin at the very start of all of this. You are makers of the premier offering in the world of time series databases. For those of us whose platonic ideal of a database is Route 53, what is a time series database and why might I need one?
Starting point is 00:01:12 Yeah, so, I mean, a time series database is basically just a database that's optimized for a specific kind of workload, which is time series data. Now, the thing that makes time series data different than say reference data that you keep in a relational database is that it's largely an append-only workload, right? You have new data arriving all the time, you're not updating previous records, and when you query the data you're frequently querying large ranges of data to compute summaries of what was the min value in these five minute increments for the last four hours or something like that.
Starting point is 00:01:52 So you can certainly use other types of databases to store time series data. You can use relational databases or other NoSQL databases. But for time series data specifically, there are optimizations that you can make to deal with the very high rate of ingest, the query workloads, which are very, very different. And a few other things. One, which is it's very common in time series to have your high precision data that you keep around for a limited window of time. Like say, I'm going to keep all my raw data for seven days. And then you want to summarize it or downsample it and say, keep those summarizations around for longer periods of time, like three months or a year, whatever. And a good time series solution database will basically handle that data
Starting point is 00:02:43 management lifecycle for you automatically, evicting the high precision data, downsampling the other data. So the eviction, I think, is actually really interesting from a database perspective when you think about relational databases. So in the naive case, if you're going to evict your time series data and you say you want to keep it around for just a day, what that means is the naive way of doing this is every time I do a write I have to delete the oldest data point right because if I'm if I'm ingesting at a fixed rate then I know for every write that goes in
Starting point is 00:03:19 there's a delete happening and regular databases aren't designed for this workload they actually assume that you want to keep most of your data around for all time. So deletes actually are expensive. So time series databases do things like that. They optimize the ingest, the eviction of high precision data, the downsampling, and the summarizations that you might want to do in real time. So I played a little bit with things like this once upon a time in my first life as a network admin. We ran Cacti to manage a lot of these things. If you've never used Cacti, the primary purpose of that software is to sit on your network,
Starting point is 00:03:56 be written in PHP, and be exploited and used as an attack platform for the rest of your network. It displayed graphs using RRD tool or RRD-based tools that came out of MRTG and a bunch of other similar products. And it's exactly what you described. As you look back further in time, the data gets less and less granular under the baseline assumption that you won't need to have that level of insight and visibility into things that happened a month ago as you might yesterday. Is that aligned with the same principles? Yeah, that's similar for the most part. There is an important distinction with RRD that I like to make, which is when I think of time series data, I think of two types
Starting point is 00:04:38 of data. What's called a regular time series, which is samples taken at fixed intervals of time like once every 10 seconds or once a minute or once an hour and then there are irregular time series which are basically event streams that could be individual requests to an API and their response times that could be a container spinning up or shutting down any sort of an exception in an application, any sort of event. Now RRD, and I guess it's kind of like a spiritual successor, which would be probably graphite, those are based around storing regular time series data. So regular time series data is basically a summarization of some underlying distribution or raw event stream or whatever.
Starting point is 00:05:27 So for example, if I store the response time for every request made into my API, that's an event stream. That's an irregular time series. But I can query that time series and say, give me the 95th percentile in 10 minute intervals for the last eight hours of time and what you've done there is you've created a regular time series from the underlying irregular event stream so essentially when you think about putting data into RRD you're summarizing your data before it ever goes into the database and withux, what we wanted to do from the very early stage was we wanted to be able
Starting point is 00:06:08 to store the raw event stream as well as the summarizations. If we go back to when you said you made your first commit to what became InfluxDB back in 2013, if I didn't notice it at the time, and if I had, I would have made blistering fun of it. It's, oh, you're building your own database engine. I know you. You're that guy from Hacker News. Come to life. And it sounds like something that only someone who's deranged would do it, except for the fact that it worked.
Starting point is 00:06:36 You've built a successful company. You are a name brand in the time series space, and you were clearly right about this. I mean, now we have other entrants to the space, which we'll get to in a little bit. But at the time, did it feel like you were potentially making a catastrophic mistake? It really didn't. That's based on my experience from a few other times. So in 2010, I was working for a fintech startup, and we had to essentially build a quote-unquote time series database or time series solution for tracking real-time pricing for like 200,000 different financial instruments. And the solution I built was basically Scala Web Services
Starting point is 00:07:25 using Cassandra as the underlying long-term data store and Redis as a real-time indexing layer. And then later on, when I built the first product that we built actually, which was a SaaS platform for real-time metrics and monitoring in the server monitoring space, I had to build the time series solution again. And this was basically two completely different problem domains, but the solution was the exact same. In that case, for the first version of this, I did Scala Web Services on top of Cassandra with Redis. And then I built a next version of that API, which I used Go for. And this was in, I guess, late 2012 is when I started development on that. So I used Go and I used LevelDB, which is an open source library that Google had at the time. And I
Starting point is 00:08:18 built this whole thing. So I essentially built a database for that API. And the thing I realized through the process of doing this was that for solving this time series use case problem, you could use a general purpose database. You could use Cassandra or you could use MySQL or Postgres or whatever. But the thing is I had to write this mountain of application code, of web services code in Scala to make the whole thing work. And my feeling in 2013 was there's nobody focused on this exclusively. Graphite at that time was largely an abandoned project. Everybody who was using it was complaining about the fact that it wouldn't scale. So I thought, okay, nobody's focused on this, but here I am. I've had to solve this problem multiple times in the past few years.
Starting point is 00:09:08 I saw people at large companies trying to solve the problem themselves, and I saw the monitoring companies doing the same thing. So I basically thought, here's a need that isn't being served, so I might as well go do it. And I think the other important thing is to think of the timing. So in 2013, obviously, NoSQL was a big thing, and you had the different players in the space,
Starting point is 00:09:33 and it wasn't obvious how that was gonna shake out, although MongoDB was obviously already very popular. But it wasn't like, I feel like over the last two or three years, there's literally a new time series database or a new database of some kind every other week. That's on the front page of hacker news. So I feel like there was a little bit less like new database fatigue in 2013 than there was then there is now.
Starting point is 00:09:59 Yeah. It's, it's one of those areas where there are so many different database options that I, to rip off the ancient JWZ quote, it's, oh those areas where there are so many different database options that, to rip off the ancient JWZ quote, it's, oh, I have a problem. I'll use regular expressions. Now I have two problems. It's, oh, I'll just write my own database. And invariably in 2019, it feels like that's exactly the wrong direction to go in.
Starting point is 00:10:19 But counterpoint, you folks have recently released Influx 2. So what's the story behind that? I mean, I guess, so first off, I should probably say that, you know, I'm a firm believer in what I call polyglot persistence, which is you use the right tool for the job and not every single persistence need is the same as others, right? For some, you absolutely will need a relational transactional database, and for other things, you won't need that. And all of this would be kind of like a moot point if relational transactional databases were infinitely scalable and infinitely performant, then we would just use those. But that's not the case. You make trade-offs and optimizations based on your needs and the
Starting point is 00:11:01 specific use case. So initially, you know, within FluxDB 1.x, that's what that was about. But we started with the database and then we saw that there were other needs that people had in this time series use case, right? And for time series, like what I realized is it's an abstraction that works well for solving problems in multiple domains, right? Server monitoring, I mentioned, financial market data is one, real-time analytics, but also sensor data of all kinds, be it industrial, oil and gas, wearables, consumer tech, all that kind of stuff. So people had other needs to solve these problems and to build applications on top of this time series abstraction. They had to collect it, they had to store it and query it,
Starting point is 00:11:48 they had to process it for either doing ETL for enrichment or for doing monitoring and learning. And then finally they had to summarize the data for human consumption either through visualization or reporting or other kinds of things. So as I built the company, I raised capital and hired developers to build these other pieces. And we learned a lot over that period of time, over the last six years. And the thing I realized is what I wanted is a platform that was easy for developers to use that kind of encapsulated all of that. Right now we have in 1.x, we have four separate products. With 2.0, what we've tried to do is combine them into like one cohesive whole where there's a single API that is consistent, that's easy to use.
Starting point is 00:12:38 There's a swagger definition for it. And there's a user interface on top of it. And then the last bit with 2.0 is, you know, with InfluxDB 1, we had a query language that looks very much like SQL. And that's because I thought it would be easier for people to pick up. And it certainly was. But what we found is there were more complex like analytics and processing tasks that people wanted to accomplish that they wanted to push down into the database. And because they couldn't do that, a common pattern emerged where people would write code
Starting point is 00:13:12 in whatever language they choose, like Python or Ruby or apparently your favorite PHP. And then they would query data out of the database and do some post-processing and then write data back into the database so that they could get it back into the tool chain for monitoring it, for visualizing it, and all these other things. So when we created 2.0, we decided, let's create a new language called Flux, which is not just a query language, but it's also a query planner, a query optimizer, and it's a scripting language. So you can push down this kind of complex processing into the data platform. And as a language, we want it to be Turing complete and generally useful,
Starting point is 00:13:58 but we also want it to be able to pull in data from sources outside of InfluxDB. As I mentioned, I'm a firm believer in polyglot persistence. And what that means is Influx is great for time series data, but it's not good for reference data. So we want to be able to pull in data from Postgres or MySQL, or from any sort of third party API that you want to pull data in from. You could hit GitHub for data that you could mix and match with your time series data. So basically 2.0 is the realization of collapsing those four components into one cohesive whole and creating a language that allows people to define really complex analytics and processing that the platform will just do for them.
Starting point is 00:14:47 It seems almost like you're going through Hacker News to some extent and picking all the terrible ideas at once. You just mentioned, for example, that you built a new query language called Flux. Two issues. One, writing your own language is always one of those things that's fraught with peril, but based upon what you've demonstrated, I will absolutely extend credence to that, that you're probably doing the right thing. But I will say that from my experience working in tech for entirely too long, which is where I guess this bitter cynicism all has root, I find that whenever I have to learn a specific
Starting point is 00:15:20 language to use a particular tool, it means tears before bedtime. And I want to wind up calling out a bunch of different companies that have done this, but it's unfair because it seems that every time I've dealt with a specific DSL, you have these problems. I'm still going to maintain that Kubernetes wrote their own custom DSL called YAML, which is so historically incorrect. I don't even know where to begin, but that's why I like saying those controversial things. What, I guess, what made you decide to do this, I guess, in the face of historical terrible experiences with these? Yeah, by the way, I think Kubernetes was, that's not original to create a YAML DSL. I think they're just copying Spring and Struts who created a DSL in...
Starting point is 00:16:01 Once upon a time, we wound up adding Jinja to YAML and that would effectively turn into SaltStack for its configuration language. Again, we're all code terrorists in our own way. Yeah, yeah. No, I mean, it's absolutely fair. I agree. Generally speaking, why would you create your own language? There are countless other languages out there. And so, you know, so basically, like, one option is we just go with SQL, right? Well, one, creating an actual SQL compliant SQL is really, really hard. It's a lot of work. Two, SQL is in turn complete. It's actually not programming language. It's a declarative scripting language. Now, Microsoft's version of SQL has extensions that makes it Turing complete. Oracle's version of SQL has the same.
Starting point is 00:16:52 But then again, you're not using standards compliant SQL. And really, even when you get down to it, every single major database has differences between what their versions of SQL are. So there's a standard, which is the lowest common denominator. And then when you get into more powerful query functionality, you end up getting into this specific database implementation. And as you mentioned, like, there's so many tools that have their own languages, basically, like, I think any analytics tool in existence, whether it's log analytics, user analytics, business intelligence, marketing analytics, they all have their own custom query languages. And that certainly hasn't stopped them from becoming popular. But let me speak to one thing about our specific journey
Starting point is 00:17:38 to Flux that I think is relevant, which is in 2013, I created InfluxDB with this language, this query language that looked like SQL, but it wasn't actually SQL. It was different in ways that are actually frustrating if you're a SQL expert and you try to use it. But at the time, like, tons of people picked it up because, you know, most people actually aren't writing SQL day to day, they're using their ORM. I personally had a viewpoint probably in the fall of 2014 that the SQL style of writing queries was maybe not the best way to work with time series data, which I basically viewed as just like ordered streams of data coming through. And I thought a functional style language would actually be the better way to
Starting point is 00:18:25 represent the query style. Now, I was too afraid to make that change at the time. But when we introduced our processing agent, Capacitor, which is there for like background ETL and monitoring, alerting and real time processing. When we introduced that in September of 2015, it had a language that was more functional in nature. So we, again, made the foolish mistake of creating a language. And we actually made not just that mistake, but the other mistake, which is we created a platform that now had two separate languages that were custom.
Starting point is 00:18:59 One for interactive querying and one for background processing. And the language itself called TIC script actually looks like nothing else that you've probably ever seen. It's very, very strange. But over the last, was it three and a half, four years, a surprising number of people have actually adopted it. And a surprising number of people have written very complex tech scripts despite very serious gaps in the functionality that it should provide as a language and some gaps in what I call like developer ergonomics which is the experience of actually writing code in it and developing and testing things but it has this like fan base that's that uses it and they get a lot of value out of it so I well, there must be something
Starting point is 00:19:45 there because if those people are willing to suffer the pain of using this thing that I can see all sorts of like horrible warts on, there's something worth putting more effort into. And when we went to create Flux, we wanted something that could be used for background processing as well as interactive querying. And the choices then were, you know, we knew we couldn't use SQL for the reasons I mentioned. So at this point, it's either do we use an embedded language like Lua or do we create our own? Now, Lua, obviously, like, there are very mature implementations of it. It would have been way easier to just use that.
Starting point is 00:20:25 My problem is like, I don't think Lua has enough popularity and that people are familiar enough with it. I think the learning curve is too high for people to adopt Lua. Like it's just not that easy for regular developers to use. So the other thing we wanted was we wanted to be able to control the tooling around the language. We want to ultimately we want to create an experience that has a UI in it that allows you to create these flux scripts without actually writing flux code, right? So point and click interfaces that describe like data flows of different time
Starting point is 00:21:01 series data that you're collecting that output monitoring learning rules or all sorts of other things. So we wanted to be able to control the language. And then the next thing I did was I thought, okay, if we're gonna do a new language, it has to be easy to use. It has to be easy to pick up. So we intentionally made it look like JavaScript,
Starting point is 00:21:27 which plenty of people hate on JavaScript, but the fact is it's probably the most widely used programming language in the world. Even people who don't write JavaScript day to day are usually pretty familiar with it. You can look at the code and kind of understand what's going on. And we said, like, the truth is, like, the learning curve in this thing is going to be the API. And the API learning curve would be there regardless of if we had written, you know, if we had used Ruby as the starting point, Lua, all of those other things. Like, the API is the biggest surface area. The surface area of the actual syntax of the language, that can be covered in 15 minutes by reading a getting started guide that shows you the basic pieces of it. So that's the bet we're making. Obviously, we just launched 2.0 as a cloud product.
Starting point is 00:22:17 The open source product is still the open source build is in alpha right now. We've just released a new alpha release. So it's really too early to tell what's going to happen. But the joke I like to make is, I'll either be spectacularly wrong or spectacularly right, but there probably won't be a middle ground. No, it's fair. And I think we're going to see one way or another. It's an interesting space, and we're seeing a lot of emergence coming out of it, which I guess gets me to one issue that has been a recurring theme on this show, has been the idea that multi-cloud is generally not a terrific direction to go in, pick a vendor, and go all in. The challenge is that you're already going to be locked into whatever it is you choose, unless you're spending an awful lot of time working around that to no real benefit.
Starting point is 00:23:04 But understand where that lock-in comes from. From that perspective, if something gets built on top of Influx, is that fundamentally locked in from a data model perspective? Is there lock-in being driven from a once you start paying, you never, ever, ever get to leave in an Oracle-esque model? How does that story play out as far as adoption and implementation go? Yeah, so we have the open source InfluxDB, which is basically a single server. You can use that, obviously. It's MIT licensed with no restrictions, so you can do whatever you want with that. If you want to make your own new version of Influx and fork it, go for it. That's up to you. Our cloud product is basically the exact same API, the exact same user interface. We don't yet have bulk export and bulk import of data,
Starting point is 00:23:54 but our goal is to have seamless data transitions from open source nodes at the edge and our cloud product running in whichever provider you choose. Right now we're just in AWS, but soon we'll be in GCP and Azure. And ultimately, we don't want to hold your data hostage. The data model of InfluxDB is simple enough that you can represent it pretty easily, trivially you can represent it in any relational database, and you can also represent it in Cassandra or, and you can also represent it in Cassandra or HBase or whatever.
Starting point is 00:24:28 But basically, that open source component is ideally the thing that gives you the feeling that you don't have lock-in. But I agree with you in the sense that once you've invested a certain amount of time and effort into a piece of infrastructure, and particularly into a provider that's hosting your data for you, there's kind of, you know, there's lock-in just by virtue of the fact that you don't want to spend the time and money to move off of it. And the other thing about data is that it has gravity. It's not free to move from one place to another. So, and particularly in our use case, we're talking about large amounts of data. So that becomes a thing that you actually have to pay attention to.
Starting point is 00:25:08 So ultimately, people want to feel like there's no lock-in. But if it comes time to, say, switch cloud providers, are you really going to halt all feature development for six months while you do this lift and shift over to another cloud provider that provides zero customer value, right? Like the main thing you need is the threat of moving to another cloud provider to give you pricing leverage. And there are mixed reports as to how well that actually works,
Starting point is 00:25:38 but it does raise an interesting question. One of the easiest jobs in the world has got to be running product strategy at AWS because you're just a post-it note that says yes on it. There's really no thing that I would put past them building at this point in time. And to that end, they have announced their own time series offering called Amazon Timestream, which it sounds almost like it can manipulate time itself, which it probably should because it was announced at reInvent last year and we're about to hit hit reInvent this year, and it still hasn't been released. So it's like influx without those whole pesky customers. So I don't know what the story there is, but more interesting to this conversation and germane to what you're doing and what you're building, what is it like when Amazon enters your market, when they come to crush you, for lack of a better term?
Starting point is 00:26:23 Yeah, so when that got announced last year that wasn't actually entirely unexpected uh i just knew it was a matter of time just when um you know it's it's obviously concerning that's always the question is like what if so and so comes to build your product uh i mean the things i take comfort in are basically that amazon isn't guaranteed to win in every single market to enter. And it's not necessarily necessary that it's a winner-take-all situation for every single product. So, for example, Amazon entered Elasticsearch. They have Elasticsearch hosting.
Starting point is 00:27:02 And by all accounts, they make far more money at it than Elastic does. But last I checked, Elastic's market cap was pretty big. And they're doing pretty well, despite the fact that Amazon has come for them and is trying to crush them, right? And has forked their distribution, even though they don't call it a fork. So I realized that there are things we can still do to try and deliver customer value that's outside of what Amazon does, right? Like we're not gonna be able to buy server time and memory and network bandwidth cheaper than they can, but hopefully we can provide a developer experience that's better,
Starting point is 00:27:33 we can provide a user experience that's better. So, you know, when I think about Timestream, which is their, you know, their soon to come time series database in the cloud, that's just one component of what Influx 2.0 does. Query and storage is just one piece. If you were going to try to cobble together what Influx 2 does by yourself, you'd have
Starting point is 00:27:55 to take Kinesis paired with Lambda, paired with Timestream, paired with S3, paired with some sort of visualization engine. I forget what Amazon's is, but they certainly have. Oh, QuickSight. And that also has no customers because it's like Tableau, but crappy. And it turns out that's not the most compelling marketing. Yeah. So that's the other thing that I've heard you say and I've heard plenty of other people
Starting point is 00:28:19 say, which is that Amazon competes very fiercely on basically infrastructure and costs and scale and stuff like that. But they, for one reason or another, just don't see fit to build user experiences that are compelling and UIs that are compelling. So that's one thing that we continue to invest heavily in is the UI and the API and how those things work together. Yeah, it seems that in a number of the higher level differentiated services, the user experience lacks a certain polish. And I think that's something that only comes with time, for starters. But it also seems that it's not a high priority. And when you're dealing with a tool like this, where you're going to be spending not inconsiderable amounts of time gazing into it,
Starting point is 00:29:04 that experience should be reasonable and polished. And the idea that someone should be able to go from, I've never heard of this thing before, to using it effectively should be measured in hours, not weeks. And I think that that's a lesson that sometimes gets lost. Yeah. I mean, our goal is to measure it in minutes and hopefully seconds. Exactly. Pictures are worth a thousand words, as they say, and graphs, on the other hand, lets you figure out exactly how many words each picture is worth if you wind up getting your axes and calibration done appropriately.
Starting point is 00:29:33 So what's coming next, if you have anything to share, as far as what Influx is doing, what's interesting that people should keep an eye out for? What does the future hold? So we recently launched basically monitoring and alerting features inside our Cloud 2.0 product. And that basically turns, you know, Influx 2 into a full monitoring and alerting platform
Starting point is 00:29:57 in addition to a time series database and an agent that can collect data. Within Flux the language, what I'm most excited about is basically packages, right? The ability for users of the system to create their own packages and share them with other people.
Starting point is 00:30:12 And those packages could be bits of Flux source code. So they're shared like you would on MPM or Ruby gems or crates. And then the other piece is packages that know, packages that allows people to share essentially entire application experiences, which could be dashboards that you see, it could be drill downs that you can do within your time series data. And all of that is kind of scoped to the structure and schema of what data looks like inside of InfluxDB. So that packaging thing is something I'm excited about. And then the last bit is,
Starting point is 00:30:48 obviously like InfluxDB, all the core components are open source and we really need to drive towards getting open source InfluxDB 2.0 into beta. And for that, what we need is the, basically the GA of 1.0 of fluxux, the language, and we need the compatibility layer so that users of InfluxDB 1.x can point at InfluxDB 2.0 and work with it as though it's a 1.x server.
Starting point is 00:31:13 We need the migration tooling. And then after we're in the beta, it's all about performance, testing, robustness, and getting to the point where we can get open source InfluxDB into 2.0, into general release. Got it. Well, it sounds like there's going to be some interesting stuff coming up, and I'm very curious to see how that winds up manifesting in the marketplace and seeing it in increasing numbers of environments.
Starting point is 00:31:37 I want to thank you for taking the time to speak with me today. If people want to learn more about Influx, about you, your sage thoughts on things that people should and absolutely should not do, where can they find you? So Influx, you can find it at Influx Data or on Twitter as InfluxDB. And I can be found on Twitter as Paul Dix. Thanks so much for taking the time to speak with us today. I appreciate it. Paul Dix, founder of Influx Data, makers of InfluxDB.
Starting point is 00:32:05 I'm Corey Quinn. This is Screaming in the Cloud. This has been this week's episode of Screaming in the Cloud. You can also find more Corey at Screaming in the Cloud dot com or wherever fine snark is sold.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.