The Data Stack Show - 82: Databases: The Fun Never Stops with Robert Hodges of Altinity

Episode Date: April 6, 2022

Highlights from this week’s conversation include:Robert’s background and career journey (2:21)How studying languages influences database work (5:13)Why Robert has been working with databases for 4...0+ years (7:50)Explaining the ClickHouse database (10:43)How ClickHouse is able to focus on latency (13:39)The use cases behind ClickHouse (19:19)How ClickHouse is different than other databases (25:47)Why old problems are just now getting addressed (29:04)How ClickHouse works with others against another (33:03)When to implement ClickHouse (38:53)The distance between ClickHouse and the end-user (42:24)New database technologies (47:02)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com..

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, one platform for all your customer data pipelines. Learn more at Rudderstack.com. And don't forget, we're hiring for all your customer data pipelines. Learn more at rudderstack.com. And don't forget, we're hiring for all sorts of roles. All right, we're going to talk from Robert from Altenity today, a super interesting guy, and he's worked with databases for
Starting point is 00:00:37 almost three decades now. Three decades, that's 30 years. More than that. Probably more than that. Yeah. Oh yeah, four decades. Wow, I'm bad at math. So super interesting guy works on Clickhouse and has some services on type of Clickhouse, which is super interesting. Here's what I'm going to ask. Four decades in databases is a long time. A lot of the people who I've talked to who have worked in databases early in their career eventually move on to do something else. And I want to know why he's stuck with it. I mean, look, databases are great.
Starting point is 00:01:16 I've seen a pattern of you sort of start in databases and then you do something else. So four decades. I mean, that's some staying power. So that's what I'm going to ask. How about you? I want, that's some staying power. So that's what I'm going to ask. How about you? I want to learn more about GitHub. And I think we have the right person to chat about both the technology, like what makes GitHub such a special technology in databases, but also about
Starting point is 00:01:37 the use cases, because like some other technologies like Druid and Pinot, that's in the same space. So it would be great to understand a little bit better why we need this new category of database systems out there and how they are used. Let's dig in and hear from Robert. Robert, welcome to the Data Stack Show. Thanks for giving us some of your time.
Starting point is 00:02:04 Eric, it's great to be here with you and Costas. Awesome. Okay. Give us your background. So you've been working in data and databases specifically for quite a while. So give us just a quick rundown and then what led you to Altinity today. Sure. I started with databases in 1983. I was actually serving in the U.S. military and as a programmer in Germany. And it just happened one day, the unit I worked for, they bought this database. It was called M204. It's a pre-relational database. And they said, okay, learn about this thing and write some apps for it.
Starting point is 00:02:41 And I started doing that. And it was the most interesting software I'd ever worked with. It, because it, you know, sort of it, it had a language, it had APIs, it dealt with data. There were all these interesting ways you could organize things to make things run faster or slower. And then inside it, as I came to learn more about how it worked inside, it was just a fascinating piece of software. So by the time I got out of the military, there was nothing I wanted to do more than go work for this company that designed this database. As it happened, I ended up going back to school at the University of Washington.
Starting point is 00:03:16 So I didn't go there, but I then ended up down in the Bay Area and worked for Sybase for seven years. And that got me completely hooked. It was, I'm not a CS person. I actually grew up studying things like economics and Latin and Japanese, but working at Sybase was kind of like doing a master's in CS. There was just so much technology, so many great people to learn from. And so from that, then pretty much continuously since then, I've worked on databases. And the thing that has kept me hooked on them is just everything interesting in CS shows up in databases sooner or later. So I worked at Sybase. I then worked on a couple other startups.
Starting point is 00:03:57 We're building apps on top of databases. I ran a company that did database clustering for MySQL. We sold that to VMware. I went and lived there for a while. And then what drew me to this was working at VMware, hey, Robert, I know you like databases. You should come check this out. And so eventually I did. And it was so interesting. I thought, OK, I've done interesting things at VMware, but I'm going to get back into the startup world and try this out. And that's how I landed at Alternity.
Starting point is 00:04:43 Very cool. We want to hear a lot about ClickHouse and Alternity, but a couple of questions for you. So first of all, you didn't study computer science in school, but you did study languages. And I'm just interested to know how, how did your study of languages influence your sort of understanding or work with databases? Was there a relationship there? Do you feel like it was helpful? It was pretty tenuous.
Starting point is 00:05:12 I think that I'll tell you, first of all, how languages helped me. They got me my first industry job. And the reason was that I have a master's degree in Japanese studies. And when I got done with that, I was actually a certified translator. There's like an American translator association. I was a certified technical translator of Japanese to English. So I could read things like Nikkei electronics in Japanese. Sybase was entering the Japanese market. They needed somebody to test their software. You have to be able to read Japanese because you have to be able to, if you screw up the data, what'll happen is it'll cause your kanji to become corrupted.
Starting point is 00:05:53 And it doesn't mean you see it, obviously. It just morphs ordinary kanji into characters that haven't been used since the Middle Ages. So they needed somebody who could read Japanese. That's how I got that job and got in at Sybase. So I think in my academic background, I think the most useful things were mathematics. Because even though I didn't major in it, I have kind of a sick interest in discrete mathematics, things like sets, sort of a form of logic, things like that. The other thing is I ended up doing a lot of, in Latin, you spend a lot of time reading people like Cicero. And it turns out that the rhetoric, that the principles of rhetoric that you learn from
Starting point is 00:06:37 somebody like Cicero are incredibly applicable in technical companies because you're always explaining stuff. You're always trying to figure out, hey, who's the audience what do they want to know what do i want to tell them how am i going to do it this is this is cicero is all about this he spent his life doing this he's one of the best people that ever lived at this at at doing this and so i i actually find that i constant this goes back to you know like to reading Latin, but it was not so much the language as the content of what I was reading that then could be directly applied to technical jobs. Fascinating.
Starting point is 00:07:14 Oh, that is so fun. Okay, next question, and we will get to the ClickHouse stuff, but Costas knows that I love to entertain myself and just learning about people. So you've done databases for a long time. And a lot of times you hear about people, it's like, okay, I started in databases and then I kind of like went and did this other thing, right? But you've stuck with it. And you mentioned just a minute ago
Starting point is 00:07:40 that it sort of combines all the things in CS that you like, but like you're so excited about databases and you're still doing it. I love that, but tell us why. Why are you still doing it? Well, two things. One, as I said, there's a huge amount of computer science that just comes together in databases.
Starting point is 00:07:58 It's kind of like operating systems, but I think even more diverse in some ways. So for example, I really like distributed systems. I just like the idea of being able to visualize, you know, things working concurrently on a network. What does it mean for things to fail? What does it mean to, you know, how can we develop algorithms so that we can get work done?
Starting point is 00:08:18 This is fundamental to modern databases because most interesting data problems require more than one node. They just, they're big. So you get this constantly. You end up dealing with very fundamental results like distributed consensus, cap theorem, things like that. These come up in real life in databases. I think the other reason that the databases are fascinating is they have evolved enormously over the time that I've
Starting point is 00:08:46 been working with them. And so there's always new things coming up. Example, when I worked at Sybase, that was during what I call the relational database cathedral building era. You think of people, you think of, you know, if you look at cathedrals in Europe, they went through these phases of building and they were Romanesque cathedrals. They were, you know, if you look at cathedrals in Europe, they went through these phases of building and they were Romanesque cathedrals. They were, you know, Gothic cathedrals and people had different things. Different components. Different components. And they were built out over time in the 90s or late 80s, 90s.
Starting point is 00:09:15 It was relational databases and they were driving things like commerce and transaction processing. So that was really interesting. After a while. Yeah. You know, like, hey, we know everything about that. We understand asset transaction processing. So that was really interesting. After a while, yeah, you know, like, hey, we know everything about that. We understand asset transaction models. But then new things came along, like very, very large volumes of data. So what's next? Well, Hadoop, you know, there's sort of like different ways of thinking about processing data. Now we're in this completely different era where we have huge amounts of data, we want to answer questions about
Starting point is 00:09:46 them really, really fast. So now we have all over the place, data warehouses are popping up that solve this problem. So you just see this constant refreshing of the problems that you're dealing with and sort of new things to attack, new things to learn. That's why I stuck with it. I don't feel any need to work on any other computing problem. Love it. It's so fun. Okay. Last question for me. We've talked about a lot of databases on the show. ClickHouse, I don't know if ClickHouse has come up, but I know that a lot of our listeners, some of them have used it. Some of them have probably heard of it. And then I know there's probably a subset who it's a new term for.
Starting point is 00:10:26 What is ClickHouse and sort of what makes it unique or different as a database? And I mean, you've been doing this for a long time and it really attracted you. So tell us what kind of roots you have. Yeah, I think the, so first of all, let me just tell you what it is, how it's the same as things that have come before and how it's different. So ClickHouse is a SQL data warehouse. That means it's a database. It talks SQL, which is kind of the winning language for managing data. And as a data warehouse, it is designed to scan very large amounts of data and give you answers very quickly. So this is a class of databases that started to develop in the
Starting point is 00:11:07 early 80s and with things like Teradata and what became SybaseIQ. And it has continued over time through the most modern incarnations, things like Snowflake, things like Redshift, BigQuery, which have evolved the technology but are still solving the same fundamental problem. ClickHouse differs in a really fundamental way. One is that it is open source. So unlike most data warehouse technology is proprietary. Snowflake is a great example. What being open source means is that it's accessible to anyone. Any developer can grab it, just stick it on their laptop, develop an application. Moreover, you can use those applications in any way you want. So that gives you this freedom to use it.
Starting point is 00:11:51 The other thing that ClickHouse does that's really interesting is it specializes in low latency response. Moreover, not just low latency, like fast now and then, but guaranteed low latency. So for example, you can build applications where, you know, at P99, you're going to get sub one second response. And that these two properties of the flexibility, and then the fact that it can give fixed response, you know, you know, sort of fixed response time on very large data is, is, is sort of a real game changer and explains why the database is becoming so popular.
Starting point is 00:12:26 So a quick question here from my side, Robert, because you mentioned data warehouses, like Snowflake, for example. And usually when we think about data warehouses, we think about queries that might take hours to answer a question. And it's common. I mean, people, especially like for bots that you don't have like any kind of latency requirement that are like on the second or middle or whatever, like it's fine to do that.
Starting point is 00:12:57 So from where the design key comes does not do that. Actually, it's focuses on how we can have the lowest possible latency in the queries that we are asking. First question, it's a little bit more technical. How do we do that? What trade-offs we have to make there in order to focus more on the latency? Because in engineering, there's always a trade-off that we have to make at some point. So what's the difference technically between ClickHouse and something like Snowflake at the end? That's a great question. And I think you can answer it in two ways. I think they're both relevant.
Starting point is 00:13:39 One is the architecture and the other one is the features. So let's look at features. So ClickHouse was developed originally at Yandex to solve a specific problem. Yandex has a piece of software called Metrica. It's a lot like Google Analytics. And basically people can come in and they can run queries on it through a nice interface and see the traffic on their websites. And just like Google Analytics, they can choose different combinations of, you know, things like where are they coming from? You know, how long do they stay on the site?
Starting point is 00:14:13 So on and so forth. So what you need to do when you're solving that problem, this needs to come back very quickly. Moreover, you cannot pre-compute the data that, you know, sort of pre-aggregate the data, as we would say, in all the possible different ways that people can ask for it. What you need to have is just a piece of software that can take the raw data, just the messages that are arriving from logs, and answer these questions extremely fast. So from the very beginning, ClickHouse was developed to focus on
Starting point is 00:14:47 this problem, have a very, very large table, which has potentially many, many columns of source data and to be able to answer numerical or aggregation questions straight off that raw data set in a very short period of time. As a result, the energy in ClickHouse, many databases that come out, they say, hey, we're SQL compliant. Every single SQL feature you can imagine is we've got it. That's the Postgres story. In ClickHouse, the story is no.
Starting point is 00:15:15 We don't have every single SQL feature. We don't have acid transactions. We don't have a delete command. What we do have is 40 different kinds of hash table implementations inside, each of which is tuned to a specific use case where that hash table organization is going to give us the edge in providing rapid response. So that's a really fundamental difference from other data, you know, from the traditional enterprise data warehouse. It's a product, it's really product-led development that's coming up, starting with this very, very basic use case, and then extending out to other ones and making feature trade-offs along the way.
Starting point is 00:15:55 If that makes sense. The other part that I mentioned is the architectural differences. So ClickHouse has, unlike Snowflake, ClickHouse still has a traditional, what we call shared nothing architecture, where it's basically a set of nodes with attached storage connected by a network. Now, we're working our way toward the model that is used by Snowflake. But in the meantime, this is just extremely fast. And so we see, you know, when we run benchmarks against something like Snowflake, yes, it's a great architecture. Yes, it's, you know, like, you know,
Starting point is 00:16:35 you can store data really cheaply because it's backed by object storage. But, you know, ClickHouse answers these questions in a fraction of a second. And in many cases, Snowflake takes, you know, tens of seconds or even minutes to answer the same question. So there's a real architectural difference. And again, it's sort of, you know, focused on, you know,
Starting point is 00:16:54 sort of delivering the speed and solving this specific set of problems around low latency access to large quantities of data. Yeah, that's some great information that you shared with us right now. You mentioned the features, and you said, for example, there's no delays, right? Or there's no asset guarantees there,
Starting point is 00:17:14 as we see in databases like Postgres. What else do we have to trade off there? Things like joins, for example? Yeah, joins are a great example. And don't get me wrong, there's a transaction model, but it's not asset. So the transaction model is, hey, if I write a block of data, it always shows up and we never get torn blocks. And by block, I mean it could be a chunk of like 100 million rows in a table. But yeah, on joins, absolutely, you know, some real trade-offs.
Starting point is 00:17:46 So ClickHouse by default uses what's called a hash join. So, and that's a join that works very well where you have one table that's very large. And so you're going to scan that table and then you're going to be with the data that you join, you will preload into memory. And then you'll just look at, you know, look it up in memory and see if you got a match. And if you do have a match, you'll pull the extra, the join columns over. So that is
Starting point is 00:18:10 different from a database like Snowflake, which can do merge joins, for example, where you can have instead of just a very, very big table and maybe smaller tables joining with it, you can have very large tables. But the trade-off there is merge joins are great, but they're not fast. Because in order to process these joins and to process complex queries with, you know, like, you know, just arbitrarily join data from many locations, you'll have a long process where you do a join, you then shuffle data around between machines, you do another join, and so on and so forth. So there are real trade-offs here that we have to deal with
Starting point is 00:18:50 and that were part of the design choices for ClickHouse. That's super interesting. So, okay, we talked a little bit about the more technical side of things and the trade-offs there, but usually we make trade-offs because we try to focus on different programs, right? So what are the use cases behind ClickHouse as a technology? Why we make this and how these trade-offs that we talk about, like address? Right. Yeah. So I think that there are an increasingly large number of use cases. The first one was web analytics. And we still see
Starting point is 00:19:27 that, we still see many of the users of ClickHouse pursuing that use case. I'll give you an example, Cloudflare. So Cloudflare is a super successful company. They provide DDoS protection. They shield websites. They provide networking, DNS lookups, things like that. It turns out that if you're a tenant of Cloudflare and you go to the dashboard, the chances are that the data that you're serving, that you're seeing is actually served up from a huge ClickHouse cluster that they maintain. I haven't talked to them recently, but last I looked, it was on the order of 120 nodes. And so everything that's popping up on those dashboards is coming out of ClickHouse. And it's just, you know, sort of rapidly, you know, assembling this data from the sources like
Starting point is 00:20:11 DNS lookups. But there are some other use cases that I think are more interesting because they're completely new. I'll give you a simple example. There's a company called Mux, which is a video content delivery network. They are the folks, among other things, that deliver the video, the streaming video in real time. So they have the content delivery network, which can provide telemetry. They have applications running in the user browsers, which can also send up information about what they're seeing. They combine that in a ClickHouse database so that the people who are operating the Super Bowl live stream can actually look at the metrics in real time and see like, hey, are we having rebuffering problems? Are we seeing content bottlenecks? Are we seeing problems with specific browser types? They can recognize those problems.
Starting point is 00:21:18 They can diagnose the root cause. They can fix them. And then they can go back to the metrics and confirm that they've got a fix. And they can basically do this in the time that the NFL takes a timeout. This is a completely new business. This business was not possible, you know, say 20 years ago with the technology that was available. How is ClickHouse or, let's say, this class of databases different than what we have used to call time series databases? They focus mainly in terms of use case and stuff like observability, for example, because the use case that you described with the video players, it is close to, let's say, the problem
Starting point is 00:22:04 of observability, right? So how they're different or there's an overlap there? Well, there actually is overlap. And it's a great question because when ClickHouse originally started, ClickHouse is really processes a superset of the use cases
Starting point is 00:22:18 of databases like Timescale and Influx. And so the way that Timescale or time series databases work is they just assume that you have a series of measurements that are characterized by time and then have an arbitrary set of attributes attached to them. So it's not the same model as we see in a traditional SQL database, which has a table with columns, so a rectangular format. What ClickHouse does is it solves that same problem, but it just does it in a different way. So first of all, ClickHouse has very efficient support for time. It has multiple data types for it. It also has a wealth of functions that can do transformations. Like when you're doing time
Starting point is 00:23:04 series, a very common thing that you want to do is you want to say, hey, what happened each hour, each day, each week, each month? So time, ClickHouse has functions to just normalize dates to do that kind of bucketing. And you do it straight off the raw data. So you're doing a scan, you can just bucket things and then group by them. The other thing that ClickHouse does is it is a column store. And so even though you don't have quite the same flexibility of just randomly adding parameters to it, ClickHouse allows you to add as many columns as you want. And it provides stunningly efficient compression on it. The compression is really outstanding. And it's not just, they don't just, we use LZ4 and ZSTD, but on top of it, we have what are called codecs, which are ways of transforming the data before it even gets to compression to reduce the
Starting point is 00:23:57 size and get it into something that is more, that's going to compress even more efficiently. So as a result, you can solve the same problems that time series databases do, but you have a database which can also handle much more diverse use cases and doesn't force you to think of everything in terms of that time series mod. Well, that's very interesting. And do you see these database systems, like DeepHouse to, I don't know like yes yes i do i think they this is
Starting point is 00:24:28 just me i believe that the low latency data warehouses products like click house pino and druid which are we're all kind of in the same the the use cases i i mentioned you can do them in you know and and whether whether druid is better for it than ClickHus, well, you know, try your application out. You'll probably be able to tell. But, yeah, so I believe that they're going to take over this model. And I think it's an example. I think, you know, just there's some historical strength behind or sort of basis for that assumption. And that is the fact that over time, we have seen the SQL on top of a relational model has pretty much subsumed most of the use cases that people have for data
Starting point is 00:25:16 management. And so I think this is another case where the time series databases are interesting. But over time, I think databases like ClickHouse that have very good vectorization use all the data warehouse technologies that have developed over the past four decades, take advantage of those, that they will ultimately be able to solve this problem far better than the narrowly focused databases will. That's super interesting. All right. So you mentioned another two technologies
Starting point is 00:25:49 that are similar to ClickHouse. You talked about, you know, and Druid, right? Right. So what's similar between the three and what are the main differences? What's between all three of them? Sure. And I'm going to sort of excuse myself on Pinot
Starting point is 00:26:08 because I haven't used it and haven't really looked into it too deeply. The one that I think is a good comparison and I think I can do it justice is Druid versus ClickHouse. And I think that they are the same in one fundamental way, which is that both Druid and ClickHouse are designed to solve this problem of providing low latency response, no matter how big the data set gets.
Starting point is 00:26:35 That's a very important similarity. So they're framed around that. They frame the problem the same way. Moreover, they also have the idea that a lot of the data is going to be coming off event streams, which are arriving very, very quickly, often millions of events per second. And so they both support columnar storage. They have efficient scanning techniques. They're able to parallelize not just across CPUs, but also across many nodes on a network to deliver these responses. Where they differ is that Druid, when it was originally developed, didn't even support SQL. It did not support joins for a long time, although they've been since added. The other thing is that Druid
Starting point is 00:27:17 has a more complex operational model. So in ClickHouse, you really just have one process. That's the ClickHouse database engine. That's a single process. Looks kind of like a, it's almost like MySQL. You just pop it up and it runs. We also use Zookeeper to keep the cluster coordinated, but that's it. With Druid, you have something like
Starting point is 00:27:42 six different process types, you know, different that serve different purposes. So it's operationally more complex. In that sense, I think ClickHouse is a better architecture. On the other hand, one of the things that is good about ClickHouse is it was built from the start to use, or from very early on, to use object storage as a backing store. And so that is a very good feature of Druid. So there's differences in how they've gone after these problems, but I think what's interesting
Starting point is 00:28:11 and where I give the Druid folks huge credit is they frame the problem the same way. And in the United States, at least, they were one of the first systems that really recognized this problem and recognized that new technology was necessary. Yeah. That's what I wanted to ask you next, because I remember, I think, if I'm not wrong, but like Druid is probably like the first technology that's trying to address these problems and these use cases.
Starting point is 00:28:38 It's been around for quite a while. Yes. Do you feel that the fact that we hear so much and we see more products around these programs today instead of when Droid started, it's just like market conditions and timing kind of reason behind it? Or also because of these operational choices that were made and how hard it was at the end to operate? I'm sorry.
Starting point is 00:29:04 I think what's happening is that people are recognizing the business opportunities around this. So in addition to, you know, there's web analytics, there's content delivery, network management, there is observability, which was, it's an old problem, but this is a new way to solve it. Log analysis. And by that, I mean, service logs, real-time marketing. Let me just give you, in fact, let me give you one more example of a use case that comes from our customer base.
Starting point is 00:29:31 You're going to a website, you have like an ad blocker, which you have an ad blocker and you're going through the website after a few pages, something pops up and said, hey, you know, wouldn't you like to take that off and sign up with us, you know, have a subscription, we can see you visited the website, you know, X number of times or, you know, like in the last hour. Well, that's actually backed by a data warehouse. In fact, in that particular case, it was backed by ClickHouse. And the idea is that you are, you know, that this information is being fed in in real time. Moreover, the data warehouse is able to give an answer about how many times you've been on the page within about 10 milliseconds.
Starting point is 00:30:13 So sufficiently fast that you can apply that knowledge in the time that you render a page. This creates a whole new industry. You know, this is a whole new extension of what people can do with real-time marketing. So, and interacting with customers. This is why there are products that are attacking this problem, because people are starting to understand, hey, one, there's a business opportunity. Two, there's technologies like Druid, like ClickHouse, like Firebolt that begin to solve this. And so you're starting to see people coming into the market and offering solutions for users. It's nice that you mentioned Firebolt. I think Firebolt is also based on ClickHouse, right?
Starting point is 00:30:52 It is. I wrote a blog article about it. And so in our business, our focus is, just to be clear, so our focus is on the real-time applications. ClickHouse is the linchpin, but we're looking around because we're helping people make decisions early on about which technology is right and then how to build the application. So we're super interested in technologies like Firebolt that are emerging. And yes, so Firebolt announced at a webinar that they gave on the Carnegie Mellon database series out on YouTube around December 15th. They said, hey, you know, we've got a new query engine. It's ClickHouse.
Starting point is 00:31:30 Yeah. And I sort of knew this because I know one of the guys who's, Ben Wagner, who's one of their query engineers. And I met him at reInvent. And he said, hey, Robert, there's something you'll be interested in hearing, but I can't tell you what it is. So, but, but come, come see this webinar. And so sure enough, you know, he, he taught, he did this great talk and I ended up writing a blog article, just analyzing what they did and like what it means for ClickHouse, how we can respond and, and also what it means for analytics in general. Yeah. That's super cool. One, one more question from my side, and then I want to give some time. There's never one more. There's never one more.
Starting point is 00:32:10 I'll give it to one more because you said the magic word, which is marketing and analytics in real time. So I'm pretty sure that like... Oh man, that wasn't even what my question was about. Okay. But I have one more question that I want to ask you before I give the microphone to Eric. There is another let's say real set of real-time technologies, which
Starting point is 00:32:35 is distributing brokers like Kasta, right? They build a whole business around real-time data. And outside of the broker itself, which it has like a very specific use of like in the stack of a company, they have also like builds ways to query the data there.
Starting point is 00:32:57 They also like how to create some kind of database on top of it and all that stuff. How is these technologies like ClickHouse or Fino or Druid compare or work together or compete with something like Kafka? How do they work together? These are completely complementary technologies. I know that Kafka talks about, they talk about KSQL and the ability to do queries on event streams. This is significant in some use cases, but I think what we see much more, so we have a getting up toward a couple hundred customers. I would say half of them are using either Kafka or newer versions.
Starting point is 00:33:40 So for example, it's very interesting that a technology called Red Panda, which is a rewrite. So Kafka compatible uses vectorized vector processing to make it extremely fast. So in these applications that we're discussing, you know, like real-time marketing, you've got data coming from multiple upstream services. And what you need is a big pipe that you can just toss it into without worrying about what's going to happen to it at the other end.
Starting point is 00:34:08 That is Kafka. Kafka solved this problem brilliantly by creating this distributed log. And then we're at the other end because in order to, you know, in order to receive the, you know, to actually run meaningful queries on this,
Starting point is 00:34:28 you're going to have to have all the data in one place. Let me give you a concrete example of why you want the data warehouse at the other end. Let's look at security. So we have a bunch of customers that one way or another are dealing with threat analysis and notification about security problems. What happens in a typical problem with security is you notice that some machine is beginning to make DNS requests for a server that you know is infected or a source of malware. So you get an alert that pops, that comes in, it gets stuck into Kafka, stuck into Red Panda, shows up in the data warehouse, somebody alerts on it and says, oh, there's something you got to look at. Well, the next thing you're going to do is the fact that one server is making this call
Starting point is 00:35:14 isn't just, that's an indication that you've got something to look at. In order to figure out what's really going on, you need to have not just that information, but you now need to be able to look at the history of DNS calls for that particular machine, for that particular data center, you know, for that particular type of application, whatever, you know, however you choose to divide it up. Moreover, you want to see the history going back often, you know, days, weeks, or months. So the data warehouse, by having the data in there, allows you to get that initial notification and then do the analysis that's necessary to figure out what's really happening and then do something constructive about it. So these two things are, they work perfectly.
Starting point is 00:35:58 And in fact, a lot of our work, what we're focused on is getting these, you know, ClickHouse is the linchpin to get this to work, but the event stream and then the platform that you run on. And for us, we do an enormous amount of. But we're focused on helping people tie these two things together and then make them work efficiently to build these applications. That's great. Keep asking questions. Go for it. Go for it. Yeah. Another one.
Starting point is 00:36:39 I can tell you, I don't want to bore our listeners, but hey, look, I never get bored with this stuff. And it's, you know what's so cool about this is that, and I think the best part of working in this job is that these are, I talked about cathedral building. These are the new cathedrals. So this generation of cathedrals is solving problems like these, you know, sort of dealing with threats coming in from, you know, from malware and things like that. It's just seeing the creativity that people exhibit in building these applications. Some of them, like the real-time marketing one, where they're popping up the, you know, the thing on the webpage, it's like some of these applications, you have to see them to believe that they're possible. I mean, it's almost like, it's sort of mind expanding to see the way that people use this technology. Okay, so I'm going to jump in, Costas.
Starting point is 00:37:31 So let's step back. And the question I want to ask, and I'll give a little context here. I'll ask the question first and then give a little bit of context. When do you implement ClickHouse? And I'll give a little bit of context there, right? So let's go through the life cycle of a company, you know, who's building, you know, an app, whether that's consumer, B2B or whatever, they're scaffolding out the software that, you know, maybe they have, you know, a Postgres database that's sort of near the back end of the app,
Starting point is 00:38:02 they grow, they have a data warehouse that's doing analytics, et cetera. Maybe they have a data lake. Like in that life cycle, which, you know, is sort of established, you know, to some extent at this point, like you build onto your stack. When do you adopt ClickHouse? And maybe that's like immediately, you know, from your perspective, or maybe there are use cases where ClickHouse is, you know, your perspective or maybe there are use cases where click house is you know better suited for but i just love for you to give our audience a sense of like in the arc of building
Starting point is 00:38:31 a stack from like two guys in a garage who are just clearing the production databases to see what's going on to i'm an enterprise and i need like real-time data happening now the real-time data happening now, the real-time marketing use cases, like when should ClickHouse enter the picture, you know, in sort of the arc of maturity? Sure. I think that's a great question. And there's really a couple of answers, but what we see is two patterns. One is, and this is a pattern that was previously very common, people would outgrow their existing databases. So Postgres and MySQL are super popular. A lot of people, like right now, that if you're building
Starting point is 00:39:15 an application and you need a database, for many developers, the go-to database is becoming Postgres. And so you're collecting, you're collecting your processing transactions. Maybe you're recording DNS requests. You stick them in Postgres. After a while, you're successful. So you have, you start to shard it across more Postgres instances using something like Citus. You realize, hey, you know, this is kind of need a little bit more horsepower. So we're going to add aggregations in there. Again, doing it in Postgres.
Starting point is 00:39:43 After a while, you figure, hey, this is just, we're outgrowing this technology. And that's when you roll in ClickHouse because ClickHouse can literally, because of the trade-offs that we talked about, there are cases where ClickHouse can solve a question or answer a question in a second that would take Postgres literally hours to answer. So that pattern is very common. And in fact, one of the early US users was Clydeflare. That's exactly how they grew into ClickHouse. And they started with Postgres. Many others have done this and sort of grew into ClickHouse over time. Now that people can see these patterns and recognize, hey, I'm going into a domain like I'm doing real-time marketing, I know that the data sets are going to be very large. They can do ClickHouse just right from the start.
Starting point is 00:40:33 And the cool thing about ClickHouse and the other open-source databases is because they're open-source, you can just go get a community build. You can just pop this thing up on your laptop, develop your app. And then the problem that you have is, okay, I've developed the app. Now I just need to get it to run in a production environment. And so that's it. We get customers and see users coming to us from both of those paths, replacing an existing database that they've grown out of, or they built the app. They know ClickHouse is what they need. And the problem there they built the app, they know click house is what
Starting point is 00:41:05 they need. And the problem there is more, okay, how do I get this thing to run in an environment and not have to worry about scaling it, for example, because, hey, they run on our cloud and we do that for them. Okay. Next question on that. Okay. Real-time marketing. What is the distance between the data that is needed and the capability to deliver in real time? So the distance, let's say, between ClickHouse and I can deliver, I have the force power to do this stuff in real time, I still have to actually get the data into a user experience, into the front end of some sort of application in real time. And so I'd love to know, what do you see from ClickHouse users? What's the distance between ClickHouse and sort of that tactual experience? Because that's sort of the other piece of the puzzle, right? In terms of
Starting point is 00:42:11 the data engineering or really even software development piece, right? Like we have the pipelines, we can do this in real time, but we still have to like, whatever it is, there's an app experience, a browser experience, you know, all that sort of stuff. Yeah, that's a great question. Because I think that the pipelines that we've been talking about, that the event streams, that technology is now really robust and I think very well understood. The apps are interesting. So there's a couple of traditional ways that people are bringing that data out and making it available.
Starting point is 00:42:40 One is to use dashboards. Use a tool like Superset, for example, which is open source. Grafana is another one. These allow you to build dashboards where people can quickly see what's going on and basically react to things very, very quickly. both cases, they're kind of fixed, right? So you're, you know, like if a user wants to do, you know, like go in and, you know, sort of tweak graphs and quickly change things about the data, that's harder to do because those products don't support it out of the box. So the other thing that people do is they just build these, they build these front ends. So, you know, they'll use JavaScript or TypeScript, React applications, and then they go straight back to ClickHouse to, you know, to get questions answered. What's now happening, and I think the really interesting development that we're kind of tracking is people are building, you know, you have to build the middle tier. And people were just doing this by hand, you know, like, hey, they'd have React,
Starting point is 00:43:42 and then they'd have Node in the middle there, and they just hand code it. But what's happening is we're starting to see products emerge in that space, like Hasura. That's a product that does GraphKit that'll stick a GraphQL API on top of an existing database. They're adapting it for ClickHouse. We also see things like kube.js. That sticks what looks like a SQL API and builds what we used to call cubes back in the old data warehouse days. That's another middleware tier. And then applications just connect to that.
Starting point is 00:44:15 And so those are the, you know, this middle tier that serves up the data and, you know, gives you some, you know, sort of an indirection layer between the database. That's an emerging topic. And then you have the dashboards and your, you know, typical front-end stacks that are talking to the data. Totally. Super interesting. And we've seen some really interesting architectures around something like Hasura, who basically exposes that API to enable some really interesting things. This is super fun. Those architectures are really fascinating. And I think you're right.
Starting point is 00:44:48 Like circling back to what you said earlier, like it's really fun when all this stuff comes together and a lot of it really is databases and sort of variations. Right. Yeah. And we've, Hasura doesn't have the, we haven't seen Hasura, but we see other people using GraphQL that they've, so we know this is, and we get a lot of questions like, hey, how can I just pop a GraphQL interface down? Because I think this is the other thing to understand is that the challenge that it's not just a technical challenge, but you have dev teams and not everybody can do everything at once. So for example, you have teams that understand the business problem that they're solving, but there's this relatively complex
Starting point is 00:45:31 infrastructure that they have to operate, you know, to, you know, design on and then operate that make these applications work. So I think one of the challenges that dev teams are having is not so much just figuring out the design, but how to get the business part done. You know, that end user experience was going to make the product successful. How to do that without becoming an expert on, say, I don't know, compression in ClickHouse. And how to decide where, hey, you know, if I tweak this compression by 5%, or how do I fix a bug in ClickHouse? That's something that, and so I think, in fact, a lot of what our work is around is we provide that expertise so that the developers can just be developers.
Starting point is 00:46:14 They can figure out how to make it work, but they don't have to worry about necessarily about how to operate it or become, immediately become deep experts in ClickHouse and some of the technologies it connects to. Okay. A non-ClickHouse question for you here, and Costas, would love for you to jump in, because, I mean, this is super fascinating. I think it's very helpful for our listeners, but you are an expert in all things database. Outside of ClickHouse, what are some of the other interesting database technologies that you've seen or that excite you, you know, that have come out in the last couple of years? Like, is there anything, I mean, of course, ClickHouse is exciting, but anything you've seen that sort of piques your interest, you know, in terms of breaking new ground or open source projects?
Starting point is 00:47:00 That's a really great question. I have to say the analytic databases right now are the main thing I'm focused on. I will say SQL implementation, the ability to store data in object storage. But now you're also going to get this real-time behavior and do it in a cloud database. I think that's like the next step in cloud databases. Because if you look at what happened over the last decade, two really significant databases came up. Redshift, that was the first one to bring data warehouses to the cloud. A brilliant idea. And just, you went from potentially having to spend two months to get something installed to about 20 minutes with a credit card. Snowflake, separation of compute and storage. BigQuery has done the same thing. These are really significant. But again, that came at a cost because object storage is really slow. And in fact, low latency is not a focus of either of those products. So Firebolt is doing something really interesting there. And I think kind of laying down a, you know, as I say, laying down the gauntlet to, you know, for other people to try and match their capabilities.
Starting point is 00:48:35 Eric, do you expect for me to share my opinion on that? Absolutely. Actually, I am interested in this because, you know, of course you moved to Starburst, you know, which was both upsetting and exciting for me. But what are you seeing? Because you, you know, sort of federated querying and you're seeing people query all sorts of stuff. Like, what are you seeing out there, Kostas? Yeah, I don't think that my answer will come from my experience of Starburst. Okay, Trino and Presto
Starting point is 00:49:05 and all these technologies, they have been around for a while. It's not something new. The difference is that they were mainly used by quite large enterprise companies. I think what we will see there is, especially now with the data lakes and the lake house paradigm that they will start to be used. They will be used also from let's say smaller companies. So they will become like much more approachable, let's say to the people out there now, exactly what are the use cases and how they compete or not compete with solutions like BigQuery or Snowflake.
Starting point is 00:49:44 That's something that I think it's a big, big conversation for another episode. But there are some stuff that I see happening in the industry and in the technology around databases that I find very interesting and not necessarily product-level kind of technologies right now. But one of the things that I really enjoy is playing around with something, with a project called DuckDB, which is a very interesting approach of building like a columnar database and playing around like we, you know, like use cases
Starting point is 00:50:22 and stuff that you usually would need something like the snowflake to do. It's very interesting. The user experience is brilliant, actually. It's a very, very interesting and very refreshing way of doing things. And another thing that they've done, and that's what I love about the team behind DuckDB is that they really, I have a feeling that they really enjoy what they're doing and they're experimenting a lot. They compiled duckdb.wasp, which means that the whole thing can run inside your browser, which is also a very interesting experiment. So that's one part. And the other thing is anything that has to do with ARO. I love the ARO projects. I think the people there are doing like some amazing job in trying to build,
Starting point is 00:51:08 let's say, a framework of how to represent NARF in memory in a vendor agnostic way called NARData. And I'm paying a lot of attention on these projects and Shades Diaper, like how it's going to impact the database industry, because I think it will. And Eric, if I could also add to this, I think that one of the things that I'm really looking at is actually Kubernetes. And Kubernetes has been around for a long time, but it's only in the last two to three years
Starting point is 00:51:40 that it has really come into its own as a platform for data. There's been a couple of key technology developments, but one is the notion of an operator, which gives you the ability to define resources of your own that are things like distributed databases, like ClickHouse, like whatever database you want to support, Elasticsearch. So I think what, one of the things right now is everybody is very, very focused on operating in the cloud. But as Costa said, there's technologies out there like DuckDB that run just great anywhere. And I think that as people begin to have more and more of these large workloads, it's going to be obvious that, hey, there are times when, yes, it's great to run in the cloud, but for a variety of reasons, you may want to run in your own data centers or
Starting point is 00:52:32 in clouds that you control directly where you do not have an outside vendor controlling things. Kubernetes is the answer to that. It is the portability layer that lets you run not just in the Amazon cloud, say using EKS, but you can now, you know, hey, if you decide, you know, we'd like to, you know, run this on Hetzner or we'd like to run this on Equinix, on Equinix Metal, Kubernetes is a tool that allows you to do that. I am, right now we have a very centralized computing model in the cloud, but it's not going to, and everybody's very focused on that. That is, these pendulums always swing back. And I see this really interesting development. And this is in fact, something that we're very focused on in our company is using Kubernetes and enabling people to run these data warehouses and build these applications wherever they want.
Starting point is 00:53:21 And we already see our customers doing this, using Kubernetes in combination with managed services, for example, to build these very flexible, very portable applications. So I think there's going to be some really interesting things in that area as well, along with this new technology, new specific databases we've talked about. That's a very interesting point you brought there, Robert, and I think Google were trying to address this with Anthos, but I don't
Starting point is 00:53:53 think that these projects went very well. I don't think that they succeeded. Having like, let's say some kind of technology that can run like this hybrid model of like, yeah, you can deploy this onto the cloud. Right. Like on your bare metal. Right, right. But maybe that's also an execution thing, right?
Starting point is 00:54:12 And not like something that has to do with the market itself. But that's going to be, I think, like very interesting to see how it's going to progress. Yeah. I think one of the most interesting developments in Kubernetes is not a technological one so much as the fact that there's now managed kubernetes in so many different environments so all of a sudden because the big challenge with kubernetes and that i think what's what held back adoption is it's not that hard to run applications on kubernetes i think any
Starting point is 00:54:39 any competent developer can do that it's running Kubernetes itself. But as soon as you have distributions that, you know, sort of every cloud provider, you know, large or small has this, you know, so that, you know, take that away. That then makes Kubernetes a practical answer. Plus there's just better technology for running it even on-prem at this point. It's more people know how to do it. There are, you know, companies like SUSE and Red Hat OpenShift that are supplying technology for this. I think we're going to see some really interesting stuff around Kubernetes over the next few years. I totally agree with that. Yeah, I think I would describe that, Robert, as option value. And I think that's becoming a
Starting point is 00:55:21 really important, like, as we think about, you know, it's fun for us to talk about this, you know, all three of us are working to build technology in the data space. And that's great. And we love talking about it, right? The people we serve are trying to build these data stacks. And that's, that's pretty hard. It's actually a lot harder than, you know, probably we say in our marketing materials materials because they have this world of choices and confusing marketing. But one thing that I do see, and I think you make a great point, is the option value and building on something like Kubernetes where the appetite for getting locked in is decreasing at an unbelievable rate.
Starting point is 00:56:00 Right. And people want tools that give them option value, right? Open source is a component from that. Multi-cloud is a component of that. And it's a big deal. And I think it's going to be, you know, the companies that don't sort of adopt an architecture that allows for that option value and flexibility are going to struggle. Now, you know, it's a big market. It's, you know, it's changing rapidly, but in some ways it it's a decade-long change that we're at the beginning of that. But I think that's a great point.
Starting point is 00:56:28 Yeah. And I think that it comes back to what we see is there's just all these different choices that come into where you want to run a particular application. You know, hey, where are your data sources? You know, where's the data coming from? What compliance do you have? What's your data sources? Where's the data coming from? What compliance do you have? What's your cost model? Sort of the economic model behind the application.
Starting point is 00:56:50 And I think that what I'm always amazed at is just how varied the information is that goes into the choices and what different choices companies, both large and small, tend to make. So I think we, it's, you know, the VCs like to call it optionality. I think we're definitely going to see this as a big issue. People will get the systems deployed. In fact, you know, just the obvious case, people get systems deployed and they realize, wow, this is actually kind of pricey up there in Snowflake, you know, and I'm locked in, you know, I'm not going to be coming back. That's the Roach Motel. I'm not coming back out. Yeah. Look, at the end of the day, it is a really fun space to be in, holding some cool stuff. And I have had the privilege of letting the show run long because
Starting point is 00:57:39 Brooks isn't here. So we've gotten to go long, Robert, which is great because Brooks isn't waving the flag saying land the plane, land the plane so I love when I get to do that because these conversations are so wonderful thank you again for giving us some of your time thanks for telling us about ClickHouse and talking about real time this has been really great
Starting point is 00:57:58 yeah and thanks Acosta thanks for the reminder on DB I've absolutely got to go back and check that out. Yeah, this has been really fun. Thank you both. They have done some, like, I think these theories are really having fun. They are doing like some really interesting experiments. I don't know, like, if they have any plans, like productized of like what
Starting point is 00:58:20 they want to do about it, but you see some updates from them and you're like, oh, wow, why did they do that? And it's quite interesting. And it's like, in some ways, like for me, at least it has become like a very useful tool sometimes, like, especially when I have data, like, you know, I have a bunch of like parquet files that I just won't like to create, like it's the easiest way to do it. Like super, super easy.
Starting point is 00:58:46 Yeah. That's why this stuff never, the fun never stops. Maybe we'll put that in the title of the episode. Yeah. The fun never, the fun never stops. Yeah.
Starting point is 00:58:57 Feel free to use it. I, I'm, I'm releasing it into open source. It's a permissive, permissive license. Go to it. I have control over the show
Starting point is 00:59:06 because Brooks isn't here. So done. Consider it done. Databases at Fundever South. Thank you. Thanks, guys. Yeah. And we'll talk again soon.
Starting point is 00:59:14 All right. Bye-bye. That's always a great show. Two things, which I know isn't allowed, but I guess I make the rules so I can kind of break them and Brooks isn't here. So, you know, we can do it.
Starting point is 00:59:29 I'm fascinated. I want to start a separate podcast, not that I would ever do this, but about how people with academic backgrounds that are really different from what they do today, like influence their work today. That's just so interesting to me. And I thought it was fascinating hearing about how his study of language was like very practical, like, okay, I got a job where I was translating Japanese. But then also the rhetoric of studying Latin and reading Cicero influencing the way that he thought about some of the sort of theoretical pieces of databases and their uses and implementations. I loved that. I love that stuff. Maybe I'm just a big nerd and I love learning about people's backgrounds, but that was my big takeaway. I just love hearing about that.
Starting point is 01:00:09 How about you? Yeah. Okay. I have to say that I think that this is also cultural. I think it has a lot to do with like the American culture, but it's a big and very interesting conversation that we can have. And I totally agree with you. It's very fascinating. It's not the first time that we hear from someone to have such an interesting journey until they get into technology, right? Sure. But outside of this, I really enjoyed the conversation today.
Starting point is 01:00:38 And I'm not going to say anything about the technologies that we talked about, the use case and the markets of all that stuff. What I really, really enjoyed was spending like 15 minutes or 60 minutes with someone who is so excited about the stuff. Totally agree. And that is like probably like the best thing that can happen to me as opposed to this show. Like it's so refreshing and like...
Starting point is 01:01:05 So fun. It's still happening. Yep. All right. Well, thank you for joining us. A lot of great shows coming up with guests just like Robert, who are very excited about what they're working on. Eric and Kostas, you're signing off,
Starting point is 01:01:18 and we'll catch you on the next one. We hope you enjoyed this episode of The Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers.
Starting point is 01:01:43 Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.