The Data Stack Show - 244: Postgres to ClickHouse: Simplifying the Modern Data Stack with Aaron Katz & Sai Krishna Srirampur

Starting point is 00:00:00 For the next two weeks as a thank you for listening to the Data Stack show, Rudderstack is giving away some awesome prizes. The grand prize is a LEGO Star Wars Razor Crest 1023 piece set. They're also giving away Yeti mugs, anchor power banks, and everyone who enters will get a Rudderstack swag pack. To sign up, visit rudderstack.com slash TDSS-giveaway. Hi, I'm Eric Dotz. And I'm Jon Wessel. Welcome to The Data Stack Show.

Starting point is 00:00:36 The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. ["Data Work"] Before we dig into today's episode, we want to give a huge thanks to our presenting sponsor, Rudder Sack. They give us the equipment and time to do this show

Starting point is 00:01:04 week in, week out, and provide you the valuable content. RudderSack provides customer data infrastructure and is used by the world's most innovative companies to collect, transform, and deliver their event data wherever it's needed, all in real time. You can learn more at rudderstack.com. Welcome back to the Data Sack Show. We are live in Oakland, California, recording at the Data Council Conference, and we have Sy and Aaron from ClickHouse on the show today. Welcome, gentlemen. Thank you very much. I'm really excited to be here.

Starting point is 00:01:36 All right. Well, give us just a quick background. You've had a pretty incredible journey, so give us a quick background. Sure. I'm happy to start. This is Aaron. We formed ClickHouse Inc, the company around the popular open source database ClickHouse

Starting point is 00:01:50 about four years ago. And it's a venture backed startup, headquartered in Silicon Valley, Delaware corporation and well capitalized. This is model is to take this very popular columnar open source database and offer it as a managed service. As a database, it supports a variety of different use cases,

Starting point is 00:02:08 which I suspect we'll get into. And we launched our managed service, which we call ClickHouse Cloud two years ago, and it's gone very well. There's a lot of market demand for this type of technology. And so we've got over a thousand customers on our managed service, companies like Weights and Biases, Land Chain, Versel, Twilio, Roblox, Sony, Cisco,

Starting point is 00:02:31 and many others, and they're driving great benefits in terms of cost savings and also extremely low latency analytical experiences for their customers. So the company's about 300 employees globally distributed. Over half of our team members are outside of the United States, which also shows up in terms of our customer base and our revenue mix being highly international, with over 50% of both being outside of the Americas. Love to introduce Cy. We acquired Cy's company about 10 months ago, PeerDB, where he was the founder and CEO, and they developed a CDC protocol for moving data

Starting point is 00:03:08 from Postgres into ClickHouse as Postgres emerged as one of the most popular sources of data going into our analytical database. Awesome, very excited to be here and thanks Aaron for the great intro. So I'm Sai and I head up ClickPipes efforts in ClickHouse. So ClickPipes is a native ingestion service which gets data into ClickHouse Cloud.

Starting point is 00:03:30 So at a high level, we make it very easy to stream and get data from various sources like object storage or streaming sources like Kafka and also databases, right? And prior to ClickHouse, I was the CEO and co-founder at PeerDB where we were building a data replication tool with laser focus on Postgres. So the goal was to provide the world's fastest and the easiest way to move data from Postgres to data warehouses, which included ClickHouse. And interestingly, ClickHouse was one of the most adopted in the high traction connector,

Starting point is 00:04:01 which is why I think Aaron acquired PeerDB. And now at ClickHouse, we integrated PRDB already into click house cloud. So you just click a button and like you can start streaming Postgres data into click house and use click house for blazing fast analytics, right? So it's all native. So you don't need to have any external ETL tool to do all of this. It's all in the click house cloud experience. And prior to PRDB, my experience is all in Postgres. So I was working at this database startup called Citus Data, which built a distributed Postgres database and that database got acquired by Microsoft. So

Starting point is 00:04:35 I spent eight years there helping customers implement Postgres. So I've seen all the pain points around Postgres for analytics, which is why I built this company where like, making it easy to move data from Postgres to warehouses. And now I'm working in the other side, which is Clickhouse, which makes like analytics like blazing fast. So I would love to talk about like Postgres, Clickhouse. So yeah. Yeah. So Sai and Aaron, I'm really excited about talking about this Postgres topic

Starting point is 00:05:02 as well, because I think teams hit this wall and they're like, okay, this doesn't work anymore. What do I do? And the thing they don't want to do is have a bunch of different solutions for each thing, right? They want like as few solutions as possible. So I wanna talk about that. Aaron, what's the topic that you wanna hit?

Starting point is 00:05:18 Perhaps we can touch on the, just the diversity of use cases that we're seeing emerge around this type of technology and the convergence of a lot of these specialized databases. And we've seen this now, you know, for the last, let's call it five years, where you have transactional databases like Postgres or MySQL or Mongo. You've got analytical databases like ClickHouse, Apache Druid or Pino, many others. You've got relational databases, vector databases,

Starting point is 00:05:45 and you can kind of see these technologies on a bit of a collision course. And just the overlap between them and what we're hearing from customers around the desire to simplify the database infrastructure to where they can have one or two databases satisfy a lot of these different requirements. Yeah.

Starting point is 00:06:04 Yeah, what about you, Sar? I'd love to talk about Postgres and Clickhouse. And my experiences of what I have seen at Citus, because Citus did build a real-time analytical database. And what were the challenges that we saw building stuff within Postgres, and how we saw customers move to purpose-built databases like Clickhouse we used to hear, like MemSQL also at that time we used to hear, like MemSQL also at that time,

Starting point is 00:06:25 we used to hear like Snowflake, right? So I would love to share those experiences and yeah. Great, awesome. Well, let's dig in, tons to talk about. Yeah, let's do it. Aaron, Sai, welcome to the Data Stack share. Awesome to have you here in person at Data Council. Before we jump into the meat of the share, Aaron,

Starting point is 00:06:43 can you tell us, was there a moment when working on ClickHouse open source, just was there a light bulb moment of, hey, we have something here we can build a company on? How did that happen? Yes, maybe I'll broaden my answer to how I first discovered ClickHouse. So my career started back at some microsystems where I was working on the Java station. And then in early 2002, I joined Salesforce, which is a database company, despite many refer to it as an applications company.

Starting point is 00:07:13 Forms on database, yes. Satisfies a number of use cases, three specifically when I was there. And then I joined Elasticsearch 11 years ago. That was a small startup behind the popular open source search engine. Yeah. And I ran the go-to-market functions there for six years. And then I stepped out during COVID. And in the latter years of my time at Elastic,

Starting point is 00:07:34 we started to observe a number of emerging technologies like ClickHouse, which entered the open source frame. And others would be Druid and Pino. And frankly, at the time, we were really focused on migrating Splunk workloads for logging or offering a managed service to compete with AWS's redistribution of Elasticsearch. And so, you know, at least I frankly kind of dismissed how popular these technologies will be coming, but we started to see some pretty

Starting point is 00:08:00 prominent workloads of companies migrating from Elasticsearch to ClickHouse. And so when I had the ability to step out and kind of observe the broader landscape and inventory, what I thought would be, you know, technologies that just have this growth ascent, but also had this very vibrant community and take a look at, you know, the number of contributors that were helping evolve the technology, ClickHouse just continued to stand out really on its own. And so in early 2021, so this is four years ago, I started working with Yandex,

Starting point is 00:08:34 a Dutch company called Yandex Envy. At the time they were publicly listed on, I believe on the New York Stock Exchange, they were a $30 billion company. They had developed ClickHouse internally to power something called Yandex Metrica, which is the equivalent of Google Analytics. So if you think about web scale analytics, the type of database that would need to sit behind that, where you've got, you know, very high concurrency and very low latency requirements and the creator of ClickHouse, Alexei Mulevitov, and I got to know each other and we started kind of romanticizing about forming a company

Starting point is 00:09:06 around ClickHouse. He named the product and the project. It's short for Clickstream Data Warehouse. And so he was thinking about the data warehouse use case before he even started writing the first line of code. And then in coordination with Yandex, he open sourced it. And it just took off in popularity in companies like Deutsche Bank and Microsoft and Uber, Disney and Comcast on and on, adopted this technology for a diverse set of use cases. And so, you know, we spent about a year

Starting point is 00:09:32 engineering a company around it. It's not the first time this has been done. As many of you know, technologies like Kafka, which was developed inside LinkedIn and was the foundation for the company Confluent or Hadoop, which was developed inside a large internet company and there were companies like Hortonworks and Laudera that were formed around that. So it's a pretty well-known playbook, but the business model that we selected was a little bit different. We can spend some time talking about that later on.

Starting point is 00:09:57 I have a question here and, and this is a selfish question for me, but you mentioned that you were at Elastic and you saw some shifts in the technology that people were using, the architectures, and you kind of dismissed them. And that's a really, that's an easy thing to do because you're so focused on, you know, the problem that you're solving and especially in an environment where someone's changing, which I think a lot of people feel now, especially with a lot of the advances in AI. Do you have some advice on how to maintain an objective view and navigate through understanding technological shifts? Which ones should you pay attention to? Which ones should you not? Well, I think the most important thing to look at is how the applications or the databases are being

Starting point is 00:10:43 implemented and what use cases they're satisfying. Because Elasticsearch was originally designed to be a search engine. So if you needed to add a search bar to a website or you're building a mobile application, you need to search on groceries or DoorDash, for example, those are common use cases. And then people started putting log files into it.

Starting point is 00:11:00 And the beauty about open source is it spawns all of this innovation around it And so you had these other open source projects like log stash for login gestion get created by Jordan or you had Kibana Get created for visualization and that formed the elk stack and the elk stack was a very common Alternative to Splunk for example, which was perceived to be expensive. I mean, because it was, it was a great product. And let's get credit where it's due. So, so then when we were growing the company, we started to say, okay, you've got all these different search applications.

Starting point is 00:11:35 You've got website search, you've got application search, you've got enterprise search, then this was actually before observability was even a term in the industry, you know, I'm dating myself a little bit here, but you had, you had logging, you had metrics and you had APM and the two dominant APM providers were AppDynamics and New Relic and AppDynamics got acquired by Cisco the night before their IPO and New Relic was a very successful public company and Datadog had not yet emerged on the scene. And there had not been this convergence of observability. And Elastic was really, I think, central with those vendors and pulling those

Starting point is 00:12:10 three use cases together. And then people started putting security events into the Elk Stack. And it started to be used as a SIM alternative to something like Splunk Enterprise Security or Arc Site. And so when we were taking the company public, the story told very well, because you have all these different use cases. Each one of those use cases has a huge addressable market. And so investors love that type of growth story. And the reality is, as we all know, it's very difficult to parallel execute product development and distribution when you have all these different use cases and you've got big incumbents who you're trying to disrupt.

Starting point is 00:12:47 And I think Datadog, why they were so successful so early on is they just built this experience for developers to get up and running very quickly in a frictionless way where they could try, explore, deploy an agent, start monitoring their application without ever having to talk to anybody in the sales organization. And that's very different than, you know, a company like Snowflake, for example, which

Starting point is 00:13:11 went very heavy into enterprise sales and marketing, both very successful companies, but approached the problem set in a very different set of ways. And so coming back to your question, when I think when you're looking at these types of technologies, there are some that are very specific, like a vector database. Yep. And there's several that which are very popular. And I don't know if the luster is off this sector or not. I think time will tell.

Starting point is 00:13:34 But what we're hearing from customers is that they want to simplify their infrastructure. If they have a vector search requirement or they need a feature store, it could possibly be the same database that they're using to power their analytical workloads. And so my advice would be, try to take a look at a piece of technology that satisfies a very diverse set of use cases. And I think that's partially why Mongo's been so successful with their Atlas service,

Starting point is 00:13:59 is they really focused on the platform layer versus getting pulled into all of these vertical solution areas. Yep, makes total sense. is they really focused on the platform layer versus getting pulled into all of these vertical solution areas. Yep. Makes total sense. Okay. I want to so much to talk about here, but Cy, we need to get the back story with you. And I love the topic of Postgres because you're seeing a really interesting set of technology develop around Postgres. It's been modified very heavily to the point where now companies have a core value proposition of don't worry like it's actually just post

Starting point is 00:14:29 So give us your backstory and then let's bring PewdV and ClickHouse together No, absolutely So I would say that I'm really lucky that I saw Postgres since it was still very early And it was not a big name, right? Like it was Heroku, I think Heroku was the only like, you know, managed service like, which was there in 2013, 2014. Sure RDS was surely there, but like keeping the hyperscalers apart,

Starting point is 00:14:51 like I think Heroku was the only one. So my history dates back to 2014 at Citus data, where what we were doing is we were building a Postgres database, which could run across multiple machines, right? And one of the bread and butter use cases was Postgres for analytics. And the idea was single node Postgres database

Starting point is 00:15:09 can handle only some of analytics, say like few hundreds of gigs to maybe a terabyte of data, but then you would need more hardware to power these analytical workloads. And the way scientist was built was it was built as an extension. So in Postgres, there is this amazing thing where you can actually extend Postgres to make it more powerful.

Starting point is 00:15:29 So what we did was we extended Postgres so that you could run Postgres across multiple machines. So that was the idea of Citus and very similar to what Aaron said, I was in that passion that Citus can be used for analytics and like, and we've seen customers like use it as well. Right. And but what happened was like, one thing that became a problem over time, right, like as we you know, scale was basically Postgres, like, was built predominantly as a transactional database, right, like basically goes like 30 years. like 30 years. But then we're trying to like, you know, bring analytical capabilities. But analytics is a very hard problem because like you have like stuff like vectorized execution, columnar storage, right? Like, and it has a bunch of things yesterday, like Clickhouse published a blog, which talks about like lazy execution. Yeah, great blog post, by the way, to the listeners, if you haven't read it and go read it, it's awesome. So there are like hundreds of optimizations that, you know,

Starting point is 00:16:22 we needed to do inside this. And the problem was the Postgres ecosystem was a blocker basically, right? And we could go only some, right? And also there were two, you know, big problems we were chasing. One was you were chasing Postgres compatibility, because once you say that, like, you're a Postgres extension, customers expect that like everything works. Like, I mean, prepared statements, correlated software, all theies, all the stuff that Postgres supports. That's why they're using Postgres.

Starting point is 00:16:47 Exactly. And the second big problem is performance. Analytics means they want fast queries. So I feel that we were chasing two big problems, and it was hard to get either. I mean, the argument then was you get best of both worlds. But I feel that customers didn't get the best of both worlds. Right. And which is where we saw like a bunch of customers migrate to, you know, purpose-built databases like Clickhouse. And it was interesting

Starting point is 00:17:15 that like Clickhouse stood out, right? Like you see Cloudflare, right? That was a Citus use case, which migrated to Clickhouse. And at that time, I you not migrate? But now it makes a lot of sense. I feel that like the reason Clickhouse stood out was because of the ethos. Right? Like if you look at it, like Clickhouse and Postgres come from the same ethos of open source. Right? Like both of them have like large like communities.

Starting point is 00:17:40 Right? So we saw this transition where like a of customers were migrating to Clickhouse. And that was the reason I built PeerDB because my bread and butter was Postgres. I was like, how do we make it easy for customers to run workloads? Even though I'm a big Postgres fan, finally it's the customer. We need to build products for customers. So that was the idea. And Clickhouse was one of the early targets that we supported, we supported right? Like at PRDB and it did work. I mean, since the time we launched it in private beta, like, I mean, the traction

Starting point is 00:18:09 was like crazy and we landed like a bunch of like production customers and click house did notice, which is why, you know, the acquisition happened and obviously you build like a company for one and a half year and I was like, okay, should we get acquired? But for me, the main driving force was I do resonate with the philosophy of like Postgres and Clickhouse forming the default data stack. Right? That is what like, you know, drove me and it is paying dividends as well

Starting point is 00:18:35 because since the time we got acquired, you know, we have been growing like, I mean, the growth has been like a hockey stick on you know, how customers are moving data to like, you know, Clickhouse and how they are using both of them together. But that is some history on Postgres and like by click house now. We do need to get you a click house for the... The peer-to-peer is cool, right?

Starting point is 00:18:57 But we do need to get you a click house for the... This is a memorabilia, so like I... Yes, that is true. I have a bunch of click houses. We're gonna take a quick break from the episode This is a memorabilia. data every day and you know how hard it can be to make sure that data is clean and then to stream it everywhere it needs to go. Yeah, Eric. As you know, customer data can get messy. And if you've ever seen a tag manager, you know how messy it can get. So RutterStack has really been one of my team's secret weapons. We can collect and standardize data from anywhere, web, mobile, even server-side, and then send it to our downstream tools. Web, mobile, even server-side,

Starting point is 00:20:02 data to our downstream tools. One of the things about the implementation that has been so common over all the years and with so many RutterStack customers is that it wasn't a wholesale replacement of your stack. It fit right into your existing tool set. Yeah, and even with technical tools, Eric, things like Kafka or PubSub, but you don't have to have all that complicated customer data infrastructure. Well, if you need to stream clean customer data to your entire stack, including your data infrastructure tools,

Starting point is 00:20:28 head over to rudderstack.com to learn more. Yeah, I think one topic that would be fun because you mentioned this earlier, is talking about this open source thing. So, Bruce, pretend like you're sitting here, you're a founder and you're like, I want to do this thing in data. Tell us about the open source path

Starting point is 00:20:44 and then the non-open source path and what are the like what's the decision? What are the decision points there? Well the first decision point is obvious whether or not you pursue an open source strategy or not and the advice I give to early stage founders who are debating this question is don't. If it's even a debate, just don't do it. It's very tricky.

Starting point is 00:21:05 I mean, there's really a handful of what I think most people would define as successful independent open source companies today. So let's go through them. You've got MongoDB, Confluent, Elastic, Grafana, Clickhouse. And there's obviously more. He's still just on one hand. I did limit the set somewhere. So let's focus on those five for example. All right, Red Hat couldn't stay independent,

Starting point is 00:21:32 they got acquired by IBM. HashiCorp recently got acquired by IBM as well. So, and then you've got a long tail of open source companies who are emerging. And the business model historically is very well known. You get as many people using your technology, you then sell them technical support, which is inherently a flawed business model. And I can talk about why there's all these inherent conflicts with selling technical

Starting point is 00:21:52 support on top of open source, which we can talk about. And then you build an enterprise version or you have proprietary features. And that's typically around things like security and orchestration, alerting, et cetera. You bundle those together. It's typically tied to the size of the environment, which could be memory as a proxy or the number of nodes that a company's running. And you kind of disguise it as subscription revenue.

Starting point is 00:22:15 And it's high margin revenue because your cogs are unlimited because you're not reselling infrastructure like you are in a managed service. But eventually you wanna move your customers to a managed service. But eventually you want to move your customers to a managed service. Those who want to move to the cloud, we're going to talk about hopefully

Starting point is 00:22:30 different deployment models. Because I do believe that there's a research instead of on-prem workloads. Right now, we're seeing that today more than we have in the past five years of companies that are moving back to their own infrastructure, their own data centers. Or they want to self-host or self-manage the software and they don't want to pay the egress costs to move data from their account to yours.

Starting point is 00:22:52 And we saw Confluent acquired WarpStream recently to basically have what we call bring your own cloud. And that's where you decouple the control plane and the data plane. The data plane runs in the customer's accounts. You don't have to pay those networking costs. And so, but coming back to your question around open source, that's the data plane. The data plane runs in the customer's accounts. You don't have to pay those networking costs. And so, but coming back to your question around open source, that's the first question. Fortunately for us, Alexei, with the support of Yandex, had made the decision to open source QuickHouse. So all of a sudden you got a bit of a head start because you've got not a bit

Starting point is 00:23:16 of a head start, a big head start. Right. Right. But you've got a very feature rich database and you've got it in the hands of thousands of companies that are advocating. And then what you have are hundreds of contributors around the world that are advancing the feature set and you could have 10 engineers who are the committers, the core committers to the main branch, but you've got so many other people that are submitting pull requests. So all of a sudden you get this very competitive database technology. So you can go and credibly replace, you know, very advanced technologies like Snowflake and Google BigQuery and Amazon Redshift and Postgres and Elasticsearch, etc.

Starting point is 00:23:51 Then the obvious second question, if the first question is yes, let's open source the technology. So what license do you choose? Yeah. And, you know, historically 10 years ago, when or 11 years ago when I joined Elastic, it was basically Apache or AGPL. Those are the two common open source licenses, pros and cons of each. AGPL being a bit more restrictive, but gives you a bit more protection. Apache being much more permissive, but opens you up to competition.

Starting point is 00:24:15 And then when the hyperscalers emerged, and AWS is probably the most prominent, they started redistributing open source technology as managed services, which is obviously a threat if you're trying to build a company around it, then the server-side license emerged and people started adopting, you know, the elastic license, for example. There's some derivative of that, which has all of the benefits of a traditional open source license, but it restricts somebody offering it as a managed service.

Starting point is 00:24:41 If you fast forward to where we are today, I think that religious war is more or less past. I think companies accept pretty much all of these as commonly accepted open source licenses. Yeah. What do they care about? A, is it a common open source license? B, can they see the source code? C, is it free? Yeah. Pretty much all of these licenses satisfy those three requirements. And so, you know, we're staying with the Apache 2 license for the time being. We think it's in our users' best interests and the growth of the community is exploding. And so we want to, we don't want to do anything to disrupt that. But it's always something that we revisit periodically and say, hey, is this strategically the right decision?

Starting point is 00:25:17 Yeah. Well, let's talk about, Sai, something you mentioned, this vision for a simplified stack of Postgres and Clickhouse, what is that, like explain that vision to us. I mean, that's why the acquisition happened and that's what you're trying to enable for your customers. Great. I think that's a great question.

Starting point is 00:25:34 I think the first thing that we're doing is with PADB, like completely integrated into the Clickhouse cloud, they're making it very easy to continuously move data from Postgres to Clickhouse. This is the change data capture CDC side of things. And now the challenge is like Postgres OLTP workloads are still run like in the terabyte scale. Right. Like now building that experience like, and making it magical,

Starting point is 00:26:00 that's going to be very important. Right. Like, so that was the premise of like PaDB as well, where like we were not a generalized ETL tool. We were like a laser focused replication tool on posters. So that was our main value proposition. And that is helping because now we move, if there is a 30-kb database,

Starting point is 00:26:17 that needs to move to ClickHouse, we can do that in few hours, believe it or not. And other ETL tools, it would take days to you know, days to weeks and most of them would probably be break. Yeah. So then you have to start over. So that is, so that experience you want to make it like magical. So currently I think we are like 50 to 60% there, right?

Starting point is 00:26:37 Like, and we want to, there are a few things where like, there are workloads in Postgres which run at like, you know, over 50,000 transactions per second. And there are caveats around like replication slots, which is like the premise of like change data capture cannot handle, right? So we want to go deeper and see what we can do there. So one of the things we were exploring was logical replication, we do, which lets you consume in-flight transactions, right? So the idea there is it would like, you know, drastically improve throughputs. And here we are talking about customers who run Postgres at like, you know,

Starting point is 00:27:10 that kind of scale, right? Like these are like enterprises. And at Microsoft, I saw that like you had like Adobe, AT&T, FedEx, like who were using Postgres as a new transaction database. Right? And we want to go towards that. And that's very much aligned with the company as well. Right? Like because we are seeing a lot of traction,

Starting point is 00:27:26 as Aaron mentioned, from the upmarket and with that BYOC released as well. So that would be the next focus, where how do we support these enterprise-grade workloads? And obviously that includes, how do we make clickpipes available in BYOC? How do we make clickpipes available across CSPs, which is Azure and GCP, right? So that would be on the Postgres side. We want to go

Starting point is 00:27:48 pretty deep but we also want to expand to new like operational databases so MySQL, MongoDDR ones that we prioritized but the philosophy with which we run click pipes is quality over quantity. I'm not a kind of a person where like we have like 100 data stores and we say that it works. We have so much. The thing is if we build something it has to work. So that is the philosophy with which we are operating at Clickpipes where like any data source we add shoots up on that terabyte scale. So that's the more broader vision of our team and Postgres.

Starting point is 00:28:19 Did you hear that marketing teams? Yes. Can we talk, Aaron you mentioned use cases. So let's talk about just some end-to-end flows that you've seen with your customers, right? And maybe let's talk about their previous architecture, and then they simplify it with Postgres and ClickHouse. But what are they doing?

Starting point is 00:28:37 What is the final product that's being delivered at the end of the pipeline? Yeah, I mean, if we talked about it at a granular level, we'd be here all day, because you talk about fraud detection, sentiment analysis, A-B experimentation, et cetera. So I think you need to up level it to a more broader category, and we can simplify it with three. And the first is where I first observed ClickHouse, which is observability.

Starting point is 00:29:00 So it being a back end database to store and analyze logs, metrics and traces. That's one. The second would be a traditional cloud data warehouse. So people looking for alternatives to, for example, Amazon redshift, Google, BigQuery, or Snowflake. Let's limit the set to those three. That's the second. And then the third would be this broad category of real-time analytics.

Starting point is 00:29:22 And that is, it could be internal or external, but let's say it's typically an externally facing B2B SaaS application. So examples of these could include Ramp, Vantage, Versel, Weights and Biases, their Weave product, Langchain with Langsmith. So these are types of examples where, again, there's a little bit of overlap here,

Starting point is 00:29:43 because some of those are actually exposing observability data, but it's to a customer. So you need to search on your build logs and you need that query to run in a hundred milliseconds. ClickHouse is a great back end for that type of use case. Yep, that makes total sense. And that would include, so you have the build logs that then Vercell or other companies will also offer user facing analytics as well. Does ClickHouse power that? Great question.

Starting point is 00:30:07 Ramp and Vantage would be great examples of that. So a very different persona is the end user. If you're doing expense management or you're actually trying to optimize your cloud costs, so we would be the back end to that type of analytical experience through that SaaS application. Yep. That makes total sense. OK, and so what does a migration look like if you're moving to this stack? So let's say I'm on Redshift.

Starting point is 00:30:34 I have Postgres. I have sort of this architecture. Because migration is the brute. We were talking about this, I think, in another. It's like, their entire company's consulting firms, their bread and butter is making millions of dollars on doing migration. We're also talking, there's employees that like make a whole career of like, essentially they just migrate between things.

Starting point is 00:30:55 Like that's all they do. Yeah. Yeah. Great question. I think that is where like, that is the bowl of click pipes to make like migrations easy. Now I see migrations as, there are two dimensions to migration. One is you have the data migration piece

Starting point is 00:31:09 where you get the data as fast as possible, as reliably as possible to click house. The second is the application migration, basically. So on the data migration side of things, clickpipes would be very helpful. We have native capabilities for Postgres, where you have terabytes of data that you can migrate from Postgres, where like you have like terabytes of data that you can migrate from Postgres,

Starting point is 00:31:27 it would just work out of the box. And second, talking about like these warehouses, like, you know, Redshift, Snowflake, that is something that will be like in our roadmap. So we want to add capabilities where customers can easily migrate like from Redshift and like Snowflake. But the good news is that, right, all of these like, like, warehouses already,

Starting point is 00:31:49 like have capabilities to shove data to like S3 and GCS, right? And ClickPipe supports these capabilities to get from, you know, external storage. And we have customers move, like, you know, moving petabytes of data from like object storage. Right. And so it's like pretty solid on that front. customers moving petabytes of data from object storage. So it's pretty solid on that front. But also the direct migration to even avoid getting data to storage is something that we will be having more in the medium term roadmap.

Starting point is 00:32:15 So this is on the data migration side of things. Now let's come to the application migration side of things. So Clickhouse already supports layer compatibility with Postgres. Right, so you can have a Postgres compatible layer over click house, but my recommendation is to use like the native click house, right? So I don't, I mean, I'm not a believer of like compatibility layers, because we did this at Citus

Starting point is 00:32:38 and we failed, so, right? Because the thing is, people want like native capabilities. Right, once you put a layer, you are like, like inhibiting, you know, the application and the user to use this database in the best way possible. Right. So, so customers typically natively query like click house and it's all ANSI SQL. Right. Like, and you know, very similar to Postgres, right.

Starting point is 00:32:59 And the driver support, the ecosystem, right. Like my sister team, which is, you know, the integrations team has a bunch of like, I mean, they're like, it's a pretty large team who just manage like, you know, drivers and integrations to make it very easy to query ClickHouse, right? So, and this not only ties to like the application layer, but even on the, you know, BI layer, right? Like we have like a native Power BI integration, right?

Starting point is 00:33:21 So we made like, so we made that thing like very simple. So that's the way simple. So there would be some migration effort on the application side, but finally Clickhouse is a SQL based database, it is ANSI SQL, we have a bunch of drivers, that's the way to go about and I'm not a believer of compatibility because I don't think it's going to work and the thing is, users want to use the database in the fullest way possible and you don't want to innovate them from doing it. Yeah. Okay. Sure. Like the compatibility or like migration effort would reduce from like a month to like a two months to one month, but that doesn't matter.

Starting point is 00:33:53 Right? Like looking at the picture two months, you get like a solid product, which is like blazing fast. Yeah. Yeah. Awesome. Well, I know we're at time we can keep going all day, but here inside we've learned a ton. I think our listeners have learned a ton. Again, check out the blog post that ClickHouse published yesterday. I think it was amazing post. And yeah, we'd love to have you back on the show so we can go even deeper on the tech. Sounds great.

Starting point is 00:34:14 Thanks for the invitation. Thanks. And for the Bay Area folks, we're having our first user conference in San Francisco, or for those who want to come to the Bay Area, and we're going to be streaming it as well at the end of May. We'd love to have you. You can find it on our website. We're calling it Open House. Great. And we'll, we'll repeat that on the show closer to the Bay Area, and we're going to be streaming it as well at the end of May. We'd love to have you. You can find it on our website. We're calling it Open House. Great. And we'll repeat that on the show closer to the date.

Starting point is 00:34:29 Great. Thank you. You've been very awesome. Thanks everybody. The Data Stack Show is brought to you by Rudder Stack, the warehouse native customer data platform. Learn more at rudderstac.com

Your Ad Here

The Data Stack Show - 244: Postgres to ClickHouse: Simplifying the Modern Data Stack with Aaron Katz & Sai Krishna Srirampur

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.