The Data Stack Show - 244: Postgres to ClickHouse: Simplifying the Modern Data Stack with Aaron Katz & Sai Krishna Srirampur
Episode Date: May 20, 2025Highlights from this week’s conversation include:Background of ClickHouse (1:14)PostgreSQL Data Replication Tool (3:19)Emerging Technologies Observations (7:25)Observability and Market Dynamics (11:...26)Product Development Challenges (12:39)Challenges with PostgreSQL Performance (15:30)Philosophy of Open Source (18:01)Open Source Advantages (22:56)Simplified Stack Vision (24:48)End-to-End Use Cases (28:13)Migration Strategies (30:21)Final Thoughts and Takeaways (33:29)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
For the next two weeks as a thank you for listening to the Data Stack show,
Rudderstack is giving away some awesome prizes.
The grand prize is a LEGO Star Wars Razor Crest 1023 piece set.
They're also giving away Yeti mugs, anchor power banks, and everyone who enters will get a Rudderstack swag pack.
To sign up, visit rudderstack.com slash TDSS-giveaway.
Hi, I'm Eric Dotz.
And I'm Jon Wessel.
Welcome to The Data Stack Show.
The Data Stack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators
and data professionals to learn about new data technologies
and how data teams are run at top companies.
["Data Work"]
Before we dig into today's episode,
we want to give a huge thanks
to our presenting sponsor, Rudder Sack.
They give us the equipment and time to do this show
week in, week out, and provide you the valuable content. RudderSack provides customer data
infrastructure and is used by the world's most innovative companies to collect, transform,
and deliver their event data wherever it's needed, all in real time. You can learn more at rudderstack.com.
Welcome back to the Data Sack Show. We are live in Oakland, California, recording at the Data Council Conference, and we have
Sy and Aaron from ClickHouse on the show today.
Welcome, gentlemen.
Thank you very much.
I'm really excited to be here.
All right.
Well, give us just a quick background.
You've had a pretty incredible journey, so give us a quick background.
Sure.
I'm happy to start.
This is Aaron.
We formed ClickHouse Inc, the company
around the popular open source database ClickHouse
about four years ago.
And it's a venture backed startup,
headquartered in Silicon Valley, Delaware corporation
and well capitalized.
This is model is to take this very popular
columnar open source database
and offer it as a managed service.
As a database, it supports a variety of different use cases,
which I suspect we'll get into.
And we launched our managed service,
which we call ClickHouse Cloud two years ago,
and it's gone very well.
There's a lot of market demand for this type of technology.
And so we've got over a thousand customers
on our managed service, companies like Weights and Biases,
Land Chain, Versel, Twilio, Roblox, Sony, Cisco,
and many others, and they're driving great benefits
in terms of cost savings and also extremely low latency
analytical experiences for their customers.
So the company's about 300 employees globally distributed. Over half of our team members are outside of the United States,
which also shows up in terms of our customer base and our revenue mix being highly international,
with over 50% of both being outside of the Americas.
Love to introduce Cy.
We acquired Cy's company about 10 months ago, PeerDB, where he was the founder and CEO, and they developed a CDC protocol for moving data
from Postgres into ClickHouse as Postgres emerged
as one of the most popular sources of data
going into our analytical database.
Awesome, very excited to be here
and thanks Aaron for the great intro.
So I'm Sai and I head up ClickPipes efforts in ClickHouse.
So ClickPipes is a native ingestion service
which gets data into ClickHouse Cloud.
So at a high level, we make it very easy to stream
and get data from various sources like object storage
or streaming sources like Kafka and also databases, right?
And prior to ClickHouse, I was the CEO and co-founder
at PeerDB where we were building a data replication tool with laser focus on Postgres.
So the goal was to provide the world's fastest and the easiest way
to move data from Postgres to data warehouses, which included ClickHouse.
And interestingly, ClickHouse was one of the most adopted in the high traction connector,
which is why I think Aaron acquired PeerDB.
And now at ClickHouse, we integrated PRDB already into click house cloud. So you just
click a button and like you can start streaming Postgres data into click house and use click
house for blazing fast analytics, right? So it's all native. So you don't need to have
any external ETL tool to do all of this. It's all in the click house cloud experience.
And prior to PRDB, my experience
is all in Postgres. So I was working at this database startup called Citus Data, which
built a distributed Postgres database and that database got acquired by Microsoft. So
I spent eight years there helping customers implement Postgres. So I've seen all the pain
points around Postgres for analytics, which is why I built this company where like, making it easy to move data from Postgres to warehouses.
And now I'm working in the other side, which is Clickhouse,
which makes like analytics like blazing fast.
So I would love to talk about like Postgres, Clickhouse.
So yeah.
Yeah. So Sai and Aaron,
I'm really excited about talking about this Postgres topic
as well, because I think teams hit this wall
and they're like, okay, this doesn't work anymore.
What do I do?
And the thing they don't want to do
is have a bunch of different solutions for each thing, right?
They want like as few solutions as possible.
So I wanna talk about that.
Aaron, what's the topic that you wanna hit?
Perhaps we can touch on the, just the diversity
of use cases that we're seeing emerge around
this type of technology
and the convergence of a lot of these specialized databases.
And we've seen this now, you know, for the last, let's call it five years,
where you have transactional databases like Postgres or MySQL or Mongo.
You've got analytical databases like ClickHouse, Apache Druid or Pino, many others.
You've got relational databases, vector databases,
and you can kind of see these technologies
on a bit of a collision course.
And just the overlap between them
and what we're hearing from customers around
the desire to simplify the database infrastructure
to where they can have one or two databases
satisfy a lot of these different requirements.
Yeah.
Yeah, what about you, Sar?
I'd love to talk about Postgres and Clickhouse.
And my experiences of what I have seen at Citus,
because Citus did build a real-time analytical database.
And what were the challenges that we saw building stuff
within Postgres, and how we saw customers move to purpose-built
databases like Clickhouse we used to hear,
like MemSQL also at that time we used to hear, like MemSQL also at that time,
we used to hear like Snowflake, right?
So I would love to share those experiences and yeah.
Great, awesome.
Well, let's dig in, tons to talk about.
Yeah, let's do it.
Aaron, Sai, welcome to the Data Stack share.
Awesome to have you here in person at Data Council.
Before we jump into the meat of the share, Aaron,
can you tell us, was there a moment when working on ClickHouse open source,
just was there a light bulb moment of,
hey, we have something here we can build a company on?
How did that happen?
Yes, maybe I'll broaden my answer to how I first discovered ClickHouse.
So my career started back at some microsystems where I was working on the Java station.
And then in early 2002, I joined Salesforce, which is a database company, despite many
refer to it as an applications company.
Forms on database, yes.
Satisfies a number of use cases, three specifically when I was there.
And then I joined Elasticsearch 11 years ago.
That was a small startup behind the popular open source search engine.
Yeah.
And I ran the go-to-market functions there for six years.
And then I stepped out during COVID.
And in the latter years of my time at Elastic,
we started to observe a number of emerging technologies
like ClickHouse, which entered the open source frame.
And others would be Druid and Pino.
And frankly, at the time, we were really focused on migrating
Splunk workloads for logging or offering a managed service to compete
with AWS's redistribution of Elasticsearch.
And so, you know, at least I frankly kind of dismissed how popular
these technologies will be coming, but we started to see some pretty
prominent workloads of companies migrating from
Elasticsearch to ClickHouse.
And so when I had the ability to step out and kind of observe the broader
landscape and inventory, what I thought would be, you know, technologies that
just have this growth ascent, but also had this very vibrant community and take
a look at, you know, the number of contributors that were helping evolve
the technology, ClickHouse just continued to stand out really on its own.
And so in early 2021, so this is four years ago, I started working with Yandex,
a Dutch company called Yandex Envy. At the time they were publicly listed on,
I believe on the New York Stock Exchange, they were a $30 billion company.
They had developed ClickHouse internally to power something called Yandex Metrica,
which is the equivalent of Google Analytics.
So if you think about web scale analytics, the type of database that would need to
sit behind that, where you've got, you know, very high concurrency and very low
latency requirements and the creator of ClickHouse, Alexei Mulevitov, and I got to
know each other and we started kind of romanticizing about forming a company
around ClickHouse. He named the product and the project.
It's short for Clickstream Data Warehouse.
And so he was thinking about the data warehouse use case
before he even started writing the first line of code.
And then in coordination with Yandex, he open sourced it.
And it just took off in popularity in companies
like Deutsche Bank and Microsoft and Uber, Disney and Comcast on and on,
adopted this technology for a diverse set of use cases. And so, you know, we spent about a year
engineering a company around it. It's not the first time this has been done. As many of you know,
technologies like Kafka, which was developed inside LinkedIn and was the foundation for the
company Confluent or Hadoop, which
was developed inside a large internet company and there were companies like
Hortonworks and Laudera that were formed around that.
So it's a pretty well-known playbook, but the business model that we
selected was a little bit different.
We can spend some time talking about that later on.
I have a question here and, and this is a selfish question for me, but you
mentioned that you were at Elastic and you saw some shifts in
the technology that people were using, the architectures, and you kind of dismissed them.
And that's a really, that's an easy thing to do because you're so focused on, you know,
the problem that you're solving and especially in an environment where someone's changing,
which I think a lot of people feel now, especially with a lot of the advances in AI. Do you have some advice on how to maintain an objective view and navigate through
understanding technological shifts? Which ones should you pay attention to? Which ones should you not?
Well, I think the most important thing to look at is how the applications or the databases are being
implemented and what use cases they're satisfying.
Because Elasticsearch was originally designed
to be a search engine.
So if you needed to add a search bar to a website
or you're building a mobile application,
you need to search on groceries or DoorDash, for example,
those are common use cases.
And then people started putting log files into it.
And the beauty about open source
is it spawns all of this innovation around it
And so you had these other open source projects like log stash for login gestion get created by Jordan or you had Kibana
Get created for visualization and that formed the elk stack and the elk stack was a very common
Alternative to Splunk for example, which was perceived to be expensive. I mean, because it was, it was a great product.
And let's get credit where it's due.
So, so then when we were growing the company, we started to say, okay, you've
got all these different search applications.
You've got website search, you've got application search, you've got enterprise
search, then this was actually before observability was even a term in the
industry, you know, I'm dating myself a little bit here, but you had, you had logging, you had metrics
and you had APM and the two dominant APM providers were AppDynamics and New Relic and AppDynamics
got acquired by Cisco the night before their IPO and New Relic was a very successful public
company and Datadog had not yet emerged on the scene.
And there had not been this convergence
of observability. And Elastic was really, I think, central with those vendors and pulling those
three use cases together. And then people started putting security events into the Elk Stack. And
it started to be used as a SIM alternative to something like Splunk Enterprise Security or Arc
Site. And so when we were taking the company public, the story told very well, because you have
all these different use cases.
Each one of those use cases has a huge addressable market.
And so investors love that type of growth story.
And the reality is, as we all know, it's very difficult to parallel execute product development
and distribution when you have all these different use cases and you've got big incumbents who you're trying to disrupt.
And I think Datadog, why they were so successful
so early on is they just built this experience
for developers to get up and running very quickly
in a frictionless way where they could try, explore,
deploy an agent, start monitoring their application
without ever having to talk
to anybody in the sales organization.
And that's very different than, you know, a company like Snowflake, for example, which
went very heavy into enterprise sales and marketing, both very successful companies,
but approached the problem set in a very different set of ways.
And so coming back to your question, when I think when you're looking at these types
of technologies, there are some that are very specific, like a vector database.
Yep.
And there's several that which are very popular.
And I don't know if the luster is off this sector or not.
I think time will tell.
But what we're hearing from customers is that they want to simplify their infrastructure.
If they have a vector search requirement or they need a feature store, it could possibly be the same database
that they're using to power their analytical workloads.
And so my advice would be,
try to take a look at a piece of technology
that satisfies a very diverse set of use cases.
And I think that's partially why Mongo's been so successful
with their Atlas service,
is they really focused on the platform layer
versus getting pulled into all of these
vertical solution areas. Yep, makes total sense. is they really focused on the platform layer versus getting pulled into all of these vertical
solution areas.
Yep. Makes total sense. Okay. I want to so much to talk about here, but Cy, we need to
get the back story with you. And I love the topic of Postgres because you're seeing a
really interesting set of technology develop around Postgres. It's been modified very heavily
to the point where now companies have a core value proposition of don't worry like it's actually just post
So give us your backstory and then let's bring PewdV and ClickHouse together
No, absolutely
So I would say that I'm really lucky that I saw Postgres since it was still very early
And it was not a big name, right?
Like it was Heroku, I think Heroku was the only like, you know, managed service like,
which was there in 2013, 2014.
Sure RDS was surely there,
but like keeping the hyperscalers apart,
like I think Heroku was the only one.
So my history dates back to 2014 at Citus data,
where what we were doing is we were building
a Postgres database,
which could run across multiple machines, right?
And one of the bread and butter use cases
was Postgres for analytics.
And the idea was single node Postgres database
can handle only some of analytics,
say like few hundreds of gigs to maybe a terabyte of data,
but then you would need more hardware
to power these analytical workloads.
And the way scientist was built
was it was built as an extension.
So in Postgres, there is this amazing thing where you can actually extend Postgres to
make it more powerful.
So what we did was we extended Postgres so that you could run Postgres across multiple
machines.
So that was the idea of Citus and very similar to what Aaron said, I was in that passion
that Citus can be used for analytics and like, and we've seen customers like use it as well. Right. And but what happened was like, one thing that became a problem over time, right, like as we you know, scale was basically Postgres, like, was built predominantly as a transactional database, right, like basically goes like 30 years.
like 30 years. But then we're trying to like, you know, bring analytical capabilities. But analytics is a very hard problem because like you have like stuff like vectorized execution, columnar storage,
right? Like, and it has a bunch of things yesterday, like Clickhouse published a blog, which talks
about like lazy execution. Yeah, great blog post, by the way, to the listeners, if you haven't read
it and go read it, it's awesome. So there are like hundreds of optimizations that, you know,
we needed to do inside this.
And the problem was the Postgres ecosystem was a blocker basically, right?
And we could go only some, right?
And also there were two, you know, big problems we were chasing.
One was you were chasing Postgres compatibility, because once you say that, like, you're a Postgres extension,
customers expect that like everything works.
Like, I mean, prepared statements, correlated software, all theies, all the stuff that Postgres supports.
That's why they're using Postgres.
Exactly.
And the second big problem is performance.
Analytics means they want fast queries.
So I feel that we were chasing two big problems,
and it was hard to get either.
I mean, the argument then was you get best of both worlds.
But I feel that customers didn't get the best of both worlds. Right. And which is where we saw like a bunch of
customers migrate to, you know, purpose-built databases like Clickhouse. And it was interesting
that like Clickhouse stood out, right? Like you see Cloudflare, right? That was a Citus use case,
which migrated to Clickhouse. And at that time, I you not migrate? But now it makes a lot of sense.
I feel that like the reason Clickhouse stood out was because of the ethos.
Right?
Like if you look at it, like Clickhouse and Postgres
come from the same ethos of open source.
Right?
Like both of them have like large like communities.
Right?
So we saw this transition where like a of customers were migrating to Clickhouse.
And that was the reason I built PeerDB because my bread and butter was Postgres.
I was like, how do we make it easy for customers to run workloads?
Even though I'm a big Postgres fan, finally it's the customer.
We need to build products for customers.
So that was the idea. And Clickhouse was one of the early targets that we supported, we supported right? Like at PRDB and it did work.
I mean, since the time we launched it in private beta, like, I mean, the traction
was like crazy and we landed like a bunch of like production customers and click
house did notice, which is why, you know, the acquisition happened and obviously
you build like a company for one and a half year and I was like, okay, should we
get acquired?
But for me, the main driving force was
I do resonate with the philosophy of like Postgres and
Clickhouse forming the default data stack. Right? That is what
like, you know, drove me and it is paying dividends as well
because since the time we got acquired, you know, we have been
growing like, I mean, the growth has been like a hockey stick on
you know, how customers are moving data to like, you know,
Clickhouse and how they are using both of them together.
But that is some history on Postgres and like
by click house now.
We do need to get you a click house for the...
The peer-to-peer is cool, right?
But we do need to get you a click house for the...
This is a memorabilia, so like I...
Yes, that is true.
I have a bunch of click houses.
We're gonna take a quick break from the episode This is a memorabilia. data every day and you know how hard it can be to make sure that data is clean and then to stream it everywhere it needs to go. Yeah, Eric. As you know, customer data can get messy. And if you've ever seen a tag manager,
you know how messy it can get. So RutterStack has really been one of my team's secret weapons.
We can collect and standardize data from anywhere, web, mobile, even server-side, and then send
it to our downstream tools. Web, mobile, even server-side,
data to our downstream tools. One of the things about the implementation that has been so common over all the years
and with so many RutterStack customers is that it wasn't a wholesale replacement of
your stack.
It fit right into your existing tool set.
Yeah, and even with technical tools, Eric, things like Kafka or PubSub, but you don't
have to have all that complicated customer data infrastructure.
Well, if you need to stream clean customer data to your entire stack,
including your data infrastructure tools,
head over to rudderstack.com to learn more.
Yeah, I think one topic that would be fun
because you mentioned this earlier,
is talking about this open source thing.
So, Bruce, pretend like you're sitting here,
you're a founder and you're like,
I want to do this thing in data.
Tell us about the open source path
and then the non-open source path and what are the
like what's the decision?
What are the decision points there?
Well the first decision point is obvious whether or not you pursue an open source strategy
or not and the advice I give to early stage founders who are debating this question is
don't.
If it's even a debate, just don't do it.
It's very tricky.
I mean, there's really a handful of what I think most people would define as successful
independent open source companies today.
So let's go through them.
You've got MongoDB, Confluent, Elastic, Grafana, Clickhouse.
And there's obviously more.
He's still just on one hand. I did limit the set somewhere.
So let's focus on those five for example.
All right, Red Hat couldn't stay independent,
they got acquired by IBM.
HashiCorp recently got acquired by IBM as well.
So, and then you've got a long tail of open source companies
who are emerging.
And the business model historically is very well known.
You get as many people using your technology, you then sell them technical support, which
is inherently a flawed business model.
And I can talk about why there's all these inherent conflicts with selling technical
support on top of open source, which we can talk about.
And then you build an enterprise version or you have proprietary features.
And that's typically around things like security and orchestration, alerting, et cetera.
You bundle those together.
It's typically tied to the size of the environment,
which could be memory as a proxy
or the number of nodes that a company's running.
And you kind of disguise it as subscription revenue.
And it's high margin revenue
because your cogs are unlimited
because you're not reselling infrastructure
like you are in a managed service.
But eventually you wanna move your customers to a managed service. But eventually you want to move your customers
to a managed service.
Those who want to move to the cloud,
we're going to talk about hopefully
different deployment models.
Because I do believe that there's a research
instead of on-prem workloads.
Right now, we're seeing that today more than we have
in the past five years of companies that are moving back
to their own infrastructure, their own data centers.
Or they want to self-host or self-manage the software and they don't want to pay the egress
costs to move data from their account to yours.
And we saw Confluent acquired WarpStream recently to basically have what we call bring your
own cloud.
And that's where you decouple the control plane and the data plane.
The data plane runs in the customer's accounts.
You don't have to pay those networking costs.
And so, but coming back to your question around open source, that's the data plane. The data plane runs in the customer's accounts. You don't have to pay those networking costs. And so, but coming back to your question around open source, that's the first
question. Fortunately for us, Alexei, with the support of Yandex, had made the decision to open
source QuickHouse. So all of a sudden you got a bit of a head start because you've got not a bit
of a head start, a big head start. Right. Right. But you've got a very feature rich database and
you've got it in the hands of thousands of companies that are advocating.
And then what you have are hundreds of contributors around the world that are advancing the feature
set and you could have 10 engineers who are the committers, the core committers to the
main branch, but you've got so many other people that are submitting pull requests.
So all of a sudden you get this very competitive database technology.
So you can go and credibly replace, you know, very advanced technologies
like Snowflake and Google BigQuery and Amazon Redshift and Postgres and Elasticsearch, etc.
Then the obvious second question, if the first question is yes, let's open source the
technology. So what license do you choose? Yeah. And, you know, historically 10 years ago,
when or 11 years ago when I joined Elastic, it was basically Apache or AGPL. Those are the two common open source licenses,
pros and cons of each.
AGPL being a bit more restrictive,
but gives you a bit more protection.
Apache being much more permissive,
but opens you up to competition.
And then when the hyperscalers emerged,
and AWS is probably the most prominent,
they started redistributing open source technology
as managed services, which is obviously a threat
if you're trying to build a company around it,
then the server-side license emerged and people started adopting, you know, the elastic license, for example.
There's some derivative of that, which has all of the benefits of a traditional open source license,
but it restricts somebody offering it as a managed service.
If you fast forward to where we are today, I think that religious war is more or less
past. I think companies accept pretty much all of these as commonly accepted open source licenses.
Yeah. What do they care about? A, is it a common open source license? B, can they see the source
code? C, is it free? Yeah. Pretty much all of these licenses satisfy those three requirements.
And so, you know, we're staying with the Apache 2 license for the time being. We think it's in our users' best interests and the growth of the community is exploding.
And so we want to, we don't want to do anything to disrupt that.
But it's always something that we revisit periodically and say, hey, is this
strategically the right decision?
Yeah.
Well, let's talk about, Sai, something you mentioned, this vision for a simplified
stack of Postgres and
Clickhouse, what is that, like explain that vision to us.
I mean, that's why the acquisition happened and that's what you're trying
to enable for your customers.
Great.
I think that's a great question.
I think the first thing that we're doing is with PADB, like completely integrated
into the Clickhouse cloud, they're making it very easy to continuously move data
from Postgres to Clickhouse.
This is the change data capture CDC side of things.
And now the challenge is like Postgres OLTP workloads are still
run like in the terabyte scale.
Right.
Like now building that experience like, and making it magical,
that's going to be very important.
Right.
Like, so that was the premise of like PaDB as well, where like
we were not a generalized ETL tool.
We were like a laser focused replication tool on posters.
So that was our main value proposition.
And that is helping because now we move,
if there is a 30-kb database,
that needs to move to ClickHouse,
we can do that in few hours, believe it or not.
And other ETL tools, it would take days to you know, days to weeks and most of them would
probably be break.
Yeah.
So then you have to start over.
So that is, so that experience you want to make it like magical.
So currently I think we are like 50 to 60% there, right?
Like, and we want to, there are a few things where like, there are workloads in Postgres
which run at like, you know, over 50,000 transactions per second. And there are caveats around like replication slots,
which is like the premise of like change data capture cannot handle, right? So we want to go
deeper and see what we can do there. So one of the things we were exploring was logical replication,
we do, which lets you consume in-flight transactions, right? So the idea there is it
would like, you know, drastically improve throughputs.
And here we are talking about
customers who run Postgres at like, you know,
that kind of scale, right?
Like these are like enterprises.
And at Microsoft, I saw that like you had like
Adobe, AT&T, FedEx, like who were using Postgres
as a new transaction database.
Right? And we want to go towards that.
And that's very much aligned with the company as well.
Right? Like because we are seeing a lot of traction,
as Aaron mentioned, from the upmarket
and with that BYOC released as well.
So that would be the next focus,
where how do we support these enterprise-grade workloads?
And obviously that includes,
how do we make clickpipes available in BYOC?
How do we make clickpipes available across CSPs,
which is Azure and GCP, right? So that would be on the Postgres side. We want to go
pretty deep but we also want to expand to new like operational databases so
MySQL, MongoDDR ones that we prioritized but the philosophy with which we run
click pipes is quality over quantity. I'm not a kind of a person where like we
have like 100 data stores and we say that it works. We have so much.
The thing is if we build something it has to work.
So that is the philosophy with which we are operating at Clickpipes where like any data source we add
shoots up on that terabyte scale.
So that's the more broader vision of our team and Postgres.
Did you hear that marketing teams?
Yes.
Can we talk, Aaron you mentioned use cases.
So let's talk about just some end-to-end flows
that you've seen with your customers, right?
And maybe let's talk about their previous architecture,
and then they simplify it with Postgres and ClickHouse.
But what are they doing?
What is the final product that's being delivered
at the end of the pipeline?
Yeah, I mean, if we talked about it at a granular level,
we'd be here all day, because you talk about fraud detection, sentiment analysis,
A-B experimentation, et cetera.
So I think you need to up level it to a more broader category,
and we can simplify it with three.
And the first is where I first observed ClickHouse, which is observability.
So it being a back end database to store and analyze logs, metrics and traces.
That's one.
The second would be a traditional cloud data warehouse.
So people looking for alternatives to, for example, Amazon redshift,
Google, BigQuery, or Snowflake.
Let's limit the set to those three.
That's the second.
And then the third would be this broad category of real-time analytics.
And that is, it could be internal or external,
but let's say it's typically an externally facing
B2B SaaS application.
So examples of these could include Ramp, Vantage,
Versel, Weights and Biases, their Weave product,
Langchain with Langsmith.
So these are types of examples where, again,
there's a little bit of overlap here,
because some of those are actually exposing observability data, but it's to a customer.
So you need to search on your build logs and you need that query to run in a hundred milliseconds.
ClickHouse is a great back end for that type of use case.
Yep, that makes total sense.
And that would include, so you have the build logs that then Vercell or other companies
will also offer user facing analytics as well.
Does ClickHouse power that?
Great question.
Ramp and Vantage would be great examples of that.
So a very different persona is the end user.
If you're doing expense management or you're actually trying to optimize your cloud costs,
so we would be the back end to that type of analytical experience through that SaaS application.
Yep.
That makes total sense. OK, and so what does a migration look like if you're
moving to this stack?
So let's say I'm on Redshift.
I have Postgres.
I have sort of this architecture.
Because migration is the brute.
We were talking about this, I think, in another.
It's like, their entire company's consulting firms, their bread and butter is making millions of dollars
on doing migration.
We're also talking, there's employees that like make a whole career of like, essentially
they just migrate between things.
Like that's all they do.
Yeah.
Yeah.
Great question.
I think that is where like, that is the bowl of click pipes to make like migrations easy.
Now I see migrations as,
there are two dimensions to migration.
One is you have the data migration piece
where you get the data as fast as possible,
as reliably as possible to click house.
The second is the application migration, basically.
So on the data migration side of things,
clickpipes would be very helpful.
We have native capabilities for Postgres,
where you have terabytes of data that you can migrate from Postgres, where like you have like terabytes of data
that you can migrate from Postgres,
it would just work out of the box.
And second, talking about like these warehouses,
like, you know, Redshift, Snowflake,
that is something that will be like in our roadmap.
So we want to add capabilities
where customers can easily migrate
like from Redshift and like Snowflake.
But the good news is that, right, all of these like, like, warehouses already,
like have capabilities to shove data to like S3 and GCS, right?
And ClickPipe supports these capabilities to get from, you know, external storage.
And we have customers move, like, you know, moving petabytes of data from like object storage.
Right. And so it's like pretty solid on that front. customers moving petabytes of data from object storage.
So it's pretty solid on that front.
But also the direct migration to even avoid getting data
to storage is something that we will be having more
in the medium term roadmap.
So this is on the data migration side of things.
Now let's come to the application migration side of things.
So Clickhouse already supports layer compatibility with Postgres.
Right, so you can have a Postgres compatible layer
over click house, but my recommendation is to use
like the native click house, right?
So I don't, I mean, I'm not a believer of like
compatibility layers, because we did this at Citus
and we failed, so, right?
Because the thing is, people want like native capabilities.
Right, once you put a layer, you are like, like inhibiting, you know,
the application and the user to use this database in the best way possible.
Right.
So, so customers typically natively query like click house and it's all ANSI SQL.
Right.
Like, and you know, very similar to Postgres, right.
And the driver support, the ecosystem, right.
Like my sister team, which is, you know, the integrations team has a bunch of like, I mean,
they're like, it's a pretty large team who just manage
like, you know, drivers and integrations
to make it very easy to query ClickHouse, right?
So, and this not only ties to like the application layer,
but even on the, you know, BI layer, right?
Like we have like a native Power BI integration, right?
So we made like, so we made that thing like very simple. So that's the way simple. So there would be some migration effort
on the application side, but finally Clickhouse is a SQL based database, it is ANSI SQL, we
have a bunch of drivers, that's the way to go about and I'm not a believer of compatibility
because I don't think it's going to work and the thing is, users want to use the database
in the fullest way possible and you don't want to innovate them from doing it.
Yeah.
Okay. Sure. Like the compatibility or like migration effort would reduce from like
a month to like a two months to one month, but that doesn't matter.
Right? Like looking at the picture two months, you get like a solid product,
which is like blazing fast.
Yeah.
Yeah. Awesome. Well, I know we're at time we can keep going all day, but here inside
we've learned a ton. I think our listeners have learned a ton. Again, check out the blog post that ClickHouse published yesterday.
I think it was amazing post.
And yeah, we'd love to have you back on the show so we can go even deeper on the tech.
Sounds great.
Thanks for the invitation.
Thanks.
And for the Bay Area folks, we're having our first user conference in San Francisco,
or for those who want to come to the Bay Area, and we're going to be streaming it
as well at the end of May.
We'd love to have you.
You can find it on our website. We're calling it Open House. Great. And we'll, we'll repeat that on the show closer to the Bay Area, and we're going to be streaming it as well at the end of May. We'd love to have you. You can find it on our website. We're calling it Open House.
Great. And we'll repeat that on the show closer to the date.
Great. Thank you.
You've been very awesome. Thanks everybody.
The Data Stack Show is brought to you by Rudder Stack, the warehouse native customer data platform.
Learn more at rudderstac.com