The Infra Pod - Are all future data infra products will be built on Cloud storage? Chat with Warpstream
Episode Date: October 23, 2023Ian and Tim invited the cofounders of Warpstream (Serverless Kafka built on S3) to talk about the tradeoffs of building a new Streaming product that is designed from ground up to work on top of Cloud ...storage like S3, and also what the future implications mean.
Transcript
Discussion (0)
Welcome back to our yet another Infra Deep Dive podcast.
As usual, Tim at EssenceBC and Ian, let's go.
Today, I'm Ian, as always, actually, to be honest.
I have never changed my name.
I helped SNEAK turn into a platform,
and I'm super excited to be joined by Richard and Ryan,
who are building some really incredible new streaming tech.
Richard, please introduce yourself and pass it off to your co-founder
and tell us what you're working on.
Yeah. Hey, everyone. I'm Richie.
Ryan and I have known each other for actually over four years now, I think.
The company we're building right now is called WarpStream Labs.
WarpStream is a Kafka protocol-compat compatible streaming system that runs directly on top of object storage.
You know, it pretends to look exactly like Kafka, basically from an API perspective,
but there's no local disks anywhere in the system. And the kind of main two reasons we set out to do
that was to reduce basically inner zone networking costs and cloud environments.
I think maybe even more importantly than that,
just to make Kafka a lot easier to manage.
At Datadog, we worked on a system called Husky, which is a columnar store for observability data.
So think like, whatever, I don't know,
Snowflake, but heavily optimized for logs.
And we spent a long time at Datadog
building that system,
migrating everything to it
over the course of three and a half years.
And then when we were done,
we had this cool auto-scaling Datalake thing
that was really nice and cost-effective,
but we were still getting paid for Kafka.
Still had issues with brokers running out of disks,
not being able to scale them up fast enough.
Issues with partition balancing and stuff like that.
And it just felt really old, I guess,
compared to the new S3-based system.
So that was somewhat the impetus for the company.
Yeah, this is Brian, my co-founder and CTO.
Rishi and I, before we worked on this
and before at Datadog,
were working on another system called WarpTable,
which was a hacky prototype of the thing
that we eventually built at Datadog.
And before that, I tried to convince him to do this, the same basic idea about building Kafka
on top of S3. But he thought it was too boring back then because he hadn't yet experienced the
pain of using Kafka. The joys. Yeah, the pain and the joys. mostly pain. After three and a half years of interacting with Kafka
in a real production, high-scale environment,
he thought the problem was a lot more interesting.
We co-founded the company in May of this year,
and we've been heads down working on the product ever since.
So what was it, Ryan and Richie,
when you were working on Husky,
was there some insight or some moment where you said,
okay, this is the moment to go build
this company?
What was it that got you over the line
to think that, hey, this is a really good idea?
That's a really good question.
Yeah, I think I can get us
started a little bit because I don't think there's really
one answer to this. There's definitely going to be more than one. I think I can get us started a little bit because I don't think there's really one answer to this.
There's definitely going to be more than one.
I think what made me ready was the fact that we were basically done with the Husky project.
Obviously, it's still being maintained and stuff, but we set out to migrate all of the products using the legacy storage system at Datadog to Husky.
And that finished early in 2023.
We definitely hit our goal there.
And obviously, that makes it a good time to look around and see what's next.
Yeah, I mean, we'd also just, we'd been there a while.
And I think, you know, Ryan and I are both kind of entrepreneurial by nature too. You know. We ended up at Datadog trying to start a database company,
basically an observability company,
and realizing that we had a lot to learn, I think,
about doing a B2B product and stuff.
I mean, I had been working at Uber.
I'd never even worked at a B2B company.
So Datadog was really good to us.
We had a lot of fun there,
but it always felt like a bit of a compromise, right?
Like we wanted to go start our own company, basically.
The amount of things that we did to just avoid introducing Kafka
into new places in our architecture was also extremely frustrating.
Whenever we thought about companies that we might want to start,
a lot of them involved needing something like Kafka.
And we're like, I don't want to actually go figure that out again
from scratch with no help.
A lot of big data problems, if you want to use Kafka for them,
it's not cost effective at all, basically, to use a vendor.
It just is margin destroying for some businesses.
So it just felt like a necessary precursor
for a lot of other interesting problems to be solved, basically.
So I don't know. It just felt like a good time to do it.
So I want to talk about the history you're getting here, because I think Husky definitely
is more like the new next-gen log management story you guys were building for it.
I remember there's definitely a lot of history based on the work you guys did for Foundation
DB back in the day.
So I would assume the most straightforward path for you guys to start a company is basically
doing exactly what you did for Husky, basically, like a log story type tool.
And looking at the blog post of Husky, you're talking about sketches and better ways to be able to integrate different algorithms directly into storage engines.
So it's a lot more logs and probably some more observability type of data.
But now looking at the new Kafka, you're sort of just
assuming any data streams, right? But just built a fundamental different way. Like why go to Kafka,
but why not just do the Husky newborn? What was the thought process for you to like, hey,
this is way more important or this is something we should do instead?
Yeah, that's a totally fair question. I mean, we're joking about fundraising a little bit
before this. I think like that would have been a really easy fundraise.
So I think that would have for sure been an easy path for us.
I just think that, honestly, we spent four years solving that problem,
and I just didn't want to solve it again.
It's really hard for me to get excited about watching the same movie twice, basically.
Whereas the Kafka stuff felt...
I've kind of said this before, too.
I feel like it's kind of understood now.
If you think about what Husky is,
it's basically an observability data lake.
It's a solved problem in the sense that people know how to build it.
I don't think there's a clear leader in the open source space or anything,
but people know how to build these systems now.
You ingest data, you buffer it for a little while,
you dump it into object storage in a
columnar format, you track metadata in some external store, you have a scatter gather query
engine, you add caching, etc, etc. People know how to do this and there's a lot of systems that do
that. Doing quote-unquote low latency streaming on top of a high latency storage medium like S3,
I don't really think that's a super well-solved problem. And so if I'm looking at
the next four to 10 years of my life, that felt a lot more interesting.
Cool. Let's talk more about this Kafka space, right? Because obviously Kafka has been there
forever. I worked on Kafka actually before, and I saw how Red Panda started back in the day. So
we've seen other entrants in this space before. Now you're coming into it, I think there's
definitely a lot of decisions you can make to go after the space. And your Kafka is dead. Blockpost definitely went
viral as well. So let me talk about the thought process here. Building a new Kafka, what does
it look like? What do you want to make sure you're able to talk about? Because the biggest thing
on the front page is cost. Are you comparing how much it costs to run different
versions of it and your 10x way more cheaper? So it sounds like cost is like a number one hook,
but I'm sure there's way more thoughts around this.
We basically talk about two things, cost and ease of operations or ease of use. Those are our two
main value propositions. And cost, I think, is actually a lot more important than I think
people realized in the beginning. Because when you're trying to solve a new and interesting
data problem, especially at the frontiers of what is possible in computing today,
you may self-select out of different kinds of architectures based on how much you think they will cost.
If you're going to be doing streaming analytics on video, you probably wouldn't back that with
Kafka. It just wouldn't function from a cost perspective. You would do something like
writing video files to S3 and then maybe tracking them in Kafka, like pointers to them.
It would be nice to just use Kafka to do it because there's no
differentiating value as the person who's making this video streaming analytics thing on doing the
engineering behind, I'm going to write video files to S3, and then I'm going to track pointers to it
in Kafka. The cost thing is not just about taking existing workloads and lowering their cost it's about
making sure that people starting today with new workloads that maybe are at the edge of what you
would reasonably choose kafka for or even well beyond it can choose the system that they want
based on the features and the ease of use rather than purely about the cost model.
And we think that we can deliver on that.
We think that we can build something that would be essentially as efficient for a high data volume use case like doing stream analytics on video.
We're not building anything in particular for it.
We're just building exactly the correct, from an architecture perspective, way to move a
lot of data through S3 through a Kafka-like interface.
The other side of that is the ease of operations.
Amazon and any other hyperscaler cloud vendors
have put an inordinate amount of engineering time
behind making object storage scalable.
And if you look at the right pattern of Kafka,
ignoring the latency for a second.
It's not as if Kafka requires
writing a bunch of tiny objects to S3.
I think most developers could imagine
a way that you could make it work.
There's a whole bunch of metadata aspects of it
that you have to think a lot harder about,
but just from the data side,
I think most developers could figure out
a way to do that.
And once you shift that burden
onto the cloud provider,
you have so, so many more options
about how you build the rest of the system.
Like one thing that you would get rid of right away
is partition rebalancing.
If you're an open source user of Kafka,
you have to have some kind of tool
to move partitions from one broker to another,
which is like the absolute most archaic thing.
When you're coming from other systems
that automatically shuffle data around for you
behind the scenes,
having to think about that for Kafka is just very funny,
which is why most people either use some fancy tool for it
that automates it,
or they pay a vendor like Confluent or Amazon
to do it for them.
But it'd be nice to just never have to think about that again.
That's one thing that Workstream gives you
is you just run a stateless Docker container
in whatever your container orchestrator of choice is.
We've tested it in both ECS and Kubernetes.
You point it at an S3 bucket and you get Kafka on the other side.
There's no partition rebalancing.
So you can scale it just as easily as you'd scale Nginx.
You add more containers based
on the CPU usage, and that's basically it. All of that ease is provided by the fact that you use
the object storage. Object storage is such a fundamental game changer in terms of the ability
to design high-scale systems that if you're starting today building a new system like
WarpStream, you just don't have another option if you want to build something that works at high scale and is easy to use.
Yeah, I think that's a point I want to emphasize. I don't think it's possible for big data systems
to remain competitive if they're not object storage based. If you're running in a cloud
environment and you're dealing with large volumes of data, you have to move to a completely object
storage based architecture or your system will just not be cost-effective
or effective in other ways long-term.
I don't think you'll be able to essentially keep up.
The cost differences there are too huge
and the economies of scale there,
it doesn't matter how good your software is,
you can't accomplish them any other way
in those environments.
And the primary reason for that is that you have
these cloud providers like Amazon
and ObjectStore where they're investing all the money
to create a great abstraction to abstract away
spinning disks from you and the complexities of dealing
with hardware failures and all these other
different complexities and have
hidden it behind APIs. Is that the primary reason?
That's part of it. There's also
the millions of engineering hours
that have gone into making that extremely reliable
and dealing with hotspots
and basically allowing you to burst temporarily
and all the stuff that they do.
But there's also just some physics to it, right?
If you think about the way S3 works,
the kind of trick there is that
if you're an object storage provider,
you can go out and buy these disks.
They actually get slower every year, but their capacity just grows continuously.
And they have so many disks that are just sitting filled with data that are completely idle, right?
It's data at rest that's not being read, that's not being written.
And basically, the number of disks that you would need to buy to get the amount of IOPS you can get out of S3 or object storage
by just signing up with your personal credit card
and not asking for any quotas,
the number of SSDs you would need to accomplish that
and how much that would cost you, it's ridiculous.
I remember Ryan and I did some cost modeling a while back.
We were trying to estimate how much it would cost,
basically, and what it would take to build a single layer of redundancy
against object storage and cloud environments
just for a couple of hours,
it's unimaginably expensive.
You can't compete with the economies of scale
that they have from those giant arrays of disks.
And also, part of this is just the cloud tax
of where you just get hammered
on these inner-zone networking fees.
If you ever have to move data
across availability zones, large volumes of data,
and you don't go through, essentially, the object storage APIs, you get crushed.
To send a gigabyte of data from one availability zone to another in AWS costs the same as storing an S3 for two months.
So it's like two months of data storage versus copying it over the network once.
It's crazy. It's completely imbalanced in terms of cost.
There are plenty of vendors who are building new systems that are not object storage based, at least not entirely. They have some
local disk layer in the front of them because they have
a poorly engineered application that doesn't know how
to work well against object storage. They built it
assuming local disks exist.
And there are a bazillion IOPS
that you can get on the local disk
at a four kilobyte size
and that you don't have to pay for put and get requests.
But those systems just, they'll work at toy scales.
Like if they have a lot of tenants
that are doing a bunch of tiny workloads that don't spike.
But if you want to deliver this to a giant bank
or other Fortune 500 companies
with a diverse range of applications,
a diverse range of sizes of those applications,
and run it in a multi-tenant environment,
you're not going to be able to do that
with local disks effectively.
It doesn't even matter how good you are
at writing software against the local disk.
The economies of scale will crush you,
fortunately or unfortunately.
I don't necessarily think object storage is a good thing in the abstract.
It's just, if you're in the cloud, there's no other way to meet the cost goals,
at least on the analytics side.
If you're on the operational database side,
there are potentially different choices you can make there.
But if you're anywhere close to the analytics side, you just can't do it.
Yeah, I think it's actually bringing up really good points
where if you're fundamentally
designed on top of object storage and there's a cloud,
like the partition rebalancing and all
that kind of stuff that you have to do before,
the reason we had to do it before
is because all the data is sitting on a disk somewhere.
And there's a bunch of limitations,
right? How many topics can you actually write? How many
partitions can you actually write? Because you have Zookeeper,
you have all this stuff, right? I think over time,
Kafka has changed quite a bit. And trying to add a bunch of stuff to
existing architecture has always been a challenge because they added like
exactly one writing and transactions, all that stuff has been harder.
There's like a huge benefits if you're fundamentally designed differently
with object storage behind the scenes. But there's trade-offs too, right?
Once you have object storage as your main storage, there's actually the downsides. What are the trade-offs of being object storage-based?
What are things you have to do more carefully or harder that doesn't have to be done in the past?
The obvious answer is there's a latency trade-off that we make, right? We can talk about that in
more detail too, because there's other types of storage besides object storage, but there is a latency
trade-off. You're never going to be able to ensure that something is durable in a millisecond.
You can do that with an NVMe SSD. That's not going to happen when you're using S3.
The other kind of thing I think you have to think about is that a lot of these systems,
I see what people do is they take software that was written for SSDs and then they lift and shift it into the cloud.
They put it in Kubernetes and then they copy data asynchronously to S3
and page it in when they need it later.
And there's tons of systems that do this.
That works in the sense that you can now store way more data
on a much more limited set of hardware,
but you miss out on a ton of the potential
of what the system could really be capable of.
Really to take advantage of object storage properly,
you basically have to rewrite all the software
from the ground up around object storage.
For example, even just forgetting WarpStream for a second
and looking at Husky,
every design vision made in Husky,
if you root cause it to the bottom,
it's because it's S3 based
to the point where there's no local buffering
and ingestion either. And everything kind of calls out of that. The file format falls out of that.
The query engine style falls out of that. The data structure, every single data structure in the system
falls out of that. It's the same thing with WarpStream. It's Kafka protocol compatible in the sense that
it implements the semantics and protocol of Kafka, but literally nothing about the internals of it
looks anything like traditional open-source Kafka.
There's very little you can reuse.
You really just have to be willing to start from scratch
and start from first principles.
If I had a disk and every time I tried to write to it,
the P99 was like 400 milliseconds,
what would my software look like?
And so you have to design for massive amounts of parallelism
and minimizing sequential operations in your storage system. You have to design around larger
IOPS. You have to think about caching a lot more intently. You have to think about the fact that
basically your entire file system is immutable. Metadata always becomes a huge thing. That's,
I think, actually probably the main point of leverage when you're designing around object storage
is how well can you handle the metadata
that's required to make the object store
look like something else, basically.
So for Husky, we use FoundationDB
and with WarpStream, we had to do something
significantly more custom
because of the semantics of the Kafka protocol.
I mean, those are pretty big trade-offs.
You just kind of have to start over and the latency.
The things that you start over,
after you decide you're going to start over,
a lot of these things aren't unknowns.
If you go read papers about
file systems and databases
and stuff, the tips and tricks
basically that you need to build a system
on top of object storage, they're all there.
They're just kind of left behind
now because they're the
same things that you would use on spinning disks from like the 1980s. The fundamentals are actually
not that different. It's just people have completely forgotten how to write software
in an environment like that, where the latency of an IO is very high and variable. But once you do decide to start over, it's not
as if you have to invent whole new fields in computer science to solve the problems.
A lot of stuff is reusable. What's a good example for a 1980s that's left behind?
It's not specifically about the time period for the whole system. It's just spinning disks used to be even slower than they are, and they used to have lower IO bandwidth. and you'd get a lot of IOPS and you'd get a lot of throughput out of that, but the latency was not great,
especially when you're oversubscribed
in terms of how many IOPS the system has
versus the applications you're running across it.
The latency for an individual IO can be high and variable.
The difference between the median and the P99
might be really large,
which is exactly the way that S3 works.
But it is not the way that an NVMe SSD works.
And most people that are writing software today
assume that if they are going to use the file system,
they can just read and write anywhere
inside their application code,
maybe even not without putting the I.O. onto a thread pool
or doing it asynchronously.
They're just like, I'm going to make the write happen right here
because the operating system is going to cache
the file system for me,
and I can mostly just not think about it and just tell the user
make sure that the working set of the application fits in memory
and then nothing will ever have to happen. But if the way that you design the system
is every I.O. is going to be uncached and every I.O. is going to take a really long time
And it costs money.
It actually costs money to do these IOs.
Once you constrain yourself into that interface,
which most people have left behind long ago,
over time, even spinning disks,
the latency went down a little bit.
Now it's going the other way again
in the last few years as the technologies
had to get more and more obscure technologies
to build spinning disks around.
Outlier control is a good example, I think.
When I worked on M3DB at Uber,
and we ran it on top of real disks and SSDs,
if it took a second or a couple seconds
to write some data to the disk, the disk was broken.
That machine was getting ripped out immediately
because the system did not work
when individual IOs started taking seconds.
The system's designed for individual IOs to take milliseconds. Whereas that's a completely
normal occurrence when you're programming against object storage. Writes and reads,
I think stream outliers will just take a second or two. And that's normal and has to be designed
around. One really dumb example that you can do that's extremely effective, at least with S3,
and I've done this in various things,
is you can monitor the P99 latency of reads or writes,
or the P99, basically,
and just automatically retry beforehand
when you detect that it's taking too long.
It's not like the data in S3 is on one disk.
It's like there's multiple different sets of disks
that they can read from to answer the same request.
Or you just got unlucky,
you hit one of the thousands of the machines
in the cluster that was slightly overloaded or whatever.
So stuff like that that just doesn't happen
with an NVMe SSD
unless you've absolutely worn it to the ground.
And you rewrote it 50 times a day,
every day for years,
and it's starting to malfunction.
It's a thing that we've done.
So don't do that.
I mean, you both have been through the ringer
on both running the hardware, building software for running the hardware,
and now focus on your new approach using object storage
and what you learned at Datadog and BuildHusky.
I'm curious, there are a lot of different use cases for things like Kafka.
I'm kind of curious to get your perspective
on what use cases this most makes sense
to use for today.
And then also,
are there use cases where it doesn't make any sense?
And is there a path to actually
making it work for there?
And then maybe it's just a building,
a lot of software to write.
Something has to change in the cloud providers
or something else.
The obvious use case
where it's really great today
is just like, we've been calling it analytics,
just really anything that looks like high volume streams
of telemetry data.
That's the most obvious use case
because you just get hit so hard.
If you're like, I have a bunch of telemetry data,
I need to dump it somewhere, I need to stream it somewhere,
you get whacked twice because first you just get hammered
on the inner zone networking costs. So you're like, why is my bill so high? I'm already stream it somewhere. You get whacked twice because first you just get hammered on the interzone networking costs.
So you're like, why is my bill so high?
I'm already running this myself.
I already got priced out of all the vendor solutions.
It's like a gigabyte per second is like $1.7 million per year
in interzone networking.
And a lot of people have much bigger workloads than that.
But then it's also like a gigabyte per second workload
is not a trivially managed, self-hosted Kafka cluster.
And so both of those problems just kind of disappear
if you can offload most of the storage problems to object storage.
So that's where it shines, I think, today.
Long-term retention can also be a really good use case too.
There's a bunch of stuff happening right now around open source Kafka
with tiered storage, which is really not the same thing.
And I can get into the difference between what we're doing
and tiered storage too, but that does help a little bit.
And then obviously the place where it's not good
today is just like, I've
yet to find a ton of people
who actually have really solid use cases
for extremely low latency
Kafka, but there are some.
You can get a really well-tuned
Kafka cluster that's over-precisioned.
You can get it down so writes are consistently finishing
under 20 milliseconds, right?
If you're willing to spend money on that.
That's just not a thing you can do with object storage
in basically any cloud environment.
You could do things like you could run Minio.
There are other options too.
At some point, we'll offer a low-latency version of WarpStream,
but there's other trade-offs there.
And so that's the place where today it just doesn't work very well.
All right, so I guess we've got to jump into what we call the spicy future.
Very simply, spicy future is to talk about what you believe should happen in the future.
And also we want to just present it as facts, right?
Just like, hey, this is what we believe will happen in the next, you know,
two to three years. And I think in this context,
we'll be interesting to kind of talk about like,
what do we believe the ecosystem will look like from a data infrastructure
perspective?
Like if projects are going to be rewritten on top of S3,
or do you see every project rewritten on S3 and what does the side effects or
the downstream effects of things happening?
So however you guys want to start, what do you believe the next two, three years should look like?
What do you see happening? There are definitely a lot of things that I think are going to change.
I think for developer infrastructure products, the scale to zero is going to be even more prevalent
than it is today. And the reason for this is not just people being cheap.
I think scale to zero, once you have it everywhere in your stack,
the way that you think about building development environments
and CI environments changes completely.
If you can set up a complete development environment
that has its own copy of all of its database dependencies
and Kafka cluster dependencies
and application server dependencies,
the reliability of CI will greatly increase
relative to the battle days of everybody sharing
one gigantic staging database,
one gigantic staging Kafka cluster.
And databases, I think, have solved this problem a little bit
better than most other infrastructure pieces
because you could at least afford to create a different schema
in Postgres for each CI run.
But once you get outside of that very narrow universe,
there is not much out there.
You have to build your own hacks around multi-tenancy into things.
That translates also into building multi-tenant systems.
If you can have a multi-tenant
system whose dependencies are also scaled to zero multi-tenant, you can build it all the way up the
stack. I think that's going to become table stakes, basically, in most of these products.
Not to talk about Warpstream's competitors too much, but if you look at the competitors' products,
most of them have some fixed cost element that does not go away if you want to have a coffee cluster.
I think that there are other developer features type stuff that will become more prevalent, like branching and snapshots or developer infrastructure stuff.
If you're both on top of S3, this is a lot easier than it is for systems based around local storage. If you think of your system as a copy
on write type model, you can just leave all the old files around and make a much smaller copy of
the metadata for the system to share between the branches or snapshots, so to speak. Snowflake has
had that feature for a long time, and I think it's going to filter its way down into a lot more
developer products. Again, not just because it's cool, but because I think that there are actual use cases for it.
Like if you were going to build in CI, you could branch your Kafka cluster and start it from
a data set that already exists, a snapshot of some data that you're always using in CI,
instead of spending the first 30 seconds of every CI run
writing a bunch of data to Kafka.
It could just already be there right from the beginning.
Or to validate changes against a production database,
you could spin up your application server
against a clone of it, run your tests,
and assert that, hey, I didn't break all of the data
in my database.
I think it'll make development a lot
safer instead of just shipping your stuff
to prod and hoping it works.
Those are my two biggest ones, I think,
related to object storage.
I think
at least the open source implementation of Kafka
has just had a stranglehold
on the streaming industry. If you want
my spiciest take, I think it's been
holding everything back.
If you compare what's happened in the batch and ware If you want my spiciest take, I think it's been holding everything back. I think if you compare what's happened
in the batch and warehousing worlds,
what people have managed to build
basically over the last 10 years,
especially all the modern ones now
that all have these object storage takes
where the storage is solved.
Your job is to now provide value
above the layer of just copying bytes around.
A lot of really interesting
and innovative stuff has happened.
We're just starting to see that in the streaming space.
If you look at just the number of streaming compute engines
and databases that have emerged in the last two to three years,
basically, it's crazy.
I can probably name seven off the top of my head.
And I think that'll keep happening.
I think that'll keep growing.
And I don't think that's because people like necessarily streaming compute tools.
I think it's because they dislike Kafka.
And they dislike programming against Kafka.
SQL has evolved enough now and most people understand at least the basics of a tumbling window or something like that when they're going to do streaming analytics.
And now that you can express
most of those things in SQL,
that's what people want to do a lot of times.
Kafka has to be a pipe for them
in order to write the SQL query that they want.
It doesn't translate into every application,
but so many of them can be solved
with a relatively straightforward SQL query.
And Kafka has
definitely been holding people back because they write an application directly against Kafka.
And it's an absolute nightmare. It's really easy to ship the first basic thing into production.
It is next to impossible to maintain like a long lived evolving application that has local state, that is an aggregation of some of the data in Kafka.
I don't believe there are actually a significant number of applications in the real world that do
interesting things directly against the Kafka protocol. They're either doing something extremely
basic, or they're using Kafka as a pipe to get their data into what is essentially a batch system.
Yeah. And I think the thing, the area that's going to be really, really interesting
is bringing those two worlds basically closer together.
And then also just kind of this idea of streaming batch
that you kind of see a lot of people talking about right now.
Basically, the ability to write an application
that looks like a batch processing job,
but is actually running and updating something
every couple of seconds, basically.
And I think the line between those two things will start to blur.
And I think that'll be a really good thing for everyone.
With our last two minutes here, I'm going to ask the uncomfortable question that I'm
sure you're asked all the time.
Programming against S3 and you have GCS and Azure is the same thing, and they're relatively
the same stuff.
When someone suggests to you, that's just
another way for vendor capture,
how do you think about and rationalize this?
That S3 is
like you're getting trapped into
S3. Object storage
is the biggest commodity
in the cloud, man. It's like
VMs and then object storage.
You've got to put
servers in your basement at that point
if that's what you're worried about.
The APIs are standardized.
We wrote both Husky and WarpStream.
We use a single library to talk to object storage.
We ran Husky in AWS, GCP, Azure, GovCloud.
It's the same stuff.
Now, Amazon's object storage implementation is the best for sure, but the
other ones work good enough.
I view object storage as an absolute commodity.
I think it's bordering
on the POSIX API at this point.
It's available everywhere.
There's multiple open source implementations.
That one doesn't worry me at all.
Now, a lot of the other stuff I do think is
if you use Spanner, you're
getting locked in.
But you know, that's a very different story.
Cool.
Well, we can probably ask a whole lot more questions,
but I think there's actually a great way to kind of like segment into what we want.
So thanks a lot, guys.
And thanks for being on our pod.
Yeah, thanks for having us.
It was awesome.
Thank you so much.