Orchestrate all the Things - Evaluating the streaming data ecosystem: StreamNative releases benchmark comparing Apache Pulsar to Apache Kafka. Featuring Chief Architect & Head of Cloud Engineering Addison Higham
Episode Date: April 7, 2022Processing data in real-time is on the rise. The streaming analytics market (which depending on definitions, may just be one segment of the streaming data market) is projected to grow from $15.4 ...billion in 2021 to $50.1 billion in 2026, at a Compound Annual Growth Rate (CAGR) of 26.5% during the forecast period as per Markets and Markets. A multitude of streaming data alternatives, each with its own focus and approach, has emerged in the last few years. One of those alternatives is Apache Pulsar. In 2021, Pulsar ranked as a Top 5 Apache Software Foundation project and surpassed Apache Kafka in monthly active contributors. In another episode in the data streaming saga, StreamNative just released a report comparing Apache Pulsar to Apache Kafka in terms of performance benchmarks. We caught up with StreamNative Chief Architect & Head of Cloud Engineering Addison Higham to discuss the report's findings, as well as the bigger picture in data streaming. Article published on VentureBeat
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
Processing data in real time is on the rise.
Historically, most organizations adopting the streaming data paradigm
are doing so driven by use cases such as application monitoring,
log aggregation, and data transformation.
Organizations like Netflix have been early adopters of the streaming data paradigm.
Today, there are more drivers to growing adoption.
According to a 2019 survey,
new capabilities in AI and machine learning,
integration of multiple data streams and analytics
are starting to rival these historical use cases.
The streaming analytics market,
which depending on definitions may just be one segment
of the overall streaming analytics market, which depending on definitions may just be one segment of the overall streaming data market,
it projected to grow from about $15.5 billion in 2021 to over $50 billion in 2026.
Again, historically, there has been a sort of de facto standard for streaming data, Apache Kafka.
Kafka and Confluent, the company that commercializes it,
are an ongoing success story,
with Confluent confidentially filing for IPO in 2021.
In 2019, over 90% of people responding to a Confluent survey
deemed Kafka as mission-critical to their data infrastructure.
As successful Confluent may be,
and as widely adopted as Kafka may be, however,
the fact remains.
Kafka's foundations were laid in 2008.
A multitude of streaming data alternatives, each with its own focus and approach, has
emerged in the last few years.
One of those alternatives is Apache Pulser.
In 2021, Pulser ranked as the top five Apache Software Foundation project and surpassed
Apache Kafka
in monthly active contributors.
In another episode of the Data Streaming Saga, StreamNative just released a report comparing
Apache Pulsar to Apache Kafka in terms of performance benchmarks.
StreamNative was founded in 2019 by some of Pulsar's core contributors and offers a fully
managed Pulsar as a service cloud
as well as professional services.
We caught up with Stream Native Chief Architect
and Head of Cloud Engineering, Addison Haim,
to discuss the report's findings
as well as the bigger picture in data streaming.
I hope you will enjoy the podcast.
If you like my work,
you can follow Link Data Orchestration
on Twitter, LinkedIn, and Facebook.
Yes, my name is Addison Haim. I'm the head of product here at StreamNative.
So I've been part of the Pulsar community for almost four and a half years now, but I've been part of StreamNative for about two years. So my background is as a software engineer, as previously the architect of platform and data at a company called Instructure that's in the ed tech space.
And there I solved a lot of different problems around platform technologies and helping to build sort of resilient and scalable infrastructure for that team there. And one of the projects I was involved in towards the end of my
time was getting to use a, solving a technology around message bus and messaging technology.
And, you know, I think came across Apache Pulsar, got very involved with the community there. And
that was a really exciting time for me to kind of get involved in that community and led me to,
you know, being really successful with that
and then to join StreamNative.
I had a chance to really get deeper in the technology
and was really excited about that.
And since that's how I landed up here at StreamNative.
Okay, thank you.
I know that the main course, let's say,
for the conversation today is the fact
that you're
about to release a benchmark in which you're comparing Apache Pulser to Apache Kafka and
we're going to get to that but actually before we do I thought it may be interesting to well
set the backdrop let's say in a way. So to be honest with you, I haven't had any real chance
to chat to people
from the Apache Pulsar community
for a while.
I checked and the last time
it was like three years ago
and a lot has happened since then.
So last time I actually chatted to someone,
it was people from Streamlio, which was the company that was at the time kind of fostering, let's say, the development of Pulsar.
And then a number of things happened. So Streamlio got acquired and then Stream Native came along. along and so I was wondering if you could just share a little on the background let's say of
how StreamNative came along and what the thinking is actually behind the company and
its goals let's say. Yeah, I'm happy to share a little bit on that. So, CJ Guo and Jia, Zai, were both, you know, had been using, CJ was also one of the original creators of Pulsar.
He had previously, you know, worked at Yahoo, had then worked at Twitter where he's using similar technology with BookKeeper, Apache BookKeeper.
And his actually was part of the Streamlio founding team as well.
And, you know, Streamlio had really taken an approach in the market
with initially focusing a bit on Apache Heron
and then pivoting a little bit later to Pulsar.
You know, CJ had really remained pretty focused
on the Pulsar side of the world.
And as Streamlio kind of wound down for various reasons,
was really still excited to continue on the journey
with bringing Pulsar to market,
really helping the technology mature
and co-found Stream Native along with Jai,
and then later Mateo joining the team as the CTO,
who is also from Streamlio.
With those three together really being there to excite about the Pulsar,
the continuing adoption and maturity of Pulsar.
And so the thinking of the company was also a little bit different than Streamlio in the sense that we're really we're focused on continuing to foster the community and kind polish and, you know, better documentation,
training, events. Those are sort of the things that we've really taken to help grow that community.
And we've really seen that translate into broader adoption, you know, with earlier adoption case
studies like Tencent and Herbal, now a growing number of adoption stories that are often, you
know, developer-led, focused by some of those people who've maybe felt the pain points of using other streaming technologies, etc.
I've been really looking for a more complete solution.
And so we've been really excited to kind of see that community grow that way.
And that's really, you know, our approach and now trying to continue to kind of tell that story to the market.
Okay, yeah. Thanks for the recap. I thought, you know, from,
from the outside, let's say it looks like, like that kind of story, but,
you know,
it's always best to actually hear it from, from someone from the inside,
rather than assume. So thanks for, for sharing that. Okay.
So I guess then we can move on to the main course,
which is the benchmark itself.
And yeah, I was just wondering if you'd like to, again,
summarize, well, first of all, the rationale.
So what's the thinking behind releasing that benchmark now?
And then how it was set up and the main findings.
Yeah, happy to dig into that. So there's always comparisons that we see between Pulsar and
Kafka. It's one of those common questions we get. And in reality, we are not a company to try and
say, hey, one is better than the other, or that's not kind of the way we approach this problem space in general.
But it is a common question that we receive. There's lots of curiosity about how do we compare these two technologies?
Where are they stronger? Where does one shine more than the other other what makes Pulsar different and so really a lot of this benchmark is sort of focused around kind of telling that story of around you know how do
they compare when it comes to performance perspective and sort of very quickly in what
we always like to highlight in the key benchmark findings is, or excuse me, in this benchmark report is that, you know, both systems are very capable systems.
For many use cases, they can both achieve very, very high performance, you know, relatively low latency, et cetera.
But Pulsar kind of, you know, has some improvements in certain areas that what we tend to see in the market are becoming more and more common use cases and needs. So some of those primary findings that we kind of like to point
out is that we can deliver Pulse Arc and deliver higher throughput on relatively the same amount
of hardware. So this is really about better efficiency, better utilization, and that's when you have kind of like for like
durability values. Another number that we're pretty proud of is that Pulsar can have significantly
a lower single digit published latency, particularly when it comes to using larger amounts of topics or partitions.
And finally, one of the big key benchmark, you know, things we like to point out is that
in the world of streaming, it's not just about how fast you can write the data. You often have
mixed workloads where, you know, you have built up a large backlog of data, maybe because you
just are doing some periodic job that wants to go pick up
data, or maybe you had downtime and you need to recover and read back a large backlog of data.
And so if you can't do that significantly faster, you know, you're going to have issues
in catching up to real time. And so Pulsar has a significant increase on reading that older historical data than Kafka,
which can be very, very helpful in those sort of mixed workload cases or recovering from downtime and whatnot.
Okay, thanks.
So, yeah, it was also rather clear to me just by skimming through the benchmark, to be honest with you,
that it has a clear focus on measuring performance.
And that makes sense for a number of reasons.
I was wondering if you could possibly compare, let's say, your offering or actually Apache Pulser, let's say the straight up open source version
to Apache Kafka on other parameters as well.
I know that it's a harder,
much harder question to ask, to answer
because as opposed to benchmarks
in which you can clear metrics,
that's a much tougher ask. However, I'm sure that it's something that
you also get asked a lot. So how would you approach answering that question?
Yeah, great. It's a great question. And I think really this is often for us that the bigger area
of focus, you know, the reality being most of these systems in many situations as mentioned,
you know, can behave fairly similar, but where we really try and differentiate with Pulsar is in those kind of additional features, the management story, what's the developer experience, etc.
And so kind of first focus a little bit on the developer side of the world.
Pulsar historically was built to replace like a JMS workload. So traditional
messaging workloads. So it's API, it's model is much more message oriented, like more of a pub
sub messaging oriented system that you might, developers might be more familiar with something
like a RabbitMQ or another like messaging system kind of in that vein.
And so we really have found that, in my experience, actually, that developers took to that API
very quickly.
It was really easy to learn, easy to kind of model those problems, but then layered
on some of those additional streaming features.
And so developer experience is a big focus there with that API that's really simple to
use.
You don't have to think about complexities of consumer groups and how many partitions
and whatnot.
There's also a lot of design aspects there where we make it really simple to say, okay,
you can have 100,000 topics and that's not coming with a really challenging, scaly component
relative to Kafka, for example. And then also on sort of that platform needs side,
as I kind of mentioned, that that's, you know, I was looking in my initial adoption of Pulsar
about a platform technology that could be kind of used across our organization. And that's what we
often see with Pulsar is that, or when companies are looking in this space, is they're really
looking for a technology that they can kind of use across all their teams they can standardize. And Pulsar's multi-tenancy,
kind of being a built-in systems that's designed to share those underlying resources, is just for
companies at scale is super valuable. We regularly interact with companies who are using Kafka and have found that they have just a large sprawl of hundreds or even thousands of Kafka clusters, kind of one per application, and it ends up being not very cost effective. utilization across all of those to size each one of those. And this system like Pulsar with its
built-in multi-tenancy and designed to safely kind of share that workload, built-in quota management,
et cetera, really can help companies have much less number of clusters that really efficiently
use those resources and make it easier to offer it sort of as a service within their company. One team managing all of that,
et cetera. Likewise, that's where the features of geo-replication that are just built in as a
single API call for a lot of organizations are just, it's kind of a game changer of being able
to say, okay, this is not like a separate tool we need to run. This is built in the system to
handle those use cases.
And then finally for those teams operating that is really the aspect of lowering the overhead of maintenance. Pulsar works very much like you want it to. You add additional capacity and
immediately is usable by the cluster. There's no need for additional tooling for rebalancing, for kind of reconfiguring how, where topics live. That's kind of an entirely, that concern isn't
built within Pulsar. And so those sorts of things are really where we see, you know, beyond just,
hey, performance numbers is really where the, the, a lot of the true value we see for a lot of
our customers and community members who are adopting Pulsar have found the most value.
Okay, thank you.
And I guess, obviously, that user segment, so people and organizations who are existing users of Kafka, who for one reason or another choose to consider
or even migrate eventually to Pulsar are important for you and hence also the benchmark.
And last time I checked, there was not direct, let's say, equivalence or a compatibility between those two APIs, but rather there was a
sort of, well, in between middleware, let's say, that translated Kafka API calls to Pulsar API
calls. Is that still the case? Has there been any development, let's say, on that front?
Yeah, it's a great question. So there has been, and one development
they're very excited about
that we've been working on
initially actually partnered
with Tencent for a project
we call KOP,
which stands for Kafka on Pulsar.
We really see Pulsar
as a lot of its innovation
is in its architecture
and that ability to
have a separate compute and storage
that really allow
for some of its flexibility.
And so because of that innate flexibility, we, starting about 18 months ago, two years ago,
implemented a new kind of lower level API, developer focused API, I should say,
called protocol handlers. And they actually allow you to reuse kind of Pulsar's internals and expose a different API.
You know, it could be of anything, but in this case, it's a re-implementation of the Kafka API.
So KOP allows for within your existing Pulsar cluster, no different component, no change in libraries to have a fully compliant
Kafka API that, you know, you can seamlessly interact with. And we've started to see, you know,
some adoption of that and some real excitement around that as a mechanism, not only just for
kind of a migration story, but actually just for long-lived Kafka use cases, right?
Organizations, and we acknowledge that organizations have a large investment in,
you know, that Kafka API applications they've built.
It's a large ask to say, okay, yeah, you're going to get all these benefits, but you have to, you know, migrate to this whole other, rewrite all your applications.
And so Kafka on Pulsar really is a way for allowing teams to start investing in Pulsar, continue to use their applications for long-term use cases that will still be using the Kafka API.
And we're actually really excited that we'll be bringing that into stream native offerings with our stream native cloud a little pretty soon here. So we're very excited about that development to really help accelerate
organizations that are interested in Kafka or excuse me,
interested in Pulsar, but already have some of that investment in Kafka.
Okay. Thanks. I was going to ask you about stream native cloud anyway,
but before I do I have another kind of catching up question.
So again, last time I checked,
the SQL capabilities in Pulsar were supported through Presto.
And I guess, well, I wonder if that's still the case.
And actually to add to that,
if you could just quickly refer to,
well, transformation capabilities
that you have in terms of streaming support?
Yeah, great, great question.
So yeah, first to touch on the Pulsar SQL side, that's actually enabled via the Trino
kind of side of the Presto community.
So that is actually, as we speak, getting the connector that is used to read data from the storage tier in Pulsar is getting upstreamed into
into Try Now. So really excited about that development to work with that community.
And yeah for a bit of a snapshot for, that allows for running a query that takes sort of a snapshot of your topic and then query that data out via SQL.
And really making it easy for use cases if you have some analytical use cases over some StreamStore and Pulsar, etc.
And we've seen quite a bit of adoption of that technology is really a way of saying, okay, we don't have to, you know, go and store this all this data in some other system for for into where we just need to run some ad hoc queries whatnot.
And so that's that's been really exciting within the community and just get that upstream into Trino making that really easy for those who are already there. And then to the second part of
that on kind of the transformation side, Pulsar Functions has been part of Pulsar for quite a
while now, become a quite mature, stable aspect of Pulsar that allows for writing functions within a few different languages.
So Java, Go, Python, that will make it easy
for developers to kind of just deal
with just the business logic, right?
Pulsar functions handles the inputs, the outputs,
and really, you know, you just kind of getting
that sort of Lambda-like API
where you can transport messages.
And that we've seen a lot of usage for that. Really, businesses
starting to kind of build their core platforms on top of Pulsar functions in some cases, which
we've been really excited to see. Okay, thanks. And then you also mentioned, well,
Stream Native Cloud. And I'm assuming that's probably the main revenue driver, let's say,
for the company, since Pulsar itself is open source, and that's a typical scenario that
companies formed around open source products follow.
So I was wondering if you could, again, just share the basics around your cloud offering and whether there's also any value-add services on top of the open source Pulsar on which it's built. cloud. So StreamNative cloud, yes, as you alluded to, is our fully managed self-service deployment
of Apache Pulsar via some of StreamNative's additional value adds there. So one of the
unique things about our cloud, though, is that it's a very flexible offering. So we have both
what we call our managed cloud offering,
which runs in the customer's cloud account,
and our hosted cloud offering, which runs in our cloud.
And so there's really a pretty broad range
that a lot of organizations really love managed services,
but maybe for compliance reasons, security reasons, et cetera,
can't fully, or maybe just some use cases where they need to have more visibility and ownership over their data, may struggle to put that data fully into, you know, a service provider's cloud accounts and hands, right?
With stream native cloud, they don't have to make that choice.
We can support either model.
And the way we do that is we've built a very powerful control plane so that allows for developers to programmatically create clusters.
And it's built in a very flexible data plane,
so where the actual workloads live.
And that's been, for us, a strong product to help support the community.
You know, a lot of what we often see is developers really fall in love with the technology.
And as they look to how to make it a reality, you know, their organizations vary in on sort of managed services.
And their adoption really is dependent upon being able to have some of those managed services. And that's been a pretty
typical story we've seen repeatedly with distributed cloud. And to speak a little bit to some of those
additional value adds, really where we see is Pulsar, as I kind of alluded to, has a lot of
very strong management features kind of built into the open source.
The geo-replication, the multi-tenancy, tiered storage, all of that is native.
And so when we really look at the value adds, a lot of that is around security compliance as well as around integrations.
And so one of the, we have some you know, some things there around on the authentication
authorization size with an OAuth 2 based authentication, additional authorization
model actually arriving soon, as well as a number of other options for the enterprise security
model, you know, things like an audit log, et cetera. And then the other big focus that we often let it,
what is it kind of the ecosystem and integrations. And, um,
as I kind of mentioned earlier with the, uh,
we're calling extreme native cloud for Kafka is a feature coming very soon to
the stream native cloud that will allow for kind of one click deployment of
that KOP component, making it just seamless to, uh,
also get support for the Kafka API.
And then finally, what I'll highlight here
is some of those integrations also with things
like a SQL engine, so you can get a streaming update
of SQL, very simple UI, nothing else to deploy, et cetera.
And that's really kind of the two big focuses,
once again, of where we add value
at in StreamNative Cloud.
Okay, thanks.
And well, let's wrap up with what may well be the hardest actually question to answer.
So you've already touched upon comparing, let's say, Pulsar with Kafka on a number of fronts.
However, those are obviously not the two only options around.
And I know, having quickly read your bio,
that you've also personally worked with solutions like Flink, for example.
So as quickly and painlessly as possible,
given we don't have an infinite amount of time for that,
how would you say Pulsar positions itself related to Flink and Spark streaming and Red Panda?
So what are the main differences, let's say?
How would you say Pulsar differentiates itself compared to those options.
Yeah.
The flanks, the sparks of the world,
those streaming compute engines, our goal is really maybe relative to something like Kafka,
KSQL, et cetera.
We're not really focused on trying to build
one of those streaming compute engines.
That's Pulsar Functions, for example, is really not designed to be a,
to handle that entire set of use cases and problems. We're really focused on ease of use
in the, in for ease of use and sort of the simple 80% use cases of single message transformations, et cetera.
And so what that means for us with those compute engines is that we're really focused on just a great integration story of building a best of breed connector that's very flexible.
And that's in general, when we look at some of those other tools within the streaming ecosystem,
also like Kafka, we're very focused on interoperability
and really being a solution that can scale across
a wide range of different APIs, different technologies,
different integrations.
As really once again, with these streaming technologies,
they sit sort of central to so many different services.
We think that that's just critical
and that's why we try and position Pulsar
as being a great option for
sitting central to all of those. And we really see them as, you know, partners. On the side of
relative to other streaming storage stream, you know, the transports, you mentioned Red Panda,
one of, and, you know, we share a lot of interest with them in terms of, okay, how do we solve some of those core Kafka pain points?
But some of those pain points also sit not just in the implementation, but also in the underlying protocol, the underlying flexibility.
And so we see that as not something that's easy to solve just within kind of the limitations of the Kafka API, but we support it, but we also have more that we
can offer beyond there by Pulsar's APIs, as well as actually didn't mention other
protocol handlers that we support, such as RabbitMQ flavor of AMQP, as well as MQTT.
So we're really about that model of interop between lots of different protocols and while still offering
extremely high consistent performance. So that's kind of how we see ourselves relative to a lot of
those competitors. Okay, thanks. And last one, this time really last one. So if you can share just a few,
a sneak peek, let's say, into your roadmap.
So what's in your agenda for the next year or so
or a couple of years, maybe,
if you can look that far ahead.
Yeah, so Pulsar is definitely focused a lot
on continuing to stabilize
and add additional exciting features into the community.
Some of those things that we're coming up very soon is an API we call the table service that allows for taking a stream
and within your application, turning that into like a table view.
And lots more coming there.
Happy to chat more about that with anybody who's interested to get involved with the Pulsar community. And the StreamNative side, we're really excited about the StreamNative cloud for Kafka,
as well as continuing the story on the integration with more compute and functions.
I hope you enjoyed the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.