Orchestrate all the Things - Evaluating the streaming data ecosystem: StreamNative releases benchmark comparing Apache Pulsar to Apache Kafka. Featuring Chief Architect & Head of Cloud Engineering Addison Higham

Episode Date: April 7, 2022

Processing data in real-time is on the rise. The streaming analytics market (which depending on definitions, may just be one segment of the streaming data market) is projected to grow from $15.4 ...billion in 2021 to $50.1 billion in 2026, at a Compound Annual Growth Rate (CAGR) of 26.5% during the forecast period as per Markets and Markets. A multitude of streaming data alternatives, each with its own focus and approach, has emerged in the last few years. One of those alternatives is Apache Pulsar. In 2021, Pulsar ranked as a Top 5 Apache Software Foundation project and surpassed Apache Kafka in monthly active contributors. In another episode in the data streaming saga, StreamNative just released a report comparing Apache Pulsar to Apache Kafka in terms of performance benchmarks. We caught up with StreamNative Chief Architect & Head of Cloud Engineering Addison Higham to discuss the report's findings, as well as the bigger picture in data streaming. Article published on VentureBeat

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. Processing data in real time is on the rise. Historically, most organizations adopting the streaming data paradigm are doing so driven by use cases such as application monitoring, log aggregation, and data transformation. Organizations like Netflix have been early adopters of the streaming data paradigm. Today, there are more drivers to growing adoption.
Starting point is 00:00:29 According to a 2019 survey, new capabilities in AI and machine learning, integration of multiple data streams and analytics are starting to rival these historical use cases. The streaming analytics market, which depending on definitions may just be one segment of the overall streaming analytics market, which depending on definitions may just be one segment of the overall streaming data market, it projected to grow from about $15.5 billion in 2021 to over $50 billion in 2026.
Starting point is 00:00:56 Again, historically, there has been a sort of de facto standard for streaming data, Apache Kafka. Kafka and Confluent, the company that commercializes it, are an ongoing success story, with Confluent confidentially filing for IPO in 2021. In 2019, over 90% of people responding to a Confluent survey deemed Kafka as mission-critical to their data infrastructure. As successful Confluent may be, and as widely adopted as Kafka may be, however,
Starting point is 00:01:25 the fact remains. Kafka's foundations were laid in 2008. A multitude of streaming data alternatives, each with its own focus and approach, has emerged in the last few years. One of those alternatives is Apache Pulser. In 2021, Pulser ranked as the top five Apache Software Foundation project and surpassed Apache Kafka in monthly active contributors.
Starting point is 00:01:48 In another episode of the Data Streaming Saga, StreamNative just released a report comparing Apache Pulsar to Apache Kafka in terms of performance benchmarks. StreamNative was founded in 2019 by some of Pulsar's core contributors and offers a fully managed Pulsar as a service cloud as well as professional services. We caught up with Stream Native Chief Architect and Head of Cloud Engineering, Addison Haim, to discuss the report's findings
Starting point is 00:02:14 as well as the bigger picture in data streaming. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook. Yes, my name is Addison Haim. I'm the head of product here at StreamNative. So I've been part of the Pulsar community for almost four and a half years now, but I've been part of StreamNative for about two years. So my background is as a software engineer, as previously the architect of platform and data at a company called Instructure that's in the ed tech space. And there I solved a lot of different problems around platform technologies and helping to build sort of resilient and scalable infrastructure for that team there. And one of the projects I was involved in towards the end of my
Starting point is 00:03:05 time was getting to use a, solving a technology around message bus and messaging technology. And, you know, I think came across Apache Pulsar, got very involved with the community there. And that was a really exciting time for me to kind of get involved in that community and led me to, you know, being really successful with that and then to join StreamNative. I had a chance to really get deeper in the technology and was really excited about that. And since that's how I landed up here at StreamNative.
Starting point is 00:03:37 Okay, thank you. I know that the main course, let's say, for the conversation today is the fact that you're about to release a benchmark in which you're comparing Apache Pulser to Apache Kafka and we're going to get to that but actually before we do I thought it may be interesting to well set the backdrop let's say in a way. So to be honest with you, I haven't had any real chance to chat to people
Starting point is 00:04:10 from the Apache Pulsar community for a while. I checked and the last time it was like three years ago and a lot has happened since then. So last time I actually chatted to someone, it was people from Streamlio, which was the company that was at the time kind of fostering, let's say, the development of Pulsar. And then a number of things happened. So Streamlio got acquired and then Stream Native came along. along and so I was wondering if you could just share a little on the background let's say of
Starting point is 00:04:47 how StreamNative came along and what the thinking is actually behind the company and its goals let's say. Yeah, I'm happy to share a little bit on that. So, CJ Guo and Jia, Zai, were both, you know, had been using, CJ was also one of the original creators of Pulsar. He had previously, you know, worked at Yahoo, had then worked at Twitter where he's using similar technology with BookKeeper, Apache BookKeeper. And his actually was part of the Streamlio founding team as well. And, you know, Streamlio had really taken an approach in the market with initially focusing a bit on Apache Heron and then pivoting a little bit later to Pulsar. You know, CJ had really remained pretty focused
Starting point is 00:05:45 on the Pulsar side of the world. And as Streamlio kind of wound down for various reasons, was really still excited to continue on the journey with bringing Pulsar to market, really helping the technology mature and co-found Stream Native along with Jai, and then later Mateo joining the team as the CTO, who is also from Streamlio.
Starting point is 00:06:12 With those three together really being there to excite about the Pulsar, the continuing adoption and maturity of Pulsar. And so the thinking of the company was also a little bit different than Streamlio in the sense that we're really we're focused on continuing to foster the community and kind polish and, you know, better documentation, training, events. Those are sort of the things that we've really taken to help grow that community. And we've really seen that translate into broader adoption, you know, with earlier adoption case studies like Tencent and Herbal, now a growing number of adoption stories that are often, you know, developer-led, focused by some of those people who've maybe felt the pain points of using other streaming technologies, etc. I've been really looking for a more complete solution.
Starting point is 00:07:13 And so we've been really excited to kind of see that community grow that way. And that's really, you know, our approach and now trying to continue to kind of tell that story to the market. Okay, yeah. Thanks for the recap. I thought, you know, from, from the outside, let's say it looks like, like that kind of story, but, you know, it's always best to actually hear it from, from someone from the inside, rather than assume. So thanks for, for sharing that. Okay. So I guess then we can move on to the main course,
Starting point is 00:07:48 which is the benchmark itself. And yeah, I was just wondering if you'd like to, again, summarize, well, first of all, the rationale. So what's the thinking behind releasing that benchmark now? And then how it was set up and the main findings. Yeah, happy to dig into that. So there's always comparisons that we see between Pulsar and Kafka. It's one of those common questions we get. And in reality, we are not a company to try and say, hey, one is better than the other, or that's not kind of the way we approach this problem space in general.
Starting point is 00:08:31 But it is a common question that we receive. There's lots of curiosity about how do we compare these two technologies? Where are they stronger? Where does one shine more than the other other what makes Pulsar different and so really a lot of this benchmark is sort of focused around kind of telling that story of around you know how do they compare when it comes to performance perspective and sort of very quickly in what we always like to highlight in the key benchmark findings is, or excuse me, in this benchmark report is that, you know, both systems are very capable systems. For many use cases, they can both achieve very, very high performance, you know, relatively low latency, et cetera. But Pulsar kind of, you know, has some improvements in certain areas that what we tend to see in the market are becoming more and more common use cases and needs. So some of those primary findings that we kind of like to point out is that we can deliver Pulse Arc and deliver higher throughput on relatively the same amount of hardware. So this is really about better efficiency, better utilization, and that's when you have kind of like for like
Starting point is 00:09:47 durability values. Another number that we're pretty proud of is that Pulsar can have significantly a lower single digit published latency, particularly when it comes to using larger amounts of topics or partitions. And finally, one of the big key benchmark, you know, things we like to point out is that in the world of streaming, it's not just about how fast you can write the data. You often have mixed workloads where, you know, you have built up a large backlog of data, maybe because you just are doing some periodic job that wants to go pick up data, or maybe you had downtime and you need to recover and read back a large backlog of data. And so if you can't do that significantly faster, you know, you're going to have issues
Starting point is 00:10:36 in catching up to real time. And so Pulsar has a significant increase on reading that older historical data than Kafka, which can be very, very helpful in those sort of mixed workload cases or recovering from downtime and whatnot. Okay, thanks. So, yeah, it was also rather clear to me just by skimming through the benchmark, to be honest with you, that it has a clear focus on measuring performance. And that makes sense for a number of reasons. I was wondering if you could possibly compare, let's say, your offering or actually Apache Pulser, let's say the straight up open source version to Apache Kafka on other parameters as well.
Starting point is 00:11:31 I know that it's a harder, much harder question to ask, to answer because as opposed to benchmarks in which you can clear metrics, that's a much tougher ask. However, I'm sure that it's something that you also get asked a lot. So how would you approach answering that question? Yeah, great. It's a great question. And I think really this is often for us that the bigger area of focus, you know, the reality being most of these systems in many situations as mentioned,
Starting point is 00:12:04 you know, can behave fairly similar, but where we really try and differentiate with Pulsar is in those kind of additional features, the management story, what's the developer experience, etc. And so kind of first focus a little bit on the developer side of the world. Pulsar historically was built to replace like a JMS workload. So traditional messaging workloads. So it's API, it's model is much more message oriented, like more of a pub sub messaging oriented system that you might, developers might be more familiar with something like a RabbitMQ or another like messaging system kind of in that vein. And so we really have found that, in my experience, actually, that developers took to that API very quickly.
Starting point is 00:12:54 It was really easy to learn, easy to kind of model those problems, but then layered on some of those additional streaming features. And so developer experience is a big focus there with that API that's really simple to use. You don't have to think about complexities of consumer groups and how many partitions and whatnot. There's also a lot of design aspects there where we make it really simple to say, okay, you can have 100,000 topics and that's not coming with a really challenging, scaly component
Starting point is 00:13:20 relative to Kafka, for example. And then also on sort of that platform needs side, as I kind of mentioned, that that's, you know, I was looking in my initial adoption of Pulsar about a platform technology that could be kind of used across our organization. And that's what we often see with Pulsar is that, or when companies are looking in this space, is they're really looking for a technology that they can kind of use across all their teams they can standardize. And Pulsar's multi-tenancy, kind of being a built-in systems that's designed to share those underlying resources, is just for companies at scale is super valuable. We regularly interact with companies who are using Kafka and have found that they have just a large sprawl of hundreds or even thousands of Kafka clusters, kind of one per application, and it ends up being not very cost effective. utilization across all of those to size each one of those. And this system like Pulsar with its built-in multi-tenancy and designed to safely kind of share that workload, built-in quota management,
Starting point is 00:14:31 et cetera, really can help companies have much less number of clusters that really efficiently use those resources and make it easier to offer it sort of as a service within their company. One team managing all of that, et cetera. Likewise, that's where the features of geo-replication that are just built in as a single API call for a lot of organizations are just, it's kind of a game changer of being able to say, okay, this is not like a separate tool we need to run. This is built in the system to handle those use cases. And then finally for those teams operating that is really the aspect of lowering the overhead of maintenance. Pulsar works very much like you want it to. You add additional capacity and immediately is usable by the cluster. There's no need for additional tooling for rebalancing, for kind of reconfiguring how, where topics live. That's kind of an entirely, that concern isn't
Starting point is 00:15:32 built within Pulsar. And so those sorts of things are really where we see, you know, beyond just, hey, performance numbers is really where the, the, a lot of the true value we see for a lot of our customers and community members who are adopting Pulsar have found the most value. Okay, thank you. And I guess, obviously, that user segment, so people and organizations who are existing users of Kafka, who for one reason or another choose to consider or even migrate eventually to Pulsar are important for you and hence also the benchmark. And last time I checked, there was not direct, let's say, equivalence or a compatibility between those two APIs, but rather there was a sort of, well, in between middleware, let's say, that translated Kafka API calls to Pulsar API
Starting point is 00:16:35 calls. Is that still the case? Has there been any development, let's say, on that front? Yeah, it's a great question. So there has been, and one development they're very excited about that we've been working on initially actually partnered with Tencent for a project we call KOP, which stands for Kafka on Pulsar.
Starting point is 00:16:54 We really see Pulsar as a lot of its innovation is in its architecture and that ability to have a separate compute and storage that really allow for some of its flexibility. And so because of that innate flexibility, we, starting about 18 months ago, two years ago,
Starting point is 00:17:14 implemented a new kind of lower level API, developer focused API, I should say, called protocol handlers. And they actually allow you to reuse kind of Pulsar's internals and expose a different API. You know, it could be of anything, but in this case, it's a re-implementation of the Kafka API. So KOP allows for within your existing Pulsar cluster, no different component, no change in libraries to have a fully compliant Kafka API that, you know, you can seamlessly interact with. And we've started to see, you know, some adoption of that and some real excitement around that as a mechanism, not only just for kind of a migration story, but actually just for long-lived Kafka use cases, right? Organizations, and we acknowledge that organizations have a large investment in,
Starting point is 00:18:11 you know, that Kafka API applications they've built. It's a large ask to say, okay, yeah, you're going to get all these benefits, but you have to, you know, migrate to this whole other, rewrite all your applications. And so Kafka on Pulsar really is a way for allowing teams to start investing in Pulsar, continue to use their applications for long-term use cases that will still be using the Kafka API. And we're actually really excited that we'll be bringing that into stream native offerings with our stream native cloud a little pretty soon here. So we're very excited about that development to really help accelerate organizations that are interested in Kafka or excuse me, interested in Pulsar, but already have some of that investment in Kafka. Okay. Thanks. I was going to ask you about stream native cloud anyway, but before I do I have another kind of catching up question.
Starting point is 00:19:03 So again, last time I checked, the SQL capabilities in Pulsar were supported through Presto. And I guess, well, I wonder if that's still the case. And actually to add to that, if you could just quickly refer to, well, transformation capabilities that you have in terms of streaming support? Yeah, great, great question.
Starting point is 00:19:28 So yeah, first to touch on the Pulsar SQL side, that's actually enabled via the Trino kind of side of the Presto community. So that is actually, as we speak, getting the connector that is used to read data from the storage tier in Pulsar is getting upstreamed into into Try Now. So really excited about that development to work with that community. And yeah for a bit of a snapshot for, that allows for running a query that takes sort of a snapshot of your topic and then query that data out via SQL. And really making it easy for use cases if you have some analytical use cases over some StreamStore and Pulsar, etc. And we've seen quite a bit of adoption of that technology is really a way of saying, okay, we don't have to, you know, go and store this all this data in some other system for for into where we just need to run some ad hoc queries whatnot. And so that's that's been really exciting within the community and just get that upstream into Trino making that really easy for those who are already there. And then to the second part of
Starting point is 00:20:46 that on kind of the transformation side, Pulsar Functions has been part of Pulsar for quite a while now, become a quite mature, stable aspect of Pulsar that allows for writing functions within a few different languages. So Java, Go, Python, that will make it easy for developers to kind of just deal with just the business logic, right? Pulsar functions handles the inputs, the outputs, and really, you know, you just kind of getting that sort of Lambda-like API
Starting point is 00:21:20 where you can transport messages. And that we've seen a lot of usage for that. Really, businesses starting to kind of build their core platforms on top of Pulsar functions in some cases, which we've been really excited to see. Okay, thanks. And then you also mentioned, well, Stream Native Cloud. And I'm assuming that's probably the main revenue driver, let's say, for the company, since Pulsar itself is open source, and that's a typical scenario that companies formed around open source products follow. So I was wondering if you could, again, just share the basics around your cloud offering and whether there's also any value-add services on top of the open source Pulsar on which it's built. cloud. So StreamNative cloud, yes, as you alluded to, is our fully managed self-service deployment
Starting point is 00:22:28 of Apache Pulsar via some of StreamNative's additional value adds there. So one of the unique things about our cloud, though, is that it's a very flexible offering. So we have both what we call our managed cloud offering, which runs in the customer's cloud account, and our hosted cloud offering, which runs in our cloud. And so there's really a pretty broad range that a lot of organizations really love managed services, but maybe for compliance reasons, security reasons, et cetera,
Starting point is 00:23:03 can't fully, or maybe just some use cases where they need to have more visibility and ownership over their data, may struggle to put that data fully into, you know, a service provider's cloud accounts and hands, right? With stream native cloud, they don't have to make that choice. We can support either model. And the way we do that is we've built a very powerful control plane so that allows for developers to programmatically create clusters. And it's built in a very flexible data plane, so where the actual workloads live. And that's been, for us, a strong product to help support the community. You know, a lot of what we often see is developers really fall in love with the technology.
Starting point is 00:23:51 And as they look to how to make it a reality, you know, their organizations vary in on sort of managed services. And their adoption really is dependent upon being able to have some of those managed services. And that's been a pretty typical story we've seen repeatedly with distributed cloud. And to speak a little bit to some of those additional value adds, really where we see is Pulsar, as I kind of alluded to, has a lot of very strong management features kind of built into the open source. The geo-replication, the multi-tenancy, tiered storage, all of that is native. And so when we really look at the value adds, a lot of that is around security compliance as well as around integrations. And so one of the, we have some you know, some things there around on the authentication
Starting point is 00:24:46 authorization size with an OAuth 2 based authentication, additional authorization model actually arriving soon, as well as a number of other options for the enterprise security model, you know, things like an audit log, et cetera. And then the other big focus that we often let it, what is it kind of the ecosystem and integrations. And, um, as I kind of mentioned earlier with the, uh, we're calling extreme native cloud for Kafka is a feature coming very soon to the stream native cloud that will allow for kind of one click deployment of that KOP component, making it just seamless to, uh,
Starting point is 00:25:24 also get support for the Kafka API. And then finally, what I'll highlight here is some of those integrations also with things like a SQL engine, so you can get a streaming update of SQL, very simple UI, nothing else to deploy, et cetera. And that's really kind of the two big focuses, once again, of where we add value at in StreamNative Cloud.
Starting point is 00:25:46 Okay, thanks. And well, let's wrap up with what may well be the hardest actually question to answer. So you've already touched upon comparing, let's say, Pulsar with Kafka on a number of fronts. However, those are obviously not the two only options around. And I know, having quickly read your bio, that you've also personally worked with solutions like Flink, for example. So as quickly and painlessly as possible, given we don't have an infinite amount of time for that,
Starting point is 00:26:26 how would you say Pulsar positions itself related to Flink and Spark streaming and Red Panda? So what are the main differences, let's say? How would you say Pulsar differentiates itself compared to those options. Yeah. The flanks, the sparks of the world, those streaming compute engines, our goal is really maybe relative to something like Kafka, KSQL, et cetera. We're not really focused on trying to build
Starting point is 00:27:02 one of those streaming compute engines. That's Pulsar Functions, for example, is really not designed to be a, to handle that entire set of use cases and problems. We're really focused on ease of use in the, in for ease of use and sort of the simple 80% use cases of single message transformations, et cetera. And so what that means for us with those compute engines is that we're really focused on just a great integration story of building a best of breed connector that's very flexible. And that's in general, when we look at some of those other tools within the streaming ecosystem, also like Kafka, we're very focused on interoperability and really being a solution that can scale across
Starting point is 00:27:48 a wide range of different APIs, different technologies, different integrations. As really once again, with these streaming technologies, they sit sort of central to so many different services. We think that that's just critical and that's why we try and position Pulsar as being a great option for sitting central to all of those. And we really see them as, you know, partners. On the side of
Starting point is 00:28:12 relative to other streaming storage stream, you know, the transports, you mentioned Red Panda, one of, and, you know, we share a lot of interest with them in terms of, okay, how do we solve some of those core Kafka pain points? But some of those pain points also sit not just in the implementation, but also in the underlying protocol, the underlying flexibility. And so we see that as not something that's easy to solve just within kind of the limitations of the Kafka API, but we support it, but we also have more that we can offer beyond there by Pulsar's APIs, as well as actually didn't mention other protocol handlers that we support, such as RabbitMQ flavor of AMQP, as well as MQTT. So we're really about that model of interop between lots of different protocols and while still offering extremely high consistent performance. So that's kind of how we see ourselves relative to a lot of
Starting point is 00:29:16 those competitors. Okay, thanks. And last one, this time really last one. So if you can share just a few, a sneak peek, let's say, into your roadmap. So what's in your agenda for the next year or so or a couple of years, maybe, if you can look that far ahead. Yeah, so Pulsar is definitely focused a lot on continuing to stabilize and add additional exciting features into the community.
Starting point is 00:29:45 Some of those things that we're coming up very soon is an API we call the table service that allows for taking a stream and within your application, turning that into like a table view. And lots more coming there. Happy to chat more about that with anybody who's interested to get involved with the Pulsar community. And the StreamNative side, we're really excited about the StreamNative cloud for Kafka, as well as continuing the story on the integration with more compute and functions. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.