Orchestrate all the Things - The Ring Zero of real-time data processing: Redpanda scores $50M Series B funding to grow its streaming platform. Featuring CEO / Founder Alex Gallego

Episode Date: February 23, 2022

Maintaining compatibility with the de-facto market standard, while reimplementing and extending it. It's easier said than done, but that's what Redpanda is doing, and it seems to be working. Art...icle published on ZDNet

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amadiotis and we'll be connecting the dots together. Maintaining compatibility with a de facto market standard while reimplementing and extending it. It's easier said than done, but that's what Red Panda is doing and it seems to be working. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration
Starting point is 00:00:23 on Twitter, LinkedIn, and Facebook. Thanks for having me, George. By means of introduction, my name is Alex, and I've been working in a stream processing for about 13 years prior to starting Red Panda Data and working on Red Panda the Engine. I sold the company in 2016 to Akamai. It was called Concord. You can think of it as a competitor to Apache Flink or Apache Spark. And we built it on top of Mesos. And yeah, that's sort of the executive summary. Okay. Thank you. And then I guess the next thing would be to talk a little bit about Red Panda, the business, like your milestones and your headcount, some clients and use cases, if you'd like to refer to that. And of course, the specifics around the funding. Sure. finished racing a $50 million round led by Google Ventures with participation from our
Starting point is 00:01:46 existing investors, Lightspeed and Haystack. And so I guess for those that are tuning in on the call and the podcast, Red Panda really was started as kind of a natural evolution of what I thought streaming should be. And in 2017, I guess the background of how we got started, so I can give the context of growth and numbers and fundamentals here, is that I wanted to understand what was the gap between what the hardware could do and what the software could do. And so that was really the genesis. And so in 2017, I kind of gave a talk about, you know, just what is the gap between hardware and software. And so in 2019, I started this company originally for experts by experts. Actually, it was designed for people that were like me, that were streaming experts that wanted something more with the storage.
Starting point is 00:02:50 And I wanted it to be compatible with the entire Kafka ecosystem. And so I wrote Red Panda, the storage engine, before I even decided to build the company. And so the interesting piece, to share the context of the numbers is that about 40% of our customers are experts of streaming engines. And so they're either Kafka migrations or Pulsar migrations or some other classical technology like that. But what turns out is that the way we built Red Panda, which was a single binary, has no external dependencies on Zookeeper
Starting point is 00:03:25 or Schema Registry or HTTP proxy, all that, allowed us to onboard net new users to streaming, people that had never even heard about what streaming is because they didn't feel that sort of, I guess, emotionally empowered to go and tackle some of these problems. And so anyways, with that context, when we released Red Panda was at first closed source. And in late 2020, yeah, late 2020,
Starting point is 00:03:57 we made a source available and it's under the same license as CockroachDB. We actually took inspiration from them, which means for the first four years, we're the only company that is allowed to host Red Panda as a service. And after that, it becomes permissively licensed Apache too. And then in 2021, we really started probably with hundreds of customers. By the middle of the year, we were in the thousands of customers, several thousands of customers.
Starting point is 00:04:29 And we ended the year in hundreds of thousands of Red Panda clusters coming alive. So it's felt like every nut and bolt rattled as we started to take off. And then with people too, we started last year with a little bit less than 20 and we ended with 60. And so it's been incredible growth over the last year. And I guess on the customer front to contextualize some of our listeners is that we, you know, really to my surprise, at first I thought we were going to just capture the Kafka experts and, you experts and classical Kafka migrations, but these new kind of users to Red Panda, net new users to streaming, really opened up a bunch of doors. And so with that, we were able to send Red Panda into suborbital, which I find the coolest
Starting point is 00:05:18 use case. It is a Red Panda process sitting in a satellite somewhere in orbit. So IoT, we also help oil and gas pipelines to measure the jitter. And so when you're shipping crude, the pipelines need to be monitored for actual physical jitter of the pipeline in case it blows up physically. And then, but also the classical, you know, fintech and gaming companies and ad tech and kind of web 2.0, like Snapchat is a, Zenly from Snapchat is a public example and Alpaca, which is a startup in the FinTech space. So anyway, so what we actually saw was sort of tremendous growth in just about every vertical. So it makes it kind of hard to explain to just like, oh, it's only one particular vertical. It's more about use case.
Starting point is 00:06:06 Hopefully that helps for contextualizing the audience. Yeah, yeah, it does. Thank you very much for sharing that. And I was going to ask, you know, obviously I looked around a little bit and trying to get some of the basics, let's say the key premises around Red Panda before having this conversation.
Starting point is 00:06:26 And my initial impression was that it seems simple enough. I mean, the key premise like Kafka compatible, similar kind of business model. I initially had the impression it was open core and software as a service. I guess I was wrong about the former, so it's source compatible, as you said. However, the rest I think I got right, and you can correct me if I missed something. Would you say that's an accurate description? And if yes, I would like to ask you to elaborate a little bit on, well, simple enough but I'm sure it wasn't simple it wasn't easy to execute on this simple idea so what was what were the technical underpinnings that enabled you to build a faster Kafka while maintaining compatibility? I remember that
Starting point is 00:07:19 I was that was a really fun question remember that I was a principal engineer and prior to this i was a cto um so when when i started red panda um i really wanted to understand why in 2017 i saw such a large gap between hardware and software and so i literally connected two two edge computers uh at akama with it with the cable back to back just to make sure that there was nothing in between these two computers. And I just wanted to measure, like understand what is the fundamental evolution of hardware and did software actually take advantage of modern hardware? And so when you look at existing solutions, they were built for a decade old hardware where spin and disk was the fundamental limitation of the previous platform. The new limitation is actually CPU coordination. And to not get too much into the weeds,
Starting point is 00:08:11 sometimes you really get to reinvent the wheel when the road changes. It's kind of the executive summary here, where if you were to start from scratch, what would you do differently? And what we did differently is that we built a single binary. But to just address your point directly of how would I describe Red Panda, I think it's simple, fast, reliable, and unified. And the last part was actually, I would say it was a discovery. Does that make sense?
Starting point is 00:08:44 Were you going to ask something else, George, or do you want me to continue? No, please go ahead. Okay. So on the unified part, so when we first started, we took a fundamental approach, which it's called a thread-per-cord architecture. And the idea is,
Starting point is 00:08:59 how do we take advantage of very tall computers? So if you look at the trend in computing, disks are a thousand times faster than they were a decade ago. And computers are 20 to 30 times taller, a.k.a. more coarse. And so with this new platform, you would do things fundamentally different. And so we started Red Panda in C++ with no dependencies, really trying to extract every ounce of performance out of the hardware. And that worked. And we, you know, we onboarded like electric car manufacturers, the largest CDN in the world, some of the largest banks.
Starting point is 00:09:31 And, but what we discovered, and this is, I think, a slight difference that we really need to do a better job on our marketing and on our website, is it's true that Red Panda is a streaming platform that is simple, fast, reliable, aka doesn't lose data. But the last part, unified, is different. And what we learned is that in order for us to make a fundamental shift into how developers actually think about building applications, we needed to give them compatibility with the way developers are building applications. So meaning infinite data retention with things like Amazon S3. And so
Starting point is 00:10:06 I think those, I would say that the premise is simple. It's a simple, fast, reliable engine, but unified lifts up the risk, sort of allows developers to build a new category of applications that couldn't have built before. And because, you know, for the developer, what it means to have unlimited data retention it means that they don't have to worry about disaster recovery and you know they now have a backup they they don't have to worry about a priori about which other databases or downstream systems they need to materialize they simply push their data into red panda and we transparently adhere and so it's relatively cost-effective to store even petabytes of data. Hopefully that helps. Yeah, thank you. I had gotten the part about the re-implementing everything in C++ and well even that alone
Starting point is 00:10:58 I think would have made a difference but I think you did a nice job of adding to that, well, elaborating on why it wasn't just that, basically. Which leads us to the next question. And I was wondering, also earlier, you mentioned, you described a few of your use cases. So I was wondering if you can share with us a little bit about how your use cases look like. Basically, brownfield versus greenfield. The fact that you have Kafka compatibility obviously helps. So I imagine a few of your use cases will be
Starting point is 00:11:33 replacing Kafka, but you also mentioned that you can do some things that were previously not possible. So how many of your use cases are brownfield versus greenfield? Yeah, let me give you color here. Let's start with the brownfield. So by being Kafka API compatible, so I think that we have to give credit to Kafka and to Pulsar and to RabbitMQ. And that's really this huge family of streaming systems that came before Red Panda. I consider really Red Panda, the natural evolution, if you will,
Starting point is 00:12:10 if you were an expert in streaming and you were like, okay, well, what is the new bot on the King computer? And how do I design for that? But what Kafka did is really not, the Kafka broker was a fundamental piece in building the new streaming infrastructure. But what it did is it actually,
Starting point is 00:12:27 the most powerful thing that Kafka did is it created an ecosystem. And so it's the fact that it connects transparently to Spark streaming and to Flink streaming and to Materialize and to MongoDB and to ClickHouse to the extent that when we launched, for example, the partnership with MongoDB or with Materialize or MemgraphDB and all of these other partners,
Starting point is 00:12:49 SingleStore, for example, it just worked. And that is the experience that a lot of our brownfield folks get when we show up to a deal. Let's say we get to influence a decision maker. And so they always are worried. They're like, hey, I have 100,000 lines of code or 150,000 lines of code, right? Like a large body of work that connects with Kafka in a particular way. Let's say they're doing real-time fraud detection, or they're doing real-time online gas pipeline jittering, or they're building an Uber competitor that delivers food like Mr. Yum. Well, actually Mr. Yum is not an Uber competitor, but they work in the restaurant industry.
Starting point is 00:13:25 Anyways. And so what we found is that Kafka compatibility for us was really fundamental. And to the application developer, by the way, there's no code changes. And that perhaps is maybe the most radical thing that a developer, or maybe the most shocking thing that a developer, or maybe the most shocking thing that a developer experiences, like, hey, just works. And it's not just works for their application.
Starting point is 00:13:50 It just works with entirety of the ecosystem, whether you plug in TensorFlow. When I tested TensorFlow, it took me 30 minutes and I wrote a blog post in 15 because it just worked. There's often no hero migration. There's no hero story. There's nothing. It just, it's simply a configuration change. And then you pick up and go. And so those are the brownfield. Those are existing
Starting point is 00:14:10 applications, like kind of classical Kafka uses of things like analytics or not necessarily foreground use cases. Now, on the greenfield, on new use cases, there's really two quick categories here. One is the way new applications are being built. So if you look at the way new databases are being built, let's say Materialize is a good example, and MemgraphDB is another good example. I know of three ML companies that are launching that will be in stealth mode so far. But where the industry is heading is they want to leverage something like Red Panda or Kafka or some distributed ledger, right? Distributed source of truth, engine of truth. And then they build on top. And so an example is Materialize is a SQL engine on top of the log. And MemgraphDB is a graph database on top
Starting point is 00:15:02 of the log. So that's one particular example for the advanced use case. And then the last one I'll mention on the green field is this company called Alpaca. And so they trade everything from cryptocurrencies to derivatives to, you know, basically it's a fintech API. And they're growing super rapidly. And the difference between that use case and the brownfield is that Red Panda sits in the foreground. That is every time you interact with their API and their application, every single message that goes through
Starting point is 00:15:38 the Alpaca trading framework, it goes through Red Panda. And what we give them is basically this huge compatibility with the ecosystem, but we're so fast that it sort of unlocks new revenue streams for them. So that hopefully could give you a color as to how the classical migrations happen, which it just works. And then this new greenfield. And I actually think we're still early in the greenfield because people are just trying to understand right now
Starting point is 00:16:07 what Repanda is and how it can help them. Okay. Another question I had was about, well, the focus, let's say, of your use cases and the actual layer that you're targeting. And what I mean by that is that I tried to look around a little bit and try and get a sense of the capabilities let's say in your offerings and I got the impression that it seems like the suggested use is more on the transport layer and not so much as a processing
Starting point is 00:16:41 or transformation layer and for example I've seen a blog post in which it seems like the suggested use for that would be to a user at Panda in conjunction with Flink. And I don't think I have seen a SQL interface, for example. So is my impression correct? And if yes, is this an area you're aiming to expand to in the future? So two things here. One is that stream processing is a relatively rich kind of area. It's similar to saying databases. There's many flavors and many capabilities here. So let's break down stream processing into complex stream processing and simple transformations.
Starting point is 00:17:29 So for the simple transformations, let's say a simple transformation is you take a personal object and you want to mask all of the private and sensitive information like your social security number in the US. And so often the problem that people face, and this is how we're advancing the state of the art of streaming for this one shot transformation, not for complex processing, is do you read your data from Red Panda and then you ping pong it into a separate system like Flink or Spark or whatever it is.
Starting point is 00:18:03 Do you transform it, you mask it, that is you take the social security number and you replace it with XXXXX and then you save it back into Red Panda. So we developed an in-broker processing engine, we're calling it the Wasm engine really because it'll take at some point, right now it only takes JavaScript, but it'll take a web assembly pretty soon, is the ability to do this one-shot transformations inside the broker itself, eliminating the data ping pong for the simple cases. And so that's really where I think that Red Panda shines. Now on the complex stream processing side, whether you have SQL or you have another one of our partners has this really beautiful Python experience.
Starting point is 00:18:51 They're called Deep Haven, and you can process and visualize and use all of Python's power locally. It's kind of an interactive notebook in my view. And so there's just this huge ecosystem of matured and innovative companies that are innovating at that level. And I feel that the ring zero, the source of truth is not a solved problem. And that's really where we shine. And so for the time being, we hope to stick to stay in this space.
Starting point is 00:19:26 And so the alternative is that then we partner with all of these other database companies, whether you're Mongo or SingleStore or Materialize or DeepHaven and so on. I think it builds a richer ecosystem. I think having companies that are focused on specific layers yields a better product. Okay. Thanks for the elaboration. And I was also wondering, I think you mentioned as well yourself earlier when you were referring to some of your use cases.
Starting point is 00:19:56 I was wondering about your experience with people using RedPanta to support machine learning use cases and actually real-time machine learning. Do you have any of those? And do you see them often? Do you think that's something that Red Panda is a good fit for? Yeah, actually in fintech. So what's interesting about a lot of the techniques that people have adapted into big data is that a lot of them have been in
Starting point is 00:20:27 use for a really long time in fintech, in particular in the HFT shops, but they were always built before as this one-off solutions, right? That was the secret sauce of this particular HFT shop and so on. I think the trend, the difference in machine learning now is that, first of all, there's this fantastic engines that are being released, like, you know, TensorFlow and so on. What is challenging about those is actually continuously retraining the model and being able to replay it. So for machine learning, I actually think Red Panda is a really good fit for. And there are, like I mentioned, I know of one, two, three companies that are in stealth mode, three startups that are in stealth mode, all backed by fantastic investors, tier one investors that are about to come out of stealth, that are working on this problem itself. And so there's, I think, multiple tiers to this problem.
Starting point is 00:21:23 But where Red panda fits into this storyline is not on the ml algorithms right that's tensorflow it's got that cover and and spark ml has got that cover and there's like you know deep java or whatever it's called deep 4j i think it's called this library so there's this very special layers for the actual uh machine learning algorithm what red panda brings to the table is a scalable, effectively back pressure valve that allows the machine learning algorithm to replay. So let's be specific with one use case. Fraud detection is kind of this classical example for ML and real-time ML, where let's say that you have a credit score application where you're given, let's say, you know, men and women a credit card.
Starting point is 00:22:07 And, you know, you had a bias in your credit score application. Now you want to go back and reprocess the entire history. This is where really Red Panda shines. your application to be able to reprocess the entire history of all of your events that led into that decision without changing a single line of code on the application level. And so what really what RedPanda is doing is sort of creating a new engine of record that allows these ML algorithms to reprocess the data, have access controls, have back pressure, spill to disk in case that you get a ton of load. It's really kind of a centerpiece of the architecture. And then you combine that with TensorFlow and maybe a serving layer. And now you have an entire serving category and a serving
Starting point is 00:22:55 product. So you can think of Red Panda as, you know, I like to call it ring zero. It's kind of the very bottom layer to build a scalable real-time ML company. Okay, thank you. And by the way, you said you are in touch with a few companies working on real-time machine learning that are about to come out of sale. Well, if you want to send them my way, feel free. I would love to talk to them. I guess we're actually already a bit over time, so probably a good time to wrap up. And to do that, I'd just like to ask you,
Starting point is 00:23:33 where do you think this domain is headed, basically, and how do you see your role in it? I think that, you know, to me, Kafka is, and the API is a historical artifact in a positive way. It's developers bought into the ecosystem and they build millions of lines of code. I think the future is a different API. I think the future is serverless. I think the future is a protocol that is less heavyweight than the Kafka protocol. And so I think that Red Panda, the company, is a company that can give people both A and B. A is compatibility with this hugely rich ecosystem that is always going to be important. And B is because we're more tied to the market evolution from batch to real time. Today, it happens to be that Kafka is the best,
Starting point is 00:24:32 Kafka API is sort of the best way that we could do that. But I think in the future, it'll be a different API and it'll be a new API that is really designed for the way modern applications are being built. And so that's kind of how I see the story arc for Red Panda, the company. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.