Software at Scale - Software at Scale 18 - Alexander Gallego: CEO, Vectorized

Episode Date: April 27, 2021

Alexander Gallego is the founder and CEO of Vectorized. Vectorized offers a product called RedPanda, an Apache Kafka-compatible event streaming platform that’s significantly faster and easier to ope...rate than Kafka. We talk about the increasing ubiquity of streaming platforms, what they’re used for, why Kafka is slow, and how to safely and effectively build a replacement.Previously, Alex was a Principal Software Engineer at Akamai systems and the creator of the Concord Framework, a distributed stream processing engine built in C++ on top of Apache Mesos.Apple Podcasts | Spotify | Google PodcastsHighlights7:00 - Who uses streaming platforms, and why? Why would someone use Kafka?12:30 - What would be the reason to use Kafka over Amazon SQS or Google PubSub?17:00 - What makes Kafka slow? The story behind RedPanda. We talk about memory efficiency in RedPanda which is better optimized for machines with more cores.34:00 - Other optimizations in RedPanda39:00 - WASM programming within the streaming engine, almost as if Kafka was an AWS Lambda processor.43:00 - How to convince potential customers to switch from Kafka to Redpanda?48:00 - What is the release process for Redpanda? How do they ensure that a new version isn’t broken?52:00 - What have we learnt about the state of Kafka and the use of streaming tools? TranscriptUtsav [00:00]:Welcome Alex, to an episode of the Software at Scale podcast. Alex is the CEO and co-founder of Vectorized, which is a company that provides product called Redpanda and you can correct me if I'm wrong, but Redpanda is basically a Kafka replacement, which is a hundred times faster or significantly faster than Kafka itself. I am super fascinated to learn about why we're building. I understand the motivation behind building Redpanda, but what got you into it and what you learned on process and thank you for being here.Alexander “Alex” Gallego: Yeah. Thanks for having me though. A pleasure being here. This is always so fun to get a chance to talk about the mechanical implementation. To a large extent to this day, this thing is pretty large now. So I get to do a little bit of code review of the business, and it's always fun to get the details. Yeah. I've been in streaming for a really long time, like just data streaming for like 12 years now and the way I got into it was through startup in New York. I was doing my PhD in crypto. I dropped out, went to work for this guy on a startup called Yieldmo. Doesn't mean anything, but it was an Ad tech company that competed against Google on the mobile market space and that's really how I got introduced to it. The name of the game back then was to use Kafka and Apache Storm and ZooKeeper 32 and honestly it was really hard to debug. So, I think I experienced the entire life cycle of Kafka from the zero point maybe seven release or 0.8 release back in 2011 or something like that. All the way until my previous job at Akamai, where I was a principal engineer and I was just sort of measuring latency and throughput. And so I sort of seeing that I will listen to Kafka and before they required ZooKeeper and all of these things. I guess the history of how we got here is, at first we were optimizing ad then we're using a Storm and this and then Mesos has started to come about. And I was like, oh, Mesos is really cool. The future of streaming it’s going to be. You're going to have an Oracle and then something that's going to schedule the container. In retrospect now, that Apache Mesos got archived or it got pushed to the Apache Attic a couple of weeks ago. We chose the wrong technology. I think it was the right choice at the time like Kubernete is barely working on a couple of hundred nodes and Mesos was proven at scale and so it just seemed like the right choice. What we focused on though is streaming. Streaming is a technology that helps you both sort of extract the value of now or in general deal with time sensitive data like a trade or fraud detection, Uber eats are all really good examples of things that need to happen relatively quickly, like in the now and so streaming systems are kind of technology designed to help you both deal with that kind of complexity. So what Concord did was, hey, we really liked the Apache Storm ideas. Back then when it was a [Inaudible 3:17], Storm was really slow and it was really hard to debug this thing called Nimbus and the supervisors and anything, stack traces on poor languages at [Inaudible 3:27] closure and Java. And I was like, I need to figure out three standard libraries to debug this thing. And so we wrote Concord in C++ on top of Mesos, with that squarely on the compute side. And so streaming is really where store and compute came together and then at the end of it, you do something useful. You say, hey, this credit card transaction was fraud claim. That's what I did for a long time. And, you know, long story short I've been using Kafka really as a storage medium. I personally couldn't get enough performance out of it with the right safety mechanics. So in 2017, I did an experiment where I took two optimize edge computers and literally with a wire back to back to each other. So no rack, latency, nothing. It's just an FPF wire connected back to back between these two computers and I measure it. Let me start up at Concord Server I think maybe two, four, something like that, or two, one at the time, a Concord Server and Concord Client and measure what it can drive this artwork to for 10 minutes by both in latency and throughput. Let me turn that down and let me write a C++ program that bypasses the cornel and bypasses the page cache for the storage level too, and see what the hardware is actually capable of. I just wanted to understand what the gap is. Where does this accidental complexity comes from? Like how much funding are we leaving on the table? [5:00] The first implementation was 34X tail-latency performance improvement and I was just floored. I took two weeks and I was comparing the bytes of the data on this, just making sure that the experiment actually worked. And so, yeah, honestly, that's, that was the experiment that it got me thinking for a long time, that hardware is so fundamentally different to how hardware was a decade or more ago when Kafka and Pulsar and all these other streaming technologies were invented. If you'd look at it, actually the Linux scheduler, block algorithm and IO algorithms, basically, it's the thing where you send data to a file and the Linux sort of organize it safe for optimal writes and reads is fundamentally different. It was designed for effectively a millisecond level latencies and the new disks are designed for microsecond level latencies. So this is a huge gap in performance improvement. Not to mention that now you can rent on Google 220 cores on a VM. You can rent a terabyte of Ram.So the question is, what could you do differently with this new [inaudible 6:11]? Like it's so different. It's like a totally different computing paradigm. And I know that [Inaudible6:19] has coined the term, "there's no free lunch." you basically have to architect something from scratch for the new bottleneck and the new bottleneck is the CPU. The delayed [Inaudible 6:30] of this, are so good and same thing with network and all these other peripheral devices that the new bottleneck is actually the coordination of work across the 220 core machine. The future is not cores getting faster; the future is getting more and more cores. And so the bottlenecks are in the coordination of work in the CPU's. And so, we rewrote this thing in C++, and that's kind of maybe a really long-winded way of saying how we got here.Utsav: That is fascinating. So maybe you can explain a little bit about where would someone deploy Kafka initially? You mentioned like fraud, and that makes sense to me, right. Like, you have a lot of data and you need to stream that. But the example that was, I guess, a little surprising was like Uber eats. So how would you put Kafka in like Uber eats? Like where would that fit in like the pipeline?Alex: Great Question. Let me actually give you a general sense and then we can talk about that case. Event streaming has been a thing for a really long time. People have been trading systems, but in the modern stack, it's called event streaming. And what is an event? An event is what happens when you contextualize data. So you have some data, let's say a yellow t-shirt. Right. Or like a green sweater, the one I'm wearing today. That's just data doesn't mean anything. But now if I say I bought this green t-shirt with my visa credit card and I bought it from, let's say, just Korean seller that is coming through www.amazon.com. And then I start to all of this context that makes an event. There's a lot of richness there. Implicitly, there's also a lot of time to that transaction. If I buy it today, there's this immutability about this facts. And so event streaming is this new way about thinking on your architecture as this immutable, contextualized data, things that go through your architecture.And so in the case of Uber eats, for example, when I go into like my Uber eats app and I select my favorite Thai restaurant, and I said, hey, get me number 47, I can't pronounce it, but I know it's the item I always get from the third restaurant across the corner. It’s like Chinese broccoli. And so, it's immutable that I paid $10 for it. It is immutable that the restaurant got it, 30 seconds later after this order. And so you start to produce kind of this chain of events, and you can reason about your business logic as this effectively a function application over a series of immutable events. It's kind of like functional programming at the architectural level. And why is that powerful? That's powerful because you can now actually understand how you make decisions. And so to go back to the case of fraud detection its really useful, not in making the decision. Like you can just create a little micro-service in node or in Python. It doesn't matter. And you just say, hey, is whatever is the credit card, both $10,000 is probably a fraudulent for, for buying Thai food. That's not the interesting part. The interesting part is that it's been recorded in an order fashion so that you can always make sense of that data and you can always retrieve it.[10:01]: So there are these properties about Kafka that Kafka brought to the architect, which were durability. I mean, this data actually lives on disk, and by the way, it's highly available. If you crashed one computer, it's going to live on the other two computers. And so you can always get back this data and 3 it’s replayable. If the computing crash, then you can resume the compute from the previous iterator. And so I think those were the properties that I think the enterprise architects started to understand and see it. They're like, oh, it's not just for finance in any way of doing trades, but it works almost across any industry. Today, we have customers, even us and we're a relatively a young company, in oil and gas measuring the jitter between oil and gas pipelines, where you have this little raspberry pie looking things. And the point of this Kafka pipeline, which was later replaced with Redpanda was just a matter of how much jitter there is on this pipeline. Should we turn it off or it's really cool?We've seen it in healthcare where people are actually analyzing patient record and patient data. And they want to use new technologies like spark and mail or TensorFlow and they want to connect to real-time streaming. For COVID, for example, we're talking with the hospital in Texas, they wanted to measure their COVID vaccines in real time and alert things for all sorts of suppliers. We've seen people in the food industry. It's like in sports betting. It's huge outside of the United States. To me, it feels like streaming is at this stage where databases were in the seventies, like before that people are writing to flat files and it works like that's the database. Every customer gets a flat file, you read it. Every time you need to change it, you just rewrite the entire customer data. And that's kind of like a pseudo database, but then database gave users and higher level of abstraction and modeling technique. And to some extent, that's what Kafka has done for the developer. It's like, use this pluggable system that has the three Tico to them as the new way to model your infrastructure as an immutable sequence of events that you can reproduce, you can consume, it's highly available. So I think those were the benefits of kind of switching to an architect like company.Utsav: Well, those customers are super interesting to hear. And that makes sense, like IOT and all of that. So maybe one more question that would come to a lot of people's mind is, at what point should you stop using something like SQS? Which seems like it provides like a lot of that similar functionality, just that it'll be much more expensive and you don't get to see like the bare-bone stuff, but like Amazon supports that for you.  So why do customers stop using SQS or something like that and start using the Kafka Redpanda directly?Alex: Yeah. So here's the history. So Kafka is 11 years old now and the value to developers on Kafka is not Kafka the system, but the millions of lines of code that they didn't have to write to connect to other downstream systems. It's like the value is in the ecosystem, the value is not in the system. And so when you think about that, let's think about Kafka as two parts, Kafka the API and Kafka the system. People have a really challenging time operating Kafka the system with ZooKeeper. Even if I know that there's might be some listeners that are thrilled and they're like, oh, ZooKeeper 500 was released. Then we could talk about that about what KRaft means and the ZooKeeper 2 later. But anyways, so if you look at Kafka, it's two things, the API and the system. The reason is, and why someone would want to use the Kafka API, which by the way, Pulsar are also started supporting, it’s not just Redpanda, really like a trend in the data streaming system is you can take spark ML and TensorFlow and [inaudible 00:14:04] and all of these databases, it just floats right in, and you didn't write a single line of code.Alex: You start to think about the systems of these Lego pieces. Of course, for your business logic, like you have to write the code, it's what people get paid for but for all of these other databases, all of these downstream systems, whether you're sending data to anything: Datadog for alerting, or a Google BigQuery or Amazon Redshift, or [Inaudible00:14:32] to be, or any of these databases. There's already this huge ecosystem of code that works that you don't have to maintain because there’s people already maintaining it and so you're just flogging into this ecosystem. So I would say the largest motivation for someone to move away from a non-Kafka API system, which you know, is before Google gloves pops up and Azure event hub and there's like a hundred of them is in the ecosystem.[15:00] And realizing that it is very quickly at the community. I think Redpanda, for example, makes Kafka the community bigger and better. We start to expand the uses of the Kafka API into embedded use cases. For example, for this security appliance company, they actually embed Redpanda because we're in a C++, super well footprint, but they didn't embed as processed to do intrusion detection. So every time they see an intrusion into the network, they just write a bunch of data to disk, but it's through the Kafka API and in the cloud, it's the same code. It's just not one Redpanda local instance is the collector. And so I think for people considering other systems, whether it's SQS, or [Inaudible00:15:46] or pops up or Amazon event hub. First of all, there are specific traders that we need to dig really to the detail, but at the architectural level is plugging into this ecosystem of the Kafka API is so important in getting to leverage the last 10 years of work that you didn't have to do and it takes you have five seconds to connect to these other systemsUtsav: That is really fascinating. I guess large companies, they have like a million things they want to integrate with like open source things and like all of these new databases, like materialize all of that. So Kafka is kind of like the rest API in a sense.Alex: I think it's become the new network to some extent. I mean, people joke about this. Think about this, if you had an appliance that could keep up with the throughput and latency of your network, but give you auditability. It gives you access control. It gives you a replay ability. Why not? That I think some of our more cutting edge users are using Redpanda as the new network, and they needed the performance that Redpanda brought to the Kafka KPI ecosystem to enable that kind of use case which is where every message gets sent to Redpanda. It could keep up. It could saturate hardware, but now that they get this tracing and auditability. They could go back in time. So you're right. It's almost like you have the new rest API for micro services.Utsav: Yeah. What is it about Kafka that makes it slow? Like from an outsider's perspective, to me, it seems like when a code base gets more and more features continued by like hundreds of people over like a long time span. There's just so many like ifs and elses and checks and this and that, that tend to like blow the API service and also slow things down. And then somebody tries to profile and improve things incrementally. But could you maybe walk me through, like, what have you learned by looking at the code base and why do you think it's low? Like, one thing you mentioned was, you just do cardinal bypass and you skip like all of the overhead there, but is there anything inherent about Kafka itself that makes it really slow?Alex: Yeah. So it's slow comparatively speaking and we spent 400 hours benchmarking before it comes out because I have a lot of details about this particular investment. Let me step back and think. An expert could probably tune Kafka to get much better performance than most people. Most people don't have 400 hours to benchmark different settings of Kafka. Kafka is multimodal in performance, but I can dig it out a little bit. But assuming that you're an expert and assuming that you're going to spend the time to think of Kafka for your particular workload, which by the way, it changes depending on throughput. The performance characteristics of running sustain workloads and Kafka are actually varying. And so therefore you're threading model of areas, the number  of threads for your network and the number of threads for your disk and the number of threads for your background workloads and the amount of memory, I think is this, [Inaudible00:18:51] of tuning Kafka that is really the most daunting task for an engineer. Because it is impossible, I think in my opinion to ask an engineer who doesn't know any of the internals of Kafka unless they go and they read the code to understand, well, what is the relationship between my IO threads and my disk threads and my background workloads and how much memory should I reserve for this versus how much memory do you serve reserve for that. There's all of this, trade-offs that do matter as soon as you start to hit some form of saturation. So let me give the details on the parts that where we improve performance which is specifically in the tail latency and why that matters to the messaging is great and the throughput. So by and large, Kafka can't drive hardware to the similar throughput as Redpanda. With Redpanda there's always at least as much as Kafka and in some cases which we highlight is in the block [Inaudible 00:19:51]. In some cases we're a little better. Let's say like 30% or 40% better. The actual improvement in performance is in the tailgate and the distribution. [20:00] Why does that matter? I'm just going to focus on what I think that Redpanda brings to the market rather than the negatives of Kafka because I think we are built on the shoulders of Kafka. If Kafka didn't exist we wouldn't have gotten to learn the improvements or understand the nuances of oh, maybe I should do this a little different. So on the 3 latency performance improvement, latency, and I've said this a few times, is the sum of all your bad decisions. That's just what happened at the user level. When you send the request to the micro-service that you wrote, you're just like, oh right should I have used a different data structure? There's no cache locality, etc. And so what we focused on is how do we give people predictable tail-latency. And it turns out that for all of our users, that predictable tail-latency often results in like 5X hardware reduction. So let me materialize. All of this performance improvement, where we are better and how that materializes for users. We paid a lot of attention to detail. That means we spent a ton of engineering time and effort and money and benchmarking and test suite on making sure that once you get to a particular latency, it doesn't spike around. It's stable because you need that kind of predictability. Let me give you a mental example or mental model, which you could potentially achieve really good average latency and terrible tail-latencies.Let’s say that you write, and you have a terabyte of heap and you just write to memory and every 10 minutes you flash a terabyte. So every 10 minutes you get one request that is like five minutes long because you have to flash the terabyte with the disk and then otherwise the system looks good. So what happens is that people need to understand that you start to hit those tail-latency spikes that Kafka has the more messages you put in the system. Being that you are a messaging system monitor of your users are therefore going to experience the tail-latency. So we said, how can we improve this for users? And so in March we said, let's rethink this from scratch and that really had a fundamental impact. Now we don't use a lot of the Linux Kernel facilities. So there are global locks that happen in the Linux Kernel when you touch global object. For example, the page cache. And I actually think is the right decision for the page cache to be global because if you look at the code, there's a ton of edge cases and things that we have to optimize for it to make sure that it even just work. Then a lot more to make sure that it worked fast. So it's a lot of engineering effort that we didn't know it was going to pay off, to be honest and then he happened to pay off. So we just believe that we could do better with modern art. And so we don't have this global locks kind of at the low level on the Linux Kernel objects and because we don't use the global resources, we've partitioned the memory across every individual core. So memory allocations are local. You don't have this global massive garbage collection that has to traverse terabytes heaps. You have this like localized little memory arenas. It's kind of like taking a 96 core computer and creating a mental model of 96 little computers inside that 96 core computer and then it's structuring the Kafka API on top of that. Because again, remember that the new bottle making computer and the CPU is rethinking the architecture to maximize and really extract the value out of hardware.  My philosophy is the hardware is so capable, the software should be able to drive hardware at saturation at all points. If you're not driving hardware saturation at throughput, then you should be driving hardware basically at the lowest latency that you can. And these things need to be predictable because when you build an application, you don't say, oh, let me think about what is my tail latency for this and that and most of the time I need five computers, but there's other 10% of the time we need 150 computers limit. Let's take an average of 70 or 75 computers. So it's really hard to think about building applications when your underlying infrastructure is not predictable and so that's really a big improvement. And then the last improvement on the Kafka API was that we only expose safe settings. We use rapid as a replication model and I think that was a big improvement on the state of the art of streaming. If you look at the actual implementation of Kafka, ISR replication model, Pulsar, I think it's the primary backup with some optimization replication models versus our rapid implementation. You know, that we didn't invent our own protocol. [25:00] So there's a mathematical proof of replication. But also you understand as a programmer, oh, this is what I'm used to have two or three replicas. So this is what I meant to have three or five replicas of the memory. So it's kind of all of this context. So that was a long-winded question, but you ask such a critical thing that I had to be very specific just to make sure I don't give room for ambiguity or try to.Utsav: Yeah. Can you explain why is it important the partition memory per core? Like what happens when you don't? Like, one thing you mentioned was, does the garbage collection that has to go through every day. What exactly is wrong about that? Can you elaborate on that?Alex: Yeah. So there's nothing wrong and everything works. It's the traders that we want to optimize for is reduced. Basically make it cost efficient for people to actually use data stream. To me, I feel that streaming is in that weird space [Inaudible00:25:57] a few years ago where there's all this money being put into it, but very few people actually get value out of it. Why is this thing so expensive to run and how do we bring this to the masses, so that is not so massively expensive? Basically anyone that has run other streaming system that is in the [Inaudible00:26:17], they always have to over-provision because they just don't understand the performance characteristics to elicit them. So let me talk about the memory of partitioning. So for modern computers, the new trend is going to increase in core count the frequency, the clock frequency, the CPU is not going to improve. Here's the tricky part where it gets very detailed. Even on one CPU or CPUs individually, it still got faster even if the clock frequency didn't improve. You're just like, how is this possible? It improved through the very low level things like instructions, prefetching, basically proud execution, like pipeline execution on there's all of these strikes at the lowest level of instruction execution. Even if the clock frequency of the CPU, wasn't getting faster, it made something like, 2X performance improvement or maybe 3X over the last 10 years. But now the actual larger training computing is getting more core counts. My desktop has 64 physical cords. It's like the Verizon 3900.In the data center, there's also this weird trend, which actually don't think the industry has settled on where even on a single motherboard, you have two sockets. So now when you have two sockets, you have this thing called NUMA memory axes and NUMA domain, which means every socket has a local memory that it makes "low latency access and allocations, whether it worked like one computer," but it can leverage remote memory from the other sockets memory. And so when you rent a cloud computer, you would want to understand what kind of hardware is it. To some extent you're paying for that virtualization and most people are running in the cloud these days. So why is needing the memory to that particular thread important? It matters because like I mentioned, latency is that sum of all your bad decisions. And so what we did is we said, okay, let's take all of the memory for this particular machine and I want to give you an opinionated view on it, which is if you're running this for really larger scale, I'm going to say the optimal production setting is two gigabytes per call. That's what we recommend for Redpanda. You can run it on like 130 megabytes if you want to for very low volume use cases but if you're really aiming to go ham on that hardware, those are kind of the memory recommendations. So why is that important? When Redpanda starts up, it's that I'm going to start one P thread for every core that gives me the programmer at concurrency and parallelism model.So within each core, when I'm writing code in C++, I code to it like it is a concurrent destruction, but the parallelism is a free variable that gets executed on the physical hardware. The memory comes in in that we split the memory evenly across every cores. So let's say you have a computer with 10 cores. We take all the memory, we sum it up, we subtract like 10% and then we split it by 10 and then we do something even much more interesting. We go and we ask the hardware, hey hardware, tell me for this core, what is the memory bank that belongs to the Relic in this NUMA domain. In this like memory and in this CPU socket. What is the memory that belongs to this CPU socket? And then the hardware is going to tell you- based on the motherboard configuration, this is the memory that belonged to this particular core. And then we tell the Linux Kernel, hey, allocated this memory. [30:00] And pin it on this particular thing and lock it. So don't give it to anybody else. And then this thread reallocates that as a single byte array and so now what you've done is you've eliminated all forms of implicit cross core communication. Because that thread will only allocate memory on that particular core, unless the programmer explicitly programs the computer. You've got to allocate memory on the remote core. And so it's kind of relatively onerous system to you get your hands on, but if you're programming an actor model. So what does that mean for a user? Let me give you a real impact. We were running with the big fortune 1000 company, and they took a 35 node Kafka cluster and we brought it down to seven. All of these little improvements matter because at the end of the day, if you get a 5.5 X performance improvement, hardware cost reduction at 1600% performance improvement, all of the things. There's a blog post, we wrote a month ago where we talk about one or the other mechanical sort of sympathy techniques that we do to ensure that we give low latency to a Kafka API. And so that was a long-winded way of explaining at the lowest level of why it matters to allocate memory. It all boils down to the things that we were optimizing for, which is saturated hardware, so streaming is affordable for a lot of people, making it low latency so that you enable new use cases like this oil and gas pipeline, for example. And yeah so that’s kind of one of the really deep [Inaudible 00:31:40]. I'm happy to compare with our pool algorithms and how that's different, but that's how we think about building software.Utsav: Now I wanted to know, what is the latency difference when you use memory from your own NUMA node, versus when you try to access like the remote memory? Like how much faster is it to just stay in your own like core, I guess.Alex: It's faster relative to the other. I think the right question to ask is what is the latency of crossing that NUMA boundary in relation to the other things that you have to do in the computer? If you have one thing to do, which is you just need to allocate more memory on that core, it'll be plenty fast. But if you're trying to saturate hardware, when you're trying to do this on Kafka, I think then let me give you orders may be made to comparison. Its a few microseconds to cross the boundary that's separate an allocate memory from another core. Just some experiments I did last year to cross the NUMA boundary and allocate memory.But let me put that in perspective with writing a single page to disk using the NBME with [Inaudible00:32:55] bypass. You could write a page to a NBME device assuming non 3D Cross point technology, just regular NBME on your laptop in single to double digit microseconds. So when you say now a memory allocation is in the microsecond space, you're just like, well, that's really expensive in comparison with actually doing useful work, like send them this link to this. So I think it's really hard for humans to understand latency and unless we work in the low latency phase or have like an intuition for what the computer can actually do, or the type of work that you're trying to do in that particular case. It's really hard to judge, but that adds up now. If you have contention, then it's in like the human time depending on how contended the resources are. Let's say that trying to allocate from a particular memory bank in a remote NUMA node. If there is no memory, then you have to wait for a page fault and stuff to get written to the disk. These things just add up. And it's really hard to give intuition but I think the better intuition is like, let's compare with the other useful things that you need to be doing. They useful thing of a computer is to be used for doing some useful thing for the business, like detecting fraud protection or on our planning an Uber ride around to your house or doing all of these things. I think really expensive in comparison with the things that you can actually be using the computer for.Utsav: That makes sense to me. And is there any other fancy stuff you do, like in terms of like networking, because recently I've heard that even NIKSUN and everything are getting extremely fast and there's like a lot of overhead in software. Does the Kernel not get things exposed?Alex: I think this is kind of such an exciting time to be in computing. Let me tell you a couple of things that we do. [35:00] So we actually expose some of the latency that the disk is propagating. So let's say you're writing to disk and over time to start to learn. The device is getting busier. So the latency is where one to eight microseconds to write a page and now they're in like thirty to a hundred microseconds to the write a page, because there's like a little bit of contention. There's a little bit of queuing and you start to learn that. At some point, there are some thresholds that we get to monitor because we don't use the Linux Kernel page cache. So we get to monitor this latency and propagate those latencies to the application level, to the raft replication model. Which is very cool when you co-design a replication model with the mechanical execution model, because it means that you're sort of passing implementation details on purpose through the application level. So you could do application level optimizations. One of those optimizations is reducing the number of pluses. So raft in order to participate on this, you write the page, and then you flash the data on disk. But we could do it with this adaptive batching so that we write a bunch and then we issue a single flash with like five flashes. That's one thing. The second thing is what this latency gives you is a new computer model. We added WebAssembly. We actually took the V8 engine and that's two modes of execution for that V8 engine currently. One is as a sidecar process and one is in line in the process. So every core gets a V8 isolate, which is like the no JS engine. Inside of V8 Isolated there is a thing called V8 context. And just go into the terminology for a second because there's a lot of terms.It means that you connect to the JavaScript file. Inside is a V8 Concept. In fact, a context can execute multiple JavaScript files. So why does this matter? Given that Redpanda becomes programmable storage for the user. Think of like the Transformers, when Optimus Prime unite and then all the robots make a bigger robot kind of thing. It's like the fact that you can ship code to the storage engine and change the characteristics of the storage engine. So now you are streaming data, but because we're so far that were like, oh, now we have a latency budget, so we can do new things with the latency budget. We can introduce a computer model that allows the storage engine to do inline transformations of this data. So let's say you sent the JSON object and you want to remove the social security number for DDPR compliance or HIPAA compliance or whatever it is. Then you could just hit the JavaScript function and it will do an inline transformation of that, or it will just obscure for performance. So you don't reallocate just write xxx and then pass it along, but now you get to program what gets at the storage level. You're not ping-ponging your data between Redpanda and other systems. You're executing to some extent. You're really just sort of raising the level of abstraction of the storage system and the streaming system to do things that you couldn't do before, like inline execution or filtering inline execution of max gain, simple enrichments. Just simple things that are actually really challenging to do outside of this model.And so you insert the computation model where now you can ship code to the data and it executes in line. And so some of the more interesting things is actually exposing that the WebAssembly engine, which is just V8 to our end users. So as an end user, we've now the Kafka API where you say, RPK this command line, things that we run, wasn't deploying to give it a JavaScript file. You'd tell it the input source and the output source and that's it. The engine is in charge of executing that little JavaScript function for every message that goes in. So I think this is like the kind of impact that being fast gives you. You're now have computational efficiencies that allow you to do different things that you couldn't do before.Utsav: That's Interesting. I think one thing you mentioned where there was like HIPAA compliance or something to get rid of information, like what are some use cases that you can talk about publicly that you've seen that you just would not expect? And you were like, wow, like I can't believe, that is what this thing is being used for.Alex: Yeah. let me think. Well, you know, one of them is IP credit score. So why that's interesting is not as a single stuff, but as an aggregate of steps, it's really critical. So, let me just try to frame it in a way that doesn't [Inaudible00:39:50] the customers, but we have a massive customer internet scale customer. [40:00] They are trying to give every one of their users. So really like profiling information that is anonymous, which is kind of wild. So every IP gets a credit score information. So let's say in America, you have credit scores from like 250 to 800 or maybe 820. And so you give it the same number or like a credit score to every IP and you monitor that over time, but now they can push that credit to score in [Inaudible00:40:27] inside that IP. And then you can potentially make a trade on that. And so there's all of this weird market dynamics. Let me give you this example. Let's say you watch the Twitter feed and you're just like, oh, what is the metadata that I can associate with this particular period and can I make a trade on that? So it's like a really wild thing to do. And then the last one that we're focusing on is this user who is actually extending the WebAssembly protocol, because it's all in GitHub. So you could literally download Redpanda and then stop our WebAssembly engine or your own web assembly engine. Here's actually spinning out very tall computers that are tall both in terms of CPU, in terms of memory and in terms of GPU. He has a [Inaudible00:41:19] job running on the GPU and then every time a message comes in it would make the local a call to this sidecar process that is running this machine learning thing to say, like, you know, should I proceed or should I not proceed with this particular decision? Those are the things that we just didn't plan for. I was like, wow when you sort of expand the possibilities and give people a new computing platforms, everyone will use it. It’s actually not a competing platform. It enriches how people think about their data infrastructure because a spike and Apache Flink, they all continue to work. Like the Kafka API continues to work, were just simply expanding that with this WebAssembly engine. Yeah.Utsav: I think it's fascinating. Let's say that you want to build a competitor to Google today. Like just the amount of machines that they have for running search is very high. And not that you'd be easily be able to build like a competitor, but at least using something like this will make so much of your infrastructure costs cheaper, that it's possible to do more things. That's like the way I'm thinking about it.Alex: Yeah. We're actually are in talks with multiple database companies that we can't name but what's interesting is that there's multiple. We are actually their historical data, both the real-time engines and their historical data. So Kafka API and of course Redpanda give the programmer a way to address by integer every single message that has been written to the log. It gives you a total addressability of the log, which is really powerful. Why? So if you're a database, imagine that each one of these address is just like a page on a file system, like an immutable page. So like this database vendors, they're streaming data through Redpanda, and then, it gets written to the Kafka, a batch, and then we push that data to S3. We can transparently, hydrate those pages and give it to them. So they've actually built an index on top of their campaign index that allows them to reason of a Kafka batch, as you know, it's real, the guiding way to fetch pages. It sounded like this page faulty mechanism and can also ingest real-time traffic. And so that's another like really sort of fascinating thing that we didn't think of until we started building this.Utsav: Yeah. So then let me ask you, how do you get customers ready? You build this thing, it's amazing. And I'm assuming your customer is like another principal engineer somewhere who's like frustrated at how slow Kafka is or like the operational costs. But my concern would be that how do we know that Redpanda is not going to lose data or something and right now it's much easier because you have many more customers. But how do you bootstrap something like this and how do you get people to trust your software and say, yes, I will replace Kafka with this?Alex: Yeah. So that's a really hard answer because building trust is really hard as a company. I think that's one of the hardest thing that we had to do in the beginning. So our earlier users were people that knew me as an individual engineer. Like, there are friends of mine. I was like, I'm building this would you be willing to give it a shot or try it. So, that only lasts so long. So what we really had to do is actually test with Jepsen, which is basically a storage system hammer[45:00] that just verifies that you're not b**********g your customers. Like if you say you have a rapid limitation, then you better have a rapid limitation according to the spec. Pilo is a fantastic engineer too. And so what we did is that we were fortunate enough to hire Dennis who has been focused on correctness for a really long time. And so he came in and actually built an extended and internal Jepsen test suite and we just test it for like eight months. So it seems like a lot of progress, but you have to understand that we stayed quiet for two and a half years. We just didn't tell anyone and in the meantime, we're just testing the framework. The first rapid implementation, just to put it in perspective, took two and a half years to build something that actually works. That is scalable and this isn't like that overnight. It's like, well, it took a two and a half year to build a rapid implementation and now we get to build a future of streaming. And so the way we verify is really through running our extended Jepsen suite. We're going to be releasing hopefully sometime later this year, an actual formal evaluation with external consultants. People trust us because their code is on GitHub. And so you're just like, well, this is not just a vendor that is saying they have all the cool stuff and underneath it seems just a proxy for Kafka and the Java. And I was like, no, you could go look at it. A bunch of people have bought like 300,000 lines of C+++ code or maybe 200 something. It’s on GitHub and you can see the things that we do.I invite all the listeners to go and check it out and try it out, because you could just verify this claims for yourself. You can't hide away from this and it's in the open. So we use the thing [Inaudible 46:51] so everyone can download it. We only have one restriction that says we are the only one that is allowed to have a hosted version of Redpanda. Other than that you can run it. In fact, in four years it becomes Apache 2. So it's really only for the next four years. So it's really, I think, a good tradeoff for us. But you get trust multiple ways. One is, people will know you and they're willing to take a bet on you but that only lasts with like your first customer or two. And then, the other ones is that you build a [Inaudible 47:24] like empirically. So that's an important thing. You prove the stuff that you're claiming.It is on us to prove to the world that we're safe. The third one is we didn't invent a replication protocol. So ISR is a thing that Kafka invented. We just went with Raft. We say, we don't want to invent a new mechanism for replicating data. We want to simply write a mechanical execution or Raft that was super good. And so, it's relying on existing research like Raft and focusing on the things that we were good at, which was engineering and making things really fast. And then, over time there was social proof, which is you get a couple of customers and they refer, and then you start to push petabytes of traffic. I think a hundred and something terabytes of traffic per day with one of our customers. And we thought, some point the system is always intact. If you have enough users, your system is always intact. And I think we just stepped into that criteria where we just have enough users that every part of the system has always been intact but still, we have to keep testing and make sure that every command runs through safety and we adhere to the claims.Utsav: Yeah. Maybe you can expand a little bit about like the release process. Like how do you know you're going to ship a new version that's like safe for customers?Alex: Yeah. That is a lot to cover. So we have five different types of fault injection frameworks. Five frameworks, not five kind of test suites. Five totally independent frameworks. One of them, we call it Punisher and it's just a random poll exploration and that's sort of the first level. Redpanda is always running on this one test cluster and every 130 seconds the fatal flaw that is introduce into the system. Like literally, there's like an estimated time in that logged in, not manually, but programmatically and removes your data director and it has to recover from that. Or [Inaudible 49:32] into the system and that's K-9 or it sends like the incorrect sequence of commands to create a topic in Kafka at the first level and that's always running. And so that tells us that our system just doesn't like [Inaudible49:50] for any reason. The second thing is we run a fuse file system and so what it does is then instead of writing to disk, it writes to this virtual disk interface, and then we get to inject the deterministic. [50:08] The thing about Fault Injection is when you combine three or five, like edge criteria is when things go south. I think through unit testing, you can get pretty good coverage of the surface area, but it's when you combine like, oh, let me get a slow disc and a faulty machine and a bad leader. And so we get to inject like surgically for this topic partition, we're going to inject a slow write and then we're going to accelerate the reads. And so then you have to verify that there's a correct transfer of leadership.There's another one, which is a programmatic fault injection schedule, which we can terminate, we can delay, and we can do all these things. There's ducky, which is actually how Apache Kafka gets tested. There’s this thing in Apache Kafka called ducktape and it injects particular balls at the Kafka KPI level. So it's not just enough that we do that internally for the safety of the system but at the user interface the user are actually getting the things that we say we are. And so we leveraged now with the Kafka tests, because we're Kafka API compatible is to just work to inject particular failures. So we start off with three machines. We take Liberty Kafka and, then write like a gigabyte of data and we crashed the machines. We bring them back up and then we read a gigabyte of data. We start to complicate experiments. And so that's the fourth on and then I'm pretty sure I mentioned another one but I know we have five. And so the release process is actually every single committed to Dev gets run for a really long time and ZrZi is parallel. So I think every merchant to Dev takes like five hours or something like that but we paralyze it. In human time, it takes one hour, we just run like five different petals of the time. So I mean, that's how we do it. That’s really sort of the release process. It takes a long time and of course, if something breaks, then we hold the release. And in addition to that, there's like manual tests too because there things that were starting to codify into chaos.Utsav: I wonder if the Kafka can use the frameworks that you've built for it. And maybe that will be an interesting day when Kafka start using some of that.Alex : Some of the things are so companies to the big. To launch tests, we have an internal tool called Veto like vectorized tools and the name of the company. So we say veto, give me a cluster and it just pick the binary of from CI, deploys it into a cluster and start particular to the test. And it's specific to our setup. Otherwise a lot of these tests, I think like three out of five are in the open. The things that are actually general people can just go on and look at so it could help. But the other two, the ones that are more involved are just internal.Utsav: Okay. What is something surprising you've learned about the state of Kafka? You had an opinion when you started Vectorized - that streaming is the future and it's going to be used everywhere. And I'm sure you've learned a lot through talking to customers and actually deploying like Redpanda down to the wild. So like, what is something surprising that you've learned?Alex: I feel like I'm surprised almost every day. People are so interesting with the stuff that they're doing. It's very cool. Multiple surprises. There's business surprises, there's customer surprises. So from a business, I'm a principal engineer by trade. I was a CTO before, this is the first time I'm being a CEO. It was really, I think, a lot of self-discovery to feel can I do this and how the devil I could do it. So that's one. That was a lot of sort of self-discovery because it started from a place of really wanting to scratch a personal itch. I wrote this storage engine. If you look at the comments, I think to date I'm still at the largest commuter in the repo and I'm the CEO and it's obviously through history. It's because I wrote this storage engine with kind of a page cache bypass and the first allocator and the first compaction strategy because I wanted this to be from a technical level. Then it turns out that when I started interviewing people and we interviewed tens of companies, maybe less than a hundred, but definitely more than 50 somewhere in between, people were having the same problems that I was having personally. [55:02] This varies just from small companies to medium companies, to large companies. Everyone knows we're just struggling with operationalizing Kafka.  It takes a lot of expertise or money or both money and talent, which costs money to just get the thing stable. And I was like, we could do better. And so the fact that that was possible, even though Kafka has been around for 11 years, I was shocked. I was like wow. There's like this huge opportunity. I'll make it simple. And you know, what the interesting part about that is that the JavaScript and Python ecosystem, they love us because they're used to running on JS and engine X and this little processes, I mean like later in terms of footprint, like sophisticated code and they're like, oh, Redpanda does this really simple thing that I can run myself and it's easy. They feel we sort of empower some of these developers that would never come over to the JVM and Kafka ecosystem just because it's so complicated and so just to basically productionize it easy to get started, let me be clear, it’s hard to run into stable in production. And so that was a really surprising thing. And then from a customer level the thing we used for the oil and gas was really, I think, revealing. The embedding use case and the edge IOT use case was I was blown because I've been using Kafka as a centralized sort of methods hub where everything goes. I never thought of it as it being able to power IOT, like things. Or like this intrusion detection system, where they wanted this consistent API between their cluster and their local processes and, Redpanda was like the perfect fit for that. I think in the modern world, there's going to be a lot more users of that and I think that sort of people pushing the boundaries. I think there's been a lot of surprising things, but those are good highlights.Utsav: Have you been just surprised at the number of people who use Kafka more than anything else or were you like kind of expecting it?Alex: I've been blown. I think there's two parts to that. One streaming is an AdSense market. I think Concord just filed their SCC recently. I think saw that today. I think they're going to have a massively successful life beyond. I wish them luck and success, because the bigger they are, the bigger we are too. We add to the ecosystem. I actually see us expanding the Kafka API users to other things that couldn't be done before. I think it's big in terms of its total size, but it also think is big in that it's growing. And so the number of new users that are coming to us about like, oh, I'm thinking about this business idea, how would you do it in real time? And here's, what's interesting about that.You and I put pressure on products that scientifically translate to things like Redpanda. Say I want to order food. My wife is pregnant and she wants Korean food tomorrow, but she doesn't eat meat, but when she's pregnant, she wants to eat meat. And so I want to order that and it's like, I want to be notified every five minutes. What's going on with the food? Is it going to get here? Blah, blah, blah. And so end users end up putting pressure on companies. A software doesn't run on category theory, it runs on this real hardware and ultimately it ends up in something looking like red Panda. And so I think that's interesting.There's a ton of new users where they're getting asked by end users, like you and I, to effectively transform the enterprise into this real time business and start to extract value out of the now, what’s happening literally right now. I think to them, once they learn that they can do this, that Redpanda empowers them to do this, they never want to go back to batch. Streaming is also strict superset of batch; without getting theoretical here. Once you can start to extract value out of what's happening in your business in real time, nobody's ever want to go back. So I think it’s those two things. One, the market is large and two, it is growing really fast and so that was surprising to me.Utsav: Cool. This is a final question. You don't have to answer this. This is just a standard check. If Confluence CEO came to you tomorrow and said, we just want to buy Redpanda, what would you think of that?Alex: I don't know. I think from a company perspective, I want to take the company, myself, to be a public company and I think there's plenty of room. Like the pie is actually giant. And I just think there hasn't been a lot of competition in this space. It's my view, at least. [1:00:00] Yeah and so I think we're making it better. Let me give you an example. Apache web server dominated the HTTP standards for a long time. It almost didn't matter what the standard said is like Apache web server was like the thing. If anything, they implemented that was the standard implied. Then NGINX came about, and people were like, wait, hold on a second, there's this new thing. And it sort of actually kicked off this new, I think, interest in trying to standardize and develop the protocol farther.I think it's similar. I think it's only natural to happen to the Kafka API. The Pulsar is also trying to bring in the Kafka API. We say, this is our first class citizen and so, I think that there's room for multiple players who offer different things. We offer extremely low latency safety. So we get to take advantage of the hardware. And so I think people are really attracted to our offering from a technical perspective, especially for new use cases. And yeah, and so I don't know. I think there's a chance for multiple players in that.Utsav: Yeah. It's exciting. I think like there's open telemetric. There'll be like an open streaming API or something that eventually there'll be like a working group that will be folded into CNCF and all those things.Alex: Exactly. Utsav: Well, thanks Alex again for being a guest. I think this was a lot of fun and it is fascinating to me. I did not realize like how many people need streaming and are using streaming. I guess the IOT use case just like slipped my mind, but it makes so much sense. This was super informative and thank you for being a guest. Alex: Thanks for having me. It was a pleasure. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Welcome, Alex, to an episode of the Software at Scale podcast. Alex is the CEO and co-founder of Vectorized, which is a company that provides a product called Red Panda. And you can correct me if I'm wrong, but Red Panda is basically a Kafka replacement, which is a hundred times faster or significantly faster than Kafka itself. I am super fascinated to learn about why we're building, like I understand the motivation behind building Red Panda,
Starting point is 00:00:41 but you know, what got you into it and what you learned on the process? And thank you for being here. Yeah, thanks for having me, Itzhak. A pleasure being here. This is always so fun to get a chance to talk about the mechanical implementation. To a large extent,
Starting point is 00:00:58 this day the team is pretty large now, so I get to do just a little bit of code reviews with business, and so it's always fun to get to the details. Yeah, you know, I've been in streaming for a really long time. So I get to do just a little bit of code reviews for the business. And so it's always fun to get to get to the details. Yeah. You know, I've been in streaming for a really long time, like just, uh, uh, data streaming for like 12 years now. And then when I got into it was through startup in New York, I was doing my PhD
Starting point is 00:01:17 in crypto, I dropped out, went to work for this guy named Mike Yamandiri on a startup, uh, called Yieldmo. Doesn't mean anything, but it was, it was an ad tech company that competed against google for the on the mobile market space and that's really how i got introduced to it you know the the name of the game back then was to use kafka and apache storm and zookeeper 3-2 and honestly it was really hard to to debug so i've been sort of i think i experienced the the entire life cycle of Kafka from the 0.7 release or 0.8 release back in 2011 or something like that, all the way until my previous job at Akamai, where I was a principal engineer and I was just sort of measuring latency and throughput. And so I've sort of seen the evolution of Kafka and before they required Zookeeper and all of these things.
Starting point is 00:02:07 And so the way I got into it was we wanted to, I guess the history of how we got here is at first we were optimizing ads and, you know, we're using a storm and this, and then methods started to come about. And I was like, oh, Mesos is really cool. The future of streaming is going to be, you're going to have an Oracle and then something is going to schedule those containers.
Starting point is 00:02:31 I happen to choose, you know, in retrospect now that Apache Mesos got archived or it got pushed to the Apache attic a couple of weeks ago, we chose the wrong technology. I think it was the right choice at the time. Like Kubernetes, you know, is barely working on a couple hundred nodes and methods was proven out of scale. And so it just seemed like the right choice. What we focused on, though, is streaming is the technology that helps people sort of extract the value of now or in general deal with time sensitive data like a trade a fraud detection
Starting point is 00:03:07 Uber aids are all really good examples of things that need to happen relatively quickly like in the now and um and so streaming systems are kind of technology designed to help people deal with that kind of complexity so what Concord did was, we really like the Apache storm ideas. Back then when I was at Yilma, storm was really slow and it was really hard to debug this thing called Nimbus and the supervisors and kind of thing, you know, stack traces on four languages of Scala, Clojure and Java. And I was like, I need to figure out three standard libraries to debug this thing. And so we wrote Concord in C++ on top of methods. It's that squarely on the compute side. And so streaming is really store and compute chained together. And then at the end of it, you do something useful. You say, hey, your credit card was this credit card transaction was fraudulent. That's what I did for a long time.
Starting point is 00:04:06 And, you know, long story short, I've been using Kafka as really this storage medium. And we really couldn't, I personally couldn't get enough performance out of it with the right safety mechanics. And so in 2017, I did an experiment where I took two Akamai Edge computers with a wire back to back to each other, so no rack latency, nothing.
Starting point is 00:04:30 It's just an SPF wire connected back to back between these two computers. And I measured, let me start up a Kafka server, I think maybe 2.4 or something like that, or 2.1 at the time, a Kafka server and a Kafka client and measure what it can drive this hardware to for 10 minutes, right? Both in latency and throughput. Let me turn that down and let me write a C++ program that bypasses the kernel and bypasses the page cache, so the storage level two,
Starting point is 00:04:59 and see what is the hardware actually capable of. I just wanted to understand what's the gap? Where is this accidental complexity comes from? Like, how much money are we living on the table? The first implementation was 34X tail agency performance improvement. And I was just floored, I took two weeks and I was comparing the bytes of the data on this, just making sure that
Starting point is 00:05:26 the experiment actually worked. So yeah, honestly, that was the experiment that it got me thinking for a long time that hardware is so fundamentally different to how hardware was a decade or more ago when Kafka and AllSAR and all these other streaming technologies were invented. If you look at it, actually the Linux scheduler block algorithm and IO algorithms basically, it's the thing where you send data to a file and the Linux kernel sort of organizes it for optimal write and reads. It's fundamentally different. It was designed for effectively millisecond level latencies and the new disks are designed for mic millisecond level latencies.
Starting point is 00:06:05 And the new disks are designed for microsecond level latencies. So it's like, this is a huge, huge gap in performance improvement. Not to mention that now you have, you can rent on Google 220 cores on a VM. You can rent a terabyte of RAM. So the question is, what could you do differently with this new road? Right? Like it's so different. It's like a totally different computing paradigm. And I know that Herb Sutter has coined the term, there's no free launch. And you basically have to re-architect something from scratch for the new bottleneck.
Starting point is 00:06:40 And the new bottleneck is the CPU. The latencies of disks are, are so good. And same thing with, with, uh, with network and all these other peripheral devices that the new bottleneck is actually the coordination of work across the 220 core machine. That's if you'd write the future is not course getting faster, the future is getting more and more course. And so the bottlenecks is in the coordination of work in the CPUs. And so, we rewrote this thing in C++ and that's kind of maybe a really long winded way of saying how we got here. That is fascinating. So maybe you can explain a little bit about
Starting point is 00:07:25 where would someone deploy Kafka initially? You mentioned fraud, and that makes sense to me. You have a lot of data and you need to stream that. But the example that was, I guess, a little surprising was Uber Eats. So how would you put Kafka in Uber Eats? Where would that fit in the pipeline? Great question. Let me actually give you a general sense, and then we can talk about that case. Kafka sits in this new way of like event streaming has been a thing for a really long time. People have been trading systems, they have been, right? But in the modern stack,
Starting point is 00:07:58 it's called event streaming. And what is an event? An event is what happens when you contextualize data. So you have some data, let's say a yellow t-shirt, or like a green sweater, the one I'm wearing today. That's just data, doesn't mean anything. let's say this Korean seller that is coming through through amazon.com. And then I start to all of this context that makes an event, right? There's a lot of richness there implicitly. There's also a lot of time to that transaction, right? If I buy it today, it's there's this immutability about this facts. And so event streaming is really this new, this new way about thinking on your architecture as this immutable contextualized data things that go through your architecture. And so in the case of Uber Eats, for example, when I go into my Uber Eats app and I select my favorite Thai restaurant and I said, hey, get me number 47.
Starting point is 00:09:04 I can't pronounce it, but I know it's the item I always get from the Thai restaurant across the corner. It's like Chinese broccoli with some stuff. It's delicious. And so the Uber, it's immutable that I paid $10 for it. It's immutable that the restaurant got it, you know, 30 seconds later after this order. And so you start to produce kind of this chain of events.
Starting point is 00:09:28 And you can reason about your business logic as this, you know, effectively a function application over a series of immutable events. It's kind of like functional programming at the architectural level. And why is that powerful? That's powerful because you can now actually understand how you, you know, how you make decisions. And so to go back to the case of fraud detection is really useful, not in making the decision. Like you can make, you can just create a little microservice in Node or in Python, it doesn't matter. And if you say, hey, is the, you know, whatever, is the credit card above $10,000?
Starting point is 00:10:02 That's probably fraudulent for buying, you know, Thai food. That's not the interesting part. The interesting part is that it's been recorded in an ordered fashion so that you can always make sense of that data and you can always retrieve it, right? So there are these properties about Kafka that Kafka brought to the architect, which were durability, right? It means this data actually lives on this. And by the way, it's highly available. If you crash one computer, it's going to live on the other two computers. And so you can always get back this data. And three, it's replayable.
Starting point is 00:10:34 If the computing crashed, then you can resume the compute from the previous iterator, right? And so I think those were the properties that I think the enterprise architects started to understand and see they're like, oh, it's not just for finance in any way of doing trades, but it works almost across any industry. in oil and gas, measuring the jitter between oil and gas pipelines, where you have this little raspberry pie looking things. And the point of this Kafka pipeline, which was later replaced with Red Panda, was just to measure how much jitter there is on this pipeline, and should we turn it off? It's really cool. We've seen it in healthcare, where people are actually analyzing,
Starting point is 00:11:21 you know, patient records and patient data, and they want to use new technologies like SparkML or TensorFlow, and they want to connect to this, like, you know, real-time streaming for COVID, for example. We're talking with a hospital in Texas. They wanted to measure their, like, COVID vaccines in real time and alert things, you know, for all sorts of suppliers. We've seen people in the food industry. It's like in a sports betting. It's like huge outside of the United States. There's just this massive,
Starting point is 00:11:52 it's almost to me, it feels like streaming is at this stage where databases were in the 70s. Before that, people are writing to flat files and it works. That's a database. Every customer gets a flat file. You read it every time you need to change it. You just rewrite the entire customer data. And that's kind of like a pseudo database.
Starting point is 00:12:12 But then databases gave users a higher level of abstraction and modeling technique. And to some extent, that's what Kafka has done for the developer. developer, think about, use this pluggable system that has this rich ecosystem as a new way to model your infrastructure as an immutable sequence of events that you can reproduce, you can consume, it's highly available. So I think those were the benefits of switching to an architecture like Kafka. Those customers are super interesting to hear, and that makes sense, IoT and all of that. So maybe one more question that would come to a lot of people's mind is, at what point should you stop using something like SQS?
Starting point is 00:12:51 It seems like it provides a lot of that similar functionality, just that it'll be much more expensive and you don't get to see the bare bones stuff. But Amazon supports that for you, right? So why do customers stop using SQS or something like that and start using like Kafka or Red Panda directly? Yeah, Kafka has done through
Starting point is 00:13:12 history, right? So Kafka is 11 years old now. And the value to developers on Kafka is not Kafka the system, but the millions of lines of code that they didn't have to write to connect to other downstream systems, right? It's like the value is in the ecosystem, the value is not in the
Starting point is 00:13:31 system. And so when you think about that, let's think about Kafka as two parts, Kafka, the API, and Kafka, the system. People have a really challenging time operating Kafka, the system with Zookeeper, even if I know there's might be some listeners that are interested and they're like, oh, well, KIP-500 was released. I think we could talk about that, about what K-Ref means and the Zookeeper too later. But anyways, so if you look at Kafka's two things, the API and the system, the reasons and why someone would want to use the Kafka API, which by the way, Pulsar also started supporting, it's not just Red Panda, it's really like a trend in the data streaming system,
Starting point is 00:14:08 is you can take Spark ML and TensorFlow and ClickHouse and MemSQL and all of these databases, it just floods right in. And you didn't write a single line of code. You start to think about the systems of these Lego pieces. Of course, for your business logic, you have to write the code. It's what people get paid for, right? But for all of these databases, all of these downstream systems, whether you're sending data to anything, Datadog for alerting
Starting point is 00:14:36 or Google BigQuery or Amazon Redshift or Materialize or Cockroach TV or any of these databases, there's already this huge ecosystem of code that works that you don't have to maintain because people are maintaining it, right? And so you just plug into this ecosystem.
Starting point is 00:14:54 So I would say the largest motivation for someone to move away from a non-Kafka API system, which, you know, is true for Google PubSub and Azure Event Hub, and there's like 100 of them, is in the ecosystem. And realizing that it is basically the community, I think Red Panda, for example, makes Kafka the community bigger and better. We start to expand the uses of the Kafka API into embedded use cases.
Starting point is 00:15:28 For example, for this security appliance company, they actually embed a Red Panda because we're in C++, super small footprint. So they just embed us in process to do intuition detection. So every time they see an intuition to the network, they just write a bunch of data to disk, but it's fronted through the Kafka API. And in the cloud, it's the same code. It's just not one Red Panda local instance.
Starting point is 00:15:47 It's a cluster. And so I think for people considering other systems, whether it's SQS or Kinesis or Pub Stop or Amazon Event Hub, first of all, there are specific traders that are kind of, you know, we need to dig really into the detail. But at the architectural level, is plugging into this ecosystem of the Kafka API is so important in getting to leverage the last 10 years of work that you didn't have to do. And it takes you five seconds to connect to these other systems. That is really fascinating.
Starting point is 00:16:15 And I guess large companies, they have a million things they want to integrate with open source things and all of these new databases materialize and all of that so kafka is kind of like the rest api in a sense it's the i think it's become the new network to some extent i mean people joke about this but if you're think about this if you had an appliance that could keep up with the throughput and latency of your network but give you auditability it give you access control it it gives you replayability, why not? That's I think some of our more cutting edge users are using Red Panda as the new network and they needed the performance that Red Panda brought
Starting point is 00:16:56 to the Kafka API ecosystem to enable that kind of use case, which is every message just gets sent to Red Panda. It could keep up with, it could saturate hardware, but now they get this tracing and auditability, they could go back in time. So you're right, it's almost like the new REST API for microservices. So what is it about Kafka that makes it slow?
Starting point is 00:17:18 From an outsider's perspective, to me, it seems like when a code base gets more and more features contributed by like hundreds of people over like a long time span there's just so many like ifs and elses and checks and this and that that tend to like bloat the api service and also slow things down and then somebody tries to profile and improve things incrementally but could you maybe walk me through like what have you learned by looking at the code base? And why do you think it's slow? Like, one thing you mentioned was, you know, you just do kernel bypass and you skip, like, all of the overhead there.
Starting point is 00:17:51 But is there anything inherent about Kafka itself that makes it really slow? Yeah, so it's slow, comparatively speaking, right? And we spent 400 hours benchmarking against Kafka, so I have a lot of details about this particular endeavor. Kafka is, let me step back and say, an expert could probably tune Kafka to get much better performance than most people can, right? Most people don't have 400 hours
Starting point is 00:18:21 to benchmark different settings of Kafka, and Kafka is multi-modeling performance. And I can dig into that a little bit. But assuming that you're an expert and assuming that you're going to spend the time to tune Kafka for your particular workload, which, by the way, changes depending on throughput, the performance characteristics of running sustained workloads on Kafka actually vary. And so therefore, your threading model varies, the number of threads for your network and the number of threads for your disk and the number of threads for your background workloads and the amount of memory. I think it's this whack-a-mole of tuning Kafka that is really the most daunting task for an engineer because it is impossible,
Starting point is 00:19:07 I think, in my opinion, to ask an engineer who doesn't know any of the internals of Kafka unless they go and they read the code to understand, well, what is the relationship between my IO threads and my disk threads and my background workloads and how much memory should I reserve for this versus how much memory should I reserve for that, right? There's all of this like trade-offs that do matter as soon as you start to hit some form of saturation. So let me give you now the details on the parts that where we reduce, where we improve performance, which is specifically in the tail agency and why that matters in the messaging space, right through throughput. So by and large, Kafka can drive hardware
Starting point is 00:19:47 to the same throughput, similar throughput as Red Panda. Red Panda does always at least as much as Kafka. And in some cases, which we highlight is in the blog posts, in some cases we're a little better, let's say like 30% or 40% better. The actual improvements in performance is in the tail latency distribution. Why does that matter?
Starting point is 00:20:10 Red Panda, and I'm just going to focus on the things that Red Panda brings to the market rather than the negatives of Kafka, because I think we built on the shoulders of Kafka. If Kafka didn't exist, we wouldn't have gotten to learn the improvements or understand the nuance or get you know, get the, oh, maybe I should do this a little different, right? So on the tail latency performance
Starting point is 00:20:30 improvement, latency, and I've said this a few times, is the sum of all your bad decisions. That's just like, that's what happened that, you know, at the user level, when you send the request to a microservice that you wrote, you're just like, oh, right, I should have used a different data structure. There's no cache locality, blah, blah, blah. That's what latency is. And so what we focus on is how do we give people predictable tail latency? And it turns out that for all of our users,
Starting point is 00:20:58 that predictable tail latency often results in like 5X hardware reduction. So let me materialize all of this performance improvement where we are better and how that materializes for users. So we pay a lot of attention in detail. That means we spent a ton of engineering time and effort and money and benchmarking and test suite on making sure that once you get to a particular latency,
Starting point is 00:21:24 it doesn't spike around, it's stable because you need that kind of predictability. Let me give you a mental example or mental model in which you could potentially achieve really good average latency and terrible latency. Let's say that you write and you have a terabyte of heap and you just write to memory. And every 10 minutes, you flush a terabyte.
Starting point is 00:21:44 So every 10 minutes, you get one request that is like five minutes long because you have to flash a terabyte. And then otherwise, the system looks good. And so what happens is that people need to understand that you start to hit those state latency spikes that Kafka has, the more messages you put in the system, being that you are a messaging system, most of your users are therefore going to experience a latency. And so we said, well,
Starting point is 00:22:11 how can we improve this for users? And so in large, we said, let's rethink this from scratch. And that really had a fundamental impact in that we don't use a lot of the Linux kernel facilities. So there are these global locks that happen in the Linux kernel when you touch global objects. For example, the page cache. And I actually think it's the right decision for the page cache to be global. Because if you look at the code,
Starting point is 00:22:38 there's a ton of edge cases and things that we have to optimize for to make sure that it even just works. And then a lot more to make sure that it worked fast. So it's a lot of engineering effort on an effort that we didn't know it was going to pay off, to be honest. And then it happened to pay off. So we just believed that we could do better with modern hardware. And so we don't have this global locks kind of at the low level on the Linux kernel objects. And because we don't use these global resources,
Starting point is 00:23:07 we partition the memory across every individual core. And so memory allocations are local. You don't have this global massive garbage collection that has to traverse terabytes of heaps, right? You have this localized little memory arena. It's kind of like taking a 96-core computer and creating a mental model of 96 little computers inside that 96 core computer.
Starting point is 00:23:30 And then structuring the Kafka API on top of that, it's really that re-architecture, because again, remember that the new bottleneck in computing is CPU, is rethinking the architecture to maximize and really extract the value out of hardware. And my philosophy is the hardware is so capable, the software should be able to drive hardware at saturation at all points, right? If you're not driving hardware saturation at throughput, then you should be driving
Starting point is 00:24:01 hardware basically at the lowest latency that you can. And these things need to be predictable because when you build an application, you don't say, oh, let me think about what is my tail latency for this and that. And then I might need most of the time, I need five computers, but there's other 10% of the time I need 150 computers. Let me just take an average of 70, you know, whatever, 50 or 75 computers. So it's really hard to think about building applications when your underlying infrastructure is, you know, not predictable. And so that's really a big improvement. And then the last improvement on the Kafka API
Starting point is 00:24:36 was that we only expose safe settings. We just wrapped up this replication model. And I think that was a big improvement on the state of the art of streaming. If you look at the actual implementation of Kafka's ISR replication model, Pulsar's, I think it's the primary backup with some optimization replication models, versus our RAP implementation, you know that we didn't invent our own protocol, so there's a mathematical proof of replication. But also, you understand as a programmer, oh, this is what it means to have two out of three replicas, or this is what it means to have three out of five replicas up
Starting point is 00:25:14 and running, right? So it's kind of all of this context. So that was a long-winded question, but you asked such a critical thing that I had to be very specific just to make sure I don't give room for ambiguity or try to. Maybe can you explain why is it important to partition memory per core? What happens when you don't? One thing you mentioned was there's the garbage collection that has to go through everything. What exactly is wrong about that? If you can elaborate on that.
Starting point is 00:25:43 Yeah. So there's nothing wrong and everything works it's the traders that we want to optimize for is reduce you know basically make it cost efficient for people to actually use data stream to me i feel that streaming is in that weird space that hadoop was a few years ago where there's all this money being put into it but very few people actually get value out of it. It's just like, why is this thing so expensive to run? And so when I went into that, I was like,
Starting point is 00:26:11 how do we bring this to the masses so that it's not so massively expensive to run? If you ask most other, you know, basically anyone that has run other streaming system that is in the South, they always have to over provision because they just don't understand the performance characteristics of the system. So let me talk about the memory partitioning. So modern computers, the new trend is it's gonna increase in core count.
Starting point is 00:26:36 The frequency, the clock frequency, the CPU is not going to improve. Here's the tricky part where it gets very detailed. Even on one CPU, CPUs individually still got faster, even if the clock frequency didn't improve. You're just like, how is this possible? It improved through the very low level things like instruction prefetching, you know, basically parallel execution, like pipeline execution. There's all of these tricks at the lowest level of instruction execution that made even even if the clock frequency of the cpu wasn't getting faster it made something like you know 2x
Starting point is 00:27:11 performance improvement or maybe 3x over the last 10 years but now the actual larger trend in computing is getting more more core counts my desktop has 64 physical cores it's like the ryzen 3900 so uh so so that trend. In the data center, there's also this weird trend, which I actually don't think the industry has settled on, where even on a single motherboard, you have two sockets. And so now when you have two sockets, you have this thing called NUMA memory access, right, and NUMA domains, which is every socket has a local memory that it makes, quote, low latency access and allocations,
Starting point is 00:27:50 that it works like one computer. But it can leverage remote memory from the other socket's memory, right? And so when you rent a cloud computer, you don't understand, well, what kind of hardware is it, right? Like to some extent, you're paying for that virtualization, and most people are running in the cloud these days. So why is memory, why is like pinning the memory to that particular thread important?
Starting point is 00:28:16 It matters because, like I mentioned, latency is the sum of all your bad decisions. And so what we did is we said, okay, let's take all of the memory for this particular machine. And I want to give you an opinionated view on it, which is if you're running this for really large scale, I'm going to say the optimal production setting is two gigabytes per quad. That's what we recommend for Red Panda. You can run it on like 130 megabytes if you want to for low, very low volume use cases. But if you're really aiming to go ham on that hardware, you really like that, those are kind of the memory recommendations that we do. Okay, so why is that important?
Starting point is 00:28:53 When RedPendant starts up, it says I'm gonna start one P thread for every core. That gives me the programmer, a concurrency and parallelism model. So within each core, when I'm writing code in C++, I code to it like it is a concurrent structure. But the parallelism is a free variable that gets executed on the physical hardware.
Starting point is 00:29:16 The memory comes in, in that we split the memory evenly across every core. So we take, say 10, say you have a computer with 10 cores, we take all the memory, we sum it up, we subtract like 10%, and then we split it by 10. And then this is, we do something even much more interesting. We go and we ask the hardware, hey, hardware, tell me for this core, what is the memory bank that belongs to it, right? Like in this NUMA domain, in this like memory, in this CPU socket, what is the memory that belongs to the relic in this NUMA domain in this like memory in this CPU socket what is the memory that belongs to this CPU socket and then the hardware is going to tell you you
Starting point is 00:29:52 know based on the heart on the motherboard configuration this is the memory that belongs to this particular core and then we tell the Linux kernel hey allocate this this this memory and pin it on this particular thing and lock it. So don't give it to anybody else. And then this thread reallocates that as a single byte array. And so now what you've done is you've eliminated all forms of implicit cross-core communication, right?
Starting point is 00:30:19 Cause that thread will only allocate memory on that particular core, unless the programmer explicitly programs the computer, you're gonna allocate memory on a remote core unless the programmer explicitly programs the computer to go and allocate memory on the remote core. And so it's kind of a relatively onerous system to get your hands up, but if you're programming an actor model, so what does that mean for a user?
Starting point is 00:30:37 Let me give you a real impact. We were running with the big Fortune 1000 company and they took a 35 node Kafka cluster and we brought it down to seven so all of these little improvements matter because at the end of the day you could get a 5.5 x performance improvement sorry hardware cost reduction at 1600 performance improvement you know all of these things in the there's a blog post we wrote a month ago where we actually talk about what are the other mechanical, you know, sort of sympathy techniques that we do to ensure that we get low latency, you know, we give low latency to a Kafka API. And so that was a long-winded way of explaining at the lowest level of why it matters to allocate memory. It all boils down to the things
Starting point is 00:31:23 that we were optimizing for, which is saturated hardware. So streaming is affordable for a lot of people. Make it at low latency so that you enable new use cases like this oil and gas pipeline, for example. And yeah, and so that's kind of, that was a really deep, deep stuff. I'm happy to compare with our thread pool algorithms
Starting point is 00:31:45 and how that's different, but that's how we think about building software. Now I want to know, what is the latency difference when you use memory from your own NUMA node versus when you try to access a remote memory? How much faster is it to just stay in your own core, I guess? Well, it's faster in proportion...
Starting point is 00:32:08 Sorry, it's faster relative to the other... I think the right question to ask is, what is the latency of crossing that NUMA boundary in relation to the other things that you have to do in the computer? If you have one thing to do, which is you just need to allocate more memory on that core, it do in the computer. If you have one thing to do, which is you just need to allocate more memory on that core, it'll be plenty fast, right? But if you're trying to saturate hardware
Starting point is 00:32:29 and you're trying to do this Kafka API, I think then let me give you orders of magnitude comparison. Yeah. It's a few microseconds on like, you know, to cross the boundary, et cetera, and allocate memory from another core. That's assuming, by the way, that you've cached some, with the latest experiments,
Starting point is 00:32:46 some experiments I did last year to cross the NUMA boundary and allocate memory, right? But let me give you, let me put that in perspective with writing a single page to disk using the NVMe with kernel bypass. You could write a page
Starting point is 00:33:01 to an NVMe SSD device, assuming non-3D cross-point technology, just regular NVMe on your laptop, in single to double-digit microseconds, right? So when you say now a memory allocation is in the microsecond space, you're just like, well, shit, that's really expensive in comparison with actually doing useful work like sending this thing to this, right? So, so I think, you know, it's really hard to, for humans to understand latency and unless you work in the low latency space or have like an intuition for what the computer can actually do or the type of work that you're trying to do in that particular space, it's really hard to judge. But that adds up. Now, if you add contention, right, then it's in like the human time, right? Depending on how content that the resources are,
Starting point is 00:33:51 let's say that, you know, trying to read, to allocate from a particular memory bank in a remote NUMA node, if there's no memory, then you have to wait for a page fault and stuff to get, you know, reaching the disk. And so, so these things just add up and it's sort of like, it's really hard to, to give intuition. But I think the better intuition is like, let's compare with the other things, useful things that you need to be doing. Like, right. Like the useful thing of a computer is to be used for doing some useful thing for the business, like detecting fraud
Starting point is 00:34:24 detection or sending an Uber rider around to your house or doing all of these things and so the person is like is it expensive I think really expensive in comparison with the things that you can actually be using the computer for that makes sense to me and is there any other fancy stuff you do like in terms of like networking because recently i've heard that you know even nix and everything are getting extremely fast and there's like a lot of overhead in software that yeah like just the kernel and other things impose i mean i think this is kind of such an exciting time to be in computing let me tell you a couple of things that we do so we actually uh expose some of the latency that that the disk is propagating
Starting point is 00:35:08 so let's say you're writing to disk and over time you start to learn you know the device is getting busier so the latency is where one to eight microseconds to write a page and now they're in like you know 30 to 100 microseconds to write a page because there's like a little bit of contention there's a little bit of queuing, and you start to learn that. At some point, there are some thresholds that we get to monitor because we don't use the Linux kernel page cache. So we get to monitor these latencies and propagate those latencies to the application level, to the replication model,
Starting point is 00:35:41 which is very cool when you co-design a replication model with a mechanical execution model because it means that you're sort of passing implementation details on purpose through the application level so you could do application level optimizations one of those optimizations is reducing the number of flushes so raft in order to purchase data on this you write the page and then you flush the data on this. But we could do this adaptive batching so that we write a bunch and then we issue a single flush as opposed to like five flushes in our.
Starting point is 00:36:15 That's one thing. The second thing though, is what this latency gives you is a new computer model. We added a web assembly. We actually took the V8 engine and we, you know, there's two modes of execution for that V8 engine currently. One is as a sidecar process and one is inline in the process. So every core gets a V8 isolate, which is like the Node.js engine, right? Inside a V8 isolate, there's a thing called V8 context. And just go with the terminology
Starting point is 00:36:44 for a second, because there's a lot of terms. It means that you can execute a JavaScript file inside a V8 context. In fact, a context can execute multiple JavaScript files. So why does this matter? It means that Red Panda becomes programmable storage for the user. Think of the Transformers
Starting point is 00:37:08 when Optimus is whatever, Unite, and then all the robots, then you make a bigger robot kind of thing. It's like the fact that you can ship code to the storage engine and change the characteristics of the storage engine. Now you're streaming
Starting point is 00:37:24 data, but because we're so fast, we're like, oh, now we have a latency budget. So we could do new things with the latency, but we could introduce a computing model that allows the storage engine to do inline transformations of this data. So let's say you send the JSON object and you want to remove the social security number for GDPR compliance or whatever, HIPAA compliance or whatever it is, then you could just send the JavaScript function and it'll do an inline transformation of that or it'll just obscure it for performance so you don't reallocate, just write XXXX
Starting point is 00:37:54 and then you pass it along. But now you're sort of adding, you get to program what gets on this at the storage level. You're not ping-ponging your data between Red Panda and other systems, right? You're executing, like, to some extent, you're really just sort of raising the level of abstraction
Starting point is 00:38:14 of the storage system and the streaming system to do things that you couldn't do before, like inline execution of filtering, inline execution of masking, simple enrichments, simple, yeah, just simple things that are actually really challenging to do outside of this model. And so you invert the computational model
Starting point is 00:38:32 where now you can ship code to the data and it executes inline. And so some of the more interesting things is actually exposing the WebAssembly engine, which is just V8, to our end users. So as an end user, we've now expanded the Kafka API, where you say, RPK, this command line thing that we wrote, was on deploy, and you give it a JavaScript file,
Starting point is 00:38:53 you tell it the input source and the output source, and that's it. The engine is in charge of executing that little JavaScript function for every message that goes in. So I think this is like the kind of impact that being fast gives you. You now have computational efficiencies that allow you to do different things that you couldn't do before.
Starting point is 00:39:14 That is interesting. And I think one thing you mentioned over there was HIPAA compliance or something to get rid of information. What are some use cases that you can talk about publicly that you've seen that you just would not expect? And you were like, wow, I can't believe that is what this thing is being used for. Yeah, let me think. Well, you know, one of them is IP credit score.
Starting point is 00:39:37 So why that's interesting is not as a single step, but as an aggregate of steps, it's really critical. So we have this, let me just try to frame it in a way that doesn't leak who the customer is. But we have a massive customer, internet scale customer. They are trying to give every one of their users, so really like profiling information that is anonymous, which is kind of wild. So every IP gets a credit score information. So you'd say this IP, let's say in America, you have credit scores from like 250 to 800 or maybe 820. And so you give it the same number,
Starting point is 00:40:19 or like a credit score to every IP and you monitor that over time. But now they can push that credit score enrichment inside that IP. And then you can potentially make a trade on that. And so there's all of this like weird market dynamics. Let me say, let me give you this example. Let's say you watch the Twitter feed and you're just like, oh, you know, what is like the metadata that i can associate with this particular tweet and can i make a trade on that and so it's like a really wild thing um to do and then the last one that we're focusing on is this user who is actually extending the web assembly
Starting point is 00:40:57 protocol because it's all in github right so you could literally download repanda and then swap our web assembly engine for your own WebAssembly engine. He's actually spinning up very tall computers that are tall both in terms of CPU, in terms of memory, and in terms of GPU. And he's deciding, he has a CUDA job running on the GPU, and then every time a message comes in, he just makes a local to this this sidecar process that is running this machine learning thing to say like you know should i proceed or should i not proceed with this particular decision those are things that we just didn't plan for i was like wow you know when you
Starting point is 00:41:34 sort of expand the possibilities of and give people new computing platforms everyone will use it it's actually not a competing platform is It enriches how people think about their data infrastructure. Because Spark and Apache Flink, they all continue to work. The Kafka API continues to work. We're just simply expanding that with this WebAssembly engine. Yeah. I think it's fascinating because you can think of, let's say that you want to build a competitor to Google today. Just the amount of machines that they have for running search is very high. And not that you'd be easily be able to build like a competitor, but at least using something like this will make so much of your infrastructure costs cheaper that it's possible to do more
Starting point is 00:42:19 things. That's like the way I'm thinking about it. Yeah, we're actually are in talks with multiple database companies that we can't name, but what's interesting is that there's multiple. Where we are actually the historical data, both the real-time engines and the historical data. So Kafka gives the programmer, the Kafka API, and so of course, WorkPanda,
Starting point is 00:42:42 give the programmer a way to address by integer every single message that has been written to the log, right? It gives you total addressability of the log, which is really powerful. Why? So if you're a database, imagine that each one of these addresses is just like a page on a file system, like an immutable page, right? So like this database vendors, they're streaming data through Red Panda. And then, you know, it gets written as a Kafka batch, and then we push that data to S3. We can transparently hydrate those pages and give it to them.
Starting point is 00:43:19 So they've actually built an index on top of the Kafka index that allows them to reason of a Kafka batch as, you know, this real dynamic way to fetch pages, right? Sort of like this page faulting mechanism that can also ingest real-time traffic. And so that's another, like, really sort of fascinating thing that we didn't think of until we started building this. Yeah. So then let me ask you how do you get customers right like okay you build this thing it's amazing and i'm assuming your customer is like another principal engineer somewhere who's like frustrated at how
Starting point is 00:43:54 slow kafka is or like the operational cost but my concern would be that you know how do we know that uh you know red panda is not going to lose data or something and right right now it's much easier because you have many more customers, but how do you bootstrap something like this? And how do you get people to trust your software and say, yes, I will replace Kafka with this? Yeah, so that's a really hard answer because building trust, it's really hard as a company.
Starting point is 00:44:25 I think that's one of the hardest things that we had to do in the beginning. So our earlier users were people that knew me as an individual engineer. And I just asked them, I was like, hey, man, I'm building this. They're friends of mine. I was like, I'm building this. Would you give it a shot? I want to try it. So there's that.
Starting point is 00:44:43 But that only lasts so long. So what we really had to do is actually test with Jepson, which is this empirical basically storage system hammer that just verifies that you're not bullshitting your customers. If you say you have a wrapped implementation, then you better have a wrapped implementation according to the spec. And Kyle is a fantastic engineer too. And so what we did is that we were fortunate enough to hire Dennis, who has been focused on correctness for a really long time. And so he came in and actually built and extended an internal Jepson test suite.
Starting point is 00:45:28 And we just tested for like eight months. So it seems like a lot of progress, but you have to understand that we stayed quiet for two and a half years. We just didn't tell anyone. And in the meantime, we're just testing the framework. Their first RAP implementation, just to put it in perspective, took like two weeks. I mean, something janky, but still, right? Like two weeks.
Starting point is 00:45:44 But it took two and a half years to build something that actually works, that is scalable. And so this isn't like success overnight. It's like, well, it took us two and a half years to build the Raft implementation. And now we get to build the future of the streaming. And so the way we verified is, you know, it's really through running our extended Jepsen suite. And, you know, we're going to be releasing,
Starting point is 00:46:06 hopefully sometime later this year, an actual formal evaluation with external consultants too. So there's that. People trust us because the code is on GitHub. And so you're just like, well, this is not just a vendor that is saying they have all this cool stuff. And underneath the scenes, it's just a proxy for Kafka to Java. And I was like, no, you could go look at it. It's it a bunch of C++ code like 300,000 lines of C++ code or maybe 200 something
Starting point is 00:46:31 it's on GitHub and you can see the things that we do and I invite all the listeners to go and check it out and try it out because you could just verify these claims for yourself it's you know you can't hide away from this and And it's in the open, right? So we use the same license with CockroachDB. So everyone can download it. We only have one restriction that says you can't compete with us being, we're the only one that is allowed to have a hosted version of Red Panda. Other than that, you can run it.
Starting point is 00:46:59 In fact, in four years, it becomes Apache 2. So it's really only for the next four years. It's really, I think, a good trade-off for us. And so you get trust multiple ways. One is people know you and they're willing to take a bet on you. Two is you, but that only lasts with like your first customer, or two. And then, you know, the other ones is that you build, you test like empirically, right? So that's an important thing. You prove the stuff that you're claiming. It is on us to prove to the world that we're safe. The third one is we didn't invent the replication protocol.
Starting point is 00:47:34 So ISR is a thing that Kafka invented. We just went with Raft. We say we don't want to invent a new mechanism for replicating data. We want to simply write a mechanical execution of RAP that was super good. And so it's relying on existing research like RAP and focusing on the things that we were good at, which was just engineering and making things really fast.
Starting point is 00:48:00 And then over time, there's social proof, which is you get a couple of customers and they refer, and then you start to push petabytes of traffic. And, you know, I think 100 and something terabytes of traffic per day with one of our customers. And so at some point, like the system is always in test. If you have enough users, like your system is always in test. And I think we just stepped into that criteria.
Starting point is 00:48:23 We just have enough users that it's always, every part of the system has always been in test. But still, we have to keep testing and make sure that every commit runs through safety and we adhere to the claims. Yeah. Maybe you can expand a little bit about the release process. How do you know that you're going to ship a new version that's safe for customers? Yeah, that is a lot to cover. So, okay. So, we have five different types of fault injection frameworks.
Starting point is 00:48:51 Five? Yeah, five frameworks, not just five tests. Five kind of test suite. Five total independent frameworks. One of them, we call it Punisher, and it's just a random fault exploration. And that's sort of the first level. RedFind is always running on this one test cluster. And every 130 seconds, there is a fatal flaw that is introduced into the system. Literally, there's an SSH admin that logs in, not manually, but programmatically, and removes your data directory. And you're just like, well, you know, it has to recover the map.
Starting point is 00:49:28 Or it SSH into the system and that's kill dash nine. Or it sends like the incorrect sequence of commands to create a topic in Kafka, right? Sort of the first level, and that's always running. And so that tells us that our system just doesn't like check fault for any reason the second thing is um we run a fuse file system uh and so what it does is that instead of instead of repinder writing to disk it writes to this virtual disk interface and then we get to inject
Starting point is 00:50:02 deterministic the thing about fault injection is when you combine three or five, like edge criteria is when, when things go south, like I think through unit testing, you can get pretty good coverage of the surface area, but it's when you combine like, Oh, let me get a slow disc and a faulty machine and a bad leader.
Starting point is 00:50:21 And so we get to inject like surgically for this topic partition, we're going to inject a slow write and then we're going to accelerate the reads. And so then, you know, you have to verify that there's a correct transfer of leadership. There's another one, which is a programmatic fault injection schedule,
Starting point is 00:50:40 which we say like we can terminate, we can delay, we could do all this all these things there's a ducky which is actually how apache kafka gets tested too there's a thing in apache kafka called duct tape and it injects particular faults at the kafka api level right so it's not just enough that we do that internally for the safety of the system but it's at the user interface is the user actually getting the things that we say we are and so we we leverage um some of the system, but at the user interface, is the user actually getting the things that we say we are?
Starting point is 00:51:06 And so we leverage some of the Kafka tests because we're Kafka API compatible, so it just works, to inject particular failures. And so we start up with three machines, we take LibRD Kafka and then write like a gigabyte of data, then we crash the machines, we bring it back up, then write like a gigabyte of data and we crash the machines, we bring them back up, then we read a gigabyte of data and we start to complicate experiments. And so that's the fourth one. And then I'm pretty sure I'm missing another one,
Starting point is 00:51:34 but I know we have five. And so the release process is actually every single committed to dev gets run for a really long time in CI. Our CI is parallel. So I think every merge to dev takes like five hours or something like that and but we parallelize it and so we in like human time it takes one hour we just run like five different tests at the same time and so yeah so i mean that's how we do it and then um and that's that's really sort of the release process. It takes a long time
Starting point is 00:52:05 and of course if something breaks then we sort of halt the release and in addition to that there's manual testing too because there's things that we're starting to codify into chaos. I wonder if Kafka can use the frameworks that you've built for it and
Starting point is 00:52:21 maybe that would be an interesting day when Kafka starts using some of that. Yeah. Exactly. You know, some of the things are so company specific, like to launch a test on, we have an internal tool called vTools, like Vectory Tools is the name of the company. And so we say vTools, give me a cluster. And it just like takes the binary from CI, de puts it into a cluster, starts a particular test,
Starting point is 00:52:45 and it's just so specific to our setup that it's really hard to... Otherwise, a lot of these tests, I think three out of five are in the open. The things that are actually general, people can just go and look at it, it's on GitHub. But the other two, the ones that are more involved, are just internally.
Starting point is 00:53:05 Okay. What is something surprising you've learned about the state of Kafka? You had an opinion when you started Vectorize that streaming is the future and it's going to be used everywhere. And I'm sure you've learned a lot through talking to customers and actually deploying like red panda into the wild so like what is something surprising that you've learned i feel like i'm surprised almost every day people are so interesting with the stuff that they're doing it's very cool and there's like i guess multiple surprises there's like business surprises there's customer
Starting point is 00:53:42 surprises so from a business i'm a principal engineer by trade. This is, I was a CTO before. This is the first time I'm being a CEO. And, you know, it was really, I think a lot of self-discovery to this, like, can I, can I do this? And all the devs, you could do it. I could do this, you could do this. So that's one. That was a lot of sort of self-discovery because it started from a place of really wanting to scratch a person off. It's probably like, I wrote this storage engine. If you look at the commits, I think today I'm still the largest committer in the repo
Starting point is 00:54:21 and I'm the CEO. And it's obviously through history. It's because I wrote this storage engine with, you know, kernel page cache bypass and the first allocator and the first compaction strategy, because I wanted these things to be from a technical level, you know? And then it turns out that when I started interviewing people and we interviewed, man, tens of companies, maybe less than a 100, but definitely more than 50, somewhere in between.
Starting point is 00:54:51 You know, people were having the same problems that I was having personally, you know, and this like varied just from small companies to medium companies to large companies. Everyone was just struggling with operationalizing Kafka. It takes a lot of expertise or money or both, money and talent, which costs money to just get the thing stable. And I was like, you know, we could do better. And so the fact that that was possible, even though Kafka has been around for 11 years, I was shocked.
Starting point is 00:55:21 I was like, wow, there's like this huge opportunity of making it simple. And you know what's the interesting part about that is that the JavaScript and Python ecosystem, they love us because they're used to running Node.js and Nginx and these little processes. I mean, like later in terms of footprint, like sophisticated code. And they're like, oh, Red Pandas is a really simple thing that I can run myself. And it's easy.
Starting point is 00:55:46 And they feel that we sort of empower some of these developers that would never come over to the JVM and Kafka ecosystem just because it's so complicated. And so just, you know, to basically productionize, it's easy to get started. Let me be clear. It's hard to run a stable in production. And so that was a really surprising thing. And then from a customer level, seeing it used for the oil and gas was really, I think, revealing the embedding use case and the edge and IoT use case was, I was blown because I've been using Kafka as a centralized sort of method hub where everything goes. I never thought of it as it being able to power IoT-like things,
Starting point is 00:56:29 you know, or like this intuition detection system where they wanted this consistent API between their cluster and their local processes. And, you know, Red Panda was like the perfect fit for that. I think in the modern world, there's going to be a lot more users of that. And I think that's sort of people pushing the boundaries. Yeah, I mean, I think there's been a lot of surprising things, but those are good highlights. Have you been just surprised at the number of people who use Kafka more than anything else?
Starting point is 00:56:55 Or were you like kind of expecting it? I've been blown. You know, I think there's two parts to that. One, streaming is an asset market. And so I think us and, you know, I think Confluent just filed their SEC or I recently, I think, saw that today. And I think they're going to have a massively successful IPO.
Starting point is 00:57:14 And we wish them lots of success because the bigger they are, the bigger we are too. And, you know, we add to the ecosystem. I actually see us as expanding the Kafka API users to other things that couldn't be done before. And so one, I think it's big in terms of its total size, but I also think it's big in that it's growing. And so the number of new users that are coming to us about like, oh, I'm thinking
Starting point is 00:57:42 about this business idea, how would you do it in real time? And here's what's interesting about that. You and I put pressure on products that transitively translate to things like Red Panda. Like you and I want to order our food, say I want to order food. My wife is pregnant and she wants Korean food tomorrow, but she doesn't eat meat.
Starting point is 00:58:01 But when she's pregnant, she wants to eat meat. And so I want to order that. And it it's like i want to be notified every five minutes of you know what's going on with the food is it going to get here blah blah and so there's this like you know users end users end up putting pressure on companies that it ultimately right it doesn't run on kind of you know software doesn't run on category theory it runs on this real hardware, and ultimately it ends up in something looking like Red Panda. And so I think that's interesting. There's a ton of new users where they're getting asked by end users like you and I to effectively
Starting point is 00:58:40 transform their enterprise into this real-time business and start to extract value out of the now you know what's happening literally right now and what's i think to them once they learn that they can do this that repentant empowers them to do this is they never want to go back to batch right streaming is also a strict superset of batch without getting too theoretical here once you can start to extract value out of what's happening in your business in real time, you're, you know, nobody's ever going to want to go back. And so I think there's those two trends. One, the market is large and two, it is growing really fast. And so that was surprising to me.
Starting point is 00:59:20 Cool. And this is a final question. You don't have to answer this. This is just tongue in cheek. If Confluent CEO came to you tomorrow and said, we just want to buy Red Panda, what would you think of that? I don't know. I think from a company perspective, I want to take the company myself to be a public company. And I think there's plenty of room. Like the pie is actually giant. And I just think there hasn't been a lot of competition in this space, is my view at least. And yeah, and so I think we're making it better.
Starting point is 00:59:54 If you look at it, let me give you one example. NGINX, Apache web server dominated the HTTP standards for a long time. It almost didn't matter what the standard said. It's like Apache web server was like the thing. If they implemented, that was the standard implied. Then Nginx came about and people were like, wait, hold on a second, there's this new thing.
Starting point is 01:00:16 And it sort of actually kicked off this new, I think interest in trying to standardize and develop the protocol farther. I think it's similar, I think it's only natural to standardize and develop the protocol farther. I think it's similar. I think it's only natural to happen to the Kafka API. The, you know, Pulsar are also trying to bring in the Kafka API. We say this is our first class citizen. And so, you know, I think that there's room for multiple players that are offering different things.
Starting point is 01:00:39 We offer extremely low latency safety, right? So we get to take advantage of the hardware. And so I think people are really attractive to our offering from, you know, like a technical perspective, especially for new use use cases. And yeah, and so I don't know, I think I think there's a chance for multiple players. And I don't think it's, you know, yeah, yeah, it's exciting. I think, like this open telemetry, you know, there'll be like it's exciting. I think like there's open telemetry and all that there'll be like an open streaming API or something that eventually
Starting point is 01:01:09 there'll be like a working group. It'll be folded into CNCF and all those things. It's exciting to see. Yeah, exactly. Well, thanks Alex again for being a guest. I think this was a lot of fun and it is fascinating to me. I did not realize like how many people need streaming
Starting point is 01:01:26 and are using streaming and like i guess the iot use case just like slipped my mind it makes so much sense this was super informative

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.