CppCast - Event Streaming

Starting point is 00:00:00 Episode 291 of CppCast with guest Alex Gallego recorded March 17th, 2021. Sponsor of this episode of CppCast is the PVS Studio team. The team promotes regular usage of static code analysis and the PVS Studio static analysis tool. In this episode, we discuss AI programming for StarCraft. Then we talk to Alex Gallego, CEO of Vectorize.io. Alex talks to us about Red Panda, the C++ developers by C++ developers. I'm your host, Rob Irving, joined by my co-host, Jason Turner. Jason, how are you doing today?

Starting point is 00:01:22 I'm all right, Rob. Episode 291. Our last guest, I think, asked us if we had anything special planned for 300. Yes, I believe Arnaud did ask that kind of once we kind of wrapped up the recording. Yeah, Alf. Yeah. Yeah. And the answer is no. But I will say if our listeners have an idea that they think would be outstanding, then go ahead and respond to that on Twitter or something.

Starting point is 00:01:49 And maybe we'll do one of those ideas. Yeah, I feel like at this point, we've had kind of the biggest names on who, you know, were interested in being on podcast. But if there's someone you think we've missed, certainly reach out to us and let us know like oh you should definitely have so and so on definitely all right well at the top of every episode i threw a piece of feedback uh we got this uh tweet uh saying there is a c++ memory profiler and leak finder for windows uh and it's by a github user named milo stosik and uh i have the repo open it's called m tuner it's an advanced c and c++ memory profiler and i have not had a chance to check it out but uh yeah this listener recommended it if you were listening to last week's episode and heap track sounded wonderful but you need something that works on Windows,

Starting point is 00:02:46 then maybe check out mTuner and see how well that tool works for tracking memory usages. Definitely, yeah. Okay, well, we'd love to hear your thoughts about the show. You can always reach out to us on Facebook, Twitter, or email us at feedback at cppcast.com. And don't forget to leave us a review on iTunes or subscribe on YouTube.

Starting point is 00:03:05 Joining us today is Alex Gallego. Alex is the founder and CEO of Vectorized, 12 years in streaming, previously a principal engineer at Akamai and founder and CTO of Concord.io, a C++ stream processing engine sold to Akamai in 2016. Alex, welcome to the show. Thanks, Rob, for having me. How are you guys? Doing great. How are you doing? I'm doing well. It's, I think, early in San Francisco. Glad to be here. I feel like I've

Starting point is 00:03:33 been listening to you for so long. It's great to finally, you know, kind of see you face-to-face on the podcast. Oh, that's cool. Yeah. Your bio makes it sound like you've been doing something in the content distribution data streaming of some sort for a rather long time yeah so actually right out of college i um i was a crypto major and then um i dropped out of my grad grad school program to go work on an embedded database in finance for like basically

Starting point is 00:04:08 how the S&P 500 portfolio is managed. Back then it was in the BlackBerry days. So it was extending SQL cipher with some cryptography primitives. It was really cool. But yeah, ever since after that job, I've been in the real-time and big data space. In particular, I guess the first gig was really how to do real-time multi-armed bandit algorithms for ad tech. If you're trying to display ads, how do you get more people to effectively click on things? How do you get more people to effectively click on things? And how do you run experiments? But how do you do that in real time so that the budget spent is spent efficiently across a fleet of 100 servers? That kind of stuff.

Starting point is 00:04:53 And then after that, I went on to build a stream computing platform. So I don't know if the listeners know Spark streaming or Apache Flink, but it's effectively just big data frameworks to do real-time manipulation of events as they're coming in through your data pipeline. And then the last one is... Yeah, go ahead. I mean, if we're going to be talking around this concept of streaming for the rest of the interview, go ahead and give us a definition of what you mean by that. Is that like YouTube streaming or is that just like data processing, like a data stream coming through with like pipelines of data processing or what?

Starting point is 00:05:32 Yeah, that's a good one. So it's not video streaming, it's really data streaming. All right. So event streaming is really kind of a set of technologies to deal with kind of a set of technologies to deal with time-sensitive data. And let me define an event for the audience. So an event, for the purpose of event streaming, is really data with a timestamp and an action. So like a click, for example, it happened at a particular time and it may contain a bunch of data. Or a trade, right, when you like make a trade it's some

Starting point is 00:06:06 data but it has a timestamp and it has an action and so on and so event to streaming is really kind of a technology to help developers deal with this time sensitive data and it often has the the time sensitive component is an interesting way of putting it because there is this expectation that something will happen you know with regards to the timestamp. So for a trade, when you make a trade on, let's say Robinhood, you kind of expect that thing to, you know, to eventually get resolved. Same thing when you order food on your phone app from Uber Eats or DoorDash or all of these applications, you kind of expect the food to get prepared and eventually get delivered to your

Starting point is 00:06:41 house. But it's so all of these events, you know, are kind of like data coupled with time and inaction. And then at a low scale, you know, you could probably use a pencil and a notepad when you're, you know, lining up for like a restaurant to get a table. But at large scale, it's really difficult. And you need systems like Apache Kafka or Red Panda to help enterprises deal with that kind of volume. And that's specifically what those tools do is process data. Yeah, correct. Yeah. So it's really like an event streaming platform. And so all of the last 12 years, I think as a comment to the intro, I've been working on this and I think event streaming

Starting point is 00:07:23 as an ecosystem is really two technologies. There is the compute side and then there's the storage side. And so when you change compute and store, that chain of compute and store at the end is really kind of what stream processing is. And you need both. You need something to do something intelligent, like, whatever, parse some fields of JSON or save it onto a database, but you also need these intermediary steps of a storage. And so combined those technologies is really what, what a stream processing is about.

Starting point is 00:07:53 Okay. Okay. All right, so we're gonna absolutely dig more into stream processing, but before we do, we got just a couple news articles to touch on. So feel free to comment on any of these Alex, okay? Okay. All right. So this first one is to comment on any of these, Alex, okay? Okay. All right, so this first one is a post on the Visual C++ blog,

Starting point is 00:08:08 and this is on IntelliSense improvements in Visual Studio 2019. And lots of great stuff in here. They're starting to have IntelliSense support for C++ 20 features like coroutine support. It's always good to see new features of language lighting up in the editor, in the IDE. Anything you want to highlight here, Jason? I don't know.

Starting point is 00:08:31 Yeah, lots of like more real-time things as you're typing. I like that. I haven't actually really played with IntelliCode yet. I'm curious if either one of you have. They highlight enhancements in IntelliCode here also. That's the AI-backed IntelliSense. Yeah, I think I have run that against my code base. And yeah, basically when you are getting prompted with IntelliSense, hence the things you use

Starting point is 00:08:55 the most often will be kind of up on the top instead of just being alphabetized or whatever it was before. So yeah, it has been helpful. Yeah, that's interesting. I've used Emacs for so long that to me, I feel like autocomplete is me usually grabbing for the method signature.

Starting point is 00:09:14 Now, I'm kidding. I do use ClangD as autocomplete, but I'd be curious to experiment with coroutines because they're so gnarly to debug. And I think we'll talk about our code base and kind of our experiences with debugging with asynchronous code and coroutines in a second.

Starting point is 00:09:31 But I hope that in addition to autocomplete and all these features, they are making coroutines easier to debug. I'm sure someone's working on that. I feel like I just want to make a comment real quick on autocomplete because, yeah, I agree. Even though I've really gotten in the swing of using IDEs lately, and I've been using CLion a lot, I swear it is almost never that the autocomplete, unlike braces, does what I want it to do.

Starting point is 00:09:56 I'm like, oh, I'm missing an opening brace here. So I type the opening brace, and it automatically adds a closing brace. No, I was already correcting a bug. I don't need you to create a new bug for me. I don't know. That's just an aside. And it's not even a comment to CLion. It's just the kind of thing that seems to happen

Starting point is 00:10:13 with like every IDE that I use. Anyhow. All right. Next, we have this YouTube video and this is called StarCraft. And it's a complete beginner StarCraft Brood War AI programming tutorial with C++. And I've played StarCraft a lot. I've never thought to try to program the AI. Jason, you watched some of this video, right? How does it work?

Starting point is 00:10:38 Yeah, I watched about half of the video. That's about what I had time for. And I was watching it like 1.75 times times which for me was a pretty good flow for this particular video because it was a live stream thing so it was a little uh on the slow side but yeah i mean i it's funny because i watched my roommates play hundreds of hours of starcraft but i never really cared about the game myself so but it seems like a really cool concept that they do these like ai versus ai game matches oh okay that's like that's kind of the point as you're programming an ai and c++ to fight against to play against other ais yeah uh anyhow if you have any interest at all in this i feel like it was a really good intro to like getting started a Visual Studio project, with C++, with how all of these things work, how game time loops work little bit of understanding about like in a high level how they do dll injection to be able to hook into

Starting point is 00:11:51 the starcraft game to be able to do this per frame thing to pump inputs and outputs in between the things uh i thought it was pretty cool it seems like a really fun way to get people into programming and into c++ into into game development to some extent. Yeah. I saw the video too, and it was really cool just to see it. I haven't developed in Windows in so long, but it seemed really neat. Yeah, and Andy starts out with like,

Starting point is 00:12:17 basically starts you in the debugger. Like, here, you can put a breakpoint here and hit play, and the StarCraft game is just going to stop and wait for you while you do your debugging on that frame. I like that the programming API is synchronous, so it'll block. So I think it's kind of key to getting people into programming games because multi-threading is really hard for people

Starting point is 00:12:41 to usually wrap their head around, especially as a getting started project. So whoever designed that, I think it seems really good for people just getting into game programming and into building AI. Because then you just have to deal with the current state. You don't have to deal about concurrency. Right, and you don't have to write any of the game engine. You can just deal with the AI. And then they're doing all kinds of other things like overlays on top of the current game and stuff that if you really want to get advanced you can just deal with the ai and and then they're doing all kinds of other things like overlays on top of the current game and stuff that you if you really want to get advanced you can and also the techniques and you used were you know fairly modern-ish looking code and stuff

Starting point is 00:13:15 like i felt like it had a good clean feel to it in general except i wanted them to use auto more i feel like it's the signatures i was watching the video and I was like, Otto, please use Otto. It's kind of like the C++98 memories. When I started writing C++, it was before the standard for OpenVMS. So it had all these janky primitives. And his long story is that the type signatures of this game were so long. I was like, it reads so much better if you use Otto. Well, and I saw a couple of places where

Starting point is 00:13:45 he typed auto and then said well this is actually a whatever just for the sake of explanation and deleted auto and typed out the full type name and i'm like you could have just left auto i like i totally agreed though yeah okay uh last thing we have and this is just a uh to Git's own repo for Git. And they have a banned header file and just a nice little trick if you want to prevent certain C functions from being used within your code, you can have this type of header file

Starting point is 00:14:21 where it just defines something like strcpy to print out an error message so if anyone tries to use stir copy and you you know don't trust that function then you'll get an error message and the developer is going to have to go move on and find something else to do i think i've seen this type of trick before so i guess it's somewhat common i'm curious if either one of you were surprised by any of these no i think it's kind of annoying when you integrate with a third-party library if you had those macros defined because now i think it's kind of annoying when you integrate with a third party library if you had those macros defined because then like if it's a header because right

Starting point is 00:14:49 then and it sort of complains i like the the cpp lint style where prints errors and then via comments you can disable the or you know with with clang tidy you can sort of disable the the warnings you're like i'm pretty sure i know exactly what i'm doing and it also serves as a warning for future developers like this was exactly my intention and i and i intend to do this this thing i mean i do get that people you know she's probably stood copy n as opposed to the you know the c variant etc which is um um but yeah it seems annoying to integrate with their party libraries but i guess if you write most of your code, it should be okay. That's a really good point. Like, do not pound include this band.h into another header file, right?

Starting point is 00:15:31 It struck me that this is assuming that the compilers implemented all of these as macros, because it starts with undeaf stircopy and then defines stir copy undef stir cat i feel like you need for portability would need if def stir copy undef stir copy right otherwise you're going to get a compiler error or at least a warning on some compilers unless those are just disabled i guess it's a good thing about open source you make it somebody else's you make it the compilers have people problem right like if it doesn't work on windows you're just like someone on the windows team is going to come back and patch this uh with i've never taken that approach that sounds like a good one okay well alex so uh after introducing you we talked a little bit about what stream processing is. And I think you mentioned Kafka and Red Panda.

Starting point is 00:16:26 Could you go into a little bit more detail about what Kafka and the product you work on, Red Panda, are? Sure. So Kafka is really two things. I think when people say Kafka, they often refer to Kafka, the API, and Kafka, the system. Combined, they are sort of the largest data streaming platform in the world right now. It's so popular, it's almost synonymous with real-time data streaming. So basically every big company is using that. So Uber, Robinhood, and you can basically YouTube just say Kafka and company, and there's almost like a video per company. It's kind of like the MySQL for databases, but in the streaming world. YouTube, just say Kafka and company, and there's almost like a video per company.

Starting point is 00:17:09 It's kind of like the MySQL for databases, but in the streaming world. So that's what it is. And then what Red Panda, so this technology was built around 11 years ago out of LinkedIn. Kafka was. Correct. Okay. And Red Panda, the project that we are working on is a C++ re-implementation of kind of the base API. And it's really the way I like to frame it to people is sort of the natural evolution of the principles that Kafka, the system, built on. And we extend sort of Kafka, the API.

Starting point is 00:17:39 So let me give you an analogy, I think, for all of the listeners. If you think of SQL as this lingua franca for speaking to multiple databases, whether you're speaking to DynamoDB or Google Spanner or Postgres or MySQL, right? SQL is sort of this language to communicate with databases. Similarly, in the streaming world, we think that Kafka, the API, became this lingua franca for communicating with real-time streaming systems. And Red Panda is really a reimplementation. And I think what we think is the future of streaming. And so what is interesting for this audience is what one of our advisors likes to say is that sometimes you have to reinvent the wheel when the road changes and hardware is,

Starting point is 00:18:29 has changed so, so much over the last decade, right? Like if you, if you go on Google, you can now rent a machine with 244 cores, a terabyte of Ram and like, you know, whatever, multiple terabytes of SSD, NVMe SSD devices. In 2010, the bottleneck in computing was spinning disk. And so Red Panda was like, well, what could we do if we, you know, were to start from scratch for modern hardware, where CPU is the new bottleneck, is no longer a spinning disk for this kind of big data and storage systems.

Starting point is 00:19:01 And so we re-implemented the Kafka, the API from scratch in C++. We actually use C++ 20 now, but when we started with C++ 11, using a thread per cord architecture. And then the product really is kind of aimed to do three things outside of the Kafka API is kind of the first thing.

Starting point is 00:19:20 The second thing is to unify historical and real-time access by integrating with storage systems like Amazon S3 or Google Cloud Bucket or Azure Blobstore. That's the second one. And then the last one is inline Lambda transformations with WebAssembly. So I know, Jason, that you have, sort of push these capabilities to the storage engine. So it's really, that's what the project is aiming to do. And kind of the analogy for the C++ community that is listening to this is think of Kafka the API, kind of like the C++ language spec.

Starting point is 00:20:03 And, you know, Clang and GCC as being, you know, Kafka, the system and Red Panda. So it's really two implementations for the same API at kind of the base layer. Okay. Okay. So what initially, I guess you kind of went into this a little bit, but you talked a lot about, you know, hardware has changed a lot over the past decade. Is that what prompted you to create Red Panda? Or is there other factors that went into you?

Starting point is 00:20:29 Yeah. So in 2017, I actually did a YouTube talk that I think the audience can Google at C-Star Summit. And so the idea there, I just wanted to take two edge computers uh i think at akamai and i wanted to measure what like basically what is the gap in performance between the state-of-the-art software for doing data streaming at that point really was kafka and the hardware i just wanted to measure the gap right and so i wrote a prototype using dpdk so kernel bypass at the NIC level, and DMA, so kernel bypassing the Pitch Cache to speak to the storage device. So as fast as I could, but really just the first pass. And then compare that with Kafka, right?

Starting point is 00:21:15 So as slow as I could get on the hardware level and really comparing it with the state-of-the-art system. And the first pass was 34x latency reduction improvement on the tail latencies. So it's like, wow, why is this the case? And so I really didn't do anything. I started this company in 2019, but that was really sort of this big moment of realization that really the roads have changed. The hardware is so fundamentally different. And so let me give you orders of magnitude disk just disk alone is about

Starting point is 00:21:46 a thousand times faster than disks were a decade ago just and so it sort of requires you to think about it because the latencies are so so different right like a dma write on a modern nvme ssd is like 130 microseconds to say double digit microseconds to about 130 microseconds um versus you know the latency of that a decade ago were multiple milliseconds on a spinning disk. And so it's like you need to rethink your software architecture when the components are so different. It's a totally different world.

Starting point is 00:22:20 And so that really was the motivation. I really wanted a storage engine that could keep up with the volumes that I wanted to push when I was at Akama and previously at AdTex and all these other places in a way that was cost effective. And so when I saw that hardware was really so capable, I was like, wow, there's so much potential here. And that was really sort of the beginnings of Red Panda. I want to make

Starting point is 00:22:45 sure i understand your original experiment uh you're saying you were doing uh basically writing to the disk at the fastest possible speed and what is that you're doing dma sorry i i lost the uh the two different technologies you were using there yeah so so dpdk isK is an Intel driver that allows you to decommission the NIC from the kernel, and then you unload the drivers as part of the application space. So your application literally talks to the hardware, and then it loads the driver, and it runs the NIC itself. So when you get a NIC write, like when you're receiving a packet on your receive queue, it literally writes to the first 8 kilobyte of application memory space.

Starting point is 00:23:33 And so there's no context switch, there's nothing, because your application owns the memory that the device is writing to. And so that's at the network level. And then on the disk level, it's using DMA. And so what it is, is that you're aligning memory in multiples of what the file system exposes so that the SSD device basically could just, you know, basically mem copy that onto the SSD sectors, kind of on the underlying hardware. And the experiment was,

Starting point is 00:24:07 let me run a Kafka client and server as the first, as the baseline, and measure how much and how fast I can push the hardware, right? That's the state-of-the-art streaming system. Turn that off, and on the same hardware with two SPF NICs back-to-back, so like the server is literally connected with a single wire.

Starting point is 00:24:27 For both experiments, let me run a C++ client and a C++ server where the server uses DPDK and page cache bypass on the Linux kernel side, and then just measure what is the gap in latency. What can the hardware actually do versus... Effect know, effectively to me, it boils down to what is the essential versus the accidental complexity? What happened? Why is software at 34x, you know, why is there such a gap in the latency performance? So that was the experiment.

Starting point is 00:24:59 So that number that you came up with, that 34x, is it fair to say that's the theoretical limit? That's as fast as possible, but you're going to get slower as you start to actually add software on top of that? Or was there still room for optimization there? Well, that was just the first pass, right? So it didn't take into account all sorts of, you know, farther improvements. I'm not sure. I think that, I think we can compute the theoretical limit if you do. Well, so bottlenecks, that's a really interesting and difficult question to answer. But the bottleneck will move around, right? Because if your disk can keep up with the network, then your network will be the bottleneck.

Starting point is 00:25:34 But if your network is faster than the disk, then your disk will be the bottleneck. So the gist and the way we like to explain this to people is like red panda should drive your hardware at capacity and in fact the way when we test it this is really powerful when we go into like a hedge fund or something like that or bank or an oil oil refinery etc uh we ask them just run dd on the disk or sorry um fio on the disk to measure latency and throughput, which is this Linux kernel tool that Jen Saxoby wrote who works on asynchronous IO interfaces for the kernel. And then on the network side, run Iperf. And it should literally run at basically the slowest of the fastest.

Starting point is 00:26:25 Does that make sense? So it should run at whatever the, you know, basically the slowest of the fastest. Does that make sense? So it should run at whatever is the bottleneck. And so what the limit of the hardware should give you should really be very close, like Red Panda should be able to run very close within the physical limits of the hardware, both in latency and throughput. So you pull up a process manager,

Starting point is 00:26:41 you expect to see Ethernet 100%, disk 100% or whatever. Correct. Okay. Yeah. And that's how we did a lot of the initial benchmarks because we couldn't find clients that could keep up with a lot of this stuff. And so we just kept increasing the load. And then kind of a big improvement from a software perspective was how do we deliver that with the stable tail latencies, right?

Starting point is 00:27:03 Because you could deliver like basically fantastic latencies when the hardware isn't saturated but what happened when when start you know uh the hardware starts to get a little bit saturated so it's not at 100 but let's say between your 75th and like 90th percentile of hardware saturation when you are really like exercising every part of the system you're exercising your your memory reclamation techniques, you're exercising your memory allocation, you're exercising, you know, the hardware, your back pressure from disk all the way to the NIC, etc. And so I think delivering flat tail latencies at near saturation speeds is something that we've worked on a lot. Sponsor of this episode is the PVS Studio team. The team develops the PVS Studio Static Code Analyzer.

Starting point is 00:27:48 The tool detects errors in C, C++, C Sharp, and Java code. When you use the analyzer regularly, you can spot and fix many errors right after you write new code. The analyzer does the tedious work of sifting through the boring parts of code, never gets tired of looking for typos. The analyzer makes code reviews more productive by freeing up your team's resources. Now you have time to focus on what's important,

Starting point is 00:28:09 algorithms and high-level errors. Check out the team's recent article, Date Processing Attracts Bugs or 77 Defects in Qt 6 to see what the analyzer can do. The link is in the podcast description. We've also added a link there to a funny blog post, COVID-19 research and uninitialized variable, though you'll have to decide by yourself whether it's funny or sad.

Starting point is 00:28:29 Remember that you can extend the PVS Studio trial period from one week to one month. Just use the CppCast hashtag when you're requesting your license. Before we dig a little bit more into Red Panda and how you've implemented this streaming service, going back to Kafka, what is that service written in and have they made. And it's really a two-system thing. It uses this system called ZooKeeper, which is notoriously difficult to manage, and Kafka. And so kind of the challenge is not only in performance, but is in the operational complexity of running two systems, right? If something fails, you need to have a deep understanding of what are the possible failure semantics

Starting point is 00:29:28 and the additional fault domains. So let's say Zookeeper crashes, what happens to Kafka? And under what operation will Kafka report an error if Zookeeper is down? Or clearly if one of the Kafka broker is down, there's probably known failure domains there. But as a software engineer, as a developer who isn't necessarily a streaming expert Dealing with the complexity of twofold domains is really challenging especially because the configuration on clients and and

Starting point is 00:29:57 producers and consumers of data and the broker is sort of all intermingled and so a lot of kind of the the of the improvements from a technical perspective were like, if we were to start from scratch, how can we make this really, really easy to use? And so we wrote it in C++ as a single binary. And honestly, that has opened up the doors for so many things for us. But yeah, the high level Kafka is implemented in Java. And yeah, does that help that help yeah i think so is this is kind of an aside but is kafa kafka why people that i know in big data know scala is that the the main correlation there or is there there's some other tool yeah i think uh spark streaming is is really the tool that kind of drove a lot of the Scala

Starting point is 00:30:48 adoption. So Scala is really nice for developing DSLs. Or I think back in 2012, DSLs in Scala were all the rage. I think that's what happens when people start to get in functional programming language. They're like, oh, look, I can express all of these and I can introduce new symbols. And so I think Spark Stream was really the framework that drove Scala adoption, in my opinion. Okay. Definitely an aside. But Spark Streaming does connect to Kafka. So there is a correlation in that big data world.

Starting point is 00:31:22 And that often to do real-time streaming, Spark is really the compute level of things and Kafka is the storage level. And so I think in the beginning of the podcast, I mentioned how you need both. You need compute and a store kind of chained together to do something useful, like detect fraud detection or deliver your food to your door or things like that. So I think it's both. Spark is often deployed alongside Kafka. So SparkStream, Kafka, and you mentioned ZooKeeper, and you're talking about multiple points of failure. Red Panda, what of these things does Red Panda

Starting point is 00:31:56 and its single binary encompass exactly? Yeah, great question. A lot of all of them. So the first thing is we eliminated the multiple fault domains. We made it a single C++ binary. And we rewrote Raft from scratch. So it took us two years, really, to write Raft and test Raft in a way that was scalable and actually could drive the hardware at capacity, right? Because that was an initial goal. It's like, how do we utilize hardware efficiently? And so we made it a single binary, which means we eliminated Zookeeper and we eliminated all of the tunables that come with Kafka. And so that was the first one.

Starting point is 00:32:39 It's really easy for people to deploy. The next part of it, which is what we think is the future of streaming, is that to do simple things in real-time streaming is actually really complicated. Let's say that you just wanted to guarantee GDPR compliance. And to do that, you just have to remove a social security number from a JSON object. That simple use case for people to do it at a scale, you need a Kafka and Zookeeper and Spark streaming and a bunch of data ping pong between kafka and the spark streaming back into kafka and potentially back into elastic search or whatever it would be the the end data store and so what we said is let me let me give you

Starting point is 00:33:17 the programmer an ability to express this you know dumb computationally speaking sophisticated machinery but this dumb transformations in line so that as data is coming through you could just you know remove the social security number so you don't need to ping pong your data between you know red panda and spark streaming and back into red panda and so but web assembly the the engine that we've embedded is really meant for for simple things it's not meant to do complicated multi-way merges. So at the storage level, we are a drop-in replacement for Kafka. So we have some of the largest hedge funds in the world, some of the largest oil and gas companies in the world, really finance and web companies, CDN companies, et cetera. So at the lowest level, we are a drop-in replacement for Kafka, the system, and the API.

Starting point is 00:34:10 But we're also starting to extend the primitives with simple things like WebAssembly. So does that make sense? So we're not a full replacement for Spark Streaming, which could do complicated multi-way merges. But the simple things are now actually simple to do with Red Panda. And if they need to do the complex things, you can still offload that data to some other process? Correct, correct. Because we're Kafka API compatible. So you could still use Spark with the same drivers and touch application that used to work with Kafka, right? Because we implement the same API, the same specification. So they could just point it at Red Panda and everything will just work correctly. So we're just sort of

Starting point is 00:34:44 making the simple things simple. I'm kind of finding myself a little fascinated by the development history here, because I'm assuming Kafka and this menagerie of tools around it evolved over a decade or so. Correct. Or something like that. And so then when you come in with red panda you are coming from a perspective and correct me if i'm wrong please basically saying well the problem is solved we know what the problem is and we know what the solution is but now we get to start from

Starting point is 00:35:17 a fresh perspective and say okay let's do it but better yeah i mean that that was so so we get to we have to give a lot of credit to kafka in that before kafka you had like rabbit mq and you had tipco and solas and all of these enterprise message buses that people really don't like to use uh really uh you know and then kafka came and and and and then developers love that they could be heroes overnight literally it's kind of impressive it the ecosystem is so large that you can take Kafka and Spark streaming and Elasticsearch and in a day, you can come up with a sophisticated fraud detection pipeline for a bank. So that's what people love. It was this API, this pluggability, this huge ecosystem.

Starting point is 00:35:59 But like you mentioned, we knew what the end result should be. And the insight there was like, well, what if we start from scratch? What are the things that we could fundamentally change about how to implement this software? Still give people the same sort of hero and wow moments of being able to plug and play all of these open source systems, but really design it from scratch in C++ and see what we could do better, both from a protocol, from an architectural perspective, from a performance perspective. Yeah, and so that was kind of the technical history there.

Starting point is 00:36:33 I also find it interesting that you're using, you keep saying WebAssembly, and it sounds like WebAssembly specifically for your processing engine. You're not saying JavaScript that then gets compiled or C++ or whatever. You're just saying you that then gets compiled or C++ or whatever. You're just saying you feed us some WebAssembly, we'll run it on the code, on the stream?

Starting point is 00:36:50 Yeah, yeah, correctly. So right now we are exploring between Wasmere and VA. So the first implementation uses V8. And WebAssembly is important because effectively people like to program in whatever language they like to program. That is just like, and so we're just like, well, what is sort of this common denominator that can give us, you know, sort of reasonable performance while we're executing? And so we leverage the LLVM backend to produce really good web assembly

Starting point is 00:37:22 in whatever programming language you want, right? So if you want to send us garbage, it'll increase the latency of your processing, but largely people are really good at, you know, doing these simple things pretty fast. And so WebAssembly sort of opens up the ecosystem in that you as a developer, you could program in whichever language you want,

Starting point is 00:37:39 like that's up to you, whatever libraries and whatever you want to bring up, that's your problem. We just execute it inline. And so it's really actually from a mental perspective, it's the same idea of when you inline a function in C++, like, you know, it's going to be fast because it's like the code is right there. So it's really the same idea here with WebAssembly.

Starting point is 00:37:59 It's like as data is coming in, you know, the code is literally right there to make just the simple transformations and then pass on the information onto the storage subsystem. Fascinating how often WebAssembly comes up in our C++ discussions. It's coming up more and more, it seems. More and more, yeah. If you're joining CBPCast for the first time ever, you probably want to go research web assembly yeah go back and listen to the episode uh what was it who we had on web assembly most recently jason i don't know we've it's come up like seven or eight times recently and i'm having a hard time

Starting point is 00:38:36 keeping them ben smith that's right now 251 okay oh yeah that was a good one that was fun yeah anyhow sorry so we we've talked a lot, you know, what stream processing is and what you've done. Do you want to dip down into more technical aspects of what you've done when architecting and writing Red Panda to make it so much faster compared to Kafka? Yeah, this is really the most fun to talk about with a technical audience. You know, I feel like latency is really the sum of all to talk about with the technical audience um you know to just like i feel like latency is really the sum of all of your bad decisions to deliver in a low latency that was good sorry delivering a low latency it's it's uh it really takes a kind of discipline in

Starting point is 00:39:22 making sure that that the things that you're doing like have this holistic design principle and so the first thing that we noticed is we didn't want to use the page cache and the page cache for in the Linux kernel so this is runs on Linux only the page caching in the Linux kernel is a generic you know caching caching system and so every file gives you a page cache for that particular file but that file for simplicity of IO has a global lock right and then it also has all of this like lazy materialization of pages it's actually really sophisticated and really nice and that when you do and

Starting point is 00:40:00 like a read or write syscall there there's all of this like generic machinery. So for example, let's say that you read the first byte of a file, the Linux kernel will actually go ahead and read the next four or five pages is basically, you know, it'll kind of prefetch the mechanical execution of fetching the next few pages from disk to so that the next read call that you make on the file will probably have the data already materialized in memory and so um you know for for the kind of stuff that we're building we know exactly what the patterns are literally like the the kafka protocol tells us the offset of where to start reading data from how much data at what time, and we understand the mechanics of a queue.

Starting point is 00:40:45 We know that once you start reading, you never read behind. You always read ahead. So we can start pretty aggressive read-ahead mechanics. But what's more important is that memory is limited, and we don't use, actually, virtual memory. So we actually pre-allocate the whole machine's memory to get low latency allocation, and then we split the memory evenly across all the number of cores.

Starting point is 00:41:07 At the lowest level, we use a thread per core architecture. So every core has a pin thread, which means that the threads don't move around. So there's no, you know, really memory access, ping pong in and cache line invalidations. Once, let's say you have a four core machine, so you'll have four exactly four p threads full stop and every every p thread will will like will allocate the whole machine's memory

Starting point is 00:41:32 into these four p threads and then we'll split it up and we'll use um lib um numa uh lib hardware alloc library which is the supercomputing library, which tells the application which literally memory regions belong to that particular core. So you're not doing kind of this remote memory access. And then we lock the memory that belongs to that particular core so that the accesses aren't kind of, you know, ping-ponging around especially on multi multi-socket systems um and so that's really kind of the basic layer and the way communication happens between cross cores is that we use uh explicit uh structure message passing between the cores um and so so at its lowest level we really have this this quite i think a different architecture because remember what we were trying to move away from was this multi-threaded

Starting point is 00:42:26 kind of work-stealing style compute platform that Kafka has, right? Which is like thread pools and so on into where the bottleneck a decade ago was spinning disk into the new bottleneck in compute, which is CPU.

Starting point is 00:42:42 It's like, how do you keep the CPUs busy and doing useful work? And so that was how we started this. And kind of from a C++ perspective, it's been a lot of fun because we compile our own compiler. So right now, for example, we bootstrap Clang with GCC 10. So we use Clang 11. And then we use Clang 11 to compile all of our transitive closure libraries

Starting point is 00:43:07 so we get to use C++ 20, and we get to use kind of all of the nice things. And, yeah, and so it's been kind of a lot of fun to use the latest coroutines and figure out how to effectively implement an operating system within the application because we don't use virtual memory. We bypass the Linux kernel on the page cache. We kind of do all of these tricks at the lowest level. Yeah, from an architecture perspective, it kind of sounded like you said

Starting point is 00:43:33 you take a four-core machine and split it into four separate machines if you're giving each one its own dedicated process, its own dedicated memory. Yeah, correct. And what's interesting about that from a C++ perspective is the way you communicate between this multiple cores is you set up SPSC queue, so single producer, single consumer lock-free

Starting point is 00:43:56 queue. And so when you communicate with a remote core, let's say core zero communicates with core three, you explicitly invalidate that particular sort of cache line, but sort of the other cores are untouched, right? You do this explicit synchronization between cores and that is really scalable, especially when you have multi- computers right and so even but even on a single even on a single socket motherboard that network of cues is sort of makes all of your communication explicit here's here's what the concurrency and parallelism that modeling that that that

Starting point is 00:44:42 it gives you or it superimposes on the developer is you know that your parallelism is fixed to the number of physical cores full stop right like you'll have four p threads and there will be no other p threads ever running but what you what we do do is we use coroutines and futures to have concurrency not parallelism within each core. It's kind of like you can have a single core computer, but move your mouse and watch YouTube at the same time. So it's a similar thing that happens on a single P thread. We use coroutines with a comparative scheduling algorithm so that every coroutine yield actually yields a future,

Starting point is 00:45:20 which is a structure. I think this is what's missing in the spec, is a structure, like literally a C struct struct that you can add tags and integers on so you can you know sort of schedule it at different points in terms of execution but uh but on each core you get concurrency and so when you're writing code is really freeing because there's exactly one mode of one model of both parallelism and concurrency. And when you develop code with this program, so you only write code with a concurrent mindset.

Starting point is 00:45:52 You understand that once you're in a single core, like there will never be a data race from a, from a P thread perspective, there'll be concurrent access, but not parallelism. And so parallelism becomes a free variable, which means you can take the same code structure, which was made concurrent,

Starting point is 00:46:09 and execute it on a 244 core, and it'll just sort of scale because you program to the structure you didn't program to the explicit parallelism. Does that make sense? I think so. Are you using any libraries or anything to help you with your coroutine stuff?

Starting point is 00:46:24 Because it seems like coroutines takes a lot of bootstrapping to get running. Yeah, so we use C-Star as sort of the lowest level library. And basically every time you do co-return or co-await or something like that, well, actually co-await, it doesn't generate a future. But when you do co-retait, it doesn't generate a future. But when you do co-return, it'll generate a future. And then the future has a scheduling point. And the reason you need this is because you need additional metadata to understand the priority of IO, to understand the priority of CPU,

Starting point is 00:47:01 to understand, you know-CPU priority. Let's say you have a core that is communicating with a bunch of the other cores because it's just receiving most of the network requests and it needs to route them. You need to understand how much time of the scheduler, of the comparative scheduler, you want to spend sending messages to the other cores versus doing useful work on this particular core. And that's fundamentally, I think, what is missing on the coroutines. I think coroutines is a really low-level primitive

Starting point is 00:47:29 that, you know, it's really all about the state machine bootstrapping. But when you're building an application, you need more. You need a scheduler to allow you to prioritize different kinds of work at different points of execution. Very cool. What else do we want to go over? We're a little bit low on time. Yeah, I was just looking up C-Star.

Starting point is 00:47:50 So that's S-E-A-S-T-A-R. Correct,.io, C-Star.io. Okay. Because, you know, whenever coroutines comes up on CppCast, there's always this question of, like, oh, well, you know, it has this bootstrapping. And like you said, there's the stuff missing from, it feels like it's missing from the spec because it's such a low level primitive. But are there any other C++20 features that you're

Starting point is 00:48:15 really taking advantage of besides coroutines? Yeah, I think polymorphic lambdas are really useful. So we wrote this basically type tree walker. And so if you give the serialization framework a plain C struct with no constructors, which means we can splat the struct into multiple fields. And so we can get the arity of the particular struct, the arity being the number of fields in the struct. So let's say you have two integers and one string, that would give you arity of three. So we use polymorphic lambdas to basically take any type,

Starting point is 00:48:58 compute the arity, the number of fields, and then do auto bracket ref, you know, like whatever, know like whatever a b c and sort of get access to individual fields and then recurs on those individual fields so that we do a dfs type traversal on on a type tree so it's really neat because you can give it a high level type object uh let's say like a building and then a building would have people and people would have jobs i'm just kind of going uh over the top of my head and it would just recursively traverse that and give you a serialized in a dfs traversal building that is only used in c++ you know jenki reflection you know via uh polymorphic uh lambdas and and and tuple splatting. That's cool. So that, and then, you know, lambdas, we use lambdas kind of super, super heavily.

Starting point is 00:49:49 Coroutines, I think those are the main things that we're getting from C++20, I would say. I mean, honestly, coroutines are as useful as auto. Like, you know, when auto first came out, you kind of like don't really have an intuition of when to use auto, whether you use auto with templates. And if you do that, that kind of blows up your instruction cache because it'll instantiate, basically it'll do the explosive combinatorial of all of the inner types plus the outer type

Starting point is 00:50:18 of the template parameter. So anyways, so you kind of develop a sense for auto or when to use auto, like when it makes sense for auto. Coroutines are kind of develop a sense for auto or when to use auto, like when it makes sense for auto. Coroutines are kind of like that. It takes a little bit of experimenting for you to develop an intuition of when to use coroutines. But once you get the hang of it, it's awesome because it's really such a useful primitive. You often want to do, especially in this concurrent world where parallelism is a free variable is really freeing in just like writing your code with coroutines and understanding you know having sort of an explicit understanding of

Starting point is 00:50:52 when exactly your code is going to get executed and run all right so i've got one other architecture question then um my understanding is that coroutines necessarily have a dynamic allocation with them which can sometimes be optimized away. In high-performance environments, people tend to avoid dynamic allocations. But then I'm thinking about the fact that you have dedicated to each core its own chunk of memory, so maybe dynamic allocations, when you need to do them, aren't such a big deal? Yeah, and that's a big part of the allocator design, I think that Avi wrote, the original C-Star author, is that when you pre-allocate the memory, then memory access and memory

Starting point is 00:51:31 allocation is thread local, but it's already cached, right? So it's just giving you a quick pointer assignment to this thing. And your allocator doesn't have to have any locks around it or anything because you know no one else can touch it, right? Exactly. And it's locked memory and it's thread local. And so there isn't any real kind of like bottlenecks there. But also let's like kind of holistically, just to wrap up this thought, our bottleneck is really IO. And so we're in the double digit microsecond latencies we're

Starting point is 00:52:05 not in the nanosecond latency most of the time of course nanoseconds add and like i mentioned latency is the sum of all your bad decisions and so um but but because we're in the in the microsecond scale and like you know anywhere from like low to gd to you know i guess when you make an io is like 100 microseconds so it is low low latency, but it isn't HFT low latency. It's kind of like, you know, right above that. It's really kind of more like in the algorithmic trading latency space. So it's okay. I think for us, allocations still happen within like nanoseconds.

Starting point is 00:52:40 And so in the grand scheme of things, our compute is in the like microsecond scale you know one two two double digit microseconds for most of the things that we do so i it hasn't been a problem for us so then now what's the end result i know you started this 34 times in your initial experiment yeah fantastic we just released a blog post with 400 hours of benchmarking thanks Thanks for asking. And I almost didn't talk about it. We spent 400 hours comparing the open source alternatives with Red Panda. And the best result that we've gotten is 10,000% improvement over the state of the art.

Starting point is 00:53:24 So it does pay to rebuild from scratch in C++20 and what's funny is that we are literally a startup like we're a venture-backed startup so it's almost like the opposite of what people think when you're building a startup like one you don't build infrastructure and two you definitely don't use C++20 but you know it sort of worked out really well for us to think about it

Starting point is 00:53:42 and you know we've had a lot of experience with C++ so it's kind of a natural choice for us. But yeah, that's really, I think, the best result. And yeah, I think there's, you know, there's really kind of this in-depth blog post that I'll send it to you guys so you can post it on the podcast footnotes. But definitely dig deep. You know, I'm also available on Twitter if people want to reach out and they're like,

Starting point is 00:54:03 hey, this benchmark didn't make sense. We spent literally hundreds of hours testing, so I can probably answer very, very detailed and pointed questions. Well, I know our listeners who are interested, the next question they're going to have is, what is your business model? Because if I understand right, Kafka's Apache

Starting point is 00:54:19 Foundation open source? Correct. And Red Panda is also available on GitHub. So you can download it. You don't have to pay us. It's free to use. It's BSL, which means as long as you don't compete with a hosted version of Red Panda, like Vectorize gets exclusive rights of hosting it, but only for the next four years.

Starting point is 00:54:39 So it's like a four-year rolling window. So today's release, we've got four years to be the exclusive person that releases that that's a very interesting model yeah but in four years it becomes apache 2 sort of the core um and so but other than that people can just run it they can use it as a kafka drop-in replacement it's on github all of the code all of the c++ techniques and details people can go and download it and play around with it and ask us questions. And so our monetization strategy is the cloud, largely. And we have some enterprise features like disaster recovery, you know, kind of like some enterprise security features,

Starting point is 00:55:16 multi-cloud deployments, you know, multi-region, kind of like the things that it would take larger enterprises to feel comfortable running, you know, sort of the heart, the guts of the system, which is often this data pipeline. But, yeah, it's also available under a BSL license, which is the same license like CockroachDB and MariaDB and a bunch of other database companies use. Yeah, and so I thought, to me, that felt like a good balance

Starting point is 00:55:41 between building open source and then also building a company can pay the bill so so that's that's the license that we that we set on okay awesome thank you awesome yeah well it's been great having you on the show today alex uh you mentioned you are on twitter where can listeners find you online yeah my handle is emaxerno, as like literally the largest error number you can get. That's on Twitter, vectorize.io on Twitter as well. And that's really, I think, the best place. We have a Slack on vectorize.io slash Slack if you want to chat with me real time about these things.

Starting point is 00:56:20 But yeah, I'd love to, if anyone has any questions, just feel free to ping me or GitHub, really. File a ticket and be like, you know, this doesn't work. I'm kidding. You could be nice about it. But let me know. I think either of those channels are good for me. Okay, great.

Starting point is 00:56:38 Thanks, Alex. All right, guys. Thank you. Pleasure being here. Thanks so much for listening in as we chat about C++. We'd love to hear what you think of the podcast. Please let us know if we're discussing the stuff you're interested in, or if you have a suggestion for a topic, we'd love to hear about that too.

Starting point is 00:56:53 You can email all your thoughts to feedback at cppcast.com. We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter. You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter. We'd also like to thank all our patrons who help support the show through Patreon. If you'd like to support us on Patreon, you can do so at patreon.com slash cppcast. And of course, you can find all that info

Starting point is 00:57:17 and the show notes on the podcast website at cppcast.com. Theme music for this episode was provided by podcastthemes.com.

Your Ad Here

CppCast - Event Streaming

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.