CppCast - Event Streaming
Episode Date: March 19, 2021Rob and Jason are joined by Alex Gallego. They first discuss blog posts from Visual C++ on Intellisense updates and a tutorial for programming Starcraft AI. Then they talk to Alex Gallego about Red Pa...nda, the event streaming platform written in C++ that's compatible with the Kafka API. Episode Transcripts PVS-Studio Episode Transcripts News MTuner Intellisense Improvements In Visual Studio 2019 STARTcraft - Complete Beginner Starcraft: Broodwar AI Programming Tutorial with C++ / BWAPI STARTcraft Git source has a banned.h file that blocks use of certain C functions Links Vectorized.io RedPanda on GitHub The Kafka API is great; now let's make it fast! Sponsors PVS-Studio. Write #cppcast in the message field on the download page and get one month license Date Processing Attracts Bugs or 77 Defects in Qt 6 COVID-19 Research and Uninitialized Variables
Transcript
Discussion (0)
Episode 291 of CppCast with guest Alex Gallego recorded March 17th, 2021.
Sponsor of this episode of CppCast is the PVS Studio team.
The team promotes regular usage of static code analysis and the PVS Studio static analysis tool. In this episode, we discuss AI programming for StarCraft.
Then we talk to Alex Gallego, CEO of Vectorize.io.
Alex talks to us about Red Panda,
the C++ developers by C++ developers.
I'm your host, Rob Irving, joined by my co-host, Jason Turner.
Jason, how are you doing today?
I'm all right, Rob. Episode 291.
Our last guest, I think, asked us if we had anything special planned for 300.
Yes, I believe Arnaud did ask that kind of once we kind of wrapped up the recording.
Yeah, Alf.
Yeah.
Yeah.
And the answer is no.
But I will say if our listeners have an idea that they think would be outstanding, then go ahead and respond to that on Twitter or something.
And maybe we'll do one of those ideas.
Yeah, I feel like at this point, we've had kind of the biggest names on who, you know, were interested in being on podcast.
But if there's someone you think we've missed, certainly reach out to us and let us know like oh you should definitely have so and so on
definitely all right well at the top of every episode i threw a piece of feedback uh we got
this uh tweet uh saying there is a c++ memory profiler and leak finder for windows uh and it's by a github user named milo stosik and uh i have the
repo open it's called m tuner it's an advanced c and c++ memory profiler and i have not had a
chance to check it out but uh yeah this listener recommended it if you were listening to last
week's episode and heap track sounded wonderful but you need something that works on Windows,
then maybe check out mTuner and see how well that tool works
for tracking memory usages.
Definitely, yeah.
Okay, well, we'd love to hear your thoughts about the show.
You can always reach out to us on Facebook, Twitter,
or email us at feedback at cppcast.com.
And don't forget to leave us a review on iTunes
or subscribe on YouTube.
Joining us today is Alex Gallego.
Alex is the founder and CEO of Vectorized, 12 years in streaming,
previously a principal engineer at Akamai and founder and CTO of Concord.io,
a C++ stream processing engine sold to Akamai in 2016.
Alex, welcome to the show.
Thanks, Rob, for having me. How are you guys?
Doing great. How are you doing?
I'm doing well. It's, I think, early in San Francisco. Glad to be here. I feel like I've
been listening to you for so long. It's great to finally, you know, kind of see you face-to-face
on the podcast.
Oh, that's cool.
Yeah.
Your bio makes it sound like you've been
doing something in the content distribution data streaming of some sort for a rather long time
yeah so actually right out of college i um i was a crypto major and then um i dropped out
of my grad grad school program to go work on an embedded database in finance for like basically
how the S&P 500 portfolio is managed. Back then it was in the BlackBerry days. So it was
extending SQL cipher with some cryptography primitives. It was really cool. But yeah,
ever since after that job, I've been in the real-time and big data space.
In particular, I guess the first gig was really how to do real-time multi-armed bandit algorithms for ad tech.
If you're trying to display ads, how do you get more people to effectively click on things?
How do you get more people to effectively click on things? And how do you run experiments?
But how do you do that in real time so that the budget spent is spent efficiently across a fleet of 100 servers?
That kind of stuff.
And then after that, I went on to build a stream computing platform. So I don't know if the listeners know Spark streaming or Apache Flink,
but it's effectively just big data frameworks to do real-time manipulation of events
as they're coming in through your data pipeline.
And then the last one is...
Yeah, go ahead.
I mean, if we're going to be talking around this concept of streaming for the rest of the interview,
go ahead and give us a definition of what you mean by that. Is that like YouTube streaming or is that just like data processing,
like a data stream coming through with like pipelines of data processing or what?
Yeah, that's a good one. So it's not video streaming, it's really data streaming. All
right. So event streaming is really kind of a set of technologies to deal with kind of a set of technologies to deal with time-sensitive data.
And let me define an event for the audience.
So an event, for the purpose of event streaming,
is really data with a timestamp and an action.
So like a click, for example, it happened at a particular time
and it may contain a bunch of data.
Or a trade, right, when you like make a trade it's some
data but it has a timestamp and it has an action and so on and so event to
streaming is really kind of a technology to help developers deal with this time
sensitive data and it often has the the time sensitive component is an
interesting way of putting it because there is this expectation that something
will happen you know with regards to the timestamp. So for a trade, when you make a trade on,
let's say Robinhood, you kind of expect that thing to, you know, to eventually get resolved.
Same thing when you order food on your phone app from Uber Eats or DoorDash or all of these
applications, you kind of expect the food to get prepared and eventually get delivered to your
house. But it's so all of these events, you know, are kind of like data coupled with time and inaction.
And then at a low scale, you know, you could probably use a pencil and a notepad when you're,
you know, lining up for like a restaurant to get a table.
But at large scale, it's really difficult.
And you need systems like Apache Kafka or Red Panda to help enterprises deal with
that kind of volume. And that's specifically what those tools do is process data. Yeah, correct.
Yeah. So it's really like an event streaming platform. And so all of the last 12 years,
I think as a comment to the intro, I've been working on this and I think event streaming
as an ecosystem is really two technologies.
There is the compute side and then there's the storage side. And so when you change compute
and store, that chain of compute and store at the end is really kind of what stream processing
is. And you need both. You need something to do something intelligent, like, whatever,
parse some fields of JSON or save it onto a database,
but you also need these intermediary steps of a storage.
And so combined those technologies is really what,
what a stream processing is about.
Okay. Okay.
All right, so we're gonna absolutely dig more
into stream processing, but before we do,
we got just a couple news articles to touch on.
So feel free to comment on any of these Alex, okay?
Okay. All right. So this first one is to comment on any of these, Alex, okay? Okay.
All right, so this first one is a post
on the Visual C++ blog,
and this is on IntelliSense improvements
in Visual Studio 2019.
And lots of great stuff in here.
They're starting to have IntelliSense support
for C++ 20 features like coroutine support.
It's always good to see new features of language lighting up in the editor, in the IDE.
Anything you want to highlight here, Jason?
I don't know.
Yeah, lots of like more real-time things as you're typing.
I like that.
I haven't actually really played with IntelliCode yet.
I'm curious if either one of you have.
They highlight enhancements in IntelliCode here also.
That's the AI-backed IntelliSense. Yeah, I think I have
run that against my code base. And yeah, basically when you
are getting prompted with IntelliSense, hence the things you use
the most often will be kind of up on the top instead of just being alphabetized
or whatever it was before. So yeah, it has been helpful.
Yeah, that's interesting.
I've used Emacs for so long
that to me,
I feel like autocomplete
is me usually grabbing
for the method signature.
Now, I'm kidding.
I do use ClangD as autocomplete,
but I'd be curious to experiment
with coroutines
because they're so gnarly to debug.
And I think we'll talk about our code base
and kind of our experiences with debugging
with asynchronous code and coroutines in a second.
But I hope that in addition to autocomplete
and all these features,
they are making coroutines easier to debug.
I'm sure someone's working on that.
I feel like I just want to make a comment real quick
on autocomplete because, yeah, I agree.
Even though I've really gotten in the swing of using IDEs lately, and I've been using CLion a lot,
I swear it is almost never that the autocomplete, unlike braces, does what I want it to do.
I'm like, oh, I'm missing an opening brace here.
So I type the opening brace, and it automatically adds a closing brace.
No, I was already correcting a bug.
I don't need you to create a new bug for me.
I don't know.
That's just an aside.
And it's not even a comment to CLion.
It's just the kind of thing that seems to happen
with like every IDE that I use.
Anyhow.
All right.
Next, we have this YouTube video
and this is called StarCraft.
And it's a complete beginner StarCraft Brood War AI
programming tutorial with C++. And I've played StarCraft a lot. I've never thought to try to
program the AI. Jason, you watched some of this video, right? How does it work?
Yeah, I watched about half of the video. That's about what I had time for. And I was watching it
like 1.75 times times which for me was a
pretty good flow for this particular video because it was a live stream thing so it was a little uh
on the slow side but yeah i mean i it's funny because i watched my roommates play hundreds of
hours of starcraft but i never really cared about the game myself so but it seems like a really cool concept that they do these like ai versus ai game matches
oh okay that's like that's kind of the point as you're programming an ai and c++ to fight against
to play against other ais yeah uh anyhow if you have any interest at all in this i feel like it
was a really good intro to like getting started a Visual Studio project, with C++, with how all of these things work, how game time loops work little bit of understanding about like in a high level how they do dll injection to be able to hook into
the starcraft game to be able to do this per frame thing to pump inputs and outputs in between the
things uh i thought it was pretty cool it seems like a really fun way to get people into programming
and into c++ into into game development to some extent.
Yeah.
I saw the video too, and it was really cool just to see it.
I haven't developed in Windows in so long,
but it seemed really neat.
Yeah, and Andy starts out with like,
basically starts you in the debugger.
Like, here, you can put a breakpoint here and hit play,
and the StarCraft game is just going to stop and wait for you
while you do your debugging on that frame.
I like that the programming API is synchronous,
so it'll block.
So I think it's kind of key to getting people into programming games
because multi-threading is really hard for people
to usually wrap their head around,
especially as a getting started project.
So whoever designed that, I think it seems really good for people just getting into game programming and into building AI.
Because then you just have to deal with the current state. You don't have to deal about concurrency.
Right, and you don't have to write any of the game engine.
You can just deal with the AI. And then they're doing all kinds of other things like
overlays on top of the current game and stuff that if you really want to get advanced you can just deal with the ai and and then they're doing all kinds of other things like overlays on top of the current game and stuff that you if you really want to get advanced you can
and also the techniques and you used were you know fairly modern-ish looking code and stuff
like i felt like it had a good clean feel to it in general except i wanted them to use auto more
i feel like it's the signatures i was watching the video and I was like, Otto, please use Otto.
It's kind of like the C++98 memories.
When I started writing C++, it was before the standard for OpenVMS.
So it had all these janky primitives.
And his long story is that the type signatures of this game were so long.
I was like, it reads so much better if you use Otto.
Well, and I saw a couple of places where
he typed auto and then said well this is actually a whatever just for the sake of explanation and
deleted auto and typed out the full type name and i'm like you could have just left auto i like i
totally agreed though yeah okay uh last thing we have and this is just a uh to Git's own repo for Git.
And they have a banned header file
and just a nice little trick
if you want to prevent certain C functions
from being used within your code,
you can have this type of header file
where it just defines something like strcpy
to print out
an error message so if anyone tries to use stir copy and you you know don't trust that function
then you'll get an error message and the developer is going to have to go move on and find something
else to do i think i've seen this type of trick before so i guess it's somewhat common i'm curious
if either one of you were surprised by any of these no i think it's kind of annoying when you
integrate with a third-party library if you had those macros defined because now i think it's kind of annoying when you integrate with a third
party library if you had those macros defined because then like if it's a header because right
then and it sort of complains i like the the cpp lint style where prints errors and then via
comments you can disable the or you know with with clang tidy you can sort of disable the
the warnings you're like i'm pretty sure i know exactly what i'm doing and it also serves as a warning for future developers like this was exactly my intention and i and i intend
to do this this thing i mean i do get that people you know she's probably stood copy n as opposed to
the you know the c variant etc which is um um but yeah it seems annoying to integrate with their
party libraries but i guess if you write most of your code, it should be okay.
That's a really good point.
Like, do not pound include this band.h into another header file, right?
It struck me that this is assuming that the compilers implemented all of these as macros,
because it starts with undeaf stircopy and then defines stir copy undef stir cat i feel like you need for portability
would need if def stir copy undef stir copy right otherwise you're going to get a compiler error or
at least a warning on some compilers unless those are just disabled i guess it's a good thing about
open source you make it somebody else's you make it the compilers have people problem right like
if it doesn't work on windows you're just like someone on the windows team is going to come back and patch this uh with
i've never taken that approach that sounds like a good one okay well alex so uh after introducing
you we talked a little bit about what stream processing is. And I think you mentioned Kafka and Red Panda.
Could you go into a little bit more detail about what Kafka and the product you work on, Red Panda,
are? Sure. So Kafka is really two things. I think when people say Kafka, they often refer to Kafka,
the API, and Kafka, the system. Combined, they are sort of the largest data streaming platform in the world right now.
It's so popular, it's almost synonymous with real-time data streaming.
So basically every big company is using that.
So Uber, Robinhood, and you can basically YouTube just say Kafka and company,
and there's almost like a video per company.
It's kind of like the MySQL for databases, but in the streaming world. YouTube, just say Kafka and company, and there's almost like a video per company.
It's kind of like the MySQL for databases, but in the streaming world.
So that's what it is.
And then what Red Panda, so this technology was built around 11 years ago out of LinkedIn.
Kafka was.
Correct.
Okay.
And Red Panda, the project that we are working on is a C++ re-implementation of kind of the base API. And it's really the way I like to frame it to people is sort of the natural evolution of the principles that Kafka, the system, built on.
And we extend sort of Kafka, the API.
So let me give you an analogy, I think, for all of the listeners. If you think of SQL as this lingua franca for speaking to multiple databases, whether
you're speaking to DynamoDB or Google Spanner or Postgres or MySQL, right?
SQL is sort of this language to communicate with databases.
Similarly, in the streaming world, we think that Kafka, the API, became this lingua franca
for communicating with real-time streaming systems.
And Red Panda is really a reimplementation.
And I think what we think is the future of streaming.
And so what is interesting for this audience is what one of our advisors likes to say is that sometimes you have to reinvent the wheel when the road changes and hardware is,
has changed so, so much over the last decade, right? Like if you,
if you go on Google, you can now rent a machine with 244 cores,
a terabyte of Ram and like, you know, whatever,
multiple terabytes of SSD, NVMe SSD devices.
In 2010, the bottleneck in computing was spinning disk.
And so Red Panda was like, well, what could we do if we, you know,
were to start from scratch for modern hardware,
where CPU is the new bottleneck, is no longer a spinning disk for this kind of big data and storage systems.
And so we re-implemented the Kafka,
the API from scratch in C++.
We actually use C++ 20 now,
but when we started with C++ 11,
using a thread per cord architecture.
And then the product really is kind of aimed
to do three things outside of the Kafka API
is kind of the first thing.
The second thing is to unify historical
and real-time access by integrating
with storage systems like Amazon S3 or Google Cloud Bucket or Azure Blobstore.
That's the second one.
And then the last one is inline Lambda transformations with WebAssembly.
So I know, Jason, that you have, sort of push these capabilities to the storage engine.
So it's really, that's what the project is aiming to do.
And kind of the analogy for the C++ community that is listening to this is think of Kafka the API, kind of like the C++ language spec.
And, you know, Clang and GCC as being, you know, Kafka, the system and Red Panda.
So it's really two implementations for the same API at kind of the base layer.
Okay.
Okay.
So what initially, I guess you kind of went into this a little bit, but you talked a lot
about, you know, hardware has changed a lot over the past decade.
Is that what prompted you to create Red Panda?
Or is there other factors that went into you?
Yeah.
So in 2017, I actually did a YouTube talk that I think the audience can Google at C-Star Summit.
And so the idea there, I just wanted to take two edge computers uh i think at akamai and i wanted to
measure what like basically what is the gap in performance between the state-of-the-art software
for doing data streaming at that point really was kafka and the hardware i just wanted to measure
the gap right and so i wrote a prototype using dpdk so kernel bypass at the NIC level, and DMA, so kernel bypassing the Pitch Cache to speak to the storage device.
So as fast as I could, but really just the first pass.
And then compare that with Kafka, right?
So as slow as I could get on the hardware level and really comparing it with the state-of-the-art system.
And the first pass was 34x latency reduction improvement on the tail latencies.
So it's like, wow, why is this the case?
And so I really didn't do anything.
I started this company in 2019, but that was really sort of this big moment of realization
that really the roads have changed.
The hardware is so fundamentally different.
And so let me give you orders of magnitude disk just disk alone is about
a thousand times faster than disks were a decade ago just and so it sort of requires you to think
about it because the latencies are so so different right like a dma write on a modern nvme ssd is
like 130 microseconds to say double digit microseconds to about 130 microseconds um versus
you know the latency of that a decade ago
were multiple milliseconds on a spinning disk.
And so it's like you need to rethink your software architecture
when the components are so different.
It's a totally different world.
And so that really was the motivation.
I really wanted a storage engine that could keep up with the volumes
that I wanted to push when I was at Akama and previously at AdTex
and all these other places in a way that was cost effective.
And so when I saw that hardware was really so capable,
I was like, wow, there's so much potential here.
And that was really sort of the beginnings of Red Panda.
I want to make
sure i understand your original experiment uh you're saying you were doing uh basically writing
to the disk at the fastest possible speed and what is that you're doing dma sorry i i lost the uh the
two different technologies you were using there yeah so so dpdk isK is an Intel driver that allows you to decommission the NIC from the kernel,
and then you unload the drivers as part of the application space.
So your application literally talks to the hardware, and then it loads the driver, and it runs the NIC itself.
So when you get a NIC write,
like when you're receiving a packet on your receive queue,
it literally writes to the first 8 kilobyte of application memory space.
And so there's no context switch, there's nothing,
because your application owns the memory that the device is writing to.
And so that's at the network level.
And then on the disk level, it's using DMA.
And so what it is, is that you're aligning memory in multiples of what the file system
exposes so that the SSD device basically could just, you know, basically mem copy that onto
the SSD sectors, kind of on the underlying hardware.
And the experiment was,
let me run a Kafka client and server
as the first, as the baseline,
and measure how much and how fast
I can push the hardware, right?
That's the state-of-the-art streaming system.
Turn that off, and on the same hardware
with two SPF NICs back-to-back,
so like the server is literally connected with a single wire.
For both experiments,
let me run a C++ client and a C++ server
where the server uses DPDK and page cache bypass
on the Linux kernel side,
and then just measure what is the gap in latency.
What can the hardware actually do versus... Effect know, effectively to me, it boils down to what is the essential versus the accidental complexity?
What happened? Why is software at 34x, you know, why is there such a gap in the latency performance?
So that was the experiment.
So that number that you came up with, that 34x, is it fair to say that's the theoretical limit?
That's as fast as possible, but you're going to get slower as you start to actually add software
on top of that? Or was there still room for optimization there? Well, that was just the
first pass, right? So it didn't take into account all sorts of, you know, farther improvements. I'm
not sure. I think that, I think we can compute the theoretical limit if you do. Well, so bottlenecks,
that's a really interesting and difficult question to answer.
But the bottleneck will move around, right?
Because if your disk can keep up with the network, then your network will be the bottleneck.
But if your network is faster than the disk, then your disk will be the bottleneck.
So the gist and the way we like to explain this to people is like red panda should
drive your hardware at capacity and in fact the way when we test it this is really powerful when
we go into like a hedge fund or something like that or bank or an oil oil refinery etc uh we
ask them just run dd on the disk or sorry um fio on the disk to measure latency and throughput, which is this Linux kernel tool that Jen Saxoby wrote
who works on asynchronous IO interfaces for the kernel.
And then on the network side, run Iperf.
And it should literally run at basically the slowest of the fastest.
Does that make sense? So it should run at whatever the, you know, basically the slowest of the fastest. Does that make sense?
So it should run at whatever is the bottleneck.
And so what the limit of the hardware should give you
should really be very close,
like Red Panda should be able to run very close
within the physical limits of the hardware,
both in latency and throughput.
So you pull up a process manager,
you expect to see Ethernet 100%,
disk 100% or whatever.
Correct.
Okay.
Yeah.
And that's how we did a lot of the initial benchmarks because we couldn't find clients that could keep up with a lot of this stuff.
And so we just kept increasing the load.
And then kind of a big improvement from a software perspective was how do we deliver that with the stable tail latencies, right?
Because you could deliver like basically fantastic latencies when the hardware isn't saturated but what happened when
when start you know uh the hardware starts to get a little bit saturated so it's not at 100
but let's say between your 75th and like 90th percentile of hardware saturation
when you are really like exercising every part of the system you're exercising your your memory reclamation techniques, you're exercising your memory allocation, you're exercising,
you know, the hardware, your back pressure from disk all the way to the NIC, etc.
And so I think delivering flat tail latencies at near saturation speeds is something that we've
worked on a lot. Sponsor of this episode is the PVS Studio team.
The team develops the PVS Studio Static Code Analyzer.
The tool detects errors in C, C++, C Sharp, and Java code.
When you use the analyzer regularly,
you can spot and fix many errors right after you write new code.
The analyzer does the tedious work of sifting through the boring parts of code,
never gets tired of looking for typos.
The analyzer makes code reviews more productive
by freeing up your team's resources.
Now you have time to focus on what's important,
algorithms and high-level errors.
Check out the team's recent article,
Date Processing Attracts Bugs or 77 Defects in Qt 6
to see what the analyzer can do.
The link is in the podcast description.
We've also added a link there to a funny blog post,
COVID-19 research and
uninitialized variable, though you'll have to decide by yourself whether it's funny or sad.
Remember that you can extend the PVS Studio trial period from one week to one month.
Just use the CppCast hashtag when you're requesting your license.
Before we dig a little bit more into Red Panda and how you've implemented this streaming service, going back to Kafka, what is that service written in and have they made. And it's really a two-system thing.
It uses this system called ZooKeeper, which is notoriously difficult to manage, and Kafka.
And so kind of the challenge is not only in performance, but is in the operational complexity
of running two systems, right?
If something fails, you need to have a deep understanding
of what are the possible failure semantics
and the additional fault domains.
So let's say Zookeeper crashes, what happens to Kafka?
And under what operation will Kafka report an error
if Zookeeper is down?
Or clearly if one of the Kafka broker is down,
there's probably known failure domains there.
But as a software engineer, as a developer who isn't necessarily a streaming expert
Dealing with the complexity of twofold domains is really challenging especially because the configuration on clients and and
producers and consumers of data and the broker is sort of all intermingled and so a
lot of kind of the the of the improvements from a technical
perspective were like, if we were to start from scratch, how can we make this really,
really easy to use? And so we wrote it in C++ as a single binary. And honestly, that has opened up
the doors for so many things for us. But yeah, the high level Kafka is implemented in Java.
And yeah, does that help that help yeah i think so is this is kind of an aside but is kafa kafka why
people that i know in big data know scala is that the the main correlation there or is there there's
some other tool yeah i think uh spark streaming is is really the tool that kind of drove a lot of the Scala
adoption.
So Scala is really nice for developing DSLs.
Or I think back in 2012, DSLs in Scala were all the rage.
I think that's what happens when people start to get in functional programming language.
They're like, oh, look, I can express all of these and I can introduce new symbols.
And so I think Spark Stream was really the framework that drove Scala adoption, in my opinion.
Okay. Definitely an aside.
But Spark Streaming does connect to Kafka. So there is a correlation in that big data world.
And that often to do real-time streaming, Spark is really the compute level of things and Kafka is the storage level.
And so I think in the beginning of the podcast, I mentioned how you need both.
You need compute and a store kind of chained together to do something useful, like detect fraud detection or deliver your food to your door or things like that.
So I think it's both.
Spark is often deployed alongside Kafka.
So SparkStream, Kafka, and you mentioned ZooKeeper,
and you're talking about multiple points of failure.
Red Panda, what of these things does Red Panda
and its single binary encompass exactly?
Yeah, great question.
A lot of all of them.
So the first thing is we eliminated the multiple fault domains. We made it a single C++ binary. And we rewrote Raft from scratch. So it took us two years, really, to write Raft and test Raft in a way that was scalable and actually could drive the hardware at capacity, right? Because that was an initial goal.
It's like, how do we utilize hardware efficiently?
And so we made it a single binary, which means we eliminated Zookeeper
and we eliminated all of the tunables that come with Kafka.
And so that was the first one.
It's really easy for people to deploy.
The next part of it, which is what we think is the future of streaming, is that
to do simple things in real-time streaming is actually really complicated. Let's say that you
just wanted to guarantee GDPR compliance. And to do that, you just have to remove a social security
number from a JSON object. That simple use case for people to do it at a scale, you need a Kafka
and Zookeeper and Spark streaming and a bunch of data ping pong
between kafka and the spark streaming back into kafka and potentially back into elastic search
or whatever it would be the the end data store and so what we said is let me let me give you
the programmer an ability to express this you know dumb computationally speaking sophisticated
machinery but this dumb transformations in line so that as data is coming through you could just you know remove the social
security number so you don't need to ping pong your data between you know red panda and spark
streaming and back into red panda and so but web assembly the the engine that we've embedded is
really meant for for simple things it's not meant to do complicated multi-way merges. So at the storage level, we are a drop-in replacement for Kafka. So we have some of
the largest hedge funds in the world, some of the largest oil and gas companies in the world,
really finance and web companies, CDN companies, et cetera. So at the lowest level,
we are a drop-in replacement for Kafka, the system, and the API.
But we're also starting to extend the primitives with simple things like WebAssembly.
So does that make sense?
So we're not a full replacement for Spark Streaming, which could do complicated multi-way merges.
But the simple things are now actually simple to do with Red Panda.
And if they need to do the complex things, you can still offload that data to some other process? Correct, correct. Because we're Kafka
API compatible. So you could still use Spark with the same drivers and touch application that used
to work with Kafka, right? Because we implement the same API, the same specification. So they
could just point it at Red Panda and everything will just work correctly. So we're just sort of
making the simple things simple.
I'm kind of finding myself a little fascinated by the development history here, because I'm
assuming Kafka and this menagerie of tools around it evolved over a decade or so.
Correct.
Or something like that.
And so then when you come in with red panda you are
coming from a perspective and correct me if i'm wrong please basically saying well the problem
is solved we know what the problem is and we know what the solution is but now we get to start from
a fresh perspective and say okay let's do it but better yeah i mean that that was so so we get to we have to give a lot of credit to kafka in that
before kafka you had like rabbit mq and you had tipco and solas and all of these enterprise
message buses that people really don't like to use uh really uh you know and then kafka came and
and and and then developers love that they could be heroes overnight literally it's kind of
impressive it the ecosystem is so large that you can take Kafka and Spark streaming and Elasticsearch
and in a day, you can come up with a sophisticated fraud detection pipeline for a bank.
So that's what people love.
It was this API, this pluggability, this huge ecosystem.
But like you mentioned, we knew what the end result should be.
And the insight there was like, well, what if we start from scratch?
What are the things that we could fundamentally change about how to implement this software?
Still give people the same sort of hero and wow moments of being able to plug and play all of these open source systems,
but really design it from scratch in C++ and see what we could do better,
both from a protocol, from an architectural perspective,
from a performance perspective.
Yeah, and so that was kind of the technical history there.
I also find it interesting that you're using,
you keep saying WebAssembly,
and it sounds like WebAssembly specifically
for your processing engine.
You're not saying JavaScript that then gets compiled
or C++ or whatever. You're just saying you that then gets compiled or C++ or whatever.
You're just saying you feed us some WebAssembly,
we'll run it on the code, on the stream?
Yeah, yeah, correctly.
So right now we are exploring between Wasmere and VA.
So the first implementation uses V8.
And WebAssembly is important because effectively
people like to program in whatever language they like to program.
That is just like, and so we're just like, well, what is sort of this common denominator
that can give us, you know, sort of reasonable performance while we're executing?
And so we leverage the LLVM backend to produce really good web assembly
in whatever programming language you want, right?
So if you want to send us garbage,
it'll increase the latency of your processing,
but largely people are really good at, you know,
doing these simple things pretty fast.
And so WebAssembly sort of opens up the ecosystem
in that you as a developer,
you could program in whichever language you want,
like that's up to you,
whatever libraries and whatever you want to bring up,
that's your problem.
We just execute it inline.
And so it's really actually from a mental perspective, it's the same idea of when you inline a function in C++,
like, you know, it's going to be fast
because it's like the code is right there.
So it's really the same idea here with WebAssembly.
It's like as data is coming in,
you know, the code is literally right there
to make just the simple transformations and then pass on the information onto the storage subsystem.
Fascinating how often WebAssembly comes up in our C++ discussions.
It's coming up more and more, it seems.
More and more, yeah.
If you're joining CBPCast for the first time ever, you probably want to go research web assembly yeah go back and listen to the episode uh what was it who we had on web assembly most recently jason
i don't know we've it's come up like seven or eight times recently and i'm having a hard time
keeping them ben smith that's right now 251 okay oh yeah that was a good one that was fun
yeah anyhow sorry so we we've talked a lot, you know, what stream processing is and what you've done.
Do you want to dip down into more technical aspects of what you've done when architecting
and writing Red Panda to make it so much faster compared to Kafka?
Yeah, this is really the most fun to talk about with a technical audience.
You know, I feel like latency is really the sum of all to talk about with the technical audience um you know to just like i feel
like latency is really the sum of all of your bad decisions to deliver in a low latency
that was good sorry delivering a low latency it's it's uh it really takes a kind of discipline in
making sure that that the things that you're doing like have
this holistic design principle and so the first thing that we noticed is we
didn't want to use the page cache and the page cache for in the Linux kernel
so this is runs on Linux only the page caching in the Linux kernel is a generic
you know caching caching system and so every file gives you a
page cache for that particular file but that file for simplicity of IO has a
global lock right and then it also has all of this like lazy materialization of
pages it's actually really sophisticated and really nice and that when you do and
like a read or write syscall there there's all of this like generic machinery. So for example,
let's say that you read the first byte of a file, the Linux kernel will actually go ahead and read
the next four or five pages is basically, you know, it'll kind of prefetch the mechanical
execution of fetching the next few pages from disk to so that the next read call that you make
on the file will probably have the data
already materialized in memory and so um you know for for the kind of stuff that we're building we
know exactly what the patterns are literally like the the kafka protocol tells us the offset of where
to start reading data from how much data at what time, and we understand the mechanics of a queue.
We know that once you start reading, you never read behind.
You always read ahead.
So we can start pretty aggressive read-ahead mechanics.
But what's more important is that memory is limited,
and we don't use, actually, virtual memory.
So we actually pre-allocate the whole machine's memory to get low latency allocation,
and then we split the memory evenly
across all the number of cores.
At the lowest level, we use a thread per core architecture.
So every core has a pin thread,
which means that the threads don't move around.
So there's no, you know, really memory access,
ping pong in and cache line invalidations.
Once, let's say you have a four core machine,
so you'll have four exactly four p threads
full stop and every every p thread will will like will allocate the whole machine's memory
into these four p threads and then we'll split it up and we'll use um lib um numa uh lib hardware
alloc library which is the supercomputing library, which tells the application which
literally memory regions
belong to that particular core. So you're not doing kind of this remote memory access. And then we lock the memory that belongs to that
particular core so that the accesses aren't kind of, you know, ping-ponging around especially on multi multi-socket systems um and so that's really kind
of the basic layer and the way communication happens between cross cores is that we use uh
explicit uh structure message passing between the cores um and so so at its lowest level we really
have this this quite i think a different architecture because remember what we were trying to move away from was this multi-threaded
kind of work-stealing
style compute platform
that Kafka has, right?
Which is like thread pools and so on
into where
the bottleneck a decade ago was spinning
disk into the
new bottleneck in compute, which is CPU.
It's like, how do you keep the CPUs
busy and doing useful work?
And so that was how we started this.
And kind of from a C++ perspective, it's been a lot of fun because we compile our own
compiler.
So right now, for example, we bootstrap Clang with GCC 10.
So we use Clang 11.
And then we use Clang 11 to compile all of our transitive closure libraries
so we get to use C++ 20, and we get to use kind of all of the nice things.
And, yeah, and so it's been kind of a lot of fun to use the latest coroutines
and figure out how to effectively implement an operating system
within the application because we don't use virtual memory.
We bypass the Linux kernel on the page cache.
We kind of do all of these tricks at the lowest level.
Yeah, from an architecture perspective,
it kind of sounded like you said
you take a four-core machine
and split it into four separate machines
if you're giving each one its own dedicated process,
its own dedicated memory.
Yeah, correct.
And what's interesting about that
from a C++ perspective is the way you communicate between
this multiple cores is you set up SPSC queue, so single producer, single consumer lock-free
queue.
And so when you communicate with a remote core, let's say core zero communicates with core three,
you explicitly invalidate that particular sort of cache line, but sort of the other
cores are untouched, right?
You do this explicit synchronization between cores and that is really scalable, especially
when you have multi- computers right and so even but even on a single even on a single socket motherboard that
network of cues is sort of makes all of your communication explicit here's
here's what the concurrency and parallelism that modeling that that that
it gives you or it superimposes on the developer is you know that your parallelism
is fixed to the number of physical cores full stop right like you'll have four p threads and
there will be no other p threads ever running but what you what we do do is we use coroutines
and futures to have concurrency not parallelism within each core. It's kind of like you can have a single core computer,
but move your mouse and watch YouTube at the same time.
So it's a similar thing that happens on a single P thread.
We use coroutines with a comparative scheduling algorithm
so that every coroutine yield actually yields a future,
which is a structure.
I think this is what's missing in the spec,
is a structure, like literally a C struct struct that you can add tags and integers on so you can you know sort
of schedule it at different points in terms of execution but uh but on each core you get
concurrency and so when you're writing code is really freeing because there's exactly one mode
of one model of both parallelism and concurrency.
And when you develop code with this program,
so you only write code with a concurrent mindset.
You understand that once you're in a single core,
like there will never be a data race from a,
from a P thread perspective,
there'll be concurrent access,
but not parallelism.
And so parallelism becomes a free variable,
which means you can take the same code structure,
which was made concurrent,
and execute it on a 244 core,
and it'll just sort of scale
because you program to the structure
you didn't program to the explicit parallelism.
Does that make sense?
I think so.
Are you using any libraries or anything
to help you with your coroutine stuff?
Because it seems like coroutines takes a lot of bootstrapping to get running.
Yeah, so we use C-Star as sort of the lowest level library.
And basically every time you do co-return or co-await or something like that,
well, actually co-await, it doesn't generate a future.
But when you do co-retait, it doesn't generate a future.
But when you do co-return, it'll generate a future.
And then the future has a scheduling point.
And the reason you need this is because you need additional metadata to understand the priority of IO, to understand the priority of CPU,
to understand, you know-CPU priority.
Let's say you have a core that is communicating with a bunch of the other cores because it's
just receiving most of the network requests and it needs to route them.
You need to understand how much time of the scheduler, of the comparative scheduler, you
want to spend sending messages to the other cores versus doing useful work on this particular
core.
And that's fundamentally, I think, what is missing on the coroutines.
I think coroutines is a really low-level primitive
that, you know, it's really all about the state machine bootstrapping.
But when you're building an application, you need more.
You need a scheduler to allow you to prioritize
different kinds of work at different points of execution.
Very cool.
What else do we want to go over?
We're a little bit low on time.
Yeah, I was just looking up C-Star.
So that's S-E-A-S-T-A-R.
Correct,.io, C-Star.io.
Okay.
Because, you know, whenever coroutines comes up on CppCast,
there's always this question of, like,
oh, well, you know, it has this bootstrapping.
And like you said, there's the stuff missing from, it feels like it's missing from the spec
because it's such a low level primitive. But are there any other C++20 features that you're
really taking advantage of besides coroutines? Yeah, I think polymorphic lambdas are really useful. So we wrote this basically type tree walker.
And so if you give the serialization framework a plain C struct with no constructors,
which means we can splat the struct into multiple fields.
And so we can get the arity of the particular struct,
the arity being the number of fields in the struct.
So let's say you have two integers and one string,
that would give you arity of three.
So we use polymorphic lambdas to basically take any type,
compute the arity, the number of fields,
and then do auto bracket ref, you know, like whatever, know like whatever a b c and sort of get access
to individual fields and then recurs on those individual fields so that we do a dfs type
traversal on on a type tree so it's really neat because you can give it a high level type object
uh let's say like a building and then a building would have people and people would have jobs i'm
just kind of going uh over the top of my head and it would just recursively traverse that and give you
a serialized in a dfs traversal building that is only used in c++ you know jenki reflection
you know via uh polymorphic uh lambdas and and and tuple splatting. That's cool. So that, and then, you know, lambdas, we use lambdas kind of super, super heavily.
Coroutines, I think those are the main things
that we're getting from C++20, I would say.
I mean, honestly, coroutines are as useful as auto.
Like, you know, when auto first came out,
you kind of like don't really have an intuition
of when to use auto, whether you use auto with templates.
And if you do that, that kind of blows up your instruction cache because it'll instantiate,
basically it'll do the explosive combinatorial of all of the inner types plus the outer type
of the template parameter.
So anyways, so you kind of develop a sense for auto or when to use auto, like when it
makes sense for auto. Coroutines are kind of develop a sense for auto or when to use auto, like when it makes sense for auto.
Coroutines are kind of like that.
It takes a little bit of experimenting for you to develop an intuition of when to use coroutines.
But once you get the hang of it, it's awesome because it's really such a useful primitive.
You often want to do, especially in this concurrent world where parallelism is a free variable is really freeing in just like writing
your code with coroutines and understanding you know having sort of an explicit understanding of
when exactly your code is going to get executed and run all right so i've got one other architecture
question then um my understanding is that coroutines necessarily have a dynamic allocation
with them which can sometimes be optimized away.
In high-performance environments, people tend to avoid dynamic allocations.
But then I'm thinking about the fact that you have dedicated to each core its own chunk of memory,
so maybe dynamic allocations, when you need to do them, aren't such a big deal?
Yeah, and that's a big part of the allocator design, I think that Avi wrote, the original C-Star author, is that when you
pre-allocate the memory, then memory access and memory
allocation is thread local, but it's already cached, right? So it's just
giving you a quick pointer assignment to this thing.
And your allocator doesn't have to have any locks around it or anything because
you know no one else can touch it, right?
Exactly. And it's locked memory and it's thread local.
And so there isn't any real kind of like bottlenecks there.
But also let's like kind of holistically, just to wrap up this thought, our bottleneck is really IO.
And so we're in the double digit microsecond latencies we're
not in the nanosecond latency most of the time of course nanoseconds add and like i mentioned
latency is the sum of all your bad decisions and so um but but because we're in the in the
microsecond scale and like you know anywhere from like low to gd to you know i guess when you make
an io is like 100 microseconds so it is low low latency, but it isn't HFT low latency.
It's kind of like, you know, right above that.
It's really kind of more like in the algorithmic trading latency space.
So it's okay.
I think for us, allocations still happen within like nanoseconds.
And so in the grand scheme of things, our compute is in the like microsecond
scale you know one two two double digit microseconds for most of the things that we do
so i it hasn't been a problem for us so then now what's the end result i know you started this 34
times in your initial experiment yeah fantastic we just released a blog post with 400 hours of
benchmarking thanks Thanks for asking.
And I almost didn't talk about it.
We spent 400 hours comparing the open source alternatives with Red Panda.
And the best result that we've gotten is 10,000% improvement over the state of the art.
So it does pay to rebuild from scratch in C++20 and what's funny is that we are literally a startup
like we're a venture-backed startup
so it's almost like the opposite of what people think
when you're building a startup
like one you don't build infrastructure
and two you definitely don't use C++20
but you know it sort of worked out really well for us
to think about it
and you know we've had a lot of experience with C++
so it's kind of a natural choice
for us.
But yeah, that's really, I think, the best result.
And yeah, I think there's, you know, there's really kind of this in-depth blog post that
I'll send it to you guys so you can post it on the podcast footnotes.
But definitely dig deep.
You know, I'm also available on Twitter if people want to reach out and they're like,
hey, this benchmark didn't make sense.
We spent literally hundreds of hours
testing, so I can probably answer very, very
detailed and pointed
questions. Well, I know our listeners who are
interested, the next question they're going to have is,
what is your business model? Because if I understand right,
Kafka's Apache
Foundation open source? Correct.
And Red Panda is
also available on GitHub.
So you can download it.
You don't have to pay us.
It's free to use.
It's BSL, which means as long as you don't compete with a hosted version of Red Panda,
like Vectorize gets exclusive rights of hosting it, but only for the next four years.
So it's like a four-year rolling window.
So today's release, we've got four years to be the exclusive person that releases
that that's a very interesting model yeah but in four years it becomes apache 2 sort of the core
um and so but other than that people can just run it they can use it as a kafka drop-in replacement
it's on github all of the code all of the c++ techniques and details people can go and download
it and play around with it and ask us questions. And so our monetization strategy is the cloud, largely.
And we have some enterprise features like disaster recovery,
you know, kind of like some enterprise security features,
multi-cloud deployments, you know, multi-region,
kind of like the things that it would take larger enterprises to feel comfortable
running, you know, sort of the heart, the guts of the system,
which is often this data pipeline.
But, yeah, it's also available under a BSL license,
which is the same license like CockroachDB and MariaDB
and a bunch of other database companies use.
Yeah, and so I thought, to me, that felt like a good balance
between building open source and then also
building a company can pay the bill so so that's that's the license that we that we set on okay
awesome thank you awesome yeah well it's been great having you on the show today alex uh you
mentioned you are on twitter where can listeners find you online yeah my handle is emaxerno, as like literally the largest error number you can get.
That's on Twitter, vectorize.io on Twitter as well.
And that's really, I think, the best place.
We have a Slack on vectorize.io slash Slack
if you want to chat with me real time about these things.
But yeah, I'd love to, if anyone has any questions,
just feel free to ping me or GitHub, really.
File a ticket and be like, you know, this doesn't work.
I'm kidding.
You could be nice about it.
But let me know.
I think either of those channels are good for me.
Okay, great.
Thanks, Alex.
All right, guys.
Thank you.
Pleasure being here.
Thanks so much for listening in as we chat about C++.
We'd love to hear what you think of the podcast.
Please let us know if we're discussing the stuff you're interested in,
or if you have a suggestion for a topic, we'd love to hear about that too.
You can email all your thoughts to feedback at cppcast.com.
We'd also appreciate if you can like CppCast on Facebook and follow CppCast on Twitter.
You can also follow me at Rob W. Irving and Jason at Lefticus on Twitter.
We'd also like to thank all our patrons
who help support the show through Patreon.
If you'd like to support us on Patreon,
you can do so at patreon.com slash cppcast.
And of course, you can find all that info
and the show notes on the podcast website
at cppcast.com.
Theme music for this episode
was provided by podcastthemes.com.