Orchestrate all the Things - The Ring Zero of real-time data processing: Redpanda scores $50M Series B funding to grow its streaming platform. Featuring CEO / Founder Alex Gallego
Episode Date: February 23, 2022Maintaining compatibility with the de-facto market standard, while reimplementing and extending it. It's easier said than done, but that's what Redpanda is doing, and it seems to be working. Art...icle published on ZDNet
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
Maintaining compatibility with a de facto market standard
while reimplementing and extending it.
It's easier said than done, but that's what Red Panda is doing
and it seems to be working.
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration
on Twitter, LinkedIn, and Facebook.
Thanks for having me, George.
By means of introduction, my name is Alex, and I've been working in a stream processing for about 13 years prior to starting Red Panda Data and working on Red Panda the Engine.
I sold the company in 2016 to Akamai. It was called Concord.
You can think of it as a competitor to Apache Flink or Apache Spark. And we built it on top
of Mesos. And yeah, that's sort of the executive summary. Okay. Thank you. And then I guess the next thing would be to talk a little bit about Red Panda, the business, like your milestones and your headcount, some clients and use cases, if you'd like to refer to that.
And of course, the specifics around the funding.
Sure. finished racing a $50 million round led by Google Ventures with participation from our
existing investors, Lightspeed and Haystack. And so I guess for those that are tuning in on the
call and the podcast, Red Panda really was started as kind of a natural evolution of what I thought
streaming should be. And in 2017, I guess the background of how we got started, so I can give
the context of growth and numbers and fundamentals here, is that I wanted to understand what was the
gap between what the hardware could do and what the software could do. And so that was really the genesis. And so in 2017, I kind of gave a talk
about, you know, just what is the gap between hardware and software. And so in 2019, I started
this company originally for experts by experts. Actually, it was designed for people that were like me, that were streaming experts that wanted something more
with the storage.
And I wanted it to be compatible
with the entire Kafka ecosystem.
And so I wrote Red Panda, the storage engine,
before I even decided to build the company.
And so the interesting piece,
to share the context of the numbers is that about 40% of our customers are experts of streaming engines. And so they're either Kafka
migrations or Pulsar migrations or some other classical technology like that. But what turns
out is that the way we built Red Panda, which was a single binary, has no external dependencies on Zookeeper
or Schema Registry or HTTP proxy, all that,
allowed us to onboard net new users to streaming,
people that had never even heard about what streaming is
because they didn't feel that sort of, I guess,
emotionally empowered to go and tackle
some of these problems.
And so anyways, with that context,
when we released Red Panda was at first closed source. And in late 2020, yeah, late 2020,
we made a source available and it's under the same license as CockroachDB. We actually took
inspiration from them, which means for the first four years,
we're the only company that is allowed to host Red Panda as a service.
And after that, it becomes permissively licensed Apache too.
And then in 2021,
we really started probably with hundreds of customers.
By the middle of the year, we were in the thousands of customers,
several thousands of customers.
And we ended the year in hundreds of thousands of Red Panda clusters coming alive.
So it's felt like every nut and bolt rattled as we started to take off.
And then with people too, we started last year with a little bit less than 20 and we ended with 60. And so it's been incredible
growth over the last year. And I guess on the customer front to contextualize some of our
listeners is that we, you know, really to my surprise, at first I thought we were going to
just capture the Kafka experts and, you experts and classical Kafka migrations, but
these new kind of users to Red Panda, net new users to streaming, really opened up a bunch of
doors. And so with that, we were able to send Red Panda into suborbital, which I find the coolest
use case. It is a Red Panda process sitting in a satellite somewhere in orbit. So IoT, we also help oil and gas pipelines to
measure the jitter. And so when you're shipping crude, the pipelines need to be monitored for
actual physical jitter of the pipeline in case it blows up physically. And then, but also the
classical, you know, fintech and gaming companies and ad tech and kind of web 2.0, like Snapchat is a,
Zenly from Snapchat is a public example and Alpaca, which is a startup in the FinTech space.
So anyway, so what we actually saw was sort of tremendous growth in just about every vertical.
So it makes it kind of hard to explain to just like, oh, it's only one particular vertical.
It's more about use case.
Hopefully that helps for contextualizing the audience.
Yeah, yeah, it does.
Thank you very much for sharing that.
And I was going to ask, you know,
obviously I looked around a little bit
and trying to get some of the basics,
let's say the key premises around Red Panda
before having this conversation.
And my initial impression was that it seems simple enough. I mean, the key premise like
Kafka compatible, similar kind of business model. I initially had the impression it was open core
and software as a service. I guess I was wrong about the former, so it's source
compatible, as you said. However, the rest I think I got right, and you can correct me
if I missed something. Would you say that's an accurate description? And if yes, I would
like to ask you to elaborate a little bit on, well, simple enough but I'm sure it wasn't simple it wasn't
easy to execute on this simple idea so what was what were the technical underpinnings that
enabled you to build a faster Kafka while maintaining compatibility? I remember that
I was that was a really fun question remember that I was a principal engineer and prior to this i was a cto
um so when when i started red panda um i really wanted to understand why in 2017 i saw such a large gap between hardware and software and so i literally connected two two edge computers uh
at akama with it with the cable back to back just to make sure that there was nothing in between these two computers.
And I just wanted to measure, like understand what is the fundamental evolution of hardware
and did software actually take advantage of modern hardware? And so when you look at existing
solutions, they were built for a decade old hardware where spin and disk was the fundamental
limitation of the previous platform. The new limitation is actually CPU coordination.
And to not get too much into the weeds,
sometimes you really get to reinvent the wheel when the road changes.
It's kind of the executive summary here,
where if you were to start from scratch, what would you do differently?
And what we did differently is that we built a single binary.
But to just address your point directly of how would I describe Red Panda,
I think it's simple, fast, reliable, and unified.
And the last part was actually, I would say it was a discovery.
Does that make sense?
Were you going to ask something else, George, or do you want me to continue?
No, please go ahead.
Okay.
So on the unified part,
so when we first started,
we took a fundamental approach,
which it's called a thread-per-cord architecture.
And the idea is,
how do we take advantage of very tall computers?
So if you look at the trend in computing,
disks are a thousand times faster than they were a decade ago.
And computers are 20 to 30 times taller, a.k.a. more coarse.
And so with this new platform, you would do things fundamentally different.
And so we started Red Panda in C++ with no dependencies, really trying to extract every ounce of performance out of the hardware.
And that worked. And we, you know, we onboarded like electric car manufacturers,
the largest CDN in the world, some of the largest banks.
And, but what we discovered, and this is, I think,
a slight difference that we really need to do a better job on our marketing
and on our website, is it's true that Red Panda is a streaming platform
that is simple, fast, reliable, aka doesn't lose data.
But the last part, unified, is different.
And what we learned is that in order for us to make a fundamental shift into how developers actually think about building applications,
we needed to give them compatibility with the way developers are building applications.
So meaning infinite data retention with things like Amazon S3. And so
I think those, I would say that the premise is simple. It's a simple, fast, reliable engine,
but unified lifts up the risk, sort of allows developers to build a new category of applications
that couldn't have built before. And because, you know, for the developer, what it means to have unlimited data retention it
means that they don't have to worry about disaster recovery and you know they now have a backup they
they don't have to worry about a priori about which other databases or downstream systems they
need to materialize they simply push their data into red panda and we transparently adhere and so it's relatively
cost-effective to store even petabytes of data. Hopefully that helps. Yeah, thank you. I had
gotten the part about the re-implementing everything in C++ and well even that alone
I think would have made a difference but I think you did a nice job of adding to that, well, elaborating on why it wasn't just that, basically.
Which leads us to the next question.
And I was wondering, also earlier,
you mentioned, you described a few of your use cases.
So I was wondering if you can share with us
a little bit about how your use cases look like.
Basically, brownfield versus greenfield. The fact
that you have Kafka compatibility obviously helps. So I imagine a few of your use cases will be
replacing Kafka, but you also mentioned that you can do some things that were previously
not possible. So how many of your use cases are brownfield versus greenfield?
Yeah, let me give you color here. Let's start with the brownfield.
So by being Kafka API compatible, so I think that we have to give credit to Kafka and to Pulsar and
to RabbitMQ. And that's really this huge family of streaming systems
that came before Red Panda.
I consider really Red Panda,
the natural evolution, if you will,
if you were an expert in streaming
and you were like, okay,
well, what is the new bot on the King computer?
And how do I design for that?
But what Kafka did is really not,
the Kafka broker was a fundamental piece
in building the new streaming infrastructure.
But what it did is it actually,
the most powerful thing that Kafka did
is it created an ecosystem.
And so it's the fact that it connects transparently
to Spark streaming and to Flink streaming
and to Materialize and to MongoDB and to ClickHouse
to the extent that when we launched,
for example, the partnership with MongoDB
or with Materialize or MemgraphDB and all of these other partners,
SingleStore, for example, it just worked.
And that is the experience that a lot of our brownfield folks get when we show up to a deal.
Let's say we get to influence a decision maker.
And so they always are worried. They're like, hey, I have
100,000 lines of code or 150,000 lines of code, right? Like a large body of work that connects
with Kafka in a particular way. Let's say they're doing real-time fraud detection, or they're doing
real-time online gas pipeline jittering, or they're building an Uber competitor that delivers food
like Mr. Yum. Well, actually Mr. Yum is not an Uber competitor, but they work in the restaurant industry.
Anyways.
And so what we found is that Kafka compatibility for us was really
fundamental.
And to the application developer, by the way, there's no code changes.
And that perhaps is maybe the most radical thing that a developer,
or maybe the most shocking thing that a developer, or maybe the most shocking thing that a developer experiences,
like, hey, just works.
And it's not just works for their application.
It just works with entirety of the ecosystem,
whether you plug in TensorFlow.
When I tested TensorFlow, it took me 30 minutes
and I wrote a blog post in 15 because it just worked.
There's often no hero migration.
There's no hero story.
There's nothing.
It just, it's simply a configuration change. And then you pick up and go. And so those are the brownfield. Those are existing
applications, like kind of classical Kafka uses of things like analytics or not necessarily
foreground use cases. Now, on the greenfield, on new use cases, there's really two quick categories here.
One is the way new applications are being built. So if you look at the way new databases are being
built, let's say Materialize is a good example, and MemgraphDB is another good example. I know
of three ML companies that are launching that will be in stealth mode so far. But where the industry is
heading is they want to leverage something like Red Panda or Kafka or some distributed ledger,
right? Distributed source of truth, engine of truth. And then they build on top. And so an
example is Materialize is a SQL engine on top of the log. And MemgraphDB is a graph database on top
of the log. So that's one particular example for the advanced use case.
And then the last one I'll mention on the green field is this company called Alpaca.
And so they trade everything from cryptocurrencies to derivatives to, you know, basically it's a fintech API.
And they're growing super rapidly. And the difference between that use case
and the brownfield is that Red Panda sits in the foreground.
That is every time you interact with their API
and their application,
every single message that goes through
the Alpaca trading framework,
it goes through Red Panda.
And what we give them is basically this huge compatibility with the ecosystem,
but we're so fast that it sort of unlocks new revenue streams for them.
So that hopefully could give you a color as to how the classical migrations happen,
which it just works.
And then this new greenfield.
And I actually think we're still early in the greenfield because people are just trying to understand right now
what Repanda is and how it can help them.
Okay.
Another question I had was about, well,
the focus, let's say, of your use cases
and the actual layer that you're targeting.
And what I mean by that is that I tried to look around a little bit
and try and get a sense of the capabilities let's say in your offerings and I got the impression
that it seems like the suggested use is more on the transport layer and not so much as a processing
or transformation layer and for example I've seen a blog post in which
it seems like the suggested use for that would be to a user at Panda in conjunction with Flink.
And I don't think I have seen a SQL interface, for example. So is my impression correct? And if yes,
is this an area you're aiming to expand to in the future?
So two things here.
One is that stream processing is a relatively rich kind of area.
It's similar to saying databases.
There's many flavors and many capabilities here. So let's break down stream processing into complex stream processing and simple transformations.
So for the simple transformations, let's say a simple transformation is you take a personal object and you want to mask all of the private and sensitive information like your social security number in the US.
And so often the problem that people face,
and this is how we're advancing the state of the art of streaming
for this one shot transformation,
not for complex processing,
is do you read your data from Red Panda
and then you ping pong it into a separate system
like Flink or Spark or whatever it is.
Do you transform it, you mask it,
that is you take the social security number and you replace it with XXXXX and then you save it
back into Red Panda. So we developed an in-broker processing engine, we're calling it the Wasm
engine really because it'll take at some point, right now it only takes JavaScript, but it'll take a web assembly pretty soon, is the ability to do this one-shot transformations
inside the broker itself, eliminating the data ping pong for the simple cases.
And so that's really where I think that Red Panda shines.
Now on the complex stream processing side, whether you have SQL or you have another one of our partners
has this really beautiful Python experience.
They're called Deep Haven,
and you can process and visualize
and use all of Python's power locally.
It's kind of an interactive notebook in my view.
And so there's just this huge ecosystem of matured and innovative
companies that are innovating at that level. And I feel that the ring zero, the source of truth
is not a solved problem. And that's really where we shine. And so for the time being, we hope to
stick to stay in this space.
And so the alternative is that then we partner with all of these other database companies,
whether you're Mongo or SingleStore or Materialize or DeepHaven and so on.
I think it builds a richer ecosystem.
I think having companies that are focused on specific layers yields a better product.
Okay.
Thanks for the elaboration.
And I was also wondering, I think you mentioned as well yourself earlier when you were referring
to some of your use cases.
I was wondering about your experience with people using RedPanta to support machine learning
use cases and actually real-time
machine learning.
Do you have any of those?
And do you see them often?
Do you think that's something that Red Panda is a good fit for?
Yeah, actually in fintech.
So what's interesting about a lot of the techniques that people have adapted into big data is that a lot of them have been in
use for a really long time in fintech, in particular in the HFT shops, but they were
always built before as this one-off solutions, right? That was the secret sauce of this particular
HFT shop and so on. I think the trend, the difference in machine learning now is that,
first of all, there's this fantastic engines that are being released, like, you know, TensorFlow and so on.
What is challenging about those is actually continuously retraining the model and being able to replay it.
So for machine learning, I actually think Red Panda is a really good fit for.
And there are, like I mentioned, I know of one, two, three companies that are in stealth mode, three startups that are in stealth mode, all backed by fantastic investors, tier one investors that are about to come out of stealth, that are working on this problem itself.
And so there's, I think, multiple tiers to this problem.
But where Red panda fits into this
storyline is not on the ml algorithms right that's tensorflow it's got that cover and and
spark ml has got that cover and there's like you know deep java or whatever it's called deep 4j i
think it's called this library so there's this very special layers for the actual uh machine
learning algorithm what red panda brings to the table is a scalable, effectively back
pressure valve that allows the machine learning algorithm to replay. So let's be specific with
one use case. Fraud detection is kind of this classical example for ML and real-time ML,
where let's say that you have a credit score application where you're given, let's say, you know, men and women a credit card.
And, you know, you had a bias in your credit score application.
Now you want to go back and reprocess the entire history.
This is where really Red Panda shines. your application to be able to reprocess the entire history of all of your events that
led into that decision without changing a single line of code on the application level.
And so what really what RedPanda is doing is sort of creating a new engine of record
that allows these ML algorithms to reprocess the data, have access controls, have back
pressure, spill to disk in case that you get a ton of load.
It's really kind of a centerpiece of the architecture. And then you combine that with TensorFlow and maybe a serving layer. And now you have an entire serving category and a serving
product. So you can think of Red Panda as, you know, I like to call it ring zero. It's kind of
the very bottom layer to build a scalable real-time ML company.
Okay, thank you.
And by the way, you said you are in touch with a few companies working on real-time machine learning that are about to come out of sale.
Well, if you want to send them my way, feel free.
I would love to talk to them.
I guess we're actually already a bit over time,
so probably a good time to wrap up. And to do that, I'd just like to ask you,
where do you think this domain is headed, basically, and how do you see your role in it? I think that, you know, to me, Kafka is, and the API is a historical artifact in a positive
way. It's developers bought into the ecosystem and they build millions of lines of code.
I think the future is a different API. I think the future is serverless. I think the future is a protocol that is less
heavyweight than the Kafka protocol. And so I think that Red Panda, the company,
is a company that can give people both A and B. A is compatibility with this hugely rich ecosystem
that is always going to be important. And B is because we're more tied to the market evolution
from batch to real time.
Today, it happens to be that Kafka is the best,
Kafka API is sort of the best way that we could do that.
But I think in the future, it'll be a different API
and it'll be a new API that is really designed
for the way modern applications are being built.
And so that's kind of how I see the story
arc for Red Panda, the company. I hope you enjoyed the podcast. If you like my work,
you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.