The Infra Pod - Will Protobuf be the default for everything? Chat with Akshay from Buf
Episode Date: September 9, 2024Ian and Tim saw down with Akshay Shah (CTO of Buf) to talk about how his experience at Uber where the company decided to go 'schema-first' to address the challenges of a microservices architec...ture and loosely structured data. Akshay also shares his 'spicy hot take' on the limitations of the big data and data science hype, and the need for more converged, end-to-end data platforms.
Transcript
Discussion (0)
Welcome back to yet another Infra Deep Dive.
As usual, Tim from Essence and Ian, let's go.
Man, I'm so excited today, Tim.
This is Ian Livingston.
We are joined by a dear friend of mine and the current CTO at a company called Buff.
Akshay Shah, tell us a little about yourself, my friend.
Hey, I'm Akshay.
I'm the CTO at Buff right now.
And I'm an engineer.
I don't know.
I came to engineering via a long and winding road, but have spent most of my engineering career in startups, the most notable of which was Uber, which I joined when it was, I don't
know, a couple hundred engineers and grew to at least an order of magnitude more than that.
And I started over there in data engineering, working on this kind of metrics and anomaly
detection stuff, moved over to network infrastructure and service discovery and RPC,
and then ended up running the team that built out Uber's Go infrastructure.
Started my own company for a bit, went to Microsoft for a bit and worked on Azure,
and then ended up at Buff doing ProtoBuff infrastructure.
Amazing. And can you give us a little bit,
like a little 411 or the download? What is Buff?
What's Buff trying to do? Why does Buff exist? What's the mission at Buff?
Absolutely. We build proto buff
infrastructure for other companies to buy as a package kind of end to end piece of your
infrastructure. That means we build a proto buff compiler and a command line tool. So replaces
proto C plus a bunch of other stuff you might want linting, formatting, breaking change detection,
interacting with RPC
services, using binary payloads, all that kind of stuff. We actually built an RPC framework called
Connect, which is part of the CNCF now. Our main commercial product to date has been a schema
registry, which basically brings all the benefits of a Bazel-style monorepo for protobuf schemas to your not-Bazel, not-monorepo
mesh of services. And recently, we finally launched this message queue that we've been
working on called Buffstream, which brings a lot of the same kind of protobuf-first benefits
to streaming data. What is it about Buff's mission that really spoke to you?
Was there some experience or something you had that you were like, ah, this company solved
this problem and I have to go join and help and figure this out too?
Like what got you excited and what got you going?
That's a great question.
You know, so pretty much as soon as I joined Uber, the company decided to go all in on
schema-driven development.
And it was shaped much like most companies are today. I joined right after New Year's,
so the very beginning of 2015. And that was a pretty foamy, enthusiastic time for microservices.
But directionally, I don't think much has changed. So you have a bunch of developers, everybody's shipping code from their own repository,
like their own Git repository, using usually their choice of language and framework.
And then the idea is that all these services kind of talk to each other over the network, and they pass messages over some queue. And we want to make all of that schema first.
And the idea there is you get this layer of safety
and policy control over all of your systems.
So everybody knows what are the inputs and outputs
of these functions we're calling over the network,
exactly what is inside of this Kafka topic.
It's not just some like garbage pile of JSON.
And that all should be like self-documenting, self-enforcing, and then you get some efficiency benefits what is inside of this Kafka topic. It's not just some like garbage pile of JSON.
And that all should be like self-documenting,
self-enforcing, and then you get some efficiency benefits
out the top of it.
In practice, what you end up with is just the system
that's designed in individual tiny pieces
that in theory are like nicely factored layers,
but in practice is just a mess.
Like somebody needs to go and spend
what ended up being many, many, many millions of dollars
in headcount and infrastructure
to actually assemble this stuff
into a platform that works end-to-end.
And this goes from the most trivial problems
to these kind of gigantic data infrastructure problems.
On the trivial side, anyone you talk to about Thrift or Protobuf or JSON schema, they'll
tell you that one of the benefits is that you can identify ahead of time using the schema
when a given change is going to break backward compatibility.
So like, oh, I can look at the pull request and tell you, like, you shouldn't do this. This is
going to break all your existing colors, or it's going to break all the existing consumers of this
data. And in theory, that's true. But then if you dig one step deeper, you're like, okay,
how do we do this? You're like, oh, yeah, you just read the PR very, very carefully and think hard about it.
Like this is a garbage answer.
There are no tools to do this
and you have to build the tools.
And to build the tools,
sometimes you have to build a parser
or a compiler for the schema language
because the existing one just doesn't expose
the APIs needed to do this.
And so we kind of carefully and laboriously
solve this problem and all of the follow-on problems.
How do you get the schemas from one repo to another?
What does dependency management look like?
How do you package them up and move them around?
How do you access the schema at runtime
if I want to do some dynamic schema-driven thing?
How do I attack extra policy information onto the schema?
Like, who's allowed to call this API?
Or is this
fit for public consumption? Or if I'm an engineer debugging this topic, am I allowed to see this
information? Or is it privileged and sensitive and cannot exit production onto my laptop?
All of that was so laborious to do. And it's funded the way all internal infra was funded, which is you get it
to like just good enough. And then all these expensive engineers get allocated onto some
problem that's like the new priority of the year. And I really wanted to come and solve that problem
once and for all for everyone. And I wanted a cohesive platform
that forevermore,
when I got tapped as the Protobuf person
or the RPC person or the Kafka person,
I could just point people and say,
my answer is that you should buy or build exactly that.
That is the way to do it end-to-end.
So one of the things you said at the beginning
of your explanation was,
as soon as you got to Uber, they decided to go schema-first.
I think to the uninformed, anyone who's operated an at-scale data system
that is real-time or operational, right?
Uber's an operational data use case because you're trying to price rides
and do all this other stuff in real time as the world changes around it.
What were the pain points that led us to, oh, we've got to be schema-first?
This problem actually extended far beyond the operational side.
It went into the analytics side of things, too.
That's actually the angle that I joined the company focused on
because I joined the data engineering team.
You have to rewind the clock.
It's now the very, very beginning of
2015. And what is the overall engineering zeitgeist? You know, if you went back to the
Wayback Machine and sampled the front page of Hacker News, what would you be seeing at that
time? And my recollection of the world at that time was that we were still on the upswing of everyone's enthusiasm about AP systems instead of CP.
We're on the upswing of Cassandra-style key value stores.
We're on the upswing of interest in CRDTs and exposing consistency constraints to application
developers.
We're very definitely on the upswing of interest in microservices
and having a company build a product as a bunch of small binaries that communicate with each other
in a pretty chatty way over the network. Kubernetes did not yet really exist the way it does today.
The closest was Mesos, which was roughly being commercialized by Mesosphere, but the primitives we depend on
today just didn't exist. And so what did Uber have? Uber had a pretty large pool of hosts that
we were physically racking in data centers and managing with Puppet. And what the company would
do is it would ask every team that wanted to ship a new service
to go add a little bit of data to the puppet run to define what the service was, where
it was going to run, what port it was going to be available on.
Everything was RESTful-ish.
And then that focus on loosely structured JSON kind of came out of a development culture that used a lot of
Node and Python.
So that was a pretty comfortable world to live in.
And then that data would go into often Kafka or some not Kafka system, because Kafka was
also pretty new at the time.
It would pop out the other end, and it would be processed often in some sort of complicated ad hoc way that depended on treating all the data just as a dictionary full of stuff.
And this led to all the problems that you might expect.
It's incredibly error-prone.
It's very, very difficult to reason about the blast radius of a change.
So you're a hapless developer who's trying to change something, and you just can't really tell what's going to break.
It was really difficult to reason about consistency and correctness over time because it was very difficult to understand when and how might somebody access old data that's using a schema that's implicit in application code that's from a year ago.
And this just led to this constant stream of small outages and brownouts and fire drills
that was really difficult to cope with.
It also in general made it very difficult for teams to communicate with each other.
It was a constant efficiency tax.
JSON is just expensive.
The team before I joined did a whole bake-off of options.
And so they were looking at the open API world, at RAML, at a variety of schema technologies.
And the one they settled on at the time was Thrift. Protobuf was open source, but gRPC didn't
exist yet in the outside world. And so Thrift was the only place you could go to
get a schema language, a serialization format, and an RPC framework out of the box. Very quickly,
Uber abandoned that RPC framework and built its own network transport. And then in a sad pivot,
the data ecosystem was like, oh God, this Thrift thing is never going to work for us. We're going to use Avro
instead. So in the worst of
all worlds, we settled on two of the possible
options and then spent a bunch of time building
elaborate interop between them.
It's so funny
to mention thrift and mesosphere.
It brings back all the old memories here.
I feel like we have been
running into this problem
forever.
But maybe the complexity of microservices,
maybe the pace of changes,
and now we're actually going into data, right?
This thing, you know, we have been,
I think the Hadoop ecosystem had the Avro.
There's had things have been in a sort of different ecosystems with their different formats and different tools.
And they're both growing complexity.
I feel like data is probably way more now than the microservice,
I think, just given the complexity here.
And what I've always been very intrigued is,
I feel like since Buff is really talking about protobuf
as the standard format for all,
is there any challenge or driving force
of why people should use the same format everywhere?
Yeah.
Protobuf to drive everything.
Because I think the fundamental belief here is Protobuf is the center of everything.
And sort of Buff creates all the toolings and products necessary towards that.
But I want to kind of start there.
What is driving motivation or things you're seeing people to start to like,
you know what, let's not
use five things or four different formats
and we got to adopt
one thing and it's a protobuf.
Can you tell us more?
Yeah, for sure.
There are two kind of altitudes
that you can talk about a problem like that.
One of them is kind of an organizational
altitude of what are the benefits of having
one thing instead of two or three or four?
And what are the characteristics of the one schema language and data format that you might
want?
And then on a more granular in the weeds level, it's like, okay, we can compare and contrast
protobuf with Avro or with Thrift or with JSON.
At a high level for an organization, I think you want a couple of
things out of a schema language. Ideally, you want to be able to use the schema language in a couple
of different arenas pretty seamlessly. And if you're talking about your modern cloud native
organization, that means you want to model your business all the way from RPCs down to tables in your lakehouse using a common language.
And what that gives you is it gives you the ability to standardize on definitions of common business entities, at least in a bounded domain, and pass them along very easily from one system to another without a translation step that jumps
through a bunch of arbitrary application code. Once you have that, that is a really interesting
point of leverage over a bunch of problems. So now, because you're using one system throughout,
you can actually, from a technical and a process perspective, start imposing
some control on change safety, so backwards compatibility. You can start imposing some
best practices around how the schemas ought to be shaped and what sort of data you need to attach
to the schema to make it useful to consumers. You can also start attaching policy information
in a way that's really powerful.
So for example, you can say that a user has a profile
and the profile has an email address,
but this email address in the schema,
you can mark it as protected information.
And you can say that it's not just a string,
it's actually an email address. And in
our system, by entering this email address, the user has only consented to communication
about like billing related events. You may not use this for marketing. And then you can build
infrastructure that enforces that all the way down. It says, look, when you show up to read this data,
or when you show up to make an RPC, you must inform the server or the queue what you're doing with the data.
Are you here for marketing?
Are you here for billing?
Are you here to train an LLM?
And then you only receive the data that's greenlit for that purpose.
And that's the kind of capability that it's really an end-to-end capability of the architecture, but it's really powerful for an organization because it lets you take those problems and push them down into your infrastructure instead of taking a TPM and handing them Google spreadsheets and being like, please go victimize individual engineers until the checklist has been checked.
You tell me how many quarters it's going to take to harass everybody into installing whatever bespoke libraries we've done for this and updating them and whatever.
Nobody likes that. It's expensive. It's inefficient. It's painful. It has a bunch of
holes in it. You want to take that and you want to make that a characteristic of your data platform
and this i think is where protobuf shines um it's paired often with grpc which is available in a
million languages is kind of a widely used standard right so if you're doing microservices
and you have schemas today chances are you're doing grRPC. You can then take those ProtoBuf schemas and
this huge, relatively quiet group of companies uses ProtoBuf all the way from gRPC down into
Kafka. Historically, one of the misconceptions about ProtoBuf is that you must generate code
ahead of time and you cannot use ProtoB off the way you use Avro with a schema
registry and a bunch of dynamic message work. You can, it's just that there wasn't a schema
registry available to you. Now there is. And once you get out of Kafka and you're over in batch land,
like nobody's really doing Avro anyways, right? Everything's in Parquet. The Parquet files are
self-describing. Avro was just an incidental detail of your row format along the way.
Easy peasy, right?
Yeah.
So I guess I think you're talking about when we all consolidate,
there's just so much more things you can actually do
to not just prevent and power and all the policies.
That's really interesting.
I'm actually very curious because when I think of protobuf and buff,
I think of gRPC,
I think of Microsoft's API,
all that stuff up front as well
because that's where the world
has largely adopted this
at that place.
And now you're into
the data business now,
Buffstream.
I'm actually really curious.
To me,
I think Buffstream
is a really bold take
because it's not just like you're adding
protobuf support to existing data stuff.
You're actually like, you know what?
Use us as a Kafka replacement, right?
That's kind of how I read it.
I guess this is the question.
Why did you choose this to be the first,
not the first entry of a data product,
but almost like the first major product
you want to go into
the data side.
Because I can imagine you also take the schema registry for Kafka, for example.
We already do that.
We shipped that early last year.
We do that with a bunch of customers.
Got it.
So then why Buffstream?
I guess probably the more easier question.
Why did you want to actually build a Kafka replacement?
Because I feel like that is more just the benefits you talked about, maybe.
Because coffee replacements, there is quite a lot of different things you can go after here.
So maybe talk about motivation here.
Yeah.
Like a lot of startups, this stuff comes from early and pretty deep engagements with our existing customers and with a couple of prospects.
And all of those companies, which for the most part, the ones for this product,
they're quite large and sophisticated. They would like to use Protobuf for everything,
from their RPC APIs down through their message queues, their stream processing.
And the only place they would like to get out of Protobuf is when things become parquet.
We started working with them early last year,
so the kind of beginning of 2023,
when we were building the buff schema registry support
for the Kafka registry protocol.
We built that, and we started talking to them.
We were like, okay, did we fix it?
And their basic answer was not quite. No, we still have a ton of problems. And the basic problem they
had was kind of what we've articulated already. They have an end to end problem and they don't
want a collection of individual layered solutions that they then need to go and ensure that every code base layers
correctly. So for Kafka in particular, it's where we started because in a modern data architecture,
that's typically where you're making the transition from your online transaction processing world into your data engineering, offline stream processing,
kind of semi-real-time but not hard real-time work. And that's where you're crossing the
boundary into analytic data use cases. The basic problem that our customers have is that
the Kafka team, like the streaming data team at their company, they view their job as
accurately schlepping the bytes around. And so from the Kafka team's perspective,
they run Apache Kafka or they help to manage the confluent install or whatever.
And their job is to get the bytes that they're handed by the producer and give them to the
consumer. And that's kind of the end of it. If anything about the bytes is wrong or the bytes that they're handed by the producer and give them to the consumer. And that's kind of the
end of it. If anything about the bytes is wrong or the bytes were not supposed to be there to
begin with, that's a conversation that the consumer needs to take up with the producer.
And now you're in this like whodunit murder mystery of like, there was a NAN in the floating
point numbers. And now some exec is looking at like a revenue dashboard that just says NAN.
And everyone's scrambling to figure out
where this thing came from,
which stream processing job along the way introduced it,
like whose fault is this basically?
This is just not really workable.
It's not what the business wants out of the data platform.
What the business wants is they want some guarantee
that yes, the bytes are making their way around
correctly, but more importantly, that the business entities being modeled by these topics are moving
around correctly, that consumers can rely on getting valid data out of this system as a platform
guarantee, not as a best effort, go talk to the producer guarantee.
And then furthermore, that this system has built-in hooks to enable the kind of data
governance that they're increasingly concerned about now. And so we started working on this
with a gateway, right? You look at that problem and you kind of smell it and you're like,
ah, this feels like an API gateway. That's how I would solve this over
in networking land. I would slap a sidecar next to your process. I would inspect the data as it comes
in and out. I would redact it. I would load balance it. I'd have some control plan for policy data.
And that's how we would fix this. And that's roughly how Envoy works. It's kind of a
commoditized architecture. And you look over at Kafka and you're like, there's not really anything
like this. This smells like a similar problem.
Let's try this out.
And so we built a gateway,
which does all that stuff for Kafka.
And of course it has to speak the Kafka protocol,
which is particularly onerous to implement.
And it mostly works.
The problem is that it's really irritatingly complicated.
And there are key pieces of this that you just cannot do as a proxy.
You have to own the data layer.
And the most important thing, actually, let's say the two key things that we were unable
to do in a satisfying way with a proxy were number one, making all of these correctness
things a guarantee and not best effort.
What we were telling our early
prospects to do is to configure firewall rules to make sure that only the proxies can talk to
the brokers. That's just very laborious and painful. From a functionality perspective,
the biggest thing that we couldn't do is we couldn't control the format of the data on disk or in S3. And really what we want is we want the native
format of your Kafka topics to be parquet files. That lets you get huge efficiency benefits and
simplicity benefits. And so we looked at that and we kind of looked at this gateway product and we
said, like, this is cool, but we can do dramatically better if we own
the storage layer we've already implemented the whole protocol
we might as well own the storage layer too so that's what we did
can you explain why like you said so much that i'm like okay we need to step back and have like
a little explanation but could you explain why Parquet is the right format for Kafka?
What are the efficiencies that you're seeing?
Because you talked before that you have this sort of transmission layer.
This transformation layer is a little odd to take a block turn to Parquet.
But why is Parquet the right format?
If you had a Parquet native stream processor, why is Parquet the right solution? There are a couple of things that are really good about Parquet from a theoretical
perspective, and then there's a practical angle. And the most important practical angle is that
maybe this is getting into the kind of spicy hot takes world, but in my view, really elaborate
stream processing architectures,
they're kind of 2015 vintage.
Like, oh, we're really excited.
We're going to do Storm and Heron and SAMSA and Spark
streaming.
And we're going to get really bent out of shape
about whether your streaming is micro-batching or truly
streaming.
My sense is that, for the most part,
this has gone the way of eventually consistent
transactional databases. The programming
model is too complicated for most use cases. And instead, we want to take all that stuff and we
want to do most of it in a near real-time batch kind of way. Or at least that's the programming
model you want to present people. SQL. Over in that world, Parquet is the lingua franca of data. It's like CSV for
business data. If you want to deal with Iceberg or Delta Lake or Hoodie or DuckDB or Arrow,
Parquet is the bedrock of all of that. And there's a ton of complexity in systems like Hoodie,
but really it's just a bunch of Parquet files with a really simple set of manifest files on top of it that just tell you what data
is in which Parquet file.
And so for topics that have a schema, which is most of the ones that matter, if you're
natively storing Kafka data as Parquet, it's really only a hop, skip, and a jump to also
materialize the manifest files that you need.
And now you go all the way back to Jay Kreps' original blog post on Kafka.
And he made a big deal at the time about stream table duality.
Jay Kreps was talking about a couple of years later,
Martin Kleppman gives the blog post and we're turning the database inside out.
And at the time, I think everyone really interpreted that to mean, oh my God, we've got to do
ksqlDB.
We've got to have Flink support SQL as a true stream processing abstraction.
But I think where we've ended up actually is that we want all the application logic
to be SQL-like for the most part.
And we want to use Kafka kind of as a way of like directly
accessing the write ahead log of the database.
And if we can store your Kafka data as an iceberg table
or as Parquet files, we're there.
Writing to a Kafka topic becomes the same thing
as appending to your lake house table.
And your consumers can choose the access modality that
makes the most
sense for them. They can get record at a time via the Kafka API, or they can jack this S3 bucket
into their query engine as an external table, and they can run whatever SQL queries they'd like.
And if you look at what people do in practice, this is kind of the default architecture.
It even has a name.
This is the Medallion architecture.
And your Kafka topics are your tin or bronze tables.
And then you're going to run SQL or you're going to run some processing on the other
end to refine that data into lower volume, more like analysis-ready tables.
But all that stuff doesn't need to require
another distributed system to copy the data around. It doesn't need to require another
whole copy of the data, which is extortionately expensive. We can just do that at rest. We can
build that end-to-end in a way that's ready to use in one step in one system.
I think Parquet as a format has kind of won. It has some things that are not great about it.
And there are a lot of people trying to do better.
So like Facebook just did Nimble.
It's really interesting.
And I don't really have a strong opinion
about whether Nimble has nailed
all of the important problems with Parquet.
Yeah, I guess we should let the format wars begin.
Yeah, I mean, we're already fighting over in lake house formats.
And I guess we're gonna like push that fight one layer down to
one more standard to rule them all.
So I want to move on to something we call the spicy future. Spicy Futures. As it very simply understood,
we want you to tell us what you believe
that most people don't believe yet.
And I think you already have something.
So what is the spicy hot take
of the data engineer world that you have?
I think all of this was a mistake.
Like the idea that really kicked off a lot of our focus on big data. We've got Jay Krebs
and Martin Kleppman and DJ Patil. And we're all going to say that lurking in this mass of data
are these business changing insights. And if we could only build you the incredibly modular,
expensive, operationally complex stack, and if you would
only staff up a whole new function of people to operate that thing for you, and then if you would
only staff a team of data engineers to get the low-rent analysis off their plate, we could have
this aristocracy of data scientists show up and just start delivering game-changing insights.
I mean, every company would have their people-you-may-know
LinkedIn feature equivalent,
and they'd be shipping them like once a month.
And that just has not panned out.
Company after company has invested in this stack,
and what they've gotten at the end, for the most part,
is an approach that works for well-understood
business-critical problems. So at Uber, that is surge pricing, right? That is the dominant use
for a bunch of this data. And it has been business-critical from day one. Everybody knew
that. It wasn't some new problem that emerged and was discovered in the data. And the other thing that companies can achieve
is they can achieve an easier-to-use,
more user-friendly analysis pipeline
that doesn't require calling your friendly business analyst
to run some crazy SQL Teradata monster for you.
For the most part, people are not churning out
these game-changing insights
on a regular and predictable
basis. And I think that the consequences of that are pretty far-reaching. To me, it means that
as infra people, the idea that you can sell an incredibly expensive, piecemeal, difficult-to-operate
data platform that needs a dedicated crew of people to manage
and integrate. And in Kafka's case, you're like, oh, I need a dedicated crew of people that just
deals with partition rebalancing and operations. And that the way you sell that is the future value
of this game-changing analysis, that's dead. That money is not available anymore. Increasingly,
I suspect companies are going to be
unwilling to bite this off. And so that means from an infra perspective, we need to refocus
on delivering end-to-end platforms that bake in best practices into the technology and not into
a complicated user guide where you're buying data engineering, the definitive guide
from O'Reilly, and it makes that big thump when you drop it on the desk.
So I think a lot of our data input today needs to pivot.
And I think this is the place where the modern data stack needs the equivalent of the Horton
Works and the Clou cloud era of 2010.
I mean, this is something I've talked about a lot.
It's like, I think we're in the cloud consolidation phase,
right, where we went through this like crazy thing
over the last 15 years, since 2010, right?
And there's this, this is happening in AI right now,
but like, you know, in cloud and data, cloud and data, we had this massive Cambrian explosion of all these technologies and all these companies.
We ended up with this crazy polyglot, weirdo architecture.
It's like we picked different shaped Lego bricks that none of them actually fit together.
It's like we went to every universe and were like, what's your Lego style?
And we just took them all and put them in a stack and then tried to build
Lego with it.
This is what cloud architecture
feels like today.
Like the cloud native
ecosystem is still,
you know,
a thousand plus
different components
and they all work differently.
It's all weird.
And so I like
finally agree
with what you're saying.
I guess the question is,
what do you think causes,
like there's all these
like macroeconomic drivers,
right?
Like interest rates
are at 5% and, you know, capital's not free and we're kind of in this like tech recession. But what do you think causes, like, there's all these macroeconomic drivers, right? Like, interest rates are at 5%, and capital's not free, and we're kind of in this tech recession.
I generally agree with what you're saying.
Now, how do you think this consolidation occurs, and what do you think drives the consolidation?
It's a great question.
So the first thing I would say is, this isn't limited to the cloud.
Because these same architectures are being used on-prem. Uber,
actually, I think they just published this big blog post on this system called Odin.
And really what Odin is, is it's trying to bring some consistent way to manage data gravity and
operations for disks full of data that you can't just, you know, yeet into the cloud and never
think about again. And so this same problem exists for on-prem workloads.
My sense here is this is partly an economic situation where you're right.
Budgets are down, the cloud is really expensive, interest rates are up.
But this is partly also an acknowledgement that these systems are just very, very, very
complicated to run and operate.
And there are only so many companies that want to fund an expansive data engineering team
to enable their analysts. And their analyst needs tend to be relatively straightforward.
You know, really, I want something that speaks SQL that I can jack into my dashboarding
and query engine of choice.
At one point, that would have been Tableau or Looker.
I think that world has expanded a lot since then.
But it's hard to justify millions in headcount
on an ongoing basis to support that use case.
It doesn't pass the smell test.
I mean, that's a super salient point,
is just the headcount expense to manage these systems.
And the fact that like a lot of the complexity systems
due to the fact that they're funny shaped Lego bricks
that weren't thought about being run together, right?
It's like the glue is where all the work is
and that glue is very expensive.
So I'm curious, like, what do you think the future of a data org looks like or an
engineering organization looks like? What is the future of these organizations? Because if we're
starting at the point is like, we're spending millions and millions of dollars in headcount
on things that we don't necessarily need to if there was consolidation, like, what's the future
organization look like? Because we got here through the broad concept of platform teams and feature teams, right?
And that's how we ended up in this world where you have the Kafka team that just runs Kafka and
doesn't care about bytes, because that's the way we drew their boundary. Do you think there's a
fundamental engineering culture change that has to occur? How do you think this all changes?
I think my view of engineering culture is that a lot of it ends up shaped by the tools
that we use and that we invest ourselves and our identities and our careers in.
And so the best way to change the culture of how teams interact with each other is to change the
technology that they use. It redraws the boundaries, it redraws responsibilities,
and it changes the place where like cost and value sit. have a unified stack that moves data all the way from ephemeral network data down to like
kind of historical reporting data in your lake house.
And again, this is my perspective, right?
That we want one format.
That's the lingua franca there.
Not that you have to use that, but there's a clearly paved path to use one thing throughout. And then we stopped thinking of Kafka as a
particular message queue and more as a convenient protocol for streaming updates. And that protocol
is a commodity. Nobody today thinks of HTTP as a feature of the Apache web server. And you're like,
oh, if this thing has to speak HTTP,
like we got to put Apache in front of it.
Everything speaks HTTP.
That's just the way we hook systems together for request response workloads.
Similarly, the Kafka protocol
should be the way we hook things together
when we want record at a time streaming.
Everything else, right,
is just a question of convenience.
I would love to get to a world where you write data in with Kafka to a system, but that same
system accepts RPC or REST calls for relatively simple use cases. Ideally, like all the workloads
that once upon a time we did with HBase tables, that is right
there for the asking.
We can support all of that.
With some squinting and maybe some cooperation from the object storage vendors, we could
do reasonably low latency like MongoDB on top of the same data set at rest.
I would love something that's much more converged
that doesn't necessarily try to bridge transaction processing and analytics,
but over in the use cases where you want throughput over latency,
we can offer you your data in a whole bunch of different formats
and via a bunch of different protocols and APIs.
You know, this is actually fascinating.
Our last episode was talking about HTAP, which also has another flavor of combining transactional
and analytics as well.
And you're bringing a totally different angle to it, which is super interesting.
I think...
HTAP is the holy grail.
Like, if we could get there, that would be amazing.
I feel like HTAP with one single format to rule
them all might be in heaven
of some sort of data
engineer world out there.
Everyone listening to this podcast from inside Google
is just giggling to themselves.
More or less, my understanding is this is basically how it works.
You have some protobufs and you're just
flinging protobufs everywhere.
Under the hood, everything is protobuf.
Their equivalent of Parquet is just shredded up protobuf records. That Under the hood, everything is protobuf. Their equivalent of Parquet is just
shredded up protobuf records. That's the Dremel
paper. The SQL-type
database takes and returns
protobuf records. It's the kind
of world you can get to quickly
if you have an
internal ecosystem that has
really, really lavish infrastructure funding
and has
buy-in from the top to really focus
on uniformity, no matter what the cost.
And so I think my last question really, we haven't got to this future state that we're
talking about here.
We probably can't talk about all the single steps that requires it, but what is one major
thing that you guys are working on or something that requires to make the push to have the portable format to be
more widely adopted what is the like the first major step you guys are taking to like hey
we can get better this is something we're working on to get more usually quite adopted the first
thing that we're doing is we're shipping native RK support in Buffstream.
And I think that by itself is enough to get rid of so much extra plumbing and so much extra expense
from a lot of data pipelines that I think of it as a proof of concept. It's the one thing that we
can point to and say, look, if you were willing to standardize,
I can give you this not only with lower operational overhead and more simplicity, but with lower
hard costs too.
That is also, I think the foundational, that's the bridge between your stream processing
and your kind of analytic estate over in Databricks or Snowflake or Big Lake or
wherever you put your data. And once those get bridged, now we can start moving policy information
back and forth too. And I think that'll be the place where we can show people the power of a
unified platform to solve business problems for you, right?
To solve problems of compliance, enforcement, GDPR, CCPA.
How do you govern whether your LLMs are learning on the wrong data?
And give you a firm footing to tackle all those problems.
Well, awesome.
I guess if people want to learn more about you or Buffstream or Buff, where should we go?
Just go to buff.build. That's the homepage.
And right up at the top, there's an announcement banner for Buffstream.
You can check out the blog post. You can check out the cost deep dive.
You can reach out to us and we can kind of get you slated for a POC with us.
Amazing. Thanks so much. It was so informative.
When I saw Buffstream launch,
I was like,
I need to know more.
Now I get it.
Now I understand.
I appreciate it enough.
It was wonderful
to talk to both of you.
I appreciate it.