The Infra Pod - Will Protobuf be the default for everything? Chat with Akshay from Buf

Starting point is 00:00:00 Welcome back to yet another Infra Deep Dive. As usual, Tim from Essence and Ian, let's go. Man, I'm so excited today, Tim. This is Ian Livingston. We are joined by a dear friend of mine and the current CTO at a company called Buff. Akshay Shah, tell us a little about yourself, my friend. Hey, I'm Akshay. I'm the CTO at Buff right now.

Starting point is 00:00:27 And I'm an engineer. I don't know. I came to engineering via a long and winding road, but have spent most of my engineering career in startups, the most notable of which was Uber, which I joined when it was, I don't know, a couple hundred engineers and grew to at least an order of magnitude more than that. And I started over there in data engineering, working on this kind of metrics and anomaly detection stuff, moved over to network infrastructure and service discovery and RPC, and then ended up running the team that built out Uber's Go infrastructure. Started my own company for a bit, went to Microsoft for a bit and worked on Azure,

Starting point is 00:01:07 and then ended up at Buff doing ProtoBuff infrastructure. Amazing. And can you give us a little bit, like a little 411 or the download? What is Buff? What's Buff trying to do? Why does Buff exist? What's the mission at Buff? Absolutely. We build proto buff infrastructure for other companies to buy as a package kind of end to end piece of your infrastructure. That means we build a proto buff compiler and a command line tool. So replaces proto C plus a bunch of other stuff you might want linting, formatting, breaking change detection,

Starting point is 00:01:44 interacting with RPC services, using binary payloads, all that kind of stuff. We actually built an RPC framework called Connect, which is part of the CNCF now. Our main commercial product to date has been a schema registry, which basically brings all the benefits of a Bazel-style monorepo for protobuf schemas to your not-Bazel, not-monorepo mesh of services. And recently, we finally launched this message queue that we've been working on called Buffstream, which brings a lot of the same kind of protobuf-first benefits to streaming data. What is it about Buff's mission that really spoke to you? Was there some experience or something you had that you were like, ah, this company solved

Starting point is 00:02:29 this problem and I have to go join and help and figure this out too? Like what got you excited and what got you going? That's a great question. You know, so pretty much as soon as I joined Uber, the company decided to go all in on schema-driven development. And it was shaped much like most companies are today. I joined right after New Year's, so the very beginning of 2015. And that was a pretty foamy, enthusiastic time for microservices. But directionally, I don't think much has changed. So you have a bunch of developers, everybody's shipping code from their own repository,

Starting point is 00:03:15 like their own Git repository, using usually their choice of language and framework. And then the idea is that all these services kind of talk to each other over the network, and they pass messages over some queue. And we want to make all of that schema first. And the idea there is you get this layer of safety and policy control over all of your systems. So everybody knows what are the inputs and outputs of these functions we're calling over the network, exactly what is inside of this Kafka topic. It's not just some like garbage pile of JSON.

Starting point is 00:03:44 And that all should be like self-documenting, self-enforcing, and then you get some efficiency benefits what is inside of this Kafka topic. It's not just some like garbage pile of JSON. And that all should be like self-documenting, self-enforcing, and then you get some efficiency benefits out the top of it. In practice, what you end up with is just the system that's designed in individual tiny pieces that in theory are like nicely factored layers, but in practice is just a mess.

Starting point is 00:04:04 Like somebody needs to go and spend what ended up being many, many, many millions of dollars in headcount and infrastructure to actually assemble this stuff into a platform that works end-to-end. And this goes from the most trivial problems to these kind of gigantic data infrastructure problems. On the trivial side, anyone you talk to about Thrift or Protobuf or JSON schema, they'll

Starting point is 00:04:35 tell you that one of the benefits is that you can identify ahead of time using the schema when a given change is going to break backward compatibility. So like, oh, I can look at the pull request and tell you, like, you shouldn't do this. This is going to break all your existing colors, or it's going to break all the existing consumers of this data. And in theory, that's true. But then if you dig one step deeper, you're like, okay, how do we do this? You're like, oh, yeah, you just read the PR very, very carefully and think hard about it. Like this is a garbage answer. There are no tools to do this

Starting point is 00:05:12 and you have to build the tools. And to build the tools, sometimes you have to build a parser or a compiler for the schema language because the existing one just doesn't expose the APIs needed to do this. And so we kind of carefully and laboriously solve this problem and all of the follow-on problems.

Starting point is 00:05:27 How do you get the schemas from one repo to another? What does dependency management look like? How do you package them up and move them around? How do you access the schema at runtime if I want to do some dynamic schema-driven thing? How do I attack extra policy information onto the schema? Like, who's allowed to call this API? Or is this

Starting point is 00:05:45 fit for public consumption? Or if I'm an engineer debugging this topic, am I allowed to see this information? Or is it privileged and sensitive and cannot exit production onto my laptop? All of that was so laborious to do. And it's funded the way all internal infra was funded, which is you get it to like just good enough. And then all these expensive engineers get allocated onto some problem that's like the new priority of the year. And I really wanted to come and solve that problem once and for all for everyone. And I wanted a cohesive platform that forevermore, when I got tapped as the Protobuf person

Starting point is 00:06:31 or the RPC person or the Kafka person, I could just point people and say, my answer is that you should buy or build exactly that. That is the way to do it end-to-end. So one of the things you said at the beginning of your explanation was, as soon as you got to Uber, they decided to go schema-first. I think to the uninformed, anyone who's operated an at-scale data system

Starting point is 00:06:55 that is real-time or operational, right? Uber's an operational data use case because you're trying to price rides and do all this other stuff in real time as the world changes around it. What were the pain points that led us to, oh, we've got to be schema-first? This problem actually extended far beyond the operational side. It went into the analytics side of things, too. That's actually the angle that I joined the company focused on because I joined the data engineering team.

Starting point is 00:07:21 You have to rewind the clock. It's now the very, very beginning of 2015. And what is the overall engineering zeitgeist? You know, if you went back to the Wayback Machine and sampled the front page of Hacker News, what would you be seeing at that time? And my recollection of the world at that time was that we were still on the upswing of everyone's enthusiasm about AP systems instead of CP. We're on the upswing of Cassandra-style key value stores. We're on the upswing of interest in CRDTs and exposing consistency constraints to application developers.

Starting point is 00:08:02 We're very definitely on the upswing of interest in microservices and having a company build a product as a bunch of small binaries that communicate with each other in a pretty chatty way over the network. Kubernetes did not yet really exist the way it does today. The closest was Mesos, which was roughly being commercialized by Mesosphere, but the primitives we depend on today just didn't exist. And so what did Uber have? Uber had a pretty large pool of hosts that we were physically racking in data centers and managing with Puppet. And what the company would do is it would ask every team that wanted to ship a new service to go add a little bit of data to the puppet run to define what the service was, where

Starting point is 00:08:52 it was going to run, what port it was going to be available on. Everything was RESTful-ish. And then that focus on loosely structured JSON kind of came out of a development culture that used a lot of Node and Python. So that was a pretty comfortable world to live in. And then that data would go into often Kafka or some not Kafka system, because Kafka was also pretty new at the time. It would pop out the other end, and it would be processed often in some sort of complicated ad hoc way that depended on treating all the data just as a dictionary full of stuff.

Starting point is 00:09:32 And this led to all the problems that you might expect. It's incredibly error-prone. It's very, very difficult to reason about the blast radius of a change. So you're a hapless developer who's trying to change something, and you just can't really tell what's going to break. It was really difficult to reason about consistency and correctness over time because it was very difficult to understand when and how might somebody access old data that's using a schema that's implicit in application code that's from a year ago. And this just led to this constant stream of small outages and brownouts and fire drills that was really difficult to cope with. It also in general made it very difficult for teams to communicate with each other.

Starting point is 00:10:19 It was a constant efficiency tax. JSON is just expensive. The team before I joined did a whole bake-off of options. And so they were looking at the open API world, at RAML, at a variety of schema technologies. And the one they settled on at the time was Thrift. Protobuf was open source, but gRPC didn't exist yet in the outside world. And so Thrift was the only place you could go to get a schema language, a serialization format, and an RPC framework out of the box. Very quickly, Uber abandoned that RPC framework and built its own network transport. And then in a sad pivot,

Starting point is 00:11:01 the data ecosystem was like, oh God, this Thrift thing is never going to work for us. We're going to use Avro instead. So in the worst of all worlds, we settled on two of the possible options and then spent a bunch of time building elaborate interop between them. It's so funny to mention thrift and mesosphere. It brings back all the old memories here.

Starting point is 00:11:20 I feel like we have been running into this problem forever. But maybe the complexity of microservices, maybe the pace of changes, and now we're actually going into data, right? This thing, you know, we have been, I think the Hadoop ecosystem had the Avro.

Starting point is 00:11:40 There's had things have been in a sort of different ecosystems with their different formats and different tools. And they're both growing complexity. I feel like data is probably way more now than the microservice, I think, just given the complexity here. And what I've always been very intrigued is, I feel like since Buff is really talking about protobuf as the standard format for all, is there any challenge or driving force

Starting point is 00:12:03 of why people should use the same format everywhere? Yeah. Protobuf to drive everything. Because I think the fundamental belief here is Protobuf is the center of everything. And sort of Buff creates all the toolings and products necessary towards that. But I want to kind of start there. What is driving motivation or things you're seeing people to start to like, you know what, let's not

Starting point is 00:12:26 use five things or four different formats and we got to adopt one thing and it's a protobuf. Can you tell us more? Yeah, for sure. There are two kind of altitudes that you can talk about a problem like that. One of them is kind of an organizational

Starting point is 00:12:41 altitude of what are the benefits of having one thing instead of two or three or four? And what are the characteristics of the one schema language and data format that you might want? And then on a more granular in the weeds level, it's like, okay, we can compare and contrast protobuf with Avro or with Thrift or with JSON. At a high level for an organization, I think you want a couple of things out of a schema language. Ideally, you want to be able to use the schema language in a couple

Starting point is 00:13:11 of different arenas pretty seamlessly. And if you're talking about your modern cloud native organization, that means you want to model your business all the way from RPCs down to tables in your lakehouse using a common language. And what that gives you is it gives you the ability to standardize on definitions of common business entities, at least in a bounded domain, and pass them along very easily from one system to another without a translation step that jumps through a bunch of arbitrary application code. Once you have that, that is a really interesting point of leverage over a bunch of problems. So now, because you're using one system throughout, you can actually, from a technical and a process perspective, start imposing some control on change safety, so backwards compatibility. You can start imposing some best practices around how the schemas ought to be shaped and what sort of data you need to attach

Starting point is 00:14:19 to the schema to make it useful to consumers. You can also start attaching policy information in a way that's really powerful. So for example, you can say that a user has a profile and the profile has an email address, but this email address in the schema, you can mark it as protected information. And you can say that it's not just a string, it's actually an email address. And in

Starting point is 00:14:45 our system, by entering this email address, the user has only consented to communication about like billing related events. You may not use this for marketing. And then you can build infrastructure that enforces that all the way down. It says, look, when you show up to read this data, or when you show up to make an RPC, you must inform the server or the queue what you're doing with the data. Are you here for marketing? Are you here for billing? Are you here to train an LLM? And then you only receive the data that's greenlit for that purpose.

Starting point is 00:15:21 And that's the kind of capability that it's really an end-to-end capability of the architecture, but it's really powerful for an organization because it lets you take those problems and push them down into your infrastructure instead of taking a TPM and handing them Google spreadsheets and being like, please go victimize individual engineers until the checklist has been checked. You tell me how many quarters it's going to take to harass everybody into installing whatever bespoke libraries we've done for this and updating them and whatever. Nobody likes that. It's expensive. It's inefficient. It's painful. It has a bunch of holes in it. You want to take that and you want to make that a characteristic of your data platform and this i think is where protobuf shines um it's paired often with grpc which is available in a million languages is kind of a widely used standard right so if you're doing microservices and you have schemas today chances are you're doing grRPC. You can then take those ProtoBuf schemas and this huge, relatively quiet group of companies uses ProtoBuf all the way from gRPC down into

Starting point is 00:16:33 Kafka. Historically, one of the misconceptions about ProtoBuf is that you must generate code ahead of time and you cannot use ProtoB off the way you use Avro with a schema registry and a bunch of dynamic message work. You can, it's just that there wasn't a schema registry available to you. Now there is. And once you get out of Kafka and you're over in batch land, like nobody's really doing Avro anyways, right? Everything's in Parquet. The Parquet files are self-describing. Avro was just an incidental detail of your row format along the way. Easy peasy, right? Yeah.

Starting point is 00:17:09 So I guess I think you're talking about when we all consolidate, there's just so much more things you can actually do to not just prevent and power and all the policies. That's really interesting. I'm actually very curious because when I think of protobuf and buff, I think of gRPC, I think of Microsoft's API, all that stuff up front as well

Starting point is 00:17:30 because that's where the world has largely adopted this at that place. And now you're into the data business now, Buffstream. I'm actually really curious. To me,

Starting point is 00:17:40 I think Buffstream is a really bold take because it's not just like you're adding protobuf support to existing data stuff. You're actually like, you know what? Use us as a Kafka replacement, right? That's kind of how I read it. I guess this is the question.

Starting point is 00:17:57 Why did you choose this to be the first, not the first entry of a data product, but almost like the first major product you want to go into the data side. Because I can imagine you also take the schema registry for Kafka, for example. We already do that. We shipped that early last year.

Starting point is 00:18:14 We do that with a bunch of customers. Got it. So then why Buffstream? I guess probably the more easier question. Why did you want to actually build a Kafka replacement? Because I feel like that is more just the benefits you talked about, maybe. Because coffee replacements, there is quite a lot of different things you can go after here. So maybe talk about motivation here.

Starting point is 00:18:33 Yeah. Like a lot of startups, this stuff comes from early and pretty deep engagements with our existing customers and with a couple of prospects. And all of those companies, which for the most part, the ones for this product, they're quite large and sophisticated. They would like to use Protobuf for everything, from their RPC APIs down through their message queues, their stream processing. And the only place they would like to get out of Protobuf is when things become parquet. We started working with them early last year, so the kind of beginning of 2023,

Starting point is 00:19:13 when we were building the buff schema registry support for the Kafka registry protocol. We built that, and we started talking to them. We were like, okay, did we fix it? And their basic answer was not quite. No, we still have a ton of problems. And the basic problem they had was kind of what we've articulated already. They have an end to end problem and they don't want a collection of individual layered solutions that they then need to go and ensure that every code base layers correctly. So for Kafka in particular, it's where we started because in a modern data architecture,

Starting point is 00:19:55 that's typically where you're making the transition from your online transaction processing world into your data engineering, offline stream processing, kind of semi-real-time but not hard real-time work. And that's where you're crossing the boundary into analytic data use cases. The basic problem that our customers have is that the Kafka team, like the streaming data team at their company, they view their job as accurately schlepping the bytes around. And so from the Kafka team's perspective, they run Apache Kafka or they help to manage the confluent install or whatever. And their job is to get the bytes that they're handed by the producer and give them to the consumer. And that's kind of the end of it. If anything about the bytes is wrong or the bytes that they're handed by the producer and give them to the consumer. And that's kind of the

Starting point is 00:20:45 end of it. If anything about the bytes is wrong or the bytes were not supposed to be there to begin with, that's a conversation that the consumer needs to take up with the producer. And now you're in this like whodunit murder mystery of like, there was a NAN in the floating point numbers. And now some exec is looking at like a revenue dashboard that just says NAN. And everyone's scrambling to figure out where this thing came from, which stream processing job along the way introduced it, like whose fault is this basically?

Starting point is 00:21:12 This is just not really workable. It's not what the business wants out of the data platform. What the business wants is they want some guarantee that yes, the bytes are making their way around correctly, but more importantly, that the business entities being modeled by these topics are moving around correctly, that consumers can rely on getting valid data out of this system as a platform guarantee, not as a best effort, go talk to the producer guarantee. And then furthermore, that this system has built-in hooks to enable the kind of data

Starting point is 00:21:52 governance that they're increasingly concerned about now. And so we started working on this with a gateway, right? You look at that problem and you kind of smell it and you're like, ah, this feels like an API gateway. That's how I would solve this over in networking land. I would slap a sidecar next to your process. I would inspect the data as it comes in and out. I would redact it. I would load balance it. I'd have some control plan for policy data. And that's how we would fix this. And that's roughly how Envoy works. It's kind of a commoditized architecture. And you look over at Kafka and you're like, there's not really anything like this. This smells like a similar problem.

Starting point is 00:22:26 Let's try this out. And so we built a gateway, which does all that stuff for Kafka. And of course it has to speak the Kafka protocol, which is particularly onerous to implement. And it mostly works. The problem is that it's really irritatingly complicated. And there are key pieces of this that you just cannot do as a proxy.

Starting point is 00:22:47 You have to own the data layer. And the most important thing, actually, let's say the two key things that we were unable to do in a satisfying way with a proxy were number one, making all of these correctness things a guarantee and not best effort. What we were telling our early prospects to do is to configure firewall rules to make sure that only the proxies can talk to the brokers. That's just very laborious and painful. From a functionality perspective, the biggest thing that we couldn't do is we couldn't control the format of the data on disk or in S3. And really what we want is we want the native

Starting point is 00:23:27 format of your Kafka topics to be parquet files. That lets you get huge efficiency benefits and simplicity benefits. And so we looked at that and we kind of looked at this gateway product and we said, like, this is cool, but we can do dramatically better if we own the storage layer we've already implemented the whole protocol we might as well own the storage layer too so that's what we did can you explain why like you said so much that i'm like okay we need to step back and have like a little explanation but could you explain why Parquet is the right format for Kafka? What are the efficiencies that you're seeing?

Starting point is 00:24:11 Because you talked before that you have this sort of transmission layer. This transformation layer is a little odd to take a block turn to Parquet. But why is Parquet the right format? If you had a Parquet native stream processor, why is Parquet the right solution? There are a couple of things that are really good about Parquet from a theoretical perspective, and then there's a practical angle. And the most important practical angle is that maybe this is getting into the kind of spicy hot takes world, but in my view, really elaborate stream processing architectures, they're kind of 2015 vintage.

Starting point is 00:24:47 Like, oh, we're really excited. We're going to do Storm and Heron and SAMSA and Spark streaming. And we're going to get really bent out of shape about whether your streaming is micro-batching or truly streaming. My sense is that, for the most part, this has gone the way of eventually consistent

Starting point is 00:25:04 transactional databases. The programming model is too complicated for most use cases. And instead, we want to take all that stuff and we want to do most of it in a near real-time batch kind of way. Or at least that's the programming model you want to present people. SQL. Over in that world, Parquet is the lingua franca of data. It's like CSV for business data. If you want to deal with Iceberg or Delta Lake or Hoodie or DuckDB or Arrow, Parquet is the bedrock of all of that. And there's a ton of complexity in systems like Hoodie, but really it's just a bunch of Parquet files with a really simple set of manifest files on top of it that just tell you what data is in which Parquet file.

Starting point is 00:25:50 And so for topics that have a schema, which is most of the ones that matter, if you're natively storing Kafka data as Parquet, it's really only a hop, skip, and a jump to also materialize the manifest files that you need. And now you go all the way back to Jay Kreps' original blog post on Kafka. And he made a big deal at the time about stream table duality. Jay Kreps was talking about a couple of years later, Martin Kleppman gives the blog post and we're turning the database inside out. And at the time, I think everyone really interpreted that to mean, oh my God, we've got to do

Starting point is 00:26:28 ksqlDB. We've got to have Flink support SQL as a true stream processing abstraction. But I think where we've ended up actually is that we want all the application logic to be SQL-like for the most part. And we want to use Kafka kind of as a way of like directly accessing the write ahead log of the database. And if we can store your Kafka data as an iceberg table or as Parquet files, we're there.

Starting point is 00:26:56 Writing to a Kafka topic becomes the same thing as appending to your lake house table. And your consumers can choose the access modality that makes the most sense for them. They can get record at a time via the Kafka API, or they can jack this S3 bucket into their query engine as an external table, and they can run whatever SQL queries they'd like. And if you look at what people do in practice, this is kind of the default architecture. It even has a name.

Starting point is 00:27:27 This is the Medallion architecture. And your Kafka topics are your tin or bronze tables. And then you're going to run SQL or you're going to run some processing on the other end to refine that data into lower volume, more like analysis-ready tables. But all that stuff doesn't need to require another distributed system to copy the data around. It doesn't need to require another whole copy of the data, which is extortionately expensive. We can just do that at rest. We can build that end-to-end in a way that's ready to use in one step in one system.

Starting point is 00:28:01 I think Parquet as a format has kind of won. It has some things that are not great about it. And there are a lot of people trying to do better. So like Facebook just did Nimble. It's really interesting. And I don't really have a strong opinion about whether Nimble has nailed all of the important problems with Parquet. Yeah, I guess we should let the format wars begin.

Starting point is 00:28:24 Yeah, I mean, we're already fighting over in lake house formats. And I guess we're gonna like push that fight one layer down to one more standard to rule them all. So I want to move on to something we call the spicy future. Spicy Futures. As it very simply understood, we want you to tell us what you believe that most people don't believe yet. And I think you already have something. So what is the spicy hot take

Starting point is 00:28:54 of the data engineer world that you have? I think all of this was a mistake. Like the idea that really kicked off a lot of our focus on big data. We've got Jay Krebs and Martin Kleppman and DJ Patil. And we're all going to say that lurking in this mass of data are these business changing insights. And if we could only build you the incredibly modular, expensive, operationally complex stack, and if you would only staff up a whole new function of people to operate that thing for you, and then if you would only staff a team of data engineers to get the low-rent analysis off their plate, we could have

Starting point is 00:29:39 this aristocracy of data scientists show up and just start delivering game-changing insights. I mean, every company would have their people-you-may-know LinkedIn feature equivalent, and they'd be shipping them like once a month. And that just has not panned out. Company after company has invested in this stack, and what they've gotten at the end, for the most part, is an approach that works for well-understood

Starting point is 00:30:06 business-critical problems. So at Uber, that is surge pricing, right? That is the dominant use for a bunch of this data. And it has been business-critical from day one. Everybody knew that. It wasn't some new problem that emerged and was discovered in the data. And the other thing that companies can achieve is they can achieve an easier-to-use, more user-friendly analysis pipeline that doesn't require calling your friendly business analyst to run some crazy SQL Teradata monster for you. For the most part, people are not churning out

Starting point is 00:30:41 these game-changing insights on a regular and predictable basis. And I think that the consequences of that are pretty far-reaching. To me, it means that as infra people, the idea that you can sell an incredibly expensive, piecemeal, difficult-to-operate data platform that needs a dedicated crew of people to manage and integrate. And in Kafka's case, you're like, oh, I need a dedicated crew of people that just deals with partition rebalancing and operations. And that the way you sell that is the future value of this game-changing analysis, that's dead. That money is not available anymore. Increasingly,

Starting point is 00:31:24 I suspect companies are going to be unwilling to bite this off. And so that means from an infra perspective, we need to refocus on delivering end-to-end platforms that bake in best practices into the technology and not into a complicated user guide where you're buying data engineering, the definitive guide from O'Reilly, and it makes that big thump when you drop it on the desk. So I think a lot of our data input today needs to pivot. And I think this is the place where the modern data stack needs the equivalent of the Horton Works and the Clou cloud era of 2010.

Starting point is 00:32:06 I mean, this is something I've talked about a lot. It's like, I think we're in the cloud consolidation phase, right, where we went through this like crazy thing over the last 15 years, since 2010, right? And there's this, this is happening in AI right now, but like, you know, in cloud and data, cloud and data, we had this massive Cambrian explosion of all these technologies and all these companies. We ended up with this crazy polyglot, weirdo architecture. It's like we picked different shaped Lego bricks that none of them actually fit together.

Starting point is 00:32:39 It's like we went to every universe and were like, what's your Lego style? And we just took them all and put them in a stack and then tried to build Lego with it. This is what cloud architecture feels like today. Like the cloud native ecosystem is still, you know,

Starting point is 00:32:52 a thousand plus different components and they all work differently. It's all weird. And so I like finally agree with what you're saying. I guess the question is,

Starting point is 00:33:00 what do you think causes, like there's all these like macroeconomic drivers, right? Like interest rates are at 5% and, you know, capital's not free and we're kind of in this like tech recession. But what do you think causes, like, there's all these macroeconomic drivers, right? Like, interest rates are at 5%, and capital's not free, and we're kind of in this tech recession. I generally agree with what you're saying. Now, how do you think this consolidation occurs, and what do you think drives the consolidation?

Starting point is 00:33:17 It's a great question. So the first thing I would say is, this isn't limited to the cloud. Because these same architectures are being used on-prem. Uber, actually, I think they just published this big blog post on this system called Odin. And really what Odin is, is it's trying to bring some consistent way to manage data gravity and operations for disks full of data that you can't just, you know, yeet into the cloud and never think about again. And so this same problem exists for on-prem workloads. My sense here is this is partly an economic situation where you're right.

Starting point is 00:33:53 Budgets are down, the cloud is really expensive, interest rates are up. But this is partly also an acknowledgement that these systems are just very, very, very complicated to run and operate. And there are only so many companies that want to fund an expansive data engineering team to enable their analysts. And their analyst needs tend to be relatively straightforward. You know, really, I want something that speaks SQL that I can jack into my dashboarding and query engine of choice. At one point, that would have been Tableau or Looker.

Starting point is 00:34:32 I think that world has expanded a lot since then. But it's hard to justify millions in headcount on an ongoing basis to support that use case. It doesn't pass the smell test. I mean, that's a super salient point, is just the headcount expense to manage these systems. And the fact that like a lot of the complexity systems due to the fact that they're funny shaped Lego bricks

Starting point is 00:34:58 that weren't thought about being run together, right? It's like the glue is where all the work is and that glue is very expensive. So I'm curious, like, what do you think the future of a data org looks like or an engineering organization looks like? What is the future of these organizations? Because if we're starting at the point is like, we're spending millions and millions of dollars in headcount on things that we don't necessarily need to if there was consolidation, like, what's the future organization look like? Because we got here through the broad concept of platform teams and feature teams, right?

Starting point is 00:35:30 And that's how we ended up in this world where you have the Kafka team that just runs Kafka and doesn't care about bytes, because that's the way we drew their boundary. Do you think there's a fundamental engineering culture change that has to occur? How do you think this all changes? I think my view of engineering culture is that a lot of it ends up shaped by the tools that we use and that we invest ourselves and our identities and our careers in. And so the best way to change the culture of how teams interact with each other is to change the technology that they use. It redraws the boundaries, it redraws responsibilities, and it changes the place where like cost and value sit. have a unified stack that moves data all the way from ephemeral network data down to like

Starting point is 00:36:28 kind of historical reporting data in your lake house. And again, this is my perspective, right? That we want one format. That's the lingua franca there. Not that you have to use that, but there's a clearly paved path to use one thing throughout. And then we stopped thinking of Kafka as a particular message queue and more as a convenient protocol for streaming updates. And that protocol is a commodity. Nobody today thinks of HTTP as a feature of the Apache web server. And you're like, oh, if this thing has to speak HTTP,

Starting point is 00:37:06 like we got to put Apache in front of it. Everything speaks HTTP. That's just the way we hook systems together for request response workloads. Similarly, the Kafka protocol should be the way we hook things together when we want record at a time streaming. Everything else, right, is just a question of convenience.

Starting point is 00:37:26 I would love to get to a world where you write data in with Kafka to a system, but that same system accepts RPC or REST calls for relatively simple use cases. Ideally, like all the workloads that once upon a time we did with HBase tables, that is right there for the asking. We can support all of that. With some squinting and maybe some cooperation from the object storage vendors, we could do reasonably low latency like MongoDB on top of the same data set at rest. I would love something that's much more converged

Starting point is 00:38:06 that doesn't necessarily try to bridge transaction processing and analytics, but over in the use cases where you want throughput over latency, we can offer you your data in a whole bunch of different formats and via a bunch of different protocols and APIs. You know, this is actually fascinating. Our last episode was talking about HTAP, which also has another flavor of combining transactional and analytics as well. And you're bringing a totally different angle to it, which is super interesting.

Starting point is 00:38:39 I think... HTAP is the holy grail. Like, if we could get there, that would be amazing. I feel like HTAP with one single format to rule them all might be in heaven of some sort of data engineer world out there. Everyone listening to this podcast from inside Google

Starting point is 00:38:53 is just giggling to themselves. More or less, my understanding is this is basically how it works. You have some protobufs and you're just flinging protobufs everywhere. Under the hood, everything is protobuf. Their equivalent of Parquet is just shredded up protobuf records. That Under the hood, everything is protobuf. Their equivalent of Parquet is just shredded up protobuf records. That's the Dremel paper. The SQL-type

Starting point is 00:39:10 database takes and returns protobuf records. It's the kind of world you can get to quickly if you have an internal ecosystem that has really, really lavish infrastructure funding and has buy-in from the top to really focus

Starting point is 00:39:26 on uniformity, no matter what the cost. And so I think my last question really, we haven't got to this future state that we're talking about here. We probably can't talk about all the single steps that requires it, but what is one major thing that you guys are working on or something that requires to make the push to have the portable format to be more widely adopted what is the like the first major step you guys are taking to like hey we can get better this is something we're working on to get more usually quite adopted the first thing that we're doing is we're shipping native RK support in Buffstream.

Starting point is 00:40:08 And I think that by itself is enough to get rid of so much extra plumbing and so much extra expense from a lot of data pipelines that I think of it as a proof of concept. It's the one thing that we can point to and say, look, if you were willing to standardize, I can give you this not only with lower operational overhead and more simplicity, but with lower hard costs too. That is also, I think the foundational, that's the bridge between your stream processing and your kind of analytic estate over in Databricks or Snowflake or Big Lake or wherever you put your data. And once those get bridged, now we can start moving policy information

Starting point is 00:40:54 back and forth too. And I think that'll be the place where we can show people the power of a unified platform to solve business problems for you, right? To solve problems of compliance, enforcement, GDPR, CCPA. How do you govern whether your LLMs are learning on the wrong data? And give you a firm footing to tackle all those problems. Well, awesome. I guess if people want to learn more about you or Buffstream or Buff, where should we go? Just go to buff.build. That's the homepage.

Starting point is 00:41:31 And right up at the top, there's an announcement banner for Buffstream. You can check out the blog post. You can check out the cost deep dive. You can reach out to us and we can kind of get you slated for a POC with us. Amazing. Thanks so much. It was so informative. When I saw Buffstream launch, I was like, I need to know more. Now I get it.

Starting point is 00:41:48 Now I understand. I appreciate it enough. It was wonderful to talk to both of you. I appreciate it.

Your Ad Here

The Infra Pod - Will Protobuf be the default for everything? Chat with Akshay from Buf

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.