The Data Stack Show - 56: Stream Processing and Observability with Jeff Chao of Stripe

Episode Date: October 6, 2021

Highlights from this week’s conversation include:Jeff’s history with stream processing (2:52)Working with Mantis to address the impact of Netflix downtime (4:20)Defining observability as operation...al insight (6:58)Time series data and the value of data today (18:52)Data integration’s shift from batch to streaming (29:34)The current state of change data capture (32:20)How an engineer thinks of the end-user (56:21)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the show. We have a really special guest
Starting point is 00:00:30 who has worked on some unbelievable technology at companies like Heroku and Netflix, really deep in the stack at those companies. And so what a treat to get to talk to someone who has that level of experience. I'm super excited to talk to Jeff. I think one of the things that I'm interested in asking him about is just some of the, he's worked on technology that tends to be sort of internal infrastructure technology at these companies. And so I'm really interested to know,
Starting point is 00:01:05 how does he think about the end consumer of the product? And Netflix is a great example. So he has worked on a lot of infrastructure stuff, but I'm just interested to know, does he think about the end consumer in his work? Because the work product doesn't necessarily touch the consumer in the sense of like the last mile. So if we have time, I'm going to try and sneak that question in. But Costas, I know you have a ton of technical questions. So what are you interested in asking?
Starting point is 00:01:35 Yeah, absolutely. I mean, Jeff is also one of the maintainers of an open source project called Mondays, which is a stream processing framework. So Jeff is like the kind of person that you could call an expert in stream processing. So I have quite a few questions around that. I want to learn more about what stream processing is, why it's important. It's also like a term that it's coming up to our conversations again and again, especially compared to batch processing.
Starting point is 00:02:06 So I really want to hear his opinion about what happened, how was streaming processing 10 years ago, how it is today and where we are moving towards to in terms of like these technologies. So yeah, I think I'll focus more on like the streaming processing side of things and we'll see. I'm pretty sure that there will be surprises. There always are. Great. Well, let's jump in and talk with Jeff. Yeah, let's do it. Jeff, we are so excited to have you on. Thanks for taking the time to join us. Hey, thank you. It's a pleasure to be here. Well, you have a really impressive resume,
Starting point is 00:02:44 but I'm going to let you just give your quick background. So where did you come from and what do you do today? Hey, great. Thanks, Eric. So yeah, I'm Jeff. I have a stream processing background. I'm currently at Stripe. I've been here for a little over a month. But prior to that, I was at Netflix for a little over three years. And at Netflix, I worked on an open source project called Mantis. Mantis, we can talk about it later. But fundamentally, it's a platform that enables developers to build real time cost effective stream processing applications specifically geared towards the observability space. So I've been working on that for the entirety of my Netflix tenure.
Starting point is 00:03:30 And prior to that, I was at Heroku working in the data space as well. I worked on Kafka as a service and Postgres as a service. And then prior to that, I was at Salesforce through a series of acquisitions working on stream processors. Super interesting. And what a treat. I just have to say again, what a treat. You've worked on some of the most famous companies in terms of tech and just all the amazing engineering heritage. So really excited to have you on the show. I think we should just start with talking about Mantis. So born inside of Netflix and you did a ton of work on it. And from my understanding, it solved an observability
Starting point is 00:04:12 problem at Netflix. And that was sort of the main use case there, but I guess it's since expanded, but tell us, tell us about Mantis. Yeah, definitely. So Mantis was born out of Netflix. It came before I joined and I joined and worked on it a bunch. And I'm currently one of the committers here. But the whole premise of Mantis was originally around reducing downtime or the mean time to detect in Netflix playback. so back in the day 2014 and prior imagine if you are a netflix member and you're going home and then you're trying to play your video and netflix is down so that back then the amount of downtime for the number of members netflix had on the service was let's say it was like, if the service was down for like two hours, you would have some impact. So that same impact today would be in the order of minutes. So as Netflix continued to grow, we realized like, hey, we need to have better detection and remediation and time to insight. So the current systems at the time were very
Starting point is 00:05:26 limiting. You have your typical logs, metrics, and traces, which typically get sent through like some sort of aggregation tier eventually to be served for you. But the problem is like, the longer it takes for you to get the insight means the longer the downtime is, which means the more impacted is to Netflix members trying to watch their favorite films and series on the service. So Mantis is a system that was born to enable like highly granular sub-second or second detection and also it allows you to trigger automatic remediations. So then we can reduce ultimately impact for playback experience at Netflix. Since then, it's expanded to a wide number of other use cases at Netflix. It's used for debugging use cases.
Starting point is 00:06:17 It's used for things like anomaly detection. And then at the end of the day, you can sync the events out to some other persistent, durable thing like Kafka. And then that opened up other use cases in the data and analytics space. Before we continue, so you mentioned the word observability. Can you define us a little bit what observability means for MADIS specifically? Like what was the exact problem that MADIS was going to solve and how it is described?
Starting point is 00:06:50 That's one question. The second question is like, why observability is a good use case for streaming processing? Yeah, that's a great question. So the first question, how do I define observability? Yeah, that has been a definition that's been going
Starting point is 00:07:06 through the community for some time. But for me, like buzzwords aside, what I'm really saying is operational insight. So I want to be able to ask questions of my system, either ad hoc or not, and get an answer without necessarily having to re-instrument my system now whether i re-instrument or not i guess it doesn't matter because it's more of like what kind of insight can i gain into my system and so specifically for mantis the insight would be well one of the one of the one of the questions we would mainly ask is, what does the playback experience look like for a member at Netflix? And a member at Netflix means, what does the playback experience look like for someone in a specific country, watching on a specific device, for which
Starting point is 00:08:00 television series and which episode and which part in the series. So you can imagine the cardinality of that can get pretty high. So how do we answer those questions without ensuring that the cost in answering such questions costs more than the actual system serving the playback experience? So cardinality was definitely a consideration. Also stream processing, being able to evaluate events one at a time and leave things like windowing or any sort of batchy statefulness up to the developer to define the semantics of how that should work. Because a developer develops their microservice application events will stream
Starting point is 00:08:45 out of there they kind of know what the behavior should look like and kind of what windows they should define and how that's great so we are talking about focusing more on observability around a service right so for example when most of the times when I hear the term observability from people, it's a very common SRE term, right? Where people won't like to have, as you said, operational awareness for their servers and more low level, let's say, services there. Is Mantis used also for these use cases Or is it specifically for more higher level, like product-related services in a microservice architecture that it's commonly deployed and used for? Yeah, that's a great question. So it's generally used at more of the application tier,
Starting point is 00:09:38 but more specifically, so we have logs, metrics, and traces, right? There's sort of a fourth thing, which is events. Events are just a thing that has a bunch of fields in it. Some people, it looks like structured logs, and it's very ambiguously defined. But the idea is, it's just a thing that has a bunch of fields in it. So the developers are free to define whatever fields they want for whatever use case. So it depends on the instrumentation or the library. At Netflix, we have auto instrumentation through Spring Boot. We just use GRPC interceptors. So as events come through the request path, things are automatically added for the developers, so they don't have to worry about adding it but should they choose to add their own fields or add more context they can do that
Starting point is 00:10:31 and on the flip side is if they aren't on the like the request response pattern if they had just any standard microservice or any like stateful system they can just use a library to just add fields as they want. So the long story short is it can be used for lower level system stuff. It's up to the developer to explicitly add those in. That's great. And okay, this is probably a question that you have heard many times before, but what made the team at Netflix to build something from scratch
Starting point is 00:11:09 instead of using something that already exists out there for instrumenting services and collecting events? Yeah, that's a great question. Build versus buy, right? Yeah. So definitely, if you imagine back in 2013, so Mantis has been in prod at Netflix since 2014. So development's probably a year or two even before that.
Starting point is 00:11:38 And back in those days, we had Storm. And I think Trident might have been around, Trident on top of Storm. And then we also had Spark streaming at some point. So the technologies that existed at the time weren't satisfactory for the requirements that Netflix had. So really, they being the people that came before me, Mantis with a few certain trade offs in mind, and operating principles on what what the architecture should look like. So there's three. The first is data should be on demand access. So data, you shouldn't pay the cost to export, serialize, or just move data in general, unless you need to, unless somebody is subscribed to it or listening to that data. When they're subscribed, they should be able to filter for sample project exactly what they're looking for. So you can get the granularity that you want,
Starting point is 00:12:46 but you just have to be very intentional about doing so. So number one is on-demand data access. Number two is aggressively reusing the data. So oftentimes people are subscribing to a stream and they're looking for, they could be looking for the same data or very similar shape of data or like a subset. So if you have that data already on hand in memory, for example, then you shouldn't have to go all the way upstream to the applications to grab that data again, you should just send it back down to the subscriber. So there's a sort of bookkeeping mechanism there. And then the lastly was cloud native auto-scaling. So it needed auto scaling native in the platform at different tiers to be able to scale in and out resources to just generally account for the subscription load and the publishing load.
Starting point is 00:13:46 Yeah, so the systems evaluated at the time during those years didn't fit those three things. Yeah, yeah. I love trade doves. It's where engineering starts, actually. So it's very interesting always to hear the trade doves based on the trade doves that the system was based on. Oh, yeah, definitely. Yeah, it defines pretty much the context
Starting point is 00:14:08 of the software that was built. This is great. And you said in 2013, back then the landscape of the available technologies was very different. Today, 2021, there's a lot of work being done on both streaming platforms and in data platforms in
Starting point is 00:14:26 general. Is there something out there now that if you had at Netflix to choose today to implement Mantis again, are there solutions out there that are closer to this basic that can satisfy this trade dose that we are talking about based on your experience and knowledge? Yeah, I haven't fully looked at the entire landscape of technologies. I mean, from the pure stream processing side, we have Flink. Flink has its whole async checkpointing thing. Oftentimes people use it with Kafka. Within the whole Kafka streaming ecosystem, we have Kafka Streams or even Kafka Connects for the integration story. There are other newcomers with a sort of querying type user experience, like Materialize. Details aside, I think the premise of Mantis was or is that if you're looking for operational insights, insights right now, at least during an incident, are more valuable than insights a month ago than insights a few weeks ago.
Starting point is 00:15:34 So the main tradeoff is we favor the cost effectiveness over not necessarily the correctness, but the persistence of the data. So what that means is effectively a sort of ephemeral or at most one's delivery mechanism where, yeah, if no one's subscribing to a stream, then the data is the cost isn't paid. And so if you want to actually have that context from weeks ago, you would actually explicitly and intentionally have to sync the data out to some more durable store. And then later on, join it, do like a stream table join or just join it together with that context later on. But you did touch on an interesting thing, though, like Mantis today. So it is open source source but then the community isn't as large as I would hope and of course part of that is like putting in the rigor and
Starting point is 00:16:34 and then the care that you would for an open source community and so it becomes it becomes tough because the one of the benefits of open source is in theory you can broaden your development to people outside your company. You can withstand a lot of organizational pressure. You can withstand lots of other things that might come through in the development experience, of course, with tradeoffs. there can still be room in general for existing systems to consider like a mechanism where you have to be intentional about persisting the data. Yeah. And I think that's like a very, very interesting topic, which is something that since I started like working with data systems, I keep thinking about it, which is this characteristic of what I call time series data in general is the more generic, let's say, for this type of data
Starting point is 00:17:33 that they have this temporal nature, like events, for example. I think it's very common that this assumption makes total sense. The data today are much more important than the data that you collected a year ago, right? And this is something that, okay, outside probably of like some InfoSec related use cases where the data is like very important to be kept because you might have like retrospectively go back and figure out what happened six months ago when someone broke into your system. But outside of this, I think the majority of the use cases around streaming data, data today are much more valuable than the data yesterday. And this is a question that I would like also to make to Eric, because Eric comes from being the consumer of this type of data from the marketing perspective, but time series kind
Starting point is 00:18:22 of data that are also used in his everyday work. And like the most typical example is the instrumentation, let's say of the customer, right? Trying to capture all the different events that a customer is generating and try to rebuild the behavior out of this. And based on that,
Starting point is 00:18:38 run some campaigns and do marketing. But what's your experience, Eric? Are the data like in marketing today have the same value as the data that have been collected for your customers like six months ago? That's a great question. I would say in my experience, a lot of marketing data and marketing reporting actually tends to be pretty primitive because unfortunately, I mean, I think marketers are getting more technical, so I don't, I mean, I am a marketer, so I guess I can speak badly about my own kind, but the marketing automation
Starting point is 00:19:18 tool and CRM are where a lot of the customer profile information lives. And so when you think about time series data, especially in the context of an event-based paradigm, I actually still think, even though there's some really good marketing analytics tools out there, I think in terms of the fundamental building blocks of that, a lot of marketing reporting is really behind. And so they tend to rely on a snapshot that comes from exporting structured data out of relational databases, which is in the form of like marketing automation and CRM tools. But I would say for me, the time series data is actually the most important. And so when, like, I'm just the biggest fan of thinking about data in terms of events, especially when it comes to marketing, because it's really the most robust way to identify trends over time. And so to me, that really is the most valuable kind of data. Like if I can look back
Starting point is 00:20:26 six months and look at a metric and the way it changes, and then if I can identify the marketing activities that I've executed or the tactics or campaigns that I've run and line those up on a timeline relative to data that's displayed as events over time, that's where you get your most valuable insights because you can triangulate pretty accurately within reason. I mean, of course, you run into issues around statistical significance, et cetera, but that's where you can kind of say like, okay, I can start to see that when we execute these sorts of tactics, they have an impact, but the delay is actually like a month or three months. And so from a forecasting perspective, when you think about deploying budget, and then when you're going to
Starting point is 00:21:10 start seeing results, it's really helpful, especially in businesses that have a sales cycle or activation cycle that is longer, right? So if you think about a multi-month activation or purchase cycle, it's pretty hard to get without time series data. It's very difficult to get insight into when your marketing activities and especially your budget are actually going to show up on the bottom line, as it were. So that was maybe a longer answer than you were looking for, but without a doubt, like if you can get time series data and we do that on a warehouse and use some BI tools from that standpoint using, you know, event-based data, that's really the holy grail for marketing, I think, especially relative to attribution. Yeah. Yeah.
Starting point is 00:21:56 Makes total sense. talking about all that, Eric, like I started thinking, because you mentioned the evolution in marketing around data and like how primitive some stuff are. And like at the same time, on the other hand, we are talking with Jeff about how streaming, processing on real-time data was implemented in Netflix like back in 2013. And it's very interesting to see like the,
Starting point is 00:22:21 how to say that, like the difference between different roles inside the company and how they are implementing and they're using technologies. And I wanted to ask you, Jeff, you've been in this space and you're working on streaming data for a long time now. Kafka, I think, as a technology has been around for almost 10 years now, if I'm not mistaken.
Starting point is 00:22:43 How have you seen... Yeah, it's actually like a long time in tech time. But how you have seen streaming as a paradigm, streaming data as a paradigm change all these years? And where was it when you started working on that? And how it is today? Yeah, that's a great question you mentioned Kafka 10 years boy it's been long I remember Kafka back in the 0.7 days yeah well and I if I remember
Starting point is 00:23:15 correctly I think the offsets were stored in Zookeeper yeah so and that was before the high level consumer was introduced so yeah we we've come a long way in the community and the data space and the streaming data space. I remember back then, like a lot of it, first of all, data, depending on the company, like you might not even have that much data. You could serve things out of standard leader, follow Postgres or some relational database.
Starting point is 00:23:43 Mongo and WebScale was a huge thing during those days as well. In terms of streaming, a lot has changed in the community in the sense that data integration is like a very common ubiquitous word now, like moving data with schema and versioning and exposing that in some sort of via catalog and connecting sources and syncs and simple transforms like that's a lot of people are trying to solve that problem and really understand what the developer experience the user experience from that looks like personally i feel like it's not quite solved yet. And then we still got some ways to go. Another thing is data quality is a huge thing. As your company
Starting point is 00:24:31 grows and employees and users, et cetera, there will be more data. The data can come from anywhere in any form. So how do you make sure that your data upholds some level of quality threshold? So there's a bunch of companies tackling that as well. So we have like alerting thresholds on systems infrastructure. We have like KPIs for business, but then we are still trying to work on what does that look like for data? What does that experience look like for data as well? And then the last bit is started by the Flink folks is
Starting point is 00:25:06 streaming is a superset of batch. So batch has been around for a while in the Hadoop ecosystem. And there's been some effort through an SQL like interface to merge the different paradigms, streaming and batch into a single abstraction it might have separate underlying infrastructures but at least a single abstraction for for a user to to work with that's still going i think at least from the larger companies it's hard to move because of processes and integrations so at least from what i've seen I've seen, a lot of the larger companies are still having a lot of batch systems through Spark and whatnot.
Starting point is 00:25:49 But with that said, there has been a relatively recent technology through Apache Iceberg to help with the table format. And then you use that with a column file store like Parquet or something. So with tight integration with spark like that makes it pleasant to work with but while at the same time it's sort of i don't
Starting point is 00:26:13 know what that means for the velocity of the the initiative of streaming is a super set of batch quote unquote that's a very interesting point like actually i wanted to ask you exactly that we have like quite a few conversations on this show where people were actually they were trying to say is that bots on an abstract way at least is like a subset of stream right and that we are going towards a reality where ideally everything can be like treated as a stream right is this something like do you agree with that? Do you see this happening? Do you think that at some point,
Starting point is 00:26:48 batch is going to be just, let's say, an obsolete pattern of accessing and working with data and streaming is going to be the de facto way of working with data? Or do you think that we are going to get at a balance point where both are going to coexist and have a single abstraction at the end on top. Yeah, I'm all for simplifying it
Starting point is 00:27:12 under a single abstraction in the future. If we can get the verbs right, verbs being assuming we stick with the SQL paradigm. In its purest form, batch is just like fixed windows windows are typically on longer time horizons and streaming are just generally on smaller windows or more i guess dynamic on smaller time horizons you the windows you might have some state you checkpoint the state you checkpoint in batch as well in one hand you're saving more often than the other so in its purest form that's what i believe and i'm not sure practically if we can get there
Starting point is 00:27:53 and how we would define the criteria for success in that case because there's been a lot of history at least for larger companies with batch systems and a lot a lot of these, at least for larger companies with batch systems. And a lot of these larger companies and even mid-market companies have ML initiatives. And so I'm not familiar with how amenable the streaming patterns are to those as well. So data is accessed in more use cases than before when we were starting out talking the talk of streaming as a superset of batch. So there's, well, I'm all for it. And I champion that fact. I think practically speaking, there's a lot of, I guess, practical hurdles that we have to go over. And the practical hurdles are for good reasons, because people are in the business of whatever core competency that their companies are trying to deliver for their users. Yeah, yeah. Makes total sense. That's very
Starting point is 00:28:59 interesting insight, especially like that we have like many more ways of accessing and working with data today. And that makes total sense. Okay. Question about data integration. You mentioned the term before. Data integration, traditionally, at least, and especially if we look at the vendors that were used a couple of years ago, was like a batch business. Most of data integration was happening with some batch processing systems.
Starting point is 00:29:24 Is this changing? Is data integration more of a streaming workload today? Do you think that this is going to happen? And if yes, how? How it has changed? Yeah, I'm really glad you asked this question. I think it is, at least from all of the startups that I've seen in this space, it's moving towards streaming.
Starting point is 00:29:43 And for one reason that I see, the latency aspect, I mean, some people might have batch systems for whatever requirements they have for their use cases, like maybe they have to do something once a day or once a week for whatever reason their business requires, or maybe it just exists. but getting the data from upstream, like the processing can be according to however they want, whatever semantics they want, but getting the data, people really want their data as quickly as they can, assuming that the trade-off correctness is, I mean, you still get the correctness of the data. So when I say getting the data quickly, I mean, you have applications, REST APIs, ERPC APIs, they're basically just generating data. These data are generally persisted in some sort of
Starting point is 00:30:35 database, or there's a bunch of data sources. There's many applications. So downstream of that, it would be really nice to just get that data as soon as you can. And what better way to do that than in the streaming way. And I think one of the pivotal points was, I mean, Kafka was definitely a contributing factor. So it helps with the, it's got that at least once delivery, and then you can get the correctness factor by replaying and deduping on the other side. So the idea is if you can essentially have a little stream processor, quote unquote, that reads from these log, their log data structures off of these databases,
Starting point is 00:31:17 event at a time, and then write them into Kafka. Then downstream, they can pick up the data at their own leisure. They can process the data at their own leisure they can process this process the data at their own leisure and and for the canonical word today we call that cdc or change data capture surprise yeah yeah let's discuss a little bit more about cdc because it's uh something that you hear a lot about and it's very i think it's a it's a very interesting topic so okay cdc is like
Starting point is 00:31:46 more of a pattern right like it's how do you capture like changes on the state of a database system and you can propagate like these changes like to other systems tell us about it how do you see cdc change uh through time if i remember correctly i've seen projects the basium there around for a while, right? So it's not like something that came up today. But it seems that today people are much more interested in CDC at the point where we even see startups out there actually implementing like CDC as a service, whatever that means. So what is the current state of CDC? Where do you think that it's going?
Starting point is 00:32:26 Yeah, CDC is a fascinating thing so the way i see it is like there's this concept of like the stream table duality where uh operation like a table in a database represents an integration on top of a set of a change stream or it's a snapshot of a point in time of what the representation of this this stream of changes looks like so then then if you take the derivative of that, you get the chain streams, or what I mean by that is like the operations that happen. So if you're inserting a record or updating a record, what did the event look like before? What does it look like after?
Starting point is 00:32:58 What is the operation and the timestamp? So then if you have a stream processor that just reads from the beginning of that stream and applies the operation, you could eventually get to a snapshot. And that's just like a very interesting thing. It's just this log looking data structure that is traditionally internal to database systems. And a lot of them do replication this way, particularly in the leader follower model, right, the single leader model. And then so it's like an extension of that, where like, hey, what if we expose that in a public slash stable API for people or consumers outside of the whole replication ecosystem, so that we can look at that chain stream and then move it move that materialized views after the fact so one of the news cases that i see is people read from these streams and then they have a bunch there are a few stream processors that materialize different views
Starting point is 00:33:57 for different consumers so let's say i'm interested in like a user's table and I want to, I don't care of the exact event, a user event. I just care about maybe like some sort of aggregation or roll up. So then instead of querying that database directly, I can, if I can relax my latency guarantees a little bit, then I can rely on some other thing to read from a chain stream, populate a materialized view for me. And then I can just read from that without hammering the main, the main data set. So the tricky part is you don't just have one database system you have, at least in a, in a larger setting, you might have like multiple systems with their own representation of this log-like data structure and their own representation of how you get data out of those boxes over the network into your systems. So there's a semantics of doing that.
Starting point is 00:34:59 And then there's also the shape. So what kind of envelope, or if at all, is there in place for you to move these payloads from different technologies in a coherent way that something can transform into data that eventually should be enriched for a downstream consumer? So there's a lot of work being done for that. And that's a hard problem. The problem just stems from like, people are generating lots of data, there's lots of applications, many different data sources. But at the end of the day, people aren't looking, the downstream consumers aren't looking at, hey, is this a Postgres, MySQL, Cassandra, Mongo, etc. They're just looking at my user model, my accounts model, my billings model, et cetera. Jeff, one question. So it's interesting to think about CDC. We've talked about it on the show several times, but it's really interesting that you started out
Starting point is 00:35:58 with a use case that is very practical from a business like operations execution standpoint, right? So a user's table, when we think about that, that could apply to product marketing, sales, customer support. CDC is one of those things that is not a new, it's not a new idea, but is the trend towards moving CDC closer to the end consumer of the data? And even from an organizational standpoint, I'm just interested, as you sort of see the technology as it's being leveraged inside of organizations and even some of the tooling around it, at a base level, it's a mechanism to capture useful data, but is it moving closer to the end consumer who may not be as technical? I think in terms of moving the data, it's getting better. In terms of moving towards the end user, I'm not sure, actually.
Starting point is 00:37:01 At the end of the day, the user just wants some view of the data. It can be the raw view of whatever the table looks like all the way upstream, or it can be some materialized view that a stream processor in between has materialized and enriched for them. I think that the tricky thing is CDC has been around for a long time. It's not a new concept. But the interesting thing is there has been a few enabling technologies that made the developer experience better. So like we talked about Debezium and Kafka and Kafka Connect with this high level consumer and the consumer rebalance protocol.
Starting point is 00:37:46 Like there's just a bunch of things that have made it incrementally easier. And today I think technology aside, the concepts have been distilled to a point where if people choose not to use Kafka or that whole ecosystem, they can take those concepts and then build in their own CDC-like technology.
Starting point is 00:38:08 So exporting the data out of a database from like a Mongo op log or a MySQL binlog Postgres wall, et cetera, like that experience has gotten better. Taking that data and moving it in a scalable way has gotten better. And then I think the last piece is how can you transform that data in an easy way to materialize the view? That is so tricky because like an end consumer, there's different personas, right? So if I have the persona at a lower layer, like a data engineer, I can say, hey, look, write a flink job or something that reads from a Kafka stream, make it do a join with
Starting point is 00:38:53 this external data source and produce a view to like Redshift, Snowflake, or Iceberg. And then someone downstream of that can use like a Presto or whatever tools that they use to get that data. But the tricky thing is you need to have that person in between to write that job, to take the raw stream of data and then make the view. So I'm wondering if there's a way where the person all the way downstream that traditionally writes Presto and Trino or works more on like visualizations
Starting point is 00:39:24 and dashboards and stuff that they can just own that stack end to end sure yeah it's super interesting yeah that's a exactly what i was getting at and i really appreciate the candid way that you said the end consumer and in many ways because this is, but the end consumer just wants a materialized view. And it's kind of like, I don't really care how it gets there. Yeah, exactly. I imagine in a smaller company, you would just wear multiple hats and figure out how to ETL the data yourself. Or in a larger company, you would ping like multiple teams and take forever to get the result that you actually want. Yeah, Yeah. Yeah. So I think that I
Starting point is 00:40:05 just think it's a really valuable insight saying that the developer experience has gotten way better and it seems like it's at least moved more towards the data engineering type persona where it's pretty, it's, it's a lot easier for them to leverage CDC in order to produce the result for the end consumer. Yeah, we're getting there. We've still got some ways to go because CDC is, I mean, probably a thing that's been gated, not gated, but more prevalent in larger companies because the mid-market companies, they're still generating data and they still need insight into their systems. And everybody wants to materialize views for whatever use case, but they don't have the resources and teams. They don't have the ingestion people. They don't have data engineers. years. So developer experience wise, we as a community has got some way to go to simplify
Starting point is 00:41:05 that and make it more accessible for companies that are more smaller or in the mid market size. Sure. Jeff, one question, and this is jumping back a little bit in the conversation, but I just want to connect the dots for me and hopefully for our listeners. When you were talking about Mantis and downtime for end users watching video on Netflix, I'm interested to know when you think about the observability of like, okay, we have downtime and we need much more robust data around that. When you were looking at that problem, how much were you interfacing with the teams who were dealing with the user interface side
Starting point is 00:41:53 of trying to communicate about those problems with the end user? Or is that part of it? And the reason I ask that is because as you kind of think about being a consumer of Netflix, as I'm sure almost all of our listeners are, it's interesting to know like how the observability data gets to the engineering team so that they can fix the problem faster. And then what does the loop back
Starting point is 00:42:18 to the consumer look like, if that makes sense, or like the end user watching videos? Yeah, yeah. So you kind of have to define what's interesting to you. And one of the most interesting metrics for Netflix playback experience is called stream starts per second, or streams per second or SPS. And what that means is anytime someone hits play on a television show or a film, basically a giant event gets fired off. And so in its purest form, you can just count the number of those. And if it deviates from some threshold, either static or dynamic, then you want to sound the alarm. Sounding the alarm is a tricky thing because Netflix has hundreds of microservices and these microservices are generally operated and owned by disjoint teams that might not even talk to each other. They might not know of each other. So
Starting point is 00:43:20 it gets tricky with the alert because you have to be able to have a coherent alert that is ideally consolidated. So it's not only quick, but consolidated so that you can triage appropriately and page the appropriate teams. Yeah. It's like the context. Exactly. Yeah, exactly. So stream starts per second is like the main metric. So what happens is it go, everything goes through like a front end proxy. And then that will distribute through to a bunch of microservices within the Netflix ecosystem, internal infrastructure. Netflix, fortunately, has, we use dynamic thresholds, but even without those, a static threshold is sufficient because Netflix playback is pretty predictable over a 24-hour period because people, they get home from work, they watch Netflix. During the work hours, they probably aren't watching Netflix or shouldn't be watching Netflix. And so you have this sort of sinusoidal pattern throughout the week, day in and day out. Yeah, that's super
Starting point is 00:44:33 interesting. Wow. How interesting to think about sort of the rhythm of a day within a time zone, creating some level of predictability. And's really it's really interesting too because uh there's not like a single alert right so there's lots of granularity you can alert on the like a dip in global stream starts per second but you that might not even dip because you might have like if you imagine a long tail distribution? Like someone watching in like a smaller country on an older device, like we care about those members too in their playback experience. So if you're doing things like aggregating or looking on a less granular view, then you're going to miss that person's playback experience if it's bad.
Starting point is 00:45:26 So with Mantis, it allows you to going to miss that person's playback experience if it's bad. So with Mantis, it allows you to zone in on that person. So if someone's watching on like a Wii U, which is just decommissioned by now, I think, Wii U, Stranger Things, Series 3, Episode 1 in like Russia or something, like we'll be able to see if that person is having playback problems. So fascinating. So fascinating. I mean, I just think about the scale of Netflix and the ability to sort of quickly like identify statistically significant problems on a regional level is, is just really cool. I mean, a lot of companies don't have that at an international scale. Yeah, the example that I like to give is like,
Starting point is 00:46:11 suppose you're having playback problems, you call customer support, they could inspect the stream, a stream from an application, and then put in your, like target the query to exactly your device and put 100 sampling and see all of the events coming through and then as you can click tap and swipe through the application
Starting point is 00:46:32 assuming you're on your phone or something they can see all of the events for you going through live and help you troubleshoot right there and then but doing that through like like if you're trying to store all of those events and then aggregate and then look at it after the fact, that could get quite costly for Netflix. Jeff, I want to take you back to CDC again, if that's okay, because I know that Eric has like more questions that are, let's say, closer to the customer, let's say, of all that stuff. Oh yeah, we can jump around. Yeah, yeah, yeah, absolutely. But I have a question that I need to ask you
Starting point is 00:47:12 based on what you mentioned previously about the data modeling on top of the CDC streams. And I wanted to ask, the question is the following. So traditionally, data integration is built around the concept of the data warehouse. It's assumed there is a data warehouse somewhere. We are going to load the data onto this data warehouse. And in the process of doing that, we are also going to integrate the data together.
Starting point is 00:47:37 So yeah, sure. I might have one service that represents a user in one way, another service that represents the user somehow different. And we are going to normalize these into one data model that then our BI team or whoever wants to work on that is going to do that. But that's a very, let's say, so far it has been done like always in a more, let's say, batch way or assuming that's a batch process that are happening. So assuming CDC and assuming that we have these streams of data, right?
Starting point is 00:48:09 Is this going to change? Is this work on the data modeling side of things is going to keep happening like on a data warehouse? Do we need the data warehouse? Or are these things that we can also do on the streams? That's interesting. So I think personally, the answer would be where where does the tooling lie and and where is the best experience and most familiar experience and
Starting point is 00:48:32 like market leader for tooling so it's also sort of a question for eric which i'll pass on to in a minute but like the down the downmost stream the downmost stream consumer, the end user, what tooling do they like? What are they familiar with? And what do they need to generate whatever they need to generate for their use case and their own customers? So the CDC thing, unfortunately, I think it just gets your data faster. So instead of reading two tables, two snapshots from two different data sources, you're reading two streams of changes.
Starting point is 00:49:11 And then as the events come in, you are joining or enriching those two streams with some extra context and then materializing the view as you go. So for Eric's case, it makes his life better. He gets data faster, but how he got the data as like, honestly, I don't think Eric would care or it matters. I mean, if, if the batch thing works in a day, if you could stomach like doing it every hour and recomputing everything that might even be fine or in the streaming sense, if you can get it down to minutes, like that's great for Eric, but the data is still the same at the end of the day.
Starting point is 00:49:50 Absolutely. Yeah, it's very interesting. So another question, and I think you are like probably the perfect person to ask that also because of your experience in the companies that you have worked. So you worked at Salesforce, Heroku, then at Netflix. So Heroku had a very interesting product for me itself that was called Heroku Connect, right? Which the whole idea was we have a Postgres database on Heroku. We have a Salesforce account somewhere. And these two different systems are synchronized one way or another.
Starting point is 00:50:26 This product has been around for quite a while, I think. I always found it very interesting. And having a little bit of the knowledge of how it works, do you see CDC as a pattern also applying these kind of use cases? Because here we don't have a database system and another database system, right? We have a database system. And if we, right? Like we have a database system and if we want to make it a little bit more technical,
Starting point is 00:50:47 we have like a service that is exposed to a REST API or a SOAP API and which the modality is a little bit different. Do you think that like CDC can be used or should be considered also as a pattern that can do this kind of integrations? Yeah, it's just a pattern, as you said, or a tool mechanism to move data into the tool
Starting point is 00:51:11 that is most comfortable for the end user. So it's really interesting in the Salesforce case because you're moving data out of Heroku Postgres, which might be from a smaller dev team, from a new app that somebody in a larger company just started, or like an acquisition of a larger company. Or maybe they have customers of their company that want to use Salesforce and they have their own integration.
Starting point is 00:51:36 So it's just a natural fit to take the data from your online systems, which are in Postgres, and move them to another data source or a sink, in this case, like Salesforce. And then your actual customers are, in this case, would be most comfortable with the Salesforce ecosystem. There's lots of tooling and other Salesforce products that they can use and integrate as well. So if it's not Salesforce, it could be some other thing.
Starting point is 00:52:06 It just depends on which customer are you trying to serve for the CDC mechanism that you happen to have within your infrastructure. Yeah, yeah. So it could be Marketo. So we can also have Eric Copy, right? Yeah. I mean, as I think about, again, coming, like I'm not an engineer and I don't have formal training as an engineer.
Starting point is 00:52:27 But I understand the concept of CDC. And as a non-engineer, it feels incredibly freeing when I think about the stack, because the event-based time series data to me is so valuable and in sort of an. Like, I think it's a really compelling concept. And I think, I mean, my, at least from an outsider's perspective, like, I think the value there and as the ease of delivery gets better and better, I think we're going to see a lot more of it just because it's kind of a way to provide the most valuable type of data to downstream teams, regardless of the system, right? And so you don't have to force a super high level of conformity in terms of like the way that things are captured at the source, if that makes sense.
Starting point is 00:53:36 That's, that's really awesome insight, Eric. And while you were speaking, it reminded me of, of another word, which is just getting, so if you have the raw change stream, you're effectively just getting the raw data. You're not getting a snapshot of a point in time. You're getting it as it comes. And so the beauty of that is it's up to the downstream. Ideally, it would be up to the downstream user to interpret that in a way that fits their use case or fits whatever problem they're trying to solve. So it's a bit more flexible in that sense, because you're not working with data that's rolled up. It's just that today, like you would have to ask someone
Starting point is 00:54:16 to interpret that stream and then put it in a way that works for you. But yeah, if you don't have the raw data, then it might be a little bit different to more difficult to do what you need to do. And that's, I guess, what really was one of the fundamental values of CDC is. Yeah, for sure. Well, we have a little bit more time. So I'm going to ask a question that's less technical. You have worked on some really heavy duty systems, really deep in the stack at some really amazing companies. And one thing I was thinking about just prepping for the show today was, and this is, I love that you describe some of the problems that you're solving in terms of the user experience. So for example, if someone's trying to watch Netflix and they have downtime, right?
Starting point is 00:55:11 Or I translate that as, wow, you interrupted my like binge buzz because I'm trying to hammer through Game of Thrones. But that really stuck out to me because you're working pretty deep in the stack. And in many cases, it sounds like some of the stuff that you're doing doesn't even necessarily touch the end user, if that makes sense. It certainly has an influence, but when we think about observability,
Starting point is 00:55:41 you're trying to get system data to people who can solve a problem with the system, which is sort of an internal feedback loop that someone's going to do something to get it back online for the consumer or the end user. How do you think about that? Especially having like span multiple different companies who are very user centric, but working very deep in the stack. I would just love your perspective. And I think our audience would appreciate like, is that a concern for you? Like, do you try to think about the end user, even if what you're
Starting point is 00:56:17 doing is more of an internal feedback loop with data or infrastructure? Yeah, I love this question. It's something that really resonates for me because a lot of times as engineers, we tend to get stuck in the weeds of the deep technical. And it takes a lot of discipline and rigor to remember to get out of that and start actually with what problem are you trying to solve? Who are you trying to solve it for? What things have been attempted before and why are they bad and how can you improve it? So with that, you're basically just getting context to help you inform the inputs and the requirements to how you should solve your problem and what kind of technology you should be introducing to solve such problems. Because if you start from the inside out,
Starting point is 00:57:06 from the technical side, I just feel like that likelihood of building the wrong thing is much higher. And then therefore you'll end up building a solution in search of a problem instead of actually building the right thing. So definitely it's a skill or a trait, but definitely something that any engineer should be doing first or thinking about is like, what problem are you actually trying to solve?
Starting point is 00:57:31 And is it the right problem to be solving? And then there's other tactical things like incrementally building upon hypothesis, like build something, test it out, build something, get feedback. So just as things that mitigate the risk of building the wrong thing. Yeah. And you know, one thing that, that came to mind, number one, love the perspective and it sounds like, I don't want to want to assume things about you, but it sounds like you've developed like a strong level of discipline in terms of thinking that way, which is really cool. And the other thing that just came to mind around data is that you, you have the end user, right? So the person watching a video with Netflix, but then there are also end users of the data products inside the company as well. Like the first stop and users inside the company, and then the last
Starting point is 00:58:23 stop or the last mile and users, the one who's actually consuming the product. And then the last stop or the last mile end user is the one who's actually consuming the product. And that's a really interesting environment to operate in. And in many ways kind of is complex because it's easy to stop at the first end user and not do the work of thinking about the end end user. Yeah. And that's okay. You can just start somewhere because as you mentioned, if you're deep down in the infrastructure stack, you have to focus. You got to pick some user persona, work with them, get a tight feedback loop going, iterate, ship, and iterate and ship. And then from there, you can either make their experience even better, or you can broaden to
Starting point is 00:59:03 a different user persona and see how you can up-level the abstraction in your whatever product or service that you're providing. And then ultimately, Netflix is a single product by itself. is partnering with, if their use cases would eventually serve that streaming path, like the actual series and films watching path, then you would have to actually keep that in mind as well. So there's different levels of customers, if you will, different levels of partners and user personas, and you can just focus and then incrementally chip away at it. Yeah, absolutely. And I think I'm far from the expert, but if there's one thing that I have learned about mature engineers is that stepping back to ask whether you're building the right thing, regardless of technology is, is absolutely a hallmark of, of a mature engineer. So love that mindset and a great reminder for us and the audience. Well,
Starting point is 01:00:05 we are at the end of our time. Jeff, I feel like we could keep talking for another hour or two, which means we'll have to have you back on the show, but this has been great. Yeah, it's been awesome talking to you all. Thanks. Awesome. Well, we'll check back in with you. We'd love to have you back on the show. Best of luck at Stripe. That's a new adventure. So super exciting and congrats. Thanks again. I'll see you guys later. That show felt like it passed in three minutes, but we talked for an hour. So there was so much in there. I think my big takeaway, we've talked about CDC multiple times on the show, but I think I may have had a little bit of a light bulb moment or sort of an epiphany thinking
Starting point is 01:00:46 about how potentially flexible CDC is for event data. I'm a huge fan of event data. I don't hide that, of course, but it just seems really interesting as a technology. And before I thought it was interesting. Now I would say I'm probably bullish on it as sort of a core piece of the stack that we see developing over the next five years. So that's my takeaway and my prediction. Absolutely.
Starting point is 01:01:13 I think that CDC is a term that we are going to be hearing more from now on. Although as Jeff also said, it's not something new, like it's been around for quite a while. But it has been a little bit more of like an esoteric kind of pattern that bigger companies always use.
Starting point is 01:01:28 But, I mean, okay, the overall conversation, the whole conversation that we had with Jeff was amazing, and I'm pretty sure that we could have a separate episode for each one of the things that we discussed with him. But what I would like to mention and
Starting point is 01:01:43 remind, and what really impressed me is two things. One, how he was talking about technologies and why they were built and how they were built based on the concept of trade-offs and what are the trade-offs. And how many times when we ask him about, is this how things are going to look like in the future is this like the right direction is this the other the right direction his response was let's ask eric who's going to be using it right and these two characteristics are like the characteristics of like like a very experienced and good engineer like that's at the end how you build technology because technology has to serve someone, right? Like we don't build technology for the sake of technology.
Starting point is 01:02:30 And so that's like what I want to keep at the end of this conversation outside of all the amazing information that Jeff shared with us about all these amazing companies and technologies that he has worked with so far. Yeah, absolutely. I mean, to hear someone he has worked with so far. Yeah, absolutely. I mean, to hear someone who has worked on a project like Mantis
Starting point is 01:02:50 and has been deeply involved in really cool sort of advancements in tech like CDC, to hear him step back and ask the question, am I building the right thing is really great. I think that's just a really healthy reminder for all of us. Absolutely. So I appreciated that. Great. Well, thanks for joining us. Tons of great episodes coming up. Subscribe if you haven't, and we'll catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
Starting point is 01:03:22 podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.