Drill to Detail - Drill to Detail Ep.27 'Apache Kafka, Streaming Data Integration and Schema Registry' with Special Guest Gwen Shapira

Starting point is 00:00:00 So my guest this week is Gwen Shapira, long-term friend of the Oracle, Big Data and Data Integration Communities, and now working as a product manager at Confluent, the people behind Kafka. So Gwen, thanks for coming on the show, and why don't you tell us a bit about what you do and how you got here? Yeah, so right now, as I said, I'm product manager and I take care of Confluent's data integration product, which are mostly Kafka Connect. So it's a way to get data in and out of Kafka and between databases. And the Confluent schema registry, which is kind of a cool piece of technology that allows us to add structure and schema

Starting point is 00:00:49 to data in Kafka, which is kind of a definition of schema-less and therefore has the potential of being a huge headache. And I obviously got here by kind of a convoluted way. Like, I think my career has been defined by my inability to say no to things that sound interesting. Hence me being on this podcast. I started my career as a developer.

Starting point is 00:01:18 And then when our chief DBA quit, it sounded like a good opportunity to start something cool. So I volunteered to take on the Oracle DBA position. And I did it for quite a while. So Oracle was really, really cool place to be for, I think, a good 10 years of my career. And then, you know, you kind of, I was in Oracle consulting, I worked for PTN at the time, and you're a consultant, right? So you work with all those clients, and all they at the time. You're a consultant, right? So you work with all those clients, and all they want to talk about is Hadoop, right? And they're like, okay,

Starting point is 00:01:50 they no longer want to move data between two Oracle systems. They all want to get data from the Oracle to Hadoop. And after client number five was talking about, I want to move all my data from Oracle to Hadoop, you kind of get to thinking, if this is actually everything they're doing, will I still have a job doing Oracle in five or ten years because it's all moved to Hadoop? So after the fifth glance or so, I started talking to Cloudera about, hey, can I do this Hadoop thing?

Starting point is 00:02:19 And that's the nice thing about consulting. And in general, I think in big big data the skills are super transferable like if you are really good at your fundamentals like you know how to look at what a program is doing instrument it and dig into is it using disk is it using cpu is the network slow you can kind of do pretty much anything so I ended up doing pretty much what I did for Oracle, which is why is my ETL process so slow? I started doing the exact same thing for Hadoop. So suddenly here I am learning a new technology just so I can help my client start using it. And after a few of those,

Starting point is 00:02:57 I became the Kafka person at Clodora, but Clodora wasn't that interested in doing Kafka at the time. I think personally, I think they're still not. It's not something official, but it just doesn't seem like an area they're hugely invested in. And it's kind of slightly depressing to be doing a sideshow for a big company or even a medium-sized company, for that matter. So when I noticed that all the Kafka people were kind of coalescing around this one company. I was very interested in talking to them, basically. So that's kind of who I am. Actually, I joined their company as an engineer.

Starting point is 00:03:32 But then they said that, hey, we don't really have any product people. And our investors say we need product people. So can you do product? And I'm like, I have no idea what I'm doing. Hey, we're pretty sure that nobody in product knows what they're doing. So I'm still learning that. And I think that's the cool part about a career, right? That you can kind of keep learning and reuse your skills in different ways.

Starting point is 00:03:55 Exactly. I've always thought you've always been the person who's had the job that's one bit cooler than me, actually. So when I was learning kind of big data, you were working at Cloudera. And then when I got into big data more so, you were working as a product manager. And I think you've always kind of blazed quite a trail, really, when it comes to the companies you work for and the technologies that you kind of specialize in. But I think particularly data integration and ETL has been an area that you've done a lot of things with.

Starting point is 00:04:24 And as a confession, I would say that done a lot of things with and as a confession I would say that probably a lot of the good ideas I've had in the past around this sort of area have come from things that you've presented on particularly I suppose around the way that you kind of lay out maybe you know in the old days of HTFS the way you'd lay out a kind of file system and the way you kind of I suppose the governance around it as well and I mean what again what particularly interested you in the data integration side of things really actually the funniest thing is that I never actually did that integration and they wasn't even that interested in it except I walked as I started my career as a DBA right yeah and 90% of the

Starting point is 00:04:57 problems I had to deal with as a DBA be summed up as why is my ETL so slow yes why is it not finishing on time Or it was fast yesterday but it's slow today. Or it's only slow on Sundays or Mondays when it rains. ETLs are large processes. They can get convoluted and they're usually a source

Starting point is 00:05:17 of performance pain for customers. So pretty much I just hanged out around ETL people a lot i guess and try to make things faster and so that's kind of how i find myself and you know it's and obviously in oracle and traditional etl speed and resource urges and how it interacts with other parts of the system was a huge pain point but then i started moving to hadoop speed was never the pain point. But then I started moving to Hadoop, speed was never the pain point. I mean, ETL was fast. Pain was usually after the ETL process finished. And okay, now you have a

Starting point is 00:05:54 bunch of basically bytes on the file system. And it's not like Oracle, right? After the ETL process finishes in Oracle, you have nice tables and partitions and that kind of thing. In Hadoop, you don't really get that. You get, if you're lucky, you get hive tables, but that's about it. So a lot of the pain I had to solve for my customers is actually things that were super, super obvious back in Oracle. You know, like, how do we know what data types do we have? Should everything really be a string? Yeah, exactly. I mean, I think, I mean, as a lead in, I think, to talk about Kafka, which is obviously the

Starting point is 00:06:32 area that you specialize in now. Let's kind of think about how I suppose ETL has changed over the last few years as we've gone more towards things like streaming. We've used kind of, I suppose, a much wider set of tools and so on there. I mean, you talked, again, looking back at presentations you've written in the past you talked about there being kind of patterns and anti-patterns and different kind of tools and so on people would use if you if you were assessing or thinking about etl for a customer or reviewing a system you know what what typically would you kind of see there in terms of approaches that they would have there

Starting point is 00:07:03 and and how how how are people starting to adopt i suppose things like streaming processing and so on what's the kind of patterns you see there yeah it's fascinating because that is an area that changes a lot over time and actually the evolution of the area is something that i'm super interested in right now because you know i work in the silicon valley and actually a lot of Confluence customers are kind of cool Silicon Valley companies. And you see things that are a lot different. And sometimes Silicon Valley is a trend center. So I kind of try to see it as does it predict something for the rest of the world?

Starting point is 00:07:36 Because often it does, right? And if I have to say what is ETL in the Silicon Valley right now, I would say that it's no longer ETL. I would recognize it. It's software engineers using data in their applications is the way I would characterize it, right? And last week, you talked to the Airbnb guy, and he's not calling himself an ETA person, right?

Starting point is 00:08:03 He's a software engineer who happens to be building data stuff. There is a very good book by your fellow Englishman, Martin Kletman, about data-intensive applications. And it talks about applications that use data in super broad, super generic terms, and it goes over a lot of patterns just around applications that goes over a lot of patterns just around applications that use data a lot. And obviously

Starting point is 00:08:29 a large part of it is what we would call ETL but for him, he was building search for LinkedIn. And if you're building search for LinkedIn, yes, I would call it an ETL because you're getting data out of Kafka and maybe out of some databases.

Starting point is 00:08:46 And you need to land it in Elasticsearch, where you run searches in all kinds of specific indexes. So to me, it looks a lot like an ETL. But he would never call himself an ETL engineer or ETL specialist, right? He would call himself a software engineer. And this kind of changes everything because if you're a software engineer, you don't really use point and click tools anymore, right? So it's been, let me think, five years since I've last seen anyone actually do the whole informatica thing or data stage thing. Because you see everything as a software problem.

Starting point is 00:09:26 You need tools, but your native environment is writing code to solve problems. And if something like Informatica doesn't really let you do that in a convenient way, you know, it doesn't integrate with your GitHub

Starting point is 00:09:41 and Jenkins and all. Software engineers have their environments that they're productive in. Yeah, when I started, moved from being an DBA to do more of engineering work, the first thing that my colleagues told me is you need to set up your development environment.

Starting point is 00:09:58 And you turn out that it's like, okay, how do I get a copy of the code into my machine? How do I edit it in my editor? How do I make a small change? How do I run the test? And how do I get it back into the correct release? Which is kind of a cycle that basically engineers keep working around, right? Developing, testing, integrating, deploying.

Starting point is 00:10:17 It's an interesting point. I mean, yeah, sorry to interrupt you there. I mean, I had the conversation with Maxime last week. And I mean, certainly working now within a product area in a big data company, building kind of software products, as you say, nobody will be seen using a tool like Informatica or anything really like that. I mean, there are tools like, say, StreamSets that I think you're probably aware of that have tried to take that point-and-click approach and update it, I suppose,

Starting point is 00:10:41 for the kind of cloud era world and so on. I mean, do you think, though, that this you i mean you know we've you and i both worked in in kind of database environments dba environments where there used to be kind of hand scripting of etl routines and dbas would say i'm much more productive using scripts rather than using kind of etl tools but you know the tool etl tools are brought in to I suppose broaden the kind of set of people that can do this kind of work and to add a degree of kind of I suppose kind of standardization and governance to it do you think what you're seeing now in in in the industry or in silicon valley is just a kind of the maturity of the people doing it or it's a fundamental change

Starting point is 00:11:18 in how this kind of work is done yeah the work still seems a lot the same, at least in some senses. I think that part of the struggle is around the main differences between structured and unstructured data, right? That's something that has been like when Informatica was written and DataStage was written, everything was structured. You moved data between one relational database and another. And now you have all those logs and Jasons and all kinds of just more systems,

Starting point is 00:11:54 more types of data to integrate with. And the other difference is that sometimes it's not even databases on either side of the ETL. So we do work a lot with change capture scenarios now because just apparently Kafka plus something like DB Visit is pretty popular. So obviously DB Visit is getting data out of Oracle and put it into Kafka. And then the consumers of the data in Kafka don't necessarily write it anywhere. Sometimes it's just an application that wants to get an update whenever something happens in the database.

Starting point is 00:12:30 So you have all those microservices, and one of the microservices is responsible for sending congratulations. Your account has been opened, the email for a bank, and they want this application to get notified by whenever the account creation entry shows up in their db2. So that's kind of a classic change data capture use case, except that there is no database at the other end of it. So the integration of, I etl and up building and kind of merging the two worlds is something that's fairly new and in my opinion really fascinating okay okay so that leads on quite nicely to i suppose kind of kafka and streaming and so on so you talk quite a lot about

Starting point is 00:13:19 streaming pipelines and and how this is kind of how things are done now in this world so for anybody listening to this podcast who is i, I suppose, still working with tools like Data Integrator and ODI and so on, what fundamentally is different about say streaming pipelines and the way that that kind of integration works within the kind of big data world? Okay. So I'll warn you, I don't know that much about ODI, but... Okay, any tool, any tool. So any tool that is batch and graphical, you know, it's a quite different paradigm, isn't know that much about ODI. Any tool.

Starting point is 00:13:46 So any tool that is batch and graphical, it's a quite different paradigm, isn't it, to doing things streaming? Yeah, definitely. So graphical is not the big change. I mean, it's almost a side effect of stuff being done by developers versus the ETL specialist. But really, the big change is moving from batch to streaming and you can see it in the conversations we're having

Starting point is 00:14:10 because we started out my first streaming tool was Spark Streaming and that was they had these micro-batches and we thought that micro-batches are good enough for the longest time I thought micro-batches were good enough but then we pretty much as a community discovered that even though I don't care that much about latency,

Starting point is 00:14:31 you know, like one second or 100 milliseconds is often fine, but I care about the flexibility of my data windows. So if you work with microbatches, it also kind of fights how you do aggregates, how you're slicing your aggregates to specific points. And it affects your ability to handle things like late events, which are obviously, if you do real-time data, you can't just say, oh, let's rerun yesterday's batch

Starting point is 00:15:04 and redo it because we just got a bunch of events. Streaming is all about being continuous, right? And I think people call it unbounded data sometimes. It's not batch and streaming, it's bounded and unbounded. And batch has those very clear boundaries,

Starting point is 00:15:19 and streams basically is always on. So you need to start worrying about, how do I handle errors in the world that is always on? How do I handle late events? How do I make updates? And those are pretty hard problems. I think the entire field is still kind of struggling with exactly what are the correct answers.

Starting point is 00:15:38 I talked to a super smart guy from one of those streaming radio apps, which is pretty popular. A lot of people use it, so they definitely have big data. And they said that they moved to use Google Dataflow recently, which is one of the more advanced stream processing out there. Yeah, I think a lot of the field, it's a field that advances in many directions and everyone exchanges their new ideas but we've definitely been inspired by some of their work and vice versa they've been definitely inspired

Starting point is 00:16:12 by some of our work and you can see so this guy from Spotify is telling me that they're using it but they're mostly using it in batch mode because it's a bit scary for them to move to streams. They still haven't figured out how to do everything they used to do

Starting point is 00:16:32 in this streams world. And things like error handling and reprocessing of data was definitely one of the things that is still kind of, they're still trying to get used to it. They're trying it on maybe some non-critical data loads. And so if some of the smartest people out there are still trying to figure it out i think that's kind of the cost of being out front yeah exactly so what was this i mean so in terms of kafka and confluent then you know what what was the story of kafka um and how did confluent how a confluent involved in that and what was i suppose really what's interesting is what was the innovation really with Kafka

Starting point is 00:17:07 that made it a better solution to, say, Spark Streaming? You mentioned it's continuous versus micro-batch and so on. But what was the thing, the breakthrough thing with Kafka that made it kind of very popular, do you think? Yeah, that's funny because Kafka was written as a message bus. And as a message bus is something that me as a DBA almost never saw, right? It's pure app developers. It's something that TIPO did.

Starting point is 00:17:32 It's not something that I've ever really seen myself. And it started out basically as a smarter, more scalable message bus. And the way they decided to scale, so it turns out that normal message buses basically are very aware of who is supposed to consume every message. And the message will be stored often in memory until every single consumer consumed it. And if you ever used Oracle Advanced Queues, which everyone who used Oracle Streams happened to use, then that's exactly how Oracle Advanced Queues would work, right? It will never delete data until everyone consumed it. And that causes a lot of issues around scale, because if you try to scale to a lot of consumers, then suddenly this whole thing starts falling apart.

Starting point is 00:18:18 And LinkedIn had something like over 10,000 consumers at some point. So Kafka was basically built to deal with that. And the way they decided to deal with that is to say, we are going to keep data for four days or seven days, which is enough for even our worst behaving consumer, and just delete it afterwards. And we're not going to bother keeping track of the consumers. And turns out that this simplifying principle basically allowed for pretty big scale. And they kept on simplifying. They simplified the data structure.

Starting point is 00:18:49 Oh, just this write-ahead log is actually enough to maintain a queue, so let's just delose this bunch of partitioned write-ahead logs. Basically, like an Oracle redo log without the database. And so there are a lot of smart data engineering tricks

Starting point is 00:19:07 for scale, like zero-copy memory and a lot of cool things we like to talk about. But at the base, it was all about simplifying. How simple can we make a system and it will still behave like publish-subscribe queues that we need? And it turns out you can simplify quite a lot. And at that point,

Starting point is 00:19:27 they started calling those logs of event, what we would call a redo log, they called it a stream of event because that's just the way these people looked at it. So if you looked at the very old Kafka consumer, you can basically define how many threads the consumer has and they call it streams because it's like streams of data that the consumer gets

Starting point is 00:19:50 Yeah, and that was way before stream processing was even a thing that's the funny part of it When LinkedIn started having use cases for real-time processing and real-time updates And that kind of thing, it was very natural to build it on top of the data already in Kafka because so many of the data they needed, they basically built a lot of the integration between their services via Kafka. So when they look to see where can we have data for those streams of events,

Starting point is 00:20:21 well, they had the data in the database, but getting data out of the database in a real-time manner, as you know, is not always something DBAs look at kindly. They worry a lot about performance implications and how it will be done. And then in Kafka, data was already there, and as you know, Kafka didn't care about having additional consumers. That's what it was written for.

Starting point is 00:20:44 So the Kafka admins were like, oh, go ahead, just connect to this stream of events and get whatever you need. So it was just very natural. And that's when LinkedIn started writing the system called SAMSA, which is still one of the coolest systems out there because it combines basically event-level stream processing, which is very fast and very scalable, with localized caches for the data,

Starting point is 00:21:10 which allows you to do more advanced things like aggregation and joins of data between streams or even... So they had this idea of actually building materialized views in this local cache, which is absolutely brilliant. So suppose you have streams of clicks or something, and you want to join it with something like profiles. And you know that the profile table is not that large. So you say, okay, no problems.

Starting point is 00:21:37 I'll make it fast. Instead of every time I get a clickstream event, which is very often, and go and ask Oracle, which should make everyone mad, I'll just read this profile table once and have it in cache and start doing my joins locally, which is something that is slightly controversial, right? Because doing joins in the application is something that a lot of times we

Starting point is 00:21:56 say, hey, why are you doing that? The database can do it a lot better, but it doesn't necessarily want to do it a million times a second. So you build this application that will do this join for you. And then you say, well, but what if someone updates the profile table? And then you say, okay, I'll just do change data capture. I'll create a stream of events out of this profile table.

Starting point is 00:22:22 Every time someone does an insert or an update, I will get this as an event. And now it's just a matter of basically creating a materialized view of those events, but instead of in another oracle or even the same oracle, in my local cache, or in the case of some, it was using rocksdb, which is a pretty nice in-memory database from LinkedIn.

Starting point is 00:22:43 Sorry, not LinkedIn, Facebook. So they built this system and it turned out to work really, really well. And when you started the call front, we actually wanted the system. And that was not a problem, except that it was built to run on Yarn.

Starting point is 00:23:00 And Yarn is really hard to manage. And remember that the entire philosophy of Kafka is minimizing, stripping away the stuff you don't need. So we decided that we don't really need to run it on Yarn. We can just write it as a library, run the stream processing as a library in the application. We are writing applications and deploying them anyway.

Starting point is 00:23:23 And we can kind of make the whole thing simpler. So that's kind of like the way the stream processing evolved into what we're doing now with Confluent. Okay, so what's the business model behind Confluent then? Because was Confluent a spin-off from LinkedIn? I mean, I know there's obviously they employ you as a product manager. They have a kind of a commercial offering and so on. But what's the kind of commercial kind of, I suppose, model behind what you're doing? Yeah, so we do something that's pretty common to open source companies.

Starting point is 00:23:53 We have the open source core, which is libraries and connectors and Kafka itself. And then we say that there is a lot of added value that is specific to enterprise. So, for example, security or some aspects of security, integration with Active Directory is something that pretty much only enterprise companies would really be interested in. If you're a startup, you don't do Active Directory. But if you're a bigger company, you care about it. And then stuff like management tools and graphical monitoring and click here to create topics and all that kind of stuff is things that are kind of added value you

Starting point is 00:24:31 can definitely live without them you can do it yourself and most startups actually do themselves but we feel that if in order if you're a bank and you want to save some time it's a good to just pay someone who knows what they're doing. Okay okay that's interesting so that's a good introduction I think to kind of Kafka and streaming and so on so I mean around that topic I mean you present quite a lot you talk quite a lot about this topic and ETL and so on and so I prepared a list of things you've said on Twitter over the last few months that I want you to go through and explain what you mean actually with them so it's so so um so you talked so starting off then you talked about no ETL and and and you were kind of a phrase there

Starting point is 00:25:12 around no ETL and I suppose how things have changed I mean again what was the thinking behind that really what what what again what what changes are you seeing happening and what point are you trying to make with that really yeah so actually I didn't invent the term. I think I heard it at Scala, by the way, like a year or two ago. And I'm not even 100% sure what they meant at the time. But for me, it's just the experience of looking around and seeing that most companies do not have the ETL person or the ETL tools anymore. No, that's right.

Starting point is 00:25:45 They need the data applications that happen to integrate with each other. Yeah, that's interesting. I mean, I think I was talking to the people at another big company we both used to work with before and about, I suppose, how ETL is changing for that long tail of customers as well. And do you think that ETL is changing completely or it's just kind of taking on a new kind of guise within the kind of world that we work in because there's

Starting point is 00:26:09 still companies out there that are integrating kind of old database systems and transaction systems and so on do you think ETL is changing for them or it's just the kind of like the world that we're in now really in a sense I hope that ETL is changing because for the 10 years I've been doing it, it just had more and more problems. And I almost blame the ETL pains for Hadoop. You know, it was so painful. First of all, data warehouses were very painful, right? Like the modeling exercise and the conform dimensions and then the pain trickled

Starting point is 00:26:48 into the ETL, right? Because that was the process that was supposed to do it. And then loading data into very large data warehouses was always kind of an exercise. You know how much Oracle talks about it. It just was super painful. And then when someone said, oh, you don't

Starting point is 00:27:04 really need to do all that. Here is some disks, just start dumping data and it will be very fast because you don't do anything and you don't need to conform to the benches and stuff. Everyone just jumped on it. And three years later, they're looking at it and they're like, okay, but what we do is all this crap. Who knows? But now suddenly ETL is not painful because it barely exists anymore you just dump that on those disks and ETL tools are no longer that useful and on the other hand you have all those bytes on disks you kind of need an engineer to write MapReduce jobs or Spark jobs or whatever to make

Starting point is 00:27:41 sense of this right so that's basically, I don't really know. I have no proof of that, but that's my theory of the change we're seeing. So ETL became less painful but less useful, but the job of getting data from a database to somewhere else that needs the data is still a job that exists. It's just because the whole tools and it suddenly stopped being done at one place in the process

Starting point is 00:28:15 but moved to another place and the whole schema on write versus schema on reading, it just seems to have moved to a different part of the organization so yeah it's not dead not even closed but it's just being done very very differently now so what i'd like to do is get on a bit later on just before the end of the calls talk about things like data governance and schema because i know that's an area you've been looking at recently but because that's i think that's quite relating to what you're talking about there. But one other thing you've been talking about recently on Twitter is architects and architectures. And I think in particular it was to do with the kind of microservices

Starting point is 00:28:52 presentation you saw and so on, really. If you come across an architect on a customer site or you're asked to talk about architecture, what kind of things do you like about that and what don't you like about it? What's the problem with architects, do you think? It depends a lot on the architects, right? But I think the field tends to be... A lot of times it's just engineers who no longer write code, or not that much.

Starting point is 00:29:18 I'll quote you. I'll quote you on here. You said, we need architecture discussions to be informed by data, especially around operability metrics and the one was it feels like this is the age of architects refusing to make the hard decisions they're being paid to make so obviously obviously you make a point there but what what's the what's the underlying issue there and what do you think is the way things are going that we could do things better really yeah so architects are usually not the people who actually build the systems and not the people who run it in production. So a lot of times I feel like their incentives are sometimes just slightly misaligned.

Starting point is 00:29:55 And so I'm thinking a lot about how they're misaligned and how to fix this. And part of it is that you see it a lot these days. You go to those talks from startups showing off their architecture, and it's just a huge mishmash of a lot of technology. So obviously you have all those microservices, and every one of them will have its own database, which is every one of them will be a different type of database

Starting point is 00:30:20 because they care very much about best tool for the job. And that's very defensible. So I can see why it wouldn't get an architect fired because you can explain that you are making the best decisions and microservices are those best practices. But when you take it to production, like how many people can actually operate well 15 different databases? In my opinion,

Starting point is 00:30:46 it's not going to happen. So the two parts of it is that there has to be a relationship between the architects and ops. The system is super difficult to operate. If there's too many databases, too many applications,

Starting point is 00:30:59 nobody can understand an application went down with the impact of it. If no one in the organization can actually answer that, in my opinion, the architect didn't do a very good job at that. The system has to be maintainable and understandable. And the other part is that because architects sometimes pretend

Starting point is 00:31:19 that there is no cost to running the system, the trade-off between do we pick a good database for this problem or do we use a slightly not as awesome database but that we're already very familiar with, they will always take the cool database. Even if I don't talk about the resume-driven development, they will always pick the cool database because they don't see the cost of adding the cool database.

Starting point is 00:31:42 They only see the benefits. And I think that if an architect is not making this trade-off call, is this new complexity actually worth it, then he's not doing what an architect is supposed to do, which is consider those hard trade-offs. So that's kind of where I was coming from. I feel like a lot of architects are actually ex-operations people, in which case they do see the complexity,

Starting point is 00:32:07 but you just run into a lot of people who haven't done the job in many years and you can see how far removed they are from the realities. I suppose in a way that comes back to what you said earlier on about doing things simple. You know, the Kafka and the Confluence sort of approaches to do it simply, really. Yeah, you know, I'm now in Denmark, right? Yes. They're very much into the scandinavian minimalism and it kind of affects you you need to see your architecture as this bare white room and you pick furniture with a lot of care and you don't want to overstuff

Starting point is 00:32:37 it with things yes the place you're staying actually i remember being there before and had no furniture so that was taken to a uh taken to an extreme, actually. So in my respect, you're saying, you know, don't overcomplicate things. But also, I suppose it's important to do some things correctly. And I think data governance is an area that, in my opinion, Hadoop and Big Data and NoSQL has had a bit of a kind of free pass, really, over the last few years. You know, do you think that data governance will become more important for the work we're doing in this area? Do you think there's kind of like, you know, do you think there's enough emphasis put in now or what really? You know, I've been unable to have a conversation with a customer about anything else for several months now. And so I come with like I have the schema registry and I tell them about, hey, we allow you to manage all those schemas and

Starting point is 00:33:25 you can reject bad data before it makes it to the system and this resonates with people right like not compatibility and comply data compliance is only part of the story around them a governance but that's a pain point like nobody wants Kafka to turn into what Hadoop became, which is kind of an unorganized dump of data that people have trouble making sense of. So people take a lot more care of the data going in and out, and the schema registry is kind of

Starting point is 00:33:56 helpful for them. But then they really want us to do a lot more. They keep asking us, can we also track who is using the data? Can we track what is using it for? Can we have rules around what data is allowed for which use cases? Because apparently in Europe, they have very strict rules about privacy. And if you have private data, you can use it for some things, but not other things and so they need to have very good tracking around that

Starting point is 00:34:26 and then apparently also found out by talking to people in the EU they have they are doing a lot of work automating how loans are approved but if you reject someone's loan and you can just go to court and say it's the algorithm like we do in the States you actually have to have very good explanation of what data was used to make the decision and how it was made and who touched the data and so on so every piece of data that may be used to make those big financial decisions have to have very very good tracking on it and that's something that i don't know if it was really solved right because if we what we have is big data but you also have to cache all this metadata of who touched it and where was it created

Starting point is 00:35:08 and what was it used for. We're basically talking about taking big data and making it maybe 10 times bigger. So it's not something that's super trivial and easy. I mean, yeah, if you use Kafka for small problems, that's not a big deal, and a lot of people are using

Starting point is 00:35:23 Kafka for a small problem, and they use Kafka for small problems, that's not a big deal. And a lot of people are using Kafka for a small problem. And they use new features like headers to basically keep tracking lineage and things. But that only goes up to a point. And if you start doing something crazy like, hey, I'll use the clickstream data to learn about a person and then make long decisions based on that, you will no longer be able to really do it because there's just too many clicks to track governance for all of those so yeah okay so you mentioned schema registry it's a product that you're kind of product managing so tell us about that and i suppose how how are you managing to do what you just talked about without getting bogged down in

Starting point is 00:36:00 the same kind of like um i suppose kind of uh friction and and and um whatever that you get from doing that in the data warehousing world how can you do that so i have to tell you that i don't think you can get away from this kind of friction like we always we have this internal discussion at confluent always whether the schema registry is vitamins or sweets like is it something that our customers want and ask for, or is it a bitter pill that we force them into swallowing? And for, I mean, when I talk to data architects and people doing governance,

Starting point is 00:36:37 then definitely it's a sweet. They want it. They know that bad events in the system can cause havoc, especially since data sticks around for a long time. And basically schema registry allows you to declare this is the schema that this topic is going to have. And every message that the producer sends will get validated against the schema.

Starting point is 00:36:58 You can still make changes, but they have to make several changes. So if you add the column and default, then everyone will still be able to read and write data. have to make several changes. So if you add the column and default, then it's not everyone will still be able to read and write data. But if someone just changes a string into a number, then that's not going to be compatible.

Starting point is 00:37:15 And the rule is that the person causing the problems have to feel the most pain. So basically if you send an incompatible schema, your code will get the error message and you will not be able to produce it so everyone who is reading from the topic will be able to count on the fact that the top 100% data that they will be able to use which is super important and on the other hand the people who write the data are not very happy about it they are usually

Starting point is 00:37:42 a lot of the pain we deal with is people trying to basically game the system. Oh, we have this incompatible change, but actually nobody will care because there are no consumers on the topic. And then, so we decided to disable the schema check and just write it. And then obviously there was consumer on the topic. There always are, right?

Starting point is 00:38:01 Or even if there's not now, five, two months later, someone says, oh, I know, I'll write all this data into Hadoop and we'll build this type table and run queries. And then your bad data goes back and bites you. So there's always, always something. And the way we're trying to get better at that, that's something that I actually learned

Starting point is 00:38:22 from my own customers. They basically told me that we can no longer, we as in the architects and the governance people, we can no longer force software engineers to do anything. We can convince them, but they'll do whatever they want. So you need to build tools that those engineers will want to use. There is absolutely no choice about it. Whatever tool you build for the schema registry has to work very well for those engineers. So instead of just stopping bad data in operations at production

Starting point is 00:38:48 time, we basically built a tool that will do all those validations in advance. So you can run it on your own machine, like Maven, or Gradle for that matter, or you can build it into your nightly tests, or into a continuous integration system, or have your Jenkins

Starting point is 00:39:03 run it whenever someone tries to merge a patch. So catching things earlier in the development process has been advantageous. On the other hand, you can do it too early in the development process because otherwise they will let you that you don't let them move fast and break things. So it's kind of a delicate

Starting point is 00:39:20 balance. You have to be careful there. But basically that's how we're trying to help with the problem okay okay fantastic and i noticed one last thing just to mention i noticed that there was some stuff that you've been talking about and confluence around exactly once processing as well and i know i know what that concept is but just maybe explain what that means in the context of kafka and and this conversation as well Yeah, it actually means two separate things. Both of them are really cool.

Starting point is 00:39:47 So for the longest time, if you produced an event for Kafka and Kafka, for any reason, neglected to acknowledge the event, it was written to Kafka, but it wasn't acknowledged, a producer would retry. And now Kafka would have the event twice. For some use cases, it was kind of annoying, and how do we remove duplicates, and it was kind of painful. So the first thing we added was what we call an idempotent producer,

Starting point is 00:40:13 which basically adds sequence numbers to all the messages you send. So if the broker receives the same message twice, it will just say, oh, I already got this one and throw it away. That's the simplest part. The second part we backed in is what we call transactions. But obviously, the term transactions can mean a lot of things. But it allows you to do a begin transaction, produce to a lot of different topics or partitions or whatever. And then either abort, which means that none of it, everything you just wrote is going to get ignored,

Starting point is 00:40:51 or commit, in which case it will be persisted. And consumers can decide if they want to enable dirty reads or they want to do read committed. And that's a pretty big deal, right? Because now we can actually have data in common between different topics and know that either it's all there or none of it is there. And it's especially

Starting point is 00:41:11 useful for stream processing because it means that you can consume an event, process it, write the result, and it executes successfully. And either the result and at the exact time it was successfully written. And either the result of the event was to be written at the same time,

Starting point is 00:41:33 it will be written, which means if someone is retrying, he will retry the processing of the event because the result is not there and the fact that it has been processed is not there. So for us this is huge, right? It just allows us to have much more reliable streams of events, it allows us to do very reliable aggregates because that's where it really gets you. If you start summing things up and you sum duplicates, it's almost impossible to actually get the correct number out of that. So we have high hopes that exactly once we'll make our stream processing a lot more reliable for kind of those critical, highly accurate use cases. Fantastic. Well, I'll let you go now because I know you're presenting at a meetup

Starting point is 00:42:16 in Copenhagen later on today. But it's been great to speak to you. I'll see you tomorrow, actually, because I'll be over there as well speaking at an event with you. But but um yeah thanks very much for coming on the show and uh have a good time in copenhagen i know i've taken you away for an hour now but um but thanks so much for coming on the show and it's been great speaking to you okay take care bye Thank you.

Drill to Detail - Drill to Detail Ep.27 'Apache Kafka, Streaming Data Integration and Schema Registry' with Special Guest Gwen Shapira

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.