Drill to Detail - Drill to Detail Ep.27 'Apache Kafka, Streaming Data Integration and Schema Registry' with Special Guest Gwen Shapira
Episode Date: May 22, 2017Mark Rittman is joined by Gwen Shapira from Confluent to talk about Apache Kafka, streaming data integration and how it differs from batch-based, GUI-developed ETL development, the problem with archit...ects, exactly-once processing and how data governance is coming to Kafka development with Confluent's new schema registry server.
Transcript
Discussion (0)
So my guest this week is Gwen Shapira, long-term friend of the Oracle, Big Data and Data Integration
Communities, and now working as a product manager at Confluent, the people behind Kafka.
So Gwen, thanks for coming on the show, and why don't you tell us a bit about what you do and how you got here?
Yeah, so right now, as I said, I'm product manager and I take care of Confluent's data
integration product, which are mostly Kafka Connect.
So it's a way to get data in and out of Kafka and between databases.
And the Confluent schema registry, which is kind of a cool piece of technology
that allows us to add structure and schema
to data in Kafka,
which is kind of a definition of schema-less
and therefore has the potential of being a huge headache.
And I obviously got here by kind of a convoluted way.
Like, I think my career has been defined
by my inability to say no to things that sound interesting.
Hence me being on this podcast.
I started my career as a developer.
And then when our chief DBA quit, it sounded like a good opportunity to start something cool. So I
volunteered to take on the Oracle DBA position. And I did it for quite a while. So Oracle was
really, really cool place to be for, I think, a good 10 years of my career. And then, you know,
you kind of, I was in Oracle consulting, I worked for PTN at the time, and you're a consultant,
right? So you work with all those clients, and all they at the time. You're a consultant, right?
So you work with all those clients,
and all they want to talk about is Hadoop, right?
And they're like, okay,
they no longer want to move data between two Oracle systems.
They all want to get data from the Oracle to Hadoop.
And after client number five was talking about,
I want to move all my data from Oracle to Hadoop,
you kind of get to thinking,
if this is actually everything they're doing,
will I still have a job doing Oracle in five or ten years because it's all moved to Hadoop?
So after the fifth glance or so, I started talking to Cloudera about, hey, can I do this Hadoop thing?
And that's the nice thing about consulting.
And in general, I think in big big data the skills are super transferable like if you are really good at your fundamentals like you know how to look at what a program is
doing instrument it and dig into is it using disk is it using cpu is the network slow you can kind
of do pretty much anything so I ended up doing pretty much what I did for Oracle, which is why is my ETL process so slow?
I started doing the exact same thing for Hadoop.
So suddenly here I am learning a new technology
just so I can help my client start using it.
And after a few of those,
I became the Kafka person at Clodora,
but Clodora wasn't that interested in doing Kafka at the time.
I think personally, I think they're still not.
It's not something official, but it just doesn't seem like an area they're hugely invested in.
And it's kind of slightly depressing to be doing a sideshow for a big company or even a medium-sized company, for that matter.
So when I noticed that all the Kafka people were kind of coalescing around this one company. I was very interested in talking to them, basically.
So that's kind of who I am.
Actually, I joined their company as an engineer.
But then they said that, hey, we don't really have any product people.
And our investors say we need product people.
So can you do product?
And I'm like, I have no idea what I'm doing.
Hey, we're pretty sure that nobody in product knows what they're doing.
So I'm still learning that.
And I think that's the cool part about a career, right?
That you can kind of keep learning and reuse your skills in different ways.
Exactly.
I've always thought you've always been the person who's had the job
that's one bit cooler than me, actually.
So when I was learning kind of big data, you were working at Cloudera.
And then when I got into big data more so, you were working as a product manager.
And I think you've always kind of blazed quite a trail, really,
when it comes to the companies you work for and the technologies that you kind of specialize in.
But I think particularly data integration and ETL has been an area that you've done a lot of things with.
And as a confession, I would say that done a lot of things with and as a confession
I would say that probably a lot of the good ideas I've had in the past around this sort of area
have come from things that you've presented on particularly I suppose around the way that you
kind of lay out maybe you know in the old days of HTFS the way you'd lay out a kind of file system
and the way you kind of I suppose the governance around it as well and I mean what again what
particularly interested you in the data integration side of things really actually the funniest thing is that I
never actually did that integration and they wasn't even that interested in it
except I walked as I started my career as a DBA right yeah and 90% of the
problems I had to deal with as a DBA be summed up as why is my ETL so slow yes
why is it not finishing on time Or it was fast yesterday but it's
slow today. Or it's only slow
on Sundays or Mondays
when it rains.
ETLs are large
processes. They can get convoluted
and they're usually a source
of performance pain for
customers. So
pretty much I just hanged
out around ETL people a lot i guess and try to make things
faster and so that's kind of how i find myself and you know it's and obviously in oracle and
traditional etl speed and resource urges and how it interacts with other parts of the system was
a huge pain point but then i started moving to hadoop speed was never the pain point. But then I started moving to Hadoop, speed was never the pain point.
I mean, ETL was fast. Pain was usually after the ETL process finished. And okay, now you have a
bunch of basically bytes on the file system. And it's not like Oracle, right? After the ETL process
finishes in Oracle, you have nice tables and partitions and that kind of thing. In Hadoop, you don't really get that.
You get, if you're lucky, you get hive tables, but that's about it.
So a lot of the pain I had to solve for my customers is actually things that were super, super obvious back in Oracle.
You know, like, how do we know what data types do we have?
Should everything really be a string?
Yeah, exactly.
I mean, I think, I mean, as a lead in, I think, to talk about Kafka, which is obviously the
area that you specialize in now.
Let's kind of think about how I suppose ETL has changed over the last few years as we've
gone more towards things like streaming.
We've used kind of, I suppose, a much wider set of tools and so on there.
I mean, you talked, again, looking back at presentations you've written in the past you talked about there being
kind of patterns and anti-patterns and different kind of tools and so on people would use if you
if you were assessing or thinking about etl for a customer or reviewing a system you know what
what typically would you kind of see there in terms of approaches that they would have there
and and how how how are people starting
to adopt i suppose things like streaming processing and so on what's the kind of patterns you see
there yeah it's fascinating because that is an area that changes a lot over time and actually
the evolution of the area is something that i'm super interested in right now because you know i
work in the silicon valley and actually a lot of Confluence customers are kind of cool Silicon Valley companies.
And you see things that are a lot different.
And sometimes Silicon Valley is a trend center.
So I kind of try to see it as does it predict something for the rest of the world?
Because often it does, right?
And if I have to say what is ETL in the Silicon Valley right now,
I would say that it's no longer ETL.
I would recognize it.
It's software engineers using data in their applications
is the way I would characterize it, right?
And last week, you talked to the Airbnb guy,
and he's not calling himself an ETA person, right?
He's a software engineer who happens to be building data stuff.
There is a very good book by your fellow Englishman, Martin Kletman,
about data-intensive applications.
And it talks about applications that use data in super broad,
super generic terms, and it goes over a lot of patterns
just around applications that goes over a lot of patterns just around applications
that use data a lot.
And obviously
a large part of it is what we would call ETL
but for him, he was
building search for
LinkedIn. And
if you're building search for LinkedIn, yes,
I would call it an ETL because you're
getting data out of Kafka
and maybe out of some databases.
And you need to land it in Elasticsearch, where you run searches in all kinds of specific indexes.
So to me, it looks a lot like an ETL.
But he would never call himself an ETL engineer or ETL specialist, right?
He would call himself a software engineer.
And this kind of changes everything because if you're
a software engineer, you don't really use point and click tools anymore, right? So it's been,
let me think, five years since I've last seen anyone actually do the whole informatica thing
or data stage thing. Because you see everything as a software problem.
You need tools, but
your native
environment is writing code to solve
problems. And if something
like Informatica doesn't really
let you do that in a convenient
way, you know, it doesn't integrate
with your GitHub
and Jenkins
and all.
Software engineers have their environments
that they're productive in.
Yeah, when I started, moved from being an DBA
to do more of engineering work,
the first thing that my colleagues told me
is you need to set up your development environment.
And you turn out that it's like, okay,
how do I get a copy of the code into my machine?
How do I edit it in my editor?
How do I make a small change?
How do I run the test?
And how do I get it back into the correct release?
Which is kind of a cycle that basically engineers keep working around, right?
Developing, testing, integrating, deploying.
It's an interesting point.
I mean, yeah, sorry to interrupt you there.
I mean, I had the conversation with Maxime last week.
And I mean, certainly working now within a product area in a big data company,
building kind of software products, as you say,
nobody will be seen using a tool like Informatica or anything really like that.
I mean, there are tools like, say, StreamSets that I think you're probably aware of
that have tried to take that point-and-click approach and update it, I suppose,
for the kind of cloud era world and so on.
I mean, do you think, though, that this you i mean you know we've you and i both worked in in kind of database
environments dba environments where there used to be kind of hand scripting of etl routines and dbas
would say i'm much more productive using scripts rather than using kind of etl tools but you know
the tool etl tools are brought in to I suppose broaden the kind of set of people that
can do this kind of work and to add a degree of kind of I suppose kind of standardization
and governance to it do you think what you're seeing now in in in the industry or in silicon
valley is just a kind of the maturity of the people doing it or it's a fundamental change
in how this kind of work is done yeah the work still seems a lot the same, at least in some senses.
I think that part of the struggle is around the main differences between structured and
unstructured data, right?
That's something that has been like when Informatica was written and DataStage was written, everything
was structured.
You moved data between one relational database and another.
And now you have all those logs and Jasons
and all kinds of just more systems,
more types of data to integrate with.
And the other difference is that
sometimes it's not even databases on either side of the ETL.
So we do work a lot with change capture scenarios now
because just apparently Kafka plus something like DB Visit is pretty popular.
So obviously DB Visit is getting data out of Oracle and put it into Kafka.
And then the consumers of the data in Kafka don't necessarily write it anywhere.
Sometimes it's just an application that wants to get an update whenever something happens in the database.
So you have all those microservices, and one of the microservices is responsible for sending congratulations.
Your account has been opened, the email for a bank, and they want this application to get notified
by whenever the account creation entry shows up in their db2.
So that's kind of a classic change data capture use case,
except that there is no database at the other end of it.
So the integration of, I etl and up building and kind of merging the two
worlds is something that's fairly new and in my opinion really fascinating okay okay so that leads
on quite nicely to i suppose kind of kafka and streaming and so on so you talk quite a lot about
streaming pipelines and and how this is kind of how things are done now in this world so for anybody
listening to this podcast who is i, I suppose, still working with
tools like Data Integrator and ODI and so on, what fundamentally is different
about say streaming pipelines and the way that that kind of integration works
within the kind of big data world?
Okay.
So I'll warn you, I don't know that much about ODI, but...
Okay, any tool, any tool. So any tool that is batch and graphical, you know, it's a quite different paradigm, isn't know that much about ODI. Any tool.
So any tool that is batch and graphical,
it's a quite different paradigm, isn't it, to doing things streaming?
Yeah, definitely.
So graphical is not the big change.
I mean, it's almost a side effect of stuff being done by developers versus the ETL specialist.
But really, the big change is moving from batch to streaming
and you can see it
in the conversations we're having
because we started out
my first streaming tool was Spark Streaming
and that was
they had these micro-batches
and we thought that micro-batches
are good enough
for the longest time I thought micro-batches were good enough
but then we pretty much as a community discovered that even though I don't care that much about latency,
you know, like one second or 100 milliseconds is often fine,
but I care about the flexibility of my data windows.
So if you work with microbatches, it also kind of fights how you do aggregates,
how you're slicing your aggregates
to specific points.
And it affects your ability to handle things like late events,
which are obviously, if you do real-time data,
you can't just say, oh, let's rerun yesterday's batch
and redo it because we just got a bunch of events.
Streaming is
all about being continuous, right?
And I think people call it
unbounded data sometimes.
It's not batch and streaming, it's bounded and unbounded.
And batch
has those very clear boundaries,
and streams basically
is always on. So you need to start
worrying about, how do I handle errors in the world that is always on?
How do I handle late events?
How do I make updates?
And those are pretty hard problems.
I think the entire field is still kind of struggling
with exactly what are the correct answers.
I talked to a super smart guy
from one of those streaming radio apps,
which is pretty popular. A lot of people use it, so they definitely have big data. And they said that they moved to use Google
Dataflow recently, which is one of the more advanced stream processing out there. Yeah,
I think a lot of the field, it's a field that advances in many directions and everyone exchanges
their new ideas but we've definitely
been inspired by some of their work
and vice versa they've been definitely inspired
by some of our work
and you can see
so this guy from Spotify
is telling me that
they're using it but they're mostly using it
in batch mode because
it's a bit scary for them to move to streams.
They still haven't figured out how to do everything they used to do
in this streams world.
And things like error handling and reprocessing of data
was definitely one of the things that is still kind of,
they're still trying to get used to it.
They're trying it on maybe some non-critical data loads.
And so if some of the smartest people out there are still trying to figure it out i think that's
kind of the cost of being out front yeah exactly so what was this i mean so in terms of kafka and confluent then you know what what was the story of kafka um and how did confluent how a confluent
involved in that and what was i suppose really what's interesting is what was the innovation really with Kafka
that made it a better solution to, say, Spark Streaming?
You mentioned it's continuous versus micro-batch and so on.
But what was the thing, the breakthrough thing with Kafka
that made it kind of very popular, do you think?
Yeah, that's funny because Kafka was written as a message bus.
And as a message bus is something that me as a DBA almost never saw, right?
It's pure app developers.
It's something that TIPO did.
It's not something that I've ever really seen myself.
And it started out basically as a smarter, more scalable message bus.
And the way they decided to scale, so it turns out that normal message buses basically are very aware of who is supposed to consume every message.
And the message will be stored often in memory until every single consumer consumed it.
And if you ever used Oracle Advanced Queues, which everyone who used Oracle Streams happened to use, then that's exactly how Oracle Advanced Queues would work, right?
It will never delete data until everyone consumed it.
And that causes a lot of issues around scale, because if you try to scale to a lot of consumers,
then suddenly this whole thing starts falling apart.
And LinkedIn had something like over 10,000 consumers at some point.
So Kafka was basically built to deal with that. And the way they
decided to deal with that is to say, we are going to keep data for four days or seven days, which
is enough for even our worst behaving consumer, and just delete it afterwards. And we're not going
to bother keeping track of the consumers. And turns out that this simplifying principle basically
allowed for pretty big scale.
And they kept on simplifying.
They simplified the data structure.
Oh, just this write-ahead log
is actually enough to maintain a queue,
so let's just delose this
bunch of partitioned
write-ahead logs. Basically, like an
Oracle redo log without the database.
And so
there are a lot of smart data engineering tricks
for scale, like zero-copy memory
and a lot of cool things we like to talk about.
But at the base, it was all about simplifying.
How simple can we make a system
and it will still behave like publish-subscribe queues
that we need?
And it turns out you can simplify quite a lot.
And at that point,
they started calling those logs of event,
what we would call a redo log,
they called it a stream of event
because that's just the way these people looked at it.
So if you looked at the very old Kafka consumer,
you can basically define how many threads the consumer has
and they call it streams
because it's like streams of data that the consumer gets
Yeah, and that was way before stream processing was even a thing that's the funny part of it
When LinkedIn started having use cases for real-time processing and real-time updates
And that kind of thing,
it was very natural to build it on top of the data already in Kafka because so many of the data they needed,
they basically built a lot of the integration
between their services via Kafka.
So when they look to see where can we have data
for those streams of events,
well, they had the data in the database,
but getting data out of the database in a real-time manner,
as you know, is not always something DBAs look at kindly.
They worry a lot about performance implications
and how it will be done.
And then in Kafka, data was already there,
and as you know, Kafka didn't care about having additional consumers.
That's what it was written for.
So the Kafka admins were like, oh, go ahead,
just connect to this stream of events and get whatever you need.
So it was just very natural.
And that's when LinkedIn started writing the system called SAMSA,
which is still one of the coolest systems out there
because it combines basically event-level stream processing,
which is very fast and very scalable,
with localized caches for the data,
which allows you to do more advanced things like aggregation
and joins of data between streams or even...
So they had this idea of actually building materialized views
in this local cache, which is absolutely brilliant.
So suppose you have streams of clicks or something, and you want to join it with something like
profiles.
And you know that the profile table is not that large.
So you say, okay, no problems.
I'll make it fast.
Instead of every time I get a clickstream event, which is very often, and go and ask
Oracle, which should make everyone mad, I'll just read this profile table once
and have it in cache and start doing my joins
locally, which is something that
is slightly controversial, right?
Because doing joins in the application
is something that a lot of times we
say, hey, why are you doing that?
The database can do it a lot better,
but it doesn't necessarily want to do it
a million times a second.
So you build this application that will do this join for you.
And then you say, well, but what if someone updates the profile table?
And then you say, okay, I'll just do change data capture.
I'll create a stream of events out of this profile table.
Every time someone does an insert or an update,
I will get this as an event.
And now it's just a matter of basically creating
a materialized view of those events,
but instead of in another oracle or even the same oracle,
in my local cache, or in the case of some,
it was using rocksdb, which is a pretty nice
in-memory database from LinkedIn.
Sorry, not LinkedIn, Facebook.
So they built this system
and it turned out to work
really, really well.
And when you started the call front, we
actually wanted the system.
And that was not a problem, except
that it was built to run on Yarn.
And
Yarn is really hard to manage.
And remember that the entire philosophy of Kafka is minimizing,
stripping away the stuff you don't need.
So we decided that we don't really need to run it on Yarn.
We can just write it as a library,
run the stream processing as a library in the application.
We are writing applications and deploying them anyway.
And we can kind of make the whole thing simpler.
So that's kind of like the way the stream processing evolved into what we're doing now with Confluent.
Okay, so what's the business model behind Confluent then?
Because was Confluent a spin-off from LinkedIn?
I mean, I know there's obviously they employ you as a product manager.
They have a kind of a commercial offering and so on.
But what's the kind of commercial kind of, I suppose, model behind what you're doing?
Yeah, so we do something that's pretty common to open source companies.
We have the open source core, which is libraries and connectors and Kafka itself.
And then we say that there is a lot of added value that is specific to enterprise. So, for example, security or some aspects of security,
integration with Active Directory is something that pretty much
only enterprise companies would really be interested in.
If you're a startup, you don't do Active Directory.
But if you're a bigger company, you care about it.
And then stuff like management tools and graphical monitoring and
click here to create topics and all that kind of stuff is things that are kind of added value you
can definitely live without them you can do it yourself and most startups actually do themselves
but we feel that if in order if you're a bank and you want to save some time it's a good to just pay
someone who knows what they're doing. Okay okay that's interesting
so that's a good introduction I think to kind of Kafka and streaming and so on so I mean around
that topic I mean you present quite a lot you talk quite a lot about this topic and ETL and so on and
so I prepared a list of things you've said on Twitter over the last few months that I want you
to go through and explain what you mean actually with them so it's so so um so
you talked so starting off then you talked about no ETL and and and you were kind of a phrase there
around no ETL and I suppose how things have changed I mean again what was the thinking behind that
really what what what again what what changes are you seeing happening and what point are you trying
to make with that really yeah so actually I didn't invent the term. I think I heard it at Scala, by the way, like a year or two ago.
And I'm not even 100% sure what they meant at the time.
But for me, it's just the experience of looking around
and seeing that most companies do not have the ETL person
or the ETL tools anymore.
No, that's right.
They need the data applications that happen to integrate with each other.
Yeah, that's interesting.
I mean, I think I was talking to the people at another big company
we both used to work with before and about, I suppose,
how ETL is changing for that long tail of customers as well.
And do you think that ETL is changing completely
or it's just
kind of taking on a new kind of guise within the kind of world that we work in because there's
still companies out there that are integrating kind of old database systems and transaction
systems and so on do you think ETL is changing for them or it's just the kind of like the world
that we're in now really in a sense I hope that ETL is changing because for the 10 years I've been doing it, it just had more and more problems.
And I almost blame the ETL pains for Hadoop.
You know, it was so painful.
First of all, data warehouses were very painful, right?
Like the modeling exercise and the conform dimensions
and then the pain trickled
into the ETL, right? Because that was
the process that was supposed to do it.
And then loading data into very large
data warehouses was always
kind of an exercise.
You know how much Oracle talks about it.
It just was super painful.
And then when someone said, oh, you don't
really need to do all
that. Here is some disks, just start dumping data and it will be very fast because you don't do
anything and you don't need to conform to the benches and stuff. Everyone just jumped on it.
And three years later, they're looking at it and they're like, okay, but what we do is all this
crap. Who knows? But now suddenly ETL is not painful because it barely exists
anymore you just dump that on those disks and ETL tools are no longer that
useful and on the other hand you have all those bytes on disks you kind of
need an engineer to write MapReduce jobs or Spark jobs or whatever to make
sense of this right so that's basically, I don't really know.
I have no proof of that,
but that's my theory of the change we're seeing.
So ETL became less painful but less useful,
but the job of getting data from a database
to somewhere else that needs the data
is still a job that exists.
It's just because the whole tools and it suddenly stopped being done at one place in the process
but moved to another place and the whole schema on write versus schema on reading,
it just seems to have moved to a different part of the organization
so yeah it's not dead not even closed but it's just being done very very differently now so what
i'd like to do is get on a bit later on just before the end of the calls talk about things like
data governance and schema because i know that's an area you've been looking at recently but
because that's i think that's quite relating to what you're talking about there. But one other thing you've been talking about recently on Twitter
is architects and architectures.
And I think in particular it was to do with the kind of microservices
presentation you saw and so on, really.
If you come across an architect on a customer site
or you're asked to talk about architecture,
what kind of things do you like about that
and what don't you like about it?
What's the problem with architects, do you think?
It depends a lot on the architects, right? But I think the field tends to be... A lot
of times it's just engineers who no longer write code, or not that much.
I'll quote you. I'll quote you on here. You said, we need architecture discussions to
be informed by data, especially around operability metrics and the one was it feels like this is the age of architects refusing
to make the hard decisions they're being paid to make so obviously obviously you make a point there
but what what's the what's the underlying issue there and what do you think is the way things are
going that we could do things better really yeah so architects are usually not the people who
actually build the systems and not the people who run it in production.
So a lot of times I feel like their incentives
are sometimes just slightly misaligned.
And so I'm thinking a lot about how they're misaligned
and how to fix this.
And part of it is that you see it a lot these days.
You go to those talks from startups showing off their architecture,
and it's just a huge mishmash of a lot of technology.
So obviously you have all those microservices,
and every one of them will have its own database,
which is every one of them will be a different type of database
because they care very much about best tool for the job.
And that's very defensible.
So I can see why it wouldn't get an architect fired
because you can explain that you are making the best decisions
and microservices are those best practices.
But when you take it to production,
like how many people can actually operate well 15 different databases?
In my opinion,
it's not going to happen.
So the two parts of it is that there has to be
a relationship between
the architects and ops.
The system is super difficult
to operate.
If there's too many databases,
too many applications,
nobody can understand
an application went down
with the impact of it.
If no one in the organization
can actually answer that,
in my opinion, the architect didn't do a very good job at that.
The system has to be maintainable and understandable.
And the other part is that because architects sometimes pretend
that there is no cost to running the system,
the trade-off between do we pick a good database for this problem
or do we use a slightly not as awesome database
but that we're already very familiar with,
they will always take the cool database.
Even if I don't talk about the resume-driven development,
they will always pick the cool database
because they don't see the cost of adding the cool database.
They only see the benefits.
And I think that if an architect is not making this trade-off call,
is this new complexity actually worth it,
then he's not doing what an architect is supposed to do,
which is consider those hard trade-offs.
So that's kind of where I was coming from.
I feel like a lot of architects are actually ex-operations people,
in which case they do see the complexity,
but you just run into a lot of people who haven't done the job in many years
and you can see how far removed they are from the realities.
I suppose in a way that comes back to what you said earlier on about doing things simple.
You know, the Kafka and the Confluence sort of approaches to do it simply, really.
Yeah, you know, I'm now in Denmark, right?
Yes.
They're very much into the scandinavian minimalism and it kind of affects you you need to see your architecture
as this bare white room and you pick furniture with a lot of care and you don't want to overstuff
it with things yes the place you're staying actually i remember being there before and had
no furniture so that was taken to a uh taken to an extreme, actually. So in my respect, you're saying, you know, don't overcomplicate things.
But also, I suppose it's important to do some things correctly.
And I think data governance is an area that, in my opinion, Hadoop and Big Data and NoSQL has had a bit of a kind of free pass, really, over the last few years.
You know, do you think that data governance will become more important for the work we're doing in this area?
Do you think there's kind of like, you know, do you think there's enough emphasis put in now or what really?
You know, I've been unable to have a conversation with a customer about anything else for several months now.
And so I come with like I have the schema registry and I tell them about, hey, we allow you to manage all those schemas and
you can reject bad data before it makes it to the system and this resonates with people right like
not compatibility and comply data compliance is only part of the story around them
a governance but that's a pain point like nobody wants Kafka to turn into what Hadoop became, which is kind of
an unorganized dump of data
that people have trouble making sense of.
So people
take a lot more care of the data going
in and out, and the schema registry is kind of
helpful for them.
But then they really want us to do a lot
more. They keep asking us,
can we also track
who is using the data? Can we track what
is using it for? Can we have rules around what data is allowed for which use cases? Because
apparently in Europe, they have very strict rules about privacy. And if you have private data,
you can use it for some things, but not other things and so they need to have very good tracking around that
and then apparently also found out by talking to people in the EU they have they are doing a lot
of work automating how loans are approved but if you reject someone's loan and you can just go to
court and say it's the algorithm like we do in the States you actually have to have very good
explanation of what data was used to make the decision and how it was made and who touched the data and so on
so every piece of data that may be used to make those big financial decisions have to have very
very good tracking on it and that's something that i don't know if it was really solved right
because if we what we have is big data but you also have to cache all this metadata of
who touched it and where was it created
and what was it used for. We're basically
talking about taking big data
and making it maybe 10 times bigger.
So
it's not something that's super
trivial and easy. I mean,
yeah, if you use Kafka for small problems,
that's not a big deal, and a lot of people are using
Kafka for a small problem, and they use Kafka for small problems, that's not a big deal. And a lot of people are using Kafka for a small problem.
And they use new features like headers to basically keep tracking lineage and things.
But that only goes up to a point.
And if you start doing something crazy like, hey, I'll use the clickstream data to learn about a person
and then make long decisions based on that, you will no longer be able to really do it because
there's just too many clicks to track governance for all of those so yeah okay so you mentioned
schema registry it's a product that you're kind of product managing so tell us about that and
i suppose how how are you managing to do what you just talked about without getting bogged down in
the same kind of like um i suppose kind of uh friction and and and um whatever that
you get from doing that in the data warehousing world how can you do that so i have to tell you
that i don't think you can get away from this kind of friction like we always we have this
internal discussion at confluent always whether the schema registry is vitamins or sweets like
is it something that our customers want and ask for,
or is it a bitter pill that we force them into swallowing?
And for, I mean, when I talk to data architects
and people doing governance,
then definitely it's a sweet.
They want it.
They know that bad events in the system can cause havoc,
especially since data sticks around for a long time.
And basically schema registry allows you to declare
this is the schema that this topic is going to have.
And every message that the producer sends
will get validated against the schema.
You can still make changes, but they have to make several changes.
So if you add the column and default,
then everyone will still be able to read and write data. have to make several changes. So if you add the column and default, then it's not
everyone will still be able to
read and write data.
But if someone just changes
a string into a number,
then that's not going to be compatible.
And the rule is that the person causing
the problems have to feel the most
pain. So basically
if you send an incompatible schema,
your code will get the error message and you will
not be able to produce it so everyone who is reading from the topic will be able to count
on the fact that the top 100% data that they will be able to use which is super important
and on the other hand the people who write the data are not very happy about it they are usually
a lot of the pain we deal with is people trying to basically game the system.
Oh, we have this incompatible change,
but actually nobody will care
because there are no consumers on the topic.
And then, so we decided to disable the schema check
and just write it.
And then obviously there was consumer on the topic.
There always are, right?
Or even if there's not now,
five, two months later, someone says,
oh, I know, I'll write all this data into Hadoop
and we'll build this type table and run queries.
And then your bad data goes back and bites you.
So there's always, always something.
And the way we're trying to get better at that,
that's something that I actually learned
from my own customers.
They basically told me that we can no longer, we as in the architects and the governance
people, we can no longer force software engineers to do anything.
We can convince them, but they'll do whatever they want.
So you need to build tools that those engineers will want to use.
There is absolutely no choice about it.
Whatever tool you build for the schema registry has to work very well for those engineers. So instead of just stopping
bad data in operations at production
time, we basically built
a tool that will do all those validations
in advance. So you can run it on
your own machine, like Maven, or
Gradle for that matter, or you
can build it into your nightly
tests, or into a continuous integration
system, or have your Jenkins
run it whenever someone tries to merge a patch.
So catching things
earlier in the development process
has been advantageous. On the other
hand, you can do it too early in the development process
because otherwise
they will let you that you don't let them move fast
and break things. So it's kind of a delicate
balance. You have to be careful there.
But basically that's
how we're trying to help
with the problem okay okay fantastic and i noticed one last thing just to mention i noticed that there
was some stuff that you've been talking about and confluence around exactly once processing as well
and i know i know what that concept is but just maybe explain what that means in the context of
kafka and and this conversation as well Yeah, it actually means two separate things.
Both of them are really cool.
So for the longest time, if you produced an event for Kafka
and Kafka, for any reason, neglected to acknowledge the event,
it was written to Kafka, but it wasn't acknowledged,
a producer would retry.
And now Kafka would have the event twice.
For some use cases, it was kind of annoying,
and how do we remove duplicates, and it was kind of painful.
So the first thing we added was what we call an idempotent producer,
which basically adds sequence numbers to all the messages you send.
So if the broker receives the same message twice,
it will just say, oh, I already got this one and throw it away.
That's the simplest part.
The second part we backed in is what we call transactions.
But obviously, the term transactions can mean a lot of things.
But it allows you to do a begin transaction, produce to a lot of different topics or partitions or whatever. And then either abort, which means that none of it,
everything you just wrote is going to get ignored,
or commit, in which case it will be persisted.
And consumers can decide if they want to enable dirty reads
or they want to do read committed.
And that's a pretty big deal, right?
Because now we can actually have data in common between different
topics and
know that either it's all there or
none of it is there. And it's especially
useful for stream processing
because it means that you can
consume an event,
process it,
write the result, and it
executes
successfully.
And either the result and at the exact time it was successfully written. And either the result of the event was to be written at the same time,
it will be written, which means if someone is retrying, he will retry the processing of the event because
the result is not there and the fact that it has been processed is not there. So for us this is huge, right? It
just allows us to have much more reliable streams of events, it allows us to do very reliable
aggregates because that's where it really gets you. If you start summing things up and you sum
duplicates, it's almost impossible to actually get the correct number out of that. So we have high hopes that exactly once we'll make our stream processing
a lot more reliable for kind of those critical, highly accurate use cases.
Fantastic.
Well, I'll let you go now because I know you're presenting at a meetup
in Copenhagen later on today.
But it's been great to speak to you.
I'll see you tomorrow, actually, because I'll be over there as well
speaking at an event with you. But but um yeah thanks very much for coming on
the show and uh have a good time in copenhagen i know i've taken you away for an hour now but um
but thanks so much for coming on the show and it's been great speaking to you okay take care
bye Thank you.