CoRecursive: Coding Stories - Tech Talk: Micro Services vs Monoliths With Jan Machacek
Episode Date: June 6, 2018Tech Talks are in-depth technical discussions. I don't know a lot about micro services. Like how to design them and what the various caveats and anti-patterns are. I'm currently working on a proje...ct that involves decomposing a monolithic application into separate parts, integrated together using Kafka and http.  Today I talk to coauthor of upcoming book, Reactive Systems Architecture : Designing and Implementing an Entire Distributed System. If you want to learn some of the hows and whys of building a distributed system, I think you'll really enjoy this interview. The insights from this conversation are already helping me. Contact Jan Machacek is the CTO at Cake Solutions. Videos long lived micro services Book - Reactive System Architecture
Transcript
Discussion (0)
Welcome to Code Recursive, where we bring you discussions with thought leaders in the world of software development.
I am Adam, your host.
Hey, here is my confession. I do not know a lot about microservice architectures.
I'm currently working on a project that involves decomposing a monolithic application into separate parts and integrating those parts using Kafka and HTTP.
Today I talked to the co-author of the upcoming O'Reilly book, Reactive Systems Architecture, subtitled Designing and Implementing an Entire Distributed System.
If you want to learn some of the hows and whys of building a distributed
system, I think you'll enjoy this interview.
The insights from this conversation are
already helping me.
Jan Machacek.
Actually, is that how you pronounce your name?
It's Moflem.
Sorry, I'll shut up.
Jan
Machacek is the CTO of Cake Solutions and a distributed systems expert.
How do you feel about being called an expert?
It's false advertising entirely.
No idea if I would call myself expert.
I've had my fingers burnt quite a few times.
So I'm not sure if that qualifies as the answer.
I think that it qualifies
like from where I'm sitting, you know, somebody
who's made some mistakes
isn't a great place to stop me from making them.
Okay, I'll do my best and make more
mistakes.
So recently I've been trying to get
up to speed on how to split
up a monolithic app
and also just about
how you might design a distributed system in general.
And my former colleague, Pete, recommended I check out your work. So why would I build a
distributed system? Now, that's actually a killer question. A lot of the times,
so typical answer is that people say, oh, well,
we don't like this monolithic application because it's monolithic.
That's the only reason.
And by monolithic, I mean a thing where all the modules, all the code that makes up the
application runs in one container.
I'm using the word container in this most generic way.
So, you know, maybe a Docker container, maybe a JVM or something,
but everything runs in this one blob of code.
And that doesn't actually mean that it doesn't scale, right?
It's quite possible to think of multiple instances
of this monolithic application,
monolithic in the sense that it contains the thing that deals with orders
and the thing that deals with generating emails and what have you.
But it scales out, right?
It runs on multiple computers and it talks to, I don't know,
again, a scaled-out database.
Now, that might be okay for business,
but clearly it's not really entirely satisfying for development.
And soon enough, the two will meet,
and business will say things like,
well, guys, we need you to update the blah component,
the thing that generates the emails that go out to our customers.
And the development team says,
yeah, okay, yeah, yeah, sure, we can do that.
Tappity, tappity, tap.
And then they say, yep, that's all done.
So when can we redeploy this entire thing?
Now, that's where it's starting to get a bit ropey.
I just think, you know, the product owners,
the business will have questions like,
okay, yeah, yeah, okay.
Are you sure you haven't, like, everything else is fine, right? And then we, the programmers, say, oh, okay, yeah, yeah, okay. Are you sure you haven't, like everything else is fine, right?
And then we, the programmers say,
oh, yeah, yeah, yeah.
We haven't changed a single line of code
in this other stuff.
We've only changed the lines of code
that make up the email sending service.
And we deploy it and then something breaks.
Now that's actually really unpleasant.
So we say, oh, we don't want that.
We realize that the entire system is made up of smaller components and they have a different life cycle.
So we want to break them up.
So we want to have a service that does email sending.
We want to have a service that does, I don't know, the product management and some sort of e-commerce component that takes payment.
And so we break it up.
Now, when we do, we gain the flexibility
of deploying these smaller components independently,
but there's a price to pay, right?
The complexity of the application remains the same.
And the fact that we've broken it up and we said,
okay, this is now simpler.
Well, the complexity hasn't gone away.
That still stays.
And we've just pushed it aside to the way in which these components talk to each other.
So this brings me to like the first monumental mistake.
And yeah, I've made it.
We said, oh, well, what we actually have now is a distributed monolith.
What I mean by that is application that runs in multiple processes,
multiple containers, if you like, but its components need to talk to each other.
So to display a web page, I need to talk to the product service. And I also need to talk to the
discount service and the image service. And I need to get all three results before I can render out
the page. And if one of these components fails, ah, well, that's
tough. It doesn't work. So that is a particularly terrible example of a distributed monolith.
Essentially, what you've replaced is the in-process method calls and function calls with
RPC. Now, even when I say it out loud, it sounds pretty unpleasant. Who would want to do that?
So, clearly, this might not be the right thing to do, right? You have a monolithic application,
you decide to break it up. But you need to take a step back. The designers need to take
a step back and say, is the fact that we now have a distributed application,
which means that time is not a constant.
So we need to worry about time.
We need to worry about unreliable network.
We need to worry about network as such.
These components need to be talking to each other.
And because these components will now,
each of them will get their own life,
we need to worry about
versioning all of a sudden.
So all that brings up this
vast blob of
complexity that we need
to solve. And
you might have heard about this reactive
application development thing,
have you? Yeah.
I feel I'm not clear whether it's just a buzzword,
to be honest, like a marketing term.
I hear you.
What is Reactive?
So it has these four key components,
and it centers around being message-driven.
So the components, when they interact with each other,
it's not a request followed by a fully baked response. It's a world where, you know,
one of the components publishes a message and it says, okay, here's my favorite example.
So we're building this e-commerce website, and when we're
designing the checkout process,
we have this entire thing,
and we've integrated it with payments, this all works.
But each service,
when it runs, it should emit
messages.
Now, this is all sort of hand-wavy.
You're not seeing me live,
but I'm waving hands, so you can imagine
the picture in all
its horror. But
you publish a message onto some
message bus. So think Kafka,
think Kinesis, that sort of thing.
A durable message.
We'll get to that. So hang on to the word
durable message. Now, each service,
as it operates, it should emit
as much information about
its work as possible.
On to this message topic now.
You've successfully implemented this.
This first version of the e-commerce website works.
Now, because these services were emitting these events as they went along,
it is now possible, retrospectively, this is the cool thing,
it's now possible for someone to come along and say,
you know what, I'm going to build a blackmailing service.
The blackmailing service is going to go through all the historical messages
in this persistent message journal,
and it's going to pick out the embarrassing orders that I've made.
You wouldn't believe the stuff I buy on Amazon.
Now, if we design a microservice architecture that way,
so we are event-driven, each microservice architecture that way, so we are
event-driven, each service publishes an event, the events are durable, it's possible later to
construct another service that consumes these past and future events and does more work, right?
This is how we extend. So this is the promise. Surely we can just add another service to our
system and it now does more things now that
can only be if you think about it that can only be achieved if we have this historical record
of stuff that happened not just here's an entry in the database that that's that's like a snapshot
in time you have stream of messages going from like message number one, the beginning of time, all the way to today.
Now, if you were to replay all the messages and write them to a database table, you'd get a snapshot in history.
You'd get a snapshot of when I replayed all the messages, here's the state of the system.
But as more messages come in, of course, this snapshot keeps changing. And so it's actually sometimes really useful to think of even a database table or some sort of persistent store as a snapshot of this system at a particular point in time. But the
life of it is expressed through these messages. So just so I'm clear, you're saying you designed
the e-commerce system as a standalone monolithic app. However, you make sure that you're emitting everything you're doing to some.
You know what? That could be a good start.
If you have a monolithic application, you think, I want to do microservices.
Emit stuff first and build the new stuff as a new microservice.
There's a whole bunch of value in
that versus the usual
enterprise
initiative that says
let's take what we have
and rewrite it.
I have yet to see one
successful rewrite it from scratch.
It just never works.
There is so much value
and so much experience baked in existing code
that rewriting it is almost always a disaster. But extending it is a good idea. Now, extending it,
you don't want to add another monolithic bit onto it. So adding messaging, asynchronous messaging,
which is another bit, another asynchronous non-blocking, another reactive buzzword, right? You don't want
to, as part of
processing of, I don't
know, a request from
a user, the last thing
that the app should do is to
wait and block on some
sort of I.O.
Because we are distributed, after
all, and this I.O. could be
network.
Now, who knows how long that might take?
Sometimes even forever.
Sometimes there's no result.
And, you know, these TCP sockets time out after, I don't know, 60 seconds.
60 seconds.
There's no way on earth that a thread, this heavyweight thing,
should be blocked for 60 seconds.
So it needs to be
asynchronous. Now, modern operating systems,
I mean, the kernels of the modern
operating systems have
all these asynchronous primitives.
So it's possible to say
things like, begin
reading from a socket
and continue running
this thread, and then the OS will wake up
some other thread and will say, okay, yes, I now have the data.
Here's your callback.
I've read five kilobytes from the socket.
Here, you deal with it.
And then it's up to the application frameworks to manage it
in some reasonable way for the application programmers.
So I do a lot of Scala, so there's a whole bunch of asynchronous,
convenient asynchronous libraries, you know, like Akka. Go obviously goes in a very similar way. or is it I'm just not blocking, but then the scheduler is going to return something? Like is every request returning like 204 or something?
Like I say, save user, and they're like,
okay, we heard you.
That's also a really, really, really good question.
If you can get away with fire and forget,
your system will be so much quicker
and so much easier to write.
Now, a lot of our systems,
the users wouldn't accept fire and forget. So again, think about typical messaging. Read from a socket or write to disk
or write to this persistent journal. Now, if you accept at most one's delivery, so far and forget, what these components are
telling you is we're going to do our best. We'll try not to lose your message. And that's probably
okay for statistics. It's probably perfectly fine for health checks monitoring. But it's not okay
for, I don't know, payments. This is where it gets a little more difficult.
So if a component has taken money from your card,
the thing that receives it really should receive it.
Now, this is where it gets really complicated
because distributed system, right?
Now, okay, so you and I need to exchange information
in a guaranteed way.
What I can do is say, hey, Adam, so I have the payment for you and I can hang up or disconnect my computer.
Now, from where I'm sitting, I'll hear nothing from you.
Okay, okay, okay, okay. I don't know if the message that was going to you got lost and you actually
received it and you have it,
or if you,
or if the,
you actually received the message,
you began processing it,
but then you crashed.
So I should send it again in both cases.
Or has,
is it the case that I sent a message,
it successfully got to you,
you processed it,
but then just as you were replying,
my network went down, in which case you have it, but then just as you were replying, my network went
down. In which case, you have it, but I don't know that. So I'm going to send the message again to
you. Now, that means that you might get duplication. So you might hear the same message again. Now,
there's nothing I can do about it, because from my perspective, I haven't heard a confirmation from you. So I need to send it to you again. You now need to do extra work to deduplicate.
Now that's problematic. How do you do deduplication? Well, okay. You hash the message, right? So
you compute some SHA-256 of it. You keep it in memory. Yeah. How much memory do you have?
Because this can go on for a really long time. Now, this
is where we say pragmatic things like, well, in reality, it's all, you know, we have a system.
We know that between you and I, we exchange like one message per second. So you think, well,
how much memory do I really need to remember this stuff from yonder. Oh, yeah, I'm going to need the last 10 statements.
So you sort of make sure you have memory for the 10 hashes of the last 10 sentences,
and that's good enough for you.
And the key thing is to measure the system, right,
and know how much is going through it.
And then you can tune these deduplicating strategies
and you can keep things in check.
But it's a probability game.
At some point, something is going to go wrong.
If you phone me and say,
hey, I got your payment,
then that's like at most once
because you hang up,
you haven't heard an acknowledgement from me.
Exactly.
Then if you wait for me to acknowledge back
then we have this problem where that's at least once because maybe the the line comes out yes
while i'm telling you so then you have to re-deliver so i got it twice so then i mean i
think what everybody really wants is exactly what i know i want that too i i'd also like a wristwatch with a fountain never
gonna happen so is exactly once possible um okay uh yes in if you reduce the context so if you
remove a lot of bio or if you are prepared to say, to trade off something else. So you think
I want exactly once. Okay. Okay. Okay. That's fine. Which now means that you need to say,
I need an extra coordinator and this extra service, extra coordinator can now tell me,
have I seen this message? Yes or no. And if the answer is yes, I've already seen this message, then okay, reject it. Don't even attempt to deliver it. And that's okay, but you've
sacrificed availability. And if this coordinator goes down, your system has to say, well, no,
I'm no longer sending anything. So it is a trade-off. Throughout this thing, it's a trade-off.
So in the phone example, I guess,
if we had some third person who tracked whether...
We both acknowledge to the third party, is that the idea?
That's the idea.
You'd have something else that listens and says,
yeah, no, that one can go through, that one can go through.
So let's rewind.
You said earlier that... I think that you said, like, one of the main reasons for wanting a distributed system was deployment. It seems like a small reason to me, just to want to deploy components the big reason, the favorite reason is we say scale.
That's the kicker, right?
And we always imagine the system that goes, you know what, I'll be Amazon.
This will be so cool.
I will 100,000 requests per second.
Maybe even more.
And then you say, okay, well, it's unreasonable for a monthly application to be able to handle 100,000 requests per second
when the distribution of work is actually 90,000 of browsing, people just look for stuff,
and only 10,000 is people buying.
I think I've made those numbers up.
I think if that's the case, then, you know, Mr. Bezos is the happiest man on earth.
If you get like 90, 10, 9-1 browsing to actually buying,
oh, wow, that would probably be pretty good.
So I think the numbers are different, but it doesn't matter.
So, yes, it's scale.
And, of course, if the system is divided into different services,
each service can be scaled differently.
What's actually even more encouraging about it is that when the system is broken up into services, we can now really think about failure scenarios and what do we do if something is broken. e-commerce website. Let's say that the e-commerce website is really keen that people get whatever
goes into a shopping basket. People have to be able to get, right? We're just going to believe
that. So if I put stuff in my shopping basket and browse, so that's one set of microservices
that I can do the search and the image delivery and all that personalization.
And it's all in my shopping basket. And I go to the payment, and the payment service goes down.
Now, one of the options that we have, which would be sort of silly in this particular example,
but nevertheless, reasonable to think about it, let's say we're really good,
and we say we want our customers to be super happy.
They've spent all this time putting their stuff in the shopping basket.
Regardless of whether the payment service works or not,
we're going to send the stuff.
And so if I make a...
See, now, okay, I know, silly, right?
But you make a request to this payment service,
and you say, hey, I want to now pay for this.
You know, I put a few DVDs in my...
I don't even have a DVD drive.
But, you know, years ago, right?
I put a couple of DVDs in my shopping basket,
and I paid for it.
I checked out.
Checked out.
The service that ran the website
made a request to the payment service.
That was down.
And so I said, never mind.
Let's just do another list.
Now, ridiculous, you say.
Of course it is.
Think, though.
Maybe you're subscribing to music,
online music.
You're a Spotify.
I've actually heard of an example.
So Starbucks, they have their member,
it's like a gift card or whatever,
but it's also their membership card, right?
And you can put coffees on it.
So I guess they have problems with their system going down.
And when down, they just give free coffees.
Absolutely right.
Absolutely right.
Now you can extend it to say a music service so suppose you
want to listen to something open open your phone you say i'm going to subscribe i'm going to
subscribe using apple pay and so the payment has gone through the apple systems and then
the receipt gets sent to this music service and it says hey guys so i have a receipt number 47
from apple and the music service now says okay well, well, I'm going to go to Apple
and just double check really that if the receipt number 47 has that been paid.
Apple service goes 503 over capacity.
Now at that point, it's really reasonable for the music service to say,
ah, nevermind, I'm going to let the user listen to stuff.
For the next 15 minutes, I'm going to hand out the authentication token
that allows access to all this music.
But 15 minutes later, the user has to check back,
because by then I will have checked for sure with Apple, and I will know.
So this is actually a really reasonable thing to do.
Now, again, it sort of makes, I guess, the commercial people's blood go stone cold, right?
But what's the alternative?
If we didn't do this, then the alternative would have been really annoyed customers.
They would have used something.
Everything would have worked on their side.
Just because one of our services is down, they don't get to do what they want it to do.
So they pick up the phone or write a tweet. It's actually better in a lot of cases to just
trust people
and maybe allow them access
for a short duration of time.
Again, it depends on the industry and all that.
Generalizing overly or overly generalizing
even. But with
distributed system,
this is now possible.
If you were in a monolith,
if the payment service went down
because it's part of the monolith, that's it, you're done.
Nothing works.
If it's a distributed system, you have a better chance
of defining some sort of failure strategy.
And because the payment system publishes events
about what happened to it, you as developers,
you now have a chance of going through what happened and going,
oh, this sort of sequence of events might lead to a failure, which you could use to improve
your code, or maybe you can write some sort of predictive mechanism that tells the operations
team or tells whoever happens to be on pager duty that says, look, heads up, this is not going to be pretty.
We're still fine,
but something's coming.
And that is fantastic.
That is the one thing that keeps services running,
keeps the entire systems running.
It's interesting because
it brings to mind to me,
even if you have a monolithic app,
likely you already have
a distributed system, right?
Because you're calling out to some right? Because you're calling out to
some payment processor, you're calling out to some database. So even though your application is
fully formed, right? I guess what I'm hearing you saying is that you need to model these
systems and what will happen when you can't reach them. I just always assume I can reach
the database. If I can't, you know, whatever.
Absolutely right.
Absolutely right.
And the worst thing about it
is that at small scale,
I don't mean it in a bad way,
but at low scale,
it's quite possible to go to AWS
and provision a MySQL instance
and that thing will run forever.
It'll never fail.
Because it's just one computer,
but as the number of computers
increases, the chance
of probability goes,
the chance of failure goes right up.
Something is going to
fail. Now, this was
I used to
say this. I used to say, oh, well, we need to build these
distributed systems, and we need to treat failures as first-class citizens. This is really important.
And I said that out loud, and at the back of my mind, I thought, yeah, but come on,
when does this happen? Well, we now happen to have a system that handles significant load. And it happens every day. Every day, some service
goes down. A couple of nodes of a database
go down. A couple of brokers go down. They get restarted.
And it's fine. The key thing is, because we can
reason about this failure and because we're prepared for it,
it's fine. We survive it.
Our infrastructure keeps running. Our application
keeps running. The users are getting
their responses.
But internally, of course,
we can detect
errors and recover from them.
So that was
very, that was the opener, truly.
What do you think about this argument? I haven't heard it in a while but it was very popular about
um you know build the monolith first and then you know once you reach some you know once you reach
these these problems then you know then you start splitting i mean i don't have a general answer to that.
I think a lot of the times you can know what the expected load is going to be. If you know, as in you have an existing customer base and you know that you have a million customers.
And so the new service, when it goes live, it will get that load. Now, I think it would
be silly in that scenario to say, no, no, no, we're just going to start with this thing that we
know isn't right. And then we'll see how it goes, how it crashes.
Feel free to say it's a horrible idea. So it came, I think it was Martin Fowler who wrote about this
probably several years ago. And I think his argument was, if you try to build this new application as a distributed system,
but once it starts interacting with customers, your requirements start changing drastically.
And if you don't know where the change is going to be, maybe you've cut things up in a way that doesn't make sense.
And that was going to be my follow-on.
If it's a new application,
if this is just a, dare I say, startup,
then it makes total sense to experiment first.
But there was another,
I think that this was Googlers who wrote it,
and it goes along the lines of design.
Take a system and say,
you say, okay, I need to be able to handle 10 concurrent users.
Design it for 100.
But once you get to 1,000, it's a new system.
And once you get to 10x the original design,
it's just not going to work.
So, okay, I know 10 is a silly example, but design, say you have 100,000 users,
design a system that will hold up to, say, a million.
But once your usage grows out to be 10 million, the thing isn't suitable.
It's the wrong thing.
Now, I think this is where Martin Fowler was getting as well.
The design choices for something that handles 10 or 100 use might be completely different to the thing that handles
1,000. And if the starting position is that you have zero users, well, wow, you're right.
Design something, anything. Because as you point out, who knows what these silly users will say
when they finally log in and say, well, I don't like this. And you might cut your system up the wrong way. You might
define these consistency boundaries. That's a big word. I haven't come up with a big word for
a long time. So consistency boundary, or this context boundary, because we're in a distributed system,
we have, I'm sure people know this cap theory,
which says, okay, well, you have consistency, availability,
or partition tolerance.
You'd like all three, but you can only have two.
And as long as one of them is partition tolerance,
because you have a distributed system, right?
So partition tolerance,
now pick two, consistency or availability. And it really depends on what you're building. Now, the choice isn't usually quite... it's not binary. It's not like consistency 1-0. There's a scale.
But what you're ultimately dealing with is physics. You have a data center, say there's a data center in Dublin,
and there's a data center in East Coast, US East, Amazon, and EU West in Dublin.
It takes time for this electricity nonsense to get from America to Ireland.
It just does, and there's nothing you can do about it.
And so if you write a thing to a
system like a database in
Dublin, it's going to take
100 milliseconds before the
signal gets to
US East, and there's nothing you can do about
it. That's just life.
So what can you do?
You can say, well, so if someone
else is reading the same row
from
US East, they'll see old data, because it's just You can say, well, so if someone else is reading the same row from, you know,
US East, they'll see old data because it's just not there yet.
So you can say, oh, okay, okay, okay, okay.
So you hate the idea.
Absolutely hate the idea.
It must be consistent.
When I look at my, this database row, I must see, wherever I look in the world,
I must see the same value.
All right. So you better have this other component, this coordinating row, I must see wherever I look in the world, I must see the same value. All right. So you better have this other component, this coordinating guy that adds a tick and says,
yes, no, this is now everyone I've heard acts from every data center, every replica that they now have it.
And now it's good. Now I can release these read logs that people are waiting for.
Okay? It's consistent.
What if one of these networks connection,
what if this coordinating component goes down?
Oh.
There you go.
You can't be consistent.
So the only thing that the system can do is no bye-bye, no service.
I think even even in um
monolithic applications i think people don't think about the fact that maybe you're consistent
within the database but but that doesn't necessarily like i imagine a scenario you
have a bunch of app servers and a database behind it like postgres and you know everything's acid
but you know one user reads a record and it displays it somewhere and then they change some things. And then your little magic ORM software saves the record en masse back.
And somebody else could have the same record up at the same time.
So the consistency is lost. Sometimes people don't even realize it, I guess.
Oh, absolutely.
Oh, absolutely.
So the more you distribute the system, the more you have to deal with these scenarios, the more you have to deal with the possibility that something will be inconsistent.
And it can actually be a really, I've had a whole bunch of discussions with typically e-commerce people who find it absurd.
What do you mean that I don't know how much stuff I have in stock?
Well, you do roughly, but it's not down to like one unit.
So they say, oh, no, no.
How about we reserve items?
You can.
How about we lock items?
Yes, you can.
But as the number of users grows, the number of locks or these reservations grows.
And so it's really easy to get to the point where everything's
locked and no one can actually do anything because you want to be consistent. Because you want to be
consistent, you've sacrificed availability. And so, you know, people say, no, no, no, no, no, no,
no, no, no. You can't have that. I want to be consistent and available. Which means you say,
okay, fine. Buy one big, huge machine
and run everything on this one machine.
But even that's ridiculous.
That machine will have multiple CPUs.
It will have wires going through it and they will break.
So it's...
And the availability is lost if that machine goes down.
Oh, absolutely.
It's not pretty.
So you had an example about an image processing system you designed.
I did, yes.
I'm wondering if you could describe that a little bit.
Yeah, of course. So this was a really interesting thing.
Now, I guess people are a little bit more worried about their private information.
But nevertheless, this was in good old days where people uploaded pictures of their passports
freely onto any old service they could find.
Now, what we did is,
it was obviously doing a couple of processing steps
with these identity documents.
So what was the goal of the system?
It was to essentially provide biometric verification of the user.
So think you want to open a bank account and you don't want to go to the branch.
And the bank doesn't really want to open a branch for you because it costs a lot of money.
What they would like to do instead is to have the users use their phones to look into their phones and take a picture of their driving license or something.
And for this system to say,
yeah, no, this is looking good.
The driving license is the real deal.
It's not being tampered with.
And the thing that's staring into the camera,
that's the same face.
It's alive.
So that was that.
Now, of course, the users,
they want this thing to run beautifully smoothly, right?
And they will give you 10 seconds of attention, maybe.
So if the processing really isn't done within, say, 10 seconds,
maybe 15 if we invent some clever animation,
then this would have been a total failure.
So this needed to be obviously scalable, it needed to be parallel, it needed to be concurrent.
Many of these steps needed to be executed concurrently. And a lot of these steps were actually pretty difficult. As you can imagine, there was a whole bunch of machine learning involved. And so what the PDF that you might have read
or the preview book that you might have read
essentially describes the thing that there's a front service
that ingests a couple of images, as in high-resolution pictures
of documents or whatever else,
and it ingests a video of the person staring into the phone.
And then the downstream services then compute the results. So it needs to be OCR'd.
The face needs to be extracted from it. We need to check that
you haven't just stuck another picture on. We used to do this.
I didn't because I'm from Eastern Europe,
so we used to drink all the time.
But I keep hearing that people elsewhere in the world have to think that driving licenses
in order to prove that they are allowed to have a drink.
And so they take someone else's driving license,
put a picture on it, put their picture on it.
So we need to be able to detect those scenarios, obviously.
And then combine the results.
At the end, there is a state machine that reacts to all these messages that are coming from these
upstream components. And as soon as it's ready, it needs to emit a result. So imagine that you're
the bank, right? You want to open a bank account. And you say, you know, a question,
how much, how much,
maybe you're a betting company
or someone who needs to verify
identity of a person.
And so, so if I'm a betting company
and I say, okay, you know what?
I'm going to play on the Grand National,
which is a UK horse race.
Okay, you need,
I've never bet before, right?
So I need to register
and prove that I'm over 18.
So I go through this process.
And then in the app, I say, how much do you want to bet?
And I say, I'll splash out and bet £2.
Well, right?
So the betting company then can have a ruling that says,
as soon as I hear from this system,
as soon as it emits a message that says,
look, you know what? We've OCR'd it, and this driving license looks like a UK driving license.
At that point, it might say, that's good enough.
It's fine.
A lot of that to go through.
If, on the other hand, I was saying, well, look, I'm going to bet £3,000, they would actually need to wait for all these messages to arrive.
That is to say, it's OCR'd and we checked the driving license register to verify that
it's the real thing.
And we also compared the picture on the driving license with a face that was staring into
the phone.
And yes, it's the right person. Now, that is only possible
if your system emits these messages.
It would have been impossible to condense down
and to start processing
and then wait until we have everything.
We have this entire processing flow
and then send one callback,
one result to the gambling company.
That makes sense.
Because if we were doing,
back to our startup,
generating something quick,
I might have some web server
that kind of throws these images
into whatever Kafka topic,
and then just something else
that just does everything, right?
Pulls it out and goes through it. So you're saying
the problem with that is
the upstream consumer needs to know
about specific...
It wants to know before it's done, I guess.
Yes, and the
main challenge
is that you don't know
what it wants to know.
Until you deploy
the first cut of this,
and until you say to your customers,
look, here is what we can do,
you don't get the feedback.
And what they might say is,
well, that's nice,
but couldn't you just send us,
I'm a gambling company,
couldn't you just send us
the initial processing?
That's good enough for us.
Oh, yeah,
so if client equals equals five,
then do this. And then some other client comes along and says, well,
you see, but if the betting
amount is more than 200, then we will end it.
That way you will eyes pain. So it's much better to just
admit everything that you know as soon
as you know it from your services.
Now, if you
need to implement in the system
and we decide together that we're going to be
so good to our customers that
we're going to implement this workflow engine,
we can as a separate service.
But as a
separate service, that's the key.
So we can defer that decision
absolutely
so how does the client
the client is a confusing term here
so there's a user with a phone
but then there's your actual client
like the bank
so how are they consuming
these results
now
there are multiple choices
the most the crudest one is we just push stuff to them over these results? No. There are multiple choices.
The most,
the crudest one is
we just push
stuff to them
over a connection.
Right?
So we say
to our
clients
who speak to
the system
that should
receive the
messages from us,
we say,
okay, guys,
why don't you
stand up a
publicly accessible
HTTP endpoint
and we're going
to post stuff
to you?
That's a horrible way of doing it.
Don't ever do that.
First of all, your customers will
hate it because they'll say,
no,
we don't want to
stand up a web server in order
to consume your service. That's a stupid idea.
Firstly, and secondly,
it
pushes the responsibility of
making sure that you're talking to the right endpoint,
like a dual service.
It would be really bad if, say, you had this complicated image processing service,
and if it posted data, private data, biometric data, to some other URL.
Yeah.
Through misconfiguration.
So, yeah, horrible.
So, how about we do it like Twitter
used to do? So we say, okay, you know, Mr. Client, open a long post request and like a
connection and we will send you chunks, HDB chunks of messages as soon as we receive something. Now, that's a much nicer way of doing it
because it just allows the client
to control where it's connecting to.
It also means that the client has to check the HTTP,
the certificate on the connection.
It has to know and it has to be sure
that it's talking to the right service
so we can wash our hands of this whole horrible
business.
It also means that
the client now is in control
of its consumption.
That's kind of nice.
What's slightly problematic
about it is the scale of it.
It is the
case of one image per second
variance, if it were, oh,
I don't know, a thousand images per second,
that's a bit tougher, right?
You would want multiple of these
connections to be opened.
And so now you need to think about a way to
partition which connection
gets which results
in which images, and they need to be
routed more carefully. And and they need to be rooted more carefully.
And you also need to think about
you need to be nice to
your clients and say, we understand
that the connection will go down
and you might have to make
a new connection. Okay, but
you need to be able to say things like, I'm making
a connection, but I want to start consuming
from record
75 instead of the last one.
So, but that can be, you know, a big query parameter. So that's a neat way of really
bridging the gap between someone who says, so where's Scala and JDMs and Kafka? And, you know,
our Mr. Client says, no, I'm.NET, none of this stuff, don't even talk to me.
In which case, this sort of long HTTP connection works pretty well. Of course, the ideal scenario
is the client would say, well, we would like for your system to expose another kind of interface
and we can just mirror that. So that's
one of the other options that we offer.
Now, in the end, this particular
system, we ended up just with these HTTP
endpoints. So that's something
you do? You do integration
with Kafka?
We offered that, because
there was a particular deployment
that looked like it was going to be large
enough to warrant that, but
in the end, we dropped it. Interesting.
I can see why it would be easy from your side.
Oh, yeah, exactly. That's why we said, oh, no, no, we can
do this for you, but
it...
I didn't like it in the end because
it exposed
too much of our internal workings.
It felt like integration through database
wasn't quite right. Now, okay, you can exposed too much of our internal workings. It felt like integration through database.
It wasn't quite right.
Now, okay, you can restrict the topics that are replicated.
You can maybe transform them.
But it still felt a little bit crummy.
So I think it's much better to have a really sharp,
completely disconnected interface.
And even if you're talking, say you're running an AWS and you publish your messages to Kinesis,
and then you would say to your client,
oh, yes, I see you have Kinesis.
Why don't we push stuff to your Kinesis?
Again, it feels like you're letting your dirty laundry
out for everyone to see.
So there should be an integration service
that it might well consume from Kinesis,
do some cleanup and publish to another Kinesis.
That's all good, right?
And it might even be a lambda for all I care.
But it shouldn't be a direct connection.
So that's interfacing externally.
So inside of your microservices world,
how do you draw these?
How do you agree on these formats?
Formats being the message bus or the...
Yeah, how do you agree on how you communicate
between the various microservices?
So a lot of it is driven by the environment.
So we have one system that's divided,
and part of it lives in a data center,
and part of it lives in AWS.
Now that drives the choice,
because there's no Kinesis in a data center.
We had to use Gatsby in that particular case.
And then we publish messages out to Kinesis
and we AWS components consume Kinesis.
What's actually more important is to think about
the properties of this messaging thing
and think about the payload that goes on this thing.
So in a distributed system,
one of the super evil things to have is ephemeral messages.
So if the integration between our components,
our services is like an HTTP request,
there is no way to get back to it.
If once the request is made, it's made, it's gone.
There's no way, no record of it anywhere.
So what we really prefer is persistent messaging.
And you can then think of these message brokers
as both messaging delivery mechanism
as well as journal in the CQRS event-sourced sense.
So the message box contains all messages for a duration of time,
but they're not lost.
So if I publish a message to Kinesis, I publish a message to Kafka,
once the publisher gets the confirmation that, yes, okay,
the sufficient number of nodes of
Kinesis or Kafka have the record.
It's there, right?
It sits somewhere. It can be
re-read again. It's persistent.
Now, actually, in terms of conditions
apply, that's not exactly what's happening with
Kafka, because what happens is
if you, by default,
give you a cluster of
like n number of nodes,
and if you say, I want to receive a confirmation
when the record is published on quorum number of nodes,
that still doesn't mean that the message is written to disk.
It just means that the quorum number of nodes have the message in memory.
Now, it's a probabilities game.
You'd have to be really unlucky that all of these nodes would fail before they would flush their memory to disk.
Of course, you can say, no, no, no, it needs to be fsynced, but then the message rate goes right down.
Now it needs to be fsynced. We have a distributed file system, right?
So, so, one of the components is running in OpenShift, which is GloucesterFS, which is a
distributed network file system painted over a number of SSDs. And so we have a cluster of Kafka's that use this GloucesterFS, which we're not entirely happy about, but we're not also
entirely unhappy about because it runs. But hey, right? This is confusing you.
It wouldn't be fun if it weren't messy.
So just to make sure that I'm understanding.
So we're going to...
Your image service has a bunch of microservices inside of it.
And they are communicating with each other
using Kafka topics.
And we kind of assume that that is persistent,
even if some conditions apply.
So how do you organize it in terms of message schemas and topics?
That's a good question.
So the messages we've chosen,
we call buffers as the format and description of our messages.
So we're very, very, very strict about it.
And we actually have...
This is where people are going to get really angry
because one of the things about microservices is that they are independent.
Each microservice is its own world.
It shouldn't have any shared code and shared dependencies.
Well, I mean, true, but then reality kicks in.
Now, what we've done,
so people won't get angry,
we actually created one separate Git repository
that contains our protocol definitions
for all the services.
Now, this is weird.
It's really kind of weird.
But this one thing allows us humans
to spot when someone is making a change to the protocol.
So if there's a pull request to this one central repository that contains all the protocol definitions for the services,
the entire team is on it and goes, oh, wait a minute.
If we merge this, is it going to break any of our stuff?
Because remember, these microservices have independent lifecycle.
And so it's quite possible to have version 1.2.3 of service A
running alongside version 1.4.7 of service B.
And following the usual semantic versioning rules,
these should be compatible.
Well, except semantic versioning is only as good as the
programmers who make it. And so this is why we have this shared repository and we have
humans in my team sort of eyeball it and say, is this going to work? Yes or no? When we
build the microservices, they all build protocol implementations using this shared
definitions repository.
So if there's a C++ version, it makes
the proto-Cs
its own C++ versions
of the protocol.
If there's a Scala version, it's a Scala code.
But then it's all
quote-unquote guaranteed
to work together when it's deployed.
Is protobuf the best solution,
or what do you think of?
Ah,
now, to me,
the best solution is
language that has
good language tooling.
So you could say,
oh, well, we should have used Avro.
Surely Avro is more efficient.
It has support in Kafka and all that.
Maybe we should have used Drift or something else.
But when we were developing the system,
Protobuf had really mature tooling for the languages that we used.
So there was tooling for Swift and Objective-C for the clients.
There was good tooling for C++, and there was was tooling for Swift and Objective-C for the clients. There was good
tooling for C++ and there was good tooling for Scalp. And that made the difference.
So I guess what I'm saying is you're free to choose which of the protocols you like,
protocol language, protocol definition language, as long as the tooling matches your development lifecycle and as long as the
protocol is rich
enough to allow
you to express
all the messages
that you're
sending.
That makes
sense.
So how do
you decide
about topics?
Like, do you
have each, is
each microservice
pushing to a
singular topic
or?
Yes, so that's a good question.
Typically, each microservice is pushing to multiple topics.
So there is the basic results topic,
but it might have partial results,
it might have some statistics.
So these are definitely topics that...
I can't say that one service only ever consumes from one topic and publishes to one topic.
A typical service in that system will consume from two topics and publish to maybe three.
It is worth mentioning though, particularly with versioning, is the way we've
gone about it is we have topics for major versions. So there's like a V1 of images.
And obviously V1 will be forwards and backwards compatible across all the major versions within the...
minor versions within the one
major version. And if we ever
decide to deploy image service
v2, which presumably is completely
different, right? Because it's just not
compatible, then we would create
an image-v2 topic
for it. I mean, it seems to be a theme you're saying
is to make these things explicit?
Uh-huh. Absolutely right. it seems to be a theme you're saying is to make these things explicit?
Absolutely right.
You have to be able to talk about these things
and accept reality.
What always scares
me the most is when people
describe the system and they say,
we will break on the monolith or we are designing
microservices based on architecture
and we're going to have this component
and this component and when it gives us a response, we're going to do this, that, and the other. My first question in
all those scenarios is, what if it's done? What if it's not available? And, okay, so this is very
esoteric. Let's go back to a user database. If you're building a user database and you have a
login page, and you have like an e-commerce
music shop or something.
And my question to you would be,
okay, what do you do when your user database is
down and someone wants to log in?
Freak out. Get pages.
There you go. Now, a lot of people
will say, well, what can we possibly do?
The database is down.
But what you
can do, of course, option A is
denied. You can't log in.
Option B is, okay, fine, you're logged in.
I don't care what username
and password you typed in. You're logged in.
We'll take your word for it.
We're going to issue you a token that's good
for the next two minutes.
How about that?
And it's this thing, right?
And then you say it to your product team and they say,
are you ridiculous? We can't allow that to happen. And then you say, okay, fine, fine, fine. I hear
you. So option B is if the database goes down and it will go down, then your service is down.
Well, we don't like, we hate that as well. And we do. I know. I hate the database
going down, and I'm sure that
product owners hate the idea of just letting people in.
But we have to make a choice.
That's reality.
It makes sense, yeah. For a music service,
for a bank,
I think that...
Absolutely.
And this is where consistency,
this is where the dials go in. right? It's not a one zero.
So of course, if I can't verify my online banking identity, I can't let you in. What if though,
let's say there are a number of banks in the UK. One is actually one of them had a
massive IT failure last week. Anyway, anyway you're running a bank and you have
2005
code and you're doing
two-factor authentication
by text messages. Okay.
I was able to log in
my username and password or
whatever third digit of your
security code was fine,
but the SMS sending service is
down.
And now it's a reasonable choice.
Maybe a possibly reasonable choice is to allow re-only access.
Makes sense.
Because then the SMS...
Just not do two-factor.
That could be a choice you could make.
Absolutely.
Interesting.
So it has to be...
You have to think about it.
And actually designing these systems,
systems, this is the particular, right,
systems that are made up of the multiple components,
as we design them, we always say,
what if it's not there?
What if it's unreliable?
What if it's slow?
And not treat it as either an academic discussion,
as in, well, what if,, what if the data center is offline?
No, no, no.
I mean these mundane, boring things, failures that happen all the time.
Okay, well, someone broke the table.
The database is inaccessible.
My SMS sending service is down.
The email SNTP server is down.
And they happen all the time.
And so that's the first layer of digging. the answer shouldn't be oh well it there has to be something there has to be a decision and it it's
the implicit oh well that causes the most problems oh well could be a decision like i guess or i mean
i guess it's implied you're saying we need an explicit decision to say we have to.
That's exactly it.
It can never be.
We assume that.
And it takes so much of mental practice
because we are used to things.
And, you know, we program, right?
We say, I don't know, select star from user.
And if TCP connection denied,
then most people, select star from user, and if TCP can actually deny it, then
most people, me, myself
included, will say,
what do you want?
Seriously, what do you want?
But if we don't say it out loud,
if we don't say that
if the database isn't there,
I'm not doing anything, then
other people won't know that we have made
this decision. It's this hidden
decision, and that's going to come back and bite us.
I assume also the cutting things
up this way into
microservices allows us to be
more explicit about various
parts failing while others remain.
Like in a monolith, it's hard to make these...
Yes.
Absolutely.
The lines aren't clear, I guess, between...
Yeah, absolutely. yes the absolutely the lines aren't clear I guess between yeah yeah yeah
absolutely
it's
you know
people want
we want it right
when you go on
Netflix
and you watch stuff
and you're just like
no you expect it to happen
because after all
you paid $10
and
you know
I want
the world-class service
I want
files to be ready
on CDNs,
ready for me to start watching
within a second of pressing play.
Because come on, I paid my $10.
You know, we are guilty of that expectation.
And so I find it bizarre that we would then be at work
and say things like, no, no, well, you know,
it's the thing we can do.
And we're building a system for a bank
or for a retailer
or for something
that actually gets more money than
the $10 per month on Netflix.
So Jan, you're
the CTO of Cake Solutions.
What's that like?
It doesn't sound to me
like you're doing CTO things.
It sounds like you're
in the weeds of designing distributed systems.
Well, quite.
I mean, I guess I would hate the idea of not understanding what my teams are doing.
That frightens me.
And the same goes for non-coding architects and that sort of thing.
I think to actually make a useful decision,
one really has to know what's involved.
And I think these are complicated decisions.
And so it would really annoy me if I didn't know,
if I didn't understand, and if I had to go to meetings
where I would hear things that people talk about,
and I sat there and thought,
well, it sounds complicated.
I suppose they're right.
That would really frighten me.
So I think it's actually super important
for techies to be leading
big teams and big companies.
I think that was an interesting article
I read a while ago that said
that, I think it was
Joe Spolsky, in fact,
that talks about his time
at Microsoft.
He said that Bill Gates,
you know, the back then CEO,
I don't know, see something, chief, very important person.
Yeah.
He said he would dig into technical details. He says he remembers the time when Bill Gates was insisting
that there is a reusable rich text editor component.
Now, that can only come from really understanding
what the hell's going on in technology.
Without it,
how would
a CTO who doesn't code,
how would a CTO who doesn't understand any of this
insist on having a reusable rich
text component?
So, yeah, that's
my take on it anyway.
I think it's great. I think that all CTOs should take that on.
I think, I guess, people just get buried by the ins and outs of the job and sometimes...
And that's fair, you know, it's fair.
And it's tempting to say, to achieve the optimum meeting density, as Wally says in truth.
You know, it's tempting, but no resist.
Honestly, if anyone's listening, don't do it.
It's horrible.
Be interested.
There is so much interesting stuff going on.
That goes down to teams that are implementing all these microservices.
Some of the stuff is, on the outside, boring.
Oh, goodness, it's boring. Who would
like to make a PDF report? Really? You know, Jasper reports, and off we go. But what I'm
always reminded of, and again, I'm terrible with names, no idea who said this, but when
one looks closely enough, everything is interesting. And I think that's really the case.
Like, even PDF reports
can be made more interesting.
I mean, in the darkest of days,
we used all our bloody transformers
to make PDF reports.
So the PDF report thing is still boring.
But we could use this other stuff,
which was really interesting.
And I guess my point is,
you know, this is the time to
be alive.
There's so much
technology available to us that I
really don't think that anything can be boring.
That's a great
sentiment. Do you want to say a little
bit about Cake Solutions before we wrap up?
Well,
sure.
I'm sure you've heard what we used to do so we used to do these
and we still made distributed systems but of course about a year ago we were working for
for our clients and then about a year ago we we were acquired by bantech. And Bantech was subsequently acquired by Disney.
And so we now
like distributed systems, it's the same stuff,
right? But we now concentrate on
media delivery. So
if anyone is
a sports fan, and if you guys are watching
ESPN+,
there you go, that's the stuff.
Those are the distributed
systems.
I can't tell you, but
in 2019,
there will be a thing
that will just be the best thing
in existence, like sliced bread,
nothing.
Wow, it sounds exciting. I can understand
where you're coming out from scale, then, if you're working on ESPN.
Oh, God, yeah, absolutely.
This was the volume of media.
And I'll try to be completely specific about what we do,
but you can tell how delivering that sort of experience is super important.
It's super important for these systems to deliver content to our users.
Now, I remember back in the days
when we were making these
quote-unquote ordinary distributed systems
for, you know, banks and e-commerce.
Well, you know,
if an e-commerce app is not available,
it's annoying.
You know, we can get it in our days.
Imagine, though, there's a game,
there's like a
baseball game
football
you're a fan
you really want to
see this
right
and a 500
no
no well first of
all it's live
so
like
it's
it's
it cannot
happen
you cannot
have 500
because
where it's
in e-commerce
users were annoyed
if the thing went
down now they're
angry but deeply angry so you know that that was uh that was quite an eye-opener but but having
this this distributed system comprised uh you know part of these microservices allowed us to
think about what will happen what do we do if something breaks?
What's the mitigation?
Because ultimately the motivation is to deliver content.
You know, our users have paid for this.
And they're fans.
This is their passion.
They want to see this video.
So there you have it.
Awesome.
So thanks so much for your time, Jan.
It's been great great it was an absolute
pleasure