CoRecursive: Coding Stories - Tech Talk: Micro Services vs Monoliths With Jan Machacek
Episode Date: June 6, 2018Tech Talks are in-depth technical discussions. I don't know a lot about micro services. Like how to design them and what the various caveats and anti-patterns are. I'm currently working on a proje...ct that involves decomposing a monolithic application into separate parts, integrated together using Kafka and http. Today I talk to coauthor of upcoming book, Reactive Systems Architecture : Designing and Implementing an Entire Distributed System. If you want to learn some of the hows and whys of building a distributed system, I think you'll really enjoy this interview. The insights from this conversation are already helping me. Contact Jan Machacek is the CTO at Cake Solutions. Videos long lived micro services Book - Reactive System Architecture
 Transcript
 Discussion  (0)
    
                                         Welcome to Code Recursive, where we bring you discussions with thought leaders in the world of software development.
                                         
                                         I am Adam, your host.
                                         
                                         Hey, here is my confession. I do not know a lot about microservice architectures.
                                         
                                         I'm currently working on a project that involves decomposing a monolithic application into separate parts and integrating those parts using Kafka and HTTP.
                                         
                                         Today I talked to the co-author of the upcoming O'Reilly book, Reactive Systems Architecture, subtitled Designing and Implementing an Entire Distributed System.
                                         
                                         If you want to learn some of the hows and whys of building a distributed
                                         
                                         system, I think you'll enjoy this interview.
                                         
                                         The insights from this conversation are
                                         
    
                                         already helping me.
                                         
                                         Jan Machacek.
                                         
                                         Actually, is that how you pronounce your name?
                                         
                                         It's Moflem.
                                         
                                         Sorry, I'll shut up.
                                         
                                         Jan
                                         
                                         Machacek is the CTO of Cake Solutions and a distributed systems expert.
                                         
                                         How do you feel about being called an expert?
                                         
    
                                         It's false advertising entirely.
                                         
                                         No idea if I would call myself expert.
                                         
                                         I've had my fingers burnt quite a few times.
                                         
                                         So I'm not sure if that qualifies as the answer.
                                         
                                         I think that it qualifies
                                         
                                         like from where I'm sitting, you know, somebody
                                         
                                         who's made some mistakes
                                         
                                         isn't a great place to stop me from making them.
                                         
    
                                         Okay, I'll do my best and make more
                                         
                                         mistakes.
                                         
                                         So recently I've been trying to get
                                         
                                         up to speed on how to split
                                         
                                         up a monolithic app
                                         
                                         and also just about
                                         
                                         how you might design a distributed system in general.
                                         
                                         And my former colleague, Pete, recommended I check out your work. So why would I build a
                                         
    
                                         distributed system? Now, that's actually a killer question. A lot of the times,
                                         
                                         so typical answer is that people say, oh, well,
                                         
                                         we don't like this monolithic application because it's monolithic.
                                         
                                         That's the only reason.
                                         
                                         And by monolithic, I mean a thing where all the modules, all the code that makes up the
                                         
                                         application runs in one container.
                                         
                                         I'm using the word container in this most generic way.
                                         
                                         So, you know, maybe a Docker container, maybe a JVM or something,
                                         
    
                                         but everything runs in this one blob of code.
                                         
                                         And that doesn't actually mean that it doesn't scale, right?
                                         
                                         It's quite possible to think of multiple instances
                                         
                                         of this monolithic application,
                                         
                                         monolithic in the sense that it contains the thing that deals with orders
                                         
                                         and the thing that deals with generating emails and what have you.
                                         
                                         But it scales out, right?
                                         
                                         It runs on multiple computers and it talks to, I don't know,
                                         
    
                                         again, a scaled-out database.
                                         
                                         Now, that might be okay for business,
                                         
                                         but clearly it's not really entirely satisfying for development.
                                         
                                         And soon enough, the two will meet,
                                         
                                         and business will say things like,
                                         
                                         well, guys, we need you to update the blah component,
                                         
                                         the thing that generates the emails that go out to our customers.
                                         
                                         And the development team says,
                                         
    
                                         yeah, okay, yeah, yeah, sure, we can do that.
                                         
                                         Tappity, tappity, tap.
                                         
                                         And then they say, yep, that's all done.
                                         
                                         So when can we redeploy this entire thing?
                                         
                                         Now, that's where it's starting to get a bit ropey.
                                         
                                         I just think, you know, the product owners,
                                         
                                         the business will have questions like,
                                         
                                         okay, yeah, yeah, okay.
                                         
    
                                         Are you sure you haven't, like, everything else is fine, right? And then we, the programmers, say, oh, okay, yeah, yeah, okay. Are you sure you haven't, like everything else is fine, right?
                                         
                                         And then we, the programmers say,
                                         
                                         oh, yeah, yeah, yeah.
                                         
                                         We haven't changed a single line of code
                                         
                                         in this other stuff.
                                         
                                         We've only changed the lines of code
                                         
                                         that make up the email sending service.
                                         
                                         And we deploy it and then something breaks.
                                         
    
                                         Now that's actually really unpleasant.
                                         
                                         So we say, oh, we don't want that.
                                         
                                         We realize that the entire system is made up of smaller components and they have a different life cycle.
                                         
                                         So we want to break them up.
                                         
                                         So we want to have a service that does email sending.
                                         
                                         We want to have a service that does, I don't know, the product management and some sort of e-commerce component that takes payment.
                                         
                                         And so we break it up.
                                         
                                         Now, when we do, we gain the flexibility
                                         
    
                                         of deploying these smaller components independently,
                                         
                                         but there's a price to pay, right?
                                         
                                         The complexity of the application remains the same.
                                         
                                         And the fact that we've broken it up and we said,
                                         
                                         okay, this is now simpler.
                                         
                                         Well, the complexity hasn't gone away.
                                         
                                         That still stays.
                                         
                                         And we've just pushed it aside to the way in which these components talk to each other.
                                         
    
                                         So this brings me to like the first monumental mistake.
                                         
                                         And yeah, I've made it.
                                         
                                         We said, oh, well, what we actually have now is a distributed monolith.
                                         
                                         What I mean by that is application that runs in multiple processes,
                                         
                                         multiple containers, if you like, but its components need to talk to each other.
                                         
                                         So to display a web page, I need to talk to the product service. And I also need to talk to the
                                         
                                         discount service and the image service. And I need to get all three results before I can render out
                                         
                                         the page. And if one of these components fails, ah, well, that's
                                         
    
                                         tough. It doesn't work. So that is a particularly terrible example of a distributed monolith.
                                         
                                         Essentially, what you've replaced is the in-process method calls and function calls with
                                         
                                         RPC. Now, even when I say it out loud, it sounds pretty unpleasant. Who would want to do that?
                                         
                                         So, clearly, this might not be the right thing to do, right? You have a monolithic application,
                                         
                                         you decide to break it up. But you need to take a step back. The designers need to take
                                         
                                         a step back and say, is the fact that we now have a distributed application,
                                         
                                         which means that time is not a constant.
                                         
                                         So we need to worry about time.
                                         
    
                                         We need to worry about unreliable network.
                                         
                                         We need to worry about network as such.
                                         
                                         These components need to be talking to each other.
                                         
                                         And because these components will now,
                                         
                                         each of them will get their own life,
                                         
                                         we need to worry about
                                         
                                         versioning all of a sudden.
                                         
                                         So all that brings up this
                                         
    
                                         vast blob of
                                         
                                         complexity that we need
                                         
                                         to solve. And
                                         
                                         you might have heard about this reactive
                                         
                                         application development thing,
                                         
                                         have you? Yeah.
                                         
                                         I feel I'm not clear whether it's just a buzzword,
                                         
                                         to be honest, like a marketing term.
                                         
    
                                         I hear you.
                                         
                                         What is Reactive?
                                         
                                         So it has these four key components,
                                         
                                         and it centers around being message-driven.
                                         
                                         So the components, when they interact with each other,
                                         
                                         it's not a request followed by a fully baked response. It's a world where, you know,
                                         
                                         one of the components publishes a message and it says, okay, here's my favorite example.
                                         
                                         So we're building this e-commerce website, and when we're
                                         
    
                                         designing the checkout process,
                                         
                                         we have this entire thing,
                                         
                                         and we've integrated it with payments, this all works.
                                         
                                         But each service,
                                         
                                         when it runs, it should emit
                                         
                                         messages.
                                         
                                         Now, this is all sort of hand-wavy.
                                         
                                         You're not seeing me live,
                                         
    
                                         but I'm waving hands, so you can imagine
                                         
                                         the picture in all
                                         
                                         its horror. But
                                         
                                         you publish a message onto some
                                         
                                         message bus. So think Kafka,
                                         
                                         think Kinesis, that sort of thing.
                                         
                                         A durable message.
                                         
                                         We'll get to that. So hang on to the word
                                         
    
                                         durable message. Now, each service,
                                         
                                         as it operates, it should emit
                                         
                                         as much information about
                                         
                                         its work as possible.
                                         
                                         On to this message topic now.
                                         
                                         You've successfully implemented this.
                                         
                                         This first version of the e-commerce website works.
                                         
                                         Now, because these services were emitting these events as they went along,
                                         
    
                                         it is now possible, retrospectively, this is the cool thing,
                                         
                                         it's now possible for someone to come along and say,
                                         
                                         you know what, I'm going to build a blackmailing service.
                                         
                                         The blackmailing service is going to go through all the historical messages
                                         
                                         in this persistent message journal,
                                         
                                         and it's going to pick out the embarrassing orders that I've made.
                                         
                                         You wouldn't believe the stuff I buy on Amazon.
                                         
                                         Now, if we design a microservice architecture that way,
                                         
    
                                         so we are event-driven, each microservice architecture that way, so we are
                                         
                                         event-driven, each service publishes an event, the events are durable, it's possible later to
                                         
                                         construct another service that consumes these past and future events and does more work, right?
                                         
                                         This is how we extend. So this is the promise. Surely we can just add another service to our
                                         
                                         system and it now does more things now that
                                         
                                         can only be if you think about it that can only be achieved if we have this historical record
                                         
                                         of stuff that happened not just here's an entry in the database that that's that's like a snapshot
                                         
                                         in time you have stream of messages going from like message number one, the beginning of time, all the way to today.
                                         
    
                                         Now, if you were to replay all the messages and write them to a database table, you'd get a snapshot in history.
                                         
                                         You'd get a snapshot of when I replayed all the messages, here's the state of the system.
                                         
                                         But as more messages come in, of course, this snapshot keeps changing. And so it's actually sometimes really useful to think of even a database table or some sort of persistent store as a snapshot of this system at a particular point in time. But the
                                         
                                         life of it is expressed through these messages. So just so I'm clear, you're saying you designed
                                         
                                         the e-commerce system as a standalone monolithic app. However, you make sure that you're emitting everything you're doing to some.
                                         
                                         You know what? That could be a good start.
                                         
                                         If you have a monolithic application, you think, I want to do microservices.
                                         
                                         Emit stuff first and build the new stuff as a new microservice.
                                         
    
                                         There's a whole bunch of value in
                                         
                                         that versus the usual
                                         
                                         enterprise
                                         
                                         initiative that says
                                         
                                         let's take what we have
                                         
                                         and rewrite it.
                                         
                                         I have yet to see one
                                         
                                         successful rewrite it from scratch.
                                         
    
                                         It just never works.
                                         
                                         There is so much value
                                         
                                         and so much experience baked in existing code
                                         
                                         that rewriting it is almost always a disaster. But extending it is a good idea. Now, extending it,
                                         
                                         you don't want to add another monolithic bit onto it. So adding messaging, asynchronous messaging,
                                         
                                         which is another bit, another asynchronous non-blocking, another reactive buzzword, right? You don't want
                                         
                                         to, as part of
                                         
                                         processing of, I don't
                                         
    
                                         know, a request from
                                         
                                         a user, the last thing
                                         
                                         that the app should do is to
                                         
                                         wait and block on some
                                         
                                         sort of I.O.
                                         
                                         Because we are distributed, after
                                         
                                         all, and this I.O. could be
                                         
                                         network.
                                         
    
                                         Now, who knows how long that might take?
                                         
                                         Sometimes even forever.
                                         
                                         Sometimes there's no result.
                                         
                                         And, you know, these TCP sockets time out after, I don't know, 60 seconds.
                                         
                                         60 seconds.
                                         
                                         There's no way on earth that a thread, this heavyweight thing,
                                         
                                         should be blocked for 60 seconds.
                                         
                                         So it needs to be
                                         
    
                                         asynchronous. Now, modern operating systems,
                                         
                                         I mean, the kernels of the modern
                                         
                                         operating systems have
                                         
                                         all these asynchronous primitives.
                                         
                                         So it's possible to say
                                         
                                         things like, begin
                                         
                                         reading from a socket
                                         
                                         and continue running
                                         
    
                                         this thread, and then the OS will wake up
                                         
                                         some other thread and will say, okay, yes, I now have the data.
                                         
                                         Here's your callback.
                                         
                                         I've read five kilobytes from the socket.
                                         
                                         Here, you deal with it.
                                         
                                         And then it's up to the application frameworks to manage it
                                         
                                         in some reasonable way for the application programmers.
                                         
                                         So I do a lot of Scala, so there's a whole bunch of asynchronous,
                                         
    
                                         convenient asynchronous libraries, you know, like Akka. Go obviously goes in a very similar way. or is it I'm just not blocking, but then the scheduler is going to return something? Like is every request returning like 204 or something?
                                         
                                         Like I say, save user, and they're like,
                                         
                                         okay, we heard you.
                                         
                                         That's also a really, really, really good question.
                                         
                                         If you can get away with fire and forget,
                                         
                                         your system will be so much quicker
                                         
                                         and so much easier to write.
                                         
                                         Now, a lot of our systems,
                                         
    
                                         the users wouldn't accept fire and forget. So again, think about typical messaging. Read from a socket or write to disk
                                         
                                         or write to this persistent journal. Now, if you accept at most one's delivery, so far and forget, what these components are
                                         
                                         telling you is we're going to do our best. We'll try not to lose your message. And that's probably
                                         
                                         okay for statistics. It's probably perfectly fine for health checks monitoring. But it's not okay
                                         
                                         for, I don't know, payments. This is where it gets a little more difficult.
                                         
                                         So if a component has taken money from your card,
                                         
                                         the thing that receives it really should receive it.
                                         
                                         Now, this is where it gets really complicated
                                         
    
                                         because distributed system, right?
                                         
                                         Now, okay, so you and I need to exchange information
                                         
                                         in a guaranteed way.
                                         
                                         What I can do is say, hey, Adam, so I have the payment for you and I can hang up or disconnect my computer.
                                         
                                         Now, from where I'm sitting, I'll hear nothing from you.
                                         
                                         Okay, okay, okay, okay. I don't know if the message that was going to you got lost and you actually
                                         
                                         received it and you have it,
                                         
                                         or if you,
                                         
    
                                         or if the,
                                         
                                         you actually received the message,
                                         
                                         you began processing it,
                                         
                                         but then you crashed.
                                         
                                         So I should send it again in both cases.
                                         
                                         Or has,
                                         
                                         is it the case that I sent a message,
                                         
                                         it successfully got to you,
                                         
    
                                         you processed it,
                                         
                                         but then just as you were replying,
                                         
                                         my network went down, in which case you have it, but then just as you were replying, my network went
                                         
                                         down. In which case, you have it, but I don't know that. So I'm going to send the message again to
                                         
                                         you. Now, that means that you might get duplication. So you might hear the same message again. Now,
                                         
                                         there's nothing I can do about it, because from my perspective, I haven't heard a confirmation from you. So I need to send it to you again. You now need to do extra work to deduplicate.
                                         
                                         Now that's problematic. How do you do deduplication? Well, okay. You hash the message, right? So
                                         
                                         you compute some SHA-256 of it. You keep it in memory. Yeah. How much memory do you have?
                                         
    
                                         Because this can go on for a really long time. Now, this
                                         
                                         is where we say pragmatic things like, well, in reality, it's all, you know, we have a system.
                                         
                                         We know that between you and I, we exchange like one message per second. So you think, well,
                                         
                                         how much memory do I really need to remember this stuff from yonder. Oh, yeah, I'm going to need the last 10 statements.
                                         
                                         So you sort of make sure you have memory for the 10 hashes of the last 10 sentences,
                                         
                                         and that's good enough for you.
                                         
                                         And the key thing is to measure the system, right,
                                         
                                         and know how much is going through it.
                                         
    
                                         And then you can tune these deduplicating strategies
                                         
                                         and you can keep things in check.
                                         
                                         But it's a probability game.
                                         
                                         At some point, something is going to go wrong.
                                         
                                         If you phone me and say,
                                         
                                         hey, I got your payment,
                                         
                                         then that's like at most once
                                         
                                         because you hang up,
                                         
    
                                         you haven't heard an acknowledgement from me.
                                         
                                         Exactly.
                                         
                                         Then if you wait for me to acknowledge back
                                         
                                         then we have this problem where that's at least once because maybe the the line comes out yes
                                         
                                         while i'm telling you so then you have to re-deliver so i got it twice so then i mean i
                                         
                                         think what everybody really wants is exactly what i know i want that too i i'd also like a wristwatch with a fountain never
                                         
                                         gonna happen so is exactly once possible um okay uh yes in if you reduce the context so if you
                                         
                                         remove a lot of bio or if you are prepared to say, to trade off something else. So you think
                                         
    
                                         I want exactly once. Okay. Okay. Okay. That's fine. Which now means that you need to say,
                                         
                                         I need an extra coordinator and this extra service, extra coordinator can now tell me,
                                         
                                         have I seen this message? Yes or no. And if the answer is yes, I've already seen this message, then okay, reject it. Don't even attempt to deliver it. And that's okay, but you've
                                         
                                         sacrificed availability. And if this coordinator goes down, your system has to say, well, no,
                                         
                                         I'm no longer sending anything. So it is a trade-off. Throughout this thing, it's a trade-off.
                                         
                                         So in the phone example, I guess,
                                         
                                         if we had some third person who tracked whether...
                                         
                                         We both acknowledge to the third party, is that the idea?
                                         
    
                                         That's the idea.
                                         
                                         You'd have something else that listens and says,
                                         
                                         yeah, no, that one can go through, that one can go through.
                                         
                                         So let's rewind.
                                         
                                         You said earlier that... I think that you said, like, one of the main reasons for wanting a distributed system was deployment. It seems like a small reason to me, just to want to deploy components the big reason, the favorite reason is we say scale.
                                         
                                         That's the kicker, right?
                                         
                                         And we always imagine the system that goes, you know what, I'll be Amazon.
                                         
                                         This will be so cool.
                                         
    
                                         I will 100,000 requests per second.
                                         
                                         Maybe even more.
                                         
                                         And then you say, okay, well, it's unreasonable for a monthly application to be able to handle 100,000 requests per second
                                         
                                         when the distribution of work is actually 90,000 of browsing, people just look for stuff,
                                         
                                         and only 10,000 is people buying.
                                         
                                         I think I've made those numbers up.
                                         
                                         I think if that's the case, then, you know, Mr. Bezos is the happiest man on earth.
                                         
                                         If you get like 90, 10, 9-1 browsing to actually buying,
                                         
    
                                         oh, wow, that would probably be pretty good.
                                         
                                         So I think the numbers are different, but it doesn't matter.
                                         
                                         So, yes, it's scale.
                                         
                                         And, of course, if the system is divided into different services,
                                         
                                         each service can be scaled differently.
                                         
                                         What's actually even more encouraging about it is that when the system is broken up into services, we can now really think about failure scenarios and what do we do if something is broken. e-commerce website. Let's say that the e-commerce website is really keen that people get whatever
                                         
                                         goes into a shopping basket. People have to be able to get, right? We're just going to believe
                                         
                                         that. So if I put stuff in my shopping basket and browse, so that's one set of microservices
                                         
    
                                         that I can do the search and the image delivery and all that personalization.
                                         
                                         And it's all in my shopping basket. And I go to the payment, and the payment service goes down.
                                         
                                         Now, one of the options that we have, which would be sort of silly in this particular example,
                                         
                                         but nevertheless, reasonable to think about it, let's say we're really good,
                                         
                                         and we say we want our customers to be super happy.
                                         
                                         They've spent all this time putting their stuff in the shopping basket.
                                         
                                         Regardless of whether the payment service works or not,
                                         
                                         we're going to send the stuff.
                                         
    
                                         And so if I make a...
                                         
                                         See, now, okay, I know, silly, right?
                                         
                                         But you make a request to this payment service,
                                         
                                         and you say, hey, I want to now pay for this.
                                         
                                         You know, I put a few DVDs in my...
                                         
                                         I don't even have a DVD drive.
                                         
                                         But, you know, years ago, right?
                                         
                                         I put a couple of DVDs in my shopping basket,
                                         
    
                                         and I paid for it.
                                         
                                         I checked out.
                                         
                                         Checked out.
                                         
                                         The service that ran the website
                                         
                                         made a request to the payment service.
                                         
                                         That was down.
                                         
                                         And so I said, never mind.
                                         
                                         Let's just do another list.
                                         
    
                                         Now, ridiculous, you say.
                                         
                                         Of course it is.
                                         
                                         Think, though.
                                         
                                         Maybe you're subscribing to music,
                                         
                                         online music.
                                         
                                         You're a Spotify.
                                         
                                         I've actually heard of an example.
                                         
                                         So Starbucks, they have their member,
                                         
    
                                         it's like a gift card or whatever,
                                         
                                         but it's also their membership card, right?
                                         
                                         And you can put coffees on it.
                                         
                                         So I guess they have problems with their system going down.
                                         
                                         And when down, they just give free coffees.
                                         
                                         Absolutely right.
                                         
                                         Absolutely right.
                                         
                                         Now you can extend it to say a music service so suppose you
                                         
    
                                         want to listen to something open open your phone you say i'm going to subscribe i'm going to
                                         
                                         subscribe using apple pay and so the payment has gone through the apple systems and then
                                         
                                         the receipt gets sent to this music service and it says hey guys so i have a receipt number 47
                                         
                                         from apple and the music service now says okay well, well, I'm going to go to Apple
                                         
                                         and just double check really that if the receipt number 47 has that been paid.
                                         
                                         Apple service goes 503 over capacity.
                                         
                                         Now at that point, it's really reasonable for the music service to say,
                                         
                                         ah, nevermind, I'm going to let the user listen to stuff.
                                         
    
                                         For the next 15 minutes, I'm going to hand out the authentication token
                                         
                                         that allows access to all this music.
                                         
                                         But 15 minutes later, the user has to check back,
                                         
                                         because by then I will have checked for sure with Apple, and I will know.
                                         
                                         So this is actually a really reasonable thing to do.
                                         
                                         Now, again, it sort of makes, I guess, the commercial people's blood go stone cold, right?
                                         
                                         But what's the alternative?
                                         
                                         If we didn't do this, then the alternative would have been really annoyed customers.
                                         
    
                                         They would have used something.
                                         
                                         Everything would have worked on their side.
                                         
                                         Just because one of our services is down, they don't get to do what they want it to do.
                                         
                                         So they pick up the phone or write a tweet. It's actually better in a lot of cases to just
                                         
                                         trust people
                                         
                                         and maybe allow them access
                                         
                                         for a short duration of time.
                                         
                                         Again, it depends on the industry and all that.
                                         
    
                                         Generalizing overly or overly generalizing
                                         
                                         even. But with
                                         
                                         distributed system,
                                         
                                         this is now possible.
                                         
                                         If you were in a monolith,
                                         
                                         if the payment service went down
                                         
                                         because it's part of the monolith, that's it, you're done.
                                         
                                         Nothing works.
                                         
    
                                         If it's a distributed system, you have a better chance
                                         
                                         of defining some sort of failure strategy.
                                         
                                         And because the payment system publishes events
                                         
                                         about what happened to it, you as developers,
                                         
                                         you now have a chance of going through what happened and going,
                                         
                                         oh, this sort of sequence of events might lead to a failure, which you could use to improve
                                         
                                         your code, or maybe you can write some sort of predictive mechanism that tells the operations
                                         
                                         team or tells whoever happens to be on pager duty that says, look, heads up, this is not going to be pretty.
                                         
    
                                         We're still fine,
                                         
                                         but something's coming.
                                         
                                         And that is fantastic.
                                         
                                         That is the one thing that keeps services running,
                                         
                                         keeps the entire systems running.
                                         
                                         It's interesting because
                                         
                                         it brings to mind to me,
                                         
                                         even if you have a monolithic app,
                                         
    
                                         likely you already have
                                         
                                         a distributed system, right?
                                         
                                         Because you're calling out to some right? Because you're calling out to
                                         
                                         some payment processor, you're calling out to some database. So even though your application is
                                         
                                         fully formed, right? I guess what I'm hearing you saying is that you need to model these
                                         
                                         systems and what will happen when you can't reach them. I just always assume I can reach
                                         
                                         the database. If I can't, you know, whatever.
                                         
                                         Absolutely right.
                                         
    
                                         Absolutely right.
                                         
                                         And the worst thing about it
                                         
                                         is that at small scale,
                                         
                                         I don't mean it in a bad way,
                                         
                                         but at low scale,
                                         
                                         it's quite possible to go to AWS
                                         
                                         and provision a MySQL instance
                                         
                                         and that thing will run forever.
                                         
    
                                         It'll never fail.
                                         
                                         Because it's just one computer,
                                         
                                         but as the number of computers
                                         
                                         increases, the chance
                                         
                                         of probability goes,
                                         
                                         the chance of failure goes right up.
                                         
                                         Something is going to
                                         
                                         fail. Now, this was
                                         
    
                                         I used to
                                         
                                         say this. I used to say, oh, well, we need to build these
                                         
                                         distributed systems, and we need to treat failures as first-class citizens. This is really important.
                                         
                                         And I said that out loud, and at the back of my mind, I thought, yeah, but come on,
                                         
                                         when does this happen? Well, we now happen to have a system that handles significant load. And it happens every day. Every day, some service
                                         
                                         goes down. A couple of nodes of a database
                                         
                                         go down. A couple of brokers go down. They get restarted.
                                         
                                         And it's fine. The key thing is, because we can
                                         
    
                                         reason about this failure and because we're prepared for it,
                                         
                                         it's fine. We survive it.
                                         
                                         Our infrastructure keeps running. Our application
                                         
                                         keeps running. The users are getting
                                         
                                         their responses.
                                         
                                         But internally, of course,
                                         
                                         we can detect
                                         
                                         errors and recover from them.
                                         
    
                                         So that was
                                         
                                         very, that was the opener, truly.
                                         
                                         What do you think about this argument? I haven't heard it in a while but it was very popular about
                                         
                                         um you know build the monolith first and then you know once you reach some you know once you reach
                                         
                                         these these problems then you know then you start splitting i mean i don't have a general answer to that.
                                         
                                         I think a lot of the times you can know what the expected load is going to be. If you know, as in you have an existing customer base and you know that you have a million customers.
                                         
                                         And so the new service, when it goes live, it will get that load. Now, I think it would
                                         
                                         be silly in that scenario to say, no, no, no, we're just going to start with this thing that we
                                         
    
                                         know isn't right. And then we'll see how it goes, how it crashes.
                                         
                                         Feel free to say it's a horrible idea. So it came, I think it was Martin Fowler who wrote about this
                                         
                                         probably several years ago. And I think his argument was, if you try to build this new application as a distributed system,
                                         
                                         but once it starts interacting with customers, your requirements start changing drastically.
                                         
                                         And if you don't know where the change is going to be, maybe you've cut things up in a way that doesn't make sense.
                                         
                                         And that was going to be my follow-on.
                                         
                                         If it's a new application,
                                         
                                         if this is just a, dare I say, startup,
                                         
    
                                         then it makes total sense to experiment first.
                                         
                                         But there was another,
                                         
                                         I think that this was Googlers who wrote it,
                                         
                                         and it goes along the lines of design.
                                         
                                         Take a system and say,
                                         
                                         you say, okay, I need to be able to handle 10 concurrent users.
                                         
                                         Design it for 100.
                                         
                                         But once you get to 1,000, it's a new system.
                                         
    
                                         And once you get to 10x the original design,
                                         
                                         it's just not going to work.
                                         
                                         So, okay, I know 10 is a silly example, but design, say you have 100,000 users,
                                         
                                         design a system that will hold up to, say, a million.
                                         
                                         But once your usage grows out to be 10 million, the thing isn't suitable.
                                         
                                         It's the wrong thing.
                                         
                                         Now, I think this is where Martin Fowler was getting as well.
                                         
                                         The design choices for something that handles 10 or 100 use might be completely different to the thing that handles
                                         
    
                                         1,000. And if the starting position is that you have zero users, well, wow, you're right.
                                         
                                         Design something, anything. Because as you point out, who knows what these silly users will say
                                         
                                         when they finally log in and say, well, I don't like this. And you might cut your system up the wrong way. You might
                                         
                                         define these consistency boundaries. That's a big word. I haven't come up with a big word for
                                         
                                         a long time. So consistency boundary, or this context boundary, because we're in a distributed system,
                                         
                                         we have, I'm sure people know this cap theory,
                                         
                                         which says, okay, well, you have consistency, availability,
                                         
                                         or partition tolerance.
                                         
    
                                         You'd like all three, but you can only have two.
                                         
                                         And as long as one of them is partition tolerance,
                                         
                                         because you have a distributed system, right?
                                         
                                         So partition tolerance,
                                         
                                         now pick two, consistency or availability. And it really depends on what you're building. Now, the choice isn't usually quite... it's not binary. It's not like consistency 1-0. There's a scale.
                                         
                                         But what you're ultimately dealing with is physics. You have a data center, say there's a data center in Dublin,
                                         
                                         and there's a data center in East Coast, US East, Amazon, and EU West in Dublin.
                                         
                                         It takes time for this electricity nonsense to get from America to Ireland.
                                         
    
                                         It just does, and there's nothing you can do about it.
                                         
                                         And so if you write a thing to a
                                         
                                         system like a database in
                                         
                                         Dublin, it's going to take
                                         
                                         100 milliseconds before the
                                         
                                         signal gets to
                                         
                                         US East, and there's nothing you can do about
                                         
                                         it. That's just life.
                                         
    
                                         So what can you do?
                                         
                                         You can say, well, so if someone
                                         
                                         else is reading the same row
                                         
                                         from
                                         
                                         US East, they'll see old data, because it's just You can say, well, so if someone else is reading the same row from, you know,
                                         
                                         US East, they'll see old data because it's just not there yet.
                                         
                                         So you can say, oh, okay, okay, okay, okay.
                                         
                                         So you hate the idea.
                                         
    
                                         Absolutely hate the idea.
                                         
                                         It must be consistent.
                                         
                                         When I look at my, this database row, I must see, wherever I look in the world,
                                         
                                         I must see the same value.
                                         
                                         All right. So you better have this other component, this coordinating row, I must see wherever I look in the world, I must see the same value. All right. So you better have this other component, this coordinating guy that adds a tick and says,
                                         
                                         yes, no, this is now everyone I've heard acts from every data center, every replica that they now have it.
                                         
                                         And now it's good. Now I can release these read logs that people are waiting for.
                                         
                                         Okay? It's consistent.
                                         
    
                                         What if one of these networks connection,
                                         
                                         what if this coordinating component goes down?
                                         
                                         Oh.
                                         
                                         There you go.
                                         
                                         You can't be consistent.
                                         
                                         So the only thing that the system can do is no bye-bye, no service.
                                         
                                         I think even even in um
                                         
                                         monolithic applications i think people don't think about the fact that maybe you're consistent
                                         
    
                                         within the database but but that doesn't necessarily like i imagine a scenario you
                                         
                                         have a bunch of app servers and a database behind it like postgres and you know everything's acid
                                         
                                         but you know one user reads a record and it displays it somewhere and then they change some things. And then your little magic ORM software saves the record en masse back.
                                         
                                         And somebody else could have the same record up at the same time.
                                         
                                         So the consistency is lost. Sometimes people don't even realize it, I guess.
                                         
                                         Oh, absolutely.
                                         
                                         Oh, absolutely.
                                         
                                         So the more you distribute the system, the more you have to deal with these scenarios, the more you have to deal with the possibility that something will be inconsistent.
                                         
    
                                         And it can actually be a really, I've had a whole bunch of discussions with typically e-commerce people who find it absurd.
                                         
                                         What do you mean that I don't know how much stuff I have in stock?
                                         
                                         Well, you do roughly, but it's not down to like one unit.
                                         
                                         So they say, oh, no, no.
                                         
                                         How about we reserve items?
                                         
                                         You can.
                                         
                                         How about we lock items?
                                         
                                         Yes, you can.
                                         
    
                                         But as the number of users grows, the number of locks or these reservations grows.
                                         
                                         And so it's really easy to get to the point where everything's
                                         
                                         locked and no one can actually do anything because you want to be consistent. Because you want to be
                                         
                                         consistent, you've sacrificed availability. And so, you know, people say, no, no, no, no, no, no,
                                         
                                         no, no, no. You can't have that. I want to be consistent and available. Which means you say,
                                         
                                         okay, fine. Buy one big, huge machine
                                         
                                         and run everything on this one machine.
                                         
                                         But even that's ridiculous.
                                         
    
                                         That machine will have multiple CPUs.
                                         
                                         It will have wires going through it and they will break.
                                         
                                         So it's...
                                         
                                         And the availability is lost if that machine goes down.
                                         
                                         Oh, absolutely.
                                         
                                         It's not pretty.
                                         
                                         So you had an example about an image processing system you designed.
                                         
                                         I did, yes.
                                         
    
                                         I'm wondering if you could describe that a little bit.
                                         
                                         Yeah, of course. So this was a really interesting thing.
                                         
                                         Now, I guess people are a little bit more worried about their private information.
                                         
                                         But nevertheless, this was in good old days where people uploaded pictures of their passports
                                         
                                         freely onto any old service they could find.
                                         
                                         Now, what we did is,
                                         
                                         it was obviously doing a couple of processing steps
                                         
                                         with these identity documents.
                                         
    
                                         So what was the goal of the system?
                                         
                                         It was to essentially provide biometric verification of the user.
                                         
                                         So think you want to open a bank account and you don't want to go to the branch.
                                         
                                         And the bank doesn't really want to open a branch for you because it costs a lot of money.
                                         
                                         What they would like to do instead is to have the users use their phones to look into their phones and take a picture of their driving license or something.
                                         
                                         And for this system to say,
                                         
                                         yeah, no, this is looking good.
                                         
                                         The driving license is the real deal.
                                         
    
                                         It's not being tampered with.
                                         
                                         And the thing that's staring into the camera,
                                         
                                         that's the same face.
                                         
                                         It's alive.
                                         
                                         So that was that.
                                         
                                         Now, of course, the users,
                                         
                                         they want this thing to run beautifully smoothly, right?
                                         
                                         And they will give you 10 seconds of attention, maybe.
                                         
    
                                         So if the processing really isn't done within, say, 10 seconds,
                                         
                                         maybe 15 if we invent some clever animation,
                                         
                                         then this would have been a total failure.
                                         
                                         So this needed to be obviously scalable, it needed to be parallel, it needed to be concurrent.
                                         
                                         Many of these steps needed to be executed concurrently. And a lot of these steps were actually pretty difficult. As you can imagine, there was a whole bunch of machine learning involved. And so what the PDF that you might have read
                                         
                                         or the preview book that you might have read
                                         
                                         essentially describes the thing that there's a front service
                                         
                                         that ingests a couple of images, as in high-resolution pictures
                                         
    
                                         of documents or whatever else,
                                         
                                         and it ingests a video of the person staring into the phone.
                                         
                                         And then the downstream services then compute the results. So it needs to be OCR'd.
                                         
                                         The face needs to be extracted from it. We need to check that
                                         
                                         you haven't just stuck another picture on. We used to do this.
                                         
                                         I didn't because I'm from Eastern Europe,
                                         
                                         so we used to drink all the time.
                                         
                                         But I keep hearing that people elsewhere in the world have to think that driving licenses
                                         
    
                                         in order to prove that they are allowed to have a drink.
                                         
                                         And so they take someone else's driving license,
                                         
                                         put a picture on it, put their picture on it.
                                         
                                         So we need to be able to detect those scenarios, obviously.
                                         
                                         And then combine the results.
                                         
                                         At the end, there is a state machine that reacts to all these messages that are coming from these
                                         
                                         upstream components. And as soon as it's ready, it needs to emit a result. So imagine that you're
                                         
                                         the bank, right? You want to open a bank account. And you say, you know, a question,
                                         
    
                                         how much, how much,
                                         
                                         maybe you're a betting company
                                         
                                         or someone who needs to verify
                                         
                                         identity of a person.
                                         
                                         And so, so if I'm a betting company
                                         
                                         and I say, okay, you know what?
                                         
                                         I'm going to play on the Grand National,
                                         
                                         which is a UK horse race.
                                         
    
                                         Okay, you need,
                                         
                                         I've never bet before, right?
                                         
                                         So I need to register
                                         
                                         and prove that I'm over 18.
                                         
                                         So I go through this process.
                                         
                                         And then in the app, I say, how much do you want to bet?
                                         
                                         And I say, I'll splash out and bet £2.
                                         
                                         Well, right?
                                         
    
                                         So the betting company then can have a ruling that says,
                                         
                                         as soon as I hear from this system,
                                         
                                         as soon as it emits a message that says,
                                         
                                         look, you know what? We've OCR'd it, and this driving license looks like a UK driving license.
                                         
                                         At that point, it might say, that's good enough.
                                         
                                         It's fine.
                                         
                                         A lot of that to go through.
                                         
                                         If, on the other hand, I was saying, well, look, I'm going to bet £3,000, they would actually need to wait for all these messages to arrive.
                                         
    
                                         That is to say, it's OCR'd and we checked the driving license register to verify that
                                         
                                         it's the real thing.
                                         
                                         And we also compared the picture on the driving license with a face that was staring into
                                         
                                         the phone.
                                         
                                         And yes, it's the right person. Now, that is only possible
                                         
                                         if your system emits these messages.
                                         
                                         It would have been impossible to condense down
                                         
                                         and to start processing
                                         
    
                                         and then wait until we have everything.
                                         
                                         We have this entire processing flow
                                         
                                         and then send one callback,
                                         
                                         one result to the gambling company.
                                         
                                         That makes sense.
                                         
                                         Because if we were doing,
                                         
                                         back to our startup,
                                         
                                         generating something quick,
                                         
    
                                         I might have some web server
                                         
                                         that kind of throws these images
                                         
                                         into whatever Kafka topic,
                                         
                                         and then just something else
                                         
                                         that just does everything, right?
                                         
                                         Pulls it out and goes through it. So you're saying
                                         
                                         the problem with that is
                                         
                                         the upstream consumer needs to know
                                         
    
                                         about specific...
                                         
                                         It wants to know before it's done, I guess.
                                         
                                         Yes, and the
                                         
                                         main challenge
                                         
                                         is that you don't know
                                         
                                         what it wants to know.
                                         
                                         Until you deploy
                                         
                                         the first cut of this,
                                         
    
                                         and until you say to your customers,
                                         
                                         look, here is what we can do,
                                         
                                         you don't get the feedback.
                                         
                                         And what they might say is,
                                         
                                         well, that's nice,
                                         
                                         but couldn't you just send us,
                                         
                                         I'm a gambling company,
                                         
                                         couldn't you just send us
                                         
    
                                         the initial processing?
                                         
                                         That's good enough for us.
                                         
                                         Oh, yeah,
                                         
                                         so if client equals equals five,
                                         
                                         then do this. And then some other client comes along and says, well,
                                         
                                         you see, but if the betting
                                         
                                         amount is more than 200, then we will end it.
                                         
                                         That way you will eyes pain. So it's much better to just
                                         
    
                                         admit everything that you know as soon
                                         
                                         as you know it from your services.
                                         
                                         Now, if you
                                         
                                         need to implement in the system
                                         
                                         and we decide together that we're going to be
                                         
                                         so good to our customers that
                                         
                                         we're going to implement this workflow engine,
                                         
                                         we can as a separate service.
                                         
    
                                         But as a
                                         
                                         separate service, that's the key.
                                         
                                         So we can defer that decision
                                         
                                         absolutely
                                         
                                         so how does the client
                                         
                                         the client is a confusing term here
                                         
                                         so there's a user with a phone
                                         
                                         but then there's your actual client
                                         
    
                                         like the bank
                                         
                                         so how are they consuming
                                         
                                         these results
                                         
                                         now
                                         
                                         there are multiple choices
                                         
                                         the most the crudest one is we just push stuff to them over these results? No. There are multiple choices.
                                         
                                         The most,
                                         
                                         the crudest one is
                                         
    
                                         we just push
                                         
                                         stuff to them
                                         
                                         over a connection.
                                         
                                         Right?
                                         
                                         So we say
                                         
                                         to our
                                         
                                         clients
                                         
                                         who speak to
                                         
    
                                         the system
                                         
                                         that should
                                         
                                         receive the
                                         
                                         messages from us,
                                         
                                         we say,
                                         
                                         okay, guys,
                                         
                                         why don't you
                                         
                                         stand up a
                                         
    
                                         publicly accessible
                                         
                                         HTTP endpoint
                                         
                                         and we're going
                                         
                                         to post stuff
                                         
                                         to you?
                                         
                                         That's a horrible way of doing it.
                                         
                                         Don't ever do that.
                                         
                                         First of all, your customers will
                                         
    
                                         hate it because they'll say,
                                         
                                         no,
                                         
                                         we don't want to
                                         
                                         stand up a web server in order
                                         
                                         to consume your service. That's a stupid idea.
                                         
                                         Firstly, and secondly,
                                         
                                         it
                                         
                                         pushes the responsibility of
                                         
    
                                         making sure that you're talking to the right endpoint,
                                         
                                         like a dual service.
                                         
                                         It would be really bad if, say, you had this complicated image processing service,
                                         
                                         and if it posted data, private data, biometric data, to some other URL.
                                         
                                         Yeah.
                                         
                                         Through misconfiguration.
                                         
                                         So, yeah, horrible.
                                         
                                         So, how about we do it like Twitter
                                         
    
                                         used to do? So we say, okay, you know, Mr. Client, open a long post request and like a
                                         
                                         connection and we will send you chunks, HDB chunks of messages as soon as we receive something. Now, that's a much nicer way of doing it
                                         
                                         because it just allows the client
                                         
                                         to control where it's connecting to.
                                         
                                         It also means that the client has to check the HTTP,
                                         
                                         the certificate on the connection.
                                         
                                         It has to know and it has to be sure
                                         
                                         that it's talking to the right service
                                         
    
                                         so we can wash our hands of this whole horrible
                                         
                                         business.
                                         
                                         It also means that
                                         
                                         the client now is in control
                                         
                                         of its consumption.
                                         
                                         That's kind of nice.
                                         
                                         What's slightly problematic
                                         
                                         about it is the scale of it.
                                         
    
                                         It is the
                                         
                                         case of one image per second
                                         
                                         variance, if it were, oh,
                                         
                                         I don't know, a thousand images per second,
                                         
                                         that's a bit tougher, right?
                                         
                                         You would want multiple of these
                                         
                                         connections to be opened.
                                         
                                         And so now you need to think about a way to
                                         
    
                                         partition which connection
                                         
                                         gets which results
                                         
                                         in which images, and they need to be
                                         
                                         routed more carefully. And and they need to be rooted more carefully.
                                         
                                         And you also need to think about
                                         
                                         you need to be nice to
                                         
                                         your clients and say, we understand
                                         
                                         that the connection will go down
                                         
    
                                         and you might have to make
                                         
                                         a new connection. Okay, but
                                         
                                         you need to be able to say things like, I'm making
                                         
                                         a connection, but I want to start consuming
                                         
                                         from record
                                         
                                         75 instead of the last one.
                                         
                                         So, but that can be, you know, a big query parameter. So that's a neat way of really
                                         
                                         bridging the gap between someone who says, so where's Scala and JDMs and Kafka? And, you know,
                                         
    
                                         our Mr. Client says, no, I'm.NET, none of this stuff, don't even talk to me.
                                         
                                         In which case, this sort of long HTTP connection works pretty well. Of course, the ideal scenario
                                         
                                         is the client would say, well, we would like for your system to expose another kind of interface
                                         
                                         and we can just mirror that. So that's
                                         
                                         one of the other options that we offer.
                                         
                                         Now, in the end, this particular
                                         
                                         system, we ended up just with these HTTP
                                         
                                         endpoints. So that's something
                                         
    
                                         you do? You do integration
                                         
                                         with Kafka?
                                         
                                         We offered that, because
                                         
                                         there was a particular deployment
                                         
                                         that looked like it was going to be large
                                         
                                         enough to warrant that, but
                                         
                                         in the end, we dropped it. Interesting.
                                         
                                         I can see why it would be easy from your side.
                                         
    
                                         Oh, yeah, exactly. That's why we said, oh, no, no, we can
                                         
                                         do this for you, but
                                         
                                         it...
                                         
                                         I didn't like it in the end because
                                         
                                         it exposed
                                         
                                         too much of our internal workings.
                                         
                                         It felt like integration through database
                                         
                                         wasn't quite right. Now, okay, you can exposed too much of our internal workings. It felt like integration through database.
                                         
    
                                         It wasn't quite right.
                                         
                                         Now, okay, you can restrict the topics that are replicated.
                                         
                                         You can maybe transform them.
                                         
                                         But it still felt a little bit crummy.
                                         
                                         So I think it's much better to have a really sharp,
                                         
                                         completely disconnected interface.
                                         
                                         And even if you're talking, say you're running an AWS and you publish your messages to Kinesis,
                                         
                                         and then you would say to your client,
                                         
    
                                         oh, yes, I see you have Kinesis.
                                         
                                         Why don't we push stuff to your Kinesis?
                                         
                                         Again, it feels like you're letting your dirty laundry
                                         
                                         out for everyone to see.
                                         
                                         So there should be an integration service
                                         
                                         that it might well consume from Kinesis,
                                         
                                         do some cleanup and publish to another Kinesis.
                                         
                                         That's all good, right?
                                         
    
                                         And it might even be a lambda for all I care.
                                         
                                         But it shouldn't be a direct connection.
                                         
                                         So that's interfacing externally.
                                         
                                         So inside of your microservices world,
                                         
                                         how do you draw these?
                                         
                                         How do you agree on these formats?
                                         
                                         Formats being the message bus or the...
                                         
                                         Yeah, how do you agree on how you communicate
                                         
    
                                         between the various microservices?
                                         
                                         So a lot of it is driven by the environment.
                                         
                                         So we have one system that's divided,
                                         
                                         and part of it lives in a data center,
                                         
                                         and part of it lives in AWS.
                                         
                                         Now that drives the choice,
                                         
                                         because there's no Kinesis in a data center.
                                         
                                         We had to use Gatsby in that particular case.
                                         
    
                                         And then we publish messages out to Kinesis
                                         
                                         and we AWS components consume Kinesis.
                                         
                                         What's actually more important is to think about
                                         
                                         the properties of this messaging thing
                                         
                                         and think about the payload that goes on this thing.
                                         
                                         So in a distributed system,
                                         
                                         one of the super evil things to have is ephemeral messages.
                                         
                                         So if the integration between our components,
                                         
    
                                         our services is like an HTTP request,
                                         
                                         there is no way to get back to it.
                                         
                                         If once the request is made, it's made, it's gone.
                                         
                                         There's no way, no record of it anywhere.
                                         
                                         So what we really prefer is persistent messaging.
                                         
                                         And you can then think of these message brokers
                                         
                                         as both messaging delivery mechanism
                                         
                                         as well as journal in the CQRS event-sourced sense.
                                         
    
                                         So the message box contains all messages for a duration of time,
                                         
                                         but they're not lost.
                                         
                                         So if I publish a message to Kinesis, I publish a message to Kafka,
                                         
                                         once the publisher gets the confirmation that, yes, okay,
                                         
                                         the sufficient number of nodes of
                                         
                                         Kinesis or Kafka have the record.
                                         
                                         It's there, right?
                                         
                                         It sits somewhere. It can be
                                         
    
                                         re-read again. It's persistent.
                                         
                                         Now, actually, in terms of conditions
                                         
                                         apply, that's not exactly what's happening with
                                         
                                         Kafka, because what happens is
                                         
                                         if you, by default,
                                         
                                         give you a cluster of
                                         
                                         like n number of nodes,
                                         
                                         and if you say, I want to receive a confirmation
                                         
    
                                         when the record is published on quorum number of nodes,
                                         
                                         that still doesn't mean that the message is written to disk.
                                         
                                         It just means that the quorum number of nodes have the message in memory.
                                         
                                         Now, it's a probabilities game.
                                         
                                         You'd have to be really unlucky that all of these nodes would fail before they would flush their memory to disk.
                                         
                                         Of course, you can say, no, no, no, it needs to be fsynced, but then the message rate goes right down.
                                         
                                         Now it needs to be fsynced. We have a distributed file system, right?
                                         
                                         So, so, one of the components is running in OpenShift, which is GloucesterFS, which is a
                                         
    
                                         distributed network file system painted over a number of SSDs. And so we have a cluster of Kafka's that use this GloucesterFS, which we're not entirely happy about, but we're not also
                                         
                                         entirely unhappy about because it runs. But hey, right? This is confusing you.
                                         
                                         It wouldn't be fun if it weren't messy.
                                         
                                         So just to make sure that I'm understanding.
                                         
                                         So we're going to...
                                         
                                         Your image service has a bunch of microservices inside of it.
                                         
                                         And they are communicating with each other
                                         
                                         using Kafka topics.
                                         
    
                                         And we kind of assume that that is persistent,
                                         
                                         even if some conditions apply.
                                         
                                         So how do you organize it in terms of message schemas and topics?
                                         
                                         That's a good question.
                                         
                                         So the messages we've chosen,
                                         
                                         we call buffers as the format and description of our messages.
                                         
                                         So we're very, very, very strict about it.
                                         
                                         And we actually have...
                                         
    
                                         This is where people are going to get really angry
                                         
                                         because one of the things about microservices is that they are independent.
                                         
                                         Each microservice is its own world.
                                         
                                         It shouldn't have any shared code and shared dependencies.
                                         
                                         Well, I mean, true, but then reality kicks in.
                                         
                                         Now, what we've done,
                                         
                                         so people won't get angry,
                                         
                                         we actually created one separate Git repository
                                         
    
                                         that contains our protocol definitions
                                         
                                         for all the services.
                                         
                                         Now, this is weird.
                                         
                                         It's really kind of weird.
                                         
                                         But this one thing allows us humans
                                         
                                         to spot when someone is making a change to the protocol.
                                         
                                         So if there's a pull request to this one central repository that contains all the protocol definitions for the services,
                                         
                                         the entire team is on it and goes, oh, wait a minute.
                                         
    
                                         If we merge this, is it going to break any of our stuff?
                                         
                                         Because remember, these microservices have independent lifecycle.
                                         
                                         And so it's quite possible to have version 1.2.3 of service A
                                         
                                         running alongside version 1.4.7 of service B.
                                         
                                         And following the usual semantic versioning rules,
                                         
                                         these should be compatible.
                                         
                                         Well, except semantic versioning is only as good as the
                                         
                                         programmers who make it. And so this is why we have this shared repository and we have
                                         
    
                                         humans in my team sort of eyeball it and say, is this going to work? Yes or no? When we
                                         
                                         build the microservices, they all build protocol implementations using this shared
                                         
                                         definitions repository.
                                         
                                         So if there's a C++ version, it makes
                                         
                                         the proto-Cs
                                         
                                         its own C++ versions
                                         
                                         of the protocol.
                                         
                                         If there's a Scala version, it's a Scala code.
                                         
    
                                         But then it's all
                                         
                                         quote-unquote guaranteed
                                         
                                         to work together when it's deployed.
                                         
                                         Is protobuf the best solution,
                                         
                                         or what do you think of?
                                         
                                         Ah,
                                         
                                         now, to me,
                                         
                                         the best solution is
                                         
    
                                         language that has
                                         
                                         good language tooling.
                                         
                                         So you could say,
                                         
                                         oh, well, we should have used Avro.
                                         
                                         Surely Avro is more efficient.
                                         
                                         It has support in Kafka and all that.
                                         
                                         Maybe we should have used Drift or something else.
                                         
                                         But when we were developing the system,
                                         
    
                                         Protobuf had really mature tooling for the languages that we used.
                                         
                                         So there was tooling for Swift and Objective-C for the clients.
                                         
                                         There was good tooling for C++, and there was was tooling for Swift and Objective-C for the clients. There was good
                                         
                                         tooling for C++ and there was good tooling for Scalp. And that made the difference.
                                         
                                         So I guess what I'm saying is you're free to choose which of the protocols you like,
                                         
                                         protocol language, protocol definition language, as long as the tooling matches your development lifecycle and as long as the
                                         
                                         protocol is rich
                                         
                                         enough to allow
                                         
    
                                         you to express
                                         
                                         all the messages
                                         
                                         that you're
                                         
                                         sending.
                                         
                                         That makes
                                         
                                         sense.
                                         
                                         So how do
                                         
                                         you decide
                                         
    
                                         about topics?
                                         
                                         Like, do you
                                         
                                         have each, is
                                         
                                         each microservice
                                         
                                         pushing to a
                                         
                                         singular topic
                                         
                                         or?
                                         
                                         Yes, so that's a good question.
                                         
    
                                         Typically, each microservice is pushing to multiple topics.
                                         
                                         So there is the basic results topic,
                                         
                                         but it might have partial results,
                                         
                                         it might have some statistics.
                                         
                                         So these are definitely topics that...
                                         
                                         I can't say that one service only ever consumes from one topic and publishes to one topic.
                                         
                                         A typical service in that system will consume from two topics and publish to maybe three.
                                         
                                         It is worth mentioning though, particularly with versioning, is the way we've
                                         
    
                                         gone about it is we have topics for major versions. So there's like a V1 of images.
                                         
                                         And obviously V1 will be forwards and backwards compatible across all the major versions within the...
                                         
                                         minor versions within the one
                                         
                                         major version. And if we ever
                                         
                                         decide to deploy image service
                                         
                                         v2, which presumably is completely
                                         
                                         different, right? Because it's just not
                                         
                                         compatible, then we would create
                                         
    
                                         an image-v2 topic
                                         
                                         for it. I mean, it seems to be a theme you're saying
                                         
                                         is to make these things explicit?
                                         
                                         Uh-huh. Absolutely right. it seems to be a theme you're saying is to make these things explicit?
                                         
                                         Absolutely right.
                                         
                                         You have to be able to talk about these things
                                         
                                         and accept reality.
                                         
                                         What always scares
                                         
    
                                         me the most is when people
                                         
                                         describe the system and they say,
                                         
                                         we will break on the monolith or we are designing
                                         
                                         microservices based on architecture
                                         
                                         and we're going to have this component
                                         
                                         and this component and when it gives us a response, we're going to do this, that, and the other. My first question in
                                         
                                         all those scenarios is, what if it's done? What if it's not available? And, okay, so this is very
                                         
                                         esoteric. Let's go back to a user database. If you're building a user database and you have a
                                         
    
                                         login page, and you have like an e-commerce
                                         
                                         music shop or something.
                                         
                                         And my question to you would be,
                                         
                                         okay, what do you do when your user database is
                                         
                                         down and someone wants to log in?
                                         
                                         Freak out. Get pages.
                                         
                                         There you go. Now, a lot of people
                                         
                                         will say, well, what can we possibly do?
                                         
    
                                         The database is down.
                                         
                                         But what you
                                         
                                         can do, of course, option A is
                                         
                                         denied. You can't log in.
                                         
                                         Option B is, okay, fine, you're logged in.
                                         
                                         I don't care what username
                                         
                                         and password you typed in. You're logged in.
                                         
                                         We'll take your word for it.
                                         
    
                                         We're going to issue you a token that's good
                                         
                                         for the next two minutes.
                                         
                                         How about that?
                                         
                                         And it's this thing, right?
                                         
                                         And then you say it to your product team and they say,
                                         
                                         are you ridiculous? We can't allow that to happen. And then you say, okay, fine, fine, fine. I hear
                                         
                                         you. So option B is if the database goes down and it will go down, then your service is down.
                                         
                                         Well, we don't like, we hate that as well. And we do. I know. I hate the database
                                         
    
                                         going down, and I'm sure that
                                         
                                         product owners hate the idea of just letting people in.
                                         
                                         But we have to make a choice.
                                         
                                         That's reality.
                                         
                                         It makes sense, yeah. For a music service,
                                         
                                         for a bank,
                                         
                                         I think that...
                                         
                                         Absolutely.
                                         
    
                                         And this is where consistency,
                                         
                                         this is where the dials go in. right? It's not a one zero.
                                         
                                         So of course, if I can't verify my online banking identity, I can't let you in. What if though,
                                         
                                         let's say there are a number of banks in the UK. One is actually one of them had a
                                         
                                         massive IT failure last week. Anyway, anyway you're running a bank and you have
                                         
                                         2005
                                         
                                         code and you're doing
                                         
                                         two-factor authentication
                                         
    
                                         by text messages. Okay.
                                         
                                         I was able to log in
                                         
                                         my username and password or
                                         
                                         whatever third digit of your
                                         
                                         security code was fine,
                                         
                                         but the SMS sending service is
                                         
                                         down.
                                         
                                         And now it's a reasonable choice.
                                         
    
                                         Maybe a possibly reasonable choice is to allow re-only access.
                                         
                                         Makes sense.
                                         
                                         Because then the SMS...
                                         
                                         Just not do two-factor.
                                         
                                         That could be a choice you could make.
                                         
                                         Absolutely.
                                         
                                         Interesting.
                                         
                                         So it has to be...
                                         
    
                                         You have to think about it.
                                         
                                         And actually designing these systems,
                                         
                                         systems, this is the particular, right,
                                         
                                         systems that are made up of the multiple components,
                                         
                                         as we design them, we always say,
                                         
                                         what if it's not there?
                                         
                                         What if it's unreliable?
                                         
                                         What if it's slow?
                                         
    
                                         And not treat it as either an academic discussion,
                                         
                                         as in, well, what if,, what if the data center is offline?
                                         
                                         No, no, no.
                                         
                                         I mean these mundane, boring things, failures that happen all the time.
                                         
                                         Okay, well, someone broke the table.
                                         
                                         The database is inaccessible.
                                         
                                         My SMS sending service is down.
                                         
                                         The email SNTP server is down.
                                         
    
                                         And they happen all the time.
                                         
                                         And so that's the first layer of digging. the answer shouldn't be oh well it there has to be something there has to be a decision and it it's
                                         
                                         the implicit oh well that causes the most problems oh well could be a decision like i guess or i mean
                                         
                                         i guess it's implied you're saying we need an explicit decision to say we have to.
                                         
                                         That's exactly it.
                                         
                                         It can never be.
                                         
                                         We assume that.
                                         
                                         And it takes so much of mental practice
                                         
    
                                         because we are used to things.
                                         
                                         And, you know, we program, right?
                                         
                                         We say, I don't know, select star from user.
                                         
                                         And if TCP connection denied,
                                         
                                         then most people, select star from user, and if TCP can actually deny it, then
                                         
                                         most people, me, myself
                                         
                                         included, will say,
                                         
                                         what do you want?
                                         
    
                                         Seriously, what do you want?
                                         
                                         But if we don't say it out loud,
                                         
                                         if we don't say that
                                         
                                         if the database isn't there,
                                         
                                         I'm not doing anything, then
                                         
                                         other people won't know that we have made
                                         
                                         this decision. It's this hidden
                                         
                                         decision, and that's going to come back and bite us.
                                         
    
                                         I assume also the cutting things
                                         
                                         up this way into
                                         
                                         microservices allows us to be
                                         
                                         more explicit about various
                                         
                                         parts failing while others remain.
                                         
                                         Like in a monolith, it's hard to make these...
                                         
                                         Yes.
                                         
                                         Absolutely.
                                         
    
                                         The lines aren't clear, I guess, between...
                                         
                                         Yeah, absolutely. yes the absolutely the lines aren't clear I guess between yeah yeah yeah
                                         
                                         absolutely
                                         
                                         it's
                                         
                                         you know
                                         
                                         people want
                                         
                                         we want it right
                                         
                                         when you go on
                                         
    
                                         Netflix
                                         
                                         and you watch stuff
                                         
                                         and you're just like
                                         
                                         no you expect it to happen
                                         
                                         because after all
                                         
                                         you paid $10
                                         
                                         and
                                         
                                         you know
                                         
    
                                         I want
                                         
                                         the world-class service
                                         
                                         I want
                                         
                                         files to be ready
                                         
                                         on CDNs,
                                         
                                         ready for me to start watching
                                         
                                         within a second of pressing play.
                                         
                                         Because come on, I paid my $10.
                                         
    
                                         You know, we are guilty of that expectation.
                                         
                                         And so I find it bizarre that we would then be at work
                                         
                                         and say things like, no, no, well, you know,
                                         
                                         it's the thing we can do.
                                         
                                         And we're building a system for a bank
                                         
                                         or for a retailer
                                         
                                         or for something
                                         
                                         that actually gets more money than
                                         
    
                                         the $10 per month on Netflix.
                                         
                                         So Jan, you're
                                         
                                         the CTO of Cake Solutions.
                                         
                                         What's that like?
                                         
                                         It doesn't sound to me
                                         
                                         like you're doing CTO things.
                                         
                                         It sounds like you're
                                         
                                         in the weeds of designing distributed systems.
                                         
    
                                         Well, quite.
                                         
                                         I mean, I guess I would hate the idea of not understanding what my teams are doing.
                                         
                                         That frightens me.
                                         
                                         And the same goes for non-coding architects and that sort of thing.
                                         
                                         I think to actually make a useful decision,
                                         
                                         one really has to know what's involved.
                                         
                                         And I think these are complicated decisions.
                                         
                                         And so it would really annoy me if I didn't know,
                                         
    
                                         if I didn't understand, and if I had to go to meetings
                                         
                                         where I would hear things that people talk about,
                                         
                                         and I sat there and thought,
                                         
                                         well, it sounds complicated.
                                         
                                         I suppose they're right.
                                         
                                         That would really frighten me.
                                         
                                         So I think it's actually super important
                                         
                                         for techies to be leading
                                         
    
                                         big teams and big companies.
                                         
                                         I think that was an interesting article
                                         
                                         I read a while ago that said
                                         
                                         that, I think it was
                                         
                                         Joe Spolsky, in fact,
                                         
                                         that talks about his time
                                         
                                         at Microsoft.
                                         
                                         He said that Bill Gates,
                                         
    
                                         you know, the back then CEO,
                                         
                                         I don't know, see something, chief, very important person.
                                         
                                         Yeah.
                                         
                                         He said he would dig into technical details. He says he remembers the time when Bill Gates was insisting
                                         
                                         that there is a reusable rich text editor component.
                                         
                                         Now, that can only come from really understanding
                                         
                                         what the hell's going on in technology.
                                         
                                         Without it,
                                         
    
                                         how would
                                         
                                         a CTO who doesn't code,
                                         
                                         how would a CTO who doesn't understand any of this
                                         
                                         insist on having a reusable rich
                                         
                                         text component?
                                         
                                         So, yeah, that's
                                         
                                         my take on it anyway.
                                         
                                         I think it's great. I think that all CTOs should take that on.
                                         
    
                                         I think, I guess, people just get buried by the ins and outs of the job and sometimes...
                                         
                                         And that's fair, you know, it's fair.
                                         
                                         And it's tempting to say, to achieve the optimum meeting density, as Wally says in truth.
                                         
                                         You know, it's tempting, but no resist.
                                         
                                         Honestly, if anyone's listening, don't do it.
                                         
                                         It's horrible.
                                         
                                         Be interested.
                                         
                                         There is so much interesting stuff going on.
                                         
    
                                         That goes down to teams that are implementing all these microservices.
                                         
                                         Some of the stuff is, on the outside, boring.
                                         
                                         Oh, goodness, it's boring. Who would
                                         
                                         like to make a PDF report? Really? You know, Jasper reports, and off we go. But what I'm
                                         
                                         always reminded of, and again, I'm terrible with names, no idea who said this, but when
                                         
                                         one looks closely enough, everything is interesting. And I think that's really the case.
                                         
                                         Like, even PDF reports
                                         
                                         can be made more interesting.
                                         
    
                                         I mean, in the darkest of days,
                                         
                                         we used all our bloody transformers
                                         
                                         to make PDF reports.
                                         
                                         So the PDF report thing is still boring.
                                         
                                         But we could use this other stuff,
                                         
                                         which was really interesting.
                                         
                                         And I guess my point is,
                                         
                                         you know, this is the time to
                                         
    
                                         be alive.
                                         
                                         There's so much
                                         
                                         technology available to us that I
                                         
                                         really don't think that anything can be boring.
                                         
                                         That's a great
                                         
                                         sentiment. Do you want to say a little
                                         
                                         bit about Cake Solutions before we wrap up?
                                         
                                         Well,
                                         
    
                                         sure.
                                         
                                         I'm sure you've heard what we used to do so we used to do these
                                         
                                         and we still made distributed systems but of course about a year ago we were working for
                                         
                                         for our clients and then about a year ago we we were acquired by bantech. And Bantech was subsequently acquired by Disney.
                                         
                                         And so we now
                                         
                                         like distributed systems, it's the same stuff,
                                         
                                         right? But we now concentrate on
                                         
                                         media delivery. So
                                         
    
                                         if anyone is
                                         
                                         a sports fan, and if you guys are watching
                                         
                                         ESPN+,
                                         
                                         there you go, that's the stuff.
                                         
                                         Those are the distributed
                                         
                                         systems.
                                         
                                         I can't tell you, but
                                         
                                         in 2019,
                                         
    
                                         there will be a thing
                                         
                                         that will just be the best thing
                                         
                                         in existence, like sliced bread,
                                         
                                         nothing.
                                         
                                         Wow, it sounds exciting. I can understand
                                         
                                         where you're coming out from scale, then, if you're working on ESPN.
                                         
                                         Oh, God, yeah, absolutely.
                                         
                                         This was the volume of media.
                                         
    
                                         And I'll try to be completely specific about what we do,
                                         
                                         but you can tell how delivering that sort of experience is super important.
                                         
                                         It's super important for these systems to deliver content to our users.
                                         
                                         Now, I remember back in the days
                                         
                                         when we were making these
                                         
                                         quote-unquote ordinary distributed systems
                                         
                                         for, you know, banks and e-commerce.
                                         
                                         Well, you know,
                                         
    
                                         if an e-commerce app is not available,
                                         
                                         it's annoying.
                                         
                                         You know, we can get it in our days.
                                         
                                         Imagine, though, there's a game,
                                         
                                         there's like a
                                         
                                         baseball game
                                         
                                         football
                                         
                                         you're a fan
                                         
    
                                         you really want to
                                         
                                         see this
                                         
                                         right
                                         
                                         and a 500
                                         
                                         no
                                         
                                         no well first of
                                         
                                         all it's live
                                         
                                         so
                                         
    
                                         like
                                         
                                         it's
                                         
                                         it's
                                         
                                         it cannot
                                         
                                         happen
                                         
                                         you cannot
                                         
                                         have 500
                                         
                                         because
                                         
    
                                         where it's
                                         
                                         in e-commerce
                                         
                                         users were annoyed
                                         
                                         if the thing went
                                         
                                         down now they're
                                         
                                         angry but deeply angry so you know that that was uh that was quite an eye-opener but but having
                                         
                                         this this distributed system comprised uh you know part of these microservices allowed us to
                                         
                                         think about what will happen what do we do if something breaks?
                                         
    
                                         What's the mitigation?
                                         
                                         Because ultimately the motivation is to deliver content.
                                         
                                         You know, our users have paid for this.
                                         
                                         And they're fans.
                                         
                                         This is their passion.
                                         
                                         They want to see this video.
                                         
                                         So there you have it.
                                         
                                         Awesome.
                                         
    
                                         So thanks so much for your time, Jan.
                                         
                                         It's been great great it was an absolute
                                         
                                         pleasure
                                         
