CoRecursive: Coding Stories - Tech Talk: Micro Services vs Monoliths With Jan Machacek

Episode Date: June 6, 2018

Tech Talks are in-depth technical discussions. I don't know a lot about micro services.  Like how to design them and what the various caveats and anti-patterns are.  I'm currently working on a proje...ct that involves decomposing a monolithic application into separate parts, integrated together using Kafka and http.   Today I talk to coauthor of upcoming book, Reactive Systems Architecture : Designing and Implementing an Entire Distributed System.  If you want to learn some of the hows and whys of building a distributed system, I think you'll really enjoy this interview.  The insights from this conversation are already helping me. Contact Jan Machacek is the CTO at Cake Solutions. Videos long lived micro services  Book - Reactive System Architecture

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Code Recursive, where we bring you discussions with thought leaders in the world of software development. I am Adam, your host. Hey, here is my confession. I do not know a lot about microservice architectures. I'm currently working on a project that involves decomposing a monolithic application into separate parts and integrating those parts using Kafka and HTTP. Today I talked to the co-author of the upcoming O'Reilly book, Reactive Systems Architecture, subtitled Designing and Implementing an Entire Distributed System. If you want to learn some of the hows and whys of building a distributed system, I think you'll enjoy this interview. The insights from this conversation are
Starting point is 00:00:50 already helping me. Jan Machacek. Actually, is that how you pronounce your name? It's Moflem. Sorry, I'll shut up. Jan Machacek is the CTO of Cake Solutions and a distributed systems expert. How do you feel about being called an expert?
Starting point is 00:01:13 It's false advertising entirely. No idea if I would call myself expert. I've had my fingers burnt quite a few times. So I'm not sure if that qualifies as the answer. I think that it qualifies like from where I'm sitting, you know, somebody who's made some mistakes isn't a great place to stop me from making them.
Starting point is 00:01:32 Okay, I'll do my best and make more mistakes. So recently I've been trying to get up to speed on how to split up a monolithic app and also just about how you might design a distributed system in general. And my former colleague, Pete, recommended I check out your work. So why would I build a
Starting point is 00:01:54 distributed system? Now, that's actually a killer question. A lot of the times, so typical answer is that people say, oh, well, we don't like this monolithic application because it's monolithic. That's the only reason. And by monolithic, I mean a thing where all the modules, all the code that makes up the application runs in one container. I'm using the word container in this most generic way. So, you know, maybe a Docker container, maybe a JVM or something,
Starting point is 00:02:28 but everything runs in this one blob of code. And that doesn't actually mean that it doesn't scale, right? It's quite possible to think of multiple instances of this monolithic application, monolithic in the sense that it contains the thing that deals with orders and the thing that deals with generating emails and what have you. But it scales out, right? It runs on multiple computers and it talks to, I don't know,
Starting point is 00:03:00 again, a scaled-out database. Now, that might be okay for business, but clearly it's not really entirely satisfying for development. And soon enough, the two will meet, and business will say things like, well, guys, we need you to update the blah component, the thing that generates the emails that go out to our customers. And the development team says,
Starting point is 00:03:25 yeah, okay, yeah, yeah, sure, we can do that. Tappity, tappity, tap. And then they say, yep, that's all done. So when can we redeploy this entire thing? Now, that's where it's starting to get a bit ropey. I just think, you know, the product owners, the business will have questions like, okay, yeah, yeah, okay.
Starting point is 00:03:43 Are you sure you haven't, like, everything else is fine, right? And then we, the programmers, say, oh, okay, yeah, yeah, okay. Are you sure you haven't, like everything else is fine, right? And then we, the programmers say, oh, yeah, yeah, yeah. We haven't changed a single line of code in this other stuff. We've only changed the lines of code that make up the email sending service. And we deploy it and then something breaks.
Starting point is 00:03:58 Now that's actually really unpleasant. So we say, oh, we don't want that. We realize that the entire system is made up of smaller components and they have a different life cycle. So we want to break them up. So we want to have a service that does email sending. We want to have a service that does, I don't know, the product management and some sort of e-commerce component that takes payment. And so we break it up. Now, when we do, we gain the flexibility
Starting point is 00:04:29 of deploying these smaller components independently, but there's a price to pay, right? The complexity of the application remains the same. And the fact that we've broken it up and we said, okay, this is now simpler. Well, the complexity hasn't gone away. That still stays. And we've just pushed it aside to the way in which these components talk to each other.
Starting point is 00:04:52 So this brings me to like the first monumental mistake. And yeah, I've made it. We said, oh, well, what we actually have now is a distributed monolith. What I mean by that is application that runs in multiple processes, multiple containers, if you like, but its components need to talk to each other. So to display a web page, I need to talk to the product service. And I also need to talk to the discount service and the image service. And I need to get all three results before I can render out the page. And if one of these components fails, ah, well, that's
Starting point is 00:05:26 tough. It doesn't work. So that is a particularly terrible example of a distributed monolith. Essentially, what you've replaced is the in-process method calls and function calls with RPC. Now, even when I say it out loud, it sounds pretty unpleasant. Who would want to do that? So, clearly, this might not be the right thing to do, right? You have a monolithic application, you decide to break it up. But you need to take a step back. The designers need to take a step back and say, is the fact that we now have a distributed application, which means that time is not a constant. So we need to worry about time.
Starting point is 00:06:13 We need to worry about unreliable network. We need to worry about network as such. These components need to be talking to each other. And because these components will now, each of them will get their own life, we need to worry about versioning all of a sudden. So all that brings up this
Starting point is 00:06:31 vast blob of complexity that we need to solve. And you might have heard about this reactive application development thing, have you? Yeah. I feel I'm not clear whether it's just a buzzword, to be honest, like a marketing term.
Starting point is 00:06:52 I hear you. What is Reactive? So it has these four key components, and it centers around being message-driven. So the components, when they interact with each other, it's not a request followed by a fully baked response. It's a world where, you know, one of the components publishes a message and it says, okay, here's my favorite example. So we're building this e-commerce website, and when we're
Starting point is 00:07:25 designing the checkout process, we have this entire thing, and we've integrated it with payments, this all works. But each service, when it runs, it should emit messages. Now, this is all sort of hand-wavy. You're not seeing me live,
Starting point is 00:07:42 but I'm waving hands, so you can imagine the picture in all its horror. But you publish a message onto some message bus. So think Kafka, think Kinesis, that sort of thing. A durable message. We'll get to that. So hang on to the word
Starting point is 00:07:58 durable message. Now, each service, as it operates, it should emit as much information about its work as possible. On to this message topic now. You've successfully implemented this. This first version of the e-commerce website works. Now, because these services were emitting these events as they went along,
Starting point is 00:08:18 it is now possible, retrospectively, this is the cool thing, it's now possible for someone to come along and say, you know what, I'm going to build a blackmailing service. The blackmailing service is going to go through all the historical messages in this persistent message journal, and it's going to pick out the embarrassing orders that I've made. You wouldn't believe the stuff I buy on Amazon. Now, if we design a microservice architecture that way,
Starting point is 00:08:44 so we are event-driven, each microservice architecture that way, so we are event-driven, each service publishes an event, the events are durable, it's possible later to construct another service that consumes these past and future events and does more work, right? This is how we extend. So this is the promise. Surely we can just add another service to our system and it now does more things now that can only be if you think about it that can only be achieved if we have this historical record of stuff that happened not just here's an entry in the database that that's that's like a snapshot in time you have stream of messages going from like message number one, the beginning of time, all the way to today.
Starting point is 00:09:32 Now, if you were to replay all the messages and write them to a database table, you'd get a snapshot in history. You'd get a snapshot of when I replayed all the messages, here's the state of the system. But as more messages come in, of course, this snapshot keeps changing. And so it's actually sometimes really useful to think of even a database table or some sort of persistent store as a snapshot of this system at a particular point in time. But the life of it is expressed through these messages. So just so I'm clear, you're saying you designed the e-commerce system as a standalone monolithic app. However, you make sure that you're emitting everything you're doing to some. You know what? That could be a good start. If you have a monolithic application, you think, I want to do microservices. Emit stuff first and build the new stuff as a new microservice.
Starting point is 00:10:23 There's a whole bunch of value in that versus the usual enterprise initiative that says let's take what we have and rewrite it. I have yet to see one successful rewrite it from scratch.
Starting point is 00:10:41 It just never works. There is so much value and so much experience baked in existing code that rewriting it is almost always a disaster. But extending it is a good idea. Now, extending it, you don't want to add another monolithic bit onto it. So adding messaging, asynchronous messaging, which is another bit, another asynchronous non-blocking, another reactive buzzword, right? You don't want to, as part of processing of, I don't
Starting point is 00:11:10 know, a request from a user, the last thing that the app should do is to wait and block on some sort of I.O. Because we are distributed, after all, and this I.O. could be network.
Starting point is 00:11:27 Now, who knows how long that might take? Sometimes even forever. Sometimes there's no result. And, you know, these TCP sockets time out after, I don't know, 60 seconds. 60 seconds. There's no way on earth that a thread, this heavyweight thing, should be blocked for 60 seconds. So it needs to be
Starting point is 00:11:46 asynchronous. Now, modern operating systems, I mean, the kernels of the modern operating systems have all these asynchronous primitives. So it's possible to say things like, begin reading from a socket and continue running
Starting point is 00:12:01 this thread, and then the OS will wake up some other thread and will say, okay, yes, I now have the data. Here's your callback. I've read five kilobytes from the socket. Here, you deal with it. And then it's up to the application frameworks to manage it in some reasonable way for the application programmers. So I do a lot of Scala, so there's a whole bunch of asynchronous,
Starting point is 00:12:24 convenient asynchronous libraries, you know, like Akka. Go obviously goes in a very similar way. or is it I'm just not blocking, but then the scheduler is going to return something? Like is every request returning like 204 or something? Like I say, save user, and they're like, okay, we heard you. That's also a really, really, really good question. If you can get away with fire and forget, your system will be so much quicker and so much easier to write. Now, a lot of our systems,
Starting point is 00:13:07 the users wouldn't accept fire and forget. So again, think about typical messaging. Read from a socket or write to disk or write to this persistent journal. Now, if you accept at most one's delivery, so far and forget, what these components are telling you is we're going to do our best. We'll try not to lose your message. And that's probably okay for statistics. It's probably perfectly fine for health checks monitoring. But it's not okay for, I don't know, payments. This is where it gets a little more difficult. So if a component has taken money from your card, the thing that receives it really should receive it. Now, this is where it gets really complicated
Starting point is 00:13:57 because distributed system, right? Now, okay, so you and I need to exchange information in a guaranteed way. What I can do is say, hey, Adam, so I have the payment for you and I can hang up or disconnect my computer. Now, from where I'm sitting, I'll hear nothing from you. Okay, okay, okay, okay. I don't know if the message that was going to you got lost and you actually received it and you have it, or if you,
Starting point is 00:14:28 or if the, you actually received the message, you began processing it, but then you crashed. So I should send it again in both cases. Or has, is it the case that I sent a message, it successfully got to you,
Starting point is 00:14:40 you processed it, but then just as you were replying, my network went down, in which case you have it, but then just as you were replying, my network went down. In which case, you have it, but I don't know that. So I'm going to send the message again to you. Now, that means that you might get duplication. So you might hear the same message again. Now, there's nothing I can do about it, because from my perspective, I haven't heard a confirmation from you. So I need to send it to you again. You now need to do extra work to deduplicate. Now that's problematic. How do you do deduplication? Well, okay. You hash the message, right? So you compute some SHA-256 of it. You keep it in memory. Yeah. How much memory do you have?
Starting point is 00:15:23 Because this can go on for a really long time. Now, this is where we say pragmatic things like, well, in reality, it's all, you know, we have a system. We know that between you and I, we exchange like one message per second. So you think, well, how much memory do I really need to remember this stuff from yonder. Oh, yeah, I'm going to need the last 10 statements. So you sort of make sure you have memory for the 10 hashes of the last 10 sentences, and that's good enough for you. And the key thing is to measure the system, right, and know how much is going through it.
Starting point is 00:15:59 And then you can tune these deduplicating strategies and you can keep things in check. But it's a probability game. At some point, something is going to go wrong. If you phone me and say, hey, I got your payment, then that's like at most once because you hang up,
Starting point is 00:16:19 you haven't heard an acknowledgement from me. Exactly. Then if you wait for me to acknowledge back then we have this problem where that's at least once because maybe the the line comes out yes while i'm telling you so then you have to re-deliver so i got it twice so then i mean i think what everybody really wants is exactly what i know i want that too i i'd also like a wristwatch with a fountain never gonna happen so is exactly once possible um okay uh yes in if you reduce the context so if you remove a lot of bio or if you are prepared to say, to trade off something else. So you think
Starting point is 00:17:08 I want exactly once. Okay. Okay. Okay. That's fine. Which now means that you need to say, I need an extra coordinator and this extra service, extra coordinator can now tell me, have I seen this message? Yes or no. And if the answer is yes, I've already seen this message, then okay, reject it. Don't even attempt to deliver it. And that's okay, but you've sacrificed availability. And if this coordinator goes down, your system has to say, well, no, I'm no longer sending anything. So it is a trade-off. Throughout this thing, it's a trade-off. So in the phone example, I guess, if we had some third person who tracked whether... We both acknowledge to the third party, is that the idea?
Starting point is 00:17:52 That's the idea. You'd have something else that listens and says, yeah, no, that one can go through, that one can go through. So let's rewind. You said earlier that... I think that you said, like, one of the main reasons for wanting a distributed system was deployment. It seems like a small reason to me, just to want to deploy components the big reason, the favorite reason is we say scale. That's the kicker, right? And we always imagine the system that goes, you know what, I'll be Amazon. This will be so cool.
Starting point is 00:18:33 I will 100,000 requests per second. Maybe even more. And then you say, okay, well, it's unreasonable for a monthly application to be able to handle 100,000 requests per second when the distribution of work is actually 90,000 of browsing, people just look for stuff, and only 10,000 is people buying. I think I've made those numbers up. I think if that's the case, then, you know, Mr. Bezos is the happiest man on earth. If you get like 90, 10, 9-1 browsing to actually buying,
Starting point is 00:19:05 oh, wow, that would probably be pretty good. So I think the numbers are different, but it doesn't matter. So, yes, it's scale. And, of course, if the system is divided into different services, each service can be scaled differently. What's actually even more encouraging about it is that when the system is broken up into services, we can now really think about failure scenarios and what do we do if something is broken. e-commerce website. Let's say that the e-commerce website is really keen that people get whatever goes into a shopping basket. People have to be able to get, right? We're just going to believe that. So if I put stuff in my shopping basket and browse, so that's one set of microservices
Starting point is 00:19:55 that I can do the search and the image delivery and all that personalization. And it's all in my shopping basket. And I go to the payment, and the payment service goes down. Now, one of the options that we have, which would be sort of silly in this particular example, but nevertheless, reasonable to think about it, let's say we're really good, and we say we want our customers to be super happy. They've spent all this time putting their stuff in the shopping basket. Regardless of whether the payment service works or not, we're going to send the stuff.
Starting point is 00:20:28 And so if I make a... See, now, okay, I know, silly, right? But you make a request to this payment service, and you say, hey, I want to now pay for this. You know, I put a few DVDs in my... I don't even have a DVD drive. But, you know, years ago, right? I put a couple of DVDs in my shopping basket,
Starting point is 00:20:45 and I paid for it. I checked out. Checked out. The service that ran the website made a request to the payment service. That was down. And so I said, never mind. Let's just do another list.
Starting point is 00:20:56 Now, ridiculous, you say. Of course it is. Think, though. Maybe you're subscribing to music, online music. You're a Spotify. I've actually heard of an example. So Starbucks, they have their member,
Starting point is 00:21:11 it's like a gift card or whatever, but it's also their membership card, right? And you can put coffees on it. So I guess they have problems with their system going down. And when down, they just give free coffees. Absolutely right. Absolutely right. Now you can extend it to say a music service so suppose you
Starting point is 00:21:26 want to listen to something open open your phone you say i'm going to subscribe i'm going to subscribe using apple pay and so the payment has gone through the apple systems and then the receipt gets sent to this music service and it says hey guys so i have a receipt number 47 from apple and the music service now says okay well, well, I'm going to go to Apple and just double check really that if the receipt number 47 has that been paid. Apple service goes 503 over capacity. Now at that point, it's really reasonable for the music service to say, ah, nevermind, I'm going to let the user listen to stuff.
Starting point is 00:22:02 For the next 15 minutes, I'm going to hand out the authentication token that allows access to all this music. But 15 minutes later, the user has to check back, because by then I will have checked for sure with Apple, and I will know. So this is actually a really reasonable thing to do. Now, again, it sort of makes, I guess, the commercial people's blood go stone cold, right? But what's the alternative? If we didn't do this, then the alternative would have been really annoyed customers.
Starting point is 00:22:31 They would have used something. Everything would have worked on their side. Just because one of our services is down, they don't get to do what they want it to do. So they pick up the phone or write a tweet. It's actually better in a lot of cases to just trust people and maybe allow them access for a short duration of time. Again, it depends on the industry and all that.
Starting point is 00:22:53 Generalizing overly or overly generalizing even. But with distributed system, this is now possible. If you were in a monolith, if the payment service went down because it's part of the monolith, that's it, you're done. Nothing works.
Starting point is 00:23:10 If it's a distributed system, you have a better chance of defining some sort of failure strategy. And because the payment system publishes events about what happened to it, you as developers, you now have a chance of going through what happened and going, oh, this sort of sequence of events might lead to a failure, which you could use to improve your code, or maybe you can write some sort of predictive mechanism that tells the operations team or tells whoever happens to be on pager duty that says, look, heads up, this is not going to be pretty.
Starting point is 00:23:46 We're still fine, but something's coming. And that is fantastic. That is the one thing that keeps services running, keeps the entire systems running. It's interesting because it brings to mind to me, even if you have a monolithic app,
Starting point is 00:24:01 likely you already have a distributed system, right? Because you're calling out to some right? Because you're calling out to some payment processor, you're calling out to some database. So even though your application is fully formed, right? I guess what I'm hearing you saying is that you need to model these systems and what will happen when you can't reach them. I just always assume I can reach the database. If I can't, you know, whatever. Absolutely right.
Starting point is 00:24:27 Absolutely right. And the worst thing about it is that at small scale, I don't mean it in a bad way, but at low scale, it's quite possible to go to AWS and provision a MySQL instance and that thing will run forever.
Starting point is 00:24:46 It'll never fail. Because it's just one computer, but as the number of computers increases, the chance of probability goes, the chance of failure goes right up. Something is going to fail. Now, this was
Starting point is 00:25:02 I used to say this. I used to say, oh, well, we need to build these distributed systems, and we need to treat failures as first-class citizens. This is really important. And I said that out loud, and at the back of my mind, I thought, yeah, but come on, when does this happen? Well, we now happen to have a system that handles significant load. And it happens every day. Every day, some service goes down. A couple of nodes of a database go down. A couple of brokers go down. They get restarted. And it's fine. The key thing is, because we can
Starting point is 00:25:39 reason about this failure and because we're prepared for it, it's fine. We survive it. Our infrastructure keeps running. Our application keeps running. The users are getting their responses. But internally, of course, we can detect errors and recover from them.
Starting point is 00:25:58 So that was very, that was the opener, truly. What do you think about this argument? I haven't heard it in a while but it was very popular about um you know build the monolith first and then you know once you reach some you know once you reach these these problems then you know then you start splitting i mean i don't have a general answer to that. I think a lot of the times you can know what the expected load is going to be. If you know, as in you have an existing customer base and you know that you have a million customers. And so the new service, when it goes live, it will get that load. Now, I think it would be silly in that scenario to say, no, no, no, we're just going to start with this thing that we
Starting point is 00:26:51 know isn't right. And then we'll see how it goes, how it crashes. Feel free to say it's a horrible idea. So it came, I think it was Martin Fowler who wrote about this probably several years ago. And I think his argument was, if you try to build this new application as a distributed system, but once it starts interacting with customers, your requirements start changing drastically. And if you don't know where the change is going to be, maybe you've cut things up in a way that doesn't make sense. And that was going to be my follow-on. If it's a new application, if this is just a, dare I say, startup,
Starting point is 00:27:37 then it makes total sense to experiment first. But there was another, I think that this was Googlers who wrote it, and it goes along the lines of design. Take a system and say, you say, okay, I need to be able to handle 10 concurrent users. Design it for 100. But once you get to 1,000, it's a new system.
Starting point is 00:27:58 And once you get to 10x the original design, it's just not going to work. So, okay, I know 10 is a silly example, but design, say you have 100,000 users, design a system that will hold up to, say, a million. But once your usage grows out to be 10 million, the thing isn't suitable. It's the wrong thing. Now, I think this is where Martin Fowler was getting as well. The design choices for something that handles 10 or 100 use might be completely different to the thing that handles
Starting point is 00:28:31 1,000. And if the starting position is that you have zero users, well, wow, you're right. Design something, anything. Because as you point out, who knows what these silly users will say when they finally log in and say, well, I don't like this. And you might cut your system up the wrong way. You might define these consistency boundaries. That's a big word. I haven't come up with a big word for a long time. So consistency boundary, or this context boundary, because we're in a distributed system, we have, I'm sure people know this cap theory, which says, okay, well, you have consistency, availability, or partition tolerance.
Starting point is 00:29:18 You'd like all three, but you can only have two. And as long as one of them is partition tolerance, because you have a distributed system, right? So partition tolerance, now pick two, consistency or availability. And it really depends on what you're building. Now, the choice isn't usually quite... it's not binary. It's not like consistency 1-0. There's a scale. But what you're ultimately dealing with is physics. You have a data center, say there's a data center in Dublin, and there's a data center in East Coast, US East, Amazon, and EU West in Dublin. It takes time for this electricity nonsense to get from America to Ireland.
Starting point is 00:30:00 It just does, and there's nothing you can do about it. And so if you write a thing to a system like a database in Dublin, it's going to take 100 milliseconds before the signal gets to US East, and there's nothing you can do about it. That's just life.
Starting point is 00:30:18 So what can you do? You can say, well, so if someone else is reading the same row from US East, they'll see old data, because it's just You can say, well, so if someone else is reading the same row from, you know, US East, they'll see old data because it's just not there yet. So you can say, oh, okay, okay, okay, okay. So you hate the idea.
Starting point is 00:30:35 Absolutely hate the idea. It must be consistent. When I look at my, this database row, I must see, wherever I look in the world, I must see the same value. All right. So you better have this other component, this coordinating row, I must see wherever I look in the world, I must see the same value. All right. So you better have this other component, this coordinating guy that adds a tick and says, yes, no, this is now everyone I've heard acts from every data center, every replica that they now have it. And now it's good. Now I can release these read logs that people are waiting for. Okay? It's consistent.
Starting point is 00:31:09 What if one of these networks connection, what if this coordinating component goes down? Oh. There you go. You can't be consistent. So the only thing that the system can do is no bye-bye, no service. I think even even in um monolithic applications i think people don't think about the fact that maybe you're consistent
Starting point is 00:31:31 within the database but but that doesn't necessarily like i imagine a scenario you have a bunch of app servers and a database behind it like postgres and you know everything's acid but you know one user reads a record and it displays it somewhere and then they change some things. And then your little magic ORM software saves the record en masse back. And somebody else could have the same record up at the same time. So the consistency is lost. Sometimes people don't even realize it, I guess. Oh, absolutely. Oh, absolutely. So the more you distribute the system, the more you have to deal with these scenarios, the more you have to deal with the possibility that something will be inconsistent.
Starting point is 00:32:11 And it can actually be a really, I've had a whole bunch of discussions with typically e-commerce people who find it absurd. What do you mean that I don't know how much stuff I have in stock? Well, you do roughly, but it's not down to like one unit. So they say, oh, no, no. How about we reserve items? You can. How about we lock items? Yes, you can.
Starting point is 00:32:35 But as the number of users grows, the number of locks or these reservations grows. And so it's really easy to get to the point where everything's locked and no one can actually do anything because you want to be consistent. Because you want to be consistent, you've sacrificed availability. And so, you know, people say, no, no, no, no, no, no, no, no, no. You can't have that. I want to be consistent and available. Which means you say, okay, fine. Buy one big, huge machine and run everything on this one machine. But even that's ridiculous.
Starting point is 00:33:08 That machine will have multiple CPUs. It will have wires going through it and they will break. So it's... And the availability is lost if that machine goes down. Oh, absolutely. It's not pretty. So you had an example about an image processing system you designed. I did, yes.
Starting point is 00:33:29 I'm wondering if you could describe that a little bit. Yeah, of course. So this was a really interesting thing. Now, I guess people are a little bit more worried about their private information. But nevertheless, this was in good old days where people uploaded pictures of their passports freely onto any old service they could find. Now, what we did is, it was obviously doing a couple of processing steps with these identity documents.
Starting point is 00:33:58 So what was the goal of the system? It was to essentially provide biometric verification of the user. So think you want to open a bank account and you don't want to go to the branch. And the bank doesn't really want to open a branch for you because it costs a lot of money. What they would like to do instead is to have the users use their phones to look into their phones and take a picture of their driving license or something. And for this system to say, yeah, no, this is looking good. The driving license is the real deal.
Starting point is 00:34:32 It's not being tampered with. And the thing that's staring into the camera, that's the same face. It's alive. So that was that. Now, of course, the users, they want this thing to run beautifully smoothly, right? And they will give you 10 seconds of attention, maybe.
Starting point is 00:34:59 So if the processing really isn't done within, say, 10 seconds, maybe 15 if we invent some clever animation, then this would have been a total failure. So this needed to be obviously scalable, it needed to be parallel, it needed to be concurrent. Many of these steps needed to be executed concurrently. And a lot of these steps were actually pretty difficult. As you can imagine, there was a whole bunch of machine learning involved. And so what the PDF that you might have read or the preview book that you might have read essentially describes the thing that there's a front service that ingests a couple of images, as in high-resolution pictures
Starting point is 00:35:39 of documents or whatever else, and it ingests a video of the person staring into the phone. And then the downstream services then compute the results. So it needs to be OCR'd. The face needs to be extracted from it. We need to check that you haven't just stuck another picture on. We used to do this. I didn't because I'm from Eastern Europe, so we used to drink all the time. But I keep hearing that people elsewhere in the world have to think that driving licenses
Starting point is 00:36:12 in order to prove that they are allowed to have a drink. And so they take someone else's driving license, put a picture on it, put their picture on it. So we need to be able to detect those scenarios, obviously. And then combine the results. At the end, there is a state machine that reacts to all these messages that are coming from these upstream components. And as soon as it's ready, it needs to emit a result. So imagine that you're the bank, right? You want to open a bank account. And you say, you know, a question,
Starting point is 00:36:45 how much, how much, maybe you're a betting company or someone who needs to verify identity of a person. And so, so if I'm a betting company and I say, okay, you know what? I'm going to play on the Grand National, which is a UK horse race.
Starting point is 00:36:59 Okay, you need, I've never bet before, right? So I need to register and prove that I'm over 18. So I go through this process. And then in the app, I say, how much do you want to bet? And I say, I'll splash out and bet £2. Well, right?
Starting point is 00:37:14 So the betting company then can have a ruling that says, as soon as I hear from this system, as soon as it emits a message that says, look, you know what? We've OCR'd it, and this driving license looks like a UK driving license. At that point, it might say, that's good enough. It's fine. A lot of that to go through. If, on the other hand, I was saying, well, look, I'm going to bet £3,000, they would actually need to wait for all these messages to arrive.
Starting point is 00:37:46 That is to say, it's OCR'd and we checked the driving license register to verify that it's the real thing. And we also compared the picture on the driving license with a face that was staring into the phone. And yes, it's the right person. Now, that is only possible if your system emits these messages. It would have been impossible to condense down and to start processing
Starting point is 00:38:16 and then wait until we have everything. We have this entire processing flow and then send one callback, one result to the gambling company. That makes sense. Because if we were doing, back to our startup, generating something quick,
Starting point is 00:38:35 I might have some web server that kind of throws these images into whatever Kafka topic, and then just something else that just does everything, right? Pulls it out and goes through it. So you're saying the problem with that is the upstream consumer needs to know
Starting point is 00:38:50 about specific... It wants to know before it's done, I guess. Yes, and the main challenge is that you don't know what it wants to know. Until you deploy the first cut of this,
Starting point is 00:39:05 and until you say to your customers, look, here is what we can do, you don't get the feedback. And what they might say is, well, that's nice, but couldn't you just send us, I'm a gambling company, couldn't you just send us
Starting point is 00:39:20 the initial processing? That's good enough for us. Oh, yeah, so if client equals equals five, then do this. And then some other client comes along and says, well, you see, but if the betting amount is more than 200, then we will end it. That way you will eyes pain. So it's much better to just
Starting point is 00:39:43 admit everything that you know as soon as you know it from your services. Now, if you need to implement in the system and we decide together that we're going to be so good to our customers that we're going to implement this workflow engine, we can as a separate service.
Starting point is 00:40:01 But as a separate service, that's the key. So we can defer that decision absolutely so how does the client the client is a confusing term here so there's a user with a phone but then there's your actual client
Starting point is 00:40:15 like the bank so how are they consuming these results now there are multiple choices the most the crudest one is we just push stuff to them over these results? No. There are multiple choices. The most, the crudest one is
Starting point is 00:40:27 we just push stuff to them over a connection. Right? So we say to our clients who speak to
Starting point is 00:40:35 the system that should receive the messages from us, we say, okay, guys, why don't you stand up a
Starting point is 00:40:40 publicly accessible HTTP endpoint and we're going to post stuff to you? That's a horrible way of doing it. Don't ever do that. First of all, your customers will
Starting point is 00:40:50 hate it because they'll say, no, we don't want to stand up a web server in order to consume your service. That's a stupid idea. Firstly, and secondly, it pushes the responsibility of
Starting point is 00:41:04 making sure that you're talking to the right endpoint, like a dual service. It would be really bad if, say, you had this complicated image processing service, and if it posted data, private data, biometric data, to some other URL. Yeah. Through misconfiguration. So, yeah, horrible. So, how about we do it like Twitter
Starting point is 00:41:26 used to do? So we say, okay, you know, Mr. Client, open a long post request and like a connection and we will send you chunks, HDB chunks of messages as soon as we receive something. Now, that's a much nicer way of doing it because it just allows the client to control where it's connecting to. It also means that the client has to check the HTTP, the certificate on the connection. It has to know and it has to be sure that it's talking to the right service
Starting point is 00:42:03 so we can wash our hands of this whole horrible business. It also means that the client now is in control of its consumption. That's kind of nice. What's slightly problematic about it is the scale of it.
Starting point is 00:42:23 It is the case of one image per second variance, if it were, oh, I don't know, a thousand images per second, that's a bit tougher, right? You would want multiple of these connections to be opened. And so now you need to think about a way to
Starting point is 00:42:38 partition which connection gets which results in which images, and they need to be routed more carefully. And and they need to be rooted more carefully. And you also need to think about you need to be nice to your clients and say, we understand that the connection will go down
Starting point is 00:42:54 and you might have to make a new connection. Okay, but you need to be able to say things like, I'm making a connection, but I want to start consuming from record 75 instead of the last one. So, but that can be, you know, a big query parameter. So that's a neat way of really bridging the gap between someone who says, so where's Scala and JDMs and Kafka? And, you know,
Starting point is 00:43:21 our Mr. Client says, no, I'm.NET, none of this stuff, don't even talk to me. In which case, this sort of long HTTP connection works pretty well. Of course, the ideal scenario is the client would say, well, we would like for your system to expose another kind of interface and we can just mirror that. So that's one of the other options that we offer. Now, in the end, this particular system, we ended up just with these HTTP endpoints. So that's something
Starting point is 00:43:54 you do? You do integration with Kafka? We offered that, because there was a particular deployment that looked like it was going to be large enough to warrant that, but in the end, we dropped it. Interesting. I can see why it would be easy from your side.
Starting point is 00:44:10 Oh, yeah, exactly. That's why we said, oh, no, no, we can do this for you, but it... I didn't like it in the end because it exposed too much of our internal workings. It felt like integration through database wasn't quite right. Now, okay, you can exposed too much of our internal workings. It felt like integration through database.
Starting point is 00:44:26 It wasn't quite right. Now, okay, you can restrict the topics that are replicated. You can maybe transform them. But it still felt a little bit crummy. So I think it's much better to have a really sharp, completely disconnected interface. And even if you're talking, say you're running an AWS and you publish your messages to Kinesis, and then you would say to your client,
Starting point is 00:44:51 oh, yes, I see you have Kinesis. Why don't we push stuff to your Kinesis? Again, it feels like you're letting your dirty laundry out for everyone to see. So there should be an integration service that it might well consume from Kinesis, do some cleanup and publish to another Kinesis. That's all good, right?
Starting point is 00:45:14 And it might even be a lambda for all I care. But it shouldn't be a direct connection. So that's interfacing externally. So inside of your microservices world, how do you draw these? How do you agree on these formats? Formats being the message bus or the... Yeah, how do you agree on how you communicate
Starting point is 00:45:37 between the various microservices? So a lot of it is driven by the environment. So we have one system that's divided, and part of it lives in a data center, and part of it lives in AWS. Now that drives the choice, because there's no Kinesis in a data center. We had to use Gatsby in that particular case.
Starting point is 00:46:02 And then we publish messages out to Kinesis and we AWS components consume Kinesis. What's actually more important is to think about the properties of this messaging thing and think about the payload that goes on this thing. So in a distributed system, one of the super evil things to have is ephemeral messages. So if the integration between our components,
Starting point is 00:46:33 our services is like an HTTP request, there is no way to get back to it. If once the request is made, it's made, it's gone. There's no way, no record of it anywhere. So what we really prefer is persistent messaging. And you can then think of these message brokers as both messaging delivery mechanism as well as journal in the CQRS event-sourced sense.
Starting point is 00:47:07 So the message box contains all messages for a duration of time, but they're not lost. So if I publish a message to Kinesis, I publish a message to Kafka, once the publisher gets the confirmation that, yes, okay, the sufficient number of nodes of Kinesis or Kafka have the record. It's there, right? It sits somewhere. It can be
Starting point is 00:47:31 re-read again. It's persistent. Now, actually, in terms of conditions apply, that's not exactly what's happening with Kafka, because what happens is if you, by default, give you a cluster of like n number of nodes, and if you say, I want to receive a confirmation
Starting point is 00:47:48 when the record is published on quorum number of nodes, that still doesn't mean that the message is written to disk. It just means that the quorum number of nodes have the message in memory. Now, it's a probabilities game. You'd have to be really unlucky that all of these nodes would fail before they would flush their memory to disk. Of course, you can say, no, no, no, it needs to be fsynced, but then the message rate goes right down. Now it needs to be fsynced. We have a distributed file system, right? So, so, one of the components is running in OpenShift, which is GloucesterFS, which is a
Starting point is 00:48:32 distributed network file system painted over a number of SSDs. And so we have a cluster of Kafka's that use this GloucesterFS, which we're not entirely happy about, but we're not also entirely unhappy about because it runs. But hey, right? This is confusing you. It wouldn't be fun if it weren't messy. So just to make sure that I'm understanding. So we're going to... Your image service has a bunch of microservices inside of it. And they are communicating with each other using Kafka topics.
Starting point is 00:49:05 And we kind of assume that that is persistent, even if some conditions apply. So how do you organize it in terms of message schemas and topics? That's a good question. So the messages we've chosen, we call buffers as the format and description of our messages. So we're very, very, very strict about it. And we actually have...
Starting point is 00:49:31 This is where people are going to get really angry because one of the things about microservices is that they are independent. Each microservice is its own world. It shouldn't have any shared code and shared dependencies. Well, I mean, true, but then reality kicks in. Now, what we've done, so people won't get angry, we actually created one separate Git repository
Starting point is 00:49:53 that contains our protocol definitions for all the services. Now, this is weird. It's really kind of weird. But this one thing allows us humans to spot when someone is making a change to the protocol. So if there's a pull request to this one central repository that contains all the protocol definitions for the services, the entire team is on it and goes, oh, wait a minute.
Starting point is 00:50:19 If we merge this, is it going to break any of our stuff? Because remember, these microservices have independent lifecycle. And so it's quite possible to have version 1.2.3 of service A running alongside version 1.4.7 of service B. And following the usual semantic versioning rules, these should be compatible. Well, except semantic versioning is only as good as the programmers who make it. And so this is why we have this shared repository and we have
Starting point is 00:50:50 humans in my team sort of eyeball it and say, is this going to work? Yes or no? When we build the microservices, they all build protocol implementations using this shared definitions repository. So if there's a C++ version, it makes the proto-Cs its own C++ versions of the protocol. If there's a Scala version, it's a Scala code.
Starting point is 00:51:18 But then it's all quote-unquote guaranteed to work together when it's deployed. Is protobuf the best solution, or what do you think of? Ah, now, to me, the best solution is
Starting point is 00:51:35 language that has good language tooling. So you could say, oh, well, we should have used Avro. Surely Avro is more efficient. It has support in Kafka and all that. Maybe we should have used Drift or something else. But when we were developing the system,
Starting point is 00:51:56 Protobuf had really mature tooling for the languages that we used. So there was tooling for Swift and Objective-C for the clients. There was good tooling for C++, and there was was tooling for Swift and Objective-C for the clients. There was good tooling for C++ and there was good tooling for Scalp. And that made the difference. So I guess what I'm saying is you're free to choose which of the protocols you like, protocol language, protocol definition language, as long as the tooling matches your development lifecycle and as long as the protocol is rich enough to allow
Starting point is 00:52:28 you to express all the messages that you're sending. That makes sense. So how do you decide
Starting point is 00:52:36 about topics? Like, do you have each, is each microservice pushing to a singular topic or? Yes, so that's a good question.
Starting point is 00:52:48 Typically, each microservice is pushing to multiple topics. So there is the basic results topic, but it might have partial results, it might have some statistics. So these are definitely topics that... I can't say that one service only ever consumes from one topic and publishes to one topic. A typical service in that system will consume from two topics and publish to maybe three. It is worth mentioning though, particularly with versioning, is the way we've
Starting point is 00:53:26 gone about it is we have topics for major versions. So there's like a V1 of images. And obviously V1 will be forwards and backwards compatible across all the major versions within the... minor versions within the one major version. And if we ever decide to deploy image service v2, which presumably is completely different, right? Because it's just not compatible, then we would create
Starting point is 00:53:57 an image-v2 topic for it. I mean, it seems to be a theme you're saying is to make these things explicit? Uh-huh. Absolutely right. it seems to be a theme you're saying is to make these things explicit? Absolutely right. You have to be able to talk about these things and accept reality. What always scares
Starting point is 00:54:14 me the most is when people describe the system and they say, we will break on the monolith or we are designing microservices based on architecture and we're going to have this component and this component and when it gives us a response, we're going to do this, that, and the other. My first question in all those scenarios is, what if it's done? What if it's not available? And, okay, so this is very esoteric. Let's go back to a user database. If you're building a user database and you have a
Starting point is 00:54:42 login page, and you have like an e-commerce music shop or something. And my question to you would be, okay, what do you do when your user database is down and someone wants to log in? Freak out. Get pages. There you go. Now, a lot of people will say, well, what can we possibly do?
Starting point is 00:54:59 The database is down. But what you can do, of course, option A is denied. You can't log in. Option B is, okay, fine, you're logged in. I don't care what username and password you typed in. You're logged in. We'll take your word for it.
Starting point is 00:55:15 We're going to issue you a token that's good for the next two minutes. How about that? And it's this thing, right? And then you say it to your product team and they say, are you ridiculous? We can't allow that to happen. And then you say, okay, fine, fine, fine. I hear you. So option B is if the database goes down and it will go down, then your service is down. Well, we don't like, we hate that as well. And we do. I know. I hate the database
Starting point is 00:55:46 going down, and I'm sure that product owners hate the idea of just letting people in. But we have to make a choice. That's reality. It makes sense, yeah. For a music service, for a bank, I think that... Absolutely.
Starting point is 00:56:01 And this is where consistency, this is where the dials go in. right? It's not a one zero. So of course, if I can't verify my online banking identity, I can't let you in. What if though, let's say there are a number of banks in the UK. One is actually one of them had a massive IT failure last week. Anyway, anyway you're running a bank and you have 2005 code and you're doing two-factor authentication
Starting point is 00:56:32 by text messages. Okay. I was able to log in my username and password or whatever third digit of your security code was fine, but the SMS sending service is down. And now it's a reasonable choice.
Starting point is 00:56:49 Maybe a possibly reasonable choice is to allow re-only access. Makes sense. Because then the SMS... Just not do two-factor. That could be a choice you could make. Absolutely. Interesting. So it has to be...
Starting point is 00:57:01 You have to think about it. And actually designing these systems, systems, this is the particular, right, systems that are made up of the multiple components, as we design them, we always say, what if it's not there? What if it's unreliable? What if it's slow?
Starting point is 00:57:19 And not treat it as either an academic discussion, as in, well, what if,, what if the data center is offline? No, no, no. I mean these mundane, boring things, failures that happen all the time. Okay, well, someone broke the table. The database is inaccessible. My SMS sending service is down. The email SNTP server is down.
Starting point is 00:57:42 And they happen all the time. And so that's the first layer of digging. the answer shouldn't be oh well it there has to be something there has to be a decision and it it's the implicit oh well that causes the most problems oh well could be a decision like i guess or i mean i guess it's implied you're saying we need an explicit decision to say we have to. That's exactly it. It can never be. We assume that. And it takes so much of mental practice
Starting point is 00:58:14 because we are used to things. And, you know, we program, right? We say, I don't know, select star from user. And if TCP connection denied, then most people, select star from user, and if TCP can actually deny it, then most people, me, myself included, will say, what do you want?
Starting point is 00:58:31 Seriously, what do you want? But if we don't say it out loud, if we don't say that if the database isn't there, I'm not doing anything, then other people won't know that we have made this decision. It's this hidden decision, and that's going to come back and bite us.
Starting point is 00:58:48 I assume also the cutting things up this way into microservices allows us to be more explicit about various parts failing while others remain. Like in a monolith, it's hard to make these... Yes. Absolutely.
Starting point is 00:59:02 The lines aren't clear, I guess, between... Yeah, absolutely. yes the absolutely the lines aren't clear I guess between yeah yeah yeah absolutely it's you know people want we want it right when you go on
Starting point is 00:59:12 Netflix and you watch stuff and you're just like no you expect it to happen because after all you paid $10 and you know
Starting point is 00:59:20 I want the world-class service I want files to be ready on CDNs, ready for me to start watching within a second of pressing play. Because come on, I paid my $10.
Starting point is 00:59:34 You know, we are guilty of that expectation. And so I find it bizarre that we would then be at work and say things like, no, no, well, you know, it's the thing we can do. And we're building a system for a bank or for a retailer or for something that actually gets more money than
Starting point is 00:59:52 the $10 per month on Netflix. So Jan, you're the CTO of Cake Solutions. What's that like? It doesn't sound to me like you're doing CTO things. It sounds like you're in the weeds of designing distributed systems.
Starting point is 01:00:09 Well, quite. I mean, I guess I would hate the idea of not understanding what my teams are doing. That frightens me. And the same goes for non-coding architects and that sort of thing. I think to actually make a useful decision, one really has to know what's involved. And I think these are complicated decisions. And so it would really annoy me if I didn't know,
Starting point is 01:00:41 if I didn't understand, and if I had to go to meetings where I would hear things that people talk about, and I sat there and thought, well, it sounds complicated. I suppose they're right. That would really frighten me. So I think it's actually super important for techies to be leading
Starting point is 01:01:06 big teams and big companies. I think that was an interesting article I read a while ago that said that, I think it was Joe Spolsky, in fact, that talks about his time at Microsoft. He said that Bill Gates,
Starting point is 01:01:22 you know, the back then CEO, I don't know, see something, chief, very important person. Yeah. He said he would dig into technical details. He says he remembers the time when Bill Gates was insisting that there is a reusable rich text editor component. Now, that can only come from really understanding what the hell's going on in technology. Without it,
Starting point is 01:01:51 how would a CTO who doesn't code, how would a CTO who doesn't understand any of this insist on having a reusable rich text component? So, yeah, that's my take on it anyway. I think it's great. I think that all CTOs should take that on.
Starting point is 01:02:08 I think, I guess, people just get buried by the ins and outs of the job and sometimes... And that's fair, you know, it's fair. And it's tempting to say, to achieve the optimum meeting density, as Wally says in truth. You know, it's tempting, but no resist. Honestly, if anyone's listening, don't do it. It's horrible. Be interested. There is so much interesting stuff going on.
Starting point is 01:02:37 That goes down to teams that are implementing all these microservices. Some of the stuff is, on the outside, boring. Oh, goodness, it's boring. Who would like to make a PDF report? Really? You know, Jasper reports, and off we go. But what I'm always reminded of, and again, I'm terrible with names, no idea who said this, but when one looks closely enough, everything is interesting. And I think that's really the case. Like, even PDF reports can be made more interesting.
Starting point is 01:03:10 I mean, in the darkest of days, we used all our bloody transformers to make PDF reports. So the PDF report thing is still boring. But we could use this other stuff, which was really interesting. And I guess my point is, you know, this is the time to
Starting point is 01:03:26 be alive. There's so much technology available to us that I really don't think that anything can be boring. That's a great sentiment. Do you want to say a little bit about Cake Solutions before we wrap up? Well,
Starting point is 01:03:42 sure. I'm sure you've heard what we used to do so we used to do these and we still made distributed systems but of course about a year ago we were working for for our clients and then about a year ago we we were acquired by bantech. And Bantech was subsequently acquired by Disney. And so we now like distributed systems, it's the same stuff, right? But we now concentrate on media delivery. So
Starting point is 01:04:13 if anyone is a sports fan, and if you guys are watching ESPN+, there you go, that's the stuff. Those are the distributed systems. I can't tell you, but in 2019,
Starting point is 01:04:30 there will be a thing that will just be the best thing in existence, like sliced bread, nothing. Wow, it sounds exciting. I can understand where you're coming out from scale, then, if you're working on ESPN. Oh, God, yeah, absolutely. This was the volume of media.
Starting point is 01:04:48 And I'll try to be completely specific about what we do, but you can tell how delivering that sort of experience is super important. It's super important for these systems to deliver content to our users. Now, I remember back in the days when we were making these quote-unquote ordinary distributed systems for, you know, banks and e-commerce. Well, you know,
Starting point is 01:05:14 if an e-commerce app is not available, it's annoying. You know, we can get it in our days. Imagine, though, there's a game, there's like a baseball game football you're a fan
Starting point is 01:05:27 you really want to see this right and a 500 no no well first of all it's live so
Starting point is 01:05:35 like it's it's it cannot happen you cannot have 500 because
Starting point is 01:05:40 where it's in e-commerce users were annoyed if the thing went down now they're angry but deeply angry so you know that that was uh that was quite an eye-opener but but having this this distributed system comprised uh you know part of these microservices allowed us to think about what will happen what do we do if something breaks?
Starting point is 01:06:06 What's the mitigation? Because ultimately the motivation is to deliver content. You know, our users have paid for this. And they're fans. This is their passion. They want to see this video. So there you have it. Awesome.
Starting point is 01:06:24 So thanks so much for your time, Jan. It's been great great it was an absolute pleasure

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.