Software Huddle - Building for Scale with Mario Žagar from Infobip

Episode Date: November 7, 2023

In this episode, we spoke with Mario Žagar, a Distinguished Engineer at Infobip. Infobip is a tech unicorn based out of Croatia that is a global leader in omnichannel communication, bootstrapping its... way to a staggering $1B+ in revenue. We discussed the super early days of engineering at Infobit when they were running a monolith on a single server to today running a hybrid cloud containerized infrastructure with thousands of databases serving billions of requests. It's a really fascinating look and deep dive into the evolution of engineering over the past 15 years and the challenges of essentially architecting for scale. Follow Mario: https://www.linkedin.com/in/mzagar/ Follow Sean: https://twitter.com/seanfalconer Software Huddle ⤵︎ X: https://twitter.com/SoftwareHuddle LinkedIn: https://www.linkedin.com/company/softwarehuddle/ Substack: https://softwarehuddle.substack.com/

Transcript
Discussion (0)
Starting point is 00:00:00 Most of our platform, I would say like 90% is written in Java. Initially, guys tried to do some business in Croatia, but it's a very small market. Basically, they almost kind of gave up until they realized that actually they can do business outside of Croatia and the whole world. And this was kind of the game changer. If we are building a new product, maybe the fastest way is, you know, not to worry about infrastructure so much and try just to, you know, get this product as fast as possible out there, kind of validate that it works and that it starts bringing money in. Hey there, it's Sean Faulkner, one of the creators of Software Huddle, and I'm really excited for you to listen to today's interview with Mario Zagar,
Starting point is 00:00:38 a distinguished engineer at Infobip. A lot of you might not be super familiar with the company Infobip, but they're actually a monster in the omni-channel communication space that bootstrapped to $1 billion in revenue based out of Croatia. They're now doing multiple billions in revenue and competing directly against companies like Twilio. Mario has been there nearly since the beginning, and in the episode, we go through his and Infobip's journey for the past 15 years. We discussed the super early days of engineering at Infobip when they were running a monolith on a single server to today running a hybrid cloud containerized infrastructure with thousands of databases serving billions of requests. It's a really fascinating look and deep dive
Starting point is 00:01:19 into the evolution of engineering at a company over the past 15 years and the challenges of actually architecting for scale. I really think you're going to enjoy hearing from Mario. Last thing before I kick things over to the interview, if you enjoy the episode, don't forget to subscribe to Software Huddle and leave us a positive review and rating. All right, enough plugs.
Starting point is 00:01:38 Let's get to the interview. Mario, welcome to the show. Hey, Sean. Hi, thanks for having me. Yeah, thanks so much for being here. I know it's kind of towards the end of the day for you, I imagine, but I appreciate you finding time for this. Let's start by having you introduce yourself. Who are you? What do you do? And how did you get to where you are today? Yeah, sure. So I'm Mario. And basically, I'm a
Starting point is 00:02:02 software engineer. So I've been in software development for the last about 20 years. But I'm in computer since I know basically about myself. And currently I'm working at Infobip for the last about 14 years, a little bit more than that. Currently the position of the distinguished engineer, part of the platform architecture team. Yeah, this is pretty much it.
Starting point is 00:02:29 Right. So you've been 15 years at Infobip. You still enjoy it? Oh, yeah. Lots of challenges there. Very dynamic. You know, it's like not one single company doing one product. It's like a bunch of companies doing lots of products.
Starting point is 00:02:45 So I can pick and choose the problem I want to solve. That's awesome. Yeah. So Infobip is a global cloud communication platform. And it's a big company. And I think one of the most incredible things that I've always heard or thought about Infobip is the fact that it bootstrapped to a billion dollars in revenue, which is, you know, just an amazing achievement. And I know about Infobip from my time working at Google because I was working in the business messaging space and I did a lot of work with Infobip at that time. But I feel like unless you're really in sort of the communication space,
Starting point is 00:03:18 a lot of people in the U.S. aren't that familiar with Infobip. So how did the company start and sort of when was that? Yeah, so the company was founded in 2006 by a couple of guys fresh out of college and sending SMS basically. And around 2008, guys managed to send their own SMS. So what's the difference? There's a bunch of companies that actually kind of provide you with the API to send messages to end users from business side. But usually they connect to the operators, to telecoms using their APIs and so on. But these guys manage to kind of really deploy an SMSC. So basically, you know, an application inside the telecom network and basically send the SMS very cheap.
Starting point is 00:04:14 And this was kind of when the boom started. And I joined InfoBipo shortly after that. This is basically when kind of business picked up. And more and more kind of customers started coming in. And we needed to kind of start building lots of features. Yeah. And was that originally where they just focused on the Croatian market or were they all over Europe even at the beginning? Yeah. This is like a funny. Initially, the guys tried to do some business in croatia but it's a very small market and
Starting point is 00:04:46 basically they almost kind of gave up until they realized that actually they can do business outside of croatia in the whole world and this was kind of the game changer and something that kind of opened up the horizons and what the company could go for yeah i imagine so can you talk a little bit about what the early days of being at an engineer at Infobiff sort of on the ground floor was like? Yeah, sure. So yeah, for me, it's also like amazing to basically witness, you know, the whole evolution of Infobiff from the day when I came there and how it looked then to where the company is now, right? So when I came at the company, there were like two sites in Croatia.
Starting point is 00:05:29 And one site had around maybe 10 developers and two applications they were working on. And another site where I was at had just one application on which we worked on. And there was basically, you know, there wasn't any normal stuff that is kind of normal these days, like, you know, some build servers, some, you know, repositories for the artifacts that they build, some deployment procedures, Everything was done pretty manually. But the interesting thing was, and this is one thing that I liked really a lot, is that guys wrote tests. So when I kind of got there, they had tests. And yeah, deployment was manual. And there wasn't any built server.
Starting point is 00:06:22 One guy built it on his own machine and then copied it over and deployed it. But they had high availability. Since they were doing it in this telecom space, they realized they need to have some high availability. If one machine goes down, then the other should be able to handle the load. So this was put on the standard from the early days, which was very, very good. Yeah, I remember those days of the somewhat manual build process. You build it basically on your local machine and then you move it over. But it sounds like there was probably a lot
Starting point is 00:07:01 of focus on basically building and deploying scalable data centers because of the fact that you're in the telecommunication space. And also the nature of that time, it's not like you had public cloud services where you're just spinning up containers on Google Cloud or Azure or AWS or something like that. Yeah, exactly. So it was a little, like in the early days, it was a little bit better than running like the whole production on the machine under my desk. You know, it was some data center, we rented some physical hosts there. And it was actually like Windows machines. And we were running, like all the applications that there were a couple of hosts. And the other application, this, you know, telecom application that was actually running inside the telecom operators network, this was the kind of, let's say, in another data center owned by the telecom as
Starting point is 00:07:55 a piece of equipment actually kind of running there and sending and receiving SMS messages directly from the telecom network. So this was a very interesting time. And there was no dependency management. Basically, weird things would happen. We would add some new library that we found useful, and we would test it, try to test it locally, we would deploy it and then
Starting point is 00:08:28 there would be some edge case when this library would call another library which we didn't package and everything would fall apart. So it was like funny scenarios like that. And at some point we realized like, hey, maybe we should be able to build these kind of artifacts that we deploy on one single source of truth machine. It shouldn't be like, what if this guy goes on vacation or something, you know, it's like this bus factor problem. And then we kind of started to kind of, okay, let's try to introduce some continuous integration, At least we have tests, so we have this version control. Why not, whenever there is some change, run all these tests automatically instead of waiting for someone to do it?
Starting point is 00:09:17 What was the code base at that time? What was the main programming language and stack that you were building? Oh, yeah. So basically, there like three applications. This application is running on telecom premises, this SMSC, which is responsible for receiving and talking basically to the telecom network using telecom protocols. This was Java.
Starting point is 00:09:37 Java running on Linux machines. And the application that was receiving kind of requests from customers to send messages, this was also written in Java. And there is like this one back office application, basically, through which we kind of configured the behavior of the system and kind of tried to do billing and so on. This was actually like, I think, web forms, visual basic, something like that. So basically, you know, the people that were there, whatever they knew, this was the stack. There was no, you know, like it wasn't really about pick and choosing, like what was the best. It was like what you could do.
Starting point is 00:10:17 This is what you kind of used to solve the problem. And then everybody at the time for the engineering organization was located in Croatia? Yeah, yeah. Everything was in Croatia, basically solve the problem. Then everybody at the time for the engineering organization was located in Croatia? Yeah, everything was in Croatia, basically in Pula. The main office was in Pula and the other development site was in Zagreb, and this was pretty much it.
Starting point is 00:10:34 What was the source control system? Oh, it was Subversion. Okay. Yeah. We were using Subversion, and at some point later in this evolution, we switched to Microsoft Team Foundation Server because it had not only source control, it also had this task management, so this was kind of cool.
Starting point is 00:10:59 Then after that, we kind of switched over to Git and also to Jira, and it was kind of the evolution. But right now, this is where we are. We're kind of using Jira for issue management and tracking and Git for source control. And then besides some of the pain around essentially scaling up your CICD to actually like a build process that is not running on someone's machine or dependent on someone's machine. What were some of the big engineering challenges you faced in those early days? Oh, yeah. So one of the challenges was like, how do we deploy?
Starting point is 00:11:36 Like this manual deployment, at some point, we kind of realized that we are making enough mistakes to kind of start changing stuff and try to automate it and try to remove the human factor from the equation and maybe have some more stable deployment. So this was the first thing. Then the next big thing that we solved was actually this dependency management. How do we handle these libraries? How do we pull libraries that we want to use into our application and then also pull in transitive libraries that these libraries are using? So we started using Maven for Java. And this also solved this pain point,
Starting point is 00:12:19 which caused a lot of problems in production just because we were missing libraries. This was also a kind of mess. There were also other challenges, mostly involved with the infrastructure and how do we deploy. Manual deployment was one thing where you needed to know exactly the steps you need to perform. like, okay, I should reconfigure HAProxy, I should remove this from target from list of backends, then I should stop, I should start, I should something. And all of this was done manually. And the other thing was the underlying infrastructure itself.
Starting point is 00:12:57 So we had these physical machines and then these physical machines, we were running some SQL Server database on top of them. We wanted the SQL Server database databases to be highly available. So it was basically some Windows cluster running this using some shared storage under the hood where the actual data was stored. And then Windows cluster would know, one machine is down, I will promote another one to be owner of this data and continue handling the data.
Starting point is 00:13:34 Then the shared storage died. We rented basically the shared storage solution. It was not under our control, we rented it from the data center provided. So at that point we kind of realized, okay, maybe we should have more control under what kind of storage we use and also switch maybe from these dedicated physical machines where you would... We would basically know like, okay, this machine is for that, that machine is for the classical, you know, pet versus cattle problem. And at that point, we kind of started going into direction
Starting point is 00:14:13 where we introduced virtualization. So instead of kind of directly putting applications on physical machines, we actually just rented bigger physical machines, we actually just rented bigger physical machines and used virtualizers to create virtual machines and run our applications on top of it. And that was kind of a move for the better. And also we started,
Starting point is 00:14:36 we bought dedicated storage solutions that was more under our control and we knew what we could kind of count on in terms of failures and what could go wrong basically. So yeah, it was mostly about stability. And for the virtualization, was that something that you had to build or was that something that you were able to buy something?
Starting point is 00:14:58 At that time we were using Microsoft Hyper-V to kind of drive the virtual machines. That was, again, something that some guy knew, and this is what we started using. Today, we are mostly switched over to VMware. VMware is the primary virtualization platform. So, yeah, but it's still like virtualization. And then is Java still the sort of primary programming language that people are developing?
Starting point is 00:15:32 Definitely. So most of our platform, I would say like 90% is written in Java. Then we have some Node.js. We have some.NET, be that C-sharp or whatever. And, yeah, this is pretty much it. For the UI, it's mostly React stuff. This is how our stack looks like, but also more and more with
Starting point is 00:15:57 this data analytics and data science applications being created, there is also lots of Python and whatever basically. Maybe for the applications that are running on platform, we can't prefer Java because the interoperability between all of these services applications that are running is then easier. The tool chain basically that we are using to make the building of applications easier is then the same for everyone. But there is
Starting point is 00:16:34 no like if I can solve a problem with Golang or with whatever, let's do that and this is where we are at now and for now it's kind of working working out fine it has its own like pros and cons on the analytics front what's the tool chain there are you using sort of a modern warehousing technologies things like like snowflake or databricks or something like that or are you doing something that's more custom yeah so basically like when we are doing analytics it's's mostly about customer-facing analytics so that we can provide some reporting to the customers, different kinds of reports. And these are mostly aggregated reports. So how this stack looks like a bunch of these messages and message statuses go to Kafka as some messaging pipeline.
Starting point is 00:17:25 We process it and in the end, it ends up in ClickHouse. In ClickHouse, we do the aggregation and this is then the data source that we basically exposed to our customers where we get this data from. But before ClickHouse, we had our own solution built on top of SQL Server, which is basically an aggregation engine.
Starting point is 00:17:49 And it is still used, so it's not like we have one solution. But mostly we are preferring ClickHouse. Sometimes there are some reasons where this SQL solution still works okay. It's fine. How are those choices made within the organization to essentially invest in something like ClickHouse? Are you someone essentially charged with figuring that stuff out and then they test out a bunch of different possible solutions
Starting point is 00:18:20 and make a recommendation? How are those decisions made? Yeah. Usually it all starts with having some problem. There is some itch that we want to scratch for some reason, and let's see how we can do it better, potentially. And usually it's problem dealing. So there is some problem that we are experiencing.
Starting point is 00:18:42 Be that, for example, how fast can we add another field to this report? Like, is this like a process that takes one month and involves 30 teams? And if yes, how can we improve this? Is this even like a problem? And if yes, then how do we go about it? So usually it's like some problem that kind of starts this thinking process, what could be potential solutions,
Starting point is 00:19:11 and then basically teams or the developers or the engineers that are in this field try to see if there is some smarter way. Sometimes we just try to kind of step out of of this whatever technology I'm using. So if I'm very comfortable with SQL Server, probably every solution that I can think of will evolve SQL Server. Sometimes we just try to step out of this zone and try some others. See what is now available in the market, what are other companies doing.
Starting point is 00:19:47 In the end, yes, somebody tries out, okay, let's do a POC. This looks like on my machine, this looks blazingly fast, this click house, how could this work? Let's try to connect some data directly from Kafka and see how this works, does it die. The good thing here is that there is no risk, basically. We're not exposing anything to the customer. We will just put a bunch of this data in Clickhouse. If it breaks, fine. We will learn something. If it doesn't break, we continue.
Starting point is 00:20:17 So this is pretty much it. Sometimes we will test multiple solutions, but usually we are kind are restricted with time. Then we try to narrow this amount of choices and try to pick one. Sometimes because we are really in the need to solve something fast, we will just pick some cloud solution,
Starting point is 00:20:43 like DynamoDB or whatever, instead of spinningDB or whatever, instead of, you know, spinning up some whatever, you know, locally. It's just faster. And then over time, we can make this decision. Okay, now we have enough traffic. This is generating a lot of cost. Let's see if we can do it on-prem.
Starting point is 00:21:01 Yeah. Okay. So part of the, I imagine like a lot was in the early days, you were, you know, buying from data centers or you're doing it on-prem. Yeah. Okay, so part of the, I imagine, like a lot was in the early days, you were buying from data centers or you're doing things on-prem, and now it sounds like you kind of have like a hybrid system set up, but you might move things to on-prem once they have been proven out in order to essentially have certain cost savings. Is that sort of the motivation there? Yeah, exactly.
Starting point is 00:21:21 Usually how it starts is, you know, if we are building a new product, maybe the fastest way is not to worry about infrastructure, the cost of these cloud solutions is not that big. But then, you know, if we kind of got the product right and there is interest for this and there is more and more traffic, then also the bills start to increase. There is some point at which we will say, okay, now we should kind of see
Starting point is 00:21:59 how much are we earning, how much does this cost, does it make sense to invest into kind of moving stuff on-prem? The same thing is about data centers. So we initially started with just renting space in the data centers, collocating our hardware there and running stuff there. And yes, you have this upfront cost that you need to pay to buy all this hardware, ship it to data center, install it, set it up, and so on.
Starting point is 00:22:29 But in comparison, at least according to our calculations, comparing it to the cloud solutions, it was always cheaper. It's just cheaper to do. But sometimes, especially when we kind of need it fast, then it's just faster to do it in the cloud. Like we spin it up in the cloud and use the hardware there. And we have this infrastructure there so set up that usually I don't really care where my machine is as long as the network is,
Starting point is 00:23:03 you know, good enough so that I don't really care where my machine is as long as the network is good enough so that I don't feel this extra latency between my on-premise data center and the nearest cloud which is there. Sometimes we will just do that, especially during these seasonal events like Black Friday, Cyber Monday, Christmas, Easter, and so on. When there is the increase in traffic and we know that it will happen, then it just doesn't make sense to buy a bunch of hardware and then after this week passes, what to do with it. So we just kind of go into the cloud, spin up one data center,
Starting point is 00:23:41 run it for a week, and then scale it up. Yeah, that makes sense. It's not a long-term investment. You just need to scale up for these spikes in traffic. In terms of when you're developing new products or new features, I imagine InfoBib's dealing with really, really large-scale, high-volume of calls. How do you prepare for or figure out what you need
Starting point is 00:24:07 in terms of infrastructure to gauge against essentially reducing the potential latency? How does it go about sort of testing for scale? Yeah. So this is like a standard problem that we have when we spin up a new data center. So on one side, we have the input from the business. Okay, we know why we are spinning this data center.
Starting point is 00:24:30 We know what customers are waiting for this data center, which customers are going to use it, what is their plan in terms of how many messages per second they plan to send over our platform. So this is one number that we have. So we will usually spin up this data center so that we are able to handle this number. And we already have a bunch of data centers, so we know how much we need. And we also have already some flavors so that we know, okay,
Starting point is 00:25:00 we need to spin up type A to be able to handle 10,000 requests per second or whatever. But once the data center is up, we actually need to kind of verify it. We want to do some acceptance testing that will tell us, yeah, when you really call this API, the message gets sent and delivered. And when you call this API, stuff that needs to happen really happens.
Starting point is 00:25:26 It's not something I will just deploy it and it automatically works. So after the data center is deployed, so we bring in the hardware, we kind of put the virtual machines on top, we deploy all this software, we configure everything. Then there is the slow testing phase where we are basically putting some artificial traffic into this data center, just as customers will. Data really gets processed. It really goes to this messaging pipeline. It really gets stored in the database. And we are really pushing the system to see, okay, at which point will something break? Do we have any parts of the system that may need additional resources or instances or whatever? And after this test is done, basically we clean up the databases
Starting point is 00:26:11 and then it's ready for the customers. And then later, just to add later, there might be new customers that come in. They will bring additional traffic, but we know how much we can handle. And then we also know, okay, if we need an additional, I don't know, 1000 requests per second capacity, we know how do we scale our system, like 10 more instances of that, three more instances of that, and so on. So yeah, this is this was kind of where we ended up empirically through time.
Starting point is 00:26:50 And trial and error. Something that works for us, yeah. And then when you're doing deployments, are you deploying things in such a way where it's like a progressive rollout where maybe the feature is available to 5% of users and then 10% and
Starting point is 00:27:06 then 20% or something like that. That way you can roll back the changes if anything bad happens from a scale perspective or from a bug being found during your live production that you weren't expecting. Yeah. So this is also something that kind of evolved through time, the way how we kind of roll out this deployment. So initially, you know, initially in the early days when we had just one data center and, you know,
Starting point is 00:27:33 you had maybe two instances of your application running or maybe 10 instances or whatever, you would usually deploy new version on a single server, kind of, you know, put some traffic to it, check out the logs, if everything works okay, maybe take a look at the metrics, are there any weird spikes or something, errors, whatever. Then if not, you would roll out to the rest of the servers.
Starting point is 00:28:01 This is basically what we are doing today, only we are not looking at this because now we have 30 data centers. And I want to roll out my application, new version to all 30 data centers. And usually how this works is when developer is ready to deploy, like, yes, we have these environments where we kind of deploy before going to production and do this initial kind of set of tests. And then we roll to production. This is just to kind of catch some, you know, nasty bugs and really different, like difficult failures early on. But once we go to production, we use this canary approach. So basically, we have a deployment pipeline that is able to deploy one by one instance and automatically check the metrics and other kind of potential KPIs that I can custom specify for my application
Starting point is 00:29:00 and automatically roll back if I'm outside of predefined limits. By default, it's like we are looking at, are your APIs returning errors or does everything look okay? How many error logs do you have in the log file? Is it compared to the same machine statistics, like from one hour ago before you can start the deployment? And we are trying to use this kind of heuristics to maybe check, automatically detect, hey, we should really roll back because it's something is weird here. And if it's not, not, then we progress to the next machine.
Starting point is 00:29:47 We are able to do it one by one, really sequentially, like whole data centers, and you need to wait for a long time for everything to get deployed. Or you do it one by one per data center. So you don't have 30 data centers in parallel, but you are doing canary basically on one by one instance inside this data center. This helps a lot in preventing and speeding up the rollback basically when some issues got detected. Right. Beyond just clearly the scale issues you have to deal with from an infrastructure standpoint, there's also challenges as you're essentially scaling teams.
Starting point is 00:30:33 So at what point did having multiple teams working on different things, but on the same source code and projects start to become an issue? And what were the ways that you went about trying to like solve some of those issues i imagine back in the early days all this was essentially a monolith that at some point you had to think about like breaking up yeah exactly and this was like um like i think this is like totally normal approach like we usually build some application and you you start adding features and then you start at least this is for us we start to getting more and more traffic more and more customers more and more features needed to be
Starting point is 00:31:10 built into into this monolith and as we are developing this even you know to some point it's not a problem having a lot of people working on this but then at some point you really start stepping on each other's toes, right? So we will, you know, we will touch some common code, which will break, you know, totally something on the other side. Hopefully this gets caught by the tests.
Starting point is 00:31:36 But in the end, we would really like not to, like if I'm working on a part of the system, which really doesn't have anything to do with the other part of the system, I don't want my changes to kind of break this other part of the system. And there is also one other thing, like as this monolith grew bigger, there were just more and more tests. So if I had one feature, I need to work like for all these tests that I don't really care about, which I didn't touch, to pass. And at some point, basically at this point where we started to having two or three groups
Starting point is 00:32:12 of people being really knowledgeable in this domain, this part of this monolith, we saw that maybe we should start pulling it out. It wasn't just about that. It was also about how many resources, how does this machine look like for this monolith? How much RAM or CPU do I need to have? Basically,
Starting point is 00:32:36 it's a sum of everything, right? And if I have spikes in some other system, it should be able to handle that. If I have spikes in some other part of the system, basically both spikes should be able to kind of survive on this machine. And it was really, it started to be difficult to understand, you know, when these spikes would happen, what are these spikes,
Starting point is 00:33:00 how to test this system, and so on. And it was starting to get difficult to think about the system. Basically, I just have one small part, but actually running in a more complex environment. Basically, we had stages of this, how do we scale and how do we organize teams? At first, this monolith was just kind of, okay, we need more throughput, just add more monoliths. That was it. And the second step, as more and more people got involved, we started to pull out these independent parts. Billing, let's pull it out. Handling of incoming SMS messages,
Starting point is 00:33:47 let's handle it, you know, totally separate from, you know, outgoing SMS messages. And this was kind of natural thing. And we didn't start immediately like dismantling everything. There were just some parts that came naturally to kind of extract
Starting point is 00:34:04 and evolve on their own. And with time, we got more and more such parts. And it got easier to handle, to reason about them, and to actually handle different scale requirements. Because incoming messages at that time were like 10 messages per second at best. And outgoing was maybe 1,000 messages per second. So two machines were enough for incoming messages. But I needed to have like 10 machines for the outgoing.
Starting point is 00:34:38 So yeah. And also deployment cycle got easier because now I'm just deploying my part. I'm not touching everything else. I'm not touching everything else. I'm not touching some common code. It's like my own playground where I own the code that you write. This was the progression of how we went from single monolith application to just copying the monolith and then extracting and organizing teams around basically
Starting point is 00:35:06 functionalities you know standalone functionality that can evolve on its own i imagine uh one of the other benefits too since uh you know a lot of this was uh java code was besides you know having to wait for tests to run through this entire monolith even if the tests had nothing to do with what you were building you also have the compilation cycle where if the if the test had nothing to do with what you were building. You also have the compilation cycle, where if the code base is really big, you might be waiting quite a while for it to get essentially compiled just so that you can test and deploy it, which is going to slow down your development cycles versus going with this essentially logical,
Starting point is 00:35:36 essentially you're doing some version of microservices. Exactly, exactly. And at that time, we didn't really know that it's called microservices. We didn't really think in those terms. We just had a problem. We have this big piece of code. Everything is slow. I need to wait a lot. I'm making lots of mistakes. I'm killing other people's work. And how do we solve this?
Starting point is 00:36:01 And so kind of separating it and going into this multiple service direction was a good thing, but then it also kind of brought on another set of problems with it, right? Because nothing is for free. Now we have multiple services that need to communicate. It's not now the same application, I can exchange data very easily. Now I have multiple processes running and I need to pass the data over the network somehow and make them communicate. And also, what with databases now? Should we continue to use this one single database or how does this work? And how do we also prevent, you know, these bugs from database level, like changing some table that you are, you know, your service is also using,
Starting point is 00:36:49 and I'm accidentally like removing a column and I don't know that you are using it and so on. So we, we kind of needed to also think about that and how to, how to start kind of putting, you know, data in their own domain and having, you know, dedicated databases for your own service. How did you solve the problem of
Starting point is 00:37:08 how these different services are talking to each other? What was essentially the methodology or approach that you took there? Yeah, so first approach, because it was Java service, we just used this Java RMI thing that comes
Starting point is 00:37:24 with Java. Basically, there is an example how I can call over the network, one Java method from different Java application. This is a remote procedure call. Yeah, exactly. Then this was fine for Java, but it was also cumbersome. It's not really easy. You need to have this registry, something,
Starting point is 00:37:54 and then you need to really understand how this all works. Then it's really difficult to talk to non-Java services. How do we do that? We need to have some other system and so on. We went through a couple of iterations there and we ended up basically passing JSON over HTTP. So I would just pass JSON and say, look, I want to call your method F with these parameters, here are the parameters and we would we would basically we built our own rpc engine uh that that kind of just used json over http transport and and this this kind of uh actually proved to be nice because
Starting point is 00:38:38 now i could call this method you know even from command line i could use CURL to call some method if I needed to call some batch jobs, or clone jobs, or whatever. I could easily call it from non-Java services because it's just some HTTP endpoint and you pass JSON. So this was nice. But the downside is, okay, now we built our own RPC mechanism, but in this InfoBP universe of services, how do we know where the services are? How do we know which services expose which methods? And then it kind of pushes you in the direction, okay, we should have this service registry. Some service registry. Some service registry where we can really see which instances are alive, which instances expose which services, so that we can actually do the RPC call, know which target. And then we ended up basically
Starting point is 00:39:39 doing our own service registry. And also, we decided to start with client-side balancing. So this basically means when I start up my application, I know which services are needed, I will look them up in the registry, and then I will call them directly from my application. And we built this library that did this client-side balancing. So this library would do this heavy lifting, like registering on service registry, pulling up the services that we are depending on, understanding which services are available,
Starting point is 00:40:19 what is their IP, what is the endpoint that we need to call for some method and so on. And on the developer side, it was actually really simple. You just said, hey, I have this Java interface, which has these methods. And this is, I want to call that service, which implements these methods. And in Java code, you just had interfaces and it automatically works. When you call it, we would basically through this library, serialize this call into JSON,
Starting point is 00:40:52 pass it over the network to the endpoint that we chose in this client side balancing logic, and deserialize the response and give you back the response inside this Java function. So you didn't have a clue that it was in-process or out-of-process code. You didn't really care. This was really fun. Is that the service that's still in use today or have you
Starting point is 00:41:17 moved into using something like gRPC? Yeah. When we developed this, there wasn't, you know, gRPC. Maybe there were some implementations for RPC calls and these libraries, but actually nothing was really mature enough that we would kind of be fine with using. And we did try. I remember we, at some point, we tried to use this Eureka service registry basically from Netflix that they open sourced. And we ended up... I mean, I really wanted to use it, but then I ended up like, okay, now I need to really do this simple
Starting point is 00:42:01 thing, but now I need to first understand this system. And then when I have problems, I need to fix this system. And we already had service registry. It was very simple. We understood how this works. And the conclusion was, look, this just doesn't make sense. We'll be constantly troubleshooting some other system that we know very little about. And we have already like 90% of this built. And we just kind of kept our own. And this, I mean, this has good and bad things, right? I mean, ideally, I would just use some open source stuff and I would be able to fast add features that I want.
Starting point is 00:42:39 But usually, at least in my experience, it doesn't work that way. You need to really understand this other system and then kind of to add your features on top of it. So we are still using the service register that we kind of developed, and we are still using the RPC library that we developed. Because over time, we added more and more features to this library, like very nice stuff, for example, like status checks, Prometheus metrics,
Starting point is 00:43:06 out-of-the-box metrics. Basically now, you know, any service that you kind of create from this service template will out-of-the-box have, you know, metrics and status reports and it will know how to phone home and, you know, give kind of health check pings
Starting point is 00:43:21 to this service registry so that we have a good overview of what's running, what's not, what's problematic, where maybe there are some network connection issues and so on. And this proved super cool because now that we created this service registry, then it was easy to hook up monitoring system to it. I had one place where I know everything that is running inside our platform and it's very easy to now configure primitives. Okay, now go and
Starting point is 00:43:52 scrape the metrics and kind of let us build these dashboards and alerts on top of it and whatever. So yeah. So a lot of, I feel like a lot of companies, you know, you mentioned like, you know, Netflix companies, Google, Facebook, these companies that have had to solve these massive scale issues over time and solve a lot of these problems. Sometimes they've been able to take some of their solutions and they bring it to the open source community. And then that becomes the way that people solve these problems. Has InfoBip contributed any of their bespoke solutions that they've come up with internally to solve these problems to open source, or has that not been something that they've really focused on? I know that we kind of discussed at some point,
Starting point is 00:44:36 should we kind of open source this InfoBip RPC library? But already now in this open source world, there are lots of, like if I was going to But already now, in this open source world, there are lots of... If I was going to do it now, I would just take something off the shelf because there are lots of great libraries already there. But we did open source, for example, for
Starting point is 00:44:56 Kafka, and this is available on GitHub. Basically, it's an application that allows you to manage Kafka topics on a really big scale. Because this kind of started to be a problem at some point for us. Like we have in every data center, we have a Kafka cluster.
Starting point is 00:45:16 These Kafka clusters are interconnected. You are able to create a topic and then define, you know, the replication. How do you want to do the cross data center replication? I want to write one topic in data center A, and then I want to start to replicate it to all other data centers so that I have the same data in that topic in these other data centers and stuff like that. And then it was a problem like these guys that are maintaining all these Kafka clusters. How do they create all these topics? How do they configure it?
Starting point is 00:45:48 How do they track the changes? How do they modify? How can we see the performance of this? And in the end, they just build a tool for themselves and for the end users, meaning developers, where basically it's very easy to kind of create the topics, manage changes, apply these changes in production, have some out-of-the-box metrics. And actually, I mean, you can probably buy this from Confluent, but we kind of ended up not doing that.
Starting point is 00:46:19 We just solve our own problems, and it looks like maybe it would be useful for some other folks as well. And yeah, it's available on GitHub. So this is one part that we kind of want to source. Oh, awesome. I mean, looking back, you have such a rich career from the time that you've been at InfoBit, you still had to work on a lot of challenging, complex scale problems, both from scaling teams to scaling infrastructure and moving, essentially adapting existing systems that are in production to new, more modern technologies and approaches. What do you think is the biggest engineering challenge that you've faced through
Starting point is 00:47:04 that time? Yeah. So, uh, mostly, so mostly it was about, uh, kind of stability and how to architecture systems so that, uh, that continue to work when, when stuff breaks down. Right. So that we have this graceful degradation or, or no degradation at all, if possible. And also, like, one of the big challenges for us is, like, how do we do this multi-data center applications, you know? Do we kind of confine them to one data center? What if we have, like, two data centers in the same region that, you know,
Starting point is 00:47:42 are basically back up one for the other. Should we do hot-cold standbys or should we just do active-active? These were the main questions. In the end, we just said, let's do active-active because this passive stuff never works when you need it to. So we started doing this active-active. And also, at every level, most challenging stuff for me is like, at which, where can we have failures? Because failures not only happen in the application or in database level, they happen like, hey, my router will die,
Starting point is 00:48:29 or some ISP connection will die. How do we handle that? So there are a bunch of layers before these packets even come to my application that can actually die. And how do we architect for that so that we can survive this and continue serving these 10,000 requests per second for customers within this predefined latencies that we want to hit. So yeah, for me, this was, it still is the biggest challenge. How do we do that?
Starting point is 00:48:54 Yeah, I think these infrastructure challenges, they never quite go away. You're always dealing with more scale, and you can always figure out new ways of sort of like, you know, optimizing and making sure that in the case of a failure, things are handled, you know, even better and more gracefully. And then all the deployment challenges that you're also facing. And it sounds like you have a, you know, you have a mix of essentially on-prem and public cloud. I'm sure there's a lot of complexity around how those, how the sort of the deployment pipeline works, even, you know, choosing which, where do you deploy that stuff? Which data center, you know, which cloud and so forth.
Starting point is 00:49:31 It probably gets really complicated really fast. Well, Mario, I could talk to you all day. This is really fascinating. I want to thank you so much for being here. There's so much stuff I think we didn't even get to. I'd love to, you know, dig into how you solve some of your database scale challenges and so forth. But maybe we can have you back down the road.
Starting point is 00:49:48 But I know it's getting late for you. I want you to enjoy some of your Friday night. So I will say thank you so much for being here. And thanks for sharing your experience. Yeah, thank you, Sean.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.