Software Huddle - Building for Scale with Mario Žagar from Infobip
Episode Date: November 7, 2023In this episode, we spoke with Mario Žagar, a Distinguished Engineer at Infobip. Infobip is a tech unicorn based out of Croatia that is a global leader in omnichannel communication, bootstrapping its... way to a staggering $1B+ in revenue. We discussed the super early days of engineering at Infobit when they were running a monolith on a single server to today running a hybrid cloud containerized infrastructure with thousands of databases serving billions of requests. It's a really fascinating look and deep dive into the evolution of engineering over the past 15 years and the challenges of essentially architecting for scale. Follow Mario: https://www.linkedin.com/in/mzagar/ Follow Sean: https://twitter.com/seanfalconer Software Huddle ⤵︎ X: https://twitter.com/SoftwareHuddle LinkedIn: https://www.linkedin.com/company/softwarehuddle/ Substack: https://softwarehuddle.substack.com/
Transcript
Discussion (0)
Most of our platform, I would say like 90% is written in Java.
Initially, guys tried to do some business in Croatia, but it's a very small market.
Basically, they almost kind of gave up until they realized that actually they can do business outside of Croatia and the whole world.
And this was kind of the game changer.
If we are building a new product, maybe the fastest way is, you know, not to worry about infrastructure so much and try just to, you know, get this product as fast as possible
out there, kind of validate that it works and that it starts bringing money in.
Hey there, it's Sean Faulkner, one of the creators of Software Huddle,
and I'm really excited for you to listen to today's interview with Mario Zagar,
a distinguished engineer at Infobip. A lot of you might not be super familiar with the company
Infobip, but they're actually a monster in the omni-channel communication space that bootstrapped to $1 billion in revenue based out of Croatia.
They're now doing multiple billions in revenue and competing directly against companies like Twilio.
Mario has been there nearly since the beginning, and in the episode, we go through his and Infobip's journey for the past 15 years. We discussed the super early days of engineering at Infobip
when they were running a monolith on a single server
to today running a hybrid cloud containerized infrastructure
with thousands of databases serving billions of requests.
It's a really fascinating look and deep dive
into the evolution of engineering at a company
over the past 15 years
and the challenges of actually architecting
for scale.
I really think you're going to enjoy hearing from Mario.
Last thing before I kick things over to the interview, if you enjoy the episode, don't
forget to subscribe to Software Huddle and leave us a positive review and rating.
All right, enough plugs.
Let's get to the interview.
Mario, welcome to the show.
Hey, Sean.
Hi, thanks for having me.
Yeah, thanks so much for
being here. I know it's kind of towards the end of the day for you, I imagine, but I appreciate
you finding time for this. Let's start by having you introduce yourself. Who are you? What do you
do? And how did you get to where you are today? Yeah, sure. So I'm Mario. And basically, I'm a
software engineer. So I've been in software development for the last about 20 years.
But I'm in computer since I know basically about myself.
And currently I'm working at Infobip for the last about 14 years,
a little bit more than that.
Currently the position of the distinguished engineer,
part of the platform
architecture team.
Yeah, this is pretty much it.
Right. So you've been 15 years
at Infobip. You still enjoy it?
Oh, yeah. Lots of challenges
there. Very dynamic.
You know, it's like
not one single
company doing one product. It's like a bunch
of companies doing lots of products.
So I can pick and choose the problem I want to solve.
That's awesome.
Yeah.
So Infobip is a global cloud communication platform.
And it's a big company.
And I think one of the most incredible things that I've always heard or thought about Infobip is the fact that it bootstrapped to a billion dollars in revenue, which is, you know, just an amazing achievement. And I know about Infobip from my time working at Google
because I was working in the business messaging space and I did a lot of work with Infobip at
that time. But I feel like unless you're really in sort of the communication space,
a lot of people in the U.S. aren't that familiar with Infobip. So how did the company start and sort of when was that?
Yeah, so the company was founded in 2006 by a couple of guys fresh out of college
and sending SMS basically. And around 2008, guys managed to send their own SMS. So what's the difference?
There's a bunch of companies that actually kind of provide you with the API
to send messages to end users from business side.
But usually they connect to the operators, to telecoms using their APIs and so on.
But these guys manage to kind of really deploy an SMSC. So basically,
you know, an application inside the telecom network and basically send the SMS very cheap.
And this was kind of when the boom started. And I joined InfoBipo shortly after that.
This is basically when kind of business picked up. And more and more kind of customers started coming in.
And we needed to kind of start building lots of features.
Yeah.
And was that originally where they just focused on the Croatian market or were they all over Europe even at the beginning?
Yeah.
This is like a funny.
Initially, the guys tried to do some business in croatia but it's a very small market and
basically they almost kind of gave up until they realized that actually they can do business
outside of croatia in the whole world and this was kind of the game changer and
something that kind of opened up the horizons and what the company could go for yeah i imagine
so can you talk a little bit about what the early days of being
at an engineer at Infobiff sort of on the ground floor was like?
Yeah, sure. So yeah, for me, it's also like amazing to basically witness, you know, the
whole evolution of Infobiff from the day when I came there and how it looked then to where
the company is now, right? So when I came at the company, there were like two sites in Croatia.
And one site had around maybe 10 developers and two applications they were working on.
And another site where I was at had just one application on which we worked on. And there was basically, you know, there wasn't any normal stuff
that is kind of normal these days, like, you know, some build servers, some, you know, repositories
for the artifacts that they build, some deployment procedures, Everything was done pretty manually.
But the interesting thing was, and this is one thing that I liked really a lot,
is that guys wrote tests.
So when I kind of got there, they had tests.
And yeah, deployment was manual. And there wasn't any built server.
One guy built it on his own machine and then copied it over and deployed it.
But they had high availability.
Since they were doing it in this telecom space, they realized they need to have some high availability.
If one machine goes down, then the other should be able to handle the load.
So this was put on the standard from the early days, which was
very, very good. Yeah, I remember those days
of the somewhat manual build process. You build it basically
on your local machine and then you move it over. But it sounds like there was probably a lot
of focus on basically building and deploying scalable data centers
because of the fact that you're in the telecommunication space.
And also the nature of that time, it's not like you had public cloud services
where you're just spinning up containers on Google Cloud or Azure or AWS or something like that.
Yeah, exactly.
So it was a little, like in the early days, it was a little bit better than running like the whole production on the machine under my desk. You know, it was some data center, we rented some physical hosts there. And it was actually like Windows machines. And we were running, like all the applications that there were a couple of hosts. And the other application, this, you know, telecom application that was actually running inside the telecom operators
network, this was the kind of, let's say,
in another data center owned by the telecom as
a piece of equipment actually kind of running there and
sending and receiving SMS messages
directly from the telecom network.
So this was a very interesting time.
And there was no dependency management.
Basically, weird things would happen.
We would add some new library that we found useful,
and we would test it, try to test it locally, we would deploy it and then
there would be some edge case when this library would call another library which we didn't package
and everything would fall apart. So it was like funny scenarios like that.
And at some point we realized like, hey, maybe we should be able to build these kind of artifacts that we deploy on one single
source of truth machine. It shouldn't be like, what if this guy goes on vacation or something,
you know, it's like this bus factor problem. And then we kind of started to kind of, okay,
let's try to introduce some continuous integration, At least we have tests, so we have this version control.
Why not, whenever there is some change, run all these tests automatically
instead of waiting for someone to do it?
What was the code base at that time?
What was the main programming language and stack that you were building?
Oh, yeah.
So basically, there like three applications.
This application is running on telecom premises, this SMSC,
which is responsible for receiving and talking basically to the telecom
network using telecom protocols.
This was Java.
Java running on Linux machines.
And the application that was receiving kind of requests from customers to send messages, this was also written in Java.
And there is like this one back office application, basically, through which we kind of configured the behavior of the system and kind of tried to do billing and so on.
This was actually like, I think, web forms, visual basic, something like that. So basically, you know, the people that were there,
whatever they knew, this was the stack.
There was no, you know, like it wasn't really about pick and choosing,
like what was the best.
It was like what you could do.
This is what you kind of used to solve the problem.
And then everybody at the time for the engineering organization
was located in Croatia? Yeah, yeah. Everything was in Croatia, basically solve the problem. Then everybody at the time for the engineering organization was located in Croatia?
Yeah, everything was in Croatia,
basically in Pula.
The main office was in Pula and
the other development site was in Zagreb,
and this was pretty much it.
What was the source control system?
Oh, it was Subversion.
Okay.
Yeah.
We were using Subversion,
and at some point later in this evolution,
we switched to Microsoft Team Foundation Server because it had not only source control,
it also had this task management, so this was kind of cool.
Then after that, we kind of switched over to Git and also to Jira, and it was kind of the evolution. But right now,
this is where we are. We're kind of using Jira for issue management and tracking and Git for
source control. And then besides some of the pain around essentially scaling up your CICD to actually
like a build process that is not running on someone's machine
or dependent on someone's machine.
What were some of the big engineering challenges you faced in those early days?
Oh, yeah.
So one of the challenges was like, how do we deploy?
Like this manual deployment, at some point,
we kind of realized that we are making enough mistakes
to kind of start changing stuff and try to automate it and try to remove the
human factor from the equation and maybe have some more stable deployment. So this was the first
thing. Then the next big thing that we solved was actually this dependency management. How do we
handle these libraries? How do we pull libraries that we want to use
into our application and then also pull in transitive libraries that these libraries are
using? So we started using Maven for Java. And this also solved this pain point,
which caused a lot of problems in production just because we were missing libraries.
This was also a kind of mess. There were also other challenges, mostly involved with the
infrastructure and how do we deploy. Manual deployment was one thing where you needed to
know exactly the steps you need to perform. like, okay, I should reconfigure
HAProxy, I should remove this from target from list of backends, then I should stop,
I should start, I should something.
And all of this was done manually.
And the other thing was the underlying infrastructure itself.
So we had these physical machines and then these physical machines, we were running some
SQL Server database on top of them.
We wanted the SQL Server database databases to be highly available.
So it was basically some Windows cluster running this using some shared storage under the hood where the actual data was stored.
And then Windows cluster would know,
one machine is down,
I will promote another one to be owner of
this data and continue handling the data.
Then the shared storage died.
We rented basically the shared storage solution.
It was not under our control,
we rented it from the data center provided.
So at that point we kind of realized, okay, maybe we should have more control under what kind of
storage we use and also switch maybe from these dedicated physical machines where you would...
We would basically know like, okay, this machine is for that, that machine is for the classical, you know, pet versus cattle problem.
And at that point, we kind of started going into direction
where we introduced virtualization.
So instead of kind of directly putting applications on physical machines,
we actually just rented bigger physical machines, we actually just rented bigger physical machines and
used virtualizers to create
virtual machines and run our applications
on top of it. And that was kind of
a move for the better.
And also we started,
we bought dedicated
storage solutions that was
more under our
control and we knew what we could
kind of count on in terms of failures and what could go wrong
basically. So yeah, it was mostly about stability.
And for the virtualization, was that something that you had to build or was that something that
you were able to buy something?
At that time we were using Microsoft Hyper-V to kind of
drive the virtual machines.
That was, again, something that some guy knew, and this is what we started using.
Today, we are mostly switched over to VMware.
VMware is the primary virtualization platform.
So, yeah, but it's still like virtualization.
And then is Java still the sort of primary programming language
that people are developing?
Definitely.
So most of our platform, I would say like 90% is written in Java.
Then we have some Node.js.
We have some.NET, be that C-sharp or whatever.
And, yeah, this is pretty much it.
For the UI, it's mostly React stuff.
This is how our stack looks like,
but also more and more with
this data analytics and data science applications being created,
there is also lots of Python and whatever basically.
Maybe for the applications that are running on platform,
we can't prefer Java because
the interoperability between all of
these services applications that are running is then easier.
The tool chain basically that
we are using to make the building of applications easier is then the same for everyone. But there is
no like if I can solve a problem with Golang or with whatever, let's do that and this is where we are at now and for now it's kind of working working out
fine it has its own like pros and cons on the analytics front what's the tool chain there are
you using sort of a modern warehousing technologies things like like snowflake or databricks or
something like that or are you doing something that's more custom yeah so basically like when
we are doing analytics it's's mostly about customer-facing
analytics so that we can provide some reporting to the customers, different kinds of reports.
And these are mostly aggregated reports. So how this stack looks like a bunch of these messages
and message statuses go to Kafka as some messaging pipeline.
We process it and in the end,
it ends up in ClickHouse.
In ClickHouse, we do the aggregation and this is
then the data source that we basically exposed to
our customers where we get this data from.
But before ClickHouse,
we had our own solution built on top of SQL Server,
which is basically an aggregation engine.
And it is still used, so it's not like we have one solution.
But mostly we are preferring ClickHouse.
Sometimes there are some reasons where this SQL solution still works okay.
It's fine.
How are those choices made within the organization
to essentially invest in something like ClickHouse?
Are you someone essentially charged with figuring that stuff out
and then they test out a bunch of different possible solutions
and make a recommendation?
How are those decisions made?
Yeah.
Usually it all starts with having some problem.
There is some itch that we want to scratch for some reason,
and let's see how we can do it better, potentially.
And usually it's problem dealing.
So there is some problem that we are experiencing.
Be that, for example, how fast can we add another field to this report?
Like, is this like a process that takes one month
and involves 30 teams?
And if yes, how can we improve this?
Is this even like a problem?
And if yes, then how do we go about it?
So usually it's like some problem that kind of starts this thinking process,
what could be potential solutions,
and then basically teams or the developers
or the engineers that are in this field
try to see if there is some smarter way.
Sometimes we just try to kind of step out of of this whatever technology I'm using.
So if I'm very comfortable with SQL Server, probably every solution that I can think of
will evolve SQL Server.
Sometimes we just try to step out of this zone and try some others.
See what is now available in the market, what are other companies doing.
In the end, yes, somebody tries out, okay, let's do a POC. This looks like on my machine,
this looks blazingly fast, this click house, how could this work? Let's try to connect some data
directly from Kafka and see how this works, does it die. The good thing here is that there is no risk, basically.
We're not exposing anything to the customer.
We will just put a bunch of this data in Clickhouse.
If it breaks, fine.
We will learn something.
If it doesn't break, we continue.
So this is pretty much it.
Sometimes we will test multiple solutions,
but usually we are kind are restricted with time.
Then we try to narrow this amount of choices
and try to pick one.
Sometimes because we are really in the need
to solve something fast,
we will just pick some cloud solution,
like DynamoDB or whatever,
instead of spinningDB or whatever, instead of, you know,
spinning up some whatever, you know, locally.
It's just faster.
And then over time, we can make this decision.
Okay, now we have enough traffic.
This is generating a lot of cost.
Let's see if we can do it on-prem.
Yeah. Okay.
So part of the, I imagine like a lot was in the early days, you were, you know, buying from data centers or you're doing it on-prem. Yeah. Okay, so part of the, I imagine, like a lot was in the early days,
you were buying from data centers or you're doing things on-prem,
and now it sounds like you kind of have like a hybrid system set up,
but you might move things to on-prem once they have been proven out
in order to essentially have certain cost savings.
Is that sort of the motivation there?
Yeah, exactly.
Usually how it starts is, you know, if we are building a new product, maybe the fastest way is not to worry about infrastructure, the cost of these cloud solutions is not that big.
But then, you know,
if we kind of got the product right
and there is interest for this and there is more
and more traffic, then
also the bills start to increase.
There is some point at which we will say, okay,
now we should kind of see
how much are we earning, how much does
this cost, does it make sense to
invest into kind of moving stuff on-prem?
The same thing is about data centers.
So we initially started with just renting space in the data centers,
collocating our hardware there and running stuff there.
And yes, you have this upfront cost that you need to pay to buy all this hardware,
ship it to data center, install it, set it up, and so on.
But in comparison, at least according to our calculations,
comparing it to the cloud solutions, it was always cheaper.
It's just cheaper to do.
But sometimes, especially when we kind of need it fast,
then it's just faster to do it in the cloud.
Like we spin it up in the cloud and use the hardware there.
And we have this infrastructure there so set up that usually
I don't really care where my machine is as long as the network is,
you know, good enough so that I don't really care where my machine is as long as the network is good enough so that I don't feel
this extra latency between my on-premise data center and the nearest cloud which is there.
Sometimes we will just do that, especially during these seasonal events like Black Friday,
Cyber Monday, Christmas, Easter, and so on.
When there is the increase in traffic and we know that it will happen,
then it just doesn't make sense to buy a bunch of hardware
and then after this week passes, what to do with it.
So we just kind of go into the cloud, spin up one data center,
run it for a week, and then scale it up.
Yeah, that makes sense.
It's not a long-term investment.
You just need to scale up for these spikes in traffic.
In terms of when you're developing new products or new features,
I imagine InfoBib's dealing with really, really large-scale,
high-volume of calls.
How do you prepare for or figure out what you need
in terms of infrastructure to gauge against essentially reducing
the potential latency?
How does it go about sort of testing for scale?
Yeah.
So this is like a standard problem that we have when we spin up
a new data center.
So on one side, we have the input from the business.
Okay, we know why we are spinning this data center.
We know what customers are waiting for this data center,
which customers are going to use it,
what is their plan in terms of how many messages per second
they plan to send over our platform.
So this is one number that we have.
So we will usually spin up
this data center so that we are able to handle this number. And we already have a bunch of data
centers, so we know how much we need. And we also have already some flavors so that we know, okay,
we need to spin up type A to be able to handle 10,000 requests per second or whatever.
But once the data center is up,
we actually need to kind of verify it.
We want to do some acceptance testing
that will tell us, yeah, when you really call this API,
the message gets sent and delivered.
And when you call this API,
stuff that needs to happen really happens.
It's not something I will just deploy it and it automatically works. So after the data center is deployed,
so we bring in the hardware, we kind of put the virtual machines on top, we deploy all this
software, we configure everything. Then there is the slow testing phase where we are basically putting some artificial traffic into
this data center, just as customers will. Data really gets processed. It really goes to this
messaging pipeline. It really gets stored in the database. And we are really pushing the system to
see, okay, at which point will something break? Do we have any parts of the system that may need additional resources
or instances or whatever?
And after this test is done, basically we clean up the databases
and then it's ready for the customers.
And then later, just to add later, there might be new customers
that come in.
They will bring additional traffic, but we know
how much we can handle. And then we also know, okay, if we need an additional, I don't know, 1000 requests per second capacity, we know how do we scale our system, like 10 more instances of
that, three more instances of that, and so on. So yeah, this is this was kind of where we ended up
empirically
through time.
And trial and error.
Something that works for us, yeah.
And then when you're doing
deployments, are you deploying things
in such a way where it's like a progressive
rollout where maybe the
feature is available to 5%
of users and then 10% and
then 20% or something like that.
That way you can roll back the changes if anything bad happens from a scale perspective
or from a bug being found during your live production that you weren't expecting.
Yeah.
So this is also something that kind of evolved through time,
the way how we kind of roll out this deployment.
So initially, you know, initially in the early days
when we had just one data center and, you know,
you had maybe two instances of your application running
or maybe 10 instances or whatever,
you would usually deploy new version on a single server,
kind of, you know, put some traffic to it,
check out the logs, if everything works okay,
maybe take a look at the metrics,
are there any weird spikes or something, errors, whatever.
Then if not, you would roll out to the rest of the servers.
This is basically what we are doing today,
only we are not looking at this because now we have 30 data centers.
And I want to roll out my application, new version to all 30 data centers.
And usually how this works is when developer is ready to deploy, like, yes, we have these environments where we kind of deploy before going to production and do this initial kind of set of tests.
And then we roll to production.
This is just to kind of catch some, you know, nasty bugs and really different, like difficult failures early on.
But once we go to production, we use this canary approach. So basically, we have a deployment pipeline that is able to deploy one by one instance
and automatically check the metrics and other kind of potential KPIs that I can custom specify for my application
and automatically roll back if I'm outside of predefined limits.
By default, it's like we are looking at,
are your APIs returning errors or does everything look okay?
How many error logs do you have in the log file?
Is it compared to the same machine statistics,
like from one hour ago before you can start the deployment? And we are trying to use this kind
of heuristics to maybe check, automatically detect, hey, we should really roll back because
it's something is weird here. And if it's not, not, then we progress to the next machine.
We are able to do it one by one, really sequentially, like whole data centers,
and you need to wait for a long time for everything to get deployed. Or you do it one
by one per data center. So you don't have 30 data centers in parallel, but you are doing canary basically on one by one instance inside this data center.
This helps a lot in preventing and speeding up
the rollback basically when some issues got detected.
Right. Beyond just clearly
the scale issues you have to deal with from an infrastructure standpoint,
there's also challenges as you're essentially scaling teams.
So at what point did having multiple teams working on different things,
but on the same source code and projects start to become an issue?
And what were the ways that you went about trying to like solve some of those issues
i imagine back in the early days all this was essentially a monolith that at some point you
had to think about like breaking up yeah exactly and this was like um like i think this is like
totally normal approach like we usually build some application and you you start adding features and
then you start at least this is for us we start
to getting more and more traffic more and more customers more and more features needed to be
built into into this monolith and as we are developing this even you know to some point
it's not a problem having a lot of people working on this but then at some point you really start
stepping on each other's toes, right?
So we will, you know,
we will touch some common code,
which will break, you know,
totally something on the other side.
Hopefully this gets caught by the tests.
But in the end,
we would really like not to,
like if I'm working on a part of the system,
which really doesn't have anything to do
with the other part of the system, I don't want my changes to kind of break this other part of the system.
And there is also one other thing, like as this monolith grew bigger, there were just more and more tests.
So if I had one feature, I need to work like for all these tests that I don't really care about, which I didn't touch, to pass.
And at some point, basically at this point where we started to having two or three groups
of people being really knowledgeable in this domain, this part of this monolith, we saw
that maybe we should start pulling it out.
It wasn't just about that.
It was also about
how many resources,
how does this machine look like for this monolith?
How much RAM
or CPU do I need to have? Basically,
it's a sum of
everything, right?
And if I have spikes in some other system,
it should be able to handle that.
If I have spikes in some other part of the system,
basically both spikes should be able to kind of survive on this machine.
And it was really, it started to be difficult to understand, you know,
when these spikes would happen, what are these spikes,
how to test this system, and so on.
And it was starting to get difficult to think about the
system. Basically, I just have one small part, but actually running in a more complex environment.
Basically, we had stages of this, how do we scale and how do we organize teams?
At first, this monolith was just kind of, okay, we need more throughput, just add more monoliths.
That was it.
And the second step, as more and more people got involved, we started to pull out these independent parts.
Billing, let's pull it out. Handling of incoming SMS messages,
let's handle it, you know,
totally separate from, you know,
outgoing SMS messages.
And this was kind of natural thing.
And we didn't start immediately
like dismantling everything.
There were just some parts
that came naturally to kind of extract
and evolve on their own.
And with time, we got more and more such parts.
And it got easier to handle, to reason about them,
and to actually handle different scale requirements.
Because incoming messages at that time were like 10 messages per second at best.
And outgoing was maybe 1,000 messages per second.
So two machines were enough for incoming messages.
But I needed to have like 10 machines for the outgoing.
So yeah.
And also deployment cycle got easier because now I'm just deploying my part.
I'm not touching everything else. I'm not touching everything else.
I'm not touching some common code.
It's like my own playground where I own the code that you write.
This was the progression of how we went from
single monolith application to just copying the monolith and then
extracting and organizing teams around basically
functionalities you know standalone functionality that can evolve on its own i imagine uh one of
the other benefits too since uh you know a lot of this was uh java code was besides you know having
to wait for tests to run through this entire monolith even if the tests had nothing to do
with what you were building you also have the compilation cycle where if the if the test had nothing to do with what you were building. You also have the compilation cycle, where if the code base is really big,
you might be waiting quite a while for it to get essentially compiled
just so that you can test and deploy it,
which is going to slow down your development cycles
versus going with this essentially logical,
essentially you're doing some version of microservices.
Exactly, exactly.
And at that time, we didn't really know that it's called microservices.
We didn't really think in those terms.
We just had a problem.
We have this big piece of code. Everything is slow. I need to wait a lot.
I'm making lots of mistakes. I'm killing other people's work.
And how do we solve this?
And so kind of separating it and going into this multiple service direction
was a good thing, but then it also kind of brought on another set of problems with it,
right? Because nothing is for free. Now we have multiple services that need to communicate. It's
not now the same application, I can exchange data very
easily. Now I have multiple processes running and I need to pass the data over the network somehow
and make them communicate. And also, what with databases now? Should we continue to use this
one single database or how does this work? And how do we also prevent, you know, these bugs from database level, like changing some table that you are,
you know, your service is also using,
and I'm accidentally like removing a column and I don't know that you are
using it and so on. So we,
we kind of needed to also think about that and how to,
how to start kind of putting, you know,
data in their own domain and having, you know,
dedicated databases
for your own service.
How did you solve the problem of
how these different services are talking to each
other? What was
essentially the methodology or approach that you took there?
Yeah, so
first approach, because it was
Java service, we just used
this Java RMI
thing that comes
with Java.
Basically, there is an example how I can call over the network,
one Java method from different Java application.
This is a remote procedure call.
Yeah, exactly. Then this was fine for Java,
but it was also cumbersome.
It's not really easy.
You need to have this registry, something,
and then you need to really understand how this all works.
Then it's really difficult to talk to non-Java services.
How do we do that?
We need to have some other system and so on.
We went through a
couple of iterations there and we ended up basically passing JSON over HTTP. So I would
just pass JSON and say, look, I want to call your method F with these parameters, here are the parameters and we would we would basically we built our own rpc engine uh that that kind of
just used json over http transport and and this this kind of uh actually proved to be nice because
now i could call this method you know even from command line i could use CURL to call some method if I needed to call some batch jobs,
or clone jobs, or whatever. I could easily call it from non-Java services because it's just some
HTTP endpoint and you pass JSON. So this was nice. But the downside is, okay, now we built our own RPC mechanism, but in this InfoBP universe of
services, how do we know where the services are? How do we know which services expose which methods?
And then it kind of pushes you in the direction, okay, we should have this service registry.
Some service registry. Some service
registry where we can really see which instances are alive, which instances expose which services,
so that we can actually do the RPC call, know which target. And then we ended up basically
doing our own service registry. And also, we decided to start with client-side balancing.
So this basically means when I start up my application, I know which services are needed,
I will look them up in the registry, and then I will call them directly from my application.
And we built this library that did this client-side balancing.
So this library would do this heavy lifting,
like registering on service registry,
pulling up the services that we are depending on,
understanding which services are available,
what is their IP,
what is the endpoint that we need to call
for some method and so on.
And on the developer side, it was actually really simple. You just said,
hey, I have this Java interface, which has these methods. And this is, I want to call that service,
which implements these methods. And in Java code, you just had interfaces and it automatically works.
When you call it, we would basically through this library,
serialize this call into JSON,
pass it over the network to the endpoint that we
chose in this client side balancing logic,
and deserialize the response and give you
back the response inside this Java function.
So you didn't have a clue that it was in-process or out-of-process code.
You didn't really care.
This was really fun.
Is that the service that's still in use today or have you
moved into using something like gRPC?
Yeah. When we developed this, there wasn't, you know, gRPC.
Maybe there were some implementations for RPC calls and these libraries,
but actually nothing was really mature enough that we would kind of be fine with using.
And we did try.
I remember we, at some point, we tried to use this Eureka
service registry basically from Netflix that they open sourced. And we ended up... I mean,
I really wanted to use it, but then I ended up like, okay, now I need to really do this simple
thing, but now I need to first understand this system. And then when I
have problems, I need to fix this system. And we already had service registry. It was very simple.
We understood how this works. And the conclusion was, look, this just doesn't make sense. We'll
be constantly troubleshooting some other system that we know very little about. And we have already like 90% of this built.
And we just kind of kept our own.
And this, I mean, this has good and bad things, right?
I mean, ideally, I would just use some open source stuff
and I would be able to fast add features that I want.
But usually, at least in my experience, it doesn't work that way.
You need to really understand this other system
and then kind of to add your features on top of it.
So we are still using the service register that we kind of developed,
and we are still using the RPC library that we developed.
Because over time, we added more and more features to this library,
like very nice stuff, for example, like status checks,
Prometheus metrics,
out-of-the-box metrics. Basically now,
you know, any service that you kind of create
from this service template
will out-of-the-box have, you know,
metrics and status reports
and it will know how to phone
home and, you know, give
kind of health check pings
to this service registry
so that we have a good overview of what's
running, what's not, what's problematic, where maybe there are some network connection issues
and so on.
And this proved super cool because now that we created this service registry, then it
was easy to hook up monitoring system to it.
I had one place where I know everything that
is running inside our platform and it's very easy to now configure primitives. Okay, now go and
scrape the metrics and kind of let us build these dashboards and alerts on top of it and whatever.
So yeah. So a lot of, I feel like a lot of companies, you know, you mentioned like, you know,
Netflix companies, Google, Facebook, these companies that have had to solve these massive scale issues over time and solve a lot of these problems.
Sometimes they've been able to take some of their solutions and they bring it to the open source community.
And then that becomes the way that people solve these problems. Has InfoBip contributed any of their bespoke solutions
that they've come up with internally to solve these problems to open source,
or has that not been something that they've really focused on?
I know that we kind of discussed at some point,
should we kind of open source this InfoBip RPC library?
But already now in this open source world,
there are lots of, like if I was going to But already now, in this open source world, there are lots of...
If I was going to do it
now, I would just take something off the shelf
because there are lots of great libraries already
there. But we did
open source, for example, for
Kafka, and this is available
on GitHub.
Basically, it's an application
that allows you to manage
Kafka topics
on a really big scale.
Because this kind of started to be a problem at some point for us.
Like we have in every data center, we have a Kafka cluster.
These Kafka clusters are interconnected.
You are able to create a topic and then define, you know, the replication.
How do you want to do the cross data
center replication? I want to write one topic in data center A, and then I want to start to
replicate it to all other data centers so that I have the same data in that topic in these other
data centers and stuff like that. And then it was a problem like these guys that are maintaining
all these Kafka clusters. How do they create all these topics?
How do they configure it?
How do they track the changes?
How do they modify?
How can we see the performance of this?
And in the end, they just build a tool for themselves and for the end users, meaning developers, where basically it's very easy to kind of create the topics,
manage changes, apply these changes in production,
have some out-of-the-box metrics.
And actually, I mean, you can probably buy this from Confluent,
but we kind of ended up not doing that.
We just solve our own problems, and it looks like maybe it would be useful for some other folks as well.
And yeah, it's available on GitHub. So this is one part that we kind of want to source.
Oh, awesome. I mean, looking back, you have such a rich career from the time that you've been at
InfoBit, you still had to work on a lot of challenging, complex scale problems, both from scaling teams to scaling
infrastructure and moving, essentially
adapting existing systems that are in production to new,
more modern technologies and approaches. What do you think
is the biggest engineering challenge that you've faced through
that time?
Yeah.
So, uh, mostly, so mostly it was about, uh, kind of stability and how to architecture systems so that, uh, that continue to work when, when stuff breaks down.
Right.
So that we have this graceful degradation or, or no degradation at all, if possible.
And also, like, one of the big challenges for us is, like, how do we do this multi-data center applications, you know?
Do we kind of confine them to one data center?
What if we have, like, two data centers in the same region that, you know,
are basically back up one for the other. Should we do hot-cold
standbys or should we just do active-active? These were the main questions. In the end,
we just said, let's do active-active because this passive stuff never works when you need it to.
So we started doing this active-active.
And also, at every level, most challenging stuff for me is like,
at which, where can we have failures?
Because failures not only happen in the application or in database level,
they happen like, hey, my router will die,
or some ISP connection will die.
How do we handle that?
So there are a bunch of layers before these packets even come to my application that can actually die.
And how do we architect for that so that we can survive this
and continue serving these 10,000 requests per second for customers
within this predefined latencies that we want to hit.
So yeah, for me, this was, it still is the biggest challenge.
How do we do that?
Yeah, I think these infrastructure challenges, they never quite go away.
You're always dealing with more scale, and you can always figure out new ways of sort of like, you know, optimizing and making sure that in the case of a failure, things are handled, you know, even better and more gracefully.
And then all the deployment challenges that you're also facing.
And it sounds like you have a, you know, you have a mix of essentially on-prem and public cloud.
I'm sure there's a lot of complexity around how those, how the sort of the deployment pipeline works,
even, you know, choosing which,
where do you deploy that stuff?
Which data center, you know, which cloud and so forth.
It probably gets really complicated really fast.
Well, Mario, I could talk to you all day.
This is really fascinating.
I want to thank you so much for being here.
There's so much stuff I think we didn't even get to.
I'd love to, you know,
dig into how you solve some of your database scale challenges and so forth.
But maybe we can have you back down the road.
But I know it's getting late for you.
I want you to enjoy some of your Friday night.
So I will say thank you so much for being here.
And thanks for sharing your experience.
Yeah, thank you, Sean.