Screaming in the Cloud - The Complexity and Value of Scaling Reliability with Kannan Solaiappan
Episode Date: February 16, 2023Kannan Solaiappan, Head of Reliability and Data Engineering at Circles Life, joins Corey on Screaming in the Cloud to discuss building a team in a start-up environment and the complexities of... balancing reliability and security with scale. Kannan describes the challenges of building a semi-platform multiple instances model and how products like Severalnines have helped identify and optimize potential problems before they affect customers. Kannan and Corey also discuss the impact that major outages had on the world at large when it came to fault-tolerance on entry points, and Kannan explains how guardrails can improve reliability without creating the same resistance from engineers that governance can. About KannanWith over 20 years of experience in the technology industry, Kannan Solaiappan is a highly motivated and passionate leader with a track record of driving results. With a background in software development, operations, architecture, security, and Agile transformation, Kannan has served as a Head of DevOps/Reliability/Data Engineering & Architecture, managing budgets of over 10 million dollars. Kannan has successfully led teams of up to 80 members and has a strong background in building and maintaining world-class organizational structures and cultures.Currently, Kannan is leading a team of SRE, DevOps, and Data Engineering professionals at Circles Life, Asia’s first fully digital telco, where Kannan is working towards building the world’s best Telco SAAS platform with a focus on CiCD, observability, reliability, resilience, and security.Kannan has a diverse set of skills including IT Service Management, team management, IT strategy, vendor management, site reliability engineering, Architecture and leadership.Links Referenced:Severalnines: https://severalnines.com/Circles.Life: https://circles.lifeCircles.Life Instagram: https://www.instagram.com/circleslifesg/Circles.Life Twitter: https://twitter.com/circleslifesgCircles.Life Facebook: https://www.facebook.com/CirclesLifeSG/
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This episode is brought to us by our friends at Wiz.
Wiz is on a mission to help every organization rapidly identify and remove critical risks in their cloud environments.
Purpose built for the cloud, Wiz delivers full stack visibility, accurate risk
prioritization, and enhanced business agility. In other words, nobody beats the Wiz. Wiz connects
in minutes using an agentless approach that scans both platform configurations and inside of every
workload. They provide a deep assessment that goes well beyond what standalone CSPM, CWPP, and other unknowable acronym-based
tools offer, and find the toxic combination of flaws that represent real risk. To learn more,
visit wiz.io. That's W-I-Z dot I-O. Today's episode is brought to you in part by our friends
at Minio, the high-performance Kubernetes native object store
that's built for the multi-cloud,
creating a consistent data storage layer
for your public cloud instances,
your private cloud instances,
and even your edge instances,
depending upon what the heck you're defining those as,
which depends probably on where you work.
It's getting that unified
is one of the greatest challenges
facing developers
and architects today. It requires S3 compatibility, enterprise-grade security and resiliency,
the speed to run any workload, and the footprint to run anywhere, and that's exactly what Minio
offers. With superb read speeds in excess of 360 gigs and a 100 megabyte binary that doesn't eat all the data
you've got on the system, it's exactly what you've been looking for. Check it out today at
min.io slash download and see for yourself. That's min.io slash download, and be sure to tell them
that I sent you. Welcome to Screaming in the Cloud. I'm Corey Quinn. This is one of those fun episodes because
it is brought to us as a promoted guest episode by our friends at Several Nines. However,
no one from Several Nines is on this conversation today. Instead, I am joined by Akanen Salehapan,
who is currently the head of reliability and data engineering at Circles.Life.
Now, they happen to be a Several Nines customer, but let's dive into it.
Kanan, thank you for joining me. I appreciate it.
Thanks, Corey. Glad to be part of the call.
So, let's start at the very beginning. What does Circles.Life do?
Circles.Life is a digital telco. We are trying to disrupt the market by providing a vertical SaaS telco operating system to all the telco partners. And our mission is to giving the power back to
customer, not to get themselves into lock-in contracts with multiple MNOs and giving all
the power and savings back to the customer while providing
the high-performance and highly reliable network of data and call services to them across the world.
Working in telco is one of those interesting areas because it's one of those problems of
mistakes are absolutely going to show, but when you get it right, no one notices or cares
because of course the phone is going to talk to the other end.
Why wouldn't it?
And of course the data is going to flow until suddenly TCP terminates on the floor and then
nothing gets to talk to anything ever again.
Several nines is an offering around the idea of database as a service, which on some level
sounds to me like, wait, aren't there a lot of managed database offerings? But when I started digging into them a little bit further, that's not really
how they position themselves. They talk about the idea of sovereign database as a service offerings
and the ability to wind up running a wide variety of different data stores in ways that are much
more portable and, to be direct, responsible than running it yourself in a different bespoke way
in every environment that winds up coming across your desk.
Now, that's the official story that their shiny marketing pages say.
What's your perspective on them?
What do they do for you folks that leaves you at least happy enough
to wind up showing up and claiming you're going to say nice things,
but we'll see how it plays out.
Fantastic question, right?
So that's a bigger challenge when you're trying to build a product as a startup and grow to a multinational company expanding to hundreds of countries.
So one of the big challenges for us is based on the data regulation, we are compelled to use the public clouds available in the local markets, which also make you to choose the managed services
of those public cloud in those markets.
So we decided to go for a cloud agnostic,
especially on the persistent stack.
And that brings a new challenge of, you know,
how you are going to set up the environments
and how you are going to operate those environments,
time taken to provision those hundreds of databases
and making the data recovery backups
fault tolerance and observability on top of it and several names actually give a hand to help us
in speeding up those areas by especially on the areas of disaster recovery planning and
provisioning and backup and recovery provisioning for multiple variations of the databases we have,
as well as observability of those database instances.
Yes, of course, they are not 100% met our needs,
but the areas where we are using and we are more than happy to get their help in speeding up those operations.
I can already hear the plaintiff comments coming in before I even wind up putting this on the internet, because people are going to be accusing me of, now hang on a second, you've always been wrong. It's instead, oh, you're probably doing something right because you have much closer insight to your strategic requirements. And when you're talking about taking a workload
as you are and putting it in a bunch of different places based upon where the customer happens to be,
you've got to be able to wind up deploying that in a bunch of different places because
we'll all grow old and wither and die waiting for our cloud provider of choice
to build a region where we really, really need one to be.
At least that's always been my philosophy on this.
Now, that doesn't mean every aspect of every workload
needs to be run in a completely autonomous way,
but there are some core functions that seem to.
At least that's my approach on it.
How do you think about it?
True. Some aspect of it is entirely true.
The way how we are trying to build our telcos as platform,
verticals as platform is,
we, after a much thoughtful architectural discussions
and brainstorming sessions with key architects
and industry experts,
we decided to go with the single platform,
multiple instances model, right? So which means you need to add at most precision in building your code platform,
your single platform is going to act as multiple instances in different countries, even though you
are using multiple clouds in those areas. So that brings us an opportunity to, you know, optimize
the design aspect and the complexities you'll be facing with
the multi-cloud environment.
So that also introduces new challenges like you should have 100% observability and you
should have high availability, fault tolerance, and then you have alerts in place for all
the areas and the slow query should be identified before your customers compliance you. So most of these areas are covered in our single platform, multiple instances,
and especially using several lines.
Last year, we were able to identify slow-performing queries,
and just before customers reach out, we were able to identify and optimize.
I think on some level, there's a bit of a reduction into what almost tries to be
a binary when there needs to be a spectrum instead when it comes to the idea of independence. And on
the one hand, it's, oh, we're not going to trust any of those cloud providers. We're going to build
everything in our data sensors ourselves with the sweat of our brow and the bad grounding of our
electrical supply, et cetera, et cetera. And the other side of it is, no, no, no, we're just going
to make everything that we do run by other people
because they're good at things,
and we just want to focus on the one thing that we do.
In practice, it really feels to me like independence is a spectrum.
Where do you see yourselves falling on that spectrum?
Great question.
So if you see the large successful SaaS companies, the way how they started is they put everything in one plate, try to expand and in a multi-tenant model, your server will be operated from a single country and you are serving the data to the entire world. the highly regulated environment, right? So those companies are losing their freedom.
We got all those learnings when we started our journey
and we built our systems in such a way
that your persistent data layer
can reside in the country where you are operating
and you have a whole platform
where your non-regulated data can reside, right?
So that level of initial thought process
gave us designing the system so optimistically
so that when you are trying to scale, you can reduce all this complexity beforehand.
It feels on some level like some of the worst takes around, oh, we're going to wind up building
these things to go in a bunch of different places come right in the wake of significant
cloud outages because, oh, wow, AWS went down in a particular region for a couple
hours and it made headlines everywhere. So we're going to go ahead and run in multiple cloud
providers to avoid it. In practice, what I see happening instead is that people are just doubling
their exposure to outages. Now, whenever GCP or AWS have a problem, now we're going to be hard down.
It feels like it's going in the exact opposite direction than that things really should be
moving. You are the head of both reliability, which I spent a bit of time in back when I was
hands-on keyboard and in my engineering life, and also of data engineering, something I stayed the
hell away from because I'm unlucky, have an aura, and standing too close to the data warehouse means I don't have a company anymore. So in my experience,
playing around on the reliability side was always an area of trade-offs and concerns. You've been
in that area for almost your entire career, by my understanding. What do you think that a lot
of the industry is getting wrong now? Fantastic question, Corey.
I love to answer this, right?
So the people are passionate in, you know, operations,
engineering side to add, you know,
utmost reliability as possible for all the services they're offering.
The biggest mistake, what is happening in those areas are, you know,
when you are buying a new house and there is no
specification of how many locks you can add into that house, right? So you can put 100 locks to
make it more secure and you can just have one lock, right? So this analogy is for how much
security you need to add in whatever layers, right? So similar goes for reliability, right? So
there is a trade-off, right? So the more reliability, the more SLA you are committing to your customer,
the more you are going to spend on your platform,
then there is not going to be an ROI of your business.
So the right trade-off and the right trade-off comes from
what is the customer expectation
and which customer journeys need to be highly available.
So think about if you
are running a ride hailing application and all you need to worry about that booking a cab and
riding a cab and making the payment should be your key customer journey apart from you know
looking at different you know interests areas where to visit and static content can be three
nines availability but your ride hailing app should have you know five nine
plus availability so the designing your systems on the on the reliability front right so whether
it is adding observability or you know adding high availability and creating slas and slos and
committing an sla with the customer the fine line is how you are going to balance between the high precision engineering versus customer requirement and your ROIs.
So it's a fantastic journey for us so far.
And we had a lot of learnings in our B2C country launches.
And those learnings brought us to feed us to build a state-of-the-art futuristic SaaS telco.
And we are really balancing out and without promising the quality of the
customer platforms. I think that's something that a lot of businesses, when they start out at a
business level, don't fully understand because you ask them the question of, oh, how much downtime
is acceptable for this platform? And the default expected answer is no downtime whatsoever. Okay,
we're never going to achieve that. But to
start, I'm going to need $20 billion. And I'll let you know when that runs out, etc, etc. And then
they wind up saying, Whoa, what do you mean? And come to find out what they really mean is that
they want the email server that powers Outlook to be up during business hours. Okay, great. And we
talk about what the trade offs look like as we have the SLO and SLA negotiations with the business. And eventually we come to an idea
of these things are core and need to be up, other things don't. One of the things I love personally
about the positioning of several nines just as a brand is it's an area that is highly focused on
being very precise as far as the exact levels of service that are being guaranteed.
And their name effectively just cuts against that in a really fun way. How many nines of
reliability do you offer? Several. And you know, I love that approach. I think there's really
something very human about that. That's true. That's true. So as we discussed in the previous
question, right? So how many more nines you would like to add
versus how many nines are expected from the customer?
That balancing that act will give you the ability
to design a strategy and achieve and adapt that strategy.
While we started discussing about our SaaS platform years ago,
our first question was,
instead of starting you know,
starting from the product side,
so we just start from the reliability side
because for any SaaS product,
as reliable as possible
is going to be the core value
you'll be providing to the customer
and you're adding more features as you go, right?
So Several Lines is one of the partners
who helped us, you know,
achieving that high availability
and, you know, disaster recovery and backup and recovery
and observability capabilities
when we tried to launch a country with hundreds of databases.
Another pain point what we had was
when you are trying to launch a country,
multiple countries at one time,
and you need a lot of operational engineers
to provision and operate those
components of your platform but that goes down when you stabilize those platforms right so
where the tools like several lines helped us is to achieve that elasticity of not having the
ability to attract more operation engineers to our platform to do that instead we utilize those
you know, technology
which help us as a single pan of glass to provision those many hundreds of database and other
instances and operate and, you know, scale as well. Yeah, it's a balancing act and the much
before thought is going to help you in rightly architecture and choosing the right tool and
using it in the right way. If I use several names for something else, I'm going to help you in rightly architecture and choosing the right tool and using it in the right way.
If I use several names for something else,
I'm going to fail.
I really use several names for the purpose
of how it is built.
And we are one of the customers
who had a lot of future requests.
We are a highly demanding customer for them.
I'm personally a big fan of misusing things as databases,
like DNS text records or the contents of certain databases that were never designed to serve DNS records and then use it back again.
Again, there's all kinds of ways to misuse things in horrifying ways that no one's happy with.
But used in the right way, the right tool is an absolute pleasure and a joy to wind up working with. That said, I tend to be relatively skeptical
of experience reports when people have nothing but good things to say about a company's product,
start, stop, the end. For examples of this, you can look at any conference keynote where they
have a customer testimonial ever, because it turns out getting in front of a few thousand
customers of a company and saying, here's what they're terrible at, gets you basically taken
out by a sniper if you're not careful.
We don't have any of those here.
So have you had any experiences working with several nines that left you a little bit, I guess, I don't know if it dissatisfies the right word,
but learning, okay, this is not necessarily the best way to apply it in different ways.
What hasn't worked out super well?
So at large, right, so the 7.9 is a tool
that helped us in achieving
what we tried to achieve.
But when we tried to operationalize
for a running country, right,
when you are setting up the systems,
everything is green.
But when your customers
started using your systems,
that's where the rubber meets the road.
The architecture would be perfect
if it weren't for the users or customers.
It would be glorious.
So why don't we just keep them out forever?
Yeah, it turns out
that's not sustainable.
Exactly.
So corner cases
like your backup fails
due to some port issues
on the network side
of customer control side,
then we work with the engineers.
The engineers are really good
and they quickly pick up the call
and get into the meeting
and quickly sort that out.
So when we started launching,
there are multiple, you know, service requests
to sort those corner cases,
mostly related to the setups and, you know, scaling
and utilizing some of the configurations in a right way.
Apart from that, I don't recollect anything major
happened to us in this engagement.
And CircleLives is really good at observability.
We use multi-observability tool strategy.
We don't rely on a single observability
with the sense of, you know,
with the past sense of what if, for example,
if you're using neural link or dynamic traces of the world,
what if they go down, right?
Their service goes down, right?
You should have at least, you know,
additional eye of looking into those alerts and monitoring capability.
We are actually a multi-observability strategy company.
Even though those kind of cases, we make sure there is no customer impact.
This episode is sponsored in part by our friends at Strata.
Are you struggling to keep up with the demands of managing and securing
identity in your distributed enterprise IT environment?
You're not alone, but you shouldn't let that hold you back.
With Strata's Identity Orchestration Platform, you can secure all your apps on any cloud with any IDP,
so your IT teams will never have to refactor for identity again.
Imagine modernizing app identity in minutes instead of months, deploying password
lists on any tricky old app, and achieving business resilience with always-on identity,
all from one lightweight and flexible platform. Want to see it in action? Share your identity
challenge with them on a discovery call, and they'll hook you up with a complimentary pair
of AirPods Pro. Don't miss out. Visit strata.io slash screaming cloud. That's strata.io slash screaming
cloud. There's something to be said for being a very large user of a given tool or product. And
one of the joys I imagine of being as telco focused as you are with the scale of your customer base and how you operate,
that you tend to wind up straining some of the bounds of what a lot of things were designed to
do. It's a strange challenge because some vendors seem able to rise to that occasion and others,
they try and gaslight you or they're like, oh, our site fell over, but here's an SLA credit.
Great. I have unhappy customers on my side
that I have to talk to,
and getting 40 bucks back on the enterprise deal
I'm spending doesn't help anything.
It really does tend to separate out the vendors
that are known and trustworthy in this space
from those who become an experienced report,
because experience is what you get
when you didn't get what you wanted.
That's true.
So the most recent incident
right so i think last year we all aware of that you know one of the major cdn provider you know
went for an outage and atlassian went for an outage as well so the wonderful part of all
these companies outage taught us taught to the entire world how you can make fault tolerance on
on the entry points as well. So we all
talk much and discuss more
and architect our systems and platforms
to have fault tolerance on the
back-end and services
to make sure that the customer
does not get impacted. And we are making
an assumption that your entry points
are always
secured, highly available, 100%.
So those incidents broke those concepts for architects to start discussing
about how I can make my entry points also fault-tolerant and highly available
and I can add disaster recovery capabilities.
So those incidents really taught a good lesson to the entire world
and those gave us an opportunity for us to re-architect some of
our components before we met with those incidents. One of the things that I didn't really have a keen
appreciation for is scale itself. When I'm sitting here building a hello world style application and
okay I have a even a microservices architecture with a half dozen different things all talking to one another.
It's not that hard for me to start tracing what's going on through the application as I hit various syntax errors because I don't know what a linter is and I'm terrible at programming.
Great.
But once you're at a point of significant scale where every individual transaction is like looking for a needle in a haystack, It feels like what I used to call monitoring and most people call observability.
And I refer to as hipster monitoring now seems to have taken on a very
different tone,
but it does seem like you need to be at a certain point of scale before any of
it really makes sense and starts to resonate.
Has that been your experience or am I missing something fundamental?
Yes. Some part of it, right?
So when you grow up from your startup base
with a few thousand customers to a few million, right?
That's the journey you're going to make.
And your initial assumptions are going to be proven wrong
as you go into that journey, right?
So the major aspect of scaling,
especially in Circles Life,
we take a proactive approach in scaling.
And I would
see that rest of the world, major tech
companies are doing the same as well.
So we don't rely entirely on
auto-scaling capabilities. And also
we don't rely entirely on monitoring and
observability for us to make
scaling decisions.
So when you build
complex systems in a SaaS
environment, you just need to build in such a way that
your end-to-end user journey
can be performance tested
and the metrics from that testing
should be taken as input,
number one. And number two, you also
need to do microservice
to microservice-based individual independent
testing to make sure that it can
withhold
this N number of, you know,
transaction per seconds and without breaking any of the underlying systems.
And number three, your databases are scaled, you know, appropriately.
Then you add that observability piece on top of that.
Together, it is going to give you a proactive approach of keeping your system scaled before actually your customer getting into the system
in a large number.
And also you have situations like,
you know, for example,
you have Christmas
and New Year coming up
and you're running campaigns
and you're going to acquire
a large amount of customers
on that particular day.
How I can make a dynamic scaling
enabled and, you know,
disabled at a later point of time.
So these scaling strategies
are going to really help you
achieve lower customer impact and high availability of your system.
And we are ahead of the curve, especially in the digital telco arena.
And we spend a lot of time and money on this area of reliability.
It really feels on some level like all of these things get lumped together.
And at small scale, they absolutely do.
The idea of reliability of the infrastructure underlying things,
performance engineering, observability as a whole.
It's all basically if it plugs into the wall,
keep it running and make sure that the site is up
and we find out about it before customers start calling to tell us that there's a problem.
But as you wind up scaling out, it feels like these things start to become,
they gain a bit of organizational distance between them, where they're related, but they
tend to be handled by different teams focusing on different areas. Given that all of these roads
effectively lead to you in your current role, how do you think that they should be structured
in an organizational context?
Yeah, so that's a great question. So any structuring of your team in any organization,
including ours, right? So we start from somewhere and we evolve to a more matured and, you know, more optimal way. And nobody can start from an ideal team on day one, right? Being a startup,
we started with, you know, three to four, four you know our founder started in a room with an engineer and now we are you know thousand plus engineers are working in the company
the way how i can see is you know you need to structure your team based on if you're a tech
company right so based on what architecture enterprise architecture you're going to adopt
right think about this for, if you have isolation
between your infrastructure reliability DevOps engineer
to your developers and your QAs
and your automation engineers,
that is going to be silos.
We learned this lesson in a hard way
and we made changes accordingly
to build full stack DevOps and SRE teams.
That brings a lot of ability for us to improve throughput
and reliability as well as speed.
So structuring the team depends on how you architecturing your product
and how you can make a full stack teams to support
delivering that architecture and have software and guardrails
to minimize any impact.
Think about this, right?
So you are outsourcing your deliverables
to a country where you have junior engineers, right?
So your platform should not be broken
by even an intern, right?
So you need to build guardrails accordingly.
So your CACD pipelines are completely automated.
You check in any junk.
If it is a junk,
it is going to reject
and it is not going to be,
you know, promoted to another environment.
It sounds almost like
you're suggesting the radical idea
that it's not sustainable
when every engineer
to work in your Kubernetes environment
needs to have 15 years of experience
running large-scale applications.
Otherwise, they're a menace
to themselves and others.
There's a scalability problem. How do we make this stuff more accessible where you can be
a junior engineer and not be a walking disaster waiting to happen? And guardrails are the answer,
but everyone hears the term and almost seems to recoil because, ooh, governance. I don't like
being told what I can't do. Yeah, and I don't like getting a notice that my data has just been
leaked by yet another vendor. So, you know, we all have choices to make.
Absolutely, right?
When you say governance, people hate that word, right?
When I was an engineer, the most hated word was governance as well.
Because, you know, when you are young, you are exploring, and when the freedoms are cut, you feel bad.
So that's why young engineers' best place to work is a lab environment,
an R&D environment where you have more freedom.
But now the world has changed.
So now you can bring guardrails within the technology.
I'll give an example.
If you're in a public cloud, if you hire an engineer to play around with your AWS components or GCP components,
you can build policies internally.
For example, if you're an engineer who's creating an S3 bucket,
make it as a bubbly,
and your policy will prevent that to happen.
Immediately, that transaction will be rolled back.
He won't be able to create an S3 bucket.
So I'm not keeping people to explain people.
So you have central policy
which will be shared across the table.
If you are other to it, you are good.
You can deliver your stuff.
If you are not, then your system is not going to allow you.
That's how now companies evolved.
Build automation at every aspect of your guardrails.
Whether it is CICD or to the release management side of it
or access control of it,
or you are trying to expose the data or accessing the data,
everything needs to be digitally tracked, audited, and actioned. it or access control of it or you are trying to expose the data or accessing the data everything
need to be digitally tracked audited and actioned i really want to thank you for taking so much time
out of your day to speak with me about how you're approaching a lot of these things and and given
what i've perceived to be a very positive but also honest and fair assessment of your experience as a
several nines customer if people want to learn more about what you're up to,
where's the best place for them to find you?
Thanks, Corey.
First of all, thanks for hosting me,
and thanks for Several Nines as well.
Yeah, so Circle Life is a digital course.
I mentioned we are going to expand in multiple countries
in a very short span of time,
an exciting journey with partners helping us like several nights. You can
find us on Facebook, on Circle.life website, Instagram, Twitter, and Facebook, every area,
every social network. And we will be talking in TM forums and other
Asia Telco conferences as well. You can find our colleagues there.
It's an area I find myself paying an increased amount of attention to as the world
continues to progress towards, well, hopefully something. Thank you once again for your time.
I really appreciate it. Thank you, Corey. Thanks for your time as well.
Canon Soleipan, Head of Reliability and Data Engineering at Circles.Life, brought to us by
our friends at Several Nines. I'm cloud economist, Corey Quinn, and this
is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your
podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star
review on your podcast platform of choice, along with an angry, insulting comment that will not
get posted because your podcast platform of choice is having a problem with one of the 17
different cloud providers that they deploy to, so as a result, nothing works until it's fixed.
If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business and we get to the point.
Visit duckbillgroup.com to get started.