Screaming in the Cloud - The Complexity and Value of Scaling Reliability with Kannan Solaiappan

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is brought to us by our friends at Wiz. Wiz is on a mission to help every organization rapidly identify and remove critical risks in their cloud environments.

Starting point is 00:00:41 Purpose built for the cloud, Wiz delivers full stack visibility, accurate risk prioritization, and enhanced business agility. In other words, nobody beats the Wiz. Wiz connects in minutes using an agentless approach that scans both platform configurations and inside of every workload. They provide a deep assessment that goes well beyond what standalone CSPM, CWPP, and other unknowable acronym-based tools offer, and find the toxic combination of flaws that represent real risk. To learn more, visit wiz.io. That's W-I-Z dot I-O. Today's episode is brought to you in part by our friends at Minio, the high-performance Kubernetes native object store that's built for the multi-cloud,

Starting point is 00:01:28 creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you're defining those as, which depends probably on where you work. It's getting that unified is one of the greatest challenges

Starting point is 00:01:44 facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere, and that's exactly what Minio offers. With superb read speeds in excess of 360 gigs and a 100 megabyte binary that doesn't eat all the data you've got on the system, it's exactly what you've been looking for. Check it out today at min.io slash download and see for yourself. That's min.io slash download, and be sure to tell them that I sent you. Welcome to Screaming in the Cloud. I'm Corey Quinn. This is one of those fun episodes because it is brought to us as a promoted guest episode by our friends at Several Nines. However,

Starting point is 00:02:32 no one from Several Nines is on this conversation today. Instead, I am joined by Akanen Salehapan, who is currently the head of reliability and data engineering at Circles.Life. Now, they happen to be a Several Nines customer, but let's dive into it. Kanan, thank you for joining me. I appreciate it. Thanks, Corey. Glad to be part of the call. So, let's start at the very beginning. What does Circles.Life do? Circles.Life is a digital telco. We are trying to disrupt the market by providing a vertical SaaS telco operating system to all the telco partners. And our mission is to giving the power back to customer, not to get themselves into lock-in contracts with multiple MNOs and giving all

Starting point is 00:03:23 the power and savings back to the customer while providing the high-performance and highly reliable network of data and call services to them across the world. Working in telco is one of those interesting areas because it's one of those problems of mistakes are absolutely going to show, but when you get it right, no one notices or cares because of course the phone is going to talk to the other end. Why wouldn't it? And of course the data is going to flow until suddenly TCP terminates on the floor and then nothing gets to talk to anything ever again.

Starting point is 00:03:55 Several nines is an offering around the idea of database as a service, which on some level sounds to me like, wait, aren't there a lot of managed database offerings? But when I started digging into them a little bit further, that's not really how they position themselves. They talk about the idea of sovereign database as a service offerings and the ability to wind up running a wide variety of different data stores in ways that are much more portable and, to be direct, responsible than running it yourself in a different bespoke way in every environment that winds up coming across your desk. Now, that's the official story that their shiny marketing pages say. What's your perspective on them?

Starting point is 00:04:35 What do they do for you folks that leaves you at least happy enough to wind up showing up and claiming you're going to say nice things, but we'll see how it plays out. Fantastic question, right? So that's a bigger challenge when you're trying to build a product as a startup and grow to a multinational company expanding to hundreds of countries. So one of the big challenges for us is based on the data regulation, we are compelled to use the public clouds available in the local markets, which also make you to choose the managed services of those public cloud in those markets. So we decided to go for a cloud agnostic,

Starting point is 00:05:11 especially on the persistent stack. And that brings a new challenge of, you know, how you are going to set up the environments and how you are going to operate those environments, time taken to provision those hundreds of databases and making the data recovery backups fault tolerance and observability on top of it and several names actually give a hand to help us in speeding up those areas by especially on the areas of disaster recovery planning and

Starting point is 00:05:39 provisioning and backup and recovery provisioning for multiple variations of the databases we have, as well as observability of those database instances. Yes, of course, they are not 100% met our needs, but the areas where we are using and we are more than happy to get their help in speeding up those operations. I can already hear the plaintiff comments coming in before I even wind up putting this on the internet, because people are going to be accusing me of, now hang on a second, you've always been wrong. It's instead, oh, you're probably doing something right because you have much closer insight to your strategic requirements. And when you're talking about taking a workload as you are and putting it in a bunch of different places based upon where the customer happens to be, you've got to be able to wind up deploying that in a bunch of different places because we'll all grow old and wither and die waiting for our cloud provider of choice

Starting point is 00:06:46 to build a region where we really, really need one to be. At least that's always been my philosophy on this. Now, that doesn't mean every aspect of every workload needs to be run in a completely autonomous way, but there are some core functions that seem to. At least that's my approach on it. How do you think about it? True. Some aspect of it is entirely true.

Starting point is 00:07:07 The way how we are trying to build our telcos as platform, verticals as platform is, we, after a much thoughtful architectural discussions and brainstorming sessions with key architects and industry experts, we decided to go with the single platform, multiple instances model, right? So which means you need to add at most precision in building your code platform, your single platform is going to act as multiple instances in different countries, even though you

Starting point is 00:07:36 are using multiple clouds in those areas. So that brings us an opportunity to, you know, optimize the design aspect and the complexities you'll be facing with the multi-cloud environment. So that also introduces new challenges like you should have 100% observability and you should have high availability, fault tolerance, and then you have alerts in place for all the areas and the slow query should be identified before your customers compliance you. So most of these areas are covered in our single platform, multiple instances, and especially using several lines. Last year, we were able to identify slow-performing queries,

Starting point is 00:08:16 and just before customers reach out, we were able to identify and optimize. I think on some level, there's a bit of a reduction into what almost tries to be a binary when there needs to be a spectrum instead when it comes to the idea of independence. And on the one hand, it's, oh, we're not going to trust any of those cloud providers. We're going to build everything in our data sensors ourselves with the sweat of our brow and the bad grounding of our electrical supply, et cetera, et cetera. And the other side of it is, no, no, no, we're just going to make everything that we do run by other people because they're good at things,

Starting point is 00:08:48 and we just want to focus on the one thing that we do. In practice, it really feels to me like independence is a spectrum. Where do you see yourselves falling on that spectrum? Great question. So if you see the large successful SaaS companies, the way how they started is they put everything in one plate, try to expand and in a multi-tenant model, your server will be operated from a single country and you are serving the data to the entire world. the highly regulated environment, right? So those companies are losing their freedom. We got all those learnings when we started our journey and we built our systems in such a way that your persistent data layer

Starting point is 00:09:32 can reside in the country where you are operating and you have a whole platform where your non-regulated data can reside, right? So that level of initial thought process gave us designing the system so optimistically so that when you are trying to scale, you can reduce all this complexity beforehand. It feels on some level like some of the worst takes around, oh, we're going to wind up building these things to go in a bunch of different places come right in the wake of significant

Starting point is 00:10:04 cloud outages because, oh, wow, AWS went down in a particular region for a couple hours and it made headlines everywhere. So we're going to go ahead and run in multiple cloud providers to avoid it. In practice, what I see happening instead is that people are just doubling their exposure to outages. Now, whenever GCP or AWS have a problem, now we're going to be hard down. It feels like it's going in the exact opposite direction than that things really should be moving. You are the head of both reliability, which I spent a bit of time in back when I was hands-on keyboard and in my engineering life, and also of data engineering, something I stayed the hell away from because I'm unlucky, have an aura, and standing too close to the data warehouse means I don't have a company anymore. So in my experience,

Starting point is 00:10:50 playing around on the reliability side was always an area of trade-offs and concerns. You've been in that area for almost your entire career, by my understanding. What do you think that a lot of the industry is getting wrong now? Fantastic question, Corey. I love to answer this, right? So the people are passionate in, you know, operations, engineering side to add, you know, utmost reliability as possible for all the services they're offering. The biggest mistake, what is happening in those areas are, you know,

Starting point is 00:11:22 when you are buying a new house and there is no specification of how many locks you can add into that house, right? So you can put 100 locks to make it more secure and you can just have one lock, right? So this analogy is for how much security you need to add in whatever layers, right? So similar goes for reliability, right? So there is a trade-off, right? So the more reliability, the more SLA you are committing to your customer, the more you are going to spend on your platform, then there is not going to be an ROI of your business. So the right trade-off and the right trade-off comes from

Starting point is 00:11:58 what is the customer expectation and which customer journeys need to be highly available. So think about if you are running a ride hailing application and all you need to worry about that booking a cab and riding a cab and making the payment should be your key customer journey apart from you know looking at different you know interests areas where to visit and static content can be three nines availability but your ride hailing app should have you know five nine plus availability so the designing your systems on the on the reliability front right so whether

Starting point is 00:12:31 it is adding observability or you know adding high availability and creating slas and slos and committing an sla with the customer the fine line is how you are going to balance between the high precision engineering versus customer requirement and your ROIs. So it's a fantastic journey for us so far. And we had a lot of learnings in our B2C country launches. And those learnings brought us to feed us to build a state-of-the-art futuristic SaaS telco. And we are really balancing out and without promising the quality of the customer platforms. I think that's something that a lot of businesses, when they start out at a business level, don't fully understand because you ask them the question of, oh, how much downtime

Starting point is 00:13:18 is acceptable for this platform? And the default expected answer is no downtime whatsoever. Okay, we're never going to achieve that. But to start, I'm going to need $20 billion. And I'll let you know when that runs out, etc, etc. And then they wind up saying, Whoa, what do you mean? And come to find out what they really mean is that they want the email server that powers Outlook to be up during business hours. Okay, great. And we talk about what the trade offs look like as we have the SLO and SLA negotiations with the business. And eventually we come to an idea of these things are core and need to be up, other things don't. One of the things I love personally about the positioning of several nines just as a brand is it's an area that is highly focused on

Starting point is 00:14:01 being very precise as far as the exact levels of service that are being guaranteed. And their name effectively just cuts against that in a really fun way. How many nines of reliability do you offer? Several. And you know, I love that approach. I think there's really something very human about that. That's true. That's true. So as we discussed in the previous question, right? So how many more nines you would like to add versus how many nines are expected from the customer? That balancing that act will give you the ability to design a strategy and achieve and adapt that strategy.

Starting point is 00:14:38 While we started discussing about our SaaS platform years ago, our first question was, instead of starting you know, starting from the product side, so we just start from the reliability side because for any SaaS product, as reliable as possible is going to be the core value

Starting point is 00:14:53 you'll be providing to the customer and you're adding more features as you go, right? So Several Lines is one of the partners who helped us, you know, achieving that high availability and, you know, disaster recovery and backup and recovery and observability capabilities when we tried to launch a country with hundreds of databases.

Starting point is 00:15:15 Another pain point what we had was when you are trying to launch a country, multiple countries at one time, and you need a lot of operational engineers to provision and operate those components of your platform but that goes down when you stabilize those platforms right so where the tools like several lines helped us is to achieve that elasticity of not having the ability to attract more operation engineers to our platform to do that instead we utilize those

Starting point is 00:15:44 you know, technology which help us as a single pan of glass to provision those many hundreds of database and other instances and operate and, you know, scale as well. Yeah, it's a balancing act and the much before thought is going to help you in rightly architecture and choosing the right tool and using it in the right way. If I use several names for something else, I'm going to help you in rightly architecture and choosing the right tool and using it in the right way. If I use several names for something else, I'm going to fail. I really use several names for the purpose

Starting point is 00:16:11 of how it is built. And we are one of the customers who had a lot of future requests. We are a highly demanding customer for them. I'm personally a big fan of misusing things as databases, like DNS text records or the contents of certain databases that were never designed to serve DNS records and then use it back again. Again, there's all kinds of ways to misuse things in horrifying ways that no one's happy with. But used in the right way, the right tool is an absolute pleasure and a joy to wind up working with. That said, I tend to be relatively skeptical

Starting point is 00:16:45 of experience reports when people have nothing but good things to say about a company's product, start, stop, the end. For examples of this, you can look at any conference keynote where they have a customer testimonial ever, because it turns out getting in front of a few thousand customers of a company and saying, here's what they're terrible at, gets you basically taken out by a sniper if you're not careful. We don't have any of those here. So have you had any experiences working with several nines that left you a little bit, I guess, I don't know if it dissatisfies the right word, but learning, okay, this is not necessarily the best way to apply it in different ways.

Starting point is 00:17:20 What hasn't worked out super well? So at large, right, so the 7.9 is a tool that helped us in achieving what we tried to achieve. But when we tried to operationalize for a running country, right, when you are setting up the systems, everything is green.

Starting point is 00:17:35 But when your customers started using your systems, that's where the rubber meets the road. The architecture would be perfect if it weren't for the users or customers. It would be glorious. So why don't we just keep them out forever? Yeah, it turns out

Starting point is 00:17:46 that's not sustainable. Exactly. So corner cases like your backup fails due to some port issues on the network side of customer control side, then we work with the engineers.

Starting point is 00:17:57 The engineers are really good and they quickly pick up the call and get into the meeting and quickly sort that out. So when we started launching, there are multiple, you know, service requests to sort those corner cases, mostly related to the setups and, you know, scaling

Starting point is 00:18:12 and utilizing some of the configurations in a right way. Apart from that, I don't recollect anything major happened to us in this engagement. And CircleLives is really good at observability. We use multi-observability tool strategy. We don't rely on a single observability with the sense of, you know, with the past sense of what if, for example,

Starting point is 00:18:35 if you're using neural link or dynamic traces of the world, what if they go down, right? Their service goes down, right? You should have at least, you know, additional eye of looking into those alerts and monitoring capability. We are actually a multi-observability strategy company. Even though those kind of cases, we make sure there is no customer impact. This episode is sponsored in part by our friends at Strata.

Starting point is 00:19:00 Are you struggling to keep up with the demands of managing and securing identity in your distributed enterprise IT environment? You're not alone, but you shouldn't let that hold you back. With Strata's Identity Orchestration Platform, you can secure all your apps on any cloud with any IDP, so your IT teams will never have to refactor for identity again. Imagine modernizing app identity in minutes instead of months, deploying password lists on any tricky old app, and achieving business resilience with always-on identity, all from one lightweight and flexible platform. Want to see it in action? Share your identity

Starting point is 00:19:35 challenge with them on a discovery call, and they'll hook you up with a complimentary pair of AirPods Pro. Don't miss out. Visit strata.io slash screaming cloud. That's strata.io slash screaming cloud. There's something to be said for being a very large user of a given tool or product. And one of the joys I imagine of being as telco focused as you are with the scale of your customer base and how you operate, that you tend to wind up straining some of the bounds of what a lot of things were designed to do. It's a strange challenge because some vendors seem able to rise to that occasion and others, they try and gaslight you or they're like, oh, our site fell over, but here's an SLA credit. Great. I have unhappy customers on my side

Starting point is 00:20:25 that I have to talk to, and getting 40 bucks back on the enterprise deal I'm spending doesn't help anything. It really does tend to separate out the vendors that are known and trustworthy in this space from those who become an experienced report, because experience is what you get when you didn't get what you wanted.

Starting point is 00:20:42 That's true. So the most recent incident right so i think last year we all aware of that you know one of the major cdn provider you know went for an outage and atlassian went for an outage as well so the wonderful part of all these companies outage taught us taught to the entire world how you can make fault tolerance on on the entry points as well. So we all talk much and discuss more and architect our systems and platforms

Starting point is 00:21:10 to have fault tolerance on the back-end and services to make sure that the customer does not get impacted. And we are making an assumption that your entry points are always secured, highly available, 100%. So those incidents broke those concepts for architects to start discussing

Starting point is 00:21:30 about how I can make my entry points also fault-tolerant and highly available and I can add disaster recovery capabilities. So those incidents really taught a good lesson to the entire world and those gave us an opportunity for us to re-architect some of our components before we met with those incidents. One of the things that I didn't really have a keen appreciation for is scale itself. When I'm sitting here building a hello world style application and okay I have a even a microservices architecture with a half dozen different things all talking to one another. It's not that hard for me to start tracing what's going on through the application as I hit various syntax errors because I don't know what a linter is and I'm terrible at programming.

Starting point is 00:22:15 Great. But once you're at a point of significant scale where every individual transaction is like looking for a needle in a haystack, It feels like what I used to call monitoring and most people call observability. And I refer to as hipster monitoring now seems to have taken on a very different tone, but it does seem like you need to be at a certain point of scale before any of it really makes sense and starts to resonate. Has that been your experience or am I missing something fundamental? Yes. Some part of it, right?

Starting point is 00:22:44 So when you grow up from your startup base with a few thousand customers to a few million, right? That's the journey you're going to make. And your initial assumptions are going to be proven wrong as you go into that journey, right? So the major aspect of scaling, especially in Circles Life, we take a proactive approach in scaling.

Starting point is 00:23:04 And I would see that rest of the world, major tech companies are doing the same as well. So we don't rely entirely on auto-scaling capabilities. And also we don't rely entirely on monitoring and observability for us to make scaling decisions.

Starting point is 00:23:19 So when you build complex systems in a SaaS environment, you just need to build in such a way that your end-to-end user journey can be performance tested and the metrics from that testing should be taken as input, number one. And number two, you also

Starting point is 00:23:36 need to do microservice to microservice-based individual independent testing to make sure that it can withhold this N number of, you know, transaction per seconds and without breaking any of the underlying systems. And number three, your databases are scaled, you know, appropriately. Then you add that observability piece on top of that.

Starting point is 00:24:02 Together, it is going to give you a proactive approach of keeping your system scaled before actually your customer getting into the system in a large number. And also you have situations like, you know, for example, you have Christmas and New Year coming up and you're running campaigns and you're going to acquire

Starting point is 00:24:15 a large amount of customers on that particular day. How I can make a dynamic scaling enabled and, you know, disabled at a later point of time. So these scaling strategies are going to really help you achieve lower customer impact and high availability of your system.

Starting point is 00:24:30 And we are ahead of the curve, especially in the digital telco arena. And we spend a lot of time and money on this area of reliability. It really feels on some level like all of these things get lumped together. And at small scale, they absolutely do. The idea of reliability of the infrastructure underlying things, performance engineering, observability as a whole. It's all basically if it plugs into the wall, keep it running and make sure that the site is up

Starting point is 00:25:00 and we find out about it before customers start calling to tell us that there's a problem. But as you wind up scaling out, it feels like these things start to become, they gain a bit of organizational distance between them, where they're related, but they tend to be handled by different teams focusing on different areas. Given that all of these roads effectively lead to you in your current role, how do you think that they should be structured in an organizational context? Yeah, so that's a great question. So any structuring of your team in any organization, including ours, right? So we start from somewhere and we evolve to a more matured and, you know, more optimal way. And nobody can start from an ideal team on day one, right? Being a startup,

Starting point is 00:25:41 we started with, you know, three to four, four you know our founder started in a room with an engineer and now we are you know thousand plus engineers are working in the company the way how i can see is you know you need to structure your team based on if you're a tech company right so based on what architecture enterprise architecture you're going to adopt right think about this for, if you have isolation between your infrastructure reliability DevOps engineer to your developers and your QAs and your automation engineers, that is going to be silos.

Starting point is 00:26:15 We learned this lesson in a hard way and we made changes accordingly to build full stack DevOps and SRE teams. That brings a lot of ability for us to improve throughput and reliability as well as speed. So structuring the team depends on how you architecturing your product and how you can make a full stack teams to support delivering that architecture and have software and guardrails

Starting point is 00:26:45 to minimize any impact. Think about this, right? So you are outsourcing your deliverables to a country where you have junior engineers, right? So your platform should not be broken by even an intern, right? So you need to build guardrails accordingly. So your CACD pipelines are completely automated.

Starting point is 00:27:04 You check in any junk. If it is a junk, it is going to reject and it is not going to be, you know, promoted to another environment. It sounds almost like you're suggesting the radical idea that it's not sustainable

Starting point is 00:27:16 when every engineer to work in your Kubernetes environment needs to have 15 years of experience running large-scale applications. Otherwise, they're a menace to themselves and others. There's a scalability problem. How do we make this stuff more accessible where you can be a junior engineer and not be a walking disaster waiting to happen? And guardrails are the answer,

Starting point is 00:27:34 but everyone hears the term and almost seems to recoil because, ooh, governance. I don't like being told what I can't do. Yeah, and I don't like getting a notice that my data has just been leaked by yet another vendor. So, you know, we all have choices to make. Absolutely, right? When you say governance, people hate that word, right? When I was an engineer, the most hated word was governance as well. Because, you know, when you are young, you are exploring, and when the freedoms are cut, you feel bad. So that's why young engineers' best place to work is a lab environment,

Starting point is 00:28:07 an R&D environment where you have more freedom. But now the world has changed. So now you can bring guardrails within the technology. I'll give an example. If you're in a public cloud, if you hire an engineer to play around with your AWS components or GCP components, you can build policies internally. For example, if you're an engineer who's creating an S3 bucket, make it as a bubbly,

Starting point is 00:28:27 and your policy will prevent that to happen. Immediately, that transaction will be rolled back. He won't be able to create an S3 bucket. So I'm not keeping people to explain people. So you have central policy which will be shared across the table. If you are other to it, you are good. You can deliver your stuff.

Starting point is 00:28:46 If you are not, then your system is not going to allow you. That's how now companies evolved. Build automation at every aspect of your guardrails. Whether it is CICD or to the release management side of it or access control of it, or you are trying to expose the data or accessing the data, everything needs to be digitally tracked, audited, and actioned. it or access control of it or you are trying to expose the data or accessing the data everything need to be digitally tracked audited and actioned i really want to thank you for taking so much time

Starting point is 00:29:12 out of your day to speak with me about how you're approaching a lot of these things and and given what i've perceived to be a very positive but also honest and fair assessment of your experience as a several nines customer if people want to learn more about what you're up to, where's the best place for them to find you? Thanks, Corey. First of all, thanks for hosting me, and thanks for Several Nines as well. Yeah, so Circle Life is a digital course.

Starting point is 00:29:36 I mentioned we are going to expand in multiple countries in a very short span of time, an exciting journey with partners helping us like several nights. You can find us on Facebook, on Circle.life website, Instagram, Twitter, and Facebook, every area, every social network. And we will be talking in TM forums and other Asia Telco conferences as well. You can find our colleagues there. It's an area I find myself paying an increased amount of attention to as the world continues to progress towards, well, hopefully something. Thank you once again for your time.

Starting point is 00:30:10 I really appreciate it. Thank you, Corey. Thanks for your time as well. Canon Soleipan, Head of Reliability and Data Engineering at Circles.Life, brought to us by our friends at Several Nines. I'm cloud economist, Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that will not get posted because your podcast platform of choice is having a problem with one of the 17 different cloud providers that they deploy to, so as a result, nothing works until it's fixed.

Starting point is 00:31:00 If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Screaming in the Cloud - The Complexity and Value of Scaling Reliability with Kannan Solaiappan

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.