Screaming in the Cloud - Making Sense of Data with Harry Perks

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. This episode is brought to us in part by our friends at Pinecone. They believe that all anyone really wants is to be understood,

Starting point is 00:00:38 and that includes your users. AI models combined with the Pinecone Vector Database let your applications understand and act on what your users want without making them spell it out. Make your search application find results by meaning instead of just keywords. Your personalization system make picks based on relevance instead of just tags. And your security applications match threats by resemblance instead of just regular expressions. Pinecone provides the cloud infrastructure that makes this easy, fast, and scalable.

Starting point is 00:01:09 Thanks to my friends at Pinecone for sponsoring this episode. Visit pinecone. Nobody cares about backups. Stop lying to yourselves. You care about restores, usually right after you didn't care enough about backups. If you're tired of the vulnerabilities, costs, and slow recoveries when using snapshots to restore your data, assuming that you even have them at all,

Starting point is 00:01:41 living in AWS land, there's an alternative for you. Check out Veeam. That's V-E-E-A-M for secure, zero-fuss AWS backup that won't leave you high and dry when it's time to restore. Stop taking chances with your data. Talk to Veeam. My thanks to them for sponsoring this ridiculous podcast. Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted episode has been brought to us by our friends at Sysdig, and they have sent one of their principal product managers to suffer my slings and arrows. Please welcome Harry Perks.

Starting point is 00:02:17 Hey, Corey. Thanks for hosting me. Good to meet you. Absolute pleasure. And thanks for basically being willing to suffer all of the various nonsense about to throw your direction. Let's start with origin stories. I find that those tend to wind up resonating the most. Back when I first noticed Sysdig coming into the market, because it was just launching at that point, it seemed like it was a, we'll call it an innovative approach to observability, though I don't recall that we used the term observability back then. It more or less took a look at whatever an application was doing almost at a system call level and tracing what was going on as those requests worked on an individual system, and then providing those in a variety of different forms to reason about.

Starting point is 00:03:01 Is that directionally correct as far as the origin story goes, or am I misremembering an evening event I went to what feels like half a lifetime ago? I'd say the latter, but just because it's a funnier answer. But that's correct. So Sysdig was created by Loris DiGiovanni, one of the founders of Wireshark. And when containers and Kubernetes was being incepted, created this problem where you lacked visibility into what's going on inside these opaque boxes, these black boxes which are containers. So we started using system calls

Starting point is 00:03:33 as a source of truth for I don't want to say observability, but observability. And using those system calls to essentially see what's going on inside containers from the outside. And leveraging system calls, we were able to pull out metrics, such as what is the golden signals of applications running in containers,

Starting point is 00:03:54 network traffic. So it was a very simple way to instrument applications. And that was really how monitoring started. And then Sysdig kind of morphed into a security product. What was it that drove that transformation? Because generally speaking, when you have a product in a particular space that's aimed at a particular niche, pivots into something that feel as orthogonal as security don't tend to be something that you see all that often. What did you folks see that wound up driving that change? The same challenges that were being presented by containers and microservices for monitoring

Starting point is 00:04:33 were the same challenges for security. So for runtime security, it was very difficult for our customers to be able to understand what the heck is going on inside a container. Is a crypto miner being spanned up? Is there malicious activity going on? So it made logical sense to use that same data source system calls to understand both the monitoring and the security posture of applications. One of the big challenges out there is that security tends to be one of those pervasive things.

Starting point is 00:05:06 I would argue that observability does, too, where once you have a position of being able to see what is going on inside of an environment and be able to reason about it. And this goes double for inside of containers, which from a cloud provider perspective, at least, seems to be, oh, yeah, just give us the containers. We don't care what's going on inside, so we're never going to ask, notice, or care. And being able to bridge between that lack of visibility from the outside of container land and inside of container land has been a perennial problem. There are security implications, there are cost implications, there are observability challenges, to be sure, and of course, reliability concerns that flow directly from that, which is, I think, how most people, at least historically, contextualize observability. It's a fancy word to describe, is the site about to fall over and crash into the sea? At least in my experience.

Starting point is 00:05:55 Is that your definition of observability? Or have I basically been hijacked by a number of vendors who have decided to relabel what they'd been doing for 15 years as observability? I think observability is one of those things that is down to interpretation, depending on what is the most recent vendor you've been speaking with. But to me, observability is, am I happy, am I sad?

Starting point is 00:06:22 Are my applications happy, am I sad? Am I able to complete business critical transactions that keep me online, keep me afloat? It's really as simple as that. There are different ways to implement observability, but it's really, you can't improve the performance, you can't improve the security posture of things you can't see. So it's, how do I make sure I can see everything? And what do I do if that data is really what observability means to me? The entire observability space across the board is really one of those areas that is defined on some level by outliers within it. It's easy to wind up saying that any given observability tool

Starting point is 00:07:04 will, oh, it alerts you when your application breaks. The problem is, is that the interesting stuff is often found in the margins, in the outlier products that wind up emerging from it. What is the specific area of that space where Sysdig tends to shine the most? Yeah, so you're right. The outliers can typically cause the problems, and often you don't know what you don't know. And I think if you look at Kubernetes specifically, there is a whole bunch of new problems and challenges and things that you need to be looking got a pod that's cycling in a crash loop back off. And hey, I'm a developer who's running my application on Kubernetes. I've got this pod in a crash loop back off. I don't know what that means. And then suddenly I'm being expected to alert on these problems. Well, how can I alert on things that I didn't even know was a problem? So one of the things that Sysdig is doing on the observability side is we're looking at all of this data and we're actually presenting opinionated views that help customers make sense of that data.

Starting point is 00:08:13 Almost like, you know, I could present this data and give it to my grandma and she would say, oh, yeah, OK, you've got these pods in the crash loop back off. You've got these pods that are being CPU throttled. Hey, you know, I didn't know I had to worry about CPU limits or, you know, memory limits, and now I'm suffering OOM. So I think one of the things that's quite unique about Sysdig on the monitoring side that a lot of customers are getting value from is demystifying some of those challenges and making a lot of that data actionable. At the time of this recording, I've not yet bothered to run Kubernetes in anger, by which I of course mean production. My production environment is of course called anger, and similarly to the way that my staging environment is called theory, because things work in theory but not in production. That is going to be changing in the first quarter of

Starting point is 00:08:59 next year, give or take. The challenge with that, though, is that so much has changed, we'll say, since the evolution of Kubernetes into something that is mainstream production in most shops. I stopped working in production environments before that switch really happened. So I'm still at a relatively amateurish level of understanding around a lot of these things. I'm still thinking about old school problems like, okay, how big do I make each one of the nodes in my Kubernetes cluster? Yeah, if I get big systems, it's likelier that there will be economies of scale that start factoring in fewer nodes to manage, but it does increase the blast radius if one of those nodes gets affected by something that takes it offline for a while. I'm still at the very early stages of trying to wrap my head around

Starting point is 00:09:46 the nuances of running these things in a production environment. Cost is, of course, a separate argument. My clients run it everywhere, and I can reason about it surprisingly well for something that is not lending itself to easy understanding it by any sense of the word. And you almost have to intuit its existence just by looking at the AWS bill. No, I like your observations. And I think the last part there around costs is something that I'm seeing a lot in the industry and in our customers is,

Starting point is 00:10:17 okay, suddenly I've got a great monitoring posture or observability posture, whatever that really means. I've got great security posture. And as customers are maturing in their journey to Kubernetes, suddenly there is a bunch of questions that are being asked from atop. And we've seen this internally, such as, hey, what is the ROI of each customer? Or what is the ROI of a specific product line or feature that we deliver to our customers?

Starting point is 00:10:44 And we couldn't answer those problems. And we couldn't answer those problems. And we couldn't answer those problems because we're running a bunch of applications and software on Kubernetes. And when we receive our billing reports from the multiple different cloud providers we use, Azure, AWS, and GCP, we just received a big fat bill that was compute. And we were unable to break that down by the different teams and business units, which is a real problem. And one of the problems that, you know, we really wanted to start solving both for internal uses, but also for our customers as well. Yeah. When you have a customer coming in, the easy part of the equation is, well,

Starting point is 00:11:21 how much revenue are we getting from a customer? Well, that's easy enough to just wind up polling your finance group. And yeah, how much have they paid us this year? Great. Good to know. Then it gets really confusing over on the cost side because it gets into a unit economic model that I think most shops don't have a particularly advanced understanding of. If we have another hundred customers sign up this month, what will it cost us to service them? And what are the variables that change those numbers? It really gets into a fascinating model where people more or less do some gut checks and some rounding, but there are a bunch of areas where people get extraordinarily confused start to finish. Kubernetes is very much one of them because

Starting point is 00:12:01 from a cloud provider's perspective, it's just a single tenant app that is really gnarly in terms of its behavior. It does a bunch of different things. And from the bill alone, it's hard to tell that you're even running Kubernetes unless you ask. Yeah, absolutely. And there was a survey from the CNCF recently that said 68% of folks are seeing increased Kubernetes costs, of course and 69 percent of respondents said that they have no cost monitoring in place or just cost estimates which is simply not good enough people want to break down that line item to those individual business units and teams which is a huge challenge that cloud providers aren't fulfilling today where do you see most of the cost issue breaking down? I mean, there's

Starting point is 00:12:46 some of the stuff that we are never allowed to talk about when it comes to cost, which is the realistic assessment that people to work on technology cost more than the technology itself. There's a certain, how do we put this, unflattering perspective that a lot of people are

Starting point is 00:13:02 deploying Kubernetes into environments because they want to bolster their own resume, not because it's the actual right answer to anything that they have going on. So that's a little hit or miss on some level. I don't know that I necessarily buy into that, but you take a look at the compute store, you look at the data transfer side, which it seems that almost everyone mostly tends to ignore, despite the fact that Kubernetes itself has no zone affinity. So it has no idea whether its internal communication is free or expensive. And it just adds up to a giant question mark. Then you look at Kubernetes architecture diagrams, or God forbid, the CNCF landscape diagram, and realize, oh my God, they have more of these

Starting point is 00:13:40 things than they do Pokemon. And people give up any hope of understanding it other than just saying it's complicated and accepting that that's just the way that it is. I'm a little less fatalistic, but I also think it's a heck of a challenge. Absolutely. I mean, the economics of cloud, why is ingress free, but egress is not free? Why is it so difficult to understand that inter-AZ traffic is completely, you know, built separately to public traffic, for example. And I think network cost is one thing that is extremely challenging for customers. One, they don't even have that visibility into what is the network traffic, what is internal traffic, what is public traffic. But then there's also a whole bunch of other challenges

Starting point is 00:14:26 that are causing Kubernetes cluster rise. You've got folks that struggle with setting the right requests for Kubernetes, which ultimately blows up the scale of a Kubernetes cluster. You've got the complexity of AWS, for example, economics of instance types. I don't know whether I need to be running 10 M5X cells versus four Graviton instances. And this ability to size a cluster correctly, as well as size a workload correctly,

Starting point is 00:14:59 is very, very difficult. And customers are not able to establish that baseline today. And obviously, you can't optimize what you can't see. So I think a lot of customers struggle with both that visibility, but then the complexity means that it's incredibly difficult to optimize those costs. You folks are starting to dip your toes in the Kubernetes costing space. What approach are you taking? Sysdig builds products for Kubernetes first. So if you look at what we're doing on the monitoring space, we were really pioneering what customers want to get out of Kubernetes observability.

Starting point is 00:15:35 And then we were doing the similar things for security. So making sure our security product is, I'm going to say Kubernetes native. And what we're doing on the cost side of the things is, I'm going to say Kubernetes native. And what we're doing on the cost side of the things is, of course, there are a lot of cost products out there that will give you the ability to slice and dice by AWS service, for example, but they don't give you that Kubernetes context to then break those costs down by teams and business units. So Sysdig, we've already been collecting usage information, resource usage information, requests, the container CPU, the memory usage.

Starting point is 00:16:13 And a lot of customers be using that data today for right-sizing. But one of the things they said was, hey, I need to quantify this. I need to put a big fat dollar sign in front of some of these numbers we're seeing so I can go to these teams and management and actually prompt them to right size. So it's quite simple. We're essentially augmenting that resource usage information

Starting point is 00:16:34 with cost data from cloud providers. So instead of customers saying, hey, I'm wasting one terabyte of memory, they can say, hey, I'm wasting, you know, 500 bucks on memory each month. So it's very much kind of Kubernetes specific, you know, using a lot of Kubernetes context and metadata. This episode is sponsored in part by our friends at Optix because they believe that many of you are looking to bolster your security posture with CNAP and XDR solutions.

Starting point is 00:17:04 They offer both cloud and endpoint security in a single UI and data model. Listeners can get Optics for up to a thousand assets through the end of 2023, that is next year, for one dollar. But this offer is only available for a limited time on OpticsSecretMenu.com. That's U-P-T-Y-C-S SecretMenu.com. Part of the whole problem that I see across the space is that the way to solve some of these problems internally has been when you start trying to divide costs between different teams is, well, we're just going to give each one their own cluster or their own environment. That does definitely solve the problem of shared services. The counterpoint is it solves them by making every team individually incur them. That doesn't necessarily seem like the best approach in every scenario. One thing I have learned, though, is that for some customers,

Starting point is 00:17:57 that is the right approach. Sounds odd, but that's the world we live in, where context absolutely matters a lot. I'm very reluctant these days to say at a glance, oh, you're doing it wrong. You eat a whole lot of crow when you do that, it turns out. I see this a lot. And I see customers giving their own business units, their own AWS account, which I kind of feel like is a step backwards. I don't think you're properly harnessing the power of Kubernetes and creating this shared tenancy model when you're giving a team their own AWS account.

Starting point is 00:18:32 I think it's important we break down those silos. There's so much operational overhead with maintaining these different accounts that there must be a better way to address some of these challenges. It's one of those areas where it depends becomes the appropriate answer to almost anything. I'm a fan of having almost every workload have its own AWS account within the same shared AWS organization than with shared VPCs, which tend to work out. But that does add some complexity to observing how things interact there. One of the

Starting point is 00:19:06 guidances that I've given people is assume in the future that in any architecture diagram you ever put up there, that there will be an AWS account boundary between any two resources because someone's going to be doing it somewhere. And that seems to be something that AWS themselves are just slowly starting to awaken to as well. It's getting easier and easier every week to wind up working with multiple accounts in a more complicated structure. Absolutely. But I think when you start to adopt a multi-cloud strategy,

Starting point is 00:19:36 suddenly you've got so many more increased dimensions. I'm running an application in AWS, Azure, and GCP, and now suddenly I've got all of these sub-accounts. That is an operational overhead that I don't think jives very well, considering there is such a shortage of folks that are real experts, I wouldn't say experts, in operating this environment. And it's really, I think, one of the challenges that isn't being spoken enough about today. It feels like so much of the time that Kubernetes is winding up being an expression of the same way that getting into microservices was, which is, well, we have a

Starting point is 00:20:17 people problem. We're going to solve it with this approach. Great. But then you wind up with people adopting it where they don't have the context that applied when the stuff was originally built and designed for like with monorepos yeah it was a problem when you had 5 000 developers all trying to work on the same thing and stopping each other so breaking that apart made sense but the counterpoint of where you wind up with companies with 20 developers and 200 microservices starts to be a little, okay, has this pendulum swung too far? Yeah, absolutely. And I think that when you've got so many people

Starting point is 00:20:50 being thrown at a problem, there's lots of changes being made, there's new deployments, and I think things can spiral out of control pretty quickly, especially when it comes to costs. Hey, I'm a developer and I've just made this change. And well, actually, how do I understand what is the financial impact of this change? Has this blown up my network costs because suddenly

Starting point is 00:21:12 I'm not traversing the right network path? Or suddenly I'm consuming so much more CPU and actually there is a physical compute cost to this? There's a lot of cooks in the kitchen and I think that is causing a lot of challenges for organizations. You've been working in product for a while, and one of my favorite parts of being in a position where you are so close to the core of what it is your company does is that you find it's almost impossible to not continue learning things just based upon how customers take what you built and the problems that they experience, both that they bring you in to solve.

Starting point is 00:21:49 And of course, the new and exciting problems that you wind up causing for them or to be more charitable, surfacing that they didn't realize already existed. What have you learned lately from your customers that you didn't see coming? One of the biggest problems that I've been seeing is I speak to a lot of customers and I've maybe spoken to 40 or 50 customers over the last few months about a variety of topics, whether it's observability in general or on the financial side, Kubernetes costs. And what I hear about time and time again, regardless as to the vertical or the size of the organization, is the platform teams, the people closest to Kubernetes, know their stuff. They get it. But a lot of their internal customers,

Starting point is 00:22:40 so the internal business units and teams, they, of course, don't have the same clarity and understanding. And these are the people that are getting the most frustrated. I've been shipping software for 20 years, and now I'm modernizing my applications and starting to use Kubernetes. I've got so many new different things to learn about that I'm simply drowning in problems, in cloud-native problems. And I think we forget about that. Too often we spend time throwing fancy technology at the people, such as the DevOps engineers, the platform teams.

Starting point is 00:23:15 But a lot of internal customers are struggling to leverage that technology to actually solve their own problems. They can't make sense of this data, and they can't make the right changes based off of that data. I would say that that is a very common affliction of Kubernetes, where so often that it winds up handling things that are now abstracted away to the point where we don't need to worry about that. That's true right up until the point where they break and now you have to go diving into the magic. That's one of the reasons that I was such a fan of Sysdig when it first came out, was the idea that it was getting into what I viewed

Starting point is 00:23:49 at the time as operating system fundamentals, actually seeing what was going on, abstracted away from the vagaries of the code and a lot more into what system calls is it making? Great. Okay, now I'm starting to see a lot of calls that shouldn't necessarily be making or it's thrashing in a particular way. And it's almost impossible to get to that level of insight historically through traditional observability tools. But being able to take a look at what's going on from a more fundamentals point of view was extraordinarily helpful.

Starting point is 00:24:19 I'm optimistic if you can get to a point where you're able to do that with Kubernetes given its enraging ecosystem, for lack of a better term. Whenever you wind up rolling out Kubernetes, you've also got to pick some service delivery stuff, some observability tooling, some log routers, and so on and so forth.

Starting point is 00:24:36 It feels like by the time you're running anything in production, you've made so many choices along the way that the odds that anyone else has made the same choices you have are vanishingly small. So you're running your own bespoke unicorn somewhere. Absolutely. Flip a coin and that's probably one of the solutions that you're going to throw at a problem. And you keep flipping that coin and then suddenly you're going to reach a combination that nobody else has done before. And you're right. The knowledge that you have gained from, I don't know, Corigwin Enterprises

Starting point is 00:25:05 is probably not going to ring true at Harry Perks Enterprise Limited. There is a whole different set of problems and technology and people that, you know, of course you can bring some of that knowledge along. There are some common denominators, but every organization is ultimately using technology in different ways, which is problematic to the people that are actually pioneering some of these cloud-native applications. Given my professional interests, I am curious about what it is you're doing as you start moving a little bit away from the security and observability sides and into cost observability. How are you approaching that? What are the mistakes that you see people making and how are you meeting them where they are? The biggest challenge that I am seeing is with sizing workloads and sizing clusters. And I see this time and time again, where our product shines the light on the capacity utilization of compute.

Starting point is 00:26:08 And what it really boils down to is two things. Platform teams are not using the correct instance types or the combination of instance types to run the workloads for their teams, their application teams. But also application developers are not setting things like requests correctly, which makes sense. You know, again, I flip a coin and maybe that's the request I'm going to set. I used to size a VM with one gig of memory. So now I'm going to size my pod with one gig of memory. But it doesn't really work like that. And of course, when you request usage, that is essentially my slice of the pizza that's been carved out. And even if I don't eat that entire slice of pizza, it's for me, nobody else can use it. So what we're trying to do is really help customers

Starting point is 00:26:50 with that challenge. So if I'm a developer, I want to be looking at the historical usage of a workload. Maybe it's the maximum usage or the P99 or the P95, and then easily setting my workload request to that. You keep doing that over the course of the different teams and applications you have and suddenly you start to establish this baseline of what is the compute actually needed to run all of these applications. And that helps me answer the question, what should I size my cluster to? And it's really important because until you've established that baseline, you can't start to do things like cluster reshaping, to

Starting point is 00:27:28 pick a different combination of instance types to power your cluster. On some level, a lack of diversity in instance types is a bit of a red flag just because it generally means that someone said, oh yeah, we're going to start with this default instance size, and then we'll adjust as time goes on. And spoiler, just like anything

Starting point is 00:27:43 else labeled to-do in your codebase, it never gets done. So you find yourself pretty quickly in a scenario where some workloads are struggling to get the resources they need inside of whatever that default instance size is. And on the other, you wind up with some things that are more or less running a cron job once a day and sitting there completely idle but running the whole time regardless. And optimization and right-sizing on a lot of these scenarios is a little bit tricky. I've been something of a, I'll say a pessimist

Starting point is 00:28:16 when it comes to the idea of right-sizing EC2 instances just because so many historical workloads are challenging to get recertified on newer instance families and the rest. Whereas when we're running on Kubernetes already, presumably, everything's built in such a way that it can stop existing in a stateless way and the service still continues to work. If not, it feels like there are some necessary Kubernetes prerequisites that may not have circulated fully internally yet. And to make this even more complicated, you've got applications that may be more memory-intensive or CPU-intensive.

Starting point is 00:28:51 So understanding the ratio of CPU to memory requirements for your applications, depending on how they've been architected, makes this more challenging. Parts are jumping around, and that makes it incredibly difficult to track these movements and actually pick the instances that are going to be most appropriate for my workloads and for my clusters.

Starting point is 00:29:12 I really want to thank you for being so generous with your time. If people want to learn more, where's the best place for them to find you? Asystic.com, of course, is where you can learn more about what Asystic is doing as a company and our platform in general.

Starting point is 00:29:28 And we'll, of course, put a link to that in the show notes. Thank you so much for your time. I appreciate it. Thank you, Corey. Hope to speak to you again soon. Harry Perks, Principal Product Manager at Sysdig. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that we will lose track of because we don't know where it was automatically provisioned to. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying.

Starting point is 00:30:14 The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started. This has been a HumblePod production. Stay humble.

Your Ad Here

Screaming in the Cloud - Making Sense of Data with Harry Perks

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.