Screaming in the Cloud - Making Sense of Data with Harry Perks
Episode Date: December 8, 2022About HarryHarry has worked at Sysdig for over 6 years, helping organizations mature their journey to cloud native. He’s witnessed the evolution of bare metal, VMs, and finally Kubernetes e...stablish itself as the de-facto for container orchestration. He is part of the product team building Sysdig’s troubleshooting and cost offering, helping customers increase their confidence operating and managing Kubernetes.Previously, Harry ran, and later sold, a cloud hosting provider where he was working hands on with systems administration. He studied information security and lives in the UK.Links Referenced:Sysdig: https://sysdig.com/
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud, with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
This episode is brought to us in part by our friends at Pinecone.
They believe that all anyone really wants is to be understood,
and that includes your users.
AI models combined with the Pinecone Vector Database
let your applications understand and act
on what your users want without making them spell it out. Make your search application find results
by meaning instead of just keywords. Your personalization system make picks based on
relevance instead of just tags. And your security applications match threats by resemblance
instead of just regular expressions.
Pinecone provides the cloud infrastructure that makes this easy, fast, and scalable.
Thanks to my friends at Pinecone for sponsoring this episode.
Visit pinecone. Nobody cares about backups. Stop lying to yourselves.
You care about restores,
usually right after you didn't care enough about backups.
If you're tired of the vulnerabilities,
costs, and slow recoveries
when using snapshots to restore your data,
assuming that you even have them at all,
living in AWS land,
there's an alternative for you.
Check out Veeam. That's V-E-E-A-M
for secure, zero-fuss AWS backup that won't leave you high and dry when it's time to restore.
Stop taking chances with your data. Talk to Veeam. My thanks to them for sponsoring this
ridiculous podcast. Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted
episode has been brought to us by our friends at Sysdig, and they have sent one of their
principal product managers to suffer my slings and arrows. Please welcome Harry Perks.
Hey, Corey. Thanks for hosting me. Good to meet you.
Absolute pleasure. And thanks for basically being willing to suffer all of the various
nonsense about to throw your direction.
Let's start with origin stories. I find that those tend to wind up resonating the most.
Back when I first noticed Sysdig coming into the market, because it was just launching at that point,
it seemed like it was a, we'll call it an innovative approach to observability,
though I don't recall that we used the term observability back then.
It more or less took a look at whatever an application was doing almost at a system call level and tracing what was going on as those requests worked on an individual system, and then providing those in a variety of different forms to reason about.
Is that directionally correct as far as the origin story goes, or am I misremembering
an evening event I went to what feels like half a lifetime ago? I'd say the latter, but just because
it's a funnier answer. But that's correct. So Sysdig was created by Loris DiGiovanni, one of
the founders of Wireshark. And when containers and Kubernetes was being incepted, created this problem where you lacked visibility
into what's going on inside these
opaque boxes, these black
boxes which are containers.
So we started using system calls
as a source of truth for
I don't want to say observability, but observability.
And using those system calls
to essentially see what's
going on inside containers from the
outside.
And leveraging system calls, we were able to pull out metrics,
such as what is the golden signals of applications running in containers,
network traffic.
So it was a very simple way to instrument applications.
And that was really how monitoring started.
And then Sysdig kind of morphed into a security product. What was it that drove that transformation? Because generally speaking, when you have a product in a particular space
that's aimed at a particular niche, pivots into something that feel as orthogonal as security
don't tend to be something that you see all that often. What did you folks see that wound up driving that change?
The same challenges that were being presented
by containers and microservices for monitoring
were the same challenges for security.
So for runtime security,
it was very difficult for our customers
to be able to understand
what the heck is going on inside a container.
Is a crypto miner being spanned up? Is there malicious activity going on? So it made logical
sense to use that same data source system calls to understand both the monitoring and the security
posture of applications. One of the big challenges out there is that security tends to be one of those pervasive things.
I would argue that observability does, too, where once you have a position of being able to see what is going on inside of an environment and be able to reason about it.
And this goes double for inside of containers, which from a cloud provider perspective, at least, seems to be, oh, yeah, just give us the containers.
We don't care what's going on inside, so we're never going to ask, notice, or care. And being able to bridge between that lack of visibility from the
outside of container land and inside of container land has been a perennial problem. There are
security implications, there are cost implications, there are observability challenges, to be sure,
and of course, reliability concerns that flow directly from that, which is, I think,
how most people, at least historically, contextualize observability. It's a fancy
word to describe, is the site about to fall over and crash into the sea? At least in my experience.
Is that your definition of observability? Or have I basically been hijacked by
a number of vendors who have decided to relabel what they'd been doing for 15 years as observability?
I think observability is one of those things
that is down to interpretation,
depending on what is the most recent vendor
you've been speaking with.
But to me, observability is,
am I happy, am I sad?
Are my applications happy, am I sad?
Am I able to complete business
critical transactions that keep me online, keep me afloat? It's really as simple as that.
There are different ways to implement observability, but it's really, you can't
improve the performance, you can't improve the security posture of things you can't see.
So it's, how do I make sure I can see everything?
And what do I do if that data is really what observability means to me?
The entire observability space across the board is really one of those areas that is defined on some level by outliers within it. It's easy to wind up saying that any given observability tool
will, oh, it alerts you
when your application breaks. The problem is, is that the interesting stuff is often found in the
margins, in the outlier products that wind up emerging from it. What is the specific area of
that space where Sysdig tends to shine the most? Yeah, so you're right. The outliers can typically
cause the problems, and often you don't know what you don't know. And I think if you look at Kubernetes specifically, there is a whole bunch of new problems and challenges and things that you need to be looking got a pod that's cycling in a crash loop back off.
And hey, I'm a developer who's running my application on Kubernetes. I've got this pod in a crash loop back off. I don't know what that means. And then suddenly I'm being expected to
alert on these problems. Well, how can I alert on things that I didn't even know was a problem?
So one of the things that Sysdig is doing on the observability side is we're looking at all of this data and we're actually presenting opinionated views that help customers make sense of that data.
Almost like, you know, I could present this data and give it to my grandma and she would say, oh, yeah, OK, you've got these pods in the crash loop back off.
You've got these pods that are being CPU throttled. Hey, you know, I didn't know I had to worry about CPU limits or, you know, memory limits,
and now I'm suffering OOM. So I think one of the things that's quite unique about Sysdig
on the monitoring side that a lot of customers are getting value from is demystifying some of
those challenges and making a lot of that data actionable. At the time of this recording, I've not yet bothered to run
Kubernetes in anger, by which I of course mean production. My production environment is of course
called anger, and similarly to the way that my staging environment is called theory, because
things work in theory but not in production. That is going to be changing in the first quarter of
next year, give or take. The challenge with that, though, is that so much has changed, we'll say,
since the evolution of Kubernetes into something that is mainstream production in most shops.
I stopped working in production environments before that switch really happened. So I'm still
at a relatively amateurish level of understanding around a lot of these things. I'm still thinking
about old school problems like, okay, how big do I make each one of the nodes in my Kubernetes cluster?
Yeah, if I get big systems, it's likelier that there will be economies of scale that start
factoring in fewer nodes to manage, but it does increase the blast radius if one of those nodes
gets affected by something that takes it offline for a while. I'm still at the very early stages of trying to wrap my head around
the nuances of running these things in a production environment. Cost is, of course, a separate
argument. My clients run it everywhere, and I can reason about it surprisingly well for something
that is not lending itself to easy understanding it by any sense of the word. And you almost have
to intuit its existence just by looking at the AWS bill.
No, I like your observations.
And I think the last part there around costs
is something that I'm seeing a lot in the industry
and in our customers is,
okay, suddenly I've got a great monitoring posture
or observability posture, whatever that really means.
I've got great security posture.
And as customers are maturing in their journey to Kubernetes,
suddenly there is a bunch of questions that are being asked from atop.
And we've seen this internally, such as,
hey, what is the ROI of each customer?
Or what is the ROI of a specific product line or feature that we deliver to our customers?
And we couldn't answer those problems. And we couldn't answer those problems.
And we couldn't answer those problems because we're running a bunch of applications and
software on Kubernetes.
And when we receive our billing reports from the multiple different cloud providers we
use, Azure, AWS, and GCP, we just received a big fat bill that was compute.
And we were unable to break that down by the different teams and business units, which is a real problem. And one of the problems that,
you know, we really wanted to start solving both for internal uses, but also for our customers
as well. Yeah. When you have a customer coming in, the easy part of the equation is, well,
how much revenue are we getting from a customer? Well, that's easy enough to just wind up polling your finance group. And yeah, how much have they paid us this year?
Great. Good to know. Then it gets really confusing over on the cost side because it gets into a
unit economic model that I think most shops don't have a particularly advanced understanding of.
If we have another hundred customers sign up this month, what will it cost us to service them?
And what
are the variables that change those numbers? It really gets into a fascinating model where people
more or less do some gut checks and some rounding, but there are a bunch of areas where people get
extraordinarily confused start to finish. Kubernetes is very much one of them because
from a cloud provider's perspective, it's just a single tenant app that is really gnarly in terms of its behavior. It does a bunch of different things. And
from the bill alone, it's hard to tell that you're even running Kubernetes unless you ask.
Yeah, absolutely. And there was a survey from the CNCF recently that said 68% of folks are
seeing increased Kubernetes costs, of course and 69 percent of respondents
said that they have no cost monitoring in place or just cost estimates which is simply not good
enough people want to break down that line item to those individual business units and teams
which is a huge challenge that cloud providers aren't fulfilling today where do you see most of
the cost issue breaking down? I mean, there's
some of the stuff that we are never allowed to talk
about when it comes to cost, which is the realistic
assessment that people to work
on technology cost more than the technology
itself. There's a
certain, how do we put this,
unflattering perspective
that a lot of people are
deploying Kubernetes into
environments because they want
to bolster their own resume, not because it's the actual right answer to anything that they have
going on. So that's a little hit or miss on some level. I don't know that I necessarily buy into
that, but you take a look at the compute store, you look at the data transfer side, which it seems
that almost everyone mostly tends to ignore, despite the fact that Kubernetes itself has no zone affinity. So it has no idea whether its internal communication is free or expensive.
And it just adds up to a giant question mark. Then you look at Kubernetes architecture diagrams,
or God forbid, the CNCF landscape diagram, and realize, oh my God, they have more of these
things than they do Pokemon. And people give up any hope of understanding it other than just saying it's complicated and accepting that that's just the way that it is. I'm a little less
fatalistic, but I also think it's a heck of a challenge. Absolutely. I mean, the economics of
cloud, why is ingress free, but egress is not free? Why is it so difficult to understand that inter-AZ traffic is completely,
you know, built separately to public traffic, for example.
And I think network cost is one thing that is extremely challenging for customers.
One, they don't even have that visibility into what is the network traffic,
what is internal traffic, what is public traffic.
But then there's also a whole bunch of other challenges
that are causing Kubernetes cluster rise.
You've got folks that struggle with setting the right requests for Kubernetes,
which ultimately blows up the scale of a Kubernetes cluster.
You've got the complexity of AWS, for example,
economics of instance types. I don't know whether I need to be running 10 M5X cells
versus four Graviton instances.
And this ability to size a cluster correctly,
as well as size a workload correctly,
is very, very difficult.
And customers are not able to establish that baseline today.
And obviously, you can't optimize what you can't see. So I think a lot of customers struggle with
both that visibility, but then the complexity means that it's incredibly difficult to optimize
those costs. You folks are starting to dip your toes in the Kubernetes costing space.
What approach are you taking? Sysdig builds products for Kubernetes first.
So if you look at what we're doing on the monitoring space,
we were really pioneering what customers want to get out of Kubernetes observability.
And then we were doing the similar things for security.
So making sure our security product is, I'm going to say Kubernetes native.
And what we're doing on the cost side of the things is, I'm going to say Kubernetes native. And what we're doing on the cost side of the things is, of course, there are a lot of cost
products out there that will give you the ability to slice and dice by AWS service,
for example, but they don't give you that Kubernetes context to then break those costs
down by teams and business units. So Sysdig, we've already been collecting usage information,
resource usage information, requests,
the container CPU, the memory usage.
And a lot of customers be using that data today
for right-sizing.
But one of the things they said was,
hey, I need to quantify this.
I need to put a big fat dollar sign
in front of some of these numbers
we're seeing so I can go to these teams and management and actually prompt them to right size.
So it's quite simple. We're essentially augmenting that resource usage information
with cost data from cloud providers. So instead of customers saying, hey, I'm wasting one terabyte
of memory, they can say, hey, I'm wasting, you know,
500 bucks on memory each month.
So it's very much kind of Kubernetes specific,
you know, using a lot of Kubernetes context and metadata.
This episode is sponsored in part by our friends at Optix
because they believe that many of you are looking
to bolster your security posture with CNAP and XDR solutions.
They offer both cloud and endpoint
security in a single UI and data model. Listeners can get Optics for up to a thousand assets through
the end of 2023, that is next year, for one dollar. But this offer is only available for a limited
time on OpticsSecretMenu.com. That's U-P-T-Y-C-S SecretMenu.com.
Part of the whole problem that I see across the space is that the way to solve some of these problems internally has been when you start trying to divide costs between different teams is, well, we're just going to give each one their own cluster or their own environment.
That does definitely solve the problem of shared services. The counterpoint
is it solves them by making every team individually incur them. That doesn't necessarily seem like the
best approach in every scenario. One thing I have learned, though, is that for some customers,
that is the right approach. Sounds odd, but that's the world we live in, where context
absolutely matters a lot. I'm very reluctant
these days to say at a glance, oh, you're doing it wrong. You eat a whole lot of crow when you
do that, it turns out. I see this a lot. And I see customers giving their own business units,
their own AWS account, which I kind of feel like is a step backwards. I don't think you're
properly harnessing the power of Kubernetes
and creating this shared tenancy model
when you're giving a team their own AWS account.
I think it's important we break down those silos.
There's so much operational overhead
with maintaining these different accounts
that there must be a better way to address some of these challenges.
It's one of those areas where it depends becomes the appropriate answer to almost anything.
I'm a fan of having almost every workload have its own AWS account within the same shared AWS
organization than with shared VPCs, which tend to work out. But that does add some complexity to
observing how things interact there. One of the
guidances that I've given people is assume in the future that in any architecture diagram you ever
put up there, that there will be an AWS account boundary between any two resources because
someone's going to be doing it somewhere. And that seems to be something that AWS themselves are
just slowly starting to awaken to as well. It's getting easier and easier every week
to wind up working with multiple accounts
in a more complicated structure.
Absolutely.
But I think when you start to adopt a multi-cloud strategy,
suddenly you've got so many more increased dimensions.
I'm running an application in AWS, Azure, and GCP,
and now suddenly I've got all of these sub-accounts.
That is an operational overhead that I don't think jives very well, considering there is
such a shortage of folks that are real experts, I wouldn't say experts, in operating this
environment.
And it's really, I think, one of the challenges that isn't being spoken enough about today. It feels like so much of the time that Kubernetes is winding up being
an expression of the same way that getting into microservices was, which is, well, we have a
people problem. We're going to solve it with this approach. Great. But then you wind up with people
adopting it where they don't have the context that applied
when the stuff was originally built and designed for like with monorepos yeah it was a problem when
you had 5 000 developers all trying to work on the same thing and stopping each other so breaking
that apart made sense but the counterpoint of where you wind up with companies with 20 developers
and 200 microservices starts to be a little, okay, has this pendulum swung too far?
Yeah, absolutely.
And I think that when you've got so many people
being thrown at a problem,
there's lots of changes being made,
there's new deployments,
and I think things can spiral out of control pretty quickly,
especially when it comes to costs.
Hey, I'm a developer and I've just made this change.
And well, actually, how do I understand what is the financial impact
of this change? Has this blown up my network costs because suddenly
I'm not traversing the right network path? Or suddenly
I'm consuming so much more CPU and actually there is a
physical compute cost to this? There's a lot of cooks in the kitchen
and I think that is causing
a lot of challenges for organizations. You've been working in product for a while, and one of my
favorite parts of being in a position where you are so close to the core of what it is your company
does is that you find it's almost impossible to not continue learning things just based upon how
customers take what you built and the problems that they experience, both that they bring you in to solve.
And of course, the new and exciting problems that you wind up causing for them or to be more charitable, surfacing that they didn't realize already existed.
What have you learned lately from your customers that you didn't see coming?
One of the biggest problems that I've been seeing is I speak to a lot of customers and I've maybe spoken to 40 or 50 customers over the last few months about a variety of topics, whether it's observability in general or on the financial side, Kubernetes costs. And what I hear about time and time again,
regardless as to the vertical or the size of the organization,
is the platform teams, the people closest to Kubernetes,
know their stuff.
They get it.
But a lot of their internal customers,
so the internal business units and teams,
they, of course, don't have the same clarity and understanding.
And these are the people that are getting the most frustrated.
I've been shipping software for 20 years, and now I'm modernizing my applications and starting to use Kubernetes.
I've got so many new different things to learn about that I'm simply drowning in problems, in cloud-native problems.
And I think we forget about that.
Too often we spend time throwing fancy technology at the people,
such as the DevOps engineers, the platform teams.
But a lot of internal customers are struggling to leverage that technology
to actually solve their own problems.
They can't make sense of this data,
and they can't make the right changes based
off of that data. I would say that that is a very common affliction of Kubernetes, where so often
that it winds up handling things that are now abstracted away to the point where we don't need
to worry about that. That's true right up until the point where they break and now you have to
go diving into the magic. That's one of the reasons that I was such a fan of Sysdig when it first came out, was the idea that it was getting into what I viewed
at the time as operating system fundamentals, actually seeing what was going on, abstracted
away from the vagaries of the code and a lot more into what system calls is it making? Great. Okay,
now I'm starting to see a lot of calls that shouldn't necessarily be making or it's thrashing in a particular way.
And it's almost impossible to get to that level of insight
historically through traditional observability tools.
But being able to take a look at what's going on
from a more fundamentals point of view
was extraordinarily helpful.
I'm optimistic if you can get to a point
where you're able to do that with Kubernetes
given its enraging ecosystem,
for lack of a better term.
Whenever you wind up rolling out Kubernetes,
you've also got to pick some service delivery stuff,
some observability tooling, some log routers,
and so on and so forth.
It feels like by the time you're running anything in production,
you've made so many choices along the way
that the odds that anyone else has made the same choices you have
are vanishingly small.
So you're running your own bespoke unicorn somewhere. Absolutely. Flip a coin and that's
probably one of the solutions that you're going to throw at a problem. And you keep flipping that
coin and then suddenly you're going to reach a combination that nobody else has done before.
And you're right. The knowledge that you have gained from, I don't know, Corigwin Enterprises
is probably not going to ring true at Harry Perks Enterprise Limited. There is a whole different set
of problems and technology and people that, you know, of course you can bring some of that knowledge
along. There are some common denominators, but every organization is ultimately using technology
in different ways, which is problematic to the people that are actually pioneering some of these cloud-native applications.
Given my professional interests, I am curious about what it is you're doing as you start moving a little bit away from the security and observability sides and into cost observability.
How are you approaching that? What are the mistakes that you see people making and how are you meeting them where they are?
The biggest challenge that I am seeing is with sizing workloads and sizing clusters.
And I see this time and time again, where our product shines the light on the capacity utilization of compute.
And what it really boils down to is two things.
Platform teams are not using the correct instance types or the combination of instance types to run the workloads for their teams, their application teams. But also application developers
are not setting things like requests correctly, which makes sense.
You know, again, I flip a coin and maybe that's the request I'm going to set. I used to size a
VM with one gig of memory. So now I'm going to size my pod with one gig of memory. But it doesn't
really work like that. And of course, when you request usage, that is essentially my slice of
the pizza that's been carved out. And even if I don't eat that entire slice of pizza,
it's for me, nobody else can use it. So what we're trying to do is really help customers
with that challenge. So if I'm a developer, I want to be looking at the historical usage of
a workload. Maybe it's the maximum usage or the P99 or the P95, and then easily setting my workload
request to that. You keep doing that over the course of
the different teams and applications you have and suddenly you start to establish this baseline of
what is the compute actually needed to run all of these applications. And that helps me answer
the question, what should I size my cluster to? And it's really important because until you've
established that baseline, you can't start to do
things like cluster reshaping, to
pick a different combination of instance
types to power your cluster.
On some level, a lack of diversity in
instance types is a bit of a red flag
just because it generally means that
someone said, oh yeah, we're going to start with this default
instance size, and then we'll adjust
as time goes on. And spoiler, just like anything
else labeled to-do in your codebase, it never gets done. So you find yourself pretty quickly in a scenario
where some workloads are struggling to get the resources they need inside of whatever that
default instance size is. And on the other, you wind up with some things that are more or less
running a cron job once a day and sitting there completely idle but running the whole time regardless.
And optimization and right-sizing
on a lot of these scenarios is a little bit tricky.
I've been something of a,
I'll say a pessimist
when it comes to the idea of right-sizing EC2 instances
just because so many historical workloads
are challenging to get recertified
on newer instance families and the rest.
Whereas when we're running on Kubernetes already, presumably, everything's built in such a way that it can stop existing in a stateless way and the service still continues to work.
If not, it feels like there are some necessary Kubernetes prerequisites that may not have circulated fully internally yet. And to make this even more complicated,
you've got applications that may be more memory-intensive
or CPU-intensive.
So understanding the ratio of CPU to memory requirements
for your applications, depending on how they've been architected,
makes this more challenging.
Parts are jumping around, and that makes it incredibly difficult
to track these movements
and actually pick the instances
that are going to be most appropriate
for my workloads and for my clusters.
I really want to thank you
for being so generous with your time.
If people want to learn more,
where's the best place for them to find you?
Asystic.com, of course,
is where you can learn more about
what Asystic is doing as a company
and our platform in general.
And we'll, of course, put a link to that in the show notes. Thank you so much for your time. I appreciate it. Thank you, Corey. Hope to speak to you again soon.
Harry Perks, Principal Product Manager at Sysdig. I'm cloud economist Corey Quinn,
and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on
your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review
on your podcast platform of choice, along with an angry, insulting comment that we will lose
track of because we don't know where it was automatically provisioned to.
If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less horrifying.
The Duck Bill Group works for you, not AWS.
We tailor recommendations to your business and we get to the point.
Visit duckbillgroup.com to get started.
This has been a HumblePod production.
Stay humble.