Screaming in the Cloud - Solving the Case of the Infinite Cloud Spend with John Wynkoop
Episode Date: October 24, 2023John Wynkoop, Cloud Economist & Platypus Herder at The Duckbill Group, joins Corey on Screaming in the Cloud to discuss why he decided to make a career move and become an AWS billing cons...ultant. Corey and John discuss how once you’re deeply familiar with one cloud provider, those skills become transferable to other cloud providers as well. John also shares the trends he has seen post-pandemic in the world of cloud, including the increased adoption of a multi-cloud strategy and the need for costs control even for VC-funded start-ups. About JohnWith over 25 years in IT, John’s done almost every job in the industry, from running cable and answering helpdesk calls to leading engineering teams and advising the C-suite. Before joining The Duckbill Group, he worked across multiple industries including private sector, higher education, and national defense. Most recently he helped IGNW, an industry leading systems integration partner, get acquired by industry powerhouse CDW. When he’s not helping customers spend smarter on their cloud bill, you can find him enjoying time with his family in the beautiful Smoky Mountains near his home in Knoxville, TN.Links Referenced:The Duckbill Group: https://duckbillgroup.comLinkedIn: https://www.linkedin.com/in/jlwynkoop/
Transcript
Discussion (0)
Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at the
Duckbill Group, Corey Quinn.
This weekly show features conversations with people doing interesting work in the world
of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles
for which Corey refuses to apologize.
This is Screaming in the Cloud.
Welcome to Screaming in the Cloud.
I'm Corey Quinn, and the times, they are a-changing.
My guest today is John Weinkoop.
John, how are you?
Hey, Corey, I'm doing great. Thanks for having me.
So, big changes are afoot for you. You've taken a new job recently. What are you? Hey, Corey, I'm doing great. Thanks for having me.
So big changes are afoot for you.
You've taken a new job recently.
What are you doing now?
Well, so I'm happy to say I've joined the Duckbill Group as a cloud economist. So I came out of the big company world and have dived back in or dove back into the startup world.
It's interesting because when we talk to those big companies, they always identify us as, oh, you're a startup, which is
hilarious on some level because our AWS account hangs out in AWS's startup group. But if you look
at the spend being remarkably level from month to month to month to year to year to year,
they almost certainly view us as
they're a startup, but they suck at it. They completely failed. And so many of the email
stuff that you get from them presupposes that you're venture backed, that you're trying to
conquer the entire world. We don't do that here. We have this old timey business model that our
forebearers would have understood of we make more money than we spend every month and we continue that trend for a long time. So first, thanks for joining us both on the show and at the company.
We like having you around. Well, thanks. And yeah, I guess that's maybe a startup isn't the
right word to describe what we do here at the Duckbill Group. But as you said, it seems to fit
into the industry classification.
But it is one of the things I actually really liked that was appealing about joining the team was we do spend less than we make.
And we're not after hyper growth and we're not trying to consume everything.
So it's interesting when you put a job description out into the world and you see who applies.
And let's be clear,
for those who are unaware, job descriptions are inherently aspirational shopping lists.
If you look at a job description and you check every box on the thing and you've done all the things they want, the odds are terrific you're going to be bored out of your mind when you wind
up showing up to do these, whatever that job is. You should be learning stuff and growing. At least that's always been my philosophy to it.
One of the interesting things about you
is that you checked an awful lot of boxes,
but there is one that I think would cause people
to raise an eyebrow,
which is you're relatively new to the fun world of AWS.
Yeah, so obviously, you know,
I've been around the block a few times
when it comes to cloud.
I've used AWS, built some things in AWS, but I wouldn't have classified myself as an AWS guru by any stretch of the imagination.
Spent the last probably three years working in Google Cloud, helping customers build and deploy solutions there.
But I do at least understand the fundamentals of cloud and more importantly, at least for our customers,
cloud cost, because at the end of the day, they're not all that different.
I do want to call out that you have a certain humility to you, which I find endearing, but you're not allowed to do that here. I will sing your praises for you. Before they deprecated it
like they do almost everything else, you were one of the relatively few Google Cloud certified fellows, which was sort of like their heroes program.
Only, you know, they killed it in favor of something else like the champion program or whatnot.
You were very deep in the world of both Kubernetes and Google Cloud. Yeah, so there was a few of us
that were invited to come out
and help Google pilot that program
in, I believe it was 2019,
and give feedback
to help them build the Cloud Fellows program.
And thankfully, I was selected
based on some of our early experience
with Anthos.
And specifically, it was around certified fellow in what they call hybrid multi-cloud. So experience with Anthos. And specifically, it was around certified fellow
in what they call hybrid multi-cloud. So experience around Anthos, or at the time,
they hadn't called it Anthos, they were calling it CSP or cloud services platform,
because that's not an overloaded acronym. So yeah, definitely was very humbled to be part
of that early on. I think the program, as you said, grew to about 70 or so, maybe 100
certified individuals before they transitioned, not killed,
transitioned that program into the Cloud Champions program. So those folks are all still around,
myself included. They've just now changed the moniker, but we all get to use the old title
still as well. So that's kind of cool. I have to ask, what would possess you to go from being one of the best in the world at using Google Cloud over here to our corner of the AWS universe?
Because the inverse, if I were to somehow get ejected from here, which would be a neat trick, but I'm sure it's theoretically possible.
Like, what am I going to do now?
I would almost certainly wind up doing something in the AWS ecosystem just due to inertia, if nothing else. You clearly didn't see things
quite that way. Why make the switch? Well, a couple of different reasons. So
being at a Google partner presents a lot of challenges. And one of the things that was
supremely interesting about coming to Duckbill
is that we're independent. So we're not an AWS partner. We are a independent company that
is beholden only to our customers. And there isn't anything like that in the Google ecosystem today.
There's, you know, there's Google partners and then there's Google customers and then there's
Google. So that was part of the appeal.
And the other thing was I enjoy learning new things.
And honestly, learning into the depths of AWS cost hell is interesting.
There's a lot to learn there.
And there's a lot of things that we can extract and use to help customers spend less.
So that to me was super interesting.
And also, I want to help build an organization.
So I think what we're doing here at the Duckbill Group is cool. And I think that there's an
opportunity to grow our services portfolio. And so I'm excited to work with the leadership team
to see what else we can bring to market that's going to help our customers, not just with cost
optimization, not just with contract negotiation, but through the lifecycle of their AWS journey, I guess we'll call it.
It's one of those things where I always have believed on some level that once you're deep in a particular cloud provider, if there's reason for it, you can reskill relatively quickly to a different provider. There are nuances, deep nuances, that differ from provider to provider, but the underlying concepts generally all work the same way.
There's only so many ways you can have data go from point A to point B.
There's only so many ways to spin up a bunch of VMs and whatnot.
And you're proof positive that that theory was correct.
You'd been here less than a
week before I started learning nuances about AWS billing from you. I think it was something to do
with the way that late fees are assessed when companies don't pay Amazon as quickly as Amazon
desires. So we're all learning new things constantly and no one stuffs this stuff all
into their head. But that, if nothing else, definitely cemented the,
yeah, we've got the right person in the seat.
Well, thanks.
And certainly, the deeper you go on a specific cloud provider,
things become fresh in your memory.
They're cached, so to speak.
So coming up to speed on AWS has been a little bit more documentation reading
than it would have been if I were, say, jumping right into a GCP engagement. But as you said, at the end of the day,
there's a lot of similarities. Obviously, understanding the nuances of,
for example, account organization versus GCP's project and folders. Well, that's a substantial
difference. And so there's a lot of learning that has to happen. Thankfully, all these companies,
maybe with the exception of Oracle, have done a
really good job of documenting all of the concepts in their publicly available documentation.
And then obviously having a team of experts here at the Duckbill Group to ask stupid questions
of doesn't hurt, but definitely it's not as hard to come up to speed as one may think,
once you've got it understood in one provider.
I took a look recently and was kind of surprised to discover that I've been doing this as an
independent consultant prior to the formation of the Duckbill Group for seven years now. And
it's weird, but I've gone through multiple industry cycles and changes as a part of this. And it
feels like I haven't been doing it all that long, but I guess I have. One thing that's definitely
changed is that it used to be that companies would basically pick one provider and almost
everything would live there. At any reasonable point of scale, everyone is using multiple things.
I see Google in effectively every client that we have. It used to be that going to Google Cloud Next was a great place to hang out with AWS customers. But these days, it's just as true
to say that a great reason to go to reInvent is to hang out with Google Cloud customers.
Everyone uses everything. And that has become much more clear over the last few years.
What have you seen change over the, I guess, since the start of the pandemic, just in terms of broad cycles?
Yeah, so I think there's a couple of different trends that we're seeing.
Obviously, one is that, as you said, especially as large enterprises make moves to the cloud, you see independent teams or divisions within a given organization, leveraging maybe not the right tool for the job
because I think that there's a case to be made
for swapping out a specific set of tools
and having your team learn it.
But we do see what I like to refer to as tool fetishism
where you get a team that's super, super deep into BigQuery
and they're not interested in moving to Redshift or Snowflake or a competitor.
So you see those start to crop up within large organizations where the purchasing power is distributed.
So that's one of the trends is that the multi-cloud adoption.
And I think the big trend that I like to emphasize around multi-cloud is just because you can run it anywhere doesn't mean you should run it everywhere.
So Kubernetes, as you know, as it took off 2019 timeframe, 2020, we started to see a lot of people using that as an excuse to try to run their production application in two, three public cloud providers and on-prem. And unless you're a SaaS customer or SaaS company
with customers in every cloud,
there's very little reason to do that.
But having that flexibility, that's the other one,
is we've seen that AWS has gotten a little difficult
to negotiate with, or maybe Google and Microsoft
have gotten a little bit more aggressive.
So obviously having that flexibility
and being able to move your workloads, that was another big trend.
I'm seeing a change in things that I had taken as givens back when I started. I mean,
that's part of the reason, incidentally, I write the last week in AWS newsletter,
because once you learn a thing, it is very easy not to keep current with that thing.
And things that are not possible
today will be possible tomorrow. How do you keep abreast of all of those changes? And the answer
is to write a deeply sarcastic newsletter that gathers in everything from the world of AWS.
But I don't recommend that for most people. One thing that I've seen in more prosaic terms,
you have a bit of background in, is that HPC on cloud was five, six years ago, met with, oh, that's a good one. Now
pull the other one. It has bells on it into something that these days is extremely viable.
How'd that happen? So I think that's just a, again, back to trends. I think that's just a
trend that we're seeing from cloud providers in listening to their customers and continuing to improve the service. So one of the reasons that HPC was, especially we'll call it capacity
level HPC or large HPC, right? You've always been able to run high throughput. The cloud is a high
throughput machine, right? You can run a thousand disconnected VMs, no problem, auto-scaling.
Anybody who runs a massive web front end can attest to that. But what we saw with HPC, and we used to call those grid jobs, right?
The small decoupled computing jobs.
But what we've seen is a huge increase in the quality of the underlying fabric.
Things like RDMA being made available.
Things like improved network locality where you now have predictive latency between your nodes or between your VMs. And I think those combined with the huge investment that companies like AWS have made in their file systems,
the huge investment companies like Google have made in their data storage systems,
have made HPC viable, especially at a small scale, for cloud-based HPC, specifically viable for organizations.
And for a small engineering team
who's looking to run, say,
computer-aided engineering simulation
or who's looking to prototype some new way of testing
or doing some kind of simulation,
it's a huge, huge improvement in speed
because now they don't have to order
a dozen or two dozen or five dozen
nodes, have them shipped, rack them, stack them, cool them, power them, right? They can just spin
up the resource in the cloud, test it out, try their simulation, try out the software that they
want, and then spin it all down if it doesn't work. So that elasticity has also been huge.
And again, I think the big, to kind of summarize, I think the big driver there is the improvement in the service itself.
We're seeing cloud providers taking that discipline a little bit more seriously.
I still see that there are cases where the raw math doesn't necessarily add up for sustained long-term use cases.
But I also see increasingly that with HPC, that's usually not
what the workload looks like. With, you know, the exception of we're going to spend the next 18
months training some new LLM thing. But even then, the pricing is ridiculous. What is it,
their new P6 or whatever it is, P5? The instances that have those giant half-rack NVIDIA cards that
are $800,000 and so a year each
if you were to just rent them straight out.
And then people running fleets of these things,
it's, wow, that's more commas in that training job
than I would have expected.
But I can see just now the availability
driving some of that.
But the economics of that,
once you can get them in your data center,
doesn't strike me as being particularly favoring the cloud.
Yeah, there's a couple of different reasons. So it's almost like an inverse curve, right? There's
a crossover point or a break-even point at which, and you could make this argument with almost any
level of infrastructure, if you can keep it sufficiently full, whether it's AI training, AI inference, or even traditional HPC, if you can keep the machine or the group of machines sufficiently full, it's probably cheaper to buy it and put it in your facility.
But if you don't have a facility or if you don't need to use it 100% of the time, the dividends aren't always there, right? It's not always worth buying a $250,000 compute system, like say an
NVIDIA as you, like a DGX, right? It's a good example, the DGX H100, I think those are a couple
hundred thousand dollars. If you can't keep that thing full and you just need it for training jobs
or for development, and you have a small team of developers that are only
going to use it six hours a day, it may make sense to spin that up in the cloud and pay for
a fractional use, right? It's no different than what HPC has been doing for probably the past 50
years with national supercomputing centers, which is where my background came from before cloud,
right? It's just a different model, right? One is public economies of, you know,
insert your credit card and spend as much as you want. And the other is grant funded and
supporting academic research. But the economy of scales is kind of the same on both fronts.
I'm also seeing a trend that this is something that is sort of disturbing when you realize what
I've been doing and how I've been going about things that for the last couple of
years, people actually started to care about the AWS bill. And I have to say, I felt like I was
severely out of sync with a lot of the world the first few years, because there's giant savings
lurking your AWS bill. And the company answer in many cases was, we don't care. We'd rather focus our energies on
shipping faster, building something new, expanding, capturing market. And that is logical.
But suddenly those chickens are coming home to roost in a big way. Our phone is ringing off the
hook, as I'm sure you've noticed in your time here. And suddenly money means something again.
What do you think drove it? So I think there's a couple of driving factors. The first is obviously the broader economic conditions, you know, with the economic
growth in the U.S., especially slowing down post-pandemic. We're seeing organizations looking
for opportunities to spend less, to be able to deliver, you know, recoup that money and deliver
additional value. But beyond that, right,
because, okay, but startups are probably still lighting giant piles of VC money on fire.
And that's okay. But what's happening, I think, is that the first wave of CIOs that said cloud
first, cloud only, basically got their comeuppance. And these enterprises saw their explosive cloud bills
and they saw that, oh, we moved 5,000 servers
to AWS or GCP or Azure,
and we got the bill and that's not sustainable.
And so we see a lot of cloud repatriation,
cloud optimization, right?
A lot of second gen cloud,
I'll call them second-gen cloud-native
CIOs coming into these large organizations where their predecessor made some bad financial
decisions and either left or got asked to leave. And now they're trying to stop from lighting
their giant piles of cash on fire. They're trying to stop spending 3x what they were spending on-prem.
I think an easy mistake for folks to make is to get lost in the raw infrastructure cost. I'm not saying it's not important, obviously not, but you could save a giant pile of money on your RDS
instances by running your own database software on top of EC2. But I don't generally recommend
folks do it because you also need
engineering time to be focusing on getting those things up, care and feeding, etc.
And what people lose sight of is the fact that the payroll expense is almost universally more
than the cloud bill at every company I've ever talked to. So there's a consistent series of,
well, we're just trying to get to be the absolute lowest
dollar figure total.
It's the wrong thing to emphasize on.
Otherwise, it's cool.
Turn everything off and your bill drops to zero or migrate it to another cloud provider.
AWS bill becomes zero.
Our job is done.
It doesn't actually solve the problem at all.
It's about what's right for the business, not about getting the absolute lowest possible
score like it's some kind of code golf tournament. Right. So I think that there's a couple of
different ways to look at that. One is obviously looking at making your workloads more
cloud native. I know that's a stupid buzzword that just to some people, but the problem I have
with the term is that it means so many different things to different people right but i think i think that the gist of that is
taking advantage of what the cloud is good at and so what we saw was that excess capacity on prem
was effectively free once you bought it right there was no accountability for burning through extra vcpus or extra ram and then
you had right you spin something up in your data center and the question is is the physical
capacity there and very few companies had a reaping process until they were suddenly seeing
capacity issues and suddenly everyone starts asking you a whole bunch of questions about it
but that was a natural forcing function that existed. Now, S3 has infinite storage,
or it might as well.
They can add capacity fast
and you can fill it.
I know this.
I've tried.
And the problem that you have then
is that it's always just
a couple more cents per gigabyte
and it keeps on going forever.
There's no,
we need to make an investment decision
because the SAN is at 80% capacity.
Do you need all those 16 copies
of the production data
that you haven't touched since 2012?
No, I probably don't.
Yeah, there's definitely a forcing function
when you're doing your own capacity planning
and the cloud, for the most part,
as you've alluded to,
for most organizations is infinite capacity.
So when they're looking at AWS
or they're looking at any of the public cloud providers, it's a potentially infinite capacity. So when they're looking at AWS or they're looking at
any of the public cloud providers,
it's a potentially infinite bill.
Now, that scares a lot of organizations.
And so because they didn't have
the forcing function of,
hey, we're out of CPUs
or we're out of hard disk space
or we're out of network ports,
I think that because the cloud was a buzzword
that a lot of shareholders and boards wanted to see in IT status reports and IT strategic plans,
I think we grew a little bit further than we should have from an enterprise perspective.
And I think a lot of that's now being clawed back as organizations are maturing and looking to manage cost. Obviously, the huge growth
of just the term FinOps from a search perspective over the last three years has cemented that,
right? We're seeing a much more cost-conscious consumer, cloud consumer, than we saw three years
ago. I think that the baseline level of understanding is also risen. It used to be that I would go into a client environment, prepare to deploy all kinds of radical stuff that these days look like context-aware architecture and things that would automatically turn down developer environments when developers were done for the day or whatnot.
And I would discover that, oh, you haven't bought reserved instances in three years.
Maybe start there with
the easy thing. And now you don't see those big misconfigurations or the big oversights the way
that you once did. People are getting better at this, which is a good thing. I'm certainly not
having a problem with this. It means that we get to focus on things that are more architecturally
nuanced, which I love. And I think that it forces us to continue
innovating rather than just doing something that basically any random software stack could provide.
Yeah, I think to your point, the easy wins are being exhausted or have been exhausted already,
right? Very rarely do we walk into a customer and see that they haven't bought a reserved instance or a savings plan.
That's just not a thing.
And the proliferation of software tools to help with those things, of course, in some cases, dubious proposition of we'll fix your cloud bill automatically for a small percentage of the savings that some of those software tools have.
I think those have kind of run their course.
And now you've got a smarter populace
or smarter consumer.
And it does come into the more nuanced stuff, right?
All right.
Do you really need to replicate data across AZs?
Well, not if your workloads aren't stateful.
Well, so some of the old things,
and Kubernetes is a great example of this, right?
The age-old adage of,
if I'm going to spin up an EKS cluster,
I need to put it in three AZs.
Okay, why?
That's going to cost you money.
The cross AZ traffic.
And I know cross AZ traffic is a simple one, but we still see that.
We still see, well, I don't know why I put it across all three AZs.
And so the service-to-service communication inside that cluster, the control plane traffic inside that cluster is costing you money. Now, it might be minimal, but as you grow and as you scale your product or the services
that you're providing internally, that may grow to a non-trivial sum of money.
I think that there's a tipping point where an unbounded growth problem is always going
to emerge as something that needs attention and needs to be focused on. But I should ask you this because you have a skill set that is, as you know, extremely
in demand. You also have that rare gift that I wish wasn't as rare as it is, where you can be
thrown into the deep end, knowing next to nothing about a particular technology stack, and in a
remarkably short period of time, develop what can only be
called subject matter expertise around it. I've seen you do this years past with Kubernetes,
which is something I'm still trying to wrap my head around. You have a natural gift for it,
which meant that for many respects, the world was your oyster. Why this? Why now?
So I think there's a couple of things that are unique at this thing, at this
time point, right? So obviously, helping customers has always been something that's fun and exciting
for me, right? Going into an organization and solving the same problem I've solved
20 different times, for example, spinning up a Kubernetes cluster. I guess I have a little bit
of squirrel syndrome, so to speak, and that gets boring. I'd rather just automate that or build
some tooling and disseminate that to the customers and let them do that. So the thing with cost
management is it's always a different problem. Yeah, we're solving fundamentally the same problem,
which is I'm spending too much, but it's always a different root cause.
In one customer, it could be data transfer fees. In another customer, it could be errant development growth where they're not controlling the spend on their development
environments. And yet another customer, it could be excessive object storage growth.
So being able to hunt and look for those and play detective is really fun. And I think that's one of the things that drew me to this particular area.
The other is just from a timing perspective, this is a problem a lot of organizations have.
And I think it's underserved.
I think that there are not enough companies, service providers, whatever, focusing on the hard problem of cost optimization.
There's too many people who think it's a finance problem and not enough people who think it's an engineering problem.
So I wanted to work on a place
where we think it's an engineering problem.
It's been a very long road.
And I think that engineering problems
and people problems are both fascinating to me.
And the AWS bill is both.
It's often misunderstood as a finance problem
and finance needs to be consulted, absolutely.
But they can't drive an optimization project
and they don't know what the context is
behind an awful lot of decisions that get made.
It really is breaking down bridges,
but also there's a lot of engineering in here too.
It scratches my itch in that direction anyway.
Yeah, it's one of the few
business problems that I think touches multiple areas. As you said, it's obviously a people
problem because we want to make sure that we are supporting and educating our staff.
It's a process problem. Are we making costs visible to the organization? Are we making sure
that there's proper chargeback and showback methodologies, etc.?
But it's also a technology problem.
Did we build this thing
to take advantage of the architecture
or did we shoehorn it in
in a way that's going to cost us
a small fortune?
And I think it touches all three,
which I think is unique.
John, I really want to thank you
for taking the time to speak with me.
If people want to learn more
about what you're up to any given day, where's the best place for them to find you?
Well, thanks, Corey.
And thanks for having me.
And of course, obviously, our website, duckbillgroup.com, is a great place to find out what we're working on, what we have coming.
I also, I'm pretty active on LinkedIn.
I know that's not a huge Twitter guy, but I am pretty active on LinkedIn. So you can always drop me a follow on LinkedIn. I know that's not a huge Twitter guy, but I am pretty active on LinkedIn.
So you can always drop me a follow on LinkedIn and I'll try to post interesting and useful content
there for our listeners. And we will, of course, put links to that in the show notes, which in my
case is, of course, extremely self-aggrandizing, but that's all right. We're here to do self-promotion.
Thank you so much for taking the time to chat with me, John. I appreciate it. Now, get back to work.
All right. Thanks, Corey. Have a good one.
John Weinkoop, cloud economist at the Duckbill Group. I'm cloud economist Corey Quinn,
and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star
review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this
podcast, please leave a five-star review on your podcast platform of choice while also taking pains
to note how you're using multiple podcast platforms these days because that just seems to be the way
the world went. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group.
We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill
Group works for you, not AWS. We tailor recommendations to your business and we get
to the point. Visit duckbillgroup.com to get started.