Screaming in the Cloud - Solving the Case of the Infinite Cloud Spend with John Wynkoop

Starting point is 00:00:00 Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at the Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud. Welcome to Screaming in the Cloud. I'm Corey Quinn, and the times, they are a-changing.

Starting point is 00:00:37 My guest today is John Weinkoop. John, how are you? Hey, Corey, I'm doing great. Thanks for having me. So, big changes are afoot for you. You've taken a new job recently. What are you? Hey, Corey, I'm doing great. Thanks for having me. So big changes are afoot for you. You've taken a new job recently. What are you doing now? Well, so I'm happy to say I've joined the Duckbill Group as a cloud economist. So I came out of the big company world and have dived back in or dove back into the startup world.

Starting point is 00:01:06 It's interesting because when we talk to those big companies, they always identify us as, oh, you're a startup, which is hilarious on some level because our AWS account hangs out in AWS's startup group. But if you look at the spend being remarkably level from month to month to month to year to year to year, they almost certainly view us as they're a startup, but they suck at it. They completely failed. And so many of the email stuff that you get from them presupposes that you're venture backed, that you're trying to conquer the entire world. We don't do that here. We have this old timey business model that our forebearers would have understood of we make more money than we spend every month and we continue that trend for a long time. So first, thanks for joining us both on the show and at the company.

Starting point is 00:01:51 We like having you around. Well, thanks. And yeah, I guess that's maybe a startup isn't the right word to describe what we do here at the Duckbill Group. But as you said, it seems to fit into the industry classification. But it is one of the things I actually really liked that was appealing about joining the team was we do spend less than we make. And we're not after hyper growth and we're not trying to consume everything. So it's interesting when you put a job description out into the world and you see who applies. And let's be clear, for those who are unaware, job descriptions are inherently aspirational shopping lists.

Starting point is 00:02:34 If you look at a job description and you check every box on the thing and you've done all the things they want, the odds are terrific you're going to be bored out of your mind when you wind up showing up to do these, whatever that job is. You should be learning stuff and growing. At least that's always been my philosophy to it. One of the interesting things about you is that you checked an awful lot of boxes, but there is one that I think would cause people to raise an eyebrow, which is you're relatively new to the fun world of AWS. Yeah, so obviously, you know,

Starting point is 00:03:01 I've been around the block a few times when it comes to cloud. I've used AWS, built some things in AWS, but I wouldn't have classified myself as an AWS guru by any stretch of the imagination. Spent the last probably three years working in Google Cloud, helping customers build and deploy solutions there. But I do at least understand the fundamentals of cloud and more importantly, at least for our customers, cloud cost, because at the end of the day, they're not all that different. I do want to call out that you have a certain humility to you, which I find endearing, but you're not allowed to do that here. I will sing your praises for you. Before they deprecated it like they do almost everything else, you were one of the relatively few Google Cloud certified fellows, which was sort of like their heroes program.

Starting point is 00:03:51 Only, you know, they killed it in favor of something else like the champion program or whatnot. You were very deep in the world of both Kubernetes and Google Cloud. Yeah, so there was a few of us that were invited to come out and help Google pilot that program in, I believe it was 2019, and give feedback to help them build the Cloud Fellows program. And thankfully, I was selected

Starting point is 00:04:19 based on some of our early experience with Anthos. And specifically, it was around certified fellow in what they call hybrid multi-cloud. So experience with Anthos. And specifically, it was around certified fellow in what they call hybrid multi-cloud. So experience around Anthos, or at the time, they hadn't called it Anthos, they were calling it CSP or cloud services platform, because that's not an overloaded acronym. So yeah, definitely was very humbled to be part of that early on. I think the program, as you said, grew to about 70 or so, maybe 100 certified individuals before they transitioned, not killed,

Starting point is 00:04:47 transitioned that program into the Cloud Champions program. So those folks are all still around, myself included. They've just now changed the moniker, but we all get to use the old title still as well. So that's kind of cool. I have to ask, what would possess you to go from being one of the best in the world at using Google Cloud over here to our corner of the AWS universe? Because the inverse, if I were to somehow get ejected from here, which would be a neat trick, but I'm sure it's theoretically possible. Like, what am I going to do now? I would almost certainly wind up doing something in the AWS ecosystem just due to inertia, if nothing else. You clearly didn't see things quite that way. Why make the switch? Well, a couple of different reasons. So being at a Google partner presents a lot of challenges. And one of the things that was

Starting point is 00:05:42 supremely interesting about coming to Duckbill is that we're independent. So we're not an AWS partner. We are a independent company that is beholden only to our customers. And there isn't anything like that in the Google ecosystem today. There's, you know, there's Google partners and then there's Google customers and then there's Google. So that was part of the appeal. And the other thing was I enjoy learning new things. And honestly, learning into the depths of AWS cost hell is interesting. There's a lot to learn there.

Starting point is 00:06:15 And there's a lot of things that we can extract and use to help customers spend less. So that to me was super interesting. And also, I want to help build an organization. So I think what we're doing here at the Duckbill Group is cool. And I think that there's an opportunity to grow our services portfolio. And so I'm excited to work with the leadership team to see what else we can bring to market that's going to help our customers, not just with cost optimization, not just with contract negotiation, but through the lifecycle of their AWS journey, I guess we'll call it. It's one of those things where I always have believed on some level that once you're deep in a particular cloud provider, if there's reason for it, you can reskill relatively quickly to a different provider. There are nuances, deep nuances, that differ from provider to provider, but the underlying concepts generally all work the same way.

Starting point is 00:07:13 There's only so many ways you can have data go from point A to point B. There's only so many ways to spin up a bunch of VMs and whatnot. And you're proof positive that that theory was correct. You'd been here less than a week before I started learning nuances about AWS billing from you. I think it was something to do with the way that late fees are assessed when companies don't pay Amazon as quickly as Amazon desires. So we're all learning new things constantly and no one stuffs this stuff all into their head. But that, if nothing else, definitely cemented the,

Starting point is 00:07:45 yeah, we've got the right person in the seat. Well, thanks. And certainly, the deeper you go on a specific cloud provider, things become fresh in your memory. They're cached, so to speak. So coming up to speed on AWS has been a little bit more documentation reading than it would have been if I were, say, jumping right into a GCP engagement. But as you said, at the end of the day, there's a lot of similarities. Obviously, understanding the nuances of,

Starting point is 00:08:12 for example, account organization versus GCP's project and folders. Well, that's a substantial difference. And so there's a lot of learning that has to happen. Thankfully, all these companies, maybe with the exception of Oracle, have done a really good job of documenting all of the concepts in their publicly available documentation. And then obviously having a team of experts here at the Duckbill Group to ask stupid questions of doesn't hurt, but definitely it's not as hard to come up to speed as one may think, once you've got it understood in one provider. I took a look recently and was kind of surprised to discover that I've been doing this as an

Starting point is 00:08:53 independent consultant prior to the formation of the Duckbill Group for seven years now. And it's weird, but I've gone through multiple industry cycles and changes as a part of this. And it feels like I haven't been doing it all that long, but I guess I have. One thing that's definitely changed is that it used to be that companies would basically pick one provider and almost everything would live there. At any reasonable point of scale, everyone is using multiple things. I see Google in effectively every client that we have. It used to be that going to Google Cloud Next was a great place to hang out with AWS customers. But these days, it's just as true to say that a great reason to go to reInvent is to hang out with Google Cloud customers. Everyone uses everything. And that has become much more clear over the last few years.

Starting point is 00:09:40 What have you seen change over the, I guess, since the start of the pandemic, just in terms of broad cycles? Yeah, so I think there's a couple of different trends that we're seeing. Obviously, one is that, as you said, especially as large enterprises make moves to the cloud, you see independent teams or divisions within a given organization, leveraging maybe not the right tool for the job because I think that there's a case to be made for swapping out a specific set of tools and having your team learn it. But we do see what I like to refer to as tool fetishism where you get a team that's super, super deep into BigQuery

Starting point is 00:10:23 and they're not interested in moving to Redshift or Snowflake or a competitor. So you see those start to crop up within large organizations where the purchasing power is distributed. So that's one of the trends is that the multi-cloud adoption. And I think the big trend that I like to emphasize around multi-cloud is just because you can run it anywhere doesn't mean you should run it everywhere. So Kubernetes, as you know, as it took off 2019 timeframe, 2020, we started to see a lot of people using that as an excuse to try to run their production application in two, three public cloud providers and on-prem. And unless you're a SaaS customer or SaaS company with customers in every cloud, there's very little reason to do that. But having that flexibility, that's the other one,

Starting point is 00:11:13 is we've seen that AWS has gotten a little difficult to negotiate with, or maybe Google and Microsoft have gotten a little bit more aggressive. So obviously having that flexibility and being able to move your workloads, that was another big trend. I'm seeing a change in things that I had taken as givens back when I started. I mean, that's part of the reason, incidentally, I write the last week in AWS newsletter, because once you learn a thing, it is very easy not to keep current with that thing.

Starting point is 00:11:44 And things that are not possible today will be possible tomorrow. How do you keep abreast of all of those changes? And the answer is to write a deeply sarcastic newsletter that gathers in everything from the world of AWS. But I don't recommend that for most people. One thing that I've seen in more prosaic terms, you have a bit of background in, is that HPC on cloud was five, six years ago, met with, oh, that's a good one. Now pull the other one. It has bells on it into something that these days is extremely viable. How'd that happen? So I think that's just a, again, back to trends. I think that's just a trend that we're seeing from cloud providers in listening to their customers and continuing to improve the service. So one of the reasons that HPC was, especially we'll call it capacity

Starting point is 00:12:31 level HPC or large HPC, right? You've always been able to run high throughput. The cloud is a high throughput machine, right? You can run a thousand disconnected VMs, no problem, auto-scaling. Anybody who runs a massive web front end can attest to that. But what we saw with HPC, and we used to call those grid jobs, right? The small decoupled computing jobs. But what we've seen is a huge increase in the quality of the underlying fabric. Things like RDMA being made available. Things like improved network locality where you now have predictive latency between your nodes or between your VMs. And I think those combined with the huge investment that companies like AWS have made in their file systems, the huge investment companies like Google have made in their data storage systems,

Starting point is 00:13:16 have made HPC viable, especially at a small scale, for cloud-based HPC, specifically viable for organizations. And for a small engineering team who's looking to run, say, computer-aided engineering simulation or who's looking to prototype some new way of testing or doing some kind of simulation, it's a huge, huge improvement in speed because now they don't have to order

Starting point is 00:13:43 a dozen or two dozen or five dozen nodes, have them shipped, rack them, stack them, cool them, power them, right? They can just spin up the resource in the cloud, test it out, try their simulation, try out the software that they want, and then spin it all down if it doesn't work. So that elasticity has also been huge. And again, I think the big, to kind of summarize, I think the big driver there is the improvement in the service itself. We're seeing cloud providers taking that discipline a little bit more seriously. I still see that there are cases where the raw math doesn't necessarily add up for sustained long-term use cases. But I also see increasingly that with HPC, that's usually not

Starting point is 00:14:26 what the workload looks like. With, you know, the exception of we're going to spend the next 18 months training some new LLM thing. But even then, the pricing is ridiculous. What is it, their new P6 or whatever it is, P5? The instances that have those giant half-rack NVIDIA cards that are $800,000 and so a year each if you were to just rent them straight out. And then people running fleets of these things, it's, wow, that's more commas in that training job than I would have expected.

Starting point is 00:14:56 But I can see just now the availability driving some of that. But the economics of that, once you can get them in your data center, doesn't strike me as being particularly favoring the cloud. Yeah, there's a couple of different reasons. So it's almost like an inverse curve, right? There's a crossover point or a break-even point at which, and you could make this argument with almost any level of infrastructure, if you can keep it sufficiently full, whether it's AI training, AI inference, or even traditional HPC, if you can keep the machine or the group of machines sufficiently full, it's probably cheaper to buy it and put it in your facility.

Starting point is 00:15:34 But if you don't have a facility or if you don't need to use it 100% of the time, the dividends aren't always there, right? It's not always worth buying a $250,000 compute system, like say an NVIDIA as you, like a DGX, right? It's a good example, the DGX H100, I think those are a couple hundred thousand dollars. If you can't keep that thing full and you just need it for training jobs or for development, and you have a small team of developers that are only going to use it six hours a day, it may make sense to spin that up in the cloud and pay for a fractional use, right? It's no different than what HPC has been doing for probably the past 50 years with national supercomputing centers, which is where my background came from before cloud, right? It's just a different model, right? One is public economies of, you know,

Starting point is 00:16:25 insert your credit card and spend as much as you want. And the other is grant funded and supporting academic research. But the economy of scales is kind of the same on both fronts. I'm also seeing a trend that this is something that is sort of disturbing when you realize what I've been doing and how I've been going about things that for the last couple of years, people actually started to care about the AWS bill. And I have to say, I felt like I was severely out of sync with a lot of the world the first few years, because there's giant savings lurking your AWS bill. And the company answer in many cases was, we don't care. We'd rather focus our energies on shipping faster, building something new, expanding, capturing market. And that is logical.

Starting point is 00:17:11 But suddenly those chickens are coming home to roost in a big way. Our phone is ringing off the hook, as I'm sure you've noticed in your time here. And suddenly money means something again. What do you think drove it? So I think there's a couple of driving factors. The first is obviously the broader economic conditions, you know, with the economic growth in the U.S., especially slowing down post-pandemic. We're seeing organizations looking for opportunities to spend less, to be able to deliver, you know, recoup that money and deliver additional value. But beyond that, right, because, okay, but startups are probably still lighting giant piles of VC money on fire. And that's okay. But what's happening, I think, is that the first wave of CIOs that said cloud

Starting point is 00:17:58 first, cloud only, basically got their comeuppance. And these enterprises saw their explosive cloud bills and they saw that, oh, we moved 5,000 servers to AWS or GCP or Azure, and we got the bill and that's not sustainable. And so we see a lot of cloud repatriation, cloud optimization, right? A lot of second gen cloud, I'll call them second-gen cloud-native

Starting point is 00:18:25 CIOs coming into these large organizations where their predecessor made some bad financial decisions and either left or got asked to leave. And now they're trying to stop from lighting their giant piles of cash on fire. They're trying to stop spending 3x what they were spending on-prem. I think an easy mistake for folks to make is to get lost in the raw infrastructure cost. I'm not saying it's not important, obviously not, but you could save a giant pile of money on your RDS instances by running your own database software on top of EC2. But I don't generally recommend folks do it because you also need engineering time to be focusing on getting those things up, care and feeding, etc. And what people lose sight of is the fact that the payroll expense is almost universally more

Starting point is 00:19:16 than the cloud bill at every company I've ever talked to. So there's a consistent series of, well, we're just trying to get to be the absolute lowest dollar figure total. It's the wrong thing to emphasize on. Otherwise, it's cool. Turn everything off and your bill drops to zero or migrate it to another cloud provider. AWS bill becomes zero. Our job is done.

Starting point is 00:19:37 It doesn't actually solve the problem at all. It's about what's right for the business, not about getting the absolute lowest possible score like it's some kind of code golf tournament. Right. So I think that there's a couple of different ways to look at that. One is obviously looking at making your workloads more cloud native. I know that's a stupid buzzword that just to some people, but the problem I have with the term is that it means so many different things to different people right but i think i think that the gist of that is taking advantage of what the cloud is good at and so what we saw was that excess capacity on prem was effectively free once you bought it right there was no accountability for burning through extra vcpus or extra ram and then

Starting point is 00:20:26 you had right you spin something up in your data center and the question is is the physical capacity there and very few companies had a reaping process until they were suddenly seeing capacity issues and suddenly everyone starts asking you a whole bunch of questions about it but that was a natural forcing function that existed. Now, S3 has infinite storage, or it might as well. They can add capacity fast and you can fill it. I know this.

Starting point is 00:20:49 I've tried. And the problem that you have then is that it's always just a couple more cents per gigabyte and it keeps on going forever. There's no, we need to make an investment decision because the SAN is at 80% capacity.

Starting point is 00:21:01 Do you need all those 16 copies of the production data that you haven't touched since 2012? No, I probably don't. Yeah, there's definitely a forcing function when you're doing your own capacity planning and the cloud, for the most part, as you've alluded to,

Starting point is 00:21:19 for most organizations is infinite capacity. So when they're looking at AWS or they're looking at any of the public cloud providers, it's a potentially infinite capacity. So when they're looking at AWS or they're looking at any of the public cloud providers, it's a potentially infinite bill. Now, that scares a lot of organizations. And so because they didn't have the forcing function of,

Starting point is 00:21:38 hey, we're out of CPUs or we're out of hard disk space or we're out of network ports, I think that because the cloud was a buzzword that a lot of shareholders and boards wanted to see in IT status reports and IT strategic plans, I think we grew a little bit further than we should have from an enterprise perspective. And I think a lot of that's now being clawed back as organizations are maturing and looking to manage cost. Obviously, the huge growth of just the term FinOps from a search perspective over the last three years has cemented that,

Starting point is 00:22:13 right? We're seeing a much more cost-conscious consumer, cloud consumer, than we saw three years ago. I think that the baseline level of understanding is also risen. It used to be that I would go into a client environment, prepare to deploy all kinds of radical stuff that these days look like context-aware architecture and things that would automatically turn down developer environments when developers were done for the day or whatnot. And I would discover that, oh, you haven't bought reserved instances in three years. Maybe start there with the easy thing. And now you don't see those big misconfigurations or the big oversights the way that you once did. People are getting better at this, which is a good thing. I'm certainly not having a problem with this. It means that we get to focus on things that are more architecturally nuanced, which I love. And I think that it forces us to continue

Starting point is 00:23:06 innovating rather than just doing something that basically any random software stack could provide. Yeah, I think to your point, the easy wins are being exhausted or have been exhausted already, right? Very rarely do we walk into a customer and see that they haven't bought a reserved instance or a savings plan. That's just not a thing. And the proliferation of software tools to help with those things, of course, in some cases, dubious proposition of we'll fix your cloud bill automatically for a small percentage of the savings that some of those software tools have. I think those have kind of run their course. And now you've got a smarter populace or smarter consumer.

Starting point is 00:23:48 And it does come into the more nuanced stuff, right? All right. Do you really need to replicate data across AZs? Well, not if your workloads aren't stateful. Well, so some of the old things, and Kubernetes is a great example of this, right? The age-old adage of, if I'm going to spin up an EKS cluster,

Starting point is 00:24:04 I need to put it in three AZs. Okay, why? That's going to cost you money. The cross AZ traffic. And I know cross AZ traffic is a simple one, but we still see that. We still see, well, I don't know why I put it across all three AZs. And so the service-to-service communication inside that cluster, the control plane traffic inside that cluster is costing you money. Now, it might be minimal, but as you grow and as you scale your product or the services that you're providing internally, that may grow to a non-trivial sum of money.

Starting point is 00:24:35 I think that there's a tipping point where an unbounded growth problem is always going to emerge as something that needs attention and needs to be focused on. But I should ask you this because you have a skill set that is, as you know, extremely in demand. You also have that rare gift that I wish wasn't as rare as it is, where you can be thrown into the deep end, knowing next to nothing about a particular technology stack, and in a remarkably short period of time, develop what can only be called subject matter expertise around it. I've seen you do this years past with Kubernetes, which is something I'm still trying to wrap my head around. You have a natural gift for it, which meant that for many respects, the world was your oyster. Why this? Why now?

Starting point is 00:25:22 So I think there's a couple of things that are unique at this thing, at this time point, right? So obviously, helping customers has always been something that's fun and exciting for me, right? Going into an organization and solving the same problem I've solved 20 different times, for example, spinning up a Kubernetes cluster. I guess I have a little bit of squirrel syndrome, so to speak, and that gets boring. I'd rather just automate that or build some tooling and disseminate that to the customers and let them do that. So the thing with cost management is it's always a different problem. Yeah, we're solving fundamentally the same problem, which is I'm spending too much, but it's always a different root cause.

Starting point is 00:26:09 In one customer, it could be data transfer fees. In another customer, it could be errant development growth where they're not controlling the spend on their development environments. And yet another customer, it could be excessive object storage growth. So being able to hunt and look for those and play detective is really fun. And I think that's one of the things that drew me to this particular area. The other is just from a timing perspective, this is a problem a lot of organizations have. And I think it's underserved. I think that there are not enough companies, service providers, whatever, focusing on the hard problem of cost optimization. There's too many people who think it's a finance problem and not enough people who think it's an engineering problem. So I wanted to work on a place

Starting point is 00:26:50 where we think it's an engineering problem. It's been a very long road. And I think that engineering problems and people problems are both fascinating to me. And the AWS bill is both. It's often misunderstood as a finance problem and finance needs to be consulted, absolutely. But they can't drive an optimization project

Starting point is 00:27:11 and they don't know what the context is behind an awful lot of decisions that get made. It really is breaking down bridges, but also there's a lot of engineering in here too. It scratches my itch in that direction anyway. Yeah, it's one of the few business problems that I think touches multiple areas. As you said, it's obviously a people problem because we want to make sure that we are supporting and educating our staff.

Starting point is 00:27:36 It's a process problem. Are we making costs visible to the organization? Are we making sure that there's proper chargeback and showback methodologies, etc.? But it's also a technology problem. Did we build this thing to take advantage of the architecture or did we shoehorn it in in a way that's going to cost us a small fortune?

Starting point is 00:27:56 And I think it touches all three, which I think is unique. John, I really want to thank you for taking the time to speak with me. If people want to learn more about what you're up to any given day, where's the best place for them to find you? Well, thanks, Corey. And thanks for having me.

Starting point is 00:28:11 And of course, obviously, our website, duckbillgroup.com, is a great place to find out what we're working on, what we have coming. I also, I'm pretty active on LinkedIn. I know that's not a huge Twitter guy, but I am pretty active on LinkedIn. So you can always drop me a follow on LinkedIn. I know that's not a huge Twitter guy, but I am pretty active on LinkedIn. So you can always drop me a follow on LinkedIn and I'll try to post interesting and useful content there for our listeners. And we will, of course, put links to that in the show notes, which in my case is, of course, extremely self-aggrandizing, but that's all right. We're here to do self-promotion. Thank you so much for taking the time to chat with me, John. I appreciate it. Now, get back to work. All right. Thanks, Corey. Have a good one.

Starting point is 00:28:51 John Weinkoop, cloud economist at the Duckbill Group. I'm cloud economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice while also taking pains to note how you're using multiple podcast platforms these days because that just seems to be the way the world went. If your AWS bill keeps rising and your blood pressure is doing the same, then you need the Duck Bill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duck Bill Group works for you, not AWS. We tailor recommendations to your business and we get

Starting point is 00:29:40 to the point. Visit duckbillgroup.com to get started.

Screaming in the Cloud - Solving the Case of the Infinite Cloud Spend with John Wynkoop

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.