Screaming in the Cloud - The Latest State of IaC with Ido Neeman

Starting point is 00:00:00 So this is what we see the community is going for. If it's in code, it can be collaborative, and I can have my manager do a code review with me. I can have my guardrails stopping it in my CI. I can let it go through the CI, flag it for someone which is a senior DevOps engineer, layer four, with all the certificates from you know what, AWS, Azure, HashiCorp, and Elastic, and only then have it deployed if they approve it. So these are some of the benefits for adopting IAC and having more than deployments of this IAC framework. Welcome to Screaming in the Cloud.

Starting point is 00:00:47 I'm Corey Quinn. This show has been brought to us by our friends over at Firefly. One of those friends is Ido Nieman, who's the co-founder and CEO. Ido, thank you for joining me. Pleasure to be here. Before we dive in, some jackhole wants me to wind up doing an ad read before we begin. So let's go right into it. Your cloud infrastructure deserves better than manual processes and inconsistent governance. Firefly AI delivers a unified system for cloud asset management, infrastructure as code orchestration, and AI assisted remediation that brings control and visibility to your entire cloud footprint. Learn more at

Starting point is 00:01:26 firefly.ai. All right, now that we've dispensed with that, how long have you been Firefly AI? Was that a pivot since the founding of the company into this whole AI boom? What is the backstory here? It's a semi-pivot. So we started as Firefly or GoFirefly because everyone should go with us and fly and make the sky much brighter. But roughly six, seven months after we started, we doubled down on AI. And then we did the impossible happen, not creating the best AI boom before ChudGPT moment, acquiring the Firefly.AI domain, which was a very tough challenge.

Starting point is 00:02:08 It always amuses me that people are so bad at thinking through like the next logical step of domains. For example you had gofirefly for a while but did you have stopfirefly as well? No because you know this is this is the business. You get the GoFirefly for free. If you want to stop it, okay, this is like encryption in a database. It's just like cloud. You aren't charged for what you use, but rather for what you forget to turn off. Hugging Face has been huge. They've been talking about that stuff, but you know what they didn't buy, and I did recently?

Starting point is 00:02:41 That's right, the bursting chest equivalent domain, because I am an alien fan and I did recently. That's right, the bursting chest equivalent domain, because I am an alien fan and I also have basically deep-seated personality defects, which is, I believe, people like to listen to me for whatever reason. So one of the reasons we're talking today, not necessarily because I'm just a jerk, but instead because you've recently put out one of those surveys that I love to see, because you actually tend to approach these things with data, whereas I just go and talk to a bunch of people and gather anic data instead. Tell me about it.

Starting point is 00:03:11 All right, so at Firefly, we conducted a community survey called State of IAC 2025, pun intended. And it's the third year we are releasing this community survey. And it's there to show us what the community and the cloud practitioner space thinks and uses for IAC. We will of course put a link to this in the show notes so people can wind up downloading it and following along,

Starting point is 00:03:38 presuming they're not listening as well, commuting to work and then decide to download this and then ram a bridge abutment on their way. With nothing else, at least the day deviates from the typical. But what are the high level takeaways you found? Since this is the third year, that means you've probably hit your stride at this point and have been doing it long enough that you can start to identify emerging trends. What are you saying? So I don't think we should go over everything, right? We are here to talk about the interesting stuff. So yes, all the obvious things are there. Cloud complexity is growing dramatically.

Starting point is 00:04:09 Multi-cloud is almost everywhere. What we are seeing consistently over the last three years is IAC adoption is growing and nearing almost full coverage of the market. So in this latest report, we saw that 89% of organizations has already adopted IAC. But this is not enough. When you ask them how much of those organizations

Starting point is 00:04:33 achieved fully codified clouds, you see that only 6%. What it tells us, we're still on the way. We all agreed that this is how to cloud with IAC, but it's hard to get there. What I love to ask people, and they always get very, very embarrassed, they say, oh great, are you using infrastructure of code? Absolutely, really.

Starting point is 00:04:53 So I could blow everything away and everything that you have in your environment will come back, including special S3 buckets and the Jenkins server, which still always lurks around. And then people start stammering. And it's always most coverage of the big things are there, in my experience, almost always with Terraform. But then there's always that long tail

Starting point is 00:05:12 of things that people have click-opsed into existence and then lie about, mostly to themselves. Correct. This is the philosophic DevOps and platform engineer asked themselves, what came first, the environment covered with IAC or the S3 storage bucket that holds the state? Yeah, and the correct answer is if you look at the true source of truth, when you really dig down into it's always some early

Starting point is 00:05:35 engineers laptop somewhere and there's no good way around it. One of the things that I've noticed over the last few years is that I got it deeply wrong years ago when I said that multi-cloud is a worst practice and you shouldn't be doing it. What I meant, and I don't know if I articulated it as well as I'd hope, is that we're gonna build an application

Starting point is 00:05:54 that we can seamlessly deploy into every cloud environment. I don't see a lot of that. What I do see is different workloads living in different places, so everyone uses everything, but it's not a one-to-one compatibility story. Is that what you see when you look into this? Because again, if I sit here saying things where I have evidence to the contrary, that

Starting point is 00:06:14 just makes me obnoxious and unwilling to learn. And I prefer to be obnoxious in other ways. You are definitely right, at least from the data we see from this survey and from the data we have from Firefly. So just from coming back to the survey, we see that almost 70% of respondents tell us that they operate across multiple clouds. But actually when you deep dive, you see yes,

Starting point is 00:06:37 it's mostly different applications run on different clouds. Only the very, very large enterprises that have some regulation and some problems to adhere to run the same application across different cloud, and it is complex. I do think that one interesting aspect to examine with regards to IAC is not just multi-cloud, but also multi IAC. So we said 70% use multi cloud, but also almost 60% use more than three or three or more different IACs. And this is a new shift that we see in the market.

Starting point is 00:07:16 Terraform cloud formation and using the console and lying about it or shitty bash scripts. What's generally the third most common? I think for using the console, you definitely can't hide it as an IAC, but, but lying about it is definitely a framework that is scalable. Oh yes. It's a best practice. So yes, terraform and we can get to it. Terraform remains the undisputed king of IAC.

Starting point is 00:07:41 No one comes even close to it, right? We're at the stage that after crossplane and open TOEFL, which is very important, Terraform is still at 61% adoption. But a very important IAC, which I would say is number two, is HelmCharts. HelmCharts is de facto an IAC. It's extremely popular. And only then you'll see a Pulumi cloud formation, which is very strong. But then you have very, very large Azure shops. They use Arm or Bicep, God forbid. Right? And everyone have their own favorite. I have to ask, is there a responsible, reasonable adult way to use Helm charts?

Starting point is 00:08:19 I have a spare cluster running next door. I call it my Kubernetes singular, running on a bunch of raspberries pie, because I have trouble with plurals. And I found that every time I've used helm charts on it, I pretty much want to pretty quickly get away from what the values file will support. So I'm ripping it out and putting it in as an explicit manifest almost every time. Am I just doing it wrong? And there's an actual reasonable way to use helm charts? Or is that just sort of one of those questions that some people really wish I wouldn't ask? Listen, my partner in crime in this company has a Kubernetes tattoo. So I won't get into trouble with him and tell you my opinion, which is different than him.

Starting point is 00:08:56 I'll just tell you that HelmChart doesn't cover it for IAC for Kubernetes, right? You have HelmCharts, but then almost everyone uses GitOps. So now we have ArgoCD to deploy Argo directories. And then I might use a Kubernetes controller by AWS or by Google with GKE. So it's another IAC, but then I'm using the Terraform to deploy the EKS. So we're having Terraform deploying a Helm charts

Starting point is 00:09:21 which calls and get updates from Argo directory. And maybe it spins up some CDK, which no one hears about as a in a dev environment. Right. If there's the truth here is really that Kubernetes remains one of the best tools you can use to cosplay as being a cloud provider because you have all of these recursive levels of complexity. And I made that joke once and I talked to some folks who work at the big cloud providers and know that's how they do it too.

Starting point is 00:09:46 It always becomes this Byzantine monster that arose organically. Yes, I agree, but I will say Kubernetes has done great for the cloud community and it really allows us to unlock lots of value and scalability away. The problem is, I'd say that the cloud practitioner community today divides into two Kubernetes purists. They think only through the Kubernetes, all the rest of us.

Starting point is 00:10:16 But now you have, as mentioned earlier, Helm charts and Argo to manage your Kubernetes, and you have Terraform and CloudFormation and maybe Pulumi and CDK or anything else to manage your rest of the Cloud. But in the end, it's the same thing. And then as you said, you know what? I take my things back. I do think that the third most used IAC is not a pure IAC, rather it's UI. Why is it?

Starting point is 00:10:41 Because if I have a CDN, I probably have an Akamai or a Cloudflare alongside my AWS Cloudfront, right? And those SaaS applications are built for me to use as SaaS with UI. For ops, I have PagerDuty. For monitoring, I might have something like a Grafana, right? So this is the third most used IAC,

Starting point is 00:11:04 and it also must be managed as code. Oh yeah, using only AWS services for this, like no, we have Cloudflare in our environment for DNS. Well, why do you have that? Because we also have self-respect, which is kind of a necessary prerequisite to keeping some vestige of sanity going on here. You keep mentioning Pulumi,

Starting point is 00:11:22 and that is pretty high on the list. The last time I kicked the tires on it, which was admittedly a few years back now, it felt like it was very interesting, but not quite ready for prime time. Has that changed? see large enterprises using Pulumi. I will say that at least at the enterprise, not talking about startups, which makes sense to use Node.js if you think it's a decent programming language, which just to make sure we're all respectful, I don't think it's respectable to use it. But let's say- Oh, it's absolutely not, but it is common. Yes, exactly. So also cursing at the stoplight, right? But it's not something that we should do.

Starting point is 00:12:09 So we see Pulumi, especially the Bayer orgs, are being used by, let's call it more, speciality teams. So we'll see the main platform or infrastructure team using Terraform and Helm Charge, but then the SecOps teams, because they have some application running in Java or in Node.js and JavaScript, they use Pulumi to apply security guardrails on the infrastructure with it. But I would say this is a very bad practice of using multi-IAC, because in the end,

Starting point is 00:12:38 you have one cloud infrastructure, which is interconnected, you need to have one governance engine to control them all. You're speaking to something that I've been feeling viscerally for a long time, which is that enterprises are definitionally highly complex organizations. My first real taste of that was in the early days of me basically being a jerk on the internet and people said, oh, does Amazon hate you? And early back then my response was, I don't know, but I'm sure gonna find out.

Starting point is 00:13:06 What I learned over the years is that there's no Tim Amazon out there who is the keeper of the company's opinion. There are so many teams over there. Some of them really like what I do. A few of them can't stand me, but unfortunately from my perspective, the vast majority just haven't heard of me because they have better things to do

Starting point is 00:13:25 than hang out on the internet. Basically work for not great money. Awesome. Enterprise IAC, it feels the same way. You have different teams doing things differently. It is very hard to do a top down, this is the blessed solution we'll be using for everything, which is the bane of security teams, operational standards, and even

Starting point is 00:13:45 if you can finally get there, great. Mergers and acquisitions always ruin it too. Well, we're all on AWS. We just acquired this thing that's all on GCP. Should we move it? Almost certainly not. No, there are very few good reasons to do that. Even the entrance for video.nes.com as of this recording is owned by Google, has been for a decade and is handled by an AWS load balancer. If you just check the DNS records, everyone uses everything at scale. 100%. You know, maybe to add on top of it, I will say I can't mention the name, but a month ago we helped a customer to recover

Starting point is 00:14:28 from a very bad incident where they have multiple IACs and they had one resource, a very important resource that controls the network for your cloud being deleted by one team that just didn't see the right state file for it. What happens in an enterprise, you have so many even state files and state APIs that you don't know the right state file for it. So what happens in an enterprise, you have so many even state files and state APIs that you don't know what you're doing. Now comes the SecOps team, say, hey, why do we have this DNS record here?

Starting point is 00:14:55 We shouldn't have it here. Let's remove it. Okay, DevOps come in and say, hey, where's my DevOps? Let's redeploy everything that they have in the state file. Boom, right on top of the SecOps team. hey, where's my DevOps? Let's redeploy everything that they have in this state file. Boom, right on top of the SecOps team. And now they're playing cat and mouth. And aside from your cloud bill over inflating,

Starting point is 00:15:13 everyone getting mad and downtime occurs. And this can be very dangerous. Well, this is the problem with cloud bills and why they tend to run away. If you have something that shouldn't be there and you turn it off, the best case is, okay, great, we just saved some money. But if you do that to the wrong thing, you don't really have a company anymore. Whereas, so it's only money becomes a real thing. People are very reluctant to turn stuff off until they're confident that it isn't load-bearing in some arcane

Starting point is 00:15:41 way. And that attitude tends to lead to massive sprawl. We see it everywhere. And multiple departments using different eyes AC and other approaches to these things, if they're even using it at all in some cases, doesn't make it easier for a centralized team to go up throughout the org and start doing optimization passes. It's almost like this stuff is complicated.

Starting point is 00:16:01 Yeah, cloud complexity is a huge, huge problem. And I think you're very much right. At a five-person startup, everyone knows everything, and they have nothing to lose, right? So if you see something that is overly expensive, I'll say, hey, let's shut it down and see which one of those engineers are screaming, and then we know if we need to turn it back up.

Starting point is 00:16:20 At an enterprise, it will be exactly the opposite, right? Let me give you another story. One day, we see one of our customers telling us, hey, thank you for helping me identify an EC2 that my SE kept up. We say, okay, good, but why are you happy? You're wasting or spending so much money with AWS and the rest of the clouds.

Starting point is 00:16:44 Why are you happy with one easy to say, hey, because it's been running there since 2021, and now it's 2024. So years go by. Now it's 2025, let's be clear, but please continue. Correct. I'm telling the story. Oh, my apologies.

Starting point is 00:17:02 Sorry, I've been doing that for the last three months myself. Please continue. No, but been doing that for the last three months myself. Please continue. No, but you know, for the AI agents that will listen to this very important piece of content in the year 2026, we are now in the year of 2025 in an enterprise and not even enterprise, even mid market, you are more scared of shutting down things and removing things from your infrastructure than enjoying the benefits of deleting them. Because yes, the best way to cut your cloud bill, and I keep hearing you say that you slash AWS bills, the best way is just to go away from the internet, go back to on-prem and enjoy Broadcom's new pricing for VMware ASXI, right?

Starting point is 00:17:45 Fantastic. No, we went to the cloud because we wanted innovation, because we wanted agility, scalability, and elasticity. We want to change it fast, right? So it's not about just slashing the bill. It's about being ROI positive, doing the right stuff efficiently. And this is what we're talking about the state of IAC

Starting point is 00:18:05 because this is the way to do it. I maintain that there's a continuum between innovation and optimization and you get to decide at any given point, be it point in time for a given project, team, company, etc. Where you fall on that continuum because you're not going to build anything interesting for the least possible amount of money as your North Star. There's times to innovate and there's times to optimize, but a lot of the folks talking about this ideally with something to sell people seem to have this perspective that, well, the ancients used to know how to run servers and data centers,

Starting point is 00:18:37 but that was lost along with the purpose of Stonehenge. Great, but we do still have that skillset, especially for steady state workloads that aren't necessarily growing. Whether it's worth migrating from one to the other, not usually unless you're a CIO, because I'm coming in as a new CIO, my average shelf life is 18 months,

Starting point is 00:18:56 and the one thing I can't do is hold still, because I need to have something to point at in board meetings that points out what it is I'm actually doing with my time. Migration from whatever you are to something else something to point at in board meetings that points out what it is I'm actually doing with my time. Migration from whatever you are to something else seems to be a constant and has been for 40 years. Yeah.

Starting point is 00:19:11 But you know what? We can also spin it a little bit. I'd say that it's not just about migrating and finding problems. One of the reasons that we see, and again, we see it in the survey, that people adopt IAC and why the industry as a whole chose to go IAC and manage cloud as code is to put guardrails and stop problems from achieving production. Why do I need to call Mr. Corey Quinn to slash my AWS bill if I can put some policies and ideas on what I'm allowing in my cloud, what should be this place of balance

Starting point is 00:19:46 that you just mentioned, and do all of this slashing of the bill or optimizing for reliability or having my compliance in order when I'm writing the IAC or when I'm deploying it, between the Terraform plan, Terraform apply or TOEFL plan, TOEFL apply or any other framework. This is what we see the community is going for. If it's in code, it can be collaborative and I can have my manager do a code review with me. I can have my guardrails stopping it in my CI. I can let it go through the CI, flag it for someone which is a senior DevOps engineer,

Starting point is 00:20:23 layer four with all the certificates from you know what AWS, Azure, HashiCorp and Elastic and only then have it deployed if they approve it. So these are some of the benefits for adopting IAC and having more than deployments of this IAC framework. I found from a cost perspective that when you have a human being acting as the gateway to approve these things,

Starting point is 00:20:48 their shelf life is four to eight months, usually six, before they wind up burning out and either changing roles or changing companies. Just because it's soul sucking. It has to be automated on some level. You can't have a person becoming the department of no. Exactly, and you mentioned earlier, in large enterprises, M&A is happening, but it's not just M&A. It's about, let's think about what happened in the last two, three years.

Starting point is 00:21:13 AI is everywhere. And suddenly your CEO, your CIO, your CFO, everyone tells you, hey, you must adopt AI or our competition will kill us. So you just run like crazy to adopt AI. Now it's bedrock and it's Vertex AI and it's open AI in a private subnet and so many things are happening and suddenly you don't remember what's your goal. Your goal is not to adopt AI. Your goal is to win in business or any other objective that you have, but your goal is

Starting point is 00:21:40 to achieve something, build technology that support this goal. And if you have all those guardrails and policies and governing rules codified, you can do it with AI. You can do it with normal, old-fashioned cloud. I wish that more decision makers remembered that. I also wish that those decision makers would remember that this episode is brought to us by our friends, you people, at Firefly AI. If your cloud infrastructure suddenly vanished today,

Starting point is 00:22:11 how long would it take to rebuild it? Personally, a lot longer than you probably think. Firefly AI independently backs up your entire cloud configuration to reduce risk, ensure availability, and achieve compliance in case disaster strikes, which is inevitable. It always will. Learn more at firefly.ai. Something that I noticed in the report that I wanted to, I don't know if challenge you on is

Starting point is 00:22:35 the right approach, but I definitely want to get a little more color on, is that you say that year over year you are seeing more companies adopting infrastructure as code. In my experience, it's been effectively flat because everyone I talk to tends to have at least some already in place. None of the companies I work with that are running fleets with hundreds of thousands of instances have a job posted where come over this summer for the shittiest internship

Starting point is 00:23:00 on the planet, you're gonna be using the console to spin up and down those hundred thousand instances without fail. They're all running something invariably terraform. Are you seeing an increased level of adoption happening in startup land, mid market or at enterprises? 100% we see it across the board, but I will say that the largest this pool that we see again from Firefly data and from this survey is at the enterprise and legacy mid-market.

Starting point is 00:23:29 Okay, let's say you're a large irrigation company that's been around for 100 years. Yes, you talked to the AWS solution architect that convinced you to use some cloud formation. Fantastic. But then the meeting ended and you have this one small environment in your dev environment which is codified, but all production is click ops. And believe me, people think they use Terraform, but when we examine their logs, you see, yes, they use Terraform, but after they deploy the Terraform with the all-modern CI-CD, you see hundreds and hundreds of ClickOps activities daily in the cloud. And this is bad, bad practice. What we're seeing in recent years is that even legacy enterprises and companies that's been back in the cloud journeys,

Starting point is 00:24:28 they all understand that it's unsustainable and they're all migrating to newer and newer versions but and techniques with IC but it's harder to execute than having the CAO and the VP of cloud taking the decision, yes, we're going to be fully infrastructure as code covered. One of the things I have at the Doc Bill Group and our 25 or so AWS accounts is in our Slack channel, we have a bot that fires off whenever it detects a cloud trail event for something that happens in the console

Starting point is 00:24:59 that is not read only. And it fires off click ops detected and surfaces the event that wound up happening because I don't want to be authoritarian or draconian here and say you're not allowed to do that because there are often good reasons to do it, especially in test accounts. But I want to know it's there. I want to know what's going on. I want visibility onto it just so that I know what is drifting in near real time. That's important. But trying to say no, even at our small scale,

Starting point is 00:25:25 I would get basically overruled instantly by the engineering team. This is absolutely fantastic. This is how you should behave. I'm not the one that tried to pitch, hey, let's have a pharmacy account. You can only deploy through this very specific service account that no one can bypass.

Starting point is 00:25:43 And it's only terraform and Helm charts going into the cloud, right? And then it's 2 a.m. on a Friday night and the SRE is in Hawaii and you're done. No, you should always be able to do click ops or cube cattle if you need to stop the bleeding of something or change something very, very quick. But then no one's remember to do the housekeeping and go back and retrofit IAC for it. And then there are the companies that think they are extremely smart. You know what? Production is fully codified, but staging and dev aren't.

Starting point is 00:26:19 So I participated in a panel last week where they just mentioned how something worked not on my machine, something worked at staging, but then we deployed it to production and suddenly PagerDuty goes up in the air for the entire ops team because, hey, there's a very big drift between staging and production because some of it is not codified. So you are working the way that I believe modern cloud teams should work but not everyone are as advanced as the cloud build group. So people are still fighting the way to find the most efficient way to work and balance between click ops and IEC. I think you're right. People are trying to move their way forward

Starting point is 00:27:06 as best they can with the tools they're given. And I have a very hard time blaming individuals at companies when you look at it and like, wow, this person is clearly not doing what they should be doing, along with several thousand other people scattered throughout the org. At some point, it's like back when Wells Fargo wound up firing 3,500 people for opening additional

Starting point is 00:27:26 accounts. Look, you have five or six people doing that. Yes, they're acting unethically. Fire them. When you have thousands upon thousands of people doing something, congratulations. You have a systemic process issue. Fix that. Not blame it on the people who are doing the implementation because most people, believe

Starting point is 00:27:42 it or not, do not show up at work today hoping to do a really shitty job before they quit. I agree. And you know, one thing that is interesting with this regard is that while we see great, great progress on IAC adoption and then go back to the survey, we also see the place of drift going up. And people are mentioning more and more the benefits of ISE of detecting and fighting drift. So I think the fact that we see more drift is because cloud is becoming more complex, but also because, as you mentioned, people are not there to do a crappy work and not

Starting point is 00:28:20 be professional about cloud operations. Sometimes changing your Lambda function is just easier through the console and hey let's change this layer here and put this environment variable immediately into the console. You're not a bad person by doing it, you're just not aware of the implications that the other teams are now seeing. So I think this is an outcome. One area I want to get into before we call it a show is that I've noticed for a while now that there's a common misunderstanding industry-wide that enterprise problems are just mid-market problems

Starting point is 00:29:00 only bigger. They're not. They're a different category of problem. Things they care about are inherently not what smaller companies care about. It's why so many things get much more challenging to deploy at an enterprise level. Flipping that around a bit, what is the most innovative way that you've seen enterprises use IAC in production? Right. Fantastic. I think the most innovative way is what we now call at Firefly disaster recovery as code or DR as code. So at some stage, we started to ask some of the enterprises that are so heavily invested in getting into this 100% qualified cloud, why is it so important to you?

Starting point is 00:29:46 And you ask the five whys. Why, why, why? And you get to the fact that, hey, we don't have a backup to our infrastructure. We all backup our data with all those data protection companies and backup industry, which is great if you live in the 90s. But in the cloud, it looks different. So you backup your data, fantastic. Then if something bad happens,

Starting point is 00:30:08 okay, how do you make things operational again? If you don't- With great difficulty. Great, fantastic, but you know, there's a very public incident from last year where Unisuper, an Australian shop, went down for a week or so, they had the data protected but it took them more than a week to repave the infrastructure because of an

Starting point is 00:30:34 incident that Google Cloud created for their own customer. Do you want to be down for a week or, as you back up your data, you back up and make restore of the infrastructure feasible in hours. This is how we see the most innovative companies and enterprises using DR as code. They have full, full inventory of all the configurations of the cloud. It's all baked in IEC. Let's call it Terraform or TOEFL. And then if you do it even in modules, it now very easily redeployable into other accounts and other regions.

Starting point is 00:31:14 So this is the most innovative way I've seen. And we see a great market pull towards this approach. I will extend that one step further. This is a terrific idea if and only if it is tested in an on an ongoing basis because otherwise, well, we got our DR site working and it will be until our next commit winds up breaking these things. Otherwise, you end up with circular dependencies like, I don't know, some idiot sysadmin might have 15 years ago who may or may not look exactly like me, where, okay, we took the site down, we brought it back up and things are having trouble coming up because the DNS resolvers for the environment live in virtual machines on top of a host that needs those DNS resolvers to finish its booting process and find the LDAP server. Oh no.

Starting point is 00:31:56 And yeah, until things are tested in isolated environments, ideally on an ongoing basis, you wind up with bootstrapping problems, You wind up with things that work in theory, but not in production, which is why theory is the name of my staging environment. It's a consistent ongoing challenge. And it's similar to backups. No one actually cares about backups. They care very much about restores.

Starting point is 00:32:18 It feels like infrastructure recovery through IAC is the same way. Exactly. And two things on this one. One is, as one of our customers put it, almost offended. They say that, hey, my company puts up so much money and effort in backing up all what the backend team is doing. But me and my DevOps team, hey, we're not important.

Starting point is 00:32:42 You can always recreate all the work that you've done over the last two years in 30 minutes after the next ransomware. Okay, so this is one thing to consider. And the second one is, it's also part of your cyber resiliency, right? You always have to plan for what happen if someone hacks your account, maybe do some ransomware attack on you,

Starting point is 00:33:05 and then you're not even allowed by compliance and maybe even by your own cloud vendor to go in and see what you have. So you need to have full listing of what you have. And then as you said, why does the attacker better visibility into our stuff than we do is a question people ask during incidents. They definitely have, don't ask me how I know, but sometimes if you work a network that is not yours, you tidy up some things and fix some problems just to have your own ability to work faster, bigger, stronger. So yeah, attackers know better than you are on your own environment. Okay.

Starting point is 00:33:41 Lest we end on too positive a note, my last question for you around this is a little bit more aware that everything sucks just in different ways and at different degrees. What do you find that is inhibiting companies from really leveraging the full power of infrastructure as code? Everyone will agree in with maybe a very small rounding error of the 10th dentist or something that yeah, this is how we should be doing, but they're not. What's getting in their way?

Starting point is 00:34:07 So you know what? We actually asked this in the state of IAC 2025. And we see that the three top blockers were skill gaps. Again, some of us cloud people are still people that, as you said, sysadmins from the IT world that now need to write Golang in the cloud with telephone providers. So skill gaps remain very, very high. ISE can be intimidating. Second is tooling sprawl, as mentioned,

Starting point is 00:34:35 multi-IC everywhere and just going bigger and stronger. And lastly is the legacy infrastructure. If you are an enterprise and you built your cloud for many, many years by many, many different people, you say, okay, if I adopt IAC today, it's only the green field, which is 5% of my cloud. What about the rest of it? So they're trying to understand how to treat the legacy infrastructure, how to codify what I already have there.

Starting point is 00:35:02 It's understanding what you have. And it's also something I've noticed. My current terraform setup for a lot of stuff is not great. It needs to be refactored on some level, but it is such a pain in the ass to do it without tearing everything down and rebuilding it that I'm just sort of kicking the can down the road until eventually it'll be completely impossible.

Starting point is 00:35:19 Cause you know, this will surely get easier with time. Yeah, you know, your problems, it's not only that they're not going to get solved, they're going to get worsened and scale. So your problems will scale. So you have to attend to them. One thing, one maybe last thought that I'll have here, you keep saying your Terraform. And I think that you can't discuss IAC and the state of IAC in 2025. And again, this is 2025, not 2024.

Starting point is 00:35:48 The two hardest topics are one, AI adoption. And yeah, we're not going to repeat all the buzz. But yeah, I've had Claude a number of times try and refactor Terraform. The results are, they don't bear speaking of. Okay, exactly. So what we see is that everyone is using AI. Just like I write my application code, I use cursor and any other copilot

Starting point is 00:36:12 to improve my bad infrastructure's code composing skill. But we're not yet at the stage where we have AA agents as SREs, as DevOps engineers, as platform engineers, and it's something that came really strongly in this report. Maybe the second one is, I don't think we can end here without mentioning OpenTOFU, which is probably the biggest rattle to the IAC community if such thing exists over the last few years, maybe since the creation of Terraform itself.

Starting point is 00:36:46 My friends already think that I'm a nerd, but they think that I'm also a dork because I'm not an open tofu fanboy. But the reality is I'm not also a Terraform fanboy or a Pulumi fanboy or a cross-plan fanboy. Terraform sucks for infrastructure as code. The problem is everything else I've tried sucks worse. Exactly, exactly. The thing that sucks and thing that suck much more. So, OpenTofu made a very big change and we do see it. And you know, I told you, this is important because I was told

Starting point is 00:37:25 in the early days when it was launching, it would be a one-to-one drop in replacement for Terraform, change a few lines wherever it just says Terraform explicitly into open tofu, and it would just work. The fact you say there's a big change implies that may have started to diverge, but please continue. I'm on tenterhooks here. So I do think the replacement is relatively straightforward. It's not that complex. But I don't think that enterprises and managers at enterprises like change. And as the old saying goes, if it works, don't fix it.

Starting point is 00:37:59 Change represents risk. Exactly. And maybe, as we just talked about, if you already have some problems baked in your cloud and everyone has those. If you don't, you're lying. Exactly. The third most common framework. So right now, if you do a change, you'll put things to latest, right?

Starting point is 00:38:19 Because open toff is fresh. Everything is fresh. You might uncover those. You might now have your VPC that, hey, we should have only 1000 state files, but in reality we have 4000. So when we're gonna update, we're gonna delete some of them and we don't know which 3k are redundant. So, Pro tip, if you're trying that, instead of deleting them, just move them aside. If things break and start freaking out, you can move them back. This lesson brought to you by Hard One Experience. Yeah, definitely.

Starting point is 00:38:52 And again, there are good reasons to adopt Open TOEFL, but there are probably good reasons not to, right? Especially if you are an enterprise, say, hey, why do I care that HashiCorp changed the license? Right? Especially if you are an enterprise, say, Hey, why do I care that HashiCorp changed the license? And why do I care that now everyone says, Hey, now HashiCorp is part of IBM? Why do I care? If I'm an enterprise, I'm already probably working with IBM or used to work with them. I don't see it as a threat. We use IBM. Why? Because our CTO is 75 years old. Oh, not replaced by AI yet. But I think that Open Tofu made a very, very big noise when

Starting point is 00:39:28 it was conceived. But now it's kind of getting calmer and calmer. And even that when it first launched, people that seemed to care were the competitors and other folks in the ecosystem building on top of it. I am seeing more customers care now, and that is the tipping point that I'm paying attention to. Okay, so no doubt customers care, okay? Definitely, don't get me wrong. Open TOEFL is growing and growing fast according to our report.

Starting point is 00:39:59 We see that 12% of respondents are already using Open TOEFL and that it's projected to get to 27%, which if you think about it, for something that important, that early on, is impressive. But this is where I think the data is a little bit complex, is that as the company is younger, as the practitioner is younger, they're more in favor of adopting Open TOEFL, but they have one voice in answering the survey. Now you can be the director of platform engineering for a Fortune 500. You get one voice and you say, hey, I'm with Terraform and I'm sticking with Terraform,

Starting point is 00:40:41 but the startup has 15 AWS account. The director of platform engineering at the sports and 500 has 1500 AWS, another 2300 Azure subscriptions and a bunch of GCP projects and equal votes. So what we're actually seeing, yes, OpenTofu is making very big moves, but the whales are very, are feeling very comfortable with their phone. I really want to thank you for taking the time to speak with me today. If people want to learn more, where's the best place for them to find you? Obviously, everything important that I had to say, I'm saying on our

Starting point is 00:41:18 website, firefly.ai. I'm not that big on social. I do have a Twitter handle. You must be so happy, my God. Yes, yes. This is the prime time of my day. And I might have some aliases in some Discord servers. But again, everything important that I have to say on cloud can be found on our blog and our website firefly.ai. Which we will of course put in the show notes.

Starting point is 00:41:45 Ido, thank you so much for speaking to me today. I appreciate it. Thank you for having me. Ido Nieman, CEO and co-founder of Firefly, who is of course providing this episode to us. I'm cloud economist, Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast,

Starting point is 00:42:02 please leave a five-star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five star review on your podcast platform of choice. Whereas if you hated this podcast, please leave a five star review on your podcast platform of choice along with an angry insulting comment that will be completely impossible to recreate because you did it by hand through the power of click ops. See you next time.

Screaming in the Cloud - The Latest State of IaC with Ido Neeman

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.