Screaming in the Cloud - Build vs Buy: The Hidden Costs of “Just Building It” with Ahmed Bebars

Starting point is 00:00:00 The idea of like build versus buy and all of the kind of stuff, it comes to a point where like, sure, this system is unstable, but unstable in a way that like you don't have to invest all of the resources, keeping the uptime, like all of the operational stuff, like all of the thing. Welcome to Screaming in the Cloud. I'm Corey Quinn. And I am joined today by a man of many talents. Ahmed Babars is a principal engineer at the New York Times.

Starting point is 00:00:29 He's an AWS container hero. He's a cloud native ambassador. and a prolific public speaker, Ahmed, welcome to the show. Thank you, Corey, for having me. I'm excited to see what we're going to dive into. I know that you have a lot of questions, so I'm looking forward to hear some of them.

Starting point is 00:00:46 We'll start with the direct insulting one, I suppose. You're an AWS hero. You're a cloud-native ambassador. What got you down the path of, you know what I should do? That's right. Do volunteer work for giant entities that, frankly, could afford to pay people to do this

Starting point is 00:01:02 if you really think about it the right way. I'm mostly kid. Lord knows I've spent enough time in the community myself, but how do you wind up there? Yeah, like, to be honest, like, I didn't know that I got to end up there. Like, so a few years ago when I started my journey on like when I came to the United States, I was like, sure, yeah, I'll try to solve a couple of problems in a couple of organizations here and there. And then all of a sudden, after some time in 2019, I remember this was my first public speaking opportunities that I had, it was like, it strikes me that, like, I always thought that, like,

Starting point is 00:01:36 I don't know enough to share. Like, that was, like, the really, they're really tipping point to me. Like, everything I do, like, yeah, everyone knows that. Everyone knows that. Until that moment, and then I went in my first talk, I was like, yeah, a lot of people didn't know what I'm going to talk about, and they liked it and they said, like, this is great content. So from there, like, I started to say, like, if some don't know about something that I'm doing, why I'm not sharing, at least like I have it out there. And that ends up to be like, sure, I contribute to many open source community. I can teach people how to go there.

Starting point is 00:02:10 And then like all of these things came like, sure, there's an ambassador program for the CNCF. Can I apply and see how can I explore the world from that space? It gives me a great opportunity. AWS hero, it's kind of like they pick you. So it's kind of like a different story. But like also like I've been doing a lot of work with AWS. So that's what I've been big. But what's really my interest here is to share more on what I have done, on what I heard about, on what I have seen better in my opinion, and see if that helps anyone on the ecosystem.

Starting point is 00:02:41 It feels like you fall prey to the same trap that many of us do. Lord knows, I'd still have to talk myself out of this, where I have this internalized perception that if I know something, therefore it's commonly known. Everyone basically knows this. But if I don't know something, that's the hard stuff. That's the interesting piece of it. And it's never true. Similarly, I found that making a talk more broadly accessible to a larger number of people has never been the wrong decision.

Starting point is 00:03:08 Because it's people, everything is new to someone. We live in a big world and a big space. You nailed it. It's that like concept in your head. Like when you like do a circle and then like you always keep circling around it. And I'm like, everyone knows it. Like talk to someone. Like I talk to like how many people I talk to usual.

Starting point is 00:03:27 it's not a lot. And then like you talk to someone and you're like, oh yeah, I know this feature. You talk to someone I know this feature. But like then when you look over, like a lot of people don't know. And sometimes actually like even if you talk about the same topic over and over, some people may listen to that, not listen to the others. So sharing the same content, sometimes in different ways, in different formats. Like what I also have seen resonate with people is that I talk, I'm not selling anything. Like no one have to listen to me because like I'm solving a So that also coming from like I'm being an end user, tried something, sharing my thoughts. I'm not pushing you to buy my software.

Starting point is 00:04:04 I'm telling you my software works. I tested something. I tried it. It works. You want to use it. You want to listen to it. You want to correct me. It's a community work.

Starting point is 00:04:12 This is the feedback that I'm going to. But also like what I learned from that is by contributing, people might tell me like, oh, but have you looked into this? And that opened like a whole, a lot of can of worms where like, oh, you know, I didn't look into this. Let me look into it. And actually, many of my talks, I have people say, sure, that was a great talk and all the kind of stuff, but I have people said, have you looked into that? We tried this before and it didn't work. And that struck me as a great conversation to know, like, I didn't look into it. Let me try. And then I start to look into it, and then it

Starting point is 00:04:46 becomes a bigger thing. It's why I love conferences in the rest of the community, where I'll talk to someone, the most recent thing that still irritates me that I went this long without knowing about it is Atouin, A-T-U-I-N. It's an incredibly awesome shell history that sinks between machines. I've discovered that, installed it everywhere. I cannot go back again to using the built-in nonsense, given how ephemeral most of my stuff tends to be. It's these weird things where, oh, well, why not build this tool?

Starting point is 00:05:12 Like, there's downsides to that, too. After I first built out my original overly wrought newsletter publication system, someone said, well, why didn't you just use curated.c.o? It's, what? Why didn't I use what now? because I didn't know it exists. This would have been handy several months ago. Yeah, there are always ways to do it in talking to people and getting the real skinny

Starting point is 00:05:34 on what people think about how something works is incredibly valuable. Yeah, this is usually like how most of my learning has been over the year. And that like got me to a space where like, you know what, I experienced it? Let's talk about it. Let's see like how it goes. Is it bad or good? It solved a problem for my own experience. And sometimes also it's interesting.

Starting point is 00:05:54 to show the failures because like you want to tell people like, what did you try and didn't work out? Because like you don't want them to set in that trap. So like either I'm learning something. But usually I try most of the times as much as I can to set my talks into like an experience that I have done into a real story. I don't want to like bring a topic and just like talk about. Sure, I can talk about Kubernetes. I can talk about like AWS. I can talk about anything. But I usually try to big topics that a problem I try to solve or like a situation where I've been in. That gives me like more, I don't want to say credibility, but gives me more like I'm in it. Like usually I don't grab much into the talk.

Starting point is 00:06:31 I give them a real story about what exactly happened. I mean, something I find is that documentation falls down terribly when it just tries to do a list of it. Here's all the features it does and all, here's an API reference. For whatever reason, the thing I'm trying to do is never well documented in these things. So I like experience reports of I'm going to build a to do list app to use an overdone example. Great. I want to know how you use the tool to do it, what your steps were, how it wound up looking, you're driving to an outcome.

Starting point is 00:06:59 I've also deeply appreciate the community stuff, especially the heroes, folks, on the AWS world, because you are not beholden to AWS in the same way as an AWS employee is. If an AWS employee talks about aspects of AWS being complete crap, they're likely not going to be AWS employees for very long, whereas the rest of the community, we talk about this because it does have sharp edges. These things are painful. How do you split the difference there? Because on some level, it feels weird to go and speak at a company's conference and use their

Starting point is 00:07:29 platform and then use that to drag them. I mean, I have a personal policy of not making people regret inviting me the things. So I'm not going to crap on them at their own conference. But I do sometimes feel like I have to strike a balance. Yeah, like the balance is always like being honest and like showing what has a real value of something. Like I'm always come on like many social medias and many platforms and say, That didn't work for me. That wasn't the right intention.

Starting point is 00:07:52 There are meetings and spaces for things like what I should say. I've been saying over the years that a lot of people know this about me. That AWS user experience has been clunky all the time. They didn't master it. That is such a flattering way to put it. Yeah, like in a way. Like, you know, like it's been like ridiculous how many times I have seen like, oh, I have to go all the way.

Starting point is 00:08:16 Like I go in talks with service teams. and sometimes I say, like, you know what? Like, why do we have this three times on the same page? Like, why? Like, they are reliable in something, but they are not in something else. And that's where, like, always the balance comes in. But also, like, it has to be, like, in our, I want to give them feedback and I want it to be critical, but I want it to be, like, I don't want to say in a nice way, but I want

Starting point is 00:08:40 it to be, like, honest feedback. I don't want to embrace something that, like, other vendors has done for years and say, this is great. Like, I would say, like, it's great you have this now, but like... What? Took so long. Yeah. Yeah.

Starting point is 00:08:54 Like, it's been a long time, like, to get into something like that. But there are some innovations that I have seen in this space that deserves that. And, like, what I said exactly about, I'm not beholded to anything because I don't sell anything. So, like, I'm not obligated to, to any of that. Like, I don't like AWS. You're going to, what? Like, what's going to happen? Like, I'm not...

Starting point is 00:09:15 You're going to use AWS in the next three years? Yeah, don't work for the company. It's my honest opinion. I think like that's what I should be doing. Because I'm not doing this for AWS specifically. I know the intricacies of AWS in some cases, but I'm doing this for the people. Like if someone asked me my opinion today about like, we have these debates all the time. Like funny story. You probably know that because you have seen a lot. Like when I talk to other people, I'm in the container, AWS hero space. If I talk to someone in the container about like I have a favor, I favor. E.S. Some others favor ECS. Like, you have to see the debate when, like, we trash some of the services sometimes for each other. Like, say, like, oh, no, like, ECS is better than EKS. I'm saying, no, EKS is better. Like, and all of the kind of stuff. And it's, at the end of the day, like, it's a fun situation to compare things, but, like, at least we have honest opinion about, like, where exactly is the use case. We laugh about it. We kid about it a lot of time. But you ask me in one day, what is one of the services to use?

Starting point is 00:10:18 out to like if you're a small company Kubernetes is not the right fit for you like either you are in a job or not like this is like irrelevant but like if you are like using containers if like it has to have like like cursoristics and criteria for like your decision so like there's the fun talk all of the stuff that we have but there's also like the technological decisions that you need to make and this is situation based so it's not like hey all the way go containers eKS because that doesn't work I have never. I have never. seen a solution as that fits out. At Duckville, we are building our product on top of ECS, which is, this makes sense for our scale and current constraints. We have a path forward and boom, surprise sponsorship. That's right. This show is sponsored by Duckbillhq.com, my employer. We have a platform now, rather than just handling the consulting side of the world as we have historically with contract negotiations for large entities debating with AWS what the future might hold for both parties. Now we have software that we are systematizing part of this in.

Starting point is 00:11:18 If that sounds relevant to what you're doing, please check us out at duckbillhq.com, and also, we are hiring. And also, also, Ahmed, what are the best parts about this timing is that yesterday, for the first time in, since it first came out, I spun up an EKS cluster because I'm building some, a bunch of weird projects that I want to throw at a wall. All of my customers use EKS in some way, shape, or form. It's time for me to use it, and it's gotten, from what I can tell, slightly better. It only took 10 minutes to spin up the EKS cluster instead of the 25 when I did several years ago.

Starting point is 00:11:53 So it's improving bit by bit. What's your take on it? Let's not talk about the start of time for the cluster. That has been like a dilemma for a while, like why this takes forever. Like I have seen the architecture for the control plane behind the scene. I still like don't get it why it takes too long because others have done it and it seems to be working for others. so I'm not sure like where is this coming from. I'm certain there are reasons and good reasons for it.

Starting point is 00:12:21 And honestly, how often do you spin up or down your production cluster? Oh, I don't, but that's kind of the point. In development, when I'm testing my infrastructure stuff, I want to smoke tested in a test account. And that adds a tremendous burden to how long it takes to run through those tests. Please fix it. That's why I care. Yeah, exactly.

Starting point is 00:12:41 I went through this use case. And I agree with you. Like, how often you read up a Kubernetes. cluster. Sure, not too much, but like when I need this for testing, when I need to mimic something, when I'm doing a demo, like I have to wait like 10 minutes for like a cluster to get up. But like, let's talk about like good other things that like I haven't seen like an other, like there is new as an ecosystem pattern of like, so let me tell you why I like Kubernetes in general. Like the generality of it is just because it's a common pattern across

Starting point is 00:13:10 multiple cloud provider. Like I can get that flavor on AWS. I can get this flavor and as a providers. Does a lot of the things behind the scene change? Sure, instances, all of the kind of stuff, like how they are with each other, all of the can stuff. But at the end of the day, it's a deployment, it's a bud, it's a container, it's all shared. I can get a similar flavor into it into my machine to test, which is relevant to what you're saying. So in my CI, I can spin up whatever like Kubernetes thing on a Docker, whatever ecosystem to test something with. Problem with, is like when you start having like disperse and like have solutions that like sure I have something that works for a cloud but something else will work for local it's like becoming like a tangling effect and sometimes you

Starting point is 00:13:56 cannot test the same stuff so that's where like we have to come up with mockups mock up APIs and see like oh now I have to call the EKS API now I have to get my body identity all of the can stuff this is the sad part of it the good part of it there are like more capabilities that's coming into services like this. One of the things that I really embraced when I have seen it is like the concept of manage add-on. Or not add-on, they call it managed services now, whatever, but like it is the concept of having managed Argos complex.

Starting point is 00:14:29 I've seen Argo runs and it's complex. Other controllers might not be, but like that's a good option. If there's other community or open-source projects that could run the same way and they take the complexity out of running the control plane, love that. That's a great idea. Removes burden if you know how to run things like that. So from that perspective, seems like it's growing.

Starting point is 00:14:49 Does it do as a better job like other providers? Maybe, maybe not. Like I have to put them in a benchmark to see, like, what they can do. There are use cases that I hear about. Are they interesting to me? I don't know. Like, do I need to run 100,000 node on a single cluster? Never, never had this use case in my entire life.

Starting point is 00:15:10 Well, yes, if you're trying to get the AWS. bill high score how else you're planning on doing it sure i don't have that money to spend it in my account oh good lord you never do it with your own money that's what employers are for or one else or clients in the consulting world i digress uh i've been running a test cluster at home for kubernetes for two years i had to build a conference talk out of it because i mouthed off on the internet seven years ago and said no one's going to care about kubernetes so i ended up happy to give a talk called terrible ideas and kubernetes but i found it useful where now i can just write random nonsense or find it somewhere on GitHub in a container. I can throw it onto the cluster, access it over my tailscale network,

Starting point is 00:15:47 and I can just have a bunch of heterogeneous things running. Unfortunately, I've become a victim of my own success in that. Some of my team have seen some of the tools I've built that are useful for what they're doing, even though it's not coupled to client data, because my God, but then, oh, I built a great image manipulator for marketing purposes. It has some advantages. And they said, great, can I get a copy of that? Like, all right, time to build an internal cluster. on this, but we're going to do it right, and by right, I mean enterprisy. We're doing GitOps the whole way with Argo CD. We're using Open Tofu because Terraform gets really weird at scale. It is a wildly overbuilt solution for a single container at the moment, but something I found about these

Starting point is 00:16:31 clusters is they never tend to stay single tenant for long. You start adding things to it, and in the fullness of time, this becomes really straightforward to start launching a bunch of internal corporate tools, which is handy. But the teething exercise of getting up and running with it, I'm glad this is not critical path for anything right now. Because I don't know it well enough to support it. It is not. Like I recall the days that I thought about like, oh, do I have to like manage an enterprise cluster for like many use cases? And do I have to run like all of the cube admin and join instances together and do all of that in the cloud? I was like, yeah, I don't want to. Because it depends on what laptop you ran it from. Oh,

Starting point is 00:17:11 And then you talked with like, oh, no, no, you're only supposed to run that from the CICD system. It's that would have been terrific to put on the warning label. Yeah. So, like, all of that. Like, it solves a problem. It's like, you know, the whole cloud solves a problem for like not have to care about hardware. But like, do I do my own stuff? Sure.

Starting point is 00:17:27 Yeah. My entire home automation system runs on K3S. That's where like. I'm running K3S myself. Home assistant? Yeah, help system. Yeah. I do not have that running on the cluster because that has gotten sizable and logic-based enough that I have a,

Starting point is 00:17:40 I got a HP mini PC that I put the whole thing on because it's, and again, with my wife, we're definitely proving the old trope that when you have a couple and someone's really into IoT, the dynamic is one of you loves the fact that you're living in the future and the other one thinks the house is haunted. It's great. Yeah, that's exactly where I'm at right now. Like, I have, like, and I can tell you, like, I spend a few days where like my wife would call me like and say, like, hey, the house lights are not turning on. I was like, I don't know, like what's happening. And She said, like, all of a sudden, it's not working. I was like, yeah, probably you have to restart the cluster somehow.

Starting point is 00:18:15 And then, like, go unplug it, plug it again, and it will work. I was like, sure, yeah, that works. But now I'm hunted by my own clusters and I have to set it up. Sometimes I have to, like, upgrade it and do all of the work around it. But, like, to be honest, it works. Like, I rain into, like, this is one of the things that you said. Like, setting it up one time was a complex story that I have to get all of the things set up in my end.

Starting point is 00:18:36 I have seen, like, how complex I have to bake images and do all of the things. that to get like a small cluster in my home running. So imagine this like running this in an enterprise scale. Like I have to bake my images, do all of the work to get this. Now it's easier. Now it's just like a couple clicks and you get a cluster up. That was like cool thing to have. I just discovered a few weeks ago from the person who wrote At2 and Ellie, as it turns out,

Starting point is 00:19:01 that K3S has a built in registry that is distributed across the nodes, which is awesome. I can stop pulling the same image again and again, which is freaking wonderful. I didn't know about that. It's a system command argument. Spiegel, S-P-E-G-E-L. It is built into K-3-S. You pass the server, a command line parameter, and you're done. Okay.

Starting point is 00:19:24 I actually will look this up. Yeah. That's why I talk to you. I learned something. I'm going to go implement it. Probably like my lights will not work tonight, but that's okay. It's, you know, it's a greater good. That's another trick.

Starting point is 00:19:36 I switched all of the lights. switches I was using over to Lutron, which is a little on the expensive side, but it's also what a lot of the smart home contractors build out. And what I love about them is if you don't hook it up to anything, it acts like a normal light switch. And when the system fails, the way it works is like a normal light switch, you push the button, the lights turn on, and suddenly I get yelled at less. I actually like this idea more. Like I ended up on that trend, not for all of my lights, but like because I used the U lights in before and like the switch were like very like interesting. But then the Lutron, this office is running on a Lutron switch and actually, like, allows me to do also three-way switches and different things and all of the can stuff to mimic like a normal environment.

Starting point is 00:20:17 But also like when Wi-Fi is working and everything is stable, when the cloud is running, it runs beautifully from a remote perspective. But that, you know, it's a balance between what do I need to do day to day and like how I tested things. I think like depends on like what I'm actually achieving for. I think to be honest, my cluster is running up there. and I barely touch it most of the cases, most of the time. Like, I don't need to touch it because it's working. It's an auto upgrade. All of the kind of stuff that it's running effectively, doing what I need to do.

Starting point is 00:20:48 But when I need to throw a container on it, this is an easy thing. Like just log into it, throw a container, get out, and it's all working. So, yeah. All my config lives in a Git repo that I just run Kubectal against for home stuff. I haven't Git opted yet. But it means that when I tear down the cluster and rebuild it as I have to every year and a half or so, because it gets wonky, it's pretty easy to get the stuff I care about back and running. Yeah, I have a backup.

Starting point is 00:21:12 So, I didn't get off it. Like, just, I normally, like, would do anything for, like, my cloud stuff. Yeah, for the home stuff, it's, like, I'll run my own RSS aggregator. Terrific. Awesome. If that breaks, it's annoying. I have to get it back up and running. But none of my business stuff goes down.

Starting point is 00:21:27 Nothing breaks. This is a different RSS system than the one that feeds the newsletter. That stuff all lives in AWS, like a grown-up might put something there. It's also strange sometimes to look at the monitoring for this and realize that my 11 node cluster that is all plugged into the same power strip has better uptime for a month than GitHub actions and all right, that's unfortunate, but okay. There's the other side of it too, that when it goes down, no one's coming to save me. I've got to get it up and running myself and not just wait for a vendor to fix it for me. It's a mixed bag. I don't know that there's necessarily one right way for this.

Starting point is 00:22:03 It's just the reality of it. We've forgotten on some level how to run hardware ourselves. This episode is sponsored by my own company, Duck Bill. Having trouble with your AWS bill, perhaps it's time to renegotiate a contract with them. Maybe you're just wondering how to predict what's going on in the wide world of AWS. Well, that's where Duck Bill comes in to help.

Starting point is 00:22:27 Remember, you can't duck the Duck Bill Bill, which I am reliably informed by my business partner, is absolutely not our motto. To learn more, visit duckbillhq.com. To be honest, it's a debate. It's a debate that I've been on with years. So let's talk about it from a software in general, not just like cloud perspective.

Starting point is 00:22:49 A lot of solutions out there, like you see, like, oh, this solution provides me A, M, B, and C. Oh, but I can build A, M, B, and C and D. Sure, you can build it, but, like, the problem is not in the building anymore. It's a problem, like, a few years, like, how you maintain it, how you keep it up and running, how you do all of that kind of work. It used to be day two problems. Great. Now it's like day 50.

Starting point is 00:23:10 Yeah. Like this is just like I start to think about also from a business perspective, like just a business mindset. Like what happened when like person maintains a system leaves or whatever or like a team or like something or just technology gets old or you have to upgrade it or you have to run instances? Or you leave it running in AWS for more than a year. In which case now it's extended support which costs six times more for the cluster. Why? Because screw you. Another year goes past. Then they will blind upgrade you at a time of their choosing, not yours. So you're just kicking the can down the road, gaining nothing by it and banging nothing by it and bang through the nose for it. It's, okay, that doesn't seem the most customer

Starting point is 00:23:44 obsessed. This was always interesting. That part, like the extended support was always interesting because like they are trying to balance between like how to keep it sustainable for the team or like whatever the team is managing just from my perspective. But I also like, this is like sex X is a lot. Like, it's a big number, like, when you try to do something like that, and you're always ready more like, why. But also, like, if you look at a couple of clusters, you don't pay much. Like, for example, like, the ABI is a control plane you don't pay like a lot of money for. Still money, but like you don't be a lot. I would like have a heart attack in some way if this goes for my nodes or something like that, which is going to be like more complicated and we're

Starting point is 00:24:22 going to have a conversation about. But again, the idea of like build versus buy and all of the the stuff, it comes to a point where like, sure, this system is unstable, but unstable in a way that, like, you don't have to invest all of the resources, keeping the uptime, like, all of the operational stuff, like, all of the things. When I run something on my home, I understand the risk of, like, this is not working. Like, I have built a plan. It's not, like, critical to my life. Like, my light's still going to turn on, like, turn off, but, like, my Alex, I wouldn't say, hey, like, I can turn on the light from. That's all the impact here or there. But like when I run a system and then I have to maintain it, there's a lot of operation overheads that I have to spend in maintaining the system, maintaining the infrastructure, maintaining everything behind that.

Starting point is 00:25:08 So I always tend to tell people like when someone asks, like should I build versus buy, it's just like, what do you have? Like are you building like a gigantic system and like you want to do everything? Like I would rather like lean on like open source. What I have seen in my career in some way is that majority of the tech problem have been like solving. in some way. So like you're going to find a solution that's solved like 50, 60% out of your way of doing it. Don't rebuild it. Like if Kubernetes works for you, 80%.

Starting point is 00:25:40 Don't try to rebuild it. That's the rise of AI problem right there in a nutshell. Is that, well, I could just build my own custom solution out of spare parts and that'll work. And it will, mostly for the exact use case you defined and tested. As soon as the requirement changes, now you have a problem to work. work with. And for weird back-of-house, single-purpose apps, I do that all the time. But for stuff that matters, of course I'm paying vendors. I pay for Notion at work with a smile on my face for a bunch of reasons that should be obvious. I built my own newsletter publication system. Rebuilt it finally the way

Starting point is 00:26:13 I wanted to at the start of this year with the lessons learned. And this is the third generation of that system. It's much better than the previous generations, but I'm sure I'm going to tear it down and replace it in a few years with something else. And that's okay. It's understand where the right approaches. Someone had a tweet a while back that it's interesting that Anthropic as a company uses ADP for payroll instead of re-bibe coding their own. And the answer is, is because they're not insane. They understand that you're not just paying for a piece of software. You're paying for understanding the nuances of payroll law in a bunch of different jurisdictions in which you operate, keeping up to date with legal changes, and not having the Department of Labor kick your door off

Starting point is 00:26:52 your it's three days after you miss a payroll run. It's the right move. It's not just the It's understanding the business context of what you're trying to do. 100%. It's not because you can do it. You should do it. There's like a big, big app. Like if you all want to try something out, if I want to like build something really quickly, like have a demo,

Starting point is 00:27:10 all of the kind of stuff, sure, go build it, try it, let, do whatever you want. When I think about what, when I think about long-term sustainability, like not everything is like rebuildable. Not because I can, I should. And this is where I stand by. Like, if you solve the problem and then I look at like your solution and it, and it fits, why not? Why not use this and add to it or like bake it into my like way of thinking rather than just like say, oh, it doesn't do all of the 10 things that I need to do. But it does eight,

Starting point is 00:27:40 like does seven. It does five. It like it's spilled. There's like 10 other people looking at it. Because like think about it. Like if I rely on the software, I'm like, let's pick any project. In the open source, in the ecosystem. And they're like, there's not always a single person use it. So there's many people use it. So someone has. interest in doing that, but you build your own software. It's your only responsibility. That's your things that you have to maintain. And saying yes to something means saying no to something else. Take your day job. You work at the New York Times. The New York Times does a bunch of different things. Officially, I suppose you're a news outlet. Personally, I think that your job is to employ

Starting point is 00:28:16 history as greatest monster, whoever it is that organizes and runs the connections puzzle every day, which vexes me like you would not freaking believe because I don't think in the right frame reference sometimes. But through none of those perspectives is, oh, what does the New York Times do? That's right. You're a database company. You should build your own database. No, that is not where the value is. You have a website. You should build and run your own web servers. That's something a fool would say. If I'm dealing with a bank, handling the money, ensuring compliance, making sure that the money is there when you say it is that's the key job. An airline's job is to get people and planes and cargo from place to place.

Starting point is 00:28:56 It is not to push the boundaries of computer science. And companies tend to lose sight of this, especially when engineers in some cases get carried with resume-driven development. Yeah, that's where like scope and focus and specialty is one of the things that like anyone should look into. So I would rather like spend my time in my area of expertise, what I'm good at, like how I'm doing, it. If I'm an engineer, if I want to do a design, sure, I can't like do something quickly, but I don't necessarily have all of that understanding of how does that work. Again, not because I can, I should. It's always like the idea of like, you should get to a point where you have

Starting point is 00:29:34 an SMEs. That's why they call it an SME in somewhere. That's where like people study things. If I'm asking for a serverless opinion in any way, I'm going to go ask a serverless person who dealt with this in a real production system, who knows when it breaks, who knows what are the bad things about it. Like a lot of people when we talk, say, like, serverless is great. You can spin up. Sure, you can spin up. Have you ever run a serverless architecture that has like a thousand function? Let's talk about like how you govern all of them when they work together. That's a different story. Like that story like I saw in a demo seeing like, sure, a lambda function. Any function in any cloud system runs in like the matter of second. Ship a container and it pops up. Great. Let's talk about how to govern this in a bigger system. That's a different story. So, like, that's why a game back to my point. You know, like, when I give a talk, I talk about my experience.

Starting point is 00:30:26 I talk about, like, the things that I explored because, like, I have knowledge in that area. I have an understanding, rather than just losing focus on what I'm trying to do. Right. I'm a former SRE, and I have a radically different perspective on environments depending on where they are. And I was always considering one of the most stodgy, conservative, curmudgeonly types when it came to things like databases and file systems, because mistakes there are

Starting point is 00:30:49 are going to show. But in my test environment, I have good backups of all the stuff I care about. Yeah, I'll go nuts. We'll do bleeding edge alpha thing. Oh, I guess that's why it's not GA yet. Whoops, roll back. And I am fine with throwing things over the wall. I have a dedicated AWS account with no access to data that I have an EC2 box in, upon which runs Claude Code in full permissions mode. It has a EC2 world that gives it root, administrative access to the entire AWS environment. It is called Superfund, because because it is both toxic and expensive. And the only blast radius, worst case here, is it spikes my AWS bill, which I can handle that if that's what comes down to it. Honestly, if I call in begging for forgiveness to the ATAB's billing department, it will become a company-wide holiday. Ladies and gentlemen, we got him.

Starting point is 00:31:35 It'll be great. I think, like, the separation of concerns most of the time, like, works in a lot of cases where, like, you need to understand where your competency is at. Right. Why not just do that in your production environment with, like, the customer database? It's because I'm not insane. Thank you for asking. Exactly.

Starting point is 00:31:51 Like, that's why, like, but, like, it's also, like, it's a situation where, like, you should always think about, like, segregation. You should think about, like, that's why, like, some people will, like, say, yeah, let's ship into production. Have you tested it before? Oh, I tested it, but, like, also, like, I can tell a different story. So some of the environments, we ship, like, in multiple places, in multiple environments. Like, people are, like, have you tested that before?

Starting point is 00:32:15 Yes, I tested it. Have you tested on the same skill? No, like, the only test that in a specific environment. I was like, why? You should always test with the same parameters. Like, sometimes I worked in a company before where, like, we're doing like some deployments. And then one of the deployments were like, oh, code is ready, shipped, everything is cool. Like, it was a small company.

Starting point is 00:32:37 And then, like, all of a sudden, shipped it to production. It doesn't work. It doesn't work because, like, you don't have the same parameters like that you are running this into. Like, you're running one thing, like you're testing it with, like, curl, like, shipping, like, a single ABI, and then, like, you're testing it in production with sending, like, a 50,000 request. Like, have you done this? Oh, testing is real. Yeah. Met the standard expectee.

Starting point is 00:33:03 Like, all of these things, like, are actually, like, things that you have to think about. Or even canary deployment, because at some point of scale, you cannot test at the same scale. Facebook was, they gave a lot of talks about this back when they had reasonable approaches to things. and because they didn't have a spare billion users to run in the dev environment. So they started off by having the developer run it themselves and then a small gated list of internal P.U test users. And then, like, there were something like nine concentric circles from individual developer to the entire Facebook user base. And there was a scaled and measured, and they monitored the heck out of this.

Starting point is 00:33:35 Like, oh, we're starting to see errors increase. Let's dial that back. Which works super well for Facebook would work terribly for Stripe because every 500 error they get means someone didn't get paid. And that's a worse, that's a worse outcome than, oh, the cat picture didn't load fast enough. Yeah, that's basically like what they care about. Like, some company cares about every single transaction for them. Like, and also, like, one of the things that you mentioned here for Strive, for example, like, we don't know even, like, which transaction will fail. That might be, like, a very expensive transaction to fail and that will cause, like, the business to lose a lot of money. But, like, if someone didn't hit like on my

Starting point is 00:34:13 comment on Facebook and Zend didn't work, like I'm not going to get too offended. Right. And there's also the reputational damage too. I try to buy your book for $30. I send it and the transaction fails. Both you and I are going to be upset with that, depending on how technical I am, and definitely for you, that's going to flavor our impression of stripe. It's, okay, that's not great. It has to work. It must. Whereas with other use cases, the restrictions are different. The product that we're building is for business, back-of-house users. Yes, we would like the site to be up when people are attempting to use it. But in the event that the site is down for an update or something for 20 minutes and has the

Starting point is 00:34:51 maintenance page up, it is not disastrous. It is not critical path for serving their customers that day. And I can see a future in which that potentially changes, at which case our approach to uptime and responsibility and maintenance windows will no longer be a thing. We're going to be very cognizant of the needs of it. But not everything needs to be hyper-scale. not everything needs five-nines of uptime, understand the use case and the problem you're trying to solve for it. I'd rather just doing engineering fantasy. Build the thing that solves the problem.

Starting point is 00:35:21 The thing about up-tides sometimes, like, strike me because like some use cases I have seen in my bash, where, like, talk about something like in consultant opportunity or like in anyone and say, like, what's your system up-time? And they said, like, oh, it's five-nines. I was like, great, why are you using? And they said, like, this ABI, is that ABI, that ABI. And tends to one of them like, oh,

Starting point is 00:35:41 that's not five. And it's critical for you. Yeah, but my system is five. I was like, how's that even work? The back systems that you're using is not five-nines, but then you claim it's a fine. It's just like, it's a very complex world. If you sincerely care about five-nines of uptime on a service,

Starting point is 00:36:01 you need to be in multiple regions to do it, arguably multiple providers, though I could be convinced of that otherwise, and you cannot take third-party dependencies. Because, look, I can test in my account what happens if none of my stuff can reach S3, hypothetically. But I cannot test what if my third party vendor dependencies can't reach S3 or their third party dependencies can't reach S3? The only way you test that is by S3 going down, which fortunately is not a common occurrence. But if you're serious about this must stay up at all times, you have to own so much of that availability piece yourself.

Starting point is 00:36:36 Exactly. And that's like where we have to think about like every beast that you put. So the more is that like, it's all like at the end of the day, it's like to me, like technology like any other thing. It's like an architecture. It's a buzzer beast. It's like trade off somewhere. Like you get something, you lose something. It's not like, oh, hey, you have to get like an all optimal.

Starting point is 00:36:55 We all strive for like the perfect system all over. But like, sure, you want to build it? You want to own it? Get your data center. Get your stuff. Make sure that you have redundancy. All of the kind of stuff. At a cost.

Starting point is 00:37:05 Sure. Or you go the other way. At a cost. Like it's one way of the other. and then you build for what you need, and that's exactly what you have at the end of the day. But that's what I'm looking for. It's just like, what is the right balance? I prefer sometimes to say, like, here are some Lego blocks.

Starting point is 00:37:22 And then you build it whatever, like, fits you. Like, you want it tall. You want it short. You want it wide. You want it, like, large. What do you need is what you build based on your requirements? Here's the thing. Your requirements usually change over time, because this is what we have seen.

Starting point is 00:37:37 Like, today you build, as you said, like I built for a single. single app, tomorrow 10 apps, after tomorrow, 100 apps. Like, do you have that scale? Do you want to get something to throw and, like, you keep re-architecturing every couple days? You want something bloggable. That's why, like, usually, like, finding the right patterns what other people have found. Like, I think, like, a lot of people spend their time on Kubernetes figuring, like, how did it make that work? It's a spectrum, like anything. There are tradeoffs, decisions you should make early on that will not hamstring you in the future. Almost every hyperscaler has had this problem before, where we're just going to build a small thing for back of house stuff.

Starting point is 00:38:12 Great. We're going to use the local time zone for the database entries. No, no, no, no, no. Talk to anyone who was at Google for about a decade and a half of time into there and use the phrase Google Standard Time and watch them flinch. Because that is very painful to fix after the fact. And it makes everything so much harder. So everything I build these days, even my dev box,

Starting point is 00:38:33 it sits there running in UTC. If I want to know what time it is locally, great. that my user account can change out of the presentation layer. Awesome. But the system itself must be UTC. That's where standards comes in. That's where like UTC is a common frame that like everyone's agree on. And you convert based on your needs because I'm in whatever.

Starting point is 00:38:54 I'm in New Jersey, someone else in California. We all can't, we know what is the pattern in. But like if I start to ingest in my database, data coming from like all local zones, now like I have and I have seen it in apps, Well, like, I go into an app and then I look at it and like, when is the last user has visited this app tomorrow? I was like, what? What is today? Like, today is like, the third, but like someone visited the app in tomorrow?

Starting point is 00:39:20 Like, how does this even happen? And because, like, they inserted their time zone from their local machine to the system and then you have a wrong representation because now it's this edge case. The XKCD RSS feed always goes into the past for whatever reason by about, I think, 8. and a half hours from your UTC time. Something is not right. So it always pops up. I have to scroll back to find it. Yesterday, when I was building out

Starting point is 00:39:44 an EKS cluster with Open Tofu, I had Claude Code do most of the Terraform slash tofu code. And I had to correct it where it's first, it put it into 10.0. Great. That's going to conflict with something somewhere

Starting point is 00:39:56 because everyone uses that. In this case, the staging environment. Great. Put it somewhere else. Then it built a bunch of, for the subnet slash 24s right next to each other. No.

Starting point is 00:40:06 because when you run more than 255 containers, which can happen. Sorry, 253. That's right. You've broadcast a network as well. Yeah. Oh, and the DNS, which you can't get rid of inside of the subnet too. So that drops it two more. Great.

Starting point is 00:40:20 Point being is above a certain threshold, you have to renumber, and that is painful. Build room to expand without having to move things around, and you'll be much happier for it. Yeah. That is a problem that I had to solve in many occasions, and there's always a solution for it. It's like, use a secondary side there. It's just like... Just use IPV6.

Starting point is 00:40:39 It's like, chrownups are speaking, please. Sure. Yeah, like, this strikes me as like one of the most bogus standards that I have seen over so far. Not for anything, but because like we can't yet agree on it. Like, I think like we are still living in the IBV4 world, but like we want more, but we cannot get more, but not everything supports IV6. But so we are stuck in IB4. What's your take on it? IPV6 is going to save us all.

Starting point is 00:41:04 They've been saying this since I was a child, and they'll be selling it to my grandkids as well. It's not a problem that is top of mind for anyone except the salt of the earth folk who keep the internet moving. So we're going to continue to ignore it until we can't anymore. And then don't worry, the AI will fix it. Yeah, AI will fix a lot of things until like robots will not be able to get IBs and we're going to be all stuck in that world. Exactly. So I want to thank you for taking the time to speak with me. If people want to learn more about what you're up to and how you view the way,

Starting point is 00:41:34 world and catch your next conference talk, where's the best place for them to find you? Best place is LinkedIn. That's where, like, I usually stay most up to date. If anyone want to hit me on email, like, they will find my links and all of my contact info. But if you want to need anything or just check where I'm going next, like I'm going to KubeCon and Mr. Dam next month, like where I'm doing like all of the things. Sure. Yeah, LinkedIn is the way. I want to say that maybe the first time in history that the phrase LinkedIn is the best place has ever been uttered. because that is just, it is certainly a place. I'm there a lot more myself,

Starting point is 00:42:08 and we will, of course, put links to that into the show notes. Ahmed, thank you so much for being so generous with your time. I deeply appreciate it. Thank you, Corey. I really appreciate having me here and, like, looking forward to see you in many in-person events. Oh, I'll be there. Ahmed Babars, principal engineer at the New York Times, and AWS Community Hero and Cloud Native Ambassador.

Starting point is 00:42:28 We're just stacking up the accomplishments these days. I am cloud economist, Corey Quinn, and this is screaming in the cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice. Whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry, insulting comment that I won't ever see

Starting point is 00:42:48 because that podcast platform of choice runs on somebody's home K3S cluster.

Screaming in the Cloud - Build vs Buy: The Hidden Costs of “Just Building It” with Ahmed Bebars

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.