Screaming in the Cloud - Build vs Buy: The Hidden Costs of “Just Building It” with Ahmed Bebars
Episode Date: April 2, 2026Just because you can build it doesn’t mean you should.In this episode, Ahmed Bebars, Principal Engineer at The New York Times, joins Corey Quinn to talk about real-world cloud decisions, Ku...bernetes complexity, and the constant trade-off between building your own solutions and buying existing ones. From home labs to enterprise architecture, they unpack what actually works, and what engineers often get wrong.Show Highlights: (00:19) Intro(01:09) From Imposter Syndrome(06:34) Honest Community Feedback(09:29) EKS Versus ECS Debate(21:32) Home Lab Reality Check(22:40) Build vs Buy Long Game(28:04) Focus on Core Business(34:35) Uptime Tradeoffs and Standards(39:41) Networking and IPv6 Debate(41:28) Wrap Up and Where to FindLinks:Ahmed's LinkedIn: https://www.linkedin.com/in/ahmedbebarsSponsored by: duckbillhq.com
Transcript
Discussion (0)
The idea of like build versus buy and all of the kind of stuff, it comes to a point where
like, sure, this system is unstable, but unstable in a way that like you don't have to
invest all of the resources, keeping the uptime, like all of the operational stuff, like all
of the thing.
Welcome to Screaming in the Cloud.
I'm Corey Quinn.
And I am joined today by a man of many talents.
Ahmed Babars is a principal engineer at the New York Times.
He's an AWS container hero.
He's a cloud native ambassador.
and a prolific public speaker,
Ahmed, welcome to the show.
Thank you, Corey, for having me.
I'm excited to see what we're going to dive into.
I know that you have a lot of questions,
so I'm looking forward to hear some of them.
We'll start with the direct insulting one, I suppose.
You're an AWS hero.
You're a cloud-native ambassador.
What got you down the path of,
you know what I should do?
That's right.
Do volunteer work for giant entities
that, frankly, could afford to pay people to do this
if you really think about it the right way.
I'm mostly kid.
Lord knows I've spent enough time in the community myself, but how do you wind up there?
Yeah, like, to be honest, like, I didn't know that I got to end up there.
Like, so a few years ago when I started my journey on like when I came to the United States,
I was like, sure, yeah, I'll try to solve a couple of problems in a couple of organizations here and there.
And then all of a sudden, after some time in 2019, I remember this was my first public speaking
opportunities that I had, it was like, it strikes me that, like, I always thought that, like,
I don't know enough to share. Like, that was, like, the really, they're really tipping point to me.
Like, everything I do, like, yeah, everyone knows that. Everyone knows that. Until that moment,
and then I went in my first talk, I was like, yeah, a lot of people didn't know what I'm going
to talk about, and they liked it and they said, like, this is great content. So from there,
like, I started to say, like, if some don't know about something that I'm doing,
why I'm not sharing, at least like I have it out there.
And that ends up to be like, sure, I contribute to many open source community.
I can teach people how to go there.
And then like all of these things came like, sure, there's an ambassador program for the CNCF.
Can I apply and see how can I explore the world from that space?
It gives me a great opportunity.
AWS hero, it's kind of like they pick you.
So it's kind of like a different story.
But like also like I've been doing a lot of work with AWS.
So that's what I've been big.
But what's really my interest here is to share more on what I have done, on what I heard about, on what I have seen better in my opinion, and see if that helps anyone on the ecosystem.
It feels like you fall prey to the same trap that many of us do.
Lord knows, I'd still have to talk myself out of this, where I have this internalized perception that if I know something, therefore it's commonly known.
Everyone basically knows this.
But if I don't know something, that's the hard stuff.
That's the interesting piece of it.
And it's never true.
Similarly, I found that making a talk more broadly accessible to a larger number of people
has never been the wrong decision.
Because it's people, everything is new to someone.
We live in a big world and a big space.
You nailed it.
It's that like concept in your head.
Like when you like do a circle and then like you always keep circling around it.
And I'm like, everyone knows it.
Like talk to someone.
Like I talk to like how many people I talk to usual.
it's not a lot. And then like you talk to someone and you're like, oh yeah, I know this feature.
You talk to someone I know this feature. But like then when you look over, like a lot of people
don't know. And sometimes actually like even if you talk about the same topic over and over,
some people may listen to that, not listen to the others. So sharing the same content,
sometimes in different ways, in different formats. Like what I also have seen resonate with people
is that I talk, I'm not selling anything. Like no one have to listen to me because like I'm solving a
So that also coming from like I'm being an end user, tried something, sharing my thoughts.
I'm not pushing you to buy my software.
I'm telling you my software works.
I tested something.
I tried it.
It works.
You want to use it.
You want to listen to it.
You want to correct me.
It's a community work.
This is the feedback that I'm going to.
But also like what I learned from that is by contributing, people might tell me like, oh, but
have you looked into this?
And that opened like a whole, a lot of can of worms where like, oh, you know, I didn't
look into this. Let me look into it. And actually, many of my talks, I have people say,
sure, that was a great talk and all the kind of stuff, but I have people said, have you looked
into that? We tried this before and it didn't work. And that struck me as a great conversation
to know, like, I didn't look into it. Let me try. And then I start to look into it, and then it
becomes a bigger thing. It's why I love conferences in the rest of the community, where I'll talk to
someone, the most recent thing that still irritates me that I went this long without knowing about
it is Atouin, A-T-U-I-N.
It's an incredibly awesome shell history that sinks between machines.
I've discovered that, installed it everywhere.
I cannot go back again to using the built-in nonsense, given how ephemeral most of my stuff
tends to be.
It's these weird things where, oh, well, why not build this tool?
Like, there's downsides to that, too.
After I first built out my original overly wrought newsletter publication system, someone
said, well, why didn't you just use curated.c.o?
It's, what?
Why didn't I use what now?
because I didn't know it exists.
This would have been handy several months ago.
Yeah, there are always ways to do it in talking to people and getting the real skinny
on what people think about how something works is incredibly valuable.
Yeah, this is usually like how most of my learning has been over the year.
And that like got me to a space where like, you know what, I experienced it?
Let's talk about it.
Let's see like how it goes.
Is it bad or good?
It solved a problem for my own experience.
And sometimes also it's interesting.
to show the failures because like you want to tell people like, what did you try and didn't work out?
Because like you don't want them to set in that trap. So like either I'm learning something.
But usually I try most of the times as much as I can to set my talks into like an experience
that I have done into a real story. I don't want to like bring a topic and just like talk about.
Sure, I can talk about Kubernetes. I can talk about like AWS. I can talk about anything.
But I usually try to big topics that a problem I try to solve or like a situation where I've been in.
That gives me like more, I don't want to say credibility, but gives me more like I'm in it.
Like usually I don't grab much into the talk.
I give them a real story about what exactly happened.
I mean, something I find is that documentation falls down terribly when it just tries to do a list of it.
Here's all the features it does and all, here's an API reference.
For whatever reason, the thing I'm trying to do is never well documented in these things.
So I like experience reports of I'm going to build a to do list app to use an overdone example.
Great.
I want to know how you use the tool to do it, what your steps were, how it wound up looking,
you're driving to an outcome.
I've also deeply appreciate the community stuff, especially the heroes, folks, on the
AWS world, because you are not beholden to AWS in the same way as an AWS employee is.
If an AWS employee talks about aspects of AWS being complete crap, they're likely not going
to be AWS employees for very long, whereas the rest of the community, we talk about this
because it does have sharp edges.
These things are painful.
How do you split the difference there?
Because on some level, it feels weird to go and speak at a company's conference and use their
platform and then use that to drag them.
I mean, I have a personal policy of not making people regret inviting me the things.
So I'm not going to crap on them at their own conference.
But I do sometimes feel like I have to strike a balance.
Yeah, like the balance is always like being honest and like showing what has a real value of something.
Like I'm always come on like many social medias and many platforms and say,
That didn't work for me.
That wasn't the right intention.
There are meetings and spaces for things like what I should say.
I've been saying over the years that a lot of people know this about me.
That AWS user experience has been clunky all the time.
They didn't master it.
That is such a flattering way to put it.
Yeah, like in a way.
Like, you know, like it's been like ridiculous how many times I have seen like,
oh, I have to go all the way.
Like I go in talks with service teams.
and sometimes I say, like, you know what?
Like, why do we have this three times on the same page?
Like, why?
Like, they are reliable in something, but they are not in something else.
And that's where, like, always the balance comes in.
But also, like, it has to be, like, in our, I want to give them feedback and I want it to
be critical, but I want it to be, like, I don't want to say in a nice way, but I want
it to be, like, honest feedback.
I don't want to embrace something that, like, other vendors has done for years and
say, this is great.
Like, I would say, like, it's great you have this now, but like...
What?
Took so long.
Yeah.
Yeah.
Like, it's been a long time, like, to get into something like that.
But there are some innovations that I have seen in this space that deserves that.
And, like, what I said exactly about, I'm not beholded to anything because I don't sell anything.
So, like, I'm not obligated to, to any of that.
Like, I don't like AWS.
You're going to, what?
Like, what's going to happen?
Like, I'm not...
You're going to use AWS in the next three years?
Yeah, don't work for the company. It's my honest opinion. I think like that's what I should be doing. Because I'm not doing this for AWS specifically. I know the intricacies of AWS in some cases, but I'm doing this for the people. Like if someone asked me my opinion today about like, we have these debates all the time. Like funny story. You probably know that because you have seen a lot. Like when I talk to other people, I'm in the container, AWS hero space. If I talk to someone in the container about like I have a favor, I favor.
E.S. Some others favor ECS. Like, you have to see the debate when, like, we trash some of the
services sometimes for each other. Like, say, like, oh, no, like, ECS is better than EKS. I'm saying,
no, EKS is better. Like, and all of the kind of stuff. And it's, at the end of the day,
like, it's a fun situation to compare things, but, like, at least we have honest opinion about, like,
where exactly is the use case. We laugh about it. We kid about it a lot of time.
But you ask me in one day, what is one of the services to use?
out to like if you're a small company Kubernetes is not the right fit for you like either you are in a job or not like this is like irrelevant but like if you are like using containers if like it has to have like
like cursoristics and criteria for like your decision so like there's the fun talk all of the stuff that we have but there's also like the technological decisions that you need to make and this is situation based so it's not like hey all the way go containers eKS because that doesn't work I have never. I have never.
seen a solution as that fits out. At Duckville, we are building our product on top of ECS, which is,
this makes sense for our scale and current constraints. We have a path forward and boom, surprise
sponsorship. That's right. This show is sponsored by Duckbillhq.com, my employer. We have a platform
now, rather than just handling the consulting side of the world as we have historically with
contract negotiations for large entities debating with AWS what the future might hold for
both parties. Now we have software that we are systematizing part of this in.
If that sounds relevant to what you're doing, please check us out at duckbillhq.com, and also,
we are hiring.
And also, also, Ahmed, what are the best parts about this timing is that yesterday, for the first
time in, since it first came out, I spun up an EKS cluster because I'm building some, a bunch
of weird projects that I want to throw at a wall.
All of my customers use EKS in some way, shape, or form.
It's time for me to use it, and it's gotten, from what I can tell, slightly better.
It only took 10 minutes to spin up the EKS cluster instead of the 25 when I did several years ago.
So it's improving bit by bit.
What's your take on it?
Let's not talk about the start of time for the cluster.
That has been like a dilemma for a while, like why this takes forever.
Like I have seen the architecture for the control plane behind the scene.
I still like don't get it why it takes too long because others have done it and it seems to be working for others.
so I'm not sure like where is this coming from.
I'm certain there are reasons and good reasons for it.
And honestly, how often do you spin up or down your production cluster?
Oh, I don't, but that's kind of the point.
In development, when I'm testing my infrastructure stuff,
I want to smoke tested in a test account.
And that adds a tremendous burden to how long it takes to run through those tests.
Please fix it.
That's why I care.
Yeah, exactly.
I went through this use case.
And I agree with you.
Like, how often you read up a Kubernetes.
cluster. Sure, not too much, but like when I need this for testing, when I need to mimic something,
when I'm doing a demo, like I have to wait like 10 minutes for like a cluster to get up.
But like, let's talk about like good other things that like I haven't seen like an
other, like there is new as an ecosystem pattern of like, so let me tell you why I like
Kubernetes in general. Like the generality of it is just because it's a common pattern across
multiple cloud provider. Like I can get that flavor on AWS. I can get this flavor and as a
providers. Does a lot of the things behind the scene change? Sure, instances, all of the kind of stuff,
like how they are with each other, all of the can stuff. But at the end of the day, it's a deployment,
it's a bud, it's a container, it's all shared. I can get a similar flavor into it into my machine
to test, which is relevant to what you're saying. So in my CI, I can spin up whatever like Kubernetes
thing on a Docker, whatever ecosystem to test something with. Problem with, is like when you start
having like disperse and like have solutions that like sure I have something that works for a cloud
but something else will work for local it's like becoming like a tangling effect and sometimes you
cannot test the same stuff so that's where like we have to come up with mockups mock up APIs and
see like oh now I have to call the EKS API now I have to get my body identity all of the can stuff
this is the sad part of it the good part of it there are like more capabilities that's coming
into services like this.
One of the things that I really embraced when I have seen it
is like the concept of manage add-on.
Or not add-on, they call it managed services now, whatever,
but like it is the concept of having managed Argos complex.
I've seen Argo runs and it's complex.
Other controllers might not be, but like that's a good option.
If there's other community or open-source projects
that could run the same way and they take the complexity out of running the control
plane, love that.
That's a great idea.
Removes burden if you know how to run things like that.
So from that perspective, seems like it's growing.
Does it do as a better job like other providers?
Maybe, maybe not.
Like I have to put them in a benchmark to see, like, what they can do.
There are use cases that I hear about.
Are they interesting to me?
I don't know.
Like, do I need to run 100,000 node on a single cluster?
Never, never had this use case in my entire life.
Well, yes, if you're trying to get the AWS.
bill high score how else you're planning on doing it sure i don't have that money to spend it in my account oh good
lord you never do it with your own money that's what employers are for or one else or clients in the
consulting world i digress uh i've been running a test cluster at home for kubernetes for two years i had to
build a conference talk out of it because i mouthed off on the internet seven years ago and said no one's
going to care about kubernetes so i ended up happy to give a talk called terrible ideas and kubernetes
but i found it useful where now i can just write random nonsense or find it somewhere on
GitHub in a container. I can throw it onto the cluster, access it over my tailscale network,
and I can just have a bunch of heterogeneous things running. Unfortunately, I've become a
victim of my own success in that. Some of my team have seen some of the tools I've built that are
useful for what they're doing, even though it's not coupled to client data, because my God,
but then, oh, I built a great image manipulator for marketing purposes. It has some advantages.
And they said, great, can I get a copy of that? Like, all right, time to build an internal cluster.
on this, but we're going to do it right, and by right, I mean enterprisy. We're doing GitOps the whole
way with Argo CD. We're using Open Tofu because Terraform gets really weird at scale. It is a
wildly overbuilt solution for a single container at the moment, but something I found about these
clusters is they never tend to stay single tenant for long. You start adding things to it,
and in the fullness of time, this becomes really straightforward to start launching a bunch of
internal corporate tools, which is handy. But the teething exercise of getting up and running with it,
I'm glad this is not critical path for anything right now. Because I don't know it well enough to
support it. It is not. Like I recall the days that I thought about like, oh, do I have to like manage an
enterprise cluster for like many use cases? And do I have to run like all of the cube admin and join
instances together and do all of that in the cloud? I was like, yeah, I don't want to.
Because it depends on what laptop you ran it from. Oh,
And then you talked with like, oh, no, no, you're only supposed to run that from the CICD system.
It's that would have been terrific to put on the warning label.
Yeah.
So, like, all of that.
Like, it solves a problem.
It's like, you know, the whole cloud solves a problem for like not have to care about hardware.
But like, do I do my own stuff?
Sure.
Yeah.
My entire home automation system runs on K3S.
That's where like.
I'm running K3S myself.
Home assistant?
Yeah, help system.
Yeah.
I do not have that running on the cluster because that has gotten sizable and logic-based enough that I have a,
I got a HP mini PC that I put the whole thing on because it's, and again, with my wife,
we're definitely proving the old trope that when you have a couple and someone's really into
IoT, the dynamic is one of you loves the fact that you're living in the future and the other one
thinks the house is haunted. It's great. Yeah, that's exactly where I'm at right now. Like, I have,
like, and I can tell you, like, I spend a few days where like my wife would call me like and say,
like, hey, the house lights are not turning on. I was like, I don't know, like what's happening. And
She said, like, all of a sudden, it's not working.
I was like, yeah, probably you have to restart the cluster somehow.
And then, like, go unplug it, plug it again, and it will work.
I was like, sure, yeah, that works.
But now I'm hunted by my own clusters and I have to set it up.
Sometimes I have to, like, upgrade it and do all of the work around it.
But, like, to be honest, it works.
Like, I rain into, like, this is one of the things that you said.
Like, setting it up one time was a complex story that I have to get all of the things
set up in my end.
I have seen, like, how complex I have to bake images and do all of the things.
that to get like a small cluster in my home running.
So imagine this like running this in an enterprise scale.
Like I have to bake my images, do all of the work to get this.
Now it's easier.
Now it's just like a couple clicks and you get a cluster up.
That was like cool thing to have.
I just discovered a few weeks ago from the person who wrote At2 and Ellie, as it turns out,
that K3S has a built in registry that is distributed across the nodes, which is awesome.
I can stop pulling the same image again and again, which is freaking wonderful.
I didn't know about that.
It's a system command argument.
Spiegel, S-P-E-G-E-L.
It is built into K-3-S.
You pass the server, a command line parameter, and you're done.
Okay.
I actually will look this up.
Yeah.
That's why I talk to you.
I learned something.
I'm going to go implement it.
Probably like my lights will not work tonight, but that's okay.
It's, you know, it's a greater good.
That's another trick.
I switched all of the lights.
switches I was using over to Lutron, which is a little on the expensive side, but it's also what a lot of
the smart home contractors build out. And what I love about them is if you don't hook it up to
anything, it acts like a normal light switch. And when the system fails, the way it works is
like a normal light switch, you push the button, the lights turn on, and suddenly I get yelled at
less. I actually like this idea more. Like I ended up on that trend, not for all of my lights,
but like because I used the U lights in before and like the switch were like very like interesting.
But then the Lutron, this office is running on a Lutron switch and actually, like, allows me to do also three-way switches and different things and all of the can stuff to mimic like a normal environment.
But also like when Wi-Fi is working and everything is stable, when the cloud is running, it runs beautifully from a remote perspective.
But that, you know, it's a balance between what do I need to do day to day and like how I tested things.
I think like depends on like what I'm actually achieving for.
I think to be honest, my cluster is running up there.
and I barely touch it most of the cases, most of the time.
Like, I don't need to touch it because it's working.
It's an auto upgrade.
All of the kind of stuff that it's running effectively, doing what I need to do.
But when I need to throw a container on it, this is an easy thing.
Like just log into it, throw a container, get out, and it's all working.
So, yeah.
All my config lives in a Git repo that I just run Kubectal against for home stuff.
I haven't Git opted yet.
But it means that when I tear down the cluster and rebuild it as I have to every year and a half or so,
because it gets wonky, it's pretty easy to get the stuff I care about back and running.
Yeah, I have a backup.
So, I didn't get off it.
Like, just, I normally, like, would do anything for, like, my cloud stuff.
Yeah, for the home stuff, it's, like, I'll run my own RSS aggregator.
Terrific.
Awesome.
If that breaks, it's annoying.
I have to get it back up and running.
But none of my business stuff goes down.
Nothing breaks.
This is a different RSS system than the one that feeds the newsletter.
That stuff all lives in AWS, like a grown-up might put something there.
It's also strange sometimes to look at the monitoring for this and realize that my 11 node cluster that is all plugged into the same power strip has better uptime for a month than GitHub actions and all right, that's unfortunate, but okay.
There's the other side of it too, that when it goes down, no one's coming to save me.
I've got to get it up and running myself and not just wait for a vendor to fix it for me.
It's a mixed bag.
I don't know that there's necessarily one right way for this.
It's just the reality of it.
We've forgotten on some level how to run hardware ourselves.
This episode is sponsored by my own company, Duck Bill.
Having trouble with your AWS bill,
perhaps it's time to renegotiate a contract with them.
Maybe you're just wondering how to predict what's going on
in the wide world of AWS.
Well, that's where Duck Bill comes in to help.
Remember, you can't duck the Duck Bill Bill,
which I am reliably informed by my business partner,
is absolutely not our motto.
To learn more, visit duckbillhq.com.
To be honest, it's a debate.
It's a debate that I've been on with years.
So let's talk about it from a software in general,
not just like cloud perspective.
A lot of solutions out there, like you see, like,
oh, this solution provides me A, M, B, and C.
Oh, but I can build A, M, B, and C and D.
Sure, you can build it,
but, like, the problem is not in the building anymore.
It's a problem, like, a few years, like,
how you maintain it, how you keep it up and running, how you do all of that kind of work.
It used to be day two problems. Great. Now it's like day 50.
Yeah. Like this is just like I start to think about also from a business perspective, like
just a business mindset. Like what happened when like person maintains a system leaves or whatever
or like a team or like something or just technology gets old or you have to upgrade it or
you have to run instances? Or you leave it running in AWS for more than a year.
In which case now it's extended support which costs six times more for the cluster. Why? Because
screw you. Another year goes past. Then they will blind upgrade you at a time of their choosing,
not yours. So you're just kicking the can down the road, gaining nothing by it and banging
nothing by it and bang through the nose for it. It's, okay, that doesn't seem the most customer
obsessed. This was always interesting. That part, like the extended support was always interesting
because like they are trying to balance between like how to keep it sustainable for the team or
like whatever the team is managing just from my perspective. But I also like, this is like sex
X is a lot. Like, it's a big number, like, when you try to do something like that, and you're
always ready more like, why. But also, like, if you look at a couple of clusters, you don't
pay much. Like, for example, like, the ABI is a control plane you don't pay like a lot of money
for. Still money, but like you don't be a lot. I would like have a heart attack in some way if this
goes for my nodes or something like that, which is going to be like more complicated and we're
going to have a conversation about. But again, the idea of like build versus buy and all of the
the stuff, it comes to a point where like, sure, this system is unstable, but unstable in a way
that, like, you don't have to invest all of the resources, keeping the uptime, like, all of the
operational stuff, like, all of the things. When I run something on my home, I understand the
risk of, like, this is not working. Like, I have built a plan. It's not, like, critical to my
life. Like, my light's still going to turn on, like, turn off, but, like, my Alex, I wouldn't
say, hey, like, I can turn on the light from. That's all the impact here or there.
But like when I run a system and then I have to maintain it, there's a lot of operation overheads that I have to spend in maintaining the system, maintaining the infrastructure, maintaining everything behind that.
So I always tend to tell people like when someone asks, like should I build versus buy, it's just like, what do you have?
Like are you building like a gigantic system and like you want to do everything?
Like I would rather like lean on like open source.
What I have seen in my career in some way is that majority of the tech problem have been like solving.
in some way.
So like you're going to find a solution that's solved like 50, 60% out of your way of doing it.
Don't rebuild it.
Like if Kubernetes works for you, 80%.
Don't try to rebuild it.
That's the rise of AI problem right there in a nutshell.
Is that, well, I could just build my own custom solution out of spare parts and that'll work.
And it will, mostly for the exact use case you defined and tested.
As soon as the requirement changes, now you have a problem to work.
work with. And for weird back-of-house, single-purpose apps, I do that all the time. But for stuff that
matters, of course I'm paying vendors. I pay for Notion at work with a smile on my face for a bunch of
reasons that should be obvious. I built my own newsletter publication system. Rebuilt it finally the way
I wanted to at the start of this year with the lessons learned. And this is the third generation
of that system. It's much better than the previous generations, but I'm sure I'm going to tear it down
and replace it in a few years with something else. And that's okay. It's understand where the right
approaches. Someone had a tweet a while back that it's interesting that Anthropic as a company uses
ADP for payroll instead of re-bibe coding their own. And the answer is, is because they're not insane.
They understand that you're not just paying for a piece of software. You're paying for
understanding the nuances of payroll law in a bunch of different jurisdictions in which you operate,
keeping up to date with legal changes, and not having the Department of Labor kick your door off
your it's three days after you miss a payroll run. It's the right move. It's not just the
It's understanding the business context of what you're trying to do.
100%.
It's not because you can do it.
You should do it.
There's like a big, big app.
Like if you all want to try something out,
if I want to like build something really quickly, like have a demo,
all of the kind of stuff, sure, go build it, try it, let, do whatever you want.
When I think about what, when I think about long-term sustainability, like not everything
is like rebuildable.
Not because I can, I should.
And this is where I stand by.
Like, if you solve the problem and then I look at like your solution and it,
and it fits, why not? Why not use this and add to it or like bake it into my like way of thinking
rather than just like say, oh, it doesn't do all of the 10 things that I need to do. But it does eight,
like does seven. It does five. It like it's spilled. There's like 10 other people looking at it.
Because like think about it. Like if I rely on the software, I'm like, let's pick any project.
In the open source, in the ecosystem. And they're like, there's not always a single person use it.
So there's many people use it. So someone has.
interest in doing that, but you build your own software. It's your only responsibility. That's your
things that you have to maintain. And saying yes to something means saying no to something else.
Take your day job. You work at the New York Times. The New York Times does a bunch of different
things. Officially, I suppose you're a news outlet. Personally, I think that your job is to employ
history as greatest monster, whoever it is that organizes and runs the connections puzzle every
day, which vexes me like you would not freaking believe because I don't think in the right frame
reference sometimes. But through none of those perspectives is, oh, what does the New York Times do?
That's right. You're a database company. You should build your own database. No, that is not where
the value is. You have a website. You should build and run your own web servers. That's something a fool
would say. If I'm dealing with a bank, handling the money, ensuring compliance, making sure that
the money is there when you say it is that's the key job.
An airline's job is to get people and planes and cargo from place to place.
It is not to push the boundaries of computer science.
And companies tend to lose sight of this,
especially when engineers in some cases get carried with resume-driven development.
Yeah, that's where like scope and focus and specialty is one of the things that like anyone should look into.
So I would rather like spend my time in my area of expertise, what I'm good at, like how I'm doing,
it. If I'm an engineer, if I want to do a design, sure, I can't like do something quickly,
but I don't necessarily have all of that understanding of how does that work. Again, not because
I can, I should. It's always like the idea of like, you should get to a point where you have
an SMEs. That's why they call it an SME in somewhere. That's where like people study things.
If I'm asking for a serverless opinion in any way, I'm going to go ask a serverless person
who dealt with this in a real production system, who knows when it breaks, who knows what
are the bad things about it. Like a lot of people when we talk, say, like, serverless is great. You can spin up. Sure, you can spin up. Have you ever run a serverless
architecture that has like a thousand function? Let's talk about like how you govern all of them when they work together. That's a different story. Like that story like I saw in a demo seeing like, sure, a lambda function. Any function in any cloud system runs in like the matter of second. Ship a container and it pops up. Great. Let's talk about how to govern this in a bigger system.
That's a different story.
So, like, that's why a game back to my point.
You know, like, when I give a talk, I talk about my experience.
I talk about, like, the things that I explored because, like, I have knowledge in that
area.
I have an understanding, rather than just losing focus on what I'm trying to do.
Right.
I'm a former SRE, and I have a radically different perspective on environments depending on
where they are.
And I was always considering one of the most stodgy, conservative, curmudgeonly types
when it came to things like databases and file systems, because mistakes there are
are going to show. But in my test environment, I have good backups of all the stuff I care about. Yeah, I'll go nuts. We'll do bleeding edge alpha thing. Oh, I guess that's why it's not GA yet. Whoops, roll back. And I am fine with throwing things over the wall. I have a dedicated AWS account with no access to data that I have an EC2 box in, upon which runs Claude Code in full permissions mode. It has a EC2 world that gives it root, administrative access to the entire AWS environment. It is called Superfund, because
because it is both toxic and expensive.
And the only blast radius, worst case here,
is it spikes my AWS bill,
which I can handle that if that's what comes down to it.
Honestly, if I call in begging for forgiveness to the ATAB's billing department,
it will become a company-wide holiday.
Ladies and gentlemen, we got him.
It'll be great.
I think, like, the separation of concerns most of the time, like, works
in a lot of cases where, like, you need to understand where your competency is at.
Right.
Why not just do that in your production environment with, like, the customer database?
It's because I'm not insane.
Thank you for asking.
Exactly.
Like, that's why, like, but, like, it's also, like, it's a situation where, like,
you should always think about, like, segregation.
You should think about, like, that's why, like, some people will, like, say,
yeah, let's ship into production.
Have you tested it before?
Oh, I tested it, but, like, also, like, I can tell a different story.
So some of the environments, we ship, like, in multiple places, in multiple environments.
Like, people are, like, have you tested that before?
Yes, I tested it.
Have you tested on the same skill?
No, like, the only test that in a specific environment.
I was like, why?
You should always test with the same parameters.
Like, sometimes I worked in a company before where, like, we're doing like some deployments.
And then one of the deployments were like, oh, code is ready, shipped, everything is cool.
Like, it was a small company.
And then, like, all of a sudden, shipped it to production.
It doesn't work.
It doesn't work because, like, you don't have the same parameters like that you are running this into.
Like, you're running one thing, like you're testing it with, like, curl, like, shipping, like, a single ABI, and then, like, you're testing it in production with sending, like, a 50,000 request.
Like, have you done this?
Oh, testing is real.
Yeah.
Met the standard expectee.
Like, all of these things, like, are actually, like, things that you have to think about.
Or even canary deployment, because at some point of scale, you cannot test at the same scale.
Facebook was, they gave a lot of talks about this back when they had reasonable approaches to things.
and because they didn't have a spare billion users to run in the dev environment.
So they started off by having the developer run it themselves and then a small gated list
of internal P.U test users. And then, like, there were something like nine concentric
circles from individual developer to the entire Facebook user base.
And there was a scaled and measured, and they monitored the heck out of this.
Like, oh, we're starting to see errors increase. Let's dial that back.
Which works super well for Facebook would work terribly for Stripe because every 500 error they get
means someone didn't get paid. And that's a worse, that's a worse outcome than, oh, the cat
picture didn't load fast enough. Yeah, that's basically like what they care about. Like,
some company cares about every single transaction for them. Like, and also, like, one of the
things that you mentioned here for Strive, for example, like, we don't know even, like,
which transaction will fail. That might be, like, a very expensive transaction to fail and that
will cause, like, the business to lose a lot of money. But, like, if someone didn't hit like on my
comment on Facebook and Zend didn't work, like I'm not going to get too offended.
Right. And there's also the reputational damage too. I try to buy your book for $30.
I send it and the transaction fails. Both you and I are going to be upset with that,
depending on how technical I am, and definitely for you, that's going to flavor our impression
of stripe. It's, okay, that's not great. It has to work. It must. Whereas with other use
cases, the restrictions are different. The product that we're building is for business,
back-of-house users. Yes, we would like the site to be up when people are attempting to use it.
But in the event that the site is down for an update or something for 20 minutes and has the
maintenance page up, it is not disastrous. It is not critical path for serving their customers
that day. And I can see a future in which that potentially changes, at which case our approach
to uptime and responsibility and maintenance windows will no longer be a thing. We're going to be
very cognizant of the needs of it. But not everything needs to be hyper-scale.
not everything needs five-nines of uptime,
understand the use case and the problem you're trying to solve for it.
I'd rather just doing engineering fantasy.
Build the thing that solves the problem.
The thing about up-tides sometimes, like, strike me
because like some use cases I have seen in my bash,
where, like, talk about something like in consultant opportunity
or like in anyone and say, like, what's your system up-time?
And they said, like, oh, it's five-nines.
I was like, great, why are you using?
And they said, like, this ABI, is that ABI, that ABI.
And tends to one of them like, oh,
that's not five.
And it's critical for you.
Yeah, but my system is five.
I was like, how's that even work?
The back systems that you're using is not five-nines,
but then you claim it's a fine.
It's just like, it's a very complex world.
If you sincerely care about five-nines of uptime on a service,
you need to be in multiple regions to do it,
arguably multiple providers,
though I could be convinced of that otherwise,
and you cannot take third-party dependencies.
Because, look, I can test in my account what happens if none of my stuff can reach S3, hypothetically.
But I cannot test what if my third party vendor dependencies can't reach S3 or their third party dependencies can't reach S3?
The only way you test that is by S3 going down, which fortunately is not a common occurrence.
But if you're serious about this must stay up at all times, you have to own so much of that availability piece yourself.
Exactly. And that's like where we have to think about like every beast that you put.
So the more is that like, it's all like at the end of the day, it's like to me, like technology
like any other thing.
It's like an architecture.
It's a buzzer beast.
It's like trade off somewhere.
Like you get something, you lose something.
It's not like, oh, hey, you have to get like an all optimal.
We all strive for like the perfect system all over.
But like, sure, you want to build it?
You want to own it?
Get your data center.
Get your stuff.
Make sure that you have redundancy.
All of the kind of stuff.
At a cost.
Sure.
Or you go the other way.
At a cost.
Like it's one way of the other.
and then you build for what you need, and that's exactly what you have at the end of the day.
But that's what I'm looking for.
It's just like, what is the right balance?
I prefer sometimes to say, like, here are some Lego blocks.
And then you build it whatever, like, fits you.
Like, you want it tall.
You want it short.
You want it wide.
You want it, like, large.
What do you need is what you build based on your requirements?
Here's the thing.
Your requirements usually change over time, because this is what we have seen.
Like, today you build, as you said, like I built for a single.
single app, tomorrow 10 apps, after tomorrow, 100 apps. Like, do you have that scale? Do you want to get
something to throw and, like, you keep re-architecturing every couple days? You want something
bloggable. That's why, like, usually, like, finding the right patterns what other people have found.
Like, I think, like, a lot of people spend their time on Kubernetes figuring, like, how did it make
that work? It's a spectrum, like anything. There are tradeoffs, decisions you should make early on
that will not hamstring you in the future. Almost every hyperscaler has had this problem before,
where we're just going to build a small thing for back of house stuff.
Great.
We're going to use the local time zone for the database entries.
No, no, no, no, no.
Talk to anyone who was at Google for about a decade and a half of time into there
and use the phrase Google Standard Time and watch them flinch.
Because that is very painful to fix after the fact.
And it makes everything so much harder.
So everything I build these days, even my dev box,
it sits there running in UTC.
If I want to know what time it is locally, great.
that my user account can change out of the presentation layer.
Awesome.
But the system itself must be UTC.
That's where standards comes in.
That's where like UTC is a common frame that like everyone's agree on.
And you convert based on your needs because I'm in whatever.
I'm in New Jersey, someone else in California.
We all can't, we know what is the pattern in.
But like if I start to ingest in my database, data coming from like all local zones,
now like I have and I have seen it in apps,
Well, like, I go into an app and then I look at it and like, when is the last user has visited this app tomorrow?
I was like, what?
What is today?
Like, today is like, the third, but like someone visited the app in tomorrow?
Like, how does this even happen?
And because, like, they inserted their time zone from their local machine to the system and then you have a wrong representation because now it's this edge case.
The XKCD RSS feed always goes into the past for whatever reason by about, I think, 8.
and a half hours from your UTC time.
Something is not right.
So it always pops up.
I have to scroll back to find it.
Yesterday, when I was building out
an EKS cluster with Open Tofu,
I had Claude Code do most of the
Terraform slash tofu code.
And I had to correct it
where it's first,
it put it into 10.0.
Great.
That's going to conflict with something somewhere
because everyone uses that.
In this case, the staging environment.
Great.
Put it somewhere else.
Then it built a bunch of,
for the subnet slash 24s
right next to each other.
No.
because when you run more than 255 containers, which can happen.
Sorry, 253.
That's right.
You've broadcast a network as well.
Yeah.
Oh, and the DNS, which you can't get rid of inside of the subnet too.
So that drops it two more.
Great.
Point being is above a certain threshold, you have to renumber, and that is painful.
Build room to expand without having to move things around, and you'll be much happier
for it.
Yeah.
That is a problem that I had to solve in many occasions, and there's always a solution for it.
It's like, use a secondary side there.
It's just like...
Just use IPV6.
It's like, chrownups are speaking, please.
Sure.
Yeah, like, this strikes me as like one of the most bogus standards that I have seen over so far.
Not for anything, but because like we can't yet agree on it.
Like, I think like we are still living in the IBV4 world, but like we want more, but we cannot get more, but not everything supports IV6.
But so we are stuck in IB4.
What's your take on it?
IPV6 is going to save us all.
They've been saying this since I was a child, and they'll be selling it to my grandkids as well.
It's not a problem that is top of mind for anyone except the salt of the earth folk who keep the internet moving.
So we're going to continue to ignore it until we can't anymore.
And then don't worry, the AI will fix it.
Yeah, AI will fix a lot of things until like robots will not be able to get IBs and we're going to be all stuck in that world.
Exactly.
So I want to thank you for taking the time to speak with me.
If people want to learn more about what you're up to and how you view the way,
world and catch your next conference talk, where's the best place for them to find you?
Best place is LinkedIn. That's where, like, I usually stay most up to date. If anyone want to
hit me on email, like, they will find my links and all of my contact info. But if you want to
need anything or just check where I'm going next, like I'm going to KubeCon and Mr. Dam next month,
like where I'm doing like all of the things. Sure. Yeah, LinkedIn is the way. I want to say that
maybe the first time in history that the phrase LinkedIn is the best place has ever been uttered.
because that is just, it is certainly a place.
I'm there a lot more myself,
and we will, of course, put links to that into the show notes.
Ahmed, thank you so much for being so generous with your time.
I deeply appreciate it.
Thank you, Corey.
I really appreciate having me here and, like, looking forward to see you in many in-person events.
Oh, I'll be there.
Ahmed Babars, principal engineer at the New York Times,
and AWS Community Hero and Cloud Native Ambassador.
We're just stacking up the accomplishments these days.
I am cloud economist, Corey Quinn,
and this is screaming in the cloud.
If you've enjoyed this podcast, please leave a five-star review
on your podcast platform of choice.
Whereas if you've hated this podcast,
please leave a five-star review on your podcast platform of choice,
along with an angry, insulting comment that I won't ever see
because that podcast platform of choice
runs on somebody's home K3S cluster.
