The Standup with ThePrimeagen - Software Horror Stories
Episode Date: August 4, 2025🔗 Sponsored by Code Rabbit https://coderabbit.link/primeagen-vscode #sponsored ## 📌 Chapters: 00:00:00 - Intro & Teejs Medical Records Disaster 00:04:20 - Code Rabbit Sponsor #ad 00:05:04 - No...body wants to work in Healthcare 00:07:05 - Healthcare testing process 00:12:00 - Epic's Development Pipeline Deep Dive 00:21:33 - Load Testing Nightmare 00:25:00 - The 45-Minute Site Outage 00:27:50 - Class Registration Disaster 00:32:00 - Everything Goes Wrong in Production 00:35:55 - Netflix Static Variable Mistake 00:39:00 - The Boss Who Spotted the Obvious Bug 00:42:00 - Lady Gaga's Broken Countdown Billboard 00:47:00 - 10 Requests /s Disaster 00:49:30 - GraphQL Query Explosion ## Key Topics: - Medical software development challenges and patient safety - Epic Systems development process and MUMPS language - Load testing disasters and production outages - Netflix development war stories - Server-side rendering performance pitfalls - GraphQL resolver optimization nightmares - The importance of proper testing before production releases ## Hosts: - **TJ** (@teej_dv) - Epic Systems horror stories - **Trash Dev** (@trash_dev) - Load testing and Next.js disasters - **Prime** (@ThePrimeagen) - Netflix static variable nightmares
Transcript
Discussion (0)
All right, today is another...
Let's go on.
Welcome to the stand-up.
We are doing dev horror scenes.
Everybody is laughing at me because this is take two.
Well, guess what?
We are going to hear some of the craziest stories.
TJ has guaranteed that Trash would love his dev horror story.
So, TJ, why don't you start us off?
Okay, so here you go.
I'll tell us the story, for real.
Yeah, yeah, yeah, yeah, yeah, yeah, yeah.
Anyway, sorry.
This is my first real job outside of school.
I'm working at Epic, the medical health records company, which is important to the story to know that it's not Fortnite.
And I worked on an application called Datalink.
Datalink is basically a way that we turned like insurance claims into medical data for you.
So like trash, if you were traveling around and you broke your arm or something, right?
You jujitsued and stubbed your toe, but you were in like Florida instead, right?
that you would you would make a claim and then the insurance company would send those claims back to like your home hospital network so that it would be like hey your doctor needs to know that you broke your toe at some point because people just literally forget right so it's like you broke your arm you got this surgery whatever all this other stuff happens out of network they want to know back at your home uh place so you can get better care um the problem with this is twofold first one it's a dev horror story from before i was there sales promis
that it would be done in three months.
Not a doable project in three months took nine months or something like this.
This was before I got hired to be clear, trash, because I already know you're going to say it's
my fault.
I wasn't there.
Okay.
I wasn't there.
Sure, sure, sure, sure.
It took three times as long as they said it was going to take.
And then they had to delete the product because it didn't work and was putting patients in
danger the first time around.
They literally got all the hospitals to remove it.
And they did a new version later.
Okay.
So it was off to a rocky start for round one.
I get hired.
We're in the midst of round two.
It's working like reasonable now for most cases.
It's kind of working.
So I imagine that there are a lot of horror stories that I didn't get to hear about round one.
But here's the thing that happened that was that I know Trash is going to laugh at.
I know it for sure.
So we get these things are called patient safety escalations, right?
Which means somebody reported something crazy happening in a hospital.
because that's your thing.
It's not did we, you know, like stream the wrong movie or something.
It's like did we accidentally potentially kill somebody, right?
So just to be clear, before we get out there, this never actually heard anybody.
We were confident that we did a whole bunch of research afterwards and anything.
But the thing that we found, the reason that this got surfaced is a doctor was looking at a new mother's chart and was going through the chart, just recently had her baby,
going through all the procedures.
And about three quarters of the way down on the chart,
he sees an external procedure done on her.
And what does it say?
Circumcision.
The reason this happens is because when you have a baby,
a lot of times insurance companies charge everything to the mom's insurance claim ID.
So they do a circumcision on a little baby?
We input the claims.
We say, oh, look, it's on mommy's.
It's on mom's insurance ID.
You put that right on her chart.
She gets back the week after having a baby.
She's at her child visit.
So it turns out, so we find this out.
So this then spawns an entire, like, you know, big problem because we're like,
holy cow, we could be putting procedures on the wrong people.
So we're looking into everything.
We're like calling, you know, industry.
like we have a bunch of people on staff who are industry experts and everything.
Obviously, like this one, it's not so bad.
It's chill.
It's like, yeah, it was obvious.
We fixed it and it was good.
The one that was much more like potentially dangerous in this scenario, but also due to a bunch of other Swiss cheese factors.
Like, if they actually got covered, it never happened.
Developers, are you tired of code reviews that just say looks good to me?
Or here's seven things I need you to change that are irrelevant to the PR?
Well, meet CodeRabbit.
AI reviews right inside your editor.
Starting with VS code, cursor, and windsurf.
Sorry, no VIM yet.
Hit review and bam.
Outcomes feedback even before you open a poll request.
CodeRabbit can help flag bugs that you're vibe coding with your LLM bots missed.
And with one easy click, it can send all the context back
to your favorite agenetic tools for quick fixes line by line.
CodeRabbit offers a generous free tier with up to 50 code reviews per day.
Grab the extension now at code rabbit.
com slash primogen dash vS code.
Links in the description and ship cleaner code today.
The other crazy one for this,
which is why you don't let sales drive what features you're going to do,
is there's a bunch of problems that exist
that you need industry experts to actually tell you,
is when you get a transplant,
so like trash, if you gave your liver to prime,
generally, at least in the U.S.,
The insurance company just puts the person that the procedure is going on.
If you're the donor, it always puts it on the donnie.
What's the person who receives?
The recipient.
Okay.
It always just puts the claims on that one.
So like it literally, this didn't ever actually happen, but like it could have if we had had the scenario.
But it was literally like, it would be like donate liver got liver right in a row.
or like because both of those claims would be associated with the recipient claimant ID.
Because the problem is medical insurance data is not supposed to be medical data, right?
So there you go, trash.
The main one that I knew you would laugh at from that, though, is new mother gets circumcision procedure on chart.
Dude, I don't know if I could ever, I feel like working like in that industry is just,
just those bugs are just like catastrophic, I feel like, right?
Oh, yeah, it's not, I didn't like it.
That's why I left.
Like, I mean, like, Epic was fine to work out.
I thought, like, I actually liked working there all right, like, and everything.
But, dude, when you have literally, like, multiple hundred millions of patients and, like, all of their medical data is going through the system you're working on, it's a very different experience than, like, oops, we, like,
deleted a document on accident or like some photos got lost or something like that like everything
is like potentially life or death yeah so now i have a follow-up question yeah what's like what's
your testing process before you like roll this stuff out oh i mean i you how long do you want to go trash
it's actually kind of like you know it's kind of interesting i mean it doesn't have to be like
in death i'm just curious like is it like pretty hardcore or what i assume it is part of it part of it
part of it that's kind of like,
um,
so basically like at Epic,
generally speaking,
they did,
they were changing this a little bit when I left,
but they were,
they were moving more towards like closer to a rolling release.
Like every three months,
they would cut a new version of Epic and they would try and get people on.
But like up until then,
they had,
uh,
basically like,
you know,
every three years would be like a new version of Epic,
right?
Like classic enterprise software.
You've got Epic 2014.
Epic 2017.
Um, uh, but okay, so for the testing thing. So what we would do is we had these big versions. And then you would need to, if you wanted to get a package sent back to people, you had to create like this thing called a special upgrade. And they were like basically patches that got applied. And so we would have effectively like multiple, uh, like main branches effectively. Like it's, they're like environments where it would be like, oh, this is epic 2016 with like the latest versions applied. And you would have some like ability to say, oh, they've, like, they've.
installed like these things or this one depends on this and it would do a whole graph and all that stuff and
like actually actually lots of really impressive um tech like in that stack but so basically what we
would have is for anything you wanted to change any change ever that ever got landed no matter how small
no matter how big you would do that in like our dev environment this is just if you want to do it for
like the main one so like first we'll discover main is like our current version of epic that's
coming out later.
You would do all your dev in this big main, like shared environment.
It has tons of test data, like loaded in, like big scale, try out all this different
stuff, whatever.
You can play around with it.
All these good things.
So you got to get approval.
We had a bunch of stuff for this, design docs, et cetera, project tracking, you know,
yada, yada, yada.
Yeah.
For once, it's a product where you think, okay, it's probably a good idea.
Yeah.
Like most of the time, you're like, guys, can we not do it?
design do design docs. I just want to try the feature, but for this one, like for the example I said,
if we had a design doc earlier for some of these things and had like pulled in people from
insurance and billing, they probably would have called this out. You know what I'm saying? Like,
it could have been a solved issue. Anyways. So then you do all your dev in this environment.
We had our own whole custom source code, employment, IDE review thing, everything. All of it was built
in because like Epic's old. It started in the seven.
So they built all this stuff all by themselves forever.
And so then you would get here, you would get past dev, and then you'd have some, someone do code review.
You'd go back and forth.
You'd have to get it.
They'd have to sign off.
It wouldn't let you merge unless you got sign off from people, stuff like that, right?
Then it goes to QA1.
QA1, you get a QA person from your department.
QA environment has a whole bunch of different setup, different environments, different things, different,
kind of like a different deployment.
It's completely detached from dev.
They go through.
They do all their testing.
They've got a test script.
They've got regression tests.
They got a bunch of stuff like this that they run.
Obviously, like, there's also automated testing and things that we did too.
But like you get a QA person to go test your thing.
Once that QA1 goes and if they find anything, you go back and forth, blah, blah, blah.
You potentially need Dev one to review again, depending how big the changes are.
QA can ask like, yo, you need changes.
QA1 gets finished
You have code review number two
Code review number two happens
Different developer can't be the same one as last time
Code reviewer or like testing
Code review. So we did
Code Review 1, QA1
Code Review 2
Completely different software developer
How does that make sense?
Like what if the code review 2
Ask for changes and then now you have to go back
Through that whole testing process
Yeah if there's significant enough changes
You'll just kick it back
You'll kick it back to Dev 1
Yep. Okay.
Yeah exactly
Yes if it's if it gives that
then you got to kick it back to dev 1.
What about the co-reviewer 1?
Does he get in trouble?
Because he missed the code reviewer at 2-4?
Maybe if it was a pattern.
The people who are code reviewing are on your app.
So they're like on your team.
So like we had the Datalink team.
The Datalink team was like, you know,
at various times between like five and ten devs.
So it's not like you would know.
If somebody's always just, you know,
rubber stamping, dev 1, get them out.
Get them out.
trip to Karen's office.
Put them on PIP.
Can't be messing around with that kind of stuff there.
And so then Dev 2 completes.
Now it moves into a completely separate environment.
QA2, a different QAer reviews it.
Everything checks out.
It's good.
Then it gets merged into basically like the main line,
like current branch for where all the code is and what's going on.
So that's like the process for just getting like one change.
done and getting into like the current version of epic and then like that would eventually get
deployed out to different hospitals whenever they're ready to upgrade.
Does that make sense?
Yeah, yeah, yeah, yeah.
Then though, Trash, if this is a fix for an important bug or a feature that we wanted to send
back to people without them having to do a whole thing, you'd have to do this entire process again
starting from a different initial branch, sometimes could be auto applied, but sometimes not,
Because, like, maybe there's code you use that doesn't exist in the 2016 version, but now we're in the 2020 version.
You go back, you got to go do that whole same process again, and it gets put out basically as a new patch, a special upgrade.
You know, special upgrade 37,982 or something like this, right?
And then that one could get sent back.
So you repeat that whole process again if you need to get it shipped.
What's that average release time, like weeks?
Like you're saying from 1PR starting to finished?
Yeah, I guess, I mean, obviously it depends on like the significance of the changes.
Yeah, yeah, I mean, it's super depends.
I mean, if you, so like one of the things that I did, so this, you'll think this is kind of funny, too.
Sorry, we'll just randomly talk about this and maybe we'll, we can tell some other horrors.
We're right, we're right.
We're right.
Um, tell us your horse story.
One of the things I release process.
Yeah.
That is a horror of soft.
That is a horror.
Have you ever had code take months to hit?
This actually.
I actually have a tangent on this.
That's going on here.
Like, this isn't that.
Yeah.
Like, right.
for the, you know, for the vertical that Epic's operating in, you're like happy to hear that
you do all of these things.
You're like, oh, it moves slow.
You're like, okay, well, that's good.
This is good.
So, like, one of the things I built a little bit before I left, I actually built a, like,
kind of scuffed version of telescope, like the NeoVim Fuzzy Finder.
It's the thing that inspired me to later try it out when I was done working there.
But not for Neovim, because we didn't use Neovim.
NeoVim, I had Epic Studio while I was there when I was writing some of this code.
Like, we had our own IDE, which was connected to all the environments and all this other stuff.
Although I was literally trying to write my own competing NeoVim plug and distribution.
Story for a different day.
Now I forgot.
Oh, the telescope thing.
So I built this like mini telescope thing to help our QA search for a bunch of like scripts and other stuff that they could run.
and then I built a way for them to test data link,
basically using like Excel files.
Because the problem was all the stuff we were doing before
was like this crazy manual process
where you do something like load claims into the data warehouse.
Data warehouse, you have to wait until it ETLs back into our claims
in the actual epic environment.
Then that epic environment you have to test and do things.
And then, right?
So it's like it could literally take hours or days
for like some of these changes to go through.
but you could force them manually, like, if you knew what you were doing, right?
So I wrote this thing that would be, like, you could put in the claims in, like, an Excel table you wanted as, like, an input sheet.
And then, like, you would have an output sheet in Excel, and you could say the things you wanted to see in the patient's chart.
And so you could write out all those test cases.
And then I built a telescope, like, fuzzy finder thing.
So they could search through, like, hundreds or thousands of these, like, test cases later.
Yeah, so it literally changed things from, like, something that used to take multiple days.
for QA to test something manually
into like less than an hour
and automatic, which was really,
like that part was pretty cool.
And literally the deal like this, too,
the reason that I found that I was inspired for this idea,
I'm just poking around in random old code that we have
in this crazy language called mumps,
which is also a software horror story.
I can tell you about this after this trash.
And we had our own custom Excel parsing libraries
that I discovered.
I was like, well, that's cool.
What could you do with that?
It was literally like, oh, we could just
make it so that QA can like write tests in Excel and then like we can we can write code to handle
these things but just happened because I was poking around in some randomways and found Excel parsing
libraries yeah stuff going on over there oh yeah yeah there is a there's a there's a fate
worse than death and that fate is using a company's custom IDE like that just feels like that just
feels awful. It was better than Vs code, dude.
It's a thousand times better than VS code.
Like, no joke. Like, it's not, it was great.
It connected to every environment automatically.
It would, like, you could see diffs across environments and you could check out all this
stuff.
It was connect.
Dude, we had actual code review.
You could, like, push this thing and you could go check out the code.
You were in the environment automatically.
None of this dumb.
Your environment scuffed.
You're trying to do a dumb code review in GitHub, and GitHub won't even load the entire
diff and all this.
Like, dude, most companies, like, the experience is subpar using VS code compared to what they had at Epic,
using a 1970s programming language called Mumps at a healthcare place, which is also funny.
That's all proprietary?
That whole ID was proprietary?
Yeah, yeah, they made it themselves.
They were just like, oh, gosh, that's a secret sauce over there.
Well, it's not, I mean, it's just like, I'm going to get a job at Epic just to check this out.
Yeah, I mean, like, Mumps is a bad language like it is, like it's true.
They were working on a project to make TypeScript compile to mumps, which was crazy.
And like it was working.
They just rewrote the back end where instead of emitting JavaScript, it could emit mumps instead.
I've never heard of mumps.
What the heck is that?
Yeah, well, it's so it's kind of like no sequel, but scripting.
Oh, yeah, yeah, yeah, yeah, yeah.
But I'm just saying like, somehow it made it way harder to understand what it is.
Can you describe it?
any different possible way than that.
That's literally what it is.
So it's a key value store scripting language.
So you can do business logic stuff on sparsely populated data sets
and you can write it in the same language that you're doing all your other stuff.
So like almost all the back end was in Mumps and it's like a key value store language.
Was the type script to Mumps compiler?
Was it called like MRR vaccine or something like that?
It should have been, though.
No, I just called it TSM.
That's what I was going with this.
Yeah, they tried.
So they never called it mumps.
You only call the language M.
Because you don't want to keep reminding
doctors and hospital people that your medical stack is written on mumps.
Yeah.
Well, dude, it's the classic programmer thing.
Some programmers made this up at like Massachusetts,
university, medical, something, right?
So they were like, guys, no would be funny.
Like, we should call this programming language mumps.
And then they,
Because, like, the acronym's not even that good.
You know, like, it doesn't actually line up.
The programmers knew.
They were working at a university hospital.
They made this up and were like, bruh, pretty funny calling it mumps.
I can't believe I spend my days doing this.
What the hell's wrong with me?
This is crazy.
I'm going down a mumps rabbit hole right now.
Yeah.
Trash, you know what?
We can do WTF mumps.
And I'll show you a bunch of crazy stuff in the language would be so funny.
I will say, I will say, this is.
is one of those things where I'll just be like the serious note is like a bunch of people like
I can't ship productive software in this stack blah blah whatever guys epic has made more money than
you can possibly dream up they're going to stay private forever it made judy many billions of dollars
it's like crazy literally changed the world and it's the language no one else in the entire
universe rights yeah like you can ship productive software in other things like it's just they
deliver business value they are doing stuff that no one
nobody else is doing.
Like, so just too bad.
Oh, you have to write Python at work or
yeah, if you're solving the business problem,
then it's going to be okay.
So, and actually I kind of,
there were some things that were very unique
and interesting about mumps that were actually kind of good.
Uh, like, it's good for sparse data sets.
That's medical records.
Like, dude, medical records are sparse data sets.
Yeah.
Prime, you and I have almost probably almost no,
overlapping data in our health record.
You know what I'm saying?
Right?
Like it's like, okay, we have maybe like that we got vitals taken the last times that we
went to the doctor.
But like nothing else is the same.
Right.
So just trying to do like some secret.
But you know, I live in South Dakota.
We don't even believe in doctors here, okay?
You're just when, when the guy who comes out to check if the cows are okay,
he takes your blood pressure too.
Yeah, yeah.
I just sit in line with the dairy cow.
Okay, trash.
I'll come back with the presentation of WTF mom's for,
you. And then I'll give you, I'll give you the rundown. I'll give you the rundown.
I know. That was a long, that was a long.
Without even what we're talking about anymore at this point.
Tresh, what else you want to know about Epic? This can just be your time to ask me about that.
And then we'll just be done.
We need it. We need to preface that I want to hear Trash is a horror story.
Trash is like, oh, I got good ones. And so I'm just like, okay, lay it down. I want to hear it.
I have a, I don't want to do it. I want to do like a round robbing.
Like, do you guys have more than one, I'm assuming?
That was the one that I wanted to tell was that we accidentally put circumcision on the mom's chart.
I mean, I have like a list that I can go down.
So I'm, you know, analyzing the list here.
Yeah, pick one. Okay.
You have a list of these?
Do you write them down after they happen of like, this is a memeable moment in my career?
No, these are like early in my career.
All right?
These are early.
Well, uh, yeah, okay, I got one.
I got one.
I got one.
I'm not going to name.
I'm not going to say any company name.
Netflix.
But I will say it's not my current company, 100%.
Government contracting.
Okay, okay.
Are you guys ready for this?
Oh, yeah.
So one would think you want to load test and prod, but we did.
We did.
And I'm going to tell you why.
So when you see, when you look at, sometimes you look at sites and it's, you know, populated by some data that looks simple.
But behind the scenes, it's typically not.
You have, like, different data sources and other things, like processing it, UTL, what have you.
Yeah.
I mean, it's not as simple as, like, let me just query one database and all that stuff is just there.
This is, like, you know, 20 plus databases potentially.
I'm not saying that's exactly what's happening in this case.
Well, let's just imagine that's the case.
So usually when you have all these data sources and you have to kind of aggregate that,
you usually use some level of caching, right?
Because you don't want to hit those data sources constantly,
especially if you have a high traffic site, which would be like in this case, I don't know, but it's pretty high.
It's pretty day and high.
Like tens of visitors a day or what?
At least.
Okay.
That's a lot.
That's more than I've ever done, so that's cool.
So we wanted to, obviously when you load tests, you just want to, when you stress tests, you just want to see how your infrastructure reacts.
So we did load test in our lower environments.
But it's like that classic case of where obviously your proud environment is just not like.
your test environment or your staging, so you just can't get that data,
which is literally the classic issue in like any tech company, right?
So we're like, well, we need some real results.
And we were conscious of like how much it can handle.
We did everything like we could to be like, okay, is this a good idea or is this a bad idea?
But it sounds like it's still going to be a bad idea.
Yeah, you flip a coin and there was heads on both sides and the heads was a bad idea.
So check this up
So this site
We do heavy amount of caching because of the amount of requests we get a second
Okay
And what we do is we have like we wrote like our own internal like cash buster for like specific paths and stuff
But we only clear the cash at low peak times because
When you clear that cash for like one second without much traffic, it's just like pretty catastrophic
For whatever reason
We were testing like
not like peak time because that's just almost as dumb as low testing and prod.
But we were low testing like on a relatively low, like active time like during work hours or whatever
because people aren't.
Naturally.
Naturally.
I'm not going to say what kind of site it was.
But for whatever reason, the cash got cleared in one of these instances at the same time we low tested.
And it took down our site for about 30 to 45 minutes, which is.
like absolutely unacceptable.
One, you can't log into the site.
Two, you can't sign up to pay us money to get to the site.
And like, as soon as it happened, we started seeing tweets.
We started seeing things on LinkedIn.
And like me, it was just me.
It was like me and two other engineers.
We're just sitting on a call.
And we're like, yo, is da, da, da, dot, dot, whatever working?
He's like, nah, man.
And I was like, we're fresh and I was like, we just kind of sat there like on a call like literally like just like we're doing now.
We're kind of just kind of staring at the space.
And we're just like, okay.
So try refreshing again.
We're just like.
And we're just like.
So hold on.
Why did it take so long?
Like could you just restart things?
Like what went down that took so long to get back up?
It's because it's so connected.
Like it just goes so like it takes down.
It's a basically domino is a different infrastructure.
Okay, microservices. I've heard of these. Okay, yeah.
Basically, basically, right? So when the cash clears, all of those systems downstream get overwhelmed, and they just topple over.
And not only does it take out like prod, it takes down other systems that are dependent on those systems.
So it's kind of just like, it was like literally the worst thing that could possibly happen.
We didn't get in trouble. I mean, it's got like a stuff, but like, you know, it was probably it was probably it was at least it was at least.
these top three worst moments of my life.
Okay, so when you said low testing, what were you low testing or how does that work out?
What were you doing that caused us?
I'm still a little bit nebulous on that.
So we're testing like the very fresh request for everything.
So we have like, so let's just talk about we have layers of the cake.
The top most layers, the layer we own, which is built on top of the caching on all the
underlying pieces.
So we assumed the caching was at right below our piece, right?
And the top layer is not that important.
Like everything below that is like the most important.
important part. But we had
like some logic and we wanted to just kind of like
test the CPU readings and see like
how would we react to like X number
of requests.
But I mean the whole like
it reactively
it was like a freak accident where the cache
had never been cleared. Okay. That's all I'm saying. But we're just
testing like just literally do we have to spend how many
containers do we need up at like X amount
of time right? Just to have like a steady
baseline and that's pretty much what we were
looking for.
And it turned out
that that was a bad idea.
Like we loaded, like I said, we low tested in the staging test,
but it's just not the same.
It's not the same.
Trash, the thing I wanted to ask, though.
You said that was in your top three worst moments.
I'm now very curious, what are the top three?
Yeah, yeah, give us a little quick one.
I have another, I have another really bad one.
Give us one more.
Okay.
This was probably my third year into professional software development
where I mainly just wrote PLSQL.
Like, all I did,
was just do database stuff.
Oh, yeah, trash.
Data warehouse life.
Right.
I wrote a, I wrote like this whole pipeline that basically maps soldiers to sign up for classes.
So we have like these benefits that allow soldiers to, you know, sign up for classes at these whatever universities that will accept them.
And based on like their zip code or their status or whatever, they can go some certain direction.
That's not important.
But I built like this whole mapping system of just like stored procedures.
And the TLDR is like basically on sign up day, it's pretty much havoc.
Like everyone like everyone's like, you know, you know when you sign for classes,
you want to get the best class.
Yep.
Or you have the schedule.
11 a.m. or later.
Yeah, you don't want to get the 8 a.m. 1.
You just don't.
You want all my classes on Tuesday and I want like a two days.
I'm going to take four days off so I can get a lot of work done.
And then Monday night rolls around.
So, but let me preface this.
We tested the bejesus out of this thing, as one of what would expect because people sign up for school.
And I also want to preface, like, it wasn't just like all me, like it was like a shared responsibility of a team.
Okay.
Okay.
It's not like, and we need to write this and they just looked at me and you're like, it's all on you, buddy.
Right?
But I wrote like a lot of the core, the core stuff.
And I was like super into like SQL.
Anyways, we tested the crap out of this.
We launch it.
It goes to like a big testing process.
Store procedures, right?
Mm-hmm.
Store procedures.
And it goes to its testing phase, just like with Teach, right?
We have like UAT code reviews, and we actually have like multiple layers of testing before it even gets to UAT, which is user acceptance testing for those who don't know.
So we're very confident, you know, like 99% confidence.
confident this was going to go very well. We were like in a war room on like the day it's launching.
And this is like launch day is basically like or not necessarily launch day, but the day of everyone enrolling.
And we have like all these dashboards up. I have like the database. I'm like monitoring the database.
And initially things were going well and I was like cool like I kind of looked at the data.
We kind of had like a whole like diagram with like how the flow should look.
And we started getting like the initial ones, not too much traffic because you know time zones,
early, but some people are like, do stuff super early.
And I was like, okay, like early signups or something.
So we could see like things are being mapped correctly.
But like on the main day when like everybody was signing up for whatever reason,
the mapping just started going the wrong way.
It's like when you go down a slide at a water park and I don't know, somehow that slide
makes a wrong turn to the deep end.
And we just started seeing.
It's not how slides work, crush.
What?
It's an analogy.
Formal analogy.
I was going to turn it into it's like when your veins, it's like when your veins in your arm, they make a wrong turn and they go to like your fingers.
That's also not how many.
Those are my two choices and I chose the slide one because it was less worse.
But okay, so the errors that happened, it's not, it's not, it's not obvious.
It's not like you have error rates for things like go into the wrong thing.
You kind of just have to look at the data and understand like, oh, that's kind of weird, right?
Yep. So I had all these like pre-written queries just for like certain scenarios just to like spot check the data.
And I started noticing obviously the data was wrong. I was like, okay. And then we also built our own like CRM system.
And we just started the system just started getting overwhelmed with tickets and people were getting pissed.
But anyways, like I wasn't there to like see the whole process unfold. But in the war room, we were just like, holy shit, what the hell.
Everyone's getting signed up for the wrong thing. And it's not something we can.
could just roll back because it's kind of just done.
Because of stored procedures.
Yeah, you know, you know how that goes.
I had to, I don't even remember what we had to do, but we had to just basically accept what
was happening.
You just had to accept it.
And then you just had to fix it later.
You know what I mean?
It's like one of those.
So if I'm like kind of like, it's not like they're going to go up to you and be like,
you got to stop signing for classes.
It's just not that kind of scenario.
It's basically the person that's helping the person is like, oh, that's weird.
We'll get back to you.
And then that's basically, that's kind of what happens.
So we had to just understand what went wrong so we could kind of reverse engineer that logic and put people in the right place.
It wasn't like as difficult as you would think because, you know, these people had profiles.
You knew like their PII or their information.
Okay.
So you could kind of, based off like zip code and stuff, you'd be like, okay.
this is mapped clearly because of XMX.
But at the time, you know, everyone was just like,
well, this is not what I sign up for.
This isn't the class I wanted.
This is the time slot I wanted.
So a lot of people's schedules got messed up, right?
But it was weird.
I remember sitting in the corner of the room, I had my feet up.
I was just like, very relaxed, very calm.
And everyone just kind of looks at me, and they're just like,
why is it doing that?
And I was like, I don't know.
And I literally just had to sit there and just watch it happen the whole day
while I was trying to.
reverse engineer why it's happening because like I said we couldn't stop it it
it just had to happen you know it's like a car coming towards you and you just have to
let it keep going you lock out here that's not what happens and that's
I don't know either well you can't stop a car come now you're gonna get smashed
you can move it's like jumping off a cliff and then as you're going towards
the water you can do there you go that's an analogy that finally works
yeah yeah anyways
You're very bad at analogies.
You should just think to talk about stories.
They're way better.
Your story is crazy without them.
No, keep the analogies coming.
Please.
I think this was like 2010, 2011 when this happened.
This was like my first, like, the team was supportive.
Okay, I didn't feel like I was ashamed or anything.
But it was like my first experience of having a catastrophic failure.
So, you know, lessons learned.
And again, it was a team effort.
Not all on me, but kind of all on me at the same.
time so
Tres you're not up in your case buddy
well you know what I said I said well who tested this
that's what I said
I was like well how did this get past testing
I just get past testing classic
Rebuttal.
You really can't you really I mean did you also
were you in charge of testing Trash?
No
I have a funny story
I have a funny story
side tangent this project I'm talking about
the passwords that our QA
has been using, like, we had like a password sheet.
That's where he learned it from.
I literally, I literally used those passwords.
I was kidding.
I didn't think that was actually.
They were so complex.
Well, they're like, they're kind of complex.
But I typed them so many times because I would use the QA accounts that they became
part of like my being and I just started using those passwords.
Oh my gosh.
Like what?
How complex are they trash?
Give me an example.
That's the law to my password.
I swear.
That you don't use anymore.
Hit me with one.
Old password?
No, I was kidding.
I was trying to bait him in this.
Yeah, yeah, yeah, yeah, I'm not going to do it.
But it was somehow.
Don't do it. Don't do it.
We cannot leak your one password, buddy.
Please, please, please, please.
Help yourself.
Yeah, there's the lore for the one password.
Okay.
It was proud of God.
I can't believe we uncovered.
We're going to have to tell Casey this when he comes.
Yeah, no.
I got more stories, too, after Pram goes.
Like that.
I got more stories.
Yeah.
Prime, let's hear one from you,
and then I want to save a few for Casey.
We can do a, we can run this back again.
I want to do a part two where
Casey gives us some too. Okay, so obviously the one that I can't do is the one where I created
that really horrible bug for Netflix that was the, that got a named attack, the repulsive grizzly
attack. I've already told that story. We already know foul core. Okay, we already know file core existed.
More like fail core. Am I right, chat? All right. So I'll tell you this story. Okay, so
thanks, trash. I was doing Ph.B startups for a long time. And in PHP land, at least back in the day,
when you had a static variable, a static variable would be for the life of the request
because every request ran in its own process.
And so I kind of had this mentality that requests, you know, like static was just associated
with requests for whatever reason.
You know, things happen.
You know, you just are so used to a specific environment.
You kind of forget the meaning of what static is.
And so it just like happens in a specific way.
you're saying because like because it restarts the php process every time
every request is its own php process right so that's that's why like old php was so slow
and so i remember going to netflix and i had to kind of come up with logging for all the ui
because we just didn't have logging like we had no logging for anything and how this got
started is that we started releasing these originals and what would happen is that we need to have
an original and we needed to have its art for just
like the box, the box art.
And then there's also things called
Rower panels.
Trash, I think you probably remember
Rour panels.
Ror panels were the tall ones.
And then you all said these like ultra long boxes.
And then you also had it in Korean.
You all set it in, you know, Portuguese.
You also had it like,
so it had all these like sub-variants as well.
So every single image for a show had literally like a hundred assets related to it.
And we just had no idea what,
what places of the world just didn't have those images.
And so my boss was just like,
like, can you fix that problem?
And so I was like, dude, I got this.
I got a great idea.
So I went out to go come up with like a really good idea.
And my first swing at it is that I didn't, I just forgot that static is not on a request level thing.
It's just that I was so used to the PhDPs at that time that I just started writing it that way.
And so that was like a whole situation that came in because for like the next day, I'm just like, why am I like, I just made a request, but I'm in Brazil.
Like, I know this is from me.
This can't be real.
And the thing is, that mistake is really stupid and small.
But the thing that really got me wasn't that,
wasn't that that I made that mistake.
It's just like, hey, you've just been in PhP land for too long.
You made a small mistake and all that.
But my boss at the time, Jeff Wagner, was just like,
was just like, this is ridiculous.
Open up the code.
And I opened up the code.
He's like, dude, you're variable static.
And like having your boss who doesn't program point out the most obvious thing in the universe.
Like, yeah, it's for the lifetime of an application.
It was, like, by far the most embarrassing moment of my lifetime to have a non-programming person
to be like, hey, your code's wrong and it's obvious.
Like, I can see it from here and I don't even program.
It just, like, it felt so, it just felt so horrible.
And I just remember, like, that was by far, like, the part that I was just, like, the most
personally sad about things.
is that I made that mistake
and it was so,
it was so simple not to make that mistake.
Oh my gosh.
And so my stories are much smaller
and nicer than yours.
I've actually had very few,
uh,
very few like production problems.
So I don't have a lot of these stories.
Except for the big, big one.
Except for the big big one.
Yeah, the one where we took down a big company.
Except for the really, really big one.
Yeah, because I always would,
I was always usually in charge of writing like the libraries.
Like that,
that was just a library.
Right.
Yeah.
So if you, like,
trash,
I'm sure at some point,
If you were in the UI side, you ran into like the Atlas Mantis logger for all of groovy and all that.
That was like my first attempt at making all the UIs have a unified ability to log.
Anything about them, which we had none.
And the reason why we had none, fun side stories, I went to the Atlas team, which if you don't know Atlas, Atlas is like a kind of, it's kind of like a tag storage database where if you have an event, you can put a bunch of tags.
and the combinatorics of the tags
is how many different types of data
will be stored inside the database.
And so that way you can have like instant lookup.
Like I want to slice the data in all these different dimensions.
And so you can see the last two days,
a billion requests.
You can see that in like absolutely no time.
So it's like a really cool thing.
But I went to the team and I was just like,
hey, you know, I'm writing this stuff for the UI
and I really need access to these APIs.
And they're like, oh, you guys had access at one point.
And I was like, yeah.
And they're like, we don't.
don't let the UI have access to these APIs anymore. And I'm like, why? And they're like,
they spent like a million dollars in a day and we took it away from them. You can't play with
this toy anymore. You're not allowed to play with this toy. So I actually had the program things in
there to be like, that's probably not a good idea to store. Right? Like, we had to be very careful
on these things. And I had to get things whitelisted just to be like, only these values the UI is allowed to
store. Don't even try that with me, DJ. You know, I'm going to deny a list that.
request tj so that was uh that was a lot of fun my only other
my only other really big one was i've said the story once or twice but uh it was i think it was
lady gaga who was having a movie come out on netflix and there was like this big countdown billboard
and i was in charge of rewriting the billboard and kind of getting it more into a modernized
version of react where we had a little bit better like kind of layout of everything and i just get done
with it and we have this big countdown billboard and they had all the it's
studio execs and Lady Gaga and all those people getting ready for the billboard to have this like
countdown and when it counts down it was going to like show her movie and it was going to be so awesome
and all that kind of stuff but I just like messed up the testing and when it went down to zero the
the whole page would freeze and so like ah so I just like totally broke the product
but it was only like seven people who saw it which was just like just the execs and the people
who were concerned about it dude I just broke it in front of them and I received very
or emails from that one.
Oh, no.
Nice.
The seven most important people.
We even looked at the stats.
It was literally like 12,
like it was like such a small amount because it was you had to like reload the page
within like 30 seconds of the video going to go a lot.
Like so it's like such a small fraction of people that would even have the chance to see it.
It was just all the people who needed to see it saw it not work.
Like, if it was just a second later, it would have just like been like, oh, yeah, it would have worked.
It was just the hierarchy up the company from you to the top.
Who saw it?
If Netflix was like the Illuminati triangle, it was the eyeball that saw it.
I was in trouble that one.
So that one was very embarrassing.
That's pretty much all my embarrassing stories are those ones.
I don't really have any other embarrassing stories.
I really didn't.
I almost never broke fraud.
I got like a couple that aren't like terrible,
but they're kind of like, oh, now I know
and I won't do that again kind of thing.
So here's one, this one's not my fault.
And I hope whoever.
To be fair, none of them have been your fault so far, Trash.
I've really kind of put a part in that.
It's usually not my fault.
Yeah.
I'm going to preface this just in case
because I know some of them watch you
for whose fault this is.
So not blaming you.
It's a team effort.
It's a team effort.
But it was your fault, not mine.
But, so this was, I'm not even going to say the year,
because this is going to be even more obvious if they're watching this.
So we were launching a new site, a site that I was not a part of writing.
Stop avoiding responsibility.
I'm just kidding. I'm just kidding.
Kind of.
But so this was early next J.S days.
I didn't even know what next J.S was.
So this was, I think...
I don't even know what NextJS is.
It's down for sell things.
This is like when everyone's like, oh my God,
service side rendering, even though it's been a thing forever,
it just became more accessible to people that didn't know what it was.
So it was like the new hotness,
and they wanted to use this new framework.
So everything was just server side rendering.
And what that means is we basically just wrote like a million queries
at request time.
Isomorphic.
They probably are using terms isomorphic,
during that time.
Bro, is that isomorphic?
Yeah.
I'm not sure.
Is that isomorphic JavaScript?
I got to look up the definition every time.
Right.
So we're using GraphQL.
And GraphQL had like some of these APIs.
I'll basically crawl your tree, look for all the promises,
and then make sure all those promises are resolved before I return some data.
But that's a big problem when you have such a nested tree in every single part of that tree needs some part of data.
You're just requesting like a million times before it can actually render anything.
anything. So we were doing a brand new rework of a site, a very popular site, and we're like,
on this day, brand new launch, brand new feel, brand, you're going to love it. We launched this
site. It was a client-side rendered site, I think, before that, or it was better than what it
was when we did it at that point. It looked better. We made it look better, but did it not perform
better? So we launched it. That's like the most classic rewrite of all time when it comes to.
It looked cool, it looked cool, but a complete disaster to, like, click and wait like five seconds before it even navigates, right?
Chat is saying it's balls.orgia. It is not balls.orgia. I vibecoded that myself, chat. Trash was not involved.
Yes, that was a T.J. Vibe code special. Okay. So stop that. Stop. Yeah, you stop. Anyways, continue, Trash.
So before we did this whole, like, NextJS fias fiasco, we had another sister site, and we used something that was called pre-render. And if you're not familiar, pre-render, if you're not familiar, pre-render,
is basically a bot that will crawl your site
and then cache those results for like a Google bot or something.
So if you don't necessarily need the server site render anything,
you can just pay for this bot or have a bot
that just sends back cash results.
So we had something like that,
but we switched to this new paradigm of using this framework.
And we were able to handle a pretty high number
of requests per second, but when we launched this site,
our request per second per container went down to 10.
It can only handle 10 requests per container.
And we were trying to scale.
And we were trying to scale.
We were trying to see the containers spin up as fast as they could.
But we were just shedding so much traffic.
But imagine a container only taking 10 requests per second.
Oh, chat can imagine.
For a pretty simple site.
It was actually very catastrophic.
And it was really damaging to like,
our team and the company.
10 requests per second is like impressively slow.
Dude, it was like 10 to 15.
And since we couldn't scale up enough containers fast enough,
we were obviously going to shed a bunch of traffic.
And I had a bunch of downstream effects.
But man, it was, it was, it was literally so embarrassing.
We're serving multiple digits of RPS.
It was, it was so embarrassing.
And I had no idea like how how this framework worked.
And they're just like, we got to fix this now.
We did.
Clearly it wasn't a bug that we fixed Amelia.
It took us like a week to actually resolve it.
But it was just, it was a nightmare.
And one of the most stressful lives of moments of my life for sure.
But to watch it like release or all just sit in there and just watching.
How was it that nobody knew was only going to serve approximately 10?
People had no idea back then about.
We didn't stress test it.
Like when you're in your staging environments, right?
Okay, so it was just because the database in prod was so much bigger than the other one that, like, loading these made it take.
I don't even, so I wasn't part of like the testing process before we released it.
I was just part of the team and kind of watching it unfold.
But clearly we didn't stress tests.
But, I mean, the classic story is when you usually release stuff, you just have one QA engineer, go into test, click around, things load.
And then once they load, that's kind of the end of the story.
we never really think about, oh, we should buy a load test this with, you know, 50,000
RPS or something, right?
Yeah.
And go from there.
So it was just like a classic mishap of that.
I don't think anyone, it was also because that whole paradigm was new to that, to the team.
Yeah, that makes sense.
They're just like, cool, SEO.
This is sick.
Heck yeah.
But yeah, once, once in.
I remember watching the graphs, everything was just like exponentially up.
And they just see our like containers trying to spin up and keep.
up and just keep going down. It was a disaster. Yeah, lesson learned. What was the lesson to learn,
trash? Don't server-side render or what? What is the lesson? Don't try to query literally every
piece of data to serve a simple page, you know? Smart. Okay. That is a good lesson. Like the modern
Silicon Valley way, though. That's also, yeah, but that's also a classic, classic GraphQL problem, too.
People just think that it's on the graph.
I can just, they have no idea which resolvers are actually very heavy.
So they're just like, oh, it's in the schema.
Blah, blah, blah, blah, blah, blah.
I don't know.
It's aggregating the data.
They probably used like one tenth of the fields anyways.
Like they're just putting all the fields in because they're like, well, whatever.
I mean, it's in the schema.
I'm going to put it in.
Let's find.
Yeah, yeah.
This is a classic.
Classic.
Classic craft QL issue.
Yeah.
Yeah.
Oh, even if you can get like really precise data, you still have to look it up at a
table. It's not like you're getting much out when you're just like, I'm going to go to a huge
table and perform a bunch of really bad queries, but I'm only pulling out one piece of data that I'm
going to use. It's still like, yeah, this is, this whole experience is really, really bad. Can we just
end the episode? Boot up the day. Vibcoons errors on my screen. Terminal coffee and hair.
