PurePerformance - Perform 2020 Chris Tells a Performance Horror Story
Episode Date: February 6, 2020...
Transcript
Discussion (0)
I have these lights at home too that I use when I live stream.
Coming to you from Dynatrace Perform in Las Vegas, it's Pure Performance!
Hey! welcome everyone.
We are taking a few testimonials here and hopefully our next guest, Chris, he has an amazing performance horror story.
Chris, welcome to Pure Performance.
Hey, thank you. Appreciate being here in such a wonderful conference with all the best and latest in Dynatrace technology.
Look at that. You're hired.
You didn't pay him, did you?
No, I did not.
That was really, really good.
Is this your first time at Dynatrace Perform?
It's not.
I was actually here last year.
I did not get a chance to be part of this experience.
I saw it, so I thought I'd come up and talk to you guys.
Oh, yeah, that's great.
I appreciate it a lot.
It's really cool.
We're giving away gifts and things.
We have little micro drones.
Okay.
So you'll get in the running for this? Yeah as you walked up just uh inquiring about stuff the testimonials
can be anything okay a lot of everyone that even comes to find dynatrace just sometimes after too
late after a disaster like oh why didn't we spend this money why didn't we do this work the price
of saying no versus the price of saying yes, like what our main speaker talked about yesterday. Yeah, exactly right. It kind of also, not, I wouldn't say willful ignorance,
but just like the cost of, I didn't know we should be doing this work. But you have a particular
horror story. I do. To share, but not from your current employment. No, it's from a couple lives
back. We'll let it go. You know, no, no problem. So, you know, my name is Chris Labrado. I've been
in IT for about 20 years. And about in 2007, i was working for t-mobile and i was running their middleware
administration team so that was back in the day of soa three-tiered architectures blah blah blah
yeah you know and what happened was is we had a job application that was using a precursor to xml
called exec flow the developer no longer worked for us but basically it looked like a language
where you put in checkboxes with descriptions,
and then it would read it, and then it would do some stuff.
Yeah.
It was archaic.
But this was on our prepaid billing system, which had 8 million subscribers, and I was responsible for it.
So T-Mobile, a prepaid phone.
Yes, yeah.
So what they wanted to do is they wanted to link the prepaid billing system with the postpaid billing system,
and our software developers pushed out, you know, code release.
Yeah.
And then I usually come in in the morning, check out, things are going, business as usual.
I get the call at 4 o'clock in the morning, Chris, I need you at the war room.
Nothing's working.
I said, okay, I've been through this a million times.
No problem.
I walk in there and, you know, they're like, we can't do anything.
I'm like, what do you mean?
They're like, we can't refill prepaid cards, and everyone's card balance was wiped out.
Oh.
Right?
Along those lines.
Something along those lines.
And so we're like, okay, that's no good.
And they're like, what do we need to do?
I was like, well, we need to get everybody in the war room and start, you know, our developers. We're Israel.
We got people on the phone.
And so this is the beginning of something that we spent about 164 hours on in eight
days.
All right.
So now all of a sudden.
Are you talking about 164 hours?
What's 24 times eight?
No, that's for me.
Okay.
164.
So like I had a whole bunch of people with me in this room day in and day out.
So.
Oh yeah.
Several 24 hours.
That's not a lot of sleep.
It's not a lot of sleep.
Like that's like 200 hours.
Yeah.
A hundred, a total you can take.
No, it's 164.
164. That's like very little
sleep very very little sleep all right because the problem was it made business week and it was
on the news and nobody wants to see their corporation saying hey telco cannot have customers
placing phone calls that's not a great place to be so when it wiped out the zero balance just to
get some more details here did you have logging to know what the previous amount was?
Was it like we have no way of knowing?
I had 500 hours that I paid for.
So the issue is that now T-Mobile is giving away free minutes, right,
to stop the bleeding for their experience.
Good for you.
You just got a bunch of extra free.
Right.
But now we've got some time because we don't want to give away minutes
because people are just going to keep playing the arcade free quarters. You got it. So, but now we've got some time because we don't want to give away minutes because people are just going to keep making, you know, playing the arcade free quarters.
You got it.
So we had no tools.
We had no way of understanding why it was a problem.
All we knew is that every time that somebody tried to do a transaction,
it was a full table lock.
Right.
And so what happened,
what that means is,
is that you can only do one thing at a time and that we had the art,
you know,
a bunch of application stuff talking to this middleware process,
talking to the database.
The middleware process was basically one lane,
and you had like eight lanes at the top all trying to get in.
So you have a traffic jam.
Queued up and time's out.
Yep, yep.
Just couldn't do anything.
Retry, storm, or maybe.
So what we ended up doing, and if we had had Dynatrace,
it would have made this completely simpler
because back then debuggers for Java were practically nonexistent.
Oh, yeah.
Right?
If you go back before, they would, like, dev only
your kid or some of the profilers.
The granddaddy at that time was
CA Intrascope, which right now just got bought
Broadcom. There's a reason for that because it's not very
good. Yeah. You know? Dynatrace did actually
exist in 2007. In Atmon.
Yeah. Well, it was capital
lowercase d capital T. Yes.
And it was in its infancy. You know? Right. I started using it in 2009 or 10, it was capital, lowercase d, capital T. Yes. And it was in its infancy.
Right.
I started using it in 2009 or 10, and it was still.
Right.
Yeah, so I can't even imagine.
And then Donny Strait just started spanking the competition, right?
There you go.
It's a family show, by the way.
Right.
Says the guy who drops F-bombs.
Right.
F-bombs.
It stands for a word that starts with F.
Foodie bombs.
Food bombs.
See?
Pizza. Franklin. Anyway, back to you that starts with F. Foodie bombs. Food bombs. See?
Pizza.
Franklin.
Anyway, back to you, Chris.
So here's what happened.
So the chief architect and our chief information officer said, okay, you work for me for now.
What do we need to do?
Right.
I gave him two answers.
I said, number one, we're going to need some software engineers here to do a lot of work. I want you to go to every grocery store in a two-mile radius and buy all the Red Bull that you can.
This is the truth. You're ahead all the Red Bull that you can. This is the true super curve.
You are ahead of the curve on that one, too.
Yeah, because then they said, why do you want that?
I was like, because everyone's going to get motivated, and they're going to have energy,
and you're going to get more out of your team.
It's a cheap way to get more hours.
Yeah.
And then number two is we need to set up stations where when the thing breaks,
we can get all the diagnostics, right?
So we figured out where the problem is.
For each player.
Like a person-to-person or man-to-man coverage. We flew in the developer, where the problem was. For each layer. Person to person or man to man
coverage. We flew in the developer, by the way.
We found the developer. We flew him in at
$400 an hour. He's always on cheat.
Wow. And only for him to say, you can't solve it.
I have to build you a new system.
And he was wanting some like $200,000.
Right? Whoa.
That's unscrupulous. Nobody wants that.
So what do we do? Well, it's Java
and everything is based on, you know,
TCP and this point talks to this IP just more.
Right.
So I told them, okay, we need more servers.
We're going to take copies of everything,
and we're going to rewire it and restitch it together.
Yeah.
And we're going to send it all back through the system.
Yeah.
So that we can build more of these lanes where everything was stuck.
Yeah.
Chief Information Officer says, great idea.
So what do we do?
Build more bridges.
This is before virtualization, by the way.
Right. We're talking about HPUX, iron on the floor.
Really? Yes. And it was about
$333,000 per server.
So what do we do?
We spent $10 million
in hardware in four days
standing up everything on the floor.
That's unbelievable.
That number, think about that number
relative to a software licensing agreement.
We could have went to any APM provider and said,
hey, do you have something better than $10 million?
They would have said, here's all you can need for five.
Yeah, you could almost acquire an APM vendor for that.
Sure, at that time, yeah.
But we couldn't.
Right back then, you're talking about the days of Java 1.5, 1.6.
Oh, sure, yeah.
JMX was just barely on the market.
Nobody knew how to use it, blah, blah, blah.
Yeah. So we did all that, and. JMX was just barely on the market. Nobody knew how to use it, blah, blah, blah. Yeah.
So we did all that, and then we had to write some scripting process to deploy it because the current deployment process only knew how to do one process.
Right, right.
So we had a team of developers doing all this just to stand the system back up.
Murderously.
After about 164 hours of my time, bags under my eyes, which are still here,
we finally got the system back up enough,
and the developers had time to finally write a debugger
to figure out where in their software the problem was.
Yeah.
And I always think about this because 200 engineers,
call it 100 hours apiece.
Sure.
Because not everybody's working the same.
Right.
And the cost of poorer employee, $80 an hour.
The cost of the hardware, if we had had Dynatrace,
Yeah.
with Davis telling us,
here's where your problem was, we could have just rolled back.
Developers could have wrote a patch.
You'd have been done.
Right.
And the new thing with Davis is it could predict, hey, Chris,
I predict that you're going to spend $200,000 on a developer who doesn't know what they're doing.
Absolutely. Would you like to continue?
No effing way. Heck no n capital o yeah send it back let's talk about something else that's what i would say was there a release related to this event this was a release this was
a planned release and passed all the qa and the issue there is that just like we've always seen
all these csd pipelines if you don't build your test cases with quality gates like what they're talking about with Captain, if you do not have the right load, if you do not have the right data types, then is your QA process valid?
Yeah.
It's a false negative.
Right.
And I'm sure you've seen plenty of that, Brian, like out in the industry.
I also used to work in performance.
Absolutely.
I was guilty of making some mistakes in the way we set up tests in the past.
And then something blows up like, oh, we needed to do that.
We live and learn.
I should have been fired so many times.
How many times do the performance teams do an extrapolation?
It goes over into production, and then the developers say, oh, it's production.
Extrapolation is evil.
Yeah.
Unless you're in like rocket science where it's like $150 million just to test.
But I don't have to do that anymore because Davis knows the answer.
Davis has the data and now
in my current role, which I can't talk about, I use
Dynatrace daily.
Awesome. And it's amazing.
And you're just getting fed the information that you
absolutely didn't have back in the... I'm faster,
better, cheaper, safer,
more agile, and able to get...
Right? And I'm able to get information
to my developers. And well-rested. Yeah, and information to my developers. And well-rested.
Yeah, and well-rested.
And well-rested.
Time for pizza.
I do have one outstanding question.
Was there ever a financial impact of the individual phone's effect that people impacted?
So I think they quantified that once it was all said and done,
I think they lost a couple million bucks outside of the numbers that were spent on the hardware.
Right.
10 million and then some.
It was a complete disaster.
Yeah.
But they recovered, got it all going.
And then I think there was some process in the background to go and smooth out some of the people who kind of took advantage of the free minutes.
And clean it all up.
Yeah, yeah.
But it was an important lesson in the delivery.
And then later on, people had some best practices.
And I just keep thinking, if I only had Dynatrace.
Yeah, there you go.
Did they lose any talent at the time as well?
Because people always talk about losing money.
Did people after that be like, you know what, it's time to find a new job?
Yeah, some people probably decided that was enough for them.
They don't want to work those hours.
That was back in the days of when AT&T did their Siebel 7 implementation.
It was a disaster, and the CIO was firing people on the spot.
This is in Washington State, Seattle
area. T-Mobile wasn't that
mean, but it certainly wasn't nice
when you heard executives yelling at other
executives and you stand there going, I have no
idea what to do. I used to live in Bothell
and there's the T-Mobile offices.
You did the 164 hours
down? No, I did it in Factoria.
I used to work in the Bothell data center
up there. It was nice to walk my dog right past there. Awesome. Chris, thank you very in a factoria. Okay, yeah. But I used to work in the Bothell Data Center up there. Yeah, it was nice. I walked my dog right past there.
Yep, yep.
Really cool.
Awesome.
Chris, thank you very much for the story.
No problem, Mark.
Thank you so much.
If you have a business card, we'll enter you to win one of our really awesome micro drones.
Okay, no problem.
Ryan, always a pleasure.
You as well.
And do we have any other testimonials?
You spread the word.
Yeah.
If people come on as you wander around, we'll be around taking more testimonials.
No problem.
Thank you for having me. Enjoy the rest of theials. No problem. Thank you for having me.
Enjoy the rest of the conference.
Will do.
Thank you.
Bye-bye.