Software Misadventures - David Henke - On building a culture of "Site Up" at LinkedIn and Yahoo! - #3
Episode Date: January 23, 2021David is LinkedIn’s former SVP of Engineering and Operations. He came out of retirement to join LinkedIn in 2009 during a time of rapid growth. After 4 years at LinkedIn, he retired in 2013. Throu...ghout his career, David has been in multiple leadership positions and has been recognized as one of the best Operations Executives. This was an extremely fascinating conversation. David shares insightful stories from early days at LinkedIn and what it took to develop the culture of “Site Up and Secure”. He shares one of the most severe outages he has experienced in his career - this one was at Yahoo!, which he calls the 10g massacre. We talk about David’s 3 retirements throughout his career, his advice on developing operational excellence and lessons on being an effective leader. Throughout this conversation you’ll also hear various nuggets of wisdom from David, better known as Henkeisms. Please enjoy this highly entertaining and deeply insightful conversation with David Henke. Website link: https://softwaremisadventures.com/henke Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
Transcript
Discussion (0)
The old way of doing operations, where you had engineers and they would hand over a fence to
the operations team and the operations team would run it, doesn't work in the internet.
It doesn't scale. It's not fast enough. It doesn't deal with issues fast enough.
And so I was actually a good candidate to be somebody to learn about operations.
I actually find the best operations personnel are people that were engineers first
and then became operations personnel. If you're strictly operations and you don't know how the
code works, then you don't know how the internet works. Welcome to the Software Misadventures
podcast, where we sit down with software and DevOps experts to hear their stories from the
trenches about how software breaks in production.
We are your hosts, Rannoch, Austin, and Guang. We've seen firsthand how stressful it is when something breaks in production, but it's the best opportunity to learn about a system more deeply.
When most of us started in this field, we didn't really know what to expect,
and wish there were more resources on how veteran engineers overcame the daunting task
of debugging complex systems.
In these conversations, we discussed the principles and practical tips to build resilient software, as well as advice to grow as technical leaders.
Hey everyone, this is Ronak here.
For this episode, Austin and I had the honor of speaking with David Henke.
David is LinkedIn's former SVP of Engineering and Operations.
He came out of retirement to join LinkedIn in former SVP of Engineering and Operations. He came out of
retirement to join LinkedIn in 2009 during a time of rapid growth. After four years at LinkedIn,
he retired in 2013. Throughout his career, David has been in multiple leadership positions and has
been recognized as one of the best operations executives. This was an extremely fascinating
conversation. David shares insightful stories from his early days at LinkedIn
and what it took to develop the culture of site up and secure.
He also shares one of the most severe outages he has been a part of.
This was at Yahoo, which he calls the 10G massacre.
We talk about David's three retirements throughout his career,
his advice on developing operational excellence,
and lessons on being an effective
leader. Throughout this conversation, you'll also hear various nuggets of wisdom from David,
better known as Henkisms. Please enjoy this highly entertaining and deeply insightful conversation
with David Henke. David, it's such an honor to have you with us today. Welcome to the show.
My pleasure. Good to be here.
So while preparing for this conversation, I was reading about you. And when I went on Google and searched David Henke, one of the first hits I got was an image of you pointing at a screen with
some metrics and graphs. You were pointing at the screen, but you were also looking at the camera. And what appeared is that you were screaming at the camera. So I thought we would start with
asking you about the story behind that picture. Okay, well, you're referring to a poster that
said, this is my freaking site. And you can substitute freaking for whatever word you
would like, but I can assure you it's nastier than freaking.
By the way, just so you know, in India, they made us change it to, it's my precious site, because they were offended by freaking.
What happened was, site up, which is something that's important, especially to a company the size of Yahoo, with so many properties and so many important pages, was not going well.
And they wanted to bring attention
to the site. And at the time, I was running production operations, which is really the
data centers, the networks, the 28 pops and the 35 data centers, and also the 1 million computers.
But the sites were actually run by different groups like Messenger and Search and Mail and
so forth. And things
weren't going well. And they wanted me to bring attention to the site. So I pointed to something
that matters to me greatly, which is sponsored search, which was my first job at Yahoo, which is
how we made half of our money. And if you know anything about calculus, the area under the curve
is money. Right? And we don't want to lose money. And we were losing money because we had some
sponsored search problems. And I was trying to make everybody or remind everybody, it's your
freaking site. And site up does matter because that's the business we are in.
Yeah. Well, site up does matter. And it is something that we are going to dig a lot into.
But remind me, was this poster also kind of stuck on walls in various buildings of the company?
This poster, unfortunately, I don't really, I'm not a very photogenic person, and I'm not a very good speaker.
But I will tell you, my ugly mug was in every floor of every building at Yahoo, and it was in every data center.
And people who did not know me would go, is that you?
Man, you are one ugly son of a gun.
I would disagree that you're not a good speaker.
I think you're a great speaker, as we are seeing here today.
Well, thank you for that.
I appreciate it.
So you've had an amazing career in tech, and you've had a huge impact on companies like
LinkedIn and Yahoo, AltaVista, Silicon Graphics.
One thing that while I was researching, I learned that you retired three times throughout your
career. Can you tell us more about these times? I can. I'm trying to think how to do this nicely.
So I left Silicon Graphics after eight years. And Silicon Graphics was arguably the greatest company I ever
worked for. Jeff Wiener sometimes gets mad at me because LinkedIn really is a great company.
But Silicon Graphics, just the fact that I was there for eight years, I thought we had the best
engineers on the planet. And they left to go to companies like Google, like Netscape.
But when I left after eight years, it was because we had just bought Cray computers,
and that was becoming the largest player in a dying market, because Google had shown us that
the large number of small computers was better than this mammoth supercomputer. So I just quit.
And I had enough money. I always made money. I was good at that. And I just said, I'm done.
And I sat on the sidelines for a while, And a friend of mine who I had worked with at Silicon Graphics called me up.
And he said, I'm at an Elon Musk company.
It's his first company.
It's called Zip2.
Oh, right.
We did door-to-door directions.
And we did newspaper sites for the New York Times and the Mercury, San Jose Mercury.
And he wanted me to run operations.
And I had never run operations
before. I'd always been an engineer. So I said, what the heck? I'm sitting on my butt. Might as
well go get another job. So I went to work for this little company called Zip2, Elon Musk's
first company. We sold it for $310 million, which got him into the business. But also I made some
dough. We sold it to Compaq, which at the time owned Alta Vista.
Okay.
And that's how I got into Alta Vista.
And then after four years of that, I retired.
That's retirement one.
I retired because my children were the age of high schoolers,
freshmen and sophomore in high school.
And it's the one time your kids do not want to be with you.
Young people won't know this yet, but trust me, they don't want to be with you.
But I wanted to be with them, so I retired for three and a half years,
and I did not work. I took karate.
I took Spanish.
I took real estate courses.
But I also spent a lot of time with my kids.
And then I got another call, and this was from Yahoo.
And that's when Yahoo asked me to come back and help them with their sponsored search,
which was Overture, a company they had bought.
And that was responsible for half their dough.
And it was a hell of a thing, and we can talk about that later.
But I came down and did that for two years, and then I did the production operations for Yahoo for two years.
And at that point, Yahoo capitulated on search and search monetization, which I was very disappointed with. And so I retired again. And then within a few
months, somebody I had worked with at Yahoo, a guy named Jeff Wiener, who's the CEO of LinkedIn,
and now is chairman of the board of LinkedIn. He called me and said, I want you to hear about
this company. And I said, you know what, Jeff, I think I'm not going to be working anymore. Yahoo kind of left
a bad taste in my mouth. And he brought me in and I talked to Reid Hoffman. I talked to all the
engineers. But the most important guy I talked to was the CFO. He said, let me explain this
business that we have based on this data that we have, this incredible set of data
and what we're trying to do. And it just blew me away. So I thought, the heck, I might as well try
it one more time. And that was arguably probably the best business decision I ever made. And I
think looking back, LinkedIn is the best company I've worked for.
We are extremely grateful that you chose to work at LinkedIn.
I think Austin and I here have a job probably because of you, because you built in that
site of culture at LinkedIn when you came in.
And before we go there, you mentioned that you were always on the engineering side and
operations is not something that you had at least led before.
So when you went to Silicon Graphics, or sorry, Zip2 at the time,
did you know you would enjoy this? Or were you passionate about operations engineering? Or
do you just want to take a chance? I actually didn't know if I would enjoy it. And in fact,
I did not enjoy it when I don't like it as many of you, when things do not work. And I will tell
you, things did not work at Zip2. It was one of those
Microsoft shops. It didn't scale. I was a good enough programmer and encoder where I actually
reviewed changes before they got on the site. Because at that point, I was like, this isn't
getting onto my site. I remember Elon Musk himself trying to write code to get on the site. Not a good engineer. A very great entrepreneur, but not a good programmer. And I will say, I didn't know a lot about data
centers. I didn't know a lot about networks and networking. I didn't know a lot about the scale
and the issues. The beauty of AltaVista, when we took that over, that was a large problem. It was a large search engine before
Google, but they used the big alpha chip deck machines, and that wouldn't scale because of cost.
And so you learn a lot real fast. And I realized that engineering and operations really need to
work very closely together if you're going to scale on the internet. And that's a lesson a lot
of little companies need to learn, including LinkedIn when I joined them. Yeah, it's pretty amazing that
you learn all of that on the job. Also, you mentioned that for the first
a lot of years of your career, you were a programmer. But then you went from the IC
track to the management track. How did that happen?
Well, that's a long story.
So I was a founder in my two startups, and I was the principal programmer.
The first startup, I wrote 73% of the code.
Oh, wow.
And it was all C programming back then, if you were interested in that.
When I got to Silicon Graphics, I thought I was a real hotshot programmer.
And then I would say out of the 200 programmers that were there, I would I was a real hot shot programmer. And then after the app, I would say
out of the 200 programmers that were there, I would rank myself 197. And it was a little humbling.
And I thought, well, I can look at this one of two ways. I can feel bad about that,
or I can feel really good that I can learn from these other 196 really smart people
that are way better programmers than I am. and that's what I chose to do. I worked for the tools group and also compilers.
So I was involved with the C++ compiler.
I was the one who brought Java over from Sun,
what a piece of shit Java was at that time, 100 times slower than C.
We brought Purify into the company, pure software.
But at the time, I was still an individual
contributor. And we were supposed to get to a 64-bit computing model. We were the first ones
to do it in the major computer makers. And everything worked except the tools and the
compilers. And our group was choking on the tools. I wasn't part of that exercise. So they came to me
as an individual contributor. And they said, we need you to work on this problem.
I said, well, what if I don't want to work on it?
And they said, then you're fired.
And that's the way Silicon Graphics was.
So they made me manage a group for the first time in my life of people,
and I put them into two shifts working around the clock to get our debuggers and our performance analysis tools
and our C++ front ends and all these systems to work with the 64-bit computing.
You would think this would be an easy problem coming from 32 bits to 64 bits.
It's a really hard problem.
And we worked two shifts around the clock, day and night, for three months,
created a minimum viable product, and all of a sudden,
Silicon Graphics could ship their 64-bit computers. And I realized, as an individual contributor,
I can do this amount of work. But I just got these gentlemen and these ladies to work really hard on
this problem, and we were heroes. And Silicon Graphics was really good about taking care of
their heroes. They sent me to Hawaii for five days
in a five-star hotel with presents on my desk every day,
plus bonuses, of course.
That sounds amazing.
A very good company to work for that way.
But anyway, that's when I became a manager
and decided that I could do more with that.
Now, if you ask a person like me,
and especially Kevin Scott,
who's the CTO of Microsoft and I hired for LinkedIn, he would rather be programming in Python right now than doing anything.
So once you're an engineer, you're an engineer, and you have to give up something to be a manager or leader.
So don't forget that when you're making that transition.
So as you got into management, I'm sure at some point you would also miss writing code.
But what aspects kind of
got you going to continue on the management route? Well, there were the, let's get back to the code
part. I still coded because, so for example, when we did the transition to 64 bits, I wrote all the
test cases. When we did the, when I was the manager of the group that moved to Purify to get Purify to work,
I wrote the Purify torture test, everything you could do wrong in C programming. And it was so
good that the pure software people made me an honorary member of their engineering team. Now,
I was the manager at the time, but I was writing the test code. Even when I was at LinkedIn,
we had hack day and I wrote a hack that basically scraped
all the members of LinkedIn at the time there were 140 million. And I could tell you their names and
their titles and their companies. And I scraped it without having access to LinkedIn directly.
And then I handed it as the hack to my security team, who worked for me, of course, and they were
quite embarrassed. And then they wrote a thing called Sentinel, which fixed this problem. So dummies like me couldn't scrape
LinkedIn. So I kept my hand in. But at the end of the day, you have to decide, you know, you're
going to trust your team. And I always trusted the team. And now, you know, at this point,
even when I left LinkedIn, everybody was smarter than I was, which is great. That's the way it
should be. Yeah.
Well, so going from the code part, what got you going on the management track?
Like, what did you like about it?
What I liked about it is the ability to handle and do many more things.
So I could get a lot more work done and accomplish a lot more if I could direct traffic. And at that point, I became like a coach.
I don't know if your audience knows this, but I'm a big fan of the Los Angeles Lakers.
Sorry about that, Warriors fans, but I grew up here and I have season tickets.
Well, you make a championship team not by just having LeBron and AD on the team.
You make it by having rebounders and defensive specialists.
And I assembled those teams. At LinkedIn, when they did the going away for me, there were 22 LinkedIn employees in the room. Out of those 22, five of them were there before I got there. The other 17 I hired personally, including Kevin Scott and Bruno. And that was my team. And that's what I realized. With that kind of a a team you're going to win championships you're
going to you're going to win I love that makes sense uh so I I'm very eager to jump on some of
the some of your time at LinkedIn but before I do that uh you mentioned when you went to
Alta Vista like operations or something that you learned kind of on the job like learning about
networks data centers well first of all it it sounds challenging. And throughout your career, like you were still
writing code and just staying close to the ground while managing a big organization.
What did that learning on the job about data centers and networks look like?
If I imagine, it just sounds hard to do all of that.
Well, when you realize that at the end of the day, it's either hardware or software,
and what you'll find, and this is where I have friends in this business.
So my first startup, we wrote software to design integrated circuits and PC boards.
So I have some background in hardware engineering.
But the people that know about hardware and software combined make for good people.
The old way of doing operations where you had engineers and they would hand over a fence to the operations team and the operations team would run it doesn't work in the Internet.
It doesn't scale.
It's not fast enough.
It doesn't deal with issues fast enough.
And so I was actually a good candidate to be somebody to learn about
operations. I actually find the best operations personnel are people that were engineers first
and then became operations personnel. If you're strictly operations and you don't know how the
code works, then you don't know how the internet works. Oh, yeah, a huge plus one to that.
Well, so talking about your time at LinkedIn, you mentioned the third time you came out of retirement
was when you came to LinkedIn
and led engineering and operations.
Yes.
Every SRE at LinkedIn has heard this,
that the number one priority after talent is site up.
Yay!
That took a while to get that right yeah and like every sri who joins linkedin from the get-go like from the boot camp to every like if they go to a post-mortem after an outage
or if they are speaking with a team for a new design like site up is a culture that's talked
about a lot and it's something that's attributed to you. And many people at LinkedIn who are still there since your time talk about site up.
Yeah, that comes from David Henke.
And I was reading about you and it seemed like that's true for your time at Yahoo as well.
And I realize we probably can't go into all the details, but I would love to know what did it take to bring that culture at a company like LinkedIn?
Like you mentioned, it took a while to get there.
Like what major changes needed to be happening?
I think that you used the right word, culture.
So let me tell you what happened.
I come from Yahoo, and Yahoo and Google know how to run at scale on the internet.
Good for them. And that was a good
experience. Come to LinkedIn,
the fucking site is down every
day. Every
day.
Okay? They had a word that I
never heard of called throttling.
You know what throttling means?
Where I come from, it means you're on your
motorcycle and you're pulling back and you're gassing it. Right? That's what throttling means? Where I come from, it means you're on your motorcycle and you're pulling back and you're gassing it, right?
That's what throttling means to me.
That's not what it means to these guys.
It means you're throwing bits on the floor because you can't handle this many requests.
Never heard of this term before, okay?
This happened every day.
And I'm like, this is not good.
This is not a good outcome.
Why is this happening?
And then you start exploring it.
So the first thing we did at LinkedIn was we had a daily operational meeting.
And the beauty of a daily operational meeting is I get to hear everything that's wrong.
And unfortunately, there's many things that are wrong at this point.
But unlike Jeff Fliener, who was probably the greatest QA person that LinkedIn
ever had, all 30 of the things that he reported that day aren't going to kill me. But these three
things are. And let's go over these three things and figure out what we're doing. At that point,
you're bringing attention to the problem. Then you're getting people to realize, you know,
if we weren't having so many problems, maybe we could spend more time working on things that are of more interest to us
so we can get out of some of these problems.
The other problem was just the sheer culture of the organization.
LinkedIn was a product company.
Reid Hoffman's a wonderful guy, and he's a product guy.
Jeff Wiener's a wonderful CEO, product guy.
Deep Nishar, a very technical guy, but he was the head of product at LinkedIn from Google.
And so you have three of the most powerful guys at LinkedIn, and they're all product guys.
There's an old adage, quality, schedule, features.
Pick two.
Engineers typically pick quality and schedule.
Not always, but typically.
Product people typically pick quality and schedule. Not always, but typically. Product people typically pick schedule and features.
So there's this natural tension between quality, schedule, features, pick two.
And, of course, when I asked Jeff Weiner this question, what do you think he said?
I want all three.
That's not the deal, Jeff.
That's not the problem we're trying to solve for here.
Bottom line, LinkedIn didn't treat site up importantly.
And now we did.
We had to.
Because if you can't keep the site running and the service running, who gives a damn about this next feature?
Okay?
I'm all believer in growing fast.
And I think that's a lesson I learned from Mr. Hoffman, Reid Hoffman.
Go as fast as you can.
Grow as fast as you can.
But the site's got to work.
The other thing that was important to me was security.
And, you know, when 500,000 of your users have the password 123456, that's not a very secure system.
And you want to start building that out as well.
Now, we weren't moving money at LinkedIn like a bank, but we still wanted to
make it as serial killer as possible. So I wanted site up and security to be on top after our talent.
Talent always comes first because without the people, you can't do anything. Then that,
and then everything else. Now, that doesn't mean we spent all our time working on site up and
security. What it means is when push came to shove,
that took priority. And that was a difficult and time consuming argument with Jeff, and with deep
and with read and with the rest of LinkedIn, because LinkedIn wasn't used to that. But we
figured it out. And it was in our best interest. And at the end of the day, it really, really paid off.
So I want to touch on what you said, quality schedule and features pick two.
It's a challenge, especially during high growth times. Like if you're spending too much time getting the perfect technological solution,
but not for the right product, you won't survive.
On the other hand, if you build the right product, but like you said, if the site is not working,
that doesn't work either.
So while you want to grow as fast as you can,
but still want to keep the site up,
how have you seen successful teams manage this?
Well, what you realize is,
you look for the things that are killing you.
So if you're constantly fighting the site,
then you really don't have time to add new features, right?
You're just not going to do that. The other thing that was killing LinkedIn was the release process.
Again, before your guys' times, but we used to release every two weeks very badly.
It was a huge Java monolith.
Remember, I'm the guy that ported Java from Sun to Silicon Graphics when it first came out.
What a piece of shit.
And the memory management model was horrific.
It's much better now.
Bottom line is that we shipped this big piece of junk every two weeks, and we had to fix it forward.
Sometimes we were there till the next morning,
sometimes till the next afternoon,
just trying to get our site to work again.
And it was horrific, right?
So we finally went to the product people
and to the rest of LinkedIn, the engineers,
and said, we want to redo this.
We want to re-architect how we deploy.
And we want to do this in a way that at the end
of the day, we will all be better off. And we had four principles. The project was called Inversion.
Everything at LinkedIn is in something. It's kind of a nice name. Inversion. So we had four
principles. One, trunk development. Everybody checks into the trunk. This is the way it should
be. This is the way we did it at SGI. By the way, if you broke the build at Silicon Graphics, you're fired. Okay? So anyway, don't break the build
because we're all checking into trunk. Two, got to be 24-7 testing against the trunk. We're
constantly testing against it because we're all using it as the common base. Three, canary in the
coal mine deployment. Instead of deploying to all 100 nodes,
I'm deploying to one. If that works, three. If that works, five. If that works, 10. If that works,
30. If that works, 70. If that works, all 100 nodes get deployed to. At any moment,
if it does not work, undo. You have to be able to undo. And with those four principles and the machinery behind that
called inversion, we changed how code was delivered, deployed, tested, undone,
and the pace at which this was all done at LinkedIn. People could release anything at any moment.
And if it didn't work, we undo it.
This makes the engineers happy.
They can go as fast as they can.
This makes the ops people happy.
If it doesn't work,
we're undoing it. This makes the product people
happy. We're shipping more
features than we were before.
This makes the sales people happy. We're making more money.
Everybody's happy. And I will tell you, that was a very good thing for us to do.
Yeah, it's really interesting to see how the project inversion came about. And I can imagine
that this was the right thing to do for LinkedIn at that time. And it's not easy to make that call and say like,
hey, this is what we have to do.
And a lot of, I can imagine a lot of engineers coming in and saying,
well, I thought I was going to work on, you know,
like the latest and greatest technology.
Why am I being put to do these other tasks,
which I thought would be done already?
And I can imagine someone in your position
has to be able to keep such a large group of engineers motivated okay so
to your point think about not just the engineers but think about the product people right their
whole their whole world is new features new product development and they were effectively put on hold
largely for six months right so what i'm curious about is like, how, how are you able to influence them
to kind of convince them and keep them motivated to be like, Hey, this is, we have to do this.
And this is how it's going to pay out. Like in the short term, I understand this is going to
really, really suck, but you're going to really enjoy it, you know, after this time.
That's, that's right. And, and,. And first you go to the leadership, right?
You make no mistake. We went first to Deep Nishar, the VP of product. And the good news is Steve's a
pretty good engineer. And so he understood the mess we were in. Then we went to Jeff Wiener.
And the nice thing about Wiener is he spoke every two weeks in front of the whole crew,
every two weeks. And I always asked
him why. And he says, because you have to say something 42 times before anybody will remember
it. That's one. And two, he said, we got a lot of new people. They've never heard any of this shit.
We got to do it over and over again. Well, in this case, he's going to speak to the whole company
about inversion because we need him to. We need to get buy-in from the product people,
from the finance people, from the sales people, as well as from the engineers. And we did.
Nice. So I want to pivot now. So we love discussing stories about production outages on the show,
and also the lessons learned, I imagine, leading up to project conversion, and probably during it,
there were probably many of those.
So you've seen other many product outages, not just at LinkedIn, but also other companies.
Could you share maybe one or two of these war stories from your experience?
Yeah, some of these actually bring back very bad memories for me.
So I just if I start crying at some point, you'll understand. So probably
the greatest outage in the history of Yahoo that I was on board with was what I call the
10G massacre. So 10G is Oracle, Oracle 10G. And at the time, again, we had a legacy system that was failing,
that was responsible for half of Yahoo's money.
And the 10G massacre went from Spark to...
We were going to change all of our databases.
And we're going to go from Spark to Intel, single computer to RAC, 32 bits to 64 bits,
Solaris OS to Linux OS, Big Endian to Little Endian in the
byte representation, EMC to NetApp Storage, and finally Oracle 9i to Oracle 10g.
This was our migration path.
Not an all one step though, right?
Remember, I had been at Yahoo for one month just trying to keep the site going.
And this had all been tested.
I was assured by everyone this was all tested.
And we had certifications and all this stuff.
And like anything else, when you make that many changes all at once, it's probably not going to work.
So I called it at the end of the day the 10G massacre.
We loaded it up on a weekend, which was typically our non-traffic time.
And by the morning when agents started coming online on Monday morning,
things turned to shit in a hurry, and it was bad.
And effectively what we learned was this version of Oracle on
Linux in this environment was beta at best, and it was crashing constantly. I now know what an
ORA 600 is. Never knew that before, but you got to know that because that's basically you're screwed,
and you don't know why. But I'm going to preserve the data if I can.
That's an Oracle fatal error.
Anyway, it took us two months to get out of this nightmare.
Two months.
During this time, I would get up at 7 in the morning, walk to work,
and I would leave every morning at 2.
So I got five hours of sleep, max, and I was living in Pasadena at the time, because that's where Overture was.
And at 2 a.m. in the morning in Pasadena, there's only two groups of people that are out there.
And they would ask me, can I have a cigarette? Because I used to smoke at the time. Sure.
And then they would ask me, why are you out here so late? I said, well, it's hard to explain,
but I'm responsible for half of the money at a company, and we've just shot ourselves in the
foot, if not the head. And then they started asking me personal questions like, what's my name?
Because they would see me every night.
After about three weeks, they said, David, how's it going today?
Any better?
And I said, well, you know, we stayed up for most of the day and so forth.
After five weeks, they said, we have a good feeling about this, David.
At this point, I'm buying them beer and cigarettes and shit.
And remember, I see the same crew every night at 2 a.m.
After two months, we sorted all this out with a lot of help from Oracle and experts from a lot of places,
including a lot of people flown down from headquarters in Sunnyvale for Yahoo.
And at the end of it, I went to a liquor store, bought five cartons of cigarettes and as many bags of booze as I could carry. And I took it out there at 2am to my friends. I said,
I will never see you again, I hope. And I never did. That's the 10G massacre.
Yeah, that sounds fairly horrifying. And I'm glad that you guys are able to get out that
at some point. So you wrote a series of blog posts
related probably quite to this
on the LinkedIn blog titled
Every Day is Monday in Operations.
As I was reading through one of those posts,
you were talking something about the Panama Project
where you wrote one of the axioms
that go like,
go to work every day willing to be fired.
Can you elaborate more on this or share any related stories?
Well, I can.
So Panama is, as you know, the Panama Canal is probably one of the greatest construction
efforts in the 20th century.
It's an amazing story.
I recommend to all your listeners to read it.
It's like the 1,000-page book in one of my bibliographies on this topic.
But we called our project to rebuild sponsored Search for Overture in Yahoo Panama.
And it was very similar in terms of many things.
One, in the Panama Canal, they had to keep the workforce alive because of yellow fever and because of malaria.
And they didn't know the cause of it.
In our case, we just had to build a team.
When I got to Yahoo, there were 27 people working on Panama. When I ended up, there were 500 people
working directly on Panama. The second problem was how to engineer the solution, and we had to create
brand new engineering and solutions to make this work and infrastructure as well, just like they
did on the Panama Canal. For those of you who
don't know it, the Suez Canal connected two seas, but the Panama Canal connects two oceans.
And they literally had to build a lake at the top of Panama, get the water in there, and use that to
float the boats and to lower the boats in the locks. And that's the engineering solution that
worked. But the other thing to remember is a long-term project in the Panama
Canal was a long-term project. The French had started it. They dug up one-third of it and quit,
and it almost bankrupted France. And then the Americans took over because of military
reasons, and Teddy Roosevelt was smart enough to know that we needed this,
nothing else for our military. But you can't give up. So there's a famous part where
they're cutting what's called the Calabra Cut in the Panama Canal. And it's a very mountainous area.
And once again, because it's a tropical rainforest, it fills in with mud and water. And the engineer
goes to the chief engineer, he says, what do I do now? We just filled up the trench, the glaber cut. By the way, glaber means snake in Spanish.
We filled it up again. What are you going to do? And the chief engineer says, what do you think
you're going to do? Dig. Well, that's what we had to do on the Panama Project. It took us one and a
half years to do this. Now, getting back to go to work every day willing to be fired, right near the end of the Panama Project, the boss shows up.
The CTO of Yahoo, she would remain nameless at this time, but he's my boss.
And I got all my lieutenants in there.
And I had written this 25-page spec for what it meant for acceptance criteria for Panama, right?
And he says, we got a ship now. I said,
but we're not, we haven't checked off everything on the 25 page spec. So he starts taking things
off my list and I start getting angry and this is not, not good. And I will say I didn't behave
well. I took my badge and I threw it at the big boss and I walked out of the room and I quit.
Because he was trying to undo my list in front of my staff.
At the end of it, he came and we talked and we both apologized to my staff.
Because neither of us handled that one very well.
Bottom line, we didn't relax the criteria for releasing it,
and we did a hell of a job. And I'm still proud of that project and very proud of the people that
worked on it. Yeah, that's really great to hear. Yeah, you mentioned like you work with many
engineers even at that time. And I imagine a big part of your role has been to grow them as well and to actually see key qualities that you're like,
this is what I want to see in a really good engineer.
What kind of key qualities have you kind of discovered
from a variety of engineers that you feel like
they're going to go places and do great things in the future?
Well, again, obviously you've got to have the smarts,
but that's not enough.
Necessary, not sufficient.
So, believe it or not, culture once again comes into play, right?
You could be the smartest guy on the planet, or a gal,
and if you can't get along or figure out how to get along with this team, you're off the team.
You also have a certain, some engineers have a certain knack for exploring or thinking
about what could be. And you're always looking to them because those are the ones that are going to
take it to the next level. And that's another trait that I look for. By the way, I interviewed
almost every engineer we hired in the old days. I spent 35% of my time in LinkedIn hiring. Imagine that.
Remember, I was there all day, all night, but still, 35% of my time hiring people.
And I would talk to junior interns as well as senior leaders.
It didn't matter to me because everybody that joins this company, that is the number one priority for us, hiring the best and the brightest,
but also the cultural fit. That does matter.
Since you talked about your boss a little bit in the previous conversation,
in what Austin was talking about during the Panama Project, I was looking at a LinkedIn profile,
and Kevin Scott, CTO of Microsoft, actually has a recommendation for you and one thing that he says
is that you're the best boss
he's ever had.
Yeah, that actually pisses off
all his other bosses.
That's okay.
I'll take it.
I have only one other recommendation
that I post, by the way.
That's a guy named Chi Lu.
Yeah, and I have a question
on that one too.
Go ahead.
Okay, go ahead.
I just want to make sure we mention him because everybody's got a hero in this business,
and my hero is Chi Lu.
Okay.
Why don't you tell us why?
When I got to Yahoo, Yahoo was what I call the loose confederation of warring tribes.
You had a lot of really smart people working on a lot of very different things, not necessarily together. Chi just wanted to do search
and search monetization, and he was really good at it, and he attracted people to do it. So his team
was very loyal to him, and I eventually worked for him at the end of the day in search and search monetization. He's the smartest,
most humble, hardest working person I've ever met in my life, period. He literally, you can call him
up every hour of the day, and I tried this. I even had a cron job to do this because I was so lazy to
stay up, just to see what he would respond.
All but four hours of the day, he responded.
Anyway, that's Chi.
So I'll get back to my question about Kevin Scott's recommendation about you being the
best boss.
But since you mentioned hard work, also on these recommendations, like both from Kevin
Scott and Chi Liu lu i actually want to
read it out just a couple lines from both of their recommendations so this is from kevin scott
that i knew within a day of working with him that david's passion and commitment to his work and to
his employees were almost superhuman and chi lu says something on the similar lines where
ferocious intensity is another hallmark
of Henke.
I still remember the days where Henke can fight off a major site outage with his enormous
willpower.
So you mentioned Chilu was one of the hardest working folks or the hardest working guy you
worked with.
My question for you is, these two people that you respect a lot are saying that your abilities
with the intensity you came in with and the passion you have are amazing.
So how did you develop this passion and intensity that you brought to work every day?
Well, that's a good question.
By the way, I'm not like Chi Lu and Kevin Scott.
They're both introverts.
They're both a lot smarter than I'll ever be.
But I don't like it when shit doesn't
work. It really bugs me. And I don't like to fail. You know, I always say this to people,
two kinds of people. Do you love winning or do you hate losing? Ask yourself that question
sometime. I hate losing more than I love winning. Doesn't mean I don't
love to win. I just hate losing. And I hate it when things don't work. And, you know, the nice
thing about LinkedIn is there was no end to it there. And Yahoo and Overture, no end to it there
either. So let's go fix that. And I was often brought in to fix things like that. And I know
what to do. In terms of willpower to overcome it,
that's not me, that's the team.
What I had to do was get the team in place
and make sure everybody knew what they were going to do.
And once you get enough people marching in the right direction,
you'll solve any problem.
Well said.
And coming back to my question about Kevin Scott's recommendation for you being the best boss, I know I've repeated that now three times. I don't know how many people that have pissed off.
You can't repeat all the reasons he might have said that, but you can interview Kevin someday and maybe ask him that question.
Oh, yeah. We would love to have Kevin on the show as well.
He's quite a character. Let's get him going. We'll just leave it at that.
Okay. So my question actually is, what do you think makes for a good boss? quite a character. We'll just leave it at that.
So my question actually is, what do you think makes for a good boss?
A good boss is
first of all, somebody that listens
to you. Listening
is a very underrated skill.
You want somebody that actually
hears what you're saying,
right, but also can
speak it back to you. So this gets back to one
of my deficiencies of communication. Email's not communication. Texting is not communication. IRC
chat is not communication. Communication is when you and I are talking to each other,
I say something to you, you can say what I said back to me, and I have to agree that you got it
right. That's communication. I had to agree that you got it right.
That's communication. I had to learn that the hard way in marriage counseling,
and that didn't work. But anyway, bottom line is, communication is really hard.
And believe it or not, I would use this trick with some of my teams.
Okay, person A, you need to talk to person A, say what he said, and make sure he agrees that you've got it right. You would be amazed how hard this is. And that's something we need to focus on.
The other thing that I used to do as the boss was I wanted to know what my employees wanted to do.
Look, what is it you want to be when you grow up? What is it you want to do next?
What is it you want to work on? So i don't i don't wait every six months or
one year to have a a talk like that with them i wait maybe at the max a week and and we'll hash
it out and by the way there's usually three things and but no one wants to hear more than three
things that you want to work on trust me poor thing and their their eyes are waving over their
head and they could give a shit so pick pick three things. And instead of saying you suck at this,
say, what if you did this instead?
Maybe we can talk about that.
What if this, for the three things
that we're trying to work on?
Because there's always something to work on.
And that's true for all of us.
Yeah, yeah, that's absolutely true.
Yeah.
So you've been leading,
you've been in a lot of leadership roles um at this point and
worked with many many leaders as well um i'm pretty sure there's a ton of parallels with
you know being a good boss but i imagine there's other independent contributors or
individual contributors that you know also have like the leader attribute um just not in like
more managerial position um so what to you makes a great leader
beyond like, I know, well, let's talk about individual contributors for a second. I'm going
to turn your question a little bit because it's a good LinkedIn story as well. So when I got to
LinkedIn on day two, I asked the VP of operations at the time, show me the DR site. Tell me about
it. How long will it take me to cut over
if I lose this data center? And he says, I don't think you understand. And I said, I don't think
you understand. It's a simple question. Is it going to take four hours, eight hours, 12 hours,
a day? And he says, I don't think you understand. I said, I don't think you understand. He said,
well, we have 90% of the computing we need in the second data center,
the DR site. I said, that's fine. We can buy the other 10%. I said, what else? He says, well,
we don't have any software on those computers of our software. We don't have any data replicated
on those computers, and we don't have any configuration parameters set up. Basically,
they're just machines cooking the machine room. I said, so basically what you're
telling me is we don't have a DR site. Is that what you're telling me? And he goes, that's right.
So by the way, that's not good. I had to go right to the CEO and right to the board of directors and
say, if we lose this data center, we are out of business. Everyone nod your head. Everyone
understands this, right? Okay. So fast forward, we built a DR site.
Thanks a lot to Neil Pinto and a whole bunch of people.
And we built it from scratch.
And it worked.
We know it worked because we cut over to it in about eight hours.
Pain in the ass, but we cut over to it.
Then we wanted to build Multicolo.
Now, Multicolo is a lot harder because you got to
flip traffic, but the hardest part for LinkedIn was all the data sources had to be replicated
so that they were consistent. And that was a really hard problem. So I took the best engineer
I could think of at LinkedIn that we hired from Yahoo, a guy named Swee Lim, individual contributor,
not a manager, not a director, blah, blah, blah, but a real smart engineer. And I said, Swee,
we've got to make multi-color work, just like we used to do at Yahoo, just like they do at Google.
Flip traffic, who gives a shit if we lose a data center, right? Everybody in the company at that point worked
for Sui Lim, the individual contributor. He became the leader. Jeff Weiner got in front of the entire
company. If this gentleman comes into your office and asks for something, do what he says.
And we built Multicolor in one year, roughly. Now, it runs in many, many data centers. And I remember the last time
I was in the NOC at LinkedIn, it was green, green, green, red. Red was one data center down,
the other three are data centers up. And Mr. Henke, isn't this a beautiful thing? We don't
give a shit. It works perfectly. And I said, that is a beautiful thing but it took an individual contributor to your point to lead
all of linkedin on a cross-functional project which was a real pain in the ass but we did it
and i imagine even like sweet like you said uh extremely smart um and i'm assuming that
another big part was that he was
able to bring other people along which i think is something that a lot of engineers you know
over time slowly start to develop um which made him so uh useful in that particular position
that's right people want to they want to be part of something they want to follow people that are
that are very good they want to learn from people that are very good.
Shit, I learned a tremendous amount from Sweet.
How does this work?
How does that work?
What's the biggest problem we are faced with to get this replication problem to work?
And by the way, I think he just left LinkedIn.
He's over at Databricks now.
Good for them.
Very good guy.
But it just shows you that you don't have to be a manager or director of ep
you could lead as an individual contributor and as long as you you had you need the backing of the
of the other leadership but the nice thing is so we had the backing of the ceo of the company
again jeff weiner would stand up every two weeks this is sweet limb If he walks into your office, do what he says.
I have all of that.
It's amazing that Jeff and the entire company supported Swee on this project of Multicolo.
And I mean, today, we see that.
Like LinkedIn runs out of multiple dealer centers.
We do traffic shifts.
And it's just amazing.
That's one of the first things when I came to LinkedIn that I was extremely impressed is, oh, we just click a button and it all happens. It's like magic. But what does it take for an IC to build know, the nice thing about talking with Mr. Wiener is he likes philosophy.
So we talk about things like existential issues.
Existential is basically death or life, right?
So he gets the fact that if we really can't make this work, we're just going to be in a bad spot.
And the nice thing is we also came from a company that understood how to do this.
So people know it can be done.
Google, it's not a problem.
They do it all the time.
Facebook, they do it all the time.
But LinkedIn did not, and Yahoo knew how to do it. So it's important to say it can be done.
Once it's done, everyone's going to breathe easier.
Believe it or not, of my four years there until we had Multicolo, I did not breathe easy.
I would ask for a lot of startups who are on this accelerated path of growth
probably are still working towards developing that site-up culture,
which they want, but it's not there yet.
What advice would you have for them, either teams or early-stage companies?
Well, one thing is you're in a better spot now, right?
So when I started at LinkedIn, AWS was just coming out, right, with Amazon.
Google didn't have the cloud.
Microsoft didn't really have Azure.
But AWS was starting, and we knew some of the first users of it because they were ex-Yahoo guys that went over to Netflix.
And Netflix was going to go all in with this.
And believe it or not, at that time, this was years and years ago, it wasn't reliable. It wasn't elastic. It
wasn't cost effective. There was the system administration tools sucked. The security
systems sucked. And it was like, much as we wanted to use it, we wouldn't use it for all of those
reasons. Now, it's great. And you can go to AWS, you can go to Google Cloud, you can go to Azure,
and you can be pretty sure that you don't have to deal with data centers and networks and the
computing. That doesn't mean your software doesn't have to be resilient and your monitoring systems
don't have to be excellent and your scaling systems don't have to be knowledgeable. And so you have to build that in, at least architect for that, up front.
That's one of my suggestions.
I know you want to go as fast as you can.
And I'm a big believer in Mr. Hoffman's blitz scaling.
That's something I had to learn the hard way.
The faster, the better.
But that's why you invest in things that help you go faster, help you scale.
Because you don't want to be in a position of what I call the going out of business business.
Let's say your success is so good, but you can't keep up with the demand because you can't scale it fast enough.
Even by throwing computing at it, you can't scale it fast enough.
Then you're in a bad spot. And if you're moving money, that's what I like to talk about, you know, like the encryption guys and
the Bitcoin guys versus, and the banks versus LinkedIn. LinkedIn had to run at three nines.
We did not at first, but my goal was three nines, 99.9 uptime. The money guys can't do that. They
have to run at four nines or better because they're moving people's money. And if they screw it up,
they're out of business. And
I think it's important to
grasp that as you're running as fast as you can
because it seems like they're at
odds with each other. They don't have to be.
You just have to engineer and architect your solutions
to scale. Always think
10x.
Makes sense.
So we're starting to wrap up and I have a few more questions for you.
I want to make sure we respect your time.
But before I go on to these questions, are there any other war stories you would like to share with our listeners?
I have many.
Maybe if people like them, I literally could go on forever.
And so I will not do that.
I will go to your questions.
Okay.
Well, we will save that for another time.
We, or at least I heard that you read a lot and you also like gifting books to your staff,
to the people who work with you.
What are some of the books that you've gifted the most?
So I not only gift books but i it's
required reading okay so if you work for me directly you have to read these books and uh
actually i'll just lay some i'm just gonna look for a list but i can't find it um so so here's
some of the books that I really recommend.
But I have a bibliography, and your readers can go find it on the Internet.
It's a leadership talk that I've given maybe 30 times.
It was recorded by UC Santa Barbara, and I managed not to use the F word in that thing because it was my alma mater.
But at the end of this deck that this leadership speech is done,
it has a bibliography of all these books.
Okay, we'll link that in the show notes for sure.
And definitely allow your readers just to look at the bibliography
because they're good reads.
But I have many, many good reads.
Some of them go way back, you know, like the Mythical Man Month.
It's a software book.
Six Hats is kind of a fun read.
But some of the more important books are,
one of them is a philosophical book,
but written by the Toltecs,
who predated the Aztecs.
And it's called The Four Agreements.
And it's very important to me.
The Four Agreements are,
do your best, which you guys all do,
which is great.
Be impeccable with your word.
That's number two.
Impeccable is Latin
without sin. Don't assume anything. Well, that's Ops 101, right? And the hardest one, don't take
anything personally. I love this one because that may be the hardest one for most people to
appreciate, but I'll give you an example. David Filo, the founder of Yahoo, smart guy, engineer,
and I worked for him at one point when the CTO left,
and we were looking for a new CTO. And he called me an idiot about 500 times while I worked for him.
And I said, David, I have the largest group in the company. And I have the largest budget,
because I run the data centers and the computing and the networks, and you've given me all this responsibility, and I'm an idiot. What does that make you? And I'm thinking, he must
have just kicked his dog that day or something. I don't know. Bottom line is, don't take it
personally. I don't think he really thought I was an idiot. Maybe he did, maybe he didn't,
but I don't worry about it. So that's a very good book for me because you can use it in your life
as well as in work.
Go to the bibliography because I think it talks about things like how they built the Panama Canal.
It's a hell of a read.
You want to see how something's done, a big project's done?
Because not everything is a quick and dirty project anymore.
Some things are much harder to build than others.
The graph at LinkedIn, the network, we did it three times when I was there.
That's a hell of a thing.
And it's a hell of a thing to get right.
And it may be the most interesting data structure we have.
And we just released a new database to store that graph.
You did?
Well, I didn't know that.
See, you're on iteration number N,
where N is greater than I remember.
Well, thank you for that recommendation.
And David, we could go on and on with you, and we could ask you a lot more questions,
but probably we'll save that for another time.
It's been such a pleasure and an honor to speak with you.
Thank you so much for taking the time.
Anytime, and if you work at LinkedIn, you're probably part of the best SRE group in the
universe, at least in our universe.
And the reason I know that is
ex-Googler, Kevin
Scott, now CTO of Microsoft,
knows how good Google is at this
and he said, Bruno, your team's
better. So there you go.
We'll take
that. Thank you so much again.
We really appreciate it
take care
hey thank you so much for listening to the show
you can subscribe
wherever you get your podcasts
and learn more about us
at softwaremisadventures.com
you can also write to us
at hello at softwaremisadventures.com
we would love to hear from you.
Until next time, take care.