Software at Scale - Software at Scale 3 - Bharat Mediratta: ex-CTO, Dropbox
Episode Date: December 19, 2020Bharat Mediratta was a Distinguished Engineer at Google, CTO at Altschool, and CTO at Dropbox. At Google, he worked on GWS (Google Web Server), a system that I’ve always been curious about, especial...ly since its Wikipedia entry calls it “one of the most guarded components of Google's infrastructure”.In this podcast, we discuss GWS, bootstrapping a culture of testing at Google, breaking up services to be more manageable, monorepos, build systems, the ethics of software at scale, and more. We spent almost an hour and a half, and didn’t even manage to cover his experiences at Altschool or Dropbox (which hopefully will be covered in a follow up).Listen on Apple Podcasts or Spotify.HighlightsNotes are italicized.0:20 - Background - Childhood interests in technology. His dad was a director at ADE, India. His dad recruited APJ Abdul Kalam, arguably one of India’s most popular Presidents, and kick-started his career.6:10 - Studying tech in university. Guru Meditation errors.10:50 - Working at Sun Microsystems as a first job.12:30 - Transitioning from being a programmer to a leader, and thinking about project plans and deadlines.14:15 - Working on side projects for the company (a potential inspiration for 20% projects at Google?)15:30 - Moving from Sun to a few startups to Google. How did 20% projects start?16:50 - Google News as a 20% project. Apparently 20% projects has its own wikipedia page.18:24 - Did 20% time require management approval?19:30 - TK at Google Cloud, and how the management model compares to early Google21:00 - Declining an offer from Google at 2002, and going to VA Linux instead.22:28 - Growth at Google from 2004 onwards.24:28 - Hiring at Google at that time. “A players hire A players, B players hire C players”.24:55 - Culture Fit (indoctrination)? Two weeks of “fairly intense education”, a Noogler project, and a general investment of time and money to help explain the Google way of doing things. It wasn’t accidental. I went through this in 2016 and definitely learnt a bunch, especially from an intriguing talk called “Life of a Query”.27:22 - Culturally integrating acquisitions successfully. YouTube as an example.28:40 - Differences between Google and YouTube, and other acquisitions like Motorola Mobility.30:20 - Search/Google Web Server (GWS) only had 3 nines of availability? The difference between a forager and a refiner (in terms of programming)31:15 - What was GWS? Server responsible for Google Home and Google Search. 32:20 - There was only one infrastructure engineer on GWS at the time (who wanted to switch), but about a hundred engineers made changes to it every week. 33:10 - Starting with writing unit tests for this system.33:40 - “They” used to call GWS “the neck of Google”. Extremely critical, but also extremely fragile. Search results and 98% of revenue came through this system. One second of downtime implied revenue loss. Rewriting was infeasible.34:50 - How to use unit tests to create a culture of shared understanding. Bharat released a manifesto that basically said “all changes to GWS required unit tests”. This caused massive consternation at the time.36:10 - A quick example on how to enforce unit tests on new code. If an engineer didn’t add a new unit test, Bharat would write the test for them, which often would be failing due to a bug in engineer’s code. This led to a culture where engineers realized the value of writing these tests (and implicitly 39:23 - New Googlers were taught to write unit tests, so that new engineers would spread a culture of writing tests. “Oh, everyone writes unit tests at Google”.41:50 - “What kind of features were those hundreds of engineers adding to GWS?”. An example - adding UPS tracking numbers automatically showed you UPS tracking results. These were all quiet launches.Some of the software design around experimentation towards Google search might have influenced Optimizely’s design.45:00 - Google’s search page in 2007 was pure HTML. In 2009, it was completely AJAX based. This was a massive shift that happened transparently for users.46:00 - “We wanted Search to be a utility. We wanted it to be the air you breath. You don’t turn on the faucet and worry that water doesn’t come out.” 47:40 - The evolution of GWS’s architecture. Initially, very monolithic. GWS would talk to indices, get results, rank results, and send back HTML. This eventually was broken into layers. Each layer had responsibility, and the plan was to stick to that.49:00 - “You could find one line of code in GWS that had C code, C++ code, HTML, Javascript, and CSS output”. Wow.The number one query at Google at the time was “Yahoo” - a navigational search query.50:00 - Google Instant was rolled out in 2010. Internally, this was called “Google Psychic”, cause it was pretty good at predicting what users wanted to search.51:50 - “A rewrite would have been a disaster”. GWS was essentially refactored from inside out every 18 months for 11 years. The first one - was breaking out ranking from GWS to another service.57:00 - YouTube knew that if it convinced enough people to get better internet, Google would make more revenue.59:00 - Search grew from 500-1000 people in 2004, to 3000 people in 2010. 59:30 - How exactly did search ranking work, technically and organizationally? The Long Click. 61:40 - Google ran 20+ experiments to figure out the best shade of blue on the Search page. This might seem silly, but it helps at scale, since it could potentially find the shade that would help the most colorblind individuals.67:50 - Hate speech from Google search, and the ethical quandaries around building a humanity scale system.70:30 - Improving iteration speed and developer productivity for these systems71:50 - Google had an ML model for search results back in 2004 that was competitive with the hand-built systems, but didn’t end up using it, due to the lack of understandability. This has definitely changed now. I had read that document during my internship, but was surprised to learn that Google had a working ML model for ranking since 2004.73:30 - Service Oriented Architecture at Google. Enabled GWS from C to C++ and divest itself from some responsibilities. But Google stuck with a monorepo, compared to Amazon.76:40 - Components in the Monorepo + Blaze (Bazel) helped Google scale build times and reduce iteration speed. Components is the most interesting piece, since to my understanding, it hasn’t been written about much externally.78:00 - The scale and complexity of the monorepo.79:40 - The 400,000 line Makefile, and the start of Blaze.82:00 - What were the benefits of “Components”?84:00 - The project to multi-thread GWS, when it was serving 5 - 10 billion search queries a day. It started off as a practical joke.91:00 - It’s rarely only about the technology. It’s about culture and team cohesion. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Yeah, thanks for joining me and welcome to the Software at Scale podcast.
Great, thanks for having me.
Yeah, for sure.
If you could give me like a quick background on, you know, what's your story? Like what got you into tech? My father was a aeronautical physicist.
So he designed supersonic aircraft.
He was the head of essentially the Indian aerospace research arm.
And then he went to the UK and was instrumental there in their aerospace industry.
But in the 60s, he and my mom, who was a practicing OBGYN in India,
they immigrated to the US where they they moved to the US in the 60s with very little because they felt like it was gonna be a better experience for their four
children. Well at that time they only had two
But I grew up in an environment of science and medicine and physics
Mathematics when I was very young
My dad had me doing multiplication tables on the back of paper paper bags from the supermarket and he really cared a lot about that
And they introduced us,
they introduced me to computers when I was 10. So my parents bought, bought me a RadioShack TRS-80
in 1980. And it was love at first sight. You know, I looked at this and it was,
it was very satisfying. It was very kind of something I could like work with, you know,
in this digital realm. They got me this book
that was basically computer programs. And what you did at the time was you would type in the
program from this book. And very often there were no checksums or anything. So you type it all in,
then you'd have to debug it to get it working. And then it would do something like, you know,
like tell you like, like exciting programs back then were like calculating what day Easter would fall on in 2025 or something.
This is back in 1980.
So like these things were a little bit abstract, but very concrete in terms of just like getting it working.
Sometimes there were games like Hangman.
And then you would save that program on a tape drive, which basically, you know, was like, it would, the computer would, it would do
digital to audio conversion, record it on tape, and you could save and restore it. So I had all
these cassette tapes and programs that I typed in from a book that I would run on 4k of RAM on this,
or maybe use even 16k of RAM luxury on this computer, and I loved it, you know. So from a
very early age, when I was about 10, I knew I was going to do something with computers, and I loved it. So from a very early age, when I was about 10,
I knew I was going to do something with computers
and I was really into technology at the time.
And then it just kind of flourished from there.
I had a very strong desire to go learn.
So anything that I could find that was digital,
I would take it apart and learn about it.
I broke a large number of Casio digital watches that way.
And I fiddled with a lot of technology. I never really got into the hardware side of it quite as much.
You know, I was not particularly great with soldering things,
but like the digital side of it, you know, was, was really fascinating.
And it was, it was kind of a, a love at first sight thing for me,
very like 40 years ago.
Okay. So was the organization your dad worked in ISRO or was it something else?
It was called ADE.
ADE.
Yeah.
And, you know, my father passed away recently.
And the head of the, I forget what it is, like the Aerospace Defense.
I have to look up the name.
One of his engineers who he recruited there sent a very nice note talking about what it was like to work for my dad.
My dad recruited the guy who went on to be the president of India and mentored a lot of early engineers there.
And so it was a lifelong love for him.
And he passed it on to me.
And it was something that I really just grew to to enjoy although i i wound
up being more in the digital side of the world whereas he was much more in the designing physical
uh components yeah i'm sorry to hear about your dad but that's a great story plus
he he recruited the president so was that abdul kalam if you know? Or was it some other president? I don't remember the name.
I mean, I could look it up for you.
I don't remember the name.
I think any president of India is probably like a pretty big achievement, though.
Yeah.
Yeah.
I mean, it was amazing.
And, you know, he didn't really talk about this stuff.
We kind of only discovered it by doing a lot of forensic research and like asking
him a lot of questions and just like looking through a lot of his kind of.
A little bit of his written work, a lot of his like physical possessions,
finding the stories attached to them.
I mean, he lived till he was 96, he had a long and very rich life and a complex journey.
Um, but it definitely imparted to me a deep understanding,
sorry, a deep love for understanding,
you know, technology, science, mathematics, physics,
you know, to a certain extent music,
you know, music is mathematics in so many interesting ways.
And I didn't really appreciate that as a child.
And I came to appreciate it more and more as I got into the art and the science of technology.
That's fascinating. Sounds like an immigrant father in general.
Yeah, totally.
Okay, so I'm guessing you studied computer science then in university. university? I did. You know, I went to a school that happened to have an old mainframe. Well,
it was not old at the time. It was a VAX 11750 running an early version of BSD Unix.
And so when I was 13, I had access to a bunch of really interesting technology. And at the time,
BSD Unix was compiled from source. So you would build and upgrade the time bsc unix was compiled from source so you would build and
upgrade the system by downloading new source packages and compiling them and so i learned
um by going on usenet and downloading source i learned how to build and maintain code that's
how i learned uh c the c programming language. At the same time, I was getting formally trained
in software engineering through things like Pascal, later on Turbo Pascal,
and then assembly language. And so in high school and college, I was a computer scientist,
but it was also my passion. So I did it on the side nonstop.
You know, I got, I tried to program every computer I could get my hands on from the TRS-80 to the Apples and the Commodores back in that era.
And then, you know, when I got to college, it was Macs and PCs.
And, you know, we learned 68,000 programming on old Amigas, you know, guru meditation errors for those back then.
That's like what it would, that's what it called like, like essentially like system failures.
And so, you know, so much of that stuff was so low to the ground.
They didn't have a lot of abstractions.
They were just beginning to build out layers of abstractions. So the first time I saw more advanced systems,
you know, like we had a NextCube in college,
and learning and seeing like a really nicely designed user interface
on top of these low-level systems,
you begin to see like, oh, wow, these things can actually change the world.
Like they can change the world of work. they can change the way people think about work, you know, having access to a
good computer, and a laser printer, like the next cube shipped with Roger's thesaurus,
Merriam Webster's dictionary, and the complete works of Shakespeare. And so you look at this,
and you're like, huh, I kind of have access.
And, you know, that plus a laser printer meant that like your work product could all of a sudden
be so much better than what it was before, really on the back of like a very straightforward system
at the time. And we're talking like the late 80s, early 90s. And that really kind of started me on
this journey of like, wow, you you know the world is going to be
digital we're going to communicate by email uh you know this is when everyone expected me to write
letters to them like handwritten letters and i was trying to get everyone on email so i could
just send them an email and my argument was like look i can either send you a five-page email
or i can send you a postcard like what would you prefer you'd rather get the information
you know the content is more valuable than the medium. And the medium is shifting towards digital. So I was trying to
get everyone onto digital. And now it's crazy to think that people wouldn't have email. It's crazy
to think they wouldn't even, they wouldn't have SMS, right? Like the back then, like convincing
people to make that, to cross that chasm to digital communication was actually quite hard but i was
i was there you know i was there way ahead of most of my colleagues and my friends and i was just
trying to like get them to all catch up yeah you were probably just the geeky kid trying to convince
them that this is something new and this is the future totally and it was very uh unsexy and
unpopular you know uh the university i went to, Colgate, great university, did not have a strong computer science program.
So I was very much the odd man out.
And it's kind of ironic now because in so many ways, this is where the world is.
Like everyone has moved towards this.
Even the people who were holdouts still have some digital presence now in a way that like, even 20 years ago, 15 years
ago, it was it was actually pretty common. I remember having long conversations with my brother
in law who didn't want to have any kind of digital connection, because he felt like that would be a
way that people could tie him down. But I always looked at it as a way to actually liberate you.
You know, it's like you have access to information, you have access to communication,
you have access to technology.
It's a tool that you can use for whatever your goals are.
You don't have to just because you have a phone
and people can dial your number doesn't mean you have to pick
it up and talk to them.
It doesn't mean you have to respond to their emails.
It's a tool for you, and you choose how you use it.
Yeah. So I think that makes sense.
Did you start at Sun right after college?
Like I know you worked at Sun from Ray at Dropbox.
Yeah. Yep. Yep.
I worked, you know, when I graduated from college in 1992,
it was not a particularly great time to in the workforce. In fact, many of my graduating class, like I struggled to get good jobs and I was very fortunate.
I use the alumni network and I contacted a guy who worked at Sun Microsystems and he very he he his name is Ken Houck.
Great guy. on microsystems and he very, he, his name is Ken Houck, great guy, he put me in touch with some folks
and I flew out and I interviewed
and they offered me the job the same day.
In a great twist of fate,
Ken reached out to me 20 years later
to talk about getting a job at Google.
So it was kind of nice circle of life moment for us.
But yeah, it was right out of college and I got a job doing device drivers for
graphics cards for Sun's new line of high performance graphics.
And I knew nothing about it.
And so it was all like an amazing learning experience for me to go work at
a large corporation that was doing quite well in the
heart of Silicon Valley. I mean, SUN stands for the Stanford University Network, and their motto
was the network is the computer. So they were like way ahead of their time in terms of the value of
networking. And first couple of years, I worked in device drivers, and then I transitioned into a
group that was nominally around internal tools, but it gave me a great playground to work
on web technology. That's where I met Ray Jackson. We worked together. He kind of mentored me
at the time and helped me understand, because I didn't understand how corporate, how companies
worked at all. All I cared about was like, I'm going to sit down and write some code.
But then I found myself in a position of like having to be a leader and be responsible and do
like project plans and have deadlines. And it just didn't, I didn't, I was like, I just want
to write code, you know? So I had to learn a lot. I needed to turn to the people around me and ask
them a lot of questions. Like I got this email from my director. What does it mean? Like, how
do I think about this? Like, am I in trouble? What do I do? And I was constantly spinning up side projects and doing
funky things and getting in trouble and getting unwanted attention. And my director was, you know,
multiple times came back to me and she's like, what are you doing over there? Like,
why am I getting these weird emails from other parts of the company about you? Like, what's
going on? And now in retrospect i
totally get how how much of a challenge i was as a relatively junior engineer like stirring up
trouble by doing weird things at the time though i was like it didn't make any sense i was like
there was a problem and i solved it and i wrote some code and i shifted to the whole company and
you know like why is it a big deal right exactly especially because like these side projects are generally
super useful for these companies it's just that they tend to be things that people don't prioritize
or just it just gets lost in a sense but yeah yeah i wrote a tool that synchronized the early
palm pilots with the sun platform and because i had a palm pilot and i was like this is super
cool handheld computer who wouldn't want one of those and Pilot and I was like, this is super cool, handheld computer,
who wouldn't want one of those?
And I was like, but I can't,
all of my data is on this like Sun proprietary,
not proprietary,
but like it's just like the Sun ecosystem.
And Palm Pilot was designed to work with PCs
because it's the dominant platform
at the time for individuals.
And so I just figured out the APIs
and started synchronizing it.
And so I shipped this thing to anyone who wanted it.
And then one day I get this email.
Well, my director gets this email.
They didn't even CC me where he was saying, hey, the entire sales team in Sweden has bought PalmPilots because this guy on your team wrote this software.
And we believe we can use it to be more effective.
And so she comes to me with this printout of this email.
And she's like, what are you doing? And I was like, I mean,
cause it wasn't my job. It was like completely unrelated. She's like,
what, like, why is this,
why is this team who doesn't report actually maybe did report to her?
Why are they now spending $10,000 on random hardware? Because you did,
you just like just didn't compute for that. No, but i have always had random side projects because it keeps it gets me excited and it keeps
me like you know motivated something i can be passionate about yeah maybe maybe that's where
like i don't know if it was you who had the inspiration for the 20 time at google but i've
seen that you've written like a new York Times op-ed for it. So
maybe that's where some of that came from. I mean, I was by no means the instigator of that,
but when I joined Google, so fast forwarding through several failed startups and then a
stint at VA Linux where I worked on SourceForge, which at the time was kind of ground zero for open source, I wound up at Google.
And Google had this great philosophy that if you let engineers kind of,
if you give them some time and space, they can do amazing things
that you don't necessarily get with top-down direction.
Now, to be clear, Google had the luxury of doing this,
partially because they were making so much money,
but it was kind of a vicarious, you know,
virtuous cycle. It was a virtuous cycle where because they had smart engineers and they gave
engineers room, those engineers would do great things that would make the company better,
which enabled the company to give them even more room, right? And while the company was going
through this like hyper growth period, what
Google really discovered is, listen, engineers are going to
do this anyway, why don't we endorse it? And, you know, make
it like, acceptable and allow them to talk to each other about
it at Dropbox, we do hack week, right? But this was like a like
a like a nonstop hack week. And the company was going through
this growth phase phase where so many
of the things that people did turned out to be really valuable. So there were some huge successes
for this. Google News was a good example of something that just came out of one of the
product managers deciding, hey, I could just go scrape news sites. And they're showing up in the
Google search results and the snippets are there. Why don't we just
make a landing page, which shows you all the news results. And just like, instead of search being
about relevance of the search, have it be time-based about what's topical. And so that was
a 20% project. That was something that came out of this idea that like, what's a cool thing we could do? And what Google basically said was, you know, and I think people struggled with this a little
bit.
The philosophy at Google was, if you have something you really want to do, we will guard
some time for you to do it, even if everyone else thinks it's crazy, right?
So it wasn't like, hey, I should figure out what I do on my Fridays, so much as, hey,
I have this thing that I think is really cool, and I want some space to go work with it.
Later on, I think people struggled with like feeling like I have to have a 20% project.
I need to go find something.
And I always felt like that was that was putting the cart before the horse.
Like do your job.
But if along the way of doing your job, you have this amazing idea, like don't let that idea go to waste. And there was an internal website where
people would post their ideas that they didn't have time to work on and other people would pick
them up. And this created an amazing culture and community of innovation. And you really want
pockets of innovation happening everywhere. And Google really did a great job of feeding that.
So was it something that always required management approval in a sense?
Because that's what it sounds like. It did not require management. Okay, so let's think about
this in epochs. I joined Google in 2004, the company was five years old, and I left in 2015.
So over that 11 year period, the company went through several different significant phases
in the early days we were kind of kind of post-product market fit had a business model
around ads but had just launched gmail i just bought youtube and we're just beginning to expand
in different ways but you know google now 16 years I started, is a very different beast.
It's a big, complex company that's driven...
You just start getting more and more pinned down, and you have a little bit less room
to run.
And that, I think, caused some of the challenges for the company in terms of giving engineers
tons of freedom, you know? So like when TK joined Google Cloud from Oracle,
Oracle is a very different culture.
His approach is much more, you know, top down.
Like here are the goals cloud needs to hit.
Here's what you need to do in that kind of model,
which I believe is a good model and a right model.
I'm not criticizing it,
but it doesn't leave as much room for people to be like,
hey, I want to try doing something completely blue sky. And it's also important to note that like
the web in 2004, I mean, just setting the clock back, IE6, Internet Explorer 6 was the dominant
platform. And Google had, sorry, Microsoft had built it up to be the dominant platform
and then was letting it stagnate because they were trying to feed the cash cow, which was Microsoft Office. So it was much more of the frontier. The frontier was right there.
And you could do so many more interesting things on the web that no one had really done before.
And we were kind of pushing through all these technologies in a way that now it's much more
mature. And when in a mature environment, it's a little harder to find those frontiers.
So 20% time really happened at a great time for Google in its model and for the industry.
And it had a great, I mean, it all just kind of came together in a way that was beautiful.
Yeah, I think 2004 was also, I think, kind of at the end of the dot-com bust or maybe a few years after that.
But it seemed like there was an influx of great engineers from all of these different startups.
Yeah, I mean, I had received an offer from Google in 2002 when the company was only about 300 people.
And I declined it because Google was very notoriously tight-lipped about what they were
doing. So I didn't really know what they were doing. And the job didn't appeal to me.
So I went somewhere else. I went to VA Linux. I met some of the best friends in my life and I had
a great time. And I worked with some really smart and competent people that I enjoyed
working with and some of know, some of my
best friends now. And I grew there a little bit. And then when I took a step back, I looked at,
I realized Google's actually just in the very early innings of something huge. And I wanted
to be part of that. And so that was really the moment where like, you know know because in 2002 the dot-com bubble there had was had burst and was
beginning to like stabilize but google offered me a bunch of equity and i was like i don't think
this equity is going to be worth anything by 2004 i figured out oh oh i see this is how it actually
worked like i should maybe go take that job and you know google kept kept it open you know like they didn't i could just go
back and take the offer you didn't have to like re-interview and all of that no i sat down with
jen fitzpatrick in a kind of a hilarious meeting where she basically just made sure i hadn't you
know gotten lobotomized in the last two years she's like let me just check in on you yep okay
all right i mean in those two years like you said google was like 300 people that time
2004 was already like 1500 people and it's ipo and it already ipo'd and all of that right well
i joined six months before the ipo i think there was about 1800 people in the company
so it increased about sixfold in those two years and then we proceeded to triple the next two years
you know because i remember 2005 they were like
hey we're gonna triple and i was like that's crazy we're gonna go from 2 000 people to six
like that's crazy and then we did that uh and then they're like we're gonna triple again and i was
like that's crazy like and we i think we did i think by like you know wayne rosing our vp of
eng used to have an engineering all hands in a relatively small room. And he would do this thing where it's like, if you've been here, like more than, you know,
everyone stand up and it's like, if you just joined this week, you know, sit down.
And then it was like, it would be like a significant fraction of the room would sit down.
Like, that's how fast we were growing.
And it was like, you know, a quarter, a year, two years, three years, you know, and you'd see like very quickly you'd be down to like 20 people.
Right. Because the company was growing so fast that every year it was it was doubling.
And, you know, that kind of growth in terms of team makes maintain your culture really hard because, you know, you have to find your cultural center and keep it so google
invested a lot in culture to try to make that you know to make that growth not be massively
destabilizing so how did how did google maintain that culture you said like google put in a lot of
work yep yep well so everybody went through a well first let's just let's just zoom out from the from the outside first like
the industry was booming when it was really taking off um google was booming in the industry google
was the darling back then they were the plucky underdog taking on microsoft and you know the old
guard um people google had a very good hiring. It was slow and it was painful, but it held a high
bar. And Google had a lot of A players. You know that expression, A players hire A players,
B players hire C players. Once you drop your bar, you just start rolling downhill.
So Google had kept a very high
bar, Google cared a lot about culture fit. And then when you joined, there was kind of a I don't
want to call it indoctrination, but it's not too far off from what it was, which is like,
Google heavily invested in getting you culturally aligned, right? Like,
trying to get you to the point where like, you understood the what and the why and how.
So it was like two weeks of fairly intense education. And then you join your team and
you spend some time ramping up. You have your like Noogler project. And then, you know,
they tried to maintain those cultural things. We had cultural events, you know, they tried to maintain those
cultural things. We had cultural events, you know, we had beer
bashes on Fridays, you know, we tried to get the team, I mean,
like, Google just really invested in it, they spent a lot
of time and money on trying to build that cultural center. They
had kind of cultural ambassadors. I remember like
folks like Craig Silverstein, who was one of the earliest engineers would go like relocate to other offices and be a cultural ambassador and help
people there understand the Google way of doing things. You can't maintain that forever, but they
did for a shockingly long time, keep the culture together. You know, and, and it's not, it wasn't
an accident. It was because there was a bunch of fairly senior leaders, you
know, on the engineering side, it was like the Wayne Rosling,
and Alan Eustace, and Jeff Hubert, cosmic allow and Bill
Carin, and the folks who like, really understood the value of
engineering culture, investing in making sure that the culture
continued. And so that enabled them to make sure that the culture continued.
And so that enabled them to make sure that the tidal wave of people that was joining did not water down the culture or steer us in a different direction.
The biggest challenge is acquisitions.
Because when you acquire a company, that company already has a culture.
If you bring in onesies, twosies, they get assimilated more quickly because they don't
have anything else to really attach to. But when you bring in a whole group of people who have
worked together before, then you have the challenge that they have to actually want to assimilate.
And then they have to make the effort to do it
because they have a very natural,
comfortable group themselves.
And Google didn't like,
I mean, I don't know what the strategy was.
I wasn't senior enough to be involved in it.
I don't know how they did it,
but Google did a pretty good job
making sure that the acquisitions got integrated.
But there were some cases where it didn't,
it wasn't easy, you know,
and they maybe weren't quite as successful. Sounds like YouTube might have been one of
those first acquisitions. I don't know how big YouTube was, like in terms of employees.
Yeah. Well, YouTube, I mean, YouTube was a big company. It was already very successful in terms
of like user engagement engagement maybe less so on
the monetization front i don't really remember i think it was like late 2004 uh maybe early 2005
i can't remember um they were not physically co-located so they were in i think burlingame
or wherever the the youtube offices were maybe still are. And they use a different technology stack.
And they had like cultural differences.
Like YouTube used to have two hours of downtime every Sunday
when they pushed new stuff, two hours of downtime.
Like it's just mind boggling.
Like from a search perspective, when I joined,
I'm pretty sure we had three nines.
And by the time I, you know within within two or three years
we were at four and a half nines I mean like a crazy amount of uptime and reliability and you
know I remember finding out that YouTube had two hours of downtime every Sunday just made my head
explode and so it's it's not like I mean they had perfectly good reasons for making those decisions
and I'm not criticizing them but I'm just pointing out it was a very large cultural divide in kind of up and down. And it was hard, it was hard to integrate them.
And to a certain extent, I don't think Google, I mean, YouTube maintained its own separate identity,
its own separate branding, its own really separate for a very long time. Even up till now,
people tend to think of YouTube as a separate brand from Google. And so you have to make some
tough decisions as a leadership team, where you're going to get more value from merging and
assimilation and where you need to keep things separate. I remember when Google acquired Motorola
Mobility. And then you have some really interesting cultural divides, because
there's, it was like Google was like 40,000 people or something along those lines.
Motorola Mobility was like 20,000 people, right?
And it was like some of those people worked in the factory line.
It was a very different kind of motion than what Google had done historically.
And so they had to think about like, well, wait a minute.
Do those people get free food?
Do they get what kind of benefits they get? What kind of perks they get? Like, what's their culture? How do they
think about it? Are they Motorola? Is Motorola separate? Like, you can wrap it, you can see how
you can rapidly get into all kinds of crazy complexity. If you try to think of this as like
one big culture, you have to be very, very thoughtful about how you do these things.
Yeah, it seems like it's basically like a different business. It's just in the alphabet brand or Google brand or something.
That seems like an easier way to know about that.
Yeah.
So I know that you used to work on Google Web Server.
At least that's what I've seen on your Wikipedia page, right?
And you just mentioned that search went from three nines to like four and a half nines.
And even right now, thinking of Google search being from three nines to like four and a half nines. And even right now, thinking of Google search
being only three nines is like, I can't imagine.
Well, I'm sure some Google SRE somewhere
when they listen to this is going to be like,
no, we were never three nines.
But I honestly can't remember.
I do remember it got better, right?
I mean, it was kind of a classic case where like,
you know, there's two types of modes you get into. One is forging, like pushing ahead,
forging into new areas. The other is refining. Okay. You've, you've moved into this space.
Let's make it better and better and better. I'm kind of a refiner, right? Like I, like I look at
this thing and it's like a mess. I'm like, let's clean it all up. So when I joined, I was a slightly more seasoned engineer.
I was older and I had a little more industry experience.
And they tapped me to go work on the infrastructure
behind the Google web server, GWIS, as we called it.
And GWIS was the infrastructure that served the Google homepage
and the Google search results page, which were, I think it was
like, they were like, those two were like the top two properties, web properties in the world.
And we were responsible for like, making them like fast and reliable in all languages and all
domains and being able to push new changes to it on a weekly basis
right because you you you can't stagnate like you have to forge ahead you have to enable people to
forge ahead but the thing was a toxic mess like because of the organic way that google had grown
like no one was really marshalling it and cleaning it up and enforcing the rules i mean not no one
but it was very very hard to do that. And as a result,
it had gotten very messy. And most of the engineers who worked on it didn't really want to work on it
anymore. And they were down to one engineer left in charge of infrastructure, this guy named Todd
Turnage. And Todd wanted to go do something else. And he'd been there long enough that he kind of
earned the right to go spend like a couple of months on a 20% project that he was pretty excited
about. But our senior director at the time, Bill Corrin, convinced right to go spend like a couple of months on a 20 project that he was pretty excited about but um our senior director at the time uh bill corin convinced todd to stick
around for a couple months and train me up on this so it's todd and me and our you know our job was
to make sure that this thing continued to run and continue to to be um like flexible and adaptable
and people could change it.
And at that time about a hundred engineers
were making changes to this thing a week,
which seems like a lot,
but the challenge is that no one really understood
exactly what it did, right?
And so I remember sitting in a room with Bill Korn
and Kaz Nicolau, the directors at the time, and they're looking at me and they're like, well, what are you going to do on this?
And I was like, well, OK, someone has to tell me what this thing does.
And they really weren't sure how to interpret that comment.
They're like, well, what are you going to do?
I was like, OK, I'm going to write unit tests because I figure, you know, a test is documentation.
It's executable documentation.
I'm going to write this unit test.
And this is something I had learned at VA Linux.
Our backs were to the wall at VA, and we really needed to figure out how to do better software
development.
We went and learned how to do this, and I brought this with me to Google.
And it's not like people hadn't been writing unit tests, but this project, Gwis, we used
to call the neck of Google.
It was that soft, squishy thing that all the oxygen and blood and
nerves and food and water and everything flowed through. And it was so fragile, right? It was
delivering search results with a couple of hundred millisecond latency, you know, 99.9% of the time
were better. And all of our ad revenue came through that. We had launched AdSense. So it was like 98%
of the revenue came from this thing. So if this thing was down for a minute, we lost revenue.
If it was down for a second, we lost revenue. It was really noticeable. And so we basically couldn't, we could have rewritten it.
And I think the trending sentiment was like, let's rewrite it.
But to rewrite it, you have to understand what it's doing.
Because if you rewrite it and you leave out some critical features, like how are you ever going to transition to it?
Meanwhile, while you're rewriting it, 100 engineers are contributing to it.
So my argument was, let's go write unit tests.
Let's understand it deeply. Then let's use those tests to create invariance in the system.
And then let's start cleaning it up while not breaking any invariance. And so basically,
I used unit tests to create a culture of understanding and a culture of like a safety net.
Like if you check in some code and it breaks the unit test, well, then you can't ship it,
because we know there's a breakage. And if it doesn't break a unit test, you can commit it.
And if it breaks something in production, okay, fine, we'll fix it. And we'll write a new unit
test. So you just ratchet it tighter and tighter and tighter. So by the end of 2004, I put out a
manifesto basically saying, anyone who wants to make changes to Gwis, and because of my role as
an infrastructure guy, the only infrastructure guy, I could be a gatekeeper for it. I basically was like,
if you want to make changes to this, you have to write a unit test. And this caused
massive consternation. All these people were super upset. So you're like, I don't need to
write unit tests. My code is perfect. And there were some hilarious techniques I used to get
people to write unit tests. And that's, I mean, some of them are very funny.
But we basically use this because of the place that Google was in and the moment in time and the place that I happened to sit and the experience that I had, I was able to basically just put this
speed bump in front of everything and be like, you have to write a unit test. And that started
to shift the culture of the engineering in the company towards writing
unit tests because they were like oh as soon as okay so i'll tell you i'll tell you a quick aside
an engineer would come to me and they would submit some code and they would say uh please uh review
this and submit it i needed to go out live on monday it's going to make us a lot of money and
i'd say okay that's great uh you know well they we weren't so money focused. It'd be like, it's a great feature for the users. And,
you know, it would drive engagement. And very often, like it would, it would be very monetizable,
but like, we actually were not super money focused. That was one of the great things about
Google. But the engineers would say like, hey, my code's perfect. No issues. And so what I would do is I would go,
I'd be like, okay, write a unit test.
They'd be like, I don't want to write a unit test.
It's really important.
So I'd write a unit test for them.
I'd be like, okay, fine.
You may understand this, but I don't really understand this.
So I'd write a unit test.
And very often, if I was diligent,
my unit test would find a bug in their code
that I could not find through code review because it would be very complex.
It was a very complex, intertangled code base that was very difficult to parse.
So very often it was hard to find these things in code review.
And so then what I would do is I would send them this unit test and be like, listen, I wrote one for you.
Just go ahead, add it to your change list, make sure it passes and you could submit.
So then they would take the unit test.
They would run the unit test.
It would fail and they would find a bug in the code
that they had sworn up and down, had no bugs.
They would go fix the bug.
And then you could tell,
because Perforce was set up,
you could tell after whether or not they made some changes.
Sometimes they would fess up to it and be like,
your test, found a bug.
I will write a test in the future. Sometimes they try to sneak it by and be like your test found a bug i will write a test in the
future sometimes they try to sneak it by and be like oh no that was kind of funny but what happened
was for those people the next time they sent me code and i'd be like write a unit test they'd be
like fuck he's found a bug and he's gonna humiliate i mean not humiliate but they're like they're like
they didn't want to be in the situation. So then they would start running tests.
And eventually they get to the point where they're just like, just tell me where the bug is.
And I'd be like, no, no, no, you have to write a test.
You have to do it.
And the funny thing about this, and I didn't really plan this, was it was viral.
So when someone got stung like this, they would want to pass it on.
So the next person, they'd be like, you write a test.
And it spread.
It spread to all these other projects where people were like, we would use unit tests as a way to actually do better code reviews. And engineers
did not want to be responsible for putting a bug into the system. And so we use this to not only
ratchet more tightly, the technology and the process, but the culture, like we demonstrated
value in testing, and was only possible because we sat at the very center of what drove people's brand perception of Google, the value proposition to users and the monetization.
And so, I mean, I was very fortunate to sit in this moment and I was, you know, I didn't plan to do this, but I could see when it started working, when it started turning over, that it was beginning to really shift the culture. And then that spread. It spread rapidly throughout the company.
And then one of the benefits of the fact that the company was growing so fast, we were doubling or
tripling, is that the new people who came in just kind of accepted that as the status quo. In fact,
we used to do this
great trick where like we were trying to convince the old guard to write tests and they didn't
really want to because they were like it used to work just fine we're going to keep doing that
so we decided to do i got together with we formed this thing called the testing grouplet
grouplets were like a 20 a group of people working on a 20 project and i worked with uh
antoine picard and nicola zchi, and he and I did it together.
The three of us did it together. And one of them, I forget, came up with this idea that we would,
as part of the two-week onboarding process, teach a class on unit tests. And so we were doing this,
and we were getting some uptake. And then one of the guys who did was a frequent teacher, I think it was Chris Lopez, came up with the idea that we would just tell all the noodlers that the status quo throughout the whole a fact oh everyone writes unit tests unit tests are like this is a culture it's unit testing and and then what would happen is these new groups would wind
up in a group that didn't write tests and we would tell them oh it's okay like you know that's one of
the older groups but like you know you you can help them and so what happened was we kind of did
this like bear hug to the older engineers where it's all of a sudden everyone around them was just
like i was told that we write unit tests. So here's my unit test,
right? And so that actually created this massive upswell towards rigor and discipline about
building like, like heavy duty systems at scale, that that we had the ability to modify safely and
confidently, because we had a good safety net to know that they weren't going to fail. And that transition probably only took about a year. But in that critical moment, when
we were going from like 2,000 engineers to 6,000 to 18,000, it became foundational to the engineering
organization and it shifted the way people thought we did all kinds
of other hilarious funny things we you know i could talk for hours about it but but but but
actually the the trick to getting it to happen at scale was cultural the technology existed the
processes were there it was getting the culture to shift was really hard yeah and it sounds so
simple it's like just you need one gatekeeper to bring in
this. So you need to empower a gatekeeper in a sense, and they have to enforce what the good
culture is like, and then it can just spread virally. But to clarify my understanding,
you said that there were like a hundred engineers adding features. So what kind of features were
they adding? Because when you look at Google search, it seems like if a simple UI doesn't
do that much, I'm sure there's a lot of like search happening behind
the scenes. So was this like experiments on search? Was this like ad logic that was changing?
What are the new features that they're adding? Right. Well, so, you know, funny story about that.
So Marissa Meyer was essentially the product lead for search. And one of the things she really cared about was simplicity.
I mean, I think Google backed into simplicity.
I don't think it was their aesthetics so much as like,
they just didn't focus on building a complex interface.
But then they started to appreciate the value of it.
And Marissa used to track how many words were on that page,
like a hawk.
Like she would know if you, I think it was like
20 words or 25 words max on that page. So we kept it very simple, but there was so much happening
behind the scenes. The search application had hundreds and hundreds of screens that you couldn't
easily see, but people would find them. It's like if you searched for a tracking number for UPS,
you'd get a different interface.
We called this thing at the top a one box.
It was like a little box that was like a specialized result.
We started showing sidebars.
We started showing like doing calculations for you,
doing math, doing unit conversion,
giving you weather, tuning the results,
giving you better shopping results.
Like the initial interface was very simple and the simplicity and the power and the speed
was what made it so valuable.
But the results themselves were always changing.
And we never really announced what we were doing.
We just kind of rolled it out.
You know, we'd write a feature, we'd roll it out to half a percent, you know, wait a couple of weeks. Yep. It's trending in the right
direction. 1%, 5%, 10%, 50%, a hundred percent, you know? And so like we were always in the process
of like, we built that, we built a world-class experimentation system. I think Dan Soroker
and his, his great work at Optimizely is based a little bit on the work that we did for the search and ads experimentation system.
I was working on search infrastructure.
We had a whole set of people working on ads infrastructure.
There was a huge team of people doing things below the surface,
like an iceberg, where like only 10% is actually visible in the UI, but making it fast, making
it putting in all languages, having the diversity of the responses match the diversity of the
types of searches we were getting, tuning it, building models, shipping those models,
experimenting on them. Basically, Google was exceptionally good at getting more and more value out of each query
every single day, kind of perpetually.
So we didn't actually make groundbreaking, huge strides in the UI that people could see.
But we did some things that were truly exceptional.
Like for example, in 2007, Google search result page was almost pure HTML. You type in a search
and you get back an HTML blob, right? By 2009, we had migrated the whole thing to be Ajax
and almost nobody noticed. We moved the world's
largest web app. We completely changed the infrastructure from it being completely non
JavaScript driven to a hundred percent JavaScript driven over a period of like 12 to 18 months of
steady applied work without anybody noticing. All of your existing URLs you could cut and paste and share still worked. New URLs were backwards compatible. We didn't lose any revenue. In fact, we, in many
ways, improved latency, improved revenue. It was a huge, complex infrastructural shift.
But the success for us was that you don't notice it, right? Because we wanted search to be a utility. We wanted it to
be the air you breathe, right? You don't turn on the faucet and worry that water might not come out.
It's a utility. If water doesn't come out, you're pissed off, right? But if water does come out,
you're like, you don't notice. We wanted search to be the same way. We wanted search just to be the,
it's fast, it's reliable, it's utility, and we achieved it. You know, search's uptime was legendary. We had not a single significant outage,
you know, the entire time I was there. You know, and that's not because of me. I mean,
partly because, I mean, we had a tremendous team of SREs, you know, running our ingress and egress
on traffic, our massive points of presence, our edges, data centers,
like a huge amount of work and well orchestrated. The end result being people don't notice it.
But what we found was that if you got a new computer or you got to a new place and you
tried something, the first thing people would try is Google. And if Google was down, they'd blame
their internet provider, right? Like people were much more likely to blame Comcast, AT&T,
than they were to blame us if there was ever an outage.
And that was success for us.
So you shouldn't notice it.
The fact that you're saying that means that we succeeded.
Like we managed to actually stay up with the times,
stay fast, stay relevant without you noticing.
What you're describing is like, GUIS was both this whole, it kind of served the front end of Google. Like if you had to change
what HTML was being served or what JavaScript was being served, an engineer would write something
in GUIS. But it seems like there's also like this part of search, which would classify a search result, a search query,
something like this seems like a UPS tracking number.
Yep.
It should go to this other module.
Was that also part of GUIS or was that some other search service?
Was it just a large monolith thing?
In the early days, it was very monolithic.
GUIS would talk directly to the indexes and it would do,
it would take your inputs. It would formulate backend calls.
It would talk to the indexes and it would get results back.
It would rank them, twiddle them, you know, twiddle the rankings.
And then that did not scale.
So then a team came along and they broke out the,
the ranking from the serving.
WISis became serving
and this thing called SuperRoot became ranking.
And then we kind of basically,
we had these like the edge that talked to the user,
which was always Gwis, the shiny tip of the spear,
and then the deep indexes.
And those indexes proliferated.
And then we built layers in between
where we tried to maintain this idea
that a layer had a responsibility and that we would stay within that range of responsibility. Because the problem was growing.
Like the early GUIS was designed for raw speed. It was written in C and had an abundant use of
macros. You could find a line in GUIS that had C code, C++ code, HTML output, CSS output, and JavaScript output
in one line, right?
It would basically be like a C++ type, a C print statement, and then in quotes, HTML,
CSS, and JavaScript.
Because it was always in transition.
And so we had to improve our templating systems.
You know, Gwis had so many problems to focus on
that we managed to divest a set of those problems
to other teams who could solely focus on those.
Those teams were like doing amazing work,
like figuring out how do you classify a search query?
Is it a UPS tracking number?
Is it a navigational search query?
Our number one query at Google when I got there was one word, Yahoo. People would want to get to
Yahoo. They would go to Google and type in Yahoo and then click on the first response. And so we
were making money off of Yahoo because Yahoo would advertise on Google.
It was like the number two result was just then, but they didn't want someone else to take their traffic.
So we would make money off of these navigational queries of people typing Yahoo into Google to get to Yahoo.
That's just the way that people's behavior developed.
And so we had to classify these things.
We had to understand it.
We had to do spell checking. You typed the word wrong,
and it's like, did you mean this? Which then navigated to, hey, we're so confident that you meant this. We're just going to show you the results for this instead. Things like Google
Suggest, all of these things, one day you just got a slightly improved feature, a slightly better feature, a slightly more targeted feature. You know, like, you know, in 2010, we rolled out Google Instant, which was you type a
letter based on that letter. And based on what we know about you, where you live, what you're
interested in, we can be pretty damn sure what you're looking for. You know, the internal code
name for this feature was psychic psychic because when we did it right
it was like google was psychic you know you type in w and normally that would be walmart and we
could just show you the search results for walmart but you're in san francisco and you're a sports
fan so you probably need warriors and because there's a game coming up you probably want
warriors tickets you know and we would people would be like, how is that possible? And it's like, look, yeah, you're unique, but you're not that unique.
Right.
You know, we can, we can classify you.
So all of that work was happening originally in Gwis and then more and more in these adjoining
teams that were taking one piece of it and making it 10 X better.
And then taking some other piece and making that better until
you know you have what you have now and by the way they were doing the same thing on ads at the same
time and gmail and youtube and like across the whole spectrum yeah so you never ended up like
rewriting quiz because that didn't make sense it was just splitting it off into like various
sub-components that make sense yeah i, a rewrite would have been a disaster because we would have had to either pause forward progress at a critical moment,
or we would have spent years trying to do a cutover and then failing. Rewrites for large
systems at scale are a fool's errand. It does not make any sense. Whatever they wanted to rewrite
to, I could refactor the
system there faster with fewer resources, with less downtime, like guaranteed, you know, and this
was the argument I made to Bill and Kaz and I proved it, you know, like, you know, it didn't
take long before they were like, oh, wait a minute. After six months of working on a Gwist,
no longer needs a rewrite. You're like, the pain has gone away. And now you have all the benefits. And you
didn't have any outage along the way. And then every day, you could it just gets better and
better. You know, we essentially, by refactoring, essentially rewrote the thing from the inside out
every 18 months, for the 11 years I was there. Every 18 months for 11 years. So was it, so the first
rewrite certainly like, or the first refactoring, I should say, certainly made sense. You added like
unit tests and all of that, but like, did you have to like rewrite again or refactor again for
growing scale or is it just that you had to break off some parts? Yeah. I mean, just to break off, for example, ranking out of GUIS and to push it into SuperRoot
required us to have exceptionally good unit tests, build a migration path, build a new code base,
and move, you would like literally have, you'd have coding GUIS that makes a call to SuperRoot,
gets a result back and does some code on it. And then we would move that code into SuperRoot and
we'd have two code paths. And then we would verify that code on it. And then we would move that code into SuperRoot and we'd have two code paths.
And then we would verify that they still work.
And then we would change the code path
to call SuperRoot instead
and we would delete the code from GUIS.
You just do that over and over and over again.
And so every single, it was like moving a mountain
a teaspoon at a time.
And it takes a continuous application of steady pressure and a high bar to do it.
And Google invested in it.
Google hired the best engineers and they gave us the opportunity to convince them to come
work on the team.
And we built a great culture in our team.
And it was, you know, we just like, you know, I mean, by the time I left,
I think with team was like, I don't know,
it's probably like 150 people just doing infrastructure and like a hundred
people for a quarter can move a mountain,
even if all they've got is teaspoons, you know, like, and that's,
that's what we did. And,
and it picked up so much momentum in so many ways.
It got really, it was fun.
We enjoyed the process.
It sounds like it must have been a while, right?
Like just being at Google in those 10 years,
seeing, do you have any like numbers on,
you know, it was the most widely used website then
and probably even now, right?
So like what kind of requests for a second or something
were you seeing?
And kind of related to that, like,
so you didn't have tests,
but I'm sure the site didn't go down all the time
because you were losing revenue.
So there had to be some kind of monitoring
or something in place.
So how did that evolve?
Yeah, well, so I mean, to be clear,
there were two, and there's too many teams to name that contributed to this.
Gwis relied on an incredibly strong internal tool chain.
Tons of teams worked on that.
A testing infrastructure, tons of teams worked on that um a testing infrastructure tons of teams worked on that you know the so the bill
that you'd build a binary um uh automated uh uh qa we had one qa engineer one everything was
automated all of our testing was automated we had one qa engineer and he would be able to fully test the entire system in like a day.
Right. I mean, it was at scale. This was a fantastic accomplishment.
We had a site reliability team, Gwis SRE, who was world-class,
led by a large number of people.
At the end, I think it was Astrid Atkinson,
who's now gone off to start a clean energy company.
And they were an unbelievably strong data center team, which was the substrate this all ran on.
All of this ran in and out of
Google's massive ingress-egress,
like network traffic,
which was a traffic team
that was run by Jenna Hussain.
I don't know what the actual numbers were
in terms of ingress and egress,
but it's like a staggeringly large number.
You know, like I remember YouTube had this stat
that like if they convinced people
to get a better internet connection from Comcast,
YouTube made noticeably more money. So YouTube, like Google Fiber, all these efforts were designed
to just give people more bandwidth because we knew the more bandwidth people had, it's better
for everybody, but even better for Google because Google was the place people wanted to go. So the
more you enable people to come to Google, the more people would get value from Google, the more they would come
back and the more, you know, it's like a value cycle. And then the value for us as a business
was we'd get massive amounts of revenue from it. So, I mean, it was, it was an enormous number of
people making that work. And from me on down was just Substrate, everyone I listed.
From me on up, there was like a huge search features team that was building and putting
search features into it. There was a search ranking team, I think led by Pandu Nayak,
before that Ahmed Singhal, who basically made sure that we used to measure the value of our metric for search value was in terms of Wikipedias of information.
So it's like, how many more Wikipedias of information can we provide in X amount of time?
The search features team was always looking at what do users do versus what are they trying to do?
Why is a user always going to the third page of search results to get this weird thing?
How can we find that user and make it easier, reduce the clicks, reduce the latency delivered
to them faster?
You know, that was Ben Gomes' team, you know, did search features.
And that team grew to be even larger than the Gwis infrastructure teams.
You know, like, I mean, when I joined, I would say Search was the nexus of the company, but it was probably
only about 500 to 1,000 people because there was also a lot of sales and marketing and
a lot of other efforts and ancillary efforts at G&A that go around that, it was easily like 3,000 people by 2010. Search grew so fast. So maintaining
the culture inside search was also pretty challenging because search was the oldest
bastion at Google. So a lot of those things, it was an enormous team working together. My part
was small. It was important, but it was small comparatively.
And trying to piece together
what the architecture might look like from what you said.
So it seems like there were two aspects, right?
You have a couple of people changing ranking
based on what people are clicking or not clicking.
So that reminds me of like the long click,
which I've read out there. Like are people just clicking and not going back that reminds me of like the long click which i've read out there
like are people just clicking and not going back to the google page and that's your success metric
and then there's also this really large experimentation platform where you people
make these changes it seems like and then the experimentation platform tells you okay this
particular search is getting you like more or or this particular change in the ranking algorithm
is getting you more clicks or less clicks.
Is that kind of what it looked like?
Well, yes, except think of it like this.
Think of it like many, many small teams,
each with an idea running experiments.
They're like, let's look at nav clicks.
Let's look at long clicks.
Let's look at like abandonment rates and look at like revenue uptake. Let's look at nav clicks let's look at long clicks let's look at like uh abandon uh abandonment rates and look at like revenue uptake let's look at all these things formulate a
hypothesis make it data driven uh experiment with a bunch of different things that might improve it
try to pick something that works uh productionize it get it ready for launch and launch right so
like you know it wasn't like there was at any given time, there
was like, scores of teams working on projects small and large to do this. And many of these
things at scale require a degree of complexity and thought that's very nuanced, you know, like,
is this obeying our fundamental doctrine of first do no harm, right? Like, are we damaging users
experience in any way? Is it adding enough value?
Is it going to operate at scale? Is it going to do the right thing globally? Is it discriminating
against subgroups? Is it going to create systemic negative behaviors? Like, many of these things
would really slow down our process. In the early days, you could just, like, throw something at
the wall. If it stuck, great, fantastic. But, like, as you become more and more mature, you
really have to consider all the ramifications of some of the things that
you're doing. And that can, that can make things go slowly. I mean,
there's a, there's a very famous,
there's a very famous criticism that like Google ran experiments with like
20 shades of blue to figure out like which blue was best.
And like, yeah yeah from a design perspective
i'm sure that can be annoying to be like i just want it to be this beautiful blue but like in a
it's very easy for individual designers to have biases and i'm not saying that like they did or
they didn't in this case but like and i'm not saying a 26 experiments is the right way to go
but like you do have to find that balance of like, we have to understand what's really happening, and whether or not this is really good, you know, and like,
we had cases where like, changing the color, improve the user's experience. Why we weren't
100% sure. There's a significant portion of the world that's colorblind, right? Like we only have one color for
our interface. We know that we can't satisfy everybody, right? Like some people can't see
the colors. My son is colorblind. I forget sometimes, but he can't see certain colors,
right? So like if one day the interface changed to a color he can see, it would have a noticeable
impact on him, positive or negative, right? So like you have to think about these things. And so Google really,
we built out an at scale, incredibly sophisticated system to make minute changes and see the consequence of it because we had so much volume. Like we had so much scale that we could run a
half a percent experiment for a week. It was important to run it for at least a one week
cycle. There's like a week cycle, the monthly
cycle, the yearly cycle. Everyone has cycles. We'd run it for a week. And after a week, we would know
with a reasonable statistical significance what was going to happen when we launched it.
And we always wanted to know the day before it launched what the world would be like the day
after. We wanted to be able to predict that because, you know, we wanted to continually offer a better experience for our users and not have them show up and be like, Google got worse today.
You know, like I'm bummed. Disappointment is a tough emotion. We tried to not disappoint our users ever.
And presumably, like it seems like even like Facebook feed, LinkedIn feed or any feed in general could be designed like in a really similar way. You tweak ranking of posts in a sense, and you roll it out to like half a percent of users.
The volume is so large. And maybe that's why it made sense for like all of these companies to
hire so many like ex-Google engineers. Yeah. I mean, I think Google just demonstrated that
it was actually possible. You don't need to have, you don't need to be, have lived it to know it.
Like we've demonstrated it, but like, look at what Twitter just did.
They, they, they are trying to improve the discourse in their platform.
They rolled out a feature that made quote retweets slightly,
gave it slightly more prominence to try to drive discourse.
They did data analysis on it.
It did not do what they wanted. They rolled it back, right? Like they did it out in the open.
There's a lot of people who are unhappy with, okay, fine. But like, I respect that they had
a hypothesis. They tested the hypothesis at scale. They let it run for a while. They posted their
results and they made a change. You have to do that. Because
otherwise, you're not data-driven. You're doing it on the back of your opinion or your aesthetic,
as opposed to what's happening in the world. And frankly, if you operate at the Twitter,
Facebook, Apple, Google level, you have a duty of stewardship to the world to pay attention to this.
You cannot allow your platform to be biased because the biases will go against the people who are the most disempowered.
It will dramatically affect people who are like the black community, the underrepresented minority community in general, women, people who are financially challenged, those are the people who will suffer.
So you have a duty of stewardship to make sure that what you're getting is a result that genuinely benefits all of us.
And if you don't do that, then you're derelict in your responsibility.
It's a social contract.
And I think that it's hard because you have a lot of financial drivers
but but it's important it's important for the world yeah yeah it reminds me of like that youtube
bug where it didn't work well for left-handed people like the videos would be uploaded upside
down because they just had nobody left-handed testing that stuff yeah and there's just countless examples of this it is okay
as a fledgling company it's you know vision okay product market fit okay growth still kind of a
scale not okay you know if you're serving 100 million users if you're serving millions of
users you have to start paying attention to these things. It is a drag on your system, by the way.
Like, it will slow everything down.
It will cause some of your engineers and your designers to get frustrated.
But you cannot allow your platform to be used to damage our species.
Like, that's not okay.
Like, scale comes with a set of responsibilities that are truly, truly essential and critical.
And, you know, I saw Jack Dorsey talk about it at a conference where he was talking about,
you know, they were asking, like, why doesn't Twitter just kick Trump off the platform or
whatever the spicy question was at the time? And he's like, look, like, we have to care about the
quality of the discourse,
but we also have to empower free speech, right? Like that's important. You cannot both be the
platform as well as the editor. And Google really tried to do this in search. We definitely tried
to do this. Like, even though the search results are not wonderful and they they represent a bad side of humanity um we shouldn't pretend they
don't exist right we shouldn't like edit it out there's a classic thing that happened and i i i
hesitate to say this because i'm afraid i might get some of the facts wrong and i hope that we'll
take it in the intention that you know that i'm trying to like talk about an example, but, but concretely there
was a time back, I want to say in like 2008, I'm just kind of placing where I was in which building
they're trying to remember the timetable where like, if you type the word Jew into Google,
it came up with some pretty bad hate speech and people were really upset with Google. Like,
why don't you filter this out? And it turns out that that the that word in that way is only used within these
types of groups right like the jewish community refers to themselves as jewish or they expect
to talk about judaism they don't talk they don't say the word jew uh as clearly right and so like
what what google was shining the light on holding a mirror up to is like this word is used in this way in this entire corpora.
It is like based on the data we've gathered, undisputable.
And so now what do we do?
Do we pretend it doesn't exist?
Do we disown the sentiment?
It's a tough situation, but you have to face these things.
And I think part of the challenge is, of course, it's hideously nuanced.
And for the most part, the media cannot boil it down into like a palatable soundbite.
And so a lot of the nuance is lost.
And so in the modern discourse, people fill in those gaps with their own biases and their own theories.
And very often the conversation splinters and, you know,
having lived it, I respect the fact that if somebody explains to you what's going on in this
in a, in a way that's quick, then they are, they are by definition, leaving out a lot of the nuance.
And so in that world, you have, you're free to assume, but you have to be thoughtful that
there's probably more than you're able to see. And, you know, you've got a lot of people doing
the best that they can. And obviously it's hard to trust. I trust them more having worked with
them directly at Google specifically. But, but it's very, very hard that, you know, these are
smart people. If there was an easy solution, they wouldn't have taken it.
Yeah, it's just not feasible.
And does it even make sense to like,
because it's the same ranking algorithm that's ranking everything else
that's also ranking bad websites or websites you might not agree with.
So you can't just go ahead and manually edit the algorithm
or edit ranking or
something like that yeah um yeah it's a hard problem like even on social media right like
should you ban people who are clearly spreading misinformation what if what if they're the
president it's hard to say um so going back a bit, there were a lot of teams working
on different ranking algorithms,
or tweaking it for long clicks or abandonments
and stuff like that.
But historically, working in C or C++,
language features aside, it also takes a really long time
to compile those things.
Yeah, it does.
How do you ensure that, or how did Google ensure,
how did it ensure that those engineers,
or engineers in general, can be productive,
or iterate fast on those things?
What happened?
Yeah, I mean, part of the challenge that I walked into
was Google was beginning to grind to a halt,
not to a halt, but grind down
because of the tangled complexity
and tangled responsibility.
And so separating that out into
there is a server that does ranking
and then there are servers that do experimental ranking
and then trying to get to the place
where like smaller and smaller teams had clearer and clearer areas of responsibility so that they could drive some of this stuff.
And then like moving it into data, moving it out of code and into data.
So like early on, 2004, when I'm wandering around Google, there was a, I went over to the search quality team, the ranking team, and I was just talking to an engineer i'd met there and he like showed me a dashboard he's like hey we are we have machine learning up and running and we are
we have a version of our ranking that is competitive with our current hard-coded
ranking system that uses machine learning this is 2004 16 years ago. And it was really good. And we were
talking about it, and we didn't wind up using that. And the reason why we didn't wind up using
it, and I can't remember if this is an Amit Singhal call or a Pandu Nayak call, both of them are,
you know, were very good at this. The reason for it was, at that time, with that technology,
it was very difficult to understand why something was ranked higher or lower.
And so debugging and modifying it was very tricky.
And so we could easily have switched to a model, which, if I recall correctly, at the time was going to deliver higher quality results.
But we were worried, and I think quite rightfully so, that it would ultimately damage our momentum.
And so we were slow and thoughtful.
I mean, Google certainly has moved more towards machine learning, ranking systems, but it took a while, right?
And so moving that stuff out of GUIS into its own servers, giving those teams and their servers, the ability to generate data
files, having us read those data files and act on them was a better approach for us, right? Again,
it came down to the division of responsibility, allowing small teams to be more agile and have
a clear swim lane and be effective. And Google's whole architecture, you know, Jeff and Sanjay
from an early, I'm sure many other people, from an early stage architected a system
where you had like a very service oriented architecture
where it's like this service has kind of clearly defined
inputs and outputs.
Then they built this whole kind of a protocol buffer system
which basically allowed you to have
your internal data structure,
also be your over the wire data structure
and be so clearly defined that you could just link in a library that turned it into an HTML web page that allowed you to
basically like do input and output on it. It was like a very, like I'd never seen something so
elegantly put together at a crappy user interface, but didn't really matter. It was fast and effective
and easy and cognitively simple. And so we just use that to create
essentially a proliferation of divisions of responsibility that were very, very effective.
And then that enabled Gwis to move from C to C++. We did some even crazier things
to basically re-architect it, but it it allowed us to like divest ourselves of some of the drag
of all the other responsibilities and then, you know, improve our architecture,
you know, as quickly as we could. Yeah. So Google had the same idea or like at the same time as
Amazon, which is also super famous for its service oriented architecture and there was
protobufs and all of that, but then Google decided to stick with the monorepo
and stick with a single build system.
Yes.
So I would say two things.
First, Amazon's model went all the way down to the P&L level.
Like Amazon basically was building...
Amazon started, has its roots in a B2C business, commerce.
It has its roots in commerce.
Google's roots are in technology.
So Google was not trying to flow P&Ls through these systems. It was really in commerce. Google's roots are in technology. So Google was
not trying to flow P&Ls through these systems. It was really just trying to build efficient ways
to like have the system be well architected, right? And so how do you do that? Well, you want
to have a division of responsibility. There's this team and there's that team, but you also
want to give engineers the ability to move across the whole system. And so we had a single model repo.
It was originally in Perforce.
We eventually re-implemented our own Perforce backend.
Essentially, we had a shit ton of unit tests on Perforce
and then re-implemented Perforce at scale.
I forget what it was called.
It was a great project.
Piper?
Piper.
Piper.
And it was really well done.
It was an almost seamless migration
from my perspective at scale.
But that enabled engineers to be like,
okay, I have found an issue over here
that's tied to something over there.
I can go make code change across both.
I can make code change over there, over here. Now, those things break down in scale. So we built a system called Components.
And Components was a very bumpy launch at Google. I think the architecture was right. The idea was
right. But the edge cases were very, very complex to deal with, right? Because where I sat, where
the rubber meets the road, if you pull in a component
that's prepackaged, prebuilt, tagged at a certain level, and it's got a bug in it, then we have to
ship the next version of the search. We have to. So we need to have a way to surgically be like,
we need this without that so that we can unblock our pushes, right? The trains have to flow.
And so it took a long time to get components
to the point where we could do that.
And when components got to that point,
I think like at scale, we had a build system
that used a backend object store, essentially a cache.
So it's like you compile this thing
in a hermetic environment.
It generates this output every single time deterministically,
which means you can cache it and you can reuse it.
So the more we started layering those systems, the more the system could scale. It's like you
check in this code, we checksum it, you build it once, we save that object, we checksum it,
you wrap it into a component, the component is checksummed and probably at this point,
digitally signed, right? And then you can use it kind of going forward and that meant that it's like memorization
like you can essentially like cut down a lot of your work um and so having a monorepo uh certainly
gave google a efficiency across all of our subsystems independently but also efficiency
across the whole system like there were were times when, you know,
so at the very root of our root,
there's a thing called Google,
Google three was the repository.
There was a Google two and a Google one before it.
Google three had a, at the root,
there's a small set of people
who could approve changes anywhere.
And because I was on a critical piece of infrastructure,
I was one of a very small number of people in that file. And very occasionally,
there would be something systemically wrong. Like, hey, we found a bug in a low-level library.
These 15 packages need to get updated. We can't find those people. So they would come to me and
they'd be like, we need an emergency override to go approve a change across all of Google 3.
And so that would be one of my jobs to be like, I would review it and I would stamp it. And then we would like push it. And that, that allowed Google to be agile. And that was
very, very important because you need to make sure as you scale that you don't let your velocity
wind down. Like you got to like make some trade-offs and Mono repo was effective.
Mono repo had some challenges in that, like it was an unbelievably large repo. So we had to build
more and more capabilities into Perforce to be like, I only want to check out this bit.
And I want to do this. I want to check out this bit and that file from over there.
And then I want to pretend like everything is like, I have everything, except only some of them are backed by file.
So we have a Fuse file system
where like some of these things are backed
by an object store over the network
with a little daemon that's like going
and getting it on demand.
Some of these things are locally editable files.
Like over time, it got a little confusing
and then we always had this cycle of it gets complicated
and then we simplify it
and we're like doing these trade-offs,
but Google always invested in this.
And that was key.
Like you always got to invest in it.
So you mentioned this idea of components, right?
Like would Gwis say that I depend on component and would components be tagged in a sense
or would you be using the component at head?
Well, so Gw so in the early days,
it was build from head.
I mean, it was really bad. When I joined,
we had a single makefile,
and it was like
400,000 lines of
programmatically generated makefile.
And then there was always some kind
of merge conflict going on with that thing. It was a disaster.
People would be sending emails to Edge, like, who broke the make files. So it's a disaster.
And they build up the build system, which is actually what, what, what blaze, there was a,
there was an intervening one. And then there was the one that we use with our build files is like
what blaze using our basal uses now. And so, you know, we had, we had a lot of complexity there.
But we still built from head, we built from head for a very long time, we had a lot of complexity there, but we still built from head.
We built from head for a very long time.
We used caching to speed it up and to get it to go faster.
And like, you know, sometimes like, you know, your full build might take a little bit longer.
If you were the first person in the morning to do a checkout and a full build, you wouldn't get the benefit of anyone else's caching.
But over time, that got better and better and better. Components then, again, was an optimization. Components
really was, in a sense, a way to say, all of these components compiled together. They all
passed their unit test together. They were a viable unit that you can use. I think it was
ultimately the right way to go. It was just very bumpy getting there
because so the devil was in the details and getting a lot of these things to work.
And it was just, you know, it was, it caused so much consternation, but this is like what
happens at scale. You know, you have like 20,000 people in the company and you have some mission
critical things and the complexity of it is staggering. And there's like hundreds of people who can say no and block your efforts. You have
to push forward. You know, at some point, Bill Corrin came and asked me to go help them get it
over the finish line because like they needed to build a coalition of people that supported it.
And, you know, and I did, I mean, I was originally one of the people blocking it,
and then I was one of the people helping it to go. But the reality is that it's not about the
technology or the process. It's about the culture. It's about getting people to understand, hey,
we have to do these things because we're facing a problem. So it's like, let's all see this problem
together. There was a bunch of solutions. We've chosen this one. Let's get aligned. Let's do it
together. It was a cultural challenge, not a technical challenge.
Yeah. What was the major benefit?
Was it that then you didn't have to recompile the component
and that would save a lot of time
or was there like other benefits?
It meant that you could know reliably
what you were shipping at any given time.
So if you didn't do this,
like imagine I build quiz today and I want to ship it.
I can tag it as like,
I built it at this perforce revision number.
But I don't necessarily know what's in that perforce
revision.
I have to look at like any change that had happened before
or after that, that cut line, right?
And so that, I mean, that's not bad,
but at scale it's complicated.
Cause if you're saying,
hey, we think there is a data privacy or security or migration or calculation or hardware or firmware bug somewhere, how do you trace it through the system?
What binaries is it in?
What binaries is it not in, right?
Like we have like this service-oriented architecture with thousands of binaries running in this fleet in a carefully orchestrated dance. Like how do you go find these things? So like simplifying the
problem by saying like components is like, if it's in this component, we now can be like this
components in these 10 systems and not in those 18 systems, right? Let's not worry about this.
Let's worry about that. So it, it, it, it created cognitive simplicity, it created operational efficiency, it simplified
the problem, but it was a very big shift. Yeah. Because it seems like you're not building some
things at head in a sense, or you still are, but it's a way to just break down the amount
of complexity. Like you might say that you might have
like a million transitive dependencies,
but that's only because this one really large component
has like a hundred thousand.
So you can say, I just have these 10 major dependencies
and this one component dependency has that bug.
So it's a way to simplify the mental model for everything.
Right. I mean, that was, I would say,
easier than multi-threading all of google multi-threading
all google was insanely hard right like that that's where things like okay but components
like there's a build system aspect to it you can like pin it down you'd be like you have a whole
revision control system behind you to help you but like google in the early days was building
their own servers it was all like you you know, process oriented, process per core.
Your process consumes all of the core and you only have this one core.
Moving to a model where like, you know, we were doing like, you know, 16 core, 32 core, 64 core, you run one process.
Because running 64 processes on 64 cores on one system is not efficient.
And so it's like you really need to take advantage of all these cores.
You need to do thread mechanics.
So you have to multi-thread this incredibly complex system
that the company's entire revenue rests on.
Are you okay?
Like now we're talking about
some real technical challenges,
getting that right at scale
and making sure you don't have a thread deadlock
that's going to blow up search,
blow up ads, cause an outage
and like screw the company.
Yeah, how do you manage that?
Just like everything else,
like really slow rollout,
see what breaks, fix things.
Well, ironically it was cultural.
There's a funny story behind this.
So I worked for a VPN, Alan Eustace.
I had known Alan for a very long time.
He's the guy who convinced me to come to Google.
I had known him by that point for like 10 years. And we needed to
multi-thread search, like in a big way. I had this guy running a team out in Pittsburgh,
Jay Krim. And Jay, great solid engineer, had, it was like a relatively new TL. He was kind of,
I owned a big chunk of the underlying subsystems for Gwis.
Alan was heading out to Pittsburgh to visit with that team. And as he would do,
he would reach out to me and be like, Hey Bart,
I'm going to go go to Pittsburgh visit one of your teams. Like,
why should I say to your teams? You know? And I, I always, I'm a practical joker.
I was like, let, let me was like let me like let me um um i'll just like have
some fun with this right and i know these guys pretty well and they jay and alan didn't know
each other so well so i thought this was gonna be hilarious so i'd be like i said hey hey alan
why don't you go ask jay when he's gonna multi-thread grist be like jay this is really
important when are you gonna get it done because i was like poor guy is going to be completely like deer in the headlights.
Senior VP of engineering shows up in your doorstep and asks you to do essentially the impossible.
What are you going to do? And so Alan flies out there and he says, hey, Jay, when are you going
to multi-thread quiz? And Jay says, okay, I'll get right on that. He puts together this plan,
which was amazing. Basically, yeah, I i mean it was a little bit of slow
and steady but he like went and recruited these people who are like amazing at this effort put
together like a really solid proposal for how we would statistically know that we were headed in
the right direction be able to have enough safety checks and shit he multi-threaded that thing in a
year it was a phenomenal accomplishment it was crazy like when you think about like the you know i mean at that point we were doing billions of search
queries a day maybe like maybe maybe like somewhere between five to ten billion search queries a day
right and making sure that and and this is a massively long tail problem. Like you're talking to tons and tons of backends.
Like Chris had like hundreds of backends doing all kinds of complicated things and state machines internally.
And like multi-threading, that was an insane task.
And to be thread safe, everything underneath you has to be thread safe.
And if they're not thread safe, you need to know it and you need to act accordingly and yeah they built a system that
used a combination of heuristics um uh massive testing uh building out tooling and support um
you know um i think uh there was i think robert love and alan blount were on the team at the time
they're just like super smart i'm trying to think who else. I'm probably leaving some people out,
but they like built an entire framework
to validate thread safety.
And they drove this thing.
And the end result was phenomenal.
Like we got to the point where like
you could run a single Gwis process
releasing massive amounts of memory and disk consumption,
like massive savings in our architecture.
So you could basically pack our systems far more
efficiently than ever before. And you could scale up to like, you know, at that point,
we're probably only talking about eight to 16 cores in a machine, but scale up to 64
cores in a machine. It was phenomenal. I think they did it in about a year. Like that was like
that re-architecting GUIS on the front end to be Ajax without anyone noticing and re-architecting Gwis on the front end to be Ajax without anyone noticing and re-architecting Gwis on the back end to be multi-threaded without anyone noticing were two of three largest accomplishments.
How many people was on that team? What does it take to do something like that?
Yeah, I mean, the team itself was probably relatively small, but they had the mandate they had the authority they were able to go
compel people to act differently and to um to code differently um they had the authority to go
like get other teams to pull and it was an important effort you know and that authority
kind of flowed from Alan all the way down.
So it was really about empowering the team to go do a thing that we agreed should get done and letting them take a shot at it.
And the reality is they were up to the task and they got it done.
And I was super proud of them. It was an amazing moment.
And which year was this? Was this earlier or a little later?
It was probably like 2010.
2010.
Time table?
Yeah.
Yeah, 2009.
Yeah, that was it.
So Gwis had probably grown quite a lot by then.
It wasn't super tiny.
Yeah.
The team?
No, just the code base.
Because the amount of code you have to make, that's safe.
Yeah, I mean, I would say Gwis probably
was like a million lines of code sitting
on top of 10 million lines of libraries,
something like that in that range, give or take.
That's probably about the right rough order magnitude.
You know, I have so many more questions.
I think this is a good stopping point.
We didn't even get to talking about old school or Dropbox at all.
We also didn't talk about, you know, the team culture, you know, driving RVs across country, dressing up in a bear suit, you know, like the friendly rivalry with other teams, the, you know, the scale accomplishments.
I mean, like we could talk for, we could talk for days on this. Like the one thing I would say,
the key takeaway is people think of it as technology. They think of it as process.
It is dominated by culture and team cohesion. It is dominated by having a shared purpose
and getting aligned and working together that is
how you achieve scale right like i think people lose that because they're like oh god your tech
is amazing it's not about the tech the tech was a mixed bag of amazing and horrific you know and
probably still is it's the team that makes it amazing the team that makes it work yeah like
you have the shared vision that pretty much people subscribe to and then they like working with each other yeah and you get that well thanks so much
for being a guest on this podcast and i have to invite you again because i have to hear all these
other stories all right well thanks for letting me take a wonderful trip down memory lane it's
really uh it's just when i look back on on that time, it was fun and phenomenal.
And I hope you hear that in my voice.
Like I love that team.
I love the projects.
I love the problem we solved.
I love the way that we did it.
There were a lot of things that went wrong.
There was an overwhelming number of things that went right.
You know, I'd do it again in a heartbeat.