PurePerformance - From Infra to Services to Happy End Users: The role of SLOs at Uber with Vishnu Acharya
Episode Date: January 6, 2025eBay, Yahoo, Netflix and then 10+ years at Uber. In this episode we sit down with Vishnu Acharya, Head of Network Infrastructure EMEA and Platform Engineering at Uber. Vishnu shares how Uber has scale...d over the years to about 4000 engineers and how his team makes sure that infrastructure and platform engineering scales with the growing company and the growing demand on their digital services.Tune in and learn about how Vishnu thinks about SLOs across all layers of the stack, how they manage to get better insights with their cloud providers and why its important to have an end-to-end understanding of the most critical end user journeys.Links we discussed:Conference talk at Observability & SRE Summit: https://www.iqpc.com/events-observability-sre-summit/speakers/vishnu-acharyaVishnu's LinkedIn Page: https://www.linkedin.com/in/vishnuacharya/Uber Engineering Blog: https://www.uber.com/blog/engineering/Â
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Hello everybody and welcome to another episode of Pure Performance.
My name is Brian Wilson.
And as always, I have me with my wonderful co-host, Andy Grabner.
And Andy, the fact that you did not mock me,
I imagine that means I was going to ask you if Krampus came to visit you,
but the fact that you didn't, or maybe he did. Maybe Krampus did come to visit you which is why you're being nice to me
suddenly. Maybe and it's just
a couple of days before Christmas so maybe I do
want to get some Christmas presents. No, it's after Christmas.
This episode is after Christmas. Yeah, but the
recording is still, it's in the
past. Yeah, but we're pretending for the
people it's after. So
everyone listening, pretend this is after. Pretend
Andy didn't just blow
the cover because we'll leave this all in it's more fun this way because it's more entertaining
than anything i would say um so yeah it's it's the new year andy it's 2025 we're coming to you
from the future um and so will it be an um i think it will be an an uber interesting word because Uber is basically something
that I typically,
when I heard the company name
the first time,
I thought,
why is everybody talking about Uber?
It must be Uber
or where does it come from?
And with my German background,
obviously I got a little bit confused.
But we have somebody with us today
that hopefully can shed some light on not only
the name, but more importantly, about what is it actually like to work at Uber? How did Uber change
over the years? I want to learn about how does Uber manage their service level objectives? How
does Uber ensure that everything works as expected? And I want to also find out if anybody still calls it Uber.
Because there's no umlaut.
So yes, I have to get that one in there.
We were hearing that in the early days.
Hey, Vishnu, as we told you in the first couple of minutes,
it's a little bit boring for our guests
because Brian and I just go off.
But thank you so much for being on the show.
Can you do me a quick favor?
Because you have not run away, so that's a good sign for us.
The first episode in the new year. Can you quickly
introduce yourself to the audience?
Sure, absolutely. And first I'll say
before I introduce myself,
I'm a bit embarrassed because I actually don't know
the origin story of the name.
It's not something that's widely
talked about here, but I will
effort to find out after this podcast.
And to answer your question, Brian, people do call it all kinds of different things, and that pronunciation I've definitely heard.
But it's very interesting as you go around the world.
So you need the umlaut over the you.
Yeah, exactly.
So hey, everybody.
My name is Vishnu Acharya.
I'm super happy to be here on the podcast with Brian and Andy.
I've been looking forward to this for some time. A little bit about me. As you realize by now,
I'm an engineering leader here at Uber. I've been at Uber about 10 and a half years now,
which we were talking about it with some co-workers last night. It feels like two
days and 20 years also at the same time. It's this really weird
time warp thing
that goes on.
I've been working on
platform engineering, network engineering,
SRE type
areas here for the last 10 and a half years
and I'm excited to be here with you guys.
It's amazing.
Who did we
have recently on
who also happened
to be with the
company for so
many years.
So we have a
couple of guests
who have really
been with their
current organization
for more than a
decade.
And I think it's
just really
interesting because
some people are
jumping around in
our industry quite a
bit, which on the
benefit is you see a
lot of different
teams, a lot of
different technology
companies in different stages of life.
But people like you and also Brian and myself, we've been with Dynatrace for quite a bit.
We've been here for so long, for so many years.
We saw the change.
And this is actually my first question, because I remember for the folks that are listening
in, Vishnu and I, we got to meet at an SRE conference, an observability conference in London. I think
it was early October.
And I sat
down with you at a fireside chat.
I was asked to moderate that session.
And then we got to know each other a little bit in
chatting. And then the first
thing that really struck me is, hey, you've been there for 10
years. Can you
just quickly fill us in? What has changed
over the last 10 years at Uber? What was
the company like when you started? What has changed? And obviously, always with a little
bit of a background mind on what's the engineering life, obviously, what has changed from a scaling
perspective? We talked a lot about performance, about observability. That would be just interesting
to hear. Sure, absolutely. And just before I answer that question be just interesting to hear sure absolutely and just before i answer that question just really quick um so so in my in my career i actually did jump around quite a bit
in my younger years i would say i'm not maybe i'm dating myself here but you know i started working
in the industry like 1999 2000 in my 20s you know i didn't work anywhere longer than like two and a
half three years i think which was, which was Netflix prior to Uber.
So when I got to Uber, I really enjoyed building things and scaling things in companies.
And I really enjoyed that startup energy.
And when I got to Uber, I obviously got that in sp be here 10 and a half years later as it's grown into this massive worldwide organization and technology.
So I think in the early days, especially as I came into Uber,
in the infrastructure organization, which is what we sort of talked about
or called it before platform engineering, became the industry standard.
So in our infra org, I think it was like 16 engineers, right?
I was like number 17 or whatever.
At that time, it was for a company, even at that time in 2014, of the size that Uber was Uber was and the speed it was growing at,
every day it was just everything was on fire, right?
If we made it through a day without various parts of the system falling over
and just completely failing, that was a win, right?
So on the one hand, it's exciting because you come in and there's challenges left and right.
And there's just so much to do.
On the downside, it's like you feel like you're just trying to survive or trying to make it through that day.
So it was very, very intense, very, very fast moving, both on the business side and as well as the technology side.
And so, you know, I think when you're in that kind of situation, and it was also this sort of
strange dynamic where we knew, you know, by the time I joined Uber, it was 2014. So it was by no
means a tiny startup, you know, they had raised, I think, at that time, a big $4 billion round. So it was a huge startup, I would say.
But the focus was really, at that time,
even nobody really knew how big it could get.
We just saw sort of all the graphs,
whether you look at systems or our infrastructure
or our services or our business metrics,
everything was a huge hockey stick graph,
but we didn't know where it would end, right?
So the decision-making as you're building that infrastructure
is pretty interesting because you have, on the one hand,
this instinct to like, we just need to make it through today.
But then you also have to think like,
how do we make this through the week, the month, the next five years?
And that part is very difficult when you're under that kind of pressure.
So I think in the early days when we talk about, you know,
SLOs and SLIs and SLIs, we didn't really have any of that, right?
We were just trying to keep the service up as much as possible
while also growing it in every dimension, you know, by 10x, 100x, right?
So that was sort of my initial foray into Uber.
Thank you so much for going back in time.
And obviously you mentioned when you joined,
Uber was already no longer a small startup out of a garage,
but already had a good size.
Still a lot of things have changed over the years i
remember the discussion we had in london i believe you said something you're about 4 000 engineers
now give or take is that right yeah that's correct so i think we're around 4 000 engineers uh globally
um you know what infrastructure has turned into platform engineering, I think is around 900 to a thousand out of that.
You know, so it's a, it's a large organization. You know, I think obviously, you know, we've had a lot of people come and go. There are,
there are a surprising number of us who've been around sort of at the 10 year
mark or, or, or close to that. But yeah,
we've, we've tried to sort of keep the good pieces
of our early engineering culture,
which is this sort of, I think now we're calling it go-get-it,
where basically from a technical standpoint
and just from how we address issues and initiatives,
pulling out all the stops and doing whatever it takes
to kind of
do what we need to do. But there's definitely been a lot of growing pains along the way. And then,
you know, as we fast forward to today, you know, we've matured in many, many different ways,
but especially, you know, for this conversation in how we're building our infrastructure,
how we're measuring our success or in we're measuring our success, or in some cases, lack of success in those initiatives,
how we're ensuring that we're building a stable and performant platform,
not just for our internal customers, right,
which is every other engineer at Uber and all our product teams,
but also ultimately our end customers who are using the service.
And then, you know, we've also, over the intervening years, product teams, but also ultimately our end customers who are using the service.
And then, you know, we've also, over the intervening years, we've added an important component
to that, which is our partners, right?
So, you know, when we were just offering rides in San Francisco, you know, and a few, I'd
say, I don't know the exact number, but, you know, not that many cities around the
world at that time. And we had one product,
right, which is rides. To now, where we have the rides business, we have a delivery business,
we have a courier business, we're partnering with self-driving car manufacturers,
and there's other initiatives as well. So when you add those other business partners,
the need for a very tight understanding of the system, the performance, how we measure performance, how we communicate that to our partners becomes even more critical.
Because we have partnerships with companies like large fast food chains like McDonald's know, they have a certain expectation for our reliability
that we have to certainly meet. Just on this topic, we'll be curious, when you talk about
somebody like McDonald's and they have a certain expectation, they have a certain SLA with you,
I guess, right? How do you measure that? Do you report that back to them? Do they actually validate that your
APIs, your promise to them is what they, you know, I don't know what the contractual obligation
is. Can you just, if you know any of this, yeah?
Yeah, absolutely. So I think, and this is actually true also for some of our infrastructure
partners as well, like our cloud providers, for example, which we can get into.
So, you know, I think one of the challenges is,
you know, these deals originate,
you know, on the business side.
And in an ideal world,
you know, engineering would be involved
in a lot of those discussions
and defining a lot of those SLAs up front
and agreeing to them.
I think in most companies, in most cases,
that probably doesn't happen at least consistently. So, you know different. So some of these deals get written in such ways that
we then have to adapt as best we can to meet those expectations. So I can't get into specifics. And
to be honest with you, I actually don't know the specifics of all the deals. But I'll give you some
general sense is that what we aim to do is be as transparent as possible. So, you know, there's this tendency, I think,
in the past to sort of, you know, protect our metrics or protect our performance information
from partners or from others. And I think now we're really exposing that to a lot of our partners and vice versa. So we want to actually really make use of our partners' metrics and what they're looking at in terms of performance and availability and ingest that into our systems and get this whole view of how the whole system is working.
Because in these sorts of partnerships, if Uber is working perfectly, but let's say Company X in the fast food world, they're not working
perfectly and they're down, the end result is the same for our customers,
which is that they can't get their meal. Or even for our delivery drivers
who are depending on this for income and for a way to
survive and live. So we have to understand that all of the
stakeholders involved, all of the stakeholders involved,
all of the places that our system together, the whole system could break,
and then how do we together monitor that and get proper metrics and response to that.
So we partnered pretty deeply with some of these larger companies
on the delivery side to really understand their infrastructure,
their metrics, what's important to them,
and then also be transparent from our side and expose our systems.
And then I think the other piece of that is also communication.
So in a very general sense, Uber could be having some technical issues
that impacts maybe one provider but not the others um and and so we used to treat
we used to treat sort of our incidents and our outages and our incident management
just in a very general sense right okay like delivery business is having an issue
and so we didn't really customize our our incident response or our communications around that you
know for our larger customers, which is what they expected
and what they would demand is that they want to know how their system is doing in relation to us,
not necessarily how the entire system is doing. So we had to do, you know, we had to really think
about that. And this is all post, you know, go live, right? So everything's already live. There's
hundreds of millions of dollars, you know, flowing flowing through this system and we have to make sure that we adapt our systems
to reflect that you know I was just wanted to bring it you brought up a very
interesting point and I was actually going to bring it up but you did you
talked about how you know you're not like at a traditional e-commerce
platform who is responsible from the user coming to the site to the fulfillment of the product to getting the product to the shipper.
Right.
They have full ownership of that.
So if there's anything wrong in that chain and you get the notification it's gone to UPS or it's gone to FedEx.
So you're like, OK, cool.
Now I know as a customer any delay is from FedEx,
not from the commerce site that I bought it from.
With Uber or any of your partners,
let's think about the Uber Eats kind of stuff, right?
There is no real breakdown between the two, right?
If it's slow, chances are they're gonna blame uber eats because
mcdonald's is fast food of course it came in in quick meanwhile the order got lost in mcdonald's
right so it brings a whole new um paradigm to making sure the customer knows that you're you're
doing your job right without throwing your partners under the bus.
And I apologize, I don't use Uber Eats because, you know, we have food at home.
But does it actually say it's with the driver now?
Is there a breakdown in the app to at least like slightly hint and lightly like say,
okay, the driver has picked up your order now and it's going to deliver?
Okay, so you do at least have some sort of a...
Yeah, there's all that traceability
of the order status.
But you have a great point.
And that's why
this becomes so important
because ultimately
the customer doesn't care.
If McDonald's or somebody
gives the driver the wrong order
or the order's messed up or we mess up and the driver is not dispatched or, you know, there's a million things that can go wrong in this chain.
But ultimately, like the customer doesn't care, right?
They're going to blame most likely us because we're the mechanism for delivery.
But, you know, it doesn't reflect well on the restaurant, on anybody.
So we all have this shared interest of making sure that this transaction
from end to end is successful,
but measuring that transaction becomes pretty difficult
when you have this sort of triple-sided marketplace.
And you could even take it further, right?
Which is, in some sense,
it could even be a four-sided
or four-partner marketplace
because we also use cloud providers, right?
So if one of our cloud providers has an issue, and that's the origin of the cause, and then
it causes our delivery business to have an issue and the driver to not get the order.
You can see where this goes.
So I think it presents tremendous opportunity, but also a tremendous challenge, right?
The challenge is that if you don't have strong partnerships and you also don't have strong transparency
between all of those links,
then it's easy to get into a sort of blame game, right?
Like, okay, it can really corrode a partnership.
I think if you are on the same page
and you can bring transparency to that,
then you have a much better chance
of ensuring that this whole thing works
and works reliably from end to end.
And we're talking about millions and millions of transactions per day around the world,
so it's a big challenge.
And each partner is different, right?
Andy, what's that one in the UK that did that all?
I just wanted to bring them up.
It's Mitchell & Butler's and Vision of a U.
It's one of these restaurant change that also,
especially during the pandemic,
had to change their business model.
And I think that also forced them to work more closely
with the food delivery services that they have in the UK.
And it was Mark Forrester who talked about it.
He pointed out with Uber Eats
that they also get your data into their observability
because from their perspective, they also want to make sure that their
customers are loyal. They're coming back to them because they had a good experience
whether they went into a restaurant, into a bar, or ordered
through, let's say, Uber Eats. And I remember Mark saying, yes, they're
collaborating with their partners. And I said, well, who is your partner? Well, it's Uber Eats
and all these. Yeah, it's very important because in the early days of delivery,
and we actually caught a lot of probably perhaps well-deserved fact
for some of this is like there's some restaurants
like we didn't even have an official integration with, right?
Or the same thing with DoorDash and everyone else.
So they would literally just be putting an order in and a driver will go there and pick it up.
And there's no, you know, there's no partnership with that restaurant.
And so if something goes wrong, you know,
it's going to reflect badly on the restaurant.
It's going to reflect badly on us. So, so I think, you know,
having those close partnerships is very important. And then even, you know,
it gets into even more detail in the sense of, you know, there's things that our then even, you know, it gets into even more detail in the sense of,
you know, there's things that our partners are, you know, a big concern in the restaurant industry
and me as a consumer, it's a concern for me as well. It's like, you know, you want your food
to arrive in good condition, you know, preferably hot, you know, ready to eat and all of that. So,
you know, there's things that we do with partners on packaging. There's, um, we put a lot of effort into, you know, the routing that we send drivers on,
like how they get from point A to point B, most efficient routes is specifically,
or particularly if they're handling, you know, multiple deliveries at once.
Um, we offer the, you know, we offer the option for customers, for users to select priority
delivery, meaning that driver is only going to bring
their order directly to them.
So all of those things,
you have to have all this data
that you can then,
what do you do with it?
Well, we have to share it
and we want to share it with our partners
to improve things for all of us.
Yeah.
And for me now,
this is the interesting moment where
I think this should be a reminder
for every one of us that we need to think end-to-end. We need to put
ourselves into the shoes of the end-user. Now, I know that certain software
companies, maybe the end-user journey really starts at the homepage of their website
and five clicks and that's it. But for many of us
where we are providing individual services that then make
up an end-to-end user journey.
We need to think end-to-end,
which means we need to understand
what is actually critical,
what is a user journey,
how do we measure it,
what type of data points do we have under control,
where do we need to give our data
in order to get data back.
And I think this is also so critical
and this is the conversation
and the question that I now have to you is,
how do you define and find these end user journeys?
And then how do you apply the concept of SLOs?
Do you apply any observability on that and what do you do?
Because this is a question that I always get and where I get also confused is if people say,
I want to put an SLO on every microservice.
I have a thousand microservices
and I want to put an SLO on it,
yet they fail in the end to deliver a good user experience
because they have not thought about
thinking about the end user from the outside in.
Yeah, absolutely.
That's a great question.
And so I think this is something
that we're still in the early days on, but we were definitely trying to understand that, the physical network, and monitoring it and making sure that it's up.
And we had some service-level objectives in terms of availability, and we had some around latency between different segments of the network, etc.
But what we didn't understand is when we breached that, what is the actual impact to our client services. Now, the challenge at Uber or anywhere is actually in the physical networking world,
we're probably the most tier zero of tier zero
services in that we underlie everything. If the network doesn't work,
then service A cannot talk to service B.
And as you mentioned, we have this, like many companies, we have this sort of
sprawling microservices architecture. I, we have this, like many companies, we have this sort of sprawling microservices architecture.
I think we have like upwards of 3,500 microservices.
Now, out of those, you know, there's some subset, a much smaller subset that's critical for core trip flow.
So we look at two things, right?
One is core trip flow and the delivery business, meaning the ability of a rider to go online
and say, I'm ready to take trips, a rider to actually look at a map
and book a ride, and then for that matching to happen,
and then for the ride to happen.
And then delivery business is very similar, like place an order,
route the driver to the pickup place, pick it up, drop it off.
So what we're trying to do now is really understand, you know,
end-to-end what are SLOs on the network, how they impact the services above us, right?
So within our infrastructure, let's say we're guaranteeing like five nines availability
within a given availability zone for the network.
Okay, great.
When we have downtime or we have degradation,
you know, in the past we'd say, well, okay, well, we're still hitting that five nines or four nines,
you know, depending on which part of the network we're fine. But actually, you know, we could still
be severely impacting the business in the time that we are down that doesn't match those four
nines or even in the degradations that we have that doesn't rise to the level of outage.
And so what we've really tried to do is those core services that make up the core
trip flow that I described is we try to really understand. First of all, we try to start at the
beginning of the process, which is really trying to help them understand how our network is designed
for them to take the maximum advantage of the physical redundancy we've built.
Because without that knowledge, we've seen incidents in the past where even though we've built this,
what we think is this super awesome network, we'd have microservices like all deployed in
or critical services all deployed in like a handful of racks within one zone.
And we lose that part of the network and it's a big outage, right?
So over the years, that's one of the things and it's a big outage right so over over the years that's
that's one of the things we really worked at and gotten better at is really educating our service
owners about the infrastructure that's available so a lot of times they don't know and and then
arguably they shouldn't care right like tooling should actually abstract that away from them
and we've got now have that built that at uber where if I'm a service owner or a product, you know, product customer facing service owner, you know, and I'm deploying my service, it should take care of that dispersion for me and take advantage of the physical redundancy that's built.
So we've done that.
Now, the next part is sort of like traceability and understanding like all these services and how they interact.
And that part, I would say we're still very much kind of work in progress.
But, you know, in general,
what we're trying to do is take our network SLOs
and tie it to the core services
that really, you know,
really can impact our ability
to do those two functions for delivery and rides.
And then understanding how those services talk all across our infrastructure
and see do our SLOs match what they're expecting.
Because another example is, you know, in an Uber-controlled network,
we can guarantee, well, we can guarantee anything,
but, you know, we try to, what our SLO is, is five nines within a,
or sorry, four nines across zones and five nines within a zone.
Now, where this starts to break down is we also have huge cloud deployments.
So we have cloud deployments in three of the major cloud providers.
And you'll find in many cases that cloud providers will provide an SLA to their interconnects or where you connect with them.
But then anything that happens within their network is not really covered by
SLA or they may have a service they offered within the cloud.
You know, think DynamoDB or Spanner or something, right?
They have an SLA on that,
but the network traffic to get from the interconnect to that service is not
guaranteed necessarily. So, so then you have to think, okay, well, how do we –
so first part is sort of visibility and understanding how these services work
across both our infrastructure, cloud infrastructure,
and then do our guarantees, are they sufficient to meet the business requirements?
And then I think what we really have to do is focus on, you know,
making these services more resilient, right?
Because infrastructure failures do happen,
particularly when you have all these different pieces involved
in terms of three different cloud providers,
as well as our own infrastructure.
I mean, not everybody, first of all, thank you so much for sharing.
Not everybody is Uber, Not everybody is Uber scale, or Uber scale.
But it's really interesting to hear, because I never thought about this, that when you
are putting your services on cloud provider A, that they're guaranteeing you a certain
SLA to their door.
But what happens from their door to that next
service they use, whether it's a database or putting a container on their Kubernetes
environment or the managed environment.
I thought that's really interesting.
Now, it might not be the concern of every one of our listeners, but I think it should
be a concern because in the end, we are all deploying our critical workloads, most likely
not on our own infrastructure or not everything on our infrastructure. So we want to make sure
we understand everything end-to-end from the end user
until it actually hits that service.
And being aware of this, for me that was new and I assume also for
some of our listeners, this is new.
I remember the conversation we had in London and you said,
you mentioned partners, these cloud vendors,
these cloud providers are also partners of Uber.
Have you been able to engage with them and get more data out of them,
out of their otherwise closed environments.
Because what we hear is some of these environments are really black boxes.
They only give you what they give you, but sometimes it's not enough.
Serverless, for example, and some of those other ones, it's like, well, it's distracted
from you and you're not going to get it.
Do you guys have the clout, let's say, to say we need more from you and you're not going to get it do you guys have the you know the clout let's say
to say we we need more from you yeah yeah so that's a that's a great question right and i've
run into this a couple times in my career and i'll just quickly illustrate the earlier ones and show
how much things have changed sort of in a positive way so you know i remember working at netflix and
and as a major cloud provider and
Netflix is probably at that time, or maybe still their biggest, their biggest customer.
And, you know, we were troubleshooting issues and, you know, they showed us a snippet of a
configuration and from that configuration, we could tell, you know, what kind of network devices
they were running, but we asked them and they would, they would,
they would never tell us like yes or no, this is yes,
we have this kind of device. So, you know, that, that the secrecy was there.
It is still there, but I think, you know,
what we've done over the years and they've opened up quite a bit as well.
And I think, you know, now with the partnerships we have,
even the early days of Uber's foray into cloud was probably 2016, 2017 when we really started.
We had all our own on-prem data centers, which we still have to this day, but we've also expanded into all three cloud providers.
So from one major one, which is Oracle Cloud, which we just launched publicly like a year and a half ago, I think.
So taking those lessons learned from day one, we've been working very, very closely with our cloud providers to understand as much about their infrastructure as we can.
They don't tell us everything, obviously, but I think where the transparency has really changed and moved is around monitoring,
alerting, and metrics. So, you know, we're at a point where we're emitting metrics from our
observability platforms directly to the GCPs, the OCIs of the world. And likewise, we are also
getting, you know, direct metrics and alerting from them as well for different parts of the infrastructure.
And then that's sort of one piece of the puzzle is really understanding when we have an issue or we see an issue, letting them know and trying to find out, is it them?
Do they know about it?
Are they working on it?
All of that.
So this is the whole operational piece that has to follow. So, you know, we do things like, you know,
we have shared Slack channels directly with engineers from some of those
companies where we can just chat with them directly.
We also obviously follow their internal processes for ticketing and all that.
But we, in general,
we're trying to make our team here an extension of their engineering team or
vice versa, right?
So building those relationships across is really important.
But I think we've started with really metrics and observability.
And I think in a few select cases, we've been sort of able to influence their roadmap,
not so much in what they're building, but in sort of priority, right?
Like, hey hey this feature that
you guys are thinking about like yeah we really really like it you know now can you guys like
reprioritize this higher for us and so we've seen some some movement there but i think um it's it's
still i think having those relationships is super important so it's not just like a black box to you
that that you don't understand yeah i think but i also recognize you know it's
not it's probably not possible for every relationship or every company i know like
you know we run into walls with other providers because we aren't that deployed in them right we
have pretty small deployment so so i think obviously the money and the scale matters to
them as well yeah i think it's a a lot of this harkens back to
there's company secrecy
or what companies pit against each other
versus what people in the IT world need.
And we've seen historically
people in the IT world share everything.
I'm throwing my code up in GitHub.
I'm talking about the things we severely messed up and for you not to repeat them or what worked well.
And in order for the IT companies to work successfully together, there is that movement amongst, let's call them the people.
I don't mean to be all grandiose here, right?
But the people doing the work need to share that information.
And it sounds like companies like Uber who have that financial power are starting to open the door.
I can understand why they wouldn't want to say exactly what kind of network device
it is, because what if they decide to change network devices?
People are going to be like, oh my gosh, they're going to freak out. But if they are saying
if they're at least providing the network performance
consistently, not necessarily telling you about an upgrade or a change,
hey, I have at least the transparency to see
everything is still coming in fantastic
from the network components that we rely on in the back end,
I'm happy enough.
So there is a give and take on
we don't want to freak people out
every time we're going to do something by being too transparent.
But even when you're saying on some of these new cloud providers how they're not necessarily being as open,
it at least gives me hope that over time
they're going to start sharing some of these metrics so that people
can understand, I'm using your service,
everything looks good on my end, what's going on? Because that's just
going to give people more confidence in using those services
and help. We talk about the same economy between
Uber Eats and a restaurant. It doesn't matter who's slow.
For the cloud provider, it
would behoove them to be transparent about it.
Oh yeah, our bad. Let's fix that to keep you happy.
You know?
So who knows?
It seems there's a, hopefully there's some hope on the future for, for more transparency,
at least on a metric level.
But we'll see, I guess.
No, and I think you're absolutely right.
You know, and I think, I think it does, you know, obviously there's competitive things
that, that companies don't want to give away, but you know, there's, there's competitive things that companies don't want to give away, but there's a lot of other areas of collaboration and cooperation that could actually improve the reliability of their cloud for themselves, for their customers, for us.
And then we can also improve on our side.
We learn stuff all the time. I mean, when you're dealing with engineers from GCP or AWS or OCI or, you know, anyone of these cloud providers, I mean, you're dealing with some really brilliant engineers who have thought about how to do infrastructure at a scale.
You know, we're not tiny, but we're nowhere near their scale.
And so they've seen things, you know, where things break and how to think about, you know, sort of going that 100x bigger,
they've done it, right?
So there's a lot we can learn from them as well.
Yeah, but it does get into interesting areas because even, you know,
on the example of the network device, you know, in some sense,
they're also our competitors, or at least they used to be.
So during the pandemic when all the supply chain issues were happening,
like we were getting our hardware orders, let's say redirected by from our vendors to to some of
these cloud providers, right, because they're the big, they're the big buyers. And so we're losing
hardware for our own data centers to our to these cloud providers. So there's there's that
competition angle, you know, still, I would say around hardware and things like that but
you know i think um in for the most part like the relationships have been really really good
and um i do see these companies opening up a lot more um publicly like in their engineering blogs
and just in general because you're right that's how the industry was sort of founded on those
sort of ideals and principles like you know what I learned could help you, so maybe I should teach you or help you or vice versa.
Just a quick question on, because you said you started with your own data centers, you
still have your own data centers.
I guess you obviously invest in them, yet you have the cloud vendors.
What makes you decide where new workload goes?
Are there geographical decisions?
Are there technology stack decisions?
What makes you decide what goes where?
Yeah, so that's a great question.
So the way Uber's infrastructure works is there's a lot of back-end services, obviously.
And then we have, you know, sort of our edge services that need to be closer to a customer, right?
So I'll give an example.
In the early days or earlier days of Uber, before we had cloud, you know, our two main regions where our data centers are are East Coast and West Coast of the United States, right?
But Uber is a global
company and so we have you know india is for example a huge huge market and we're competing
at the time very fiercely with uba which is the local competitor there and if you turn on the
uber app on your phone in india it might take a minute or two minutes to load the app right
which is unacceptable performance from a user perspective.
And the reason for that, or big reason for that,
is that that network traffic is going from that phone
to Indian mobile provider network,
to all the peerings across the world,
to subsea cables, to our data center in Virginia, right?
So that architecture had to change
where we get much closer to the customer.
So today, we're deployed in all those edge services all over the world,
primarily actually totally in a cloud provider because, you know,
it makes sense for us to utilize their presence all over the world
rather than building our own data centers all over the world.
So that's one example.
Now, in terms of back-end services, it's really critical that we look at cross zone or cross region dependencies,
which we've started to do, I'd say, in earnest maybe a year or two ago. So previously,
there wasn't a lot of controls around it, right? So which would leave a lot of areas,
I would say, brittleness where things could break and cause outages so for
example we'd have service a in a zone and it needs to talk to service b and service b is actually
also in the same zone but for misconfiguration or configuration reasons or historical reasons
service a would talk to service b across the country um and then a network problem happens
in our backbone and they can't talk to the one across the country and it just breaks.
We've seen a lot of cases like that over the years.
So today, what we're really striving for is zonal isolation
where for the most part, if services are self-contained in the zone,
that way if we lose the zone, it's fine. We have those services in other zones.
But from a network perspective,
this cross-zone, cross-region chattiness,
if it doesn't need to be there,
we're trying to reduce it as much as we can.
And that also goes for cloud providers.
We've had examples where we deploy some stateless services
in a cloud provider,
but all the stateful services and databases we need to talk to are still in our data center.
And then, you know, there's network issues or any other issues,
cloud provider issues or data center issues,
and then that communication breaks down.
So we're trying to co-locate as many dependencies as possible
and then sort of have our failure domain be like a zone, right?
Like we can lose a zone, that's fine. But we're not, we're not quite there today.
I think we still have, we still have quite a few kind of cross,
cross dependencies and also this is a moving target, right? Like with, you know,
3,500 microservices and like old ones being deprecated and new ones being
written all the time, like all of this is changing and morphing all the time.
So I think I'm trying to clean it up and get it to a stable state,
and then the next step is, again, tooling
and having this be an automated thing
where people don't have to think about it.
So in our deployment tooling today,
we've tried to stop the bleeding by implementing controls
where, again, as a service owner,
if I'm writing a product that's not infrastructure,
I shouldn't have to think about, hey, do I deploy it to zone A or zone B?
The tooling should think about it.
Look at my service. Look at the service tier.
Look at the subsequent reliability requirements of that service
and then deploy it dispersed as needed in a standard fashion
with all the monitoring and metrics that go with that.
So that's sort of where we're at today. I'd say we have the
controls in place, but we're still doing the cleanup and
enforcement of it.
If I get this right, as an engineer in any organization,
you're at Uber, you should in the end focus on creating a new service,
you know in which region it should be available because you're building a new feature, you
have certain dependencies to other services, but then the platform should figure out on
its own where, in which capacity, with whatever failover mechanisms are needed to meet your
reliability and resiliency goal,
it should be deployed to be as close as possible also to your depending systems.
So in case something fails, you kind of contain that problematic zone
by making sure they are as good as possible co-located.
And that is information that you cannot ask from every engineering team to know.
This is where you have collective intelligence that you build into a central platform
and then take these decisions off the shoulders of your engineers.
And one thing, oh, sorry, go ahead.
No, I also think in the beginning you mentioned that you're doing a lot of,
I'm not sure if you used to work mentoring,
but you're working very closely with these engineering teams to educate them on what type of services
you actually provide. And I think this whole educational aspect
is also a big point. But still, you cannot educate 4,000
people on all the individual details. This is where platform engineering,
at least from my perspective, comes in, that you're providing self-services
to your teams to allow
them to do their stuff. And then you take over the hard choices of, you know, what does this mean for
this critical service to be, you know, five nines and whether we need to deploy it because it is
highly depending on certain other backend services. Yeah, absolutely. And i would just add i think also you know you touched on it
it's also like i said a moving target so um you know when we and in a lot of cases service owners
may not know all the dependencies right they may not know hey like i talked to redis i talked to
this i talked to that they know some of it but they may not know all of it or it may change so
that's where i think humans break down right there's no way we can understand all of it or keep track of it.
And so that's where, like, I know not my team specifically,
but there's other teams at Uber that are exploring quite a bit.
I know it's a hot thing right now,
but like whether you call it machine learning or AI
and trying to apply that to the service graph
and how these services all interact
and trying to apply those principles
around deployment
safety and how we test
things, how we monitor metrics, all of those things
but have it be much more dynamic
where it could understand the
system as it's morphing and changing over
time. So it's a huge area
of opportunity. I mean it's
sort of a kind of crazy problem
to think about but I think there's a lot of progress to be made there. I mean, it's sort of a kind of crazy problem to think about, but I think there's
a lot of progress to be made there.
It's interesting with that, too,
because
us in the observability space,
we collect a lot of the kind of data that
decisions can be made upon. You have
things like Kubernetes,
you have add-ons to all this stuff that help you scale,
right? All the ingredients
for the recipe are there.
And it's interesting to see people finally starting to do what I think we all expected sooner,
which would be everybody should be automating this whole process.
We have this data.
We have the ability to scale.
We have the ability to do this.
Even if we're in zone one and everything's working in zone one
except for a network connection
between two components if the system knows that it'll actually be faster to go out to zone two
for that one bit and then pop back in it should just do it you know and then readjust like self
healing and all like all the ingredients are there right but it's just i guess the priorities and and
how much does it actually take to build that intelligence into a system to make it work reliably?
And I think that's the piece that we're starting to see with all this.
But it's definitely a heavy lift, right?
But, you know, yeah, it's just really cool stuff that you get to work on there.
Yeah, it's hugely exciting, right?
I mean, I think, like, your guys' organization, I mean, there's so many, like, you're right.
It's, like, tantalizingly close
or it feels like it is,
but somehow there's still a lot to do.
So it's interesting.
Hey, Vishnu, before we close this out,
I have one more question for you.
In your role, right,
head of, I'm just looking at your LinkedIn profile here,
head of network infrastructure,
EMEA, platform engineering at Uber,
you mentioned this in the beginning.
What wakes you up at night?
Or what wakes you up on the weekend?
What ruins potentially,
now obviously I know this is,
we're already in January when this airs.
But what could have ruined your Christmas?
Besides that, I didn't know we could time travel
since today, before today. today yeah no i think um for
us it's always capacity right so one of my uh i won't name him but shout out to my first boss at
uber he's a very interesting guy but he used to tell us like whatever you do like never ever run
out of capacity right as a team like and at that time uber is going crazy and we're just trying to
like i said keep our heads
above water. And in some ways that hasn't changed, right? Like Uber's growth has slowed a little bit,
but it's, you know, percentage wise it's growing and it's growing off a huge, huge base number.
It's like the growth is actually still kind of crazy. And we see it on the infrastructure side.
And I think this is where also, you know, deep observability and metrics can help is like predicting our growth.
We're very good at predicting our business growth, but I think translating that into infrastructure growth is something that I'm always worried about.
Because especially when you're dealing in the physical network world, if you're in a cloud provider, there's this perception that the cloud providers can just spin up unlimited capacity. And through our partnerships with them, we've learned
that's not possible. They have the same constraints we have, which is you need hardware,
it takes time to get the hardware, you need time to build it, all those things.
So there's a lead time for our underlying physical infrastructure
which can be addressed sort of on the edges. You can pull
in some dates here and there,
but there's some things that you just can't,
like that just take time, right?
You're laying fiber optic cables,
you're connecting things,
you're ordering hardware,
all these things take time.
So really getting a deep understanding
of like where the business is going
and how that translates into what we need to build
and that we build it in time.
That's what would kind of keep me up at night.
And maybe that's a lesson I learned
from my first boss here at Uber.
Well, shout out to him who remains unnamed,
but the people that know him will probably know him.
Yeah.
We'll call him Jebediah.
Some ancient kind of name.
Yeah. Vishnu, I want to say thank you so much Jebediah. Some ancient kind of name.
I want to say thank you so much by the time you listen to this.
You will hopefully have had
a great end of the year,
a great start of the year.
I was really fortunate
to bump into you in London.
I want to also say thanks to Sam
who organized the conference
to allow me to do the fireside chat with you.
This got us connected and this in the end resulted in this podcast.
So there's always, I think, treating partnerships really well is in the end.
It comes back to what you said earlier because Sam allowed me to host that part of the conference.
He did me a favor in return.
I helped him out.
And so in the end, now we're here.
So it's treating your partners well.
And thank you so much.
Hopefully.
Yeah, thank you.
I wanted to thank you and Brian. I really enjoyed the conversation.
I looked at the clock just now.
It went by so fast. We're having so much fun.
Thank you.
It's always great to hear what's happening on the cutting edge.
So really, really appreciate you sharing it
and continuing that spirit of sharing knowledge.
It's always difficult in the corporate world of like,
oh, can I say it?
But it's like there's some fundamentals that really benefit everybody.
I think at the end of the day, we're all just trying to build cool stuff.
Really? Yeah.
And keep going. So, yeah. Thanks.
All right. Thank you. And Happy New Year to everybody.
Or if we want to go back in time for when we're recording,
happy holidays, everybody, whatever you're celebrating.
And thanks for everyone. We'll see you on the next episode. Bye-bye. whatever you're celebrating. And thanks for everyone.
We'll see you on the next episode.
Bye-bye.
Thank you.
Bye.