PurePerformance - From Infra to Services to Happy End Users: The role of SLOs at Uber with Vishnu Acharya

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson. And as always, I have me with my wonderful co-host, Andy Grabner. And Andy, the fact that you did not mock me, I imagine that means I was going to ask you if Krampus came to visit you, but the fact that you didn't, or maybe he did. Maybe Krampus did come to visit you which is why you're being nice to me

Starting point is 00:00:48 suddenly. Maybe and it's just a couple of days before Christmas so maybe I do want to get some Christmas presents. No, it's after Christmas. This episode is after Christmas. Yeah, but the recording is still, it's in the past. Yeah, but we're pretending for the people it's after. So everyone listening, pretend this is after. Pretend

Starting point is 00:01:04 Andy didn't just blow the cover because we'll leave this all in it's more fun this way because it's more entertaining than anything i would say um so yeah it's it's the new year andy it's 2025 we're coming to you from the future um and so will it be an um i think it will be an an uber interesting word because Uber is basically something that I typically, when I heard the company name the first time, I thought,

Starting point is 00:01:32 why is everybody talking about Uber? It must be Uber or where does it come from? And with my German background, obviously I got a little bit confused. But we have somebody with us today that hopefully can shed some light on not only the name, but more importantly, about what is it actually like to work at Uber? How did Uber change

Starting point is 00:01:54 over the years? I want to learn about how does Uber manage their service level objectives? How does Uber ensure that everything works as expected? And I want to also find out if anybody still calls it Uber. Because there's no umlaut. So yes, I have to get that one in there. We were hearing that in the early days. Hey, Vishnu, as we told you in the first couple of minutes, it's a little bit boring for our guests because Brian and I just go off.

Starting point is 00:02:22 But thank you so much for being on the show. Can you do me a quick favor? Because you have not run away, so that's a good sign for us. The first episode in the new year. Can you quickly introduce yourself to the audience? Sure, absolutely. And first I'll say before I introduce myself, I'm a bit embarrassed because I actually don't know

Starting point is 00:02:37 the origin story of the name. It's not something that's widely talked about here, but I will effort to find out after this podcast. And to answer your question, Brian, people do call it all kinds of different things, and that pronunciation I've definitely heard. But it's very interesting as you go around the world. So you need the umlaut over the you. Yeah, exactly.

Starting point is 00:02:59 So hey, everybody. My name is Vishnu Acharya. I'm super happy to be here on the podcast with Brian and Andy. I've been looking forward to this for some time. A little bit about me. As you realize by now, I'm an engineering leader here at Uber. I've been at Uber about 10 and a half years now, which we were talking about it with some co-workers last night. It feels like two days and 20 years also at the same time. It's this really weird time warp thing

Starting point is 00:03:28 that goes on. I've been working on platform engineering, network engineering, SRE type areas here for the last 10 and a half years and I'm excited to be here with you guys. It's amazing. Who did we

Starting point is 00:03:45 have recently on who also happened to be with the company for so many years. So we have a couple of guests who have really

Starting point is 00:03:52 been with their current organization for more than a decade. And I think it's just really interesting because some people are

Starting point is 00:03:58 jumping around in our industry quite a bit, which on the benefit is you see a lot of different teams, a lot of different technology companies in different stages of life.

Starting point is 00:04:07 But people like you and also Brian and myself, we've been with Dynatrace for quite a bit. We've been here for so long, for so many years. We saw the change. And this is actually my first question, because I remember for the folks that are listening in, Vishnu and I, we got to meet at an SRE conference, an observability conference in London. I think it was early October. And I sat down with you at a fireside chat.

Starting point is 00:04:32 I was asked to moderate that session. And then we got to know each other a little bit in chatting. And then the first thing that really struck me is, hey, you've been there for 10 years. Can you just quickly fill us in? What has changed over the last 10 years at Uber? What was the company like when you started? What has changed? And obviously, always with a little

Starting point is 00:04:51 bit of a background mind on what's the engineering life, obviously, what has changed from a scaling perspective? We talked a lot about performance, about observability. That would be just interesting to hear. Sure, absolutely. And just before I answer that question be just interesting to hear sure absolutely and just before i answer that question just really quick um so so in my in my career i actually did jump around quite a bit in my younger years i would say i'm not maybe i'm dating myself here but you know i started working in the industry like 1999 2000 in my 20s you know i didn't work anywhere longer than like two and a half three years i think which was, which was Netflix prior to Uber. So when I got to Uber, I really enjoyed building things and scaling things in companies. And I really enjoyed that startup energy.

Starting point is 00:05:36 And when I got to Uber, I obviously got that in sp be here 10 and a half years later as it's grown into this massive worldwide organization and technology. So I think in the early days, especially as I came into Uber, in the infrastructure organization, which is what we sort of talked about or called it before platform engineering, became the industry standard. So in our infra org, I think it was like 16 engineers, right? I was like number 17 or whatever. At that time, it was for a company, even at that time in 2014, of the size that Uber was Uber was and the speed it was growing at, every day it was just everything was on fire, right?

Starting point is 00:06:30 If we made it through a day without various parts of the system falling over and just completely failing, that was a win, right? So on the one hand, it's exciting because you come in and there's challenges left and right. And there's just so much to do. On the downside, it's like you feel like you're just trying to survive or trying to make it through that day. So it was very, very intense, very, very fast moving, both on the business side and as well as the technology side. And so, you know, I think when you're in that kind of situation, and it was also this sort of strange dynamic where we knew, you know, by the time I joined Uber, it was 2014. So it was by no

Starting point is 00:07:17 means a tiny startup, you know, they had raised, I think, at that time, a big $4 billion round. So it was a huge startup, I would say. But the focus was really, at that time, even nobody really knew how big it could get. We just saw sort of all the graphs, whether you look at systems or our infrastructure or our services or our business metrics, everything was a huge hockey stick graph, but we didn't know where it would end, right?

Starting point is 00:07:47 So the decision-making as you're building that infrastructure is pretty interesting because you have, on the one hand, this instinct to like, we just need to make it through today. But then you also have to think like, how do we make this through the week, the month, the next five years? And that part is very difficult when you're under that kind of pressure. So I think in the early days when we talk about, you know, SLOs and SLIs and SLIs, we didn't really have any of that, right?

Starting point is 00:08:15 We were just trying to keep the service up as much as possible while also growing it in every dimension, you know, by 10x, 100x, right? So that was sort of my initial foray into Uber. Thank you so much for going back in time. And obviously you mentioned when you joined, Uber was already no longer a small startup out of a garage, but already had a good size. Still a lot of things have changed over the years i

Starting point is 00:08:45 remember the discussion we had in london i believe you said something you're about 4 000 engineers now give or take is that right yeah that's correct so i think we're around 4 000 engineers uh globally um you know what infrastructure has turned into platform engineering, I think is around 900 to a thousand out of that. You know, so it's a, it's a large organization. You know, I think obviously, you know, we've had a lot of people come and go. There are, there are a surprising number of us who've been around sort of at the 10 year mark or, or, or close to that. But yeah, we've, we've tried to sort of keep the good pieces of our early engineering culture,

Starting point is 00:09:28 which is this sort of, I think now we're calling it go-get-it, where basically from a technical standpoint and just from how we address issues and initiatives, pulling out all the stops and doing whatever it takes to kind of do what we need to do. But there's definitely been a lot of growing pains along the way. And then, you know, as we fast forward to today, you know, we've matured in many, many different ways, but especially, you know, for this conversation in how we're building our infrastructure,

Starting point is 00:10:02 how we're measuring our success or in we're measuring our success, or in some cases, lack of success in those initiatives, how we're ensuring that we're building a stable and performant platform, not just for our internal customers, right, which is every other engineer at Uber and all our product teams, but also ultimately our end customers who are using the service. And then, you know, we've also, over the intervening years, product teams, but also ultimately our end customers who are using the service. And then, you know, we've also, over the intervening years, we've added an important component to that, which is our partners, right?

Starting point is 00:10:31 So, you know, when we were just offering rides in San Francisco, you know, and a few, I'd say, I don't know the exact number, but, you know, not that many cities around the world at that time. And we had one product, right, which is rides. To now, where we have the rides business, we have a delivery business, we have a courier business, we're partnering with self-driving car manufacturers, and there's other initiatives as well. So when you add those other business partners, the need for a very tight understanding of the system, the performance, how we measure performance, how we communicate that to our partners becomes even more critical. Because we have partnerships with companies like large fast food chains like McDonald's know, they have a certain expectation for our reliability

Starting point is 00:11:25 that we have to certainly meet. Just on this topic, we'll be curious, when you talk about somebody like McDonald's and they have a certain expectation, they have a certain SLA with you, I guess, right? How do you measure that? Do you report that back to them? Do they actually validate that your APIs, your promise to them is what they, you know, I don't know what the contractual obligation is. Can you just, if you know any of this, yeah? Yeah, absolutely. So I think, and this is actually true also for some of our infrastructure partners as well, like our cloud providers, for example, which we can get into. So, you know, I think one of the challenges is,

Starting point is 00:12:08 you know, these deals originate, you know, on the business side. And in an ideal world, you know, engineering would be involved in a lot of those discussions and defining a lot of those SLAs up front and agreeing to them. I think in most companies, in most cases,

Starting point is 00:12:23 that probably doesn't happen at least consistently. So, you know different. So some of these deals get written in such ways that we then have to adapt as best we can to meet those expectations. So I can't get into specifics. And to be honest with you, I actually don't know the specifics of all the deals. But I'll give you some general sense is that what we aim to do is be as transparent as possible. So, you know, there's this tendency, I think, in the past to sort of, you know, protect our metrics or protect our performance information from partners or from others. And I think now we're really exposing that to a lot of our partners and vice versa. So we want to actually really make use of our partners' metrics and what they're looking at in terms of performance and availability and ingest that into our systems and get this whole view of how the whole system is working. Because in these sorts of partnerships, if Uber is working perfectly, but let's say Company X in the fast food world, they're not working perfectly and they're down, the end result is the same for our customers,

Starting point is 00:13:31 which is that they can't get their meal. Or even for our delivery drivers who are depending on this for income and for a way to survive and live. So we have to understand that all of the stakeholders involved, all of the stakeholders involved, all of the places that our system together, the whole system could break, and then how do we together monitor that and get proper metrics and response to that. So we partnered pretty deeply with some of these larger companies on the delivery side to really understand their infrastructure,

Starting point is 00:14:03 their metrics, what's important to them, and then also be transparent from our side and expose our systems. And then I think the other piece of that is also communication. So in a very general sense, Uber could be having some technical issues that impacts maybe one provider but not the others um and and so we used to treat we used to treat sort of our incidents and our outages and our incident management just in a very general sense right okay like delivery business is having an issue and so we didn't really customize our our incident response or our communications around that you

Starting point is 00:14:41 know for our larger customers, which is what they expected and what they would demand is that they want to know how their system is doing in relation to us, not necessarily how the entire system is doing. So we had to do, you know, we had to really think about that. And this is all post, you know, go live, right? So everything's already live. There's hundreds of millions of dollars, you know, flowing flowing through this system and we have to make sure that we adapt our systems to reflect that you know I was just wanted to bring it you brought up a very interesting point and I was actually going to bring it up but you did you talked about how you know you're not like at a traditional e-commerce

Starting point is 00:15:22 platform who is responsible from the user coming to the site to the fulfillment of the product to getting the product to the shipper. Right. They have full ownership of that. So if there's anything wrong in that chain and you get the notification it's gone to UPS or it's gone to FedEx. So you're like, OK, cool. Now I know as a customer any delay is from FedEx, not from the commerce site that I bought it from. With Uber or any of your partners,

Starting point is 00:15:52 let's think about the Uber Eats kind of stuff, right? There is no real breakdown between the two, right? If it's slow, chances are they're gonna blame uber eats because mcdonald's is fast food of course it came in in quick meanwhile the order got lost in mcdonald's right so it brings a whole new um paradigm to making sure the customer knows that you're you're doing your job right without throwing your partners under the bus. And I apologize, I don't use Uber Eats because, you know, we have food at home. But does it actually say it's with the driver now?

Starting point is 00:16:35 Is there a breakdown in the app to at least like slightly hint and lightly like say, okay, the driver has picked up your order now and it's going to deliver? Okay, so you do at least have some sort of a... Yeah, there's all that traceability of the order status. But you have a great point. And that's why this becomes so important

Starting point is 00:16:56 because ultimately the customer doesn't care. If McDonald's or somebody gives the driver the wrong order or the order's messed up or we mess up and the driver is not dispatched or, you know, there's a million things that can go wrong in this chain. But ultimately, like the customer doesn't care, right? They're going to blame most likely us because we're the mechanism for delivery. But, you know, it doesn't reflect well on the restaurant, on anybody.

Starting point is 00:17:22 So we all have this shared interest of making sure that this transaction from end to end is successful, but measuring that transaction becomes pretty difficult when you have this sort of triple-sided marketplace. And you could even take it further, right? Which is, in some sense, it could even be a four-sided or four-partner marketplace

Starting point is 00:17:43 because we also use cloud providers, right? So if one of our cloud providers has an issue, and that's the origin of the cause, and then it causes our delivery business to have an issue and the driver to not get the order. You can see where this goes. So I think it presents tremendous opportunity, but also a tremendous challenge, right? The challenge is that if you don't have strong partnerships and you also don't have strong transparency between all of those links, then it's easy to get into a sort of blame game, right?

Starting point is 00:18:11 Like, okay, it can really corrode a partnership. I think if you are on the same page and you can bring transparency to that, then you have a much better chance of ensuring that this whole thing works and works reliably from end to end. And we're talking about millions and millions of transactions per day around the world, so it's a big challenge.

Starting point is 00:18:33 And each partner is different, right? Andy, what's that one in the UK that did that all? I just wanted to bring them up. It's Mitchell & Butler's and Vision of a U. It's one of these restaurant change that also, especially during the pandemic, had to change their business model. And I think that also forced them to work more closely

Starting point is 00:18:50 with the food delivery services that they have in the UK. And it was Mark Forrester who talked about it. He pointed out with Uber Eats that they also get your data into their observability because from their perspective, they also want to make sure that their customers are loyal. They're coming back to them because they had a good experience whether they went into a restaurant, into a bar, or ordered through, let's say, Uber Eats. And I remember Mark saying, yes, they're

Starting point is 00:19:19 collaborating with their partners. And I said, well, who is your partner? Well, it's Uber Eats and all these. Yeah, it's very important because in the early days of delivery, and we actually caught a lot of probably perhaps well-deserved fact for some of this is like there's some restaurants like we didn't even have an official integration with, right? Or the same thing with DoorDash and everyone else. So they would literally just be putting an order in and a driver will go there and pick it up. And there's no, you know, there's no partnership with that restaurant.

Starting point is 00:19:52 And so if something goes wrong, you know, it's going to reflect badly on the restaurant. It's going to reflect badly on us. So, so I think, you know, having those close partnerships is very important. And then even, you know, it gets into even more detail in the sense of, you know, there's things that our then even, you know, it gets into even more detail in the sense of, you know, there's things that our partners are, you know, a big concern in the restaurant industry and me as a consumer, it's a concern for me as well. It's like, you know, you want your food to arrive in good condition, you know, preferably hot, you know, ready to eat and all of that. So,

Starting point is 00:20:21 you know, there's things that we do with partners on packaging. There's, um, we put a lot of effort into, you know, the routing that we send drivers on, like how they get from point A to point B, most efficient routes is specifically, or particularly if they're handling, you know, multiple deliveries at once. Um, we offer the, you know, we offer the option for customers, for users to select priority delivery, meaning that driver is only going to bring their order directly to them. So all of those things, you have to have all this data

Starting point is 00:20:49 that you can then, what do you do with it? Well, we have to share it and we want to share it with our partners to improve things for all of us. Yeah. And for me now, this is the interesting moment where

Starting point is 00:21:01 I think this should be a reminder for every one of us that we need to think end-to-end. We need to put ourselves into the shoes of the end-user. Now, I know that certain software companies, maybe the end-user journey really starts at the homepage of their website and five clicks and that's it. But for many of us where we are providing individual services that then make up an end-to-end user journey. We need to think end-to-end,

Starting point is 00:21:27 which means we need to understand what is actually critical, what is a user journey, how do we measure it, what type of data points do we have under control, where do we need to give our data in order to get data back. And I think this is also so critical

Starting point is 00:21:41 and this is the conversation and the question that I now have to you is, how do you define and find these end user journeys? And then how do you apply the concept of SLOs? Do you apply any observability on that and what do you do? Because this is a question that I always get and where I get also confused is if people say, I want to put an SLO on every microservice. I have a thousand microservices

Starting point is 00:22:09 and I want to put an SLO on it, yet they fail in the end to deliver a good user experience because they have not thought about thinking about the end user from the outside in. Yeah, absolutely. That's a great question. And so I think this is something that we're still in the early days on, but we were definitely trying to understand that, the physical network, and monitoring it and making sure that it's up.

Starting point is 00:22:48 And we had some service-level objectives in terms of availability, and we had some around latency between different segments of the network, etc. But what we didn't understand is when we breached that, what is the actual impact to our client services. Now, the challenge at Uber or anywhere is actually in the physical networking world, we're probably the most tier zero of tier zero services in that we underlie everything. If the network doesn't work, then service A cannot talk to service B. And as you mentioned, we have this, like many companies, we have this sort of sprawling microservices architecture. I, we have this, like many companies, we have this sort of sprawling microservices architecture. I think we have like upwards of 3,500 microservices.

Starting point is 00:23:30 Now, out of those, you know, there's some subset, a much smaller subset that's critical for core trip flow. So we look at two things, right? One is core trip flow and the delivery business, meaning the ability of a rider to go online and say, I'm ready to take trips, a rider to actually look at a map and book a ride, and then for that matching to happen, and then for the ride to happen. And then delivery business is very similar, like place an order, route the driver to the pickup place, pick it up, drop it off.

Starting point is 00:24:03 So what we're trying to do now is really understand, you know, end-to-end what are SLOs on the network, how they impact the services above us, right? So within our infrastructure, let's say we're guaranteeing like five nines availability within a given availability zone for the network. Okay, great. When we have downtime or we have degradation, you know, in the past we'd say, well, okay, well, we're still hitting that five nines or four nines, you know, depending on which part of the network we're fine. But actually, you know, we could still

Starting point is 00:24:35 be severely impacting the business in the time that we are down that doesn't match those four nines or even in the degradations that we have that doesn't rise to the level of outage. And so what we've really tried to do is those core services that make up the core trip flow that I described is we try to really understand. First of all, we try to start at the beginning of the process, which is really trying to help them understand how our network is designed for them to take the maximum advantage of the physical redundancy we've built. Because without that knowledge, we've seen incidents in the past where even though we've built this, what we think is this super awesome network, we'd have microservices like all deployed in

Starting point is 00:25:15 or critical services all deployed in like a handful of racks within one zone. And we lose that part of the network and it's a big outage, right? So over the years, that's one of the things and it's a big outage right so over over the years that's that's one of the things we really worked at and gotten better at is really educating our service owners about the infrastructure that's available so a lot of times they don't know and and then arguably they shouldn't care right like tooling should actually abstract that away from them and we've got now have that built that at uber where if I'm a service owner or a product, you know, product customer facing service owner, you know, and I'm deploying my service, it should take care of that dispersion for me and take advantage of the physical redundancy that's built. So we've done that.

Starting point is 00:25:56 Now, the next part is sort of like traceability and understanding like all these services and how they interact. And that part, I would say we're still very much kind of work in progress. But, you know, in general, what we're trying to do is take our network SLOs and tie it to the core services that really, you know, really can impact our ability to do those two functions for delivery and rides.

Starting point is 00:26:23 And then understanding how those services talk all across our infrastructure and see do our SLOs match what they're expecting. Because another example is, you know, in an Uber-controlled network, we can guarantee, well, we can guarantee anything, but, you know, we try to, what our SLO is, is five nines within a, or sorry, four nines across zones and five nines within a zone. Now, where this starts to break down is we also have huge cloud deployments. So we have cloud deployments in three of the major cloud providers.

Starting point is 00:26:56 And you'll find in many cases that cloud providers will provide an SLA to their interconnects or where you connect with them. But then anything that happens within their network is not really covered by SLA or they may have a service they offered within the cloud. You know, think DynamoDB or Spanner or something, right? They have an SLA on that, but the network traffic to get from the interconnect to that service is not guaranteed necessarily. So, so then you have to think, okay, well, how do we – so first part is sort of visibility and understanding how these services work

Starting point is 00:27:33 across both our infrastructure, cloud infrastructure, and then do our guarantees, are they sufficient to meet the business requirements? And then I think what we really have to do is focus on, you know, making these services more resilient, right? Because infrastructure failures do happen, particularly when you have all these different pieces involved in terms of three different cloud providers, as well as our own infrastructure.

Starting point is 00:28:01 I mean, not everybody, first of all, thank you so much for sharing. Not everybody is Uber, Not everybody is Uber scale, or Uber scale. But it's really interesting to hear, because I never thought about this, that when you are putting your services on cloud provider A, that they're guaranteeing you a certain SLA to their door. But what happens from their door to that next service they use, whether it's a database or putting a container on their Kubernetes environment or the managed environment.

Starting point is 00:28:35 I thought that's really interesting. Now, it might not be the concern of every one of our listeners, but I think it should be a concern because in the end, we are all deploying our critical workloads, most likely not on our own infrastructure or not everything on our infrastructure. So we want to make sure we understand everything end-to-end from the end user until it actually hits that service. And being aware of this, for me that was new and I assume also for some of our listeners, this is new.

Starting point is 00:29:07 I remember the conversation we had in London and you said, you mentioned partners, these cloud vendors, these cloud providers are also partners of Uber. Have you been able to engage with them and get more data out of them, out of their otherwise closed environments. Because what we hear is some of these environments are really black boxes. They only give you what they give you, but sometimes it's not enough. Serverless, for example, and some of those other ones, it's like, well, it's distracted

Starting point is 00:29:40 from you and you're not going to get it. Do you guys have the clout, let's say, to say we need more from you and you're not going to get it do you guys have the you know the clout let's say to say we we need more from you yeah yeah so that's a that's a great question right and i've run into this a couple times in my career and i'll just quickly illustrate the earlier ones and show how much things have changed sort of in a positive way so you know i remember working at netflix and and as a major cloud provider and Netflix is probably at that time, or maybe still their biggest, their biggest customer. And, you know, we were troubleshooting issues and, you know, they showed us a snippet of a

Starting point is 00:30:17 configuration and from that configuration, we could tell, you know, what kind of network devices they were running, but we asked them and they would, they would, they would never tell us like yes or no, this is yes, we have this kind of device. So, you know, that, that the secrecy was there. It is still there, but I think, you know, what we've done over the years and they've opened up quite a bit as well. And I think, you know, now with the partnerships we have, even the early days of Uber's foray into cloud was probably 2016, 2017 when we really started.

Starting point is 00:30:50 We had all our own on-prem data centers, which we still have to this day, but we've also expanded into all three cloud providers. So from one major one, which is Oracle Cloud, which we just launched publicly like a year and a half ago, I think. So taking those lessons learned from day one, we've been working very, very closely with our cloud providers to understand as much about their infrastructure as we can. They don't tell us everything, obviously, but I think where the transparency has really changed and moved is around monitoring, alerting, and metrics. So, you know, we're at a point where we're emitting metrics from our observability platforms directly to the GCPs, the OCIs of the world. And likewise, we are also getting, you know, direct metrics and alerting from them as well for different parts of the infrastructure. And then that's sort of one piece of the puzzle is really understanding when we have an issue or we see an issue, letting them know and trying to find out, is it them?

Starting point is 00:31:58 Do they know about it? Are they working on it? All of that. So this is the whole operational piece that has to follow. So, you know, we do things like, you know, we have shared Slack channels directly with engineers from some of those companies where we can just chat with them directly. We also obviously follow their internal processes for ticketing and all that. But we, in general,

Starting point is 00:32:18 we're trying to make our team here an extension of their engineering team or vice versa, right? So building those relationships across is really important. But I think we've started with really metrics and observability. And I think in a few select cases, we've been sort of able to influence their roadmap, not so much in what they're building, but in sort of priority, right? Like, hey hey this feature that you guys are thinking about like yeah we really really like it you know now can you guys like

Starting point is 00:32:49 reprioritize this higher for us and so we've seen some some movement there but i think um it's it's still i think having those relationships is super important so it's not just like a black box to you that that you don't understand yeah i think but i also recognize you know it's not it's probably not possible for every relationship or every company i know like you know we run into walls with other providers because we aren't that deployed in them right we have pretty small deployment so so i think obviously the money and the scale matters to them as well yeah i think it's a a lot of this harkens back to there's company secrecy

Starting point is 00:33:30 or what companies pit against each other versus what people in the IT world need. And we've seen historically people in the IT world share everything. I'm throwing my code up in GitHub. I'm talking about the things we severely messed up and for you not to repeat them or what worked well. And in order for the IT companies to work successfully together, there is that movement amongst, let's call them the people. I don't mean to be all grandiose here, right?

Starting point is 00:33:58 But the people doing the work need to share that information. And it sounds like companies like Uber who have that financial power are starting to open the door. I can understand why they wouldn't want to say exactly what kind of network device it is, because what if they decide to change network devices? People are going to be like, oh my gosh, they're going to freak out. But if they are saying if they're at least providing the network performance consistently, not necessarily telling you about an upgrade or a change, hey, I have at least the transparency to see

Starting point is 00:34:29 everything is still coming in fantastic from the network components that we rely on in the back end, I'm happy enough. So there is a give and take on we don't want to freak people out every time we're going to do something by being too transparent. But even when you're saying on some of these new cloud providers how they're not necessarily being as open, it at least gives me hope that over time

Starting point is 00:34:52 they're going to start sharing some of these metrics so that people can understand, I'm using your service, everything looks good on my end, what's going on? Because that's just going to give people more confidence in using those services and help. We talk about the same economy between Uber Eats and a restaurant. It doesn't matter who's slow. For the cloud provider, it would behoove them to be transparent about it.

Starting point is 00:35:23 Oh yeah, our bad. Let's fix that to keep you happy. You know? So who knows? It seems there's a, hopefully there's some hope on the future for, for more transparency, at least on a metric level. But we'll see, I guess. No, and I think you're absolutely right. You know, and I think, I think it does, you know, obviously there's competitive things

Starting point is 00:35:41 that, that companies don't want to give away, but you know, there's, there's competitive things that companies don't want to give away, but there's a lot of other areas of collaboration and cooperation that could actually improve the reliability of their cloud for themselves, for their customers, for us. And then we can also improve on our side. We learn stuff all the time. I mean, when you're dealing with engineers from GCP or AWS or OCI or, you know, anyone of these cloud providers, I mean, you're dealing with some really brilliant engineers who have thought about how to do infrastructure at a scale. You know, we're not tiny, but we're nowhere near their scale. And so they've seen things, you know, where things break and how to think about, you know, sort of going that 100x bigger, they've done it, right? So there's a lot we can learn from them as well. Yeah, but it does get into interesting areas because even, you know,

Starting point is 00:36:34 on the example of the network device, you know, in some sense, they're also our competitors, or at least they used to be. So during the pandemic when all the supply chain issues were happening, like we were getting our hardware orders, let's say redirected by from our vendors to to some of these cloud providers, right, because they're the big, they're the big buyers. And so we're losing hardware for our own data centers to our to these cloud providers. So there's there's that competition angle, you know, still, I would say around hardware and things like that but you know i think um in for the most part like the relationships have been really really good

Starting point is 00:37:10 and um i do see these companies opening up a lot more um publicly like in their engineering blogs and just in general because you're right that's how the industry was sort of founded on those sort of ideals and principles like you know what I learned could help you, so maybe I should teach you or help you or vice versa. Just a quick question on, because you said you started with your own data centers, you still have your own data centers. I guess you obviously invest in them, yet you have the cloud vendors. What makes you decide where new workload goes? Are there geographical decisions?

Starting point is 00:37:50 Are there technology stack decisions? What makes you decide what goes where? Yeah, so that's a great question. So the way Uber's infrastructure works is there's a lot of back-end services, obviously. And then we have, you know, sort of our edge services that need to be closer to a customer, right? So I'll give an example. In the early days or earlier days of Uber, before we had cloud, you know, our two main regions where our data centers are are East Coast and West Coast of the United States, right? But Uber is a global

Starting point is 00:38:25 company and so we have you know india is for example a huge huge market and we're competing at the time very fiercely with uba which is the local competitor there and if you turn on the uber app on your phone in india it might take a minute or two minutes to load the app right which is unacceptable performance from a user perspective. And the reason for that, or big reason for that, is that that network traffic is going from that phone to Indian mobile provider network, to all the peerings across the world,

Starting point is 00:38:55 to subsea cables, to our data center in Virginia, right? So that architecture had to change where we get much closer to the customer. So today, we're deployed in all those edge services all over the world, primarily actually totally in a cloud provider because, you know, it makes sense for us to utilize their presence all over the world rather than building our own data centers all over the world. So that's one example.

Starting point is 00:39:18 Now, in terms of back-end services, it's really critical that we look at cross zone or cross region dependencies, which we've started to do, I'd say, in earnest maybe a year or two ago. So previously, there wasn't a lot of controls around it, right? So which would leave a lot of areas, I would say, brittleness where things could break and cause outages so for example we'd have service a in a zone and it needs to talk to service b and service b is actually also in the same zone but for misconfiguration or configuration reasons or historical reasons service a would talk to service b across the country um and then a network problem happens in our backbone and they can't talk to the one across the country and it just breaks.

Starting point is 00:40:07 We've seen a lot of cases like that over the years. So today, what we're really striving for is zonal isolation where for the most part, if services are self-contained in the zone, that way if we lose the zone, it's fine. We have those services in other zones. But from a network perspective, this cross-zone, cross-region chattiness, if it doesn't need to be there, we're trying to reduce it as much as we can.

Starting point is 00:40:35 And that also goes for cloud providers. We've had examples where we deploy some stateless services in a cloud provider, but all the stateful services and databases we need to talk to are still in our data center. And then, you know, there's network issues or any other issues, cloud provider issues or data center issues, and then that communication breaks down. So we're trying to co-locate as many dependencies as possible

Starting point is 00:41:00 and then sort of have our failure domain be like a zone, right? Like we can lose a zone, that's fine. But we're not, we're not quite there today. I think we still have, we still have quite a few kind of cross, cross dependencies and also this is a moving target, right? Like with, you know, 3,500 microservices and like old ones being deprecated and new ones being written all the time, like all of this is changing and morphing all the time. So I think I'm trying to clean it up and get it to a stable state, and then the next step is, again, tooling

Starting point is 00:41:31 and having this be an automated thing where people don't have to think about it. So in our deployment tooling today, we've tried to stop the bleeding by implementing controls where, again, as a service owner, if I'm writing a product that's not infrastructure, I shouldn't have to think about, hey, do I deploy it to zone A or zone B? The tooling should think about it.

Starting point is 00:41:53 Look at my service. Look at the service tier. Look at the subsequent reliability requirements of that service and then deploy it dispersed as needed in a standard fashion with all the monitoring and metrics that go with that. So that's sort of where we're at today. I'd say we have the controls in place, but we're still doing the cleanup and enforcement of it. If I get this right, as an engineer in any organization,

Starting point is 00:42:22 you're at Uber, you should in the end focus on creating a new service, you know in which region it should be available because you're building a new feature, you have certain dependencies to other services, but then the platform should figure out on its own where, in which capacity, with whatever failover mechanisms are needed to meet your reliability and resiliency goal, it should be deployed to be as close as possible also to your depending systems. So in case something fails, you kind of contain that problematic zone by making sure they are as good as possible co-located.

Starting point is 00:43:00 And that is information that you cannot ask from every engineering team to know. This is where you have collective intelligence that you build into a central platform and then take these decisions off the shoulders of your engineers. And one thing, oh, sorry, go ahead. No, I also think in the beginning you mentioned that you're doing a lot of, I'm not sure if you used to work mentoring, but you're working very closely with these engineering teams to educate them on what type of services you actually provide. And I think this whole educational aspect

Starting point is 00:43:32 is also a big point. But still, you cannot educate 4,000 people on all the individual details. This is where platform engineering, at least from my perspective, comes in, that you're providing self-services to your teams to allow them to do their stuff. And then you take over the hard choices of, you know, what does this mean for this critical service to be, you know, five nines and whether we need to deploy it because it is highly depending on certain other backend services. Yeah, absolutely. And i would just add i think also you know you touched on it it's also like i said a moving target so um you know when we and in a lot of cases service owners

Starting point is 00:44:12 may not know all the dependencies right they may not know hey like i talked to redis i talked to this i talked to that they know some of it but they may not know all of it or it may change so that's where i think humans break down right there's no way we can understand all of it or keep track of it. And so that's where, like, I know not my team specifically, but there's other teams at Uber that are exploring quite a bit. I know it's a hot thing right now, but like whether you call it machine learning or AI and trying to apply that to the service graph

Starting point is 00:44:40 and how these services all interact and trying to apply those principles around deployment safety and how we test things, how we monitor metrics, all of those things but have it be much more dynamic where it could understand the system as it's morphing and changing over

Starting point is 00:44:58 time. So it's a huge area of opportunity. I mean it's sort of a kind of crazy problem to think about but I think there's a lot of progress to be made there. I mean, it's sort of a kind of crazy problem to think about, but I think there's a lot of progress to be made there. It's interesting with that, too, because us in the observability space,

Starting point is 00:45:14 we collect a lot of the kind of data that decisions can be made upon. You have things like Kubernetes, you have add-ons to all this stuff that help you scale, right? All the ingredients for the recipe are there. And it's interesting to see people finally starting to do what I think we all expected sooner, which would be everybody should be automating this whole process.

Starting point is 00:45:37 We have this data. We have the ability to scale. We have the ability to do this. Even if we're in zone one and everything's working in zone one except for a network connection between two components if the system knows that it'll actually be faster to go out to zone two for that one bit and then pop back in it should just do it you know and then readjust like self healing and all like all the ingredients are there right but it's just i guess the priorities and and

Starting point is 00:46:02 how much does it actually take to build that intelligence into a system to make it work reliably? And I think that's the piece that we're starting to see with all this. But it's definitely a heavy lift, right? But, you know, yeah, it's just really cool stuff that you get to work on there. Yeah, it's hugely exciting, right? I mean, I think, like, your guys' organization, I mean, there's so many, like, you're right. It's, like, tantalizingly close or it feels like it is,

Starting point is 00:46:28 but somehow there's still a lot to do. So it's interesting. Hey, Vishnu, before we close this out, I have one more question for you. In your role, right, head of, I'm just looking at your LinkedIn profile here, head of network infrastructure, EMEA, platform engineering at Uber,

Starting point is 00:46:44 you mentioned this in the beginning. What wakes you up at night? Or what wakes you up on the weekend? What ruins potentially, now obviously I know this is, we're already in January when this airs. But what could have ruined your Christmas? Besides that, I didn't know we could time travel

Starting point is 00:47:02 since today, before today. today yeah no i think um for us it's always capacity right so one of my uh i won't name him but shout out to my first boss at uber he's a very interesting guy but he used to tell us like whatever you do like never ever run out of capacity right as a team like and at that time uber is going crazy and we're just trying to like i said keep our heads above water. And in some ways that hasn't changed, right? Like Uber's growth has slowed a little bit, but it's, you know, percentage wise it's growing and it's growing off a huge, huge base number. It's like the growth is actually still kind of crazy. And we see it on the infrastructure side.

Starting point is 00:47:40 And I think this is where also, you know, deep observability and metrics can help is like predicting our growth. We're very good at predicting our business growth, but I think translating that into infrastructure growth is something that I'm always worried about. Because especially when you're dealing in the physical network world, if you're in a cloud provider, there's this perception that the cloud providers can just spin up unlimited capacity. And through our partnerships with them, we've learned that's not possible. They have the same constraints we have, which is you need hardware, it takes time to get the hardware, you need time to build it, all those things. So there's a lead time for our underlying physical infrastructure which can be addressed sort of on the edges. You can pull in some dates here and there,

Starting point is 00:48:25 but there's some things that you just can't, like that just take time, right? You're laying fiber optic cables, you're connecting things, you're ordering hardware, all these things take time. So really getting a deep understanding of like where the business is going

Starting point is 00:48:37 and how that translates into what we need to build and that we build it in time. That's what would kind of keep me up at night. And maybe that's a lesson I learned from my first boss here at Uber. Well, shout out to him who remains unnamed, but the people that know him will probably know him. Yeah.

Starting point is 00:48:59 We'll call him Jebediah. Some ancient kind of name. Yeah. Vishnu, I want to say thank you so much Jebediah. Some ancient kind of name. I want to say thank you so much by the time you listen to this. You will hopefully have had a great end of the year, a great start of the year. I was really fortunate

Starting point is 00:49:17 to bump into you in London. I want to also say thanks to Sam who organized the conference to allow me to do the fireside chat with you. This got us connected and this in the end resulted in this podcast. So there's always, I think, treating partnerships really well is in the end. It comes back to what you said earlier because Sam allowed me to host that part of the conference. He did me a favor in return.

Starting point is 00:49:44 I helped him out. And so in the end, now we're here. So it's treating your partners well. And thank you so much. Hopefully. Yeah, thank you. I wanted to thank you and Brian. I really enjoyed the conversation. I looked at the clock just now.

Starting point is 00:49:58 It went by so fast. We're having so much fun. Thank you. It's always great to hear what's happening on the cutting edge. So really, really appreciate you sharing it and continuing that spirit of sharing knowledge. It's always difficult in the corporate world of like, oh, can I say it? But it's like there's some fundamentals that really benefit everybody.

Starting point is 00:50:25 I think at the end of the day, we're all just trying to build cool stuff. Really? Yeah. And keep going. So, yeah. Thanks. All right. Thank you. And Happy New Year to everybody. Or if we want to go back in time for when we're recording, happy holidays, everybody, whatever you're celebrating. And thanks for everyone. We'll see you on the next episode. Bye-bye. whatever you're celebrating. And thanks for everyone. We'll see you on the next episode.

Starting point is 00:50:47 Bye-bye. Thank you. Bye.

PurePerformance - From Infra to Services to Happy End Users: The role of SLOs at Uber with Vishnu Acharya

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.