Software Misadventures - Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

Starting point is 00:00:00 I think a lot of people get focused on ML, ML, AI. But there are subtle things about what we're doing that are a little bit different. But there are a lot of things that are just about modern distributed computing and how software works on medium-sized collections of computers. And a lot of people work on software on medium-sized collections of computers. So I think the thing I would say is for people who are going into ML and going into data sciences, if you're excited about the model building and you're excited about that side of it, you should go to that. But there's going to be a huge amount of work for a very long time in making this stuff work. And so if you are a software engineer, you are a systems engineer, you are a data management engineer, and you are a software engineer, you are a systems engineer, you are a data management

Starting point is 00:00:45 engineer, and you are an SRE, this is really, really good stuff to know. And you should not be worried about not having the academic background in modeling. Welcome to the Software Misadventures podcast, where we sit down with software and DevOps experts to hear their stories from the trenches about how software breaks in production. We are your hosts, Ronek, Austin, and Guang. We've seen firsthand how stressful it is when something breaks in production, but it's the best opportunity to learn about a system more deeply. When most of us started in this field, we didn't really know what to expect,

Starting point is 00:01:20 and wish there were more resources on how veteran engineers overcame the daunting task of debugging complex systems. In these conversations, we discuss the principles and practical tips to build resilient software, as well as advice to grow as technical leaders. Hello everyone, this is Guang. Our guest for this episode is Todd Underwood. Todd is the Senior Director of Engineering at Google, where he leads site reliability engineering teams for machine learning. Having worked on SRE at Google for more than 12 years,

Starting point is 00:01:51 Todd recently gave a talk on how ML breaks in production, drawing on more than a decade of outage reports and postmortems. In this conversation, we go into different aspects of what makes it difficult to do ML well in production. Like why it's not enough just to look at aggregated statistics for ML monitoring, and who is on the hook when ML models don't perform as expected in production. We also chat about what Todd looks for when hiring ML SREs,

Starting point is 00:02:21 his impressive skill of getting LinkedIn skills endorsements, and much more. Please enjoy this insightful conversation with Todd Underwood. Awesome. Hey, Todd, it's great to have you with us today. Welcome to the show. Great. Thanks for having me. I'm excited to have the conversation. So when we were LinkedIn stalking you to prepare for this episode, I saw that in addition to leading reliability engineering efforts for ML at Google, you're also the Pittsburgh site lead. I've never heard the title of site lead before and thought that was really cool.

Starting point is 00:02:59 Can you tell us more? Like, what is the role of site lead? Yeah, absolutely. But I do want to like i want to start with the digression which is i am disappointed that what you didn't focus in on my linkedin page was my skills so as i'm sure those of you who work at linkedin know most of us in the real industry think the skills thing is silly and not useful so my particular brand of personal protest against the skills is to have

Starting point is 00:03:25 outrageous and ridiculous skills. Among my skills are nuclear proliferation, brunch, locomotion, and pork. And I should clarify, I'm a vegetarian. So the pork skill is more about detecting and avoiding pork than enjoying it. So I just want to say like, there's good LinkedIn skills out there and you should like, you should all be seeking out these skills. There really are. So sorry. Sorry. Sorry for that. I'm polluting the data for training models. I am helping. How many people are good at French? I have skill in something called, so there's a CAD CAM package called rhinoceros, but rhinoceros is also just a common noun for an animal in English, right? So I have a skill of rhinoceros. And so it's just an animal, like it's a skill. I don't know, Todd. Some poor MOSRE, you know,

Starting point is 00:04:13 late at night. It's like, why is this model giving really weird results? I want to point out one of my top LinkedIn skills, which is a self-documenting skill, is the skill of getting LinkedIn skills endorsements. That is one of my top endorsed skills, which is true because I'm really good at getting LinkedIn skills endorsements. That is one of my top endorsed skills, which is true because I'm really good at getting LinkedIn skills endorsements. I'm really curious about what kind of job recommendations you get. So one time, I think it was some kids from Blizzard. It was some video game company. We're like, we know you're not going to move here. You said you weren't going to move, but we just wanted to reach out and say, we really enjoyed your skills. And we've been having a tough day looking for candidates and you like your, your profile really brightened our afternoon. I'm like, that's a thing. Brighten some recruiter, some sourcers afternoon, right? Sorry, let's get back

Starting point is 00:04:53 to culture. That's what I'm like. I'm like, yeah, so, um, good. So a site lead. So, uh, one of the things, different companies work differently. Google has a huge engineering organization. Um, and one of the things that happens companies work differently. Google has a huge engineering organization. And one of the things that happens is, like, when you think about engineering organizations, they're mostly organized functionally or by product, right? You, like, you work on a thing, whatever that thing is, whether it's like some kind of engineering and some kind of engineering organization or some kind of product and some kind of product organization. But that's your day job.

Starting point is 00:05:23 But this is inconvenient for those of us who work in technology. Humans exist in a place and time, right? Humans are not sort of little abstract bundles of possibility that like just exist on some theoretical plane. They live in cities, they have houses, they have like families, they walk and they bike and they eat food and they have labor conditions and they have salaries and they pay taxes. There's all this inconvenient stuff about us all being humans. And also, like, you know, I think one of the things that I am always disappointed about, about our industry, the technology industry, is it spends a lot of time pretending that we will all just move anywhere at the drop of the hat for any particular reason. And that's absolutely true for some people at some times. But actually, a lot of people have families, they have communities,

Starting point is 00:06:09 they have places that they prefer to live in places that they don't like. People don't like living at a place they don't feel comfortable in. They don't like living in a place they don't feel welcome in or they don't have good ties to. And so a lot of companies try to pretend this isn't true. And anybody like all of the top talent will just move anywhere. But it's just not true. Like we live in a place. And so the job of a site lead is really to oversee, curate and set up the conditions for success of all of the different organizations that employ people in a place. So in Pittsburgh, we've got about 1,000 people working for Google, roughly in that area. I think it's a little bit, somewhere in that area. And so there's a building and there's recruitment and there are benefits issues. And there are, especially in the last year,

Starting point is 00:06:58 there's health and safety issues, right? Like our office is closed and when's it gonna reopen? And so that's an example of a question. Like you can't have like the ads lead and the cloud lead decide when the Pittsburgh office reopens. Cause that from their perspective, that's a weird edge case from our perspective, it's our entire life, right? Like this is where we live and work. So that's really what a site lead does is set up the conditions for success at a site. I see. How did you get put into that? Or like, is that something that you- I mean, mistakes were made. Clearly, like everybody else stepped back and I wasn't,

Starting point is 00:07:34 I ended up using. I mean, I think like it really is like at, you know, different, it's not a, it's not a, it sounds super fancy or it sounds important, but it's really like who, who failed to avoid doing it. When I was being asked to do it, someone told me like, don't do it under no conditions, do it. It's not the, it's not the kind of work that's rewarded. I'm like, okay, cool. And then, you know, like, so I, you know, you all may as well, like I, you know, I interview a lot of people. And so one of the things when you interview someone is you, you try to form an opinion, positive, negative, you know, good at this, bad at this, but then you look for evidence that you're wrong, because you want to test like,

Starting point is 00:08:13 maybe I was jumping to a positive or a negative conclusion. And so just really quick, I said, Okay, cool, I got it, I shouldn't do it. Just out of curiosity, if I were to do it, what's the reason I would do it? And they're like, oh, you would do it if you really cared about that stuff. If you were positive, nobody else was going to do it. Or you didn't think the people who were willing to do it, we're going to do a good job of it. I was like, well, that's disappointing. Cause you just talked me into it. Like you had talked me out of it and then you just have, yeah. But it's, I mean, it's weird. It's, it's a really important job, but it's not an engineering job. It's fundamentally a job about setting up the conditions for human success. I see.

Starting point is 00:08:48 Very cool. So when you first joined Google as the SRE manager, you were working on online applications like ads, payments, and now you're leading the ML SRE team. What made you decide to switch and what was that transition like? Yeah, it was really an outgrowth of it so um i started the ads ml team in pittsburgh uh there were sort of two engineers with no manager when i started and i built that team um and built sre in pittsburgh up from those two people to about 150 people that it is now um yeah i mean google employs a lot of people so that's still a small site but you know it's it's like it's enough i think our objective was people should have choice like

Starting point is 00:09:31 engineers like to work on different stuff like each of you like when we were chatting before the show like you're like i used to work on this but now i work on because that's what we like to do people need like good like options um not always because they're doing a bad job sometimes you're doing a great job. Sometimes you're doing a great job, but you're just bored. You're like, I did that. I want something new. I want a different challenge and to try to apply my skills. So the first team that I joined was this team working on productionizing ads, machine learning infrastructure at Google. And as you all may know, and I think a lot has been written about, like Google's been using machine learning to target ads for a very, very long time.

Starting point is 00:10:07 It was even controversial in the early days. Like there was this funny news story where I'm trying to remember how it went, but as I recall, sort of Yahoo was like in the market and they were trying to make this new ad serving system as sort of one of their last gasps of trying to figure out how to like pivot from the portal version of like the closed walled garden version of the Internet that they've been doing very well with. You know, the sports at Yahoo and the finance at Yahoo was great.

Starting point is 00:10:35 But like the search, they just couldn't really nail and they couldn't nail the advertising part of it. And at one point, someone from Yahoo said, well, you know, Google's doing well here because they use math. And we're like, you know, so there were some T-shirts made that said Google ads quality. We use math because like we do, we just use math. Like it's not, it's public math. It's not secret math. Anyone could use the same math. So we productionized that stuff. And then, you know, I think I did some work on payment systems and I did some work on a few other systems, but I always like noticed that those ad systems were highly tuned for that application. But other places that I looked in the company were not as, were not as well served with general purpose infrastructure for their machine learning use

Starting point is 00:11:25 cases. And I think, you know, as the revolution in deep learning happened in the sort of 2000, you know, what, 12 to 15 period, and people are starting to see like, hey, this is actually usable for a really wide variety of use cases, you know, natural language use cases, image use cases, fraud and abuse use cases, all kinds of other prediction use cases. But the underlying software infrastructure is still bespoke, complicated, finicky. And so when you look at an application, you know, like something simple, like when you in Gmail or an Android, when you have type ahead, like on a phone, that's genius. That's like predictive typing is amazing, right?

Starting point is 00:12:04 Because like these input, like in spite of how like the kids these days and their fast thumbs, like there's a lot, these are not great devices for inputting significant amounts of text. And so if you can tell me not just like predict this word, but the next six words I'm about to say, because you know me or you know how people speak this language that I'm speaking, that's amazing. Well, there's a model that needs to do that. Like then you have to sort of do this calculation. Is that a small idea or a big idea? I think that's a small but good idea. I don't think it's a big idea. So if I were running

Starting point is 00:12:34 a company, I would not assign a team of people just to do that one thing. That looks ridiculous. That looks like a waste of time. But if your infrastructure is sufficiently complicated, you probably need several people working on that just to ship that one feature. And that's not cool, right? And so I think that was that kind of thinking of, you know, we're generalizing this infrastructure. At the same time, Google was in secret shipping TPUs that we've now announced publicly. But these amazing hardware devices that really cost effectively do

Starting point is 00:13:05 basically low precision linear algebra. And so we're looking at these things saying, hey, there's a revolution here, but the reliability and the simplicity is not facing the developers yet. And so that's where I really got interested in. So I founded the MLSRE team and I've been building it up ever since. Dang, that's really cool. I guess changing gears a little bit, Google being a pretty global company with offices all around probably means a lot of your teams

Starting point is 00:13:34 do work in distributed manners even before COVID. Are there practices that you found that really helps keep good communications across channels, across these different teams? Yeah, for sure. I think one of the benefits when you talk to people, and you all may experience it as well at your organizations and in your history, when you talk to people who do this kind of distributed work, the most successful thing is not having one place. It's having like

Starting point is 00:14:05 everything be district, everything being somewhere other than the main place. Um, there was a neat trick. There was a good example. I think, I think it was Vijay Gill when he was at Google, but there was an example of like, there was a big meeting at the home office in Mountain View for some networking thing. And then there were several people at some other sites. So maybe in Sydney or maybe in Dublin or maybe in New York. Okay, so that's fine. He broke up the big meeting into two meetings. He just booked two smaller rooms instead of one big room. And so it wasn't that like you didn't have to go to either room, but you couldn't fit everyone in one room. And so all of a sudden the meeting shifted from in the room with some like remote participants to on the video conference. And so, you know,

Starting point is 00:14:50 I think things like that are like genius things to do for the COVID front. Google had had a culture of video conferencing for a very long time. Like, you know, I've, when I went to Google in 2009, they'd already been conducting 100% of their inter-office meetings on video on these old Tanberg devices for several years, like since 2006 or seven or something. And so like we already had,

Starting point is 00:15:20 like every conference room had video capabilities. Sometimes they were like little goofy devices with small screens and sometimes they were a little bit bigger. But by the time last year rolled around, you know, like that was just how all meetings happened. So that's fine. But what we found is that we were still relying really heavily on periodic in-person contact. And like, I think all of us have found like working from home, like I like working from home, but I don't like only working from home. I don't like always working from home. I've got a solid, like, you know, quarter or third of my coworkers that I work closely with who I've never met. And I think there's really good evidence that as humans,

Starting point is 00:16:01 we form trust affiliations and we have high bandwidth communication because of in-person interaction. Now, as a nerd, I think that's super disappointing. I'm like, I just, I see people on the screen. It's fine. Right. But for some reason, it's not completely fine. And so I think the jury's still out on what the right thing to do there is. You know, the New York Times magazine last summer, which was early in this like I don't know we felt like it was the middle of it but it was not it was not but it was like in June or something had this whole episode on working from home

Starting point is 00:16:32 technologies and there's a ton of good research in that episode that have all kinds of things I guess they call them editions of their newspaper not episode apologies there's a ton of good science in there and some of the science in there. And some of the science is things like staring at your own face is tiring and weird. So like this default

Starting point is 00:16:51 that most of our video interfaces center us, that's weird because you're not used to looking at yourself and it makes you self-conscious and annoyed. Staring at other people 100% of the time is also weird because if you think about if you're in a meeting with six people, you're really only looking at one or two of them at any given point in time because two of them are sitting next to you and you can't look at them unless you like turn around and look at them, right? But now we're like you're in a meeting with 12 people and you're looking at all 12 of them all the time and your brain is exhausted because your brain is trying constantly. It's scanning to parse all of that state, all of that emotional reaction and facial reaction. And so I think there's a bunch of stuff.

Starting point is 00:17:30 I'm both excited about the fact that we've moved a lot of interaction online, but disappointed in how little we've innovated so far. Like we need to experiment a lot more before we get this really right. Yeah, I can definitely relate. My brain is definitely doing all that processing, all the faces, definitely not browsing Reddit or, you know, anything like that. So absolutely. No, no, no.

Starting point is 00:17:52 Hacker news, never. Twitter, not at all. No, no, no, no, no, no. Cool. So at the director level, I know that you're on a lot of hiring committees for MLSRE. And I imagine that, you know, finding people that are both experts in ML as well as infrastructure engineering is really difficult. Because I think the backgrounds are almost kind of orthogonal.

Starting point is 00:18:14 One is like a little bit more academic. The other one is more hands-on, practical. In another interview, you mentioned that actually most people working on reliability engineering for ML at Google don't actually have an ML background. So I thought that was really surprising, but also makes sense. Like, tell us more about that. Like, how have you grown your teams? Like, you know, what do you look for in candidates? Love to learn more. Yeah, for sure. I think more broadly for SRE, I think there's probably an analogy between software algorithmic skill and SRE and ML skill and ML SRE. And let me explain what I mean by that. I think when I look for, you know, I've teach someone the right attitude and approach to SRE or can I teach them software skills?

Starting point is 00:19:10 You're like, well, you know, you can teach either side of that, but you really have to you have to address that gap somewhere. So if I get someone who's pure software oriented and has never really thought of from a systems perspective, that's a pretty steep barrier to overcome. Likewise, if I get someone who's like really thinks from a systems perspective, it has no software sense at all. Like that is also from an SRE perspective, a pretty steep barrier. I think that's analogous to the ML front. What we see in the ML front is much of the work that we do on MLSRE requires very little deep understanding of machine learning, very little. Now, that fades pretty quickly because what it does require is understanding how these pipelines fail, what the requirements of them are, how the structure of the systems fit together. I'll give two concrete examples. One is the difference of them are, how the structure of the systems fit together. I'll

Starting point is 00:20:05 give two concrete examples. One is the difference between, say, reinforcement learning and something like supervised learning. So in a supervised learning situation, you have a bunch of examples and you're trying to apply labels to those examples. And you're using that to produce a set of lookups or predictions where you can categorize novel events or novel examples. That has a flavor to it. That has actually a kind of linear left to right pipeline to it. Now, if I talk about reinforcement learning or something that has unsupervised learning followed by supervised learning followed by a human agent going back to the supervised learning. Now we're like, oh, from a systems perspective, you don't even need to know what those words mean. But I just like, I just strung the boxes together in a pretty different pattern, right? And if you understand

Starting point is 00:21:00 a little bit about what that means, you'll understand a lot about how it's going to break. And so I think one of the things that people who start on MLSRE who really want to be working on machine learning algorithms and modeling are super disappointed because we don't do a lot of that. In fact, we do very little of that. And so that's actually a bad fit. Having a ton of ML expertise and really wanting to work on building models, that's not what we do. Google has lots of teams that do that. They really have some of the best model builders in the world. And so if you can't get hired by those teams, a set of like pre-distilled computation. It's cool if you understand that these data processing pipelines are hyper, hyper data sensitive in a way that traditional, you know, data processing pipelines are not. Like they're sensitive to small changes in distribution of the data. They're sensitive in small dropouts of particular parts of the data. That's novel and interesting. And as an SRE, I hear that as like, oh, new failure modes, new really, really subtle failure

Starting point is 00:22:09 modes. That's great. That's cool. That's where I live. That's what I want to understand. But you don't need to know very, very detailed, substantive model. Like 99% of what happens in NeurIPS is not relevant to what we do this year, although it will be in five years.

Starting point is 00:22:29 One thing that you mentioned before is sort of like having that empathy because you do have the context of like, you know, how these models work. So like, you know, where they could potentially break. Does that come into the picture a lot? Like when the SREs work with like the model builders or work with the devs, did you see that like kind of a big piece or is it, you know, middle of the road? No, so I think in technology,

Starting point is 00:22:54 one of the biggest problems all of us have is that we frequently don't treat each other like human beings. And I'm disappointed by that. Like, I just say it that way. Like, I think like it is easy and this is worse. This is worse than the last year because we are all stressed out. We're all tremendously isolated from each other. We're missing our families. We're missing our friends. And we're angry. And we frequently take that anger out on the people on the video conference near us. I'm going to yell at you all's going to second i'm just going to start yelling but like so i understand that but in reality like everyone who's participating in these really big complex bits of infrastructure is just trying to do their job they're trying to innovate they're trying to and in particular the ml stuff they're trying to solve an interesting problem that's worth solving

Starting point is 00:23:40 they're trying to do something that's good for users often. Like people give ads a bad name, but like ads is what makes the internet free right now. And that's cool. And ads should be good. Like, like if you're, do you want crappy ads or good ads? I want the good. I mean, I like, we'd all rather have no ads, but if there's no ads, all of the stuff costs money. And we've honestly, each and every one of us has chosen not to pay for that. Like we keep doing it. Like you may not admit it to yourself, but you've done it. I've done it. Right. And so we're like, well, if I'm going to have ads, I want ads that are stuff I care about stuff I might care about. Well, I'll click on them if they're very, very, very good. Maybe. Right. We all think that. But so like even the people just doing stuff that we think mundane is mundane and boring,

Starting point is 00:24:24 but people trying to stop fraud. They're not trying to make people's lives inconvenient. They're trying to make a payment system work and be affordable and be fast and be functional and have people not steal our money and not steal your money. So, yeah, I think empathy is a huge part of it. Cool, cool, cool. And sort of speaking of things not working, so you gave this talk recently about how ML systems break. What really caught my eye during your talk there was this slide where you had 19 different categories of failures

Starting point is 00:24:53 in terms of thinking about how things break, and that's sort of being kind of aggregated from years of experience looking at postmortems of how things fail. So this sounded like a really good opportunity for playing bingo. So did you guys like play this during like a quarterly ML SRE offsite? That's a really good idea. See, now that you say that, the honest answer is no, not yet.

Starting point is 00:25:20 Like now you say that, I'm like, this sounds really good. Can you send me a copy when you guys do, you know, I know I'll send you a copy of the bingo card. That's a stellar idea. So most of the work for that talk was done by Daniel Papazian, who's like a stellar engineer that I've worked with for years. And so we like, we got to all of us who have worked on this stuff for a long time, have the sense, like, it's not the ML that breaks, it's everything else.

Starting point is 00:25:44 And sometimes the breakage it's not the ML that breaks, it's everything else. And sometimes the breakage is related to the ML. So I'll give a good example. Like if 10% of the data goes away, but it's a biased 10% of the data, like a good example would be like if we train a model that has language, like what language the model comes in and we drop all of the stuff that's in Spanish, well, we're going to have really weird results for anything that is about or in Spanish as soon as we finish training that model. But if that's only like 8% or 9% or 10% of our total data,

Starting point is 00:26:14 as an SRE working on that, you're like, oh, there's a little bit less data today than there was yesterday. I might not notice, right? And that's the problem is you can lose these little slices. Like what if the U.S. lost all of the data from Georgia and Alabama, but not Louisiana or, you know, Mississippi? Like, you'd have weird results about certain regional things in the U.S. South that you wouldn't have otherwise. And so I think, you know, but in general, most of the things that go wrong are the format was wrong. So I stopped being able to parse the

Starting point is 00:26:45 file like that's not an ml thing that's happened to every single person who's ever like worked with a computer like I set up a template to read from the file the file changed or my template changed and now they don't match and now I'm not reading from the file and I didn't notice really quickly and so now I'm not loading any data so now I'm behind because I got stuck on this thing that I wasn't supposed to do because I should have been monitoring it and not screwed it up. Right. So that's fine. Like, so we saw a lot of that.

Starting point is 00:27:12 And I think like, you know, when we looked at it, we really did find that, you know, systems thinking and general SRE work is by far the most important thing on working with ML systems. Now, you're always going to have to have some people who know quite a bit more about how machine learning works. But the systems, the general systems monitoring thinking is the most important part. That's really well put. I liked how you brought up just the whole aspect of folks going into ML ops, maybe sometimes think they're going to do a bunch

Starting point is 00:27:46 of machine learning and they end up yeah um and yeah this this helped explain a lot of it but another thing i was actually really curious about was how do new folks that enter in the kind of ml ops are they are they embedded with specific like product teams or is it more of more on the platform side of like we are responsible more for making sure the workflow of how these teams uh train new models and how they stitch them all together are they more responsible for more on the platform side or is it um even some that are like i'm going to be working on i'm an sre that's very closely tied to the ad section and work with them very closely? Yeah, I think it varies. I think we would like to be more on the platform side. But to do that, you have to have mature enough and widely adopted enough platforms. And so one of the things I think,

Starting point is 00:28:41 you know, I'll just be frank, like a lot of people like, oh, Google solved all the problems. We have not solved all the problems. And so like among the problems we have not solved, we don't have a super reliable, super general, widely adopted single set of ML platform technologies used by everyone. Like, I wish we did. We have pieces of that, but we don't have that, right? And like, I think a lot of companies are like, oh, Google's got to have that. We should build that. I'm like, you should build that, but we also should build that.

Starting point is 00:29:11 And we're working on it. You should too, right? And so when I look at like my SRE teams, one of the things I see is sometimes you will see people working on a platform. Like the, I'll give you an example. Like we do have do have basically a TensorFlow serving system. You give us a TensorFlow saved model, and we'll just distribute it around the world and answer questions about it. Now, you all know that's not a very complicated service. I mean,

Starting point is 00:29:38 when you do it at a very large scale, it's more complicated. But you get the idea. You give me a model. I'll answer questions about it. That's the whole thing. The whole idea. I just said it like two sentences, very short sentences. Well, that service is very successful because the value proposition is there. It's easy to understand how to build that service. It's moderately easy to scale that service. Lots of questions about SLOs.

Starting point is 00:30:00 Lots of questions about like latency. Lots of questions about manageability, blah, blah, blah. But you would get that in any very very large service when you start backing into stuff like well build me a training system you're like cool what do you mean by training what counts as training for you you're like and some people say like you know just supervised learning like i just you know i just give you the examples and you build a model and it looks like here's the config for the model. Do these instructions and then put the results over there. Okay.

Starting point is 00:30:29 That's one thing. Somebody else will say, take the examples, do unsupervised learning to create clusters, send them to some human eval agents to label samples of a subsample side of this class. We can't do the whole thing because that would be very expensive and tedious. Then take those examples and run some supervised learning, then subsample that back out to go see how well you did, then run it through and you're like, okay, that's a different thing. And that's also training, right? And so what we find right now is for the simplest cases, we have platforms and we have people coming in and working on those platforms. For the more sophisticated cases, we have more bespoke assembled bits of infrastructure,

Starting point is 00:31:06 and those bits of infrastructure are worked on directly by a single SRE team. So the best example of that is probably CCAI, the contact center AI application that Google Cloud AI sells. I'll be honest. I thought this thing sounded boring. I thought this thing was not going to fly. You all are nerds. You'll get this.

Starting point is 00:31:24 It automates call centers. Okay. that this thing was not going to fly right this is you all you all are nerds you'll get this it automates call centers like okay who like why is that a big deal like who calls people on phones oh my god people call people on phones so it turns out a lot of people call a lot of businesses on phones and it's very expensive and a lot of this is subject to different kinds of automation. So some of the automation is understanding the customers and understanding what their questions are and routing them to agents. Some of the automation is listening to an interaction between an agent and a customer and populating documents and populating resources on the agent's screen before the agent types. Some of it is chatbots. So all of this is really bespoke, complicated

Starting point is 00:32:06 speech-to-text, text-to-speech, chatbot technology, serving technology, this custom interfaces to telephony for each of the providers. And so, no, we don't have a platform that does all that. That's a team that builds that bespoke stuff and then works. And so the SRE team works directly with that team on that application. Interesting. And do you, but kind of rewinding back, like, would your vision, you know, for both Google and for the broader sort of MLOps industry is to have this more universal platform that basically would enable, you know, different use cases by devs to create new things? Would that be fair to say? I think so.

Starting point is 00:32:43 Yeah. But let me put like, and as people who work sort of are aware of this space, you might, this might resonate with you. I think we're at a tough spot because, so cast your minds back five or six years

Starting point is 00:32:56 and think about what machine learning infrastructure looked like and what were the models people were building and what were they trying to accomplish with those models and think about today. And now think about five years from now.

Starting point is 00:33:15 So what I don't want to do is pour service and code concrete over what we're doing right now and say, this is all you can do. You can't do anything else. Because I really do think there's quite a lot of innovation left. In fact, I don't think we're even close to knowing for sure all the stuff we want to do. So that's one side of things. But the other side of things is, but by being hard, we're actually making it really, really difficult for smaller teams who do have simple use cases to just innovate, to just do something and get anything out the door. So I think what I'm in favor of is a couple of things. One is building reliable, simple platforms for the use cases were positive. A large number of people have because I think that just enables them to get their work done, enables them to like the phrase innovation has a bad like that word has a bad vibe to it because we've all been sort of like lied to about innovation. It's been used stupidly, but like, I really mean like creating new ideas, trying new ideas, solving problems in a new way. I think there's a lot of that in most of our organizations that's tied up because people

Starting point is 00:34:17 are like, oh, I had an idea for an ML model, but then like, you know, I couldn't, I couldn't do it. So I just moved on to my next idea. And you're like, well, why didn't – we want to try – your idea – to be frank, your idea was probably dumb because most of our ideas are. No, and I'm honest. But part of what lowering the cost to try it out does is make it faster to find the good ideas by making it faster to work through the bad ideas. We all have bad ideas. I want 10 bad ideas to be tried out every week so that like maybe we can find a few good ideas so yeah i think like i'm a little bit torn on that because i think what we're going to have in the medium term is we're going to do a lot of work to try to

Starting point is 00:34:55 productionize the simple use cases but then we're also going to do some amount of work to enable the truly custom work at the sort of leading edge of this goofiness. But those people are kind of on their own. They're like, they got to figure all this stuff out. And we see that now, like, like, I think Google's TPUs are a good example of that, where they're really hard for most people to program. Like a lot of people outside of Google are like, what are these for? This is really a goofy API. I don't understand. And we're like, they're for super, super cheap. That's what they're for. Like they're for crazy cheap and crazy fast. And we're like, they're for super, super cheap. That's what they're for. Like they're for crazy cheap and crazy fast. Right. And like, just figure out the API. It's

Starting point is 00:35:30 worth it. But that only works, you know, if you have a big problem and you're well staffed. Yeah. It seems like there's one of the biggest problems is going to be as a, as an organization figuring out what are those problems that you guys want to really solidify and other ones to kind of, we're going to put some stakes in the ground so that everybody can still be useful with it, but give them enough flexibility to kind of play around until collectively we all figure out like, yes, this is the direction that we want. Let's harden this sort of thing and not doing it too early. Like you were talking about. Yeah. I think that's a great observation. I think like, uh, I've mentioned this before, but I think a lot of these are product questions. They're not engineering questions. They're not SRE questions. They're not ML questions. They're product questions. Like

Starting point is 00:36:12 somebody and again, like in the engineering community, product managers get a bad rap, but good, a good product manager should be out there talking to a lot of users thinking carefully about what they're trying to accomplish and like coming back and saying hey this subset exactly as you say like this subset of use cases all of these people had and if you did all of this it's not that much but you'd be able to meet the needs of all of these people meanwhile if you did this other stuff you would be able to provide basic infrastructure that's customizable to those big users over there. And together, that might be the bulk of your sort of addressable market. Yeah, I think that's right.

Starting point is 00:36:51 Yeah. And I want to take a step back, and you kind of briefly covered it. And you had mentioned something about for someone to just try out a model, and to get it out, and to see like, okay, what are the kind of results we're going to get? To me, that sounds a little bit like, and maybe I'm wrong, correct me if I'm wrong, is like model delivery. Is that a fairly solved problem at this point? In terms of treating model delivery as a deployment of I'm shipping a binary, a verified binary

Starting point is 00:37:18 out to multiple instances, and I just need to make sure it's doing well and have all the rollback mechanisms. Like, is it pretty similar in that regards? I do. I think that part is similar. I think it's the step that comes right before that, that I think is way, way harder. So if you already have a model in like some agreed upon format by your organization, you

Starting point is 00:37:41 know, we use save model format or something like that. But like, you know, some distilled format of the model that can be used by a server. Yeah. Then you just like ship it out and like you're done, you know? Yeah. So I think that's a solved problem. To me, what happens is before that, what happens when you have an idea for a model, right? So you and I work on the same team, you already built a model that solves some problem. Maybe it's a search ranking problem. And I have a better idea. I think i think i do it's probably it turns out it's not a better idea but i think i have a better idea i'm like i'm gonna take austin's model and i'm gonna like i'm really gonna i'm gonna fix this up i'm rubbing my hands together i'm excited

Starting point is 00:38:15 what do i need to do that quickly well um i need all of the data that you use to train your model to be in a common feature store of some kind i need metadata about all of the data that you use to train your model to be in a common feature store of some kind. I need metadata about all of that data so that I can know, like, well, what were those feature columns or what were those data items and how did you use them? I need a snapshot of your model, ideally, so I don't have to start from scratch because I'm going to use transfer learning to steal straight from what you got. I'm going to strip out a couple layers and train my own layers on top of that or whatever, right? I am going to look at your model configuration to see how you built the model and then i want to tweak that and then i want to train a new model and then i need a system to compare the two once i ship them so i need like some system that does fractional traffic to me and to you or like replays

Starting point is 00:39:00 logs to me and compares it to you that's actually a lot of moving parts, right? Like that's a lot of moving parts. But if I have all those moving parts, and if you have those for every team at your company that deals with data and is trying to solve these kinds of problems, that could be amazing, right? Like, because that's really what I'm talking about when I talk about unblocking innovation is this dream of like, you know, a new software engineer on a team on some product team, they're like, you know, a search ranking team, or they're, you know, a message delivery team, or they're whatever, like their their clubhouse, I don't know, they're like some team that's

Starting point is 00:39:36 building some feature or something, you know, you want that person to have access to data, other models, configurations, and an easy environment to tweak them and try again. So, but I do think if part of what your question was, was like, is this pretty similar to general binary delivery? Yeah. Like a machine learning model is code. Absolutely. And thinking about it as code is the right way to think about it, which is also by the way, why if you freeze code during like particular periods or holidays, you should definitely freeze model development as well or model deployment. Right.

Starting point is 00:40:06 Right. Treating of methods as the same. Yeah. And again, taking a step back, you had mentioned something about like using similar features so that we have kind of a common base to come from. And this is maybe something common in the probably very common in the machine

Starting point is 00:40:21 learning systems, which maybe differs from a lot of more traditional software engineering, is the need for feature engineering, both in the online for real-time space as well as in the offline batch stuff. So maybe for our audience that's maybe not as familiar with the machine learning aspect, can you tell us a little bit more about feature engineering and what sort of complexities it kind of introduces? Yeah, so I think like at its most basic form, I think a feature engineering is deciding what matters and putting it into a form so that you can use that. And like, that sounds like, wait, okay, did you actually say something? But like, so if you imagine, like, let's say we're trying to predict the temperature, like we're trying to predict the temperature like we're trying to predict the temperature tomorrow well you know what knowing the temperature today if we don't know anything else knowing the temperature today pretty good predictor of tomorrow like it's unlikely to be 100 degrees celsius different tomorrow it is possible but unlikely right so we can imagine like the feature that we would use to predict the temperature tomorrow is the temperature today and

Starting point is 00:41:23 they're like that's not very good what else And so now you start to think like, well, actually a historical record of features going back quite a ways would help because then like, okay, okay, but what else? And you're like, well, the date that the temperature was recorded would be really useful because it's possible that, you know, a middle of April temperature is similar to another middle of April temperature. And then we start thinking, well, what else might be relevant? Like, what about the lat-long, right? The lat-long might be really because if we know the location of that temperature. And so all of that work is sort of the initial creative side of feature engineering. But then comes the question of like, well, how are you going to store that? And what are you going to do with it? And then how are you going to combine

Starting point is 00:42:04 those things into a model? Now, I will say one thing, that last step, I'm pretty sure the computers are going to do it better than us very soon. And in some cases, they already do. So I think a lot of the AutoML technologies are better than humans at figuring out how to take a big bundle of features and turn them into a model that works. So that may disappoint people who are planning to build their careers on building ML models. But the people who are good at building ML models and are five years ahead of you are

Starting point is 00:42:32 destroying your jobs right now. Like they're like, it's sad to say that that happens sometimes. But if you understand how this stuff works, there's still plenty of stuff to do. It just won't be the manual tedious task of let me try this one, let me try this one. But I think that as an SRE, one of the things that happens is during the process of feature engineering, we make some choices about how to distill and how to store data. And those choices actually have profound consequences on reliability. So I'll give a concrete example of quantization. Quantization sounds like a fancy word. It's actually off-putting to people

Starting point is 00:43:05 who don't know what it is. But it's literally just like taking a numerical space like all of the integers and putting it in buckets. So you're like, well, and you can imagine this with age. You're like, I've got age. Everybody has a year of age. You're like, what if I only have like 12 or 13? How old's the oldest human? Not more than 130, right? So what if I only have like 12 or 13 how old's the oldest human not more than 130 right so what if i only have 13 buckets and your age is just a decade that's all i don't store more you might say like well why do you do that and you're like actually i do that because i might not lose very much information but it's crazy cheap to store so you take a bigger space and you store it in a smaller space now if you change the meaning of that quantization at any point in your system, you have just like not only thrown away data, you've ruined everything. Everything has gone to heck. It's all terrible, right? years and a different part that quantizes it on 25 years, the age that you put in at the beginning

Starting point is 00:44:05 is not apparent. There is no, if anything actually works in that model, it's pure luck, right? Like that's not, it was because age didn't matter because you just threw away all the useful information about age. But that's the kind of thing as an SRE is hard to see. And I think that for me in ML SRE is what's interesting is that's a good example of you don't have to know how machine learning works. You don't have to be tremendously sophisticated about it, but you need to know what quantization is and how you define it and where it's defined and ensure that those definitions are consistent. And similar with all kinds of other things, like it can be as simple as like you have numbered features and like the age was feature number one.

Starting point is 00:44:46 And then in serving the temperature is feature number one. Well, that's going to be a weird like it's not going to work well for you. And so like those kinds of very, very simple off by one errors are very, very simple configuration errors. In traditional software services, they show up as like fatal errors or really obvious errors. But the webpage didn't serve or the contents of the table are empty. But in ML, it comes as people aren't so happy or the results are kind of weird, but only for queries from China and Germany. And you're like, what's up with that? And you're like, that's just what's happening right now. And that can be the result's up with that? And you're like, that's just what's happening right now.

Starting point is 00:45:25 And that can be the result of something more subtle in the data layer. And so that's what I think is exciting and frustrating about ML reliability. That's really cool. So one of our questions was around that. So like monitoring, debugging. I've heard that one of the top reasons

Starting point is 00:45:41 why engineers don't want to ever leave Google is because the tooling is just so world-class. So what kind of fancy debugging tools do you guys have for like, you know, ML systems where, you know, like kind of tackling like exactly some of the issues that you just pointed out? Yeah. Well, so on your first point, I will say like that is hilarious because there is not an SRE at Google who does not complain incessantly about the terrible quality of our tools. They're awful.

Starting point is 00:46:09 Like, you know, it's just funny. You know, it is like they are no crazy general good use, good tools for like, there's no notion of like, this model is good, right? Because it depends on what it's supposed to do. And so I think when I look at, so frankly, like I am dissatisfied with the state of general purpose infrastructure for this. What I think you can do, which is useful, is you can have a set of infrastructure that takes metrics and like lets users define metrics for their models that they'd like to track and can set some thresholds on those.

Starting point is 00:46:58 You can also have infrastructure that runs golden sets through a candidate model. So in the case where people are building new models, this idea of a golden set is like, here's some questions and the defined correct answers, right? And so like, if you give me those questions and you give me a new model, I'll give you back the results and we can compare them to what you said they would be.

Starting point is 00:47:19 And the problem with ML is if they're a little bit different, that actually might be good because you might've improved the model. If they're a lot different, then we're not going to ship that into serving until we understand why it's a lot different. And so there's a couple of things you can do with the infrastructure layer to facilitate that. But in all of these cases, you might have noticed, we're actually punting to the original model team to tell us what it's supposed to do. And what it's supposed to do is about the – is this a photo categorization thing? Is this a text prediction thing? Is this a fraud detection thing? Like,

Starting point is 00:47:49 until you know what the model is supposed to do, it's very difficult to tell whether you have a good model or a bad model. So I think we have some tools, but we not not enough yet, not even close to enough. It's kind of interesting, the sort of, would it be fair to kind of compare to test-driven development and then moving that to the ML side, where at least you should know or you need to be able to quantify what your model does. And that becomes really difficult, right? Kind of to your point, I feel like what I'm used to

Starting point is 00:48:18 is having a golden set that's very specific at the record level. And then maybe you look some general, maybe flag rate for your predictions, or if you're doing something else, but just very high-level statistics. Is that also generally how you guys are? One thing that you mentioned I thought that was really cool is to do these slices on different populations.

Starting point is 00:48:40 But that does feel like you would require quite a bit of infrastructure to kind of enable those use cases um yeah i think that's right i think that like so so yes like in general you're like we're going to have some like so when you build a model you have some something you're trying to accomplish you know some objective function and so you train the model according to that objective function the objective function includes the things you're trying to accomplish. This model is trying to increase the quality of click prediction. This model is trying to increase the satisfaction of users in this particular case.

Starting point is 00:49:17 This model is trying to lose as little money as possible if it's a trading model or whatever, right? So you already have the general objective, the metric that you have, and you can track the model against that. But the problem is that these things live and breathe and the world changes around them and our understanding of the model grows. And so, you know, frequently you're like, this is a great model, but not for this class of cases. Like this is a stellar model, but not for this class of users. So what do we do about that? Okay, well, you know, I guess now, and that's really where the idea of these slices come from is you start to really

Starting point is 00:49:56 subdivide and say like, actually like to meet the widest possible set of needs, I need to actually narrow my focus and then jump around, right? So like, if you look at a text prediction, like the text prediction models were really, really good in US English and terrible and everything else. Okay. And then they added more languages and they got a little bit better. But until very recently, I think until last year, if you had two languages enabled on your Android phone, they were terrible because they couldn't figure out what language you were speaking. And so like, I don't know, I give up. I'm just saying stuff. And half the time, my phone is set to English and Spanish. And I'm like, this is just gibberish. If I talk to one part of my family, it's gibberish. If I

Starting point is 00:50:38 talk to the other part of my family, it's equally useless. And that was like a case where if you think about that from an aggregate data point of view, how many people who have Android phones in the world have those two languages turned on? Most people don't even know you can turn on two languages. So I'm assuming it's like me and six people and really understand the uses of the model and the people who really understand the what the model is trying to accomplish to be able to start to get some infrastructure to look at slices. But I think like one of the things that, you know, you sort of implicitly point out is if your model is important enough, it's important to spend some time understanding how well it's doing. And I can tell you, like, for example, for some of the ads models, we have, you know, 15 years of analytics infrastructure to try to understand how well is it performing with these kinds of these kinds of queries? How well is it performing in these markets? How well? And that makes sense, right? This is like a lot of money for Google. Google takes it seriously. We like want to make the money and serve the user's needs and sell good ads. But then in other cases, if it's just a little bit experimental,

Starting point is 00:51:51 it's harder to justify that kind of analytical investment. Yeah. And I think it's also even more interesting slash complicated because during model development, there's also sort of evaluation and test, right? So then it's like, how do you build the infrastructure such that you bring that test as close to the actual production slices that you create? I feel like that's also a pretty big challenge, right? I think it is. And I think going back to the first part of your last question, I really think that thinking of the machine learning cycle, the pipeline of development as being very similar to CICD and very similar to just software

Starting point is 00:52:32 engineering in general, like I think we should be moving in that direction. Well, to move in that direction, we need to think carefully about what's the general purpose testing infrastructure. And when you think about it for software, like, you know, this is maybe your point, point, for software, I also don't know what any given method is going to do. I require the user, if you just wrote a function, you better tell me, give me some unit tests. I'll run the unit test, but I don't know what inputs the function, I mean, I can look at the header, but I don't know what it's supposed to do, but you do, right? And so I think with ML as well, we need to move towards, and there's a couple of pieces of open source that are starting to hinge on this,

Starting point is 00:53:11 but I think quality, I think infrastructure for model quality measurement and model quality improvement is something that we're going to see a lot of in the next few years. Because I think, yeah, like if I'm developing a model, the first thing I want to know is,

Starting point is 00:53:23 is it garbage? If it's garbage, but it's kind of garbage, what can you tell me about that? How am I doing? Because that's how we're going to get better. Yeah, this sounds like shifting all the testing towards the left as we talk about general software. And thinking about the general pipeline for ML models, and like you said, we should also be thinking about this as the CICD pipelines we have for non-ML systems. So like if anyone who has read the Google SRE book or just thought about productionizing a service, one normally thinks about, okay, I have kind of a production readiness checklist that I would go through and say,

Starting point is 00:53:56 does it hit all of these things? Does it have a disaster recovery plan? Do I have my metrics, monitoring, alerting, and all of those things? I assume they would look a little different for a machine learning system, where I have to care about the data distribution, like some of the problems you specify, like, hey, is your quantization configuration right throughout the pipeline? Did we miss data over the last month in a region or things like that? So when you're thinking about productionizing a machine learning model or a machine learning system itself with some of the complicated pipelines, what kind of

Starting point is 00:54:30 quote unquote readiness checklist one would think about or you guys think about? Yeah, it's a really important question. I think the biggest challenge we have in even getting that conversation started is many model developers think they're going to develop a model once or a very small number of times. And so they really approach this with, just let me slap some stuff together. So this is pretty common. And I'm like, far be it from me to get in your way. But what's going to happen is nine months from now, I'm going to be dealing with the crap you just slapped together. And like that's going to make me sad.

Starting point is 00:55:12 So I think like the approach, I don't want to slow people down. And I actually do think a lot of people do slap something together and discover it's not a good idea or discover like they got what they needed to out of it. And they may update it once or twice, but they're not going to run it continuously or they're not going to do 20 variations of it. But other people don't. And it's those cases where they don't that we really want to sort of intervene. I think the right answer here is we need to make the platforms do most of this, which puts pressure on the do we really understand the use cases and how many of them can we accomplish? Because if the platforms do most of this, like the most common problem that we have on machine learning that's not sort of well structured is the data you wanted to use is not available or a different version of it is available or it's in a different format, right? That's one of the most common things that people are just

Starting point is 00:56:01 like, I'm training a model. I trained this model before. I'm training it now. It's not working. You're like, why is it not working? Like, well, because three of the 26 data sources don't exist anymore or are different than they were when you trained this model. Right. And that's and for many of the for many of the libraries that people use or training environments people use, that's actually hard for them to figure out.

Starting point is 00:56:24 I think for us, the next most common problem that we really need to solve is just scheduling of training jobs. So like a lot of the machine learning training that people do at Google are very large just because we have a lot of data. So frequently people say, I would like to train on all the X.

Starting point is 00:56:40 You're like, okay, that's this many exabytes. How are you feeling about that? And half the time they're like, oh, that sounds fine. I'm like, okay, we're going to okay, that's this many exabytes. How are you feeling about that? And half the time, they're like, oh, that sounds fine. Okay, we're going to town. We're training on exabytes. And half the time, they're like, what's an exabyte? No, it's just fine. That means that was not the conversation you

Starting point is 00:56:55 thought we were going to have, so let's go back and try to figure out what you're trying to accomplish and if there's any other way to do it. Because frequently, people just say, I want all the X. Do you want all of it do you want all of it yes all of it no it turns out probably not but i think like i think what we really need to do is start focusing on building those platforms so that by default people get roughly the right stuff and i think there's a tension there like i work with sres you all you know work with or are sres

Starting point is 00:57:23 like sres want to do everything perfectly. Actually, for the people messing around, I want to do something good enough. I really want to make it super easy to do first and then have a reasonable path to super well done. Because I think super easy to do is how we're going to get people in the door. And so one of the things I've been thinking about recently is really thinking about these platforms in tiers that have different SLOs and different sort of models of what quality works like. So I'll give an example. If you're just playing around with a model, you want to be in the serving system tier that syncs it out somewhere immediately, but you don't care if it's reliable.

Starting point is 00:58:06 You just want to look some stuff up, right? And so you don't need it globally replicated with very high capacity allocated. You need to look up something in it right now. On the other hand, if you're syncing the 3,476th version of your model, and this is a daily update or an hourly update, if I take 20 minutes to get it everywhere in the world, you just don't care, but you do care if I drop any requests. You're like, I want it reliable.

Starting point is 00:58:29 I want it fast. I don't care how fast the new thing is available. It doesn't have to be perfectly fresh. Well, as you all can know, if you intuit that, those are actually two different systems. They're related to each other, but they're fundamentally different requirements. And so I'm sort of arguing that for the people messing around, experimenting, trying to innovate something, we need to think about their needs as a first class set of requirements that are different from the needs of the smaller number of very highly demanding production users. This might come across as weird, but as you were talking about different requirements, in my head, I was thinking about different databases, as there is not one database that fits all use cases. Absolutely, totally agree. There's a relational store, there's a key value store, there's a document store, and you cannot just fit everything in one place, because, well, requirements are different.

Starting point is 00:59:17 In terms of thinking about these requirements, and also one aspect that you mentioned is, well, I want to know when it doesn't go right. And when you actually put something in production, what does a feedback loop look like for both the SREs working on ML side of the world and also the model developers? You know, I think the interface between those two is the single most complicated part of MLSRE. And it's something we're still working on because, you know, for, I mean, to be fair, I think that's actually complicated for a lot of services. Like many of our services are tricky and they're subtle. And so like a lot of us who have worked as SREs and who have worked as, you know, application and platform developers, we argue about this barrier a lot, right? We're like, is this your fault or

Starting point is 01:00:03 my fault? Is this like, you know, and that fault, I don't want to be blamey, but I like, you know, who, who should have the fundamental responsibility for fixing this kind of a thing? And so I think like my first step is like, let's be, let's be as collaborative as possible. But the very next step is going to be that we can't scale the SRE team if they're responsible for model quality issues. Okay. So that sounds good. We're like model quality is the model owner. Sounds good, right? Well, the platform can mess up model quality, right? We know that. So wait a second, right? So it sounded good. We're like, we're going to run the platform and you're going to run the models and the platform will take care of the models. But if the models are no good, that's your problem. Unless too many of the models are no good in kind

Starting point is 01:00:49 of the same way, in which case that's our problem, right? So this is where it gets really tricky and really interesting. And honestly, like, you know, I could spend a very long time, like I spend a lot of my day thinking about this and talking to people about this. But the short version is like, we haven't solved that yet either. And I think like anyone who's serious about making, you know, multi-party ML platforms, unless I'm just stupid, like I think this is the,

Starting point is 01:01:12 that's the meat of the whole thing. Like it's weird for the vegetarian to be like, that's the meat. But like, that's the, you know, that's the center of the whole thing. That's the value right there is if you can figure that out in a scalable way so that model owners are and model builders they're enabled everything's working well

Starting point is 01:01:31 they're excited about that and they can fix their own problems when their models are bad but the platform owners are kept out of that most of the time but not when it's their legitimately their responsibility because it comes from the platform uh That would be the holy grail. That is not yet achieved in any of my teams, but we're thinking hard about it and continuing to work on it. That's like a really good way of actually kind of capturing some of the,

Starting point is 01:01:57 I feel like my grinds with the problem as well. Awesome. So to wrap up, thank you so much for taking the time today, Todd. Just kind of the fun question that we like to ask. What was the tool that you recently discovered and really liked? Recently discovered and really liked. So it's not software at all.

Starting point is 01:02:20 Like I've been, you know, so I live in Pennsylvania. So I grew up in Puerto Rico where you can just like the process for growing things is you look outside and stuff is growing and then you eat some of it. And here in Pennsylvania, growing is intentional. And so one of the things I've been super excited about is like raised beds in the garden. They're freaking fantastic. I'm growing stuff. You can grow stuff and eat stuff.

Starting point is 01:02:43 It's easier to get to. You can grow more of it because it's easier to get to you can grow more of it because it's all loose and i mean that's old school technology right it's just like boards in a shape with some dirt in them but that's been that's the thing i know that wasn't what you were expecting but i'm talking raised gardening beds here buddy showing off that you have a yard and a garden i i see you todd come to pittsburgh houses here cheap as chips you can buy several of them the number of my like middle career engineering colleagues who like just keep a house when they move because you know they might need it later like the houses are cheap here you

Starting point is 01:03:17 should come move here it's great must be nice gardening like uh your it kind of reminded me of your linkedin skill section so i i So I did more stalking on LinkedIn. Yeah. I want to call it a couple of things. And I think anyone who hasn't looked at your LinkedIn profile should just go and look at the profile because it's very entertaining, to say the least. What do I got? Yeah, what are you seeing? One of the things that has, okay, so there are things like Terracotta retaining walls.

Starting point is 01:03:42 Terracotta, right? Like, I don't know. I don't actually know that much about terracotta, but I just love these skills that are a noun. Why is there a skill that's a noun? So there are two which are also very interesting. Third world driving. Absolutely.

Starting point is 01:03:56 Plus one. Right? It's different, right? So when I drive in the Caribbean, you just drive differently. This whole idea like Americans are like every lane is sacred and you're supposed to stay in it and if you try to stay in your exact lane in a lot of parts of the world that's not going to go well for you or anyone else you need to have a much more flexible aggressive notion of driving yeah so

Starting point is 01:04:19 someone told me it's like the water is flowing and you need to be part of the water in the traffic just join the water be part of the and i don't know if you heard this, there was a good Freakonomics episode on roundabouts. And one of the things that came out of that is roundabouts are one of the hardest problems for self-driving cars. Because if you actually do it right, you should definitely never enter a busy roundabout. There's no safe time to enter a roundabout. And so they're like, well, you got you gotta kind of guess and try but also be ready to stop and like it's really apparently really hard what was the uh what was the second skill so uh the second skill was political asylum i'm like hmm that's interesting yes i don't know so the where this came from the first part the way i found out

Starting point is 01:05:01 about this is a co-worker andrewames, endorsed me for nuclear proliferation. Not anti-proliferation, which I also don't know much about, but for nuclear proliferation. And it came to me as a notice, like, do you approve this? I'm like, first of all, yes. But second of all, how did you do that? And that was where the whole thing got started. I see. So the last one I'll say, and it seems that you are really good at this skill because

Starting point is 01:05:23 you have the most endorsement on that one. And it's getting LinkedIn skills endorsements. That's what I'm telling you. That is a self-documenting skill. Either a lot of people endorse you for it or they don't, in which case it is what it is, right? Absolutely. Feel free to peer me on LinkedIn

Starting point is 01:05:39 and endorse me for getting LinkedIn skills endorsements. Oh, for sure. I'm going to plus one to a lot of these because we have the proof now. Proof, excellent. Oh, for sure. I'm going to plus one to a lot of these because we have the proof now. Proof. Excellent. Well, moving on. Is there anything else you would like to share with our listeners, Todd?

Starting point is 01:05:53 No, this has been a great conversation. I think the kinds of topics that you all are raising here are of super general use. This is, you know, I think a lot of people get focused on ML, ML, AI. But there are subtle things about what we're doing that are a little bit different. But there are a lot of things that are just about modern distributed computing and how software works on like medium sized collections of computers. And a lot of people work on software and medium sized collections of computers. And a lot of people work on software and medium-sized collections of computers. So I think the thing I would say is for people who are going into ML and going into data sciences,

Starting point is 01:06:31 if you're excited about the model building and you're excited about that side of it, you should go to that. But there's going to be a huge amount of work for a very long time in making this stuff work. And so if you are a software engineer, you are a systems engineer, you are a data management engineer, and you are an SRE, this is really, really good stuff to know. And you should not be worried about not having the academic background in modeling.

Starting point is 01:06:57 Oh, that's really good advice. On that note, thank you so much for taking the time, Todd. This has been awesome. Yeah, it was great. Thanks for having me. Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com. You can also write to us at hello at softwaremisadventures.com.

Starting point is 01:07:22 We would love to hear from you. Until next time, take care.

CODACE Plant Stand

Software Misadventures - Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

CODACE Plant Stand

Software Misadventures - Todd Underwood - On lessons from running ML systems at Google for a decade, what it takes to be a ML SRE, challenges with generalized ML platforms and much more - #10

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.