Software Misadventures - Uma Chingunde - On managing migrations, growing engineering teams and much more - #8

Starting point is 00:00:00 My philosophy is that if it's work that everyone wants to do, like if it's bright, shiny work, then you spread out the opportunity so that everyone gets an opportunity. And if it's grunt work, then also you kind of essentially spread out the opportunity. So essentially, that's kind of like, you know, your way to fairness, right? So my thing, like the broader question is, you just have to make sure that everyone participates in all phases and not just the fun stuff. Like, you know, that's the way to a healthy team. Welcome to the Software Misadventures podcast, where we sit down with software and DevOps

Starting point is 00:00:42 experts to hear their stories from the trenches about how software breaks in production. We are your hosts, Ronek, Austin, and Guang. We've seen firsthand how stressful it is when something breaks in production, but it's the best opportunity to learn about a system more deeply. When most of us started in this field, we didn't really know what to expect, and wish there were more resources on how veteran engineers overcame the daunting task of debugging complex systems. In these conversations, we discuss the principles and practical tips to build resilient software, as well as advice to grow as technical leaders. Hey everyone, this is Ronak here. Our guest in this episode is Uma Chingunde. Uma is a VP of Engineering at Render. Prior to this,

Starting point is 00:01:27 she led the compute infrastructure group at Stripe. And before that, she led compute virtualization teams at Delphix and VMware. Austin and I had a great time speaking with Uma. Our major focus in this episode was large-scale infrastructure migrations. And Uma shared many insights on how to manage them successfully.

Starting point is 00:01:42 We discussed the importance of communicating the why behind a migration, identifying success metrics, creating a culture where migrations are identified as highly impactful projects, and much more. Uma also shared stories where parts of a migration didn't go as planned, how the team fixed the issue, and the kind of engineers she thinks would make good tech leads for these projects. There's a lot to learn from Uma's experience, and we had a great time speaking with her. Please enjoy this highly educational conversation with Uma Chinggonde.

Starting point is 00:02:19 Hey, Uma. Super excited to talk to you today. Welcome to the show. Great. Thank you. Thank you for having me. So we thought we would start with asking you about your background and how you entered the infrastructure engineering space. Great. So I'm actually going to joke that my more recent work is higher up in the stack than my early career work. So my first major job in the US was working at VMware after I actually interned with them. And I worked on the hypervisor management group, this product called vSphere. And to kind of describe it in a very simple way, it managed a cluster of hypervisors, which was VMware's hypervisors, which kind of was like the cluster management software. And if you kind of aborter, that's actually what companies like AWS or GCP or Azure use in their backend. So

Starting point is 00:03:13 essentially like, you know, when it's someone else's like, you know, server, it's usually like someone else's virtualized server that's like somewhere in the cloud. And so I started at VMware, then worked at Delphix that was doing a very similar product, but for databases, trying to virtualize databases. But I was kind of very cognizant of the fact that generally tech as an industry for over a decade now has been moving to SaaS. And so it was kind of like an intentional effort to move to companies that are essentially like, you know, software as a service. And the kind of natural extension for me was to look for a role in an infrastructure team because that's closest to what my experience was previous.

Starting point is 00:03:55 So I did that at Stripe. Most recently for a few years on the compute group. And now I'm at Render, which is, I think, another interesting abstraction where we're building the next level of abstraction for people wanting to deploy to the cloud. Yeah, that's very interesting. So you mentioned you were

Starting point is 00:04:15 on the compute group at Stripe. Can you tell us more about your role there and what the compute group looked like? Yeah, for sure. So essentially compute was kind of the name is intended to be a little self-explanatory. So the idea is that Stripe is essentially providing a payments API to users across the world. And for everything that Stripe is running, that's kind of essentially being run on a provision on compute resources that my team

Starting point is 00:04:47 managed. I mean, we were hosted on a cloud provider ourselves. But if you are a product engineer at any company like Stripe, you don't necessarily want to have to deal with the nitty gritty of the cloud provider APIs. So what my team did was essentially abstract away what typically what a typical cloud provider provides in terms of like, you know, compute instances and build abstractions on top of those instances with our kind of essentially you were just kind of you know build focusing on building your service versus having to deal with also the nitty-gritty of running the service so we made it easy for you to build your service without thinking of where and how to run it so reliability scale we also did so essentially we built in our internal k layer, also managed like, you know, the service to service communication via Envoy, which is like our, which is a service mesh that we had adopted internally.

Starting point is 00:05:52 Oh, very interesting. So would it be fair to say like your team built the abstraction layer for rest of engineering to say, okay, I'm going to tell you my service, and I'm going to tell you the compute I need and just run it somewhere in data center on the cloud. Yeah, I wouldn't say we were 100% there, but that was like the reason for like my team's existence, basically. I see. What's interesting is I've seen a lot of compute teams who build the subtraction layer. I mean, I work on a compute team myself, so I can relate to a lot of the challenges you

Starting point is 00:06:21 might be dealing with. What's interesting in there is you mentioned you also built the on-web proxy layer to provide that service mesh capabilities. Did you collaborate or did you have to collaborate with the network team as well on this or the traffic team to build that out? So I think with the traffic team, yes, very heavily.

Starting point is 00:06:38 So we had a traffic team at Stripe actually because I should kind of clarify in the things that my team managed, we had like multiple clusters and the edge cluster was actually managed by our traffic team. And so they had their own rollout of Onward. So we worked very heavily with them. And my team managed the kind of internal compute cluster versus they managed the edge network and edge cluster. But yes, we collaborated together heavily. So the separation of responsibilities was we managed the network,

Starting point is 00:07:12 the service-to-service communication for the internal, for the cluster that we managed and all the infrastructure we managed, they managed it for their lab, but there was heavy overlap, right? Because with the network, you don't have such a clear separation. Yeah, that makes sense. And on usually, so your product is essentially a product for all the engineers at Stripe. And so the customers are all internal, which is good, and also sometimes challenging, because they will give you feedback really quickly when it's good. It shows right away and it's extremely gratifying. But sometimes different teams have different requirements. And sometimes the single layer of abstraction doesn't work for everyone.

Starting point is 00:07:57 How did your team or how in your experience have you dealt with these requirements in general, when you know that this is going to work for 80% of the use cases, but there is going to be this 20% for whom will have to either give them access to the raw APIs under the hood, or we need to do something else? Yeah, I think this is a really good articulation of a common problem for internal teams. And also, honestly, it's kind of like a subset of a problem that any infrastructure product actually has. We actually had a version of this at VMware itself, right? So we kind of, I think one way to think of it is like cohorts of users, like you have your bulk users,

Starting point is 00:08:35 and then you have like your super or admin users that we would call them. So at Stripe, it was essentially like an ongoing conversation, right? Like what is the class of user that you kind of optimize for and build your interfaces for? And who do you kind of just say, actually yours is a specialized case. So it was like always an ongoing conversation and open point of dialogue. But the idea was that because you also kind of have the luxury

Starting point is 00:09:04 because it's an internal team that you just let them kind of, you know, run with the more fine grain access. And there were definitely teams that had that for a kind of very set, not all of their workloads could eventually be moved to Kubernetes. So they were just like a separate cohort of users. So our kind of overall broad strategy was thinking of these as cohorts of users and approaching them that way. And I think over time, something that we also realized is you do in the end have to draw a kind of dotted line to these users, to your external company users. So you kind of have to either directly or indirectly optimize for the business.

Starting point is 00:09:58 So for Stripe, it's like your payment users and the bulk, the most important payments products are where you will focus the most of your attention. But then like, you know, with key critical investments being made into like, you know, say emerging businesses that maybe had a different thing, or like any other kind of one-off use cases, you do have to then make strategic bets, where maybe you let them go develop on their own. So it was a combination of engineering and business decision i guess is a good summary that's really good insight uh i i didn't i haven't thought about it that way but what you said makes a lot of sense that along with working with the internal

Starting point is 00:10:38 teams uh tying it back to the business itself and then optimizing for some of the decisions you make and how you prioritize them. So now you've moved on to Render. By the way, congratulations. Thank you. I know you joined recently. And as you mentioned, Render is also building kind of a compute layer with another abstraction on top of it. And in this case, your consumers are actually not internal teams, but that's the product itself. So a lot of your experience ties into the product at Render really well. I'm curious, can you describe your role at Render? And also what kind of

Starting point is 00:11:18 similarities or differences you see in the product that you're building or the challenges that you see as users are using the product? Yep. So it's actually one of the, this actually touches on one of my motivations for joining Render because I was already following them for a while because I was like, okay, this is interesting.

Starting point is 00:11:38 This is an interesting product. We kind of used to joke internally that at Stripe, that what our users really want is for us to really abstract everything else and just give them a way to run their services, right? And that's really kind of, you know, that is what all developers want. So I had kind of been like, you know, following Render and there was always also the Stripe connection where the CEO and co-founder is also an ex-Stripe. So, you know, there's just kind of like this association. So I'd already been following them with this interest in mind.

Starting point is 00:12:10 So when they reached out and wanting to talk to me, I was like, yeah, obviously, I mean, at a minimum, I want to learn how you're tackling this problem. Because the way I see it is I have the team that I was working on. And also I was part of a larger foundation. The name was foundation, but the larger infrastructure team, right? Like our equivalent exists at pretty much every company like LinkedIn, as you just talked about, has the same thing.

Starting point is 00:12:34 All our peers like Slack, Lyft, Uber all have similar versions. So it's a pretty standard problem. And the way I'm kind of thinking of it, the way this appealed to me was it's at a larger scale, like Stripe scale, you build a team to fix the problem for you. It's a similar model as if you're Google or Facebook scale, you have your data centers. The next layer, if you are Stripe or LinkedIn or Uber or Lyft, you have an internal

Starting point is 00:13:06 infra team, even though you're probably hosted on a cloud. But then what about the next layer, which is the even smaller developers? If you want to develop something, you either have to still learn infrastructure, like you have to kind of balance between learning the infrastructure to run your service or actually building your service. And so I could see the intuitively, like, you know, the need for this. And so that was exciting. I had never worked on this problem at this scale, however. So it was kind of really appealing to try out something completely different in terms of scale, like build something from the beginning versus work with an existing system.

Starting point is 00:13:43 And also for me i really like growing engineering teams and the people side of it is really exciting to me so the opportunity to build a startup from the beginning was uh was something i couldn't really pass up basically yeah that's certainly exciting i know uh for early stage startups it's like if if infrastructure is not the core product that's's you're constantly kind of, the priorities are competing against each other. Do I build the application

Starting point is 00:14:10 that makes money or do I build the infrastructure to support it? So a product like Render makes a lot of sense. And as you mentioned, it's an early stage startup and there is an amazing opportunity

Starting point is 00:14:19 to build out the engineering team from the ground up. In terms of how you're thinking about building this engineering team and what you see at tender right now, like what are some of the things you are thinking about these days? So right now our current thing is,

Starting point is 00:14:33 so I would say like since I've started, a lot of my focus is also just like growing the team itself because currently actually we have such good traction and that users want to use us there's a clear need like you know essentially what you just described shows the need for that we're essentially pretty much gated on our bandwidth to keep delivering new features so it's it's pretty much a good problem it's a very good problem to have right and the clear solution is adding people and growing the team and that's what we that's what my immediate focus is and growing the team. And that's what we, that's what my immediate

Starting point is 00:15:05 focus is and doing it in a way that is sustainable. And like, you know, we kind of like, it's like a growing and scaling problem. So that's like the biggest thing that we're doing. And essentially, we're actually like very transparent with our roadmap. There's actually like a feedback.render.com that folks kind of can see what we're building and currently it's just pretty much like our opportunities pretty much like the things constraining our opportunity are our own personal bandwidth and our ability to execute and so that's where like you know just onboarding new people in a sustainable way and just just building and that was partially kind of my excitement as well which is after having done different things it was i was kind of missing the focus and the kind of more of like

Starting point is 00:15:52 being in the weeds and just like executing on stuff yeah uh being in the new role do you do you get any bandwidth for yourself to like do these deep dives with the team or design discussions or your time goes into other things it's it's right now i would say uh so my kind of right now i would say so far not a lot has but the refreshing thing is uh which is very different and i kind of understood it but i hadn't like really seen how much it would actually be the case is how much like you know there's just like a day-to-day overhead in a much larger organization right like just your volume of email just the volume of meetings just the just the overhead of communication that comes from a few thousand people is so much different from like my total team like I render as a company as 14 people right

Starting point is 00:16:45 now so when you think of that right you just kind of like you know cut through a lot of that so i do have a lot more time to kind of you know actually sit down and absorb the product so far i think i'm i'm still scratching the surface but i'm actually like very excited to be able to do that nice nice yeah Nice. Yeah. That sounds like a pretty exciting transition going from Stripe, which has been growing at a very rapid pace over the last few years, I'd say, to a much smaller startup as well. I wanted to kind of take a step back of like, you've pretty much always worked in these spaces where any abstractions that you're working on are generally going to be pretty huge. working on are generally going to be

Starting point is 00:17:25 pretty huge. And the impacts are going to be pretty big. And on the compute side, as you guys are growing this platform, a big part was, of course, you know, I'm going to keep delivering these features. And I would also assume that a lot of customers, while you are at Stripe, may not have been on that platform already. So I kind of want to go back to kind of pull back onto the whole concept of migrations. And you wrote an excellent blog post on this a while back, talking about managing migrations. It was a great read. And we'll put definitely put that in the show notes. But I would imagine on the compute side, there are definitely migrations that are going to be needed there. Um, some that are, uh, maybe easier than others

Starting point is 00:18:05 and some that are a little bit, uh, more, uh, uh, yeah, scary, scary to even, uh, encounter. Um, but I think the blog that you wrote, uh, gave a very good rundown of you, just your thought process of how you go about it. Um, and I kind of want to talk about that more today. And kind of jumping into that. But for starting any sort of migration, like, what are some of like the, I'm assuming like the first part always is the planning part. If you don't plan for it, you're, I'm assuming, pretty much set up for failure. I think I've seen this firsthand on my side, migrations that have gone well, some that have gone awful. So yeah, I just want to get your perspective on that. I wrote a little checklist in that as well,

Starting point is 00:18:51 which is kind of like, you know, things to like, think of even before you kind of have written like, you know, a single line of code or done like anything to migrate. Some of it depends on the time you have, right right like some migrations are planned and some are like you know last minute like a refer to the specter meltdown migration that we had to do so i i think it really depends on how much time you have but i do think it's kind of one of those things or the metaphor of measure twice cut once really helps so the way I think I like to think of it is like you really have to invest in the planning and depending on your bandwidth you can obviously like you know constrain the planning to being a quick iteration and then like keep going versus like you

Starting point is 00:19:38 actually spend a lot of time doing the planning upfront. But at a minimum, it's kind of like this checklist that I actually put together in my blog, which is just like, you know, just kind of sitting down at a minimum for an hour and just like asking, like, you know, okay, what does this migration mean? Why are we doing it? What are the goals? Like, you know, is 80% the goal, 50%, 100%, right? And like, what is the kind of almost like OKR style? What's the 100%, what's the 80%? What's the priority? What are the constraints? Like, is it like an execution constraint?

Starting point is 00:20:15 Is it like a technology constraint? Things like that. So at a minimum, like, you know, so I tried to like really summarize a checklist, which I put in the blog, into like the things that I just saw were very repeatable. So at a minimum, having this one meeting with the key stakeholders and having this actually put in a document, which is like, why are we doing this migration? It was actually a colleague on the team who came up with this idea, which is like why Envoy, which was like one of the first ones we did this for. And that's like the output typically of this conversation,

Starting point is 00:20:52 listing out all the constraints. So that's, I would say, is like the MVP of the planning, but we can do a lot more. Got it. Yeah. And it makes a lot of sense of starting with a kind of like starting with a why. I think there's been books about this as well. And I think definitely applies here. Otherwise, people say like, of course, why are we doing this? You also mentioned, I really like that you touched on how far do we want to go with this migration? Like what what is what do we want to call done? Which I think I, like myself, also would say, okay, yeah, we're going to get this migration to the end, but we don't really specify, like, stamp it down of saying, like, making it very clear what that is. And you talk about how you want to have some sort of metrics to kind of track this progress, which I'm assuming is to like, how what you call done is going to be reflective of the metrics that you're able to capture, right? Yep. Yep.

Starting point is 00:21:51 And that's what and also it's kind of like can really help drive alignment. The metrics and what you call done is actually can really drive alignment between all the stakeholders. So the spectrum meltdown was actually the one where we started doing the metrics and found them to be super useful because there we actually had a commitment to our external users, which we had kind of decided on, which would be like, we are going to be like, you know, essentially we had different percentages that our security team felt comfortable committing to and then we

Starting point is 00:22:26 communicated those to our external users which is like this percentage of our fleet is going to be running on this update by this time and that was based and then we that uh uh so so that uh uh really helped kind of like you know frame the importance of the metrics to us. And then because it was such a clear, clearly urgent thing, that kind of like, you know, helped drive. But the metrics is that experience then helped me at least realize that if you don't have that urgency, you can still frame the like, why are we doing this? And what does that look like, even in the non urgent case, because then it helps teams prioritize things relative. And it kind of drives clarity between like, in this case, it's like, you know, you have your account managers talking to team to their users, you have

Starting point is 00:23:21 the leadership team wanting to know what what our exposure is, you have the leadership team wanting to know what our exposure is, you have the security team wanting to know how fast different teams are working on it. And everyone can just like, look at this one dashboard, or like different versions of the same dashboard and just like, get the same information. Yeah, that makes a lot of sense. For like the metrics, I'm assuming like, on the compute side, I can imagine it's just like, let's say if it was to patch the cluster or something, it would be like, what do we want to call done at that point for the full migration? It could be 95%, 100%, whatever it is. For some of the metrics, I can imagine you can get alignment. Some of these metrics may exist.

Starting point is 00:24:01 Some may not. Have you been able to struck a good balance of like there are some metrics that are like this is the perfect metric that we want but we don't have access to it um it's we would have to put a lot of time on creating that metric let alone and how do you balance like kind of like those those two sides of it like we have some point we have to say okay this this is good enough as a proxy we don't want to go too far down these. Yeah, no,

Starting point is 00:24:26 no. I think this is, this is a really, this is actually like a really good topic to talk about. Sometimes building the metrics and extracting them takes more time than a lot of the other things. So I think in that case, it's like you can, I think as long as you have a good enough proxy,

Starting point is 00:24:41 right? Like you can do something as simple as someone manually updating a spreadsheet, right? Like that's okay. As long as it's a good enough proxy and as long as it's not too much work. But I think you, so it's essentially like you need an MVP of the metric, which is, okay, I can, I can get like all the patch versions of this in this way. And then someone has to maybe clean the data and pipe it into the spreadsheet and like you know it's it's like a hodgepodge but it's fine it works versus it'll take someone a week to get it all automated then you just want to do the the quick and dirty thing i i do think like it's important to focus on the right version versus having clean beautiful data

Starting point is 00:25:23 for this it's it's like the goal of the metric is to drive the migration, not the metric itself. But that's a good call. Yeah. And for a lot of these migrations, these are usually, it's not just one team, not the one team that's just running the platform. It's usually one team that's managing the platform interfacing with many, many of the customers. So you've got to work with many other people. So for a lot of these migrations in general, it's good to have a few folks that are kind of like driving this entire thing,

Starting point is 00:25:55 which probably I'm assuming begins even in the planning segment as well. Were there any sort of like key characteristics that would make for a good like either tech lead or just a general lead in these migration efforts? I think someone that understands, that has breadth is really useful. So I think people that are typically like, you know, understand breadth. And if they don't have an existing understanding, are able to kind of, you know, as they come up amongst roadblocks are able to then dig through them so an example would be like you know oh we have these kind of weird stateful machines that are relatively harder to patch and even though like so the best case scenario is like you someone with a lot of knowledge of your systems that is like, oh, yeah, we have these weird systems that are going to be harder to patch.

Starting point is 00:26:51 So we get started on them, start special casing them, et cetera. And if not that, then you want people that are able to problem solve, essentially debug. I think that being said, though, I think in the end, it's honestly alignment and prioritization question because the bigger problem often is that everyone actually knows what needs to be done, but they're just too busy to do it.

Starting point is 00:27:19 So that, I think, is actually, I would say, the broader thing. As long as you have alignment where the leadership team and engineering overall is like, okay, yes, this migration is the most important or the second most important. I am going to spend X amount of time on it. It's usually the problem then gets tractable. So the way I like to think of it is like you have the core team and then you have representation

Starting point is 00:27:43 across the board. Like the core team has people they can go to and if they don't have people they can go to when they get stuck that's when the everything starts taking much longer makes sense uh i can actually relate to a lot of the things you're saying because i was involved in the melton spectra patching at linkedin uh and stateful systems are interesting. I'll just say plus one to that. I won't go into the details because that's probably another conversation.

Starting point is 00:28:13 But since you mentioned that as you're migrating these systems and you have these representations across different teams, and there's obviously binary alignment at the leadership level. But as you go through the execution, and the migration stake may be quarter or more,

Starting point is 00:28:28 depending upon what you're trying to do. And priorities evolve based on business, based on what's current within that team. And sometimes you will see one of these teams who has a unique requirement. And like you said, there is some custom work that you have to do. They just have to put in those hours and sometimes there are conflicting priorities and

Starting point is 00:28:48 they cannot uh what are some of the effective ways you've seen to still push the migration forward i'm not saying for someone to do the work but still kind of repetition helps repeat that hey this is why we're doing this is how it helps so i'm curious what are some of the effective ways you've seen uh this helps in i think so one effective way is trying to make it as easy as possible for them to do it right so if there is prep work especially if there is generalizable prep work right uh which is like say keeping like staging everything and then being like we have staged this this is where you go this is uh the setting you change these are the scripts you run this is the how to and like you know generalizable and like essentially the the extra upfront support you can offer to teams that are struggling the better like some tactical things we did was you know like uh the team that

Starting point is 00:29:44 was doing this would like run office hours and be like, hey, if you're stuck, come, come to these hours and we will help you do this work during those things like that. So there's like, that's the kind of like, you know, the upfront kind of being nice and just like offering a lot of white glove support can get you a lot of way, can get you high oh and then i think the other thing is is aligning which is so if they're not able to do it what is the underlying reason is it something else that is higher priority that has to be delivered instead of that and then that often comes to like business alignments which is like that org leader has to align with your org leader

Starting point is 00:30:23 and be like okay maybe for this particular set of things, for this migration, we are either going to get an exception or we're going to get some other team to help out and essentially carve out a path. For that second step, it's almost always like a business slash team alignment decision versus a technical solution, right? And then the technical solution might be that, oh, the core team actually does the migration for them, but it's usually obviously like a discussion. That's kind of like the two broad categories, I would say. Essentially, it kind of becomes like a, and this is where I kind of alluded to in the blog post, we're getting specialized program or project management help via like a program manager or a TPM can be very helpful because they're trained to kind of think of these as like holistic systems and people's problems. Oh yeah, that is true. I cannot emphasize the role of a TPM in a migration effort for sure.

Starting point is 00:31:23 A team could go crazy just doing that task itself and not the migration if you don't have that support exactly it's it's like the separation of responsibilities like there's a technical work and then there's the organizational and people side and like i would almost say that in most migrations the second one is the bulk of where the energy goes yeah yeah i should just say that shout Shout out to all the TPMs who support all the engineering teams. We talk about engineering and business. I just want to say shout out to them as well.

Starting point is 00:31:51 Or anyone who is moonlighting as a TPM. Or that, yes. Yeah, so you mentioned, and this was actually really neat to hear just now, of like for these migrations, you can have, you've set up these sort of like staging environments for teams to kind of say like, Hey,

Starting point is 00:32:09 this is kind of the first transition before we move fully onto the migration. And, and it's just kind of personal preference, but I think anybody who talks about a migration, it's generally not like a happy sort of thing or like, Oh yeah, let's totally do it sort of thing. It's usually it's one of those are just like, oh, gosh, what's going to what's going to mess up in this? Because maybe like they could have been burned from other migrations in the past. And in general, migrations are just just hard. Right.

Starting point is 00:32:38 So and I really like that staging idea because it helps at least for me. I see it as this proves to me that it's like it work, but also kind of de-risk some of the bad fallouts while going through this migration. And these are kind of like the technical things. And I'm sure there's other ones, too. But have you found other ways to kind of help teams kind of de-risk that and kind of like be less fearful of these migrations i think going after teams that have maybe i think getting some wins under your under your belt is actually key and i also wanted to pull on this thing that you just on the comment that you made which is that no one likes migrations i think depending

Starting point is 00:33:23 on it there might actually be, there might, I have found, for instance, depending on the migration, there were always actually a few super users that were actually like raring to go that wanted to be opted into the new system, like the ones that wanted to be on our new Kubernetes cluster, for instance, right? Like they were like, can like sign us up right now and i think one that's actually like a really interesting thread to pull pull on because that's like that touches on most users problems right users are willing to bear some pain if you can give them a reward right so what's in it for me so if you can actually find a carrot in your migration which is at the end of this this this is what you will be getting.

Starting point is 00:34:06 That's actually the most powerful thing, right? And that goes back to why, like, why are we doing this, right? Like, what's in it for the user? Is it a compliance thing? But even then there's like, okay, you will be the first to be like, you know, in the compliance thing, or it's like, I won't have to manage my infrastructure, or will get these better tools or I will get something. So it's really important to actually make sure that there is, if at all possible, you should actually have a very compelling reason for the people being migrated to be part of the migration. Because then you're halfway there. And that can really solve a lot of things.

Starting point is 00:34:43 Yeah, I guess that emphasizes again back to the planning part of why. If you don't have that, I'm assuming this is probably how a lot of migrations don't go through well. It feels like you're pulling teeth most of the time. And for these migrations, for the folks that are working on it directly, the ones that are owning this migration, and also potentially even other engineers that are more on the customer side that have to work on this part of the migration. And you've hinted on this. It's like we need to find a way for them that they see a benefit. I can see a lot of engineers coming in, especially newer ones. They don't see migrations as something

Starting point is 00:35:23 as a shiny new feature that they want to implement this probably wasn't even part of their interview process and they're just like well i'd have to do this thing that's just moving data from one point a to point b and how have you been able to communicate that to other engineers so that they can understand the like kind of the full impact of the work that they're doing like it's not a feature but it's still very high impact. Yeah, I think that's where one thing I've learned is you actually have to leverage multiple channels to make this effective. So one channel is like, you know, you just have a landing page for everyone,

Starting point is 00:35:59 which is like, why? Like, you know, why Kubernetes? Why Spectrum Meltdown? Why Envoy? like you know why why kubernetes why spectrum meltdown why envoy like you know why do i as an engineer care about this migration and you have to make sure it's like really crisp messaging so it's like and for every organization there's often like uh preferred delivery mechanisms like stripe was a very uh written uh culture, you know, emails got read. And like, you know, you could actually like, you know, send out a white email and you could ensure that the majority of people would read it and process it. So that in a company like that, you know, you use that mechanism.

Starting point is 00:36:35 Other companies really like, there's like all hands presentation that maybe everyone goes to or like, you know, different channels like Slack is maybe a better one for some organizations. to or like you know different channels like slack is maybe a better one for some organizations so you have to find uh the preferred preferred communication channels for your organization and really leverage them and then regardless of what the main preferred one is uh you actually have to repeat it multiple times so you kind of have to like you know send the email have the all hands have the vp of end send the email so you kind of have to like you know send the email have the all hands have the vp of end send the email so you kind of have to like you know make sure the message is getting repeated so and

Starting point is 00:37:10 that's where the alignment comes in which is you first done the the prep work and done the alignment and caught like you know your org leader on board and said that we are going to be sending this email so either they send the email on your behalf or you send the email and they do a tap back, which is like, yes, everyone, this email is like super important. Everyone should be treating it as highest priority or they reference it in their notes or they reference it in the next all hands meeting. So you kind of have to figure out to your org. And this is also where the scope and the scale go, right? Is it like the team, the org, the entire company that's being affected. So based on that, you have to like create different channels for the impact and then tailor your message accordingly. And this is also why, right? Like that's where you would then tailor it. Like, why should I care?

Starting point is 00:37:59 What's the impact to me? And that's where like the delivery of the message and the communication really matters. An example is right, like in the written communication, you want to, the casual user should get the most important information in that first above the fold kind of thing, which is like, what's happening and what do I need to do and why should I care? And all the details below it. Got it. And being in more of this leadership role, I think a big part has always been to always try to recognize the work that's being done by all the other engineers and such now that you've worked on many other projects as well including many migrations how have you have there been like

Starting point is 00:38:34 other like specific or maybe lesser used ways of how you recognize like these this type of work yeah for sure I think the recognition actually goes hand in hand to kind of like, you is I want to get promoted in the next year. And if I finish this project, I have a clear path to promotion versus if I spend time doing this migration that my path to promotion is less obvious. So if you've created a culture like that, either implicitly or explicitly, then you have a problem getting this work done, right? So it's in all the reward and recognition that work gets prioritized so the way to do it is explicitly have the conversation up front like and that's on the managers and leaders which is this is explicitly work that everyone has to do

Starting point is 00:39:37 right and so it's like you have to avoid the urge to tap the same people or be like oh we'll tap the the other kind of really bad signal is oh you just always pull the newest hire on the team to tap the same people or be like, oh, we'll tap the other kind of really bad signal is, oh, you just always pull the newest hire on the team to do the work who doesn't know what they're getting into, right? Like that's like a big red flag. So you try to distribute the work evenly and fairly and recognize it. And the recognition is again, like org-based, right? Call it out in the promotion packet, call it out in the all hands maybe maybe your company does cash bonuses maybe there's like you know pure bonuses something like the same reward reward and recognition system that you have for everything should apply for this work otherwise you're

Starting point is 00:40:16 you're implicitly making the decision that no one will want to do the work yeah that makes sense that makes sense those are some really good ideas on how a team and an organization can go about conveying, not just to their customers saying, hey, this is how important the migration is, but also to the team doing the migration saying, you're doing impactful work

Starting point is 00:40:36 and this is how it helps the business. So in your blog, you mentioned that at one point in the compute group at Stripe, you were doing, I think, what, five migrations simultaneously with a team size of less than 20? Yeah. I mean, one, that sounds extremely hectic. I'm curious, how did the team manage multiple

Starting point is 00:41:00 migrations? I should clarify that statement. So a team member actually went in, I actually referenced him, Charles Hooper. So he actually just counted. So they were not all active, right? So they were ongoing. So essentially, they were migrations that were like of things where like, you know, we had started this migration from version one to version two of our internal platform. And it had just been slowly progressing for multiple years at that point. So that was one.

Starting point is 00:41:31 Then there was another one to move to Envoy. There was another one to move to Kubernetes. There was an OS upgrade. So those, and all of them were being done with different levels of urgency and were in different stages. So not all of them were like things that people were actively picking up and doing and that's the problem he saw and cited

Starting point is 00:41:50 basically he was like we have these half done things right it's like we have five parallel work streams that are like so essentially what would happen is like we were in different places so people would go chip away at one of these, like, you know, upgrade a few more boxes while they're at it, do a little more extra cleanup here, do this. And that was the kind of ongoing problem. So the problem was that all five of them weren't all active. At the same time, they had just been like kind of slowly proceeding for, as I said, the most extreme case was for multiple years at that point. And essentially, it was just kind of, this basically was accumulated tech tech. Yeah. Tech tech is a sign of a rapidly growing

Starting point is 00:42:33 organization. So I forget which, where was this said, I think it was about Google or somewhere where there's no final version of the product. The one in use will be deprecated, and the one that we want to use is being built right now. And as you mentioned, that there were these multiple migrations which are going on, and some of these kind of were moving forward slowly. Now, I think it would be appropriate to say that migrations are a marathon, not a sprint. And as the team is moving forward with this work, sometimes if the migration or any work that goes on for too long, it's only natural as humans that one would start losing some interest, not because someone wants to, but that's just the nature of the work.

Starting point is 00:43:19 So as a leader, how do you ensure the team stays motivated to continue on and prevent burnout in the process? Yeah, I think really, really important topic, I think, especially for infra teams. I think I would almost like turn it around and be like, I think migration and just like long running projects for this reason should actually be avoided in the sense that even if it's going to take multiple years, you should have clear cutoff points. And that's kind of the solution we took with those like, you know, multiple like up to five migrations, which were like the team collectively decided and the manager Ian decided that they would actually just like focus and burn it down in one go. Because there's the cognitive and just like tech-ted version of it, which just continues if you just like let it linger. So they actually just made it a goal to actually finish all of the in-going ones. And the other side of it is that no one actually benefits from an incomplete migration because no one has access to it. Like we always have these like cohort of people that can't use the newest tool.

Starting point is 00:44:32 There's actually a really good talk by a former manager of mine, Will Larson, which kind of seeded my blog, which is migrations are actually like your way to fix tech debt. Because, you know, as you finish the migration, you way to fix tech debt because you know as you finish the migration you're fixing the tech debt so i think the way to look at it is actually one is keep it like actually time bounded and this goes back to the kickoff and like define the done point and then explicitly have it be on or be off like you know so if you're going to like a problem with those migrations that had been lingering which is that we were not making explicit decision to prioritize or deprioritize so they were just kind of lingering so i think that's where if you have that explicit decision

Starting point is 00:45:15 what does that look like will we stop at 50 percent or will we wait until 100 percent and what the stopping what are the costs of stopping what are the costs of finishing what are the benefits of finishing so all of just being very intentional at every stage is what counts. And I think if you're doing that and if you're like being intentional and time boxing things, that's when you prevent burnout. Because a good description of burnout is when people don't feel like they're in control of the their kind of destiny and so if they are constantly dealing with two versions of a system right like oh i already fixed this bug in the new system but i'm still having to support like you know feel support for the old

Starting point is 00:45:56 system because migration isn't complete that leads to frustration so that's where you know you have honest conversation be like okay we're going to deprioritize other stuff, get the team to 100%, then you prevent burnout. That makes sense. I think the takeaway for personally for me, that is ensure there is an end state to the migration and ensure there is clarity on what that looks like uh for the team working on it and also for the stakeholders yep uh i mean if you're the migration goes on for long you're just ending up supporting two systems which is more cost on the team uh so as the team is where teams work on these migrations and if well once you're done with the migration people are super

Starting point is 00:46:46 excited to work on the new system support the new system because usually it's a better faster improved version of what you had before in in that regard there are two aspects to it one let's build the new system and then there is the migration part is like let's move the people move the customers from one to the other. How have you seen or in your experience, how have people balanced work, which is, which people, I should, let me rephrase that.

Starting point is 00:47:18 How have you seen engineers balance this work one themselves? And also as leader, do you move people around who are building the system versus the people doing the migration? Because I can imagine as an engineer, one wanting to build more and spend less time on the migration. I think this goes back to the kind of it's like a similar problem to the reward and recognition one. Right. And my philosophy is that if it's work that everyone wants to do, like if it's bright, shiny work, then you spread out the opportunity so that everyone gets an opportunity.

Starting point is 00:47:50 And if it's grunt work, then also you kind of essentially spread out the opportunity. So essentially, that's kind of like, you know, your way to fairness right so uh my thing like the the broader question is you just have to make sure that everyone participates in all phases and not just the fun stuff like that's the way to a healthy team makes sense uh and i know we're getting to this a little late but we love to talk about uh war stories and production outages in this show. And migrations are notorious to create them, I should say. Yeah. At least in my experience, the intentions are always good. It's just we cannot foresee every scenario.

Starting point is 00:48:36 So I'm curious, are there any war stories related to migrations or otherwise that you could share with us today? I think it's actually maybe harder to pick which ones I can share because pretty much all of them had some. I was impressed by the team that ran the Spectre meltdown migration, though, honestly, because for the tightness of the timeline and for the scope, which is our entire fleet,

Starting point is 00:49:04 we actually had very minor hiccups given that though the interesting one was where there was we ran into a weird incompatibility between uh between our os version and the underlying machines that were running and it caused some very interesting kind of, it caused an interesting production outage. That was one thing, because it was basically like incompatibility between like the hypervisor and the OS

Starting point is 00:49:37 and the packages we were running. And it kind of took a lot of digging to figure out exactly where that was. And I think that one was uh i think we were not the only ones who saw it and that's also an interesting anecdote maybe which is if it's something that's like almost industry-wide which the spectrum meltdown was it's actually like you're not the only people facing it and that's where the first time i actually saw the the leverage of the network i was actually relatively new to the team then so i was mostly like observing and coming in people facing it. And that's where the first time I actually saw the leverage of the network. I was actually relatively new to the team then. So I was mostly like observing and coming in

Starting point is 00:50:09 to see what the existing team was building on. But we kind of reached out to a lot of our peers and got lots of like, you know, that's lots of like, you know, support and also from the vendors that we were using on figuring this one out. So I think that's maybe one thing that we haven't yet touched on, which is when you run into production outages off a large scale, sometimes you can see who else is running into these problems. That's maybe one anecdote, but happy to talk about other ones. I have war stories from pretty much every migration

Starting point is 00:50:46 oh yeah i i definitely want to talk about more but one question i have on this one is this sounds like a very nasty bug it's like incompatibilities between different abstraction layers uh what was the impact like say for instance you rolled out the new patch what did the team see so it was basically uh we were just kind of like the machines that had been patched were just kind of uh this has been a while ago so i'm probably stating it incorrectly with that caveat but they were basically uh there was just like a weird uh thing where they would seg fault and reboot basically so and the cluster that this was happening one was our internal build cluster and so they would just like randomly reboot

Starting point is 00:51:30 and so then it was like this trade-off do we essentially then roll back these machines because they are relatively isolated there was no external network facing so kind of essentially we did a trade-off where for a very short period of time we actually rolled back the patches on them and continued working while we figured out what the right fix was. And I think this is the case where it really is like, you know,

Starting point is 00:51:56 the kudos to like all the engineers that were in that incident, like figuring it out live. It sounds like even the detection of this was pretty quick. It almost was multiple days in that particular moment. I mean, but I,

Starting point is 00:52:10 well, I mentioned like while managing like such a large fleet it's, and the worst kind of bugs are the ones that are like intermittent, like that just happen every now and then. So it's, it's hard to know if it's because of that or because of something else. Is that something that the team has always considered for any sort of these things? It's like, let's do this, bake for a week, I'm not sure.

Starting point is 00:52:31 I think that's like the staging and the rollout is important. And that's why, you know, you kind of pick the less critical workloads first and then continue. In this particular case, we were lucky that it was somewhat localized, that it was a particular cluster, which is again like, you know, if you roll it out completely, then it becomes harder to pinpoint which of the steps has caused the problem that you're seeing. So which is why you always want to do it in stages. That way, at every stage're as you hit problems you know what's likely to have introduced a change and then you can roll back or roll or keep going i think that's an important just way of maybe doing these things yeah identifying the issues early is very important

Starting point is 00:53:16 because rolling back when you're more than halfway through is way too expensive and being able to i think that's also where like doing it in a way where you can partially roll back is important. Yeah, I think that would also touch on kind of how a team thinks about the mechanism to migrate. Because one is you make the decision. The other one is you have to think about this as a new feature of sorts that you're ramping. And if something goes back, you need that undo button when you want it so thinking about the undo is super important not just for features but also for the migration part yeah so uh you mentioned you have other war stories to share we would love to dig into more uh can you share another one with us i think it was just more like i think i think mostly just around i think the more critical the infrastructure

Starting point is 00:54:11 the more careful you have to be with the migration right and i think uh i think just like just never never underestimate what uh what it is. And I think how critical that infrastructure is. So I think we just had, so I think with Envoy, we had some like, it's like, it's a great piece of software, like enable so many things. But the, the, the key thing is that when you, when you migrate to it, like, you know, as as we did we kind of went we were very careful like you know when a few few like you know set up by set up but then at some critical point you do have to like you know start serving up production traffic on it

Starting point is 00:54:56 and i think it was just interesting uh how how many issues we kind of had. Like there was a time where we kind of had a few issues that was like because of the critical nature of a service mesh, right? Like we were just like essentially bringing down large parts of production with our issues. The thing, the lesson there that we learned was having a lot more diagnosability baked in into any critical

Starting point is 00:55:26 infrastructure is really the key. And that was like our big learning. And then also, I think it was another big learning of making sure that knowledge of a new system is really, really spread out early on. Because what happened in that particular instance was a very small key team had been uh had was in charge of the migration and had been working on it for a long time but then as it went into production they essentially became the go-to people for on-call and so we so when these incidents started happening they were the ones who are constantly uh like you constantly dealing with the incidents. So what we had to then do is essentially like, okay, everyone, we're just going to stop all the work, go fix the reliability issues, and then continue. Makes sense.

Starting point is 00:56:19 It's so important to recognize that we're making the trade-off where you know, okay, it's time to pause, go back and fix the issues that are causing site up. It's a question of site up at this point. So let's go and fix them and then move forward. So render is still very early in very early stages. Have you seen any migrations yet? Or it's too early for that? Not really migrations per se. But yeah, we have, we've been, I mean, I think it's almost like if you're smaller, it's almost more fun because everything is like so much, you know,

Starting point is 00:56:55 new and fun and like also kind of can be broken more easily because it's just a different stage. And usually one question that we like asking everyone is, what was the recent tool that you discovered and really liked? I think one of the joys I've had of starting at a startup is because we are so small small we really get to experiment with a lot of tools that are just like you know new versus you know like because our requirements

Starting point is 00:57:35 are like you know for scale and just like overall are just like so small right so I think I would say one new one that I have liked which is this tool called linear which i'm like very new to it's essentially think of it as issue tracking software and it's a really interesting problem right and a notoriously hard problem especially for someone like me who's like lived through like the original bug tracking software that I worked with was like Bugzilla. So my expectations are very low. And also it's a very hard problem to solve because at scale. So I think it's like linear is actually like very refreshing in the way they're approaching this problem.

Starting point is 00:58:18 And just I didn't expect to kind of actually like using something like an issue tracking software, but they've actually made it really, really nice and very cool. That's pretty neat. I'll need to check that out. That's the first time I've ever heard someone say they like the bug tracking software because for whatever reason, engineers, managers, everyone, they just don't like either Bugzilla, Jira, or whatever the new software is that they're using.

Starting point is 00:58:44 But we've got to check it out. Yep, definitely. Is there anything else you would like to share with our listeners today? I think maybe more just like, you know, it's, I think, since we touched so much on migrations, I think my big thing was like, there's just this theme around things where people hate the thing. Like, you know, like actually bug tracking software is a good example, right?

Starting point is 00:59:10 Bug tracking software, migrations, meetings, right? Like people hate all of those things. And one of the things that I would maybe leave the listeners with, it's not the thing that you likely hate. It's the way the thing is being done, right? If you hate meetings hate it's the way the thing is being done right if you hate meetings it's the way the meetings are being run if you hate your issue tracking software in that case yes it is it is the issue tracking software but it's also the way it's being used if you hate the migration it's the way it's being done not the fact that you know it's going to unlock some new capability often in these things like the the answer is to like step back and like why

Starting point is 00:59:46 do you hate the thing everyone hates incidents but there shouldn't be a source of misery again right for everyone it's like um like to your to the theme of this podcast right if you if you if like why is something like a misadventure versus an actual adventure like it's always in the how of how you do it yeah yeah and i think a lot of these things are inherently just hard like you stated hard problems um so they depending on how much effort is being put into them uh they can get executed not as well as other things that are probably easier in that regard. Well, one thing which I would just say for incidents is incidents are actually unintentional investments in learnings.

Starting point is 01:00:31 It's like, you didn't plan for that, but there is a lot of learning that comes out of it. Agreed. And then maybe this might be self-serving, but Render itself is a pretty delightful software to use. So maybe I can end with that plug. Oh yeah, for sure. We'll definitely link to Render in our show notes, and

Starting point is 01:00:52 we encourage our listeners to check it out. Thank you so much again for coming on the show. It was really enjoyable to speak with you about migrations and your experiences. Thank you for having me. This was a really great talk. Thank you so much for your time, Uma. Really appreciate it.

Starting point is 01:01:09 Thank you. Hey, thank you so much for listening to the show. You can subscribe wherever you get your podcasts and learn more about us at softwaremisadventures.com. You can also write to us at hello at softwaremisadventures.com. We would love to hear from you. Until next time, take care.

Your Ad Here

Software Misadventures - Uma Chingunde - On managing migrations, growing engineering teams and much more - #8

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.