The Infra Pod - What is Platform Engineering? Chat with Ian Nowland

Episode Date: February 24, 2025

In this episode of the Infrapod, Tim and Ian sat down with Ian Nowland (ex-SVP of Datadog, co-author of Platform Engineering book), a platform engineering expert and co-founder of Junction Labs. They ...dive deep into the nuances of platform engineering, discussing the evolution of roles like sysadmin, DevOps, and SRE, and the current state and future of platform engineering. Ian shares his journey from Amazon to Datadog and talks about his new company, Junction Labs, which aims to simplify microservices networking. Tim and Ian explore the challenges and solutions in platform engineering, and Ian provides insights on building effective, user-friendly platforms. 00:00 Introduction and Welcome 00:21 Ian Noland's Background and Journey 01:49 Understanding Platform Engineering 04:03 Challenges in Platform Engineering 19:56 Effective Strategies and Common Pitfalls 25:02 Insights on Buying Software for Platforms 39:48 Conclusion

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the InfraPod. This is Tim from Essence VC. Ian, let's go. Oh, this is going to be a fun one. I can't wait to talk about platform and this new company called Junction Labs with our great guest, Ian Nelland, also a Commonwealth brother of mine. Now we have two Ians in the podcast. It's incredible.
Starting point is 00:00:22 Ian, tell us a little about yourself and why you decided to write a book on platform engineering and start a new company called Junction Labs. Yeah, so I'm a fellow Commonwealthian from Australia. I moved to the US in 2006 to Amazon, which actually does tie to sort of platform engineering. Amazon had got big very early in building a web platform. And so I sort of had these platform teams and we're doing what you would now call platform engineering in 2006. And that was just like, you know, that was like water to me. That was just what Amazon did. So I sort of accepted it as real. I eventually found my way to AWS, which is of course much more building the infrastructure, so sort of away from the normal
Starting point is 00:00:57 developer ecosystem. Ended up leaving Amazon in 2016, moving to New York. I had a couple of jobs at a FinTech and then at Datadog, where I was managing platform teams. So the interesting thing to me, sort of seeing that sort of arc was what Amazon was doing in 2006, in many ways the industry wasn't able to replicate. Each company didn't have as many engineers as Amazon had.
Starting point is 00:01:20 And so I stayed at Datadog until 2023. And then my friend at the FinTech, Camille Fournier, he asked me to write a book on platform engineering. Both of us were skeptics about what the movement was becoming. We wanted to head that off and just say that this is just good engineering, good management, and that was the book. So I did that, it took longer than we'd like. I think I finished in about March this year.
Starting point is 00:01:43 Now I'm heads down, I'm co-founding a startup, which is very much in the platform engineering space called Junction Labs. So what is platform engineering? What's your definition of it? Yeah, so that's what I would like it to be versus maybe what the industry is making it into. And it's funny, Camille, my co-author, like in 2020 or so, she was like, is this just the SREs? She had began rebranding themselves. She tweeted that. And maybe if we'd written a book earlier, we sort of could have headed that off.
Starting point is 00:02:09 There is now a role called platform engineering, which as far as I can tell is completely indistinguishable from what was once sysadmin became DevOps engineer, became SRE engineer. And that's really not what we want the roles to be. We do not think you get good platforms together by cobbling together open source and
Starting point is 00:02:27 vendor tools with configuration. The main thesis of the book, we use the definition of Evan Budscher, but platforms are internal developed things. They focus on internal customers. They basically are leveraged by product teams to move faster. To us, that's all what platforms is. The key thing that is very key to our definition
Starting point is 00:02:47 and is very, very key to the book is that there is a substantial amount of in-house software development to customize the platform to the needs of the company, whether they be historic needs, technologies you used eight years ago and you just can't get rid of, or whether they just be the business need.
Starting point is 00:03:04 You have specific needs as a business that other businesses don't have, and a generic platform, whether it be vendor or open source, isn't going to meet your needs perfectly. We talk about the four different factors, but most of them are just obvious with the word platform. They're broad, they're targeting a large range of users,
Starting point is 00:03:20 they're operated, they need to run well. The big thing that I focus on is that they develop. What a lot of the industry is focused on is this idea of platform as a product. Camille, my co-author, is a strong believer in that. I think that's important, particularly versus a pure infra mentality, which is just vending crap
Starting point is 00:03:36 without really thinking about what your users want to. But I'd say the software development stuff is far more important in our mind, that this is software people working with systemsy people but creating platforms that are for software people. To me, that's maybe the biggest disagreement in the industry today. But I think that's always
Starting point is 00:03:55 been the disagreement actually in the industry. Why do we need a DevOps engineer? Why isn't everyone a software engineer? That's the classic question. I'm curious, what's your classic answer for why that is, right? I think before we started recording, I mentioned, you know, I've been having a lot of chats with platform folks, and it's interesting
Starting point is 00:04:11 because a lot of the time, when you look into the hood of a platform organization, it's like an embedded DevOps model where there's like an embedded person that owns DevOps and they are owned back to like a centralized team and it's kind of starting to become a platform. It's interesting, right? So help us understand like your view of how a platform works.
Starting point is 00:04:29 I think would be really useful to help understand what you mean by platform. So in the book we talk about if you went back to the 90s or the 80s, there was always these two roles, systems administrator versus software developer becoming software engineer. You have to sort of dance around this because it's sort of characterizing people by personalities. But throughout my time in the industry, there's always been these two types of personalities. And you know, people exist on the spectrum, but you generally see these two types
Starting point is 00:04:52 of personalities dominate. So there's the people who, you know, the fact that computer systems, you know, there's networks and operating systems, and they're very, very complicated. There's people who love that level of detail, gravitate towards the detail. And like the funnest part of their job is understanding of that detail and succeeding despite the detail. You also find by the way
Starting point is 00:05:11 those people aren't the greatest coders in the world because the amount of detail that they keep in their head sort of gets in the way of writing lots of code. Then there's the classic software developer where like, you know, ideally they would be just completely ignoring any, there's no computer, there's no network, they're just up writing what used to be called business logic. And what they love is that flow of software development. And to them, detail gets in the way.
Starting point is 00:05:35 And so there's always been this tendency to want to separate that out. And so you have the systems engineers, or the SREs, whatever you want to call them, doing the platforms, whatever you want to call them, maybe that's infrastructure you're calling them, and the software engineers layering on top. The key thing is between open source and just the cloud, those things have really squashed together.
Starting point is 00:05:53 And so when we talk about platform engineering, what we say is you can't divide, like there's two types of people, I think they are literally two types of people, but they need to be on the same team cooperating on platforms. So you get both the detail freaks, you get the value of their expertise, but the people who write code do as well.
Starting point is 00:06:09 So that's when we say platform, it's both of those coming together as opposed to 15 years ago, if you're a Microsoft shop, you just bought the.NET SDK and you just ran with that. Maybe if you're Apache and in the Linux base, you'd really separate the role. For us, platforms with the last 25 years of innovation, you need both those teams together building in-house software.
Starting point is 00:06:33 The classic things that they're building are the classic things we all know. Developer tools, compute tools, integrating different storage systems. It's not rocket science what they're doing, but platform is a way of doing it. Amazing. If you go look under the hood of a company, it's sometimes interesting like platform bingo card. Like what are traditionally the things we would consider like a platform?
Starting point is 00:06:54 Would you consider things like user and identity for a product a part of a platform? Or would you only focus on things like IAM developer access? Like I'm kind of curious, how do you divide the lines? I think looking at different companies, those things end up in different shapes and formulations. Sometimes security teams are a part of the platform, sometimes they're not. You know, and then I think there's also this discussion of when we use, talk about platform,
Starting point is 00:07:16 is that platform formulation that, okay, we're going to, like platform is really owning the path to production, the SDLC and the lifecycle for developers, or does it mean more than that? Or is it less than that? And it's interesting because we actually have a future guest, Wayne Duso, who used to run data products at AWS, who has a very interesting take on this. So I'm kind of curious to get your perspective
Starting point is 00:07:34 having run like a relatively large organization at Datadog. So, you know, you can spend all day thinking about definitions. And you often find, by the way, you know, by the end of my time at Datadog, I was managing a very, very large org. So you're mentoring managers who remind you of yourself earlier in your career. And there is this tendency, particularly for like mid-level managers, to like overthink the meaning of words.
Starting point is 00:07:56 And you know, what is it? They're all trying to do reverse Conway's law where they try. I won't say it's completely ill intention, like it's just careerism, but it's clear that like they're just seeing the world from their own best interests, and their own best interests of either their career or the type of system that they want to build. So that's why I say upfront that a lot of this just comes down to, you can overthink it and not get it. Say within Datadog, I had an infrastructure organization,
Starting point is 00:08:19 and I think I called it an application infrastructure organization. Infrastructure was all the classic that I think everyone would agree is platform engineering. So you had compute platforms, data platforms, maybe SRE platform was in there, and then the one that's on the edge actually at the moment, but I think it's clearly platform is developer tooling. So whether you call it developer experience or whatever,
Starting point is 00:08:38 I think it's very, very bad if you start to try to separate them from the platform. You end up with developer experience on one side and platform on the other. So I think everyone agrees that's platform engineering. If you look at the book, 80 percent of the stories we talk about, just given our experience, are those types of things. What we say though, is that application infrastructure.
Starting point is 00:08:55 So at Data Dog, that was like the revenue platform, the auth platform, the front-end website platform, the backend for front-end, I guess is the term for it. If you look at actually the lessons of managing platform engineering teams, like the fact that you have many stakeholders, the fact 80 percent of it applies. The big difference for them is they always have
Starting point is 00:09:14 this bit that does poke up and external customers see it, that creates a whole bunch of new problems for them as well. But otherwise, like a lot of their focus that needs to be on, hey, you need to think about developer tooling, you need to think about making your platform observability, you need to think about making the platform self-service. It completely carries across. So in our definition, and this is partly,
Starting point is 00:09:33 we wanted to have a broad tent for the book. We think of the mortars platform engineering. I would say if you look at what the industry is doing now, the application platforms aren't quite running to the term the way that the infrastructure platforms are. They're very, very happy to call themselves platform engineers at this point. Whereas the ones in the middle, I think, they're not infuriate people. They often have fewer of those systems engineers.
Starting point is 00:09:54 And so they don't run to the term quite the same way. I don't think we can able to dive into all the details of the platform engineering nuances. Just going through the book, there's so many level of things to consider, but I'm, I'm very curious because coming from my background, um, you know, working as engineer for a lot of different places too. I think this idea of having a centralized team for helping infra or helping on a platform side is nothing new. It seems like every company has one, right? Almost like the same way, like every company used to have operations,
Starting point is 00:10:27 or what we call it DevOps. And we keep changing the names, but really stuck, I feel like was more the nuances of what the last gen of a DevOps feel like. And then we have SREs, we use Google as the golden child and sort of things kind of snowball. Platform engineering, I think is interesting because I don't think there's like a lot of descriptions of what the last gen platform engineering feels like and what the new gen platform engineering feels like.
Starting point is 00:10:53 I don't know if that makes sense. Like this topic is nothing new. But why do you feel like there is a need to talk about platform engineering now? Is there like a version of platform engineering that we should be able to like aspire to? Is it the technology, the type of team, the type of work they're doing? How do you like able to almost like give folks
Starting point is 00:11:15 that been working in industry for some time know what a centralized team means, but like, hey, this is what the new platform engineering look and feel might be feel like. In some ways, I feel like the book is sort of back to 2010 in that I feel like sort of two things happened. The DevOps people were very well intentioned, but they could never quite define themselves.
Starting point is 00:11:38 And I think sort of the cloud came along with all these promises of, oh, you're not going to need ops teams anymore. Adrian Cockroft got into this big thing because he said, Netflix, we do no ops. And pissed off like half the industry because it was seen as saying basically you don't need operations at all. But what he was actually saying was build really good platforms.
Starting point is 00:11:55 And what I saw sort of happen with the Cloud and just with open source coming, was just the idea that you could just have one or two DevOps on every team and number one, you could find enough of those people. Number two, that they'll actually stay happy. Number three, the outcomes would be good on the other side of it. I think that was a big mistake by the industry. In the book, we use this term glue.
Starting point is 00:12:14 It's nothing fancy, but you can think of glue is what happens when you ask every product team at the company to write their own YAML, interface with Kube directly. Ask 100 teams to interface with Kube directly. It's great when you're a 20-person startup, it's horrible when you're a 1,000-engineer company. So, platform engineering number one is just saying, look, if you stick your people who are good at
Starting point is 00:12:35 that DevOps-y stuff but stick them on every individual team, you're not actually getting any leverage and actually you're getting much worse than that. You're getting stuff that's going to be very, very hard to change later. The other thing I think the industry really struggled with, and I don't think anyone in the movement was ill-intentioned. getting any leverage and actually you're getting much worse than that. You're getting stuff that's going to be very, very hard to change later. The other thing I think the industry really struggled with, and I don't think anyone in the movement was in intention. I think SRE was a horrible thing for the industry because they talked about a
Starting point is 00:12:54 model that worked only at the massive scale of Google, where for every system you could have an SRE team and the software team. And I sort of went to the world with that. And what you got was a lot of people who wanted to just nerd out on the reliability aspects and like use these, oh, we're going to take the pager and we're going to hold it over your head that we could always hand back the pager if you don't do what I want. And it just was a horrible way of taking those people and actually building
Starting point is 00:13:18 good relationships with them. When I think about what platform engineering is, it's like, look, it's almost the good side of DevOps, right? Like the fact that it is more of a culture, it is more collaborative, but it also a platform isn't just work, right? A platform is a thing that exists, it is built, it is operated.
Starting point is 00:13:34 I think that is the key aspect where it's- so we don't say this outright in the book, but this is a podcast I could say. In some ways, we're trying to move DevOps finally to its mature model. That's what I'd always promised, but it sort of got stuck in this embedded per team model throughout the industry, which works great at small scale. That's why everyone starts it, but it completely makes messes once you scale up.
Starting point is 00:13:57 Yeah, it's so fascinating. I think that's also the idea of DevOps to a mature state or the sort of ideal platform engineering state. Because in a book you define like measurements, right? Almost like a cloud native measurements or it has to be trusted and loved. Anyway, I really- I wouldn't say by the way that those are measurements. I hate the term measurement for things that I don't believe can be measured.
Starting point is 00:14:21 But they are absolutely things that you should work out if your system's determined about your system. It's already interrupt. Yeah, it's fascinating because you do have it in a book. And to me, like when I'm looking at it, I was like, wow, it's not that common to find platform engineering that is trusted and loved. Oh yeah, completely. I see platform teams everywhere. Every single company has a platform team or by definition, right?
Starting point is 00:14:45 But then they're really being used. Their products are like kind of only half adopted. Many people just don't trust what they do. Like this is like such a weird state of most where platform engineering's are. And they're of course aspire to be the centralized platform team. And a centralized platform is either a joke
Starting point is 00:15:04 or almost like an experiment. What's the holding signs of making platform engineering what it is today that cannot get to that mature state? Is it the talent of the team? Is it the culture, the mindset, is just everything? Is there like, maybe what's the most common missing point or you see existing companies run into right now? So I think there's sort of two answers that matter a lot.
Starting point is 00:15:28 So at Datadog, which was growing 40 to 50% year over year, and even there the platform teams really, really struggled. That's very, very different say to parts of the industry now that are growing, where engineers growing at one to 2%. I think if you talk to anyone at that one to 2% and why their platforms are struggling, they'll say 10 to 20 years of tech, they're not enough headcount.
Starting point is 00:15:51 That's just a very, very tough management problem. Those are failures, they have to work hard to solve it. To me, that's not the interesting part. The interesting thing is how many fast growing companies have the exact problem that you described. They're growing, they can hire into their platform team, and yet the platform team is still hated by the rest of're growing, they can hire into their platform team,
Starting point is 00:16:05 and yet the platform team is still sort of hated by the rest of them. And it's clearly happened at Datadog, by the way. And it happened at the FinTech, Two Sigma. So I spent a chapter on this term re-architecture, which is a term which I won't say I fully invented, but wasn't that out there. But really it was just this idea that I hate second system syndrome. I find so many platform engineering teams
Starting point is 00:16:25 at both of those companies actually Datadog and Two Sigma. Bazel was this classic, oh Bazel is gonna solve all the problems of the build systems of the past. And you'd see these teams of 10 to 20 people spend years on something that was still not even like 10 to 20% of the total usage, right? Like everyone in the Go ecosystem
Starting point is 00:16:44 would just keep using what they're doing. And it had this story, oh, you know, build it and they will come. Like that's another classic thing that I hate with platform. Oh, we'll build it and they will come. And so I think when we, when people talk about product management in platform engineering,
Starting point is 00:16:56 what they mean is that you just be far less arrogant about what your users want versus what your ideal architecture is. And you'd be far more iterative about working within the systems that are appreciated today, even though they're completely ugly. Even though the engineer in all of us would say, this was a mistake and we should start from scratch.
Starting point is 00:17:14 As a startup, that is not product market fit. Yet these platform teams, I think, in past growing companies, that's been their big Achilles heel, is that they latch onto the new technology of a day. Oh, this worked at Facebook or this. I saw Dropbox put a layer in front of the SQL database and force everyone to use it. Of course, we all know Dropbox is
Starting point is 00:17:33 a great font of product innovation. I say this one because this one came up at Datadog. So yeah, I think a lot of the success in platform engineering is an incrementalist approach, as opposed to believing migrating to a single big technology is what's going to save the company. I think that's the biggest flaw in the growing companies throughout the industry about why platforms have really failed.
Starting point is 00:17:55 It's just migration to heart. We all know that 20% of tail takes 80% of the time. I think that's been a lot of a problem. And so a lot of a book in many ways is just like, be more humble. Go slower, be more iterative. That's actually the way you serve the company's needs. Might not give you the fanciest resume, but it definitely serves the company's needs.
Starting point is 00:18:16 Yeah, I think that resonates for me. Like I often think a lot of platform engineers have like excitement driven development, you know, in the sense that like, oh, I'm really excited about this new methodology, this new tool that drives a lot of the roadmap. I mean, my own experience, the thing I recognized in what you just said,
Starting point is 00:18:30 and having built managed platform tools myself, is a broad idea around user empathy. It's like, at the end of the day, the platform isn't the thing that drives the revenue. It's the software, the product that's built on top. So you have these engineers, where typically they'll go and they're like, I'm going to go teach these software teams that are building revenue producing applications how to do it.
Starting point is 00:18:52 And it's just like the actual opposite of the right motion. The right motion, from my experience, has always been like, okay, you're actually a janitor. You're the janitor and you're making all these other people, your goal is to create focus for the company. Your goal is to create focus for the company. Your goal is to create return on investment. I'm kind of curious, what do you think are good points of leverage for a platform engineering team?
Starting point is 00:19:11 Obviously, the Bazel build example you just gave, basically what you're saying is, that was not a useful investment. The horizon on which it was useful completely exceeded the value of it. Especially company growth is unpredictable. Yes, I could believe there is this perfect multi-language build system that handles
Starting point is 00:19:28 per-language dependencies fantastically. Why the hell are we trying to build this in our company? That's to me is often the struggle of these things. Yeah. What are common points of leverage and investment you see that actually work versus don't work? One would be a build system is an example, maybe isn't a good investment. From your experience, I'm sure you have some areas like,
Starting point is 00:19:47 okay, I come in, I land first, look at a platform or I'm like, you're going to build one, here are the places where I would spend money and these are the things I wouldn't tackle. Do you have generalized thoughts on what that looks like for a phone? So at DataGate, I had both successes and failures. My call is part of both of those. I think that the key successes had two aspects. So generally, if you want to do a new initiative,
Starting point is 00:20:07 it is find the product that you know is going to be most excited at the executive layer. Forget about building the platform perfectly. Just build what they want. In some messy partnerships sometimes, because they want to build it themselves. Build what they want, knowing that you're going to set the next two to three years of cleanup afterwards, but knowing that if you succeed, number one, no one's going to kill that product. So you've sort of got this beachhead already. Number two, I think engineers just in general, like we're not the most attuned to social things. Like you have massive social proof. You have just greatly enabled
Starting point is 00:20:43 the business. You have goodwill from that, that will help you get the next two to three internal customers and get them on board. So I think the key first one is just trying as much as possible, align your new initiatives to something that is big in the business. Now, there's countless to that, right? There's plenty of people who use AI
Starting point is 00:21:01 as their way of pushing Kube. Again, you have to be a little bit humble. Is Kube really the thing that should be running your LLM workload? I don't know if there's any people who tell you it should. But I still think the best success is they just got, I use the snowball metaphor a lot at DayDog. You just got these snowballs. Once you got that thing started,
Starting point is 00:21:20 then you could get more momentum and you could use past successes to argue for a bit more compromise from the next thing. So true business value, like not happy internal users, true business value, I think, is the first one. The second one was really just like what we call our DRE team, which was like Cassandra Postgres, Kafka. And that team was just like being managed to the book of DRE. But just given the importance of those technologies to data was just a complete burnout shop. They kept imagining this layer of abstraction, this service layer that they could
Starting point is 00:21:51 build that would make their ops manageable. I gave two years and you could just say this, it's just repeating, it was second system syndrome or whatever. But that completely fell. What really made that team a lot more functional was just doing far more actually client type stuff. Rather than focus on the stuff that you fully control, do a bit more development just on the client side. What is it with Postgres, the connection thing?
Starting point is 00:22:18 PGPool. PGBouncer, right? Yeah. Do some work on PGBouncer, right? Yeah, it's messy, but that's going to greatly improve. So I think the second thing is, if you see the platform just as almost like software as a service and you don't know anything about,
Starting point is 00:22:35 I think that's a big mistake. I think a lot of a value is by focusing actually in the internal customer's code bases. The two types of successes generally had one of those two things as a big part of their success. The biggest failures were the ones who approach it almost like waterfall. We got understanding requirements across multiple product lines.
Starting point is 00:22:53 We're going to prioritize across them, and we're going to iterate from this big design. They just took too long to show value, and then they had this migration problem. I signed off on these approaches. They did not work. I'm curious, is there a differentiation here between, in terms of app versus infra, right? Like STLC versus reusable components
Starting point is 00:23:14 that result in product lines. Like, I'm kind of curious, is there any difference between that? Or do you think those successes are true across just the whole portfolio? I think the successes are true across the portfolio. So even true infra, right, like the classic accusation, you should be building platform for platform sake,
Starting point is 00:23:29 Bazel, but other stuff. I look at my, yeah, the team that's on Kubernetes, they were falling in love with Silium early, they were falling in love with Istio early. So they attached Silium to go into GovCloud. And they could have used different technologies, they could have just used an AWS native technology, but they believed Selium was a better long-term play.
Starting point is 00:23:49 And so no one really questioned that. And that gave them the confidence in using Selium to roll out. So that's on the pure infrastructure side. I think on the app infrastructure side, I think it's like a revenue engineering team, right? You want to move people away from using day dog metrics as a way for capturing revenue
Starting point is 00:24:03 towards using Kafka as a way. Like they attached that to say, a big new initiative revenue towards using Kafka as a way of, like, they attach that to, say, a big new initiative and that worked well. The implementations look different, but I think the ideas remain the same. Like, in some ways I'm just saying it's truism, right? Like, you know, executives only care about top line value,
Starting point is 00:24:15 right? And so you have to find your way to attach the right projects to that. Yeah, and second system syndrome is not valuable to an executive at the end of the day, because it's a move of business metric. Yeah, it's funny coming from AWS where I found the leadership there as an infrastructure business did have five-year time horizons.
Starting point is 00:24:32 I really respect Andy Jassy, Charlie Bill Petersen for that. But most of the industry just cannot invest on that time horizon. And so second system just sounds like a boondoggle, but it's never going to deliver any value. And I have a question. It was a good segue. What's your advice for people looking to sell into platform engineering, like vendors? Because I think a lot of vendors, like you mentioned some of the Exelium as an example, the EPF stuff, like trying to sell to platform engineering organizations, obviously you've been a buyer. So what's worked and what hasn't from your perspective in terms of how well you found success buying software
Starting point is 00:25:08 from vendors and where you haven't? It's a great question, particularly to ask a co-founder of a startup who will say, one of the challenges at Datadog was the founders were sort of cheap, which is like all founders. But they particularly had a reason in that Datadog, because it was so close to this space, really didn want to invest in like someone who could become a competitor like it was like, okay This could be a data part of data or someday and so so I'd say if you're an infertile
Starting point is 00:25:34 Another one don't try too hard to sell it today. Talk to your hard company selling to the things that I was at though What was was easier that double classic finance, you know them it's just money versus human time, right? And humans in New York are expensive. At Two Sigma, I think the best success is it was the usual, what is the immediate problem that has a champion who has the executive buy on to get the budget? Even if in the longer term, like if I look at say, this is public so I could say, Two Sigma was a big early buyer of Mesos actually, really, really believed in it and it just got
Starting point is 00:26:07 crushed by the Kubernetes way. But why didn't Mesos succeed at Two Sigma? Because there's lots of doubters within Two Sigma who were like, oh, at the time it was like, oh, OpenStack, OpenStack is the future. Not this, or Mesos wasn't even contained as yet, it was just execution. But what worked really well for Mesos was,
Starting point is 00:26:24 it was actually one of the modeling engineering teams who just had this need for scale-ups, get on compute. And so it was really finding that champion within New York. And once that succeeded, again, Mezos is a tough example because Kubernetes sort of came along and crushed it. But the team who actually built platforms on top of Mezos, those platforms still exist. They just migrated to Kubernetes eventually. So yeah, I do think it's the usual stuff for startups.
Starting point is 00:26:47 It's like, don't presume actually that the platform team is your ideal buyer. It's that person with the burning problem who the platform team might not be solving and you're trying to sell it in a way that can eventually migrate to the platform team. I'd say that was the biggest success at Two Sigma. They know you could sell the same thing,
Starting point is 00:27:03 but they would just use open source and it would fall out of the same pile. Yeah, I was part of the early journey of getting Two Sigma to use Mesos, by the way. I am aware. As an early employee of Mesos back then, so always fun memories. So I want to jump into, actually, this is all relevant
Starting point is 00:27:19 because we're going to jump to talk to your startup very briefly here. Obviously, writing that book, you know, you've been in your role that has seen so many platform initiatives and teams and efforts, and now you've jumped out to start a company that is pretty much almost like building a product
Starting point is 00:27:35 towards that team, I would argue, right? And as you know, like, you've probably seen, there has really been hard to have one single tool that all platform managers and all companies all adopt. It takes a lot of different nuances to do it. So tell us, what is Stuntion Labs? it's really been hard to have one single tool that all platform engineers and all companies all adopt. It takes a lot of different nuances to do it. So tell us, what is Junction Labs? Why do you start to do this?
Starting point is 00:27:50 And what is the approach here that you think has a different differentiation here? So Junction Labs actually came from, actually both to Sigma, but also at Daydork. So both those companies were very early on the Kubernetes train and very early on then on multi-cluster Kubernetes. Just finding, I guess two things. Number one, Kubernetes networking for multi-cluster is pretty difficult. Suddenly, all the orchestration stuff and
Starting point is 00:28:15 service you were doing in one cluster completely fails when you start talking about multiple clusters. Then the second thing, and this is more political, but I think it's like service mesh was terrible for such a long time, right? Istio was overly complicated. Lincolte was underdone. I think Lincolte is getting there.
Starting point is 00:28:31 I think Istio is sort of doing his big pivot towards ambient. So the industry was sort of saying these heavyweight service mesh, you've got this heavyweight Kubernetes problem, these heavyweight service meshes are the way of solving all the problems of Kubernetes. And just the engineering in me was just like outraged. It's just like this just seems like layering crap upon more crap. So Junction Labs in many ways, like knowing there is no one
Starting point is 00:28:53 true monolithic technology that's going to work. So we sort of narrowed down on service discovery as the way where a lot of these networking features. So basically just resolving, what used to be just resolving a hostname to an IP. You can imagine as resolving a URL to a set of rules. Can we just plug in there and fix a whole bunch of networking use cases within Kubernetes?
Starting point is 00:29:15 We can leave the L4 stuff to like the CLEMs, the ways to get the other C. But the layer 4 networking, like leave that to the experts, but make application networking much simpler. So really what we're trying to do is fix what you'd call maybe the microservices debugability problem, not by observability but by building
Starting point is 00:29:36 a much easier system to have microservices configure their communication with each other. If I want to use the industry term, it's proxy-less service mesh. Some people know what that is, but I wouldn't use those terms for anyone who didn't know what it was. I mean, you've made this, to be honest,
Starting point is 00:29:53 what is kind of a bold choice if you look at the history of platform engineering tools. Most platform engineering tools look like an Istio. Don't touch the app, build something around the app. And instead, you're going after this sort of SDK approach, which is where you get your like proxy lists properties from for service discovery, network discovery, and configuration.
Starting point is 00:30:13 What made you choose an SDK? I think it's number one, you know, if you don't want to intermediate, gRPC is great in many ways, but forcing this big migration because you intermediate transport, protocol, and service discovery. There's many reasons why many companies will never be able to adopt gRPC at any scale because
Starting point is 00:30:32 of that. Protobufs are fine until you realize that they're using way too much garbage collection in Go. You want flat buffers. Well, you've chosen gRPC. Sorry. Part of it is just, I think, thinking a lot about composability. Now, there is this critique that a good technology is not a good startup.
Starting point is 00:30:48 And so we do think that we'll probably find our way, it might be around progressive delivery, might be around testing production or testing pre-production, basically running network where we have these products that actually make the technology a lot more useful. But really to me, the only way to do this right, in a composable way, is actually in a library.
Starting point is 00:31:07 Now, the big bet there is, historically you'd go to platform teams and you'd say, hey, this new technology requires a library. And they're like, well, good luck. Because we've got eight different languages and 20 years of legacy and not everyone's going to be able to upgrade on any realistic timeline. A little bit of a bet is between maybe monorepos and also
Starting point is 00:31:30 AI-based tools around refactoring and just security needs around vulnerabilities. It'll be far easier to keep a library up to date than it was 10 years ago. I've heard the critique of how awful it was to get Teams to upgrade libraries 10 years ago. So if I drove people towards service mesh, again, like that log for J,
Starting point is 00:31:46 I still think as an industry, right? We really haven't internalized that log for J vulnerability, right? That was such a terrible thing. And at some point, if we don't get our shit in line, like governments will force us to get our shit in line. And so a little bit of a bet on it is managing libraries is not quite as horrible as it was.
Starting point is 00:32:03 Across an enterprise, it's not quite as horrible as it was 10 years ago. It's not quite as horrible as it was 10 years ago. So we want to jump into our favorite section of a pod called Spicy Future. Spicy Future. And I will be very curious what exactly you want to give out as a spicy hot take here. What is your spicy hot take about the infoworld that you think you believe and most others don't yet? So I struggle with this hot take because usually it's a very negative hot take and particularly a startup in this space you want to like, oh I'm part of solving the problem.
Starting point is 00:32:36 So to me Kubernetes is a dead end. Maybe it's not that hot take because everyone sort of realizes it but throughout the industry right we're still betting on Kubernetes so heavily because we don't know what comes next. In my mind, Kubernetes is a dead end because it ties you to this cluster model that doesn't scale well. In the meantime, the hyperscalers are completely building their own in-house thing that would never work on-premise. So we're sort of going to this place where we're going to
Starting point is 00:33:01 all these private data centers around GPUs, pretending Kubernetes is the ecosystem that's going to cross them all. But Kubernetes is not the remote cluster world. So I think that to me is the hot take is Kubernetes is an industry dead end that we're all locked into as an industry. So if Kubernetes is a dead end, what do you think the future of compute orchestration looks like and why?
Starting point is 00:33:24 I have very similar thesis, but I'm actually curious why you think that. It is a dead end because it promises itself to be this multi-cloud technology sort of lowest common denominator. I should say, if you're using GKE or EKS, you're willing to be a hundred percent on those clouds. I'm not sure about Azure, but I'm sure about those two. You're pretty good. But this mixed sort of model, I think that's happening with GPU where suddenly it's not just the hyperscalers anymore, everyone else is building data centers.
Starting point is 00:33:51 Well, they're never going to be great at building all the internal shit that Google and Amazon has built to actually make Kubernetes go well. We saw the OpenAI outage, like it's just such dumb freaking shit, and yet OpenAI can hire the world's best engineers and they still get hit by the world.
Starting point is 00:34:06 You know, what are the B and C grade companies who were paying like half of what OpenAI pays an engineer? What are they gonna do? So it's clear to me that that operating model cannot persist. I don't think OpenShift, you know, it's good to have Red Hat supporting you, but I don't think OpenShift is a way out.
Starting point is 00:34:21 You have to believe that the future is gonna take certain aspects of Kube and stay compatible with that, like maybe KubeKutol or something, but like find a way that like administering many, many clusters just isn't as hard for so many companies. It's taking way too many resources today and we're like 5% into the migration
Starting point is 00:34:43 and nothing makes me think it's going to get better. Like it's just inherent to those sort of away Kubernetes is rooted. That does not answer your question though. Like what is it actually like? It's not really Lambda, right? It's clear function as a service has a place at an enterprise, but it's definitely not going to be 100% of workloads. Maybe it's a combination of Lambda, durable execution, you know, Kelsey Hightower a long time ago was like, people should really be running on top. And if you can build these platforms on top
Starting point is 00:35:09 that totally abstract that they're on Kubernetes, maybe that's the sort of the vendor part is like, more things that look like temporal, more things that look like restate, less things that look like, you know, YAML around server. Like I'm integrating with Argo rollouts, literally yesterday.
Starting point is 00:35:24 It's like, it's fine. I just can't imagine any application team wants to touch the 50 lines of configuration they have to get right to just get a rollout to work nicely. So I guess my answer is, it is higher level abstractions that fundamentally, at that point, do not need Kubernetes. They could be running on any computer.
Starting point is 00:35:42 Do you think that high level abstraction is like in some new open source projects that comes along? Or do you think this is like a vendor API layer? Because I agree with you about everything you just said. I don't know a single application-layered software developer that enjoys nor wants to or pays attention to anything they write in YAML files. And it's always like not even best effort. It's like worst effort to get back to doing the thing that they actually want to do. I guess this is in some sense, you know, talking my own book in terms of, you know, why I've sort of started Junction Labs. Like I look at DAPA as sort of an application platform.
Starting point is 00:36:13 And in one sense, I think it's really, really well founded to the extent that it requires almost a full migration to take advantage of it. It seems a very difficult like endeavor for me to imagine most larger companies ever moving significant workloads to it. So I guess what I imagine is, you know, there's maybe junction labs and maybe three other like us who have raised the level of abstraction about what it means to, you know, there'd be a, there'd be a scheduling system and networking system. Uh, you don't break compute into true compute. Don't have this thing called orchestration,
Starting point is 00:36:46 a couple of computer networking. I think maybe it's very open source slash vendors that cobble together enough of the Kube API in terms of glue. I think maybe that is what eventually replaces it. It's difficult. You always look like Linux has stuck together for a really long time, but distros have come and gone.
Starting point is 00:37:08 I think there's something about the distro nature that evolution through that is the next stage for Kubernetes. This is such a big topic. I don't think we can even touch on anything that's like not going to take even more like an hour or two. It's not going to take even more like an hour or two. What is your belief around maybe the junction angle into the sort of like cube-less world? Like you talk about like the extraction has to go higher, but you know, your layer isn't fully at the temporal restate, right? It's not truly at the Kubernetes level. It's somewhere in between.
Starting point is 00:37:42 Where do you see your stuff fits then in this future? Yeah, what junction we want to get really good at over time is dynamic configuration. So like a lot of applications at the end of the day, and this is, you know, what HashiCorp consoles sort of promised 10 years ago. So what we want to focus on to start with is dynamic configuration, mostly around networking. Eventually, we want to focus around dynamic configuration around different aspects of application behavior. So basically, how do you change things
Starting point is 00:38:10 without requiring a complete redeployment workflow? If the idea is that there are client SDKs, maybe they're doing client-side caching, maybe they're doing a little bit of client-side wasm to do some type of matching. We want to get really good over time at that layer of abstraction to build the broad platform, but then find one or two products that
Starting point is 00:38:33 really make people want to use it. I always just come back to the biggest pain, I think, as people move to services is quality. It's always, how do I test my stuff with production services without ending up with the staging is a mess always either, how do I test my stuff with production services without ending up with the staging as a mess problem? Or is how do I do safe rollouts? I think those are the places where,
Starting point is 00:38:53 at least at Amazon, at least at Datadog, I saw massive value in terms of a product. Even as the technology junction itself is really just a dynamic configuration plugging in with an SDK. Well, I think we have to wrap here because we can easily go on for hours. Where can people find more about you?
Starting point is 00:39:11 And I guess also plug a little bit about your book and Junction. I mean, probably the easiest for me is just the book is probably the best entry point. So it's called Platform Engineering. It's like a primer for leadership. But if you look for Platform Engineering on Amazon, I'm pretty sure it's the number one that comes up at the moment. Junction Labs is just junctionlabs.io. Because the SDK needs to be open source to succeed,
Starting point is 00:39:34 we're sort of building in the open. So it's just two of us at the moment in a room, writing code. It's early days, but you can get a good sense of what we intend to build from some of our blog posts there. So I'd say those are the best two ways. Cool. Well, thank you, Ian, and all the Ian's in the room. We're having a ton of fun here. So we definitely need to have you on in some near future. But thanks a ton for coming on our Infrapod. Yeah, thanks for hosting me.
Starting point is 00:39:59 Thanks so much.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.