The Infra Pod - What is Platform Engineering? Chat with Ian Nowland
Episode Date: February 24, 2025In this episode of the Infrapod, Tim and Ian sat down with Ian Nowland (ex-SVP of Datadog, co-author of Platform Engineering book), a platform engineering expert and co-founder of Junction Labs. They ...dive deep into the nuances of platform engineering, discussing the evolution of roles like sysadmin, DevOps, and SRE, and the current state and future of platform engineering. Ian shares his journey from Amazon to Datadog and talks about his new company, Junction Labs, which aims to simplify microservices networking. Tim and Ian explore the challenges and solutions in platform engineering, and Ian provides insights on building effective, user-friendly platforms. 00:00 Introduction and Welcome 00:21 Ian Noland's Background and Journey 01:49 Understanding Platform Engineering 04:03 Challenges in Platform Engineering 19:56 Effective Strategies and Common Pitfalls 25:02 Insights on Buying Software for Platforms 39:48 Conclusion
Transcript
Discussion (0)
Welcome to the InfraPod.
This is Tim from Essence VC.
Ian, let's go.
Oh, this is going to be a fun one.
I can't wait to talk about platform and this new company called Junction Labs with our
great guest, Ian Nelland, also a Commonwealth brother of mine.
Now we have two Ians in the podcast.
It's incredible.
Ian, tell us a little about yourself and why you decided
to write a book on platform engineering and start a new company called Junction Labs.
Yeah, so I'm a fellow Commonwealthian from Australia. I moved to the US in 2006 to Amazon,
which actually does tie to sort of platform engineering. Amazon had got big very early in
building a web platform. And so I sort of had these platform teams and we're doing what you
would now call platform engineering in 2006. And that was just like, you know, that was like water to me.
That was just what Amazon did. So I sort of accepted it as real. I eventually found my way
to AWS, which is of course much more building the infrastructure, so sort of away from the normal
developer ecosystem. Ended up leaving Amazon in 2016, moving to New York. I had a couple of jobs
at a FinTech and then at Datadog,
where I was managing platform teams.
So the interesting thing to me,
sort of seeing that sort of arc was
what Amazon was doing in 2006,
in many ways the industry wasn't able to replicate.
Each company didn't have as many engineers as Amazon had.
And so I stayed at Datadog until 2023.
And then my friend at the FinTech,
Camille Fournier, he asked me to write a book on platform engineering.
Both of us were skeptics about what the movement was becoming.
We wanted to head that off and just say that this is just good engineering,
good management, and that was the book.
So I did that, it took longer than we'd like.
I think I finished in about March this year.
Now I'm heads down, I'm co-founding a startup, which is very much in the platform engineering space called Junction Labs.
So what is platform engineering? What's your definition of it?
Yeah, so that's what I would like it to be versus maybe what the industry is making it into.
And it's funny, Camille, my co-author, like in 2020 or so, she was like, is this just the SREs?
She had began rebranding themselves.
She tweeted that.
And maybe if we'd written a book earlier,
we sort of could have headed that off.
There is now a role called platform engineering,
which as far as I can tell
is completely indistinguishable
from what was once sysadmin became DevOps engineer,
became SRE engineer.
And that's really not what we want the roles to be.
We do not think you get good platforms together by
cobbling together open source and
vendor tools with configuration.
The main thesis of the book,
we use the definition of Evan Budscher,
but platforms are internal developed things.
They focus on internal customers.
They basically are leveraged by product teams to move faster.
To us, that's all what platforms is.
The key thing that is very key to our definition
and is very, very key to the book
is that there is a substantial amount
of in-house software development
to customize the platform to the needs of the company,
whether they be historic needs,
technologies you used eight years ago
and you just can't get rid of,
or whether they just be the business need.
You have specific needs as a business that
other businesses don't have,
and a generic platform,
whether it be vendor or open source,
isn't going to meet your needs perfectly.
We talk about the four different factors,
but most of them are just obvious with the word platform.
They're broad, they're targeting a large range of users,
they're operated, they need to run well.
The big thing that I focus on is that they develop.
What a lot of the industry is focused on
is this idea of platform as a product.
Camille, my co-author, is a strong believer in that.
I think that's important,
particularly versus a pure infra mentality,
which is just vending crap
without really thinking about what your users want to.
But I'd say the software development stuff
is far more important in our mind,
that this is software people working with systemsy people but creating
platforms that are for software people.
To me, that's maybe the biggest disagreement
in the industry today.
But I think that's always
been the disagreement actually in the industry.
Why do we need a DevOps engineer?
Why isn't everyone a software engineer?
That's the classic question.
I'm curious, what's your classic answer for why that is, right?
I think before we started recording,
I mentioned, you know, I've been having a lot of chats
with platform folks, and it's interesting
because a lot of the time, when you look
into the hood of a platform organization,
it's like an embedded DevOps model
where there's like an embedded person that owns DevOps
and they are owned back to like a centralized team
and it's kind of starting to become a platform.
It's interesting, right?
So help us understand like your view of how a platform works.
I think would be really useful to help understand what you mean by platform.
So in the book we talk about if you went back to the 90s or the 80s, there was always these
two roles, systems administrator versus software developer becoming software engineer.
You have to sort of dance around this because it's sort of characterizing people by personalities.
But throughout my time in the industry,
there's always been these two types of personalities.
And you know, people exist on the spectrum,
but you generally see these two types
of personalities dominate.
So there's the people who, you know,
the fact that computer systems, you know,
there's networks and operating systems,
and they're very, very complicated.
There's people who love that level of detail,
gravitate towards the detail. And like the funnest part of their job is understanding of
that detail and succeeding despite the detail. You also find by the way
those people aren't the greatest coders in the world because the amount
of detail that they keep in their head sort of gets in the way of writing
lots of code. Then there's the classic software developer where like,
you know, ideally they would be just completely ignoring any,
there's no computer, there's no network,
they're just up writing what used to be called business logic.
And what they love is that flow of software development.
And to them, detail gets in the way.
And so there's always been this tendency to want to separate that out.
And so you have the systems engineers,
or the SREs, whatever you want to call them,
doing the platforms, whatever you want to call them,
maybe that's infrastructure you're calling them,
and the software engineers layering on top.
The key thing is between open source and just the cloud,
those things have really squashed together.
And so when we talk about platform engineering,
what we say is you can't divide,
like there's two types of people,
I think they are literally two types of people,
but they need to be on the same team cooperating on platforms.
So you get both the detail freaks,
you get the value of their expertise,
but the people who write code do as well.
So that's when we say platform,
it's both of those coming together as opposed to 15 years ago,
if you're a Microsoft shop,
you just bought the.NET SDK and you just ran with that.
Maybe if you're Apache and in the Linux base,
you'd really separate the role.
For us, platforms with the last 25 years of innovation,
you need both those teams together building in-house software.
The classic things that they're building are
the classic things we all know.
Developer tools, compute tools,
integrating different storage systems.
It's not rocket science what they're doing,
but platform is a way of doing it.
Amazing. If you go look under the hood of a company, it's sometimes interesting like
platform bingo card. Like what are traditionally the things we would consider like a platform?
Would you consider things like user and identity for a product a part of a platform? Or would you
only focus on things like IAM developer access? Like I'm kind of curious, how do you divide the lines?
I think looking at different companies,
those things end up in different shapes and formulations.
Sometimes security teams are a part of the platform,
sometimes they're not.
You know, and then I think there's also this discussion of
when we use, talk about platform,
is that platform formulation that, okay, we're going to,
like platform is really owning the path to production,
the SDLC and the lifecycle for developers,
or does it mean more than that? Or is it less than that?
And it's interesting because we actually have a future guest,
Wayne Duso, who used to run data products at AWS,
who has a very interesting take on this.
So I'm kind of curious to get your perspective
having run like a relatively large organization at Datadog.
So, you know, you can spend all day
thinking about definitions.
And you often find, by the way, you know,
by the end of my time at Datadog,
I was managing a very, very large org. So you're mentoring managers who
remind you of yourself earlier in your career. And there is this tendency,
particularly for like mid-level managers, to like overthink the meaning of words.
And you know, what is it? They're all trying to do reverse Conway's law where
they try. I won't say it's completely ill intention, like it's just careerism,
but it's clear that like they're just seeing the world from their own best interests,
and their own best interests of either their career
or the type of system that they want to build.
So that's why I say upfront that a lot of this just comes down to,
you can overthink it and not get it.
Say within Datadog, I had an infrastructure organization,
and I think I called it an application infrastructure organization.
Infrastructure was all the classic that I think
everyone would agree is platform engineering.
So you had compute platforms, data platforms,
maybe SRE platform was in there,
and then the one that's on the edge actually at the moment,
but I think it's clearly platform is developer tooling.
So whether you call it developer experience or whatever,
I think it's very, very bad if you
start to try to separate them from the platform.
You end up with developer experience on one side and platform on the other.
So I think everyone agrees that's platform engineering.
If you look at the book,
80 percent of the stories we talk about,
just given our experience, are those types of things.
What we say though, is that application infrastructure.
So at Data Dog, that was like the revenue platform,
the auth platform, the front-end website platform,
the backend for front-end, I guess is the term for it.
If you look at actually the lessons of
managing platform engineering teams,
like the fact that you have many stakeholders,
the fact 80 percent of it applies.
The big difference for them is they always have
this bit that does poke up and external customers see it,
that creates a whole bunch of new problems for them as well.
But otherwise, like a lot of their focus that needs to be on,
hey, you need to think about developer tooling,
you need to think about making your platform observability,
you need to think about making the platform self-service.
It completely carries across.
So in our definition, and this is partly,
we wanted to have a broad tent for the book.
We think of the mortars platform engineering.
I would say if you look at what the industry is doing now,
the application platforms aren't quite running to
the term the way that the infrastructure platforms are.
They're very, very happy to call themselves platform engineers at this point.
Whereas the ones in the middle, I think, they're not infuriate people.
They often have fewer of those systems engineers.
And so they don't run to the term quite the same way.
I don't think we can able to dive into all the details of the platform engineering nuances.
Just going through the book, there's so many level of things to consider, but I'm,
I'm very curious because coming from my background, um, you know,
working as engineer for a lot of different places too.
I think this idea of having a centralized team for helping infra or helping on a
platform side is nothing new. It seems like every company has one, right?
Almost like the same way, like every company used to have operations,
or what we call it DevOps. And we keep changing the names, but really stuck,
I feel like was more the nuances of what the last gen of a DevOps feel like.
And then we have SREs,
we use Google as the golden child and sort of things kind of snowball.
Platform engineering,
I think is interesting because I don't think
there's like a lot of descriptions of what the last gen platform
engineering feels like and what the new gen platform engineering feels like.
I don't know if that makes sense.
Like this topic is nothing new.
But why do you feel like there is a need to talk about platform engineering now?
Is there like a version of platform engineering
that we should be able to like aspire to?
Is it the technology, the type of team,
the type of work they're doing?
How do you like able to almost like give folks
that been working in industry for some time
know what a centralized team means, but like,
hey, this is what the new platform engineering
look and feel might be feel like.
In some ways, I feel like the book is sort of back to 2010
in that I feel like sort of two things happened.
The DevOps people were very well intentioned,
but they could never quite define themselves.
And I think sort of the cloud came along
with all these promises of,
oh, you're not going to need ops teams anymore.
Adrian Cockroft got into this big thing because he said,
Netflix, we do no ops.
And pissed off like half the industry because it was
seen as saying basically you don't need operations at all.
But what he was actually saying was build really good platforms.
And what I saw sort of happen with
the Cloud and just with open source coming,
was just the idea that you could just have one or two DevOps on
every team and number one, you could find enough of those people.
Number two, that they'll actually stay happy.
Number three, the outcomes would be good on the other side of it.
I think that was a big mistake by the industry.
In the book, we use this term glue.
It's nothing fancy, but you can think of glue is what happens when you
ask every product team at the company to write their own YAML,
interface with Kube directly.
Ask 100 teams to interface with Kube directly.
It's great when you're a 20-person startup,
it's horrible when you're a 1,000-engineer company.
So, platform engineering number one is just saying,
look, if you stick your people who are good at
that DevOps-y stuff but stick them on every individual team,
you're not actually getting any leverage
and actually you're getting much worse than that.
You're getting stuff that's going to be very,
very hard to change later.
The other thing I think the industry really struggled with, and I don't think anyone in the movement was ill-intentioned. getting any leverage and actually you're getting much worse than that. You're getting stuff that's going to be very, very hard to change later.
The other thing I think the industry really struggled with, and I don't think anyone in the movement was in intention.
I think SRE was a horrible thing for the industry because they talked about a
model that worked only at the massive scale of Google, where for every system
you could have an SRE team and the software team.
And I sort of went to the world with that.
And what you got was a lot of people who wanted to just nerd out on
the reliability aspects and like use these, oh, we're going to take the pager
and we're going to hold it over your head that we could always hand back the pager
if you don't do what I want.
And it just was a horrible way of taking those people and actually building
good relationships with them.
When I think about what platform engineering is, it's like, look, it's
almost the good side of DevOps, right?
Like the fact that it is more of a culture,
it is more collaborative,
but it also a platform isn't just work, right?
A platform is a thing that exists,
it is built, it is operated.
I think that is the key aspect where it's- so
we don't say this outright in the book,
but this is a podcast I could say.
In some ways, we're trying to move
DevOps finally to its mature model.
That's what I'd always promised, but it sort of got stuck in this embedded per team model
throughout the industry, which works great at small scale.
That's why everyone starts it, but it completely makes messes once you scale up.
Yeah, it's so fascinating.
I think that's also the idea of DevOps to a mature state or the sort of ideal platform
engineering state.
Because in a book you define like measurements, right?
Almost like a cloud native measurements or it has to be trusted and loved.
Anyway, I really-
I wouldn't say by the way that those are measurements.
I hate the term measurement for things that I don't believe can be measured.
But they are absolutely things that you should work out if your system's determined about your system.
It's already interrupt.
Yeah, it's fascinating because you do have it in a book.
And to me, like when I'm looking at it, I was like, wow, it's not that common to find
platform engineering that is trusted and loved.
Oh yeah, completely.
I see platform teams everywhere.
Every single company has a platform team or by definition, right?
But then they're really being used.
Their products are like kind of only half adopted.
Many people just don't trust what they do.
Like this is like such a weird state of most
where platform engineering's are.
And they're of course aspire to be
the centralized platform team.
And a centralized platform is either a joke
or almost like an experiment.
What's the holding signs of making platform engineering
what it is today that cannot get to that mature state?
Is it the talent of the team?
Is it the culture, the mindset, is just everything?
Is there like, maybe what's the most common missing point
or you see existing companies run into right now?
So I think there's sort of two answers that matter a lot.
So at Datadog, which was growing 40 to 50% year over year,
and even there the platform teams really, really struggled.
That's very, very different say to parts of the industry
now that are growing, where engineers growing at one to 2%.
I think if you talk to anyone at that one to 2%
and why their platforms are struggling,
they'll say 10 to 20 years of tech,
they're not enough headcount.
That's just a very, very tough management problem.
Those are failures, they have to work hard to solve it.
To me, that's not the interesting part.
The interesting thing is how many
fast growing companies have
the exact problem that you described.
They're growing, they can hire into their platform team,
and yet the platform team is still hated by the rest of're growing, they can hire into their platform team,
and yet the platform team is still sort of hated by the rest of them.
And it's clearly happened at Datadog, by the way.
And it happened at the FinTech, Two Sigma.
So I spent a chapter on this term re-architecture,
which is a term which I won't say I fully invented,
but wasn't that out there.
But really it was just this idea that I hate second system syndrome.
I find so many platform engineering teams
at both of those companies actually Datadog and Two Sigma.
Bazel was this classic,
oh Bazel is gonna solve all the problems
of the build systems of the past.
And you'd see these teams of 10 to 20 people
spend years on something that was still not even like
10 to 20% of the total usage, right?
Like everyone in the Go ecosystem
would just keep using what they're doing.
And it had this story, oh, you know, build it
and they will come.
Like that's another classic thing that I hate with platform.
Oh, we'll build it and they will come.
And so I think when we,
when people talk about product management
in platform engineering,
what they mean is that you just be far less arrogant
about what your users want
versus what your ideal architecture is.
And you'd be far more iterative about
working within the systems that are appreciated today,
even though they're completely ugly.
Even though the engineer in all of us would say,
this was a mistake and we should start from scratch.
As a startup, that is not product market fit.
Yet these platform teams, I think,
in past growing companies, that's been their big Achilles heel,
is that they latch onto the new technology of a day.
Oh, this worked at Facebook or this.
I saw Dropbox put a layer in front of
the SQL database and force everyone to use it.
Of course, we all know Dropbox is
a great font of product innovation.
I say this one because this one came up at Datadog.
So yeah, I think a lot of the success in
platform engineering is an incrementalist approach,
as opposed to believing migrating to a single big technology
is what's going to save the company.
I think that's the biggest flaw in the growing companies
throughout the industry about why platforms have really failed.
It's just migration to heart.
We all know that 20% of tail takes 80% of the time.
I think that's been a lot of a problem.
And so a lot of a book in many ways is just like, be more humble.
Go slower, be more iterative.
That's actually the way you serve the company's needs.
Might not give you the fanciest resume,
but it definitely serves the company's needs.
Yeah, I think that resonates for me.
Like I often think a lot of platform engineers
have like excitement driven development,
you know, in the sense that like,
oh, I'm really excited about this new methodology,
this new tool that drives a lot of the roadmap.
I mean, my own experience,
the thing I recognized in what you just said,
and having built managed platform tools myself,
is a broad idea around user empathy.
It's like, at the end of the day,
the platform isn't the thing that drives the revenue.
It's the software, the product that's built on top.
So you have these engineers,
where typically they'll go and they're like, I'm going to
go teach these software teams that are building revenue producing applications how to do it.
And it's just like the actual opposite of the right motion.
The right motion, from my experience, has always been like, okay, you're actually a
janitor.
You're the janitor and you're making all these other people, your goal is to create focus
for the company. Your goal is to create focus for the company.
Your goal is to create return on investment.
I'm kind of curious, what do you think are good points
of leverage for a platform engineering team?
Obviously, the Bazel build example you just gave,
basically what you're saying is,
that was not a useful investment.
The horizon on which it was useful
completely exceeded the value of it.
Especially company growth is unpredictable.
Yes, I could believe there is
this perfect multi-language build system that handles
per-language dependencies fantastically.
Why the hell are we trying to build this in our company?
That's to me is often the struggle of these things.
Yeah. What are common points of leverage
and investment you see that actually work versus don't work?
One would be a build system is an example,
maybe isn't a good investment.
From your experience, I'm sure you have some areas like,
okay, I come in, I land first,
look at a platform or I'm like,
you're going to build one, here are the places where I would
spend money and these are the things I wouldn't tackle.
Do you have generalized thoughts on what that looks like for a phone?
So at DataGate, I had both successes and failures.
My call is part of both of those.
I think that the key successes had two aspects. So generally, if you want to do a new initiative,
it is find the product that you know is going to be most excited at the executive layer.
Forget about building the platform perfectly.
Just build what they want.
In some messy partnerships sometimes, because they want to build it themselves.
Build what they want, knowing that you're going to set the next two to three years of cleanup afterwards,
but knowing that if you succeed, number one, no one's going to kill that product. So you've
sort of got this beachhead already. Number two, I think engineers just in general, like we're not
the most attuned to social things. Like you have massive social proof. You have just greatly enabled
the business. You have goodwill from that,
that will help you get the next two to three
internal customers and get them on board.
So I think the key first one is just trying
as much as possible, align your new initiatives
to something that is big in the business.
Now, there's countless to that, right?
There's plenty of people who use AI
as their way of pushing Kube.
Again, you have to be a little bit humble.
Is Kube really the thing that should be running your LLM workload?
I don't know if there's any people who tell you it should.
But I still think the best success is they just got,
I use the snowball metaphor a lot at DayDog.
You just got these snowballs.
Once you got that thing started,
then you could get more momentum and you could use
past successes to argue for
a bit more compromise from the next thing. So true business value, like not happy internal users, true
business value, I think, is the first one. The second one was really just like what we call
our DRE team, which was like Cassandra Postgres, Kafka. And that team was just like being managed
to the book of DRE. But just given the importance of those technologies to data was just a complete burnout shop.
They kept imagining this layer of abstraction,
this service layer that they could
build that would make their ops manageable.
I gave two years and you could just say this,
it's just repeating, it was second system syndrome or whatever.
But that completely fell. What really made that team a lot
more functional was just doing far more actually client type stuff.
Rather than focus on the stuff that you fully control,
do a bit more development just on the client side.
What is it with Postgres, the connection thing?
PGPool.
PGBouncer, right?
Yeah.
Do some work on PGBouncer, right?
Yeah, it's messy, but that's going to greatly improve.
So I think the second thing is,
if you see the platform just as almost like
software as a service and you don't know anything about,
I think that's a big mistake.
I think a lot of a value is by focusing
actually in the internal customer's code bases.
The two types of successes generally had one of those two things as a big part of their success.
The biggest failures were the ones
who approach it almost like waterfall.
We got understanding requirements
across multiple product lines.
We're going to prioritize across them,
and we're going to iterate from this big design.
They just took too long to show value,
and then they had this migration problem.
I signed off on these approaches. They did not work.
I'm curious, is there a differentiation here
between, in terms of app versus infra, right?
Like STLC versus reusable components
that result in product lines.
Like, I'm kind of curious,
is there any difference between that?
Or do you think those successes are true
across just the whole portfolio?
I think the successes are true across the portfolio.
So even true infra, right, like the classic accusation,
you should be building platform for platform sake,
Bazel, but other stuff.
I look at my, yeah, the team that's on Kubernetes,
they were falling in love with Silium early,
they were falling in love with Istio early.
So they attached Silium to go into GovCloud.
And they could have used different technologies,
they could have just used an AWS native technology,
but they believed Selium was a better long-term play.
And so no one really questioned that.
And that gave them the confidence
in using Selium to roll out.
So that's on the pure infrastructure side.
I think on the app infrastructure side,
I think it's like a revenue engineering team, right?
You want to move people away from using day dog metrics
as a way for capturing revenue
towards using Kafka as a way.
Like they attached that to say, a big new initiative revenue towards using Kafka as a way of, like,
they attach that to, say, a big new initiative
and that worked well.
The implementations look different,
but I think the ideas remain the same.
Like, in some ways I'm just saying it's truism, right?
Like, you know, executives only care about top line value,
right?
And so you have to find your way to attach
the right projects to that.
Yeah, and second system syndrome is not valuable
to an executive at the end of the day,
because it's a move of business metric.
Yeah, it's funny coming from AWS where I found the leadership there as an infrastructure
business did have five-year time horizons.
I really respect Andy Jassy, Charlie Bill Petersen for that.
But most of the industry just cannot invest on that time horizon.
And so second system just sounds like a boondoggle, but it's never going to deliver any value.
And I have a question. It was a good segue. What's your advice for people looking to sell
into platform engineering, like vendors? Because I think a lot of vendors, like you mentioned some
of the Exelium as an example, the EPF stuff, like trying to sell to platform engineering
organizations, obviously you've been a buyer. So what's worked and what hasn't from your perspective
in terms of how well you found success buying software
from vendors and where you haven't?
It's a great question, particularly
to ask a co-founder of a startup who will say,
one of the challenges at Datadog was the founders
were sort of cheap, which is like all founders.
But they particularly had a reason in that Datadog,
because it was so close to this space, really didn want to invest in like someone who could become a competitor like it was like, okay
This could be a data part of data or someday and so so I'd say if you're an infertile
Another one don't try too hard to sell it today. Talk to your hard company selling to the things that I was at though
What was was easier that double classic finance, you know them it's just money versus human time, right?
And humans in New York are expensive.
At Two Sigma, I think the best success is it was the usual, what is the immediate problem
that has a champion who has the executive buy on to get the budget?
Even if in the longer term, like if I look at say, this is public so I could say, Two
Sigma was a big early buyer of Mesos actually,
really, really believed in it and it just got
crushed by the Kubernetes way.
But why didn't Mesos succeed at Two Sigma?
Because there's lots of doubters within Two Sigma who were like,
oh, at the time it was like, oh,
OpenStack, OpenStack is the future.
Not this, or Mesos wasn't even contained as yet,
it was just execution.
But what worked really well for Mesos was,
it was actually one of the modeling engineering
teams who just had this need for scale-ups, get on compute.
And so it was really finding that champion within New York.
And once that succeeded, again, Mezos is a tough example because Kubernetes sort of came
along and crushed it.
But the team who actually built platforms on top of Mezos, those platforms still exist.
They just migrated to Kubernetes eventually.
So yeah, I do think it's the usual stuff for startups.
It's like, don't presume actually that the platform team
is your ideal buyer.
It's that person with the burning problem
who the platform team might not be solving
and you're trying to sell it in a way
that can eventually migrate to the platform team.
I'd say that was the biggest success at Two Sigma.
They know you could sell the same thing,
but they would just use open source
and it would fall out of the same pile.
Yeah, I was part of the early journey
of getting Two Sigma to use Mesos, by the way.
I am aware.
As an early employee of Mesos back then,
so always fun memories.
So I want to jump into, actually, this is all relevant
because we're going to jump to talk to your startup
very briefly here.
Obviously, writing that book,
you know, you've been in your role
that has seen so many platform initiatives
and teams and efforts,
and now you've jumped out to start a company
that is pretty much almost like building a product
towards that team, I would argue, right?
And as you know, like, you've probably seen,
there has really been hard to have one single tool
that all platform managers and all companies all adopt.
It takes a lot of different nuances to do it. So tell us, what is Stuntion Labs? it's really been hard to have one single tool that all platform engineers and all companies all adopt.
It takes a lot of different nuances to do it.
So tell us, what is Junction Labs?
Why do you start to do this?
And what is the approach here that you think has a different differentiation here?
So Junction Labs actually came from, actually both to Sigma, but also at Daydork.
So both those companies were very early on the Kubernetes train
and very early on then on multi-cluster Kubernetes.
Just finding, I guess two things.
Number one, Kubernetes networking for
multi-cluster is pretty difficult.
Suddenly, all the orchestration stuff and
service you were doing in one cluster
completely fails when you start talking about multiple clusters.
Then the second thing, and this is more political,
but I think it's like service mesh
was terrible for such a long time, right?
Istio was overly complicated.
Lincolte was underdone.
I think Lincolte is getting there.
I think Istio is sort of doing his big pivot towards ambient.
So the industry was sort of saying these heavyweight service mesh, you've got this heavyweight
Kubernetes problem, these heavyweight service meshes are the way of solving all the problems
of Kubernetes.
And just the engineering in me was just like outraged.
It's just like this just seems like layering crap upon more crap.
So Junction Labs in many ways,
like knowing there is no one
true monolithic technology that's going to work.
So we sort of narrowed down on service discovery
as the way where a lot of these networking features.
So basically just resolving,
what used to be just resolving a hostname to an IP.
You can imagine as resolving a URL to a set of rules.
Can we just plug in there and fix
a whole bunch of networking use cases within Kubernetes?
We can leave the L4 stuff to like the CLEMs,
the ways to get the other C.
But the layer 4 networking,
like leave that to the experts,
but make application networking much simpler.
So really what we're trying to do is fix what you'd call
maybe the microservices debugability problem,
not by observability but by building
a much easier system to
have microservices configure their communication with each other.
If I want to use the industry term,
it's proxy-less service mesh.
Some people know what that is,
but I wouldn't use those terms for anyone
who didn't know what it was.
I mean, you've made this, to be honest,
what is kind of a bold choice if you look at the history
of platform engineering tools.
Most platform engineering tools look like an Istio.
Don't touch the app, build something around the app.
And instead, you're going after this sort of SDK approach,
which is where you get your like proxy lists properties
from for service discovery, network discovery,
and configuration.
What made you choose an SDK?
I think it's number one, you know,
if you don't want to intermediate,
gRPC is great in many ways,
but forcing this big migration
because you intermediate transport, protocol, and service
discovery.
There's many reasons why many companies will never be able to adopt gRPC at any scale because
of that.
Protobufs are fine until you realize that they're using way too much garbage collection in Go.
You want flat buffers.
Well, you've chosen gRPC.
Sorry.
Part of it is just, I think, thinking a lot about composability.
Now, there is this critique that
a good technology is not a good startup.
And so we do think that we'll probably find our way,
it might be around progressive delivery,
might be around testing production or testing pre-production,
basically running network where we have
these products that actually make
the technology a lot more useful.
But really to me, the only way to do this right,
in a composable way, is actually in a library.
Now, the big bet there is,
historically you'd go to platform teams and you'd say,
hey, this new technology requires a library.
And they're like, well, good luck.
Because we've got eight different languages
and 20 years of legacy and not everyone's going to be
able to upgrade on any realistic timeline.
A little bit of a bet is between maybe monorepos and also
AI-based tools around refactoring
and just security needs around vulnerabilities.
It'll be far easier to keep a library
up to date than it was 10 years ago.
I've heard the critique of how awful it
was to get Teams to upgrade libraries 10 years ago.
So if I drove people towards service mesh,
again, like that log for J,
I still think as an industry, right?
We really haven't internalized
that log for J vulnerability, right?
That was such a terrible thing.
And at some point, if we don't get our shit in line,
like governments will force us to get our shit in line.
And so a little bit of a bet on it is
managing libraries is not quite as horrible as it was.
Across an enterprise,
it's not quite as horrible as it was 10 years ago. It's not quite as horrible as it was 10 years ago.
So we want to jump into our favorite section of a pod called Spicy Future.
Spicy Future.
And I will be very curious what exactly you want to give out as a spicy hot take here.
What is your spicy hot take about the infoworld that you think you believe and
most others don't yet? So I struggle with this hot take because usually it's a very negative hot
take and particularly a startup in this space you want to like, oh I'm part of solving the problem.
So to me Kubernetes is a dead end. Maybe it's not that hot take because everyone sort of realizes it
but throughout the industry right we're still betting on Kubernetes so heavily because we don't know what comes next.
In my mind, Kubernetes is a dead end because it ties you to
this cluster model that doesn't scale well.
In the meantime, the hyperscalers are
completely building their own in-house thing
that would never work on-premise.
So we're sort of going to this place where we're going to
all these private data centers around GPUs,
pretending Kubernetes is the ecosystem that's going to cross them all.
But Kubernetes is not the remote cluster world.
So I think that to me is the hot take is Kubernetes is
an industry dead end that we're all locked into as an industry.
So if Kubernetes is a dead end,
what do you think the future of
compute orchestration looks like and why?
I have very similar thesis, but I'm actually curious why you think that.
It is a dead end because it promises itself to be this multi-cloud
technology sort of lowest common denominator. I should say,
if you're using GKE or EKS, you're willing to be a hundred percent on those clouds.
I'm not sure about Azure, but I'm sure about those two. You're pretty good.
But this mixed sort of model, I think that's happening with GPU where
suddenly it's not just the hyperscalers anymore,
everyone else is building data centers.
Well, they're never going to be great at building
all the internal shit that Google and
Amazon has built to actually make Kubernetes go well.
We saw the OpenAI outage,
like it's just such dumb freaking shit,
and yet OpenAI can hire
the world's best engineers
and they still get hit by the world.
You know, what are the B and C grade companies
who were paying like half of what OpenAI pays an engineer?
What are they gonna do?
So it's clear to me that that operating model
cannot persist.
I don't think OpenShift, you know,
it's good to have Red Hat supporting you,
but I don't think OpenShift is a way out.
You have to believe that the future is gonna take
certain aspects of Kube and stay compatible with that,
like maybe KubeKutol or something,
but like find a way that like administering
many, many clusters just isn't as hard
for so many companies.
It's taking way too many resources today
and we're like 5% into the migration
and nothing makes me think it's going to get
better. Like it's just inherent to those sort of away Kubernetes is rooted.
That does not answer your question though. Like what is it actually like? It's not really Lambda,
right? It's clear function as a service has a place at an enterprise, but it's definitely not
going to be 100% of workloads. Maybe it's a combination of Lambda, durable execution,
you know, Kelsey Hightower a long time ago was like,
people should really be running on top.
And if you can build these platforms on top
that totally abstract that they're on Kubernetes,
maybe that's the sort of the vendor part is like,
more things that look like temporal,
more things that look like restate,
less things that look like, you know,
YAML around server.
Like I'm integrating with Argo rollouts,
literally yesterday.
It's like, it's fine.
I just can't imagine any application team wants to touch
the 50 lines of configuration they have to get right
to just get a rollout to work nicely.
So I guess my answer is,
it is higher level abstractions that fundamentally,
at that point, do not need Kubernetes.
They could be running on any computer.
Do you think that high level abstraction
is like in some new open source projects that comes along?
Or do you think this is like a vendor API layer? Because I agree with you about everything you just
said. I don't know a single application-layered software developer that enjoys nor wants to or
pays attention to anything they write in YAML files. And it's always like not even best effort. It's
like worst effort to get back to doing the thing that they actually want to do.
I guess this is in some sense, you know, talking my own book in terms of, you know, why I've sort of started Junction Labs.
Like I look at DAPA as sort of an application platform.
And in one sense, I think it's really, really well founded to the extent that it requires almost a full migration to take advantage of it.
It seems a very difficult like endeavor for me to imagine most larger companies ever moving significant workloads to it.
So I guess what I imagine is, you know,
there's maybe junction labs and maybe three other like us who have raised the
level of abstraction about what it means to, you know, there'd be a,
there'd be a scheduling system and networking system. Uh,
you don't break compute into true compute.
Don't have this thing called orchestration,
a couple of computer networking.
I think maybe it's very open source
slash vendors that cobble together enough of
the Kube API in terms of glue.
I think maybe that is what eventually replaces it.
It's difficult. You always look like
Linux has stuck together for a really long time,
but distros have come and gone.
I think there's something about the distro nature that evolution through that is the next stage for Kubernetes.
This is such a big topic.
I don't think we can even touch on anything that's like not going to take even more like an hour or two.
It's not going to take even more like an hour or two. What is your belief around maybe the junction angle into the sort of like cube-less world?
Like you talk about like the extraction has to go higher, but you know, your layer isn't
fully at the temporal restate, right?
It's not truly at the Kubernetes level.
It's somewhere in between.
Where do you see your stuff fits then in this future?
Yeah, what junction we want to get really good at over time is dynamic configuration.
So like a lot of applications at the end of the day, and this is,
you know, what HashiCorp consoles sort of promised 10 years ago.
So what we want to focus on to start with is dynamic configuration,
mostly around networking. Eventually, we want to focus around dynamic configuration around
different aspects of application behavior.
So basically, how do you change things
without requiring a complete redeployment workflow?
If the idea is that there are client SDKs,
maybe they're doing client-side caching,
maybe they're doing a little bit of
client-side wasm to do some type of matching.
We want to get really good over time at
that layer of abstraction to build the broad platform,
but then find one or two products that
really make people want to use it.
I always just come back to the biggest pain,
I think, as people move to services is quality.
It's always, how do I test my stuff with
production services without ending up with the staging is a mess always either, how do I test my stuff with production services
without ending up with the staging as a mess problem?
Or is how do I do safe rollouts?
I think those are the places where,
at least at Amazon, at least at Datadog,
I saw massive value in terms of a product.
Even as the technology junction itself
is really just a dynamic configuration
plugging in with an SDK.
Well, I think we have to wrap here
because we can easily go on for hours.
Where can people find more about you?
And I guess also plug a little bit
about your book and Junction.
I mean, probably the easiest for me is just the book
is probably the best entry point.
So it's called Platform Engineering.
It's like a primer for leadership.
But if you look for Platform Engineering on Amazon, I'm pretty sure it's the number one that comes up at the moment.
Junction Labs is just junctionlabs.io. Because the SDK needs to be open source to succeed,
we're sort of building in the open. So it's just two of us at the moment in a room,
writing code. It's early days, but you can get a good sense of what we intend to build
from some of our blog posts there. So I'd say those are the best two ways.
Cool. Well, thank you, Ian, and all the Ian's in the room.
We're having a ton of fun here.
So we definitely need to have you on in some near future.
But thanks a ton for coming on our Infrapod.
Yeah, thanks for hosting me.
Thanks so much.