PurePerformance - Lessons learned when building the NAIS Platform with Hans Kristian Flaatten
Episode Date: September 30, 2024NAIS (pronounced like NICE) is a team central application platform that provides DevOps teams with the tools they need build, test, deploy, run and observe applications.In this episode Hans Kristian ...Flaatten, Platform Engineer at NAV, walks us through the WHYs, HOWs and challenges of building modern platforms on Kubernetes. Tune in and hear WHY they defined their own abstraction layer for applications, HOW developers benefit from that platform and WHY they developed their developer portal instead of going with other popular available choices.Links we discussed:Hans Kristian's LinkedIn: https://www.linkedin.com/in/hansflaatten/NAIS Documentation: https://docs.nais.io/
Transcript
Discussion (0)
It's time for Pure Performance!
Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson.
Welcome everyone to another episode of Pure Performance. This is not the sexy voice of Brian Wilson, but this is the just regular voice of Andy Grabner.
Brian cannot be with us today, but Brian, I hope when you are listening to this, you feel better because this is why you couldn't join us today.
All the best, but thanks also for doing all the post-production. Now,
typically, sometimes Brian starts with remembering
a strange dream that he had.
I didn't have strange dreams. I really just
hope next time
we record an episode, he's there with
me. But I'm not solo today. We
actually have a repeat guest, and I think this is
the shortest amount of time between two
episodes to bring a guest back. Hans Christian, servus. Hey, how are you doing? Very good, very good. So
now I was making the joke via email that I'm looking forward to have another episode with
my two Norwegian friends because in the first episode we learned obviously you are Norwegian
but that Brian also has some Norwegian roots. well this time I have to do it solo
so we are 50-50 between Austria
and Norway
yeah
hey Hans Christian
first of all thank you so much for the last
episode which folks
if you are interested in how
and why the Norwegian
government has decided to go all in on
open telemetry,
the reasons behind it, but also some of the challenges, the lessons learned,
then please do me a favor, listen in to the episode.
Link can be found in the description.
But then last time, Christian, somewhere at the end of the podcast,
when we ran out of time, we said,
wow, there will be so much more to talk about because you are a platform engineer.
Platform engineering is a big topic for every one of us out there.
We also talked about architectural decisions.
We talked about developer portals, fault domains.
How can we isolate things in a platform to make sure that the blast radius is not too big?
And really, this is the things that I want to talk about today.
So, folks, if you're listening to this episode,
I want to learn from Christian today on his experiences on platform engineering,
building platforms on top of Kubernetes
and how you can give these platforms
to your engineering teams
so that they are becoming more proficient,
lessons learned,
similar to what we did last time with OpenTelemetry,
but focusing on best practices on platform engineering.
Christian, I give it to you.
What is the biggest learning maybe
when it comes to platform engineering in your world?
Well, maybe it's cliche,
but platform is only half of the equation.
The other part is, of course, the applications.
So neither can be successful without the others.
You can think about it sort of you have a very expensive Formula One car.
Then you also need a very expensive Formula One car driver.
If you're just a regular Joe, even though you're like super enthusiastic into cars
if you're not a really if you're not a formula one driver and that's your profile you you won't
get anything out of that car and and same goes with with platforms really sort of you um you
can build the best platform there is and and all of the tools and build it yourself or buy it, whatnot, sort
of everything is gold plated. But part of my language, if the applications are crap,
then the outcome will be bad. It doesn't really fix that. And I guess I didn't talk too much about where I'm working and the platform.
And it's super cool to be on two times on the show in order to talk more about that.
Because that is maybe the one thing where we did really good.
We have made a lot of bad decisions and a lot of great decisions. One of the really great decisions
very early on was that this would not be a lift and shift migration. This was actually
going to be a modernization and sort of bringing the applications and application architecture up
to speed and onto the new platform. And we have always had this sort of unspoken rule,
or actually we have spoken about it a lot,
but maybe we have not put it in writing.
But it's similar to the sort of,
if you go to the carnivals or the Tivoli,
and there are certain rides you need to be this tall. And now I'm sort
of gestulating with my hands to those who are just listening. Because really, to be on the platform,
you really needed to be, to have, for your applications to thrive and sort of be... And you need to also go back 10 years.
That is when we started, really.
And the whole notion about sort of this 12-factor apps,
famously written by Heroku,
it was written by Dan,
but it was not that prevalent as it is today.
We were coming from large monoliths.
We emphasized that you need to rethink your application architecture
in order for them to function.
There's no reason to just lift it over.
That shouldn't be the reason in itself.
Of course, we wanted more developers, more applications on our platform,
but we also wanted them to take the work
and we knew that there would be a lot of work.
And so the second part there as well,
this was not time bound.
Sort of, it was not like,
oh, we are going to move all of the application
in six months and some ridiculous timeframe.
This was the new platform.
And once you sort of had the time and took the effort to sort of modernize
your application, you could be on there and you would get all of these benefits.
But it would go, be side by side with all of the existing platforms
for a very long time, an unforeseeable
long time. And to this day, because we still have mainframes, we still have huge
application servers in sort of the old fashion. And then our more modern Kubernetes clusters,
first only on the on-premise environment, and then
later on inside the cloud environment.
They're still, I mean, we run about 50-50.
I guess it's 40-60 by now.
We have moved 60% of the Kubernetes-based applications onto cloud.
So we're very happy with seeing the sort of the trends, but we don't have
any fixed time or date where we would turn off the on-premise cluster because we know
that there are reasons for some of the applications that have to be there, latency and so forth.
And same goes with the more older platforms as well. Of course, there is a cost here.
So we don't want them to live on and drag on for infinity.
But we know that Roam wasn't built in a day
and neither was the application platforms here.
Christian, can I ask a quick question?
In order to recap a little bit,
so that means there's kind of three stages.
Your traditional, let's call it legacy for lack of a better term, applications
that have been around for a while. Then when you talk about the platform,
you started with Kubernetes on-premise. So that means you were able to put certain
things into Kubernetes on-premise. And then the goal is to
move as many as possible for those that make sense
from the on-premise cluster into the cloud.
Did I get this correctly? Totally correct.
Perfect. Then I have another question because I want to also clarify this for the listeners.
When you talk about the platform, the new platform, we talk about you made a conscious
decision 10 years ago that Kubernetes is the core of your platform.
But you brought up an interesting point, and this is where I want to dig in deeper.
You said, we want them, so you have to be that tall.
I like that, or the 12-factor apps.
You have to have certain, your app has to have certain capabilities to actually leverage the new platform.
What is it that the development team gets if they build an app that can then either run on your
on-premise Kubernetes cluster or in the cloud? What is the additional things that your platform
really provides that they cannot have from building a traditional monolithic app and then
run it somewhere in the legacy system? So the largest benefit is that you get this,
you build it, you run it type of environment
because all of the legacy platforms
would require to a certain degree
someone else helping you
and sort of for certain operations.
And it can vary. For some of the platforms,
it's really sort of you need to hand an artifact and the operations team will put that into
production. For some platforms, that process might be automated, but they're still sort
of manual processes when you, for instance, want to get a new application.
If you want all of this automated,
then the new platform was for you.
So really sort of the second really sort of big
and what I would say and most of in my organizations
that was a good decision,
was to start building up team autonomy.
Because also, before the Kubernetes-based platform,
we really had developers developing for a long time,
maybe a month, maybe years, and then handing it over to a production.
Or maybe they were, of course, Q then handing it over to a production. Or maybe they were,
of course, QA in the middle and back and forth before handing it over.
So the new platform represented a cut there as well in the process.
It represented that in this way of working, we don't have a separate operations team.
We don't have, there is this one,
there is only one team.
It's the application team.
It's the product team.
So sort of this domain-driven design
was something that was more important.
Sort of how do we actually partition
and cut up the organization
so they fit nicely into sort of two pizza
teams coined by Amazon.
So they can have sort of a manageable set of services or applications that they are in charge of and don't really
need to sort of coordinate with too many other teams.
Of course, there will be some coordination, but for the most part, the team should be
sort of self-coordinating that for the day-to-day interactions at least, it's inside the team
that where they do that coordination there.
Cool. So that means basically
instead of having and i know people don't see us now when we move our hands on the screen
but instead of having kind of like a a vertical silo between def and ops you kind of have a
horizontal silo again not silo but you have a horizontal interface where on the bottom you
have your platform that provides and automates most of the technical challenges that people typically have with building, deploying and operating.
And then making this as easy as possible to the engineers to then really take into responsibility for the applications from building to operating.
And this is so fresh on my mind because I had a discussion yesterday
with one of our architects and he said, well, does this mean, he asked me the question,
does this mean that now developers in that model need to know everything about Kubernetes?
Do they need to know about ingresses? Do they need to know about security? Do they need
to know about everything? And isn't there too much burden? Because then everyone needs to understand the full complexity of operating software. And then my answer was,
no, this is the actual idea of a platform engineering team that builds a platform as a
product to abstract this complexity away and really just provides the building blocks for
engineers to take into responsibility. Did I get this correct or is this wrong?
Or how do you do that?
How do you see this?
Yeah, you are spot on, Andy.
So we started before this was really called platform engineering.
So I guess we called it, maybe we called it platform as a service.
I think that was the term we used back then.
But really, you are spot on.
And the goal here is to provide a higher level abstraction
and abstract away these lower level building blocks. Of course, there will be complexity.
We cannot hide it all. But what we did very, very early on was that we would not expose directly
all of the Kubernetes resources to our developers.
What we did was that we created,
and this was really, really early on.
I think we had some people that experimented
with what was before the custom resource definition
that we know today.
Some POCs and actually used the, not sure what it was called back then, but today we know them as
custom resource definition, where you can extend the Kubernetes API. And what we have built is an
application resource. So it sort of combines all of these resources that you listed.
It combines your deployment, your service, your ingress,
your network policies, all of these lower-level building blocks, and it provides
a higher-level abstraction where the developer
needs to specify some metadata about your application,
what's it called, which namespace should it run in?
And then your container image.
And then everything from there on is really optional,
where we have at least what we call our sensible defaults.
And then very, very few feeds are actually required
in order for your application to spin up.
Of course, as the application and the team matures,
they will leverage more and more features.
So it's not that that manifest is really...
It can be 150 lines long
if you take advantage of all the different services.
But in sort of the getting started and sort of most, the minimum viable
application, it's really, really condensed there. So that, again,
sort of, of course, that's an abstraction that they need to know,
the developers, but really that allows us
to have that layer above the underlying. So we could
for instance, go from one version of an underlying Kubernetes resource or remove
it altogether.
So we have changed service mesh without the teams really knowing because that's something
that we treat as an implementation detail.
We have changed how network policies work and really sort of how these lower level building blocks really work.
In some cases, there are some migration that needs to be done and sort of with the help of the teams.
But in many, many cases, we are able to actually change versions, change components without really requiring the teams to change their manifest to change their application.
So it's a really, really strong overlay or sort of...
Did you implement then your own, I guess, operator
that handles those applications?
Have you, and I guess this may have been before tools
like Crossplane emerged,
but have you looked into Crossplane as a tool as well for that?
So we have looked into Crossplane.
That was quite early.
So most of the functionality is in our own operator or operators. We have some other supporting operators as well doing custom things.
Then what we're using today for our Google cluster,
we are using an operator similar to Crossplane.
It's built by Google.
It's a config connect.
So actually, the Google,
because that's where we are running our cloud cluster in Google Kubernetes Engine.
So it actually comes with these APIs already built in,
or the Kubernetes API already extended,
where you can create Google resources.
You can create Cloud SQL instances,
storage buckets, etc.
And it's similar to Crossplane.
If Crossplane had been a little bit more mature
when we started that,
we most likely would have used Crossplane
because we tend to favor open source,
open sort of implementations
and not sort of going too much vendor-specific.
Of course, you need to have a provider of some sort.
We really want someone to provide us Postgres databases.
That's not something that we really want to operate ourselves if we can afford not to do it.
But the research, the API of how we are creating those Postgres databases,
we would really like that to be as open as possible and sort of a standard,
which Crossplane really represents.
And I know the discussion have come up certain times, if we
should move to Crossplane. Currently,
we have no real
roadmap
items to support other cloud
environments. That's the other benefit as well
with Crossplane, being sort of instead of
the config connector from Google being
only Google-specific.
Crossplane is more like Terraform
where it supports
a number of providers
which makes that integration
work there a lot easier.
But it works similar way, Andy.
That's cool because I just
had a podcast
recording with Viktor Farchich,
whom I know you know, and he
will also be speaking at Cloud Native
Bergam Day because he's the developer advocate for Crossplane. That's why it's fresh on my mind.
So that's really cool. I'd like to recap a little bit just to make sure. You basically
built an abstraction layer for developers because really what they want to do is they want to build and deploy applications.
So you defined an application object.
You have an operator that then translates this application object
into the actual deployments underneath the hood
without the developers having to deal with the complexity
and also allowing you to switch things in the backend,
like switching from one service mesh to another.
I also like what you said with you as the experts here on the platform engineering team,
you know what good defaults are.
So that means you can start with sensible defaults, but then as you mature, you can
also then and I think this is the autonomy aspects, right?
You give them a certain level of autonomy, but also guardrails, which is great.
Back to the question that my colleague gave me yesterday, right?
He says, he asked me, you know, do they need to know everything?
And you clearly stated how you can solve this problem by the abstraction layer.
But here's my question.
What if something doesn't work?
Where is the responsibility or what can a development team do
and where do they then need to go to you for troubleshooting because i think that's a big
thing right what if my application all of a sudden doesn't deploy anymore how do i know is it my fault
as an engineering team or is it the fault of the platform maybe it's not even your fault. Maybe it's a fault from the underlying infrastructure provider. This is kind of like the default domain isolation and default domains,
which has been interested in knowing how this works in your world.
So I don't have sort of like this one definitive answer, and it's certainly matured over time. And of course, the focus has been a lot
on the deployment interface and
how do you specify the application. And then on the operations side,
we have always given them access to the Kubernetes API.
So you didn't really need to know how to specify your
resources, but you certainly needed to know about the concept of pods.
Each instance of your application would run as a pod.
And of course, that's sort of an underlying implementation detail.
But it's something that we have not found a good way to sort of how do we sort of provide on the operation
side as well but we do have logs we do have sort of like try to be helpful with the error messages
when you deploy to the platform for instance is this is it sort of a problem of ours or it's a
problem on your application in most cases this is an application problem so that's sort of a problem of ours or it's a problem on your application in most cases this is an
application problem so that's sort of what it states or um all of the helpful guides etc starts
with sort of check your check your application logs in most cases and i would argue that it's
about at least 80 percent um something is wrong on the application side. It's not necessarily that it's a bug of sorts
or it's sort of, of course,
the environment from your local development environment
is, of course, slightly different
from running inside a Kubernetes-based environment.
So there might be some of those differences
that you have not accounted for
with regards to how is the database specified or what can you access,
et cetera. But in most cases, sort of the application logs. And then in some cases,
you need to check the status of the pods, that being sort of, is it running out of memory?
Is the liveness checks not working? Sometimes these can be misconfigured. So it's not
an easy sort of... And it's certainly, I would argue, a step above sort of having to know all
of the different... If all of the different components are configured correctly, we are
very certain that at least your ingress is pointing to the right service
and the service is pointing to the right portal on the deployment.
So that isn't really an issue.
But of course, figuring out the exact problem there
has been a challenge for quite some time.
So what we fast forward, and this is quite recent, we actually started work with a proper developer portal.
And that would encompass more on the operational day two side.
So once the application has been deployed, can we provide a better overview? Can we give better and more insight into the application
in order to debug and look into the performance?
So we did some POCs very, very early on backstage,
but concluded that we wanted something,
I wouldn't say slightly different,
but we concluded that Backstage was too big, rather.
It's also trying to do the, we were not certain about sort of the provisioning of applications
and sort of the integrations with other different resources. We really had, we have some very concise or well-defined entry points where
you can get the information. We don't really need them to sort of go look into details inside your
cloud environment, really. We have most of the status in the Kubernetes cluster. So we built
sort of a new, we built our own, basically. We built our own developer portal that is based around
this application manifest that we have created.
So once your application is deployed, then you get
all of these separate sections and subpart
with the different additional
services that you can provision alongside your application.
So that'd be SQL instances, cloud storage, Kafka topics, et cetera.
But we sort of have a fairly concise sort of entry point or API
where we can get all of this information.
We just needed it to display it.
And some of the work, a lot of the work, so there are two different, three different sort
of goals here.
Of course, giving the developers a better overview of what's running inside their namespace,
what's actually the applications and the resources deployed.
Two is sort of the cost, and that was a really And that was maybe one of the main drivers, actually.
It was not that operations was hard.
It's certainly hard, but it was not sort of a pressing concern.
It was not like, oh, this is a really, really issue
and we have downtime ever too often
because something is misconfigured and they can't figure it out.
That's, of course, a challenge every now and then.
Most often it's only development environment.
You've done some changes to your application,
either that's the manifest or the application itself,
and then it fails in the development environment.
It doesn't really cause an issue on the production side.
If it's already going to production,
in most cases, it will be also caught
that you have a working application that's already running,
and we will prevent the broken one
from actually pushing out all of the replicas of the good one.
But this cost part was really what drove a lot of the effort was to give give
the developers better feedback on their resource consumptions and how much that is actually costing
us in in in dollars or euros um because when we give our developers autonomy and they can actually choose their own journey, choose their own adventure.
So how, how big SQL instance do you want?
How much resource CPU and memory would you like to allocate for this
application here?
Of course, there is a tendency to over provision.
Well,
it's fascinating and,
but I need to interject, just ask a question here.
So do you then, as your, let's say, the central platform team,
also provide advice, mentoring, any type of support to say,
hey, by the way, it seems you are 50% over-provisioned most of the time.
And here are some recommendations on proper sizing.
Do you provide this or does the platform automatically provide this in some way?
Or do developers need to manually look into this
and they need to make manual decisions?
Currently, it's manual, but it's less manual than it used to be.
It used to be you had to go into Grafana and find the right dashboard,
or we have some database, other analytics dashboards.
So now at least it's in your developer portal.
So once you log on, you open your application,
you get that recommendations there.
But we are not doing that.
We are not automatically adjusting it.
One thing that I just, Henrik, who is on my team,
Henrik Rexeth,
I'm sure you've heard his name, is a
Mr. Is it Observable for me?
He did a really cool thing at last
year's KubeCon where
he was basically using observability
data and he had
like, he built like an assistant
using workflows to
every day in the morning basically send
developers an overview and say, hey,
these workloads or this namespace of yours is 50% overprovisioned based on the actual
load.
Here's a recommendation on how to adjust your resource and request limits.
So basically using observability data and some algorithms to basically say, this is
what I as an expert would do and then provide this.
And then another colleague of mine, Katharina Sik,
she also recently spoke in KCD Romania.
She also just built something really cool
where she's using automation workflows
to automatically open up pull requests
with proposals of changed request and resource
limits based on current
and predicted CPU and memory
consumption.
That's really cool. Because if you
then have a pull request open, it says here's a
suggestion of your platform
but still give the human
then the decision to say
yes, sounds good or not good, is a
really cool thing to do.
Yeah, absolutely.
And it's certainly something,
because we have had the data here for a long time,
at least CPU and memory data.
And we can see sort of how much are using of those
compared to what you have already,
what have you allocated for the application.
So we are training our developers and application team
to keep an eye and to use this tool that we have built
and to adjust those accordingly.
And what we also do is that we translate those
into the euros amount.
Because we have certainly had instances
where sort of, oh, you have requested 200,
you have requested 400 millisieverts
and you're only using 200.
But the cost on that is almost negligible.
So it doesn't really,
that is not where we are trying to push our developer.
We are mainly sort of focusing where is the, even though that's a 50% decrease.
Yeah, but if it doesn't make an impact, you want to focus on the real hotspots.
Yes, absolutely.
The big money grabbers, yeah.
Yeah, so that's, of course, been some applications that have been really, really greedy when allocating resources.
And we also have some databases that's been really, really over-provisioned.
It's a little bit harder to auto-scale those and scale them back and up. So that's not an easy task and something that we are still sort of looking into how we can provide a better way of scaling those.
You can scale the memory and CPU, but the underlying disk, at least this is Google Cloud, it will increase.
But it's very hard, next to impossible to decrease the disk.
So once you sort of, if you, for instance,
removed a lot of data from a database,
actually decreasing that disk is increasingly hard.
So interesting, and thanks for sharing the insights
on that you built your own developer portal,
because I think it also makes a lot of sense because you came up with your own application manifest definition I think the the whole CNCF or like the whole you know our
community has been asking for kind of this additional level of abstraction of an app
there's also the the tag app delivery,
the special interest group that is working
on some of these concepts.
But yeah, so far, I think there's no real application object
out there, and that's why I see many organizations
basically doing what you're doing,
either coming up with your own implementation
with an operator, looking at tools like Crossplane
to then build composites, as it's called in crossplane to then
kind of build a level of abstraction did you and your developer portal also implement the use case
of onboarding and creating a new app from a template because this is one of the features
that people like about crossplane as well yeah no so that was that was one of the features that we didn't feel was ready for us,
or we weren't ready for that type of feature.
And it goes back to that we have given a lot of autonomy and flexibility to the development teams,
but there aren't really sort of one template to rule them all.
Of course, there are certainly things that are common across them,
but at least from the platform's point of view,
and it sort of speaks a little bit to
that we have a large number of teams.
So we have about 200 registered teams on our platform.
Not all of them are logical, separate teams.
So maybe there's somewhere between 100 and 200 actual teams.
We have 2,000 of these applications. So it's quite hard for at least the platform team
to know what's a good,
what's a sensible starting application of some sort.
So that's actually something that we haven't really
been able to answer for quite some time.
Maybe we could have done it early on because there are fewer
teams and fewer applications, but certainly not now. There's too much,
there's too wide variety of different
applications and the level of, the
amount of sort of boilerplate is quite minimal. Of course
you need, you don't really need a GitHub action,
but that's the CICD system that we are using,
where we have a custom build and deploy action.
Yeah, reusable.
You need, in most cases, a Docker file,
unless the framework that they are using can produce one for you.
And then you need this nice YAML manifest.
Apart from that, there's nothing else, at least from the platform's point of view,
that you need to create.
But I guess you put a lot of effort, obviously, in your application definition, right?
I mean, I guess you, instead of providing different templates that contain, I don't know, Helm charts and all sorts of other YAML files, you invested in defining this abstraction layer of an application.
And I assume to get there, you initially had to talk with a lot of the application owners, like how many applications need a database, what type of database,
and then you abstract it that way.
How do you, I assume you're using Flux or Argo
or any of these GitOps tools to pull this in,
or do you just push these?
Actually not.
So we're very much a push-based.
And that goes back to that, again,
the manifest is in Git.
So this application
manifest, it's
quite, it's 99%
of what's actually running.
There only, there's a few,
there might be a few variables that
might be sort of the container
image, apart from that,
everything is already in Git.
So it's, I really
want to emphasize that it's Git. You need to have your things in Git. So it's remote to emphasize that it's Git.
You need to have your things in Git.
It's sort of not just...
But then the question still is,
how does this object in Git
make it into your Kubernetes cluster
so your operator can pick it up?
And in your case, it's still a push.
Yeah, that's a push base.
So we have a deployment API that sort of just exposes a very limited
subset or rather it sort of accepts YAML or JSON objects that are Kubernetes objects and
then it forwards it to the correct cluster and then it checks, does the authorization, is this repository here authorized to do a deployment
into this namespace in this cluster here?
So the reason that we are really, really happy with this flow
is that it gives the developers an immediate feedback.
So once they push something, the GitHub action is started,
and it builds, and it tests, and it deploys,
and there's an absolute feedback there.
Did it work? Did it not work?
Because the deployment API will wait for the application
to actually be started by the operator,
and it will wait for it to actually get acknowledgement back
that, yes, this application here is actually running.
And for all of the checks that we can see, it's running well.
And there's obviously to an event-driven model where with GitOps, right, you don't know how
often the GitOps operator with the Argo flux, how it's configured, how many pulls changes in.
And so it makes this a little bit more challenging.
It's interesting. I'm just interested to hear why people are, let's say,
using push versus pull.
What are the reasons for it?
What's your record of truth then?
What is currently deployed on the cluster
and how would you then recover a cluster state?
Do you have any, what's your suggestion?
So let me answer the last one first.
So we have a complete backup.
So if we need to restore a cluster, we can restore that.
So we have, I guess it's Valero that we are using now.
It goes through and it takes a backup or snapshot of all of the Kubernetes resources,
all of the namespaces,
and the persistent volumes as well.
So we can restore those in the case
where something catastrophical has occurred
or someone has deleted something they shouldn't do.
So again, going back to the source of truth, that is up to the developers
to decide. We have sort of taken a stance that, well, there will be a lot of different models,
there's different sort of development flows. We are not here to sort of say that you need to use
this or that. We can come with some recommendations. We sort of recommend that you need to use this or that. We can come with some recommendations.
We sort of recommend that you have,
or that what we provide you is that we provide you
with a development environment
and we provide you with a production environment.
And we advocate that, say that this should be sufficient
for most of the teams.
But if you want to have some where you're using feature branches and you have release branches, etc.,
that's up to you. You just need to create the correct
GitHub workflow that sort of picks up the branch
on whatever pattern and deploys your application with the correct name
to the correct environment. We don't give you
more environments,
but you can deploy your application
as many times as you like
and then create these virtual environments
if you want to have those.
But we certainly advocate
that it needs to be driven by Git.
The teams have the possibility to go in and edit things.
And I wouldn't say too much.
They can or apply from their own machines
because they have access to their own clusters.
Again, it's certainly a challenging operation
because they first need to have a built-in container image.
It needs to be pushed.
It needs to be available.
You cannot just build that image locally and then deploy it.
So what ends up is that things are going into Git
and then they have a very, very fast feedback cycle
where you can get this up and running, out and deployed
and doesn't really alleviate the
need for any doing any manual sort of deployments. They can go in and from time to time they will
go in and do manual changes. But it's sort of a trade-off that we are okay with. We believe that sort of these teams, it's sort of contained enough for them to sort of,
oh, I'm going in and changing this on the fly
for some reason.
And that the team is sort of aware of that.
That might not be 100% true all of the time,
but from our point of view,
we have not sort of caught any sort of outages or problems,
at least in production environments that sort of been where the root cause has been sort of,
oh, we have the ability to do manual changes.
Rather, sort of having the ability to do some manual changes actually makes it easier for them to go and fix
or test sort of a really
one-off thing that they need to just check.
Oh, if I put in this number here or this value for some configuration, how does that affect
the application?
Yeah.
Cool.
Hey, Hans-Christian, it's amazing how time flies because we are almost like 40, 50 minutes in already.
For me, it's always great to listen to people that actually really implement platforms.
And everybody has a different background, different company, organization, in your case, government agency, reasons why certain things are done that way.
But what I really like, and I think this is something that I see all over, you are basically
as a platform engineering team, you are providing basically an app, platform as a service, platform
as an application, whatever you want to call it.
But in the end, you've built an abstraction layer on top of the complexity of the next generation of platforms we're going to build.
And that's Kubernetes for many people.
And even if it's not Kubernetes, even if it's serverless, if it's anything else, platform engineering basically provides this abstraction layer to actually,
and I think this is what I really like, what you said earlier, to really enable teams to be autonomous, to really take the you build it, you run it, and actually being able to
execute it without having to deal with all the complexity because you're working
within a certain
area where you can move left and right because you have,
in your case, this application object that you've defined and that you maintain and that
you have an operator for.
But yeah, that's really, really great.
Now there's reasons why you build your own developer portal.
Makes all a lot of sense.
And yeah, really cool that you share this thing. I know in the previous episode,
when we talked about OpenTelemetry,
you had a great blog post about your OpenTelemetry story.
Are there also blog posts or is there any public material out there
in case people want to read more about your platform engineering projects?
Yeah, absolutely.
So we have written blog posts over the years,
but we are really, really proud of our application documentation
and we are developing all of this in public,
in the open, and the documentation is in open.
So if you go to docs.nice.io,
so that is N-A-I-S dot I-O.
So it's pronounced nice.
It's a nice platform.
You can see all of this from a,
how does this look from a developer point of view?
There are getting started guides
and examples of how the application manifest looks like,
the deployment workflows,
and sort of all of these extra bells and whistles
that you can put on to your application.
And of course, my sort of baby, all of the observability part
where we have sort of really made it as one click as you can go
to sort of enable OpenTelemetry instrumentation to your application
by using sort of these, as I mentioned, sort of the agents
that we just attach
to your application when it's starting up.
Really cool.
Folks, we will also add the link to NICE.
We are NICE and put the links to NICE
in the description of the podcast as well.
So we have been quite vocal about our platform
and the rest of the platform engineering community and the
rest of the public and government sector here in Norway. And we know for a fact that there are a
few people at least that's a little bit sort of like, oh, how did they manage to take that cool
sort of name? It's sort of a play. There's a play on sort of the the nav that's the organization and then it's some
play on that but having this nice oh that is nice and and how that sort of is is used so much
in our at least from my platform team and then we sort of try to export that
and to market our platform and they they are like, oh no.
And now I also know the title of the podcast is probably something like building nice platforms.
Yes.
So I guess it's this,
it's play on nov infrastructure as a service.
And then if you just like squint or like,
that would be nice and nice.
Yeah, nice.
That's perfect.
Naming is a pretty hard thing
when finding a good name is challenging.
And in your case, it's perfect.
Cool.
Hey, Hans Christian,
I'm really very much now looking forward
to meeting you in person in Bergen.
It's just a couple of
more well still a couple more weeks to go but still you know likewise and it's going to be
really really cool really nice yeah so folks if you listen to this and if you are if you've never
been to bergen in norway uh but you want to meet people like christian and also others like i know
victor farcic is also going to be there
and some others.
And obviously the local community,
then I'm pretty sure you can still get some tickets maybe.
Or are you sold out already?
No, not yet.
We are almost sold out about early bird ticket,
but I guess there will be one or more,
two tickets left.
So by the time that the podcast is airing.
So yeah, cloudnativebergen.dev is the place to go.
Exactly, cloudnativebergen.dev.
Also a link that we add to the description.
Cool.
Hans-Christian, I say thanks again.
And Brian, thanks for doing all the post-production as always.
And next time, I'm sure I'll have you back as my co-host.
Awesome. Thanks for being here.
Bye.