PurePerformance - Lessons learned when building the NAIS Platform with Hans Kristian Flaatten

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Welcome everyone to another episode of Pure Performance. This is not the sexy voice of Brian Wilson, but this is the just regular voice of Andy Grabner. Brian cannot be with us today, but Brian, I hope when you are listening to this, you feel better because this is why you couldn't join us today. All the best, but thanks also for doing all the post-production. Now, typically, sometimes Brian starts with remembering a strange dream that he had. I didn't have strange dreams. I really just

Starting point is 00:00:53 hope next time we record an episode, he's there with me. But I'm not solo today. We actually have a repeat guest, and I think this is the shortest amount of time between two episodes to bring a guest back. Hans Christian, servus. Hey, how are you doing? Very good, very good. So now I was making the joke via email that I'm looking forward to have another episode with my two Norwegian friends because in the first episode we learned obviously you are Norwegian

Starting point is 00:01:21 but that Brian also has some Norwegian roots. well this time I have to do it solo so we are 50-50 between Austria and Norway yeah hey Hans Christian first of all thank you so much for the last episode which folks if you are interested in how

Starting point is 00:01:40 and why the Norwegian government has decided to go all in on open telemetry, the reasons behind it, but also some of the challenges, the lessons learned, then please do me a favor, listen in to the episode. Link can be found in the description. But then last time, Christian, somewhere at the end of the podcast, when we ran out of time, we said,

Starting point is 00:02:03 wow, there will be so much more to talk about because you are a platform engineer. Platform engineering is a big topic for every one of us out there. We also talked about architectural decisions. We talked about developer portals, fault domains. How can we isolate things in a platform to make sure that the blast radius is not too big? And really, this is the things that I want to talk about today. So, folks, if you're listening to this episode, I want to learn from Christian today on his experiences on platform engineering,

Starting point is 00:02:29 building platforms on top of Kubernetes and how you can give these platforms to your engineering teams so that they are becoming more proficient, lessons learned, similar to what we did last time with OpenTelemetry, but focusing on best practices on platform engineering. Christian, I give it to you.

Starting point is 00:02:49 What is the biggest learning maybe when it comes to platform engineering in your world? Well, maybe it's cliche, but platform is only half of the equation. The other part is, of course, the applications. So neither can be successful without the others. You can think about it sort of you have a very expensive Formula One car. Then you also need a very expensive Formula One car driver.

Starting point is 00:03:21 If you're just a regular Joe, even though you're like super enthusiastic into cars if you're not a really if you're not a formula one driver and that's your profile you you won't get anything out of that car and and same goes with with platforms really sort of you um you can build the best platform there is and and all of the tools and build it yourself or buy it, whatnot, sort of everything is gold plated. But part of my language, if the applications are crap, then the outcome will be bad. It doesn't really fix that. And I guess I didn't talk too much about where I'm working and the platform. And it's super cool to be on two times on the show in order to talk more about that. Because that is maybe the one thing where we did really good.

Starting point is 00:04:19 We have made a lot of bad decisions and a lot of great decisions. One of the really great decisions very early on was that this would not be a lift and shift migration. This was actually going to be a modernization and sort of bringing the applications and application architecture up to speed and onto the new platform. And we have always had this sort of unspoken rule, or actually we have spoken about it a lot, but maybe we have not put it in writing. But it's similar to the sort of, if you go to the carnivals or the Tivoli,

Starting point is 00:05:02 and there are certain rides you need to be this tall. And now I'm sort of gestulating with my hands to those who are just listening. Because really, to be on the platform, you really needed to be, to have, for your applications to thrive and sort of be... And you need to also go back 10 years. That is when we started, really. And the whole notion about sort of this 12-factor apps, famously written by Heroku, it was written by Dan, but it was not that prevalent as it is today.

Starting point is 00:05:43 We were coming from large monoliths. We emphasized that you need to rethink your application architecture in order for them to function. There's no reason to just lift it over. That shouldn't be the reason in itself. Of course, we wanted more developers, more applications on our platform, but we also wanted them to take the work and we knew that there would be a lot of work.

Starting point is 00:06:11 And so the second part there as well, this was not time bound. Sort of, it was not like, oh, we are going to move all of the application in six months and some ridiculous timeframe. This was the new platform. And once you sort of had the time and took the effort to sort of modernize your application, you could be on there and you would get all of these benefits.

Starting point is 00:06:38 But it would go, be side by side with all of the existing platforms for a very long time, an unforeseeable long time. And to this day, because we still have mainframes, we still have huge application servers in sort of the old fashion. And then our more modern Kubernetes clusters, first only on the on-premise environment, and then later on inside the cloud environment. They're still, I mean, we run about 50-50. I guess it's 40-60 by now.

Starting point is 00:07:13 We have moved 60% of the Kubernetes-based applications onto cloud. So we're very happy with seeing the sort of the trends, but we don't have any fixed time or date where we would turn off the on-premise cluster because we know that there are reasons for some of the applications that have to be there, latency and so forth. And same goes with the more older platforms as well. Of course, there is a cost here. So we don't want them to live on and drag on for infinity. But we know that Roam wasn't built in a day and neither was the application platforms here.

Starting point is 00:07:57 Christian, can I ask a quick question? In order to recap a little bit, so that means there's kind of three stages. Your traditional, let's call it legacy for lack of a better term, applications that have been around for a while. Then when you talk about the platform, you started with Kubernetes on-premise. So that means you were able to put certain things into Kubernetes on-premise. And then the goal is to move as many as possible for those that make sense

Starting point is 00:08:23 from the on-premise cluster into the cloud. Did I get this correctly? Totally correct. Perfect. Then I have another question because I want to also clarify this for the listeners. When you talk about the platform, the new platform, we talk about you made a conscious decision 10 years ago that Kubernetes is the core of your platform. But you brought up an interesting point, and this is where I want to dig in deeper. You said, we want them, so you have to be that tall. I like that, or the 12-factor apps.

Starting point is 00:08:58 You have to have certain, your app has to have certain capabilities to actually leverage the new platform. What is it that the development team gets if they build an app that can then either run on your on-premise Kubernetes cluster or in the cloud? What is the additional things that your platform really provides that they cannot have from building a traditional monolithic app and then run it somewhere in the legacy system? So the largest benefit is that you get this, you build it, you run it type of environment because all of the legacy platforms would require to a certain degree

Starting point is 00:09:38 someone else helping you and sort of for certain operations. And it can vary. For some of the platforms, it's really sort of you need to hand an artifact and the operations team will put that into production. For some platforms, that process might be automated, but they're still sort of manual processes when you, for instance, want to get a new application. If you want all of this automated, then the new platform was for you.

Starting point is 00:10:13 So really sort of the second really sort of big and what I would say and most of in my organizations that was a good decision, was to start building up team autonomy. Because also, before the Kubernetes-based platform, we really had developers developing for a long time, maybe a month, maybe years, and then handing it over to a production. Or maybe they were, of course, Q then handing it over to a production. Or maybe they were,

Starting point is 00:10:45 of course, QA in the middle and back and forth before handing it over. So the new platform represented a cut there as well in the process. It represented that in this way of working, we don't have a separate operations team. We don't have, there is this one, there is only one team. It's the application team. It's the product team. So sort of this domain-driven design

Starting point is 00:11:14 was something that was more important. Sort of how do we actually partition and cut up the organization so they fit nicely into sort of two pizza teams coined by Amazon. So they can have sort of a manageable set of services or applications that they are in charge of and don't really need to sort of coordinate with too many other teams. Of course, there will be some coordination, but for the most part, the team should be

Starting point is 00:11:55 sort of self-coordinating that for the day-to-day interactions at least, it's inside the team that where they do that coordination there. Cool. So that means basically instead of having and i know people don't see us now when we move our hands on the screen but instead of having kind of like a a vertical silo between def and ops you kind of have a horizontal silo again not silo but you have a horizontal interface where on the bottom you have your platform that provides and automates most of the technical challenges that people typically have with building, deploying and operating. And then making this as easy as possible to the engineers to then really take into responsibility for the applications from building to operating.

Starting point is 00:12:42 And this is so fresh on my mind because I had a discussion yesterday with one of our architects and he said, well, does this mean, he asked me the question, does this mean that now developers in that model need to know everything about Kubernetes? Do they need to know about ingresses? Do they need to know about security? Do they need to know about everything? And isn't there too much burden? Because then everyone needs to understand the full complexity of operating software. And then my answer was, no, this is the actual idea of a platform engineering team that builds a platform as a product to abstract this complexity away and really just provides the building blocks for engineers to take into responsibility. Did I get this correct or is this wrong?

Starting point is 00:13:26 Or how do you do that? How do you see this? Yeah, you are spot on, Andy. So we started before this was really called platform engineering. So I guess we called it, maybe we called it platform as a service. I think that was the term we used back then. But really, you are spot on. And the goal here is to provide a higher level abstraction

Starting point is 00:13:46 and abstract away these lower level building blocks. Of course, there will be complexity. We cannot hide it all. But what we did very, very early on was that we would not expose directly all of the Kubernetes resources to our developers. What we did was that we created, and this was really, really early on. I think we had some people that experimented with what was before the custom resource definition that we know today.

Starting point is 00:14:20 Some POCs and actually used the, not sure what it was called back then, but today we know them as custom resource definition, where you can extend the Kubernetes API. And what we have built is an application resource. So it sort of combines all of these resources that you listed. It combines your deployment, your service, your ingress, your network policies, all of these lower-level building blocks, and it provides a higher-level abstraction where the developer needs to specify some metadata about your application, what's it called, which namespace should it run in?

Starting point is 00:15:07 And then your container image. And then everything from there on is really optional, where we have at least what we call our sensible defaults. And then very, very few feeds are actually required in order for your application to spin up. Of course, as the application and the team matures, they will leverage more and more features. So it's not that that manifest is really...

Starting point is 00:15:34 It can be 150 lines long if you take advantage of all the different services. But in sort of the getting started and sort of most, the minimum viable application, it's really, really condensed there. So that, again, sort of, of course, that's an abstraction that they need to know, the developers, but really that allows us to have that layer above the underlying. So we could for instance, go from one version of an underlying Kubernetes resource or remove

Starting point is 00:16:11 it altogether. So we have changed service mesh without the teams really knowing because that's something that we treat as an implementation detail. We have changed how network policies work and really sort of how these lower level building blocks really work. In some cases, there are some migration that needs to be done and sort of with the help of the teams. But in many, many cases, we are able to actually change versions, change components without really requiring the teams to change their manifest to change their application. So it's a really, really strong overlay or sort of... Did you implement then your own, I guess, operator

Starting point is 00:16:57 that handles those applications? Have you, and I guess this may have been before tools like Crossplane emerged, but have you looked into Crossplane as a tool as well for that? So we have looked into Crossplane. That was quite early. So most of the functionality is in our own operator or operators. We have some other supporting operators as well doing custom things. Then what we're using today for our Google cluster,

Starting point is 00:17:38 we are using an operator similar to Crossplane. It's built by Google. It's a config connect. So actually, the Google, because that's where we are running our cloud cluster in Google Kubernetes Engine. So it actually comes with these APIs already built in, or the Kubernetes API already extended, where you can create Google resources.

Starting point is 00:18:03 You can create Cloud SQL instances, storage buckets, etc. And it's similar to Crossplane. If Crossplane had been a little bit more mature when we started that, we most likely would have used Crossplane because we tend to favor open source, open sort of implementations

Starting point is 00:18:23 and not sort of going too much vendor-specific. Of course, you need to have a provider of some sort. We really want someone to provide us Postgres databases. That's not something that we really want to operate ourselves if we can afford not to do it. But the research, the API of how we are creating those Postgres databases, we would really like that to be as open as possible and sort of a standard, which Crossplane really represents. And I know the discussion have come up certain times, if we

Starting point is 00:19:08 should move to Crossplane. Currently, we have no real roadmap items to support other cloud environments. That's the other benefit as well with Crossplane, being sort of instead of the config connector from Google being only Google-specific.

Starting point is 00:19:24 Crossplane is more like Terraform where it supports a number of providers which makes that integration work there a lot easier. But it works similar way, Andy. That's cool because I just had a podcast

Starting point is 00:19:39 recording with Viktor Farchich, whom I know you know, and he will also be speaking at Cloud Native Bergam Day because he's the developer advocate for Crossplane. That's why it's fresh on my mind. So that's really cool. I'd like to recap a little bit just to make sure. You basically built an abstraction layer for developers because really what they want to do is they want to build and deploy applications. So you defined an application object. You have an operator that then translates this application object

Starting point is 00:20:15 into the actual deployments underneath the hood without the developers having to deal with the complexity and also allowing you to switch things in the backend, like switching from one service mesh to another. I also like what you said with you as the experts here on the platform engineering team, you know what good defaults are. So that means you can start with sensible defaults, but then as you mature, you can also then and I think this is the autonomy aspects, right?

Starting point is 00:20:42 You give them a certain level of autonomy, but also guardrails, which is great. Back to the question that my colleague gave me yesterday, right? He says, he asked me, you know, do they need to know everything? And you clearly stated how you can solve this problem by the abstraction layer. But here's my question. What if something doesn't work? Where is the responsibility or what can a development team do and where do they then need to go to you for troubleshooting because i think that's a big

Starting point is 00:21:11 thing right what if my application all of a sudden doesn't deploy anymore how do i know is it my fault as an engineering team or is it the fault of the platform maybe it's not even your fault. Maybe it's a fault from the underlying infrastructure provider. This is kind of like the default domain isolation and default domains, which has been interested in knowing how this works in your world. So I don't have sort of like this one definitive answer, and it's certainly matured over time. And of course, the focus has been a lot on the deployment interface and how do you specify the application. And then on the operations side, we have always given them access to the Kubernetes API. So you didn't really need to know how to specify your

Starting point is 00:22:03 resources, but you certainly needed to know about the concept of pods. Each instance of your application would run as a pod. And of course, that's sort of an underlying implementation detail. But it's something that we have not found a good way to sort of how do we sort of provide on the operation side as well but we do have logs we do have sort of like try to be helpful with the error messages when you deploy to the platform for instance is this is it sort of a problem of ours or it's a problem on your application in most cases this is an application problem so that's sort of a problem of ours or it's a problem on your application in most cases this is an application problem so that's sort of what it states or um all of the helpful guides etc starts

Starting point is 00:22:52 with sort of check your check your application logs in most cases and i would argue that it's about at least 80 percent um something is wrong on the application side. It's not necessarily that it's a bug of sorts or it's sort of, of course, the environment from your local development environment is, of course, slightly different from running inside a Kubernetes-based environment. So there might be some of those differences that you have not accounted for

Starting point is 00:23:21 with regards to how is the database specified or what can you access, et cetera. But in most cases, sort of the application logs. And then in some cases, you need to check the status of the pods, that being sort of, is it running out of memory? Is the liveness checks not working? Sometimes these can be misconfigured. So it's not an easy sort of... And it's certainly, I would argue, a step above sort of having to know all of the different... If all of the different components are configured correctly, we are very certain that at least your ingress is pointing to the right service and the service is pointing to the right portal on the deployment.

Starting point is 00:24:11 So that isn't really an issue. But of course, figuring out the exact problem there has been a challenge for quite some time. So what we fast forward, and this is quite recent, we actually started work with a proper developer portal. And that would encompass more on the operational day two side. So once the application has been deployed, can we provide a better overview? Can we give better and more insight into the application in order to debug and look into the performance? So we did some POCs very, very early on backstage,

Starting point is 00:24:56 but concluded that we wanted something, I wouldn't say slightly different, but we concluded that Backstage was too big, rather. It's also trying to do the, we were not certain about sort of the provisioning of applications and sort of the integrations with other different resources. We really had, we have some very concise or well-defined entry points where you can get the information. We don't really need them to sort of go look into details inside your cloud environment, really. We have most of the status in the Kubernetes cluster. So we built sort of a new, we built our own, basically. We built our own developer portal that is based around

Starting point is 00:25:48 this application manifest that we have created. So once your application is deployed, then you get all of these separate sections and subpart with the different additional services that you can provision alongside your application. So that'd be SQL instances, cloud storage, Kafka topics, et cetera. But we sort of have a fairly concise sort of entry point or API where we can get all of this information.

Starting point is 00:26:20 We just needed it to display it. And some of the work, a lot of the work, so there are two different, three different sort of goals here. Of course, giving the developers a better overview of what's running inside their namespace, what's actually the applications and the resources deployed. Two is sort of the cost, and that was a really And that was maybe one of the main drivers, actually. It was not that operations was hard. It's certainly hard, but it was not sort of a pressing concern.

Starting point is 00:26:55 It was not like, oh, this is a really, really issue and we have downtime ever too often because something is misconfigured and they can't figure it out. That's, of course, a challenge every now and then. Most often it's only development environment. You've done some changes to your application, either that's the manifest or the application itself, and then it fails in the development environment.

Starting point is 00:27:19 It doesn't really cause an issue on the production side. If it's already going to production, in most cases, it will be also caught that you have a working application that's already running, and we will prevent the broken one from actually pushing out all of the replicas of the good one. But this cost part was really what drove a lot of the effort was to give give the developers better feedback on their resource consumptions and how much that is actually costing

Starting point is 00:27:55 us in in in dollars or euros um because when we give our developers autonomy and they can actually choose their own journey, choose their own adventure. So how, how big SQL instance do you want? How much resource CPU and memory would you like to allocate for this application here? Of course, there is a tendency to over provision. Well, it's fascinating and, but I need to interject, just ask a question here.

Starting point is 00:28:27 So do you then, as your, let's say, the central platform team, also provide advice, mentoring, any type of support to say, hey, by the way, it seems you are 50% over-provisioned most of the time. And here are some recommendations on proper sizing. Do you provide this or does the platform automatically provide this in some way? Or do developers need to manually look into this and they need to make manual decisions? Currently, it's manual, but it's less manual than it used to be.

Starting point is 00:28:58 It used to be you had to go into Grafana and find the right dashboard, or we have some database, other analytics dashboards. So now at least it's in your developer portal. So once you log on, you open your application, you get that recommendations there. But we are not doing that. We are not automatically adjusting it. One thing that I just, Henrik, who is on my team,

Starting point is 00:29:24 Henrik Rexeth, I'm sure you've heard his name, is a Mr. Is it Observable for me? He did a really cool thing at last year's KubeCon where he was basically using observability data and he had like, he built like an assistant

Starting point is 00:29:39 using workflows to every day in the morning basically send developers an overview and say, hey, these workloads or this namespace of yours is 50% overprovisioned based on the actual load. Here's a recommendation on how to adjust your resource and request limits. So basically using observability data and some algorithms to basically say, this is what I as an expert would do and then provide this.

Starting point is 00:30:05 And then another colleague of mine, Katharina Sik, she also recently spoke in KCD Romania. She also just built something really cool where she's using automation workflows to automatically open up pull requests with proposals of changed request and resource limits based on current and predicted CPU and memory

Starting point is 00:30:29 consumption. That's really cool. Because if you then have a pull request open, it says here's a suggestion of your platform but still give the human then the decision to say yes, sounds good or not good, is a really cool thing to do.

Starting point is 00:30:45 Yeah, absolutely. And it's certainly something, because we have had the data here for a long time, at least CPU and memory data. And we can see sort of how much are using of those compared to what you have already, what have you allocated for the application. So we are training our developers and application team

Starting point is 00:31:07 to keep an eye and to use this tool that we have built and to adjust those accordingly. And what we also do is that we translate those into the euros amount. Because we have certainly had instances where sort of, oh, you have requested 200, you have requested 400 millisieverts and you're only using 200.

Starting point is 00:31:34 But the cost on that is almost negligible. So it doesn't really, that is not where we are trying to push our developer. We are mainly sort of focusing where is the, even though that's a 50% decrease. Yeah, but if it doesn't make an impact, you want to focus on the real hotspots. Yes, absolutely. The big money grabbers, yeah. Yeah, so that's, of course, been some applications that have been really, really greedy when allocating resources.

Starting point is 00:32:08 And we also have some databases that's been really, really over-provisioned. It's a little bit harder to auto-scale those and scale them back and up. So that's not an easy task and something that we are still sort of looking into how we can provide a better way of scaling those. You can scale the memory and CPU, but the underlying disk, at least this is Google Cloud, it will increase. But it's very hard, next to impossible to decrease the disk. So once you sort of, if you, for instance, removed a lot of data from a database, actually decreasing that disk is increasingly hard. So interesting, and thanks for sharing the insights

Starting point is 00:33:02 on that you built your own developer portal, because I think it also makes a lot of sense because you came up with your own application manifest definition I think the the whole CNCF or like the whole you know our community has been asking for kind of this additional level of abstraction of an app there's also the the tag app delivery, the special interest group that is working on some of these concepts. But yeah, so far, I think there's no real application object out there, and that's why I see many organizations

Starting point is 00:33:34 basically doing what you're doing, either coming up with your own implementation with an operator, looking at tools like Crossplane to then build composites, as it's called in crossplane to then kind of build a level of abstraction did you and your developer portal also implement the use case of onboarding and creating a new app from a template because this is one of the features that people like about crossplane as well yeah no so that was that was one of the features that we didn't feel was ready for us, or we weren't ready for that type of feature.

Starting point is 00:34:12 And it goes back to that we have given a lot of autonomy and flexibility to the development teams, but there aren't really sort of one template to rule them all. Of course, there are certainly things that are common across them, but at least from the platform's point of view, and it sort of speaks a little bit to that we have a large number of teams. So we have about 200 registered teams on our platform. Not all of them are logical, separate teams.

Starting point is 00:34:52 So maybe there's somewhere between 100 and 200 actual teams. We have 2,000 of these applications. So it's quite hard for at least the platform team to know what's a good, what's a sensible starting application of some sort. So that's actually something that we haven't really been able to answer for quite some time. Maybe we could have done it early on because there are fewer teams and fewer applications, but certainly not now. There's too much,

Starting point is 00:35:32 there's too wide variety of different applications and the level of, the amount of sort of boilerplate is quite minimal. Of course you need, you don't really need a GitHub action, but that's the CICD system that we are using, where we have a custom build and deploy action. Yeah, reusable. You need, in most cases, a Docker file,

Starting point is 00:36:01 unless the framework that they are using can produce one for you. And then you need this nice YAML manifest. Apart from that, there's nothing else, at least from the platform's point of view, that you need to create. But I guess you put a lot of effort, obviously, in your application definition, right? I mean, I guess you, instead of providing different templates that contain, I don't know, Helm charts and all sorts of other YAML files, you invested in defining this abstraction layer of an application. And I assume to get there, you initially had to talk with a lot of the application owners, like how many applications need a database, what type of database, and then you abstract it that way.

Starting point is 00:36:47 How do you, I assume you're using Flux or Argo or any of these GitOps tools to pull this in, or do you just push these? Actually not. So we're very much a push-based. And that goes back to that, again, the manifest is in Git. So this application

Starting point is 00:37:06 manifest, it's quite, it's 99% of what's actually running. There only, there's a few, there might be a few variables that might be sort of the container image, apart from that, everything is already in Git.

Starting point is 00:37:22 So it's, I really want to emphasize that it's Git. You need to have your things in Git. So it's remote to emphasize that it's Git. You need to have your things in Git. It's sort of not just... But then the question still is, how does this object in Git make it into your Kubernetes cluster so your operator can pick it up?

Starting point is 00:37:38 And in your case, it's still a push. Yeah, that's a push base. So we have a deployment API that sort of just exposes a very limited subset or rather it sort of accepts YAML or JSON objects that are Kubernetes objects and then it forwards it to the correct cluster and then it checks, does the authorization, is this repository here authorized to do a deployment into this namespace in this cluster here? So the reason that we are really, really happy with this flow is that it gives the developers an immediate feedback.

Starting point is 00:38:18 So once they push something, the GitHub action is started, and it builds, and it tests, and it deploys, and there's an absolute feedback there. Did it work? Did it not work? Because the deployment API will wait for the application to actually be started by the operator, and it will wait for it to actually get acknowledgement back that, yes, this application here is actually running.

Starting point is 00:38:45 And for all of the checks that we can see, it's running well. And there's obviously to an event-driven model where with GitOps, right, you don't know how often the GitOps operator with the Argo flux, how it's configured, how many pulls changes in. And so it makes this a little bit more challenging. It's interesting. I'm just interested to hear why people are, let's say, using push versus pull. What are the reasons for it? What's your record of truth then?

Starting point is 00:39:18 What is currently deployed on the cluster and how would you then recover a cluster state? Do you have any, what's your suggestion? So let me answer the last one first. So we have a complete backup. So if we need to restore a cluster, we can restore that. So we have, I guess it's Valero that we are using now. It goes through and it takes a backup or snapshot of all of the Kubernetes resources,

Starting point is 00:39:46 all of the namespaces, and the persistent volumes as well. So we can restore those in the case where something catastrophical has occurred or someone has deleted something they shouldn't do. So again, going back to the source of truth, that is up to the developers to decide. We have sort of taken a stance that, well, there will be a lot of different models, there's different sort of development flows. We are not here to sort of say that you need to use

Starting point is 00:40:22 this or that. We can come with some recommendations. We sort of recommend that you need to use this or that. We can come with some recommendations. We sort of recommend that you have, or that what we provide you is that we provide you with a development environment and we provide you with a production environment. And we advocate that, say that this should be sufficient for most of the teams. But if you want to have some where you're using feature branches and you have release branches, etc.,

Starting point is 00:40:47 that's up to you. You just need to create the correct GitHub workflow that sort of picks up the branch on whatever pattern and deploys your application with the correct name to the correct environment. We don't give you more environments, but you can deploy your application as many times as you like and then create these virtual environments

Starting point is 00:41:11 if you want to have those. But we certainly advocate that it needs to be driven by Git. The teams have the possibility to go in and edit things. And I wouldn't say too much. They can or apply from their own machines because they have access to their own clusters. Again, it's certainly a challenging operation

Starting point is 00:41:44 because they first need to have a built-in container image. It needs to be pushed. It needs to be available. You cannot just build that image locally and then deploy it. So what ends up is that things are going into Git and then they have a very, very fast feedback cycle where you can get this up and running, out and deployed and doesn't really alleviate the

Starting point is 00:42:05 need for any doing any manual sort of deployments. They can go in and from time to time they will go in and do manual changes. But it's sort of a trade-off that we are okay with. We believe that sort of these teams, it's sort of contained enough for them to sort of, oh, I'm going in and changing this on the fly for some reason. And that the team is sort of aware of that. That might not be 100% true all of the time, but from our point of view, we have not sort of caught any sort of outages or problems,

Starting point is 00:42:48 at least in production environments that sort of been where the root cause has been sort of, oh, we have the ability to do manual changes. Rather, sort of having the ability to do some manual changes actually makes it easier for them to go and fix or test sort of a really one-off thing that they need to just check. Oh, if I put in this number here or this value for some configuration, how does that affect the application? Yeah.

Starting point is 00:43:18 Cool. Hey, Hans-Christian, it's amazing how time flies because we are almost like 40, 50 minutes in already. For me, it's always great to listen to people that actually really implement platforms. And everybody has a different background, different company, organization, in your case, government agency, reasons why certain things are done that way. But what I really like, and I think this is something that I see all over, you are basically as a platform engineering team, you are providing basically an app, platform as a service, platform as an application, whatever you want to call it. But in the end, you've built an abstraction layer on top of the complexity of the next generation of platforms we're going to build.

Starting point is 00:44:07 And that's Kubernetes for many people. And even if it's not Kubernetes, even if it's serverless, if it's anything else, platform engineering basically provides this abstraction layer to actually, and I think this is what I really like, what you said earlier, to really enable teams to be autonomous, to really take the you build it, you run it, and actually being able to execute it without having to deal with all the complexity because you're working within a certain area where you can move left and right because you have, in your case, this application object that you've defined and that you maintain and that you have an operator for.

Starting point is 00:44:48 But yeah, that's really, really great. Now there's reasons why you build your own developer portal. Makes all a lot of sense. And yeah, really cool that you share this thing. I know in the previous episode, when we talked about OpenTelemetry, you had a great blog post about your OpenTelemetry story. Are there also blog posts or is there any public material out there in case people want to read more about your platform engineering projects?

Starting point is 00:45:14 Yeah, absolutely. So we have written blog posts over the years, but we are really, really proud of our application documentation and we are developing all of this in public, in the open, and the documentation is in open. So if you go to docs.nice.io, so that is N-A-I-S dot I-O. So it's pronounced nice.

Starting point is 00:45:41 It's a nice platform. You can see all of this from a, how does this look from a developer point of view? There are getting started guides and examples of how the application manifest looks like, the deployment workflows, and sort of all of these extra bells and whistles that you can put on to your application.

Starting point is 00:46:04 And of course, my sort of baby, all of the observability part where we have sort of really made it as one click as you can go to sort of enable OpenTelemetry instrumentation to your application by using sort of these, as I mentioned, sort of the agents that we just attach to your application when it's starting up. Really cool. Folks, we will also add the link to NICE.

Starting point is 00:46:33 We are NICE and put the links to NICE in the description of the podcast as well. So we have been quite vocal about our platform and the rest of the platform engineering community and the rest of the public and government sector here in Norway. And we know for a fact that there are a few people at least that's a little bit sort of like, oh, how did they manage to take that cool sort of name? It's sort of a play. There's a play on sort of the the nav that's the organization and then it's some play on that but having this nice oh that is nice and and how that sort of is is used so much

Starting point is 00:47:14 in our at least from my platform team and then we sort of try to export that and to market our platform and they they are like, oh no. And now I also know the title of the podcast is probably something like building nice platforms. Yes. So I guess it's this, it's play on nov infrastructure as a service. And then if you just like squint or like, that would be nice and nice.

Starting point is 00:47:48 Yeah, nice. That's perfect. Naming is a pretty hard thing when finding a good name is challenging. And in your case, it's perfect. Cool. Hey, Hans Christian, I'm really very much now looking forward

Starting point is 00:48:01 to meeting you in person in Bergen. It's just a couple of more well still a couple more weeks to go but still you know likewise and it's going to be really really cool really nice yeah so folks if you listen to this and if you are if you've never been to bergen in norway uh but you want to meet people like christian and also others like i know victor farcic is also going to be there and some others. And obviously the local community,

Starting point is 00:48:28 then I'm pretty sure you can still get some tickets maybe. Or are you sold out already? No, not yet. We are almost sold out about early bird ticket, but I guess there will be one or more, two tickets left. So by the time that the podcast is airing. So yeah, cloudnativebergen.dev is the place to go.

Starting point is 00:48:48 Exactly, cloudnativebergen.dev. Also a link that we add to the description. Cool. Hans-Christian, I say thanks again. And Brian, thanks for doing all the post-production as always. And next time, I'm sure I'll have you back as my co-host. Awesome. Thanks for being here. Bye.

Your Ad Here

PurePerformance - Lessons learned when building the NAIS Platform with Hans Kristian Flaatten

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.