PurePerformance - Observability in the AI‑Native Era with Hilliary Lipsig and Rob Rati

Starting point is 00:00:00 It's time for pure performance. Get your stopwatches ready. It's time for pure performance with Andy Grabner and Brian Wilson. Welcome everyone to another episode of Pure Performance. And unfortunately the second time in a row now, this is not the sexy voice of Brian Wilson. It's just a regular voice with an Austrian accent. From Andy Grabner, I still hope that people are not turning off or unsubscribing from our channel

Starting point is 00:00:43 because they don't hear his voice. I hope the content that we bring up. produce is still as interesting. But Brian, looking forward to have you back. I know you're back on the next episode. Today, I'm actually here with two of my friends and co-authors of observability in the EI native era. And Hillary, I think you just had the bookup.

Starting point is 00:01:04 I know people won't see the recording now, the video, but can you hold it up again just to make sure that this is for real? Because maybe we'll observability in the EI native era. Hillary, you are responsible that this book exists. So I want to let you introduce yourself first. And then I would like you to introduce and give our other author the chance to introduce himself. But actually, tell me first, why we started this book and then why you brought Robin and then Rob, it's up to you. And then we'll kind of circle around.

Starting point is 00:01:39 Yeah, thanks, Andy. So, as Andy said, my name's Hillary Lipsig. I've been on peer performance now a couple of times, so it's always good to be back. Day job right now is I'm a senior principal site reliability engineer at Red Hat, where I also run my own podcast, Get Ops Guide to the Galaxy, where I've had Andy reverse guests many times too, so we do this, I guess, a lot. Much like the last book we wrote together, Platform Engineering for Art. architects. The motivation behind writing this book was there was a book I wanted to read on AI and

Starting point is 00:02:19 IT operations and observability that didn't exist. And so when the when Pact came to us and asked, hey, would you be interested in writing a book on this topic? That was the proposal I put together for them was the thing that I felt I needed personally. It was the book that I personally needed. And I wouldn't call myself like an

Starting point is 00:02:47 AI expert, certainly not leading into the book. I had to do many, many, many, many weeks of research and reading every book I could find on AI to make sure that what we were about to write would represent it correctly.

Starting point is 00:03:04 And as much as, you know, I'm a reliability engineer and I've got, you know, pretty solid expertise on observability. Like, Andy, it does not compare to yours. You know, you're the person I sanity check myself against. So, of course, having you involved was a must have, right? I don't think I would have done it if you hadn't said yes. And then what I really wanted was a solid, like a business case, like a use case. So really like tie what we were, like the theory we were showing together and I that's where Rob comes in. So Rob and I work together at Red Hat and the reliability engineering organization. I'll let him introduce himself more, but then he went on to do greater

Starting point is 00:03:46 things where he had more visibility into business needs and so forth. And I knew that he could really bring that business case perspective to to the, and of course obviously the technical acumen, that goes without saying. But bring that piece that I was worried would otherwise be missing. So Rob, go ahead and tell the lovely audience a little bit more about yourself. Sure. Thanks, sorry. My name is Rob Raytai. As Hillary kind of mentioned, I kind of met her and became friends, right, when I was in Esri at Red Hat. And I've had, you know, a journey through small, medium, large companies in various different areas, you know, software engineer, platform engineering, Esri, management, and these different companies and you know as i've gone from small to medium and large certainly you see a lot more you know similar problems uh in the large companies uh but you start seeing them you know when you're in the medium

Starting point is 00:04:45 and the small you start seeing some of the origins of some of those problems that you see in large companies uh but so what i kind of you know have seen throughout my journey is that there's a lot of stuff that's very very common uh and the larger the company uh the the more uh the more of the kinds of issues that you'll kind of see. And so, you know, Hillary came to me and asked about this book. And I've never done a book before. I've never done a podcast before. This is a first for me as well.

Starting point is 00:05:15 And now she came to me and asked me about it and had never done it before. But I was, you know, I was kind of raised in this area. I'm a problem solver by nature. So in all my jobs, I'm not just accepting what we're doing. I'm asking why. And I dig down. and that's where you start coming up with these common, you know, root cause kind of problems throughout these organizations.

Starting point is 00:05:39 And, you know, I said, well, sure. I mean, I've got that kind of expertise and that knowledge because I just want to have it. I don't like just accepting the status quo. And so I said, I'll take this on and, you know, here we are. Here we are a couple of months later after we started the whole thing. It's phenomenal. Hillary, you already have a press. printed version of the book in your hand.

Starting point is 00:06:03 It was just previously released. We gave out the first couple of copies at KubeCon in Amsterdam at mid-end of March, which was great. Also, we were lucky to also meet Max, Max Feuerbecher, who was the person that inspired us for the previous book, and he also wrote the foreword. And if your guys are okay, I don't read the whole four-word, but I'm really, I wanted to quickly read maybe some motivational stuff for why anybody should look into the book that we wrote. And he starts, you know, why this is important in the day of age that we are.

Starting point is 00:06:41 But in the end, he says, this book is for engineering. So for engineering leaders, this book is a roadmap to building more resilient, efficient, and autonomous systems for practitioners. It's a playbook for turning raw telemetry into intelligence. And for anyone who has ever been woken up at 3M by a paging alert, it's a promise. the future of operations is one where machines do the heavy lifting and humans focus on what they do best innovation. The AI native era has come to stay, to evolve and to revolutionize, let this book be your guide. Max, if you're listening in, thank you so much for the lovely words.

Starting point is 00:07:23 It was an really interesting journey. And Hillary, I know you said that obviously from an observability person, perspective, I've been now working in observability for 18 years and before that did a lot of performance testing. So I didn't have to do as much research, maybe as you were on the AI topic. But still, and I remember this in the early days of the book, we wanted to make sure we're not just writing something and making constructs and come up with artificial use cases. We also did a couple of interviews with organizations, with people to try to figure out what

Starting point is 00:07:57 this observability in the modern day and age look like, what of their challenges. So I think this was also very important that we got some more additional input from external people besides the research that you did. Yeah, but I'm really glad in the end, even though I said it multiple times, I will never write a book again. I'll say it again. This time, Hillary, I'll try. We'll see what happens next.

Starting point is 00:08:24 But I'm very happy that I said, yes, that's what I want to say. It was a really good journey and a lot of great lessons learned. And I'm really glad to have this book now because most of the stuff, everything that we wrote is very beneficial right now in all of my conversations. Yeah. Yeah. And that's kind of the same boat that I'm in where it's really helping me in my conversations as well. Yeah, like, you know, I also said I wasn't going to write another book again. and you know

Starting point is 00:08:56 Andy and Rob you guys know this I had just had shoulder surgery and could not use my left arm when we decided to do this and I was expecting to just be taking several weeks and focusing on healing and relaxing and then I was like oh it turns out I'm bad at that I'm bored

Starting point is 00:09:22 and that was something so over the many conversations we'd had where I think the two of you had to take all of the notes in our conversations because I couldn't do anything including those interviews those early on interviews I could barely type

Starting point is 00:09:39 I think we really relied on Rob's notes for those so that was very that was almost comically it was comic actually It was just very comical, the whole situation of trying to write a book with one arm when you haven't set up any of accessibility software ahead of time. So, but the online interviews, those were so helpful, right?

Starting point is 00:10:06 Those experts who lent their time and many of whom were like, hey, I would like to actually be anonymous. Like, don't credit me, don't cite me. I think I said it in our acknowledgments, that is everything that's great about our industry. People who are willing to just like sit and lend expertise to the efforts of others without asking for anything in return, because they care about making sure that good information is out there, the information that they needed that they didn't have. Honestly, like, again, we can't name them by name. A lot of them chose to be anonymous, but like, I can't think those people enough. That was just absolutely above and beyond an incredible of them to do. And, and, I'm a lot of

Starting point is 00:10:51 I think, you know, one of the other very funny things, Rob, when you joined into the conversations in our early meetings and we were talking about what kind of problem sets do we want to have. And we were talking about what we are seeing in the industry across the three of us. I think, correct me wrong, we pretty much noticed that despite all of our varied access points to information, that the trends were like really, really almost identical. Yeah, that's actually, it's interesting you mentioned that. As you were commenting about it, we kind of discussed about these are the problems that we've seen in our careers and we had those interviews. And in the interviews are those, those anonymous sources and whatnot, they, you know,

Starting point is 00:11:38 ended up relaying very, very similar issues, right? No prompting or anything from us. We've simple questions about, hey, what's, you know, what's working, what's not working. it tends to be pervasive across the industry. And one of the early days that these kinds of things that I discovered was in my first job. And as I kind of alluded to, I don't like to accept just the status quo. And so we were having, we were being required to do these things that I'm a junior engineer. And I would ask, why are we doing this? This doesn't make any sense.

Starting point is 00:12:09 This isn't right. And the answer I got back was, that's just the way we've always done it. And like that was, that was the accepted answer. I'm like, but that doesn't make any sense. Like literally what we're doing doesn't provide value. And that was the answer. And at this point, I was very frustrated. And I was at a wedding reception with a cousin of mine who was in, you know, insurance,

Starting point is 00:12:29 he's in insurance claims. And I was relaying the story. And I said, well, I was trying about the frustration and stuff. And I said, so I asked my boss, you know, I said, so why are we doing this? And he responded because that's the way it's always been done, right? So now we're talking, not even technical, right? but there's that probation of these same problems existing across industries. That's an insurance industry versus a technology industry.

Starting point is 00:12:52 Same answer to the same question. And so, you know, the very idea that we see a lot of the same issues across companies and across organizations, across businesses and stuff is just kind of the way it is. Like, it's not an outlier. It is just is. Yeah. And this also reminds me, I've started to. go on LinkedIn a little bit and post some snippets of some of the chapters that I

Starting point is 00:13:21 contribute and this also reminds me like you know we're doing it the way we've always done it because that used to work and why shouldn't it work going forward even though the world around us is changing quite dramatically and one of the things that I quoted in my last post was the this is some this came actually from my personal work experience one of our users asked me I need to create 10,000 decilos, please help me to do this, right? And I said, why? Well, because we've always done it that way on our monolithic applications, but now we need to do it on our microservice, cloud native, highly scalable. And so that means if you do the math, instead of having 100 silos,

Starting point is 00:13:59 and I only 10,000 decilos, and I explained, I understand where you're coming from, but maybe you need to rethink what observability is for you and what you want to get alerted on. You really want to get alerted and maintain these alerts for 10,000 different metrics. Is this really the way this works? This should work. And it was really a tough discussion because if you have been doing certain things for so long and you became an expert and it used to work, it's obviously very hard to change. And I think we're also now seeing this with everything that happens around us, right?

Starting point is 00:14:34 Everything that happens with AI and many of us, we need to rethink what our roles is. Many we need to rethink on how we are, what work, what day-to-day work looks like in the age of AI. And it's going to be challenging for many, but it's just the reality changes the only constant. And this has been said many times. But yeah, I also saw this in the book. Yeah. Essentially, you mentioned that, Andy. You said, sorry, Hillary, that is actually one of the things, the threads that we weave into the journey of our fictitious company, right?

Starting point is 00:15:07 is that it's a cultural change, right? Observability is almost like this red-headed stepchild. We have something that kind of works. We don't keep track of industry trends. We don't keep track where it's going. We don't keep track of tooling evolution. At some point in time, whoever created the stack or the people who know about it, they said, oh, these open source tools and Prometheus and Carfana and things like that

Starting point is 00:15:30 or whatever the tool set that they're using, but they're not staying up to date with normal, with how things are evolving like they normally would in other parts of industries. And that kind of gets into this problem. And it is actually, as you're kind of relaying, in my opinion, like a cultural change. Observability needs to be that, you know, first class kind of citizen and understanding when you're designing and keeping it up to date just like anything else. Because as it evolves and as we kind of highlight through this journey, you know, as the computing evolves, observability needs to evolve as well.

Starting point is 00:16:03 The same practices can't work as you, you know, go. distributed and highly available type systems in today's environments. Yeah. It's funny because there is a little bit of attention of the way we've always done things is not holding up.

Starting point is 00:16:24 And yet, so many of the software design patterns are. But because of the evolution of technology, the design patterns just kind of like, kind of like crib back from okay well these these are things that have have worked and do work and we know no scale and we know how to secure but the operations of them in the newer paradigms right where it's like not a monolith but now a bunch of microservices the operations of them just because the data flows follow the same design pattern doesn't mean the operations follow the same design

Starting point is 00:16:57 patterns and so i think that's a little bit of um there's there's i think what's happening is a false dichotomy of like, okay, well, you know, it's this design pattern, so it's this operations. And it's like, no, because, you know, the platform has changed. And I think that's actually really why we leaned into platform engineering so much in our narrative is that precisely. You know, I was talking to, I was talking to a couple of my mentees recently. And I was explaining like the important skill for Nessori is the ability to decompose a system, right? We need to be able to take, look at the system and see everything it does and then break it down and decompose it into what needs insights and observability, what needs orchestration, like beyond what, you know, Kubernetes or what have you, is already taking care of. And I actually still recommended the book design patterns from 1994.

Starting point is 00:18:02 because we're still writing software with those patterns, even though, you know, what that means is still is now different. But it's one of those things where we do need to understand how things have always been done, how software has always been written, what design patterns we always lean back on. But we also need to understand that what operations means has changed. Because, yeah, we might, again, that's a lot of data storage and data flow is in the pattern, the exact implementation details are what make, like, that is one of those, the devil is in the details minutes, actually. And that is, that is the, the factor that makes modern operations

Starting point is 00:18:45 looks so different. And especially in cloud-native environments, like, you know, at the request of our publisher, we covered, you know, VMs and serverless and not just Kubernetes. And so we, in that way our our scope is really broad but um there's not a ton of like i do that we showed the most depth on kubernetes um frankly just because of comfort um but we didn't go as deep into any one of those architecture models as we could have simply because we had to hit on all of them um but again some things still just work kind of regardless if we say you know if we look at an implementation detail of Prometheus, that can just,

Starting point is 00:19:29 Prometheus and Grafana can just work, you know, regardless of your architecture, or even if you have a hybrid architecture, right, you've got some VMs, some Kubernetes, maybe a serverless thing here or there.

Starting point is 00:19:43 You can, you know, use something like Prometheus to bridge the data collection aspect. But there's data collection and there's data normalization. And then there's, turning that data into information, right? Or turning it into insights.

Starting point is 00:20:01 We said it both ways in the book. Both ways are correct. Because information and insights are different. So, you know, how we've done things before is an important. It informs a lot of what we decide to do now, but we have to evolve. Or, you know, we're going to end up not scaling our, efforts correctly. It's not even about scaling organizations and scaling headcounts. It's about scaling our efforts correctly. Putting the right amount of work into the right place is probably

Starting point is 00:20:37 the thing I care about the most. It's the thing that I definitely champion the hardest in my day job as well. Is this is this going to get us the correct benefits for the effort that we're putting in? Because sometimes the answer is no. And we need to, it's very, very difficult, I think, for us. as an industry to handle, that the answer is no, especially if we've come up with something brilliant or elegant that would be really cool to do. I think that's actually even more relevant,

Starting point is 00:21:08 sorry, again, in this space, right? Because you're talking about huge, when you're doing that evolution, right, from an on-prem or a monolithic architecture to these kind of cloud-native architectures, you're also talking about an explosion in just the amount of data that you're having to deal with. Yes.

Starting point is 00:21:24 That's an excellent question in terms of, okay, we all love data. We want to have data. Is this data going to get it somewhere? Because now there's a real cost to that question. And also to follow up on Hillary, what you said earlier, certain things are changing. Certain things are still true, even if the world around us is changing. We talked a lot about the, I remember it's either chapter one or chapter three,

Starting point is 00:21:53 you, I talk about context is king. So enriching data with context. Brought the example from pets to kettle, right? As we're treating infrastructure, we also need to rethink on how we deal with observability data that we cannot give servers pet names, but we need

Starting point is 00:22:09 to come up with a good taking strategy. We need to come up with good metadata and observability data needs to be enriched so that if we're looking at this amount of data, somebody can actually make sense out of it. But I also make the point that not every data problem can be solved with machine learning and AI, certain things are universally

Starting point is 00:22:28 still true. And this is to your point that certain things are actually, they always work a certain way and they will always work a certain way. And I brought a couple of examples. One of them was, you know, a service connecting to a database. That means you are either doing smart queries or you are doing the M plus one query problem, where you are fetching and doing too many round trips to fetch certain data. connection pools,

Starting point is 00:22:55 threat pools, these are all things we always had and we will always have in future architectures and I don't need to have a let's say very elaborate, expensive AI or machine learning to understand the dependency

Starting point is 00:23:11 between a service, a connection pool and a backend service. So I can also like always just alert on certain things. But then the world is changing. We're very dynamic systems. We have a lot of new data and indicators. So not everything, not on every metric I can put on a threshold. That's why the whole argument I made earlier, it doesn't make sense to specify 10,000 customer alerts or even if

Starting point is 00:23:36 we call them SLOs, because in the end, an SLO is also an alert, a custom alert that you're creating. So what I wanted to say is certain things hold true for a long, long, long time and will hold true for the future, but the world around us is changing, and so we also need to change with it. Yeah. And I think the point we make in the book is, like, AI doesn't necessarily replace automation, right? A.I can leverage automation. I highly recommend that pattern, right? You write your automation and you tell your AI, okay, here's your choices of automation to use, and especially if you want it to fix things autonomously. But at the end of the day, like, an AI implementation can be, it can, can be expensive. You can keep those costs under control and you can get ROI and you can balance them.

Starting point is 00:24:24 And we actually talk a little bit about those economic models in the book as well. So, but the, where you don't need it, you know, don't bring a sledgehammer to a job where you need a screwdriver, right? And this is, yeah, you might get the thing in the wall, but that's really not how you. needed to do it. So automation doesn't lose its value. Automation still continues to be incredibly valuable. You can have the AI identify and write new automation. That is a thing it's fairly good at. I've been experimenting with that myself. But, you know, it's really, you're not, you're not going to lose out on benefits and return on investment with continuing to make investments into automation and a good automation strategy.

Starting point is 00:25:21 Because even if things change, the history of what was automated and how that worked or whatever can be another set of information that you feed into your AI models for it to help build out the next generation. So, and again,

Starting point is 00:25:37 sometimes less is more, right? I don't need to go through an entire implementation to get an AI to run something every so often when a cron job works just as well, right? You know, there's there's a time and a place for everything. And I think a lot of the case that we made for AI is that it's a unifying layer,

Starting point is 00:26:05 a unifying abstraction layer. So I talked earlier about data normalization and potentially different data sources and data having different form. That's one of the most annoying problems to solve. It's a solvable problem. I worked in IoT. It's a very solvable problem. But that if you add a new data source with a new data like data form, you have to solve the problem again with those data transformations.

Starting point is 00:26:28 And folks who've worked in Apache Airflow and Argo workflows know exactly what I'm talking about. Or you've written custom data transformation pipelines because it's not that unusual. But that is one of those ways where, okay, like you can have the AI do the interpretation for you. and that can make a lot of sense and you can have it write out its interpretation somewhere or just have it reinterpret every time just depending on your business case and need

Starting point is 00:26:54 like whatever actively makes sense for you in terms of like where the costs need to live should the cost live in data storage or in GPU processing time whatever it is. So, you know, it's there's there's there is a case to be made

Starting point is 00:27:11 still that like automation is good and I've said it my work of like you know we can't just be implementing AI for the sake of AI, we still have to have an, what is our automation strategy? Right. What should, what are the things that we are fully comfortable with having totally automated and autonomously navigated? And then what would it take? What would it take for us to get comfortable with the next set of things? So it's not a magic sledgehammer. It can feel like magic, but it's not.

Starting point is 00:27:46 There's still a time and a place for doing things the quote-unquote good old-fashioned way. Talking about good old-fashioned, not the drink now, but just the book. We split the book into multiple chapters, also in the beginning. I'm not sure whoever wrote a book, but the way we addressed it right,

Starting point is 00:28:06 we obviously had a certain idea about what needs to be covered. We broke it into 10 chapters. We tried to define or we define what is the goal of every chapter and kind of like what is it covering and then we also assigned individual chapters to us authors so because I got some feedback and questions from from people how do you split up a book between three authors do you do it by topic by chapter do you just collaborate on every single piece and the way we decided like it also worked pretty well with

Starting point is 00:28:37 the previous book we assigned authors to chapters or chapters to authors and Rob, looking at you now, you covered all of the chapters that were basically putting in practice the things that Hilary and I were writing around AI and observability and applying it to ECMI Financial Services that features company, which I have heard is really a great way for us that we're not just explaining individual use cases. disconnected but really bringing it all together. For potential readers that are listening in now, Rob, would you quickly reframe or just explain who is ECME financial services and just roughly draft the journey that they're making? I know you talked about it a little bit earlier, but it would be nice. Sure.

Starting point is 00:29:35 Yeah, and so, yeah, I, you know, as Andy said, I did those chapters about ACME financial services. And so when we were kind of kicking around, you know, the ideas for this book, really what I wanted was to take all of the information in Andy and Hillary putting out there, all that technical stuff and putting it into a business context because, you know, it's one thing to kind of see the technical side of it. And another thing, you know, in businesses, there's all sorts of, you know, much more complexities than just the raw technical. And having been through a lot of different companies and seeing a lot of different things, you started seeing that, starting the larger the company, the more issues and whatnot you're going to see in terms of culture, in terms of, you know, hodgepodge solutions in terms of hybrid implementations. And that, you know, kind of comes about as the company grows organically.

Starting point is 00:30:30 When you're small, you're kind of, you know, trying to make things work and you grow and you don't necessarily have time to address the small, the stuff that you kind of, you know, cluge together, made work. You try to improve it as you can. As you grow and you get, breaking up, maybe you start acquiring other companies that have gone through,

Starting point is 00:30:49 their things are merging, or you've expanded in new, you know, areas. And certainly having worked in regulated companies, right, in regulated environment, it's a whole other world of things you have to worry about, not just the technical stuff. Now you've got evolution of regulations and things that you're worrying about. And so you end up,

Starting point is 00:31:07 the larger you go with these kinds of, to companies, just more and more of these issues that kind of stack up that are combined with a cultural problem in this space. And you know, so now you've got very large, complex systems, even saying, hey, we're moving from a monolithic to, you know, a cloud-based architecture, right? That's more and more services and things. And it highlights these problems more and, you know, larger. And even in a lot of the large companies that I've worked in, you know, the observability practices, they have kind of work, but they were even behind the times of, you know,

Starting point is 00:31:47 modern at the time kind of observability practices. One of those things while it was working well enough. And as I kind of said earlier, it's like it's not that first class concept. Like we need to be constantly addressing this and improving it. Asking that question, is this the most efficient way to use our resources? You can have a big system and have it go down and bring 100 people on the call. is that really the best way to find out that something is wrong and how to solve it? A customer says, you know, I can't access my bank account and that's how you find out,

Starting point is 00:32:19 you know, maybe as a small company, that was something that was more acceptable, but, you know, that stuff has to evolve. And so that's really what, you know, ACP Financial Journey is about. They're a large company and regulated industries across multiple businesses, you know, doing lots of things. And they've got a lot of technical debt. They've got a lot of, hey, you know, we understand the soccer side of it. We need to be cloud native.

Starting point is 00:32:47 We're making those changes, but they've left behind as a lot of companies, do that observability side of it. And so they did their, you know, cloud migrations. The beginning of our story is, hey, we've just completed our cloud migration. And now our observability is just killing us. We can't keep anything up. We've got too much stuff. Our old practices just don't work.

Starting point is 00:33:06 And because, you know, as they said, the company is not keeping it, keeping up to date, they're way behind on current observability practices. They've got all these problems and they're underwater and it's, you know, it's costing them, you know, reputation and all these things. And so they have to go do our journey and figuring out, okay, what's wrong? They have to do root cause analysis and, you know, figure out what's broken and that. Okay, now what are the, how can we possibly start figuring to fix these things? And that's where you have to, like, they have to go out and research and say,

Starting point is 00:33:35 what are common practices because these are new problems to them. And now they're seeing, oh my gosh, we're so behind. There's all these ways to do things. And they kind of go through and they, all right, we make these first set of changes. And they think, all right, you know, we, again, that's cultural. They make a first set of changes along with their culture. They don't necessarily have a strong central authority that can dictate how certain

Starting point is 00:33:57 things will be done or even do the implementation. So they do the first wave of implementation in their current culture. And they find out, okay, we got some benefits. But now it unearths some all these other problems that we didn't realize were there. And so again, they're doing this. Okay, well, now these are new problems. We have to figure it out. And now they're kind of realizing this is more than just technical stuff.

Starting point is 00:34:18 There is a culture change that has to happen here to allow the company to succeed at this scale. Right. And that's kind of the journey they go through. And at the end of their journey, right, they've kind of finally made those cultural changes. They've made those thought changes. and now they're kind of starting to accept the tooling and allow a lot of that busy work to kind of fall away. And they start having more time to do the things they always want to do. If you're in these large corporations or even a medium-sized corporation, you will undoubtedly have some kind of an infrastructure set up.

Starting point is 00:34:52 That is working. But everybody says, don't touch it. Nobody knows how that works anymore. Where the networking system is so confusing, nobody can try to configure. it out, right? And we don't have the time to figure it out. Or the same kind of, you know, problems in software, right? Longstanding bugs that or longstanding weird features that, you know, it kind of works. And we just, the authentication mechanism works, but we don't really understand it anymore. We don't want to touch it. That type of thing. So you start getting that time to start tackling

Starting point is 00:35:23 some of that debt and making, you know, the business and the application or the infrastructure better because now you're not just spending all your time, you know, fighting your, your, your, your your system. Yeah. The code that will survive the longest in production is the code that was supposed to be a proof of concept. Yes, exactly. Right.

Starting point is 00:35:42 This is, I have never worked somewhere in 16 years, inclusive of current job, where that has not been the case, at least one or two places. In fact, Rob, you shared context on things that we can't explicitly say, but there are some things that were as proof of concept code when you were here at Red Hat that are still there. Exactly. You know, there's always some business pressure to move on to the next

Starting point is 00:36:10 thing. And that's one of the ways technical debt piles up. And the other thing is just obviously, as things modernize, even if you did everything like perfectly the first time, it ages. And typically it doesn't age like fine wine. It ages more like on the counter. Yes.

Starting point is 00:36:28 Definitely. Yeah. But Rob, thank you for walking us through that story. And as we know, there's many well-established enterprises out there that probably feel the pain and that's great that we also tell the story and walk the journey of such a company and not just providing a book where we are writing, this is how you should do it and this is the best practice, but really seeing how this applies to a company that is well-established, it has more than just technical challenges. So I think this is also what I like about the book, that it's both to learn a lot of stuff about possibility, about AI, about security, but then see really not only how it can be applied on the greenfield, but also how it can be applied with all the other complexity around it than a large enterprise.

Starting point is 00:37:17 Now, Hillary, at the end, I think it was chapter 9 that you wrote. the title is no future without challenges and I think you talk about one of your favorite topics around ensuring security, compliance, building trust cost was also a big topic

Starting point is 00:37:37 mitigating risks such as hallucinations and unreliable outputs any things that you want to highlight from this part because it's obviously a very important topic and I know you could probably we could fill a full podcast with this

Starting point is 00:37:51 but if I just give you two three minutes what will be the things that stands out in this chapter? Yeah, so for me, I think actually somebody called it out in one of the reviews, and I appreciated the attention to detail in that review. I really did. One of the things that I call out is that if your AI is doing right actions, it needs to have its own identity for auditing purposes. Right now, a lot of us are using AI on our local machines,

Starting point is 00:38:22 and so it gets to inherit our permissions and then act within the boundaries of those permissions. That's likely too permissive. I know for me, because I am fairly senior in my organization, I have God mode in places. I try to have, you know, I try to follow the least privilege and so forth and not have more permissions than I need. But somebody has to hold these permissions, and I get that glorious responsibility through, through the process of elimination. And that other people who had it left and now it's me.

Starting point is 00:38:59 I was thinking maybe you were the last person to put your finger on your nose. I've also done that, yeah. Functional, yeah, last one in, last one out. So that's what happened here. But so having an AI system running on my machine, running with my level of permissions can be, could be catastrophic

Starting point is 00:39:18 if it makes a bad decision. So it needs to have its own identity. It needs to have its own permission set. You know, running on my machine with my level of permissions is okay for something that's disposable, something that is not going to go anywhere, something that is just for experimentation or so forth. But we're actually using these systems in production. We need to be using least privilege. So it needs to have its own identity.

Starting point is 00:39:41 We talk a little bit about the type of governing, you know, there's managed identities in Azure and Google. I think it's called, I forget the name already. I haven't had coffee yet. It's my fault. STS in AWS. These are sets of permissions and company by short-lived tokens, right? So the AI having its own.

Starting point is 00:40:04 And then also governance technologies like AI gateways or middleware to kind of be capturing in the prompts and responses whether or not things are, I'm going to use the word kosher. So those those areas are really, I think, very key.

Starting point is 00:40:22 When we were writing the book, I didn't see people talking about these things. Now that the book is published, I coincidentally do see other people talking about these. There were some projects that were already in the work to do exactly what I'm talking about that just hadn't been unveiled at the time of publication that I wish we could have included. But I think that's the most important thing is that we still follow the pattern of least privilege, give your AI least privilege and its own identity. And you will need that for auditing purposes. guarantee you.

Starting point is 00:40:51 Oh, definitely. Speaking, you know, you're in a regulated field, like that's all incredibly important. And also a lot of like regulated industries and no companies all want to push their cloud offerings. When you're in a regulated

Starting point is 00:41:04 industry, that is a hard sell to get into a cloud offering. So a lot of companies end up self hosting. So now those companies, right, if you're using AI and you're getting all this data into a central location so you're AI can function on it, right?

Starting point is 00:41:20 Now you've just opened up a huge, you know, potential security hole. Like that thing has to be secure. There's more, there's more data in that service than probably any place else in the company that's telling what's going on within the company. And so that it becomes even more important in these environments to make sure you're doing least privilege and ensuring good security principles. And how about compliance? I remember I covered a little bit of the compliance topic in,

Starting point is 00:41:48 one of my chapters around using observability to observe and trace what an energetic workflow is doing, what the input and the output of an LLM is, what an agentic workflow is calling for tools. And the whole topic of compliance, I guess, is also just getting more and more important, right? Yeah. And that goes back to that audit thing. So you need compliant, you need auditing, not just for security. You actually need it for compliance. So if you're going through a compliance certification process,

Starting point is 00:42:22 you need to be able to demonstrate that you can audit everything that happens within your system. This is true for PCI. This is true for FedRamp. This is true for, I think, definitely IL5. I already said PCI. It's true for pretty much all of them. NIST, yes, thank you. It's true for all of them.

Starting point is 00:42:42 And I call this out, security and compliance are related, but not the same thing. because you could have a compliance system that is not secure and a secure system that is not compliant. Now, a secure system that's not compliant is far more common because a lot of compliance frameworks have a really good standard of security within them. But they're not perfect and they're not always kept up to date or modernized. So the ability to have all that information and that audit trail, you need it for compliance. You need it for compliance even if your AI is internal only and not customer-favor-fail. facing it all and you're only using it as an internal tool. You need to be saving those prompts and those responses so that if something it does hallucinate

Starting point is 00:43:23 or whatever that you can catch it. How you capture them, I think we talked about open LLMetry. There's other ways to do it as well. If you're using something like an AI gateway, then you can use the gateway can capture it as well and like a little service in there can put it out. You could write your own middleware to do things with custom logic if you want instead of using something that already exists. Like however you want to do it.

Starting point is 00:43:44 But yeah, that's really important for compliance. And then guardrails. So the other thing is that you need to have basically guardrails that prevent bad prompts from making it to the AI so that it would respond. Or if it is responding that, you know, you can intercept that response and remove anything that it would have that's undesirable. Or just flat out, like, reject it and be like, no, I'm not setting that forward to turn an error to the user instead. So all of those things are really important for compliance. They are as important for compliance as they are for security. But you definitely need to prove it more regularly because compliance, you basically

Starting point is 00:44:29 need to do it for security if there's an oopsie, right? Something went wrong. Now you have to go leverage all your security stuff. For compliance, it's proactive, right? It's all like, there is a security aspect of the compliance framework. I don't want to pretend there isn't. but you need to be able to demonstrate it on a moment's notice due to random audit, whatever it is.

Starting point is 00:44:51 And even if you are not in an organization that has high compliance requirements, I really recommend that you act like you are because ultimately for your organization to reach the next market share, to reach its next level of income, you will need to suddenly start and caring about these compliance frameworks. So it's better to be proactively approaching your design with compliance in mind, even if you're not actively being certified for that compliance type. Because you're going, if you want to grow your market share, you want to grow your organization, you want your business to continue to expand, you will need these things.

Starting point is 00:45:31 And I think going back to really quick what Rob was talking about with the chapters on financial, what do we call financial one-acomy services? I always forget. Yes. So bad. I always forget. I literally wrote it wrong in one of the chapters and we had to catch it in editing. It's shame on me. If people are reading those chapters expecting a step-by-step implementation guide, that's not what they're going to get.

Starting point is 00:45:56 We're really bringing the exposure that a lot of engineering people don't get, which is to the business context with these chapters. Because a lot of my career in engineering, I didn't have the business context. I didn't have access to understanding the business problems. So that is what we're bringing in those chapters. And compliance is a business problem. It is very much a business problem because it directly impacts revenue. Yeah, I was actually going to use this as an opportunity to kind of like show. So Hillary went through like a lot of really good information there.

Starting point is 00:46:29 And I was going to bring it back to our company and show exactly what the kind of the ramifications of what Hillary is mentioning. Right. So let's think in terms of, you know, the Acme Financial Services. Let's say you're a bank or any financial service. You've got people's money, and that's really heavily regulated. And there was severe penalties for not being able to, you're not proving that you are conforming to those regulations. So if you're unable to audit, let's say you have a breach and, you know,

Starting point is 00:46:57 some account numbers go out and some bank account, you know, some money gets taken. The end of that problem is not the financial institution making the people whole who got, you know, compromised. It's not like, oh, you lost your. bank account number, we'll give you a new bank account number, and we'll put the money back into your account. That's not where that ends, right? That, when you have that stuff, now you've got regulators coming at the company saying, how did this happen? Show us that you're in compliance with,

Starting point is 00:47:25 you know, the things that you need to do to prevent this type of thing. And if you don't have any auditing, you don't have any kind of audit trail that shows, hey, maybe you had AI doing something. And AI did X, Y, and Z, and that may have caused it. Or no, you know, we, we have some defense against these things and what caused it was some fluke or something or other right if you don't have any of that you don't have any kind of auditing to show what's happening when's happening any of that stuff you're going to have severe problems with the regulators and you could have massive massive financial penalties and of course any of that stuff could you imagine if you're a if you're a you know money is in a bank and they get compromised and they give it back to you and they said we have no

Starting point is 00:48:07 idea what happened. Sorry about that. Are you going to trust them? Right? That's one of the things that, exactly, it's one of the things in the book we talk about trust. When you're dealing with people's finances, that is huge. Trust is big. Yeah, trust is big. Yeah, no, if that happened, my money was just Thanos snapped away one day and the financial institution was like, yeah, we don't know why, here you go. Like, I would not be with that financial institution any longer. My money might be in a for a while, frankly, but you know, I would certainly not be continuing

Starting point is 00:48:43 to hold my assets with that financial institution. Hey, I always say it's amazing how fast time flies when you have fun because we have just spent about 50 minutes on this podcast talking about our book. It was also amazing

Starting point is 00:49:01 how fast time was flying when we wrote the book because all of a sudden all of these milestones came up faster than we were hoping for, or these deadlines, let's say, that way. It's really great that we have accomplished, what we've accomplished. I do hope that those that are listening in,

Starting point is 00:49:20 and if you happen to have a copy of the book and you read it, please also give us feedback. Otherwise, it's a one-way street. We write and we don't hear feedback. It's always appreciated. Send us a message. Leave a comment on wherever you bought the book. this also shows us what we can improve or whatever else we do in the future, writing another edition

Starting point is 00:49:44 or yeah, it's always good to get feedback. Yeah, I would say with this, Rob, I'm very much looking forward to the end of May because we just learned before the podcast that finally we have a place to meet face-to-face because so far we've only managed out on Zoom. So I'm looking forward to Indianapolis. me too i think it's going to be a lot of fun i think so too and hilary i'm pretty sure we'll see each other um again the latest and rob this might also be something for you a cubecon in north america i know it's still a couple of months out but um at least hillary i'm not sure what your plans are

Starting point is 00:50:23 but i will definitely i think be there and there's a good chance for everybody that listens in that we have a couple of copies of our book to give away signed copies by the way um so yeah Yeah, I expect to be there. This time they managed to schedule KubeCon so it doesn't fall exactly on my son's birthday. So it's a little bit before. So that makes it a lot easier. The last time they successfully did that was Chicago. So otherwise, the reason I don't go to KubeCon North America is usually because it falls right over my child's birthday. And I prioritize his birthday. That's a good thing. Family first and then. But yeah, let's hope we'll see each other all there. With this, Brian, sorry that you couldn't make it today, but I'll have you back as my co-host for the next episode. But for now, I say, thanks, Rob. Thanks, Hillary.

Starting point is 00:51:16 Thank you, everybody for listening in. See you next time.

PurePerformance - Observability in the AI‑Native Era with Hilliary Lipsig and Rob Rati

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.