PurePerformance - Serverless Observability needs a paradigm shift with Toli Apostolidis

Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance. My name is Brian Wilson and as always I have the mimicker-in-chief Andy Grabner making fun of me during the opening, which someday we'll record visually to let everybody see because then people can sleep at night. Speaking of sleeping at night, I don't have any dreams again, Andy. But I did just wake up recently so if i sound you know bed weary that's because i am hi andy how are you doing i'm very good but i am in a different setting today but uh what did you say-C in control. What did you call me earlier?

Starting point is 00:01:06 Oh, my, something in chief. Oh, Mimicr in chief. Mimicr in chief, M-I-C. Oh, and I forgot to call you,

Starting point is 00:01:14 I'm supposed to call you Salsa Boy. I forgot about that. That's right. Yeah, yeah, yeah. See, it's morning.

Starting point is 00:01:18 So Andy got a nickname, everybody. Well, we did the recording yesterday, but the previous episode, Salsa Boy, because we had a guest, Sala Boy and Andy mentioned Salsa Boy

Starting point is 00:01:28 so we now have a new name for Andy which I'm going to try to promote all throughout the company I won't do that to you Andy but anybody listening I'm sure you deserve more respect than you get from the things I hear people talking about you Andy I think you deserve a lot more respect I'm actually also good with salsa, boy, because that's

Starting point is 00:01:46 my passion, right? Dancing salsa. But I also have one quick story to tell, and this is why I'm in an unusual setting today because I made the stupid mistake of trusting the Windows update message. It says there's an update pending. It only takes six minutes. 45 minutes later,

Starting point is 00:02:02 it was still going on, and so I switched to a different room here. It finally finished, but I'm in a different setting. In case I sound different today, then it's because I'm in a different setting. But anyway, let's not talk about Windows updates. Let's not talk about Salsa or something like that. We actually have a guest back who we just recently had. Actually, two days ago, as of the recording today, we just aired the episode, which was called Unlocking the Power of Observability, Engineering Practices for Success with Toli. Toli, welcome back

Starting point is 00:02:38 to the show. And the reason why we have you back on the show last time when we talked, you said there's a whole lot of set of topics around serverless. And you wanted to talk about serverless observability. Serverless is a really big topic for our communities out there, from a performance engineering perspective, from a site reliability engineering perspective, from a platform engineering perspective. And that's why we're really happy to have you back

Starting point is 00:02:59 because it's a topic that was on your mind and you also said you want to have a little conversation here so we can kind of bounce back ideas on how to, you know, bring serverless observability to, well, discuss the challenges with it and come up with some conclusions. But, Toli, welcome back. How are you? I'm great. Thanks for having me back.

Starting point is 00:03:21 Thanks for publishing the podcast. I had really good fun talking about observability last time. I think we danced around the topic of serverless in the last topic, in the last podcast. I think the history is that my experience was I joined Common Gold Cinch four years ago and the company was entirely serverless. So everything we were talking about, or most of the stuff we were talking about last time,

Starting point is 00:03:49 was relating to serverless observability, but we kind of masked it. So I'm really looking forward to hear what you found in the field about how people are using observability, how they're enabling observable systems. Last time we talked a bit more about practices, I guess, but this time it'd be interesting to see the technical difficulties as well.

Starting point is 00:04:13 I mean, from my perspective, and that's actually let's start right away. The technical challenge that I see, and Brian and I, we've both been in the field of observability for a long, long time where we built agents that you installed on the server, on a VM, right? And as we said, even though there are servers in serverless, these servers are out of reach for the classical agent. And so the question is, what are good

Starting point is 00:04:37 approaches to actually bring observability into your serverless functions, right? Because you don't control the runtime underneath. Now, I know a lot of things have changed over the years because, you know, the way, like when you take, for instance, AWS or Microsoft or Google, right? I think they also learned that they need to bring a little bit more flexibility into allowing observability frameworks or vendors

Starting point is 00:05:02 into their runtimes. Obviously not too deep, but just getting it in. I also think that open telemetry is playing obviously a big role in serverless observability. But yeah, I think there's definitely a difference because we can no longer just install an agent on a machine that automatically detects all the processes, all the containers that run on an instrument. We need to kind of instrument from within the serverless function or whatever we are allowed to do. So I think that's the big difference in approach.

Starting point is 00:05:35 Now, Tole, I know with your experience at your previous organization, how did you approach it? How did you instrument your serverless functions? What type of mechanisms did you use? Did you instrument with open telemetry? Did you use what was provided by the vendor? Any other things that we may not be talking about that you found out? That was a great summary, I think, of the estate, really. It's very interesting to see how the vendors have evolved over time. Our experience was we did start with an agent hosted in an app service, I think.

Starting point is 00:06:12 And that was when we moved to serverless. It sounded a bit over the top to have a server to host the agent, but then everything else is serverless. At the time, I think this is around 2020, a lot of the vendors were learning how to grow and how to migrate to this new paradigm. Serverless, having worked for a company that was entirely serverless, didn't own a server, didn't own a VM at all, it is really apparent how much of a paradigm shift serverless is. And this comes with instrumentation as well.

Starting point is 00:06:55 So we quickly figured out that the best option is to instrument using the vendor SDK, the vendor tracer. Over time, the vendor learned and got better. And as you said, the cloud providers were actually allowing more support. So typically, just for the listeners, what happens, and I guess this is maybe AWS specific, so I'd be interested to know what the other two big vendors do.

Starting point is 00:07:26 They have a native tracing client, for example, and they have a native product that you can use. But for a long time, that made it hard for other vendors to instrument

Starting point is 00:07:42 as well as that, because it doesn't have that native integration with everything. But AWS in particular opened up using this form of extensions on Lambdas to actively work with these vendors and allow them to build extensions on, as you said, the Lambda runtime. And that really, we could definitely experience it. Things got a lot better, a lot quicker. So typically in that context,

Starting point is 00:08:10 you'd use an extension as your agent, really like a mini agent, and your Lambda runtime would push a telemetry async, I believe, after the end of the Lambda invocation. And that seems like a nice, clean architecture. And that works nicely with instrumentation. That's for traces, but you could probably say the same for other things. What do you think were the main drivers why somebody like a vendor, whichever one of the big ones you name,

Starting point is 00:08:48 what do you think were the real big drivers for them and the motivation to open up? Why would they, if I have a solution that already, right? I mean, I'm providing serverless and I also want to provide the whole package, right? Obviously all these vendors, all these cloud vendors they try to to to i think at least right keep you in the environment right because then everything is feels like from one vendor and

Starting point is 00:09:14 everything is is they like um what do you think pressured them to actually open up that's that's a great question I don't have any inside information, but I believe that the reason is they recognize that the observability vendors do observability, although you can't say that, do observability a lot better than they would do. That's not the differentiator. I think it's really smart that they've done that. They offload observability capabilities to other companies and in a way

Starting point is 00:09:52 they make their users' life easier as well. I tend to, whenever I meet someone at AWS or I meet someone at Datadog, whenever there's a bit of a, or Dynatrace, whenever there's a difficult problem, I just say, can you just speak to each other? Just speak to each other, make it work. So the signals I've got is I've not got any friction from either side. They seem to collaborate quite nicely. I think it's very critical to have serverless experts

Starting point is 00:10:25 in the observability vendors as employees. So it's only when we had a lot of struggle actually getting our observability vendor to really understand our problems because they didn't really understand serverless. So that helped a lot. But yeah, I guess Tildear, I think it's to their benefit

Starting point is 00:10:44 to offload this non-differentiator. They can definitely offer a basic functionality, observability functionality with, in AWS's case, with things like CloudWatch and X-Ray and their metrics. And that works well for a while, but it quickly breaks if you want to do more advanced and complex metrics. And that works well for a while, but it quickly breaks if you want to do more advanced and complex things. Yeah, I mean, from my perspective, what I see,

Starting point is 00:11:10 and Brian, I'm not sure how you see this as well, when working with our customers and there are typically large enterprises, you are typically not just using one kind of stack, right? That means if you're using AWS Lambda, then chances are that those apps are calling into other services that may run on another cloud, that may run on your on-premise,

Starting point is 00:11:33 but you still want to get the end-to-end observability. And this is why you don't want to end up with observability silos where all the Lambda data is in AWS, and then you have all of your other data in another observability platform. Really in the end, you want to get the end-to-end view. And I think also that was one of the reasons

Starting point is 00:11:50 why they opened up. I think it was the pressure from their customers saying, hey, we need this type of visibility, right? Because otherwise, it's going to be harder for me to troubleshoot. Also, if I'm your customer and I don't know if it's my mistake, it might be yours, then I need to open up a support ticket with you.

Starting point is 00:12:08 And then your guys need to look into this. So why not just give me the data that we need? And then I think it was really good that OpenTelemetry came along so that we had an open standard that everybody could agree on, that we don't have anything proprietary, but we really built something that everybody could easily consume i think that was a it was a really important piece in the last couple of years with open telemetry yeah and i think what you're speaking to there is like the market pressure right there uh you know if they're not doing that the customers are going to be unhappy and then they if like if amazon had

Starting point is 00:12:40 it closed and then azure came in with their serverless functions and said, hey, it's open. They might see an exodus. But I also think part of it could be the general IT community. What we see over and over and over is there's one cloud vendor that's not one of the big three that I could imagine would lock it down. I'm not going to say their name, but I imagine they'd be like, no, you're going to do it all through us, right? I think there are so many people at these, you know, the big three and so many other places where their nature is just to be like, oh, yeah, we should do that. People want it and it makes sense. There's that spirit within our community that really lends to that. So I'd hope to think that that's part of it, right?

Starting point is 00:13:25 Yes, market pressure is going to be a big part of it, but I think it made it easy just because of the way we all operate as an IT community. Yeah, that's an excellent point. I really like that. I hope that's a pressure that we apply as a community because in a way we want to collaborate and we want things to not be locked down overall but I also like the market pressure thing you can imagine

Starting point is 00:13:50 a worldly map where the user is anchored as a developer or a company and they want to monitor their systems that's their need and then what do you need to achieve that and you can imagine the native client custom built thing for the cloud provider to be something that they really don't want to it'll be in custom built so they want to move that into product and their approach the inertia to build into a product is that that's not the differentiator maybe so why not farm it out to other companies hey totally coming back to your experience with building a system that was purely serverless-based, one of the challenges that I hear from our users or questions that come up is like, hey, serverless monitoring, yes, we can monitor the individual serverless functions, right? We can

Starting point is 00:14:38 instrument them with open telemetry. We can use the extensions and AWS Lambdas and so on and so forth. But how can we get visibility into our end-to-end system? Because there's many services in between, like the end user and the serverless function and all the other things that a serverless app is consuming. What's your experience on this? First of all, was it critical for you to get end-to-end visibility also through these connected so through these services whether it's a gateway api gateways event buses um you know down to the database was this a important and b how did you uh how did you

Starting point is 00:15:18 solve it if it was important for you um that that's a very interesting question. I guess the answer is maybe, or yes and no. It is important, but it isn't. I think, first of all, it's interesting to say that I was part of a company that had maybe 10, 15, 20 teams at any point in time building serverless systems. So they were building the systems. They had to understand the visibility of their systems. So they had to understand how observability works in serverless. So the first thing they, if you take it almost like a timeline, the first thing they did was, okay, I can emit telemetry data from the thing that I can control,

Starting point is 00:16:04 which is your Lambda and AWS, in this case, or your serverless function. But then I think what you're saying is, the question is, if you want to use tracing, how do you trace across managed services like API Gateway and AWS or EventBridge and things like that? And the answer is that we tried.

Starting point is 00:16:26 Initially, we used a functionality of X-Ray. So X-Ray traces a lot of these things natively and merged those traces with the observability vendors SDK, their version of Tracer. And that sort of works nicely, but has a lot of limitations when it comes to filtering and that they seem to lose x ray seems to lose traces. And so we couldn't really use span data as you know, using a monitor because we couldn't rely that it's going to be there. So that was one of the strongest limitations for that approach. Then as the years went past, interestingly, as we waited for things to evolve, and that's often something that happens when you adopt a new paradigm that's new to everyone is that you have to wait for things to evolve and improve. So what happened was the tracers, the observability vendor tracer got better, that improved over time. So it natively traced

Starting point is 00:17:39 things like API gateway quite easily. And the API Gateway, so if we classify things, I don't know how you think of them in your head, but there's three things that I'm thinking in terms of serverless. And if I define, can I take a little small parenthesis to define serverless? Sure, yeah. So in my mind, serverless is elastic compute and elastic billing.

Starting point is 00:18:03 Your usage of whether it's a managed service, whether it's a computer you control, to an extent, that should go to zero if you've got no traffic. And it should scale automatically and you shouldn't need to put any rules in. And when we talked about server earlier, the server should not be visible to you. So when we talk about Lambda runtime, that runtime is not a VM or a server. It's just the runtime of the Lambda. It's a bit of a higher level abstraction.

Starting point is 00:18:34 So given that that's what serverless is, you might then have to integrate your bits of code, which is your serverless functions, with other managed services. And in this case, you have three classes of managed services. I'd say things like API Gateway, which is fairly straightforward, and it's synchronous. And then you've got more asynchronous things,

Starting point is 00:18:55 which is your queuing systems, your SQS, or your event bus, which is your event bridge, or your notification system, SNS and AWS. And then the third class is step functions. And that's a big topic in itself. But I'd say the API gateway is really useful because you can, in the last episode, we talked about having real user monitoring. So then you can get a trace that has your real user monitoring top span.

Starting point is 00:19:22 Then you have an API gateway and then you have your Lambda function. And that's useful because you can trace things across. You can see how long the API gateway takes when you're looking at support cases and just in general what overhead it adds and what hacks you can do to improve that. But I'd say that it's not the biggest value. We talked about it last time.

Starting point is 00:19:43 The flame graph is not the biggest value. So it's useful, but not amazingly useful. Then the second thing is the asynchronous stuff. And so the things like SQSS and SNM Enbridge, those are quite hard to trace. And I think the biggest...

Starting point is 00:19:57 I'm curious to see what you're seeing in your focus group. I don't know what the best visualization is for loads of things going into SQS and loads of things going out of SQS and all this kind of men-to-men relationship. We tried it

Starting point is 00:20:13 a few times to see what's the best approach, but it's not easy to visualize. It's not easy to understand. So if the experts in the company can't really figure out what's the best way and the tool is not helping you, then I don't find any use into giving this to all the teams to use.

Starting point is 00:20:37 And a similar story is with step functions. So step functions are a very different paradigm as well. So for those of you I think it's Logic Apps in Azure, but for those of you who don't know what step functions are, it's basically a state machine. So you have a well-defined set of steps, and once it starts, then it goes through these steps, and then there's an end game.

Starting point is 00:21:02 And it's a bit different to using. You can use a lambda within the step factor you can kind of compose a bit of a um uh it's it's an orchestration versus choreography they call the event driven stuff so that is a very difficult to and to use effectively because it may take two weeks so you do want a flame graph with a massive span. So the visualizations in the asynchronous managed services and the step functions are not really compatible, I think. But have you seen anything different? Yeah, I think we what we are seeing these days, I think I took some notes and I'll try to

Starting point is 00:21:43 at least give you my thoughts on this um first of all when you are sending a message to a queue right i think one of the things that you want to measure and monitor and also then highlight is how many subscribers do you have on the other side and i think there's a there's patterns that you can figure out and say hi why do we have you know 50 subscribers to this particular message? And it grew from 50 to 60 to 100. I think that's a good indication. This is something you can actually answer

Starting point is 00:22:14 when you actually do distributed tracing. So, you know, one message comes in and then it spans out into 50, 60. And you want to have an eye on this number because one of the things that I've been talking about almost since the beginning, since I started working in observability, is what I call the architectural validation and architectural regressions. So with an architectural regression, in the early days, Brian, we talked about the M plus one query problem.

Starting point is 00:22:39 And coming back to this, because in the end, it's the same thing. In the classical... So is that. In the classical Java apps that we saw in the very beginning of our work in Dynatrace, we had one call, somebody makes a request, and then 100 database calls go off. And then tomorrow, instead of 100 database calls, 200 database calls off for the same transaction now why is this because somebody was iterating through the result list of 100 200 and then making an additional database call as the classical m plus one query problem and I think we can and we we answered these so we highlighted

Starting point is 00:23:17 these architectural patterns through distributed traces because we said this request on that URL with these parameters was producing five database calls yesterday and today the same one is 50 and tomorrow it's 100 so you clearly have a data-driven problem here and i think that's one of the things that i think can also be applied uh when you're sending a message and then you want to see how this message fans out so just keeping track of of how many inverses out so that's one thing the other thing uh looking at the um at the other example there was step functions i think with step functions you're talking a lot about kind of like almost like business process like you'll be basically

Starting point is 00:23:55 modeling a process and what we do here now is um i mean in in our terminology uh we talk about uh business events and but it might not be the perfect term. But what we basically try to do, we try to model an end-to-end business process that can take a minute, it can take an hour, it can take a day, it can take a week. And we're basically trying to fetch the phase where this process is in as an event. So it's the same thing right i mean you have an event you know order comes in uh order gets sent to the i don't know checking the payment when the payment is completed then it goes into order fulfillment it gets shipped and then at the end the customer says i received this package and so what we are basically looking into how many instances of these processes are executed,

Starting point is 00:24:48 how long does it take to go from step to step? Does something change over time, right? Do certain times between steps all of a sudden become longer? Does this correlate with the workload or with the with the volume that comes in does it change maybe with some attributes on these transactions so maybe an order because we recently had a podcast with um mark forester from michelin butlers they are a restaurant chain um and in in in the uk and they basically monitor their whole food delivery end to end with this right and it's their systems but it's also the ubereats of the world that are doing the delivery and then also

Starting point is 00:25:32 like including the payment and the delivery and they basically uh you know monitor every step along the way and then they figure out where things may go wrong and Maybe Uber Eats had a bad day today and therefore all of a sudden whenever Uber Eats was used to deliver food, there was a bad experience and people were returned not paying for it in the end because they gave a bad rating and things like that. So to sum up, I think when we talk about step functions and you have longer processes that can take much longer than just a second, like a traditional transaction, you want to think about what are the individual phases of your business process and how can you monitor every phase, how many transactions actually go through, and then also detect critical parameters that tell you, well, in this case, this transaction is just buying a burger.

Starting point is 00:26:23 In this case, this transaction is buying a much more complicated item that may also take longer to produce and so i think this is where you also then need to have some domain knowledge to then figure out how do you then detect certain patterns and not just put every transaction into the same bucket that's kind of like what i'm saying what i try to say that that sounds really exciting to me, to be honest. I wanted to interrupt you a few times there, but I didn't. I think it was really good. By the way, business events is an ideal name,

Starting point is 00:26:55 I think, in my mind. I think it's a really, really good term. I have loads of questions for what you said, but if I understood correctly what you're saying is that you pay attention. At the center of what you do is the business events, and you visualize and monitor those, and you get insights for things like time between things, events that have failed. And I think absolutely this is what is missing from most observability vendors, to be honest.

Starting point is 00:27:34 And I think that, especially with serverless, you have the opportunity to stop worrying about servers and VMs and you start worrying about business transactions. And what's interesting with business transactions is in a distributed system that we typically build nowadays, like a transactional distributed system, I guess, you don't care only about the server request response. Pat, you want to know about multiple hops.

Starting point is 00:28:12 And I think what you were saying there about the particular example is that there's a lifetime of an order. And it's interesting because I've just moved into the restaurant tech industry. So I'm very interested in that example. But there's a lifetime of an order and you want to have visibility of that.

Starting point is 00:28:29 That was the number one thing back when I was working at Cinch that executives and directors cared about and the teams cared about ultimately. How do I visualize? Okay, I've got all these Lambda functions. I've got all this, what they call choreography, that things with choreography,

Starting point is 00:28:48 they mean that because you're using a vendor in architecture and systems are integrating between each other with events rather than synchronous API calls, you're hoping that the whole process falls together beautifully. But you have no way of, you have no way, well, you have no easy way of verifying as you were saying with the architectural validation maybe that you have no way of verifying that that process has worked so so is your answer to i'm guessing i guess i've got two questions so is your answer to that that you have this concept of business events and you build your

Starting point is 00:29:27 practices and your instrumentation and your dashboards and everything from that perspective? And if yes, what are the telemetry data types that you use? So the way this works is, right, you want to, first of all, kind of almost sketch out your business process end to end. Like, where does it start? I guess typically with an end user that is trying to do something. And then how does this business process evolve until that process is completely fulfilled? And coming back to that example with Mark Forrester, they have only a small piece of the whole food delivery is stuff that they actually built.

Starting point is 00:30:10 Most of it is third party components. And so they were basically sketching out the whole end to end business process. And then we were trying to figure out how can we get the individual faces? How can we monitor this? So some of this can be done by looking at traces at logs people can also like especially sas vendors they can also push data to our observability platform so for instance this was one of the things that i liked so much about that conversation because i asked him how did you convince that the sas vendors that you're working with right most

Starting point is 00:30:42 of the services that they consume are other APIs. And then I was asking him, how did you convince vendor A, B, and C that they send you their logs and traces? And he said it was very easy because when we have a problem and we don't have the insights, then we open up a support ticket with them,

Starting point is 00:31:01 which means we are binding people on their end and on our end to try to figure out whose fault is it, why the food was not delivered. Was it our mistake because we couldn't make the API call or it didn't reach their end? Or it was a problem on their end because they couldn't call back? And so he said, once we established that it's a benefit for both sides to get insights, it was very easy, right? And in the end, they now have all the data and can solve technical problems on their end

Starting point is 00:31:31 because now they know that there are problems. And if the problem is on the other side, they have the proof and can say, you know, we have done the root cause analysis. We know that when you are making a call back to us to figure out how do we still have burgers on stock? Do we still have this and this food item? And sometimes our API fails. This is one of the scenarios that he talked about. Sometimes these food platforms, they were

Starting point is 00:31:55 showing an outdated stock supply, right? This showed this restaurant still has, you can still order this even though the restaurant didn't have it anymore. So when the customer then went through the process, in the end, at the end of the order process, the system returned an error because when they made the final call, they said that we're out of stock. And so this is an interesting piece, right? And so how do we, to answer the question, how do we get this data? Right now, the way way we do it and i'm sure other maybe observability vendors are doing this as well but we can ingest it a through our agent b through logs we can also ingest this from the front end so we also have an agent or a javascript library on the front end if you're building web applications so again we can

Starting point is 00:32:42 also capture it there. Brian, did I forget anything? You can just send events to API. Yeah, we have an API where you can just send events. So this is where SaaS vendors can just send it. Now, the critical piece is you need to have some unique identification that kind of tells us that all of these events somehow belong together, right? And that's where typically you have something like an order ID or something. Customer ID, order ID.

Starting point is 00:33:08 Yeah, some ID. But we can take all of these individual events and we categorize them phase one, phase two, phase three, phase four, and then we can stitch them together if you want to know the whole process for one order. But then we can run analysis and say, hey, that many orders came in, that many made it successfully to the end, and then we can fan it out and say, hey, at the food delivery, we have five food delivery people or companies, and we have a higher

Starting point is 00:33:35 error rate when we use delivery one versus two versus three. And so these are the things we can do. And you know, it doesn't get more serverless than tracking shipping. Yeah. So I'm very interested in this

Starting point is 00:33:53 approach because I'm happy you said order ID because that's the one thing that I was going to say that you need something to stitch it together. So it's the difference between a trace and tracing a request, tracing a request, which is a request from the kind of the boundary of an API or even the front end

Starting point is 00:34:11 if you link it with the reels you're monitoring and the difference of tracing a particular order. And the trouble you have as a customer with these types of things is that it's good practice to have some top-level tags, if you like, call them tags for now, that are used across all the teams. And in this case, as you say, across companies.

Starting point is 00:34:34 So you have to have a naming convention and you have to stick to that naming convention. And if you don't, then all of that kind of doesn't work that well. But ultimately, as a user perspective, what I'd really, really like is if the vendors, the observability vendors, helped with that process of basically having that convention but adhering to it, elevating the top-level metrics, top-level tags,

Starting point is 00:35:02 making them a really, really important aspect of the observability telemetry data. I think that would be really useful. And also the events themselves and that whole choreography, if that could be visualized in something different than flame graphs or a list of spans,

Starting point is 00:35:24 is also really useful. And just the last thing, very, very business-related. What's interesting with what you said about the two companies realizing that it's more efficient and effective to collaborate is if you look at it from the other perspective, the restaurants or the hungry person ordering food doesn't care whose fault it is.

Starting point is 00:35:42 No, of course not, yeah. They say, I don't care whose fault it is, just figure it out. Just fix it. I want my order. Or you've lost my order. Where is it? Things like that. So it's super interesting. It's 2 a.m. Where are my wings? Exactly.

Starting point is 00:35:58 So the other, you know, and totally you brought up another interesting point that I didn't think, I think I read it into your comment, but like with the idea of the order ID again, right? If we go back to the very beginning of our conversation where we had this idea of the cloud vendors opening up the API because people needed this information, right? So it was like a market demand, if you will, and also cooperation. Not only is it cool that these companies are working together to do this, but when you take something like an order ID, we might find

Starting point is 00:36:32 ourselves in situations where either the originating vendor, the money generating vendor, in this case, it would be the restaurants generating money for the Uber Eats and the stock suppliers and all that kind of stuff. Or it could even be something like the observability vendors who are getting them to change the process. Because I can imagine, and again, I'm just using Uber Eats as an example, so I'm not picking on them. Uber Eats, someone goes in and puts an order in through Uber Eats. Uber Eats has their own order ID. Maybe that gets sent to the restaurant. Maybe in the restaurant, they're tracking it with

Starting point is 00:37:05 their own order ID that they put back to that. Now that's going to be very inefficient and hard to go through. So if the restaurants then make a modification so that they can observe end to end and have all this information by saying, we're going to take whatever order ID our delivery service is sending us and use that in our system and unify that piece of data throughout this entire process. Now you open it up so that all these parties can work together and then leverage the observability platforms or any other thing that they want to do to track this. It's interesting that this might drive inter-company cooperation for the purposes of improving their services and making more money, which I just find fascinating. And yeah, it's crazy.

Starting point is 00:37:49 It's crazy to me. Yeah. I just started to that, by the way, you can use Flipdish if you want, which is a company I've started. You can pick, you can pick, you can pick on them. So I find it very interesting because one of the things that we do at Flipdish is we, we surface the events as they happen to the customer in a UI. And that's something that's custom implemented separately as part of the software. I've always thought that there's a very close connection between what we do with observability data, which is almost like it's metadata.

Starting point is 00:38:24 It's data about the software system. But we put so much work in it and we put so much methodical and systematic work into instrumenting our code. That's not very far from then surfacing these type of things. If we had that list of events in your observability vendor, then it shouldn't be very hard to then display that in a different context for a more operational aspect. I know that's mixing a bit responsibilities, but you're more likely to be more accurate, maybe, or you're looking at the

Starting point is 00:39:00 same data. So yeah, it's just an idea that i thought i i always i always feel that really quite difficult that we have to ins we have to build our own thing but also we have things in our observance vendor to understand our system it feels it feels like an overlap yeah one quick thing brian on your idea of kind of like what because what you were saying is you should kind of uh encourage different organizations that work in a similar field to come up with data standards your idea of kind of like what because what you were saying is you should kind of encourage different organizations that work in a similar field to come up with data standards right that's in the end of the data standard but data standards have have existed for many years right i mean there's many i remember when i was in high school and it was in the 90s um we had our our main

Starting point is 00:39:41 teacher in class he came with an industrial background, steel industry, and he told us about systems that were exchanging messages on a data format that has been standardized for I don't know how many years even before that. So I guess now it's more and more non-traditional. Industries go into IT and need to communicate with each other, right? Maybe this is another kind of point in time where we see an explosion of new data standards, but business data standards. How can we track an order? How do we track delivery?

Starting point is 00:40:23 So that everybody that participates in the end-to-end workflow can easily participate. And that also then allows you to easily switch to one provider versus the other because you know that the API is still the same. In the end, it's about APIs and data standards. I don't know what it is, but there is something for food menus, for example, for restaurants. I found out, basically. Yeah. I'm sure people are, the businesses that are going to do best have already started tackling this.

Starting point is 00:40:55 I mean, that's the example Andy was doing. The fact that they were able to use that, that order ID means that a lot of them are tracking that the proper way. So it doesn't, and at that point it's nice too, because it doesn't even become a stretch to do it. It's not like, oh, now we have to go through and say, it's like, if there is that standard established, it's, you know, I don't want to say simple, but it's a heck of a lot simpler to be like, yeah, we can expose that as opposed to, you know, standard queue monitoring is how many are going in and how many are coming out. And there's no tie. It's just looking at the volumes in and out. Whereas if you have that ID going through,

Starting point is 00:41:35 you're now seeing when specific ones are going in and out, which is the big jump, right? That's what we need. Hey, quickly back to some of the observability things that I would like to ask you.

Starting point is 00:41:52 One of the questions that came up in our community, because I do run a working group within our customer base,

Starting point is 00:42:00 and questions that came up, how do you correctly do, is there any standards on when you're rolling out a new version of a function and you're rolling out a new version

Starting point is 00:42:09 that is providing some additional capability? Is there any standards already on kind of progressive delivery? Is there something on, you know, releasing it, obviously maybe deploying the serverless function, but not yet deploying it, but not yet releasing it, maybe through can serverless function, but not yet, deploying it, but not yet releasing it, maybe through canary deployments, feature flagging.

Starting point is 00:42:30 Is there any, are there any best practices out there on how you deal with when you switch over from one version to another, and also how do you retire it? Because you want to eventually also get rid of functions that maybe no longer need it. Have you experienced with this? Unfortunately, I've got a bit of a non-answer to this. We tried a couple of times, at least that I know of,

Starting point is 00:42:56 to implement a progressive delivery with Lambda functions. I don't know if there's a way to do it. I don't know that if we actually found a way to do it, but from recollection, we mostly didn't find a good way of doing it. And I think one of the problems, I can't remember the specifics of the problem, but it was something to do with versioning. But what we did do was we would deploy Lambda functions with feature flags that we...

Starting point is 00:43:30 Implementation of feature flags that we created with things like Parameter Store and stuff like that. So you would read from Parameter Store and decide whether you open up the functionality or not. That's, I think... And we considered more managed feature flagging services as well, but we never actually implemented them. So we spent at least three or four years evolving a serverless architecture without any of these progressive delivery of functions. I guess my comment on

Starting point is 00:44:01 that is probably that that's a signal in itself. We never really had a problem. And when you do have a problem, it's easy to roll back. Again, it's a paradigm shift. It's different. You sort of want a progressive delivery if you have a big thing that you're deploying that takes a long time to deploy and a long time to roll back. Or it's hard to figure out what's wrong. Obviously, you don't want to break things in the first place.

Starting point is 00:44:27 So, yeah, so in a way, we don't have a way to progressively deploy serverless function. But then you have, I guess, the resilience of, in our case, we have the resilience of queues and event-driven architecture. So, again, it's a paradigm shift in a way. So in practice, progressive delivery became more of a nice to have. Yeah. Yeah. I mean, I've been talking recently a lot about

Starting point is 00:44:54 feature flagging for multiple reasons. One, we've launched open feature as a standard last year, as a CNCF open source project. And therefore, I'm just a little bit more familiar now with all the use cases. And for me, feature flagging was, for me, most of these use cases that I had in mind was like maybe some A-B testing. You turn on a feature for an end user, and then you figure out if it works, and then you give it to somebody else. But what I also recently learned is, and this was in a conversation that I had with a customer, and he said, for us, the biggest adoption for feature flagging,

Starting point is 00:45:31 this could also work well in the serverless world, is we're building new functionality that assumes a certain backend system to be available or to also have upgraded as a backend system that maybe we are not controlling a third party vendor right let's say a third party vendor promises a new api so we are implementing new features based on that new api but we don't know is it really going to be ready next week next month or next year but we don't want to just keep the code lingering around and never being deployed never deploying it because there's a lot of cognitive load on our developers because they're always fearing hey in a month from now will the old code that I wrote a month ago finally be released

Starting point is 00:46:09 or not? And I still have to remember it. So what they do is they're using feature flags to really be able to deploy new code, but not activating it until that point when the backend system is ready. And then they can flip the switch and let's say the backend system all of a sudden breaks and fails and doesn't do what it's supposed to, turn off the feature flag, go to the old system, go to the old API. Another example that I also learned was there's also if you have particular times when you are allowed to do certain things as an organization, I don't know, maybe you have a marketing event and you're selling things from one to two for a special discount. You can also turn use feature flags to maybe then show a special banner, but you don't have to deploy at

Starting point is 00:46:55 one and then roll back an hour later. And I think these are interesting use cases for feature flags as well. And the first one that I mentioned could be very interesting also for serverless functions. So you're deploying a new version that has new code, but you're still kind of holding it locked behind the feature flag, and then you just turn it on when the time is ready. Yeah, I think the use case that I was very interested in, if you use a managed service that evolved these things and made them work really well, then you get this idea of releasing things to a group of people, maybe in your company, and that's in production,

Starting point is 00:47:39 and that's exercising production code. So releasing things to segments of people is really, really powerful. But unfortunately, I don't have experience with that, so I can't really comment. No worries. Another question that I have for you is the topic of cold starts. I'm not sure if that's still as big of a problem

Starting point is 00:47:58 as it used to be in the early days of serverless. So I'm just throwing it out to you. Cold starts, was this ever a challenge? Any considerations on it? Any best practices? I get this question a lot from people that haven't got serverless experience. Bam, Andy, you just got smacked. I'm kidding. I'm kidding too. I'm just kidding. Yeah. So don't ask that question again.

Starting point is 00:48:30 No, I mean, so I can see the problem because my background is at.NET. So as a.NET developer, you don't want to put the.NET runtime on a serverless function, or you do, but then it's typically more for more async jobs, more back-office jobs, integrations, for example, that are not that time sensitive. But then people think, well, if you're running a website with a backend being functions,

Starting point is 00:48:53 then you will basically hit call start and you'll have a very slow system. Our experience was that our system got faster when we got a lot more traffic, which is what you expect from cold starts, and it got more efficient. But our base response time was never a problem. We never really had any situations where we had problems with cold starts. I think, obviously, our configuration was more Node.js.

Starting point is 00:49:29 And I think with Node.js, you don't have that many problems. Interestingly, I'm just going to reference someone who I collaborated with at Datalog, AJ. He's a serverless product manager. He wrote a great blog post about SDKs, like what SDK you use, what AWS SDK you use, and how you use it might affect the duration of your Lambda function, for example, things like that. But that's less about cold start.

Starting point is 00:49:57 So you start thinking about these type of things. So maybe the telemetry libraries actually, interestingly, might be a bit more of a problem than the runtimes, has been my experience. But overall, I'd say that we didn't spend three, three and a half years talking about cold starts. So it was a bit of a non-problem for us. Okay, cool. And then, because last time we talked about kind of, you know, engineering best practices, how do we become an observability-driven organization? Specifically on serverless, if people are listening in and they also think, you know, serverless is the right choice for us, what do you need?

Starting point is 00:50:40 Any other things besides obviously having them to listen to the previous podcast, but any additional things that you want to tell people that really makes them successful with serverless, especially around observability? What do you need to do in order to make sure that developers put in the right level of observability from the start? What do you need to provide as an organization to make this easier for them? I've just had a talk at KCD Munich, Kubernetes Community Days.

Starting point is 00:51:06 I talked about platform engineering. What can platform engineers do to provide guidance templates to just make the developer's life easier and more efficient? That's a really good question. I think it all depends on your context, obviously, but overall, you need the tooling, right? And I think you mentioned it earlier that you want

Starting point is 00:51:28 to strive to have one single pane. So in this case, you want to have one observability vendor rather than have multiple ones and have fracture planes across your understanding of your systems, across your organization. So the first thing to think about is what observability vendor you'll go for. And for serverless, there are options. There's loads and loads of options that are coming up, and they're a bit more traditional ones. Vendors that come to mind are Lumigo, Epsilon, Thundra, I think, as well. More newer ones, Baseline is one that I've noticed that's kind of come up a lot.

Starting point is 00:52:07 So you can choose one of those, which is very, very serverless. They're all serverless-centric, and they'll give you a very good view of... All the UI will be focused on serverless concepts and serverless paradigm. However, it depends on what you have, because if you have other things that you also support that's not entirely serverless concepts and serverless paradigm. However, it depends on what you have, because if you have other things that you also support that's not entirely serverless,

Starting point is 00:52:29 then you might want to look at the more traditional vendors who also do serverless. So that's the number one I'd say, but you have to choose, you have to think about it a bit carefully. The other thing is for observability, specifically for serverless, how you can help engineers

Starting point is 00:52:48 to understand. I think you need to do two things. You need to do custom instrumentation, which is whether you're going to choose your tracing, the same thing as the non-serverless. You can use tracing

Starting point is 00:52:59 or you're going to use logs or you're going to use metrics. It doesn't... Well, I prefer traces, but ultimately, as long as you are consistent across the teams or with what you do within your team, then that doesn't really matter. If metrics work for you, that's fine. The thing that's a bit different with serverless

Starting point is 00:53:21 is that the metrics from the managed services make sure you familiarize yourself with the metrics from the managed services, make sure you familiarize yourself with the metrics that they give you and the logs that they give you potentially, because they're really, really important. You rely on them as much as you rely on your code. So although they mostly work, you want to have visibility of those. And I don't know, I think it's a bit funny to say, but it's not observability related,

Starting point is 00:53:51 but consider your architecture a bit. Are you building a serverless system that's mimicking your server full system or are you shifting your mindset to a more event-driven approach to a an approach where you are building small building blocks and you can you can observe those but ultimately um that you're part of a bigger choreography or a part of a bigger orchestration and then and then you start having questions like where do you do orchestration?

Starting point is 00:54:25 Where do you do choreography? I think those questions are really important before you think about observability. And obviously when you get to that, the biggest thing is, I think, going back to what you said, try and find a way to make, use whatever telemetry data you're using,

Starting point is 00:54:41 but try and visualize the steps between order created, for example, to order fulfilled and visualize it in a dashboard and start understanding how that operates. Find the gaps, find the problems, go as quickly as you can to the health of your business transactions because serverless enables you to not care about you okay you care about memory but you don't care about CPU you don't care about any traditional things that we um traditionally look at but you start caring about your business and and that's the most powerful thing so serverless allows you to think about your business transactions in a non-traditional way

Starting point is 00:55:26 because you're not bound and constrained by memory and CPU because it's a nice way. So we think differently about. Thank you so much. No, so that's why I'm very happy that sometimes I get comments like you shouldn't ask this question because this tells me that you are not an expert. Like I should not talk about cold starts anymore, but I'm actually happy that I do because that's why I learned so much. I know it's amazing how time flies.

Starting point is 00:56:05 Do we have any other things that we need to discuss? Anything we missed that is important? Anything that made you successful in your previous job and will make you successful now with Flipdish? I think the most important thing, if people have not listened to the previous episode, is that observability needs to become a core practice within your teams, within your software engineers.

Starting point is 00:56:36 It's not enough to just buy a tooling. It's not enough to also instrument and create monitors and automate all that. You need to be understanding what you want to instrument and what you want to observe on the other side. With serverless, I've got a bit of a plead to all the observability vendors. Think about the serverless paradigm and how it's different. Suddenly you care about implications rather than uptime. You care about events.

Starting point is 00:57:08 If your serverless has naturally driven you down the road of event-driven architectures, you care about events. Those are your protagonists, and you mentioned them, the business events. Make those a protagonist in your UI, in your UX of using observability vendors. That will really help the serverless architectures and look at the language and the concepts and the notation that's used in serverless and

Starting point is 00:57:35 adopt some of it so that reserved, I can't remember what it's called now, reserved concurrency or things like that are more important than scaling out servers and scaling up servers and things like that in the serverless world. So those are the two things. Care about observability if you're a software engineer and if you're an observability vendor,

Starting point is 00:57:56 try and make these concepts a protagonist in your UI. I will bring this to our engineering team. This feedback is really good. Amazing. No, of course. We are here to educate

Starting point is 00:58:13 our global community that may or may not use our products, but I think it's great to learn from folks like you that have worked in this field and know what challenges are out there and how tool vendors can do a better job in you that have worked with in this field and then know what challenges are out there and how tool vendors can

Starting point is 00:58:27 do a better job in helping people like you, people, engineers, architects. And so I'm definitely taking this as great feedback back to the engineering team. Perfect. The question now is, right, what's the next topic we're going

Starting point is 00:58:43 to discuss on the next episode? And Tully's question has already been on twice's the next topic we're going to discuss on the next episode? And Tully's question has already been on twice. The next topic is when you start paying me to be a

Starting point is 00:58:51 guest. We'll look at our budget, which is zero, and we'll give you 50% of our operating budget. Yeah.

Starting point is 00:59:03 Amazing. And I'll just start shooting down all your questions then. No, it's been great fun. I've learned so much. It's really, really good

Starting point is 00:59:15 to connect with other observability nerds, I guess. Sorry, I didn't mean to call you nerds. No, no, no. That's what you are. And yeah, I really, really hope

Starting point is 00:59:27 that the serverless observability just becomes better over time because it is really, especially the step functions bit, it's just mind-blowing really. So yeah. Yeah. Awesome.

Starting point is 00:59:42 Yes, thank you so much. I've got nothing to add this was just such a jam-packed episode that yeah the only thing I can add is thank you so much

Starting point is 00:59:51 for being on again it's been a great pleasure to have you back and we hope our listeners feel the same and we really you know

Starting point is 00:59:58 I do have one thing I take away is again that community that we have here in the IT world, the modern IT world, obviously it built from the past. I mean, we can see

Starting point is 01:00:10 all the evidence of where this came from. And as long as we keep driving towards sharing information and working with each other, I think this will continue to be a great community to work in. So thanks for everyone for maintaining that. And yeah, that's it. Thanks everyone for listening.

Starting point is 01:00:26 Until next time. Thank you. Bye-bye. Thank you. Bye-bye.

PurePerformance - Serverless Observability needs a paradigm shift with Toli Apostolidis

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.