PurePerformance - 061 Serverless Performance, Monitoring and Best Practices from Fender

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello, everybody, and welcome to another episode of Pure Performance. My name is Brian Wilson and as always with me, Andy Grabner. Andy, are you there? I am there. And I know we are both excited about this episode. Yeah, and you know, I have to say it's funny because as the theme music was playing, I realized, you know, there's one instrument that's not present in that theme music.

Starting point is 00:00:50 And it's very relevant to our guest today. Can you guess what that is? Hmm. Hmm. Come on. Throw it out there. There's no six-string guitar. Yeah. I say that because there is a bass in there.

Starting point is 00:01:01 So I guess there's some relevancy because there's a bass guitar in there. So that's extremely relevant. Before we go on, Andy, how have you been? It's been a long time since I talked to you. What, two days, three days? Yeah, about that. Actually, a lot happened last weekend. I actually saw Dan Auerbach playing at the House of Blues,

Starting point is 00:01:20 which brings me again back to the theme that we have today, or the Gu guest of honor. Very interesting concert. I've never seen him before. I was very, very happy to be brought into that concert. That's cool. I didn't know you played. Yeah, in Boston, House of Blues.

Starting point is 00:01:38 Very cool. Nice. So, yeah, why don't you introduce our guest today? Yeah, just as we talked about performances, obviously we talk about performance today. We talk about serverless performance. And the guest we bring in is Michael Garsky, hopefully I say this correctly, Director of Platform Engineering at Fender Musical Instruments.

Starting point is 00:02:03 Michael, are you there with us? Yes, I am. And you got that pronunciation proper. Awesome. Perfect. Where's the last name come from, if I may ask? Garski. It's Polish.

Starting point is 00:02:13 Polish. We have a large engineering team out of Gdansk in Poland. Do you know, do you happen to know where your ancestors come from? Or you maybe? Were you born over there? I do not know um i grew up in wisconsin area so there was lots of ski names everywhere all right cool hey michael thanks thank you so much for the show uh for coming on to the show and for obviously also doing a good

Starting point is 00:02:38 show with us today and you actually it was a good show we'll wrap it up now and we'll talk to you all. Thank you. Exactly. It was the shortest podcast we ever did. No, but seriously, so you actually presented at the AWS Loft recently, and you also told me that you also presented at the Dataverse Reinvent last year. And I think the title of your presentation was Innovating Through React Native Mobile Apps. And one of our colleagues, Bill Sajak, he saw it and he said, hey, guys, you need to interview Michael. Because it's very interesting to learn more about the architecture of what Fender actually built their applications on and learning more about serverless performance. So, Michael, I want to just throw it over to you. Maybe start out with,

Starting point is 00:03:26 for people that don't know Fender, you know, who is Fender? And then let's dive right into some of the stuff that you guys built using serverless technology from AWS. Sure. Yeah. Well, I think many people are very familiar with Fender. We've been around since 1952, but the creation of the iconic telecaster and then later the stratocaster precision bass jazz bass jaguar mustang it's that and amplifiers as well from tube so we still make tube amplifiers and tubes and also have a very good line of connected digital amplifiers as well. And I'm part of Fender Digital. And Fender Digital, we do all of our flagship product is Fender Play, which is a digital learning application for people to learn how to play guitar.

Starting point is 00:04:22 And we also have a, so that's one application we have that's both on Android and iOS. We also have Fender Tune, which is a tuner guitar tuner that's also android ios and it's same thing with fender tone which is the uh companion app to our mustang gt amplifier series allowing you to remotely control your amplifier and share your presets and settings with other users of the community that's interesting so the i know we're not here to talk about that bit but the the tone thing is that like are these modeled amplification or are you just talking about you your amplifier your own settings are digitally recorded and you can then pass those off and preset them to someone else these are digital modeling amplifiers.

Starting point is 00:05:05 So we have the Rumble bass amps that just came out and the Mustang GT guitar amps. It's like having a big pedal board all within your amp. Andy, we can start going all different ways on this. And so before we dive into this, Andy, I did just want to bring up for our listeners, we are going to be talking about serverless and Lambda a lot today. And if you're just a little bit still unshaky on the whole concept of serverless, go back. I just looked it up. Episode 42 is where we did kind of like the 101 on serverless.

Starting point is 00:05:34 So if some of this might be a little bit confusing in terms of how serverless works and everything, go back, listen to episode 42, then come back and listen to this one, and it'll probably make a lot more sense, hopefully. But with that, go on. Sorry. Okay. So we can dive right into some of the topics here. We can start out talking a bit about serverless performance and some of the things that we've encountered and our experience and what we've learned from that. Yeah, that will be perfect. Great. So one thing that you always hear about quite a bit

Starting point is 00:06:08 released with AWS Lambda is a cold start, which the first time a function is invoked after code is deployed or if it's been idle for a period of time and they've recycled the container underneath you, it actually will download your code, initializes the container, and gets it started. So there can be a brief delay on that initial invocation that could result in a user getting a slow response. If you return an error from a function,

Starting point is 00:06:40 because usually the function arguments you can return both a payload and an error, if you return an error, it will end up triggering another cold restart, cold start on it. So that if you can avoid returning errors when necessary, especially in a synchronously invoked function, it's a really good idea to do. Another thing that can really make the cold start time longer than you anticipate is using a vpc so if you need to access resources within your vpc say an rds database or your own internally set up elastic search cluster or whatever other resources you need it has to attach an eni onto that container an elastic network interface and that takes it can take up to seven, eight seconds for that cold start to really occur to get that ENI attached.

Starting point is 00:07:29 So every time you start, that has to get attached to it. It's part of like the build out, right? Correct. Yes. Yeah. I hadn't thought about that. And one thing to really worry about with the use in VPCs is that having your Lambda functions in a VPC, those ENIs, the quantity that

Starting point is 00:07:47 you have available to you, they're limited at the account level. So if you have a large number of functions and they're all attached into the VPC, once you hit that ENI account level, it's not going to, with a big request spike, it's not going to allocate more containers and you'll start returning errors back. So one thing you can do to sort of help prevent that is, well, only use a VPC if absolutely necessary. Like if you're accessing Amazon services such as S3 or DynamoDB or the AWS Elasticsearch service, you don't need to use a VPC. And if you do, AWS Lambda now has concurrency control, so you can limit how many instances of a container will be running at any one time. So you can kind of get an idea of what your E&I limits are and ask for account limits if you actually do need to access resources in the VPC.

Starting point is 00:08:40 Other things we've seen is our reuse of HTTP connections between invocations. All our lambdas are written in Go. We've done that from the beginning, from last year, even prior to native Go support that AWS launched this last year. We use a framework called Apex, which uses a Node.js shim to run the Go binary. And the client is called Elastic. It's the Golang Elastic Search client. And it will keep the HTTP connections around in the back end. But we found that between invocations it's like they go nowhere. So we actually have to close up

Starting point is 00:09:19 the connections before the function invocation completes and release those resources. Another good tip for performance is definitely take advantage of, especially if you're using API gateway in front of your functions, API gateway caching. We have that on our CMS and lesson items. The items don't change very often and we get a greater than 95% cache hit ratio. So that really saves us a lot of function invocation times and makes it very fast for the end user.

Starting point is 00:09:53 Can I ask a question on that one? Because a lot of this, I think, well, both Andy's got quite a bit more AWS experience than I do, but I think for a lot of our listeners, too, it's a lot of new stuff. So obviously, I know the API gateway in front of the Lambda function. So you're saying you can use that API gateway sort of as a caching layer so that when you invoke or when you call data from your Lambda function, you're storing it in there and running there. So the next time you make a hit, you don't actually have to make a call back to Lambda. Is that kind of the architecture you're speaking of? Absolutely correct. Yes. That's really cool. And we actually

Starting point is 00:10:29 hit it up within our, we use it heavily within our CMS. And whenever we publish updates, we can actually, our API, when an update is published into production, we actually can flush the cache to make sure those new changes are available right away. And it doesn't take 20 minutes to show up. And Andy, this makes me think of, you know, we see so many of the same problem patterns over and over, especially going from, you know, the monolith to microservice kind of style. And even though serverless is considered this other kind of thing, it really isn't. And you still have the same problem patterns to avoid.

Starting point is 00:10:59 So it's just to me, the interesting idea of throwing caching in front of the Lambda functions so that you call less of them, thus saving the money of hitting them, is really a very interesting concept. Yeah, I mean, in any type of architecture, right, if you have content that doesn't change that regularly, the closer you can get it statically to the end user, the better and the cheaper it is. Now, Michael, I have one question here. For people that may not know the API gateway that way, then the option when it comes to caching, does the API gateway cache itself or is it where you're using S3 or CloudFront as a cache or how does this work?

Starting point is 00:11:40 And is it configurable in the API gateway rule definition on where it's cached and how the cache works? Yes, you can actually specify how much your cache size is and they do charge you based on an instance size type of thing. So you can have a larger cache or a smaller cache. And it is cached within the, it's a feature actually within API Gateway itself. You can specify which routes are cached, which are not, and what would make up the uniqueness of a cache key. So if you're caching, if you want to cache based on a user level, you can have your authorization header as part of the cache key.

Starting point is 00:12:20 That's cool. And actually, one thing, one little piece of information you didn't mention, but it's in your notes for the podcast, was the costs that you have for Lambda. And that was actually quite fascinating, because if I read this correctly, you said $80 per month that you paid for Lambda. But can you also kind of tell us what the costs are for the rest of the infrastructure? Because $80 seems like nothing for your Lambda. But then knowing that 95% of the requests never hit Lambda, what's the cost of the cash as compared to if you would push everything through to Lambda? Oh, the cash cost I don't have off the top of my head, but I don't think it's very significant. So our original services that we started building about two years ago were all based off of EC2-based microservices. And then in early 2017, we just skipped right over containers and dove wholeheartedly into Lambda.

Starting point is 00:13:22 So one of our EC2 services that is still running is our authentication service. So average is about four requests per second. These are C5 instances. We have used like one year reserved instances and they cost for two of them, it costs us $80 a month and it's 2% utilization. Whereas all of our Lambda functions, which includes everything that supports Fender Play, Fender Tone, and Fender Tune, that's about 40 invocations per second, and that costs us $80.

Starting point is 00:13:58 It is fascinating, right? If you think about this new world where we talk about these costs and it's amazing how how how low these costs seem for something that you run right you run your online business on 80 worth of lambda and 80 worth of ec2 for your authentication service and uh it's fantastic it's phenomenal and i also love the fact that you skipped right over containers that's i think that's the first that i've heard that's really and it's funny and like just in my head i keep on thinking like as you're explaining

Starting point is 00:14:32 all this stuff i'm like who would have thought fender you know guitar company was so advanced it just goes to show you you never know um you know where the where all the smart people are like like again andy going back to um you know capital one with you know okay it all the smart people are. Like, again, Andy, going back to, you know, Capital One with, you know, okay, it's a bank, but like they're super, super advanced on their tech side. So it's just always cool and shocking to hear who it actually is. Not that we thought, you know, not that I ever thought Fender would be like some backwards technology company, but never in my head, like, oh, who do you think is cutting edge?

Starting point is 00:15:05 Yeah. Hey, and Michael, I got one more question. So you still have the authentication service running on bare EC2. And did you say what technology stack that is? That's a Golang-based microservice. We're using an RDS-backed instance for that due to the cost savings and also so we can make our application multi-region, we're going to be migrating that either to Cognito or

Starting point is 00:15:30 our own serverless based solution. Okay. In terms of performance, you also had mentioned something about memory. We kind of cut you off before you got to that one, but there's some talk about testing there, so I want to make sure we get to that one because that's going to be interesting, I think. Yes, so with Lambda functions, you can allocate memory anywhere from 128 megabytes to just over three gigabytes. And CPU increases proportionally with that memory

Starting point is 00:15:57 as well. So even if your function is maybe only consuming 50 megabytes of RAM, you may want to give it one and a half gigs to make that fastest response. It's kind of a bit of experimentation to find out where that sweet spot is between memory allocation and performance. For synchronous functions, we generally use 1.5 gigabytes or more. Asynchronous functions that are more event driven. Those we tend to go with a lower amount because it doesn't matter if it takes an extra 100 milliseconds to execute. Now, when you're setting up Lambda for this, are you, I know part of the pay is for how long it's running, right? But does the memory or CPU consumption come into account on what you're paying for? Or is it just strictly time?

Starting point is 00:16:45 In which case, the idea could be, as you're saying, find that sweet spot. And if it needs more CPU and memory, run it because it might actually cost you less. How does the pricing work for that? The pricing is a combination of both time and how much memory. So if you're allocating more memory to a function, it's going to cost you more per 100 millisecond increment. Right, okay. So it's – all right. Good, good.

Starting point is 00:17:06 That's interesting, too. Hey, and do you, Michael, if I may ask, because this is very interesting. So part of your performance engineering then in pre-prod is to actually find the sweet spot per function. So that means we're expecting that type of load, and we want to find the sweet spot in terms of the memory setting and the CPU setting on my serverless container. That's actually part of your pre-prod performance engineering? Yes, we play around. We mess around in our active environment whenever we're doing testing and developing to see

Starting point is 00:17:40 what would be best. That's cool. You know, that goes back, Michael, that goes back to something Andy's been trying to encourage people to do for a long time now, has been finding the cost of your deployment. You know, what's the cost of the new code, and how can that be done? And I think, well, yeah, that's just a perfect example of it. And I think with Lambda, by the way, it's priced, it's even, it's very, it's just right out there. You can see it very clearly so

Starting point is 00:18:06 i think that's an awesome metric to add into that i was um two weeks ago i had the chance to meet up with one of the guys at google in in zurich and they there we talked about the cost of a request so they're actually measuring the cost of every single request and then based on that make decisions on how to scale up the individual services and actually how to redirect traffic and how to uh kind of throw back requests to the the calling side and it was very fascinating um you know similar concepts but uh yeah very cool stuff that you guys are doing here and also i thought um what what google was doing there trying to really measure

Starting point is 00:18:46 every single request and based on that, then make the decision on how to scale or which requests are kind of rejected back to the initial caller and then hoping that the initial caller is retrying and then hitting a service

Starting point is 00:19:02 potentially that is less expensive. So there's different thoughts that we bounced off when we made this discussion. Cool. So performance, cold starts, GPC usage. Yeah, go ahead. Oh, on the memory setting, I have one quick thing. One thing we really want to investigate is actually doing CPU profiling on function execution to find out when you get no more return.

Starting point is 00:19:25 So you can kind of crank it up to the memory. And after a certain point, you don't get it. You get diminishing return. So you know where to keep it at. Kind of use that tool to kind of determine that. Yeah. Cool. In terms of so, and I'm looking at our notes here, and I think this was fantastic.

Starting point is 00:19:42 So we talked about cold starts, the VPC usage, the reuse of HTTP connections, the API gateway caching, and also the memory settings. Now, obviously, a big question here is all of these things need some type of measuring and monitoring. I assume you are obviously having your own different tools that you're using. There's a lot of stuff out there. Can you fill us in a little bit on how you monitor? And obviously, we, with our background in monitoring, would also be interested in what the shortfalls are of APM, as you put it in your notes. That's very interesting because this is where we want to learn. Now we can say, hey, okay, what can we do to help companies like yours to get better insights into architectures that you guys are building?

Starting point is 00:20:33 Yeah, so we rely very heavily on CloudWatch logs and CloudWatch metrics for measuring performance of our Lambda functions. And we actually with APM tools, say such as like Datadog or New Relic, what they'll do is they'll have an agent that, like with Datadog, there's an agent that runs on the server. You can't use that agent in a Lambda function. With something like New Relic, they've got a Golang API client, but it batches up requests. It's meant to sort of

Starting point is 00:21:04 funnel them off like once every 30 seconds to push those stats up to their servers. And while it is possible to configure that client to flush those stats after X, before the invocation of the function ends, it adds time because it's got to send that up to the server. So your error functions ended up taking an extra 50 to 100 milliseconds to execute to flush those stats. Because even if you've got that client that's batching things up, your container could go away. If you've got one request and if it's a seldomly invoked Lambda, there may not be another request for 10 minutes, that performance data is gone.

Starting point is 00:21:43 So that's the major shortfall with those APM tools is they're flushing things out to a third-party service. But with CloudWatch, yeah. CloudWatch metrics have all that stuff. Can you, yeah. Sorry, I will come back to that later on. Yeah, go on. And then I have some comments on that and also ideas and thoughts.

Starting point is 00:22:04 Sure. Go ahead. I'd like to hear your ideas and thoughts on that. So, I mean, it's interesting because these are obviously the challenges that we also heard from people that we worked with in the early days of serverless. And as you just said, you know, it's not possible to install an agent on a machine that you can't control because it's AWS who controls the underlying machine. And you don't want to batch up any data that is then asynchronously sending it back at some point, and then you potentially lose it if the container goes down. So without obviously too much of a commercial here, but you should definitely look at the audience, the listeners should check out what we have done when it comes to serverless monitoring, because we figured out a way how to trace requests end-to-end. And I know you'll probably go into later, maybe talk a little bit about X-Ray. But one thing we have done as a tool vendor in that space is making it seamless to actually trace Lambda functions end-to-end. And we've done this

Starting point is 00:23:03 for Node.js in the beginning and also for Java and Go is on the list. So that's why it will also be very interesting then once you dig deeper into how you get better transactional data to get your feedback then actively from the way we are doing it. So because we try, we heard these problems

Starting point is 00:23:22 and I believe we are on the right way of, we solved them for Node.js and Java already and for Go, that's on the list. So that would be very cool. But we heard the same thing that you just brought up. Yeah, whenever you have a Go client ready, I would love to dig in and play around with it

Starting point is 00:23:40 and see what it does. Yeah, so CloudWatch logs and metrics is what we use for capturing all our performance. And we actually just use Datadog for dashboarding all those metrics within there. And they have a neat functionality with custom counters as well, where if you write a log entry to CloudWatch logs formatted in a specific way, you can actually create metrics and tag them. So if you have certain events or want to keep track of response codes from a third party, you can actually set counters and alerts around that. Now, the downside with CloudWatch

Starting point is 00:24:17 logs. So we look at, if you think about traditional microservice, we sort of do that same sort of approach with Lambda. So we'll, a collection of Lambda functions that operate on a given business domain is a service for us. And the advantage to using Lambda is you can have multiple inputs to the service, whether it be SNS, direct invocation, S3 events, DynamoDB streams, web requests via API gateway, and have shared business logic among these functions and just have all these different inputs and responding to different things. And with all those different functions, CloudWatch logs, those logs are siloed for each function. And so it's really difficult to kind of see like, oh, this request went here and then it did this and what happened and what services and functions did this touch in that. So we actually have a Lambda function that feeds off of all our CloudWatch logs. And this is a Lambda function we have that actually is part of our VPC because it dumps it into our own Elk stack. So all of our logs will get dumped into that so we can actually see them in one unified spot. And we also recently started using a tool, I'm a very big fan of Honeycomb, honeycomb.io. And we sample our logs and we send them up to them. It's very, very fast and easy

Starting point is 00:25:37 to dice through if we have issues and kind of find those unknown unknowns and really discover what's happening with your performance. And you mentioned X-ray tracing. It's something we have not, it's something on our list to do this year. We haven't really dug into it yet because with using the Node.js shim with the Go binary, we could use X-ray,

Starting point is 00:26:02 but we would have to manually create all the segments and set everything up. Now with the native Go support, we should be able to use X-Ray so it automatically tracks all our requests to publishing to SNS and calls to DynamoDB and things of that nature. Yeah, that's pretty cool. So let me ask you a couple of questions here. So you said the way you solve the CloudWatch quote unquote challenge, where the metrics go into every single, you know, are obviously collected by every single function. You are you're pulling the data through a CloudWatch event and then you're pushing it into a central, is it ELK instance? That's what I think what you said. Yeah, we just feed those logs in. Yeah.

Starting point is 00:26:49 What are the patterns you're looking for? Is this, I mean, what are the main use cases? This will be very interesting. What are the main use cases you are detecting by looking at these logs that are coming in? What are the main things you find every day? The main thing we're looking for is sort of performance. So we can see, we can do a heat map and see what's going on with the time,

Starting point is 00:27:13 the function invocation, if we're seeing a lot of errors, what the status codes are coming back and actually dig in. And actually, if a problem happens, we can, that impacts customers, we can actually get a list of those customers and provide that to our support department so they can preemptively reach out and contact people like,

Starting point is 00:27:34 hey, since you had a problem signing up for a subscription, we apologize. Everything's been corrected now, things like that. Cool. And I assume you have set up some alerts, like it means if you have a failure rate of, let's say, higher than 1% or 2% or higher than usual, then you proactively feed this information to your support team or is your support team kind of more reactively pulling this data in? It's a little bit of both. So we get a bit of reactive if we're like one or two things go through. But if we see something large and we can be proactive about it, we are. Cool. And when you say you see something, then you become proactive. That means with we,

Starting point is 00:28:20 you mean your team that is constantly looking and analyzing the data and keeping an eye on kind of the health status of your system, and then you are alerting the individual teams in case something is abnormal? Or did you also build in some automated mechanisms? We have some automated mechanisms that will alert to pager duty if 500 errors, they exceed a certain limit. Engineers will set up specific alerts for things that they're kind of keeping an eye on. One thing that we recently had an issue with was we had a production release to our subscription management service and everything passed successfully in QA, but we had an issue in production due to a configuration with how our subscription billing provider was configured in their test environment versus their

Starting point is 00:29:11 production environment. And we were able to quickly see that when it was happening and know that, oh, it was tied to this release, we need to roll back. And it's also prompted us to start looking into the canary deployments and traffic shaping within Lambda to modify our release process. So that the releasing engineer that releases it out of production can say 20% of traffic will go to the new version, kind of keep an eye on things, and then decide whether or not to go 100% or if we need to roll it back and investigate something. Now, that's pretty cool. I mean, this is, I mean, we talked about this before, before we started the recording, that this is a very exciting topic and that we see a lot out there.

Starting point is 00:29:51 And Brian and I, we both have been in the testing space for many, many years. So obviously we know how much it takes. I mean, that we should all invest in testing early, testing often, testing automated, but certain things, you know, you can only really, you know, find and experiment with in production with real users. And as hard as it is for me to say, sometimes our end users are our guinea pigs.

Starting point is 00:30:17 Sometimes that's the way it is. Production is the new – there was the old – remember the – I think it was – I forget what beer it was, but there was the old um with the old remember the i think it was forget what beer it was but there was the most interesting man remember that campaign there was yeah yeah that got like uh commandeered for a lot of other memes and one of them was a picture of him and it says i don't always test but when i do i test in production i've seen that one many times yeah yeah hey and michael today to the can deployments that you're looking into, and you said you will have the ability to say 20% of the traffic, you know, try it out, get the results, look at the monitoring data, and then decide is it good enough or not. Do you know that feature that AWS provides, the 20%? Is this something that would then be sticky to end user sessions? It obviously has to, right?

Starting point is 00:31:08 So you can control it by user stickiness or by session stickiness or by geo. Do you have any ideas or any insights in how this traffic shifting works? We're just investigating it now and seeing what we need to do to modify our processes to take advantage of it. I'm not sure if I don't believe it does anything with user stickiness, but that's actually a very good point and something I would want to follow up on. Yeah, because if you do it 20% across all the requests, you may end up having individual users hitting different versions. The same user hitting different versions of your function.

Starting point is 00:31:45 It has to be somehow sticky to your end user, which for me the only way to make sense because otherwise you end up with a combination of different versions for a single user. It might be challenging. Yeah, I wanted to go back to some of the stuff you said on the monitoring side because I think this is the problem that you solved with your architecture,

Starting point is 00:32:06 where you are combing through the logs, and then you are looking at the performance metrics, and then you are alerting on failure rates and all that. I mean, it's great to hear that you're doing this. And it's just, I think for me, it's a confirmation that we've also been doing the right thing. So just to give you a little background on what we are doing, and I'm sure the other, you know, APM vendors in the space or other monitoring tools do it in a similar way. We're automatically doing multidimensional baselining,

Starting point is 00:32:33 as we call it. So we automatically baseline your response time, failure rate, and throughput based on individual endpoints. So in your case, it will be your individual functions. And then we calculate baselines and then we can alert based on baseline violations. And then we would, for instance, I think you mentioned PagerDuty, right? That's the tool using them. We can trigger that. So that's

Starting point is 00:32:57 good to know. So you're watching out for the same things that we've built into the product. I'm looking at your presentation right now because your presentation is up on SlideShare on the Amazon Web Services SlideShare account. And I mean, stuff that you explained verbally, you have some very nice architectural diagrams where you see Lambda functions calling other Lambda functions

Starting point is 00:33:19 and then going into DynamoDB and calling other Lambda functions. Do you, how many, how often does it happen or how many hops does a request take from let's say mobile app all the way until the very end? Is it one hop? Is it two hops? Is it up to five hops?

Starting point is 00:33:40 How many hops does a request take? So to service a user request, there's generally just two. It hits a custom authorizer, and then it hits the function that actually executes the business logic and returns the response. But then in the course of handling that response, say something is written to DynamoDB, then there's a function that's not tied into the request something is written to DynamoDB, then there's a function that's not tied into the request, but feeds off that DynamoDB stream and say decides whether it needs to raise an event. So for example, let's say when someone modifies their subscription or starts a new subscription defender play, the DynamoDB stream function off that table looks at

Starting point is 00:34:24 previous state of their subscription, previous state of their subscription, current state of the subscription, and decides whether or not it needs to raise an event that the subscription has changed. That event gets raised and it can get picked up by different applications. It's currently picked up by two applications, one that updates our email service provider and one that updates our data warehouse via segments. And then that's all. So it's all happening asynchronously in the background

Starting point is 00:34:48 after the users made their request. That's pretty, I mean, I can really encourage, we should, I want to definitely, if it's okay with you, you know, put the link to your slides up on the podcast page. It's really fascinating to see that.

Starting point is 00:35:04 Yeah. The reason why I mentioned, and I was asking for the number of hops, so one thing that I've seen... Because you're Austrian and you have beer on your mind? That's true. I just, yeah, so I said the word hops several times today, right? Yeah, that's right.

Starting point is 00:35:19 I'm so sorry. No, that's good. The reason why I'm talking about the hops, meaning from service to service and not the ingredient that makes the beer, what I call beer, is I see it a lot with our customers or people that we interact with, where they're also going down the microservice route or the Lambda, the serverless route, and then even employing or including things like service meshes. And they think about all these things that in the end end up with very deep service call chains, as I call them. You have one service call and the next, the next, and whether it's directly or indirectly through a table like DynamoDB. And what I've been advocating, and I would like to get your feedback on that,

Starting point is 00:36:04 what I've been advocating is treating a microservice like aoDB. And what I've been advocating, and I would like to get your feedback on that, what I've been advocating is treating a microservice like a LinkedIn profile. And here's my analogy, just to make this make sense. If I look at my LinkedIn profile, I have connections in my first grade and my second grade connections. So let's say I have 100 first grade connections and these 100 connections

Starting point is 00:36:21 have 1,000 second grade connections. And when you think about a microservice or a serverless function you have the same thing like a linkedin profile you have incoming connections and you have outcoming connections first grade and then these first grade have second grade connections i believe we should start also not only looking when it comes to performance on the real performance aspect of a service like response time and throughput and CPU consumption, but also the dependencies, the incoming, outgoing, and when it goes to outgoing, not only first generation, but second generation of connections. And then if we keep track on this in CI, CD, even if change is coming down the pipeline, we can immediately see if a

Starting point is 00:37:06 code change or a configuration change or including a new library, if that changes anything of the dependencies in my first generation and second generations. Does this make sense? Yes, that does, yeah. Because you mentioned, right, you're using caching. And if you are saying, or if somebody makes a mistake, you know, mistakes can happen, and somebody misconfigures your cache, or somebody, you know, removes the caching layer between two services, and all of a sudden this missing caching layer is increasing your dependencies by factor X, then you want to be aware of that.

Starting point is 00:37:46 But you might not be aware of it if you only look at response time of the service endpoint. So I think looking at these dependencies and the hops in first generation and second generation, this is something that I've been trying to advocate. And what we have within our Dynatrace platform, we have the ability to give you this information on a service-by-service and function-by-function level. We call it smart skip information, where we have the smart dependency map, the live dependency map,

Starting point is 00:38:17 and we can actually traverse the dependency tree through an API, which makes it very easy to use in a CI, CD setting, but also for constant validation in production. And I want to see what you think about this. That sounds like a very compelling tool, especially when you mentioned earlier when you talked about the baseline. So if a service is generally performing at a given point, but then all of a sudden it's gone up beyond that point, but it's not returning an error, that's something you would want to know about

Starting point is 00:38:51 and investigate before it became an error. And definitely all those dependencies. We try to minimize the dependencies between functions. There's only a handful of services that actually call out to other services in executing a request. But we try to keep things as much contained as possible. I think that's the tricky part right there, because the whole spirit of

Starting point is 00:39:16 microservices is, well, at least in one aspect of it is here's my service for anybody to use, and then not knowing who's using it gets really, really tricky. So the not knowing who's using it gets really, really tricky. Uh, so the idea of trying to keep it under control, um, I think that's just in general, the, uh, the alligator that everyone's got to wrestle when they move over. Um, I don't have any advice on how to control that, but you just control the best you can, I guess, and, and, and hope that it doesn't get too out of hand. But at least knowing what those dependencies are, I think, is a good key element to that. And obviously, you know, it's not like you're if you're one of these extremely large, let's say, banking organizations that have, you know, within them, there are almost 30 different companies.

Starting point is 00:40:02 That's where you can start getting really out of hand. I think the smaller you are, the more control you have over that situation for sure. Exactly. We have a fairly very good, very small team here. We have four engineers that work on the back-end services. So everyone's familiar with everything else, and we kind of switch off. So everyone has familiarity with all the different services and how they interact. That's a great practice, the switching.

Starting point is 00:40:26 Yeah. Cool. Michael, I know you gave us a lot of great insights in performance, what you're looking for when it comes to performance. You talked about monitoring. Any other best practices maybe that you have for our listeners once they start their journey down towards Lambda lane or serverless lane? One thing to sort of keep in mind and we're sort of is function explosion, like a large, large number of functions.

Starting point is 00:41:00 So with our API gateway invoked functions, we decided to go down the route of a one function per route method combination. So that, you know, rather than having like a router and using API gateway as a router and directing it to different functions. With language, if you're using Node.js or Python based serverless servers, you're not going to really notice much of a difference. However, with compiled languages such as Go or Java or.NET, the more functions you add, the longer and longer that build time is going to take until it just becomes unacceptable. So, you know, controlling the amount of functions and sort of we're now looking at ways to sort of reduce those like functions per event source. So API gateway is an event source. Okay, that's one function. And we're going to have a router, a muxer inside that function and handle all the different types of requests within that.

Starting point is 00:41:59 If we've got to listen on a specific SNS topic, that or a stream from a DynamoDB table to sort of limit those amount of functions just to the distinct inputs that you have. That's a big one. The other ones is our timeout detections, too. So you can set the time limit for a Lambda function. I think the shortest is five seconds and the longest is five minutes. However, when you hit that timeout, we rely on structured logging within our functions. We use a library called Logris to provide JSON formatted logging into CloudWatch that we then forward onto Elk.

Starting point is 00:42:37 And if the timeout occurs, it's not a structured log that goes into CloudWatch, so we don't really see it. So capturing those timeouts, say 200 milliseconds before the function would, AWS would pull the plug on you to cancel it and log something meaningful and return an error response to the user. That's a big piece. Do you get a chance, is there a callback or is there a log that Lambda is writing in case it kills your function? It does. It does write it to CloudWatch logs. It does say, you know, function exceeded, make your exact text function timeout.

Starting point is 00:43:15 And then when your function timeout, you get hit with another cold start on the next invocation as well. So if you can actually capture those timeouts, like say you're calling a third-party service that's a little flaky and sometimes it hangs, if you can capture those timeouts and return an error before the function itself times out, you can minimize your cold starts. Yeah, that makes sense. That's a good one. Thank you. Anything else? the best practice? A couple other simple ones is if you're doing very simple DynamoDB CRUD operations that you don't even need a Lambda function,

Starting point is 00:43:53 you can just go through API Gateway and use Velocity Template Mapping to get, put, delete items from DynamoDB. We use that within our Fender Tune application for users to create their own custom tuning and save them to the cloud. And there's no functions at all involved. It's just straight through API gateway mapping using a custom authorizer so we can get the user ID out of the authentication token. Cool. So that means this allows direct REST access to do CRUD operations on an MUDB without having to write any Lambda function in front of an MUDB that actually converts the REST call into the MUDB API call.

Starting point is 00:44:40 So that's really sweet. The only downside to that approach is logging if there are errors, say like if you have an error within your templates. And we're looking at beefing up our testing processes for those templates so that we can, because right now it's sort of a bit of an experimentation. You're playing around with the template. Does that work? Does that work? And sort of manual testing. We're looking at ways to automate that testing for those templates.

Starting point is 00:45:05 And one last thing that we're sort of doing, we're doing more and more now, is using step functions for decoupling functions. So Amazon step functions allow you to, say, coordinate function invocations and do various things. So we use Elasticsearch for full-text search within our offender tone for the community presets that are shared. And we fill that index from DynamoDB stream table, all the data stored in DynamoDB. The stream function updates the Elasticsearch index. And we also, for disaster recovery purposes, we need a way to refill that index from scratch.

Starting point is 00:45:46 And it takes longer than five minutes to do that operation. So we have a step function that invokes the refill function with an empty JSON document. And then if that function returns with, say, the last evaluated keys from DynamoDB table, the state machine actually invokes the function again using that as the input to continue where it left off. And that way it just kind of loops through invocations while until the refill is complete and then the state machine stops. And other use cases would be, say, you were handling a recording events that are coming in from the front end that maybe have a bare bones minimum of data.

Starting point is 00:46:26 So you could have that come in to a step function, which you can actually invoke step functions directly via API gateway. And you could have that function come in and then maybe it calls like one function. You need to, say, aggregate data from three different services before you publish this denormalized data into the data warehouse. So rather than having one function that's coupled to all those three services, you can actually add that data via the state machine. It's sort of like, oh, I'm going to call this service from that and get this data and kind

Starting point is 00:46:56 of keep all that data together and then finally send it off to your final destination. This is actually one of the painful lessons i learned myself when i started implementing my first lambda functions i approached lambda in the traditional sense of how i solve problems in the old days when i was coding i don't know c sharp or java very synchronous and very like a big block of thing and i'm doing a b c d and i'm waiting until a is done then doing b and i'm not sure how you may have solved this problem with with your engineers or your architects to also change the thinking towards what they just explained right using step functions and basically using state machines to break down bigger problems into smaller pieces and and when i started

Starting point is 00:47:42 developing i really made the big mistake of just trying to synchronize as much as possible and then have one lambda function that called other functions and always waited on them. And obviously, which is stupid, because you pay more

Starting point is 00:47:57 because you pay for the main function waiting for the other function. You also have to wait for the other. You also have to pay for the functions you call. And obviously, you then run into the potential problem that you guys also solved with this is timeouts. And I'm just wondering, did you have anything,

Starting point is 00:48:16 did you educate your engineers on what is, how do we properly use Lambda to leverage its full potential? I would say our engineers are very good at educating each other and researching and playing around with things and experimenting and seeing what works and keeping up on Amazon's new technology and product releases. It's very much a team effort to even put all this information that we've discussed today. I certainly am not alone in it. I'd have to thank everyone on the team for all the work we've done to sort of really make this work for us to be successful.

Starting point is 00:48:56 Cool. Well, and I also want to say a big thank you for you really to come on the air and educate. I mean, Brian, for me, this was a lot of great information on serverless. And yes, we had the one-on-one on serverless, but this was a real great experience report. This is like real, I would say this is kind of a real world advanced, you know, the kind of things you hear are possible with serverless,

Starting point is 00:49:21 but you don't really hear about people doing it. This is kind of like your vendors doing it, you know. But Andy, did you don't really hear about people doing it. This is kind of like your vendor's doing it, you know. But Andy, did you want to go ahead and summon the Summary then? We'll wrap things up. Let's do it. Yeah, I want to wrap it up. And so again, Michael, thank you so much. What I learned

Starting point is 00:49:38 today, there's a lot of things we, when we move over to the serverless world, that we have to understand. We are no longer in control of the underlying container or the underlying hardware, obviously. So things like cold starts that you talked about and how we can avoid cold starts. You talked about, we need to think a lot about caching

Starting point is 00:49:59 so we can cache, obviously, a lot of the dynamic response from our Lambda functions using built-in features of API Gateway. In your case, you had a 95% cache hit ratio. I liked what you said about doing performance engineering to figure out the optimum setting of memory CPU when it comes to price and performance, finding that sweet spot. Then I think we all have to rethink and relearn on how we do proper monitoring.

Starting point is 00:50:29 There are monitoring options out there, whether it's CloudWatch, whether it's X-Ray. And as Michael, you have shown us, you can do a lot of the built-in stuff, but it also takes, you know, obviously some work and some thinking of streaming the data to Elk

Starting point is 00:50:42 and then doing an area analysis. That's a kind of like a reminder for us as tool vendors to understand these use cases and then build something that works out of the box and solves all of these these use cases because you should not need to figure out how to get the best out of the data how to figure out if something is wrong and how to figure out if there's a higher failure rate and then notify the people. This is something that you should expect from modern monitoring tools. And then in the end, thanks for the great best practices, you know, breaking the big functions up into smaller pieces, using step functions to decouple big functions.

Starting point is 00:51:23 And another last thing that I kind of take back home with me is using Canary deployments to really test new features out in the wild and see if it's really working, if we have missed anything in testing, and then either broaden it up to the larger world or kind of

Starting point is 00:51:39 rolling it back and let engineering work on a better solution in case there are problems. So thank you so much. We will definitely make sure to link your slides that are up on SlideShare with the podcast. Any final words on your side? I think Fender is obviously a great company to work for. You stole my line already.

Starting point is 00:52:07 Fender is a wonderful – it's basically I'm at my dream job right now. This is a wonderful place to work. We do have openings at this time. I encourage people to check out the Fender website. We have an office in Scottsdale, Arizona, and in Hollywood, California. If you're interested, take a look and see what we have available and reach out. And Fender also has a very big social media presence. So be sure to check out Instagram and Facebook and everything else that has

Starting point is 00:52:35 Fender. The Instagram feed is wonderful. Great, great photography. And bonus points. If you can play a version of your resume on guitar that transforms itself the sound transforms into the actual printed resume when they play it back that would be something spectacular right yes uh which have to do with it but it has to be recorded with a fender guitar of course the one thing i wanted to just point out andy uh and and michael again thank you

Starting point is 00:53:06 so much great stuff here is uh you know andy you and i have talked a lot about um different performance considerations and for a long time now it seemed like old performance is the same new performance meaning every old performance problem pattern that you come across in the past carries over into any new platform. Once we started going to microservices and containers, it's still the same problem patterns we're looking out for, the same most common mistakes. But as Michael pointed out here, to me at least, when he was talking about these performance issues, the VPC usage, cold starts, even using the caching layers, especially the one I love the most was the CPU memory balance and defining that sweet spot. This is, this sounds all new to me. This is, it sounds like there are new performance challenges, new patterns, new things to look out for, for, you know, a lot more. I mean, every, every new,

Starting point is 00:54:00 every new kind of technology comes with a few new ones, but the server list has a whole, it seems like it has a whole larger spectrum of new performance considerations. I think we're going to be finding out more and more of them as their usage advances. So I'm just excited about that. And to me, that was very exciting to hear, Michael. So thank you for sharing it.

Starting point is 00:54:20 You're welcome. Thanks for having me on. All right. Awesome. Well, thanks everyone for listening and we'll catch you next time on Pure Performance thank you very much thank you

PurePerformance - 061 Serverless Performance, Monitoring and Best Practices from Fender

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.