PurePerformance - Don't burst in Flames: 20 years of Performance Engineering with Martin Spier

Episode Date: October 23, 2023

Martin Spier was one of six engineers to take care of all of Netflix Operations about 10 years ago. Back then performance and observability tools weren't as sophisticated and didn't scale to the needs... of Netflix as some do today. FlameScope was one of the Open Source projects that evolved out of that period, visualizing Flame Graphs on a time-scaled heatmap to identify specific performance patterns that caused issues in their complex systems back then.Tune in to this episode and hear more performance and observability stories from Martin, about his early days in Brazil, his time at Expedia and Netflix and about his current role as VP of Engineering at PicPay - one of the hottest fin techs in Brazil.More links we discussed:Performance Summit talk about FlameCommander: https://www.youtube.com/watch?v=L58GrWcrD00CMG Impact talk on Real User Monitoring at Netflix: https://www.cmg.org/2019/04/impact-2019-real-user-performance-monitoring-at-netflix-scale/Learn more about Vector: https://netflixtechblog.com/extending-vector-with-ebpf-to-inspect-host-and-container-performance-5da3af4c584bMartin's GitHub: https://github.com/spiermarConnect with him on LinkedIn: https://www.linkedin.com/in/martinspier/

Transcript
Discussion (0)
Starting point is 00:00:00 It's time for Pure Performance! Get your stopwatches ready, it's time for Pure Performance with Andy Grabner and Brian Wilson. Hello everybody and welcome to another episode of Pure Performance on the record. Yes, yes, it has been. And thank you for our audience for excusing my absence, but we had to get some new content out, and Andy was traveling like he does, as always. You know, it's an exciting time right now, Andy, I've got to tell you. Well, first of all, it's the month of October,
Starting point is 00:01:03 which means I can officially, for this month, call you Andy Candy Grabner. Bring that back. That started a long time ago, which brings me to the second exciting bit that we're on episode 193. So after this one, we have seven more until we hit 200. I think we've been doing this for about seven years now. I think so. Which is pretty crazy. Yeah.
Starting point is 00:01:21 Yeah. And I just recently hit my, just last month i hit my 12-year milestone at uh dynatrace wow which is pretty pretty crazy yeah yeah yeah been a long long time been awesome um but it's funny too thinking back to um you know when we started this all and where you and i came from right we both started in performance testing i don't know if we called it engineering at the time but in performance testing. I don't know if we called it engineering at the time, but performance testing, then performance engineering, and then into this whole observability thing. But it's always been performance at the heart of it.
Starting point is 00:01:51 I know I used to only think that performance could be load tests. And as we got into the world of Dynatrace and some of this other stuff, there are so many other aspects of it. And I think today it's coming in a little bit more full circle. We're going to be doing a little bit more full circle, right? We're going to be doing a little bit more of a performance-oriented podcast.
Starting point is 00:02:10 Yeah, and I also think, right, because back in the days we talked a lot about patterns and how to detect patterns. And I think there's a lot of stuff that we are going to hear today from our guest. And I think, Brian,
Starting point is 00:02:21 do you think it's time to introduce himself? Absolutely. I think it's time. If he knows who he is. If he knows who he is. If he's figured that out. And if he doesn't know who he is, maybe he can tell us who he wants to be.
Starting point is 00:02:31 Two things. Yeah. But I will pronounce the guest of today with the way I would pronounce it because it looks like a German name. Martin Speer. But I will let Martin introduce himself. Martin, thank you so much for being on the show we met a couple of weeks ago in sao paulo at an event and now we're here recording this podcast welcome to pure performance thank you for being here please do us a favor if you know who you are introduce yourself or if you know who you want to be introduce who you want to be
Starting point is 00:03:02 well well thank you thank you and hello everyone, thank you. Thank you. And hello, everyone. Thank you for the invite. I'm really, really happy to be sharing my war stories with you guys and all the audience. Yeah, it's a very deep, thoughtful thing. Hard to think. But one thing is for sure, I do love performance. I mean, I think my whole career was around performance somehow. Just in the recent years, I've been more of a bureaucrat.
Starting point is 00:03:30 But other than that, my whole career was around that. I mean, I think I wrote my first line of code, a really bad one, kind of when I was super young, was kind of nine-ish, I guess. I got to experience the whole internet in the 90s when things were kind of fairly small. It was a bit of a no-brainer going into computer science. Ended up studying that. And I started my career as a sysadmin when sysadmin was a thing.
Starting point is 00:03:57 And by the way, most performance engineers I kind of get in touch these days either came from a testing background. Because back then it was kind of perf testing and then you kind of get in touch these days either came from a newer testing background, you know, because back then it was kind of perf testing, and then you kind of go into something else, or sysadmin. You know, at Netflix, the bulk of the engineers there came from a sysadmin background. And, you know, back then, by pure luck, I sort of got
Starting point is 00:04:19 involved into a perf improvement project. The application was slow back when we kind of had, you know, waterfall model projects, perf improvement project. Application was slow back when we kind of had, you know, waterfall model projects, perf improvement project. It went really, really well. We improved things. And from that, the company I was working at at the time,
Starting point is 00:04:35 I said, hey, maybe we need a performance engineering team. And, you know, back then I didn't even know performance engineering was the thing. I started researching a bit more and that's how I got involved know performance engineering was the thing. I started researching a bit more, and that's how I got involved into performance engineering. Back then, it was a bit of mainframe and a few other things, not to give up my age here. But I ended up working in that, moved to the U.S.
Starting point is 00:05:02 12, 13, 14, 15 years ago. I don't know, long, long time. I spent a couple years at Expedia. So if you're from the US, you probably know Expedia, travel agency, working really in the large lines of business, hotels, air, cars, booking, all those things. And then my sort of random chance, I received a cold email from a recruiter in California. I knew the company, not just because it was a user, but they also had some really, really cool open source projects back then. That was Netflix. And they were starting a new performance engineering team over there. And this was over a decade ago. The company was really, really different back then.
Starting point is 00:05:46 I remember celebrating the 20 million user mark, which was huge back then. Now it's kind of a lot more than that. There's only a few hundred employees. Content development wasn't a thing back then. It was just licensing. DVD was still huge from those who are kind of not from the US. Netflix started in the US here as a DVD delivery kind of thing.
Starting point is 00:06:10 You went to websites, selected what you wanted. You got a red envelope with the DVD. And it was a really, really interesting time back then because Netflix was migrating from the data center
Starting point is 00:06:23 to this new thing called the cloud. And everyone is really apprehensive about the cloud and suspicious. Should I host my data over there? And is it safe? Who's guaranteeing that? So I ended up facing a lot of the performance architecture scaling problems at cloud at first hand. Because it was not something I could kind of go to Stack Overflow and ask how to kind of solve that problem.
Starting point is 00:06:50 Probably no one kind of faced those issues before. I got to work with really, really great people back in the day. Maybe you guys know Adrian Cockcroft from Sun. Amazing, amazing time. Try to define what architecture for the cloud was what it is today, which is really, really cool.
Starting point is 00:07:12 I migrated to lots of different areas there. Yes, I started with architecture, backend systems, but over time, I work on client side, which is not just your Android, iOS, and web, but also TVs and PlayStations and Chromecasts and all those kind of weird things. There was even one thing that ran Windows C back then. So interesting times. Big data, which kind of started, now is a huge problem. Back then, it was something new.
Starting point is 00:07:45 Hey, processing all this data costs a lot of money. How can I improve that? And now we do machine learning. Machine learning is becoming a problem. There's perf engineering teams focusing on machine learning, how to optimize that. And I got to work on all those things, all with
Starting point is 00:08:01 sort of a performance lens. And ended up developing a bunch of tools. We can talk about those things, but all with a performance lens. Ended up developing a bunch of tools. We can talk about those things later. And about two years ago, I left Netflix. And today, I lead what I call the foundation engineering team at PicPay. So PicPay, probably most of the listeners don't know PicPay, but PicPay is probably most of the listeners know PicPay, but PicPay is short for picture payment.
Starting point is 00:08:27 The analogy I try to make, Venmo in Latam in Brazil is one of the large fintechs in Brazil. Over 70 million accounts open, fairly big. And the way I like to explain what foundational engineering is there is basically the bulk of core engineering so everything is not directly related to a financial product but everything that supports all of those programs so your infrastructure, platforms internal platforms, architecture, internal tools, mobile
Starting point is 00:09:03 platform and I have data platform until recently. I sort of call it the plumbing. When it's working well, no one sees it. But when it starts giving you headaches, all hell breaks loose. And that includes Observability too. Observability is one of my teams. It's one of the teams I have a bit of passion. And they get bothered with me a bit of passion. And I kind of,
Starting point is 00:09:25 they get bothered with me a bit because I try to, you know, give more opinions than I should. But it's been an interesting, interesting change. I think when we, well, thank you so much. It's amazing to have somebody
Starting point is 00:09:38 like you on the podcast with such big history, right? I mean, it's amazing when you explain. Now I got to ask you a question and without dating yourself, on the podcast with such big history, right? I mean, it's amazing when you explain. Now I got to ask you a question and without dating yourself, but what was your first development language? Because you said when you were nine years old,
Starting point is 00:09:53 you were studying the codes. Do you remember the language? That was basic. Yes. Yeah. That was basic. Yeah. It's funny.
Starting point is 00:09:59 It's for me the same. My first computer was an Amiga Commodore Amiga 500 and it was Amiga Basic. Did you have the basic with the line numbers or I think it was they had a version maybe QBasic or something like that after? No, it was the line numbers. Line numbers, yeah. Yeah, okay. 20 go to 10. 10 for Brian, 20 go to 10. Exactly. Yeah. That's awesome. Yeah. And I think that helps. You appreciate hardware resources. And I think that's one thing I really love about performance. You get into how to optimize things
Starting point is 00:10:34 and because you value hardware resources, which is something that changed over the years. A lot of brand new developers, they work at a level of abstraction and everything is abstracted, even memory allocation. I mean, I can probably ask most engineers today, you know how your kind of language here
Starting point is 00:10:53 allocates memory and de-allocates memory or what's malloc? It's most will probably never have to deal with it. Yeah, it's interesting. Go on, Eddie. I was just Yeah, it's interesting. I remember... Go on, Eddie. I was just saying, understanding the basics, I think,
Starting point is 00:11:09 is also a privilege that we have when we grew up because I remember, besides basic, my first language in school, in high school, was assembler. So we actually had to learn how to move bits and pieces around in the register. It was really interesting. And I think they really just, back in the days,
Starting point is 00:11:26 it was in the early to mid-90s, not that Assembler was still something you would code, because obviously we already had languages like C, C++, but it was really great foundational knowledge that we gained. Yeah, I was just going to go back into the, I don't know if you had it. You were doing Rational, right, Andy, you worked with? No, I was working with Segway.
Starting point is 00:11:51 Oh, Segway, so yeah. So I don't know, was the language C on that one? Because I remember in Lode Runner, we would have to do C, and anytime we do a fancy function, you have to do the malloc. And I remember that confused the hell out of me. I think C was Lode Runner was doing, yeah, C, but we were doing, I think it was more like Pascal-based, to be honest with you, if I think back, yeah. Yeah, and LoadRunner was C, but it wasn't a standard compiler. Right, right. Which kind of caused a lot of headaches. Yeah. Yeah.
Starting point is 00:12:18 So first language, basic, yeah. And Andy, I don't know if you caught it, right? So Martin has a direct tie-in to Dynatrace because his first, I think you said it was your first performance job, you were working for Expedia, which just made me think immediately of Easy Travel. So you basically worked for Easy Travel. Easy Travel is our demo, one of our demo apps that we've had for years and years and years for Dynatrace. So I spin up this travel website that just cracks me up it was space travel before that wasn't it yeah anyway it was it was way too back yeah yeah no it's called something else but something with
Starting point is 00:12:58 space travel yeah yeah but uh martin a couple of quick questions so in the preparation of this call or this podcast you sent us a bunch of links uh it was one of the as you call it earlier before we hit the recording button like your little baby um you have a lot of presentations on uh flame commander or a flame scope uh an open source project that you um brought to life i guess, while your time at Netflix, correct? Yes, correct. We will, folks, if you're interested in learning more about Flame Commander and all the other open source projects
Starting point is 00:13:34 that were released back then by Netflix, you will see the links in the description of the podcast. But can you tell us a little bit about why you built this tool back then? What problem you actually tried to solve? Yeah, sure. It's quite an interesting and funny story. So we had a fairly small team at Netflix.
Starting point is 00:13:53 I think at peak it was maybe six engineers to take care of all Netflix globally, all devices, everything. And everyone had a specific focus, kind of backend, client, data, benchmarking, kernel, JVM. And I remember it was one of the most common requests was, hey, I had a CPU regression or something's kind of weird with my CPU, new build. Pretty common problem. And this issue was intermittent. You know, every... You guys probably noticed that.
Starting point is 00:14:31 They had that problem in the past. And this one was specifically hard because it wasn't even a second. It was kind of sub-millisecond. It was maybe kind of 100 millis, like sub that. And it was really hard to find what was blipping here.
Starting point is 00:14:45 And then back then, Vadim, my colleague, he was having that problem and then took a CPU profile, sampled NextApp, and started slicing it into very, very tiny bits and generating flame grass from those things to say, hey, what's that spike?
Starting point is 00:15:05 But it was really hard to catch that specific moment. And then I think it was, it's kind of fuzzy because it was a long, long time ago, but I think it was Brendan that decided to, hey, let's plot that as a heat map and see kind of what we can find here. Interesting enough, we could clearly see the blips and exactly the timeframe which should slice things. And then I took it to, hey, let's create a tool on that
Starting point is 00:15:30 where I can navigate back and forth between these two things, like the heat map and the flame graph. And cool, developed a tool. Interesting. Can we open all our old profiles that we took off kind of applications over the years
Starting point is 00:15:43 and see what we can find. And sure enough, there was a lot of really interesting patterns. We even wrote a blog post about that. Interesting patterns like your GC spiking or jobs that trigger every second or so.
Starting point is 00:16:00 All sorts of things. And wow, there's so much we can optimize here that I never even saw before. It just got washed into the profiles. And interesting, that's just CPU. What else can I open with that visualization? Then we started to move, hey, memory allocation profiles
Starting point is 00:16:19 and all the BPF things. And it kind of grew into a really, really interesting tool. It was standalone back then. That was Flamescope. That's the open source version. And, hey, we need to make it easier for all developers to have that capability. Then came Flame Commander.
Starting point is 00:16:37 It was basically a cloud-hosted version of that where you could just point to your instance, single click, and take any profile and analyze that either as a flame graph or, you know, your flame scope visualization, go back and forth.
Starting point is 00:16:51 It had a historical archive. You could compare kind of older versions with new versions of things. And it kind of grew as a cloud, overall cloud profiler on Netflix.
Starting point is 00:17:02 So it's just, that's the story how it created with a really, really tiny thing that, you know, probably most of you faced before.
Starting point is 00:17:09 Yeah. And so for me, a couple of questions on this. So I think this was, if I look back and also in the Git repository, what was
Starting point is 00:17:17 it like eight years ago? Probably. Yeah. So eight years ago, there must have been 2015. Obviously you said
Starting point is 00:17:24 you were a small team. It's amazing. Only six engineers taking care of Netflix. Now, did you look for other tools? Or did you just say, no, we built it ourselves? Or was there nothing available? Nothing from that level. Remember that back in the day, especially when Netflix started,
Starting point is 00:17:42 not a lot of observability tools were available that worked at that scale. Take your time series metrics, internally developed Atlas, still huge. I still don't know if any tools available today can kind of take on that load. Vector, which was kind of real-time monitoring sub-second of low-level metrics. Also, I don't think I've seen anything similar to that,
Starting point is 00:18:07 that is not centrally aggregated, that kind of connects to the host, kind of streams directly to the browser. So yes, there was part of, hey, we like to develop tools, but there is also, in general, there's nothing that can do what the level of granularity we want to get, and also support our scale. So that's where we kind of went straight to developing tools. Obviously, you were pioneers back then, especially as it comes to that scale, right? I think now years have passed
Starting point is 00:18:35 and I guess there might be other alternatives now out there. But that's really interesting. Brian, this also reminds me, if you remember, one of the early podcasts we had with Goranka, who was a performance engineer at Facebook back in the days. Oh, yeah. I spoke at conferences with her. She was taking up capacity engineering there. Yeah, yeah. Exactly.
Starting point is 00:18:53 So we had her on the podcast as well. And obviously, Facebook back then, same challenge. Big scale, no other tools available. And they had to figure out a new way to get all this data from all of their hundreds and thousands of hosts that they had back then and then analyze it. Another question that I had, so open source. You or Netflix decided to open source these tools. And I think this was at a time where open sourcing,
Starting point is 00:19:23 I'm not sure if that many large organization actually went down where you know open sourcing i'm not sure if that many large organization actually went down that route in open sourcing something that was built obviously it's intellectual property that you built and do you remember why netflix decided back then to actually open source these tools because that's giving away a lot of stuff for free eventually right i mean yeah um i think first it's part of the DNA. Everyone likes to have those discussions in the open and also kind of discuss
Starting point is 00:19:52 implementation and how we can solve this problem. There were a few things that I remember back then was, first was the question was is it a competitive differential for Netflix if everyone starts using this? the question was, is it a competitive differential for Netflix if
Starting point is 00:20:07 everyone starts using this? If it is, we're probably not open source. It's just a... But if you check most of the tools, they're not the recommendation algorithm, for example, which is close it, guard it.
Starting point is 00:20:24 But cloud management thing said, hey, if which is close it, guard it. But cloud management things that, hey, if everyone starts using that, that's good for us. We're sort of a standard. We can all contribute and improve how we use the tool internally too. It's really interesting
Starting point is 00:20:39 to give visibility to engineers of the problems we work in. It's a tech brand. It's really important. We're competing for really top talent and it's good for everyone to know the kind of problems we're working on. And that generally comes to open source projects. So it wasn't much of a huge discussion. It was just, hey, if our competitors are using this tool, that'd be a problem. Probably not. And then after came how much effort it takes to
Starting point is 00:21:11 manage those projects over time and so on and so forth. But the idea was generally that. Let's just, you know, it's interesting, it's good for the community. No competitive advantage. Let's open source it. Yeah. No, it's interesting it's good for the community um no competitive advantage let's open source it yeah no it's good and obviously it gave you a lot of chance to speak at all sorts of conferences and
Starting point is 00:21:40 i have a couple of tabs here open on my browser uh we spoke at some like uh cmg impact i see here uh and some other conferences obviously it's a great way to then, you know, speak about it, make obviously, you know, kind of like always free advertising for it as well, right? Because obviously these conferences, they are happy that somebody speaks about their own experiences. And especially if the tools that you're using are something that everybody can then use and don't have to purchase some. Exactly.
Starting point is 00:22:05 And we love contributions too. I think that was really interesting. If people start using it, they'll find other uses for it. They'll add features to it. Everyone benefits. It goes back to that community that we see so much in the IT world of people sharing and helping each other out. It's funny, too, because in a way,
Starting point is 00:22:26 without making it sound terrible, but I like the phrase, open source is marketing, where it's you're marketing your technology staff for a cool place to work to help. Obviously, that's not the reason you're going to do the open source, but that can be another factor.
Starting point is 00:22:41 Is anybody going to get a competitive advantage, and will this potentially help us attract more talent as we're growing and expanding it's uh it's an absolute yes on the second one for sure you know yeah one thing that i noticed i looked at the the video uh from the performance summit that was the first video i watched from you and it reminded me so much about pattern recognition, right? What we did. So Brian and I, when we started our work as performance engineers, especially now with Dynatrace, everything was about distributed tracing.
Starting point is 00:23:13 So I'm not sure how many distributed traces we've analyzed in our lifetime, but it's enough. But the interesting thing is we always kept looking for patterns. Like we always come back to the N plus one query pattern, you know, too many threads being used, the calling, fetching too much data, making too many calls to a remote system,
Starting point is 00:23:34 high latency and things like this. What I really liked about the way flame scope and flame commander worked, I think it's flame scope with the visualization. Yeah. Is the different patterns that you then could like visually see. And, and I Commander worked, I think it's Flame Scope with the visualization, is the different patterns that you then could visually see. And that was just, it was really great. Folks, if you can check it out, I'm sure you have presented this in multiple different presentations, but the one on YouTube
Starting point is 00:23:58 is the one Flame Commander Netflix Cloud Profiler by Martin. And you, I think, starting with minute number five in that video, all the way until like five minutes of just pattern after pattern after pattern and how it looks like visually. Yeah. And it's interesting. Every time we presented about that,
Starting point is 00:24:17 it was always the same question. Hey, have you kind of trained a model to kind of learn those patterns and detect those automatically that's on the at least it was on
Starting point is 00:24:28 the to-do list when I left but it's so that's
Starting point is 00:24:33 actually an interesting thought you can train a model to detect those
Starting point is 00:24:37 because what a human eye can do picture recognition is so far
Starting point is 00:24:41 advanced so if it's creating that picture and then it seems
Starting point is 00:24:44 like it's a very easy leap. Exactly and it's at scale makes a huge difference. Hey, manually checking one application very easy when I have thousands and thousands of applications and doing that manually it becomes harder. Reminds me of the fingerprint database in all those detective movies. They get the finger and then you get a hit. Exactly. We got them. Do you know, Martin, is the tool still used at Netflix? I know you left a couple
Starting point is 00:25:14 of years ago, but still, are you aware? As far as I know, yes. Jason Vadim can compliment, yes. Now, with all the stuff that you learned at Netflix, now you moved on to PicPay, and this is also how we got to know each other because you were presenting in Brazil.
Starting point is 00:25:35 Do you have a different... I know you have a different role now, but I guess with everything that you've learned, do you bring a certain, let's say, motivation into your organization around observability, around performance? do you bring a certain let's say motivation into your organization around observability around performance like do you still do you take some of the lessons learned
Starting point is 00:25:51 and make sure that in your organization you're doing things similarly or is it a different world now because different technologies take maybe different people no I'm definitely taking a lot of lessons I think that was one of the reasons I joined was
Starting point is 00:26:07 PicPay is a different company. It's a lot younger. Well, it's younger-ish. But engineering-wise, the company grew so fast. And with that, it absorbed a lot of technical depth. And engineering maturity didn't grow as fast. So that's what I'm trying to bring to the table, kind of get a lot of the lessons I learned when Netflix was scaling and all the pains we kind of suffered during those
Starting point is 00:26:36 years to pick pay, which was in a very similar momentum, right? Just to make those things kind of get fixed a bit faster. That's what I'm trying to bring to the table. Let's have the architecture discussions early on so I don't really suffer
Starting point is 00:26:52 when things are just too big to fix. Performance, for sure. Let's not get this out of hand here. Also, observability. The
Starting point is 00:27:08 observability space, at least that is my reading in Brazil, and I'm assuming most markets that are not super developed, your big tech companies, observability is still quite young. Most engineers in your team, if they did not come from a tech companies. Observability is still quite young. Most
Starting point is 00:27:25 engineers in your team, if they did not come from a big tech or a large company, their observability is, hey, can I go check the logs and scan everything manually? That's the background of most engineers that never worked at a company
Starting point is 00:27:42 at PicPay or Netflix scale. And trying to get rid of that thinking of, hey, I can manually do a lot of things or I can publish text and scan text as much as I want and bring observability to a pattern, a point where I can scale, I can continue scaling. And that's kind of where we are right now in that space to a pattern, a point where I can scale, I can continue scaling.
Starting point is 00:28:05 And that's kind of where we are right now in that space is getting rid of all the technical that causes problems. Problems is too expensive. That's the first one. And it's really slow
Starting point is 00:28:19 to find problems or understand what's going on in the system. And that's kind of moving from scanning logs to, hey, metrics and traces and full end-to-end traces
Starting point is 00:28:31 and this sort of thing. So that's the maturity I'm trying to bring to the organization. And obviously, I've seen that. Not many engineers on my team have seen that at scale. So I do give, I do have a lot more opinions than I should in my position. I guess that's always hard, right?
Starting point is 00:28:53 To sometimes not let your past dictate your actions. But yeah, that's what it is. Yeah, there was a joke internally too. I mean, I'm working a lot of improving app performance, so Android iOS performance in the app. It's a focus right now. And I had to contain myself actually installing Android Studio
Starting point is 00:29:16 and start profiling things again to the point where I took a screenshot of Android Studio running on my machine and I kind of sent that to the team. And then it became a joke. Hey, Martin is reviewing your PRs now. That's just to contain myself. Yeah.
Starting point is 00:29:34 I know we had this discussion also in Sao Paulo because when you got on stage and you introduced yourself to me and you said you're working in foundation engineering. And i said this is something that i would call maybe a platform engineering right because that's kind of the term that i have been using for a little bit and i think we you agreed on this and also the way you explained it earlier you really make sure that that an organization an engineering organization really has everything so that they can really produce great output
Starting point is 00:30:05 without all the complexity around that tool ecosystem brings and the processes bring. I get a lot of questions from our community on what is the right thing to get observability into foundational engineering, into platform engineering. Question to you now, do you bring observability as a mandatory thing into everything you do? Or is this still, you know, optional?
Starting point is 00:30:32 Or is it mandatory? That would be an interesting thing to hear. Yeah, sure. I like to see, I agree with you. Foundation, platform, different terms, but our reading is basically the same. End of the day is all layers of abstraction, right? What I'm trying to provide to my clients, and I tell it to my team all the time, I think of our Platform Engineering Foundation as a startup internally in the company. I'm providing a service,
Starting point is 00:30:58 I'm providing a level of abstraction so all other teams can build their features for the users a lot faster without having to worry about the details. Of course, the more I know, the better, but delivering that faster. I think that's the idea. And observability, when observability started within platform engineering, that was mostly to provide a managed platform to other teams, being internal hosted tools, tools that we acquired, tools that came to M&As, managed old tools.
Starting point is 00:31:35 That was the initial idea. And with time, I'm changing that vision to, hey, we're actually responsible for the best practices and the processes and what's expected from each team. I don't buy into the huge idea of, hey, something is mandatory
Starting point is 00:31:56 for other teams. What I'd like to do is to offer a better product that they would get somewhere else. They need to see the value of that. It's just not part of my culture of, hey, you have to do that. It's mandatory.
Starting point is 00:32:10 Sometimes it's necessary for sure. But I like to provide a better service, a better product. And that comes, hey, here is your Java image that you use to build all your application. It's all fully instrumented already. And it's a lot easier
Starting point is 00:32:28 for you to just use that and get all those things for free than having to develop those things yourself. We come up with, hey, here's a minimum that you need to have a system in production. All those things come for free if you use our platform. You're free to
Starting point is 00:32:43 use whatever, but at the same time, most engineers don't want to have more work that they need. That's why you become an engineer, right? Because you don't want to manually do things. You want to automate them and be done with it.
Starting point is 00:32:59 And less work is the constant search for less work. And it's a very interesting approach. I brought that from Netflix. Back at Netflix, we called that the paved path. Here's a really nice highway you can follow. And here's the off-road path. There's always cases that you have to go off-road because you don't have a really nice highway there.
Starting point is 00:33:27 But 99% of other cases, it's just a lot easier to go on a highway. Yeah. So you could call them paved paths or something. I think it'd be like golden paths, paved paths, whatever it is. But I think the way you explain it is nice, right? I mean, hopefully everybody understands that driving on the paved path makes more sense now do you how do you sell your product internally or have you reached a status already where you said now it's clear for everyone to use our platform or did you have to do you still have to do advertising do you still have to do advertising? Do you still have to sell it internally? Do you still have to educate people?
Starting point is 00:34:06 So the current products, they're pretty established, I would say, with a few exceptions. It's all about internal, how other teams see our team as a reference. I think that helps a lot. If you work on doing your research and going through every detail, listening to everyone on their input before you make a decision, and then you do the job, you don't need a lot of internal marketing.
Starting point is 00:34:42 People just look at you and, hey, I trust that they did their homework and I'll follow. But obviously, there are always technologies that you don't have a full consensus for sure. It always happens. And on those cases, we have to do a bit more marketing, a bit more education, and kind of go... I always try to bring the discussion
Starting point is 00:35:08 to a technical discussion. But not a lot of marketing. And once you standardize things, it's... I joke with the Apple environment. Everything works nicely together here. Once you step out, everything becomes a lot harder. It's nicely together here. Once you step out, everything becomes a lot harder.
Starting point is 00:35:27 It's good or bad. But if you're in the environment, everything comes for free. It's a lot easier. Everyone tested your path and whatever you need to do has a really nice documentation. It's just easier. I thought of a
Starting point is 00:35:42 marketing poster you can hang in the office so people have an idea. So on the paved path, you can have an engineer with a laptop in a self-driven car on a smooth road typing away with no problem getting their work done. For the off-road path, they're going to be in a
Starting point is 00:35:58 big Jeep Wrangler on a big rocky road bouncing around as they're trying to type and drive at the same time. Take your pick, man. I do like the off-road bit, as an engineer, but at the same time, I work for a bank. It's not the right place to be doing that. But do you want to try typing while you're driving and bouncing? That's the bit, right? Yeah, yeah.
Starting point is 00:36:23 It's interesting, too, because, Andy because anyway we talked a lot about platform engineering as well there's the idea you know i like the terms paved path and you know unpaved road um better than opinionated right but there's it's similar concept to the here's here's the the the one where you have your rules it's all set up but it's easier to go. I like that, though. You're keeping it open, right? You're letting the engineers make the choice. You're trying to inform them and educate them about the pros and cons.
Starting point is 00:36:57 And obviously, if there's a reason why they should pick the unpaved road, just like the reason why you pick any technology is because there is a reason, but it gets them to think about it. And I think it's more powerful if they come to the decision to do the paved path because they realize that's the better path as opposed to just like, no, this is what you do. It's interesting to see how that'll work. It's exactly the case because
Starting point is 00:37:16 even if you try to provide a platform that will cover all use cases, that'll never be the case. You always have exceptions and things you don't want to support as a central team just because it's one corner case there
Starting point is 00:37:30 and it makes no sense for us to invest a lot of time and effort and people to support that specific one use case for kind of one team. And there's always cases like that in a large company. You'll never be able
Starting point is 00:37:43 to standardize everything. Sometimes you might have a Windows server running Lua there because that one solution that they bought and it needs to be that one and you always have cases like that and when you try to impose things you always forget about the corner cases
Starting point is 00:37:59 back when I was at WebMD there was a team that had a tool written in Fox Pro which I'd never heard of until that came out. And they're like, well, we need to make it work because it's no longer being made or maintained. So I was like, oh, wow. But that's the case. Just bringing that up for the old language nod. The question for you now, coming back to where we all started where we started all in performance
Starting point is 00:38:25 engineering um do you now at picpay where do you do performance engineering do you still do performance testing as part of your delivery pipeline or is it more everything moved to production where you're basically analyzing performance behavior and performance changes as part of a production let's say a blue green blue-green rollout or a canary rollout. How does this, what do you do? So it's, there is performance testing, but it's ad hoc on specific cases. And it's the same on Netflix. I can tell you the whole story kind of went, because I remember the first thing I did when I joined was creating a performance, fully automated test framework.
Starting point is 00:39:08 So ad hoc cases, there is a completely new application. I have no idea how it behaves. I need to put some load in it to see how it behaves. It's not to validate. It's not a regression test. It's nothing like that. It's just I want to see how it behaves with load. That's it. There is a patch
Starting point is 00:39:25 or there is a new library version or there's something that changed that is risky. I want to see how it behaves and kind of what's the difference. I don't need to match production workload or anything like that.
Starting point is 00:39:38 I just want to see how it behaves under load and try to find any issues before production. To guarantee production is working, it's part of the development process, I guess. I think that's the same as with Netflix. Canary releases, I think that's the first. And with that, during the canary,
Starting point is 00:40:07 you need perf metrics there. Have you regressed CPU significantly? Have you regressed your memory allocations? Are you generating errors? Whatever you can think of should be there, should be in the canary evaluation. If you do that automatically, manually, but it should be part of the evaluation to see how things behave in production. Blue-green helps, even if you didn't catch whatever you had to catch during Canary. Hopefully, you'll catch that in a blue company. And that happened at PicPay before I joined.
Starting point is 00:40:47 And I think it was exactly like that. You get to a scale where it's just really, really, really, really hard to test things pre-production. Not just because of the scale and the load you need to generate. There's always ways of kind of achieving that. But the system became too complex for you to simulate all weird things that can happen. Getting an environment
Starting point is 00:41:09 with the right configuration really, really hard. You have feature flags. You have things that change the behavior of the system. You have different versions of applications. Having a separate environment
Starting point is 00:41:19 that scales or it's a, you know, have data from that's similar to production. So the test is just too many variables to know pre-production. And at the same time, we're releasing multiple times a day, every application.
Starting point is 00:41:35 So there's not a lot of time to test anything pre-production to make sure there's no regression. Keeping those tests updated. Netflix was fairly simple in user use cases. PicPay is huge, huge of features, what's available to the user. So keeping that to test something end-to-end
Starting point is 00:41:53 really, really hard. So it just became really hard to test things pre-production and we have to rely on the actual development and deploy processes to catch those problems. And one more question on that. Obviously, doing the load tests are different, but one thing I didn't know, again,
Starting point is 00:42:16 until I got more into Dynatrace, the understanding that performance is more than just load. In those pre-production environments, are you at least looking for patterns when you're not under load? Andy mentioned the N plus one query problem or single execution, CPU utilization is more, the number of calls to the database is more, things that might indicate a potential problem that gets exasperated under load.
Starting point is 00:42:39 Is that being looked at or is it... And I don't even know if the flame thing can help with that because it's not really going to be quite under load. But, you know, just imagine there are a lot of common issues. And I'm just thinking on the AI side, if you have a picture of what that looks like, you can do a comparison. Yeah, right now it's not part of, you know, your everyday engineer, everyday developer life.
Starting point is 00:43:08 Obviously, if there is a suspicion that something might be bad or I'm not sure, yes, all the tools we have available allow for that, at least, to how to investigate that in pre-production environments, for sure. Not part of the process as a day-to-day.
Starting point is 00:43:24 But if you're an engineer, you smell when things can go bad. I'm adding this SDK, and for some reason, it's a lot larger than the previous one. Interesting. Kind of let me see what's going on here. Or, hey, I developed this algorithm here, but, you know, quite complex. Let me see how it behaves.
Starting point is 00:43:45 Or I'm adding this dependency here, external call or whatever. Interesting. Let me see how that behaves in a non-production environment. It's just, you know things that are risky when you touch them
Starting point is 00:43:55 and hopefully it's a trigger for you to investigate a bit more. Martin, I got one final question because I think then we're getting almost to the end. Now it seems that at Netflix you were obviously heavily looking at metrics, like your infrastructure metrics and so on. Now with the whole, let's say, excitement about OpenTelemetry and Traces, even though
Starting point is 00:44:21 Traces has been around for a long, long time, but it's been really made popular, obviously, with OpenTelemetry and all the tools that came with it. Do you now see the benefit of having all these additional signals besides your metrics and your logs? Do you also look at Traces? Do you look at real user data and all this stuff? Oh, yeah, of course, of course. Even on Netflix, we developed
Starting point is 00:44:48 a kind of end-to-end kind of distributed tracing solution. Back then, it was internally developed. Back in the day,
Starting point is 00:44:56 it was internally developed. Today, I think it's Zipkin-based. When I left, it was Zipkin-based. What's the Google paper?
Starting point is 00:45:03 It was Dapper? Dapper, yeah. Dapper, exactly exactly it was based on that internally I think we called
Starting point is 00:45:08 salt and extremely important especially to understand a large and complex
Starting point is 00:45:15 environment and the dependencies and when something breaks I ended up even
Starting point is 00:45:19 developing a bunch of tools a few open source I think trying to remember
Starting point is 00:45:23 if any of the open source tools are there that are based on tracing data of tools, a few open source, I think. I'm trying to remember if any of the open source tools are there. That are based on tracing data, for sure. Take, for example, so, develop a tool to visualize
Starting point is 00:45:38 volume of requests versus time. So, where am I spending time? I have a request that comes to my edge layer. And have you seen the Sankey diagrams? Yeah. They kind of open up. So I use the Sankey diagram based on tracing to see where we're spending time. Okay.
Starting point is 00:45:58 That compose the time, I guess, to the edge layer request itself. So very interesting, super important to have edge layer request itself. So very interesting, super important to have the tracing data there. In a constant changing environment too, which is you have a company with kind of thousands of microservices, those things are changing all the time.
Starting point is 00:46:19 I remember in the early days when I tried to design a, do a design of the architecture, where the requests go and et cetera. By the time I finished, you know, a couple of days later, it already changed. So the only way of understanding how, like, the flow of data and the flow of requests in a system
Starting point is 00:46:37 is through tracing. So extremely, extremely important. Yeah, the data structure I don't like very much is logging just because it doesn't scale really well. Tracing metrics, super, extremely important. Yeah, the data structure I don't like very much is logging just because it doesn't scale really well. Tracing, metrics, super, super important. Cool. Hey, Martin, thank you so much for taking time out of your day. I'm sure you're super busy with your role at PicPay.
Starting point is 00:47:00 Thank you for giving us all the insights into what you learned over the years. It was great to hear that your first programming language was basic kind of brought back some memories from my own childhood I hope our paths will cross again I know you're currently
Starting point is 00:47:16 in Texas I will be in Texas by the way first after KubeCon the week after KubeCon if you make it to KubeCon if you make it to KubeCon, visit us there as well. All right, yeah, not scheduled, but yeah, if you're around, let's have a beer. Let's have a beer, yeah, sounds good.
Starting point is 00:47:33 Brian, any final words from you? Yeah, well, first of all, thank you as well. Two more thoughts that I didn't get in the beginning, right? Number one is that, you know, thank you also to all of our listeners. You know, we wouldn't be able to talk to amazing guests like Martin.
Starting point is 00:47:49 And, you know, Andy and I learn so much from this as always. And hopefully you're all learning. And the other special thing about today. So, Martin, you're in Texas. You're not too far away. We might cross paths because today is 5G zombie day, if no one knew about this, right? So in the United States, they're doing a testing of the federal emergency broadcasting. And the latest conspiracy on that is that's going to trigger, I don't know if it's the
Starting point is 00:48:16 COVID microchips that are supposedly being used, but it's going to turn us all into zombies. So Martin, if that happens, I'll meet you on the field eating brains together. And Andy, yeah, I guess maybe travel to the United States might be restricted because it'll be a zombie land. To all of our future zombie listeners, thank you. Martin, it's been a real pleasure. Yeah, you have any last things you want to get in there, Martin, as well? No, just really, really thank you. Thank you for the chance of sharing my war stories here.
Starting point is 00:48:48 Always eager to chat performance. As you guys noticed, it's something I'm really, really passionate about. And again, I always mention that. I mean, it's a very small world. We tend to kind of bump into each other at conferences and you name it so you know listeners you guys anyone
Starting point is 00:49:08 if you want to chat performance or anything as always feel free to kind of reach out you know I think you guys will share links to LinkedIn
Starting point is 00:49:14 and all those things I'm always eager to chat with like-minded folks alright well thank you thank you for giving back to everybody with your projects
Starting point is 00:49:24 and everything else so alright everyone thanks for listening we'll talk to you next time and happy october

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.