PurePerformance - Encore Presentation: 033 Performance Engineering at Facebook with Goranka Bjedov

Starting point is 00:00:00 Hello everybody and welcome to our first Encore presentation of Pure Performance. Yes, for the first time in two years and five months, we are airing a repeat. We've been very busy here at Dynatrace, and between the workload and scheduling conflicts, Andy and I slipped. Don't worry though, we've got a lot of great new episodes being scheduled. One reason we've been so busy is that we're preparing for Dynatrace Perform 2019 in Las Vegas this January 28th through the 30th at the Cosmopolitan. Make sure to head over to dynatrace.com forward slash perform dash Vegas, or simply search for Dynatrace Perform 2019 Vegas to get more details and register. We'll be podcasting from the event

Starting point is 00:00:37 and would love to interview you. Speaking of events, today's encore presentation features Garenka Biedow, Capacity Engineer at Facebook. She'll be presenting at this year's DevOne Conference in Linz, Austria on April 11th of 2019. And now, let's go back in time to episode 33, Performance Engineering at Facebook with Garenka Biedow. It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello, hello, everybody. This is Brian Wilson of Pure Performance with my co-host Andy Grabner.

Starting point is 00:01:32 As always, hello, Andy. How are you today? Hola, Brian. Muy bien. You too? Yeah, muy bien. I have to keep practicing my Spanish because I just came back from Colombia. And I just came back from Spain.

Starting point is 00:01:48 So how was your Spanish there? You can get away with a lot of English in Barcelona. But I did, you know, a lot started coming back to me here and there. I probably sounded very silly, but I can get my point, you know, get, uh, community, I could communicate enough to get some things across where I needed to. I was lucky to have my girlfriend with me. Let's say that. Yeah.

Starting point is 00:02:17 You have a crutch. You'll never learn the language, right? Tell them this. Yeah, exactly. So, um, Brian, I think today, well, I'm very excited. Yes, I am too. About our guest, Goranka Vyadov. Hopefully I correctly pronounced it.

Starting point is 00:02:35 We are two fellow Europeans. And Goranka just told me sometimes when both of us speak in the US., sometimes we are put into the same background. They both think, both from Goranka and me, that we may have some German accents, even though Goranka is not from a German background, but from Croatia. Goranka, are you there with us? Hey, guys. Nice to join you here. And yeah, Andy Wright on Target. I frequently get asked if I'm German, and I think it's those Vs and Ws that trick us all.

Starting point is 00:03:11 Here's the true test. How do you say there's that alcoholic drink that mixes well with a lot of them? Originally made in Russia, I believe. Oh, vodka? Okay, there you go. The V or the W, vodka. Okay. There you go. The, the, the V or the W. Yes. Yeah. Yeah. Yeah. So that one, obviously being Slavic, it's very easy for me to pronounce. Yes. Yes. Um, not exactly the first time I'm using the word, but, uh, you know, one of the

Starting point is 00:03:37 things that I'm always confused about is, uh, and that I would think would be the difference between person from German background and somebody like me is the articles. As you will probably notice, they are randomly sprinkled in my sentences. They make no sense whatsoever, and I expect most German speakers will correctly put them in the right places because German as a language has articles. Croatian doesn't. So bear with me on that front so garanka they um the reason why we invited is if the two of us we recently got introduced to each other uh at whopper

Starting point is 00:04:15 um the performance workshop down in lovely new zealand and uh and if i look at your uh linkedin profile i put pulled linkedin up maybe i should have pulled up your facebook page because what it says is it you are a capacity engineer at facebook at least on linkedin i'm not sure if that's the accurate title but i just read it out loud and i i will admit i'm not very good about updating my linkedin stuff uh i think the accurate title is performance and capacity engineer or capacity and performance engineer. You know how it is in companies. They shuffle those around every once in a while and I just don't bother updating anything. But yeah, I do

Starting point is 00:04:57 capacity and I do performance. Oh, so I totally agree with you. You know, titles don't really tell you a whole lot. I think it's more interesting what you do in a day-to-day life and actually what you do to actually make Facebook successful. And I think all of us know what Facebook does and the scale of Facebook. And so especially performance engineering and capacity planning, I believe, is just a very – I assume just – I don't even know. I assume it's just very, very different to what most of us are seeing. Or maybe not. Maybe it's just the same thing but just on an automated different scale. So what I would like to know is how do you do performance engineering at Facebook?

Starting point is 00:05:39 How does it work? What can we learn as non-Faceers uh for for our environments that we have that might just be on a smaller scale but we're still maybe struggling with similar things so maybe you can enlighten us a little bit uh sure happy to tell you what are some of the things that we do um i uh i'll automatically apologize to any of my colleagues whose work I forget to mention in this place. But, you know, we are quite a diverse environment, right? Because I am a capacity and performance engineer for Facebook, but I really have four different brands, fairly large to manage. And so I have Facebook, which has, you know, roughly going almost to 1.8 billion, uh, monthly active users.

Starting point is 00:06:25 Then there is a Facebook messenger, which is a completely separate product line and a WhatsApp, which both have more than a billion monthly active users. And then there is Instagram and that's just the big things. And so, uh, the goal of the team that I'm a part of is to kind of support all of them. Plus the other stuff that I didn't mention, like, you know, there is Oculus, there is all of the work that we do in internet.org and so on, and make sure that all of these teams get enough, uh, pretty much of everything they need, whether it's, uh, servers, whether it's, um, you know, network, you know, whatever they need, it's, it's our job to make sure that they get enough.

Starting point is 00:07:07 And sometimes that means procuring more resources, but most of the times it also means looking at the code and fixing the really big problems, making the code run more efficiently so that you can fit into the footprint that we have kind of allocated to you. So I know we are not allowed to talk about real numbers, but obviously I assume you have thousands of servers where you run large numbers. So that means a performance problem, like if a developer is committing code and it goes out into production, bad code can potentially cause huge costs, right? Because it's just the scale itself. So how does this work?

Starting point is 00:07:54 If I'm a developer and I make code changes, do you and your team help developers with architectural guidance, with code reviews before they deploy something into production? Or do you allow developers to deploy something in production and then punish them afterwards? So how does this work? So the answer is it depends. If you think about, and I'll just focus on Facebook, the social network and the product associated with it, it has the front end, what we call the product front end, and most of that code is PHP.

Starting point is 00:08:30 However it runs inside HHVM. And so we tend to work very closely with HHVM team, right? So in that particular case, we don't really work very closely with the product people, but we have a whole bunch of tools and monitoring systems available to them so that if they deploy something that is truly atrocious in PHP, they will know about the results and the consequences before they actually push the change. The front end also has continuous push to production. So as soon as somebody commits a change, it's going to go out live and obviously take some time for it to propagate to all of the servers, but it is live automatically.

Starting point is 00:09:17 If the change is really, really, really, really bad, it should have been caught before it goes live. Obviously it can be reverted. Now that's the front end, right? But then you have all of these other, we call them services and the back ends. And I'm not talking about the database, right? So the, the database, um, is handled by the database team. And, uh, as from my perspective, they do a fantastic job and I don't really have to worry about it too much. But then you have all of these things like photos, um, like newsfeed, like, uh, ads, uh, all of the other stuff that comprises Facebook. And, uh, those have individual teams that do development. They have, uh, you know, they pick their own stack. So, uh, I think most people know that majority of our

Starting point is 00:10:05 backends are written in C++, but there is also C code. There is Java code. There is, um, there are, there are smaller patches of different other things. Like there is some Erlang, um, there is Haskell, right? And so I tend to work primarily in C, code base. We have people who work, you know, in Java. We have people who work in the other programs as a part of our team. And so we did those teams. We work a lot closer because we kind of manage their machines a lot more directly. If you think about the front end is really managed through HHVM. I mean, we work with the HHVM. I mean,

Starting point is 00:10:50 we work with the HHVM team, but everything else kind of lands in there. For these other things, you know, depending on what the new product is, we will be involved from the point in time when they propose an architecture because they have to tell us what are their predictions and what do they think they will need in terms of resources? At that point in time, I ask for architecture review because when you are this scale, you cannot support, you cannot run away with bad architecture. So if you, in general, any large company will tell you, so it doesn't matter whether it's Facebook or whether it's Google or Microsoft or Amazon or so on, what you want to have architecturally is that with the linear growth of number of users, you have sublinear, ideally logarithmic growth in the number of resources. So you want to leverage the scale. The opposite of that, if, for example, somebody proposes an architecture that for the linear growth of the number of users has exponential growth of the number of servers, I can't let you launch that because that's fundamentally impossible for me to support at this scale. You know, what somebody who has maybe five or ten machines can say, hey, I can get away with it for another month because the growth is relatively slow.

Starting point is 00:12:02 I just can't, I don't have a month to get away with it. Uh, because exponential things, um, you know, the, the numbers that we were talking about, billions of users, you just can't, can't allow that to happen. And then from that point on, you tend to look at, um, you know, typically the team. So if you agree on architecture, the team goes and, uh, and, you know, develops their code. And, uh, one of the famous Facebook mottos is, uh, move fast and break things. And so when they have, let's call it prototype ready, I don't expect it to be perfect. Um, I just want it to be quickly available and, uh, you know, we can launch it and get user feedback on it. Because one of the problems that we have, like a lot of other companies,

Starting point is 00:12:48 is something may seem like a phenomenally great idea to us and the users could just shrug and say, yeah, we don't really care about it. While, you know, in a different situation, something that we feel like, well, you know, we'll just toss this out and see, but we don't, we're not very excited about it. Users will, you know, embrace and absolutely love the product. And so what you want to do is you want to push things into production as quickly as possible, see what the user response is, and then decide how you're going to proceed. In some cases, you will basically eventually just shut down the product. In the other cases, you realize this is going to be big.

Starting point is 00:13:29 And so you start doing a lot of performance work. So that's basically, you're talking about continuous experimentation almost. I mean, I'm not sure if experiment is the right word for it, but it sounds like it, right? You're making an experiment. Absolutely. You know, every day there is ab experiments where people are trying to figure out you know what do users like more or what do users engage

Starting point is 00:13:52 with what is more useful to to people right uh even how things look what is more appealing um you're constantly running the experiments but on the on back ends, the main thing is if this product has proven to be successful, if people have really, really enjoyed it, how do we continue to deliver it at lower cost, which usually means improving performance. One question I wanted to ask about the user feedback, right? Are you basing that strictly just on usage metrics? Or are you also taking a look at some of the feeds that people might be posting on, a combination of things? What's the determining factor to see? And on top of that, how do you know or can you determine if people aren't using it and maybe say, well, maybe it's because it's a little too slow?

Starting point is 00:14:44 A combination of things. Absolutely. So, you know, Facebook has a very large private cloud, like most large companies. And so obviously everything is being monitored. And so what we can see is, let's say, hey, maybe people in country A are using this product less than people in country B. And you can say, here is a hypothesis. Maybe this is because, you know, the product is slower in country A than it is in country B. And then you can go and experiment. What happens if I speed up the product in country A? Does the usage go up?

Starting point is 00:15:23 Or is the result, you know, is the fact that they actually use it less due to some other thing? But in general, you want to understand what is going on. We monitor very closely how quickly things respond. We try to do everything we can to get as close to the user and to terminate our connection, what we say in POPS, in points of presence that is close to the user, so that all of the TCP IP and all of the SSL and all of the HTTP handshaking and stuff is actually happening between a machine that is very close to the user and they don't have to go all the way to our data

Starting point is 00:16:05 center. So yeah, we experiment on those things. And luckily, you know, we have the ability to move things around that allows us to run these live experiments to understand, you know, what is going on, you know, form a hypothesis, do some measurements, figure out how you're going to experiment, then go ahead, do the measurements and then draw your conclusions. But in general, how do we get feedback? Well, for one, we get all of the observed measures that we get everywhere. But also, I think one of the things that no Facebook engineer is ever short of is the user feedback. Um, we, we get to hear every single time and we screw something up. Uh, you know, I, I will certainly hear from my friends in the feed and from just random people, you know, at a conference, you hear the stuff and, um, I,

Starting point is 00:17:02 I personally appreciate it. I, uh, I know that know that you know we are lucky to have users that uh are passionate about the product and as frustrating it is it can be at the time you know because obviously we've gone through several several um redesigns where people were really really angry and you look at it and you go like well how can this be i mean we've just kind of made it better and we all believe it's better. And then you realize Facebook has become something so personal that the equivalent of us rearranging the page and making things go to different places could really be explained by me taking somebody's wallet and rearranging all the things inside it to what I think is better. And, and people get, I know I would get really annoyed if somebody did it to me. And so, uh, so I'm hoping we are, we are getting better at these things, but at the same time, you have to continue experimenting. You have to continue getting better because one thing that happens in this field is

Starting point is 00:18:03 that if you, if you stay in one place, um, you're not going to be around for very long. Right. And this is fascinating. So deploying fast, fail quickly. When you deploy a change, I think you mentioned this also in New Zealand. And earlier, before we started, you said it's okay to tell you because most people are doing it that way you are always it takes a while until the changes get propagated into all the servers but then you still leverage things like a be testing or can we release the deployments and yeah so

Starting point is 00:18:40 we can do let's say let's say there is a brand new product and we really don't know, uh, you know, how it will be, uh, received by people. And we would just like to get some preliminary feedback before we polish it. Um, you know, and, and sort of decide, do we want to go further with it or so on? Um, I don't think it's a, it's a big secret. Definitely. It's not a secret in New Zealand, but pretty much all of the companies, um, in the technology feed will like, um, usually pushing things out in New Zealand and getting early feedback over there. You have a relatively small English speaking country, uh, extremely friendly. Um, and, uh,

Starting point is 00:19:17 and since they have been used to these kinds of things, it doesn't matter which particular company is doing it. I mean, from, let's say from old guys like IBM all the way to the newest startups, you can get incredibly valuable feedback, and you can then pivot and do whatever needs to be done before you deploy things to the other countries. Obviously, being who we are and doing what we do, we have the ability to do the testing. So let's say I'm only interested in what would be the feedback of a particular age group or a

Starting point is 00:19:54 particular country, we have the ability to deploy only to those users. And so we use those things as well to just evaluate how good the product is and how do we make it better. And can I ask a quick question? Does this mean you have particular server groups for a particular geo or a particular age group? Or is this you have to deploy it everywhere, then you use feature flags that you dynamically turn on and on depending on the user? Feature flags. Feature flags, yeah. that you dynamically turn on and on depending on on the user feature flags feature flags yeah um yeah and and occasionally when we misuse feature flags uh it ends up being a lot of fun in the company trying to figure out what just happened right but you know every once in a while uh you

Starting point is 00:20:38 you you know we mess up and there is a report in the news saying uh facebook's just launched such and such and and we'll be like wait how do these guys know and then you realize you know we made a mistake on a feature flag or we didn't set the feature flag or something like that i think uh the the worst one and the most recent one and i you know but i'll mention it uh we accidentally killed off half of the people on the planet. Oh, I remember hearing about that. Oh, you didn't hear about that? No, I heard about that.

Starting point is 00:21:10 I think I remember hearing some people said I'm not dead. Including our CEO, right? Yeah. You know. So I think. It's the Paul is dead thing, right? Yeah, exactly. And so imagine being the engineer who comes to work one day and you find out that you killed your CEO and half of the people on the planet. But I think one of the wonderful things about being at Facebook is,

Starting point is 00:21:30 you know, if you really don't go after the people, I couldn't even tell you who made that mistake. I mean, I know the team that was involved, but I don't know who the person is. And it's irrelevant. It's not that person's fault. It's the fault of the whole engineering organization that we have allowed for something like this to be possible to happen. Right. You can't build systems where a small mistake by a human being that is under pressure, under stress, working hard can create something like this. And so we learn from it and we create better systems. That almost goes back, ties back to, what was that recent outage where somebody typed it? Sam, I mustn't.

Starting point is 00:22:10 Yeah. Right. So where somebody typed something in and a lot of the, there was a lot of people come, came to the defense of the engineer who typed that in saying, Hey, who, who hasn't made a typo before and why was that a manual process? Right. So I think that ties directly back to that, that same concept concept the other thing i wanted to mention is you know going back like five six seven years ago i before you know before there were so many large-scale companies like this and obviously you have thousands of you know large amounts of servers and a very well-developed pipeline where there's checks and balances before things go to production. But I still remember back in the earlier days, and I shouldn't

Starting point is 00:22:48 say earlier days, but back in the days when the joke was starting to come out of as, you know, production is the new QA, right? And on some level it is. Right. But back then there were no gates, right? It was the things were getting thrown in. And I don't mean necessarily Facebook. I mean, a lot of companies were starting to get lazy and cutting out testing and just like, oh, we got to push this out but now with with the pipelines and the whole all the gates that can be put in and especially when you have the scale that you you do you really can make to an extent production qa but behind the scenes there's all those checks and balances that people aren't seeing they aren't aware of that you know we know

Starting point is 00:23:23 if there's been architectural regressions or any of these other things that have gone on in between before it gets out to production. And then you can do those nice, um, the tests, the New Zealand tests, there should be some kind of a name for that, right? It sounds like. You know, Hey, they are, they are a separate continent right now. Right. So, uh, so, so we test on the eighth continent. Um, but as I said. But as I said, it was the same case. You know, I've spent some time at different companies before. And so it's kind of a standard practice. And as Andy can confirm, it was certainly not a surprise to anybody in New Zealand. They very well know that they get access from their perspective. You know, they get access to all the

Starting point is 00:24:03 code much faster than anybody else and they get to see things before anybody else does so that's kind of cool yeah and that's similar to where my my wife grew up in a town called bricktown new jersey and that was one of the experimental places for new products for at&t probably at&t labs and stuff well not even that but this was just uh well at&t labs i think that was more in north jersey but this was like new products would come out right and they would do test markets it was bricktown was always this test market it's just this weird phenomenon so yeah you know you you have to somehow find out what you know what people think what should be changed because um and they probably remember i've made this comment uh you know i've been at google for five or six

Starting point is 00:24:43 years five five and a half years before facebook. And I have like absolutely perfect track record of mispredicting the value, success, or longevity of any product that we've launched during that time. Every single time. And I feel this is fantastic. Everybody's going to love it. You know, the product goes nowhere. And when I'm in there going like, who would care about this? Those things explode and end up being phenomenal. So, you know, you have to accept that, uh, you know, part of the reason why I'm not, uh, I'm not a product person and why I'm really happy to, to listen to product people. And regardless of how much I disagree with some of their decisions, I also know I have a track record of really bad, bad predictions on these things. And so, you know, you got to listen to people who actually have a lot better understanding of what people want.

Starting point is 00:25:38 But you also have to verify it. So in case something slips through and in case something goes wrong, do you, what's the mechanism at Facebook? Is it the roll forward? Is it the roll back? How does this work? Varies from team to team. Facebook is fairly liberal when it comes to how do you run your, your group. So, you know, people always ask me like, Hey, do you, do you do agile? And, you know, do you have a standups and so on?, some things do some things don't, uh, I don't think anybody runs it according to, you know, the official agile, agile philosophy or so on. Um, but, uh, some

Starting point is 00:26:18 things will just go, there are times when people will approach me and say, we are heading into a serve. Uh, Hey, uh, we need help. We know we screwed this up. Uh, could you help us out and, you know, loan us X number of machines for the next two weeks, because we really, really don't want to roll this back. And, uh, that is one of those situations where, you know, capacity and performance engineers kind of look at it and go like, okay, if it's the right thing for the company, we'll go ahead and help you. But I always extract promise of payment by saying, okay, but I want you to fix this and this and this in the long run. And we put our own people to work with the team to just make the code

Starting point is 00:27:00 be better when after three weeks everything is settled down. So that happens sometimes. In the other cases, the problem can be so bad that you just go like roll it back immediately, right? And most of these teams will roll things only to a small subset of servers and figure out what is going on. And then you just roll it back and go like, oops, you know, we didn't catch this particular thing. Or if it's something that you realize, uh, you know, I just messed this up. Uh, you will just roll forward and, uh, you know, just do the bug fix, uh, automatically. I mean, for years we used to

Starting point is 00:27:36 roll our main, our, uh, front end code base on Tuesday. And then all of the other days of the week, there would be a major release going on but it tended to be mostly bug fix so we would be rolling forward. Wow and so you just mentioned something earlier you said you know your code is always rolled out everywhere and then you turn on things with feature flex but now you just mentioned there are some teams that can deploy code to a certain set of servers. That means, is this then for a test environment? Or is this something where they can say, please load balancer, put the traffic from this particular group to that particular set of machines?

Starting point is 00:28:21 So it's, again, we do all of it. So for example, we constantly do performance testing live in production in every single one of our, what we call front-end clusters. And think of front-end clusters as basically a large number of web servers receiving and terminating connections from the users and then sending requests to all of the various backends, packaging it and sending the response back to the user. So in each one of those, we actually tell load balancers to keep certain subsets of these machines at a much higher load. A long time ago, we've written a paper about it. I'm just trying to figure out. And there is definitely a note on Facebook engineering page about it. I'm just trying to figure out, and there is definitely a note on a Facebook engineering page about it, but the way we measure performance for pretty much anything at Facebook

Starting point is 00:29:13 is we look at how much delay does Linux scheduler induce from the point in time it received requests to the point in time it was dispatched to one of the cores to be served. And so we refer to this as kernel or scheduler-induced delay. At the point in time when that reaches 100 milliseconds, we consider the machine loaded maximally. And at that point in time, that machine should not be taking any more traffic. Most production machines do not get anywhere close to that kind of a load. Because, you know, the reason is very simple. Once you start queuing things up and they end up waiting before they are deployed to be executed on,

Starting point is 00:29:59 you end up basically entering the realm of queuing theory and all of the calculations, everything else that you're doing that you're optimizing is basically relevant. Only queuing theory matters. And so, however, we do have a subset of machines in each of our front-end clusters that we run at two different levels of schedule-induced delay, just so that we can extrapolate at any point in time and find out how much load could we throw here and still be okay compared to the existing load.

Starting point is 00:30:32 So that's one of the things that we run in production every day, all the time, and that is on the front end. Now, again, the various backends and the various services have complete freedom and can figure out how they want to test and check things. You can imagine minor things, hey, I may have just improved a little bit or changed a little bit in terms of, I don't know, search quality result here or there. That will go probably and be pushed to a subset of machines and then it's going to go live very

Starting point is 00:31:05 quickly. But alternatively, hey, I'm changing how I'm sharding my index, right? So I'm changing how I'm storing all of this information, all of these bits, how it's going to be searched for. Well, that cannot be pushed to a small subset of machine in production because that is a completely different system. That's, that's really a different backend. And so it's going to be pushed to a backend. It's going to be taking a certain amount of shadow traffic and it will be monitored for a while and looked at to understand, you know, how, how does it perform? Then, uh, one of the things that you will do is you will now turn it live. And obviously a subset of users will be going to this system and a subset will be going to the old systems. And then you'll find out, is there actually difference in user behavior across

Starting point is 00:31:57 these systems? And if there is, and if the, if it's clear that a new system has made things worse, you will pull it back out and try to understand why, what is going on. But these are the kinds of things, these are the kinds of tests that you really cannot do without live users. Most of the stuff that I really care about, I need to have live traffic. That's mind-boggling. I mean, even thinking of, you know, my background from sitting on the other side of the fence was load testing and just even trying to think of how you would run some kind of a test at scale that in that manner is, you know, kind of mind-boggling.

Starting point is 00:32:40 So, yeah, I can absolutely see the need for production. And I think it's amazing too that you can simulate these little pieces or turn on little, as you were mentioning, the shadow traffic and the delay to see how that's going to impact the performance. It's just mind-boggling to me. You know, so one of the things is I've been approached by vendors at different events and they're like, oh, but we have this Facebook traffic simulator and you can always replay the same traffic and get. And it's like that is the last thing I want to do. What do I gain by replaying exactly the same traffic from a year ago and finding out that my code works really, really well, uh, based on the stuff that people used to do a year ago. How does that, you know,

Starting point is 00:33:33 that information is utterly meaningless to me. All those pokes are going to bring you down, right? Exactly. Right. It's like, Oh, you know, the, because I'm using a year ago over here, maybe five years or so on. It's like there is so much new code in here. There is so much new functionality that people are doing different things. You know, it always kind of perplexes me when I try to explain why we use live traffic and I have people going like, well, that's terrible. And I'm thinking, you know, alternative is to say, OK, I'm not going to test at all. I don't have a choice. I have to use live traffic. And, you know, and so that's what we do.

Starting point is 00:34:17 Obviously, also, we have a way that we can just send things to the Facebook employees. So if we really destroy something badly, we are the ones who will suffer first. Right. And so we use that method as well. Again, um, pretty much anything we can to get the highest quality product to the users as quickly as possible, because one of the things that was really frustrating for me when I joined Facebook was this idea of, you know, quick and it's like, you can't get it done as well as I would like it to be quickly. And then you realize the alternative is that I take three or six months and 10 people working on a project. And we have now performance optimized this code so that it's really spectacularly beautiful. It has no, you know, absolutely no problems.

Starting point is 00:35:08 It is completely NUMA aware. There is no locks of any kind. There is no forced interrupts. It is just running beautifully. You pass it through all the analyzers and it goes like, this is fantastic. You release it and the users go like, yeah, no. Right. And so now you have wasted 10 people's time for three or six months because let's face

Starting point is 00:35:31 it, it's going to take some time. And you have spent five, let's say person years on something that just gets discarded immediately. It is just a bad, a bad approach. Um, and as much as I wish, you know, we could spend more time perfecting things and making sure that they absolutely spectacularly good. I realized that, um, you know, we live in different times with the, with software being in the cloud and with us not shipping the media to somebody um you have opportunity to fix things much faster but more importantly you have opportunity to stop working on things that

Starting point is 00:36:11 don't matter uh quicker and so you no longer have people spending you know years developing things that users like really don't care about yeah that's also a big concept that we've been we've been promoting for a while basically monitor monitor usage but also resource impact of features out there and then as you said you know kill those features that nobody needs improve those that are heavy hitters or like that people like but that have not the best performance and i think you mentioned it earlier uh give give developers a fix-it ticket right you know you proved it that this is valuable now i give you a fix-it ticket which means i give you three weeks six months however you think it needs to make it to make it efficient enough that we can actually make money of it and uh and it does it that it becomes less costly than we gain from it right so i think this

Starting point is 00:37:05 is this is great and met i mean metrics everywhere and based on metrics make the right make the right decisions right this is the key the key thing here you know it's uh it's interesting you know i give occasional lectures and sometimes workshops on capacity and on performance and in particular in cloud. And I tell people, it always astounds me when people go like, well, you know, don't, don't talk about monitoring. That's okay. We will, you know, do this or that. And it's like monitoring, if there is one thing you have to do and you have to do it

Starting point is 00:37:39 right, it's monitoring. And then the second thing you have to do right is monitoring. And then the third one is monitoring. And then you can add alerting if you want. But if you're not monitoring your production, like why even bother doing any kind of testing? I mean, you obviously don't care, right? So it always surprises me how often I meet people from smaller groups, from let's say startups and so on. And they just go like, well, you know, we don't have any monitoring uh and they would like some help and it's like you know what

Starting point is 00:38:08 if you don't have any monitoring you really don't want any kind of advice from me because you just don't care right because you cannot they cannot measure and we cannot measure the impact you actually have with every action right that's the problem you don't know and i think the key there is too is monitoring right uh which is always the hardest thing i remember uh back andy during perform while you were running around in your leader hose and um we were talking to josh mckenty at pivotal and he was talking similar he was bringing up a similar uh concept of knowing the correct metrics and and adapting and modifying your metrics as time goes on throwing back all the way back into like the geocity days when people would put

Starting point is 00:38:46 how many hits they had and hits, hits being a metric that was used, which, you know, today we can look at and be like, well, that's, that's pretty meaningless. Right. But that correct monitoring and knowing what to look at is like, you know, really the harder part. So I'm curious, do you have any pictures or videos of Andy and Lederhosen? Oh, there are. He was riding a bike, right? A red bike.

Starting point is 00:39:13 There was Klaus. Oh, that was Klaus, that's right. I can't tell one Austrian from another, and I'm kidding. Because I think we could, you know, we could try and make it viral video on Facebook. You know, I'm happy to share it. Make that one of your releases. Yeah, so what we actually did, we staged out our own DevOps transformation within Dynatrace on stage with our CTO and founder in Lederhosen, our DevOps manager, Anita. And she was working at Dindlo.

Starting point is 00:39:43 And I was also there in lederhosen and we we were basically uh staging or playing playing the um uh the um you know we were playing it out acting it out why we as that is made our big transformation and i think part of that i also presented uh you know in new zealand uh the the idea of the UFO that came up, highlighting quality, using our own tools when we develop our own tool. I think you made a great statement in New Zealand, and you said something like, you always get approached from these vendors, and then they always tell you something that they believe helps you. But most of these vendors don't even use their own tools to prove that it's actually useful and and i think we we just damage ways we are in a fortunate situation obviously to build software for software companies and so if we cannot make use out of it then then we're obviously building something wrong and and and i think yeah that's oh absolutely you know and uh

Starting point is 00:40:43 you know i've told you like one of the things that truly boggles my mind. And I mean, you know, I'm from the Balkans. Right. So it's not like we are the most diplomatically correct people on the planet or anything of the sort. But, you know, people approach me and that's part of the reason why I could never work in sales. But and they go like, here's this great thing I have and you should buy it. And I always wonder about it. Like, shouldn't you start by asking me, Hey, do I have a problem? And if I do what that problem is and then figuring out how to help solve it, because all of these salespeople seem to be solving their own problem, which is I need to get my sales quota up. They

Starting point is 00:41:20 don't even understand what I'm dealing with. And it's like, this is really cute that you have that problem, but it's not my problem and it's not something I'm interested in at all. But, you know, when I found out that people have a tool that their own company is not using, I don't know if you could have a worse situation. It's sort of like, so it's not really good enough for your developers, but you want me to give you money so I could test it for you. Well, there is a great plan.

Starting point is 00:41:48 You know, where do I sign? But, you know. That's actually an interesting concept for anybody who gets approached by any vendor. First question to ask is, well, how do you use it within your organization? Absolutely. You know, and what problem are you solving right um and you know and when they try and tell me it's like but i don't have that problem um then they keep saying but this solves that problem really great and it's like i don't have that problem but you know i understand uh i

Starting point is 00:42:19 understand what their job is and you know if i at a conference, I'm always happy to talk with people and, you know, share feedback and give them hints. One of the things that we find, and again, Facebook is not the first company that I've been in that is very large. Most of the tools work reasonably well for medium size and probably some large companies as well. They end up really breaking at the size that we are. And so whenever we hear complaints about why does Facebook have to develop all of its own tools and we obviously open source all of our stuff, it's fundamentally when you have 1.8 billion monthly active users, there really aren't a lot of off-the-shelf tools that you can install and expect them to work at that size it's just it's just not going to happen and if you have to maintain things it is a lot easier to maintain

Starting point is 00:43:11 things that you have code for it was phenomenal listening to uh understanding the scope of of how of how facebook actually works and how big you actually are i mean i think we kind of knew but actually remembering that it's not only Facebook, but it's WhatsApp, Instagram, Messenger. What I really liked are two things that you said. I mean, I liked a lot of stuff, but two things that strike me a lot is if somebody proposes an architectural change or a new feature, then you come in and help them review because you want to make sure if this thing goes really off that that you know they did something very stupid i like that a lot so performance requirements architectural and scalability requirements being part of the requirements definition i like that

Starting point is 00:43:56 and and the second thing that i just love which i will hopefully more and more companies get to do is uh continuous experimentation, deploy fast, figure it out, throw it away. If nobody needs it, don't bother with optimizing it. But if it is a great feature, people like it, then optimize it. You know, the only thing that I would say you absolutely cannot fix once it's deployed is bad architecture. It is. And, and and and you know same as anything else now somebody who's listening will say well that's not the case here is this one situation sure but if you deploy bad architecture and you deploy it to a large number of people uh it becomes

Starting point is 00:44:37 incredibly expensive and difficult to fix that so you're better off starting with the right architecture, you know, so simple things, right. What, what should be, you know, if you think about data, right, what is the hot data, hot data belongs into Ram, warm data belongs on flash, cold data can be on the disc, right. Uh, you know, if you, if you just start with the idea that all of your data is going to be in Ram, uh, well, that's a disaster at our scale, right? I will not be able to support your product. There is just not enough RAM to purchase, you know, for something like that. Nor would I ever actually attempt to do something like that. Instead, I'm going to sit down with you and come up with a solution that, again, does multiple

Starting point is 00:45:22 layers of caching and we figure out what needs to be where. Those are the kinds of things that I think need to be fixed before you start coding. The other thing that I'm sure no other company has ever run into a problem like when you have five solutions for exactly the same problem. And so a new person comes to the company and they can't find what they're searching for. And then they go like, I will develop my own my own i don't know key value store or pub sub system and so on and so when they come to me and say and then this i go like no you will use one of the existing 10 data stores that we already have uh and so you don't have to maintain the code and if there is

Starting point is 00:46:01 something that is missing in those then we will upgrade one of those, but I'm not going to be paying for yet another, you know, pick a system of your choice. As I said, PubSub, key value pairs. I'm sure every single company has five solutions of at least one different basic computer science problem. Yeah. Brian, let's conclude this session. Any, any final thoughts from your side on this one? Well, yeah, I guess I'm thinking more of the, on a higher level of, you know, if, if Facebook can do it on this scale, you really have, you know, no excuse to get through. You can't get through whatever you're getting through, right?

Starting point is 00:46:41 Cause this is like uncharted territory where, you know, as Karinka was mentioning, having to create your own tools, having to do all sorts of kind of things to make it through. But what I really find fascinating is the experimenting in production and even using that production load as you're testing, right?

Starting point is 00:47:01 Because as we talked about, maintaining or creating realistic kind of load models is very, very difficult in any situation. Uh, and, and just for the fact of new features coming out all the time, there's really no way to stay on top of that. And, and it's just makes complete sense. So to see, you know, for anybody who's kind of like a hater on it, I think it makes absolute sense, especially with the scale that Facebook's operating at. Now, maybe at smaller scale kind of operations where maybe A-B testing is not really an option and other kind of components like that, where those restrictions are in place, you might have to do those, you know, those models and run it that way. But whenever you have that opportunity, I think it should definitely be leveraged. And I think that's really, really great.

Starting point is 00:47:51 So that's all I have for me. We will be back in a part two with Grinka and Bjerowna. I got that name wrong. I totally blew it. It's okay, Andreas. Andy can pronounce it. No problem. Andy, why don't you give the outro on this then?

Starting point is 00:48:10 It's okay. We'll be back in a couple of minutes, hopefully, with the next episode. With who, Andy? With Goranka Piedov. There you go. How was that? Good. All right.

Starting point is 00:48:23 Thank you, everybody. We'll be back in a moment. Or really, I guess there's no moments because this isn't radio. So we'll be back in a click. How's that? Good. All right. Thank you, everybody. We'll be back in a moment. Or really, I guess there's no moments because this isn't radio. So we'll be back in a click. How's that? Thank you all for listening.

PurePerformance - Encore Presentation: 033 Performance Engineering at Facebook with Goranka Bjedov

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.