PurePerformance - 033 Performance Engineering at Facebook with Goranka Bjedov

Starting point is 00:00:00 It's time for Pure Performance. Get your stopwatches ready. It's time for Pure Performance with Andy Grabner and Brian Wilson. Hello, hello, everybody. This is Brian Wilson, Pure Performance, with my co-host Andy Grabner. As always, hello, Andy. How are you today? Hola, Brian.

Starting point is 00:00:38 Muy bien. You too? Yeah, muy bien. I have to keep practicing my Spanish because I just came back from Columbia and I just came back from Spain. So how was your Spanish there? You can get away with a lot of English in, in Barcelona, in Barcelona. Sorry. But I did, you know, a lot started coming back to me here and there to, I probably sounded

Starting point is 00:01:02 very silly, but I can get my point, you know, get, uh, community. I could communicate enough to get some things across where I needed to. I was lucky to have my girlfriend with me. Let's say that. Yeah. You have a crutch. You'll never learn the language, right? Tell him this. Yeah, exactly. So, um um brian i think today well i'm very excited yes i am too uh i about uh about our guest uh goranka vidov hopefully i correctly pronounced it we are two fellow europeans and goranka just told me sometimes when both of us

Starting point is 00:01:42 speak in the usp sometimes we're put into the same background. They both think, both from Goranka and me, that we may have some German accents, even though Goranka is not from a German background, but from Croatia. Goranka, are you there with us? Hey, guys. Nice to join you here. And yeah, Andy Wright on Target. I frequently get asked if I'm German, and I think it's those Vs and Ws that trick us all.

Starting point is 00:02:12 Here's the true test. How do you say there's that alcoholic drink that mixes well with a lot of them? Originally made in Russia, I believe. Oh, vodka? Okay, there you go. the v the w yes yeah yeah yes so that one obviously being slavic it's very easy for me to pronounce yes yes um not exactly the first time i'm using the word but uh you know one of the things that i'm always confused about is uh and that i would think would be the difference between a person from a German background and somebody like me is the articles. As you will probably notice, they are randomly sprinkled in my sentences. You know, they make no sense whatsoever.

Starting point is 00:02:55 And I expect most German speakers will correctly put them in the right places because German as a language has articles. Croatian doesn't. So bear with me on that front. So, Goranka, the reason why we invited you is because the two of us, we recently got introduced to each other at WAPR, the performance workshop down in lovely New Zealand. And if I look at your LinkedIn profile, I pulled LinkedIn up.

Starting point is 00:03:26 Maybe I should have pulled up your Facebook page because what it says is you are a capacity engineer at Facebook, at least on LinkedIn. I'm not sure if that's the accurate title, but I just read it out loud. And I will admit I'm not very good about updating my LinkedIn stuff. I think the accurate title is performance and capacity engineer or capacity and performance engineer. You know how it is in companies. They shuffle those around every once in a while, and I just don't bother updating anything. But, yeah, I do capacity and I do performance.

Starting point is 00:04:01 Cool. So I totally agree with you. Titles don't really tell you a whole lot. I think it's more interesting what you do in a day-to-day life and actually what you do to actually make Facebook successful. and capacity planning, I believe is just a very – I assume just – I don't even know. I assume it's just very, very different to what most of us are seeing. But – or maybe not. Maybe it's just the same thing but just on an automated different scale. So what I would like to know is what do you – how do you do performance engineering at Facebook? How does it work?

Starting point is 00:04:42 And what can we learn as non--facebookers for our environments that we have that might just be on a smaller scale but we're still maybe struggling with similar things so maybe you can enlighten us a little bit. Sure happy to tell you what are some of the things that we do I automatically apologize to any of my colleagues whose work I forget to mention in this place. But, you know, we are quite a diverse environment, right? Because I am a capacity and performance engineer for Facebook, but I really have four different brands, fairly large, to manage. And so I have Facebook, which has, you know, roughly going almost to 1.8 billion monthly active users. Then there is Facebook Messenger, which is a completely separate product line.

Starting point is 00:05:31 Right. And WhatsApp, which both have more than a billion monthly active users. And then there is Instagram. And that's just the big things. And so the goal of the team that I'm a part of is to kind of support all of them. Plus the other stuff that I didn't mention, like, you know, there is Oculus, there is all of the work that we do in internet.org and so on, and make sure that all of these teams get enough, uh, pretty much of everything they need, whether it's, uh, servers, whether it's, um, you know,

Starting point is 00:06:03 network, you know, whatever they need, it's our job to make sure that they get enough. And sometimes that means procuring more resources. But most of the times, it also means looking at the code and, you know, and fixing the really big problems, making the code run more efficiently so that you can fit into the footprint that we have kind of allocated to you. So I know we are not allowed to talk about real numbers, but obviously I assume you have thousands of servers where you run large numbers. So that means a performance problem, like if a developer is committing code and it goes out into production, bad code can potentially cause huge costs, right?

Starting point is 00:06:50 Because it's just the scale itself. So how does this work? If I'm a developer and I make code changes, do you and your team help developers with architectural guidance, with code reviews before they deploy something into production? Or do you allow developers to deploy something in production and then punish them afterwards? So how does this work? So the answer is it depends.

Starting point is 00:07:17 If you think about, and I'll just focus on Facebook, the social network and the product associated with it, it has the front end, what we call the product front end. And most of that code is PHP. However, it runs inside HHVM. And so we tend to work very closely with HHVM team, right? So in that particular case, we don't really work very closely with the product people, but we have a whole bunch of tools and monitoring systems available to them so that if they deploy something that is truly atrocious in PHP, they will know about the results and the consequences

Starting point is 00:07:58 before they actually push the change. The front end also has continuous push to production so as soon as somebody commits a change it's going to go out live and obviously take some time for it to propagate to all of the servers but it is live automatically. If the the change is really really really really bad it should have been caught before it goes live obviously it can be reverted now that's the front end right but then you have all of these other we call them services and the back ends and i'm not talking about the database right so the the database is handled by the database team and uh as from my perspective they do a fantastic job and i don't

Starting point is 00:08:43 really have to worry about it too much. But then you have all of these things like photos, like newsfeed, like ads, all of the other stuff that comprises Facebook. And those have individual teams that do development. They have, you know, they pick their own stack. So I think most people know that majority of our backends are written in C++, but there is also C code. There is Java code. There are smaller patches of different other things, like there is some Erlang. There is Haskell, right? And so I tend to work primarily in C++ code base. We have people who work, you know, in Java.

Starting point is 00:09:29 We have people who work in the other programs as a part of our team. And so with those teams, we work a lot closer because we kind of manage their machines a lot more directly. If you think about the front end, it's really managed through HHVM. I mean, we work with the HHVM team, but everything else kind of lands in there. For these other things, you know, depending on what the new product is, we will be involved from the point in time when they propose an architecture, because they have to tell us what are their predictions and what do they think they will need in terms of resources. At that point in time, I ask for architecture review because when you are this scale,

Starting point is 00:10:10 you cannot support, you cannot run away with bad architecture. So if you, in general, any large company will tell you, so it doesn't matter whether it's Facebook or whether it's Google or Microsoft or Amazon or so on, what you want to have architecturally is that with the linear growth of number of users, you have sublinear, ideally logarithmic growth in the number of resources. So you want to leverage the scale. The opposite of that, if, for example, somebody proposes an architecture that for the linear growth of the number of users has exponential growth of the number of servers,

Starting point is 00:10:48 I can't let you launch that because that's fundamentally impossible for me to support at this scale. You know, what somebody who has maybe five or ten machines can say, hey, I can get away with it for another month because the growth is relatively slow. I just can't. I don't have a month to get away with it because exponential things, you know, the numbers that we were talking about, billions of users, you just can't allow that to happen. And then from that point on, you tend to look at, you know, typically the team. So if you agree on architecture, the team goes and, you know, typically the team. So if you agree on architecture, the team goes and and, you know, develops their code. And one of the famous Facebook mottos is move fast and break things.

Starting point is 00:11:33 And so when they have, let's call it prototype ready, I don't expect it to be perfect. I just want it to be quickly available and, you know, we can launch it and get user feedback on it. Because one of the problems that we have, like a lot of other companies, is something may seem like a phenomenally great idea to us and the users could just shrug and say, yeah, we don't really care about it. While, you know, in a different situation, something that we feel like, well, you know, we'll just toss this out and see, but we don't, we're not very excited about it. Users will, you know, embrace and absolutely love the product. And so what you want to do is you want to push things into production as quickly as possible, see what the user response is, and then decide how you're

Starting point is 00:12:22 going to proceed. In some cases, you will basically eventually just shut down the product. In the other cases, you realize this is going to be big, and so you start doing a lot of performance work. So that's basically, you're talking about continuous experimentation almost. I mean, I'm not sure if experiment is the right word for it, but it sounds like it, right? You're making an experiment. Absolutely.

Starting point is 00:12:44 You know know every day there is ab experiments where people are trying to figure out you know what do the users like more or what do the users engage with what is more useful to to people right uh even how things look what is more appealing um you're constantly running the experiments. But on the back ends, the main thing is if this product has proven to be successful, if people have really, really enjoyed it, how do we continue to deliver it at lower cost, which usually means improving performance? One question I wanted to ask about the user feedback. Are you basing that strictly just on usage metrics, usually means improving performance. One question I wanted to ask about the user feedback, right?

Starting point is 00:13:25 Are you basing that strictly just on usage metrics, or are you also taking a look at some of the feeds that people might be posting on, a combination of things? What's the determining factor to see? And on top of that, how do you know or can you determine if people aren't using it and maybe say, well, maybe it's because it's a little too slow? A combination of things. Absolutely. So, you know, Facebook has a very large private cloud, like most large companies.

Starting point is 00:13:54 And so obviously everything is being monitored. And so what we can see is, let's say, hey, maybe people in country A are using this product less than people in country B. And you can say, here is a hypothesis. Maybe this is because, you know, the product is slower in country A than it is in country B. And then you can go and experiment. What happens if I speed up the product in country A? Does the usage go up? Or is the result, you know, is the fact that they actually use it less due to some other thing? But in general, you want to understand what is going on.

Starting point is 00:14:34 We monitor very closely how quickly things respond. We try to do everything we can to get as close to the user and to terminate our connection, what we say in POPs, in points of presence that is close to the user so that, you know, all of the HTTP, all of the TCP IP and all of the SSL and all of the HTTP handshaking and stuff is actually happening between a machine that is very close to the user and they don't have to go all the way to our data center so yeah we experiment on those things and luckily you know we have the ability to move things around that allows us to run these live experiments to understand you know what is going on you know form a

Starting point is 00:15:20 hypothesis do some measurements figure out how you're going to experiment then go ahead do the measurements and then draw your conclusions. But in general, how do we get feedback? Well, for one, we get all of the observed measures that we get everywhere. But also, I think one of the things that no Facebook engineer is ever short of is the user feedback. We get to hear every single time when we screw something up. You know, I will certainly hear from my friends in the feed and from just random people, you know, at a conference.

Starting point is 00:16:00 You hear the stuff. And I personally appreciate it. I, uh, I know that, you know, we are lucky to have users that, uh, are passionate about the product and as frustrating it is, it can be at the time, you know, because obviously we've gone through several, several, um, redesigns where people were really, really angry and you look at it and you go like, well, how can this be? I mean, we've just kind of made it better and we all believe it's better. And then you realize Facebook has become something so personal that the equivalent of us rearranging the page and making things go to different places could really be explained by me taking somebody's

Starting point is 00:16:42 wallet and rearranging all the things inside it to what I think is better. And people get, I know I would get really annoyed if somebody did it to me. And so I'm hoping we are getting better at these things. But at the same time, you have to continue experimenting. You have to continue getting better. Because one thing that happens in this field is that if you if you stay in one place um you're not going to be around for very long right and uh wow this is fascinating um so deploying fast fail quickly the when when you deploy a change uh i think we you mentioned

Starting point is 00:17:20 is also in new zealand uh and earlier before we started you said it's okay to tell you because most people are doing it that way um you are obviously it takes a while until the changes get propagated into all the servers but then you still leverage things like ap testing or can we release the book with the deployments yeah so one of the things that we can do let's say let's say there is a brand new product and we really don't know uh you know how it will be uh received by people and we would just like to get some preliminary feedback before we polish it um you know and and sort of decide do we want to go further with it or so on um i don't think it's a it's a big secret definitely it's not a secret in new zealand but pretty much all of the companies, um, in the technology feed will, will like, um, usually pushing things out in New Zealand

Starting point is 00:18:11 and getting early feedback over there. You have a relatively small English speaking country, uh, extremely friendly. Um, and, uh, and since they have been used to these kinds of things, doesn't matter which particular company is doing it. I mean, from, let's say from old guys like IBM, uh, all the way to the, to the newest, uh, startups, you can get incredibly valuable feedback. Um, and, uh, you know, you can then pivot and do whatever needs to be done before you deploy things to the other countries. Obviously being who we are and doing what we do, we have the ability to do the testing. So let's say I'm only interested in, uh, you know, what would, what would be the feedback of a particular age group or a particular country. We have the ability

Starting point is 00:18:58 to deploy only to those users. Um, and so, so we use those things as well to, you know, just evaluate how, how good the product is and how do we make it better. And, and does this, can I ask a quick question? Does this mean you have particular server groups for a particular geo of a particular age group,

Starting point is 00:19:18 or is this, you have to deploy it everywhere. Then you use feature flags that you're dynamically turn on and on, depending on, on the user. Feature flags. Feature user. Feature flags. Feature flags, yeah. Yeah, and occasionally when we misuse feature flags,

Starting point is 00:19:33 it ends up being a lot of fun in the company trying to figure out what just happened, right? But, you know, every once in a while, you know, we mess up and there is a report in the news saying Facebook's just launched such and such. And we'll be like, wait, how do these guys know? And then you realize, you know, we made a mistake on a feature flag or we didn't set the feature flag or something like that. I think the worst one and the most recent one and I, you know, but I'll mention it. We accidentally killed off half of the people on the planet. Oh, I remember hearing about that.

Starting point is 00:20:08 Oh, you didn't hear about that? No, I heard about that. I think I remember some people said I'm not dead. Including our CEO, right? Yeah. You know, so I think. It's the Paul is dead thing, right? Yeah, exactly.

Starting point is 00:20:19 And so imagine being the engineer who comes to work one day and you find out that you killed your CEO and half of the people on the planet. But I think one of the wonderful things about being at Facebook is, you know, you really don't go after the people. I couldn't even tell you who made that mistake. I mean, I know the team that was involved, but I don't know who the person is. And it's irrelevant. It's not that person's fault. It's the fault of the whole engineering organization that we have allowed for something like this to be possible to happen, right? You can't build systems where a small mistake by a human being that is under pressure,

Starting point is 00:20:56 under stress, working hard, can create something like this. And so we learn from it, and we create better systems. That almost goes back, ties back to what was that recent outage where somebody typed it? Sam Emerson. Yeah, right. So where somebody typed something in and a lot of the, there was a lot of people came to the defense of the engineer who typed that in saying, hey, who hasn't made a typo before? And why was that a manual process, right? So I think that ties directly back to that same concept. The other thing I wanted to mention is going back like five, six, seven years ago, before there were so many large-scale companies like this,

Starting point is 00:21:38 and obviously you have large amounts of servers and a very well-developed pipeline where there's checks and balances before things go to production. But I still remember back in the earlier days, and I shouldn't say earlier days, but back in the days when the joke was starting to come out of as, you know, production is the new QA, right? And it's right. But back, back then there were no gates, right? It was the things were getting thrown in. And I don't mean necessarily Facebook. I mean, a lot of companies were starting to get lazy and cutting out testing and just like, oh, we got to push this out. But now with the pipelines and the whole, all the gates that can be put in, and especially when you aware of that, you know, we know if there's been architectural regressions or any of these other things that have gone on in between before it gets out to production. And then you can do those nice, um, the tests, the New Zealand tests, there should be some kind of a name for that,

Starting point is 00:22:36 right? It sounds like. You know, hey, they are, they are a separate continent right now, right? So, uh, so, so we test on the eighth continent. But as I said, it was the same case. You know, I've spent some time at different companies before, and so it's kind of a standard practice. And as Andy can confirm, it was certainly not a surprise to anybody in New Zealand. They very well know that they get access. From their perspective, you know, they get access to all the code much faster than anybody else and they get to see things before anybody else does so that's kind of cool yeah and that's similar to where my my wife grew up in a town called bricktown new jersey and that was one of the experimental places for new products for at&t probably at&t labs and

Starting point is 00:23:21 stuff well not even that but this was just uh well at&&T Labs, I think that was more in North Jersey, but this was like new products would come out, right? And they would do test markets. It was Bricktown was always this test market. It's just this weird phenomenon. So, yeah. You know, you have to somehow find out what, you know, what people think, what should be changed. Because, Andy probably remembers, I've made this comment, you know, I've been at Google for five or six years,

Starting point is 00:23:44 five, five and a half years before Facebook. And I have like absolutely perfect track record of mispredicting the value, success or longevity of any product that we've launched during that time. Every single time. And I feel this is fantastic. Everybody's going to love it. You know, the product goes nowhere. And when I'm in there going like, who would care about this? Those things explode and end up being phenomenal. So,

Starting point is 00:24:11 you know, you have to accept that, uh, you know, part of the reason why I'm not, uh, I'm not a product person and why I'm really happy to, to listen to product people. And regardless of how much I disagree with some of their decisions, I also know I have a track record of really bad, bad predictions on these things. And so, you know, you got to listen to people who actually have a lot better understanding of what people want. But you also have to verify it. So in case something slips through and in case something goes wrong, do you, what's the mechanism at Facebook? Is it a roll forward? Is it a roll back? How does this work? Uh, varies from team to team. Uh, Facebook is fairly liberal when it comes to,

Starting point is 00:24:58 uh, how do you run your, your group? So, you know, people always ask me like, Hey, do you, do you do agile? And, you know, do you have a standups and so on? Uh, some things do some things don't, uh, I don't think anybody runs it according to, you know, the official agile, agile philosophy or so on. Um, but, uh, some things will just go, there are times when people will approach me and say, we are heading into SEV. Hey, we need help. We know we screwed this up. Could you help us out and loan us X number of machines for the next two weeks?

Starting point is 00:25:36 Because we really, really don't want to roll this back. And that is one of those situations where capacity and performance engineers kind of look at it and go like, OK, if it's the right thing for the company, uh, we'll go ahead and help you. But, you know, I always extract promise of payment, um, by saying, okay, but I want you to fix this and this and this, um, in the long run. And, and we put our own people to work with the team to just make the code, uh, be better. Um, when after three weeks, you know, everything is settled down. So that happens sometimes. In the other cases, the problem can be so bad that you just go like roll it back immediately, right?

Starting point is 00:26:14 And most of these teams will roll things only to a small subset of servers and figure out what is going on. And then you just roll it back and go like, oops, you know, we didn't catch this particular thing. Or if it's something that you realize, you know, I just messed this up. You will just roll forward and, uh, you know, just do the bug fix, uh, automatically. I mean, for years we used to roll our main, our, uh, front end code base on Tuesday. And then all of the other days of the week there would be a major release going on but it it tended to be mostly bug fix so we would be rolling forward wow and um so you just mentioned something earlier you said you know your code is always rolled out everywhere and then you turn on things with feature flex but now you just mentioned there are some teams that can deploy code to a certain set of servers that means there is is this then for a test environment or is this something where they can say please uh load

Starting point is 00:27:16 balancer put the traffic from this particular group to that particular set of machines uh so it's again we do all of it. So for example, we constantly do performance testing live in production in every single one of our, what we call front-end clusters. And think of front-end clusters as basically a large number of web servers receiving and terminating connections from the users and then sending requests to all of the various backends, packaging it and sending the response back to the user. So in each one of those, we actually tell load balancers to keep certain subsets of these machines at a much higher load. A long time ago, we've written a paper about it. I'm just trying to figure out and there is definitely

Starting point is 00:28:06 a note on facebook engineering page about it but the way we measure performance um for pretty much anything at facebook is uh we look at how much delay does linux scheduler induce from the point in time it received request to the point in time it was dispatched to one of the cores to be served. And so we refer to this as kernel or scheduler-induced delay. At the point in time when that reaches 100 milliseconds, we consider the machine loaded maximally. And at that point in time, that machine should not be taking any more traffic. Most production machines do not get anywhere close to that kind of a load. Because, you know, the reason is very simple. Once you start queuing things up and they end

Starting point is 00:28:57 up waiting before they are deployed to be executed on, you end up basically entering the realm of queuing theory and all of the calculations, everything else that you're doing, that you're optimizing is basically relevant. Only queuing theory matters. And so, however, we do have a subset of machines in each of our front-end clusters that we run at two different levels of schedule-induced delay, just so that we can extrapolate at any point in time and find out how much load could we throw here

Starting point is 00:29:30 and still be okay compared to the existing load. So that's one of the things that we run in production every day, all the time, and that is on the front end. Now, again, the various back ends and the various services have complete freedom and can figure out how they want to test and check things. You can imagine minor things, uh, may have just improved a little bit, um, or, or change a little bit in terms of, I don't know, search quality result here or there, um, that will go probably, uh, and be pushed to a subset of machines and then it's going to go live very

Starting point is 00:30:06 quickly but um alternatively hey i'm changing how i'm sharding my index right so i'm uh i'm changing how i'm storing all of this information all of these bits how it's going to be searched for well that cannot be pushed to a small subset of machine in production because that is a completely different system that's that's really a different back end and so it's going to be pushed to a back end it's going to be taking a certain amount of shadow traffic and it will be monitored for a while and looked at to understand you know how how does it, uh, one of the things that you will do is you will now turn it live. Uh, and obviously a subset of users will be going to this system and a subset will be going to the old systems. And then you'll find out, is there actually difference in user

Starting point is 00:30:57 behavior across these systems? And if there is, and if the, if it's clear that a new system has made things worse, you will pull it back out and try to understand why, what is going on. But these are the kinds of things, these are the kinds of tests that you really cannot do without live users. You just, most of the stuff that I really care about, I need to have live traffic. That's mind-boggling. I mean, even thinking of, you know, my background from sitting on the other side of the fence was load testing and just even trying to think of how you would run some kind of a test at scale that in that manner is, you know, kind of mind-boggling.

Starting point is 00:31:42 So, yeah, I can absolutely see the need for production. And I think it's, it's amazing too that you can simulate these little pieces or turn on little, uh, as you were mentioning the, um, the delay, yeah, the shadow traffic and the delay to, to, to see how that's going to impact, um, the performance. It's, it's just mind boggling to me. You know, it's, uh, so one of the things is i've been approached by vendors at

Starting point is 00:32:07 different uh events and they're like oh but we have this facebook traffic simulator and you can always replay the same traffic and and get and it's like that is the last thing i want to do what do i gain by replaying exactly the same traffic from a year ago and finding out that my code works really, really well, uh, based on the stuff that people used to do a year ago. How does that, you know, that, that information is utterly meaningless to me. All those pokes are going to bring you down, right? Exactly. Right. It's like, Oh, you know, the, because because and i'm using a year ago over here maybe five years or so on it's like there is so much new code um in here there is so much new functionality

Starting point is 00:32:52 that that people are doing different things um you know it's it always kind of uh perplexes me when when i try to explain why we use live traffic and I have people going like, well, that's terrible. And I'm thinking, you know, alternative is to say, OK, I'm not going to test at all. I don't have a choice. I have to use live traffic. And, you know, and so that's what we do. Obviously, also, we have a way that we can just send things to the Facebook employees. So if we really destroy something badly, we are the ones who will suffer first. Right. And so we use that method as well.

Starting point is 00:33:31 Again, um, pretty much anything we can to get the highest quality product to the users as quickly as possible, because one of the things that was really frustrating for me when I joined Facebook was this idea of, you know, quick. And it's like you can't get it done as well as I would like it to be quickly. And then you realize the alternative is that I take three or six months and 10 people working on a project. And we have now performance optimized this code so that it's really spectacularly beautiful. It has no, you know, absolutely no problems. It is completely NUMA aware.

Starting point is 00:34:11 There is no locks of any kind. There is no forced interrupts. It is just running beautifully. You pass it through all the analyzers and it goes like, this is fantastic. You release it and the users go like, yeah, no. Right. And so now you have wasted 10 people's time for three or six months, because let's face it, it's going to take some time. Um, and you have spent five, let's say person years, um,

Starting point is 00:34:40 on something that just gets discarded immediately. Like it is just a is just a bad approach. And as much as I wish we could spend more time perfecting things and making sure that they're absolutely spectacularly good, I realize that we live in different times. With software being in the cloud and with us not shipping the media to somebody, you have opportunity to fix things much faster. But more importantly, you have opportunity to stop working on things that don't matter quicker. And so you no longer have people spending years developing things that users really don't care about.

Starting point is 00:35:24 It's also a big concept that we've been we've been promoting for a while basically monitor monitor usage but also resource impact of features out there and then as you said you know kill those features that nobody needs improve those that are heavy hitters or like that people like but that have not the best performance. And I think you mentioned it earlier. Give developers a fix-it ticket, right? You know, you proved that this is valuable. Now I give you a fix-it ticket, which means I give you three weeks, six months,

Starting point is 00:35:55 however long you think it needs to make it efficient enough that we can actually make money off it and that it becomes less costly than we gain from it. So I think this is great. And metrics everywhere and based on metrics, make the right decisions. This is the key thing here. You know, it's interesting. I give occasional lectures and sometimes workshops on capacity and on performance,

Starting point is 00:36:25 and in particular in cloud. And I tell people, it always astounds me when people go like, well, you know, don't, don't talk about monitoring. That's okay. We will, you know, do this or that. And it's like monitoring. If there is one thing you have to do and you have to do it right, it's monitoring. And then the second thing you have to do right is monitoring. And then the third one is monitoring. And then you can add alerting if you want. But if you're not monitoring your production, like why, why even bother doing any kind of testing? I mean, you obviously don't care. Right. So it, it always surprises me how often I meet people from smaller groups from let's say startups and so on, and they just go like,

Starting point is 00:37:05 well, you know, we don't have any monitoring. And they would like some help. And it's like, you know what? If you don't have any monitoring, you really don't want any kind of advice from me because you just don't care, right? Because you cannot measure the impact you actually have with every action, right?

Starting point is 00:37:21 That's the problem. You don't know. And I think the key there is, too, is monitoring right, which is always the hardest thing. You don't know. And I think the key there is too, is monitoring right, which is always the hardest thing. I remember back, Andy, during Perform, while you were running around in your leader hose and we were talking to Josh McEntee at Pivotal

Starting point is 00:37:34 and he was talking similar, he was bringing up a similar concept of knowing the correct metrics and adapting and modifying your metrics as time goes on, throwing back all the way back into like the geocity days when people would put how many hits they had and hits being a metric that was used, which, you know, today we can look at and be like, well, that's pretty meaningless, right?

Starting point is 00:37:56 But that correct monitoring and knowing what to look at is like, you know, really the harder part. So I'm curious, do you have any pictures or videos of Andy and Lederhosen? Oh, there are. He was riding a bike, right? A red bike? No, that was Klaus. Oh, that was Klaus, that's right. I can't tell one Austrian from another, and I'm kidding.

Starting point is 00:38:21 Because I think we could try and make it viral video on Facebook. You i'm happy to share it make that one of the releases yeah so what we actually did uh we we staged out our own devops transformation within dynatrace on stage with our cto and founder in lederhosen our devops manager anita and she she was working at Dindlo. And I was also there in Dederhosen. And we were basically staging or playing the – we were playing it out, acting it out, why we as Dindlo made our big transformation. And I think part of that I also presented in New Zealand, the idea of the UFO that came up, highlighting quality, using our own tools when we develop our own tool. Because I think you made a great statement in New Zealand, and you said something like you always get approached from these vendors, and then they always tell you something that they believe helps you.

Starting point is 00:39:26 But most of these vendors don't even use their own tools to prove that it's actually useful. And I think we as Dynatrace, we are in a fortunate situation, obviously, to build software for software companies. And so if we cannot make use out of it, then we're obviously building something wrong. And I think, yeah, that's... Oh, absolutely. And I've told you, one of the things that truly boggles my mind, and I mean, you know, I'm from the Balkans, right? So it's not like we are the most diplomatically correct people on the planet or anything of the sort.

Starting point is 00:39:57 But, you know, people approach me, and that's part of the reason why I could never work in sales. But, and they're going like, here's this great thing I have and you should buy it. And I always wonder about it. Like, shouldn't you start by asking me, Hey, do I have a problem? And if I do what that problem is and then figuring out how to help solve it, because all of these salespeople seem to be solving their own problem, which is, I need to get my sales quota up. They don't even understand what I'm dealing with. And it's like, this is really cute that you have that problem, but it's not my problem. And it's not something I'm interested in at all. Uh, but you know, when I found out that, that people have a tool that their own company is not using, I don't know if you could have a worse situation. It's sort of like, so it's not

Starting point is 00:40:41 really good enough for your developers, but you want me to give you money so I could test it for you. Well, there is a great plan, you know, where do I sign? Um, but you know, that's actually an interesting concept. If, uh, for anybody who gets approached by any vendor, first question to ask is, well, how do you use it within your organization? Absolutely. You know, and what problem are you solving? Right. Um, and, uh, you know, and when they try and tell me, it's like, but I don't have that problem. Um, then they keep saying, but this solves that problem really great. And it's like, I don't have that problem, but you know, I understand, uh, I understand what their job is.

Starting point is 00:41:22 And, you know, if I'm at a conference, I'm always happy to talk with people and, you know, share feedback and give them hints. One of the things that we find, and again, Facebook is not the first company that I've been in that is very large. Most of the tools work reasonably well for medium size and probably some large companies as well. They end up really breaking at the size that we are, right? And so whenever we hear complaints about why does Facebook have to develop all of its own tools, and we obviously open source all of our stuff, it's fundamentally when you have 1.8 billion monthly active users, there really aren't a lot of off-the-shelf tools that you can install and expect them to work

Starting point is 00:42:05 at that size. It's just not going to happen. And if you have to maintain things, it is a lot easier to maintain things that you have code for. It was phenomenal listening to, understanding the scope of how Facebook actually works and how big you actually are. I mean, I think we kind of knew, but actually remembering that it's not only Facebook, but it's WhatsApp, Instagram, Messenger. What I really liked are two things that you said. I mean, I liked a lot of stuff, but two things that strike me a lot is if somebody proposes an architectural change or a new feature,

Starting point is 00:42:40 then you come in and help them review because you want to make sure if this thing goes really off that they did something very stupid. I like that a lot. So performance requirements, architectural and scalability requirements, being part of the requirements definition. I like that. And the second thing that I just love, which I will hopefully more and more companies get to do, is continuous experimentation. Deploy fast, figure it out, throw it away if nobody needs it. Don't bother with optimizing it. But if it is a great feature, people like it, then optimize it.

Starting point is 00:43:16 You know, the only thing that I would say you absolutely cannot fix once it's deployed is bad architecture. It is. And, and, you know, same as anything else. Now, somebody who's listening will say, well, that's not the case. Here is this one situation. Sure. But if you deploy bad architecture and you deploy it to a large number of people, uh, it becomes incredibly expensive and difficult to fix that. So you're better off starting with the right architecture, you know, so simple things, right? What, what should be, you know, if you think about data, right, what is the hot data, hot data belongs into RAM, warm data belongs on flash, cold data can be on the disk, right? Uh, you know, if you, if you just start with the idea that all

Starting point is 00:44:02 of your data is going to be in the RAM well, that's a disaster at our scale. I will not be able to support your product. There is just not enough RAM to purchase for something like that. Nor would I ever actually attempt to do something like that. Instead, I'm going to sit down with you and come up with a solution that, again, does multiple layers of caching and, and we figure out what needs to be rare. Uh, those are the kinds of things that I think need to be fixed before you start coding. The other thing that I'm sure no other company has ever, uh, run into problem, like, uh, when you have five solutions for exactly the same problem. And so a new person comes to the company

Starting point is 00:44:43 and, uh, they can't find what they're searching for. And then they go like, I will develop my own, I don't know, key value store or pub subsystem and so on. And so when they come to me and say, and then this, I go like, no, you will use one of the existing 10 data stores that we already have. And so you don't have to maintain the code.

Starting point is 00:45:02 And if there is something that is missing in those, then we will upgrade one of those. But I'm not going to be paying for yet another, you know, ecosystem of your choice. As I said, PubSub, key value pairs. I'm sure every single company has five solutions of at least one different basic computer science problem. Brian, let's conclude this session. Any final thoughts from your side on this one?

Starting point is 00:45:28 Yeah, I guess I'm thinking more of the on a higher level of, you know, if Facebook can do it on this scale, you really have, you know, no excuse to get through. You can't get through whatever you're getting through, right? Because this is like uncharted

Starting point is 00:45:44 territory where, you know, as Greg was mentioning, having to Because as, as, as we talked about maintaining or creating realistic kind of load models is very, very difficult in any situation. And, and just for the fact of new features coming out all the time, there's really no way to stay on top of that. And, and it's just makes complete sense. sense. So, you know, for anybody who's kind of like a hater on it, I think it makes absolute sense, especially with the scale that Facebook's operating at. Now, maybe at smaller scale kind of operations where maybe A-B testing is not really an option and other kinds of components like that, where those restrictions are in place, you might have to do those, you know do those models and run it that way.

Starting point is 00:46:46 But whenever you have that opportunity, I think it should definitely be leveraged. And I think that's really, really great. So that's all I have for me. We will be back in a part two with Grinka and Bjerovna. I got that name wrong. I totally blew it. It's okay, Andreas. Andy can pronounce it no problem. Andy, why don't you

Starting point is 00:47:08 give the outro on this then? So I don't have to ask myself. No, it's okay. It's okay. We'll be back in a couple of minutes, hopefully, with the next episode.

Starting point is 00:47:16 With who, Andy? With Goranka Biedov. There you go. How was that? Good. All right. Thank you, everybody. We'll be back in a moment.

Starting point is 00:47:25 Or really, I guess there's no moments because this isn't radio. So we'll be back in a click. How's that? Thank you all for listening. Thank you.

PurePerformance - 033 Performance Engineering at Facebook with Goranka Bjedov

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.