Software at Scale - Software at Scale 1 - Alexey Ivanov: Principal Engineer, Infrastructure at Dropbox

Episode Date: December 6, 2020

Welcome to the first Software at Scale podcast. This episode contains an interview with Alexey Ivanov, Principal Engineer, Infrastructure at Dropbox.The motivation for yet another software podcast is ...to let software builders share technical decisions, opinions, and stories in an informal way. Personal blogs and corporate engineering blogs are extremely informative, but often require high activation energy to be published. This podcast instead tries to replicate bar conversations with grizzled senior engineers reminiscing about horrors of their systems and what they’ve learnt over the years.In this podcast, we discuss object storage, load balancing, build systems like Bazel, Nginx/Envoy config management, monoliths, services, gRPC, and more.HighlightsNotes are italicized.0:00 - Introduction2:20 - Experience working on Object Storage at Yandex in 20123:55 - LevelDB wasn’t efficient enough for avatar storage, presumably due to record size. The failure mode was memory consumption, and didn’t seem to work well on spinning drives at the time, so they built a custom storage backend.5:55 - RocksDB/WiredTiger might be more appropriate for such a use case today. In general, today, it makes sense to take off the shelf components, unless it involves the core of the business and requires innovation. Other examples - Figma and browser based multiplayer design, and Dropbox with Magic Pocket.8:55 - Experience working on Server Team at Dropbox in 2015. Teams were fairly broad (Server Team, Client Team), and a new Systems Engineering team was created as a lower layer for Server Team that focused on the edge network and runtime concepts like service discovery. Service Discovery at Dropbox today is fairly sophisticated.10:45 - Dropbox’s stack in 2015. Stateless systems weren’t as mature as the stateful ones, and there might have been a little duct tape involved. David Mah’s talk on securing user data at SRECon is worth listening to.12:53 - Initial, DNS based service discovery, and Nginx config management and generation via Python and Jinja2.15:09 - A DNS outage story.17:30 - Monoliths aren’t that bad, and many successful businesses start off with monoliths.17:51 - Dropbox doesn’t use the term “microservice”, neither does it encourage too many tiny services. Services shouldn’t be too small or big.20:10 - How to reasonably manage configs for Envoy - learnings from Nginx config generation. Object Oriented Python that generates a protobuf config helps with a declarative and reusable config format. Materializing to protobuf helps avoid a lot of bugs, like “no” in YAML.23:05 - “Config languages eventually converge to Turing Complete languages”. Pick the appropriate language based on the engineer who will need to edit these the most. Concretely - Python for Traffic engineers for configurability, a stripped down YAML format for product engineers that need to perform small scoped tasks like adding new routes.24:50 - What sparked the Nginx to Envoy migration? The killer feature was the community and the ability to participate in the development process. Shout out to Matt Klein, the leader of the Envoy project, for fostering an inclusive community.28:30 - Envoy aligns with the Google way of development that Dropbox adopted - it works well with gRPC and builds with Bazel. gRPC at Dropbox; Bazel at Dropbox.29:10 - Initially, Bazel was an unpopular decision at Dropbox, but it ended up as one of the best decisions made. The hermeticity guarantees and the build graph are extremely useful features for incremental builds and tests, tracking down dependencies and keeping deployments (and the deployment system) simple.“Bazel is like a sewer. You get out of it what you put into it” - Mike Solomon32:10 - How should someone decide whether Bazel is the right choice for their company? It’s definitely not for everyone. When it starts becoming painful to manage multi language builds, it might become worth it.36:00 - The “Google” way of development (monorepo, one build system), is very different from the “Amazon” style of team independence in all software decisions. Which approach is better, and why? The Google style seems to work well for midsize companies that don’t have infinite resources to paper over the inefficiencies and duplication in the Amazon development style. Somewhat alternative opinion: the Google Platform Rant.40:20 - Why did Dropbox decide to build their own monitoring system as recently as 2019 instead of using something off the shelf? The answer is mostly cost efficiency. The volume of metrics logged would make an external solution prohibitive. Magic Pocket probably logs a lot of metrics.44:00 - What’s a project that you’re most proud of? The transition from systems engineers just managing Nginx configs to rolling out Dropbox’s edge network and the first ten Points of Presence was awesome. That work hit the sweet spot of technical innovation and direct, measurable improvement for users. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Yay! Hi, Alexei. Thank you for joining me for this first-ever edition of the Software at Scale podcast. Thank you. Thank you for inviting me. Of course. A quick background on you is you've been working at Dropbox for roughly five years in various parts of our infrastructure, like the traffic layer, managing our MySQL fleet and infrastructure in general. And before that, you were working at Yandex on object storage and at LinkedIn on various parts of the traffic stack? So, yeah, I did lots of things.
Starting point is 00:00:49 I've worn many different hats in my life and I hope to wear even more. Generally, yeah, way back in Yandex, I was doing system engineering for like migration from FreeBSD to Linux for our search engine. I was working then later on backend for the object storage. That is quite interesting. I always oscillated like between a SWE, software engineering and like system administration,
Starting point is 00:01:21 network engineering, database administration. So basically each time I kind of broadened my view of the world a bit. What was the scale of Yandex at that time? If you can share, what kind of numbers were you looking at? How many requests per second or queries per second? And for those of you who might not know, Yandex is like Russia's Google. It's like search engine and a bunch of stuff. Let me think.
Starting point is 00:01:49 I can't give exact numbers. It was likely many tens of thousands of boxes just for the search engine and millions and millions of QPS. There was a lot at that point. Basically, it was the busiest site in Russia and a bunch of countries nearby. So that was a lot. And you said you worked on an object storage over there. What is that supposed to do? What is it meant for? So nowadays, storage engines are... There are good off-the-shelf storage engines and databases for storing data. And even services that actually provide that service for a fee. But back in 2010, there was not much.
Starting point is 00:02:43 It wasn't a commodity that you can buy from any cloud provider. And in fact, we were the cloud provider where people wanted to put their photos on. And also, we were very, very, very poor at that point. Specifically, our department was very unsubsidized. So we needed something dirt cheap, and there wasn't anything off the shelf. Even LevelDB at that point, for example, as a storage engine, was quite inefficient.
Starting point is 00:03:13 And that's what eventually led to Rog's DB and its dominance over the market because, again, LevelDB was not that good. At that time, we were developing our own object storage for the needs of the Yandex as a service provider. So we needed a very cheap storage that would run good on very, very cheap hardware that would constantly fail, constantly corrupt data. So we needed to detect that and actually run as normal on not very reliable hardware. That's interesting because when you read about LevelDB, and LevelDB is released by Google, or at least the initial version was the underlying storage engine behind some of their larger
Starting point is 00:04:02 systems like Bigtable. So it's interesting to hear that it wasn't efficient enough. Do you remember? I know it's been a while, but do you remember what exactly was not good enough with it? It wasn't good enough for our use case. Maybe the record size we used it for was not entirely optimized. Like for small databases, level DB would be fine. We were storing images at that point. So it was lots and lots of medium-sized records. So at that point, that was avatar storage. Our biggest pain at that point, yeah, was avatar storage, then
Starting point is 00:04:47 image storage. And nowadays, I think that thing involved into fully-fledged object storage at Yandex that they actually sell, basically, and that's free API-compatible thing. Okay, interesting. And like... Yeah, the level DB failed on, I think, memory consumption and it wasn't working. Even though it was log structured merge database, it didn't work quite well on the on spinning drives. Okay. Which is strange.
Starting point is 00:05:23 Yeah, that is interesting. And if you had to make that decision today, would you probably go with ROCKSDB? It depends on the use case. If it's an append mostly storage, I think LSM is fine, likely. Yeah, since the beauty of using over-the-shelf things and generally industry is that industry will push itself forward while using in-house thing,
Starting point is 00:05:54 it means that we would need to actually invest in it over and over and over again for it to move forward with the industry, either faster than industry or catching up with the industry. So like on the broader side of things, right now I would most likely pick something off the shelf, be it like WireTiger or ROGSDB, like likely ROGSDB. So I'm not sure how good it is for general purpose blob storage, where blobs get larger and larger over time. So yeah, I think that makes sense. You said that you would rather pick something off the shelf since you'd get this continuous growth of features or growth in scalability, rather than just build something in-house and try to keep it up, even if it's hyper-optimized for your use case? It makes sense in some cases.
Starting point is 00:06:52 When we are talking specifically about the core value of the company or of some organization within the company, it makes sense to actually invest there. If you can actually innovate, if you can spend these resources by providing value to your customers, be it like the upstream customers of your organization or your actual direct customers of your startup or company, whatever. Like yeah, investing into home brew that is highly optimized. For example, in Dropbox, we did exactly the same. We created Magic Pocket that was highly optimized hardware
Starting point is 00:07:32 slash software for our specific use case. And something that we couldn't do is we couldn't get this kind of efficiency by just using AWS at the time. So it makes sense when it's in the core of your business. When it's like a small thing, like when there is so many layers of separation between what you actually want to do, like give customers a very simple object storage in 2020 and like storage
Starting point is 00:08:04 engine for efficient storage nowadays probably start with off the shelf definitely and then kind of see where you can optimize it and maybe you can just contribute some of your fixes to rog's db and make it faster for everyone yeah yeah i think i think that example made perfect sense right like if it's related to the core value of your company and you can get a lot of efficiency gains as Dropbox presumably did with Magic Pocket. That provides an excellent segue to your current job, which is also at Dropbox. What did you work on then when you first started? I saw your experience.
Starting point is 00:08:42 It said you worked on Traffic Team when it was barely like a couple of people or were you the person who started Traffic Team at Dropbox? What was that like? Oh, in Dropbox? At that point, there were not many teams at Dropbox. Like, if you look at old team names, they were like the server team, which is basically everything on the backend, client team, everything on the desktop application. So the teams were quite small. At that point, we just factored out system engineering from like the rest of the backend. So we can focus on like system-ish stuff that are slightly lower layer. And I was basically the Nginx guy. Okay, he is Russian, he has Russian accent, he can probably configure Nginx and indeed I could. So we started with just
Starting point is 00:09:35 as a team of one, as people who actually can configure Nginx to the team who actually build a app in our maximum. We actually owned, we were around 10 people, we've built a whole Dropbox H network, so points of presence around the world. We covered almost every continent. We grew from 20 Nginx servers to thousands of them. We started owning our service discovery at some point. Great people joined the team and started just killing... Ruslan, I hope you will have... Someday we'll have a podcast with him. Started just killing all the runtime stuff. So service discovery, gRPC, all that stuff.
Starting point is 00:10:33 So, yeah, but if you can describe a little more, like what was the stack when you just joined? Like how was Dropbox being served as much as you can talk about it, like in 2015? Like what was going on behind the scenes with all the duct tape? Was Dropbox being served as much as you can talk about it in 2015? What was going on behind the scenes with all the duct tape? I don't think we have enough duct tape on that podcast to describe that. Let me think.
Starting point is 00:10:55 So there were a couple of good things and lots of very legacy stuff. We were on Ubuntu 12 at that point. Some of our configs were just rsync to our servers. Some of the binaries were basically rsync. That was a method of deploy. Lots of deployments were based on Chef. I'm not sure that's a deployment. it's probably a yoloiment or whatever. We can only push forward.
Starting point is 00:11:36 It wasn't all that bad, don't get me wrong. Things like that would only work on stateless parts. In places were stateful, like databases or magic pocket. It was actually quite good. We used to be very, very paranoid on the, and still are by the way, very paranoid on stateful system side. Ma has, David Ma has a couple of presentations, so on SREcon, for example,
Starting point is 00:12:08 about how we are protecting users' data. And that was, is, and will always be the paramount of any cloud service that actually stores users' data, including Dropbox. So yeah, there were lots of fun stuff, especially around Nginx and load balancing. How deep technical do you want to go?
Starting point is 00:12:33 As deep as you want. Oh my God, that's my favorite. So, okay, let's go deeper. So Nginx, the configuration was written by hand. The load balancing was based on DNS with VVRP for fault tolerance. So basically we had... And also keep alive D and hijacking of ARP requests. So basically once one host dies, another host in the same rack, same rack is very important
Starting point is 00:13:08 because you need to be in the same L2 domain, notices that it's dead and hijacks its IP address. Very, very robust. Basically rate 10 out of all these servers. What else was that? Yeah, the first thing we did was stop writing Nginx configs by hand and actually wrote some basic config generators. Lots of ginger Python. I think we mentioned it in the blog post about Nginx to envoy migration. It worked great compared to actually writing them by hand
Starting point is 00:13:46 uh configs by hand but after like five years it started to show its age um so started generating config files then uh switched to actual l4 load balancing so instead of using that ARP IP address hijacking on the Ethernet layer, we actually moved to a separate layer of proxies that would load balance connections and just pack IP packets into IP packets and send them through that tunnel to the backends. So without terminating TCP, without terminating HTTP or TLS, just IP packets and send them through that tunnel to the backends. So without terminating TCP, without terminating HTTP or TLS, just rub the packet, send it somewhere else. Then we were able to better load balance stuff.
Starting point is 00:14:36 And we were actually able to scale our front ends without any dependence on DNS, for example, which was terrible. If you see closely, I think we still have domains like API zero zero, API zero one. So we used to have a load balancing on both DNS level and application layer in some point. It was fun.
Starting point is 00:15:03 Yeah, that makes me think, are there any fun, I should put fun in quotes, like air quotes, like fun outages you remember from this setup? So over the years, Traffic Team started owning
Starting point is 00:15:17 a lot of things around our production. And because we already owned external DNS, it was thought that it was okay for us also on internal dns and i think that the first day we started owning internal dns and we had a major outage on the dns side because someone decided that they don't need the dns cache and this because it's not actually doing anything a Spoiler alert, it was doing a lot. So the first day we've got all our DNS infrastructure, someone did remove all the
Starting point is 00:15:51 caching from DNS layer and we were running around trying to understand how to stop that thing. It wasn't particularly interesting, but still quite representative of Dropbox at that time. There were not many metrics, very high velocity on stateless parts of the system, but oh my God, we really, really didn't have a lot of metric. It was all gut feeling based. And yeah, it's important to note at that time,
Starting point is 00:16:27 like it was like a different era. Now there's so many things that are like open source, like Kubernetes and Prometheus and all of these different systems. And then you can also use like a SaaS product, like SignalFX or like Datadog. But back in the day, there were like barely any options for people to use like right out of the gate, right?
Starting point is 00:16:43 And Dropbox is a pretty old company. It started in 2007. We're still running the monolith Python web app from those days. It's been a long time. And we kind of evolved in this time when it wasn't super easy to plug and play. And I think a startup today probably
Starting point is 00:17:02 has a lot of different options. How should I say it? They do, but most of them still end up with the same monolith, just on a more modern stack, maybe, not JS, packed in the container and running on the Kubernetes. But at the end of the day, all your business logic is in the monolith. At some point, just because the truth is monoliths are not bad. They're quite fast to develop, quite fast to iterate. And any successful startup by the end of the day
Starting point is 00:17:38 will likely end up with a monolith. And that's a good thing. And they will eventually split it and here is also critical parts we don't call our services microservices we actually call them services specifically because we don't have uh don't want to have like an infinite like mesh of everything talking to everything um and squared complexity on each layer. Basically, we don't want that. Our services are not micro.
Starting point is 00:18:11 They're actually quite medium in size. At the same time, we don't want to have that monolith that we used to have forever. Yeah. So the term microservice doesn't exist too much, or people don't talk too much about it at Dropbox. Yep, it's just a service in medium size, not too big, not too small. Of course, we sometimes mess up with the definition of how big or how small something should be.
Starting point is 00:18:38 But eventually things kind of settle at the proper size sizing like teams will get merged or split if if they start to move slower than they should or start breaking more than they should it doesn't matter whether these are services or functions in the code you don't want your code to be in a way where each and every like function is a single line and everything calls everything. And then you have hotspots that call 20 different functions and very sparse spots when you have functions that only do one thing, one tiny thing in return. So there should be balance just with the code and so is like the software architecture. That balance usually comes from periodic reevaluation
Starting point is 00:19:34 and just holistic approach to some things because otherwise, just like with our config generators which I'm partially to blame for Nginx, they did involved into some monstrosities. When we started writing Envoy config generation, we thought, oh my god, it's like, it's so much simpler right now. But we very quickly understood that if without proper governance, that will eventually lead to the same situation with like over the course of five to
Starting point is 00:20:04 seven years, it will be exactly the same as N like over the course of five to seven years it will be exactly the same as Nginx if we don't put some good practices in place. Yeah so yeah I mean can you talk about a little bit about that governance right how do you govern like what did you do differently when you started with Envoy config? Oh that one is probably not the best question for me. I was mostly providing guidance of how to not do things. The general problem was that there were too many ways of how you can do things. Because we had Ginger, we had the YAML, we had Python. All of them allowed some kind of logic and templating.
Starting point is 00:20:49 In YAML, you can use anchors. In Ginger, you can use for loops, logic, macros, even like lots of plugins that can do pretty much anything. It's too incomplete at that point. And you have Python that kind of processes all of that. So when you want to change something in that config generation, if you have enough time, you can, of course, think how to best put and rework existing code.
Starting point is 00:21:22 But usually you just, okay okay i need a function oh where should i put it oh seems like ginger is a good place to put it or i will filter some data in python and then pass it to ginger or maybe i will do just couple of anchors and uh in yaml and it would be fine so that that flexibility of doing stuff uh in different ways actually hurts. So in Envoy, we re-engineered it in a foolproof way. We have protobufs and Python, and that's pretty much all of it. And we also tried to make Python more object-oriented instead of a function, colon, function, colon, function. So more functional, more object oriented actually helped a lot. It forced us to put proper abstractions. Inability to use any logic and protobuf definition also helped a lot. OK, so just so that you can clarify my understanding,
Starting point is 00:22:28 in order to generate Envoy configs, you run a Python script where people can configure these different options on your particular domain name. And these are the routes that map to that domain name. So they define it in Python functions. and you run a big like Python script that runs all of these functions in some way and it generates the config that Onway needs on startup. Yeah, basically the output config is, you can mostly consider it as a protobuf. It's not entirely a protobuf, but basically our config
Starting point is 00:23:05 language became Python since we've noticed that pretty much all config languages eventually converge to some fully Turing complete thing. So we just start with it. Maybe we would do stuff differently if we thought that our main users are not us, but someone else, then we would probably add some kind of DSL on top of it. I think we did that for route generation, when we actually give that. We don't want to give too much flexibility to our end users.
Starting point is 00:23:41 Otherwise, again, stuff will go terribly wrong very, very fast. Yeah. So, yeah, like your end users in a sense are product engineers at Dropbox who are defining new routes. And there's a lot of engineers. So you don't want them to have complete control over this domain name, don't apply any security on it. So you want to have some kind of limits and guardrails. Yeah, of course, it can be engineered around by giving people only a fixed set of classes with a very defined public method. But again, if you're going that way, maybe YAML is good enough at that point. For us, since we develop a lot internally and only our team
Starting point is 00:24:28 touches some parts of the configuration, we decided, okay, let's just use Python, but we will be responsible for keeping these interfaces clean. Got it. That makes sense. And yeah, then just a step back, like what, at what point did you think, or did somebody propose that, you know, Nginx is kind of outdated for Dropbox or you don't need it anymore? So what sparked the whole Nginx to Envoy migration? So I've looked at Envoy some time ago. At that point, like probably two and a half years or so. Like somewhere when, yeah, I think we met with Matt on one of the NGINX conferences.
Starting point is 00:25:14 Talked a lot. At that point, Envoy was not ready. At some point, though, when Google started to put like infinite amount of resources, it was ready, but I did not know that. I did not notice that I fully missed that part. If not for other engineer, namely Ruslan, coming to me and saying that he can actually make it work. I would probably not,
Starting point is 00:25:38 I wouldn't take another look at it for, probably maybe right now I would start thinking about that. Okay. But yeah, the trigger was fully external to myself because I've looked at it and it was bad. It only supported a very limited subset of features, but what helped it a lot is community. So when Google came to Envoy,
Starting point is 00:26:07 they were able to change things. And that's a shout out to Matt. He is fully responsible for this. He built a really great community. And when you come to Nginx, it's way harder to push patches through Nginx code. It's very stable, but it comes at the cost of inability to change things. For example, one of examples is for example, gRPC. So gRPC, we waited for it for quite a while for Nginx and Envoy was added fairly quickly by Google people. Okay, so one of the killer features of Envoy was automatically GRPC support, in a sense.
Starting point is 00:26:52 I would say the killer feature of Envoy for us was the ability to change things. OK. In Nginx, we couldn't do that. So ability to modify and change it, and be actually participants in the development process. It's what actually sold us. If not for that, it would be a bit hard. If it would be impossible to push patches in either Nginx nor Envoy.
Starting point is 00:27:25 It wouldn't matter that much in which of these services we don't have gRPC support. Yeah. Yeah, so I think that makes sense. Like, even some of these larger decisions that seem purely technical in a sense, like, a lot of stuff goes on behind the scenes, like how the community is like,
Starting point is 00:27:47 how easy it is to submit new patches. Is the development model completely open source? And this is a great example of a lot of non-technical reasons for picking one product versus another. Yeah, but back to the technology part, we've mentioned lots of reasons about why we, in the blog post, we mentioned a lot about why did we pick Envoy. It's by no means a bugless or the most performant thing or the, like, the best thing for your specific startup or app.
Starting point is 00:28:30 It was just very convenient for us because how much we aligned with Google way of developing things. Like having your PC support as a first class citizens is very valuable for us, it may be totally irrelevant for any other startup out there. Having Bazel support again, it was a great win for us. We used Bazel as our build system of choice, which gives us hermetic incremental builds and tests. And oh, my God, it is good.
Starting point is 00:29:02 I think it's one of the best things that happened to dropbox infrastructure everyone hates it at first it costs a lot and how mike solomon our engineer mentioned once basil is like a sewer you get out of it what you put into it so we've put a lot of effort into basil it works beautifully again one of of the best things I've touched in my engineering career. So can you talk a little bit about what it was like before Bazel? Like what is the so-called sewer? What was it like? Oh, my God. So before Bazel.
Starting point is 00:29:43 So do you remember that R-sync I was talking about? Yeah. Yeah, that was basically it. So, oh, my God, it was bad. Config were rsynced, binaries were rsynced in append only storage. So what else? Deployments were Chef based or at that point,
Starting point is 00:30:11 Puppet based, I think. Many of them. Other things just, oh my God, build scripts for Python were also quite non-existent, I would even say. So it was very ad hoc, like very ad hoc. Just thinking about it gives me PTSD. It was so much dependent on the system that migration from Ubuntu 12 to 16 was like a major pain in the ass.
Starting point is 00:30:51 In terms of sustainability, like moving fast was very good for early Dropbox, but in terms of sustainability, it was terrible. Each time we needed to change something in the operating system, so many unrelated things broke. We have everything built by Bazel in fully isolated build environment, run in a fully isolated runtime environment. So there is no dependence on the system. In theory, if you wanted to go from Ubuntu 16 to Ubuntu 20, that should be no op from application standpoint, except for the kernel parts.
Starting point is 00:31:23 So that's basically the only dependency that is left on the system. We don't even depend on Lipsy or basically anything besides the kernel. And that speeds you up in a lot of infrastructure projects because you don't have to just think about so many things. Yeah, nowadays our deployment system
Starting point is 00:31:43 is basically we build Bazel. We test with Bazel. We build with Bazel. We pack it into SquashFS, torrent the result to a set of servers, then start up the binary that Bazel produced, like as simple as possible. Yeah. And SquashFS is like this package like, it's like this read-only FS where you can just put a bunch of files in. So if I was like somebody in a, let's say in a startup or like a midsize company, and if I asked you like, should I use Bazel for my company? How would you go about like answering that question? What would you need from me? And how, how, how would you say I should make that decision? While you are on a single language,
Starting point is 00:32:34 for example, while you are just writing go, no front-end, just go back end, Bazel is not needed, whatever it is. Like if you use like node for back end and front end, basically, if you use JavaScript, then probably Bazel is also not needed. Once you get into two, three, four, five languages in your repo and you start having like dependencies between repositories and languages and assets
Starting point is 00:33:12 and artifacts and especially when you then need to integration test all of that stuff in a reproducible environment. Once you start implicitly, once you start having outages because your things actually are not hermetic and depend on something outside of their execution or test environment, at that point, something like Bazel needs to be there. Maybe some other common build system and test system. But just look at it from a software engineering standpoint. Basically Bazel gives you that view of all the dependency graph of everything in your
Starting point is 00:34:00 repository. You can always say what depends on what. All the transitive dependencies are also there. And that dependency graphs gives you a lot besides just ability to build things. It gives you ability to test things incrementally. You change one thing, you know exactly what you need to test. You know if you have a bug, that thing gives you an ability to understand what packages have that bug and what you need to redeploy.
Starting point is 00:34:37 Having the full graph of your build also allows you to apply modifications on that graph. Once it's programmatic, you can say, okay, I want to add what's in Bazel called aspect, modify a graph and create some modification of that subgraph. So it's a bit more abstract in words, but in practice, it allows you to automatically create tests, for example, or automatically generate code. For example, you create a protobuf, you can actually generate code for it, et cetera, et cetera. Like there are so many usages of that very simple concept of what depends on what. I'm not talking about even office service use cases,
Starting point is 00:35:34 like when you just need to understand what your code depends on or security purposes, like, okay, we have CV in that library. What should we repush? Like there are so many immediate things and programmatic things like again, that dynamic draft modification on the fly. I want to talk about one point which you said earlier,
Starting point is 00:35:58 which is Dropbox has followed the Google model of building and developing and deploying things. And I think Bazel is one part of that. I wonder like, if you know, whether this was a deliberate decision and, or in just general, you've thought about there's this one way of doing it, which is like the Google Facebook where there's one build system for everything. And there's also like this Amazon style, do whatever you want.
Starting point is 00:36:24 Every team or every two teams have their own service, and everybody should interact with... Every team's stuff should interact with other teams' stuff through services. And it was service-oriented architecture from day one, but not super prescriptive about how to build those services.
Starting point is 00:36:40 I just wonder, have you thought about that trade-off, and what are your thoughts on it? For now, I think I like the Dropbox approach a lot. Mostly, I did think about it. And it sounds very appeasing to just use RPCs between the teams. And within the team, you can do whatever you want. It probably works great on either very small companies or very large ones. I'm not sure it works
Starting point is 00:37:14 like when you have infinite resources it probably works okay, okay-ish. You're still wasting a lot of time doing that. Most of the time, projects are very fast to build. To build something, to create a new entity, a new service in the organization, it's quite fast. But at the same time, once you've built it, you need to support it forever. And at that point, like cost which you pay by supporting a thing is way greater than the cost you've spent to actually create it. So I'm not sure how important it is to create things fast in a way that like, I think it's more important how much it costs you to support things. Of course, it goes both ways. For example, in our Bazel environment, if you need to change Bazel itself, you need to make sure that everything builds with new Bazel, for example.
Starting point is 00:38:15 Or when you update a GML or some other common library, you need to be sure that everything that has tests actually passes with a new library. But I think there are workarounds around that, and it's a way better paradigm when you don't have infinite resources. And even Google that arguably has infinite resources went with that common infrastructure approach. And what's even more, they actually create a product out of it. Nowadays, you can actually, in GCP, you can buy, if I'm not mistaken, a Bazel build farm. So not only they've open sourced Bazel itself,
Starting point is 00:39:03 so everyone can build Google open source, but also they've productionized parts of their infra, and now you can buy Bazel build farms and probably test farms on GCP. Yeah, they have this remote build service, which I think you can just run Bazel build locally and it goes to their servers and magically does stuff. Yeah. I like
Starting point is 00:39:31 I wish we would do more of that when we can actually factor out some pieces of our infrastructure and actually sell it. Because we have good products. I mentioned service discovery. Of course, it's I've mentioned service discovery. Of course, it's impossible to sell service discovery,
Starting point is 00:39:49 but we can at least open source it and get some people on that, arguably better service discovery than what's available, and give us some engineering power. I wouldn't say for free, but at least we will also get some visibility in the open source community. But there are also things like our monitoring system. We had a couple of blog posts about our monitoring system, which is great. Mostly compatible with Prometheus.
Starting point is 00:40:21 I don't see why we shouldn't just at least open source it. It's a great thing. Why not productionize it also? Yeah. Yeah. So do you have any insight? I know you didn't work on any of these things. Why Dropbox went with making its own monitoring system instead of using something externally.
Starting point is 00:40:48 Why didn't we do that? I think we tried a couple of times. So we started pretty much any startup in 2007 with a fully homegrown monitoring system because there was nothing on the market. And in the hidden side, that was a sign that if something is not on the market, that probably something will appear, or at least it's a pretty good market at that point. Once companies started to appear, I think people from Facebook created their monitoring as a service. DataDog, New Relic, they all started
Starting point is 00:41:31 to appear. But at that point, we have so much data in our metric that it was pretty much impossible for us to buy anything. It was just cost a lot. Our monitoring system is very efficient. Like, it's... yeah, I'm not sure I can say numbers, but oh, my God, it's good. So migrating away from it would be impossible unless these kind of metric storage, span storage for tracing and like logs, structured log storage will become dirt cheap. So once it becomes commodity that every cloud has, then we can probably migrate. Yeah. Being mostly compatible with Prometheus actually helps us a lot here.
Starting point is 00:42:20 We can actually direct the map Prometheus concept to our own, plus add some on top of that. Okay. So one kind of thing I'm hearing is that even though the cloud has commoditized a lot of things like servers, all of these next generation, like one layer up of a monitoring system, it's still not cost effective enough for a midge-sized company like Dropbox to use. That's what it looks like.
Starting point is 00:42:49 It heavily depends on the number of metrics that you have. But yeah, we've got very spoiled by... It's basically a feedback loop. Our metrics collection was so cheap that we would add a lot of metrics that would actually make it impossible for us to actually migrate off it. But technology moves forward. So like everything, if you can see, you see the technology trends. At some point, every small company had like, small to medium-sized company had a rack of servers
Starting point is 00:43:32 in some data center. Then we eventually had cloud infrastructure. Then we had services in the cloud. So instead of like deploying MySQL on a server in the cloud, you would deploy it on... You would just use RDS, for example. Then we went one layer higher. It was Aurora. Now you don't need to build your own replication.
Starting point is 00:43:55 It's out there, done for you. Same with compute. Basically, we used to have servers crunching numbers. Nowadays we have lambdas. With a metrics collection, I think it will go the same way. Basically pretty much everything moves along the axis of evolution towards commodity, if it's actually useful. Yes, at some point electricity was, oh my God,
Starting point is 00:44:20 so innovative that we didn't have it. And eventually everything moves to commodity if it's used. Yeah, so it's just a matter of time until it makes sense to maybe use something that's like an offering because it'll be commoditized. Cool. Yeah. I want to ask you like a couple of like lighter questions.
Starting point is 00:44:43 What are some of the projects that you worked on at any company which ask you like a couple of like lighter questions just what is some of the the projects that you worked on at any company which you're like most proud of which you think of as you know i'm so happy i worked on that or it's seen a good result or a good outcome oh good result. I, let me think. So there are a couple of things that come to mind. Again, during my career, I've oscillated between software engineering and what's called SRE nowadays. So I've heard many different hats. And it's very hard to find it.
Starting point is 00:45:37 Because I had very interesting things on the network engineering stuff, on database stuff, on software engineering, and on SRE slash traffic slash anything. I think in terms of drive and actual, when I just came to Dropbox and we started building Edge network, the most exciting thing I think was the moment when I joined Dropbox and we went from just people managing Nginx to running, building our points of presence around the world. Like first five to 10 points of presence were just awesome. We were just killing it. We basically removed all the blockers starting from like hardware load balancer, we replaced
Starting point is 00:46:21 them with like in kernel IPVS based load balancer, then went further with help of Nikita that joined from Facebook to move on further into XDPBPF. So there was a constant drive when we would just add new pop, improve the efficiency, add new pop, improve the efficiency. We started playing with congestion control like that. That was the most innovative time when we could actually put our effort into things that matter directly to the customer. And each new thing that we did gave quite a bit of performance to them. That was just awesome.
Starting point is 00:47:03 The same cost that was also lower in our bill that we paid. So like a perfect balance of innovation, user impact, how people felt the performance of Dropbox improving with each new point of presence. Oh my God. Nowadays, of course, we don't get that. Each new point of presence is like 10 milliseconds. But previously, it was 30, like double the speed, et cetera.
Starting point is 00:47:34 What exactly is a point of presence? Is it just like a bunch of servers in different parts of the world to do things like edge termination? The answer is yes, but we actually worked very hard so it would be as simple as possible. Like the service we put in relatively cheap. They are very uniform, there are no hardware load balancers, there are very simple networking equipment. So we work very hard for the point of presence to be basically a software defined edge for the Dropbox. So we have Linux over there and it makes all the magic of making communication
Starting point is 00:48:29 to user very fast. And at the same time, our network costs very efficient. And at the same time, we don't need any hardware load balancers or hard, basically it's all used to be Nginx plus IPVS. Nowadays it's XDP and EPPF and Envoy. Cool and yeah thank you so much for taking the time for the first ever edition of the Software at Scale podcast. It was like a pleasure talking to you and I hope you had fun as well. And let's catch up on work slack at some point. Yeah, I reminisced a lot. I will have PTSD. I will actually need to go and drink right now to wash off that feeling in my head.
Starting point is 00:49:20 Anyway, so that's it.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.