Software at Scale - Software at Scale 15 - Ben Sigelman: CEO, Lightstep

Episode Date: April 4, 2021

Ben Sigelman is the CEO and Co-Founder of Lightstep, a DevOps observability platform. He was the co-creator of Dapper - Google’s distributed tracing system and Monarch - an in-memory time-series dat...abase for metrics. Finally, he’s also the co-creator of the OpenTelemetry and OpenTracing standards.We spent this episode discussing Dapper and Monarch - their design, rollout, and lessons learned in practice.Apple Podcasts | Spotify | Google PodcastsVideo HighlightsTranscript[Intro] [00:00]: Welcome to Software At Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host Utsav Shah and thank you for listening. Utsav Shah: Hey Ben, welcome to another episode of the Software At Scale Podcast. Could you tell our guests just about your story, because there's so much in your background that is interesting to me, so right from, starting off at Google, they're like creating LightStep. Ben: Sure. Thanks for having me, I'm excited to be here. I don't know whether my background is interesting or not, but to me it's kind of boring, but yeah. I graduated from college right in the thick of the.com bust and the sort of 2003 era, and I was very fortunate to get an offer to work at Google at the time. And when I went over there, they actually put me on some stuff in the ad system that was incredibly boring, to be honest with you. And also of course, ads make a lot of money at Google, but it was this particular part of the ad system that wasn't making any money, so it was kind of boring, not very lucrative for Google and I didn't like it very much. And the way I got into dapper and distributed tracing was actually incredibly arbitrary, but it's a funny story. They had this one time event where you could opt in to this, I don't know what they called it, but it was this program where they would take everyone who opted in. They look at a bunch of different dimensions, like how long you've been at a school, what office you worked in, what languages you worked in, where you were in the org chart; that kind of stuff, and they think they have 10 dimensions. And then they found the person who also opted into this program, who is literally the furthest from you in this 10 dimensional space and then set up a half an hour meeting with no agenda, and that was it. So I was working on this stuff and ads that, as I was to say, totally pointless. And they paired me up with this woman named Sharon Pearl, who was a very distinguished researcher who had come over from Digital Equipment’s Research Lab when it kind of fizzled out after the merger in the late nineties. And she, and some of the other old guards at Google were doing all the really cool system stuff. And she asked me what I was doing. I don't want to talk about it, what are you doing? And then she went through this list of really interesting systems projects. One of them was kind of like a predecessor to an S3; it was like a blob storage . There was some NLP thing she was working on and then in this list was this prototype of a distributed tracing system called dapper that never really saw the light of day, it was just kind of an idea and she described it to me. I just thought it sounded incredibly useful and really fun , and my manager at the time had 150 direct reports, direct reports. I don't think that is more of a hundred, but he had no idea what I was doing, obviously. How could you, and so I just started working on it, basically switch Utsav Shah: Teams or anything. Ben:  Well, Google famously had this 20% program, so it was kind of that type of thing, but I really liked it and I thought it was quite valuable actually, and so I moved to New York for personal reasons and I just started working on dapper , full-time my manager Yorick also had like a hundred direct reports. So he had also had no idea what I was doing and I got it to the point where it was in production and it was actually solving problems pretty quickly, just because it was IT. Well, I can get into that if you want, why it was possible to do that ,and I got hooked on that stuff and I really haven't looked back. That was early 2005 and now sixteen years later, I'm still basically working in that same overall space of how do you observe complex distributed systems and what you, what kind of improvements can you make to the software engineering process? If you are able to observe them effectively after working on dapper for awhile, I just wanted to do something different. So I went over, did a couple of systems projects that really didn't work that are not well known because they were failures. I'm happy to talk about those too, if you want, but I eventually found my way over to Monarch. I started to create a multitenant high-availability time series database, basically and it in terms of the open source world, probably the closest parallel would be M3 or something like but ended up working on that for about three or four years and then left, Google started a social media company that was as a product of complete failure, about a year into it. I realized that it was never going to work, abandoned the product, but realized I enjoyed being an entrepreneur and I wouldn't even say Pivot's the wrong word, because pivot implies that you keep one foot in the same place. I just started playing a different sport, but with the same investors and that's actually LightStep. LightStep was founded as a social media company in 2013 and a year and a half in, I was just that I completely changed what I was doing added some co-founders at that point and here we are six years after that, and I'm still working on building stuff and really enjoying it. Utsav Shah: Yeah. I think that is a super interesting background. Next, the [05:00] first question on that is, was that the era when like Larry Page or whatever, decided that there's no need for managers, that's why they just hired all of them. Ben: I don't think so much they fired them, but they would just hire a lot of engineers and hire managers to go along with it. Utsav Shah: Yeah.Ben: I think there was this idea that management was bad in some capacity, and I understand where they're coming from. I definitely don't agree. I think good management is actually one of the most incredible supportive things you can possibly have in an organization. But I think that they were a lot of the people who had come to believe that we're just coming from really bad management. Certainly bad management is worse than no management, but good management better than all of it. Ben: Right. Utsav Shah: The other thing that was interesting about Google at the time was that they were growing so quickly that if you didn't like what you were doing, you only had to wait a couple of months and some new person would take over.Ben: So that paper's over a lot of issues. I think there was a belief that Google had solved the management problems through software or something like that. That was another thing, there was a belief that by writing internal software systems to do a lot of the blocking and tackling that managers might do, and they certainly had tech leads, which serve a managerial purpose for just dividing workup. And there was a belief that they've solved that issue, and once the company everything, there's a law of large numbers, even though Google has been very successful at some point they had to grow slower. And when that started to happen, the need for managers became much more obvious and sure enough, at this point, I don't know what the ratio is, but I'm sure it's not 151 more so that they realized that they needed to correct that. But it was liberating in the sense that you could do whatever you want, but I think it was pretty disorganized and not very efficient. Utsav Shah: Yeah. Could you talk about the architecture? You said you worked on some part of the ad system that wasn't particularly interesting from my understanding, and I could be completely wrong about this, there was one monolith, Google web server, like DWS and not that many services around, is that like roughly accurate? Because I'm also thinking why did diaper make sense if it's just like one large server, but I guess that's clearly wrong. Ben: Yeah, I don't think that's correct. Certainly if you go back far enough into 1999 or something, it was probably true, but by the time I showed up, we didn't call them micro services, but they absolutely were. And I would say again that the micro services at Google were the best , probably the only good reason to adopt micro services is going back to management. It's difficult to get more than 15 or 20 engineers to work on anything efficiently in a single code base that's deployed as a single unit; it's just difficult to do that from earliest engineering standpoint. So micro services serve a purpose from a software development management standpoint, where you can create a unit of deployment that micro services at Google were much more about horizontal scaling. And that was a necessary thing. They add throughput that required that kind of horizontal scaling, but they definitely had. I remember when he turned dapper on, in production, we'd never really been able to visualize it before, but a cache, miss and Google web search. Certainly what two quests GWS, which you're referring to web server at the top of the stack, but by the time it got down to the bottom of the stack between the front end load balancers, the final thing that actually would look through some index on disc, it was 10 or 20 levels of depth to get down, so yeah, it was definitely quite distributed and also huge fan out. Oftentimes a parent would have with paralyzed request to 30 or 50 or a hundred and in some cases, children that had different parts of the index. And so you had a tail latency, things were really scary and stuff like that. So yeah it was quite distributed, especially on the web search side, early on. There were other parts of the system, like the ad system was the front end of that system that merchants would actually use was basically like a database than a Java web server. So there wasn't everything, but that was for the high throughput, low latency stuff. It was pretty distributed from early on. Utsav Shah: Interesting. And just out of curiosity, did Google prioritize the consistency or availability ? Because of that large fan out, I'm assuming availability and it just dropped a data coming from like a few shards that they were too slow or something, but yeah. Ben: Yeah. I don't think there's one answer to that, but Jeff Dean did a talk that the slides are online at Berkeley and like 2010 or 2012, it was really good talk where he discussed a lot of the techniques that they would use depending on the situation to deal with, tell and see. And I understand you're referring to the cap there but another trade-off that I think we had to wrestle with a lot was basically just cost or efficiency versus latency. [10:00] And we would often end up with something that was more expensive in order to put us a tighter bound on latencies. So if you had three copies of some service, you'd send the request to two of them in parallel and just take the first one that came back in order to manage high latency, outliers and things like that. But I don't think there's a single answer from a availability to consistency standpoint. It really depends on I guess, the business requirements. Utsav Shah: Yeah. I've seen the tail at scale stuff, setting that might be what you're referring to. That's interesting and, you turned on dapper in 2005 is what you said and what was the immediate engineering impacts from engineers at Google where your customers. So, what was their reaction and did you see like some immediate changes based on releasing it and showing it to people? Ben: That's a great question. And one of the most interesting things about dapper is that when we first got it out there in the world, well at Google, it was definitely not something where everyone's like, oh my God, this is incredible, and telling your office mates about it; it was nothing like that. In fact, I would basically go and find a tech lead for, you name it like Gmail, web search and anything that was operating at scale had a lot of services and I would kind of beg them to like meet me and then I would show up in their office, the UI admittedly was terrible, but it was still good enough to be useful. And I would show them some traces and they would always be like, wow, this is actually really interesting, I didn't know this. And would often, explore it with me and we'd find something that was troublesome and novel to them. So know they would get something that was interesting to them. And sometimes they would go in and fix that issue, but it wasn't like we had our own dashboards to track activity and it really didn't get a lot of use. I did generate a lot of value in the sense that we're able to find some, highlight the outliers and understand where that latency was coming from and make some substantial optimizations. But it was very much a special purpose tool used by experts doing performance analysis in the study state. That was really what it was primarily used for initially and there are some technical reasons for why that was the case. But if you were to think of it from a product standpoint, the issue is that we weren't integrated into the tools that people were already using. And that is still the number one problem with the sophisticated side of the observability spectrum is that the insights that are generated are genuinely useful and insightful. And even self-explanatory when you put them in front of someone, but they simply are not going to find them themselves unless it's integrated into the tools that they're already using. And it's still, I think the number one barrier to value and observability is just that it's not integrated into the kind of daily habit tools, whatever those may be. At some point we did make a change, Josh McDonald, who actually still works with me at LightStep, who was working at dapper in 2005 as well. He eventually made a change to stubby, which is the internal name for GRPC essentially anyway. And particularly this library called request C, which was used to look at active requests that are going through the process to basically just cordon off the request that had a dapper trace, that set to true. And so you could go to any process where people are already using this request, see thing all the time to see requests going through their service. And it kept a cache of slow requests from an hour or whatever at different latencies. And we had a little table of requests that had the upper traces where you could click on the link and go directly to the trace. And then it was something people are already using and the number of people that used, dapper I don't remember exactly, but it must've been like a 20 X improvement when we released that, and it was a huge change. And the only lesson dapper didn't get any more, it didn't get any more powerful when we did that. It just got a lot easier to access. And so it's all about being in the context of the workflow. That's something where some people it's kind of Jonathan similar, who incredibly smart person, much smarter than I am that's for sure. But he ended up really pressing us to build kind of a bulk data API to run Map Reduce and things like that over the dapper data. And he was in charge of something called Terra Google, which was actually the largest part of Google's index, but also the least frequently accessed. It's a very complicated system, the way that it worked, I won't go into it just because we don't have time. I don't know if I'm allowed to talk about it, but suffice it to say it was really complicated. And he did some fascinating work to understand the critical path of the system using both , it's some really substantial improvements as a result of it. So there are people like that who made these big improvements, but it's a big difference between having, delivering quote unquote business value to Google, usually in the form of latency or reductions and having a lot of daily activity, but daily activity really didn't come until we integrated into these everyday tools. [15:00] And I think that was one of the most important lessons from, the dapper stuff is that the cool technology really is not enough to get retention from engineers who are busy doing other things. Utsav Shah: That's super interesting. And I think I've heard the term Tara Google maybe five years ago when I interned there. And I think I finally learned what it meant. I'm sure I forgot about it in like three months. That's, interesting and request, see it seems like a front end towards like visualizing a context or Google's context in a sense, is that like an accurate way of phrasing it? And why did engineers user requests? That's something I'm curious about now? Ben: Well, for different things, but what was particularly nice, but request C also known as well, RPC Z container put requests C but was the part that we're really talking about, what it allowed you to do, I guess it was basically just a table. That's all that you saw the table would have a row for every RPC method that you had in your stubby service, your GRPC service and then, so each row is a different method. Okay, fine, that's simple enough and then the columns were basically different latency buckets. So you'd have requests that took less than 10 microseconds, less than a hundred microseconds, less than one millisecond, less than 10, et cetera and it would go all the way up until I don't know, things that took longer than 10 seconds. And you could examine a very detailed kind of micro log of what took place during that request. So you could think of it as just a little snippet of logs that were pertained to that request and only that requests. And then as I was saying, if the thing was that portrays, you could then link off to the distributed version of it and see the full context. The thing that was particularly powerful though, is that it had one special column for requests that were still in flight that he would be taking a really long time. So what would happen is you could have a request that was stuck and you were trying to debug it in an alive incident. And you could inspect the logs just for requests that were stuck usually because of , let's say it was often that there is a new tech slot that was under contention restock, waiting on it. You can go and see that exact thing had happened and there was a lot of really pretty clever stuff. They did an implementation to defer any of the evaluation of any of the strings in the logs until someone was looking at them. So you could afford extremely proposed with the logging on this thing and then you only evaluated the logs when an actual human being was sitting there, hit a refresh in the browser, looking at it. So unlike most logging frameworks for that sort of, that's not generally how it works. So there's a lot of pretty complicated reference counting and things like that, but it was also that if you were having an issue, you could figure out that you were blocked on this big people tablet server, and it was this particular UTEC flock that was contended. And that was being contented by these other transactions, which you can look at and figure out where the contention came from. But to be able to pivot like that in real time was pretty powerful and then having that linked into dapper to understand the context was also pretty powerful. I don't know if that makes sense, but it was a tool that was one of those things that I really haven't seen. I've seen, I think open senses and Z pages have some of that functionality, but it doesn't really make sense unless everything is using that little micro logging framework. And I just haven't seen that outside of Google or Google open source, so I still miss that. It was a really useful piece of technology. Utsav Shah:  No, I think that's amazing given this was like so long ago, and then it makes me think about taking a step back, I think maybe five years ago or ten years ago internal tools at Google are probably better than like the development experience externally. Right? You have so much stuff for free. You talk about blaze and you talk about all of these different tools, but things have evolved a lot recently, it seems like there's so many startups coming up with like different things, and even like Datadog and stuff are like fairly mature, now you get a lot of stuff for free from them. Would you say that the development environment externally is probably better than anything that Google can offer in the sense of you get the holistic experience now? Or do you think there's still things as you said, request like some functionality that's just missing because of the lack of consistency and you have to integrate like a million different things in maintaining like this Rube Goldberg set of integrations to get like a similar development experience?Ben: Yeah, that's a good question. It's really hard to compare the inside and outside of Google experience and it's not that Google was all better, and a lot of stuff that Google was actually really annoying. And I was just talking to someone about this yesterday, but the trouble with Google was that everything had to operate at Google scale and there's this idea, [20:00] which is totally false in my mind, that things that operate at higher scale are better and they're usually not, there's a natural trade-off between the scale that something can operate at and the feature set. And so a lot of the stuff we had at Google was actually pretty feature poor, compared to what you can use right now in open source. But the only thing that really had going for it is that it scaled incredibly well. The exceptions were mostly areas where having a monitor with almost no inconsistency is in terms of the way things are built it gives you some leverage. And of course there are a lot of examples of that request, to find example of dapper is actually a fine example, too. The instrumentation for dapper to get that thing most of the way there was a couple thousand lines of code for all of Google ovens, but whereas, just look at the scope of puppet telemetry or something, to get a sense of like how much effort is going to be required to get that sort of thing to happen in the broader ecosystem, so that they had this lever around consistency. A lot of the tooling at Google was it's not that it was bad, it was very scalable, but it didn't have a lot of the features that we would expect from a tooling outside. I'd also say that in my seven years of Google Workman infrastructure and observability, I never had a designer on staff and I barely had a PM ever and it really showed it's like having worked now with really talented designers, not just who can make UIs look nice, but who really think about design with a capital D and stuff like that. And just a completely different ball game in terms of how discoverable some of the value and the feature set and things like that. So like the Google technology often lacked that sort of Polish and I think there are many different vendors out in the world right now that I have built things that are much easier for an inexperienced user to consume, even if the technology is equivalent, I think the user experience is not. So that's another area where I think what we had at Google and is actually unfortunately a pretty far cry from what you can get now; it's just by buying SAS. Utsav Shah: Yeah. Well, we thought about if people ran infrastructure teams like product, like you're trying to sell each piece of your infrastructure to potential buyers, you would create a better product because you'd have to think about user experience and you have to think about, make customers actually getting value. So, trying to make that happen, it's not the easiest thing in the world, but that makes sense. When you released dapper, did you have any sampling at all or was it like, I'm assuming yes, but wouldn't have just worked before?Ben: Yeah. Sampling is an interesting topic. There's a lot of places you can perform sampling in, a tracing system and dapper performed it almost in every one of them but yeah, dapper had actually a pretty aggressive sampling. We started with one for 1024, so that was the base sampling rate in dapper ,and then we realized that even after that cut of one for a thousand centralizing the data. When we initially wrote the data just to local desk where we wrote log files and we deployed a Damon that ran on every host at Google is actually by the way, if you ever want to like jump through some hoops, try to deploy a new piece of software that runs as root on every machine, tell you that was a real nightmare from a process standpoint. But anyway, so the day this thing would sit there, it would scan the log files and basically do a binary search anytime someone was looking for a trace, and so that's the thing that I started with , that was honestly a terrible way to build that system really bad. So eventually we moved to a model where we would try to centralize that data somewhere for all the reasons you might imagined, but it turned out that the network costs and centralizing that data, even after the one for 1000 cut was substantial and the storage costs were also really substantial. So we did another one for 10 on top of the ones with 1000. So we were doing one for 10,000 sampling randomly before we got to the central store that was used for things like now, producers and stuff like that. And it pretty much means you can never use dapper for all sorts of applications. Like for web search was fine because I don't rememb6er the number, but it was order of like a million grids for a second, so fine but for something like people check out where people are actually buying stuff. It's of course intrinsically a much lower throughput service, but the transactions are actually more valuable, so you're getting cut both ways ,and we didn't have a dynamic sampling mechanism on that for when I was working there and people could adjust to the sampling rates themselves, but they usually didn't. So the technology that's really not that useful except for the high throughput services where that sampling, wasn't a complete deal breaker. I think with LightStep and with other systems that have been written in the last couple of years, there's a recognition that sampling really serves a couple of purposes. One is to protect the system from itself. So you don't want to have [25:00] an observer effect and actually create latency through tracing with dapper. We had that issue because we wrote a local disc we're basically entering the Colonel at least on disc flush and for hosts that were doing a lot of disk activity. We could actually create latency with high sampling percentages, but there's no need to do that, you can just flush the stuff over the network and especially that was 2005, that works a lot faster. Now, next to a lot faster, now you can actually get the data out of the process without sampling in almost all situations. They're probably an outlier cases here or there, it's not true, but overall there was no issue with flushing all the data out of the process. And then you just need to decide how much you're willing to spend on network and how much you're willing to spend on storage, and that's a whole set of other constraints. The other thing that I have recognized is that long-term storage is quite cheap. The wiring networking costs in terms of a lifetime, that data end up being almost as expensive or in many cases, more expensive than storing it for a year, so if you can find some way to push the storage closer to the application itself, even if it's just in the same physical building or availability zone, that's a pretty big win as well. So a lot of the work that we've done at LightStep is actually trying to take advantage of some of those, you're just trying to be on the right side of those cost curves in terms of where we actually do the high throughput, and then where we do the sampling stuff like that. Utsav Shah: This reminds me of how Monarch is designed. I was just reading up on the paper before this, it's the same where you're trying to flush something that's in a local data center or the local availability zone. And then finally, when you query, you're getting such less data that you can do that once and ask questions of multiple regions. Is that roughly accurate? Ben: Yeah, I think there are definitely some similarities in modern. I have to say, if we could go back in time, I would have pushed back harder on some of the requirements that were put on UIs. I don't think we did the wrong thing, given the requirements that were handed down, but the requirement that we depend on, almost nothing except for, physical DM or physical, DM's kind of a misnomer, but the fact that we weren't allowed to take advantage of Google's other infrastructure beyond just the scheduling system and the kernel and things like that, it really limited what we could do. And then when you also pile on some other requirements around performance and availability and kind of forced to store everything in memory, and we did and then the paper goes and talks about the number of tasks, which are basically virtual machines that moderate consumes and the steady state. And I remember correctly, the paper's number is like 250,000 VM steady state and that is just extraordinarily expensive system right there. A VM of course is not the same size as the physical machine, but it's a lie and that's not even counting the VMs that are being used for durable storage and long-term storage of the data. And wherever they're putting that stuff in Google's longer-term storage systems, I mean just a tremendously expensive system and that's not a good thing and I'm not convinced that's the right approach. We've certainly, with some of the work we've been doing lately LightStep, we basically had to write our own time series database from scratch and rather than trying to re-implement what we did with Monarch, I think a lot of the lessons we've learned is that there are ways to do that are far more efficient without really paying a penalty in terms of performance. And, yeah, I remember that we felt like we had no choice, but to do everything in memory, there are some similar systems that Facebook like the grill system, I think also ends up making the same decision at about the same time, maybe it was because flash wasn't quite commodity at that point and so we felt like it was disc or like physical spinning disc or memory. And now of course there's some interesting things, but that was expensive. I don't know if it's a cool system and it's very powerful, but awfully expensive. Utsav Shah: Yeah. So just for listeners, Monarch is a monitoring system. You can see it provides the same end interface too, as like Promethease, it's designed in a very different way internally, including to the user. I think the configuration system is different, but it provides kind of like the same purpose. A design did replace Boardman, which is the original monitoring system, which like engineers had to deploy for themselves, whereas like Monarch was like a SAS service in the sense that you just had to add your metrics and things would work automatically. Is that like a good summary? Ben: I think that's exactly right about what Monarch is. The Promethease thing is a little funny though. Promethease is architecturally much more similar to Boardman than Monarch important. Boardman had a lot of issues that Promethease, I think has improved upon that were, self-inflicted like Boardman to actually use Boardman to monitor your system, [30:00] you had to use not one, but five different, totally unique to Google domain, specific languages, all of. Utsav Shah: PSM.Ben: Yeah. All of which were totally arcane if you want my honest opinion had like lots of got you's, like for instance, sorry, this is the ramp, but if you wanted to do at arithmetic, which of course is something you'll want to do when you're writing queries, you could use the minus operator. No surprise, but if you had variables that you were subtracting and you didn't separate it by spaces, it allowed hyphens to be a variable names and it would just like silently failed it. Oh, that's not, it would just substitute a zero for that expression and crazy stuff like that. And of course, since it was a handwritten DSL that wasn't particularly well documented or maintained, there really wasn't staffing to improve that there was definitely a period of Google where it was kind of awkward on anytime you ran into a new problem to write some kind of language, some of these languages in the borderline university were pretty small to be fair. But the point I'm making, if they each have their own grammar and their own rules, and most people basically just copy paste in someone else's portal in order to hit their launch criteria. So there wasn't a lot of thought and care being given to writing maintainable code , and it definitely is code. I think if I remember correctly, Gmail's configurations, which, you know, admittedly were generated programmatically, but those borderline configurations for like 50,000 lines of code and it was totally inscrutable. So there's a lot of frustration about that kind of stuff. Whereas problem I think is far more sensible, I could critique this for that, but it generally makes sense , I get it, I think that I don't want to sound overly critical if this is not a good or a bad thing, that's just sort of recognizing that every system is designed for a certain set of problems or whatever. But for me, if you did in here at one of the most problematic from the sweat of mine, which is that it wasn't really designed for distributed pretty evaluation, you can kind of do it, but you have to manually share the thing yourself. And that's a very difficult thing to maintain, to do all the rebalancing and things like that ,and I think that the initial effort at Google was actually, it wasn't Promethease, but it was almost like community is let's fix Boardman and building a new system that has the same scaling characteristics, but has one language, not five better language improvements to this or that; a better internal time series or things like that. But it was still basically the same architecture. And my recollection is this guy, Alan Donovan is another person who's lot smarter than I am; really clever person, but he was working on this stuff at the time. And I think his observation was if we're saying that the system board of mine has tons of issues, how could it be the right thing to architect it and have the kind of block diagram be exactly the same, but how each block just be better? Shouldn't , we be thinking about this a little bit more holistically and to really examine the problems that people are having. And I think when we did that, we realized that the number one problem that was causing a lot of the other weird stipulations people are doing with the fact they had to manually shard and balance this thing, and that distributed credit evaluation was kind of a hack. So the thing that made modern so interesting and also so difficult was that it really was horizontally scalable and that users did not need to worry about where their data was being balanced is also a multitenant from day one, which was allowed a central team to run it for all of Google, instead of trying to repeat that effort with every team in their own little cluster. And it made the design much harder, but ultimately I think more robust and I'm not knowledgeable enough about for me, if he has to know how much effort would have to go into making it really do that. I've seen Thanos has added some functionality like this, but I think that the pretty evaluation really pushing that down and making sure that you do as much of the aggregation as you can at the lowest level, and then bring things back up, have a lot of. There's a lot of subtleties that I think we felt like we had to build into the design pretty early. That's the thing that we are really trying to escape from with Boardman was a design that made it difficult to do distributed per evaluation and that's difficult to handle really large datasets that don't fit in a single feed because that's the underlying pain point in Boardman that led to a lot of other pain points. Utsav Shah: That is super interesting. The name Alan Donovan, I think I've seen it with basil get logs. He wrote like Star Luck For Go, and I think I might be in that Google group.Ben: So yes, that's right. I think he wrote one of the official Looking Go Programming Language has books. He has some languages background, really nice person, very intelligent guy. But, I credit him with sort of forcing us to step back and really think about what problem you're solving with Monarch. And yeah, that was really fun though, I loved building that system. That was probably my happiest time that Google was the summer that we were prototyping, that it was just like amazing team that went very quickly. It was a lot of fun. Utsav Shah [35:00]: Can you talk more about the district query evaluation? I don't fully get why it's problematic. So let me explain to you and you can tell me where I'm wrong. So what you're saying with Bergman Boardman that query evaluation mostly happened at the higher layers where, I guess if you could just explain to me because I don't fully grasp it. What exactly is the difference in lecture? Ben: Yeah. I wasn't being clear, so totally makes sense, so let's take a simple query. You want to understand the ratio of your error rate to your total request rate across your application and you want to group it by RPC method. So let's assume that the amount of data that you have for all of these types of series is large; to put this in context, some of Gmail's metrics were distribution value. So the actual value type was a histogram and a single metric with all the cardinality turned into 250 million times series in the steady state. So very high cardinality surface area that we were trying to aggregate around and the problem that you have that you're trying to do that sort of query that erode ratio query, it's a joint. So you have two different queries, you have a rate query and account query, and then you have to compute, you have to create buckets for each of these RPC methods, just doing a group buy and then within each of those, you have to do a bunch of math. One option is to basically have a credit evaluated at the top of the stack that just talks to all of the sort of like leaf nodes. And in Monarch, we called them leaves each leaf node. And you would say, okay, give me all the data you have for this particular metric , and they would stream the data back to you and you just do that. You do the math, it turns out the data size is large enough that if you do that, you're pretty evaluation times it moves into the tens of seconds or minutes in some cases was kind of a non-starter. So instead, what you'd like to do is say, okay, fine. So we'll compute this at the leaves, but the problem is, and this is the most important point. If you're a grouping by RPC method, there's absolutely no guarantee at all. In fact, it's just not true that all of the data for one RPC method is going to be on one server or another. So each of the services in each of the Monarch leaves is going to have some portion of the data, so what you want to do is compute what we would call a partial aggregation. So everyone confused, they're part of this particular query, so they each make the RPC method buckets, and then they pass those partial results up the stack to the mixer level where now you've done the aggregation so the data size is pretty small over the wire.And then you complete the aggregation now that you have all the data, get the final numbers for both the error and the account and then at the last step at the top, you join the two things, divide them all and you've finished yourself. So that the most important thing to understand is that it's not possible for the lower level nodes where the horizontal scaling has happened. They cannot compute the final number because they don't have enough data to do it. So they have to have some way of communicating partial results back to the top of the stack, and that example, it's not that difficult, but in terms of the full query plan for an arbitrary query and the language that we are doing, it's a lot of subtlety and complexity to how those different types of praise can and cannot be pushed down to be evaluated at the leaves. And if you ever end up in a situation where you need to pull all of the data up into the mixer level, the whole thing totally falls apart from a performance standpoint. And oftentimes even from a feasibility standpoint, you end up owning that thing if you're not careful. So there's a lot of streaming, evil and pushback on channels, so you don't flood the thing that's getting this huge fan and from all the children and stuff. So this guy, John Benning, who another person much smarter than I am, but he designed that thing and worked on it for years and to kind of optimize it. And yeah, it was a really interesting piece of technology, but it was just quite subtle. And I think if we hadn't designed it for that initially, it's just hard to make the query model that isn't designed to create these like partial aggregates. I think it was hard for me to imagine how you would send that in after the fact, because of the way you approach the computation is you have to be able to kind of truncate, the computation and send it as a partial computation up the stack instead of as a set of query results. And I'm not saying it's impossible, I'm assuming it gets pretty ugly. So that's the thing that I was referring to. Does that make sense? Utsav Shah: Are you saying that the query language itself also needs to be designed with this thing in mind? Or is that mostly just the way you shared out and you make your query plan?Ben: We really tried not to put constraints on the query language because of this. There were certain types of joins that were really hard. Like a full relational join, it was deeper than that, this problem, but it wasn't designed to be. And relational [40:00] database and those things are really quite difficult to implement, but the joints are aware. I think we had to kind of draw the line on some of the functionality, but a lot of other stuff, it was kind of a query language has some very powerful capabilities, especially dealing with the time dimension, but also was much more limited than you. It wasn't as powerful as a lot of SQL like languages for doing just general purpose computation, stuff like that. So it was very much designed for time series data, but I would say it seems to have been general enough to represent a wide variety of like operational use cases. I just wouldn't want it, it wouldn't take the place of inquiry or something like that for general purpose computation. Utsav Shah: Yeah. And one thing I think listeners should note is that you might think that, users are not running that many queries. How does it matter? But like a lot of people write alerts and like monitors and they're basically super complicated and they have to be evaluated a lot. And I'd imagine that a lot of your load was from these alerts that have to be continuously evaluated to make sure, and it's super critical that they fire quickly. You don't want there to be like a production outage and there's a slow down due to the monitoring system. Ben: Certainly the case! Utsav Shah:  So it's a really interesting problem. I think Dropbox tried to build their own and I think they rolled it out successfully and they just put cardinality limits. You can have queries with super high cardinality, but very few other limit, because I think the cardinality exclusion is where a large source of problems is, is that accurate? Ben: Yeah, that's definitely accurate. So that's another rant if you don't mind me going in that direction for a minute. This is one of the things I did not realize written on Monarch, but I've come to believe that there really are two types of monitoring data. There was telemetry, I guess you could call it, there is statistical data, which we usually call metrics and there's transactional data, which we usually call traces or sometimes logs or structured events. And those are the two flavors of data and the Achilles heel for the structured events, the traces for the transactional data. But the Achilles heel there is that at high throughput just retaining it for a long time. It's just really expensive to so much of it, like you were processing a lot of transactions. It's big data, it's expensive. I don't mean big data in the sense of big data, but it's a lot of data. And then on the metrics side, you can handle the high throughput very naturally because the only thing that changes the value of the counter, it just goes up, but it doesn't actually make the metric data larger; it just changes the numbers if you have high throughput. So the Achilles heel for metrics data is high cardinality. I actually wrote some stuff about this on Twitter a few weeks ago. I'm checking send to you afterwards, if you want to include it in the article, but, the thing that's so frustrating is that cardinality is necessary. You it's totally inevitable that you're going to want to include some tags in order to understand and isolate symptoms of interest, and I think metrics should be used to isolate symptoms of interests that connect the health at some part of your system to the business. at some abstract level, that's really what you're supposed to be doing metrics. And then I think because that's the tool that people use, it just becomes the hammer for every nail that they see. And you just try to use cardinality to address every aspect of observability, which is a complete disaster from a cost standpoint and from a user experience standpoint, and I will try to elaborate on that. So let's go back to this example of RPCs or something like it's totally fair and smart actually, to have a tag on your RPC metrics for the method, because you want to distinguish some reads. Totally fine, because those are different things you might want to independently measure from a health standpoint, but then you might want to say, Traffic spike or latency spikes and I want to understand why. And so if you're using a metric system, the only tool you really have to understand that variance or that change is to do a group buy on some other tag and hope to see that one tag value explains that blip. And so you now have a lead to go and follow that tag value where it leaves you, whether it's a host name and I understand the appeal of that, but there's two problems. One like you add a couple more tags and the confident, total explosion is immediate and you're suddenly spending a ton of money, whether it's on from Promethease or a vendor, it's really expensive, and then I think even more pressing issue is that in the code, you only have access to things that are locally available. So you have access to your own host name and things like that, but it's often the case. In fact, I think the numbers I've seen, which I would bring true for me is that 70-75% of incidents in production are caused by an intentional change, like a deployment or a config push elsewhere in the staff that version change or config push is not going to be in your local tags. If you're in service A or if you're in service B [45:00] at the bottom of the stack and service A at the top of the stack pushes a new version, that's flooding you with traffic service A's version is not going to be available for grouping and filtering anyways, so you're paying a lot of money to have all this cardinality and you can't even group by the thing. They have to explain it now in the transaction data and the tracing, you absolutely do have that. The traces flow through both services. These health metrics are linked to the transactions via hosts, by a service names, by method names, all sorts of stuff like that. There are ways to pivot over to the tracing data programmatically in an observability solution, and in the transaction data, the high cardinality is not an issue. You can do an analysis of thousands of traces in real time, and actually understand that the thing that changed is that before, service A above, you was on version five and now it's in version six, and then that explains the difference in the health metric you started with, but using cardinality as a way to do it, sort of ability is a big mistake. Sorry, metric cardinality is the way the observability is a big mistake.I think that metrics should be used to understand health and nothing more. And then you have to be using observability tooling. That's smart enough to pivot over to the transaction data where it's both cheaper and more effective to understand these sorts of systemic changes that lead to these health changes in the first place. So at Google, I think we were actually way behind where we are now, honestly, with LightStep in terms of how we would pivot from time series data over to transaction data and back again, but that's really the essence of it because you can't do high throughput with one and you can't do high cardinality with the other, so you have to feel that to use tools that pivot from one to the next intrinsically. And I think that's the thing that allows you to kind of get out of that cardinality trap more than anything else, it's not setting a cardinality limit, it's just not needing it to be in your metric data in the first place. I think that's really the solution that we'll find ourselves pursuing over the next couple of years. Utsav Shah: Well, I think that makes total sense. And maybe just super quick to explain the lessons, like why is high cardinality bad? And I think the answer is because like in a time series database, when you have like a data point with a different cardinality, you have to basically store it as like a different drawer or different columns, and that's what causes the explosion.Ben: Yeah, that's basically it. And high cardinality isn't bad, high hardly metrics are bad. I think the issue is that in a time series database, you can basically think of it as a huge spreadsheet and each row is a different time series. And it turns out that creating a new row is a lot more expensive than creating a new cell. Utsav Shah: So you're just incrementing like a number in that cell?Ben: Yeah, or adding another data point to an existing time series as far cheaper than creating a new time series. I think that the issue with high cardinality, it sounds so esoteric, but it ends up being an issue that like your chief financial officers quit and start caring about. Because you can add one piece of code, literally one line of instrumentation that says, I've got this metric, that tracks requests, I'm going to add customer ID and host name that just explodes, and then every single value, every single interaction of that line of code is going to create a new time series and you're probably dealing with, it's not vendors being evil by the way, they have cogs, they have to pay for it. But if you write a piece of code that incurs, 10 million time series in some PSTB somewhere, that's going to be expensive, no matter what he has to be are using. So some are more expensive than others, but it's still just different flavors that are expensive and I think ideally the developer can add that code a platform team can write some kind of control that says, I never want cardinality for any metric to exceed X, and then you can do some kind of top cave thing to retain the high frequency data and aggregate the rest. I think that's the kind of long-term vision I have for how cardinality should work, but really it would be great use event data tracing data for high cardinality where you don't pay a penalty at all. It just literally doesn't matter in terms of the cost of the solutions and then stick to the metrics, high throughput data where you need precise answers about critical symptoms. Utsav Shah: And then the flipside of that is why is cardinality not a problem for tracing? How do you store tracing data that in a way that cardinality is an issue at all?Ben: I think it's depends on how you index it. So cardinality it's an issue with the index and the TSB. Utsav Shah: Yes.Ben: In the tracing database. I am actually interested, there are many different ways of doing it. Some people have column stores. I think I don't work on any of them, so I don't want to misspeak, but I'm pretty confident that their underlying data database is a column store where, you have different trade-offs in that type of world. But LightStep does is too complicated for me to explain right now, but we have our own way of managing cardinality and the trace data. So it's not that it's free or something, but it's not an issue like it is for a time series database. So I think that different people have addressed it in different ways, [50:00] but I don't feel, it's a very satisfying answer to your question in a time series database, because of locality constraints you have around time series, like adding cardinality just has a fixed cost that's relatively high. I think that's probably the best way I can explain it. Utsav Shah: Okay. Is the index and electricity data, just the trace ID or the hash basically, or a service, which is relatively straightforward index?Ben: There's a lot of different ways to do it. That's why I'm kind of hedging on this because I'm trying to be precise, I like being precise and it's actually quite a diversity of ways that it's done right now. So I can't specify how it's done everywhere. In the dapper ecosystem, we did have a few indices that we special case, and then if you want anything that, wasn't one of those indices, you had to write a MapReduce, which is super high latency. But other systems have no limits and cardinality others as well, index up to end values of every key and no more. I mean, there are a lot of different approaches to it. Utsav Shah: That makes sense. And I can read up more on how Jaeger works, but when you search for a trace ID in, you get that fast, everything else is kind of not that fast, so that makes sense to me, and I think there was a lot of good information. I feel I've just learned a lot on how all of this infrastructure works for sure. Do you have anything that you want to add on top of this? Just what you've learned, all of this stuff, you left Google in 2012, what was like the one thing that was just really something that you still use or some information that you still remember from that time that is one design principle perhaps, or just one way of thinking about things?Ben:  That's a good question. I kind of referred to it, but not as precisely, but one thing that Jeff Dean said a few times, which really did stick with me, was just this idea that you really can't design a system that's appropriate for more than like three or four orders of magnitude of scale appropriate being the keyword. And this goes back to this idea, that system is Google. We're not better because they're more scalable, and one thing I liked, it's not about Google, I think most companies are like this, but the in-house technology at Google came with pretty accurate advertising for what it was good for and what it wasn't good for. And there's no shame in saying, yeah, this database is good until you hit the scale and this one's terrible if you go below that, because it doesn't have all these features that you'd expect or what have you, and my experience outside of Google, or whether it's open source software or vendor software, it's just that people are understandably reluctant to describe the scale that their system is appropriate for. And I mean, it's a great question to ask, actually, if you're talking to someone about something that they're really excited about and they're trying to pitch you on it, just say, so tell me like, what's too much skill for this? and what's too little skill for it? And if people can't come up with an actual answer to that, I think that's a bit of a red flag in my book. And that was something that really stuck with me after Google and I think it applies to any technology that you're building, and also it's good to be humble about that too. You can remind yourself, you built some really scalable thing, it probably doesn't do something that less scalable thing could do for a site and just to try and think about fitness for purpose, that's maybe the thing I feel that's comes up over and over and over again, anything from engineering ,to product ,almost to marketing. To think about what market segment is this really appropriate for? What, where, would the scale that we're targeting live in the marketplace? So I think it's relevant at that level too. Utsav Shah: Yeah. It reminds me of this meme, MongoDB is like web scale. Ben: What does that mean? No comment.Utsav Shah:  The web is so different for so many different things, there is no concept of like web scale, it just sounds fancy. My current company uses MongoDB, it seems to work so far, it's probably fine. And I have so many more questions in terms of like, I want to ask you about open tele metrics and open tracing and all of these things and lifestyle. But I think it would be nice if you do that in a follow-up search. This was great, and I feel like I learned a lot, so thank you so much for being a guest. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey Ben, welcome to another episode of the Software at Scale podcast. Could you tell our guests just about your story? Because there's so much in your background that is interesting to me. So right from, you know, starting off at Google to like creating LightStep. Sure. Yeah. Thanks for having me. I'm excited to be here. I don't know whether my background is interesting or not, but I mean, to me, it's kind of boring.
Starting point is 00:00:37 Yeah, I mean, I graduated from college right in the thick of the dot-com busts in the sort of 2003 era and I was very fortunate to get an offer to work at Google at the time and when I went over there I they actually put me on some stuff in the ad system that was incredibly boring to be honest with you and also of course ads makes a lot of money at Google but it was this particular part of the ad system that wasn't making any money so it was like just of boring, not very lucrative for Google. And I didn't like it very much. And then they had this, the way I got into Dapper and to distributed tracing was actually incredibly arbitrary, but it's a funny story. They had this one time event where you could opt into this, I don't know what they called it, but it was this program where
Starting point is 00:01:25 they would take everyone who opted in, they'd look at a bunch of different dimensions, like how long you've been at a school, what office you worked in, what languages you worked in, where you were in the org chart, that kind of stuff. And I think they had like 10 dimensions. And then they found the person who also opted into this program, who is literally the furthest from you in this 10 dimensional space. And they set up a half an hour meeting with no agenda. And that was it. So I was working on this god awful stuff and ads that was, as I was to say, totally pointless. And they paired me up with this woman named Sharon Pearl, who is a very distinguished researcher who had come over from digital equipments research lab when it kind of fizzled out after the merger in the late 90s.
Starting point is 00:02:04 And she and some of the other old guard at Google were doing all the really cool system stuff. And I just she asked me what I was doing. Like, I don't want to talk about it. What are you doing? And then she went through this list of really interesting systems projects. One of them was kind of like a predecessor to like an S3 is like a blob storage thing. There was some NLP thing she was working on. And then in this list was this prototype of a distributed tracing system called dapper that never really saw the light of day it was just kind of an idea and she described it to me and i just thought it sounded incredibly useful
Starting point is 00:02:35 and really fun and my manager at the time had 150 direct reports with direct reports i don't mean that it was 100 but like he had no idea what i was doing obviously how could you and and so i just started working on it basically like switch teams or anything well i did eventually it it turned out that i mean i did it google famously had this 20 program you know so it was kind of that type of thing but i really liked it and i i thought it was quite valuable actually and so i moved to new york for personal reasons and i just started working on dapper full-time my manager in new york also had like 100 direct reports so he also had no idea what i was doing and and i got it to a point where it was in production and it was actually solving problems like pretty quickly just because it was it well i can get into that if you want why
Starting point is 00:03:21 it was possible to do that and i just got kind of hooked on that stuff. And I really haven't looked back. I mean, that was early 2005. And now 16 years later, I'm still basically working in that same overall space of how do you observe complex distributed systems and what kind of improvements can you make to the software engineering process if you are able to observe them effectively. After working on Dapr for a while, I just wanted to do something different. So I went over actually did a couple of systems projects that really didn't work if you that are not well known, because they were failures. I'm happy to talk about those two if you want. But
Starting point is 00:03:56 eventually found my way over to Monarch, which is a project I started to create a multi tenant high availability time series database basically and and it was something kind of like you know similar to i don't know in terms of the open source world probably the closest parallel bm3 or something like that but but ended up working on that for about three or four years and then left google started a social media company that was a as a product a complete failure about a year into it i realized that it was never going to work, abandoned the product, but realized I enjoyed being an entrepreneur.
Starting point is 00:04:30 And I wouldn't even say pivot's the wrong word because pivot implies that you keep one foot in the same place. I just started playing a different sport, but with the same investors. And that's actually LightStep. LightStep was founded as a social media company in 2013. And about a year and a half in, I completely changed what I was doing.
Starting point is 00:04:48 Added some co-founders at that point. And here we are six years after that. And I'm still working on this durability stuff and really enjoying it. I think that is a super interesting background. So first question on that is, was that the era when Larry Page or whatever decided that there's no need for managers? That's why they just fired all of them? I don't think so much they fired them, but they would have hired
Starting point is 00:05:09 a lot of engineers and hired managers to go along with it. Yeah, I think there was this idea that management was bad in some capacity. And I understand where they're coming from. I definitely don't agree. I think good management is actually one of the most incredible supportive things you
Starting point is 00:05:28 can possibly have in an organization. But I think that a lot of the people who had come to believe that management was bad were just coming from really bad management. I don't know how else to put it. I mean, it's like certainly bad management is worse than no management, but good management is better than all of it. Right. The other thing that was interesting about Google at the time was that they were growing so quickly that, you know, if you didn't like what you were doing, you only had to wait a couple of months and some new person would take over. You know, so that that's that papers over a lot of issues.
Starting point is 00:05:59 I think there was a belief that Google had solved the management problems through software or something like that. That was another thing. There was a belief that by writing internal software systems to do a lot of the blocking and tackling that managers might do. And they certainly had tech leads, which serve a managerial purpose for just dividing work up. I think there was a belief that they'd solve that issue. And once the company, you know, every, there's a lot of large numbers, and even though Google has been very successful, at some point, they had to grow slower. And when that started to happen, the need for managers became much more obvious. And sure enough, at this point, I don't know what the ratio is, but I'm sure it's not 151 anymore. So that, you know, they realized that they needed to correct that. But yeah, it was liberating in the sense that you could do whatever you want. But I think it was
Starting point is 00:06:41 pretty disorganized and not very efficient. Yeah yeah could you talk about the architecture you said you worked on some part of like the ad system that wasn't particularly interesting from my understanding and i could be completely wrong about this because this is second hand like there was like one monolith like google web server like gws and not that many like services around is that that like roughly accurate? Would that just like a couple? Because I'm also thinking, why did Dapper make sense if it's just like one large server? But I guess that's clearly wrong.
Starting point is 00:07:12 Yeah, I don't think that's correct. I mean, certainly if you go back far enough into like 1999 or something, that's probably true. But by the time I showed up- 2003-ish. Yeah, we didn't call them microservices, but they absolutely were microservices.
Starting point is 00:07:25 I have said in the past, and I would say again, that the microservices at Google were... I mean, the best, probably the only good reason to adopt microservices is, well, going back to management, it's difficult to get more than 15 or 20 engineers to work on anything efficiently in a single code base that's deployed as a single unit. It's just difficult to do that from a release engineering standpoint. So microservices serve a
Starting point is 00:07:50 purpose from a software development, you know, management standpoint, where you can create a unit of deployment, the microservices at Google were much more about horizontal scaling. And that was a necessary thing. I mean, they, they had throughput that required that kind of horizontal scaling, but they definitely had Yeah, I mean, I remember when they turned Dapper on in production, it was, you know, we'd never really been able to visualize it before. But a cache bits and Google Web Search, yeah, it certainly went to Gwis, GWS, which you're referring to Google Web Server at the top of the stack. But by the time it got down to the bottom of the stack, I don't know it got down to the bottom of the stack,
Starting point is 00:08:27 I don't know, between the front-end load balancers and the final thing that actually would look through some index on disk, it was, you know, 10 or 20 levels of depth to get down. So, yeah, it was definitely quite distributed. And also a huge fan out. You know, oftentimes a parent would have, would parallelize requests to 30 or 50 or 100 in some cases children that had different parts of the index and so you had tail latency things were really scary and stuff like that so yeah it was actually it was quite distributed especially on the web search side
Starting point is 00:08:57 early on there were other parts of the system like the ad system was the front end of the ad system that merchants would actually use was was basically like a database and a java web server right so i mean there wasn't everything but but that was for the high throughput uh low latency stuff it was pretty distributed from early on i think interesting and just out of curiosity did google like prioritize the consistency or availability when that because of like that fan out. I'm assuming availability and it just dropped data coming from
Starting point is 00:09:30 a few shards if they were too slow or something. I don't think there's one answer to that, but Jeff Dean did a talk that the slides are online at Berkeley in 2010 or 2012. It was a really good talk where he discussed a lot of the techniques that they would use depending on the situation to deal with tail latency. But I think that, yeah, I mean,
Starting point is 00:09:48 I understand you're kind of referring to the cap there and stuff. But another trade-off that I think we had to wrestle with a lot was basically just cost or efficiency versus latency. And we would often end up with something that was more expensive in order to put a tighter bound on latency. So you'd send, if you had three copies of some service, you'd send the request to two of them in parallel and just take the first one that came back in order to manage high latency outliers and things like that. But I don't think there is a single answer from a availability or consistency standpoint. It really depends on, I don't know, I guess, what's the word, the business word the business requirements yeah that makes sense transaction yeah i've seen like the tail at scale stuff so i think that might be what you're referring to yeah that's that's interesting and so it turns out you turned out you turned on dapper in 2005 is what you said and what was like
Starting point is 00:10:41 the immediate like engineering impacts from engineers at google were your customers right so how did what was their reaction and did you see like some immediate changes based on releasing it and showing it to people that's a great question and one of the most interesting things about dapper is that when we when we first got it out there in the world, well, at Google, it was definitely not something where everyone's like, oh, my God, this is incredible. And telling their office mates about it, it was nothing like that. In fact, I would basically go and find a tech lead for whatever. I mean, you know, you name it, like Gmail, web search, and anything that was operating at scale had a lot of services. And I would kind of beg them to like meet with me.
Starting point is 00:11:25 And then I would show up in their office. The UI admittedly was terrible, but it was still, you know, good enough to be useful, I think. And I would show them some traces and they would always be like, wow, this is actually really interesting. I didn't know this and would often, you know, explore it with me. And we'd find something that was troublesome and novel to them. So like, you know, they would get something that was troublesome and novel to them. So they would get something that was interesting to them.
Starting point is 00:11:48 And sometimes they would go and fix that issue. But it wasn't like we had our own dashboards to track activity. And it really didn't get a lot of use. I did generate a lot of value in the sense that we're able to find some high latency outliers and understand where that latency was coming from and make some substantial optimizations. But it was very much a special purpose tool used by experts doing performance analysis in the steady state. That was really what it was primarily used for initially. And there are some technical reasons for why that was the case. But if you were to think of it from a product standpoint, the issue is that we weren't integrated into the tools that people were already using. And that is still the number
Starting point is 00:12:29 one problem with the sophisticated side of the observability spectrum is that the insights that are generated are genuinely useful and insightful and even self-explanatory when you put them in front of someone, but they simply are not going to find them themselves unless it's integrated into the tools they're already using. And it's still, I think the number one barrier to value and observability is just that it's not integrated into the kind of daily habit tools, whatever those may be. At some point, we did make a change. Josh McDonald, who actually still works with me at LightStep, who was working at Dapper in 2005 as well. He eventually made a change to Stubby, which is the internal name for gRPC, basically, essentially anyway. And this, well, particularly this library called RequestC,
Starting point is 00:13:12 which was used to look at active requests that are going through the process to basically just cordon off requests that had a Dapper trace bit set to true. And so you could go to any process where people are already using this requesty thing all the time to see requests going through their service. And it kept a cache of slow requests from the last, I don't know, an hour or whatever at different latencies. And we had a little table of requests that had Dapper traces where you could click on the link and go directly
Starting point is 00:13:40 to the trace. And then it was something people are already using. And the number of people that use Dapr, I don't know, I mean, I don't remember exactly, but it must have been like a 20x improvement when we released that. It was a huge, huge change. And the only lesson, Dapr didn't get any more, it didn't get any more powerful. When we did that, it just got a lot easier to access. And so it's all about being in the context of the workflow. That said, there were some people, this guy, Yonatan Sumer, who incredibly smart person, much smarter than I am, that's for sure. But he ended up really pressing us to build kind of a bulk data API
Starting point is 00:14:12 to run map produces and things like that over the Dapper data. And he was in charge of something called TerraGoogle, which was the, it was actually the largest part of Google's index, but also the least frequently accessed. It was a very, very complicated system, the way that it worked.
Starting point is 00:14:25 I won't go into it just because we don't have time and I don't know if I'm allowed to talk about it, but suffice it to say, it was really complicated. And, and he, yeah, he did some fascinating work to understand the critical path of the system using, you know, bulk map reduce, blah, blah, blah, blah. It made some really substantial improvements as a result of it. So there are people like that who made these big improvements. But there's a big difference between delivering quote-unquote business value to Google,
Starting point is 00:14:50 usually in the form of latency or error reductions, and having a lot of daily activity. The daily activity really didn't come until we integrated into these everyday tools. And I think that was one of the most important lessons from the Dapper stuff, is that the cool technology really is not enough to get retention from engineers who are busy doing other things. That's super interesting. And I think I've heard the term TerraGoogle maybe five years ago when I interned there.
Starting point is 00:15:16 And I think I finally learned what it meant. I'm sure I forgot about it in like three months or whatever. That's interesting. And requests, it seems like a front end towards like visualizing like a context or like ghost context in a sense. Is that like an accurate way of phrasing it? And why did engineers use Request C?
Starting point is 00:15:35 That's something I'm curious about now. Well, for different things, but what was particularly nice about Request C, also known as well, RPCZ contained request C, but request C was the part that we're really talking about what it allowed you to do. I guess,
Starting point is 00:15:52 you know, it was really basically just a table. That's all that you saw. The table would have a row for every RPC method that you had in your stubby service, your GRPC service. And then, so each row is a different method. Okay, fine, that's simple enough. And then the columns were basically different latency buckets. So you'd have, you know, requests that took less than 10 microseconds, less than 100 microseconds, less than one millisecond, less than 10, etc. And it would
Starting point is 00:16:24 go all the way up until I don't know, things that took longer than 10 seconds or something. And you could you could examine a very, very detailed kind of micro log of what took place during that request. So it was, you could think of it as as just a little snippet of logs that were pertained to that request, and only that request. And then as I was saying, if the thing was Dapper traced, you could then link off to the distributed version of it and see the full context. The thing that was particularly powerful though, is that it had one special column for requests that were still in flight that may be taking a really long time. So what would happen is you could have a request that was stuck and and you were trying to debug it, like in a live, you know, incident.
Starting point is 00:17:07 And you could inspect the logs just for requests that were stuck, usually because of, you know, let's say, it was often that there was a mutex lock that was under contention, and you're stuck waiting on it. You could go and see that that exact thing had happened. And there was a lot of really pretty clever stuff they did in implementation to defer any of the evaluation of any of the strings in the logs until someone was looking at them. So you could afford to be extremely verbose with the logging on this thing. And then you only evaluated the logs when an actual human being was sitting there
Starting point is 00:17:38 like hitting refresh in the browser looking at it. So unlike most logging frameworks where that's not generally how it works. So there's a lot of pretty complicated reference counting and things like that. But it was all so that if you were having an issue, you could figure out that you were blocked on this big table tablet server
Starting point is 00:17:54 and it was this particular mutex lock that was contended and that was being contended by these other transactions which you could look at and figure out where the cont attention came from. But to be able to pivot like that in real time was pretty powerful. And then having it linked into Dapr to understand the context was also pretty powerful. I don't know if that makes sense, but it was a tool that, you know, is one of those things that I really haven't seen.
Starting point is 00:18:19 I've seen, you know, I think OpenCensus and ZPages have some of that functionality, but it really, it didn't, it doesn't really make sense unless everything is using that little myfologging framework. And I just haven't seen that outside of Google or Google open source stuff. So I still miss it. It was a really useful piece of technology. No, I think that's amazing given this was like so long ago. And that makes me think about yeah taking a step back like is do you think the
Starting point is 00:18:48 i think maybe five years ago or 10 years ago like internal tools at google are probably better than like the development experience externally right you have so much stuff for free like you talk about blaze and you talk about all of these different tools but like things have evolved a lot recently it seems like there's so many startups coming up with like different things like and even like datadog and stuff are like fairly mature now you get a lot of stuff for free from them would you say that the development environment externally is probably better than anything that google can offer in the sense of you get the holistic experience now or do you think there's still things like,
Starting point is 00:19:28 you know, as you said, request Z, like some functionality that's just missing because of the lack of consistency and you have to integrate like a million different things and you're maintaining like this Rube Goldberg set of like integrations to get like a similar development experience? Yeah, it's a good question. It's really hard to compare the inside and outside of Google experience. And it's not that Google is all better. A lot of stuff to Google is actually really annoying. And I was just talking to someone about this yesterday. But the trouble with Google is that everything had to operate at Google scale. And there's this idea, which is totally false in my mind that things that operate at higher scale are better and they're usually not i mean there's a natural trade-off between the scale that something can operate at and the feature set and so a lot of the stuff we had at google was actually pretty feature poor actually compared to what you can use right now in open source but the only thing that really had going for it is that it scaled you know incredibly well the exceptions are mostly areas where having a monorepo
Starting point is 00:20:26 with almost no inconsistencies in terms of the way things are built, it gives you some leverage. And of course, there are a lot of examples that request you as a fine example. Dapper is actually a fine example too. I mean, the instrumentation for Dapper to get that thing most of the way there
Starting point is 00:20:41 was a couple thousand lines of code for all of Google. It's crazy. Whereas just look at the scope of open telemetry or something to get a sense of like how much effort is going to be required to get that sort of thing to happen in the broader ecosystem. So that, you know, they had this lever around consistency. A lot of the tooling at Google was, it wasn't, it's not that it was bad, it was very scalable, but it didn't have a lot of the features that we would expect from tooling outside.
Starting point is 00:21:07 I'd also say that in my, what, I guess, seven years at Google working on infrastructure and observability, I never had a designer on staff and I barely had a PM ever. And it really showed, you know, it's like having worked now with really talented designers, not just who can make UIs look nice, but who really think about design with a capital D and stuff like that.
Starting point is 00:21:28 It's just a completely different ballgame in terms of how discoverable some of the value is in the future set and things like that. So the Google technology often lack that sort of polish. And I think there are many different vendors out in the world right now that I think have built things that are much easier for an inexperienced user to consume, even if the technology is equivalent. I think the user experience is not. So that's another area where I think what we had at Google is actually, unfortunately, a pretty far cry from what you can get now just by buying sass yeah yeah yeah i've always thought about you know if people
Starting point is 00:22:07 ran infrastructure teams like product like you're trying to sell each piece of your infrastructure to like potential buyers you would create a better product because for engineers that because you'd have to think about user experience and you have to think about do my customers can my are my customers actually getting value so but trying to make that happen it's not the easiest thing in the world but that makes sense when you released dapper did you have any sampling at all or was it like i'm assuming yes but it wouldn't have just worked before yeah yeah i mean sampling is interesting that's an interesting topic i mean there's a lot of places you can perform sampling in a tracing system, and Dapr performed it almost in every one of them.
Starting point is 00:22:52 But yeah, Dapr had actually pretty aggressive sampling. We started with one for 1,024, so that was the base sampling rate in Dapr. And then we realized that even after that, you know, cut of one for a thousand, centralizing the data, when we initially, we wrote the data just to local disk where we wrote log files
Starting point is 00:23:14 and we deployed a daemon that ran on every host at Google as root actually. By the way, if you ever, yeah, if you ever want to like jump through some hoops, try to deploy a new piece of software that runs as root on every machine. I'll tell you that was a real nightmare from a process standpoint. But anyway, so this thing would sit there, it would scan the log files and basically do a binary
Starting point is 00:23:35 search anytime someone was looking for a trace. And so that's the thing that I started with. That was honestly a terrible way to build that system. Really, really bad. So eventually we moved to a model where we would try to centralize that data somewhere for all the reasons you might imagine. But it turned out that the network costs of centralizing that data, even after the one for 1000 cut was substantial, and the storage costs were also really substantial. So we did another one for 10 on top of the one for 1000. So we were doing one for 10,000 sampling totally randomly before we got to the central store that was used for things like map releases and stuff like that. And
Starting point is 00:24:10 it was, you know, pretty much means you can never use Dapper for all sorts of applications, like for web search, it was fine, because I don't remember the number, but it was order of like a million queries per second. So fine, whatever. But you know, for something like, I don't know, just making it up like Google checkout, where people are actually buying stuff, of course, intrinsically much lower throughput service, but the transactions are actually more valuable. So you're getting cut both ways. And we didn't have a dynamic sampling mechanism on Dapr when I was working there. And people could adjust the sampling rates themselves, but they usually didn't. So yeah, the technology was really not that useful, except for the high throughput services where that sampling wasn't a complete deal breaker. I think
Starting point is 00:24:48 with LightStep and with other systems that have been written in the last couple of years, there's a recognition that sampling really serves a couple of purposes. One is to protect the system from itself. So you don't want to have an observer effect and actually create latency through tracing. With Dapper, we had that issue because we wrote to local disk. We're basically entering the kernel, you know, at least on disk flush. And for hosts that were doing a lot of disk activity, we could actually create latency
Starting point is 00:25:13 with, you know, high sampling percentages. But there's no need to do that. I mean, you can just flush the stuff over the network. And especially that was 2005. That works a lot faster now. Nix are a lot faster now. You can actually get the data out of the process without sampling in almost all situations. I mean, there are probably outlier cases here or there. It's not true, but overall, there's no issue with flushing all the
Starting point is 00:25:36 data out of the process. And then you just need to decide how much you're willing to spend on network and how much you're willing to spend on storage. And so, you know, and that's a whole set of other constraints. The other thing that I recognize is that long term storage is quite cheap, the network, you know, wide area networking costs in terms of the lifetime of that data end up being almost as expensive, or in many cases, more expensive than storing it for a year. So that if you can find some way to push the storage closer to the application itself, even if it's just in the same physical building or availability zone, that's a pretty big win as well.
Starting point is 00:26:09 So a lot of the work that we've done at LightStep is actually trying to take advantage of some of those, you know, just trying to be on the right side of those cost curves in terms of where we actually do the high throughput ingest and then where we do the sampling and stuff like that. This reminds me of how like Monarch is designed. I was just like reading up on the paper before this, like it's the same where you're trying to flush to something that's in like a local data center or like a local availability zone. And then finally, like when you query, you're getting such less data that you can do that once and ask questions of
Starting point is 00:26:45 like multiple regions. Is that like roughly accurate? Yeah, I think there definitely are some similarities. In modern, I have to say, if we could go back in time, I would have pushed back harder on some of the requirements that were put on us. I don't think we did the wrong thing given the requirements that were handed down. But the requirement that we depend on almost nothing except for, you know, physical DM, physical DM is kind of a misnomer, but the fact that we weren't,
Starting point is 00:27:10 we weren't allowed to take advantage of Google's other infrastructure beyond just the scheduling system and, you know, the kernel and things like that, it really limited what we could do. And then when you, you know, when you also pile on some other requirements around performance and availability, we're kind of forced to store everything in memory. And we did, you know, and then the paper goes and talks about the number of tasks, which are basically virtual machines that Monarch consumes in the steady state.
Starting point is 00:27:38 And if I remember correctly, the paper's number is like 250,000 DMs steady state. And that is just, just extraordinarily expensive system right there. You know, it's like, I mean, a VM of course is not the same size as a physical machine,
Starting point is 00:27:52 but it's a lot, you know, and that's not even counting the VMs that are being used for durable storage and long-term storage of the data and, you know, you know, wherever they're putting that stuff in Google's longer term storage systems.
Starting point is 00:28:04 I mean, just a tremendously expensive system. And that's not a good thing. And I'm not convinced that that's the right approach. We've certainly, with some of the work we've been doing lately, LightStep, we basically had to write our own time-series database from scratch. And rather than trying to re-implement
Starting point is 00:28:22 what we did with Monarch, I think a lot of the lessons we've learned is that there are ways to do that that are far, far, far more efficient without really paying a penalty in terms of performance. And yeah, I remember that we felt like we had no choice but to do everything in memory. There are some similar systems that Facebook, like the system, I think also ends up making the same decision at about the same time. Maybe it was because Splash wasn't quite commodity at that point. And so we felt like it was disk or like physical spinning disk or memory. And now, of course, there's some interesting things.
Starting point is 00:28:54 But gosh, that was expensive. I don't know. It's a cool system. It's very powerful, but awfully expensive. Yeah. So just for listeners, like Monarch is a monitoring system. It's, you can say it provides the same end interface to as like Prometheus. It's just, it's designed in a very different way internally, including to the user, like I think the configuration system is
Starting point is 00:29:16 different, but it provides kind of like the same purpose. Right. And it was built to design, to replace Borgmon, which is the original monitoring system, which like engineers had to deploy for themselves. Whereas like Monarch was like a SaaS service in the sense that you just had to add your metrics and things would work automatically. Is that like a good summary? I think that's exactly right about what Monarch is.
Starting point is 00:29:40 I would say the Prometheus thing is a little funny though. Prometheus was was architecturally much more is architecturally much more similar to Borgmon than Nifty Monarch and Borgmon had a lot of issues that Prometheus I think has improved upon that were self-inflicted like Borgmon to actually use Borgmon to monitor your system you had to use not one but five different totally unique to Google domain specific languages. All of the ESLs. Yeah, all of which were totally arcane if you want my honest opinion and had like lots of gotchas, you know, like for instance, I'm sorry, this is a rant, but if you wanted to do arithmetic, which of course is something you'll want to do in writing queries,
Starting point is 00:30:23 you could use the minus operator. Okay, no surprise. But if you had variables that you were subtracting, and you didn't separate it by spaces, it allowed hyphens to be in variable names, and it would just like silently fail big, oh, that's not it, it would just substitute a zero for that expression. And it's crazy stuff like that, you know, and of course, since it was a handwritten DSL that wasn't particularly well documented or maintained, there really wasn't staffing to improve that. There was definitely a period of Google where it was kind of occur on
Starting point is 00:30:52 anytime you ran into any problem to write some kind of language. Some of these languages in the Borgman universe are pretty small to be fair, but the point I'm making is they each have their own grammar and their own rules. And most people, you know, basically just copy pasted someone else's board in order to hit their launch criteria. So there wasn't a lot of thought and care being given to writing maintainable code.
Starting point is 00:31:14 And it definitely is code. I mean, I think if I remember correctly, Gmail's configurations, which admittedly were generated programmatically, but those board configurations were like 50,000 lines of code or something like that. And it was totally inscrutable. So there's a lot of frustration about that kind of stuff. Whereas PromQL, I think is far more sensible, right? So a lot of the, I mean, I could critique this or that, but it generally makes sense, right? Like I get it. I think that I don't want to sound overly critical of Prometheus. This is not a good or a bad thing. It's just sort of recognizing that every system is designed for a certain set of problems or whatever. But Prometheus did inherit one of the most problematic things by
Starting point is 00:31:50 Borgman, which is that it wasn't really designed for distributed query evaluation. You can kind of do it, but you have to manually shard the thing yourself. And that's a very difficult thing to maintain, to do all the rebalancing and things like that. And I think the initial effort at Google was actually basically, I mean, it wasn't Prometheus, but it was almost like Prometheus. Let's fix Borgman and build in a new system that has the same scaling characteristics, but has one language, not five,
Starting point is 00:32:16 but better language, you know, improvements to this or that, you know, a better internal time series store, things like that. But it was still basically the same architecture. And my recollection is this guy alan donovan is another person who's a lot smarter than i am really really clever person but he he was working on this stuff at the time and i think his observation was like if we're if we're saying that this system borgman has tons of issues how could it be the right thing to re-architect it and have the kind of block
Starting point is 00:32:46 diagram be exactly the same, but have each block just be better? Like, shouldn't we be thinking about this a little bit more holistically and to really examine the problems that people are having? And I think when we did that, we realized that, you know, the number one problem that was causing a lot of the other weird gesticulations people are doing was the fact they had to manually shard and balance this thing. And that distributed query evaluation was kind of a hack. So the thing that made Monarch so interesting and also so difficult was that it really was horizontally scalable and that users did not need to worry about where their data was being balanced.
Starting point is 00:33:20 It was also multi-tenant from day one, which allowed a central team to run it for all of Google instead of trying to repeat that effort with every team in their own little cluster. And it made the design much, much harder, but ultimately I think more robust. And I'm not knowledgeable enough about Prometheus to know how much effort would have to go into making it really do that. I've seen Thanos has added some functionality like this, but I think that the query evaluation, really pushing that down and making sure that you do as much of the aggregation as you can at the lowest level and then bring things back up. There's a lot of subtleties to that that I think we felt like we had to build into the design pretty early.
Starting point is 00:34:01 That's the thing that we were really trying to escape from with Borgman was, was a design that, that made it difficult to do distributed query evaluation and that's difficult to handle really large data sets that don't fit in a single feed. Cause that's what people that's, that was the underlying pain point in Borgman that led to a lot of other, other pain points. That is super interesting. The, the name Alan Donovan, I think I've seen it with like Bazel Git git logs like he wrote like star lock for go and i think i might be in that google group so yes that's right he also i think he wrote the the one of the sort of official looking go programming languages books he has a programming languages background really nice person very very intelligent guy but but i
Starting point is 00:34:43 credit him with sort of forcing us to step back and really think about what problem we were solving with Monarch. And yeah, that was really fun though. I love building that system. That was probably my happiest time at Google was the summer that we were prototyping that. It was just like amazing team. It went very quickly. It was a lot of fun. So can you talk more about the distributed query evaluation? Like, I don't fully get why it's problematic. So let me explain it to you and you can tell me where I'm wrong. So what you're saying with Borgmon is that query evaluation mostly happened at the higher layers where, yeah, I guess if you could just explain to me, because I don't fully grasp
Starting point is 00:35:18 it, like what exactly is the difference? And like, sure, sure, sure. Yeah. Yeah. I mean, I wasn't being clear. So it totally makes sense. So let's take, let's take a simple ish query, like you want to understand the ratio of error of your error rate to your total request rate across your application, and you have for all these time series is large. To put this in context, some of Gmail's metrics were distribution valued, so the actual value type was a histogram.
Starting point is 00:36:00 And a single metric with all the cardinality turned into 250 million time series in the steady state. So very, very, very high cardinality, like, you know, surface area that we're trying to aggregate around. And the problem that you have, you're trying to do that sort of query, that error ratio query, it's a join. So you have two different queries, you have a rate query and account query. And then you have to compute, you have to create buckets for each of these RPC methods, just doing a group by and then within each of those, you have to do a bunch of math. One option is to basically have a query evaluator at the top of the stack that just talks to all of the sort of like leaf nodes, and in Monarch, we call them leaves, each leaf node, and you would say, okay, give me all the data you have
Starting point is 00:36:39 for this particular metric, and they would stream the data back to you, and you just do that. You do the math. It turns out the data size is large enough that if you do that, your query evaluation times, you know, move into the 10s of seconds or minutes, in some cases, it's kind of a non starter. So instead, what you'd like to do is say, Okay, fine. So we'll compute this at the leaves. But the problem is, and this is the most important point, if you're a grouping by RPC method, there's absolutely no guarantee at all.
Starting point is 00:37:05 In fact, it's just not true that all of the data for one RPC method is going to be on one server or another. So each of the services, each of the monarch leaves is going to have some portion of the data. So what you want to do is compute what we would call a partial aggregation. So you, basically everyone computes their part of this, you know, particular query. So they each make the RPC method buckets. And then they pass those partial results up the stack to the mixer level where now you've done the aggregation. So the data size is pretty small over the wire. And then you complete the aggregation.
Starting point is 00:37:37 Now that you have all the data, get the final numbers for both the error and the count. And then at the last step at the top, you join the two things, divide them all, and you finish your result. So the most important thing to understand is that it's not possible for the lower level nodes where the horizontal scaling has happened. They cannot compute the final number because they don't have enough data to do it.
Starting point is 00:38:00 So they have to have some way of communicating partial results back to the top of the stack. In that example, it's not that difficult. But in terms of the full query plan for an arbitrary query in the language that we are doing, it's a lot of subtlety and complexity to how those different types of queries can and cannot be pushed down to be evaluated at the leaves. And if you ever end up in a situation where you need to pull all the data up into the mixer level, the whole thing totally falls apart from a performance standpoint. And oftentimes, even from a feasibility standpoint, like you end up booming that thing if you're not careful. So
Starting point is 00:38:37 there's a lot of like streaming eval and pushback on channels. So you don't like flood the thing that's getting, you know, this huge fan in from all the children and stuff so this guy john banning who yeah another person much smarter than i but he he designed that thing and worked on it for years and years to kind of optimize and optimize and optimize it and yeah it's really interesting piece of technology but it's it was just quite subtle and i think if we hadn't designed it for that initially, it's just hard to make a query model that isn't designed to create these partial aggregates. I think it's hard for me to imagine how you would shim that in after the fact
Starting point is 00:39:13 because it really changes the way you approach the computation because you have to be able to kind of truncate the computation and send it as a partial computation up the stack instead of as a set of query results. And I'm not saying it's impossible. I just think it ends up getting, I'm assuming it gets pretty ugly. So that's the thing that I was referring to. Does that make sense? Yeah, that makes sense. Are you saying that the query language itself also needs to be designed with this thing in mind? Or is that
Starting point is 00:39:39 mostly okay? It's just the way you shard out and you make your query plan. We really tried not to put constraints on the query language because of this. There were certain types of joins that were really hard, like a full relational join. It wasn't even, it was deeper than this problem, but it wasn't designed to be a distributed relational database. And those things are really quite difficult to implement. But the joins are where I think we difficult to implement but the the joins are where i think we had to kind of draw the line on some of the functionality but a lot of the other stuff you know it was kind of the query language has some very powerful capabilities especially
Starting point is 00:40:14 dealing with the time dimension that also was you know much more limited than you it wasn't as powerful as a lot of sql-like languages for doing just general purpose computation and stuff like that. So it was very much designed for time-series data. But I would say it seems to have been general enough to represent a wide variety of operational use cases. I just wouldn't want it to, you know, it wouldn't take the place of BigQuery or something like that for general purpose computations yeah and one thing i think listeners should like note is that you might think that you know users are not running that many queries so like how does it matter but like a lot of people write alerts and like monitors and they're basically super complicated and they have to be evaluated a lot and i'd imagine that a lot of your load
Starting point is 00:41:01 was from these alerts that have to be continuously evaluated to make sure. And it's super critical that they fire quickly. You don't want there to be a production outage and there's a slowdown due to the monitoring system. Yeah, certainly the case. Certainly the case. So it's a really interesting problem. I think Dropbox tried to build their own and I think they rolled it out successfully and they just put cardinality limits. You can't have queries with super high cardinality, but very few other limits. Because I think the cardinality
Starting point is 00:41:32 explosion is where a large source of problems is. Is that kind of accurate? Yeah, that's definitely accurate. So that's another rant if you don't mind me going in that direction for a minute. I mean, I've really got i this is one of those things i did not realize you know working on monarch but but i've come to
Starting point is 00:41:49 believe that there's there are really two types of monitoring data there or telemetry i guess you could call it there is statistical data which we usually call metrics and there's transactional data which we usually call traces or sometimes logs or structured events or something like that. And those are the two flavors of data. And the Achilles heel for the structured events, the traces for the transactional data, the Achilles heel there is that high throughput, just retaining it for a long time is just really expensive because there's so much of it, right? Like you're processing a lot of transactions. It's big data.
Starting point is 00:42:22 It's expensive. I don't mean big data in the sense of big data, but it's a lot of data, right? And then on the metric side, you can handle the high throughput very naturally. The only thing that changes the value of the counter just goes up, but it doesn't actually make the metric data larger. It just changes the numbers if you have high throughput. So the Achilles heel for metrics data is high cardinality. I actually wrote some stuff about this on Twitter a few weeks ago, which I can send you afterwards if you want to include it in an article or something. But the thing that's so frustrating is that cardinality is necessary. It's totally
Starting point is 00:42:53 inevitable that you're going to want to include some tags in order to understand and isolate symptoms of interest. And I think metrics should be used to isolate symptoms of interest that connect the health of some part of your system to the business, like at some abstract level, that's really what you're supposed to be doing with metrics. And then I think because that's the tool that people use, it just becomes, you know, the hammer for every nail that they see. And you just try to use cardinality to address every aspect of observability, which is a complete disaster from a cost standpoint and from user experience standpoint.
Starting point is 00:43:28 And I will try to elaborate on that. So let's go back to this example of RPCs or something like it's totally fair and smart actually to have a tag on your RPC metrics for the method because you want to distinguish, you know, writes and reads or whatever. Okay, totally fine. Because those are different things you might want to independently measure from a health standpoint. But then you might want to say, well, okay, right traffic spike or right latency spikes, and I want to understand why. And so if you're using a metric system, the only tool you really
Starting point is 00:43:57 have to understand that variance, or that change is to do a group by on some other tag and hope to see that one tag value explains that blip. And so you now have a lead to go and follow that tag value where it leads you, whether it's a host name or whatever. And I understand the appeal of that, but there's two problems. One, like you add a couple more tags
Starting point is 00:44:18 and the combinatorial explosion is immediate and you're suddenly spending a ton of money, whether it's on Prometheus or a vendor or whatever, it's really expensive. And then I think the even more pressing issue is that in the code, you only have access to things that are locally available. So you have access to your own host name
Starting point is 00:44:35 and things like that. But it's often the case, in fact, I think the numbers I've seen, which ring true for me, is that 70, 75% of incidents in production are caused by an intentional change like a deployment or a config push elsewhere in the stack that version change or config push is not going to be in your local tags like if you're in service a at the you know or if you're in
Starting point is 00:44:59 service b at the bottom of the stack and service a at the top of the stack pushes a new version that's flooding you with traffic service a's version is not going to be available for grouping and filtering anyway so you you're paying a lot of money to have all this cardinality and you can't even group by the thing that would explain it now in the transaction data and the tracing you absolutely do have that the traces flow through both services these health metrics are linked to the transactions via hosts, via service names, via method names, all sorts of stuff like that. Like there are ways to pivot over to the tracing data programmatically in an observability solution. And in the transaction data, the high cardinality is not an issue. You can do an analysis of thousands of traces in real time and actually
Starting point is 00:45:43 understand that the thing that changed is that before, you know, service A above you was on version five, and now it's in version six. And that, in fact, is a thing that explains the difference in the health metric we started with. But using cardinality as a way to do observability is a big mistake. Sorry, metric cardinality as a way to do observability is a big mistake. i think that metrics should be used to understand health and nothing more and then you have to be using observability tooling that's smart enough to pivot over to the transaction data where it's both cheaper and more effective to understand these sorts of systemic changes that lead to these health changes in the first place so i i mean at google I think we were actually way behind, you know, where we are now,
Starting point is 00:46:25 honestly, with LightStep in terms of how we would pivot from time series data over to transaction data and back again. But that's really the essence of it, because you can't do high throughput with one, and you can't do high cardinality with the other. So you have to be able to use tools that pivot from one to the next, you know, intrinsically. And I think that's the thing that allows you to kind of get out of that cardinality trap more than anything else. It's not setting a cardinality limit. It's just not needing it to be in your metric data in the first place. I think that's really the solution that we'll find ourselves pursuing over the next couple
Starting point is 00:46:58 of years. No, I think that makes total sense. And maybe just super quick to explain to listeners, listeners, why is high cardinality bad? And I think the answer is because in a time series database, when you have a data point with a different cardinality, you have to basically store it as a different row or a different column. And that's what causes the explosion. Yeah, that's basically it.
Starting point is 00:47:20 And I mean, high cardinality isn't bad. High cardinality metrics are bad, right? I think the issue is that in a time series database, you can basically think of it as a huge spreadsheet and each row is a different time series. And it turns out that creating a new row is a lot more expensive than creating a new cell. Because you're just incrementing like a number in that cell.
Starting point is 00:47:37 Yeah, or adding another data point to an existing time series is far cheaper than creating a new time series. I think that, you know, so the issue with high cardinality, you know, it's such a, it sounds so esoteric, but it ends up being an issue that like your chief finance, financial officers couldn't start caring about, right? Because you can add one piece of code, like literally one line of instrumentation that
Starting point is 00:47:57 says, I've got this metric that tracks requests. I'm going to add customer ID and host name. And then it just explodes. And then every single value, every single iteration of that line of code is going to create a new time series. And so suddenly you're dealing with, and you're probably dealing with,
Starting point is 00:48:14 it's not vendors being evil, by the way. I mean, they have cogs they have to pay for, right? But if you write a piece of code that incurs 10 million time series in some tstb somewhere that's going to be expensive no matter what tstb you're using so i mean some are more expensive than others but it's still just different flavors and expensive and and i think ideally the developer can add that code a platform team can write some kind of control that says i never want cardinality for any metric to exceed x. And then you can do some
Starting point is 00:48:45 kind of top K thing to to retain the high frequency data and aggregate the rest. I think that's the kind of long term vision I have for how cardinality should work. But really, it'd be great to use event data tracing data for high cardinality, where you don't pay a penalty at all. I mean, it just literally doesn't matter in terms of the cost of those solutions. And then stick to metrics for, you know, high throughput data where you need precise answers about critical symptoms. Yeah. And then the flip side of that is like, why is cardinality not a problem for tracing? How do you store tracing data in a way that cardinality isn't an issue at all? I think it just depends on how you index it, right?
Starting point is 00:49:21 So, I mean, cardinality is really, it's an issue with the index in a TSPV. In a tracing database, there are different, I'm actually interested, there are many different ways of doing it. Some people have column stores. I think, I don't work at Eaglem, so I don't want them to speak, but I'm pretty confident that their underlying database
Starting point is 00:49:38 is a column store where, you know, you have different trade-offs in that type of world. What Lifestep does is too complicated for me to explain right now, but we have our own way of managing cardinality in the tracing data. So it's not that it's free or something, but it's not an issue like it is for a time series database. So I think that different people have addressed it in different ways,
Starting point is 00:49:58 but I don't feel it's a very satisfying answer to your question. In a time series database, because of the locality constraints you have around time series, the cardinality, like adding cardinality, just has a fixed cost that's relatively high. I think that's probably the best way I can explain it. Okay. Is the index and tracing data just like the trace ID or like the hash, basically?
Starting point is 00:50:18 Or like a service, which is relatively straightforward to index? There's a lot of different ways to do it. I mean, that's why I'm kind of hedging on this this, because I'm trying to be precise. I like being precise. And there's actually quite a diversity of ways that it's done right now. So I can't specify how it's done everywhere in the Dapr ecosystem, like we did have a few indices that we special case. And then if you want anything that wasn't one of those indices, you had to write a map, you know, which is super high latency, right? But other systems have, you know, no limits on cardinality. Others will, you know,
Starting point is 00:50:49 index up to n values of every key and no more. I mean, there are a lot of different approaches to it. Yeah, that makes sense. And I can read up more on like how Jaeger works. But overall, like, yeah, when you search for like a trace ID in Jaeger, you get that fast. Everything else is not that fast. So that makes sense to me. And I think that was a lot of good information. I feel like I've just learned a lot on how all of this infrastructure works. For sure. Yeah.
Starting point is 00:51:17 Do you have anything that you want to add on top of this? You've learned all of this stuff. You left Google in 2012. What was like the one thing that was just really something that you still use or some information that you still remember from that time? That is like, like one design principle perhaps, or just one way of thinking about things. That's a good question. I think I would actually, I kind of referred to it, but not as precisely. But one thing that Jeff Dean said a few times, which I really did stick with me was just this idea that you really can't design a system that's appropriate for, you know, more than like three or four orders of magnitude of scale. Appropriate being the key word, right? And this goes back to this idea that systems at Google were not better because they're more scalable. And one thing I like, it's not
Starting point is 00:52:08 about Google, I think most companies are like this, but the in-house technology at Google came with pretty accurate advertising for what it was good for and what it wasn't good for. And there was no shame in saying, yeah, this database is good until you hit the scale, and then it's not, you know, and like, this one's terrible if you go go below that because it doesn't have all these features that you'd expect or what have you. And my experience outside of Google or, you know, whether it's open source software or vendor software, it's just that people are, you know, understandably, I guess, reluctant to describe the scale that their system is appropriate for. And I mean, it's a great question to ask, actually, if you're talking to someone about something
Starting point is 00:52:45 that they're really excited about and they're trying to pitch you on it, just say, so tell me, like, what's too much scale for this? And what's like too little scale for it, you know? And if people can't come up with an actual answer to that, I think that's a bit of a red flag in my book. And that was something that really stuck with me
Starting point is 00:53:01 after Google. And I think it applies to any technology that you're building. And also, it's good to be humble about that, too. You can remind yourself you built some really scalable thing. It probably doesn't do something that a less scalable thing could do. Or vice versa, you know. And just to try and think about fitness for purpose.
Starting point is 00:53:17 That's maybe the thing I feel like comes up over and over and over again. Anything from engineering to product, almost to marketing, right? To think about what market segment is this really appropriate for? Where would the scale that we're targeting live in the marketplace, right? So I think it's relevant at that level too.
Starting point is 00:53:39 Yeah, it reminds me of this meme, like MongoDB is like web scale. What does that mean? Because the web is so different for so many different things like there is no concept of like web scale it just sounds fancy yeah my current company uses mongodb it seems to work so far so yeah it's fine yeah yeah it's probably fine anyways i have so many more questions in terms of like i want to ask you about like open telemetry and like open tracing and all of these things and like light
Starting point is 00:54:10 step but i think it'll be nice if you do that in like a follow-up episode this was great yeah and i feel like i learned a lot so thank you so much for being a guest my pleasure thank you great question thank you

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.