Software at Scale - Software at Scale 44 - Building GraphQL with Lee Byron

Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey, welcome to another episode of the Software at Scale podcast. Joining us today is Lee Byron, the co-creator of GraphQL, a senior engineering manager at Robinhood, and a director at the GraphQL Foundation. Thank you so much for joining us. Yeah, absolutely. My pleasure. Thanks for the invite. Yeah, I have been really excited to talk to you for a few months since we decided at my company, we're going to be exposing a GraphQL API. And I'm so happy I finally get to talk to you.

Starting point is 00:00:45 Yeah, always fun to tell some stories. So I want to start with, you know, a history of GraphQL, right? Like, how did we decide, like, how did Facebook decide that we need to do something different with our application? I know you have a talk that talks about, you know, deploying a native app or an HTML5 app. I'd love to get to know some of the behind the scenes, going from we need to do this to deploying it and seeing whether it works or not. Yeah. So it's been almost 10 years

Starting point is 00:01:14 because we started the effort that became GraphQL in early 2012. So 10 years ago, which is kind of crazy to think about. At that point, GraphQL actually didn't feel like the wild part of that adventure. The wild part was actually rebuilding our iOS app. So at the time, our iOS app was awful. It was really bad. The core technology layer, and I'm definitely partially, if not primarily, to blame for this. The core technology stack was a wrapped web view in a native shell. The native shell would do all the things that you could really only do natively that there wasn't a web API for.

Starting point is 00:01:56 So if you needed to bring up the camera to take a picture, or we had just launched this thing called Places where you could check in. Remember when you'd go places and check in? Those are all powered by native code. But the rendering layer was entirely web-based. And Bet was really this idea that when Steve Jobs introduced the iPhone, he was like, it's a communicator. It's an internet device. And if you want to build an experience for it, you just build for the web. You couldn't build apps for the iPhone when it first came out. You could only build websites. like the internet company number one you know they make this brand new chrome browser so my bet was

Starting point is 00:02:47 that these companies were going to go head to head and compete on web rendering technology on this new mobile platform uh i was super wrong that's not what happened they they instead actually really crippled their web platforms on these um on these oss and instead leaned into creating the walled garden that we really have today where natively run apps that only work on these platforms and it is what it is. So we had gotten pretty far down this road of betting on web as a common UI stack

Starting point is 00:03:22 that would work portably across mobile web if you hit it in a browser on any platform versus within the core of this iOS app. We had a similar one that was within the core of the Android app. With this idea of, you know, Facebook is a, or meta now, giant company, but at the time was not a particularly huge company. There weren't that many product engineers. And mobile in general was kind of new. So you would go to these teams that were super focused on building a desktop site and say, Hey, we really need you to ship mobile product. And they were like, uh, okay. Yeah. Like, what does that entail? And you're like, well, you got to build a mobile web version and a

Starting point is 00:03:56 feature phone version and iOS version and Android version. Like, all right. So you're not talking about building one extra thing. You're talking about building five extra things. Yeah. No, thank you. We're not going to do that. Um, and we were kind of really caught in a in a really tough situation pretty late in the game when we realized that we were behind on just straight up product coverage there were plenty of features that just did not exist on mobile the ones that did were low quality um app speed was bad. Crash rate was really high because it turned out using the web stack

Starting point is 00:04:28 as a rendering layer ate a lot of RAM and these older devices, they just didn't have that much RAM. And rather than gracefully invicting stuff in the background, they would just hard crash. There were just showstopper bugs.

Starting point is 00:04:40 You try to do some animation and the animation would just lock and then the whole browser stack would crash. And it was just unviable. And I think at some point, Mark Zuckerberg was in some interview and someone asked him about this mobile strategy being off. And he said, betting on HTML5 was like the number one, the worst mistake that Facebook had ever made. Me and my team were like watching him like, oh, oops, that's us. That was our fault. We are the number one mistake that Facebook ever made. So really, the big bet was this idea that we were going to rebuild a mobile app from scratch. And at this point, in part, because of the strategy that we had taken to invest in the

Starting point is 00:05:22 web, but also in part, just these platforms being new and recruiting being hard. We just didn't really have that many iOS experts at the company. We had a couple, I think we had maybe three or four at that point that we would have considered X really experts. And that meant since the platform was new, these were Apple engineering experts. Like they had been writing objective C forever and the iPhone is new. These were Apple engineering experts. They had been writing Objective-C forever,

Starting point is 00:05:46 and the iPhone was new for everybody, but they had already understood how to build for that platform. We had a few of those. We were starting to train people within the company on how to do this, but we needed to double down, triple down on this. So we started a brand new app and we decided that we were going to start with one feature that was going to be newsfeed. That was the very first thing you see when you open the app that was going to be purely native, no web technology anywhere. And then there was going to be a compatibility layer that bridged out into all of the existing stuff that we had already built. So all of the existing features would still be there. This would give us the ability to ship iteratively. We could ship a native experience newsfeed,

Starting point is 00:06:31 and then piece by piece, we could move more and more things onto this new technology stack. So we spun up this group of engineers and said, hey, go build a prototype, something feasible in the next three months. And they did this. This was very, maybe late 2011 or super three months. And they did this, this was very, maybe late 2011, or super early 2012. And they did this, they came back, and they said, Hey,

Starting point is 00:06:55 you know, we've got this, this app, we think it's good, we think it's actually probably ready to ship. And I took a look at it, and noticed that it was missing a bunch of content in the newsfeed and said, Hey, okay, this looks great. But like, when are you going to get these other kind of story types showing up in the newsfeed? Like, what, okay, this looks great. But like, when are you going to get these other kinds of story types showing up in the newsfeed? And they're like, what are you talking about? All the story types that come back from us from the API are here. And then like, all of a sudden, I had that moment where like the blood rushes out of your face.

Starting point is 00:07:14 You're like, what API? Because we just like completely, of course, missed this. And everything is web technology based. There is no separation of concerns. Like there's just one big ball of code back there that talks to databases and services and does business logic and it munches it all together. And its output is HTML. In fact, there was backend systems at Facebook at that time that literally yielded you blobs of HTML that you're supposed to interleave

Starting point is 00:07:42 into the rest of another page, which was kind of frustrating for other reasons if you're supposed to interleave into the rest of another page, which was kind of frustrating for other reasons if you're trying to build something for mobile because they would stick desktop CSS in there and you'd have to figure out how to fix it. There was no concept of a data layer abstraction. That was not really a thing. Except for the fact that there was a software-based data layer. So not an API tier.

Starting point is 00:08:09 Typically today, we would think about these are services, whether it's a database-level service or a high-level data service or even a GraphQL API or REST API. You yield them some requests, whether that's over a GLPC or HTTP or whatever, and it comes back to you with data in some form. That's not what this was. This was like kind of like an ORM, like within the runtime of the server would be these PHP objects that would describe every kind of thing that existed within the Facebook ecosystem with getter methods. And we had just started in on this path to what has

Starting point is 00:08:46 now been the hack language for a long time. At that point, it was still kind of finding its name. It was brand new. But the one feature that they had added early on was async await, which at the time was really novel. I think C sharp was the only other language that had this, that's where we borrowed it from. And that feature was taking the code base by storm. There was a massive sort of performance and speed effort that was being run sort of orthogonally to all this happening. And so anyway, we found ourselves in this situation where the iOS app, a brand new iOS app from was getting built. It was consuming theseis that they had found that actually what they were were three or four year old apis that had been built as one-offs for some partner

Starting point is 00:09:31 integration and then abandoned and they were just ancient and missing stuff and slow and it's like yeah no no we absolutely cannot ship to production on this we got to go build a new thing. And at the same time, we had these ORM-ish objects. They were called Ent, E-N-T, for entity, inside the data layer or inside just the application code at Facebook. And we thought, okay, really what we want is we want to take this ORM-ish Ent framework that actually lots of people really enjoyed using. It was a

Starting point is 00:10:07 very well-built abstraction, but we wanted to be able to access it from iOS code in the same way that a PHP engineer would be able to access it from within the big ball of PHP code. You can't just do that, right? You need to be able to go back and forth across the network. And we sort of said, all right, well, how does this thing work? And, you know, what are its core primitives? And really the single biggest sticking point was this idea that and being part of the backend system had, it was not afraid of going back and forth from the database multiple times.

Starting point is 00:10:42 You know, if you needed to say, I need to load all of this person's friends. Great, done. Okay, now I also need their profile pictures. Great, just go back again, get more data. No problem. Because you're talking about requests within a data center pretty fast. We didn't want to do that from an iOS app because even these days, we would consider that probably poor design um but back in 2012 you're talking about really slow 3g networks where it's not just that the bandwidth is slow it's the latency is slow multiple seconds per round trip so if you have to go back twice you're talking about changing a page's load time from three seconds, which would have been considered fast at that time, to six. And

Starting point is 00:11:26 if you have to do two or three layers of first I load this, then I load B, then I load C, it's really easy to end up in the tens of seconds to load a screen, which would have put us right back where we started with the mobile web-based thing we were trying to replace in terms of performance. So we needed to come up with a way to keep the programming model that we liked from this ORM layer called Ent, but leave the back and forthiness of how it worked under the hood on the PHP side and not in the iOS side. And it turned out that there was this experimental API that we had been playing with on the end side called and loader, which was the ability to declaratively state the relationships between data

Starting point is 00:12:14 that you wanted. So that this basically a query scheduler under the hood could figure out how to do the most performant thing with the ants under the hood. And we thought, great, how about we just, you know, write something that can directly translate to building an ant loader on the fly. So we wrote a little piece of code. My co-creator Nick Schrock wrote that. He called it Supergraph. And the idea was that you would have the first version was the parser was a bunch of regular expressions, you know, it was not pretty, but it was a very, it looked like PHP code.

Starting point is 00:12:49 You know, imagine taking all the features away from PHP, except, you know, dot method and then a parentheses. You know, it's like that syntax was the only thing that remained. But it was just enough to mirror the code that you would have written in PHP to write that loader and loader sort of dependencies of the data that I need. And then so he would write that, he would parse it. He would, you know, you just send it as a string to the server, it gets parsed on the server as a string to intermediate data structure. That then sort of on the fly, builds one of these ant loaders,

Starting point is 00:13:25 runs it, takes the data, and then sends it right back to the client. This was sort of prototypal GraphQL. That was early 2012. So we put this in front of the mobile engineers and we're like, is this useful? Is this helpful? And they were like ecstatic.

Starting point is 00:13:42 They're like, this is insane. This is amazing. Yes, this is absolutely what we need. And at the same time, I was not working on the project at the time, actually. I was sort of knee deep in trying to get these iOS engineers focused on the right things. I was taking a different angle at the API problem. The thing that stood out to me was all of our API technology at Facebook at the time was sort of one layer deep. You couldn't ask for dependencies of data.

Starting point is 00:14:10 And there was no typing information. You just sort of hit an endpoint. You got a big blob of JSON back. And the point that I jumped to was these RPC languages that all were based on a firm type system with relationships. And they're self-describing and there was auto-generated documentation i was like that's where we want to be we want to be where grpc is or where thrift is or protobuf or captain proto like that technology is where we want to go towards um and so there's this magical moment where nick schrock who had written the Supergraph prototype. And I had this sort of like insane 3000 line prototype of a Newsfeed API all written as a thrift document with no idea of how I would actually translate that into something that would work.

Starting point is 00:14:56 Like, what if we put these two ideas together? And that's what we did. So we built a version of Supergraph on top of Newsfeed. Our other co-creator, Dan Schaefer, got involved. He was on the News Feed team and was able to sort of make sure that the pieces on the PHP side would work the way we wanted them to work. And that was the origin. It worked. We did a lot of iteration on the syntax to get it into a state

Starting point is 00:15:21 where the iOS engineers kind of understood what they were looking at and not just like fake PHP code. And that unlocked our ability to ship that app. In late summer 2012, we shipped a version of the Facebook for iOS app that was a completely native News Feed that was entirely powered by GraphQL. And it loaded very complicated News Feed endpoints in two and a half seconds. It was pretty incredible. And right after that, it started to expand like wildfire. We had teams sort of like pounding at our door saying, this is super cool. How do I build for your iOS app? I want to add a screen to the iOS app. But also like this GraphQL thing is interesting,

Starting point is 00:15:58 like how can I use it? And there was enough energy around that that we decided to create a whole team around that. So we actually created the GraphQL team in early 2013, sort of in response to that demand and kind of the rest is history. That's such a cool origin story for technology. And I guess at this time, there were no bells and whistles. It was pretty much like a JSON schema. And there were no data loaders or anything at this time.

Starting point is 00:16:23 Is that roughly correct? There were so I think what this is part of the reason why I say that the graph QL Well, I think that's the piece that has longevity from this effort. Although the iOS app today is still the same code base origins is that effort so that that has certainly lasted the song um but the the reason why we're able to build something in a matter of a couple of months was we were standing on top of many years of work um that ent framework which was also originally authored by nick schrock who was the the one who came up with the Supergraph concept. He had been working on that for close to three years at that point, which started as a very, very lightweight thing, just noticing common patterns and how people were doing data access

Starting point is 00:17:17 and trying to build an abstraction around that. And then encountering sort of problem after problem after problem that this thing could solve. So we early on in like maybe 2009 or 2010, the dominant problem that we faced was access control. a set of business logic that would determine whether or not a particular person was allowed to see a particular piece of data. And then if the answer was yes, then it would fall through to the next piece of code, which would actually then go load that data and get it back and put it in the appropriate place in the screen. And it was just way too easy to make a mistake. And, you know, one subtle bug in that first piece of logic, and it falls through and it loads that data. And lo and behold, people write bugs sometimes. And it was, it meant you would have to unit test every single place where you would access data, which, you know, we had some, but also tests only test what you write them to test. So there would be things where we thought our test coverage was good, but something still would kind of slip through the cracks.

Starting point is 00:18:26 So that was another thing that we added early on there was the ability to, to describe access control rules that were tied to the type of data itself rather than to the place where that data gets used. So you actually cannot load that data without first running the access control rules. And another was the loading efficiency. So the reason why you never heard us when we talk about GraphQL talk about query planning or anything like that is that end infrastructure

Starting point is 00:19:01 had dynamic query planning built into it because it's at some point long before we built GraphQL, this idea of, hey, our CPU and memory usage on our servers is way higher than we want it to be. And if we need to continue to scale and grow as a company and have more people use our services, then we have to be loading data as efficiently as possible. And so the thinking of what can you do in parallel, what truly depends on what, so that you never want to wait to load data B if it doesn't depend on data A. If there's no dependency there, you should start loading it as soon as possible. That gets more complicated when access control rules themselves often require loading data.

Starting point is 00:19:44 So you get this thing where it's like, can I see this picture? It's like, well, I don't know. Has the author of that picture blocked you? Oh, I got to go load the author of that picture, and then I have to go load all the people they've blocked to see if that person is in that list. That's arbitrary business logic. That's additional data fetching.

Starting point is 00:20:02 All of that has to get factored into that query planner. So these are just a handful of many examples of things that are built into that, that piece of infrastructure. And even that that final piece, that idea of like, actually, there's a really high level API that sits on top of this, that allows you to sort of assemble a high level query plan that says, these are the high level pieces of information I need, this is actually the subset that I really care about. So you can get a nuanced understanding of dependencies. And then here's what to do once that data is loaded. And that's, that's what GraphQL ended up being built on top of in the first place. And then I think a lot of

Starting point is 00:20:37 the, the work, both within Facebook, in the years after we originally built that, but then even more so after open sourcing was unpacking all of that, like coming to an understanding of just how much was in this underlying ant layer that we were reliant upon that we needed to at least be able to tell the story about to people who are using GraphQL. And I think a happy accident of that is GraphQL is left in a very agnostic state towards these kinds of problems, because all it is is a mapping layer between essentially arbitrarily run compute on the server. It translates on the server side to calling functions in some particular order. And those functions can return a promise or a task

Starting point is 00:21:27 or whatever your async primitive is. But that's it. There's no concept of data fetching. There's no concept of query playing. There's no concept of access control. Not to say that those aren't important. It's just that it's allowed to be agnostic to them so some layer underneath can do it.

Starting point is 00:21:40 And that was a happy accident because that's just happened to be what our technology stack looked like at the time so you create a system that doesn't have too many things attached but you create a plugin system so you can easily include things like authorization and you just didn't need data loaders at that time because ant did all of that for you that was you know we we at first i didn't think open source and graph deal was going to be a good idea. We were kind of pressured into it by another team, the Relay team.

Starting point is 00:22:10 So the Relay team had built this really cool integration between GraphQL and React and had just watched React open source and saw how sort of wildly popular that was. We're like, wow, that was really cool. We should open source Relay. And they're like, well, open sourcing Relay makes no sense if GraphQL isn't also open sourced. And so they came knocking at our door and they're like, would you guys ever consider open sourcing GraphQL? We're like, well, you know, it's kind of complicated

Starting point is 00:22:38 and it's really kind of tied to all this Facebook-specific technology. They eventually convinced me, and that's what kind of led to the effort to open source it. And not just throwing code over a wall, but really this idea that if we gave people a big ball of PHP code, they'd probably look at us like we were crazy and not use it. And so instead, we kind of generalized it away from that and made something a little more consumable

Starting point is 00:23:02 for a reference implementation. But as soon as we had done that, immediately we start hearing about the problems that people would have of, hey, how do you do X or Y? Or, hey, the server that I built is really, really slow, and I'm trying to understand why. How did you go about solving this problem in GraphQL? And of course, the answer was always, well, we didn't solve that problem in GraphQL. It was solved for us at some lower level. But that's kind of what led us on this journey in working our way down the stack of abstractions that tie together and

Starting point is 00:23:33 make sure that we were telling the story of those, even if they weren't open source themselves. Part of the way that we would do that is for companies that were really early adopters, we would go visit them, especially if they were Bay Area adopters, we would go visit them, especially if they were Bay Area companies. So we would have a meeting at Intuit and a meeting at Pinterest and a meeting at a handful of these places that were experimenting with this early on

Starting point is 00:23:57 just to hear them out, hear what their infrastructure looked like, hear what their early experiences were like and started noticing common problems that were surprising to us. Things that felt like not where we would have expected the bulk of the work would be. And like, oh, that seems like an obvious thing to solve. But as we would describe how it was solved for us, it was non-obvious. It was only obvious to us because we had been staring it in the face for so many years that it seemed like, of course, that's the way that you would do it. But this kind of goes back

Starting point is 00:24:30 to when we had done this transition to own the programming language, we were building this hack programming language. And one of the very first features was the async await. very few other runtimes had this kind of mechanism. And a lot of our data abstractions were really based on this idea of having an async primitive. And you go talk to all these other shops that are built in Ruby or Python or Java, and they don't have that runtime primitive. And so they end up with very different architectures. And as we would describe how ours worked, like you'd see people light up like, whoa, that sounds really interesting. And they started kind of jamming about

Starting point is 00:25:12 how they might go about building that themselves. And even still sort of stuck in the mire of the complexity of their existing stack. And so Dan Schaefer and I would do these tours. We would go to these companies and kind of do the public roadshow for GraphQL. And one day after visiting Pinterest, we had sat down with, I think, maybe 12 engineers from their product infrastructure team. And, like, again, heard repeated things like, wow, this is the third time we've heard someone talk about this problem.

Starting point is 00:25:43 And the answer seems so simple when we said it out loud and of course we'd never really thought about extracting this one piece out of the big you know depth of complexity of our own abstraction layers but it felt easy enough to talk about that surely there was like one little piece here and we literally like we left um that meeting at pinterest at like 2 30 and walked down the street in soma in san francisco which is where their office was uh into a coffee shop and and literally opened up our laptops and started writing code and um dan started writing unit tests and i started writing implementation and he was just like it should should work this way. So he was like building an API and writing tests against that API. And I was coming up with a runtime model that would work in JavaScript.

Starting point is 00:26:33 And by, you know, 6 PM when the coffee shop was trying to kick us out because they wanted to close, we had something working and that was passing all of our tests. And it was just a matter of like writing documentation for it. And I think I spent the next week at work writing documentation. So like it took us, you know, like maybe three hours for the meat of the technical implementation and then another week to finish it up. And that was good order.

Starting point is 00:27:00 And it's got a little bit of iteration since then, but yeah, that was like a week of work, but it's that, that's got a little bit of iteration since then. But yeah want 10 times as many lines of documentation as code. And I think I recorded a video about it as well, just like explaining how it worked. And now there's like data loader equivalents written in like 20 different programming languages. And like, that's exactly what I hoped happened. Like, couldn't have hoped for a better outcome is people stole the technique, not the code. It's like, great, I don't want to maintain the code. I want to get the technique out there.

Starting point is 00:27:51 And people have leaned on that as a way to power the GraphQL engines in the same way that we did early on at Facebook. Certainly not the only way. And again, GraphQL is built in a way that's agnostic. You can power it in lots of different ways. But certainly a very viable one that worked quite well for us at Facebook and seems to be working well for plenty of other people.

Starting point is 00:28:10 Yeah, it's like 200 lines of code. Like I think I've seen the source. It's a tiny, tiny library, right? Yeah. And now I think Yelp has like a library to like auto-generate data loaders based on your database schemas. And then like the rabbit hole goes on and on and on.

Starting point is 00:28:26 Right. Exactly. Yeah. Nobody writes data loader code at Facebook, right? Like they have something similar to what you just described. There's, you know, you just describe at a high level where how you access the stuff and a lot of this low level stuff gets auto generated.

Starting point is 00:28:39 Yeah. I guess our company is still stuck in the old ages where we have to write custom data loaders and like have to fix that at some point. That's fine. Yeah, yeah, yeah. It's fascinating to me. You formed the GraphQL team in 2013. Our company was founded in 2017.

Starting point is 00:28:56 And it's purely GraphQL based. So things move so fast in the technology area. And that actually brings me to a question. We exposed our first public API endpoints recently in GraphQL because we like GraphQL. I think it has a bunch of benefits for exposing public APIs. You can actually see what fields people are using, how often they're using them. But of course, you can be worried. Nobody has as much experience with GraphQL as compared to REST.

Starting point is 00:29:26 And that's just a model that more people are used to. I'm sure you had to think about that trade-off as well, right? While building this, it's like people will feel confused because they've never seen this language before. How would you weigh that trade-off?

Starting point is 00:29:39 Yeah, I think there's a handful of things we have to keep in mind. For us, what was very helpful was starting small and expanding deliberately. So for us, that was starting with the one thing that we wanted to ship, which was

Starting point is 00:29:55 that News Feed app. So we only built out as much API service as was necessary strictly to make that work. And that meant the set of people who were consuming that API were all like, I could, you know, ball up a piece of paper and throw it and hit one of them. So we didn't have to worry too much about somebody who we had less of a contact with getting confused. Later, we did have that problem. But because we are a little bit more deliberate

Starting point is 00:30:22 about how it spread, we could we could get a handle on that. I think there's a very different problem of how you expand something like GraphQL within a company, even a big company, versus something that is public. Because at least within a company, you have some organizational tools you can lean on that you might not be able to have

Starting point is 00:30:42 for a truly public API. But I think there's some corollaries there. So most of my stories are, of course, in how to roll it out within an internal company, because that's what we did. But we started with appreciating that we also didn't necessarily have the right answers. So by going slow, we would make sure that each decision we were making along the way seemed well-reasoned. And we would roll to sort of one new feature at a time. We would choose one person from the engineering team that was working on that feature to be, you know, wear the GraphQL hat. They got to be the final decision maker about that API.

Starting point is 00:31:20 And they would come to us whenever they needed help. So we would have office hours every day. So every day after lunch, someone could show up to our area. We would hang out with them for up to two hours. And so our whiteboard was just filled with crazy API design ideas and it would go back and forth. And we would, you know, stuff that everyone agreed was the right thing to do, we would do. And if we had an idea later on how to make it better, we would do that.

Starting point is 00:31:44 And it was great that we would fold one of these engineers from outside our team into this process and let them really hold the torch and kind of own the conclusion. And as that expanded, what we ended up with was we noticed that a handful of these people kept coming back. They'd go build, someone would come from the photos team and they'd be building one feature and they'd come back three months later and they'd be building the next feature. And they would, that same person would have come back and they'd be like, Hey, remember how we were doing this thing about how to do tagging? And we came up with this data model. I found a way where it breaks for our new feature, you know? And so there'd be this ongoing thread of API evolution with a

Starting point is 00:32:22 relatively small set of people. And as it continued to expand and ultimately got to the next phase where we're sort of, you know, open entry, anyone could add to it. We had minted all these experts spread out across our company. And that was extremely useful. And we had this sort of, you know, almost like judicial, you know, what's the right word? You know, the existing case matter, right, of all these decisions that had been made before precedent, that's the way I'm looking for these, all this existing precedent of decisions, that new decisions could then lean on. And so we wouldn't end up reinventing the wheel. So a little bit orthogonal to what you're asking about. But I think that's also kind of important is to be intentional about how this rolls out, start small, and just, you know, being ready to say that you don't have the answers until you want to take about sort of the operational cost

Starting point is 00:33:25 of running a GraphQL service, which is more complicated in some ways, simpler in some ways, but more complicated in some ways than REST APIs. It's just a different surface area. And so really kind of knowing what you're going to get there is important. Attack vectors, you know, making sure someone can't DDoS you,

Starting point is 00:33:44 but also having good visibility. So if something goes wrong, you know, making sure someone can't DDoS you, but also having good visibility. So if something goes wrong, you can see it. And none of this stuff necessarily comes for free out of the box, you have to kind of intentionally make sure you're building it. And there's certainly a lot of companies out there that are ready and willing to help with that for a fee. And some of these things are actually kind of straightforward to build just from first principles. Yeah, like there's like with, with GraphQL, if you think about query costing, which is not really a thing with REST APIs because you know exactly what you're

Starting point is 00:34:13 sending. Yeah. There's a whole gamut there too. You know, this is yet another one of these examples of things that lots of people are asking us about. Yeah. How do you, how do you do query cost analysis? And we're just like, we never did any of that. You know, like, wait, how did we get this far without ever having to address this problem? And so sometimes that's actually, you know, I've come to really appreciate that that's a fantastic thing to be celebrated when you realize you've sidestepped an entire problem because of some decision you made earlier and to try to really consciously figure out like what was that allowed us that allowed us to sidestep it rather

Starting point is 00:34:48 than immediately facing it and trying to solve the problem it's way better to make the problem go away than it is to solve it um so for example for for query cost you say like okay well why is it that you need to care about query cost well some attacker could send a query that is intentionally maliciously designed to maximally consume costs. And I want to protect against that. Okay, that's fair. Or my I have a semi public, like it's, you know, third party, but it's technically a private API, people are paying per use. And I want to make sure my customers are aware of the cost that they're about to incur on themselves and don't end up in a situation where I have a tough conversation with a client that's going to be unfortunate. So there's a lot of very different kinds of reasons why you might care about query costs.

Starting point is 00:35:34 And your solve might be really different for each. We shared the idea of we didn't want an attacker to do something bad. And so what we would do is this idea of a query persistence for us actually it was much more about not the runtime cost of the query itself but all the other pieces around graph tree all the parsing time and the validation time and we wanted to cut down on all that or even just the query upload time it turns out that a sufficiently complicated graph tool query can be quite a number of bytes especially in the days before we had

Starting point is 00:36:09 fragments where you uh just had to unfold manually all the things that you wanted to have had as fragments those could be quite big and mobile upload bandwidth is an order of magnitude slower than download bandwidth at least if not two orders of magnitude. So we were seeing cases where query performance was quite poor. And when we did an end-to-end analysis of where all that time was being spent, it was being spent in upload. It was being spent uploading a query. And that's silly. That is the least useful part, especially since there's all of these iOS apps spread out across millions of little devices in people's pockets, sending exactly the same query up, you know, like, maybe it's a little bit different version to version. But otherwise, this is exactly the

Starting point is 00:36:55 same thing. So this is where the very, very early idea of query variables came from. At that point, it was called query templates. And so the idea is that you would, the query template would still live on the client. It was a step during build. So at build time, you would send up all of the queries within the code base to a server endpoint that we wrote. And if the server had seen them before, then it would return you a unique ID that represented the one that it had seen before.

Starting point is 00:37:23 And if it hadn't seen them before, it would make one of those unique IDs and send that back to you. Either way, now the client-side app can say, great, rather than sending this giant ball of query code, I can just send this 32-bit integer and that uniquely recognizes this particular instance of this query. And then that alongside your query variables or your parameters probably fits in one packet, one upload packet. So wildly better upload performance. So we had sidestepped this problem of abuse, kind of. You could still actually, like if you, you know, decomp our app and looked at this what was going on or even just looked at traffic you could probably figure out that there was this thing going on it was called graphql because our endpoint was called slash graphql and you could probably poke

Starting point is 00:38:15 at it and eventually figure out what it was doing and then come up with some attack um we actually we had logging in place to look we didn't block it because we wanted to know if someone was going to do that first since we set up logs. And we actually set it up to email us if this ever happened because we wanted to celebrate whoever would be the first person to figure this out. And it took two and a half years before that log got tripped, which so we were very disappointed, very disappointed. But eventually did. But we didn't completely sidestepped that problem. So, you know, I talked to some folks who are trying to figure out really expensive ways to do query cost analysis, or I want to make sure my query is maximally this deep or this wide or like lots of different things like that.

Starting point is 00:38:54 It's like, why don't you just whitelist your queries? Like here's the sanctioned set of queries you're allowed to run that you've pre-vetted and you know are going to run in a reasonable amount of time and for i don't know like two-thirds of use cases like that's sufficient because all of your use cases are your internal product and you can do the literally the cheapest possible way to do that is there's a directory somewhere in a code base that's shared between your runtime and your clients, where you just literally write GraphQL text, and then you refer to the file name or something. There's very, very, very simple ways to do this. And then there's slightly more sophisticated ways, which

Starting point is 00:39:35 the way that we had done that at Facebook. And then if you're really worried about it, the one mechanism that we had that actually was not GraphQL specific. We did not build this. This is another example of us leveraging existing technology that we had was everything that would run on our API domain. So, you know, we served this from API.Facebook.com slash GraphQL. Every endpoint that came from that domain had a super basic timeout mechanism. If your query ran for more than two seconds, it was just hard killed. And you'd return an error code.

Starting point is 00:40:13 And there was something above the application code layer that would ensure that to make sure something couldn't spin loop and keep it from doing that. And then there was a graph that would show you how frequently that happens. And we ended up making a GraphQL specific version of that graph. So if there was a specific query that was timing out more so than others that we could point people who were authoring those queries to them

Starting point is 00:40:36 to help them spot those. But that was a great, like extremely basic abuse vector mitigation because if someone wanted to do something that was a DDoS ish in shape, they would very quickly, yeah, they very quickly hit this great timeout. And we knew that we had a lot of capacity to manage many, many, many two second long queries. But then on top of that, our security team would use that as a signal that went into their security model. So they would say if the same, you know, API key keeps sending timeout queries over and over again, we're going to

Starting point is 00:41:12 throttle it like they're going to send a query and we won't even run it, we'll just return, hey, you've sent too many API requests, please come back in, you know, five minutes before you send your next one. And if they still do it, we'll kill the API key altogether and then they won't be able to send requests at all. And if they've got some bot that's just generating API keys, we'll probably spot their IP range and we'll kill the IP rate. And, you know, if that's not it,

Starting point is 00:41:34 like the security team has got a whole plethora of tools that they can use for that, right? But again, like it's another great example of making sure the lines of abstraction are in the right place and clarifying, is this a GraphQL problem? Or is this a general problem that you just want to make sure you have a good bridge between GraphQL was one particular kind of request amongst many other kinds of requests that could share this shared concept of a request cost analysis of just purely how long do you let it run? I think later it got slightly more sophisticated where we also looked at how much data is being loaded because we were worried about the ability to kind of run loop the database layer rather than the application layer. And so there was threshold set for that as well.

Starting point is 00:42:31 But again, that had nothing to do with GraphQL. That was generalized and actually even applied to just a typical web endpoint. You know, you just load facebook.com. You get exactly that same case. Yeah. Okay. It sounds like at least with the open source thing right like you you you exposed like you added graphql as like an open source library you spoke about it tons of people

Starting point is 00:42:54 started using it how do you foster like a good community after that right there's a lot of people who have a lot of excitement but then how do you make sure that the community is building in like one strategic direction or doing the right thing in your opinion? Like what are the challenges that come in with that? Yeah. First I'll say that's a very fortunate problem to have and not the one that most open source authors find themselves facing. Most of them is how do you get energy around the project in the first place so i want to acknowledge that it truly exceeded our expectations you know we we presented this for the first time um actually before we were considering open sourcing it at a it was the very first react conference

Starting point is 00:43:37 that facebook hosted and we decided that we wanted to talk a little bit about GraphQL only because we were talking about Relay. And we got this super positive response to it. And people were starting to reverse engineer GraphQL. And we're like, this is nuts. And OK, maybe this what the Relay team has been telling us about open sourcing is they're probably right. We should we should follow their good wisdom here and pursue this. And so we invested time in the open source effort. And then we launched it later that same year at the React Europe conference, which was the very first of that conference.

Starting point is 00:44:15 And so I gave a talk about the details of the open sourcing and sort of what we had changed and what was actually getting released and where you could find it. And even just within that, you know, like half week conference was getting really excited folks coming up asking how they could help. And, you know, maybe par for the course. Sometimes people are just polite and they say, you know, thanks for a great talk. And so I was ready to kind of chalk it up to that. But I attended another conference a number of months later and had a handful of people come up to me and said, hey, can you come check out the thing that I built? in GraphQL or GraphQL adjacent were like showing me cool things that they had built that were

Starting point is 00:45:07 essentially like GraphQL DevOps or, or graph, you know, like tooling around this outside of the context of Facebook, which was super exciting to me. And, um, so, you know, my, in hindsight, looking at this, I think a few things, one is making sure that your voice is heard and clear and factors in the opinions of others. So a lot of what I would do is I would, I would talk about the stuff and then I would hear what people were saying about it, what was concerning to them, what parts were they excited about that they wanted to hear more of. And then I spent a lot of time and energy in making sure that I was getting conference talks in front of the right audience.

Starting point is 00:45:46 And I was taking that pretty seriously. I wanted to make sure that I was not only speaking at React conferences, because I was actually kind of concerned from those first two, that people would look at this and say, this is a JavaScript technology. And this is a thing for React. And it was like, no. When we built, the original idea was that this was a thing that we had made for our iOS app. The fact that a bunch of React engineers are excited about it is great. But like, that's actually not the reason why we built this in the first place. So I wanted to make sure that we were getting this in front of other kinds of engineers.

Starting point is 00:46:16 And spent probably a good year and a half, maybe two, talking to backend engineering communities, mobile engineering communities, web engineering communities, DevOps communities, just like trying to explain what it is and explain my vision for it. And that really helped because then when a problem, I knew it was working when I saw other people giving talks on GraphQL that used almost word for word phrases

Starting point is 00:46:41 that I was using in my talks. That was like a great success. It wasn't like, oh, that person's copying my talk. It was like, no, fantastic. I have people getting the exact same messaging around the project and the vision setting that I would have said myself. And so I can kind of take a step back from doing the conference talks and let more and more people talk about it and give their own spin. And then the other piece was making sure that for the people who were excited and wanted to contribute, that there was a path in for them. And that's really hard to do. Because a lot of the work that we needed to do was actually not really technical. It was about getting the documentation

Starting point is 00:47:19 in a better place. How do you integrate with other technologies a lot of these like stuff we're talking about before like query planning and cost you know cost analysis and um all of the devops tooling around like how does all this stuff work and making sure that the people who are working on this were kind of in lockstep with one another um And then as people were, did have good feedback for GraphQL itself, making it was clear to me that the model of, you know, a traditional open source project of you send a pull request, and someone reviews it and then merges it, it doesn't work for something like GraphQL, because GraphQL is not one code base, there's a specification and many code bases, which follow that specification. And I'd spent a fair amount of time in TC39, which is the steering committee that oversees

Starting point is 00:48:10 JavaScript, and tried to basically emulate and borrow ideas from that, which I had found reasonably successful. And that was the beginnings of what became our working group. I think our working group started in early 2017 was our first working group call, which was about a year and a half after we had open sourced it. So, you know, it took us a while

Starting point is 00:48:32 to figure out the right model. That's quite a long time, but eventually we figured that out. And now that has become the true center of gravity for how the project evolves. Can you elaborate a little bit about steering committee, right? Like TC39, like what

Starting point is 00:48:45 do these committees do just at a higher level? Yeah. So their primary purpose is to make sure that these broadly used technologies evolve in a healthy way. And that charter can mean a lot of different things. So, you know, TC39 has this, I've also written this for GraphQL, a set of principles. And that also really helps in tying things back when someone has a really cool idea. and say, does this align with what we said we wanted to do here? And one of our very first principles is favor no change, which is really frustrating for somebody who wants to show up and add something new. But as a user of GraphQL, what you really don't want is for that thing to be changing out under your feet all the time. And so this is a little unintuitive, especially for the people who are involved and are motivated to make improvements, that the most important thing that we can do is ensure a sense of stability to the technology and give someone a sense that this is not a fad. It's not a thing that's going to evolve

Starting point is 00:49:53 major version to major version multiple times over a month by month. And if you build something big and significant on top of it, it's going to keep working for many, many years. And that's a very significant responsibility. So that means rather than going straight to code, we tend to spend a lot more time looking at problems. What problems are people facing? Is this a real problem? What is the design space that we can work in around these problems? We'll sometimes spend an entire year for a particular proposal, just making sure we really understand the problem space. Because the cost of a mistake is really high. And the cost of doing nothing is actually not that high. Our principle is to do nothing, right?

Starting point is 00:50:45 Like that's our only make a change if we're very confident that it's the right thing to do. So the working group has really embodied these values and these principles. A lot of familiar faces will show up to those meetings, but also a lot of new people will show up to those meetings with great ideas. And a lot of it is helping people guide

Starting point is 00:51:04 through this process that is new for many open source contributors of how to get your proposal into a really kind of a bulletproof stance, how to make sure you understand not just its impact on a bit of JavaScript code, but on the impact on many other bits of code on the specification itself on the runtime environments on the tooling environments. So it's not just about getting the change landed, but then making sure that the ecosystem effects are well managed, documentation is in place. And the working group oversees all of this. It's a completely volunteer-based thing. It's open source in that true sense where it's truly folks who show up who care about it and volunteer their time to facilitate this and to,

Starting point is 00:51:46 you know, work through the proposals. If you want to answer this, like feel free not to answer this question. What's one change you're very proud of that you didn't make to the specification? Something that looked really interesting, but you decided let's not do this. There are plenty. Actually, a lot of them were from early on. So, you know, I had a mini version of this process earlier on. The version of GraphQL that we used in 2012, all the way up really through 2015 internally at Facebook was pretty different from what we use today, what we see GraphQL as today. I spent that window of time between us talking about GraphQL at the first React conference. I think that was maybe January or February 2015.

Starting point is 00:52:35 And when we open sourced it, which was July or August of that same year, basically that window of time was a massive redesign of GraphQL. Like basically me looking at what had unintentionally evolved into a state and being kind of like not really proud of that and not wanting to put that out into the world and wanting to make sure that what we put out was good and so it's like okay well if the problem was um if the problem was that this thing kind of evolved into its end state then me just taking a hatchet to it is likely to result in just like the same bad result. It'll just feel novel because I all just came up with it, but it'll end up being kind of equally poor.

Starting point is 00:53:15 And I wanted to avoid the so-called second system syndrome, which is where, you know, you build it the first time and you kind of don't know what you're doing. And then, you know, you look back and you're like, I would have made all these changes. And then you do it a second time and you think that you know exactly what you're doing because you've already done it once before. And you make way more mistakes because you are overconfident and you end up with something way too complicated. And I almost did that.

Starting point is 00:53:41 But we ended up running this process where I would write proposals. So I would write a white paper for every change that I wanted did that. But like, we ended up running this, this process where I would write proposals. So I would write a white paper for every change that I wanted to make. Why what problem it was solving, why it was important change to make and what it would look like, here's the alternatives and why we're not doing the alternatives. And then I would shop that around a bunch of engineers that I trusted, certainly our core team, you know, I put it in front of Nick and Dan and but I would go around to other like very senior engineers across the company and say, does this make sense to you? Like, please poke holes. And they would.

Starting point is 00:54:12 And I would say maybe one out of four of the proposals that I made stuck. One of them was I wanted to get rid of the curly braces. I was spending time in Python and a couple other languages. And I was like, this is the future. Like people are going to write languages that don't have these weird curly braces. Everyone's like, nope, you're nutsly. That's wrong. Like the curly braces are fine.

Starting point is 00:54:37 And I think they were right. I was probably wrong on that one. And I'm glad they talked me out of it. That's one example. Another one that in hindsight, I think there's plenty of things that I got wrong on that one and i'm glad they talked me out of it that's one example um another one that in hindsight i think there's plenty of things that i got wrong in that process too that now we're kind of in the midst of trying to fix um one was interfaces versus unions i got pushed back really early on that we should not have unions that there should only be one way to describe a type abstraction and i was pretty pretty adamant that this difference between something inherent

Starting point is 00:55:07 in interface or applies in interface and the relationship with the union is exactly the other way around. Something doesn't know whether or not it's part of the union. And the interface doesn't know all the things that implement it necessarily, but a union definitionally knows all the things that are in that union. And that reversal felt critical and a very different piece of type design. And so people eventually believed me and shipped it. And I think it's produced a lot of value, but we've been now going through a lot of iteration around that space and landing on things these days

Starting point is 00:55:42 that I think may be better. But just having to work our way around interfaces and unions has been pretty challenging. And so I'm thinking that was maybe... I would have done something differently earlier. Another example is how we manage nullability. I'm actually quite happy with the change that we made when we made it. Previously, there was no

Starting point is 00:56:05 modeling for nullability at all every field was nullable all the time there was no concept of null non-null uh it was very controversial to introduce the idea of a non-null field to this day it remains a point of confusion for newcomers to graphql so you know the controversy still stands a A lot of people, the thing that they're confused about is like, why is it backwards? Like, why is this thing not just like non-nullable by default? And you have to explicitly say that it's nullable. That's how many other languages work. There's firm reasons for that. But the big miss I think was making this purely a part of the schema design and not something that the client could control. And we actually have a proposal that is most of the way through the process,

Starting point is 00:56:49 but you know, a very compelling proposal at the moment with it's, we're now kind of poking holes in it, trying to make it bulletproof that's readdressing that and trying to reintroduce the concept of client control knowability, which is super exciting to me. There were other things that we punted on early on that actually didn't come from me, came from other sources that I was hesitant about, and I'm quite

Starting point is 00:57:11 happy that we did not do. One example was parametric types, the ability to say, you know, connections are an important primitive in GraphQL for doing pagination. People wanted the ability to say literally connection, you know, angle bracket of T, a closed angle bracket, and then to be able to, you know, parametrize those per use case, you could have a connection of user and the connection of this and that. But it was always really hard to explain exactly how that would work. Like, at a high level, you can see why that as an organizational technique would be useful, but like, okay,

Starting point is 00:57:47 mechanics, like what is this actually doing? And it was always really hard to kind of piece that together in a way that didn't feel super complicated. And if I've come to any conclusion after spending so many, so many years steeped in this, it's that type systems are really hard. Like the simpler, all the complexities

Starting point is 00:58:08 and the bugs and the mistakes within these runtimes is making mistakes about expecting something to be of one type and actually it's of another type. And all of our most complicated validation rules are about asserting that queries are fitting the particular shape that they said that they would fit. And every time we make the type system more complicated, that surface of code gets more and more complicated. And so there was a pressure to keep the type system as simple as possible. And we made it slightly more complicated in the course of the pre-open source work, hopefully in valuable ways,

Starting point is 00:58:45 but we also kicked down a lot. That was just one example of many others that would have made the type system nominally more powerful, but also quite a bit more complicated. And I think biasing towards simplicity there was the right call. Cool. Yeah. You've already basically answered what my next question was going to be, which is like, what are you excited about for like the evolution or like the next step of GraphQL? You mentioned one thing, which was client control nullability. Is there like something else that you want to discuss? Yeah, I feel like the GraphQL runtime itself and the query features, actually, these days, there's more changes and proposals to that than there ever have been. And yet, at the same time, paradoxically, it's actually not where most

Starting point is 00:59:30 of my excitement is. Not to say that I'm not excited about all those changes. In fact, like client control and availability, super exciting to me. There's stream and defer, which is also super exciting to me. There's many very, very cool proposals that are happening at that layer. But for me, it's actually the ecosystem. That's really where I think the value is. Like giving people one extra technique for their querying is a nice to have, but it's not a game changer in the same way that, you know, I think there's actually,

Starting point is 01:00:02 there's way more ground to cover than there is ground that has been covered in terms of um tying many different schemas together not just like um schema federation within one organization but like the the internet of data like how do we how do we interconnect data sources and use graphql as a primitive for that um really early conversations around how that might work that to me is super exciting all the tools that are getting built out now, like half our conversation early on was about these sort of fundamental ideas of operational excellence and setting up GraphQL service, like how to make sure you're not screwing something up. And there's a handful of companies now that will help you with that, but just a handful. And I think there's a fertile ground to build a whole ecosystem of pretty incredible tooling that has GraphQL at a core primitive within it. And that, to me, is actually where most of the exciting work is happening. And that all kind of harkens back to this principle of the idea of the core of GraphQL, the spec of GraphQL is not to just add feature after feature after feature. It's to hold firm as a stable base. And this top principle of stability, the intent of that was a stable base allows you to build other things on top of it. And I care way more about what people do with GraphQL than I care about GraphQL itself. I'm very excited to see

Starting point is 01:01:26 that that has yielded fruit and people are now building very, very interesting things and now layers upon layers of tools and technology that leverages that. It's very exciting to see that that piece of the vision is starting to play out today.

Starting point is 01:01:42 Yeah. I spoke to the founder of Hasura where you can take a database schema and order it, generate like a GraphQL API out of it. You can also take that to the next level, right? Like a GraphQL API. I hope someday you can auto generate react components from GraphQL.

Starting point is 01:02:00 And that way, if you combine Hasura with that new technology, you can just have a database schema and boom, you have like a UI. There's so many interesting. Yeah. My long, long-term hope for GraphQL is like, actually, you know, I'm very happy that 10 years after creating it, it's not just that people are still using it. It's become a really common tool and maybe even ubiquitous. That makes me very happy. But if we're still using GraphQL in the same capacity 10 or 20 years from now, I'll be a little underwhelmed. That's not the long arc I want

Starting point is 01:02:39 for it necessarily. I'm sure plenty of people will continue to use it but i would much rather people steal the ideas from it um you know and that's the fact that like all this stuff gets to evolve and we get to steal ideas from old technologies and apply them to new ones and figure out like all right well where's where does this thing break and it's time for new technology and we can start to evolve from there hopefully it's it's modestly paced and deliberate and not crazy like a lot of the other tech stacks that we work with every day are. But my parallel in a lot of this is SQL. SQL is now many, many decades old.

Starting point is 01:03:20 Absolutely still in use. Is it the way that everyone builds everything? Of course not. But have the ideas that it popularized become pervasive in technologies maybe not even related to SQL anymore? Absolutely, including GraphQL. And that to me is the true legacy of SQL.

Starting point is 01:03:43 It's not that it's the SQL query language specifically and what features that it's added over time. It's the like broad impact and shift that it's had on the technology ecosystem. And if I could have like one long-term wish for the GraphQL project and its long life, it's that it has that. That's really what I hope for.

Starting point is 01:04:02 Yeah, I think that's a great vision to have. And I'm sure you'll also see versions of things like a super GraphQL that's GraphQL without braces. But I'm sure people will build stuff that's even better than that. Well, Lee, thank you so much. This has been an amazing conversation. Thanks so much for being on the show. My pleasure.

Your Ad Here

Software at Scale - Software at Scale 44 - Building GraphQL with Lee Byron

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.