Software at Scale - Software at Scale 44 - Building GraphQL with Lee Byron
Episode Date: March 22, 2022Lee Byron is the co-creator of GraphQL, a senior engineering manager at Robinhood, and the executive director of the GraphQL foundation.Apple Podcasts | Spotify | Google PodcastsWe discuss the Gra...phQL origin story, early technical decisions at Facebook, the experience of deploying GraphQL today, and the future of the project.Highlights(some tidbits)[01:00] - The origin story of GraphQL.Initially, the Facebook application was an HTML web-view wrapper. It seemed like the right choice at the time, with the iPhone releasing without an app-store, Steve Jobs calling it an “internet device”, and Android phones coming out soon after, with Chrome, a brand-new browser. But the application had horrendous performance, high crash rates, used up a lot of RAM on devices and animations would lock the phone up. Zuckerberg called the bet Facebook’s biggest mistake. The idea was to rebuild the app from scratch using native technologies. A team built up a prototype for the news feed, but they quickly realized that there weren’t any clean APIs to retrieve data in a palatable format for phones - the relevant APIs all returned HTML. But Facebook had a nice ORM-like library in PHP to access data quickly, and there was a parallel effort to speed up the application by using this library. There was another project to declaratively declare data requirements for this ORM for increased performance and a better developer experience.Another factor was that mobile data networks were pretty slow, and having a chatty REST API for the newsfeed would lead to extremely slow round-trip times and tens of seconds to load the newsfeed. So GraphQL started off as a little library that could make declarative calls to the PHP ORM library from external sources and was originally called SuperGraph. Finally, the last piece was to make this language strongly typed, from the lessons of other RPC frameworks like gRPC and Thrift.[16:00] So there weren’t any data-loaders or any such pieces at the time.GraphQL has generally been agnostic to how the data actually gets loaded, and there are plugins to manage things like quick data loading, authorization, etc. Also, Facebook didn’t need data-loading, since its internal ORM managed de-duplication, so it didn’t need to be built until there was sufficient external feedback.[28:00] - GraphQL for public APIs - what to keep in mind. Query costing, and other differences from REST.[42:00] - GraphQL as an open-source project[58:00] - The evolution of the language, new features that Lee is most excited about, like Client-side nullability.Client-side nullability is an interesting proposal - where clients can explicitly state how important retrieving a certain field is, and on the flip side, allow partial failures for fields that aren’t critical. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Hey, welcome to another episode of the Software at Scale podcast.
Joining us today is Lee Byron, the co-creator of GraphQL, a senior engineering
manager at Robinhood, and a director at the GraphQL Foundation. Thank you so much for joining us.
Yeah, absolutely. My pleasure. Thanks for the invite.
Yeah, I have been really excited to talk to you for a few months since we decided at my company,
we're going to be exposing a GraphQL API. And I'm so happy I finally get to talk to you.
Yeah, always fun to tell some stories.
So I want to start with, you know, a history of GraphQL, right? Like, how did we decide,
like, how did Facebook decide that we need to do something different with our application? I know
you have a talk that talks about, you know, deploying a native app or an HTML5 app. I'd
love to get to know some of the behind the scenes,
going from we need to do this to deploying it and seeing whether it works or not.
Yeah.
So it's been almost 10 years
because we started the effort that became GraphQL
in early 2012.
So 10 years ago, which is kind of crazy to think about.
At that point, GraphQL actually didn't feel like the wild part of that adventure. The wild part was actually rebuilding our iOS app.
So at the time, our iOS app was awful. It was really bad. The core technology layer,
and I'm definitely partially, if not primarily, to blame for this.
The core technology stack was a wrapped web view in a native shell.
The native shell would do all the things that you could really only do natively that there wasn't a web API for.
So if you needed to bring up the camera to take a picture, or we had just launched this thing called Places where you could check in.
Remember when you'd go places and check in?
Those are all powered by native code.
But the rendering layer was entirely web-based.
And Bet was really this idea that when Steve Jobs introduced the iPhone,
he was like, it's a communicator.
It's an internet device.
And if you want to build an experience for it, you just build for the web. You couldn't build apps for the iPhone when it first came out. You could only build websites. like the internet company number one you know they make this brand new chrome browser so my bet was
that these companies were going to go head to head and compete on web rendering technology on this
new mobile platform uh i was super wrong that's not what happened they they instead actually
really crippled their web platforms on these um on these oss and instead leaned into creating the walled garden
that we really have today
where natively run apps that only work on these platforms
and it is what it is.
So we had gotten pretty far down this road
of betting on web as a common UI stack
that would work portably across mobile web if you hit it in a browser on any platform versus within the core of this iOS app.
We had a similar one that was within the core of the Android app.
With this idea of, you know, Facebook is a, or meta now, giant company, but at the time was not a particularly huge company.
There weren't that many product engineers.
And mobile in general was kind of new.
So you would go to these teams that were super focused on building a desktop site and say,
Hey, we really need you to ship mobile product. And they were like, uh, okay. Yeah. Like,
what does that entail? And you're like, well, you got to build a mobile web version and a
feature phone version and iOS version and Android version. Like, all right. So you're not talking
about building one extra thing. You're talking about building five extra things. Yeah. No,
thank you. We're not going to do that. Um, and we were kind of really caught in a in a really tough
situation pretty late in the game when we realized that we were behind on just straight up product
coverage there were plenty of features that just did not exist on mobile the ones that did were
low quality um app speed was bad.
Crash rate was really high because it turned out
using the web stack
as a rendering layer
ate a lot of RAM
and these older devices,
they just didn't have that much RAM.
And rather than gracefully
invicting stuff in the background,
they would just hard crash.
There were just showstopper bugs.
You try to do some animation
and the animation would just lock
and then the whole browser stack would crash. And it was just unviable. And I think at some point, Mark Zuckerberg
was in some interview and someone asked him about this mobile strategy being off. And he said,
betting on HTML5 was like the number one, the worst mistake that Facebook had ever made.
Me and my team were like watching him like, oh, oops, that's us. That was our fault. We are the number one mistake that Facebook
ever made. So really, the big bet was this idea that we were going to rebuild a mobile app from
scratch. And at this point, in part, because of the strategy that we had taken to invest in the
web, but also in part, just these platforms being new and recruiting being hard.
We just didn't really have that many iOS experts at the company.
We had a couple,
I think we had maybe three or four at that point that we would have
considered X really experts.
And that meant since the platform was new,
these were Apple engineering experts.
Like they had been writing objective C forever and the iPhone is new. These were Apple engineering experts. They had been writing Objective-C forever,
and the iPhone was new for everybody, but they had already understood how to build for that
platform. We had a few of those. We were starting to train people within the company on how to do
this, but we needed to double down, triple down on this. So we started a brand new app and we decided that we were going
to start with one feature that was going to be newsfeed. That was the very first thing you see
when you open the app that was going to be purely native, no web technology anywhere.
And then there was going to be a compatibility layer that bridged out into all of the existing
stuff that we had already built. So all of the existing features would still be there. This would give us the ability to ship iteratively.
We could ship a native experience newsfeed,
and then piece by piece,
we could move more and more things onto this new technology stack.
So we spun up this group of engineers and said,
hey, go build a prototype,
something feasible in the next three months.
And they did this.
This was very, maybe late 2011 or super three months. And they did this, this was very,
maybe late 2011, or super early 2012. And they did this, they came back, and they said, Hey,
you know, we've got this, this app, we think it's good, we think it's actually probably ready to ship. And I took a look at it, and noticed that it was missing a bunch of content in the newsfeed
and said, Hey, okay, this looks great. But like, when are you going to get these other kind of
story types showing up in the newsfeed? Like, what, okay, this looks great. But like, when are you going to get these other kinds of story types showing up in the
newsfeed?
And they're like, what are you talking about?
All the story types that come back from us from the API are here.
And then like, all of a sudden, I had that moment where like the blood rushes out of
your face.
You're like, what API?
Because we just like completely, of course, missed this.
And everything is web technology based.
There is no separation of concerns.
Like there's just one
big ball of code back there that talks to databases and services and does business logic
and it munches it all together. And its output is HTML. In fact, there was backend systems at
Facebook at that time that literally yielded you blobs of HTML that you're supposed to interleave
into the rest of another page, which was kind of frustrating for other reasons if you're supposed to interleave into the rest of another page, which was kind of frustrating for other reasons
if you're trying to build something for mobile
because they would stick desktop CSS in there
and you'd have to figure out how to fix it.
There was no concept of a data layer abstraction.
That was not really a thing.
Except for the fact that there was a software-based data layer.
So not an API tier.
Typically today, we would think about these are services,
whether it's a database-level service or a high-level data service
or even a GraphQL API or REST API.
You yield them some requests, whether that's over a GLPC or HTTP or whatever,
and it comes back to you with data in some form.
That's not what this was.
This was like kind of like an ORM, like within the runtime of the server would be these PHP objects that would describe every kind of thing that existed within the Facebook ecosystem with getter methods.
And we had just started in on this path to what has
now been the hack language for a long time. At that point, it was still kind of finding its name.
It was brand new. But the one feature that they had added early on was async await,
which at the time was really novel. I think C sharp was the only other language that had this,
that's where we borrowed it from. And that feature was taking the code base by storm. There was a massive sort of performance
and speed effort that was being run sort of orthogonally to all this happening. And so
anyway, we found ourselves in this situation where the iOS app, a brand new iOS app from
was getting built. It was consuming theseis that they had found that actually what
they were were three or four year old apis that had been built as one-offs for some partner
integration and then abandoned and they were just ancient and missing stuff and slow and it's like
yeah no no we absolutely cannot ship to production on this we got to go build a new thing. And at the same time, we had these ORM-ish objects.
They were called Ent, E-N-T, for entity,
inside the data layer
or inside just the application code at Facebook.
And we thought, okay, really what we want
is we want to take this ORM-ish Ent framework
that actually lots of people really enjoyed using. It was a
very well-built abstraction, but we wanted to be able to access it from iOS code in the same way
that a PHP engineer would be able to access it from within the big ball of PHP code.
You can't just do that, right? You need to be able to go back and forth across the network.
And we sort of said, all right, well, how does this thing work?
And, you know, what are its core primitives?
And really the single biggest sticking point was this idea that
and being part of the backend system had,
it was not afraid of going back and forth from the database multiple times.
You know, if you needed to say,
I need to load all of this person's friends. Great, done. Okay, now I also need their profile pictures.
Great, just go back again, get more data. No problem. Because you're talking about requests
within a data center pretty fast. We didn't want to do that from an iOS app because even these days,
we would consider that probably poor design um but back
in 2012 you're talking about really slow 3g networks where it's not just that the bandwidth
is slow it's the latency is slow multiple seconds per round trip so if you have to go back twice
you're talking about changing a page's load time from three seconds, which would have been considered fast at that time, to six. And
if you have to do two or three layers of first I load this, then I load B, then I load C,
it's really easy to end up in the tens of seconds to load a screen, which would have put us right
back where we started with the mobile web-based thing we were trying to replace in terms of
performance. So we needed to come up with a way to keep the programming model that we liked from this ORM layer called
Ent, but leave the back and forthiness of how it worked under the hood on the PHP side and not in
the iOS side. And it turned out that there was this experimental API that we had been
playing with on the end side called and loader,
which was the ability to declaratively state the relationships between data
that you wanted.
So that this basically a query scheduler under the hood could figure out how
to do the most performant thing with the ants under the hood.
And we thought, great,
how about we just, you know, write something that can directly translate to building an ant loader
on the fly. So we wrote a little piece of code. My co-creator Nick Schrock wrote that. He called
it Supergraph. And the idea was that you would have the first version was the parser was a bunch
of regular expressions, you know, it was not pretty, but it was a very, it looked like PHP code.
You know, imagine taking all the features away from PHP, except, you know, dot method and then a parentheses.
You know, it's like that syntax was the only thing that remained.
But it was just enough to mirror the code that you would have written in PHP to write that loader
and loader sort of dependencies of the data that I need.
And then so he would write that, he would parse it.
He would, you know, you just send it as a string to the server, it gets parsed on the
server as a string to intermediate data structure.
That then sort of on the fly, builds one of these ant loaders,
runs it, takes the data,
and then sends it right back to the client.
This was sort of prototypal GraphQL.
That was early 2012.
So we put this in front of the mobile engineers
and we're like, is this useful?
Is this helpful?
And they were like ecstatic.
They're like, this is insane.
This is amazing.
Yes, this is absolutely what we need.
And at the same time, I was not working on the project at the time, actually.
I was sort of knee deep in trying to get these iOS engineers focused on the right things.
I was taking a different angle at the API problem.
The thing that stood out to me was all of our API technology at Facebook at the time was sort of one layer deep.
You couldn't ask for dependencies of data.
And there was no typing information.
You just sort of hit an endpoint.
You got a big blob of JSON back.
And the point that I jumped to was these RPC languages that all were based on a firm type system with relationships.
And they're self-describing
and there was auto-generated documentation i was like that's where we want to be we want to be
where grpc is or where thrift is or protobuf or captain proto like that technology is where we
want to go towards um and so there's this magical moment where nick schrock who had written the Supergraph prototype. And I had this sort of like insane 3000 line prototype of a Newsfeed API all written as a thrift document with no idea of how I would actually translate that into something that would work.
Like, what if we put these two ideas together?
And that's what we did.
So we built a version of Supergraph on top of Newsfeed.
Our other co-creator, Dan Schaefer, got involved.
He was on the News Feed team and was able to sort of make sure
that the pieces on the PHP side would work the way we wanted them to work.
And that was the origin. It worked.
We did a lot of iteration on the syntax to get it into a state
where the iOS engineers kind of understood what they were looking at
and not just like fake PHP code. And that unlocked our ability to ship that app.
In late summer 2012, we shipped a version of the Facebook for iOS app that was a completely
native News Feed that was entirely powered by GraphQL. And it loaded very complicated
News Feed endpoints in two and a half seconds.
It was pretty incredible. And right after that, it started to expand like wildfire. We had
teams sort of like pounding at our door saying, this is super cool. How do I build for your iOS
app? I want to add a screen to the iOS app. But also like this GraphQL thing is interesting,
like how can I use it? And there was enough energy around that that we decided to create
a whole team around that.
So we actually created the GraphQL team in early 2013, sort of in response to that demand
and kind of the rest is history.
That's such a cool origin story for technology.
And I guess at this time, there were no bells and whistles.
It was pretty much like a JSON schema.
And there were no data loaders or anything at this time.
Is that roughly correct?
There were so I think what this is part of the reason why I say that the graph QL Well, I think that's the piece that has longevity from this effort. Although the iOS app today is still
the same code base origins is that effort so that that has certainly lasted the song
um but the the reason why we're able to build something in a matter of a couple of months
was we were standing on top of many years of work um that ent framework which was also
originally authored by nick schrock who was the the one who came up with the Supergraph concept.
He had been working on that for close to three years at that point, which started as a very,
very lightweight thing, just noticing common patterns and how people were doing data access
and trying to build an abstraction around that. And then encountering sort of problem after problem after problem that this thing could solve. So we early on in like maybe 2009 or 2010, the dominant problem that we faced was access control. a set of business logic that would determine whether or not a particular person was allowed to see a particular piece of data. And then if the answer was yes, then it would fall through
to the next piece of code, which would actually then go load that data and get it back and put
it in the appropriate place in the screen. And it was just way too easy to make a mistake.
And, you know, one subtle bug in that first piece of logic, and it falls through and it loads that data. And
lo and behold, people write bugs sometimes. And it was, it meant you would have to unit test every
single place where you would access data, which, you know, we had some, but also tests only test
what you write them to test. So there would be things where we thought our test coverage was good,
but something still would kind of slip through the cracks.
So that was another thing that we added early on there was the ability to,
to describe access control rules that were tied to the type of data itself
rather than to the place where that data gets used.
So you actually cannot load that data without first running the access control rules.
And another was the loading efficiency.
So the reason why you never heard us
when we talk about GraphQL talk about query planning
or anything like that is that end infrastructure
had dynamic query planning built into it
because it's at some point long before we
built GraphQL, this idea of, hey, our CPU and memory usage on our servers is way higher than
we want it to be. And if we need to continue to scale and grow as a company and have more people
use our services, then we have to be loading data as efficiently as possible. And so the thinking of what can you do in parallel,
what truly depends on what, so that you never want to wait to load data B if it doesn't depend
on data A. If there's no dependency there, you should start loading it as soon as possible.
That gets more complicated when access control rules themselves often require loading data.
So you get this thing where it's like, can I see this picture?
It's like, well, I don't know.
Has the author of that picture blocked you?
Oh, I got to go load the author of that picture,
and then I have to go load all the people they've blocked
to see if that person is in that list.
That's arbitrary business logic.
That's additional data fetching.
All of that has to get factored into that query planner.
So these are just a handful of many examples of things that are built into that,
that piece of infrastructure. And even that that final piece, that idea of like, actually,
there's a really high level API that sits on top of this, that allows you to sort of assemble
a high level query plan that says, these are the high level pieces of information I need,
this is actually the subset that I really care about. So you can get a nuanced understanding
of dependencies. And then here's what to do once that data is loaded. And that's,
that's what GraphQL ended up being built on top of in the first place. And then I think a lot of
the, the work, both within Facebook, in the years after we originally built that, but then even more so after open sourcing was unpacking all of that,
like coming to an understanding of just how much was in this underlying
ant layer that we were reliant upon that we needed to at least be able to
tell the story about to people who are using GraphQL.
And I think a happy accident of that is GraphQL is left in a very
agnostic state towards these kinds of problems, because all it is is a mapping layer between
essentially arbitrarily run compute on the server. It translates on the server side to
calling functions in some particular order. And those functions can return a promise or a task
or whatever your async primitive is.
But that's it.
There's no concept of data fetching.
There's no concept of query playing.
There's no concept of access control.
Not to say that those aren't important.
It's just that it's allowed to be agnostic to them
so some layer underneath can do it.
And that was a happy accident
because that's just happened to be
what our technology stack looked
like at the time so you create a system that doesn't have too many things attached but you
create a plugin system so you can easily include things like authorization and you just didn't need
data loaders at that time because ant did all of that for you that was you know we we at first i
didn't think open source and graph deal was going to be a good idea.
We were kind of pressured into it by another team, the Relay team.
So the Relay team had built this really cool integration between GraphQL and React and had just watched React open source and saw how sort of wildly popular that was.
We're like, wow, that was really cool.
We should open source Relay.
And they're like, well, open sourcing Relay makes no sense
if GraphQL isn't also open sourced.
And so they came knocking at our door and they're like,
would you guys ever consider open sourcing GraphQL?
We're like, well, you know, it's kind of complicated
and it's really kind of tied to all this Facebook-specific technology.
They eventually convinced me,
and that's what kind of led to the effort to open source it.
And not just throwing code over a wall,
but really this idea that if we gave people a big ball of PHP code,
they'd probably look at us like we were crazy and not use it.
And so instead, we kind of generalized it away from that
and made something a little more consumable
for a reference implementation.
But as soon as we had done that, immediately we start hearing about the problems that people
would have of, hey, how do you do X or Y?
Or, hey, the server that I built is really, really slow, and I'm trying to understand
why.
How did you go about solving this problem in GraphQL?
And of course, the answer was always, well, we didn't solve that problem in GraphQL. It was solved for us at some lower level. But that's kind of what
led us on this journey in working our way down the stack of abstractions that tie together and
make sure that we were telling the story of those, even if they weren't open source themselves.
Part of the way that we would do that is for companies that were really early adopters,
we would go visit them, especially if they were Bay Area adopters, we would go visit them,
especially if they were Bay Area companies.
So we would have a meeting at Intuit
and a meeting at Pinterest
and a meeting at a handful of these places
that were experimenting with this early on
just to hear them out,
hear what their infrastructure looked like,
hear what their early experiences were like
and started noticing common problems that were surprising to us.
Things that felt like not where we would have expected the bulk of the work would be.
And like, oh, that seems like an obvious thing to solve.
But as we would describe how it was solved for us, it was non-obvious.
It was only obvious to us because we had been staring it in the face for so many years that it seemed like, of course, that's the way that you would do it. But this kind of goes back
to when we had done this transition to own the programming language, we were building this
hack programming language. And one of the very first features was the async await. very few other runtimes had this kind of mechanism. And a lot of our data abstractions were
really based on this idea of having an async primitive. And you go talk to all these other
shops that are built in Ruby or Python or Java, and they don't have that runtime primitive. And
so they end up with very different architectures. And as we would describe how ours worked,
like you'd see people light up like,
whoa, that sounds really interesting.
And they started kind of jamming about
how they might go about building that themselves.
And even still sort of stuck in the mire
of the complexity of their existing stack.
And so Dan Schaefer and I would do these tours.
We would go to these companies
and kind of do the public roadshow for GraphQL.
And one day after visiting Pinterest, we had sat down with, I think, maybe 12 engineers from their product infrastructure team.
And, like, again, heard repeated things like, wow, this is the third time we've heard someone talk about this problem.
And the answer seems so simple when we said it out loud and of course we'd never really thought
about extracting this one piece out of the big you know depth of complexity of our own abstraction
layers but it felt easy enough to talk about that surely there was like one little piece here
and we literally like we left um that meeting at pinterest at like 2 30 and walked down the street
in soma in san francisco which is where their office was uh into a coffee shop and and literally
opened up our laptops and started writing code and um dan started writing unit tests and i started
writing implementation and he was just like it should should work this way. So he was like building an API and writing tests against that API.
And I was coming up with a runtime model that would work in JavaScript.
And by, you know,
6 PM when the coffee shop was trying to kick us out because they wanted to
close, we had something working and that was passing all of our tests.
And it was just a matter of like writing documentation for it.
And I think I spent the next week at work writing documentation.
So like it took us, you know,
like maybe three hours for the meat of the technical implementation and then
another week to finish it up. And that was good order.
And it's got a little bit of iteration since then, but yeah,
that was like a week of work, but it's that, that's got a little bit of iteration since then. But yeah want 10 times as many lines of documentation as code.
And I think I recorded a video about it as well, just like explaining how it worked.
And now there's like data loader equivalents written in like 20 different programming languages.
And like, that's exactly what I hoped happened.
Like, couldn't have hoped for a better outcome is people stole the technique, not the code.
It's like, great, I don't want to maintain the code.
I want to get the technique out there.
And people have leaned on that as a way to power the GraphQL engines in the same way
that we did early on at Facebook.
Certainly not the only way.
And again, GraphQL is built in a way that's agnostic.
You can power it in lots of different ways.
But certainly a very viable one
that worked quite well for us at Facebook
and seems to be working well for plenty of other people.
Yeah, it's like 200 lines of code.
Like I think I've seen the source.
It's a tiny, tiny library, right?
Yeah.
And now I think Yelp has like a library
to like auto-generate data loaders
based on your database schemas.
And then like the rabbit hole goes on and on and on.
Right.
Exactly.
Yeah.
Nobody writes data loader code at Facebook, right?
Like they have something similar to what you just described.
There's, you know,
you just describe at a high level where how you access the stuff and a lot of
this low level stuff gets auto generated.
Yeah.
I guess our company is still stuck in the old ages where we have to write
custom data loaders and like have to fix that at some point.
That's fine.
Yeah, yeah, yeah.
It's fascinating to me.
You formed the GraphQL team in 2013.
Our company was founded in 2017.
And it's purely GraphQL based.
So things move so fast in the technology area.
And that actually brings me to a question.
We exposed
our first public API endpoints recently in GraphQL because we like GraphQL. I think it has
a bunch of benefits for exposing public APIs. You can actually see what fields people are using,
how often they're using them. But of course, you can be worried. Nobody has as much experience
with GraphQL as compared to REST.
And that's just a model that more people are used to.
I'm sure you had to think
about that trade-off as well, right?
While building this,
it's like people will feel confused
because they've never seen
this language before.
How would you weigh that trade-off?
Yeah, I think there's a handful of things
we have to keep in mind.
For us, what was very
helpful was starting
small and expanding deliberately.
So
for us, that was starting with
the one thing that we wanted to ship, which was
that News Feed app. So we only
built out as much API service as was
necessary strictly to make that
work. And that meant the set
of people who were consuming that
API were all like, I could, you know, ball up a piece of paper and throw it and hit one of them.
So we didn't have to worry too much about somebody who we had less of a contact with
getting confused. Later, we did have that problem. But because we are a little bit more deliberate
about how it spread, we could we could get a handle on that.
I think there's a very different problem
of how you expand something like GraphQL
within a company, even a big company,
versus something that is public.
Because at least within a company,
you have some organizational tools you can lean on
that you might not be able to have
for a truly public API.
But I think there's some corollaries there. So most of my stories are, of course, in how to
roll it out within an internal company, because that's what we did. But we started with
appreciating that we also didn't necessarily have the right answers. So by going slow,
we would make sure that each decision we were making along the way
seemed well-reasoned. And we would roll to sort of one new feature at a time. We would
choose one person from the engineering team that was working on that feature to be,
you know, wear the GraphQL hat. They got to be the final decision maker about that API.
And they would come to us whenever they needed help. So we would have office hours every day.
So every day after lunch, someone could show up to our area.
We would hang out with them for up to two hours.
And so our whiteboard was just filled with crazy API design ideas
and it would go back and forth.
And we would, you know, stuff that everyone agreed
was the right thing to do, we would do.
And if we had an idea later on how to make it better, we would do that.
And it was great that we would fold one of these engineers from
outside our team into this process and let them really hold the torch and kind of own the conclusion.
And as that expanded, what we ended up with was we noticed that a handful of these people kept
coming back. They'd go build, someone would come from the photos team and they'd be building one
feature and they'd come back three months later and they'd be building the next feature. And they would,
that same person would have come back and they'd be like, Hey, remember how we were doing this
thing about how to do tagging? And we came up with this data model. I found a way where it breaks
for our new feature, you know? And so there'd be this ongoing thread of API evolution with a
relatively small set of people. And as it continued to expand
and ultimately got to the next phase where we're sort of, you know, open entry, anyone could add
to it. We had minted all these experts spread out across our company. And that was extremely useful.
And we had this sort of, you know, almost like judicial, you know, what's the right word? You know, the existing case matter,
right, of all these decisions that had been made before precedent, that's the way I'm looking for
these, all this existing precedent of decisions, that new decisions could then lean on. And so we
wouldn't end up reinventing the wheel. So a little bit orthogonal to what you're asking about. But
I think that's also kind of important is to be intentional about how this rolls out, start small, and just, you know, being ready to say that you don't have the answers until you want to take about sort of the operational cost
of running a GraphQL service,
which is more complicated in some ways,
simpler in some ways,
but more complicated in some ways than REST APIs.
It's just a different surface area.
And so really kind of knowing what you're going to get there
is important.
Attack vectors, you know, making sure someone can't DDoS you,
but also having good visibility. So if something goes wrong, you know, making sure someone can't DDoS you, but also having good visibility.
So if something goes wrong, you can see it. And none of this stuff necessarily comes for free out
of the box, you have to kind of intentionally make sure you're building it. And there's certainly a
lot of companies out there that are ready and willing to help with that for a fee. And some
of these things are actually kind of straightforward to build just from first principles.
Yeah, like there's like with, with GraphQL,
if you think about query costing,
which is not really a thing with REST APIs because you know exactly what you're
sending.
Yeah. There's a whole gamut there too. You know,
this is yet another one of these examples of things that lots of people are
asking us about. Yeah. How do you, how do you do query cost analysis? And we're just like, we never did any of that. You know, like, wait,
how did we get this far without ever having to address this problem? And so sometimes that's
actually, you know, I've come to really appreciate that that's a fantastic thing to be celebrated
when you realize you've sidestepped an entire problem because of some decision you made earlier
and to try to really consciously figure out like what was that allowed us that allowed us to sidestep it rather
than immediately facing it and trying to solve the problem it's way better to make the problem go away
than it is to solve it um so for example for for query cost you say like okay well why is it that
you need to care about query cost well some attacker could send a query that is intentionally maliciously designed
to maximally consume costs. And I want to protect against that. Okay, that's fair. Or my I have a
semi public, like it's, you know, third party, but it's technically a private API, people are
paying per use. And I want to make sure my customers are aware of the cost that they're
about to incur on themselves and don't end up in a situation where I have a tough conversation with a client that's going to be unfortunate.
So there's a lot of very different kinds of reasons why you might care about query costs.
And your solve might be really different for each.
We shared the idea of we didn't want an attacker to do something bad.
And so what we would do is
this idea of a query persistence for us actually it was much more about
not the runtime cost of the query itself but all the other pieces around graph tree all the
parsing time and the validation time and we wanted to cut down on all that or even just the query
upload time it turns out that a sufficiently
complicated graph tool query can be quite a number of bytes especially in the days before we had
fragments where you uh just had to unfold manually all the things that you wanted to have had as
fragments those could be quite big and mobile upload bandwidth is an order of magnitude slower
than download bandwidth at least if not two orders of magnitude. So we were seeing cases
where query performance was quite poor. And when we did an end-to-end analysis of where all that
time was being spent, it was being spent in upload. It was being spent uploading a query.
And that's silly. That is the least useful part, especially since there's all of these iOS apps spread out across millions of
little devices in people's pockets, sending exactly the same query up, you know, like,
maybe it's a little bit different version to version. But otherwise, this is exactly the
same thing. So this is where the very, very early idea of query variables came from. At that point,
it was called query templates. And so the idea is that you would,
the query template would still live on the client.
It was a step during build.
So at build time, you would send up all of the queries within the code base to a server endpoint that we wrote.
And if the server had seen them before,
then it would return you a unique ID
that represented the one that it had seen before.
And if it hadn't seen them before,
it would make one of those unique IDs and send that back to you. Either way, now the client-side app can say, great, rather than sending this giant ball of query code,
I can just send this 32-bit integer and that uniquely recognizes this particular instance of this query. And then that alongside your query variables or your parameters probably fits in one packet, one upload packet.
So wildly better upload performance.
So we had sidestepped this problem of abuse, kind of.
You could still actually, like if you, you know, decomp our app and looked at this what was going on or
even just looked at traffic you could probably figure out that there was this thing going on
it was called graphql because our endpoint was called slash graphql and you could probably poke
at it and eventually figure out what it was doing and then come up with some attack um we actually
we had logging in place to look we didn't block it because we wanted to know if someone was going
to do that first since we set up logs.
And we actually set it up to email us if this ever happened because we wanted to celebrate whoever would be the first person to figure this out.
And it took two and a half years before that log got tripped, which so we were very disappointed, very disappointed.
But eventually did.
But we didn't completely sidestepped that problem. So, you know, I talked to some folks who are trying to figure out really expensive ways to do query cost analysis, or I want to
make sure my query is maximally this deep or this wide or like lots of different things like that.
It's like, why don't you just whitelist your queries? Like here's the sanctioned set of
queries you're allowed to run that you've pre-vetted and you know are going to run in a
reasonable amount of time
and for i don't know like two-thirds of use cases like that's sufficient because all of your use
cases are your internal product and you can do the literally the cheapest possible way to do that is
there's a directory somewhere in a code base that's shared between your runtime and your clients, where you just
literally write GraphQL text, and then you refer to the file name or something. There's very,
very, very simple ways to do this. And then there's slightly more sophisticated ways, which
the way that we had done that at Facebook. And then if you're really worried about it,
the one mechanism that we had that actually was not GraphQL specific.
We did not build this.
This is another example of us leveraging existing technology that we had was everything that would run on our API domain.
So, you know, we served this from API.Facebook.com slash GraphQL.
Every endpoint that came from that domain had a super basic timeout mechanism.
If your query ran for more than two seconds, it was just hard killed.
And you'd return an error code.
And there was something above the application code layer that would ensure that to make sure something couldn't spin loop and keep it from doing that.
And then there was a graph that would show you how frequently that happens.
And we ended up making a GraphQL specific version
of that graph.
So if there was a specific query
that was timing out more so than others
that we could point people
who were authoring those queries to them
to help them spot those.
But that was a great,
like extremely basic abuse vector mitigation
because if someone wanted to do something that was a DDoS
ish in shape, they would very quickly, yeah, they very quickly hit this great timeout. And we knew
that we had a lot of capacity to manage many, many, many two second long queries.
But then on top of that, our security team would use that as a signal that went into their security model. So they would say if the
same, you know, API key keeps sending timeout queries over and over again, we're going to
throttle it like they're going to send a query and we won't even run it, we'll just return,
hey, you've sent too many API requests, please come back in, you know, five minutes before you
send your next one. And if they still do it, we'll kill the API key altogether and then they won't be able to send requests at all.
And if they've got some bot
that's just generating API keys,
we'll probably spot their IP range
and we'll kill the IP rate.
And, you know, if that's not it,
like the security team has got a whole plethora of tools
that they can use for that, right?
But again, like it's another great example
of making sure the lines of abstraction
are in the right place and clarifying, is this a GraphQL problem? Or is this a general problem that you just want to make sure you have a good bridge between GraphQL was one particular kind of request amongst many other kinds of requests that could share this shared concept of a request cost analysis of just purely how long do you let it run?
I think later it got slightly more sophisticated where we also looked at how much data is being loaded because we were worried about the ability to kind of run loop the database layer
rather than the application layer.
And so there was threshold set for that as well.
But again, that had nothing to do with GraphQL.
That was generalized and actually even applied
to just a typical web endpoint.
You know, you just load facebook.com.
You get exactly that same case.
Yeah. Okay.
It sounds like at least with the open source thing right like you you
you exposed like you added graphql as like an open source library you spoke about it tons of people
started using it how do you foster like a good community after that right there's a lot of people
who have a lot of excitement but then how do you make sure that the community is building in like
one strategic direction or doing the right thing in your opinion? Like what are the challenges that
come in with that? Yeah. First I'll say that's a very fortunate problem to have and not the one
that most open source authors find themselves facing. Most of them is how do you get energy
around the project in the first place so i want to acknowledge that
it truly exceeded our expectations you know we we presented this for the first time um actually
before we were considering open sourcing it at a it was the very first react conference
that facebook hosted and we decided that we wanted to talk a little bit about GraphQL only because we were talking about Relay.
And we got this super positive response to it.
And people were starting to reverse engineer GraphQL.
And we're like, this is nuts.
And OK, maybe this what the Relay team has been telling us about open sourcing is they're probably right.
We should we should follow their good wisdom here and pursue this.
And so we invested time in the open source effort.
And then we launched it later that same year at the React Europe conference, which was the very first of that conference.
And so I gave a talk about the details of the open sourcing
and sort of what we had changed
and what was actually getting released and where you could find it.
And even just within that, you know, like half week conference was getting really excited folks coming up asking how they could help.
And, you know, maybe par for the course.
Sometimes people are just polite and they say, you know, thanks for a great talk.
And so I was ready to kind of chalk it up to that.
But I attended another conference a number of months later and had a handful of people come up to me and said, hey, can you come check out the thing that I built? in GraphQL or GraphQL adjacent were like showing me cool things that they had built that were
essentially like GraphQL DevOps or, or graph, you know, like tooling around this outside of
the context of Facebook, which was super exciting to me. And, um, so, you know, my, in hindsight,
looking at this, I think a few things, one is making sure that your voice is heard and clear
and factors in the opinions of
others. So a lot of what I would do is I would, I would talk about the stuff and then I would hear
what people were saying about it, what was concerning to them, what parts were they excited
about that they wanted to hear more of. And then I spent a lot of time and energy in making sure
that I was getting conference talks in front of the right audience.
And I was taking that pretty seriously.
I wanted to make sure that I was not only speaking at React conferences, because I was actually kind of concerned from those first two, that people would look at this and say, this is a JavaScript technology.
And this is a thing for React.
And it was like, no.
When we built, the original idea was that this was a thing that we had made for our iOS app.
The fact that a bunch of React engineers are excited about it is great.
But like, that's actually not the reason why we built this in the first place.
So I wanted to make sure that we were getting this in front of other kinds of engineers.
And spent probably a good year and a half, maybe two, talking to backend engineering communities, mobile engineering communities, web engineering communities, DevOps communities,
just like trying to explain what it is
and explain my vision for it.
And that really helped
because then when a problem,
I knew it was working
when I saw other people giving talks on GraphQL
that used almost word for word phrases
that I was using in my talks.
That was like a great success. It wasn't
like, oh, that person's copying my talk. It was like, no, fantastic. I have people getting the
exact same messaging around the project and the vision setting that I would have said myself.
And so I can kind of take a step back from doing the conference talks and let more and more people
talk about it and give their own spin. And then the other piece was making sure that for the people who were excited and wanted to contribute,
that there was a path in for them. And that's really hard to do. Because a lot of the work
that we needed to do was actually not really technical. It was about getting the documentation
in a better place. How do you integrate with other technologies a lot of these like stuff we're
talking about before like query planning and cost you know cost analysis and um all of the devops
tooling around like how does all this stuff work and making sure that the people who are working
on this were kind of in lockstep with one another um And then as people were, did have good feedback for GraphQL
itself, making it was clear to me that the model of, you know, a traditional open source project
of you send a pull request, and someone reviews it and then merges it, it doesn't work for something
like GraphQL, because GraphQL is not one code base, there's a specification and many code bases,
which follow that specification. And I'd spent a fair amount of time in TC39, which is the steering committee that oversees
JavaScript, and tried to basically emulate and borrow ideas from that, which I had found
reasonably successful.
And that was the beginnings of what became our working group.
I think our working group started in early 2017
was our first working group call,
which was about a year and a half
after we had open sourced it.
So, you know, it took us a while
to figure out the right model.
That's quite a long time,
but eventually we figured that out.
And now that has become the true center of gravity
for how the project evolves.
Can you elaborate a little bit
about steering committee, right?
Like TC39, like what
do these committees do just at a higher level? Yeah. So their primary purpose is to make sure
that these broadly used technologies evolve in a healthy way. And that charter can mean a lot
of different things. So, you know, TC39 has this, I've also written this for GraphQL, a set of principles. And that also really helps in tying things back when someone has a really cool idea. and say, does this align with what we said we wanted to do here? And one of our very first
principles is favor no change, which is really frustrating for somebody who wants to show up and
add something new. But as a user of GraphQL, what you really don't want is for that thing to be
changing out under your feet all the time. And so this is a little unintuitive, especially for the
people who are involved and are motivated to make improvements, that the most important thing that we can do is ensure a sense of stability to the technology
and give someone a sense that this is not a fad. It's not a thing that's going to evolve
major version to major version multiple times over a month by month. And if you build something
big and significant on top of it, it's going to keep working for many, many years. And that's a very significant
responsibility. So that means rather than going straight to code, we tend to spend a lot more time
looking at problems. What problems are people facing? Is this a real problem? What is the
design space that we can work in around these problems?
We'll sometimes spend an entire year for a particular proposal, just making sure we really
understand the problem space. Because the cost of a mistake is really high. And the cost of doing
nothing is actually not that high. Our principle is to do nothing, right?
Like that's our only make a change
if we're very confident that it's the right thing to do.
So the working group has really embodied
these values and these principles.
A lot of familiar faces will show up to those meetings,
but also a lot of new people will show up
to those meetings with great ideas.
And a lot of it is helping people guide
through this
process that is new for many open source contributors of how to get your proposal
into a really kind of a bulletproof stance, how to make sure you understand not just its impact
on a bit of JavaScript code, but on the impact on many other bits of code on the specification
itself on the runtime environments on the tooling environments. So it's not just about getting the change landed, but then making sure that the ecosystem effects
are well managed, documentation is in place. And the working group oversees all of this.
It's a completely volunteer-based thing. It's open source in that true sense where it's truly
folks who show up who care about it and volunteer their time to facilitate this and to,
you know, work through the proposals. If you want to answer this, like feel free not
to answer this question. What's one change you're very proud of that you didn't make
to the specification? Something that looked really interesting, but you decided let's not do this.
There are plenty. Actually, a lot of them were from early on. So, you know, I had a
mini version of this process earlier on. The version of GraphQL that we used in 2012, all the
way up really through 2015 internally at Facebook was pretty different from what we use today, what we see GraphQL as today. I spent that window of time between us talking about GraphQL
at the first React conference.
I think that was maybe January or February 2015.
And when we open sourced it, which was July or August of that same year,
basically that window of time was a massive redesign of GraphQL.
Like basically me looking
at what had unintentionally evolved into a state and being kind of like not really proud of that
and not wanting to put that out into the world and wanting to make sure that what we put out was good
and so it's like okay well if the problem was um if the problem was that this thing kind of evolved
into its end state then me just taking a hatchet to it is likely to result in just like the same bad result.
It'll just feel novel because I all just came up with it, but it'll end up being kind of equally poor.
And I wanted to avoid the so-called second system syndrome, which is where, you know, you build it the first time and you kind of don't know what you're doing.
And then, you know, you look back and you're like,
I would have made all these changes.
And then you do it a second time and you think that you know exactly what you're doing
because you've already done it once before.
And you make way more mistakes because you are overconfident
and you end up with something way too complicated.
And I almost did that.
But we ended up running this process where I would write proposals. So I would write a white paper for every change that I wanted did that. But like, we ended up running this, this process where I would write
proposals. So I would write a white paper for every change that I wanted to make. Why what
problem it was solving, why it was important change to make and what it would look like,
here's the alternatives and why we're not doing the alternatives. And then I would shop that
around a bunch of engineers that I trusted, certainly our core team, you know, I put it in
front of Nick and Dan and but I would go around to other like very senior engineers across the company and say, does this make sense to you?
Like, please poke holes.
And they would.
And I would say maybe one out of four of the proposals that I made stuck.
One of them was I wanted to get rid of the curly braces.
I was spending time in Python and a couple other languages.
And I was like, this is the future.
Like people are going to write languages that don't have these weird curly braces.
Everyone's like, nope, you're nutsly.
That's wrong.
Like the curly braces are fine.
And I think they were right.
I was probably wrong on that one.
And I'm glad they talked me out of it.
That's one example.
Another one that in hindsight, I think there's plenty of things that I got wrong on that one and i'm glad they talked me out of it that's one example um another one that in hindsight i think there's plenty of things that i got wrong in that process too that
now we're kind of in the midst of trying to fix um one was interfaces versus unions i got pushed
back really early on that we should not have unions that there should only be one way to
describe a type abstraction and i was pretty pretty adamant that this difference between something inherent
in interface or applies in interface and the relationship with the union is exactly the
other way around. Something doesn't know whether or not it's part of the union. And the interface
doesn't know all the things that implement it necessarily, but a union definitionally knows
all the things that are in that union. And that reversal felt critical and a very different piece of type design.
And so people eventually believed me and shipped it.
And I think it's produced a lot of value,
but we've been now going through a lot of iteration
around that space and landing on things these days
that I think may be better.
But just having to work our way around interfaces and unions
has been pretty challenging.
And so I'm thinking that was maybe...
I would have done something differently earlier.
Another example is how we manage nullability.
I'm actually quite happy with the change that we made when we made it.
Previously, there was no
modeling for nullability at all every field was nullable all the time there was no concept of
null non-null uh it was very controversial to introduce the idea of a non-null field
to this day it remains a point of confusion for newcomers to graphql so you know the
controversy still stands a A lot of people,
the thing that they're confused about is like, why is it backwards? Like, why is this thing not just like non-nullable by default? And you have to explicitly say that it's nullable. That's how
many other languages work. There's firm reasons for that. But the big miss I think was making
this purely a part of the schema design and not something that the client could control.
And we actually have a proposal that is most of the way through the process,
but you know,
a very compelling proposal at the moment with it's,
we're now kind of poking holes in it,
trying to make it bulletproof that's readdressing that and trying to
reintroduce the concept of client control knowability,
which is super exciting to me.
There were other things that we punted on early on that
actually didn't come from me, came from other sources that I was hesitant about, and I'm quite
happy that we did not do. One example was parametric types, the ability to say, you know,
connections are an important primitive in GraphQL for doing pagination. People wanted the ability to say literally connection,
you know, angle bracket of T, a closed angle bracket, and then to be able to, you know,
parametrize those per use case, you could have a connection of user and the connection of this and
that. But it was always really hard to explain exactly how that would work. Like, at a high
level, you can see why that as an organizational technique would be useful,
but like,
okay,
mechanics,
like what is this actually doing?
And it was always really hard to kind of piece that together in a way that
didn't feel super complicated.
And if I've come to any conclusion after spending so many,
so many years steeped in this,
it's that type systems are really hard.
Like the simpler, all the complexities
and the bugs and the mistakes within these runtimes
is making mistakes about expecting something
to be of one type and actually it's of another type.
And all of our most complicated validation rules
are about asserting that queries are fitting the particular shape that they said that they would fit.
And every time we make the type system more complicated, that surface of code gets more and more complicated.
And so there was a pressure to keep the type system as simple as possible.
And we made it slightly more complicated in the course of the pre-open source work, hopefully in valuable ways,
but we also kicked down a lot. That was just one example of many others that would have made the
type system nominally more powerful, but also quite a bit more complicated. And I think biasing
towards simplicity there was the right call. Cool. Yeah. You've already basically answered
what my next question was going to be, which is like, what are you excited about for like the evolution or like the next step of GraphQL?
You mentioned one thing, which was client control nullability.
Is there like something else that you want to discuss?
Yeah, I feel like the GraphQL runtime itself and the query features, actually, these days,
there's more changes and proposals to that than there ever have been. And yet, at the same time, paradoxically, it's actually not where most
of my excitement is. Not to say that I'm not excited about all those changes. In fact,
like client control and availability, super exciting to me. There's stream and defer,
which is also super exciting to me. There's many very, very cool proposals that are happening
at that layer. But for me, it's actually the ecosystem.
That's really where I think the value is.
Like giving people one extra technique for their querying is a nice to have,
but it's not a game changer in the same way that, you know,
I think there's actually,
there's way more ground to cover than there is ground that has been covered in terms of um tying many different
schemas together not just like um schema federation within one organization but like the
the internet of data like how do we how do we interconnect data sources and use graphql as a
primitive for that um really early conversations around how that might work that to me is super
exciting all the tools that are getting built out now, like half our conversation early on was about these sort of fundamental ideas
of operational excellence and setting up GraphQL service, like how to make sure you're not screwing
something up. And there's a handful of companies now that will help you with that, but just a
handful. And I think there's a fertile ground to build a whole ecosystem of pretty incredible tooling that has GraphQL at a core primitive within it. And that, to me, is actually where most of the exciting work is happening. And that all kind of harkens back to this principle of the idea of the core of GraphQL, the spec of GraphQL is not to just add feature after feature after feature. It's to hold firm as a stable base. And this top principle of stability, the intent of that was a stable base allows you to build other things on top of it. And I care way more about what people do with GraphQL than I care about GraphQL itself. I'm very excited to see
that that has yielded
fruit and people are now building
very, very interesting things and now
layers upon layers of tools and
technology that leverages that.
It's very exciting to
see that that piece of
the vision is starting to play out today.
Yeah. I spoke
to the founder of Hasura where you can take a database schema
and order it,
generate like a GraphQL API out of it.
You can also take that to the next level,
right?
Like a GraphQL API.
I hope someday you can auto generate react components from GraphQL.
And that way,
if you combine Hasura with that new technology,
you can just have a database
schema and boom, you have like a UI. There's so many interesting. Yeah. My long, long-term hope
for GraphQL is like, actually, you know, I'm very happy that 10 years after creating it,
it's not just that people are still using it. It's become a really common tool and maybe even ubiquitous.
That makes me very happy. But if we're still using GraphQL in the same capacity
10 or 20 years from now, I'll be a little underwhelmed. That's not the long arc I want
for it necessarily. I'm sure plenty of people will continue to use it but i would much rather people steal the ideas from it um you know and that's the fact that like all this stuff gets to evolve
and we get to steal ideas from old technologies and apply them to new ones and figure out like
all right well where's where does this thing break and it's time for new technology and we can start
to evolve from there hopefully it's it's modestly paced and deliberate
and not crazy like a lot of the other tech stacks
that we work with every day are.
But my parallel in a lot of this is SQL.
SQL is now many, many decades old.
Absolutely still in use.
Is it the way that everyone builds everything?
Of course not.
But have the ideas that it popularized
become pervasive in technologies
maybe not even related to SQL anymore?
Absolutely, including GraphQL.
And that to me is the true legacy of SQL.
It's not that it's the SQL query language specifically
and what features that it's added over time.
It's the like broad impact and shift
that it's had on the technology ecosystem.
And if I could have like one long-term wish
for the GraphQL project and its long life,
it's that it has that.
That's really what I hope for.
Yeah, I think that's a great vision to have.
And I'm sure you'll also see versions of things like a super GraphQL that's GraphQL without braces. But I'm
sure people will build stuff that's even better than that. Well, Lee, thank you so much. This
has been an amazing conversation. Thanks so much for being on the show. My pleasure.