The Changelog: Software Development, Open Source - The principles of data-oriented programming (Interview)

Starting point is 00:00:00 This week on The Change Law, we're talking about the principles of data-oriented programming. Yohannes and Sharvit, author of Data-Oriented Programming, joined Jared to talk about the virtues of treating data as a first-class citizen in our applications and the four principles that make it possible. A massive thank you to our friends and partners at Fastly and Fly. Our pods are fast to download globally because Fastly is fast globally. Check them out at Fastly.com. And our friends at Fly let you put your app and your database closer to users all over the world. Zero DevOps. Learn more at Fly.io.

Starting point is 00:00:51 This episode is brought to you by our friends at Square, developed on the platform that sellers trust. Here's what you can do with Square. You can bridge more experiences. You can build online, mobile, and in-person commerce experiences that connect more customers and sellers. You can build custom booking solutions. You can create and track orders. You can accept payments. You can manage and curate inventory. You can organize customers.

Starting point is 00:01:09 You can manage employees. You can extend Square gift cards to your app. You can use Afterpay. And all this is powered by the world-class Square APIs and SDKs that enable you to build full-feature business apps for yourself or millions of Square sellers. So much is available as a Square Solutions Partner. Learn more and get started at changelog.com slash square. Again, changelog.com slash square. All right, I'm here with Yohanatan Sharvit, author of Data-Oriented Programming, published by our friends at Manning in July of 2022. Welcome to the show. Welcome. Very happy to be here with you.

Starting point is 00:02:15 Happy to have you. So, data-oriented programming, or data-oriented programming, depending on your affectation, a concept that I hadn't really heard about. I feel like I've heard about lots of different things. And so maybe like a niche area that you're trying to expose. Tell us about why you decided to spend, it looked like 18 months in early access and finally published. You have a real hard bound book that physically is in your hands right there, which is really cool. Why did you decide to write this book? All right. But before that, let me ask you

Starting point is 00:02:45 a question back. Okay. If you don't know what is data-oriented programming, do you agree that it sounds sexy or cool? I do. I think it does sound intriguing. That's why I was like, yeah, we'll have you on the show. It sounds cool. Yeah. So that was my feeling when I started to write a book that nobody really knows what it is, but everybody thinks they know what it is. And if you would ask a developer in the street, what do you think about data-octane programming? Is it a good paradigm or a bad paradigm? They say, for sure, it's a good paradigm. And when I started to write the book two and a half years ago, I did the research on Google, and I was thrilled to discover that there were no hits on Google

Starting point is 00:03:30 for data-oriented programming. Zero. Nothing. Then I said, wow, that's amazing. I can be the reference, the guru of data-oriented programming. You can be the one. Yeah, exactly. And then I looked on Wikipedia. Of course, there were no articles of data-oriented programming. You can be the one. Yeah, exactly. And then I looked on Wikipedia.

Starting point is 00:03:46 Of course, there were no articles about data-oriented programming. So I wrote an article on Wikipedia about data-oriented programming. And I said, wow, I'm going to be famous. But can you guess what happened? No. Wikipedia refused my article. They took it down. They said it wasn't good enough.

Starting point is 00:04:04 Yeah, they said, no said it wasn't good enough yeah, they said in order for a topic to be worse, having a page on Wikipedia you need to prove secondary sources meaning you need to find books or articles that are not about data-oriented programming

Starting point is 00:04:19 that mention data-oriented programming and it was obviously not the case. And I cannot use my book as a reference for my topic. So right now, my article on Wikipedia is not there. Now, seriously, I am a member of the Clojure community. I've joined the Clojure community 11 years ago, in 2011. No, 2012.

Starting point is 00:04:48 And Clojure is a data-oriented programming language. It's a data-oriented programming language, and it was marketed as so. So when I write Clojure code, even if I don't know what is data-oriented programming, in fact, I do data-oriented programming. And my first attempt with Manning was to write a book about Clojure. And it was a total failure.

Starting point is 00:05:12 Nobody purchased the book. I mean, maybe we sold 200 copies. But the folks at Manning were intrigued by the Clojure paradigm and they say, you know what? We liked working with you. The book didn't go well. So you know how they were.

Starting point is 00:05:27 They stop before the book is published. They do like early release of the book. And if it doesn't work, they stop. We didn't go into full production. It was just a couple of chapters. But they said, we want to work with you. Can you suggest another topic? And then I did a trick.

Starting point is 00:05:44 I said, okay, I cannot convince people to get interested in Clojure. I'm going to give them Clojure without Clojure. Clojure spirit without the syntax. And that's exactly data-attached programming. It's the principles behind Clojure philosophy. I see. So you've also tricked me.

Starting point is 00:06:07 We're doing a closure episode. I didn't realize it, but here we are. We're doing closure. Too late. Too late. Yeah, we're here now. Yeah. And it seems that people are interested

Starting point is 00:06:20 and intrigued by this topic. I got a lot of responses and questions from readers all around the world. And it's not really new. I didn't invent anything. And Clojure also didn't invent anything. It just pulled some best practices from many, many languages

Starting point is 00:06:39 and made them into a coherent whole. So if we take something we don't understand, data-oriented programming, and compare it to some things that we may already understand, many of us understand object-oriented programming. Others of us also understand functional programming. And we try to fit this in somewhere amongst things that we already kind of understand.

Starting point is 00:07:05 Is it set against object oriented? Is it set against functional? Is it set inside these things somewhere? How does it relate to these paradigms that we're familiar with? Okay, great question. It's both, both against, both fit with. And you know, in 2022, it's very hard to put a clear distinction between paradigms and languages. For example, Java, which is the object-oriented programming language,

Starting point is 00:07:33 supports functional programming. And even Clojure, which is totally functional programming, supports kind of object-oriented. And JavaScript, you can do both. So it's not about languages. I think nowadays, most modern languages support many, many paradigms, but some languages guide you into... In some languages, it's more natural to use this or this paradigm. So what is data-oriented programming? Data-oriented programming is a set of principles that makes it pleasant and effective for developers to write programs that manipulate data. Programs that what we call information systems. Programs that manipulate data, but where the data does not belong to the program.

Starting point is 00:08:29 Program that manipulates data that they have not created. Programs where the lifecycle of the data goes beyond the program. For example, programs that manipulate data that come from a database. Let's say a web server. The web server does not own the data. It from a database. Let's say a web server. The web server does not own the data. It processes the data. And with

Starting point is 00:08:52 this thing, according to data-oriented programming, you need a program that treats data as a first-class citizen and allows you to manipulate data in a flexible way. And for that, it starts from a big, big thing against object-oriented programming.

Starting point is 00:09:10 Because in object-oriented programming, data and behavioral data and code are encapsulated together into objects. Right. So the first thing that we do, we separate. Data can live on its own, and code can live on its own and code can live on its own. Like in functional programming. Okay.

Starting point is 00:09:28 So the first step of data programming is exactly the same as functional programming. The second step is that instead of using specific structure to represent our data, we prefer to use generic data structures like hash maps, like we have in Ruby and in JavaScript or dictionaries in Python.

Starting point is 00:09:51 That's our main ingredient for representing data that we have fetched from the database. And that's where there is a little split versus standard functional programming languages like Haskell and all the ML families where there you are, you use strongly typed things to model your data. Here, we prefer to use generic data structures, mainly hash maps and lists. And the number three, which is similar to functional programming, is that we never mutate data. We use immutable data structures. And they are very, very advanced or very performant, sorry, immutable data structures for generic data structures. So we have, in all languages, we have super efficient immutable hash maps, where instead of modifying the data in place,

Starting point is 00:10:49 you create kind of a new version of the data, but without having to clone the original data. We can talk about that later if you are interested. That was principle number three. And principle number four is, okay, if you don't have types for your data, how do you prevent, how do you avoid the big mess that you will be into? If all the pieces of data that you're manipulating in your program are hash maps, how do you know if in the hash maps, you expect a field that is called email and user and ID, and how do you know how to spell it?

Starting point is 00:11:23 How do you know as a programmer and how does the program know to fail fast and not pass forward invalid data? And that's the way we do data validation in data-oriented programming is by having the schema, the data schema, separated

Starting point is 00:11:39 from the data itself. And data is validated at runtime, not at compile time. So these four principles of data-oriented programming, the first one separating the code from the data, the second one representing data

Starting point is 00:11:55 with generic data structures like maps and lists, the third one treating data as immutable, and the fourth one separating the schema from the data representation. Let's step through these and let's just focus in on each one for a moment. So this first one, separating code from data, as you said, this is kind of like against traditional object-oriented programming, which is kind of defined objects as code plus data coexisting in the same entity. Data-oriented programming says separate those two.

Starting point is 00:12:27 And so the question to that, which comes to my mind, is like, why? Why is it better to separate them versus to have them together? Because almost every developer that has worked in a production-ready, object-oriented system has suffered from huge class hierarchies. And you inherit from something that inherits from something. And when you want to make a little change, you influence so many things that it's a nightmare of complexity. And also for code reuse.

Starting point is 00:12:58 If you have a method of a class that does, I don't know, calculates the full name of a user by concatenating first name and last name. If you want to use this piece of code for calculating the name of an author, which happens to also have a first name and a last name, you need to have author and user inherit from a common object that you call person

Starting point is 00:13:25 or that you call human being or that you don't know how to call exactly and sometimes you can't do it and sometimes you need multiple inheritance. While the only thing that you need is the ability to call a piece of code. And you cannot really do that in OOP in a simple way.

Starting point is 00:13:44 There are tricks and design patterns and da-da-da-da. But in the most straightforward way, code is kind of in jail inside the objects that wrap it. And we want to free them. We have a political agenda. We want to free the world. And we don't want code to be in jail. I see. So if I have an object, which is a person, and inside of that object, there's the data

Starting point is 00:14:12 of the person's first name and last name. And there's a code that says, here's how I represent that as their full name. I've implemented that inside of this little object, and it's stuck inside of there. And I have to do a bunch of tricks, whether it's inheritance or includes or imports or whatever it is, in order to free that logic from the person and give it to other areas of my application that may also need the exact same logic. So the problem is, is the data and the code are wrapped up together and that's, that's trapping the functionality and we want to make it free. Is that what you're saying? Yeah. And I think object haunted is fine where the data that you wrap in the,

Starting point is 00:15:01 that you encapsulate in the object is not information. Data is, we have different kinds of data. Sometimes we have data, for example, the internals of the data structure, the left child and the right child and the number of children and is visited and stuff like that. This is not what I call data. This is not information that comes from the outside. This is not something about the real world.

Starting point is 00:15:27 It's something about your program. For things about your program, that's fine to use objects. But for facts about the world that come from outside, I think it's better they deserve to live on their own, not to be stuck into our mental systems okay so there's kind of like internal data and external data is kind of what you're saying but you're you're saying one's information one's not and it's okay to encapsulate internal things because they're uninteresting to the outside world but if you encapsulate things that are

Starting point is 00:16:03 eventually interesting outside world now you've backed yourself into a corner. I see what you're saying. Okay, so there's the why. So for principle one, what about principle two, representing data with generic data structures? Why use a map or a list? Those are kind of the two main ones, right? Lists of things and then like dictionaries of things or maps or hashes or whatever your language calls them. When you could more richly represent them as what the world wants to see them as necessarily. Why is it beneficial to just pass around generic things if we have the capability of building specific things? Okay, that's the toughest question. And that's the question that comes again and again.

Starting point is 00:16:46 That's the strongest critique against the closure and data-outlet programming. But also that's the most interesting one. So let's say that we have a way to do data validation. And we will talk about it when we talk about principle number four. So let's say we are not scared about having to manipulate invalid data. Let's put this fear aside for a moment. And let's just see what we lose with static typing. When I force data to be, let's say I manipulate books and I have a structure, a struct, static types for a book. Let's see what kind of problem we have when we force this thing about the real world, which is information about the books, the title, the number of pages.

Starting point is 00:17:38 That's something in the real world. But when I force it to be wrapped into my algebraic data type or my struct, let's see what do we lose. First of all, we lose the ability to refer to fields by their name at runtime because a struct, when it's compiled, it becomes just an array and the field names become offsets inside the array, meaning that, for example, it's very hard to be dynamic and to receive, let's say, from the user,

Starting point is 00:18:14 the name of the field they want to retrieve, because the name of the field is a dynamic string, and there is no easy way to fetch dynamically the value of a field inside a struct without using reflection. While if it's just a map, you can access any field in a map by its name. It's the essence of the map. Does it make sense? I think so.

Starting point is 00:18:37 But you said absent reflection. Lots of languages have reflection abilities. So you can get at the names of things pretty reliably, right? Yeah, but I think that if you write a program and you rely too much on reflection, you will be rejected by a code reviewer. They will say, hey, what are you doing here? And, you know, in a sense, if you use reflection,

Starting point is 00:18:59 and anyway, when you use reflection, you bypass the type checker. So you do data-aut-oriented programming in a sense. If you use structs and access field with reflections, it's the same as using maps. So just use maps if that's what you want to do. Let me give you an example. Let's say you fetch from the database information about a book

Starting point is 00:19:24 with title and number of pages, and you want to rename a field because in your API, title should be called the title and number of pages should be called pages. In object-oriented programming, what you need to do is to create another struct, or if you do static typing, you need to have two structs. One that holds the data as it is stored in the database. And you need another struct with the names that as they need to be seen by the API. While if it's just a map, in a map you can rename fields.

Starting point is 00:19:58 It's just a two line functions to rename a field in a map. And if you want the user to decide how to rename the field, also you have a big problem. You cannot create a priori data types for every possible combination. If you want flexibility with your data and your field names, you need a flexible data structure like a map. MARK MANDELMANN, OK.

Starting point is 00:20:22 So from the static typing side, though, aren't you throwing away a lot of upside? You're throwing away a lot of tooling inference. You're throwing away a lot of refactoring abilities. I know you said set validations aside for this part of it, but obviously that does play a role in, in decision-making processes. I'm going to show my true colors. I'm more of a dynamic guy myself. So I'm not going to be the, I'm going to be an easier sell than probably a lot of our listeners when it comes to that side of it. I can hear in the way you ask the question, you just pretend that you ask the question. Well, I'm representing what a lot of people think. I work in small teams,

Starting point is 00:21:00 small code bases. I don't have a lot of the problems that static type solves personally. I've seen them and I've heard them from a lot of people. And so I represent them, but yes, I am not going to be the hardest sell on this, but I don't want to be a pushover either. So. Yeah. I think that most, most of the concern of static type people is based on fear. Like, oh, I need to know what I have. It's like you are free control. And tooling. Yes, tooling is the big one. But if you put those two aside

Starting point is 00:21:34 and you are interested about what really happens when the program runs, after you have written it, let's say you want to debug a program in production. So there your tooling will not really help you. And you want something that is, when it runs, you want the, even it's not the artifact,

Starting point is 00:21:52 you want the runtime to be simple. And the less complex data structures you use, the simpler your program. Moreover, it's very easy to carry maps around. For example, the API for Google Docs, right? You want to modify the title of a Google document. If you

Starting point is 00:22:14 are using generic data structure, you pass a JSON with the title and document and body and author name and first paragraph. And that's what goes on the wire anyway. But if you are static and you could have this map from many, many functions, right?

Starting point is 00:22:41 You can write functions that enrich the map, that remove stuff, that rename fields, etc. While if you use a static type API, like the Java API, you cannot really do that. Everything needs to be statically known. And Java also with setter, set title, set author, set this, set that. And writing unit tests also, if you think of,

Starting point is 00:22:58 I think one way to measure the readability or the goodness of the code is to see how easy it is to write tests for it. And when you use generic data structures, it's very easy to write tests for your code. You just create a map with the fields that your function expects and you call the function.

Starting point is 00:23:16 And it could be that a function that in production receives a map with 10 fields but only looks at two. In order to test it, you don't need to create the whole map with all the 10 fields. You can just create a small map with two fields that you looks at two in order to test it you don't need to create the whole map with the all the 10 fields you can just create a small map to feel that you know the function care about and this kind of flexibility is really valuable really valuable but does that flexibility scale so one of the things that i've found over time as written many ruby programs and done it in such a way that it's flexible and I could just pass maps around

Starting point is 00:23:47 and I could just test the parts of the maps that I'm interested in is that I end up writing a lot of tests, a lot of tests that are merely type checking. Like I'm merely saying, did I get what I expect? Right. And so I know you said that we get, if we can set validations aside, but we really can't because a lot of our programs are the interface between a human and a database. Right. And like in between there is like, did I get what I expect from the human? Like a lot of what we do is that, and I can see the type, the static type argument that if you can enforce that constraint formalized in a way, then you can guarantee it and not have to write a unit test that says, well, what if they pass me nil? Now what do I do? How do you respond to that in data-oriented programming?

Starting point is 00:24:38 Yeah, so let's skip to principle number four, and then we will go back to principle number three. Okay, so all right, let's set that one aside. Let's go, yeah, straight to principle four, and we'll go back to three. Separating data scheme from data representation. Go ahead. So how do we do data validation in a dynamically typed world, right?

Starting point is 00:24:58 So I'm writing a HTTP server, an API, with lots of endpoints, and each endpoint receives a payload, and each payload has an expected shape for the data. So until, I think, four years, the way I would validate that the data is valid is most of the time,

Starting point is 00:25:19 I just won't validate and be optimistic, and then fail in production, fail with unclear errors. Because instead of having a failure that says, hey, you passed invalid data, I would have foo was called with nil. And it was very hard to re-concentrate. Right. You can't call this method on this nil thing.

Starting point is 00:25:43 So that works well when you are a startup and you don't really care about safety and you need to move fast. And that's what I did until four years ago. But four years ago, I discovered that there is a way to express programmatically data expected data shape. It started from something called enclosure spec, and then something called Mali, and something called JSON schema. So let's talk about JSON schema because it's universal. It fits in every programming language.

Starting point is 00:26:22 So in JSON schema, you have a map that describes what you expect, the expected field in your map. For example, you could say the schema for a book, I have a field called title, a field called pages, title should be a string,

Starting point is 00:26:38 pages should be a number. And there are functions for validating and say, okay, here is a piece of data that I got from the user. Here is my schema. Please validate. And if it functions for validating and say, okay, here is a piece of data that I got from the user. Here is my schema. Please validate. And if it's not valid, tell me why. And then you can, with no code, just by writing the schema,

Starting point is 00:26:55 you can automatically, with middleware, generate. You don't need to write code. You just use the middleware that, when your endpoint receives an invalid piece of data, returns to the user the 404, 402, or I don't know, 405 error code automatically with explanation about, hey, the field title was not provided. Oh, you provide title and it was not a string. Or even better, you provide pages with a number that is negative or with a number that is higher than a million

Starting point is 00:27:26 or you can do lots of advanced things that you cannot really do with static typing. You can do because you do runtime check. And anyway, the kind of check that you want to do are at runtime. You cannot validate a compile time user input, right? You agree with me? Sure.

Starting point is 00:27:43 So instead of writing a class that say, hey, that's the class of a book. And let me try to JSON parse this string into my book. I have a schema, which is super flexible. I can express many, many conditions. There is no limit. I can even pass function as predicate or numbers in range or stuff like that. And everything is just code. That middleware for HTTP server that automatically generates the proper error response when there is

Starting point is 00:28:13 a failure. So I think for that, for API, for validating user input, dynamic programming is even better than static programming. It's not only as good as, I claim it's better. For this kind of data validation, there is another

Starting point is 00:28:29 kind where it's worse, but this first kind is better. Okay. It's better because say why it's better again. Why is it better? Okay. First of all, because you can express conditions that are not expressible as static types, like number in a range.

Starting point is 00:28:46 Number of pages should be between zero and 10,000. You cannot express that as static types. That's number one. And number two, because you can, for example, very easily generate the Swagger JSON from the schema. And in fact, the language for Swagger is based on JSON schema. And generating JSON schema from a class is doable,

Starting point is 00:29:13 but it involves tricks, reflections, and stuff like that. While generating Swagger data from JSON schema is straightforward. And you can programmatically also manipulate. Okay, I will do more later, but until now. So it's richer than static typing because you can express any condition. And it's a perfect match with Swagger, let's say.

Starting point is 00:29:39 And there is no downside. So conceptually, what you're saying is you get to defer the typing. you're basically saying, well, you're going to have a schema, you're going to have types or you're going to have validations and requirements of their data. But it's going to be separated out from your application code. And as long as it is enforced at the last minute or at the end of the chain of operations, and as long as the result of that failure is matriculated back up to a place where it's displayed to an end user, it's not like an explosion or a crash.

Starting point is 00:30:18 It's a displayed error that's somehow built into the system. Then it's better of doing it kind of at the edge nodes of your code at the entrance point. Are you convinced or half convinced? I'm interested. It seems like I understand why you use JSON schema as an example because it's broad sweeping versus a specific implementation

Starting point is 00:30:43 or tool chain inside of Clojure. But I wonder how accessible this setup is to different developers in different circumstances. Oftentimes what we find is inside of a application framework, you end up with, even if you have a strong schema at your database layer, for example, you end up with undefined is not a function calls as people send input. And I wonder how practically people would get this going for themselves.

Starting point is 00:31:16 Yeah, so here you need a little bit of discipline because nobody is going to force you to write a schema for your endpoint, for the payload and for the response, Because nobody is going to force you to write a schema for your endpoint. For the payload and for the response. While in statically typed languages, you are forced to type. So, yes, that's maybe a little downside. But once you get to it, you do code review and you won't accept code.

Starting point is 00:31:46 You won't accept a new endpoint without a schema. Where it's more challenging is when we talk about another kind of validation, which is, okay, I've passed my endpoint, and the endpoint calls a function that calls a function that calls a function that calls a function. So I'm going down the stack. And now I call the function foo that receives a book. But inside the code of foo, what I see is that the parameter is called book, but the type is just a map or it's just a var. And as a developer, I have no way to know that the book parameter received by foo is a map with those fields. And I can use JSON schema again here, but it is overkill to use JSON schema everywhere.

Starting point is 00:32:32 So in Clojure, we have different tools for that. And when they are wired properly, they give you kind of IDE functionalities. So when I wire it properly in my closure code, I have the function foo, and I can say here it's a map, but here is the schema of this map. And if I call foo from somewhere else

Starting point is 00:32:54 and I mistype the field name, the IDE will tell me, hey, you have passed an invalid input to foo, like in a statically typed language. I don't know, maybe you have seen in VS Code, which relies on JSON schema, when you want to edit some configuration files, VS Code knows the JSON schema of the file.

Starting point is 00:33:17 And if you mistype the name of a field inside your configuration, on the fly, VS Code will tell you, hey, you have an error. Because actually there is a repository of JSON schemas that VS Code reads from there. So we can have something similar in our code, not only for configuration data, but also for function arguments.

Starting point is 00:33:39 So here, we are not in power with static typing. In terms of tooling and internal functions when the data flows it's not as good as in static typing I admit but what we have what we do have is

Starting point is 00:33:57 when you decide to type the function arguments you are not forced to but when you decide to do so one benefit that you have is unit test for free. Let me tell you, again, let's take the function foo that receive a book and it's supposed to return, I don't know, whether it is a good book or not. Yeah, thumbs up or thumbs down.

Starting point is 00:34:18 Yeah, thumbs up or thumbs down. Let's say thumbs up is more than three stars, something like that, and less than a thousand pages. If you have a schema for your book, what you can do is use JSON schema library. So the first library that we discussed would validate data against schema, but there are libraries that generate data out of a schema.

Starting point is 00:34:39 So once you have your schema, you can say, hey, generate me a thousand samples of books, call the function, and make sure that the result is as I expected. It's called generative testing. And it's very easy to do that. And in my book, I show a couple of examples how to leverage these capabilities. In addition to unit tests, where you cover five, ten cases, you can cover all the cases. You can say, generate all possible input, or a thousand or a million of possible input, and validate that my code behaves properly.

Starting point is 00:35:12 And every time I run, I use that, I find bugs. You know, some edge cases with regular expressions, with special characters, with negative things, positive things. And doing so with static types is much, much more challenging. To generate random data out of an algebraic data type is more challenging. I'm not saying it's impossible, but it's more challenging. While with JSON schema and maps, it's very, very natural.

Starting point is 00:35:39 Yeah. So have you ever tried using your database schema as the schema? Or do you need an internal representation and an external representation? Usually it's different. Because usually in the application, you don't treat your data as tables. You have maps instead of tables. And those maps are denormalized instead of normalized.

Starting point is 00:36:03 So I don't think it's a... But I'm sure that there are tools that takes a SQL schema and translate it into a JSON schema. Yeah, there is an NPM package, SQL DDL to JSON schema. How does GraphQL fit into this story? Don't talk with me about GraphQL.

Starting point is 00:36:23 I tried GraphQL and I was so upset, so upset. Why? Because it's too rigid. It's too rigid. Too rigid. Too rigid. And I really tried hard, but so like many things,

Starting point is 00:36:38 when you start, it's great. For Hello World or MVP, it's great. But when the complexity of your requirements grow, it becomes unmanageable. And we had to do so many tricks to please the GraphQL type checker, and it added too much complexity to our business problem. So JSON schema in relation to GraphQL

Starting point is 00:37:00 is much less rigid. JSON schema is much more... It's much more flexible. I prefer to have REST plus JSON schema than GraphQL is much less rigid. JSON schema is much more... It's much more flexible. I prefer to have REST plus JSON schema than GraphQL. Because in GraphQL also you have these things

Starting point is 00:37:11 that... Let me just give you an example if I remember correctly. You cannot have union types for input data. Something like that.

Starting point is 00:37:20 They decided it should not be done. And there is debate on it in the GitHub issues. So probably is debate on it in the GitHub issues. So probably they will add it in a few years. But sometimes you need it. And so you end up having what we did at the end was to pass a string as part of the data to GraphQL

Starting point is 00:37:40 and to pass it as JSON in order to get back the flexibility that we wanted. This episode is brought to you by Sentry. Build better software faster, diagnose, fix, and optimize the performance of your code. More than a million developers in 68,000 organizations already use Sentry, and that includes us. Here's the easiest way to try Sentry. Head to sentry.io slash demo slash sandbox.

Starting point is 00:38:23 That is a fully functional version of Sentry that you can poke at. And best of all, our listeners get the team plan for free for three months at Sentry.io and use the code changelog when you sign up. Again, Sentry.io and use the code changelog. Let's loop back to principle number 3 because we skipped over it, treating data as immutable. This one is an easy sell for people who have been doing FP for a while, but it's a hard

Starting point is 00:39:02 sell for a lot of OOP proponents. So I think it's a hard sell just because we got used to mutation and I think that a while ago in Java strings were mutable and then they fixed it to be

Starting point is 00:39:19 immutable and it's much, much, much better. And I don't think that anybody likes mutation. It's just they think that if you go to immutability, you will pay a huge performance price, cost. Right, because you're copying data around. So I think the interesting question is, how can you manipulate data in an immutable way

Starting point is 00:39:44 without a performance hit? That's the interesting question. What's the interesting answer? The interesting answer is Git. Git? Yeah, Git is an immutable source control tool. And every time you do a commit, you don't do a modification.

Starting point is 00:40:03 You create a new node in the tree, and you just move the pointer of the branch that says, hey, now you are going to point to this commit. But the previous commit is not modified. So the Git tree is immutable. Now the question is, how do they do the magic? How do they allow us to create a new commit with, let's say, 10 changes in 10 files

Starting point is 00:40:26 without having to... And they create the illusion that you have a new tree without creating a new tree. And in Git, you can go back in time 10,000 commits ago, and in a millisecond, you have the new folder hierarchy. So there is no performance hit, and they don't replicate the whole tree on each commit. So what is the secret

Starting point is 00:40:47 behind Git? It's called structural sharing. Are you familiar with this term? No. So let's start with Git and then we will see how it applies with data. So Git, you have folders and in each folder you have folders and folders and folders

Starting point is 00:41:04 and then files. So imagine that you have a hierarchy of 10 folders, and you want to change a file at the bottom of the hierarchy. So what you can do is to create a new tree. And let's say at the first level, you have five folders. And you modify only, and the file that you want to change belongs to folder number five. So the four other folders can be copied by reference safely because you don't change them.

Starting point is 00:41:35 Folder number five, you cannot copy by reference because you have a change below folder number five. So what you do, you create a new folder, five tag, but all the children of folder five, except the one that you are changing, you can copy them by reference. And you do that recursively until you reach the leaf. And that's what Git does. So at each level, it copies by

Starting point is 00:41:56 reference all the children, and it creates a new node for the modified node. And that's structural sharing. And we can do the same trick with maps. So let's say you have a map with 10 fields and you want to modify and field number 10 is also a map. And you want to modify a field inside field number 10. So you copy by

Starting point is 00:42:17 reference the nine maps and no matter how big they are, it's just a pointer copy. And for the node number 10, the map, you create a new node, and you copy all the children of node number 10 beside the one that you want to change. Gotcha.

Starting point is 00:42:33 So you're only copying the diffs. Like, you're only actually... The new stuff is the only thing that's actually getting new memory allocated. Everything else is just referencing existing. And so that's why it's better than it used to be. Yeah. Okay. So imagine for sake of conversation that you've, you've completely convinced me I bought in.

Starting point is 00:42:58 I'm now a data oriented programmer. Okay. So I separate my code from my data. I use only generic data structures in my, in my application, my data, everything's immutable. I'm not doing any mutations. And I have a separate data schema from my data and I'm living the life and I'm going about my merry way. What does my life look like? What, what have I gained? What am I experiencing? You know, how many rainbows are there and unicorns? Like give us the best case scenario of like adopting this.

Starting point is 00:43:35 Is it better in every way? Are there trade-offs? Go ahead and paint that picture. Yeah. So first of all, you belong to the population that is enlightened and you are grateful for that. And now you look at all your former colleagues and you see how much they suffer. And you pray for them.

Starting point is 00:43:52 And you pray for them. And you buy them books that are in the programming. You buy them books. And you give them away to your friends, hoping that they will also make the move and lighten. There you go. Seriously. Let me just mention, I don't know if it was clear, but those paradigms, those principles

Starting point is 00:44:13 are applicable in any programming language. It's not applicable only in Clojure. You can apply them in Java, JavaScript, in Ruby, in Python, in any, virtually in any programming language. Moreover, you are not forced to embrace them as a whole. You can decide, okay, in some places of my code, I will just separate code from data, but I will keep static typing.

Starting point is 00:44:36 And in other places, I will use general data structures. And in some places, I will allow mutability if I want. So it's cherry-pickable. Okay, now let's say you decided to write your HTTP server in the data-oriented programming way. How does your life look like? So your life looks like that you deal only with the business logic. You don't deal with pleasing the compiler or pleasing the language.

Starting point is 00:45:04 In data-oriented programming, you have so many goodies in terms of data manipulation as part of the language or as third-party libraries like func tools in Python, Lodash, in JavaScript, etc., that it's very, very easy to manipulate data, to do massage to your data, to read it like this, to manipulate like that, to join and to pass it forward, which is what most of our APIs do. They read data from one place, from another place, they merge together and they pass it forward. You don't deal with serialization because serializing a map is a problem that is completely solved. Use a library. You don't deal with creating the swagger from your

Starting point is 00:45:49 endpoint. You just have your schema and middleware create the schema. You don't do validation. It is done by middleware. So you only do business logic. You generate unit tests by randomly generated data. You generate unit tests by randomly generated

Starting point is 00:46:06 data. You pass data around, you pass map around, you use maps and you live happily. And from time to time, someone say, hey, what is the field that this map expects? Why didn't you document it? And then you say, oh, I should have written a schema

Starting point is 00:46:22 here to make it clearer. I think that's the problem that is not yet solved. So you got to have those schemas. That's the discipline. Yeah, I think that's the problem that is not yet solved. It's a tooling problem. We don't yet in 2022 have a common way to combine, to express that this argument

Starting point is 00:46:41 is expected to be of this schema. We can do that, as I mentioned, but it feels a bit awkward. It's a problem that is not yet solved. And let me tell you something interesting that happened to me when, in the beginning of last year, I was contacted by the main engineer of a very interesting language called Ballerina. Have you heard of Ballerina? Yeah, I think we did a show on Ballerina.

Starting point is 00:47:07 It's designed specifically for APIs, right? Exactly. So you did a show. Or for the cloud. I can't remember how they pitch it. Yeah, we did a show like two or three years ago on Ballerina. I haven't heard of it since, honestly. Okay, so it has continued its evolution.

Starting point is 00:47:23 And the Ballerina of 2022, it is marketized as a data-oriented language. So those guys came to me. It's an army of developers. It's 100 developers working for five years. And the manager of all this army came to me and said, Oh, your books look very interesting. Do you mind writing an article about how ballerina fits

Starting point is 00:47:49 with the paradigms of your book? And I was, wow. So I did a little research about ballerina, and I wrote a couple of articles in InfoQ about ballerina. And one of the very interesting things that ballerina fosters is they have something which is called the flexible type system, which is neither static nor dynamic. It's in the middle. And it's super interesting because it gives you all the goodies of statically typed language with all the flexibility of a dynamically typed language. Let me give you an example to illustrate what it looks like.

Starting point is 00:48:28 So the syntax for maps is very similar to JavaScript, curly braces and JSON. But for accessing a member, you have two different syntax. You have the dot notation and the square bracket notation. In JavaScript, both notations are equivalent, but in Ballerina, the dot notation is for fields that are at compile time part of your data, and the square bracket notation is for dynamic types. If for some reason you want to say,

Starting point is 00:48:57 here I want to add a new member to my data, which is not part of the schema, dear compiler, let me do so. So most of the time, you will use the field that you know at compile time, and from time to time, you will allow yourself to add new fields.

Starting point is 00:49:13 You can splurge a little. You can go out for the evening. Yeah. Exactly. Treat yourself nice. Yeah. For me, I think it's the future. I don't know if it's Ballerina is going to nail it down, or maybe another language,. I don't know if it's Ballerina is going to nail it down or maybe another language. But I don't think that everything is perfect in dynamically typed languages.

Starting point is 00:49:30 Even with JSON schema, like I mentioned, we have the tooling problem. With static typing, we have rigidity problem. And maybe we need a new language that will combine the best of both worlds. Yeah.

Starting point is 00:49:43 Or maybe that language is not Ballerina specifically but this this panacea maybe it's too good to be true maybe there is no such middle ground that we can actually stake out because of the requirement of discipline you know if you provide me the ability to shoot myself in the foot i may just do it over and over again until my code base is unmaintainable. I don't know. That's what you have in dynamic and we do okay. But I do think tooling is definitely the downfall at this time of the

Starting point is 00:50:14 dynamic world. We see all the cool new tools coming out and we're like, wow. This is why we can't have nice things. But we do have freedom. Yes. And the tooling also gets better and better. Like I said, in Clojure we have CLG Kondo with

Starting point is 00:50:29 Mali and there is decent integration. It's not like Java ID is, but it's getting closer. But the future will tell. I'm sure there's areas that we didn't touch. The one thing that you mentioned you might return to was the circumstance in which the

Starting point is 00:50:50 schema representation is worse than static typing. You mentioned in the case of API endpoints with JSON schema, it's better for the two reasons that you stated. But what's the, what kind of program in which it's actually worse? Yeah, so it's not the kind of program like where in the program. Okay, where? So we have two kinds of data validation. One at the boundaries of our programs,

Starting point is 00:51:16 and there the validation is inherently dynamic because you get output from the outside. So the validation, by definition definition cannot happen at compile time. It cannot be static, it has to be dynamic. So that's where dynamic programming has an edge. But when you are inside your code, when a function calls a function that calls a function, here, statically typed languages have an edge.

Starting point is 00:51:43 And we have the tooling problem. All right. Well, what about community? Is there a place where data-oriented programmers hang out, discuss, tell war stories? Is it the Clojure community? Are these just the same communities? Unfortunately, not yet. Not yet.

Starting point is 00:51:58 Okay. Well, as the author of data-oriented programming and the guy who the ballerina folk come to for their punditry, maybe you could be the one to get something started around this group of people who, it seems like at least in Clojure land, Clojure seems to make this accessible, this style. What are other languages or areas

Starting point is 00:52:23 where it's pretty easy to do data-oriented programming? I think JavaScript is really a good fit. Data-oriented programming is natural. And in a sense, TypeScript is kind of data-oriented because at run times,

Starting point is 00:52:39 the types are not there. At run times, TypeScript is JavaScript. So TypeScript is like a linter. It's not really a statically typed language. It's like a linter, a static type linter. So in a sense, TypeScript is kind of... And in TypeScript, you can say, here, leave me alone, it's anything or any or whatever.

Starting point is 00:53:01 Right, that's kind of your eject button from TypeScript, is the any type, right? Yeah, but even without the any type, what I wanted to say is that in TypeScript, the types are not part of the data. You still have the freedom to create types decoupled from the data. The types are like glasses

Starting point is 00:53:20 at which you want to look at the data. And you could have the same data and look at it in the function as being this type, and in another function, it's another type. For one function, it's just a person, and for another function, it's an author. And for another function, it's just a map. So what I like in TypeScript

Starting point is 00:53:38 is the decoupling between types and schema. So that's why I'm saying maybe it can be considered as a data-oriented programming language. Gotcha. Well, there are more and more people writing TypeScript each and every day, so maybe we have more people with access to this style.

Starting point is 00:53:58 If it is interesting to you, check out Yohanatan's blog. There's lots of extracts from the book and blog posts out there covering the principles and some of the history of this style. Of course, there is the book Data-Oriented Programming out there published by Manning. Check that out. We'll have links to all the things in the show notes so people can connect with you, connect with the book. And hopefully, you know, I like that it's cherry pickable. People can start to integrate some of these styles, these principles into their code. You don't have to go all in.

Starting point is 00:54:33 You can say, I'm going to start writing my programs immutably from here on out. And you can just kind of adopt that as you go. So all four things you can build towards that towards that perfect world where you're completely enlightened and living the good life as your own attendants thanks so much for coming on the show really appreciate it okay that's it the show's done thank you for tuning in a big thank you to Jonathan for joining us on today's show and And hey, we have some books to give away of his. But join us in Slack. That's where you get them.

Starting point is 00:55:08 Changelaw.com slash community. You can join for free. You might get some free books. That's a good deal. And once again, a big thank you to our friends and partners at Fastly and Fly. And thank you also to our mysterious friend, BMC, Breakmaster Cylinder, for making sure we're bumping the best beats in the biz. Yes! We love those beats, and I hope you love them too. And FYI, if you

Starting point is 00:55:30 didn't know this, we share video clips of all of our podcasts on YouTube. Subscribe at youtube.com slash changelog. But hey, that's it. This show's done. Thank you again for tuning in. We'll see you on Monday. so Outro Music

The Changelog: Software Development, Open Source - The principles of data-oriented programming (Interview)

Jerod is joined by Yehonathan Sharvit, author of Data-Oriented Programming, to discuss the virtues of treating data as a first-class citizen in our applications and the four principles that make it po...ssible.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.