The Data Stack Show - 98: Category Theory and the Mathematical Foundation of the Technologies We Use with Eric Daimler of Conexus

Episode Date: August 3, 2022

Highlights from this week’s conversation include:Eric’s background and career journey (3:30)Presenting to people without knowledge of AI (11:04)Why math was chosen over AI (19:03)From compilers to... databases (25:42)The contribution of category theory (30:09)The Connexus customer experience (37:45)The primary user of Connexus (46:33)Interacting with 300,000 databases (51:07)When Connexus begins to add value (54:02)The best way to learn this mathematical approach (55:46)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Hey Data Stack Show listeners, Brooks here. Usually, I'm behind the scenes keeping things rolling for the show, but today I'm coming out of hiding to share some exciting news. We have another live show coming up, and we want you to
Starting point is 00:00:37 join us for the recording. This time, we're bringing back Tristan from Continual and Willem from Tekton to talk about the future of machine learning. We'll record the show on August 10th at 2 o'clock Eastern, 11 o'clock Pacific. So mark your calendars and visit datastackshow.com slash live to register today. Welcome to the Data Stack Show. We have an exciting episode because we are going to talk about the White House. We are going to talk about math and we are going to talk about a data company that solves really complex data problems all in one conversation with Eric, who is from Conexus, which is a fascinating company and he's a fascinating person. Kass says, of course,
Starting point is 00:01:25 I have to ask him what it was like to be an advisor for AI at the White House. I think typically when we think of the government, we don't necessarily think about people solving issues or thinking deeply about the subject of AI, but that's exactly what he did. I want to know what that meant practically day to day. So that's what I'm going to ask. How about you? Absolutely. Like I'm, I really want to hear like all the stories that he has to share from like trying
Starting point is 00:01:54 to help the government understand like what's the implication, like all the state of the art technology and help them like do them introduce all the right legislation and how these things happen and all these interactions. It's something that's very, very different than what we are used to by building businesses or building products. So definitely a lot of main equations there around that. But at the same time, he's also representing a company that's one of these companies that they have a product that is very directly connected to very foundational research,
Starting point is 00:02:32 especially with mathematics. So we'll have one of these rare opportunities where we can talk with someone and go through the whole, let's say, from the product itself and the experience that it has and the problem that it solves down to, let's say, the core mathematics that are used to actually deliver this value. So let's see what he has to say about all that stuff. All right, let's do it. Eric, thank you so much for giving us some time
Starting point is 00:02:58 and joining us on the Data Sack Show. We can't wait to chat. It's good to be here, Eric. Thanks for having me. All right, well, give us a, you have an absolutely fascinating background. You've probably sat at almost every seat at the table that someone could think of when you think about a technology company in the data space, and then some that you wouldn't think about. So can you just give us a quick history of all the different things you've done
Starting point is 00:03:25 in roles that you've played and then what you're doing today? Sure. I've been told that I have a rare, if not unique perspective in having exposure to the areas of AI from the perspective of being a researcher, to being an entrepreneur to being a venture capitalist and even spending time in Washington, D.C. And that's often how people will know me, if they know my name, is as acting as an AI authority during the last year of the Obama administration. Before that, I had spent time as a professor in computer science and sitting on a couple of other boards, one of which was SoftBank's largest investment into AI, Petro. Wow.
Starting point is 00:04:34 Costas, I didn't even know where to start. I mean, this is so exciting. But Eric, let's dig in, I think, where my mind went and I think a lot of our listeners. So an advisor in the White House on the topic of AI. Can you just tell us what was that like? What were your responsibilities? And then I think I'm so interested in what were some of the really specific things that came up that you worked through in that role? Sure. I can say that it's a very privileged position. I was really grateful for the time.
Starting point is 00:05:11 I worked with some really smart and dedicated people, and I hope to do it again someday. The role itself actually has been elevated. There's now an AI office inside what's colloquially known as a science advisory group. I know the person that leads it and some of the people that work inside of it. They're all super smart, competent, the right people there. And even the other job is now a cabinet level job. The job senior to that one is now a cabinet level job. So lovely people working very hard on behalf of the American people. When I was there, there were other people in other areas of expertise from space to
Starting point is 00:05:54 healthcare. There's an expert on soil and agriculture. I just happened to be the authority on AI during my time. There was another, in computer science, there was another Princeton professor who was an expert on computer security. Before me, the person was more of an expert on very large computing systems. But I am very happy to have been there when I was there. It was a hot time to be there around AI. What we did, what we do, it was colloquially known as the science advisory group.
Starting point is 00:06:31 It's really nonpartisan, and that was my experience. So this was not a whole bunch of West Wing people in the way that you read about in a screenplay or something or some TV series. These are nerds, right? These are nerds. Yeah. They're a science thing. We did not talk about politics. And really, for all I know, and actually I did know in a couple of cases,
Starting point is 00:06:55 people had different political views even than the president that we serve. Oh, fascinating. I know that the person who I reported, they said, I would serve a lot of presidents, but this president I served with enthusiasm. And that's how I felt. on behalf of the president, humbly speaking on behalf of the president, the goals of the White House in coordinating the executive branch. So the executive branch are state, defense, of course, but also health and human services and transportation is a big one, coordinating those efforts in AI. So generally the funding of research, but also a coordination of the goals and outlook that the federal government might have for the coming years. This got expressed then in written reports. Many of them were public. Obviously, some of the work we did within the DOD is not,
Starting point is 00:07:58 and the intelligence community is not, but much of the work was public. I think you can even see this still on the White House archives, the work we did. And it really was helpful in coordinating a conversation that then we could share with Congress, who would then allocate funds. Where do we want to go? What do we want to fund? What do we see happening in the future? We would take some lessons from what some of our allies would be doing and vice versa. It was a wonderful experience where you have a very high level perspective of AI initiatives. And this was actually a bigger deal than I had expected. Obviously, everybody knows the federal government is big, but one of the wonderful parts about that job, and I get goosebumps even thinking about recalling that experience, was that it is bigger than any organization, any other organization.
Starting point is 00:08:54 So I get to see, oh, this is where people are experiencing roadblocks today that become a lot worse in the future. These are some of the big scale difficulties people are going to be running into. Because everybody knows that data is increasing, the every two years sort of thing, or every 18 months for computing power. But also data is growing, doubling every two years
Starting point is 00:09:24 or some such thing. But the exponential growth, data is growing, doubling every two years or some such thing. But the exponential growth in data is well understood. But the equally exponential or quadratic, to be more combination of data and data sources that's an unfathomably large number of data relationships. And that is just breaking, breaking systems. Because if you flip from millions and billions to billions and trillions, you have to be thinking about your systems in a fundamentally different way. It's really a phase change. Ice to water, water to gas. It is a fundamentally different way to be interacting with that scale of data relationships. And that's what we saw begin to happen in the federal government. Fascinating. Okay. One more question to
Starting point is 00:10:27 satisfy my curiosity, and then I'll hand it over to Costas because I know his mind is buzzing with questions. And this is just more of a curiosity in terms of taking research, papers, discoveries, recommendations, and say, presenting those to people who may not have a good understanding of what AI is. I mean, we all work in the data industry. And even then people, a lot of times will misuse the term AI, right? Or speak about it in a way that's ambiguous. And so if you think about the wide audience of people who were exposed to the work that you and your team did, was it difficult, say, if you were presenting a research paper to Congress or they were digesting that, how did you approach the problem of not everyone has a baseline understanding of sort of what AI is at the
Starting point is 00:11:26 fundamental level. Was that a challenge? I'll actually say that was the challenge. I mean, you don't present research papers to members of Congress. It just doesn't work. Some of these people are smart. Some of them are less so. some of the senators are very, very smart. Uh, some of them less so, but they, they still may not understand and should really should, they shouldn't be expected to understand, you know, the new characters of this tech. So it's a big part of the job actually to, to both work with my peers in the state department or the Defense Department or the Transportation Department or Energy, you know, some super smart people at the Energy Department, work with my peers at that level, and then go back to members of Congress and try to, you know, you don't say dumbed down because, you know, many of these people are super smart in their own right, but try to simplify it in a way that is meaningful so that they can have a grasp to make more effective policy. I have a couple of conclusions from that experience. And it really was daily,
Starting point is 00:12:32 if not hours, a lot. The social calendar wasn't really a social calendar because I would often be the entertainment at dinner talking about AI at some ambassador's residence or with some members of Congress. So the lesson I took is tell a simplified version of AI. And I can even share with you how I told it. Worries can often be helpful. And then the second lesson is that we really need to bring more people into the conversation around AI. Because even if the members of Congress and senators at the federal level would understand this, we still have every state government. To say nothing of other governments, allies around the world from whom we also take some direction and where our companies are often subject to those laws, GDPR as being a perfect example. Europe had, we could talk about their implementation and
Starting point is 00:13:39 their modifications of GDPR over the last few years, despite having very, very smart people, they've been often misguided in their modifications to GDPR. So those two lessons, one is have a good definition that'll share. Another is just generally working to bring more people into the conversation of AI, what it is, how we want it to be implemented. So the definition that I worked with that I found to resonate with members of Congress is that AI is a system, a system that collects data, senses data. So that could be from the LiDAR on top of your car, it could be from the air quality sensor in your home. Then through that sensor takes the data into a system that then cognates about it, thinks about it, plans it, plans for action.
Starting point is 00:14:34 That's a traditional place that people would think of AI. And I notice I try not to get too pedantic about saying, well, AI is just that with a subset of deterministic and probabilistic AI, a subset of which is in machine learning, a subset of which is deep learning, right? That's not helpful. Unhelpful for people that are researchers day to day, right? But it's a system that kind of senses, plans,
Starting point is 00:14:58 and then acts on those decisions, learning from the experience. So we take that whole system and then apply it to how ordinary non-AI professionals can get engaged. And we talk about automated car, driving down the street, seeing something. Is it a crosswalk?
Starting point is 00:15:17 On the crosswalk, is it a person? Is it a tumbleweed? Is it a shadow? What do I do? Slow, stop, or keep going? Do I ask for driver intervention then that's that's a point that everybody can get that we as a society need to make a decision we as a society will need to determine you know where do we put that liability on the driver on the manufacturer on
Starting point is 00:15:40 the coder we so that will happen that will That will happen. And we will have litigation around this to make that determination. You know, Mercedes, for example, makes really no bones about them biasing towards the safety of the driver. So you think, you know, if I see an automated Mercedes coming at me, I might back off a little bit. And, you know, we're all part of Tesla's beta test, you know, whether we like it or not. You know, they regularly break the law. That's kind of just their mode of operation for testing their autonomous software. So we need to engage more people in the conversation. Use the definition and work to engage more people.
Starting point is 00:16:20 Super helpful and super fascinating. I could keep going, but Costas, please jump in. I know you have a question. I have a feeling you were keeping a lot of notes to share with the sales enablement team or something, right? That's right. Or actually, I will say, Eric, though, that is a very helpful definition.
Starting point is 00:16:42 I mean, it's the classic, you're at a cocktail party. And to your point, it's not that people aren't intelligent. It's just that distilling a subject like AI with all of the various componentry is kind of hard. And so I really appreciate that. I'm going to paraphrase that definition in the future if you don't mind at the next cocktail party where ai comes up please use it yeah i mean that was that was pretty amazing to be honest like it was one of the like best let's say how to that uh taking like some very very complex concepts and distill them down to things that everyone can understand.
Starting point is 00:17:27 So that's a very, very rare skill. So I totally understand why you had the position you had there. It's amazing. And I think it's one of the skills that anyone who's working with technology, we should work more on improving ourselves, to be honest, because like, it's a big, it's a big problem that we have, right? Like, especially when we introduce, not like, we introduce like new pieces of technology that they are pretty much like, we also have to invent, let's say, new language. Like, like people are just like not ready.
Starting point is 00:18:00 Like it takes time, like for, doesn't matter how smart you are, right? Like you need like to rewire your brain and start like thinking in different ways. So that's, that was amazing. Like, I don't know if you write, like if you have a blog or like, do you plan to write a book at some point? But please do. Like, I think many people. Oh, thank you.
Starting point is 00:18:18 All right. So having said that, and thank you so much, like for this amazing, like introduction, I'd like, like to chat a little bit more about the company, like connections, right? And we were talking until now about AI, which is, let's say, the holy grail of data. We collect all these massive amounts of data that at some point we want to build these models that they are going to use the data and help automate big parts of our life in a very positive way. But Connexus works on a much lower level of this, let's say, journey or supply chain of
Starting point is 00:18:59 data, let's say. Okay? Yeah. So what made you from being working with AI for so long, go and like build a company that works on like much more, let's say boring in a way like, and don't take this wrong. It's not boring for me,
Starting point is 00:19:17 but I'm pretty sure that you were discussing much more about AI than you were discussing about how to create connections between data, like with the people that you were meeting there. So, but I think there's a very good reason and I'd love to hear more about that. So. Yeah. Yeah. Thanks for that. You know, there's a lot of different levels at which we could talk about this. But I can take the, the last point, which is to a non-nerd, it ain't sexy. That's for sure. It's going to be difficult to write a Hollywood screenplay about math. But unless there are aliens involved, I think like if you have like an alien there that you try to communicate with, I think that script works most of the times.
Starting point is 00:20:05 There is, I will pay for this movie. There's a brilliant woman, Eugenia Chang. She does a fascinating job bringing math to life through the metaphor of baking. Baking pie is, you know, obviously a kind of a pregnant way of saying this, but, you know, she even went so far as to have a children's book that I bought and I read to one of my nieces explaining math and specifically categorical algebra at the level I read to a four-year-old. So she's brilliant. I'm a fan of hers.
Starting point is 00:20:35 I can say that the math is where it's going. As I was studying computer science, the more I advanced in that domain, the more we got away from the syntax of the different languages, of course, and the more we got into the mathematics. What I have come to believe is that we are, not to be hyperbolic, but we're entering a new epoch where we are shifting the framework from that of logic that helped our current infrastructure of computing create itself to another epoch, which is that of composability. You see expressions of the concept of composability in such things as quantum computing, and specifically in quantum compilers, where we would not, as humans, be able to understand the output of quantum computers without the math of categorical algebra, category theory, or type theory. You see other expressions of composability with smart contracts, the structure of which would not be able to exist without categorical algebra or category theory.
Starting point is 00:21:57 That math helps you understand and analyze these increasingly complex systems. And there's really no other way to do that. You know, the math that we all grew up with is, you know, the math of the 20th century. I mean, I'd even say the math of the 19th century, calculus, geometry, trigonometry. You know, it's going to become a little bit like Latin, which is interesting, intellectually interesting, but less and less relevant to the digital age. Those are the maths that we will use for aerospace engineering or mechanical engineering. But for digital applications and the emerging compositional systems, we will be relying on the math of category theory and type theory and,
Starting point is 00:22:46 and, and the expression to categorical algebra. So that's where we're going. And that's what, that's what Connexus is building. So Connexus is built on a mathematical discovery and that's, that's as foundational as you get, you know, that's a law of nature. There, there are, that's better than physics, right? So, you. So math is, it's a strange thing, I will say a little aside, just to point out how nerdy some of these math professors are, one of which is our co-founder, David Spivak. You go to MIT's math department and how you'll be able to tell you're in the math department? Two ways. One is no computers on the desks. That's weird. The second is blackboards, not whiteboards. So, I mean, these are hardcore. I mean, if you went to Central Casting and you said, give me a mathematician, our co-founder would pop up. And that's what that math department looks like. So, it's a funny little side about that. But this, this domain of math had a discovery where this is sort of metamath of categorical algebra was applied to databases. So the translation of problems between spaces can now be done with databases.
Starting point is 00:24:00 That was expressed in software by Dr. Wisniewski that then began to have a commercial expression. And that's when I found out about it. Initially, I put money into it and then decided to jump in full time. The reason is because this is going to be the biggest trend we see over the next 10 to 20 years. Other people can be fascinated about other domains. They may read about in the press. But this, although it's not telegenic, the math is not as easy of a story to tell, it is foundational. It's going to make a very big difference.
Starting point is 00:24:37 Category theory, categorical algebra will be the math that our kids learn in the future. You might say the more math, the better, but if I was to choose, I would be replacing calculus, geometry, trigonometry with statistics, probability, and category theory. That's very interesting. So, okay. Let's step back a little bit and let's talk about category theory. So I, I mean, I was aware of like the impact
Starting point is 00:25:08 of category theory and type theory has like especially in compilers and functional languages, right? Like it's a big part of like the conversation that's happening especially like in people that are working like with Haskell and like the functional let's say, paradigm in writing content. So how did we go from these applications with compilers and computer languages to databases? What happened in this between? And how did this happen? I'd love to learn about that.
Starting point is 00:25:41 The easiest way to get at that is just from what our customers tell us. I mean, that's the best. We as engineers might really enjoy programming in Haskell. That's a fun place to be. But your commercial expressions of Haskell are kind of few and far between. I mean, we even worked with Uber,
Starting point is 00:26:02 and I'm not saying anything out of school here, to say that they didn't want to even open source their Haskell code, the code we created with them because it was Haskell. It's just not, it's not that they want to just be affiliated with it. They don't want to, they don't want any part of it. It's fun. Maybe we'll all do that in retirement is just sit and program in Haskell. What our clients tell us is, the clients of
Starting point is 00:26:26 Connexus, they come to us like Uber to say, hey, we tried solving this problem other ways, and we reached a dead end. So Uber is an interesting story where they, like many companies, have some very smart people, really exceptionally smart people have been at Uber in their technical functions, and they had an effectively infinite balance sheet with which to fund a optimal IT strategy. But neither of those allowed them to actually create an optimal IT infrastructure. Despite having smart people and an infinite balance sheet, they grew up respecting the business. They grew up in that case then as a ride-sharing company, city by city or jurisdiction by jurisdiction. And the output of that was they were then prevented from easily respecting a privacy lattice so that, for instance, driver's licenses might have a different privacy sensitivity than license plates, depending on the jurisdiction. Easy business questions, theoretically, such as, hey, there's a sporting event coming up. How will driver supply be affected or rider demand? Could be done for Richmond, Virginia or Charlotte, but they couldn't be done for the whole state or the whole eastern seaboard of the U.S., the whole country or the whole world.
Starting point is 00:27:51 So Uber then looked about how to solve this problem. How do I integrate 300,000 databases in their particular case? This is what I meant about the scale. You know, that's a phase change and you have 300,000 databases. They realized this is to the point, Kostas, is they realized, hey, this needs to be solved at a deeper level. The commercial expressions are broken. They don't extend in the marketing language that you'll often hear.
Starting point is 00:28:21 They don't extend to 300,000 databases in a way that's feasible. How we need to look deeper. They came up with the solution of categorical algebra, and then they looked around the world about who are the leaders in that, and they found Conexus. Conexus happens to be 40 miles north of them, so that worked out. But we then co-developed with them to solve that problem of bringing together 300,000, essentially 300,000 data models. And in a way that was guaranteed to maintain its integrity, which is really the point of the category theory. We can't have four mutate into approximately four. You have to have four equal four every single time you evolve that. Uber could have done this without category theory,
Starting point is 00:29:07 but the budget on that, I think they computed, would have been roughly $2 trillion. It's just because they don't do it because of all the connections that category theory allows. So that actually is the same as in quantum computing. You can use traditional methods in quantum mechanics to run the compiler, but then you start having to use imaginary numbers, which can be done, but it's not pleasant. It's not as easy. It's not as dependable.
Starting point is 00:29:37 In high consequence contexts, you want to have math that is less susceptible to human error. And that's where category theory comes in. And that's how it's evolved, I guess, as a way to be answering your question. And how is category theory, seeing the way that we solve these problems, what's the contribution of category theory that helps make these problems tractable while they were not before? So what's the secret sauce, would you say, of category theory? Yeah. of like make these problems tractable while they were not before. Right? So what's the secret sauce, would you say, of category theory?
Starting point is 00:30:08 Yeah. So category theory is a sort of metamath. I mean, if you're already talking about formal methods, then you're already most of the way there. It's really defining in a very precise way, giving the ability to define in a very precise way a reasoning engine or a
Starting point is 00:30:24 rules engine that then is encoded and shared with others. One of the clients of Connexus came to us and they described it this way. Have you ever been in a room where you wanted to ensure that you heard everybody's viewpoint and you wanted to make sure that the loudest didn't dominate the conversation and the quietest got heard such that everybody left the room feeling that their opinion was represented exactly as they had said it. That's how they said it. How they're telling it is that that can happen sometimes in a consensus. Say you have 30 people, 30 engineers get together and they have a consensus about what a wellhead would look like and how you define it. But then you add the 31st person and then it all breaks again. You have to do it all over again.
Starting point is 00:31:19 And that's exactly the sort of problem that engineering and manufacturing, to say nothing of healthcare and logistics and financial services, run into with some degree of regularity. What category theory allows is it allows for a logical composition of each of those respect the different definitions or meanings that the engineer represents or the subject matter expert more generally represents. Category theory just supplies that language. You don't have to worry about the syntax, about how you might encode that, but the math allows for the semantics to be respected in any data transformation. Okay. So how do we go
Starting point is 00:32:06 from the business problem that we have? Let's make it simple. Let's say we have two databases or get the 300,000 databases. And we want to align the two data models there.
Starting point is 00:32:22 How are we doing that using category theory? Like help us a little bit to understand the, let's say, the experience that the developer would have
Starting point is 00:32:30 trying to do that using category theory or the Connexus product, right? Yeah. Yeah. Well, at two days, this is actually the problem
Starting point is 00:32:39 is a lot of existing solutions you'll deploy as a sort of proof of concept. Hey, let's deploy two and now let's do 200. And that's exactly where they break. So Connexus is engaged with customers who have had this experience and the people say, oh yeah, I can do it with category theory. And then they scale up to even 10 and they start fudging it.
Starting point is 00:33:01 You know, who among us has it ever hard-coded a cell in Excel, right? We've done it before. And that's what often these people will do if they don't have a foundation on which their product is built in category theory. So two databases, you're not going to see a big difference. It's really a difference between a quadratic scaling and a linear scaling for the complexity. So let's just say five. You don't know. It doesn't have to get terribly big.
Starting point is 00:33:26 We have this experience, Connexus did, with a healthcare system in New York. So I didn't know this was possible, but this one healthcare system, one system had different definitions of diabetes between the groups. Maybe this is familiar to you or some of the listeners, but for me, I thought, well, I can just look in the dictionary for a definition of diabetes, but no.
Starting point is 00:33:50 How that expressed itself is that one group would say diabetes in the table, the attribute would be diabetes, and then in the row would be yes, no. And then in another division, this might be in research, so it might be temporal, or it might be in the row would be yes, no. And then in another division, this might be in research, so it might be temporal, or it might be in the use, clinician versus research. But the next table might say, diabetes, how are you treating it? Or the next one might say, diabetes, how long ago? Or the next one might have well-meaning clinicians that would say, well, Eric had diabetes, and then we treated it, and it doesn't appear to be showing up anymore for some reason, like whatever. So you have different definitions of diabetes across the organization.
Starting point is 00:34:31 You know, what Connexus has seen is clients will do one of a few things. They'll either normalize all of that. So they'll just combine it, squash it into diabetes, yes, no, and lose the fidelity or the semantics from the subject matter experts. As we talked about in the meeting, everybody's meeting was lost because it's just the lowest common denominator. Or they will pick a couple of the ones that are easier to integrate. Maybe get diabetes, yes, no, and diabetes, how long ago? And then they'll ignore the rest. And that often what happens, or they'll just ignore all the other ones. They'll just have
Starting point is 00:35:09 one and the others will remain dark to use a Gartner term for it. Dark data, data, data you collected, but aren't going to use. You know, it's really funny actually, as an aside is I feel like companies have gotten the memo about the data collection, you know, data is the new oil and all that. Like there's a lot of data now being collected, but I really am curious about how much of that data is actually being used. Because my experience is not a lot. You know, they're collecting it, but it's not being put to good use by the data scientists. And it's because of that difficulty, that layer between data scientists and data engineering. So back to the example is we, we can either ignore it, we could
Starting point is 00:35:50 normalize it, or we can spend a whole bunch of money with ETL for tools. And then, and, and Tata or Wipro or Tipco or Infosys or Accenture combining that data over a period of months or not years, that's all suboptimal. So what category theory provides, that's the background, what category theory provides, what Connexus provides for its clients is it provides for a way to have a mathematical representation of those different traits, of those different columns in those different tables. So you can say, for example, the diabetes yes, no is related to diabetes how
Starting point is 00:36:33 long ago. And you can keep the fidelity of diabetes how long ago equals diabetes yes or no, plus some other attribute. And then you could just keep going. So that's actually how this gets expressed. You have a mathematical relationship that is able to be captured. So you can then begin to realize, oh, this is a little bit like graph theory, where my PhD is in. It's a little bit like graph theory, but it is a richness that is unavailable in graph theory. And that's why it's a little bit more like type theory. Because you can have an infinite amount of expressions
Starting point is 00:37:06 inside of every edge and node in the vernacular of graph theory. Category theory has its own vernacular. Yeah. Okay, and how, like... Okay, let's talk a little bit about what's the experience that the customer has, right? Like, what's...
Starting point is 00:37:22 Let's say the hospital come to you, right? What's the journey that they have to go through together with Connexus to build this rule engine and represent all these semantics that they have around their data using category theory? Yeah, it's just a great, great question. And I'll tell you, it's really an easy start.
Starting point is 00:37:47 For engineers, accountants, lawyers, or people that are often trained to be at least aware of the precise meanings of their words, other people can get this too. Those people that haven't necessarily been trained in those disciplines. But often I find that those disciplines are, I find this to be a little easier, you just have a whole bunch of data. It's the data plus the data relationships. So every person, because this is simple, every person has a name, right? That's person, name, every person has. That's the simplest part of an ontology log. And here's where it gets, here's where it's important.
Starting point is 00:38:46 You say, well, whatever, Eric, that's pretty, you know, that's trivial. Yes, it is. But here's how this gets messed up all the time. So just to use a common example, for me, you'd say, well, Eric has a nose and eyes. Eric has ears and eyes, ears and eyes. So where in there did I say singular? And where did I say plural? That can matter. You know, it matters a lot. You want to be super, super clear when you're writing down as a subject matter expert, my knowledge of my face, you know,
Starting point is 00:39:17 nose, eyes, ears. Yeah. So you got to write that down in the ontology log. Every head has two eyes, maybe both working, you know, one nose, maybe working has two eyes, maybe both working. One nose, maybe working. Two ears, maybe working. You got to write down with that level of precision. Any subject matter expert that kind of lives with their work has this implicit knowledge. They then need to represent that implicit knowledge explicitly. And that's what Connexus helps its clients do. But that, that really, Connexus isn't going to read anybody's mind and it's not magic. So that has to be captured by the subject at MatterExpert.
Starting point is 00:39:53 And we facilitate that and are working over time to make that easier. Henry Suryawirawan, Yeah. So, okay. I have like a couple of follow-up questions here. First, when it comes like to domain experts, right? Like usually domain experts don't know category theory because they spend their time getting really deep into something else, right? Like for example, someone might be like a medical doctor
Starting point is 00:40:17 or they might be like, I don't know, like an expert in finance or whatever. How do we help these people encode this knowledge that they have into something that then can be used to accurately being represented using category theory? Right. I mean, this is great. No one needs to... In order to do this, we would be at a huge disadvantage
Starting point is 00:40:41 if somehow we were requiring people to learn any math, let alone abstract math. This is not probably bringing up warm memories for people that studied abstract math at school. So, you know, category theory, I think, is easier than calculus, frankly. So I think people getting into it might enjoy it, you know, especially with the easy on-ramps available from the likes of Eugenia Chang and others. One of our co-founders wrote two books on category theory. They're excellent. So people may enjoy it, but they don't have to learn it in order to do what we are describing here. In order to capture diabetes, yes, no, diabetes, how long ago, diabetes, how are we treating?
Starting point is 00:41:22 In order to capture that, you just have to start with a logical diagram. We'll call it an O log ontology log, call it an O log. You just have to create the ontology log, transferring that O log with the right syntax into the software. That's the job of Connexus. That's what, that's what we will do. And that's, then that's super easy for us, but that's all it is. It's just a syntax like SQL, because we're just dealing with databases.
Starting point is 00:41:48 That's what Connexus does. So it's really super easy. You know, a lot of kind of depending on the level of, of sophistication or interest, you know, people can pick up this little modification of CQL in a long weekend. Uh, that's that's, that's how it happens. That's what happens. Next ontology log syntax into, into building a Connexus. Okay.
Starting point is 00:42:11 And now the tricky question, how do we deal with changes in the semantics? Because, okay. What we know today about diabetes might change in the future, right? And that's one of the reasons that we have also all these different versions of what diabetes might be. So what about the process of going back and refining the Scientology log and representing this back to the rules that we have?
Starting point is 00:42:38 How does this happen? And how do we find out that we have to do that also? Well, the knowledge that any subject matter expert represents on an ontology log gets captured in the software. If the software needs to be modified to represent a change in the ontology log, that would be driven by the subject matter expert. The software is only an automatic reasoning engine. It just looks for all the possible connections. This is a factorial explosion of possible connections or combinatorial is a term that people are often familiar with combinatorial explosion of of relationships so it's a reasoning engine about those relationships but it's not going to read anybody's mind about about the the change in reality that that that
Starting point is 00:43:39 mandated a a change in the design of the flange on the wellhead. That's a subject matter expert's responsibility. Okay. And, okay, we do, like, these things. We create, like, the rules. And then what? Like, what's the experience that we are having there? Like, are we able to, like, query this rule engine instead of, like, going and querying all
Starting point is 00:44:06 the different databases? How does this work in whatever we have been used to call, let's say, the analytics environment that the company has? How does this work? How we can integrate this as part of like the data warehouses that we have in the databases? Yeah, I mean, the experience of a user is really going to be transparent. And to build on your last question about that, you'll get some benefits such as just contradiction detection. So another answer to your question also addresses the last one, which is, what are the benefits and how does it show up? It will tell you whether you have contradictions, which can often happen. If you have 30 engineers in a room and all analyzing a complex system, you may, this is what happens today in many of these situations, you may have to go through iterations, many, many iterations to expose the contradictions
Starting point is 00:45:08 in the different viewpoints of the subject matter experts. This automatic reasoning engine that then is powered by creating a connexus, that's available immediately. Just push of the button and then you will see the contradictions in the data models that then will require perhaps a change in the semantics or change in the ontology that you might have taken years to discover and perhaps after something bad happened. The experience day-to-day should really be transparent because you're no longer needing to develop a consensus and abide by a consensus with others.
Starting point is 00:45:53 Once you have a Connexus, you're just experiencing it the same way you'd interact with any other sort of database or with SQL. You have your version and that's, that's, that remains the case. And you're accessing that in relation to the linkage, the automatically created linkage between all of the other definitions. Okay. So, connection to the tool that is used like primarily by the data engineers while they're like maintaining, let's say the databases and the data engineers while they're like maintaining, let's say, the databases and the data lakes of the company?
Starting point is 00:46:27 Like who's like the, let's say, the primary user? Yeah, it operates somewhat at the level between a data scientist and a data engineer. Often the data engineers, although they are maintaining a complex data infrastructure, they often have less autonomy than they may wish. They're told what to do and implement, where data scientists just want to get it. They maybe have more flexibility to get done what they want to get done. This is often driven by peers operating in a business group that need to guarantee the meaning of their semantics. That's how it was driven at Uber. need to guarantee the meaning of their semantics.
Starting point is 00:47:06 That's how it was driven at Uber. That's how it was driven at most of our other clients. It's engineers in a business unit, not driven by IT, that need to collaborate with their teammates and not spend a couple of hours a day that we've heard sometimes are spent exchanging Excel docs. And one last question before I give the stage to Eric. So how long it usually takes from like day one, but someone wants to build like this ontology log and populate like the reasoning engine until they have,
Starting point is 00:47:43 let's say, everything in place and they are like to start like using it. David PĂ©rez- The big issue is, is developing the ontology log. That actually can take someone moving their job from something that's implicit to explicit, a large, there's a large time variance in how long it takes to create one of those, but I tell you just last week we, we had somebody stay up after we told them about this just over dinner. They said at 1030 at night, they sent us an ontology log of their job.
Starting point is 00:48:12 And then we coded it up in the semantics of a connexus in 30 minutes and sent it back to them. So that's, and they can start reviewing that and say, oh, you know, is this the ontology log I meant? Just in the language of SQL, really? And they can be good to go. The magic comes, of course, when you're combining these different subject matter experts, these different ontology logs into one connexus. That's where the power comes.
Starting point is 00:48:36 So, you know, the more, the better. But it can go pretty quick. The time is in developing the ontology log. And then the weakness, it's a question you didn't ask, but it's an important one. Where's the weakness? Where does it fall down? There are some jobs where we can't actually define them easily. An example was told to me last week where there was a person that walked around a big manufacturing plant and listened to the motors.
Starting point is 00:49:05 Okay. And would pull out parts, you know, designate, pull that part out, pull that part out. And how I heard the story was that nine out of 10 times, the person was right. That it prevented, it was, it was a good preventative maintenance. And when that person retired, that there was a, there was a, a large increase in preventative maintenance budget. So they pulled the person out of retirement for an hour a week, just to walk around and listen to the plant.
Starting point is 00:49:28 You know, that's, that could, maybe it could be a replicated machine learning and a microphone, you know? But, but that if you can't take that implicit knowledge and, and put it down in a logical diagram, you know, we don't have a lot to say, Connexus doesn't have a lot to say. Yeah. Yeah. So I guess like, we are not going to see like some media doing that
Starting point is 00:49:47 like anytime soon, I guess, creating ontologic logs over like the wine experiences that they have. But I guess that's fine. That's okay. We can do that. Eric, all yours. Eric, a couple of questions here. So, you know, I think your average person working in data
Starting point is 00:50:07 probably doesn't get to see the scale of 300,000 databases, you know, you know, a hundred databases is a lot of thousands, you know, you know, a lot. So I'm interested to know, you know, I feel like I could have a good handle on the ontological log and sort of how you prepare to my mind around what's required on your end in order to actually interact with that many databases. Is that a significant infrastructural problem? Is that something that the customer sort of enables on their infrastructure you have access to? How does that work practically when you engage, say, with an Uber who has 300,000 databases? Yeah. So we're not, Connexus is not a bit store company. We never will see the data in that way. So we're not storing petabytes of data.
Starting point is 00:51:18 And it's unless, there may be different security protocols. So we have applications in defense and the intelligence community that, that may require different sort of security protocols, but generally that's the, that would, that's a complication of the, of any implementation, just a respect of that, of that sort of framework for deployment. But the deployment is really a pretty light one because, because it's all in math. You know, the code is actually, you know, this isn't Oracle that you're deploying and that is not us.
Starting point is 00:51:47 You know, we're, we work with any it's cloud native, but it works with any of the cloud providers kind of in some fashion. It's really complimentary to all of that. What, what we do is what Connexus does is, uh, Connexus is just is just a way of doing what was previously difficult, if not impossible, like I say. Some things were just infeasible. That's what I meant to say. Infeasible, if not impossible. And Connexus just enables that. 300,000 databases is where we're going in many cases, but that is just an example of a sophisticated system. We worked with, Connexus worked with a financial services company that had a goal of taking
Starting point is 00:52:30 86 databases down to one. That was their vision. There's a well-known bank, but this is actually the story that they told us, this particular well-known US bank, 86 databases down to one. They budgeted $20 million in three years for this project. Five years and $120 million later, after people then got fired, they then went back and scaled down that problem from 86 databases to 16. They then budgeted another $100 million in five years, and they succeeded. But this is the point where we then get involved
Starting point is 00:53:10 because they say, well, at the end, I know that we still are left with a super fragile system in that we acquire another company or divest of one of our assets, and we have to do it all over again in some sense, or we have introduced that degree of fragility in some sense, or we introduce that degree of fragility. And that's what developing a Connexus can provide for these firms that really can't afford to have their data models be mutated.
Starting point is 00:53:35 Yep. And a question on the discussion around two databases, okay, five, 100,500,000, $100,000, $300,000. At what point is Connexus, where do companies start feeling the pain? What's the breakpoint of complexity where Connexus really starts to add value. There's two ways where Connexus can begin to provide value. One, the general proxy is number of databases. And that, you could just do the simple arithmetic of a linear versus quadratic explosion. And so you'd say, well, three, four, oh, five, yes. You know, where you really start to know there's a difference. Yeah. But the real answer, the more nuanced answer is
Starting point is 00:54:30 it's in the sophistication of the data models. So if you have homogenous data models, then there's not a lot, there's not as much to say. Whereas if you had a broad heterogeneity of data models, such as in engineering, energy, transportation, manufacturing, then you really have a very big problem where the consequence of failure is large. And that's where Connexus provides the most value. That's where people actually will come to us unsolicited.
Starting point is 00:55:00 Yeah, absolutely. And we're closing in on time here, but one thing that I would love for you to provide our listeners is, you know, we've talked about AI a good bit on the show. We've never talked about sort of the mathematical componentry of it, you know, or certainly sort of how that's influencing a technology like you're building. What would you say to someone who, you know, wants to begin to study this now so that as technology like this, you know, sort of experiences wide adoption, what are the best ways for them to start to learn, you know, sort of this mathematical focused approach that you're using? Sure. Category theory, categorical algebra have been around since 1948. The mathematical discovery upon which Connexus builds happened in 2011 out of MIT. So I might look for texts and videos for category theory since that time. We mentioned Eugenia Chang, one of our co-founders, Dr. Spivak
Starting point is 00:56:06 is another one, and one of his co-founders, Brendan Fong in academia, one of his collaborators. Those people are excellent resources to learn more about category theory, categorical algebra. That's a great place for the mathematical foundation. You don't really need to learn that. Often as a programmer, as a coder, as a computer scientist, formal methods are maybe a nice, easy gateway into that. Some people might be more comfortable through type theory as a gateway into that. I often will say that category theory is like graph theory, but with more structure, just because that's my background. But I think that, oh, they may also benefit by just remembering that there is the theory, category theory, and then there's applied category theory, or that's why we call it categorical algebra. It
Starting point is 00:56:57 can often sound a little easier for people, because you can immediately just think of examples that might be day-to-day helpful. Very cool. Well, Eric, this has been just absolutely fascinating. Thank you for enlightening us on so many subjects. It's not often that you get to talk about the White House and mathematical theory in the same conversation, but you have brought those worlds together and we've learned a ton. So thank you
Starting point is 00:57:25 again for spending some time with us today. It's just been fun. Thank you, Eric. Thank you, Costas. Costas, I have three takeaways. I guess I keep breaking the rules more and more. The problem's getting worse because we're supposed to have one major takeaway, but here's my three. I'm just on a roll after Brooks not being here for a couple episodes. And so I'm sort of still sowing my wild oats. I'm going off script. So one is just how smart people are. I mean, it's just, you know, I felt like every, you know, five minutes we covered a subject where it was clear that Eric was like deeply knowledgeable, you know, probably on an expert, you know, fundamental level, you know, on all these different topics, which is really fascinating. The other, my second one is, it was just a good reminder that, you know, a lot of times we take it for granted that, you know, programming or working with data or solving problems around that has a foundation in math. I mean, it almost seems obvious, but I forget about that a lot. And so
Starting point is 00:58:31 I think it was fun to see him really draw the direct connection between mathematics and some of the things that we do day-to-day or the problems we solve. And then the last one, which is perhaps my favorite, was talking about the mathematics department at MIT, where you go in there and there's no computers and only blackboards and whiteboards. And that's like, you know, staying true, you know, to, to mathematics. And I just love that, that mental picture. So how about you?
Starting point is 00:59:03 Yeah, I mean, I'm okay. Like what I I'll keep like from the conversation that we've had is that, love that mental picture. So how about you? Yeah, I mean, I'm okay. Like what I'll keep like from the conversation that we've had is that how much more work can be done in like introducing like new technologies and new like things like doing commutations and how this can change like the type of problems that we can solve. So we might think that like things are like everything that could be solved like it's solved, but actually it's not like this. And at the same time, I get this feeling of fascination of how
Starting point is 00:59:36 so abstract concepts and stuff like category theory that started as a very abstract and technical mathematical tool for explaining and becoming. They were trying to build the foundation of the rest of the mathematics, let's say. And it ends up solving real-life problems that affect me and you. And we call an Uber or Lyft to come and pick us up. So that's one of the biggest joys that I have, like in doing the work that I'm doing and also the, the, the show here is I kept like, uh, learning and really relearning this again and again. And it's like, so, so fascinating for me.
Starting point is 01:00:21 And it's one of the reasons that I really love like doing the stuff that I'm doing. I agree. I agree. Yeah, I think, you know, when you think about the practical nature of the example you give around something like diabetes, you realize, wow, I mean this,
Starting point is 01:00:36 you know, not to be too dramatic, but on some level, lives can be on the line, you know, if you get certain things wrong. So definitely some weighty stuff and a fascinating guy. Many more great episodes coming up, and we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.