The Data Stack Show - 98: Category Theory and the Mathematical Foundation of the Technologies We Use with Eric Daimler of Conexus
Episode Date: August 3, 2022Highlights from this week’s conversation include:Eric’s background and career journey (3:30)Presenting to people without knowledge of AI (11:04)Why math was chosen over AI (19:03)From compilers to... databases (25:42)The contribution of category theory (30:09)The Connexus customer experience (37:45)The primary user of Connexus (46:33)Interacting with 300,000 databases (51:07)When Connexus begins to add value (54:02)The best way to learn this mathematical approach (55:46)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Hey Data Stack Show listeners, Brooks here.
Usually, I'm behind the scenes keeping things rolling for the show, but today I'm coming out
of hiding to share some exciting news. We have another live show coming up, and we want you to
join us for the recording. This time, we're bringing back Tristan from Continual and Willem
from Tekton to talk about the future
of machine learning. We'll record the show on August 10th at 2 o'clock Eastern, 11 o'clock
Pacific. So mark your calendars and visit datastackshow.com slash live to register today.
Welcome to the Data Stack Show. We have an exciting episode because we are going to talk about
the White House. We are going to talk about math and we are going to talk about a data company
that solves really complex data problems all in one conversation with Eric, who is from
Conexus, which is a fascinating company and he's a fascinating person. Kass says, of course,
I have to ask him what it was like to be an advisor for AI at the White House. I think
typically when we think of the government, we don't necessarily think about people
solving issues or thinking deeply about the subject of AI, but that's exactly what he did.
I want to know what that meant practically day to day.
So that's what I'm going to ask.
How about you?
Absolutely.
Like I'm, I really want to hear like all the stories that he has to share from like trying
to help the government understand like what's the implication, like all the state of the
art technology and help them like do them introduce all the right legislation and how
these things happen and all these interactions.
It's something that's very, very different than what we are used to by building businesses
or building products.
So definitely a lot of main equations there around that.
But at the same time, he's also representing a company that's one of these companies that
they have a product that is very directly connected to very foundational research,
especially with mathematics.
So we'll have one of these rare opportunities where we can talk with someone and go through
the whole, let's say, from the product itself and the experience that it has and the problem
that it solves down to, let's say,
the core mathematics that are used to actually deliver this value.
So let's see what he has to say about all that stuff.
All right, let's do it.
Eric, thank you so much for giving us some time
and joining us on the Data Sack Show.
We can't wait to chat.
It's good to be here, Eric.
Thanks for having me.
All right, well, give us a, you have an absolutely fascinating background. You've probably sat at almost every
seat at the table that someone could think of when you think about a technology company
in the data space, and then some that you wouldn't think about. So can you just give us a quick
history of all the different things you've done
in roles that you've played and then what you're doing today? Sure. I've been told that I have a
rare, if not unique perspective in having exposure to the areas of AI from the perspective of being
a researcher, to being an entrepreneur to being a
venture capitalist and even spending time in Washington, D.C. And that's often how people
will know me, if they know my name, is as acting as an AI authority during the last year of the
Obama administration. Before that, I had spent time as a professor in computer science and sitting on a couple of other
boards, one of which was SoftBank's largest investment into AI, Petro.
Wow.
Costas, I didn't even know where to start.
I mean, this is so exciting.
But Eric, let's dig in, I think, where my mind went and I think a lot of our listeners.
So an advisor in the White House on the topic of
AI. Can you just tell us what was that like? What were your responsibilities? And then I think
I'm so interested in what were some of the really specific things that came up that you
worked through in that role? Sure. I can say that it's a very privileged position.
I was really grateful for the time.
I worked with some really smart and dedicated people,
and I hope to do it again someday.
The role itself actually has been elevated.
There's now an AI office inside what's colloquially known as a
science advisory group. I know the person that leads it and some of the people that work inside
of it. They're all super smart, competent, the right people there. And even the other job is
now a cabinet level job. The job senior to that one is now a cabinet level job. So lovely people working very hard on behalf of the American
people. When I was there, there were other people in other areas of expertise from space to
healthcare. There's an expert on soil and agriculture. I just happened to be the authority
on AI during my time. There was another, in computer science, there was another Princeton professor
who was an expert on computer security.
Before me, the person was more of an expert
on very large computing systems.
But I am very happy to have been there when I was there.
It was a hot time to be there around AI.
What we did, what we do, it was colloquially known as the science advisory group.
It's really nonpartisan, and that was my experience.
So this was not a whole bunch of West Wing people in the way that you read about in a screenplay or something or some TV series.
These are nerds, right?
These are nerds.
Yeah.
They're a science thing.
We did not talk about politics.
And really, for all I know, and actually I did know in a couple of cases,
people had different political views even than the president that we serve.
Oh, fascinating.
I know that the person who I reported, they said, I would serve a lot of presidents, but this president I served with enthusiasm. And that's how I felt. on behalf of the president, humbly speaking on behalf of the president, the goals of the White
House in coordinating the executive branch. So the executive branch are state, defense, of course,
but also health and human services and transportation is a big one, coordinating
those efforts in AI. So generally the funding of research, but also a coordination of the goals and outlook that the
federal government might have for the coming years. This got expressed then in written reports.
Many of them were public. Obviously, some of the work we did within the DOD is not,
and the intelligence community is not, but much of the work was public. I think you can even see
this still on the White House archives, the work we did. And it really was helpful in coordinating a conversation that then we could share with Congress, who would then allocate funds. Where do we want to go? What do we want to fund? What do we see happening in the future? We would take some lessons from what some of our allies would be doing and vice versa.
It was a wonderful experience where you have a very high level perspective of AI initiatives.
And this was actually a bigger deal than I had expected.
Obviously, everybody knows the federal government is big, but one of the wonderful parts about
that job, and I get goosebumps even thinking
about recalling that experience, was that it is bigger than any organization, any other
organization.
So I get to see, oh, this is where people are experiencing roadblocks today that become
a lot worse in the future.
These are some of the big scale difficulties
people are going to be running into.
Because everybody knows that data is increasing,
the every two years sort of thing,
or every 18 months for computing power.
But also data is growing, doubling every two years
or some such thing. But the exponential growth, data is growing, doubling every two years or some such thing.
But the exponential growth in data is well understood.
But the equally exponential or quadratic, to be more combination of data and data sources that's an unfathomably large number of data relationships.
And that is just breaking, breaking systems.
Because if you flip from millions and billions to billions and trillions, you have to be thinking about your systems in a
fundamentally different way. It's really a phase change. Ice to water, water to gas. It is a
fundamentally different way to be interacting with that scale of data relationships. And that's what
we saw begin to happen in the federal government. Fascinating. Okay. One more question to
satisfy my curiosity, and then I'll hand it over to Costas because I know his mind is buzzing with
questions. And this is just more of a curiosity in terms of taking research, papers, discoveries, recommendations, and say, presenting those to people who may not
have a good understanding of what AI is. I mean, we all work in the data industry.
And even then people, a lot of times will misuse the term AI, right? Or speak about it in a way
that's ambiguous. And so if you think about the wide
audience of people who were exposed to the work that you and your team did, was it difficult,
say, if you were presenting a research paper to Congress or they were digesting that,
how did you approach the problem of not everyone has a baseline understanding of sort of what AI is at the
fundamental level. Was that a challenge? I'll actually say that was the challenge. I mean,
you don't present research papers to members of Congress. It just doesn't work. Some of these
people are smart. Some of them are less so. some of the senators are very, very smart.
Uh, some of them less so, but they, they still may not understand and should really should, they shouldn't be expected to understand, you know, the new characters of this tech.
So it's a big part of the job actually to, to both work with my peers in the state department or the Defense Department or the Transportation Department or Energy, you know, some super smart people at the Energy Department, work with my peers at that level, and then go back to members of Congress and try to, you know, you don't
say dumbed down because, you know, many of these people are super smart in their own
right, but try to simplify it in a way that is meaningful so that they can have a grasp to make more effective
policy. I have a couple of conclusions from that experience. And it really was daily,
if not hours, a lot. The social calendar wasn't really a social calendar because
I would often be the entertainment at dinner talking about AI at some ambassador's residence or with some members of Congress.
So the lesson I took is tell a simplified version of AI.
And I can even share with you how I told it.
Worries can often be helpful. And then the second lesson is that we really need to bring more people into the conversation around AI.
Because even if the members of Congress and senators at the federal level would understand this, we still have every state government. To say nothing of other governments, allies around the world
from whom we also take some direction and where our companies are often subject to those laws,
GDPR as being a perfect example. Europe had, we could talk about their implementation and
their modifications of GDPR over the last few years, despite having very, very smart people, they've
been often misguided in their modifications to GDPR. So those two lessons, one is have a good
definition that'll share. Another is just generally working to bring more people into the conversation of AI, what it is, how we want it to be implemented.
So the definition that I worked with that I found to resonate with members of Congress
is that AI is a system, a system that collects data, senses data.
So that could be from the LiDAR on top of your car, it could be from the air quality
sensor in your home. Then through that sensor takes the data into a system that then cognates about it, thinks
about it, plans it, plans for action.
That's a traditional place that people would think of AI.
And I notice I try not to get too pedantic about saying, well, AI is just that with a
subset of deterministic and probabilistic AI,
a subset of which is in machine learning,
a subset of which is deep learning, right?
That's not helpful.
Unhelpful for people that are researchers day to day, right?
But it's a system that kind of senses, plans,
and then acts on those decisions,
learning from the experience.
So we take that whole system
and then apply it
to how ordinary non-AI professionals can get engaged.
And we talk about automated car, driving down the street,
seeing something.
Is it a crosswalk?
On the crosswalk, is it a person?
Is it a tumbleweed?
Is it a shadow?
What do I do?
Slow, stop, or keep going?
Do I ask for driver intervention then that's that's a
point that everybody can get that we as a society need to make a decision we as a society will need
to determine you know where do we put that liability on the driver on the manufacturer on
the coder we so that will happen that will That will happen. And we will have litigation around this to make that determination.
You know, Mercedes, for example, makes really no bones about them biasing towards the safety of the driver.
So you think, you know, if I see an automated Mercedes coming at me, I might back off a little bit.
And, you know, we're all part of Tesla's beta test, you know, whether we like it or not.
You know, they regularly break the law.
That's kind of just their mode of operation for testing their autonomous software.
So we need to engage more people in the conversation.
Use the definition and work to engage more people.
Super helpful and super fascinating.
I could keep going, but Costas, please jump in.
I know you have a question.
I have a feeling you were keeping a lot of notes
to share with the sales enablement team or something, right?
That's right.
Or actually, I will say, Eric, though,
that is a very helpful definition.
I mean, it's the classic, you're at a cocktail party.
And to your point, it's not that people aren't intelligent.
It's just that distilling a subject like AI with all of the various componentry is kind of hard.
And so I really appreciate that.
I'm going to paraphrase that definition in the future
if you don't mind at the next cocktail party where ai comes up please use it yeah i mean that was
that was pretty amazing to be honest like it was one of the like best let's say how to that uh
taking like some very very complex concepts and distill them down to things that everyone can understand.
So that's a very, very rare skill.
So I totally understand why you had the position you had there.
It's amazing.
And I think it's one of the skills that anyone who's working with technology, we should work
more on improving ourselves, to be honest, because like, it's a big, it's a big problem that we have, right?
Like, especially when we introduce, not like, we introduce like new pieces of technology
that they are pretty much like, we also have to invent, let's say, new language.
Like, like people are just like not ready.
Like it takes time, like for, doesn't matter how smart you are, right?
Like you need like to rewire your brain and start like thinking in different ways.
So that's, that was amazing.
Like, I don't know if you write, like if you have a blog or like, do you plan to write
a book at some point?
But please do.
Like, I think many people.
Oh, thank you.
All right.
So having said that, and thank you so much, like for this amazing, like introduction,
I'd like, like to chat a little bit
more about the company, like connections, right? And we were talking until now about
AI, which is, let's say, the holy grail of data. We collect all these massive amounts
of data that at some point we want to build these models that they are going to
use the data and help automate big parts of our life in a very positive way.
But Connexus works on a much lower level of this, let's say, journey or supply chain of
data, let's say.
Okay?
Yeah.
So what made you from being working with AI for so long,
go and like build a company that works on like much more,
let's say boring in a way like,
and don't take this wrong.
It's not boring for me,
but I'm pretty sure that you were discussing much more about AI
than you were discussing about how to create connections between data, like with the
people that you were meeting there. So, but I think there's a very good reason and I'd love
to hear more about that. So. Yeah. Yeah. Thanks for that. You know, there's a lot of different
levels at which we could talk about this. But I can take the, the last point, which is to a non-nerd, it ain't sexy.
That's for sure.
It's going to be difficult to write a Hollywood screenplay about math.
But unless there are aliens involved, I think like if you have like an alien there that you try to communicate with, I think that script works most of the times.
There is, I will pay for this movie.
There's a brilliant woman, Eugenia Chang.
She does a fascinating job bringing math to life through the metaphor of baking.
Baking pie is, you know, obviously a kind of a pregnant way of saying this, but, you
know, she even went so far as to have a children's book that I bought and I read to one of my nieces explaining math and specifically categorical algebra at the level I read to
a four-year-old.
So she's brilliant.
I'm a fan of hers.
I can say that the math is where it's going. As I was studying computer science, the more I advanced in that domain, the more we got
away from the syntax of the different languages, of course, and the more we got into the mathematics.
What I have come to believe is that we are, not to be hyperbolic, but we're entering a new epoch where we are shifting the
framework from that of logic that helped our current infrastructure of computing create itself
to another epoch, which is that of composability. You see expressions of the concept of composability in such things as
quantum computing, and specifically in quantum compilers, where we would not, as humans, be able
to understand the output of quantum computers without the math of categorical algebra, category theory, or type theory. You see other expressions of composability with smart contracts, the structure of which
would not be able to exist without categorical algebra or category theory.
That math helps you understand and analyze these increasingly complex systems.
And there's really no other way to do that.
You know, the math that we all grew up with is, you know, the math of the 20th century.
I mean, I'd even say the math of the 19th century, calculus, geometry, trigonometry.
You know, it's going to become a little bit like Latin, which is interesting, intellectually
interesting, but less and less relevant to the digital age.
Those are the maths that we will use for aerospace engineering or mechanical engineering.
But for digital applications and the emerging compositional systems, we will be relying on the math of category theory and type theory and,
and, and the expression to categorical algebra. So that's where we're going. And that's what,
that's what Connexus is building. So Connexus is built on a mathematical discovery and that's,
that's as foundational as you get, you know, that's a law of nature. There, there are,
that's better than physics, right? So, you. So math is, it's a strange thing,
I will say a little aside, just to point out how nerdy some of these math professors are,
one of which is our co-founder, David Spivak. You go to MIT's math department and how you'll
be able to tell you're in the math department? Two ways. One is no computers on the desks. That's weird. The second is blackboards, not whiteboards. So, I mean, these are hardcore. I mean, if you went to Central Casting and you said, give me a mathematician, our co-founder would pop up. And that's what that math department looks like. So, it's a funny little side about that. But this, this domain of math had a discovery where this is sort of metamath of categorical algebra was applied to databases.
So the translation of problems between spaces can now be done with databases.
That was expressed in software by Dr. Wisniewski that then began to have a commercial expression.
And that's when I found out about it.
Initially, I put money into it and then decided to jump in full time.
The reason is because this is going to be the biggest trend we see over the next 10 to 20 years.
Other people can be fascinated about other domains. They may read about in the press.
But this, although it's not telegenic, the math is not as easy of a story to tell, it
is foundational.
It's going to make a very big difference.
Category theory, categorical algebra will be the math that our kids learn in the future.
You might say the more math, the better, but if I was to choose, I would be
replacing calculus, geometry, trigonometry with statistics, probability, and category
theory.
That's very interesting.
So, okay.
Let's step back a little bit and let's talk about category theory.
So I, I mean, I was aware of like the impact
of category theory and type theory has like especially in compilers and functional languages,
right? Like it's a big part of like the conversation that's happening especially like in
people that are working like with Haskell and like the functional let's say, paradigm in writing content.
So how did we go from these applications
with compilers and computer languages to databases?
What happened in this between?
And how did this happen?
I'd love to learn about that.
The easiest way to get at that
is just from what our customers tell us.
I mean, that's the best.
We as engineers might really enjoy programming in Haskell.
That's a fun place to be.
But your commercial expressions of Haskell
are kind of few and far between.
I mean, we even worked with Uber,
and I'm not saying anything out of school here,
to say that they didn't want to even open source their Haskell code, the code we created with them because
it was Haskell.
It's just not, it's not that they want to just be affiliated with it.
They don't want to, they don't want any part of it.
It's fun.
Maybe we'll all do that in retirement is just sit and program in Haskell.
What our clients tell us is, the clients of
Connexus, they come to us like Uber to say, hey, we tried solving this problem other ways,
and we reached a dead end. So Uber is an interesting story where they, like many companies,
have some very smart people, really exceptionally smart people have been at Uber in their technical functions, and they had an effectively infinite balance sheet with which to
fund a optimal IT strategy. But neither of those allowed them to actually create an optimal IT
infrastructure. Despite having smart people and an infinite balance sheet, they grew up respecting the business.
They grew up in that case then as a ride-sharing company, city by city or jurisdiction by jurisdiction. And the output of that was they were then prevented from easily respecting a privacy lattice so that, for instance, driver's licenses might have a different privacy sensitivity than license plates, depending on the jurisdiction.
Easy business questions, theoretically, such as, hey, there's a sporting event coming up.
How will driver supply be affected or rider demand? Could be done for Richmond, Virginia or Charlotte, but they couldn't be done for the whole state or the whole eastern seaboard of the U.S., the whole country or the whole world.
So Uber then looked about how to solve this problem.
How do I integrate 300,000 databases in their particular case?
This is what I meant about the scale.
You know, that's a phase change and you have 300,000 databases.
They realized this is to the point, Kostas, is they realized, hey, this needs to be solved
at a deeper level.
The commercial expressions are broken.
They don't extend in the marketing language that you'll often hear.
They don't extend to 300,000 databases in a way that's feasible.
How we need to look deeper. They came up with the solution of categorical algebra,
and then they looked around the world about who are the leaders in that, and they found Conexus.
Conexus happens to be 40 miles north of them, so that worked out. But we then co-developed
with them to solve that problem of bringing together 300,000, essentially 300,000 data models.
And in a way that was guaranteed to maintain its integrity, which is really the point of the category theory.
We can't have four mutate into approximately four.
You have to have four equal four every single time you evolve that. Uber could have done this without category theory,
but the budget on that, I think they computed,
would have been roughly $2 trillion.
It's just because they don't do it
because of all the connections that category theory allows.
So that actually is the same as in quantum computing.
You can use traditional methods in quantum mechanics to run the compiler, but then you start having to use imaginary numbers, which can be done, but it's not pleasant.
It's not as easy.
It's not as dependable.
In high consequence contexts, you want to have math that is less susceptible to human error.
And that's where category theory comes in.
And that's how it's evolved, I guess, as a way to be answering your question.
And how is category theory, seeing the way that we solve these problems,
what's the contribution of category theory that helps make these problems
tractable while they were not before?
So what's the secret sauce, would you say, of category theory? Yeah. of like make these problems tractable while they were not before. Right? So
what's the secret sauce, would you say, of category theory?
Yeah. So category theory is a
sort of metamath. I mean, if you're already
talking about formal methods,
then you're already most of the
way there. It's really defining
in a very precise way,
giving the ability to define in a very precise
way a reasoning engine or a
rules engine that then is encoded and shared with others. One of the clients of Connexus came to us and
they described it this way. Have you ever been in a room where you wanted to ensure that you heard
everybody's viewpoint and you wanted to make sure that the loudest didn't dominate the conversation
and the quietest got heard such that everybody left the room feeling that their opinion was
represented exactly as they had said it. That's how they said it. How they're telling it is that
that can happen sometimes in a consensus. Say you have 30 people, 30 engineers
get together and they have a consensus about what a wellhead would look like and how you define it.
But then you add the 31st person and then it all breaks again. You have to do it all over again.
And that's exactly the sort of problem that engineering and manufacturing, to say nothing
of healthcare and logistics and financial services, run into with some degree of regularity.
What category theory allows is it allows for a logical composition of each of those respect
the different definitions or meanings that the engineer represents or the subject matter
expert more generally represents.
Category theory just supplies that language. You don't have to worry about the syntax,
about how you might encode that, but the math allows for the semantics to be respected
in any data transformation. Okay. So how do we go
from
the business problem that we have?
Let's make it simple.
Let's say we have two databases
or get the 300,000 databases.
And we want
to align the two data
models there.
How are we doing that using
category theory?
Like help us a little bit
to understand
the, let's say,
the experience
that the developer
would have
trying to do that
using category theory
or the Connexus product, right?
Yeah.
Yeah.
Well,
at two days,
this is actually the problem
is a lot of existing solutions
you'll deploy
as a sort of proof of concept.
Hey, let's deploy two and now let's do 200.
And that's exactly where they break.
So Connexus is engaged with customers who have had this experience and the people say,
oh yeah, I can do it with category theory.
And then they scale up to even 10 and they start fudging it.
You know, who among us has it ever hard-coded a cell in Excel, right?
We've done it before.
And that's what often these people will do if they don't have a foundation on which their
product is built in category theory.
So two databases, you're not going to see a big difference.
It's really a difference between a quadratic scaling and a linear scaling for the complexity.
So let's just say five.
You don't know. It doesn't have to get terribly big.
We have this experience, Connexus did,
with a healthcare system in New York.
So I didn't know this was possible,
but this one healthcare system,
one system had different definitions of diabetes
between the groups.
Maybe this is familiar to you or some of the listeners, but for me,
I thought, well, I can just look in the dictionary for a definition of diabetes, but no.
How that expressed itself is that one group would say diabetes in the table,
the attribute would be diabetes, and then in the row would be yes, no. And then in another division,
this might be in research, so it might be temporal, or it might be in the row would be yes, no. And then in another division, this might be in research, so it might
be temporal, or it might be in the use, clinician versus research. But the next table might say,
diabetes, how are you treating it? Or the next one might say, diabetes, how long ago? Or the
next one might have well-meaning clinicians that would say, well, Eric had diabetes, and then we
treated it, and it doesn't appear to be showing up anymore for some reason, like whatever.
So you have different definitions of diabetes across the organization.
You know, what Connexus has seen is clients will do one of a few things.
They'll either normalize all of that.
So they'll just combine it, squash it into diabetes, yes, no, and lose the fidelity or the semantics from the subject matter experts.
As we talked about in the meeting, everybody's meeting was lost because it's just the lowest common denominator.
Or they will pick a couple of the ones that are easier to integrate.
Maybe get diabetes, yes, no, and diabetes, how long ago?
And then they'll ignore
the rest. And that often what happens, or they'll just ignore all the other ones. They'll just have
one and the others will remain dark to use a Gartner term for it. Dark data, data, data you
collected, but aren't going to use. You know, it's really funny actually, as an aside is I feel like
companies have gotten the memo about the data collection, you know, data is the new oil and all that.
Like there's a lot of data now being collected, but I really am curious about how much of that data is actually being used.
Because my experience is not a lot.
You know, they're collecting it, but it's not being put to good use by the data scientists.
And it's because of that difficulty, that layer between data scientists and data engineering.
So back to the example is we, we can either ignore it, we could
normalize it, or we can spend a whole bunch of money with ETL for tools.
And then, and, and Tata or Wipro or Tipco or Infosys or Accenture combining that
data over a period of months or not years, that's all
suboptimal.
So what category theory provides, that's the background, what category theory provides,
what Connexus provides for its clients is it provides for a way to have a mathematical
representation of those different traits, of those different columns in those
different tables. So you can say, for example, the diabetes yes, no is related to diabetes how
long ago. And you can keep the fidelity of diabetes how long ago equals diabetes yes or no,
plus some other attribute. And then you could just keep going.
So that's actually how this gets expressed.
You have a mathematical relationship that is able to be captured.
So you can then begin to realize, oh, this is a little bit like graph theory, where my PhD is in.
It's a little bit like graph theory, but it is a richness that is unavailable in graph
theory.
And that's why it's a little bit more like type theory. Because you can have an infinite amount of expressions
inside of every edge and node
in the vernacular of graph theory.
Category theory has its own vernacular.
Yeah.
Okay, and how, like...
Okay, let's talk a little bit about
what's the experience that the customer has, right?
Like, what's...
Let's say the hospital come to you, right?
What's the journey that they have to go through
together with Connexus to build this rule engine
and represent all these semantics
that they have around their data
using category theory?
Yeah, it's just a great, great question.
And I'll tell you, it's really an easy start.
For engineers, accountants, lawyers, or people that are often trained to be at least aware of
the precise meanings of their words, other people can get this too. Those people that haven't
necessarily been trained in those disciplines. But often I find that those disciplines are, I find this to be a little easier, you just have a whole bunch of data.
It's the data plus the data relationships.
So every person, because this is simple, every person has a name, right?
That's person, name, every person has.
That's the simplest part of an ontology log.
And here's where it gets, here's where it's important.
You say, well, whatever, Eric, that's pretty, you know, that's trivial.
Yes, it is.
But here's how this gets messed up all the time.
So just to use a common example, for me, you'd say, well, Eric has a nose and eyes.
Eric has ears and eyes, ears and eyes.
So where in there did I say singular?
And where did I say plural? That can matter. You know, it matters a lot. You want to be super,
super clear when you're writing down as a subject matter expert, my knowledge of my face, you know,
nose, eyes, ears. Yeah. So you got to write that down in the ontology log. Every head has two eyes,
maybe both working, you know, one nose, maybe working has two eyes, maybe both working. One nose, maybe working.
Two ears, maybe working. You got to write down with that level of precision. Any subject matter
expert that kind of lives with their work has this implicit knowledge. They then need to represent
that implicit knowledge explicitly. And that's what Connexus helps its clients do.
But that, that really, Connexus isn't going to read anybody's
mind and it's not magic.
So that has to be captured by the subject at MatterExpert.
And we facilitate that and are working over time to make that easier.
Henry Suryawirawan, Yeah.
So, okay.
I have like a couple of follow-up questions here.
First, when it comes like to domain experts, right?
Like usually domain experts don't know category theory
because they spend their time getting really deep into something else, right?
Like for example, someone might be like a medical doctor
or they might be like, I don't know, like an expert in finance or whatever.
How do we help these people encode this knowledge that they have
into something that then can be used to accurately being represented
using category theory?
Right.
I mean, this is great.
No one needs to...
In order to do this, we would be at a huge disadvantage
if somehow we were requiring people to learn any math, let alone abstract math.
This is not probably bringing up warm memories for people that studied abstract math at school.
So, you know, category theory, I think, is easier than calculus, frankly.
So I think people getting into it might enjoy it, you know, especially with the easy on-ramps available from the likes of Eugenia Chang and others.
One of our co-founders wrote two books on category theory.
They're excellent.
So people may enjoy it, but they don't have to learn it in order to do what we are describing here.
In order to capture diabetes, yes, no, diabetes, how long ago, diabetes, how are we treating?
In order to capture that, you just have to start with a logical diagram.
We'll call it an O log ontology log, call it an O log.
You just have to create the ontology log, transferring that O log with the right syntax
into the software.
That's the job of Connexus.
That's what, that's what we will do.
And that's, then that's super easy for us, but that's all it is.
It's just a syntax like SQL, because we're just dealing with databases.
That's what Connexus does.
So it's really super easy.
You know, a lot of kind of depending on the level of, of sophistication or interest,
you know, people can pick up this little modification of CQL in a long weekend.
Uh, that's that's, that's how it happens.
That's what happens.
Next ontology log syntax into, into building a Connexus.
Okay.
And now the tricky question, how do we deal with changes in the semantics?
Because, okay.
What we know today about diabetes might change in the future, right?
And that's one of the reasons that we have also
all these different versions of what diabetes might be.
So what about the process of going back and refining
the Scientology log and representing this back
to the rules that we have?
How does this happen?
And how do we find out that we have to do that also?
Well, the knowledge that any subject matter expert represents on an ontology log gets captured in the software.
If the software needs to be modified to represent a change in the ontology log, that would be driven by the subject matter expert.
The software is only an automatic reasoning engine.
It just looks for all the possible connections.
This is a factorial explosion of possible connections or combinatorial is a term that people are often familiar with combinatorial explosion of of relationships so it's a reasoning engine about those relationships
but it's not going to read anybody's mind about about the the change in reality that that that
mandated a a change in the design of the flange on the wellhead. That's a subject matter expert's responsibility.
Okay.
And, okay, we do, like, these things.
We create, like, the rules.
And then what?
Like, what's the experience that we are having there?
Like, are we able to, like, query this rule engine
instead of, like, going and querying all
the different databases? How does this work in whatever we have been used to call, let's say,
the analytics environment that the company has? How does this work? How we can integrate this as part of like the data warehouses that we have in the databases?
Yeah, I mean, the experience of a user is really going to be transparent.
And to build on your last question about that, you'll get some benefits such as just contradiction detection.
So another answer to your question also addresses the last one, which is,
what are the benefits and how does it show up? It will tell you whether you have contradictions,
which can often happen. If you have 30 engineers in a room and all analyzing a complex system,
you may, this is what happens today in many of these situations, you may have to go through iterations, many, many iterations to expose the contradictions
in the different viewpoints of the subject matter experts.
This automatic reasoning engine that then is powered by creating a connexus, that's
available immediately.
Just push of the button and then you will see the contradictions in the data models
that then will require perhaps a change in the semantics or change in the ontology that you
might have taken years to discover and perhaps after something bad happened.
The experience day-to-day should really be transparent because you're no longer needing
to develop a consensus and abide by a consensus with others.
Once you have a Connexus, you're just experiencing it the same way you'd interact with any other
sort of database or with SQL.
You have your version and that's, that's, that remains the case. And you're accessing that in relation to the linkage, the automatically created
linkage between all of the other definitions.
Okay.
So, connection to the tool that is used like primarily by the data engineers
while they're like maintaining, let's say the databases and the data engineers while they're like maintaining, let's say, the databases and the data lakes
of the company?
Like who's like the, let's say, the primary user?
Yeah, it operates somewhat at the level between a data scientist and a data engineer.
Often the data engineers, although they are maintaining a complex data infrastructure,
they often have less autonomy than they may wish.
They're told what to do and implement, where data scientists just want to get it.
They maybe have more flexibility to get done what they want to get done.
This is often driven by peers operating in a business group that need to guarantee the meaning of their semantics.
That's how it was driven at Uber. need to guarantee the meaning of their semantics.
That's how it was driven at Uber.
That's how it was driven at most of our other clients.
It's engineers in a business unit, not driven by IT,
that need to collaborate with their teammates and not spend a couple of hours a day
that we've heard sometimes are spent exchanging Excel docs.
And one last question before I give the stage to Eric.
So how long it usually takes from like day one, but someone wants to build
like this ontology log and populate like the reasoning engine until they have,
let's say, everything in place and
they are like to start like using it.
David PĂ©rez- The big issue is, is developing the ontology log.
That actually can take someone moving their job from something that's implicit
to explicit, a large, there's a large time variance in how long it takes to create
one of those, but I tell you just last week we, we had somebody stay up after we told them about this just
over dinner.
They said at 1030 at night, they sent us an ontology log of their job.
And then we coded it up in the semantics of a connexus in 30 minutes and sent it back
to them.
So that's, and they can start reviewing that and say, oh, you know, is this the ontology
log I meant?
Just in the language of SQL, really?
And they can be good to go.
The magic comes, of course, when you're combining these different subject matter experts, these different ontology logs into one connexus.
That's where the power comes.
So, you know, the more, the better.
But it can go pretty quick.
The time is in developing the ontology log.
And then the weakness, it's a question you didn't ask, but it's an important one.
Where's the weakness?
Where does it fall down?
There are some jobs where we can't actually define them easily.
An example was told to me last week where there was a person that walked around a big manufacturing plant and listened to the motors.
Okay.
And would pull out parts, you know, designate, pull that part out, pull that part out.
And how I heard the story was that nine out of 10 times, the person was right.
That it prevented, it was, it was a good preventative maintenance.
And when that person retired, that there was a, there was a, a large increase in preventative
maintenance budget.
So they pulled the person out of retirement for an hour a week, just
to walk around and listen to the plant.
You know, that's, that could, maybe it could be a replicated machine
learning and a microphone, you know?
But, but that if you can't take that implicit knowledge and, and put it down
in a logical diagram, you know, we don't have a lot to say, Connexus doesn't have
a lot to say.
Yeah.
Yeah.
So I guess like, we are not going to see like some media doing that
like anytime soon, I guess, creating ontologic logs
over like the wine experiences that they have.
But I guess that's fine.
That's okay.
We can do that.
Eric, all yours.
Eric, a couple of questions here.
So, you know, I think your average person working in data
probably doesn't get to see the scale of 300,000 databases, you know, you know,
a hundred databases is a lot of thousands, you know, you know, a lot. So I'm interested to know,
you know, I feel like I could have a good handle on the ontological log and sort of how you prepare to my mind around what's required on your end in order to actually interact with that many databases. Is that a significant infrastructural problem? Is that something that the customer sort of enables on their infrastructure you have access to? How does that work practically when you engage, say, with an Uber who has
300,000 databases?
Yeah.
So we're not, Connexus is not a bit store company.
We never will see the data in that way.
So we're not storing petabytes of data.
And it's unless, there may be different security protocols.
So we have applications in defense and the intelligence community that,
that may require different sort of security protocols, but generally that's the,
that would, that's a complication of the, of any implementation,
just a respect of that, of that sort of framework for deployment.
But the deployment is really a pretty light one because, because it's all in math.
You know, the code is actually, you know, this isn't Oracle that you're
deploying and that is not us.
You know, we're, we work with any it's cloud native, but it works with any of
the cloud providers kind of in some fashion.
It's really complimentary to all of that.
What, what we do is what Connexus does is, uh, Connexus is just is just a way of doing what was previously difficult,
if not impossible, like I say. Some things were just infeasible. That's what I meant to say.
Infeasible, if not impossible. And Connexus just enables that. 300,000 databases is where we're
going in many cases, but that is just an example of a sophisticated system.
We worked with, Connexus worked with a financial services company that had a goal of taking
86 databases down to one.
That was their vision.
There's a well-known bank, but this is actually the story that they told us, this particular
well-known US bank, 86 databases down to one.
They budgeted $20 million in three years for this project. Five years and $120 million later, after people then got fired, they then
went back and scaled down that problem from 86 databases to 16. They then budgeted another $100 million in five years,
and they succeeded.
But this is the point where we then get involved
because they say, well, at the end,
I know that we still are left with a super fragile system
in that we acquire another company
or divest of one of our assets,
and we have to do it all over again in some sense,
or we have introduced that degree of fragility in some sense, or we introduce that degree
of fragility. And that's what developing a Connexus can provide for these firms that
really can't afford to have their data models be mutated.
Yep. And a question on the discussion around two databases, okay, five, 100,500,000, $100,000, $300,000. At what point is Connexus, where do companies start feeling the pain? What's the breakpoint of complexity where Connexus really starts to add value. There's two ways where Connexus can begin to provide value.
One, the general proxy is number of databases.
And that, you could just do the simple arithmetic
of a linear versus quadratic explosion.
And so you'd say, well, three, four, oh, five, yes.
You know, where you really start to know there's a difference.
Yeah.
But the real answer, the more nuanced answer is
it's in the sophistication of the data models.
So if you have homogenous data models,
then there's not a lot, there's not as much to say.
Whereas if you had a broad heterogeneity of data models,
such as in engineering, energy, transportation, manufacturing, then you really
have a very big problem where the consequence of failure is large.
And that's where Connexus provides the most value.
That's where people actually will come to us unsolicited.
Yeah, absolutely.
And we're closing in on time here, but one thing that I would love for you to provide our listeners is, you know, we've talked about AI a good bit on the show. We've never talked about sort of the mathematical componentry of it, you know, or certainly sort of how that's influencing a technology like you're building.
What would you say to someone who, you know, wants to begin to study this now so that as technology like this, you know, sort of experiences wide adoption, what are the best ways for
them to start to learn, you know, sort of this mathematical focused approach that you're
using?
Sure. Category theory, categorical algebra have been around since 1948. The mathematical discovery
upon which Connexus builds happened in 2011 out of MIT. So I might look for texts and videos
for category theory since that time. We mentioned Eugenia Chang, one of our co-founders, Dr. Spivak
is another one, and one of his co-founders, Brendan Fong in academia, one of his collaborators.
Those people are excellent resources to learn more about category theory, categorical algebra.
That's a great place for the mathematical foundation. You don't really need to learn that. Often as a programmer, as a coder, as a computer scientist, formal methods are maybe a nice, easy gateway
into that. Some people might be more comfortable through type theory as a gateway into that. I
often will say that category theory is like graph theory, but with more structure, just because
that's my background. But I think that, oh,
they may also benefit by just remembering that there is the theory, category theory,
and then there's applied category theory, or that's why we call it categorical algebra. It
can often sound a little easier for people, because you can immediately just think of examples that might be day-to-day helpful.
Very cool.
Well, Eric, this has been just absolutely fascinating.
Thank you for enlightening us on so many subjects.
It's not often that you get to talk about the White House and mathematical theory in
the same conversation, but you have brought those worlds together and we've learned a
ton.
So thank you
again for spending some time with us today. It's just been fun. Thank you, Eric. Thank you, Costas.
Costas, I have three takeaways. I guess I keep breaking the rules more and more. The problem's
getting worse because we're supposed to have one major takeaway, but here's my three. I'm just on
a roll after Brooks not being here for a couple episodes. And so I'm sort of still sowing my wild oats. I'm going off script. So one is just how smart people are.
I mean, it's just, you know, I felt like every, you know, five minutes we covered a subject where
it was clear that Eric was like deeply knowledgeable, you know, probably on an expert, you know, fundamental level, you know, on all these different topics, which is really fascinating.
The other, my second one is, it was just a good reminder that, you know, a lot of times we take it for granted that, you know, programming or working with data or solving problems around that has a
foundation in math. I mean, it almost seems obvious, but I forget about that a lot. And so
I think it was fun to see him really draw the direct connection between mathematics and some
of the things that we do day-to-day or the problems we solve. And then the last one,
which is perhaps my favorite, was talking about the
mathematics department at MIT, where you go in there and there's no computers
and only blackboards and whiteboards.
And that's like, you know, staying true, you know, to, to mathematics.
And I just love that, that mental picture.
So how about you?
Yeah, I mean, I'm okay.
Like what I I'll keep like from the conversation that we've had is that, love that mental picture. So how about you? Yeah, I mean, I'm okay.
Like what I'll keep like from the conversation that we've had is that how much more work can be done in like introducing like new technologies and
new like things like doing commutations and how this can change like the
type of problems that we can solve.
So we might think that like things are like everything that could be solved
like it's solved, but
actually it's not like this. And at the same time, I get this feeling of fascination of how
so abstract concepts and stuff like category theory that started as a very abstract and technical mathematical tool for explaining and becoming.
They were trying to build the foundation of the rest of the mathematics, let's say.
And it ends up solving real-life problems that affect me and you.
And we call an Uber or Lyft to come and pick us up.
So that's one of the biggest joys that I have, like in doing the work that
I'm doing and also the, the, the show here is I kept like, uh, learning and
really relearning this again and again.
And it's like, so, so fascinating for me.
And it's one of the reasons that I really love like doing the stuff that I'm doing.
I agree.
I agree.
Yeah, I think, you know,
when you think about
the practical nature of the example
you give around something like diabetes,
you realize, wow, I mean this,
you know, not to be too dramatic,
but on some level,
lives can be on the line,
you know, if you get certain things wrong.
So definitely some weighty stuff and a fascinating guy. Many more great episodes coming up, and we will catch you
on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on
your favorite podcast app to get notified about new episodes every week. We'd also love your
feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. Thank you.