The Data Stack Show - 123: What Is a Universal Database? Featuring Stavros Papadopoulos of TileDB, Inc.
Episode Date: January 25, 2023Highlights from this week’s conversation include:Stavros’ journey into data and founding TileDB (3:12)What problem was TileDB going to solve? (12:05)Defining database systems (21:35)What part of d...atabase architecture is TileDB? (31:58)Storage engine solutions (42:37)What does the API look like in using TileDB? (50:40)What makes genomics unique in working with data (55:28)Final thoughts and takeaways (1:06:46)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
Costas, today we're going to talk with Stravos from TileDB. He's Greek,
so I know you're going to have a great conversation with him. And he created some
really interesting technology. They call it a universal database, and I'm really interested to
know what that means. So I'm going to ask about, you know, what is TileDB?
What is a universal database?
My guess is that the technology is a little bit more specific and opinionated.
But also he spent a lot of time in academia, which is really interesting, something we've
talked a little bit about on the show before and so i really want to hear the story about how tile db came about uh in his work at um uh at mit uh so yeah those are my
questions how about you yeah first of all i want to see how you're going to handle two greeks at
the same time so let's see what will happen.
With the Socratic method.
Oh yeah, keep asking questions. We will never finish the recording.
So that's one thing. The other thing is, it's a very interesting opportunity that we have here because TileDB actually has started by building one of the most, let's say, low-level and core parts of a database system, which is the storage engine. The question initially comes from that of how we can store the data on the very low
level in a much more efficient way that gives us out to the ergonomics of dealing with the
data.
So it's a great opportunity to focus on that and learn more about one aspect of the database systems that we don't
usually have the opportunity to do because
it's
more, let's say, many times
we take it for granted, I think, like, even
like database systems.
So I think it would be very interesting
to
hear from him about, like,
how they built that.
And also, like, what's the story behind it?
Why they started from that and why they open-source it
and all the things around like TileDB,
not just as a technology, but also as a company.
Well, let's dig in and talk with Travis.
Stavros, welcome to the Data Stack Show.
I am so excited to talk about so many things,
the primary one of which is all databases, everything databases, and especially TileDB.
So thanks for giving us some time.
Thank you for having me. I'm very happy to be here.
Okay, so give us your background, because you have a long history in technical disciplines, but
not necessarily in databases specifically.
So where did you start?
And then what was the path that led you to founding TileDB?
So TileDB is the first company I'm creating.
I'm a technical CEO, so I have a PhD in computer science.
I did my PhD in Hong Kong, where I spent several years.
Then I became a professor.
So I have a very deep academic background.
It has always been data, not database systems, but data,
a lot of algorithms, data structures.
Then I did a lot of work on security and cryptography, but always on data.
And in 2014, I got an amazing job at Interlabs and MIT.
And that's when I moved to Boston.
And that's when I started actually working on TAL-EP, but effectively
on database systems.
That's where I got more experience and I dove deeper into the database systems world.
Super interesting.
Can I ask what your thesis was on in your doctorate?
It goes on query result integrity, actually, authentication and integrity of results.
Really?
So I was creating a variety of data structures infused with some cryptographic primitives
so that we can certify that the results returned for a query are indeed the correct results produced by the owner of the data.
There are several different nice techniques for that.
So I build several structures for geospatial data,
and this is where the geospatial angle of LBB comes from.
So yeah, very much into data structures and cryptographic primitives.
Yeah.
Okay, I have another question for you. If you, if you will entertain me because I'm, as Costas knows, I love thinking about the interaction of sort of academics and sort of backgrounds and then how that influences sort of building, you PhD in computer science. Could you outline what you think
some of like maybe the top like advantages and disadvantages, if there are any of an academic
background are, you know, because like commercializing technology out of academia is hard,
right? Like that's difficult. But yeah, I just love to know like your experience because you
have that so viscerally as someone with a deep academic background and now, you know, who started your first company.
Yeah, that's a great question. So the business world is 100 percent different from the academic world. Small piece of advice to any academically oriented
person starting in business,
start studying.
Now, as I did, I studied a lot
of studies around yourselves
with amazing mentors
from the business side of things.
It's a different world. It requires
another PhD.
So your computer science PhD
is not going to help you there at all. Start from scratch.
Where it helps me a lot until today is, first of all, understanding what product to build,
right? I'm very, very heavily involved in the decisions around the design, what features to
build, certain algorithmic aspects. I have an amazing team, but I'm very heavily involved
in a lot of core components of TileDB.
I wrote the original code base of the storage engine,
so I understand the code.
I do code reviews sometimes, believe it or not,
for specific features that I'm very interested in.
The biggest advantage is that
when I get
one,
defining the direction of the product,
of course, and having the vision, which
for a database company, you need to
understand the technology. You need to understand all the
surrounding technologies in order to
differentiate, innovate, and so on and so forth. So that's
extremely important as a founder
and CEO of a company.
But the other advantage is when I get on a call with the customers
because the customer can tell in the first 30 seconds
that I know what the heck I'm talking about.
Yeah, yeah, yeah.
And I know how to solve the damn problem.
Yeah, yeah.
Very quickly.
Yeah, yeah. Very quickly. Yeah, yeah.
They understand.
And it's all sincere because I do get down into the details.
For example, it's a new problem, which is not exactly a replica of the problems we have solved in the past.
And I offer solutions and I brainstorm with the customers and my team.
And actually, I enjoy it.
Like, this is one of the most enjoyable parts that till today.
I mean, even at this scale, we are, it's, it's still present.
Yeah. Yeah. Okay. One more question. And then,
and then I want to talk about databases.
What is like on the business side as a CEO,
you know, founder of a, of a tech company, what's the most sort of unexpected
delight that you have in the business side that you maybe didn't expect, right? Like, is it
like managing people or like, you know, fundraising? Or is there something where
you're like, man, this is really fun. And I just didn't really know this sort of existed on the business side?
Not exactly, but let me tell you a little bit about my experience on that.
Unless I'm forgetting something, I apologize for that.
So what I did throughout the company, why I never had any pleasant or unpleasant surprises,
believe it or not, I've been running the company for five years and a half now.
Yeah.
And I have never been truly surprised because I was always studying for the stage of the company.
For example, when I was raising my very first round, I was studying about how a convertible node is structured.
Yeah.
So that I know what kind of deal to cut, right?
And of course, I was talking to a thousand angel investors and mentors
in order to understand what is a good deal, what is not a good deal.
Yeah, yeah.
As the company was scaling, I started studying more about culture, right?
Like what kind of culture do we want the company to have?
Because the first, well Because the first four people
are going to dictate the culture, but of course
the first 10, the first
20, and so on and so forth. Not the first thousand.
Certainly a small core of
people will dictate. They will determine
the cultural path
of the company.
And then, of course, when we
started closing enterprise deals,
I was learning about enterprise deals, the sales cycles, budgeting,
procurement, all of that stuff that you need to know
in order to be very efficient.
And the same goes for marketing.
When we started doing a little bit of marketing, how do we brand,
what kind of traction channels we're going after, and so on and so forth.
So I was never really surprised with
something but what is extremely pleasant to me again speaking as a as a scientist not so much
as a business person is the the results we're having with the customers we have been extremely
fortunate and we can talk about this later to focus on certain verticals that are extremely meaningful.
For example, life sciences.
And you know what, it is extremely meaningful when your system helps a hospital save lives
of babies and DNIs in you.
So yes, that delights us as a company, right? And we continue to go after challenging
and very meaningful use cases
as our first niche beachhead markets.
And of course we expand once we do that
so that we have both purpose,
but also sustainability and growth.
Yeah, yeah.
I love it.
I mean, I love hearing sort of,
it sounds a little bit cliche from a business standpoint, but being customer obsessed, you know, is like a key ingredient.
And it's very clear that you love digging in with the customers and understanding the problem, which is great.
Okay, well, thank you.
That's such wonderful background.
Let's talk about databases.
So, you know, creating a database product is
a huge undertaking. You've been doing that for five or six years now. The database space has
changed significantly, even in the last, you know, half decade, five years or so.
What problem were you thinking about when you started TileDB and how did you decide
that a database was the right way to solve it? So when I started TileDB, I did not have
in mind to create a company. That was not the original motivation. I was a researcher. I liked
my job as a researcher. Interlapse was phenomenal. MIT original motivation. I was a researcher. I liked my job as a researcher.
Intel Labs was phenomenal. MIT was phenomenal. I was effectively embedded into MIT by Intel.
So I had bought the industry in academia hats, which was amazing.
I interacted with, you know, probably the smartest people on the planet.
And I loved this. And frankly, looking back, it was a good life.
I mean, that is kind of like a dream role
where you get like the best of both worlds, you know,
and you sort of get to straddle it
without any of the like undue burdens on either side.
Absolutely.
And I would have continued doing it.
It's just one of those, you know,
instance in your life where you need to choose.
And the reason, so first of all,
I started Taldi B as a research project.
I always wanted to have high quality in my work.
So in the sense that I didn't want
to just write one paper with Taldibi.
I had written multiple papers in the past.
I wanted to write papers,
innovate, obviously,
but also do something that kind of lasts.
So I always had a product mentality
even during my research,
but not a company mentality
because I was working for Intel.
And the idea was, you know what? if this works well, maybe we create a database for the labs
to maintain and grow, like very similar to what CWI has done in the past, Berkeley,
and others, right? So that's the kind of mindset I had at the time. Let's build something that is beautiful,
do some technology transfer for Intel,
try to solve some very big problems,
and we take it from there.
So no company plans in the horizon.
And again, there was no specific problem
from a use case standpoint up until something came up in genomics.
What I wanted to do was something different in databases.
That's what I wanted to do.
Also, please note that when I was at Intel and MIT, I was working at the intersection of high-performance computing, so supercomputing, and databases.
And the supercomputing from Intel,
I was working with some true ninjas in high-performance computing,
you know, optimizing operations at the CPU cycle level.
Yeah.
You know, some hardcore stuff.
And the databases, of course.
Some of them, Stonebreaker and CSAO, right?
So what I wanted to do from a research perspective
was to kind of bridge those two very, very different domains
with very, very different crowds
that don't talk to each other very frequently.
Yeah.
And actually, that was one of the first times
that, and big kudos to Intel and MIT
because they partnered for that reason.
They partnered in order to, you know,
try to combine the knowledge of both domains.
And my group at Intel at the time was doing a lot,
and still does, I still keep in touch.
They do a lot of machine learning, AI, a lot of graph algorithms.
So a lot of linear algebra,
a lot of computational linear algebra,
which sits at the core of very, very advanced operations.
If you want to do something very advanced in your CPU or GPU,
in all likelihood you're using somewhere in the depths of your software stack
some kind of package, probably built by Intel or NVIDIA,
that fully optimizes computational linear algebra operations.
And the question I had in the beginning that this is how it started
was, wait a second, if we take the data from storage,
be it in key values or documents or text files or tables or whatever,
and then we rattle this data into matrices and vectors in main memory
in order to feed those into Intel's
or the Nvidia GPU.
Why aren't we storing the data in this form to begin with?
Well, that was the initial observation.
Like why are we storing the data on anything other
than the form that we compute on in the end?
So it started as a very, very naive observation.
Then this started to pile up, of course.
And then I said, wait a second,
if we do store the data as matrices,
an image is a matrix.
So now I have the means to store also an image natively
as a matrix, and actually it can be sliceable from the disk, not just a blob that I bring in its entirety. I can actually use a arithmetic in order to slice portions of this very, very fast.
Then now I'm starting to modeling also images in addition to tables.
But if I do images and tables, what if I can do also key values?
What if I can do also graphs representing
the adjacency matrices? And this is how the observations
started to pile up again and again, more and more.
And the hypothesis at the time in the lab
was, is the mult-dimensional array a universal
data structure?
And by universal, I mean both in terms of capturing all the different shapes of data,
but also not just generically, but in a performant way.
The hypothesis is twofold.
Can we structure them intuitively, like in an abstract model? But the second hypothesis is twofold can we structure them intuitively like in an abstract
model but the second part is equally important can we do it efficiently because if you sacrifice
performance nobody's going to use it and that's that's when i started you know pounding on the
on the storage engine because it starts from it starts from storage is there an efficient format on disk or on a cloud object
store which can represent data of any form extremely efficiently so that's that's what
that again it started as a scientific hypothesis and then i started thinking a little bit more
business-wise and i was talking to a lot of people. At the time, I was working at the Broad Institute,
which was across the street.
They had a very specific population genomics problem,
which at the time has not been thought of as a matrix problem
or as an array problem.
Interesting.
When I modeled this problem as an array,
they experienced a massive performance boost.
And also it was intuitive. Yeah, there are three aspects at the time. There were three aspects of
this data set that we need to index on, and this makes it the perfect candidate for an array,
and more specifically a sparse array. There is a difference between dense arrays and sparse
arrays. We can talk about this later. But this is how it started. And the bottom line was that after talking to a lot of potential
prospects, we found that if indeed we build such a database system, which is universal,
if we can do it, then we can tackle two extremely important problems as the value proposition of this database system. The first is performance for very complex use cases, genomics,
LIDAR, any kind of point clouds, imaging, video, stuff that,
you know, relational databases are not very good at.
And number two,
we can consolidate data and code assets in a single infrastructure.
If we do that, we eliminate silos.
If you eliminate silos, first, you make people more productive, they collaborate more easily.
And second, they experience a lot of cost reduction. They don't need to buy a lot of
licenses. They don't need to wrangle data. They don't need to, you know, to take time in order to get to insights and so on and so forth.
So the business aspects came up later from observation.
Yeah.
I love it.
Okay.
Well, I could keep asking questions because, you hear about a lot of entrepreneurs trying to find pain in the market to respond to to start a company.
And it's so enjoyable to hear about your curiosity leading you to conclusions that ultimately solve some pretty specific problems.
So that's just so fun to hear about your curiosity sort of being a guide from that standpoint.
Okay, let's dig into the technical side.
Kostas, please take the mic and help me learn about TileDB and all of the specifics. Thank you, Eric. Thank you.
So Stavros, let's start from the basics, okay? Let's start with database systems. And I'd like to
hear from you before we get into the specifics of TileDB, what a database system is, right? Like, when we
think about something like Postgres
or Snowflake,
like, it doesn't matter. Big query.
Like, at the end, like, all these systems,
they have, like, some common
patterns there, right? So I think it would be
great, like, to start from that.
Like, what are these components, the most universal
found ones? And then we'll get
into more details about, details about TileDB.
But I think that's going to be super helpful also for Eric.
So let's do that.
Oh, this is an amazing question, Costa,
because it gives me a segue to so many different things
I want to talk about.
So in order to be able to answer to this question,
making no assumptions about the audience and making it as simple as possible.
Let's talk about what you would do if there was no database system in the market.
And you're going to be surprised, especially in the verticals we're working with.
Life Sciences, Geospatial, those folks are not
using databases at all.
So let's talk about what those folks are doing.
Okay.
All right?
In the absence of databases, the first thing that they're doing is that they're coming
up with very obscure formats to store and serialize the data as bytes into files.
Every domain in the absence of a database system is coming up with their own formats
and their own parsers.
You need also some kind of a command line tool to understand the format in the file.
Otherwise, you're not going to be able to open the file and read it.
Yep. So in the absence of a database system, you as a user are responsible for
serializing the data somehow, saying I'm going to put this field first and then the next one,
and then explain to the parser or write the parser so that you're able to parse this format.
And what they usually do is that they're forming consortia
to define the specifications of those formats
so that everybody agrees on.
And of course, changing the format because the technology changes
may take years.
And this is very problematic because the technology advances
much faster than that, much faster than the specification. So the specification stays behind. You have no flexibility in changing this because
otherwise the downstream analysis tools which rely on this will never be able to work.
And therefore you're stuck with files. Now on the cloud and with the advent of cloud,
this becomes even worse. You need to store thousands or millions of files on some kind of storage bucket,
index them with metadata using, well, maybe a database.
It's just for the index.
This file is that.
It corresponds to that person.
That person has access.
And speaking about access, you need to define your own solution about granting access to the data.
So one part is the storage.
The other part is access.
Who has access to the data?
And with a distributed file system, you're going to call your IT person and say, hey, give access to this username so that they have access to that folder.
And on AWS, it's very similar.
Give me an IAM role so that I have access to that folder. And on AWS, it's very similar. It gives me an IAM role
so that I have access to a prefix in a bucket.
And of course, this can create a revocation hell
because we need to keep track of all the keys
and all the roles.
Effectively, you're creating a lot of work for yourself,
one to store the data, the other to manage the data.
And then it comes to the analysis. And there are so many different tools that get created that reinvent the wheel.
The reason is that all you have from a domain is a bunch of files. So if you want to do some
kind of analysis on those files, you need to
build a system, you need to build some kind of a program to do that. And most of the programs have
some common components. They may have to run some statistical analysis like principal component
analysis or linear regression or something like that. So every single tool implements those or
links to a particular library that implements those.
And again, a lot of reinvention of the wheel because those tools share massive components
on the analysis front.
So this happens today.
Imagine if we did not have databases,
we would do the same thing for CSV files.
We would store the data in CSV files or Parquet files,
that's fine.
And then every single tool
would try to reinvent a SQL
operator, like a workloads
or like a filter projections, right?
Like projections, selections, and joins.
And every single tool would have to implement
a different one.
So what does a database
system do? A database system
abstracts all of that.
It stores the data in some way.
Yes, sure, there are some common formats like Parquet,
but Oracle doesn't open to you the actual storage format.
In the past, nobody cared about the storage format.
The database system cares about the format,
and it evolves it the way it wants to evolve it so that
it you know becomes faster and faster without asking anybody's opinion about the format and then
they have again the storage layer a way to parse the data they have an execution layer with
operators right implementing so many different, so diverse functionality and computations.
It has a language layer,
some kind of APIs in the database domain.
The most common API is SQL.
SQL, it's an API.
And of course, a parser to parse this
and translate it into, you know,
something that is actionable.
And all those layers, all of those, everything that I have explained,
appears in every single database management system that exists in the world.
Be it for tables for transactions, tables for analytics, key value, graph, images, and so on and so forth.
And that was exactly my observation.
But all of those layers are common.
The only thing that systems differ on
is the serialization of the bytes on the disk.
And then some APIs,
different systems use different languages
and different APIs.
And those are set.
We don't want to innovate there.
Those problems have been solved.
And then the query engine is not as different as you think.
Like you decompose a query into some kind of an operator tree or an operator graph.
And then you dispatch the execution.
That was my observation.
And I said, okay, if that's the case, if I choose the format wisely,
and I abstract the operators wisely so that it's not just a work close or a filter, right?
A filter, a join.
And I expand it a little bit so that it is also a matrix product, a miner product, and so on and so forth,
then maybe I can create something more general than just an SQL engine.
So this is what a database system is.
And that was the original observation that all the database systems in the world share those components.
Absolutely. That was a great, great...
Oh, I'm sorry. I did forget. I did forget a very important one.
The access control layer, because I spoke about storage.
This is super important.
I cannot leave it out.
Database system is responsible for enforcing authentication,
access control, and logging.
You don't want to reinvent those.
They have to be part of the database system.
Absolutely, 100%. And I think we can have probably
at least one episode for each one of these components
to talk about. And there are some
super fascinating things that I think
even people who have have built, let's say, databases
and forget, I always find it amazing the fact that if you think about SQL, right?
SQL is a language that what it does is you use it to describe how your data want to look
like.
And then the database systems take this template of how the data should look like and actually generates
code in a way to go and create this data. So I'm abstracting a lot here because there are many
different ways that you can generate this code. But if you think about it, it's a very interesting
and a very different approach compared to going and writing code
to do the same thing on your own, right?
It's very, very different.
Well, think about it like this.
In order to see the absurdity of trying to build
a database system on your own,
other than making a company,
if you're making a company, it's fine
because you're going to try to raise capital
and then build an amazing team
and then do it for a living. That's fine. But it's absurd to try to do capital and then build an amazing team and then do it for a living that's fine but
it's absurd to try to do it in a hospital or in a genomics institution or in a geospatial company
but this is what is what is happening i'm just telling you absolutely this is what they're doing
it's it's absurd for the following reason just go to sigmoid or vlbb like these are the conferences
i used to hang out at, right?
You're going to find professors, big professors, working on a sub-component of the components I mentioned for life, generating hundreds of papers with a lot of innovation so that you can
understand how difficult it is to build such a database system.
A hundred percent.
So, okay, based on this architecture that you described
and the components of a database system, where is TileDB?
What parts of this architecture it fits at?
Absolutely.
So again, there are many things we need to touch upon here.
And I'm going to explain the Tallybee evolution
in order to see what we have built
and what we're building down the road
so that you see where we are.
But things become even more complicated
than what I'm explaining right now
because I focus on the
components of a database system. However, you know, this kind of database system needs to evolve
because it captures just a very small piece of the puzzle in the data infrastructure of an
organization. Think about it like this. In the past, you used to have mainly tables in most organizations, and you used to buy Oracle, IBM, or Microsoft, and that was it.
That was your data infrastructure. It was just a single colossal database system.
There were no data engineers or data scientists. There were DBAs, database administrators. And that was it.
You were set. You would pay a lot of money every year, but you would be set.
Today, the data infrastructure consists of hundreds of pieces of the puzzle.
You have AI also. Well, AI deals with data as well. It's just a different computation. It's not SQL.
But again, in the context of universality that we're talking about here, it's yet some
other operators and different formats and different data. So now you also have dashboards.
So you have more, more than just a SQL console. You have Jupyter notebooks where you're doing your data science.
You're having pipelines for curation of the data, so more advanced detail than before.
You have transformations. You have so many different things. And at least in my mind,
a lot of those components need to be built into the database system.
Again, this is equally radical
to the ideas I had in the past
about universality
and having a single database for everything.
But the database system needs to evolve.
Otherwise, you're creating way too many silos.
One silo for the dashboards,
one silo for the ETL,
one silo for AI.
And this also calls for building unnecessarily large data engineering teams.
Data engineers are extremely important, but you're inflating your data engineering teams.
And what happens is that, again, you end up reinventing the database system.
But now it's not on tables.
Now it's on your database system and on your ETL pipelines and on your AI and on your dashboards.
Yeah.
So, again, you're ending up reinventing the database pretty much because, you need access control, again you need authentication
again you need a catalog to see what's
happening across your organization
again you need wrangling
again ETL but these are problems
we've solved in databases. So anyway just a
small note before I tell you where we're
going because you're going to see some of those components in
Tallybee today.
Tallybee has evolved even
within its own evolution it has evolved even within its own evolution.
It has evolved even more to capture this kind of aspect. So let me tell you how it started
and how it evolved. Because in my opinion, that's the only sane way to build something
like Taldi B, which is colossal, especially the way we started. Nobody gave us a blank
check with tens of millions of dollars to build it.
We built it very organically,
starting with pretty much a single person at the time,
myself, and then, you know, raising capital,
incrementally proving this crazy vision,
attracting more capital,
attracting more great talent to build this, right?
Because I'm not building this.
It's the team that is building this.
So the very first same decision that we took,
I'm telling you about the decisions
we have not regretted, okay?
Yeah.
I'm probably going to hide any decisions
that we regretted,
although I'm forgetful when it comes to those.
After I learned my lesson, I move forward.
So I move on.
So the first, the only way to start this with was storage.
Yeah.
I focused the first 18 months of building the,
so the first couple of years at MIT experimenting,
and then the first 18 months as a company,
we focused exclusively on the storage
engine we build what we believe is perhaps the best format in the space we don't talk too much
about it because i think my reasons i i i don't want to promote a format i want to promote an
engine it's the engine that matters not just the format i don't want to create to get another
format consortium and evangelize it i want to't want to create to get another format consortium
and evangelize it. I want to tell you, here's the library. Forget about the format. The format is
always going to be flawless. And here's the engine. Here's the API. We're going to have stable APIs.
So just use the APIs. So we focus on the storage. We build something extremely powerful that has
features that are necessary across domains like cloud-native data layout on object stores with file mutability so that we don't have too much copying of files on object stores.
Versioning, time traveling, amazing indexes, asset guarantees for the insertions and the leads and everything that we do.
We have very specific guarantees.
This took a long time to get it right.
But we did get it right.
And this gave us amazing performance for very difficult use cases, again,
like genomics, imaging, and others.
Because once you get those right, then the tabular use cases become easier.
Tables are very neat.
They're easier to capture once you start operating at a petabyte
scale. And once you get the indexes right, you can optimize at the IO request level and the CPU
cycle level. Then the rest become much easier. So storage engine. And we started at five customers
from the storage aspect alone, like on the open source.
There were customers that are customers until today.
You know, I trusted us when we were like four people.
But, you know, it really enjoyed the storage engine because it solved a lot of the genomics problems, for example.
They could see that, yes, that library alone gives us value.
As we were proving this out
and we were getting some customers
and as we attracted more capital,
we were more confident to start building
the other layers on top.
We built, the next thing that we did
was to build an inordinate amount
of APIs and integrations.
So right now we have eight APIs,
all fully maintained, fully optimized by us.
And we started offering, for example,
SQL queries through, well, Presto,
through MariaDB, through Spark,
so that we have a quick win.
So we plugged into those systems and we said,
here, if you want to do some SQL queries at scale,
this is what you should be using.
And that was an easier path than creating all those layers I explained, right?
SQL parser, query rewriter, optimizer, executor, all of those aspects of an Arabic system.
So that was next.
And then we started realizing that one of the most important things is access control, logging the governance aspect.
And also, if we wanted to build our own execution engine as we are doing today, we need to start with fundamentals.
So we built, since a long time ago, our own serverless engine, since day one, our engine has been serverless.
And we were building it out because we knew that at the end of the day, no matter what query you had,
it's going to be decomposed into either an operator tree
or a task graph.
And that task graph needs to be deployed
in a distributed setting where each task is dispatched
in a different worker. Those workers need to be to be you know they need to elastically scale and they need to
have retries they need we need to log everything we need to monitor everything so we build a
primitive we didn't build a sequel operator we build the primitive and that helped us solve a
lot of problems again in genomics and imaging and AI and so on and so forth.
Again, slowly and gradually.
And then we started building
dashboards and building Jupyter environments,
you know, more one-liners
around, you know,
queries that become task graphs a little bit
more automatically, better
ETL processes.
All of those were built on the primitives.
So the architecture was built in a very sane way since the get-go, and that's why we have
very little technical depth in that respect.
We reuse everything.
We don't refactor what we build already.
And nowadays, we're pushing more and more compute down to TALDB because, first of all, we want it to be self-contained.
Second, the
compute moves closer to the data
so we minimize the copies.
Everything is zero-copy
in TALDB.
Again,
we optimize for the L1
cache. There are so
many optimizations we can do
if we manage the data from the
time it comes into main memory to producing
an output. It's much easier
to control performance and to optimize for performance.
So this is where TALDB is. There is a lot of work still to be done around
optimized distributed computing, more primitive pushdowns.
But still, I mean, TileDB today, you can use it and get immense value
for very challenging problems and use SQL, do distributed computations,
use dashboards, use Jupyter notebooks, and most importantly,
federate your data.
Like this is one of the killer features in TileDB.
Okay, so from what I hear, TileDB started as a storage engine,
as you said, solving first of all the problem of...
the fundamental of how we are going to write the bytes on the storage. And on top of that, you build the rest of the
BMS, right? All the rest of the stuff that is needed to actually execute queries, give access
control, and all these things. So, let's focus a bit like on the first step, which is the storage engine.
So the most, or one of the most, let's say, well-known like storage engines out there is RocksDB, right?
Which is a key value stores, let's say the mental model of how you interact with the data is super
straightforward. You have a key and a value. The API is quite primitive. So what's the difference
if I have to convert a system like ROPSDB? I request a key to get it or give a key and a value and get stored.
So if I compare it with TileDB as a storage engine, what's the difference there?
How do I access the data in TileDB?
And is there the API difference?
Share a little bit more about the experience that I should have if I wanted to work with TileDB today
compared to something that I already know how it looks like.
Oh yeah, it's kind of day and night really, and I'm going to elaborate.
So okay, RocksDB is a key value store.
You can model, again, a lot of stuff as key value pairs. And this is very good for lookups.
I'm looking up for an equality query.
It's an equality query, right?
And I can get back the blob.
That's too simple.
Like this is the use cases that we have are not like this.
I'm going to explain a bit.
I'm going to expand a little bit the baselines.
For example, Parquet and all the variations
on top of Parquet, like Delta, like Iceberg and others. Parquet is a specification. And of course,
through Arrow, for example, you can have the engine. And this is where some people may be
confused. Parquet is a specification. It's a format and you may have multiple different
implementations of it. Arrow is one of them, and actually maybe the best one.
So let's focus on Arrow.
Arrow implements the Parquet format, and others implement the Parquet format.
I think Presto as well.
So Parquet is for tables.
RocksDB is for key values.
And those are the most prevalent ones. TileDB is a multi-dimensional
array engine. And this is a completely different level. First of all, it's more sophisticated.
It requires much more work and thought to be put into this. But in order to understand why, okay, why arrays? I'm going to tell you the following.
Think of an array as a shapeshifter.
Depending on your problem, it's going to shapeshift into a two-dimensional array,
into a three-dimensional array, into a dense array where every cell has a value,
or into a sparse array where the majority of the cells do not have a value,
and they should not be stored, because the space itself might be infinite.
So different semantics there.
And think of the dimensions, the axis of the array, right?
The dimensions.
It's an index.
It's like an index, a very, very powerful index.
And that index allows you to do two things.
The first is, again, in a shapeshifting manner,
so you can really, really tune to your applications.
You can lay out the bytes on the file
in a way that benefits a lot your queries.
It's important.
So performance is dictated by the proximity
of the result bytes to each other on the file.
If your result appears contiguous in the byte space in the file,, you may end up in the worst case doing one million requests.
And the latency of each request is going to kill you.
It's going to kill the performance.
The arrays, very naively speaking, the arrays allow you to retain the proximity of the bytes on the disk with respect to your query
workloads. And in most use cases, you know your workloads. Not 100%, but you know that, you know,
my query is spherical, or my query is elongated along one axis. Like, trust me, in the use cases
we're tackling, you know your queries, more or less.
I'm not saying that you're going to hard code it.
I'm just saying that you know the patterns,
not the actual queries.
So that's one of the things that the arrays do
that you cannot do that with the key value stores.
Because the key value stores are hashing.
They're hashing the values.
So if you're asking for a continuous range
of values, those are not going to appear continuous in a key value store. They're
going to be hashed to random places. It's going to appear continuous. That's the thing.
So you can retain the spatial locality of the multi-dimensional space in the single
dimensional space. The same is more or less true for parquet. I mean, again, you can hack it a little bit,
you can partition, you can change the order, you can do
some hacky things, but you have to hack it. You need to hack
nothing like it's, it's, it's infused into the array model,
you don't need to think about partitioning. You don't need to think about the orders. TallyB does that for you.
So that's the first, the layout of the bytes.
The second is the indexing in the dense case. And this is
very different from Parquet. For dense arrays, you don't need an extra index.
Everything is inferred. The positions of the bytes on the disk, they're
inferred with very simple arithmetic
and instead of doing conditions hey does this cell satisfy my query does that sell
satisfy my query you know a priori the byte ranges that satisfy your query so effectively
you minimize the IO and you minimize the mem copy requests in in main memory. You copy the data from your temporary buffers
into your result buffers.
So massive boost in terms of performance.
So this is what arrays do,
and that's why they're super, super powerful.
You just need to reason in terms of arrays,
not in terms of tables, not in terms of key values.
All right.
So to summarize, key values, we are storing a key and a value, right?
Lookups, issues with locality there if we want ranges and all these things.
We have columnar storage.
We have the whole column.
Let's say we can hack it, as you said, sort, partition the data.
If we want to get the whole column, yeah, there's going to be
some locality there. But you do other trade-offs, right? How do you do point queries there?
Inserts and stuff like that. That gets too complicated. And then we have arrays, right?
And I get what you're saying about how in this structure you can incorporate also indexing in
there and you can infer the indexing and all these things. My question is, let's take it from an API,
purely an API perspective. I am a developer right now. When I have a key value store, I know that
I'm going to make a request. I will ask for a key and I'll get back a value. In a columnar store, I'll ask for a column and I will start iterating on the column,
the values, one after the other. So when I'm dealing with arrays, and as you said, there are
different types of arrays that you can have there. It's not one-dimensional or only two-dimensional.
What I'm interacting with as a developer,
what the API looks like when I'm using TileDB in my application.
Yeah, this is a good question.
Let's make, again, a differentiation between the technology and the API.
The technology is what I explained.
By the way, TileDB also a columnar. You need to
think of Talib as
Parquet on steroids. That's how you
should be thinking about it. It's very, very similar
in some respects, but
it introduces stuff that you need to
hack with Parquet to get.
The partitioning, for example,
you need to hack Parquet to
have the partitioning. You need to hack Parquet
to have versioning. You don't hack Parquet to have versioning.
You don't need to do that with Tallybee.
It's embedded into the format itself.
You don't have to think.
That's why I'm saying you should think of it as a generalization of Parquet.
Now, the API is as follows.
First of all, you can just ask SQL queries on Tallybee.
Imagine that an array has dimensions. Think of a dimension
as a column, and think of
the attributes as other
columns. It's as simple as that.
The array is... Think of
the array as a
data frame that is indexed,
and the dimensions are special
columns. That's what you should be thinking
about. And in this case,
the dimensions are non-materialized columns. They're virtual. That's how you should be thinking about. And in this case, the dimensions are
non-materialized columns. They're virtual. That's how you should be thinking about it.
But any SQL query is valid. Think of them as different columns, differently compressed,
very similar to what you do with PyChem. Now, SQL is one of the APIs. In Python and in R, we have more Pythonic and more R-like APIs as well.
For example, you can use NumPy-like APIs for TALDB.
So you use the bracket operator and you slice.
You're slicing.
That's what you're doing.
It's just that TALDB also supports real dimensions,
not only integers, and string dimensions. So in your bracket operators,
you can also slice string ranges. And you can slice real ranges. But the API is like interfacing
with an AMPI array. And on top of this, we built also a Pandas-like API. So you can have a data frame operator,
which is very, very similar to pandas.
And on top of everything,
you can add conditions to the non-dimensions.
And there are a lot of tricks we do.
We push those down.
So this is very similar to Arrow.
You create a query expression and you push it down to TalDB.
Very similar to what you would do, again,
with Arrow or pandas or very similar
data frame-like
libraries.
So it's easier
than you might think. You don't need to think about
the arrays other than
when you model your data so that
you get the proper performance. Once you do
this, and in a lot of applications
we have ingesters that do that for
you, you don't even need to think about this.
From that point onwards, you use what is familiar to you.
The Pythonic API, RAPI, SQL, and so on and so forth.
That's great.
That's great.
Yeah, the reason I'm insisting in that is because I want...
People are familiar with certain things.
And when you introduce something new, I think if you can create
some kind of parallel
with what people already know,
it's really helpful for them to understand
what they are dealing with. And at the end,
we are talking about
products or technologies or whatever we want to
call them here, that they are primarily
consumed by technical people
and engineers. And engineers won't
like to understand.
Knowledge is an important part. part like extracting knowledge from the process of like using something like it's an important uh part of the job right and why like many people at least
like including me like is uh why i like uh doing that right so uh that's what i was trying like to
get uh out of this conversation i know we spent like a lot of time on, let's say, the low-level stuff.
But before I give it to Eric, because we are also getting close to the end of the recording,
I want to ask something.
You kept mentioning from the beginning genomics as a very important like
use case uh and i would like to hear from you what makes genomics such a unique
use case when it comes like to working with data um so can you elaborate a little bit more on that
yeah absolutely and uh for genomics,
there are effectively
two different kind of sub-verticals.
There is the population genomics,
so DNA, think about it like this.
And then there is single cell,
which is mostly RNA.
I'm oversimplifying, okay?
But there are two different areas,
quite dissimilar, I would say,
on the surface.
From a technology standpoint, it's identical.
It's all arrays.
So I'm going to explain about this.
And although I'm going to talk about genomics,
the same ideas apply in geospatial,
because geospatial is yet another big vertical for us.
And then, please, just for the record, let's not forget that we do a lot of tabular use cases, a lot of time series. Those are a little bit easier for us. The most difficult
ones are those that come in a very specific scientific vertical like, you know, the life
sciences or geospatial. Those are, you know, more scientific
use cases than the typical business analytics stuff that you would see with tables.
So here's what makes them very appealing to us. And then geospatial is very similar. So I'm going
to focus on genomics. So the very first reason why I started working on this was because
those are meaningful use cases. We help hospitals save babies' lives. It's as simple as that. It
creates a lot of purpose in our company. We're solving a super difficult problem for good.
I know it might be a bit cliche, but it is the truth.
The reason why it is appealing to us
and why other databases was extremely difficult
to break into this space
is because these spaces seem very convoluted
to database folks like me.
You really need to invest the time to understand the science. If you don't understand what those scientists you're dealing with say,
there is absolutely no way that you're going to solve their data problems. It's impossible. They
have a lot of jargon. You need to understand this jargon. And then you need to dig into those very convoluted formats.
The formats are crazy.
They look crazy.
They're not crazy.
They look crazy.
Because they come in text.
Again, there's a lot of jargon, multiple fields, seemingly variable-length fields.
The metadata is crazy.
So you really need to love the space
and you need to hire people that are experts
which is exactly what I did
I go very deep into those
domains but I'm not a
bioinformatician, I'm not going to gather all this knowledge
within the past couple
of years
I will bring the actual experts
that understand this deeply
so TALDB is an amazing I will bring the actual experts that understand this deeply.
So TileDB is an amazing fit for those because those use cases are not purely tabular.
There are always tables, always.
This is very appealing for a system like ours
because we can definitely handle tables extremely, extremely well,
but then we can handle matrices equally well, right?
Or even better.
And those applications come with a lot of matrices,
either dense or sparse, a lot, and big.
And if you don't have native array functionality,
you really need to hack your way with a relational database system.
Someone may claim, yeah, I can do it.
You might.
You will never get to the performance we're getting.
It's very...
I'm not going to show you theoretically
why this is impossible,
but even if you do it,
it's not going to be worth it.
You're going out of your way to do it.
For us, it's an 80-50.
For others, you need to get out of your way to do it. For us, it's an 80-50. For others, you need to get out of your way to do this.
So that's why we started with those.
Again, by no means are we focusing exclusively on those,
but they're very good verticals.
We're working with super smart people,
which we really, really love,
and we work very, very closely with those.
We are solving something that has not been solved before.
So we're making our mark in those spaces.
And from a business perspective, they're lucrative.
We're working with very big pharmas and hospitals.
You know, this is a very, very good space.
So it can give us the growth we need in order to accelerate
and then expand on the other verticals, which we know we can get. We can expand on the other verticals which we know we
can get we can get to the other verticals are easier we started with the most challenging one
this is great i mean we need to have at least one more uh episode because no question we just
like scratched the surface here well actually brooks isn't here and I'm the one recording.
So I'm tempted to go long,
but we don't want to get punished by Brooks too much because he drives a pretty tight ship.
Okay, so we're actually a little bit over time,
which I love because Brooks isn't here.
But I'd love to conclude
with just a really practical anecdote.
So you mentioned, you know, you mentioned earlier that, um, and actually I think this
is when we were like discussing the show, uh, before we started recording, but like
babies in the ICU, you know, that that's like an actual sort of use case. Can you sort of bring,
you know, we've, we've talked in such technical depth, which I love, but can you bring it home
for us and talk about like what's happening in people's lives? Um, you know, who have babies in
the ICU and how tile DB, I mean, I don't want to get overly sentimental, but that's a big deal.
I'd just love to hear, like, do you have a story about, you know, what this looks like on the ground for the people who are sort of the ultimate end consumer of the data?
Yeah, absolutely. the people who are actually saving babies' lives are the pioneer doctors we're talking to.
And they're super pioneers,
and especially Rady Children's
and Dr. Kingsmore and his team,
and a lot of other partners that he's accumulating.
And they're the absolute pioneers
because, of course, they know the science, once again,
but they were so so perceptive
so perceptive to understand that their science is blocked by data management in their case
the science is clearly clearly blocked by data by data management the data is too big
the the idea and again i'm gonna oversimimplify if somebody from the genetic side is listening in.
I apologize beforehand.
I do that on purpose.
I just don't want to get into all the jargon.
But the idea is this.
It's quite critical sometimes, not sometimes, always,
to genetically test a baby, a newborn, when they're born,
to find specific genetic diseases
that pretty much can destroy their lives if they go untreated early on.
There are specific genetic diseases that are treatable, but you need to take prompt action. In order to be able to treat those diseases,
the very first thing that you need to do
is identify that there is risk for such a disease.
And you do that through DNA sequencing.
Now, in order to be able to identify
whether a baby is prone or will have this genetic disease,
you need to find the corresponding mutations
in its DNA in the baby's DNA sequence but here's here's the what data management comes in someone
may say okay just sequence the baby find those locations and say okay this this so-called variant, this mutation, is going to lead to something very, very critical for the baby.
That's fine if you know that that variant is responsible for the genetic disease.
But how do you know that this variant is responsible for the genetic disease?
You need to have a very big sample right yeah because it's
not like a binary like it's not a binary like i mean it's statistics it's statistics right like
it's not a binary like this chromosome repository exactly you have a repository you have a database
table which says this particular mutation is pathogenic it will lead to something
bad but how did you derive the fact how did you create this factual table but
yes this disease is gonna happen because of this very it's because you are like
the million other babies yeah yeah and the data from these a million other
babies is huge huge so you need a database system to be able to do analysis at
that scale in order to be able to always keep up to date this factual table that this variant is
going to lead to something so that you take decisions at the icu yep that's how probably
contributes to this space once again all the credit to those pioneers,
because they are. All of this technology is new, and this is truly the first time that
genomics plays a big role in clinical use. Up until now, it has been used mostly in research,
but now we're talking about clinical use. And that's why I really respect the people
that we're working with.
Amazing.
Absolutely amazing.
Truly inspiring.
Boy, I'd love to meet some of those doctors.
Maybe we can have one of them on the show.
That would be kind of fun.
But thank you so much for giving us your time.
This has been such a wonderful journey of,
you know, understanding academia, understanding entrepreneurship, understanding, you know,
the deep guts of databases, and then ultimately understanding how the ultimate manifestation of
this can, can truly change lives, which, you know, is, is pretty
incredible. Um, so Stavros, thank you. It's been incredible. And, uh, we'd love to have you back
to continue the conversation. Absolutely. Thank you so much for having me anytime.
What a fascinating guy. I think my big takeaway Costas is you don't often hear about, um, you know, ideas that arise, you know, sort of from
pure curiosity, call it, maybe that's an overstatement because Travis is obviously
working on, you know, real problems, but he also had a genuine curiosity to understand the relationship between, you know, storage and how that impacts the
functionality of all these other components, you know, of a database system, right? And the way
that you query it and all the things that you can do with it. And I just really loved hearing about
his curiosity, leading him to some interesting questions that ultimately
led to interesting discoveries. You know, because a lot of times the classic entrepreneur story is,
you know, I was sick of late fees at, you know, Blockbuster. And so I started a mail order DVD
company, right? And then it became Netflix, right? And you're responding to some sort of pain that
you or someone experiences.
And so I love that, you know, his grew out of curiosity.
Yeah, absolutely. I think this is like something that it's commonly found in people that they have done like a
career in research in general like okay to to be honest like to go through uh graduate studies and phds and
postdocs and all that stuff like you have to be a curious person like like actually curiosity has
to be like important enough for you so you can you know keep grinding through like the academia way of doing things
And I mean you can you can tell that like also like from the energy that the guy has right like
He he can get passionate, right?
So I think that's that comes together with like it was like explains like he's a good even carrier before that
I Mean okay. He's definitely Greek. I think we can
say that, right? It was really fun for me to have this conversation with him.
And for me, it's also very interesting to see how TileDB is going to mature and progress as a product.
There's a lot of things that the team is building on top of TileDB as the storage engine, as most people might know about it.
So I'm looking forward to see what the future is going to look like for them.
And I have a feeling we will have him again in the show in a couple of months.
And he will have news to share with us.
So I'm looking forward to that.
I agree.
Well, thanks for tuning in.
Subscribe if you haven't.
And we will catch you on the next one.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd
also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's
E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers.
Learn how to build a CDP on your data warehouse
at rutterstack.com.