The Data Stack Show - 123: What Is a Universal Database? Featuring Stavros Papadopoulos of TileDB, Inc.

Episode Date: January 25, 2023

Highlights from this week’s conversation include:Stavros’ journey into data and founding TileDB (3:12)What problem was TileDB going to solve? (12:05)Defining database systems (21:35)What part of d...atabase architecture is TileDB? (31:58)Storage engine solutions (42:37)What does the API look like in using TileDB? (50:40)What makes genomics unique in working with data (55:28)Final thoughts and takeaways (1:06:46)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Costas, today we're going to talk with Stravos from TileDB. He's Greek, so I know you're going to have a great conversation with him. And he created some really interesting technology. They call it a universal database, and I'm really interested to
Starting point is 00:00:41 know what that means. So I'm going to ask about, you know, what is TileDB? What is a universal database? My guess is that the technology is a little bit more specific and opinionated. But also he spent a lot of time in academia, which is really interesting, something we've talked a little bit about on the show before and so i really want to hear the story about how tile db came about uh in his work at um uh at mit uh so yeah those are my questions how about you yeah first of all i want to see how you're going to handle two greeks at the same time so let's see what will happen. With the Socratic method.
Starting point is 00:01:27 Oh yeah, keep asking questions. We will never finish the recording. So that's one thing. The other thing is, it's a very interesting opportunity that we have here because TileDB actually has started by building one of the most, let's say, low-level and core parts of a database system, which is the storage engine. The question initially comes from that of how we can store the data on the very low level in a much more efficient way that gives us out to the ergonomics of dealing with the data. So it's a great opportunity to focus on that and learn more about one aspect of the database systems that we don't usually have the opportunity to do because it's more, let's say, many times
Starting point is 00:02:32 we take it for granted, I think, like, even like database systems. So I think it would be very interesting to hear from him about, like, how they built that. And also, like, what's the story behind it? Why they started from that and why they open-source it
Starting point is 00:02:49 and all the things around like TileDB, not just as a technology, but also as a company. Well, let's dig in and talk with Travis. Stavros, welcome to the Data Stack Show. I am so excited to talk about so many things, the primary one of which is all databases, everything databases, and especially TileDB. So thanks for giving us some time. Thank you for having me. I'm very happy to be here.
Starting point is 00:03:19 Okay, so give us your background, because you have a long history in technical disciplines, but not necessarily in databases specifically. So where did you start? And then what was the path that led you to founding TileDB? So TileDB is the first company I'm creating. I'm a technical CEO, so I have a PhD in computer science. I did my PhD in Hong Kong, where I spent several years. Then I became a professor.
Starting point is 00:03:55 So I have a very deep academic background. It has always been data, not database systems, but data, a lot of algorithms, data structures. Then I did a lot of work on security and cryptography, but always on data. And in 2014, I got an amazing job at Interlabs and MIT. And that's when I moved to Boston. And that's when I started actually working on TAL-EP, but effectively on database systems.
Starting point is 00:04:26 That's where I got more experience and I dove deeper into the database systems world. Super interesting. Can I ask what your thesis was on in your doctorate? It goes on query result integrity, actually, authentication and integrity of results. Really? So I was creating a variety of data structures infused with some cryptographic primitives so that we can certify that the results returned for a query are indeed the correct results produced by the owner of the data. There are several different nice techniques for that.
Starting point is 00:05:07 So I build several structures for geospatial data, and this is where the geospatial angle of LBB comes from. So yeah, very much into data structures and cryptographic primitives. Yeah. Okay, I have another question for you. If you, if you will entertain me because I'm, as Costas knows, I love thinking about the interaction of sort of academics and sort of backgrounds and then how that influences sort of building, you PhD in computer science. Could you outline what you think some of like maybe the top like advantages and disadvantages, if there are any of an academic background are, you know, because like commercializing technology out of academia is hard, right? Like that's difficult. But yeah, I just love to know like your experience because you
Starting point is 00:06:04 have that so viscerally as someone with a deep academic background and now, you know, who started your first company. Yeah, that's a great question. So the business world is 100 percent different from the academic world. Small piece of advice to any academically oriented person starting in business, start studying. Now, as I did, I studied a lot of studies around yourselves with amazing mentors from the business side of things.
Starting point is 00:06:37 It's a different world. It requires another PhD. So your computer science PhD is not going to help you there at all. Start from scratch. Where it helps me a lot until today is, first of all, understanding what product to build, right? I'm very, very heavily involved in the decisions around the design, what features to build, certain algorithmic aspects. I have an amazing team, but I'm very heavily involved in a lot of core components of TileDB.
Starting point is 00:07:12 I wrote the original code base of the storage engine, so I understand the code. I do code reviews sometimes, believe it or not, for specific features that I'm very interested in. The biggest advantage is that when I get one, defining the direction of the product,
Starting point is 00:07:32 of course, and having the vision, which for a database company, you need to understand the technology. You need to understand all the surrounding technologies in order to differentiate, innovate, and so on and so forth. So that's extremely important as a founder and CEO of a company. But the other advantage is when I get on a call with the customers
Starting point is 00:07:51 because the customer can tell in the first 30 seconds that I know what the heck I'm talking about. Yeah, yeah, yeah. And I know how to solve the damn problem. Yeah, yeah. Very quickly. Yeah, yeah. Very quickly. Yeah, yeah. They understand.
Starting point is 00:08:07 And it's all sincere because I do get down into the details. For example, it's a new problem, which is not exactly a replica of the problems we have solved in the past. And I offer solutions and I brainstorm with the customers and my team. And actually, I enjoy it. Like, this is one of the most enjoyable parts that till today. I mean, even at this scale, we are, it's, it's still present. Yeah. Yeah. Okay. One more question. And then, and then I want to talk about databases.
Starting point is 00:08:38 What is like on the business side as a CEO, you know, founder of a, of a tech company, what's the most sort of unexpected delight that you have in the business side that you maybe didn't expect, right? Like, is it like managing people or like, you know, fundraising? Or is there something where you're like, man, this is really fun. And I just didn't really know this sort of existed on the business side? Not exactly, but let me tell you a little bit about my experience on that. Unless I'm forgetting something, I apologize for that. So what I did throughout the company, why I never had any pleasant or unpleasant surprises,
Starting point is 00:09:23 believe it or not, I've been running the company for five years and a half now. Yeah. And I have never been truly surprised because I was always studying for the stage of the company. For example, when I was raising my very first round, I was studying about how a convertible node is structured. Yeah. So that I know what kind of deal to cut, right? And of course, I was talking to a thousand angel investors and mentors in order to understand what is a good deal, what is not a good deal.
Starting point is 00:09:52 Yeah, yeah. As the company was scaling, I started studying more about culture, right? Like what kind of culture do we want the company to have? Because the first, well Because the first four people are going to dictate the culture, but of course the first 10, the first 20, and so on and so forth. Not the first thousand. Certainly a small core of
Starting point is 00:10:14 people will dictate. They will determine the cultural path of the company. And then, of course, when we started closing enterprise deals, I was learning about enterprise deals, the sales cycles, budgeting, procurement, all of that stuff that you need to know in order to be very efficient.
Starting point is 00:10:33 And the same goes for marketing. When we started doing a little bit of marketing, how do we brand, what kind of traction channels we're going after, and so on and so forth. So I was never really surprised with something but what is extremely pleasant to me again speaking as a as a scientist not so much as a business person is the the results we're having with the customers we have been extremely fortunate and we can talk about this later to focus on certain verticals that are extremely meaningful. For example, life sciences.
Starting point is 00:11:09 And you know what, it is extremely meaningful when your system helps a hospital save lives of babies and DNIs in you. So yes, that delights us as a company, right? And we continue to go after challenging and very meaningful use cases as our first niche beachhead markets. And of course we expand once we do that so that we have both purpose, but also sustainability and growth.
Starting point is 00:11:38 Yeah, yeah. I love it. I mean, I love hearing sort of, it sounds a little bit cliche from a business standpoint, but being customer obsessed, you know, is like a key ingredient. And it's very clear that you love digging in with the customers and understanding the problem, which is great. Okay, well, thank you. That's such wonderful background. Let's talk about databases.
Starting point is 00:12:02 So, you know, creating a database product is a huge undertaking. You've been doing that for five or six years now. The database space has changed significantly, even in the last, you know, half decade, five years or so. What problem were you thinking about when you started TileDB and how did you decide that a database was the right way to solve it? So when I started TileDB, I did not have in mind to create a company. That was not the original motivation. I was a researcher. I liked my job as a researcher. Interlapse was phenomenal. MIT original motivation. I was a researcher. I liked my job as a researcher. Intel Labs was phenomenal. MIT was phenomenal. I was effectively embedded into MIT by Intel.
Starting point is 00:12:51 So I had bought the industry in academia hats, which was amazing. I interacted with, you know, probably the smartest people on the planet. And I loved this. And frankly, looking back, it was a good life. I mean, that is kind of like a dream role where you get like the best of both worlds, you know, and you sort of get to straddle it without any of the like undue burdens on either side. Absolutely.
Starting point is 00:13:23 And I would have continued doing it. It's just one of those, you know, instance in your life where you need to choose. And the reason, so first of all, I started Taldi B as a research project. I always wanted to have high quality in my work. So in the sense that I didn't want to just write one paper with Taldibi.
Starting point is 00:13:47 I had written multiple papers in the past. I wanted to write papers, innovate, obviously, but also do something that kind of lasts. So I always had a product mentality even during my research, but not a company mentality because I was working for Intel.
Starting point is 00:14:04 And the idea was, you know what? if this works well, maybe we create a database for the labs to maintain and grow, like very similar to what CWI has done in the past, Berkeley, and others, right? So that's the kind of mindset I had at the time. Let's build something that is beautiful, do some technology transfer for Intel, try to solve some very big problems, and we take it from there. So no company plans in the horizon. And again, there was no specific problem
Starting point is 00:14:43 from a use case standpoint up until something came up in genomics. What I wanted to do was something different in databases. That's what I wanted to do. Also, please note that when I was at Intel and MIT, I was working at the intersection of high-performance computing, so supercomputing, and databases. And the supercomputing from Intel, I was working with some true ninjas in high-performance computing, you know, optimizing operations at the CPU cycle level. Yeah.
Starting point is 00:15:16 You know, some hardcore stuff. And the databases, of course. Some of them, Stonebreaker and CSAO, right? So what I wanted to do from a research perspective was to kind of bridge those two very, very different domains with very, very different crowds that don't talk to each other very frequently. Yeah.
Starting point is 00:15:36 And actually, that was one of the first times that, and big kudos to Intel and MIT because they partnered for that reason. They partnered in order to, you know, try to combine the knowledge of both domains. And my group at Intel at the time was doing a lot, and still does, I still keep in touch. They do a lot of machine learning, AI, a lot of graph algorithms.
Starting point is 00:15:58 So a lot of linear algebra, a lot of computational linear algebra, which sits at the core of very, very advanced operations. If you want to do something very advanced in your CPU or GPU, in all likelihood you're using somewhere in the depths of your software stack some kind of package, probably built by Intel or NVIDIA, that fully optimizes computational linear algebra operations. And the question I had in the beginning that this is how it started
Starting point is 00:16:29 was, wait a second, if we take the data from storage, be it in key values or documents or text files or tables or whatever, and then we rattle this data into matrices and vectors in main memory in order to feed those into Intel's or the Nvidia GPU. Why aren't we storing the data in this form to begin with? Well, that was the initial observation. Like why are we storing the data on anything other
Starting point is 00:17:00 than the form that we compute on in the end? So it started as a very, very naive observation. Then this started to pile up, of course. And then I said, wait a second, if we do store the data as matrices, an image is a matrix. So now I have the means to store also an image natively as a matrix, and actually it can be sliceable from the disk, not just a blob that I bring in its entirety. I can actually use a arithmetic in order to slice portions of this very, very fast.
Starting point is 00:17:38 Then now I'm starting to modeling also images in addition to tables. But if I do images and tables, what if I can do also key values? What if I can do also graphs representing the adjacency matrices? And this is how the observations started to pile up again and again, more and more. And the hypothesis at the time in the lab was, is the mult-dimensional array a universal data structure?
Starting point is 00:18:09 And by universal, I mean both in terms of capturing all the different shapes of data, but also not just generically, but in a performant way. The hypothesis is twofold. Can we structure them intuitively, like in an abstract model? But the second hypothesis is twofold can we structure them intuitively like in an abstract model but the second part is equally important can we do it efficiently because if you sacrifice performance nobody's going to use it and that's that's when i started you know pounding on the on the storage engine because it starts from it starts from storage is there an efficient format on disk or on a cloud object store which can represent data of any form extremely efficiently so that's that's what
Starting point is 00:18:54 that again it started as a scientific hypothesis and then i started thinking a little bit more business-wise and i was talking to a lot of people. At the time, I was working at the Broad Institute, which was across the street. They had a very specific population genomics problem, which at the time has not been thought of as a matrix problem or as an array problem. Interesting. When I modeled this problem as an array,
Starting point is 00:19:22 they experienced a massive performance boost. And also it was intuitive. Yeah, there are three aspects at the time. There were three aspects of this data set that we need to index on, and this makes it the perfect candidate for an array, and more specifically a sparse array. There is a difference between dense arrays and sparse arrays. We can talk about this later. But this is how it started. And the bottom line was that after talking to a lot of potential prospects, we found that if indeed we build such a database system, which is universal, if we can do it, then we can tackle two extremely important problems as the value proposition of this database system. The first is performance for very complex use cases, genomics, LIDAR, any kind of point clouds, imaging, video, stuff that,
Starting point is 00:20:14 you know, relational databases are not very good at. And number two, we can consolidate data and code assets in a single infrastructure. If we do that, we eliminate silos. If you eliminate silos, first, you make people more productive, they collaborate more easily. And second, they experience a lot of cost reduction. They don't need to buy a lot of licenses. They don't need to wrangle data. They don't need to, you know, to take time in order to get to insights and so on and so forth. So the business aspects came up later from observation.
Starting point is 00:20:52 Yeah. I love it. Okay. Well, I could keep asking questions because, you hear about a lot of entrepreneurs trying to find pain in the market to respond to to start a company. And it's so enjoyable to hear about your curiosity leading you to conclusions that ultimately solve some pretty specific problems. So that's just so fun to hear about your curiosity sort of being a guide from that standpoint. Okay, let's dig into the technical side. Kostas, please take the mic and help me learn about TileDB and all of the specifics. Thank you, Eric. Thank you.
Starting point is 00:21:47 So Stavros, let's start from the basics, okay? Let's start with database systems. And I'd like to hear from you before we get into the specifics of TileDB, what a database system is, right? Like, when we think about something like Postgres or Snowflake, like, it doesn't matter. Big query. Like, at the end, like, all these systems, they have, like, some common patterns there, right? So I think it would be
Starting point is 00:22:18 great, like, to start from that. Like, what are these components, the most universal found ones? And then we'll get into more details about, details about TileDB. But I think that's going to be super helpful also for Eric. So let's do that. Oh, this is an amazing question, Costa, because it gives me a segue to so many different things
Starting point is 00:22:38 I want to talk about. So in order to be able to answer to this question, making no assumptions about the audience and making it as simple as possible. Let's talk about what you would do if there was no database system in the market. And you're going to be surprised, especially in the verticals we're working with. Life Sciences, Geospatial, those folks are not using databases at all. So let's talk about what those folks are doing.
Starting point is 00:23:11 Okay. All right? In the absence of databases, the first thing that they're doing is that they're coming up with very obscure formats to store and serialize the data as bytes into files. Every domain in the absence of a database system is coming up with their own formats and their own parsers. You need also some kind of a command line tool to understand the format in the file. Otherwise, you're not going to be able to open the file and read it.
Starting point is 00:23:43 Yep. So in the absence of a database system, you as a user are responsible for serializing the data somehow, saying I'm going to put this field first and then the next one, and then explain to the parser or write the parser so that you're able to parse this format. And what they usually do is that they're forming consortia to define the specifications of those formats so that everybody agrees on. And of course, changing the format because the technology changes may take years.
Starting point is 00:24:19 And this is very problematic because the technology advances much faster than that, much faster than the specification. So the specification stays behind. You have no flexibility in changing this because otherwise the downstream analysis tools which rely on this will never be able to work. And therefore you're stuck with files. Now on the cloud and with the advent of cloud, this becomes even worse. You need to store thousands or millions of files on some kind of storage bucket, index them with metadata using, well, maybe a database. It's just for the index. This file is that.
Starting point is 00:24:56 It corresponds to that person. That person has access. And speaking about access, you need to define your own solution about granting access to the data. So one part is the storage. The other part is access. Who has access to the data? And with a distributed file system, you're going to call your IT person and say, hey, give access to this username so that they have access to that folder. And on AWS, it's very similar.
Starting point is 00:25:24 Give me an IAM role so that I have access to that folder. And on AWS, it's very similar. It gives me an IAM role so that I have access to a prefix in a bucket. And of course, this can create a revocation hell because we need to keep track of all the keys and all the roles. Effectively, you're creating a lot of work for yourself, one to store the data, the other to manage the data. And then it comes to the analysis. And there are so many different tools that get created that reinvent the wheel.
Starting point is 00:25:55 The reason is that all you have from a domain is a bunch of files. So if you want to do some kind of analysis on those files, you need to build a system, you need to build some kind of a program to do that. And most of the programs have some common components. They may have to run some statistical analysis like principal component analysis or linear regression or something like that. So every single tool implements those or links to a particular library that implements those. And again, a lot of reinvention of the wheel because those tools share massive components on the analysis front.
Starting point is 00:26:34 So this happens today. Imagine if we did not have databases, we would do the same thing for CSV files. We would store the data in CSV files or Parquet files, that's fine. And then every single tool would try to reinvent a SQL operator, like a workloads
Starting point is 00:26:51 or like a filter projections, right? Like projections, selections, and joins. And every single tool would have to implement a different one. So what does a database system do? A database system abstracts all of that. It stores the data in some way.
Starting point is 00:27:09 Yes, sure, there are some common formats like Parquet, but Oracle doesn't open to you the actual storage format. In the past, nobody cared about the storage format. The database system cares about the format, and it evolves it the way it wants to evolve it so that it you know becomes faster and faster without asking anybody's opinion about the format and then they have again the storage layer a way to parse the data they have an execution layer with operators right implementing so many different, so diverse functionality and computations.
Starting point is 00:27:47 It has a language layer, some kind of APIs in the database domain. The most common API is SQL. SQL, it's an API. And of course, a parser to parse this and translate it into, you know, something that is actionable. And all those layers, all of those, everything that I have explained,
Starting point is 00:28:10 appears in every single database management system that exists in the world. Be it for tables for transactions, tables for analytics, key value, graph, images, and so on and so forth. And that was exactly my observation. But all of those layers are common. The only thing that systems differ on is the serialization of the bytes on the disk. And then some APIs, different systems use different languages
Starting point is 00:28:44 and different APIs. And those are set. We don't want to innovate there. Those problems have been solved. And then the query engine is not as different as you think. Like you decompose a query into some kind of an operator tree or an operator graph. And then you dispatch the execution. That was my observation.
Starting point is 00:29:04 And I said, okay, if that's the case, if I choose the format wisely, and I abstract the operators wisely so that it's not just a work close or a filter, right? A filter, a join. And I expand it a little bit so that it is also a matrix product, a miner product, and so on and so forth, then maybe I can create something more general than just an SQL engine. So this is what a database system is. And that was the original observation that all the database systems in the world share those components. Absolutely. That was a great, great...
Starting point is 00:29:42 Oh, I'm sorry. I did forget. I did forget a very important one. The access control layer, because I spoke about storage. This is super important. I cannot leave it out. Database system is responsible for enforcing authentication, access control, and logging. You don't want to reinvent those. They have to be part of the database system.
Starting point is 00:30:07 Absolutely, 100%. And I think we can have probably at least one episode for each one of these components to talk about. And there are some super fascinating things that I think even people who have have built, let's say, databases and forget, I always find it amazing the fact that if you think about SQL, right? SQL is a language that what it does is you use it to describe how your data want to look like.
Starting point is 00:30:39 And then the database systems take this template of how the data should look like and actually generates code in a way to go and create this data. So I'm abstracting a lot here because there are many different ways that you can generate this code. But if you think about it, it's a very interesting and a very different approach compared to going and writing code to do the same thing on your own, right? It's very, very different. Well, think about it like this. In order to see the absurdity of trying to build
Starting point is 00:31:13 a database system on your own, other than making a company, if you're making a company, it's fine because you're going to try to raise capital and then build an amazing team and then do it for a living. That's fine. But it's absurd to try to do capital and then build an amazing team and then do it for a living that's fine but it's absurd to try to do it in a hospital or in a genomics institution or in a geospatial company but this is what is what is happening i'm just telling you absolutely this is what they're doing
Starting point is 00:31:37 it's it's absurd for the following reason just go to sigmoid or vlbb like these are the conferences i used to hang out at, right? You're going to find professors, big professors, working on a sub-component of the components I mentioned for life, generating hundreds of papers with a lot of innovation so that you can understand how difficult it is to build such a database system. A hundred percent. So, okay, based on this architecture that you described and the components of a database system, where is TileDB? What parts of this architecture it fits at?
Starting point is 00:32:24 Absolutely. So again, there are many things we need to touch upon here. And I'm going to explain the Tallybee evolution in order to see what we have built and what we're building down the road so that you see where we are. But things become even more complicated than what I'm explaining right now
Starting point is 00:32:43 because I focus on the components of a database system. However, you know, this kind of database system needs to evolve because it captures just a very small piece of the puzzle in the data infrastructure of an organization. Think about it like this. In the past, you used to have mainly tables in most organizations, and you used to buy Oracle, IBM, or Microsoft, and that was it. That was your data infrastructure. It was just a single colossal database system. There were no data engineers or data scientists. There were DBAs, database administrators. And that was it. You were set. You would pay a lot of money every year, but you would be set. Today, the data infrastructure consists of hundreds of pieces of the puzzle.
Starting point is 00:33:38 You have AI also. Well, AI deals with data as well. It's just a different computation. It's not SQL. But again, in the context of universality that we're talking about here, it's yet some other operators and different formats and different data. So now you also have dashboards. So you have more, more than just a SQL console. You have Jupyter notebooks where you're doing your data science. You're having pipelines for curation of the data, so more advanced detail than before. You have transformations. You have so many different things. And at least in my mind, a lot of those components need to be built into the database system. Again, this is equally radical
Starting point is 00:34:29 to the ideas I had in the past about universality and having a single database for everything. But the database system needs to evolve. Otherwise, you're creating way too many silos. One silo for the dashboards, one silo for the ETL, one silo for AI.
Starting point is 00:34:45 And this also calls for building unnecessarily large data engineering teams. Data engineers are extremely important, but you're inflating your data engineering teams. And what happens is that, again, you end up reinventing the database system. But now it's not on tables. Now it's on your database system and on your ETL pipelines and on your AI and on your dashboards. Yeah. So, again, you're ending up reinventing the database pretty much because, you need access control, again you need authentication again you need a catalog to see what's
Starting point is 00:35:28 happening across your organization again you need wrangling again ETL but these are problems we've solved in databases. So anyway just a small note before I tell you where we're going because you're going to see some of those components in Tallybee today. Tallybee has evolved even
Starting point is 00:35:44 within its own evolution it has evolved even within its own evolution. It has evolved even more to capture this kind of aspect. So let me tell you how it started and how it evolved. Because in my opinion, that's the only sane way to build something like Taldi B, which is colossal, especially the way we started. Nobody gave us a blank check with tens of millions of dollars to build it. We built it very organically, starting with pretty much a single person at the time, myself, and then, you know, raising capital,
Starting point is 00:36:12 incrementally proving this crazy vision, attracting more capital, attracting more great talent to build this, right? Because I'm not building this. It's the team that is building this. So the very first same decision that we took, I'm telling you about the decisions we have not regretted, okay?
Starting point is 00:36:32 Yeah. I'm probably going to hide any decisions that we regretted, although I'm forgetful when it comes to those. After I learned my lesson, I move forward. So I move on. So the first, the only way to start this with was storage. Yeah.
Starting point is 00:36:53 I focused the first 18 months of building the, so the first couple of years at MIT experimenting, and then the first 18 months as a company, we focused exclusively on the storage engine we build what we believe is perhaps the best format in the space we don't talk too much about it because i think my reasons i i i don't want to promote a format i want to promote an engine it's the engine that matters not just the format i don't want to create to get another format consortium and evangelize it i want to't want to create to get another format consortium
Starting point is 00:37:25 and evangelize it. I want to tell you, here's the library. Forget about the format. The format is always going to be flawless. And here's the engine. Here's the API. We're going to have stable APIs. So just use the APIs. So we focus on the storage. We build something extremely powerful that has features that are necessary across domains like cloud-native data layout on object stores with file mutability so that we don't have too much copying of files on object stores. Versioning, time traveling, amazing indexes, asset guarantees for the insertions and the leads and everything that we do. We have very specific guarantees. This took a long time to get it right. But we did get it right.
Starting point is 00:38:11 And this gave us amazing performance for very difficult use cases, again, like genomics, imaging, and others. Because once you get those right, then the tabular use cases become easier. Tables are very neat. They're easier to capture once you start operating at a petabyte scale. And once you get the indexes right, you can optimize at the IO request level and the CPU cycle level. Then the rest become much easier. So storage engine. And we started at five customers from the storage aspect alone, like on the open source.
Starting point is 00:38:46 There were customers that are customers until today. You know, I trusted us when we were like four people. But, you know, it really enjoyed the storage engine because it solved a lot of the genomics problems, for example. They could see that, yes, that library alone gives us value. As we were proving this out and we were getting some customers and as we attracted more capital, we were more confident to start building
Starting point is 00:39:11 the other layers on top. We built, the next thing that we did was to build an inordinate amount of APIs and integrations. So right now we have eight APIs, all fully maintained, fully optimized by us. And we started offering, for example, SQL queries through, well, Presto,
Starting point is 00:39:32 through MariaDB, through Spark, so that we have a quick win. So we plugged into those systems and we said, here, if you want to do some SQL queries at scale, this is what you should be using. And that was an easier path than creating all those layers I explained, right? SQL parser, query rewriter, optimizer, executor, all of those aspects of an Arabic system. So that was next.
Starting point is 00:40:09 And then we started realizing that one of the most important things is access control, logging the governance aspect. And also, if we wanted to build our own execution engine as we are doing today, we need to start with fundamentals. So we built, since a long time ago, our own serverless engine, since day one, our engine has been serverless. And we were building it out because we knew that at the end of the day, no matter what query you had, it's going to be decomposed into either an operator tree or a task graph. And that task graph needs to be deployed in a distributed setting where each task is dispatched
Starting point is 00:40:44 in a different worker. Those workers need to be to be you know they need to elastically scale and they need to have retries they need we need to log everything we need to monitor everything so we build a primitive we didn't build a sequel operator we build the primitive and that helped us solve a lot of problems again in genomics and imaging and AI and so on and so forth. Again, slowly and gradually. And then we started building dashboards and building Jupyter environments, you know, more one-liners
Starting point is 00:41:14 around, you know, queries that become task graphs a little bit more automatically, better ETL processes. All of those were built on the primitives. So the architecture was built in a very sane way since the get-go, and that's why we have very little technical depth in that respect. We reuse everything.
Starting point is 00:41:33 We don't refactor what we build already. And nowadays, we're pushing more and more compute down to TALDB because, first of all, we want it to be self-contained. Second, the compute moves closer to the data so we minimize the copies. Everything is zero-copy in TALDB. Again,
Starting point is 00:41:58 we optimize for the L1 cache. There are so many optimizations we can do if we manage the data from the time it comes into main memory to producing an output. It's much easier to control performance and to optimize for performance. So this is where TALDB is. There is a lot of work still to be done around
Starting point is 00:42:23 optimized distributed computing, more primitive pushdowns. But still, I mean, TileDB today, you can use it and get immense value for very challenging problems and use SQL, do distributed computations, use dashboards, use Jupyter notebooks, and most importantly, federate your data. Like this is one of the killer features in TileDB. Okay, so from what I hear, TileDB started as a storage engine, as you said, solving first of all the problem of...
Starting point is 00:42:59 the fundamental of how we are going to write the bytes on the storage. And on top of that, you build the rest of the BMS, right? All the rest of the stuff that is needed to actually execute queries, give access control, and all these things. So, let's focus a bit like on the first step, which is the storage engine. So the most, or one of the most, let's say, well-known like storage engines out there is RocksDB, right? Which is a key value stores, let's say the mental model of how you interact with the data is super straightforward. You have a key and a value. The API is quite primitive. So what's the difference if I have to convert a system like ROPSDB? I request a key to get it or give a key and a value and get stored. So if I compare it with TileDB as a storage engine, what's the difference there?
Starting point is 00:44:12 How do I access the data in TileDB? And is there the API difference? Share a little bit more about the experience that I should have if I wanted to work with TileDB today compared to something that I already know how it looks like. Oh yeah, it's kind of day and night really, and I'm going to elaborate. So okay, RocksDB is a key value store. You can model, again, a lot of stuff as key value pairs. And this is very good for lookups. I'm looking up for an equality query.
Starting point is 00:44:49 It's an equality query, right? And I can get back the blob. That's too simple. Like this is the use cases that we have are not like this. I'm going to explain a bit. I'm going to expand a little bit the baselines. For example, Parquet and all the variations on top of Parquet, like Delta, like Iceberg and others. Parquet is a specification. And of course,
Starting point is 00:45:11 through Arrow, for example, you can have the engine. And this is where some people may be confused. Parquet is a specification. It's a format and you may have multiple different implementations of it. Arrow is one of them, and actually maybe the best one. So let's focus on Arrow. Arrow implements the Parquet format, and others implement the Parquet format. I think Presto as well. So Parquet is for tables. RocksDB is for key values.
Starting point is 00:45:43 And those are the most prevalent ones. TileDB is a multi-dimensional array engine. And this is a completely different level. First of all, it's more sophisticated. It requires much more work and thought to be put into this. But in order to understand why, okay, why arrays? I'm going to tell you the following. Think of an array as a shapeshifter. Depending on your problem, it's going to shapeshift into a two-dimensional array, into a three-dimensional array, into a dense array where every cell has a value, or into a sparse array where the majority of the cells do not have a value, and they should not be stored, because the space itself might be infinite.
Starting point is 00:46:29 So different semantics there. And think of the dimensions, the axis of the array, right? The dimensions. It's an index. It's like an index, a very, very powerful index. And that index allows you to do two things. The first is, again, in a shapeshifting manner, so you can really, really tune to your applications.
Starting point is 00:46:50 You can lay out the bytes on the file in a way that benefits a lot your queries. It's important. So performance is dictated by the proximity of the result bytes to each other on the file. If your result appears contiguous in the byte space in the file,, you may end up in the worst case doing one million requests. And the latency of each request is going to kill you. It's going to kill the performance.
Starting point is 00:47:37 The arrays, very naively speaking, the arrays allow you to retain the proximity of the bytes on the disk with respect to your query workloads. And in most use cases, you know your workloads. Not 100%, but you know that, you know, my query is spherical, or my query is elongated along one axis. Like, trust me, in the use cases we're tackling, you know your queries, more or less. I'm not saying that you're going to hard code it. I'm just saying that you know the patterns, not the actual queries. So that's one of the things that the arrays do
Starting point is 00:48:15 that you cannot do that with the key value stores. Because the key value stores are hashing. They're hashing the values. So if you're asking for a continuous range of values, those are not going to appear continuous in a key value store. They're going to be hashed to random places. It's going to appear continuous. That's the thing. So you can retain the spatial locality of the multi-dimensional space in the single dimensional space. The same is more or less true for parquet. I mean, again, you can hack it a little bit,
Starting point is 00:48:50 you can partition, you can change the order, you can do some hacky things, but you have to hack it. You need to hack nothing like it's, it's, it's infused into the array model, you don't need to think about partitioning. You don't need to think about the orders. TallyB does that for you. So that's the first, the layout of the bytes. The second is the indexing in the dense case. And this is very different from Parquet. For dense arrays, you don't need an extra index. Everything is inferred. The positions of the bytes on the disk, they're
Starting point is 00:49:23 inferred with very simple arithmetic and instead of doing conditions hey does this cell satisfy my query does that sell satisfy my query you know a priori the byte ranges that satisfy your query so effectively you minimize the IO and you minimize the mem copy requests in in main memory. You copy the data from your temporary buffers into your result buffers. So massive boost in terms of performance. So this is what arrays do, and that's why they're super, super powerful.
Starting point is 00:49:55 You just need to reason in terms of arrays, not in terms of tables, not in terms of key values. All right. So to summarize, key values, we are storing a key and a value, right? Lookups, issues with locality there if we want ranges and all these things. We have columnar storage. We have the whole column. Let's say we can hack it, as you said, sort, partition the data.
Starting point is 00:50:21 If we want to get the whole column, yeah, there's going to be some locality there. But you do other trade-offs, right? How do you do point queries there? Inserts and stuff like that. That gets too complicated. And then we have arrays, right? And I get what you're saying about how in this structure you can incorporate also indexing in there and you can infer the indexing and all these things. My question is, let's take it from an API, purely an API perspective. I am a developer right now. When I have a key value store, I know that I'm going to make a request. I will ask for a key and I'll get back a value. In a columnar store, I'll ask for a column and I will start iterating on the column, the values, one after the other. So when I'm dealing with arrays, and as you said, there are
Starting point is 00:51:20 different types of arrays that you can have there. It's not one-dimensional or only two-dimensional. What I'm interacting with as a developer, what the API looks like when I'm using TileDB in my application. Yeah, this is a good question. Let's make, again, a differentiation between the technology and the API. The technology is what I explained. By the way, TileDB also a columnar. You need to think of Talib as
Starting point is 00:51:47 Parquet on steroids. That's how you should be thinking about it. It's very, very similar in some respects, but it introduces stuff that you need to hack with Parquet to get. The partitioning, for example, you need to hack Parquet to have the partitioning. You need to hack Parquet
Starting point is 00:52:04 to have versioning. You don't hack Parquet to have versioning. You don't need to do that with Tallybee. It's embedded into the format itself. You don't have to think. That's why I'm saying you should think of it as a generalization of Parquet. Now, the API is as follows. First of all, you can just ask SQL queries on Tallybee. Imagine that an array has dimensions. Think of a dimension
Starting point is 00:52:26 as a column, and think of the attributes as other columns. It's as simple as that. The array is... Think of the array as a data frame that is indexed, and the dimensions are special columns. That's what you should be thinking
Starting point is 00:52:42 about. And in this case, the dimensions are non-materialized columns. They're virtual. That's how you should be thinking about. And in this case, the dimensions are non-materialized columns. They're virtual. That's how you should be thinking about it. But any SQL query is valid. Think of them as different columns, differently compressed, very similar to what you do with PyChem. Now, SQL is one of the APIs. In Python and in R, we have more Pythonic and more R-like APIs as well. For example, you can use NumPy-like APIs for TALDB. So you use the bracket operator and you slice. You're slicing.
Starting point is 00:53:17 That's what you're doing. It's just that TALDB also supports real dimensions, not only integers, and string dimensions. So in your bracket operators, you can also slice string ranges. And you can slice real ranges. But the API is like interfacing with an AMPI array. And on top of this, we built also a Pandas-like API. So you can have a data frame operator, which is very, very similar to pandas. And on top of everything, you can add conditions to the non-dimensions.
Starting point is 00:53:54 And there are a lot of tricks we do. We push those down. So this is very similar to Arrow. You create a query expression and you push it down to TalDB. Very similar to what you would do, again, with Arrow or pandas or very similar data frame-like libraries.
Starting point is 00:54:10 So it's easier than you might think. You don't need to think about the arrays other than when you model your data so that you get the proper performance. Once you do this, and in a lot of applications we have ingesters that do that for you, you don't even need to think about this.
Starting point is 00:54:25 From that point onwards, you use what is familiar to you. The Pythonic API, RAPI, SQL, and so on and so forth. That's great. That's great. Yeah, the reason I'm insisting in that is because I want... People are familiar with certain things. And when you introduce something new, I think if you can create some kind of parallel
Starting point is 00:54:47 with what people already know, it's really helpful for them to understand what they are dealing with. And at the end, we are talking about products or technologies or whatever we want to call them here, that they are primarily consumed by technical people and engineers. And engineers won't
Starting point is 00:55:03 like to understand. Knowledge is an important part. part like extracting knowledge from the process of like using something like it's an important uh part of the job right and why like many people at least like including me like is uh why i like uh doing that right so uh that's what i was trying like to get uh out of this conversation i know we spent like a lot of time on, let's say, the low-level stuff. But before I give it to Eric, because we are also getting close to the end of the recording, I want to ask something. You kept mentioning from the beginning genomics as a very important like use case uh and i would like to hear from you what makes genomics such a unique
Starting point is 00:55:52 use case when it comes like to working with data um so can you elaborate a little bit more on that yeah absolutely and uh for genomics, there are effectively two different kind of sub-verticals. There is the population genomics, so DNA, think about it like this. And then there is single cell, which is mostly RNA.
Starting point is 00:56:18 I'm oversimplifying, okay? But there are two different areas, quite dissimilar, I would say, on the surface. From a technology standpoint, it's identical. It's all arrays. So I'm going to explain about this. And although I'm going to talk about genomics,
Starting point is 00:56:38 the same ideas apply in geospatial, because geospatial is yet another big vertical for us. And then, please, just for the record, let's not forget that we do a lot of tabular use cases, a lot of time series. Those are a little bit easier for us. The most difficult ones are those that come in a very specific scientific vertical like, you know, the life sciences or geospatial. Those are, you know, more scientific use cases than the typical business analytics stuff that you would see with tables. So here's what makes them very appealing to us. And then geospatial is very similar. So I'm going to focus on genomics. So the very first reason why I started working on this was because
Starting point is 00:57:26 those are meaningful use cases. We help hospitals save babies' lives. It's as simple as that. It creates a lot of purpose in our company. We're solving a super difficult problem for good. I know it might be a bit cliche, but it is the truth. The reason why it is appealing to us and why other databases was extremely difficult to break into this space is because these spaces seem very convoluted to database folks like me.
Starting point is 00:58:07 You really need to invest the time to understand the science. If you don't understand what those scientists you're dealing with say, there is absolutely no way that you're going to solve their data problems. It's impossible. They have a lot of jargon. You need to understand this jargon. And then you need to dig into those very convoluted formats. The formats are crazy. They look crazy. They're not crazy. They look crazy. Because they come in text.
Starting point is 00:58:35 Again, there's a lot of jargon, multiple fields, seemingly variable-length fields. The metadata is crazy. So you really need to love the space and you need to hire people that are experts which is exactly what I did I go very deep into those domains but I'm not a bioinformatician, I'm not going to gather all this knowledge
Starting point is 00:58:57 within the past couple of years I will bring the actual experts that understand this deeply so TALDB is an amazing I will bring the actual experts that understand this deeply. So TileDB is an amazing fit for those because those use cases are not purely tabular. There are always tables, always. This is very appealing for a system like ours
Starting point is 00:59:19 because we can definitely handle tables extremely, extremely well, but then we can handle matrices equally well, right? Or even better. And those applications come with a lot of matrices, either dense or sparse, a lot, and big. And if you don't have native array functionality, you really need to hack your way with a relational database system. Someone may claim, yeah, I can do it.
Starting point is 00:59:47 You might. You will never get to the performance we're getting. It's very... I'm not going to show you theoretically why this is impossible, but even if you do it, it's not going to be worth it. You're going out of your way to do it.
Starting point is 01:00:03 For us, it's an 80-50. For others, you need to get out of your way to do it. For us, it's an 80-50. For others, you need to get out of your way to do this. So that's why we started with those. Again, by no means are we focusing exclusively on those, but they're very good verticals. We're working with super smart people, which we really, really love, and we work very, very closely with those.
Starting point is 01:00:23 We are solving something that has not been solved before. So we're making our mark in those spaces. And from a business perspective, they're lucrative. We're working with very big pharmas and hospitals. You know, this is a very, very good space. So it can give us the growth we need in order to accelerate and then expand on the other verticals, which we know we can get. We can expand on the other verticals which we know we can get we can get to the other verticals are easier we started with the most challenging one
Starting point is 01:00:50 this is great i mean we need to have at least one more uh episode because no question we just like scratched the surface here well actually brooks isn't here and I'm the one recording. So I'm tempted to go long, but we don't want to get punished by Brooks too much because he drives a pretty tight ship. Okay, so we're actually a little bit over time, which I love because Brooks isn't here. But I'd love to conclude with just a really practical anecdote.
Starting point is 01:01:27 So you mentioned, you know, you mentioned earlier that, um, and actually I think this is when we were like discussing the show, uh, before we started recording, but like babies in the ICU, you know, that that's like an actual sort of use case. Can you sort of bring, you know, we've, we've talked in such technical depth, which I love, but can you bring it home for us and talk about like what's happening in people's lives? Um, you know, who have babies in the ICU and how tile DB, I mean, I don't want to get overly sentimental, but that's a big deal. I'd just love to hear, like, do you have a story about, you know, what this looks like on the ground for the people who are sort of the ultimate end consumer of the data? Yeah, absolutely. the people who are actually saving babies' lives are the pioneer doctors we're talking to.
Starting point is 01:02:27 And they're super pioneers, and especially Rady Children's and Dr. Kingsmore and his team, and a lot of other partners that he's accumulating. And they're the absolute pioneers because, of course, they know the science, once again, but they were so so perceptive so perceptive to understand that their science is blocked by data management in their case
Starting point is 01:02:51 the science is clearly clearly blocked by data by data management the data is too big the the idea and again i'm gonna oversimimplify if somebody from the genetic side is listening in. I apologize beforehand. I do that on purpose. I just don't want to get into all the jargon. But the idea is this. It's quite critical sometimes, not sometimes, always, to genetically test a baby, a newborn, when they're born,
Starting point is 01:03:22 to find specific genetic diseases that pretty much can destroy their lives if they go untreated early on. There are specific genetic diseases that are treatable, but you need to take prompt action. In order to be able to treat those diseases, the very first thing that you need to do is identify that there is risk for such a disease. And you do that through DNA sequencing. Now, in order to be able to identify whether a baby is prone or will have this genetic disease,
Starting point is 01:04:03 you need to find the corresponding mutations in its DNA in the baby's DNA sequence but here's here's the what data management comes in someone may say okay just sequence the baby find those locations and say okay this this so-called variant, this mutation, is going to lead to something very, very critical for the baby. That's fine if you know that that variant is responsible for the genetic disease. But how do you know that this variant is responsible for the genetic disease? You need to have a very big sample right yeah because it's not like a binary like it's not a binary like i mean it's statistics it's statistics right like it's not a binary like this chromosome repository exactly you have a repository you have a database
Starting point is 01:04:59 table which says this particular mutation is pathogenic it will lead to something bad but how did you derive the fact how did you create this factual table but yes this disease is gonna happen because of this very it's because you are like the million other babies yeah yeah and the data from these a million other babies is huge huge so you need a database system to be able to do analysis at that scale in order to be able to always keep up to date this factual table that this variant is going to lead to something so that you take decisions at the icu yep that's how probably contributes to this space once again all the credit to those pioneers,
Starting point is 01:05:47 because they are. All of this technology is new, and this is truly the first time that genomics plays a big role in clinical use. Up until now, it has been used mostly in research, but now we're talking about clinical use. And that's why I really respect the people that we're working with. Amazing. Absolutely amazing. Truly inspiring. Boy, I'd love to meet some of those doctors.
Starting point is 01:06:16 Maybe we can have one of them on the show. That would be kind of fun. But thank you so much for giving us your time. This has been such a wonderful journey of, you know, understanding academia, understanding entrepreneurship, understanding, you know, the deep guts of databases, and then ultimately understanding how the ultimate manifestation of this can, can truly change lives, which, you know, is, is pretty incredible. Um, so Stavros, thank you. It's been incredible. And, uh, we'd love to have you back
Starting point is 01:06:51 to continue the conversation. Absolutely. Thank you so much for having me anytime. What a fascinating guy. I think my big takeaway Costas is you don't often hear about, um, you know, ideas that arise, you know, sort of from pure curiosity, call it, maybe that's an overstatement because Travis is obviously working on, you know, real problems, but he also had a genuine curiosity to understand the relationship between, you know, storage and how that impacts the functionality of all these other components, you know, of a database system, right? And the way that you query it and all the things that you can do with it. And I just really loved hearing about his curiosity, leading him to some interesting questions that ultimately led to interesting discoveries. You know, because a lot of times the classic entrepreneur story is,
Starting point is 01:07:52 you know, I was sick of late fees at, you know, Blockbuster. And so I started a mail order DVD company, right? And then it became Netflix, right? And you're responding to some sort of pain that you or someone experiences. And so I love that, you know, his grew out of curiosity. Yeah, absolutely. I think this is like something that it's commonly found in people that they have done like a career in research in general like okay to to be honest like to go through uh graduate studies and phds and postdocs and all that stuff like you have to be a curious person like like actually curiosity has to be like important enough for you so you can you know keep grinding through like the academia way of doing things
Starting point is 01:08:47 And I mean you can you can tell that like also like from the energy that the guy has right like He he can get passionate, right? So I think that's that comes together with like it was like explains like he's a good even carrier before that I Mean okay. He's definitely Greek. I think we can say that, right? It was really fun for me to have this conversation with him. And for me, it's also very interesting to see how TileDB is going to mature and progress as a product. There's a lot of things that the team is building on top of TileDB as the storage engine, as most people might know about it. So I'm looking forward to see what the future is going to look like for them.
Starting point is 01:09:50 And I have a feeling we will have him again in the show in a couple of months. And he will have news to share with us. So I'm looking forward to that. I agree. Well, thanks for tuning in. Subscribe if you haven't. And we will catch you on the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd
Starting point is 01:10:11 also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rutterstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rutterstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.