The Infra Pod - Will HTAP database eat both OLAP & OLTP? Chat with Moritz & Christian at CedarDB

Starting point is 00:00:00 Welcome to yet another Infra Deep Dive podcast. As usual, Tim from Essence and Ian, let's go. Hey, this is Ian. Super excited. Today we have the founders of CedarDB on the pod. Moritz and Chris, welcome. Can you tell us a little bit about yourselves, but more importantly, what is CedarDB and why am I excited about it? Sure. So, hi, I'm Moritz. I'm one of the five co-founders of CedarDB

Starting point is 00:00:32 and I'm joined by my colleague Chris. And yeah, so to quickly answer what CedarDB is, we like to call ourselves the all-in-one database system. So in a sense, it is what Postgres used to be for the 90s and the 2000s, but built for the hardware and data processing you have right now. So at its core, CDDB is just a SQL-based relational database system, and it's also compatible with Postgres via protocol and its grammar, which means it allows people who are used to working with Postgres to work with data, but much faster and much more efficient. Amazing. Oh, go ahead, Chris.

Starting point is 00:01:12 Just wanted to add to the intro. Yeah, we're really excited to build something that finally makes it easy again for people to analyze the data without having to work with tons of tools and learn tons of languages. Just make it simple like it was in Postgres. What's the primary innovation? Postgres of the 90s, CEDRDB is 2020s. That enables it.

Starting point is 00:01:35 Primary differentiator in the data world is OLAP versus OLTP. Does CEDRDB solve that problem? You say it's an all-in-one solution. Help me understand what it is about CEDRDB that makes it the future. problem, you say it's like an all-in-one solution. Help me understand what it is about CDDB that makes it the future. Right. So CDDB by itself, I think what differentiates it most to many other database systems is that it is built from scratch. And it is designed for the current hardware you have right now. And this really goes through all the components that CDDB has. So if we start with our query optimizer,

Starting point is 00:02:06 it can generate efficient query plans even for deep-nested queries with hundreds of joins, which means that as a user, you can finally just write SQL without having to think too much about what the database will do with the query and how we'll execute it. And then, of course, CDDB then takes these optimized query plans

Starting point is 00:02:25 and what we do with these, we compile them to efficient machine code directly. So essentially, this gives you the performance similar to a handwritten C code for every individual SQL query, which makes use of the specific CPU you have in the instance where CDDB is running on. And then during query processing, our buffer manager is also designed

Starting point is 00:02:45 to effectively utilize large memory sizes, which on today's servers, you can easily get a few terabytes, right? So what we do is that we can reach main memory processing speeds as long as your data size fits entirely into main memory. And we gracefully slow down to the speeds of NVMe SSDs or cloud storage

Starting point is 00:03:06 as soon as the processing you do exceeds the memory capacity. And all of these components together allow CDDB to easily outperform all the other database systems that you have right now and even non-relational database systems. So you can also run other use cases such as graph processing, stream processing, all of this in CDDB and combine them with traditional relational data processing. It's been a while

Starting point is 00:03:30 that we've been hearing about these sort of HTAP systems or these combined different systems, but not many people actually tried it. I think that's a reality. From a research point of view, I think it's pretty research and been well understood. But industry-wise, I don't think there's actually that prevalence of usage yet.

Starting point is 00:03:49 Maybe can you talk about the history of these sort of products or systems? Has this mostly been a research only? Have we seen any commercial products at all? Talk about what has been the reason why we haven't seen everybody just start using this everywhere. So I kind of disagree there that there isn't really an HTAP product out there. Because Postgres, for example, is like a prime example of an HTAP system. A system that can do both transactions and analytics. It's just that people don't perceive it as such because it's just like a database system that's always been there.

Starting point is 00:04:27 And so in a sense, I think that a lot of people use HTAP systems, but they don't actively call them that. That's just what they're used to with a database system in general. And so I believe that there are quite a few workloads for those. Of course, in the past, we've seen the transition away from a single system such as Postgres to a more diverse landscape where you have a system specifically for your analytics, a big data warehouse like Snowflake or BigQuery and so on. But in a sense, that was only out of the necessity of Postgres not being able to keep up with the current workload. And why we haven't seen that many other products being built is it's not that easy to build an HTAP system from scratch, right? I mean, we've taken five, six years in research

Starting point is 00:05:16 with a not small team by any means, and not many people have the chance to work on such a project for such a long time without actively having to deliver something in between and so like building this deep integration of transaction processing analytics and gathering everything fast as well is something that's quite the engineering challenge and so i think that's the reason why we haven't seen this because for most people which have a small workload which don't have tons of complex analytics running at the same time, Postgres still scratches that itch, right? It still works just fine.

Starting point is 00:05:51 It's still a great system. And for everyone else, they were fine with building data pipelines into Snowflake and so on. I think I agree, yes. There's Postgres and all those are variants even come before that. We grew a lot from a database world from even going to relational and actually shoving a lot more data over time. But there's a point. I think you just talk about the database history in the more recent years

Starting point is 00:06:16 is an explosion of databases. There's so many databases now. Database almost feel like for everything, not just vector database for AI, but like graph and time series and the whole nine yards. I think it was fundamentally, maybe just kind of like for context, what you guys think, the explosion seems to happen because this one size system fits all, at least for Postgres, wasn't possible, right? To push for certain kinds of queries, push for certain kinds of workloads. And like I said, like technical limitations probably or the demand or needs of the data

Starting point is 00:06:50 has changed quite a bit. And maybe can you talk about like, what are actually is the main challenges when it comes to like trying to combine workloads? And I don't think you can combine all kinds of workloads, like combine all world history of databases or all recent history into one, right? So there's probably certain kind of things that seems more suitable to fit into H-tasks, I assume. Because I think this is kind of the

Starting point is 00:07:16 interesting part is like, how do we start thinking about the fundamental things that are changing for Cedar, something like Cedar that can work here. So we can talk about a little nuance here. What are the biggest things that are happening that makes a database like yours actually can work? And what kind of workloads can actually consolidate into your all-in-one database now? Like you said, there's tons of different database systems

Starting point is 00:07:42 that you can get today for very specific workloads, such as graph processing. But I think if you look into what the actual tasks behind these systems are, they're quite similar. Mostly for analytical systems, this is like loading data from your storage, be it internally or externally, transforming it, so filtering it for certain attributes

Starting point is 00:08:02 or certain values or certain combinations combining it with other data so in graphs you will have hops and you have like edges between nodes and in relations you will have joins and so on but the concepts of these are very similar and they're not that different to process semantically from the outside and from the languages that you interact with these, they're quite different. But if you look down into the individual components or individual steps that the system has to take in order to process this data, this is very similar for a vast majority of fixed data analysis, I would call it. So something that's not highly interactive and changing all the time.

Starting point is 00:08:47 So as long as you can scan statically data and analyze it, I think it doesn't matter that much if you want to see it as a graph or if you want to see it as a semi-structured JSON document or if you want to see it as relations. I think the steps that the system has to take are pretty similar. And so starting from this, then building a system like ours, if you get these basic blocks right, and if you're good at filtering data, and if you're good at combining data with other are not a natural fit to this model, but the vast majority of people we've spoken with, they are fine with the subpart that maps down to be built again. But for the vast majority of like three, four, five, ten hops in a graph, you can easily do this with the same mechanisms

Starting point is 00:09:51 that need a database system for trends. I'm really curious to understand, you know, in the traditional enterprise, you've got your data warehouse, maybe it's Snowflake, maybe you're doing some Databricks stuff, Delta Lake. You've got a bunch of LTP stores, so a bunch of Postgres instances all over the place. You've got online, offline. You've got ETL systems moving data from the Postgres instances into the warehouse, and maybe you've got some reverse

Starting point is 00:10:14 ETL moving stuff from the warehouse back into the online processing or maybe into some SaaS. What of that cacophony of moving pieces of bubblegum and popsicle sticks that is a modern data infrastructure? If you adopt CedarDB, what of that cacophony of stuff are you replacing that you don't need anymore because we've solved this problem? For the simple folk, is all the LTP stuff together and just inside CedarDB?

Starting point is 00:10:40 Help us understand. The nice thing about CedarDB is that essentially this choice is up to you. So you aren't forced to really replace your entire data stack up front. But of course, in the ideal world, and what we think should work best, is that you combine most of your OTP and analytical workloads and ETL and reverse ETL

Starting point is 00:11:03 all into the same place, into one single source of truth. And because this really allows you to get rid of all the downsides of having such a complex stack, right? If you have 10 systems, you need to transfer data between them, which, I mean, if you're running ETL pipelines,

Starting point is 00:11:20 if you're lucky, they are running every night, but the delays can be even longer. So you will necessarily have delays. And then, of course, you will have the problems with synchronizing data between all of the systems, which, for example, make your analytical results potentially out of date, which could be an issue if you are doing more modern data processing, where your data results result in automated decisions or in any other automated processes which rely on the analytics results to be precise and correct.

Starting point is 00:11:52 And if you have everything in a single system, which CDDB can do, you necessarily get rid of all the delays caused by ETL and you have only a single version of your data, which immediately makes it consistent. And of course, since CDDB is a transactional database system, you have real transactions over your entire data and your analytics run on the same systems as your transactions do. So even all your analytics

Starting point is 00:12:19 will be part of this transactional semantics and will also be guaranteed to be correct. So to answer your question, ideally, you want all your data to be in one place, but CDDB specifically allows you to gradually start out with replacing some of the applications or some of the data storage you have by replicating the data you have lying around somewhere else

Starting point is 00:12:42 and then gradually replacing existing systems one by one. Really interesting. Help me understand the scaling semantics. One of the reasons we've ended up in the world where we have tens, hundreds, for some people, thousands of different database instances and many different warehouses is scaling. We started this world where it's like, 15 years ago, we talked about a three-tier app where you had your UI layer

Starting point is 00:13:13 and then you had your business logic layer and that was probably one ginormous Java app and then you had your massive Oracle instance and you basically built your entire business on top of that. That was the go-to stack. And a bit of what it sounds like you're telling me is, actually, C3DB is the pathway back to that world. Now, is that true? How do you think about the scale of the data, the different workloads? Is it the same format

Starting point is 00:13:38 and protocol? Are there different instances? If you were to prognosticate your mid-sized enterprise or even your mid-market company, how do you think about the way that people deploy CedarDB and the optimum use cases for CedarDB? And what do you simplify for them? It's really interesting, the idea of HTAP. But we went to this world of thousands of different databases all over the place for a reason. So help me understand what C CDDB is solved specifically. And is it like one instance?

Starting point is 00:14:07 Is it many instances? Like what's the deployment topology and why? To answer the easiest question, the topology of CDDB is that we do data processing on a single node only, and we distribute only the storage. So that's the easy answer. The reason why we believe this is the correct approach is

Starting point is 00:14:26 I mean, there are many answers to this. One of the answers is that people will rarely actually need real distributed, especially transactional, processing for the amount of data they have. You do have companies like Google, like Amazon, which will process

Starting point is 00:14:42 petabytes of data every day, but this is not your typical customer, right? So your typical customer will have terabytes, a few terabytes of data, which nowadays can fit even into main memory of the largest AWS instances you can buy, right? They have, I think, around two terabytes of main memory right now.

Starting point is 00:15:00 So there's easily enough capability, even in main memory, to process the vast majority of all customers' needs on a single node right now. I mean, because of this reason, we also believe it's much easier to build a very efficient system on a single node, right? So you don't have to think about, especially distributed transactional processing, involves a lot of consensus protocols and all of this, which necessarily make things slower, which, I mean, you can't really avoid this,

Starting point is 00:15:29 but you can get most of the, as I said, most of the query processing will easily run on a single server anyway. And if you have a system such as CWDB, which is so much more efficient and faster than all the other systems you have, then it's much easier to really deliver on this promise of, okay, you just need a single server.

Starting point is 00:15:48 Because what we also believe is that the reason why people in the last decade or so tended to distribute their data processing heavily is that no system could really utilize the, for example, 200 CPU cores you get on a large CPU server. So instead, you use, for example, 200 CPU cores you get on a large CPU server. So instead, you use, for example, Spark. It's a very popular and very efficient system for doing easy distributed processing. But if, for example, our system, I mean, this heavily depends on the workload, but if our system is 100 times faster than Spark,

Starting point is 00:16:19 you can easily, I mean, that's pretty obvious, you need only 100 of these servers. So if you have a cluster with 100 servers right now, you could replace it maybe with just a single instance of CDDB. So that's actually really fascinating. I think you actually put it in a very interesting way that instead of thinking of how many servers and how many distributed computing you really think about, because most of the frameworks these days, I think at some point, even like a Mongo or Influx, all of these,

Starting point is 00:16:51 you know, at some point, production has to be clusters and more nodes, right? And for compute layers, like the Spark of the Worlds has been famously talking about like how many thousands of nodes it can run with like how many big of data sets, right? And that was the way to make it more commodity hardware type. And people get used to that. And that's why we even spend money to have other people manage those fleet of clusters for us. But when you think of a CRDB, I guess you're saying like a single node performance and pushing that limits and re-implementing can make a huge difference here.

Starting point is 00:17:18 So how do you think people are going to change the way they actually operate with data. Because data right now, the assumption is not going to fit in a single node anyways, right? There's no single disk that can hold all data. So you put everything in S3, and therefore you can have a scale-out compute. And the compute-storage separation has been pretty much a standard

Starting point is 00:17:39 for everything now, warehouses, the whole nine yards. But if you have put a single node computing, do you still treat compute storage separate? How do people think about actually the data itself when they think about CRDB? Are you back to the world of Postgres where everything is shoved into a single disk and we all run there and with a rate and kind of stuff?

Starting point is 00:18:01 Or we actually can actually leverage your big server compute power with still the scale-out data. Does Cedar AdBD have a different philosophy when it comes to this sort of data and compute relationship? No, I would definitely say that the separation between compute and storage is definitely there to stay. So while you get tons of memory and even tons of SSDs in a single disk right people tend to store a lot of data even if they don't look at it that much right if they have the data storage is cheap especially with like stuff like this three which basically comes at commodity prices right you don't have to care about how much you store it's pretty cheap per month so i

Starting point is 00:18:45 definitely think that separating the compute from storage is there to stay but what isn't necessary is distributing the compute itself so i mean you can have different instances but one instance can definitely serve one query so no need to distribute a query over multiple workers. And at the same time of what Moritz talked about before, distributing transactions is also not something that's going to be necessary for a ton of folks. But we definitely do distributed storage and can also read from S3 and other cloud object stores. I'm really kind of interested.

Starting point is 00:19:26 I mean, this is the nuts and bolts of what differentiates CedarDB. And so one of the questions I wanted to ask you was, what are the challenges that make Postgres not scale? I mean, you're kind of answering that question, right? Like these fundamental things. Of the challenges you've had to solve to create CedarDB, what has been the most difficult thing to get to the point that you are today?

Starting point is 00:19:47 What has been the biggest technical challenge to overcome? Because for a long time, HTAP has been this, it'll never happen, these are completely polar opposites, the world is not possible. And then you all come along, you launch CedarDB, and we have this database that can do relational graph data, can do analytics workloads, can do transactional workloads. So is there one thing, is it your architecture,

Starting point is 00:20:12 that opened up the opportunity, or are there multiple different things? And of those things, which was the most difficult to figure out? We definitely don't think that there's only one component that you could, for example, improve in Postgres to make it as fast as CWDB. So it's definitely a combination of many things. But there definitely are some things which were harder than others.

Starting point is 00:20:32 So one thing is that by design, every query in CWDB always runs on all the CPU cores you give it. So this is something that's very different to especially Postgres, which is just now starting to paralyze

Starting point is 00:20:47 some of the operators. But a very important philosophy of CWB's query processing is that from start to end of a query, all your CPU cores should always do useful work. And this requires you

Starting point is 00:20:58 to rethink a lot of the algorithms that you usually use in data processing. For example, just sorting data, it's just inherently not easy to fully parallelize. I mean, there's a lot of research

Starting point is 00:21:09 in how to actually do this, but there's also other things like, okay, you have joint processing and a lot of other things that a database system usually does. How can you make sure that, essentially, if you look at your CPU utilization on your server,

Starting point is 00:21:22 that it will always be on 100% from start to finish. So this is one thing which requires a lot of rethinking of algorithms. And the other thing is the hard drives you have right now, the SSDs, they have very, very different

Starting point is 00:21:35 performance characteristics to traditional magnetic spinning hard drives. So, for example, it's not as bad anymore to do non-sequential accesses, which is the number one performance killer for magnetic drives. But you have different things. For example, you have the internal block sizes of an SSD.

Starting point is 00:21:53 It's much larger than the 4-kilobyte page size. So you need to rethink a lot of how you can access this SSD because it gives you a lot more performance, but only if you access it correctly. And then, of course, you have some more things like modern transactional processing and multiversion concurrency control and all of this, which also ties into multi-user support

Starting point is 00:22:13 and scheduling and all of this. So there are certainly a lot of components, but I hope I could give you a bit of an idea what for us the most challenging things were. So I want to ask, I'm reading your blog posts, kind of like talking about Postgres and introducing CedarDB. And I know one thing I kind of like caught my eye, you know, beyond what we talked about is I think the fundamental thing about having an all-in-one database systems isn't just

Starting point is 00:22:44 combining the sort of capabilities. You also mentioned like there has to be also a very simple base layer. So I guess the idea is not to really complicate the interface so that you have a graphical, you know, you do Cedar and do pretty much like every single possible thing here, key value lookup.

Starting point is 00:23:00 And, you know, everybody has basically reinvented some kind of query language at this point that seems well suited for their workloads and make it somewhere simpler, but it's really complicated explosion. But then I wonder, like, how do you decide what's the base layer of operators that you should put where it also doesn't make transitioning from any database into you almost seems impossible. Or maybe it's possible. Is there a thought process here? Like, how do you think about designing that interface and figure out what other transitional paths it is?

Starting point is 00:23:40 Because it's basically not backward compatible to any other database at this point, right? If you're using Cedar for graph, you got to rewrite it, right? I assume. Or using some SQL language, everybody has a different SQL, you might need to migrate, right? Correct me if I'm wrong assumption here, but that's kind of what I feel like I understand based on your statements. But talk more about how you think about the interfaces and think people can be able to transition to you in a more seamless way? So I think a big part of our strategy there is being as compatible with Postgres as possible, right? Because tons of systems today have interfaces with PostgreSQL and so we basically

Starting point is 00:24:18 re-implemented the Viya protocol. We also re-implemented the SQL grammar in a sense there so that wherever you're using PostgreSQL which a lot of tools offer is like the basic choice then you can also simply use us without much hassle so you can keep your tools and so on there and this also of course goes for

Starting point is 00:24:40 the add-ons there so for example for vector stuff we are implementing the PGVector extension interface and so on. But you're, of course, right that there are some use cases which have seen their own languages like graph processing. And for those right now, like most of them, and we will have a blog post on this very soon, depending on when this airs, it will probably already be out.

Starting point is 00:25:05 It's not that hard to express these typical graph queries of a few hops in SQL as well. And of course, supporting a graph language in the end just comes down to parsing the grammar and translating it into our internal operators like joins and filters and scans and so on so i'm definitely not excluding the possibility that we will have a graph language interface at some point but right now there is not like the ggi standard came out recently and we're still not sure if this will get adopted or people will stay more closely with the new 4j cipher language and so for now most interfaces that we will focus on are these postgresql based ones because that's just something a lot of people are familiar with not only for postgres but also for other systems like

Starting point is 00:25:59 aurora or rds in the cloud and so on, which also depend on these tools. I'm really curious. You know, one of the things, I was having a conversation with a friend of mine actually yesterday about Postgres, and he looked at Postgres 16 and was like really interested in all the new features that have come out in the last like four to five years around horizontal scaling,

Starting point is 00:26:17 which didn't exist before. Like it's just the ecosystem of Postgres is phenomenal. I'm curious, you know know how do you think about competing with postgres's plugin ecosystem and that sort of this core they have this core spire and they have all these like they have all these plugins effectively extensions database is a system for front-end developers right you don't have to own the back endend. It's back-endless. It's back-end as a service or whatever. I'm kind of curious, have you thought about an extension system

Starting point is 00:26:48 and how you build or enable an ecosystem like Postgres and Topper CDRDB? Have you thought about it supporting their extension ecosystem? And how production-ready do you think CDRDB is in general today? And where do you think it is in terms of your ability to,

Starting point is 00:27:06 should Snyk use it, is kind of my question. So to answer your question about whether we are intending to support Postgres plugins right away, the answer to this is no. The way Postgres processes data and processes queries is fundamentally incompatible with how we process data. So every plugin assumes that the query will run on a single thread only.

Starting point is 00:27:28 And since we compile to machine code, this is also done very differently. So we don't support Postgres plugins directly, but we try to find which Postgres plugins are most popular, such as PG Vector, as Chris said, which seems to be very popular for vector processing in Postgres, and try to incorporate them in CDDB directly as well

Starting point is 00:27:49 with more efficient implementations. And of course, many plugins that we also see, they purely exist because of some downsides of Postgres. So you have plugins which try to improve some performance aspects. You have plugins for cloud storage or plugins which make parallel processing easier or they loosen up some of the transactional guarantees to allow you to process the data more quickly.

Starting point is 00:28:12 I mean, since our database system is so much faster than Postgres anyway, we also think that many plugins may not necessarily be required for use in CDDB. Of course, there will always be things that CDDB will not offer if you have some very specific data processing needs. And for this, we support user-defined functions.

Starting point is 00:28:34 I mean, of course, user-defined functions, everybody knows them. And people tend to not like to use them for many reasons. One is that often, for example, in postures as well, or also in databases like Oracle or Microsoft SQL Server,

Starting point is 00:28:47 you have to use a very specialized database kind of SQL dialect which allows you to write the algorithms. And obviously, no one wants to write their AI inference,

Starting point is 00:28:57 whatever, in the SQL-based language. So in that case, we can also exploit the fact that we do compile to machine code. And if you have an algorithm written in a language like C, C++, Rust, any of the languages

Starting point is 00:29:11 that compile to machine code as well, we can easily combine them with the code that we generate for the SQL queries. And this allows us to essentially also get the same performance, like the handwritten code performance for custom code you've written, but you can still combine it with existing functionality like aggregations, filters, and joins,

Starting point is 00:29:31 and all of this in a single database. Right now, we do have our own SQL dialect for these kinds of algorithms as well. And we are currently working on making our UDF implementation more stable and production-ready. So maybe if you're interested in this, I can go into a bit more detail there.

Starting point is 00:29:51 So in general, how this will work, that we support WebAssembly as essentially an input language for user-defined functions. So you can run any kind of untrusted code very efficiently in the database system and combine it very easily with all your other workloads. Awesome. Well,

Starting point is 00:30:10 I want to jump into a section we call the spicy future. Spicy future. As is very self-explanatory, we want you guys to tell us what you believe that not many people believe yet, hence spicy.

Starting point is 00:30:29 Chris, why don't you start with yours and let's go from here. Sure. So I think Moritz has already taken some of that away in our discussion before, but I think a large part of what I believe is that the

Starting point is 00:30:43 amount of distribution that is sold today is not something that will be necessary even in the future for a ton of people like there will always be some outsiders that really need petabytes of intermediate states but for the vast amount of people even if you're crunching petabytes of data, your intermediate results will not be petabytes. And for those people, it is always better to stay on a single machine if you can and just make the most use out of the hardware you have without getting all the overhead and pain of distributing to get a minimal gain in that area. Interesting.

Starting point is 00:31:21 Moritz, how about you? I have a very different hot take. Maybe it's more of a mild take because I think people are starting to realize this right now. But I think SQL is a great language. SQL was designed in the 70s, 80s, something like this. And it has shown that it's still very relevant today. And the reason for this is that SQL has been designed

Starting point is 00:31:41 in a way to not prescribe how a database has to execute the query. And this has led to database systems right now, such as CWDB, they can choose how they want to execute the SQL statement themselves, which means that the SQL statement you wrote in the 80s will run much, much faster right now, even though you didn't change any of the code. And to add on that, I think many languages or many database systems tried to think of new languages. I mean, of course, you have document-based database systems

Starting point is 00:32:11 such as MongoDB, which try to come up with their own language. You have things also like object-relational mappers and all of these libraries. I strongly believe you should just write SQL directly. So you can do JSON processing in SQL as well. CDDB supports the same syntax as Postgres does. It has been shown year for year and for every new system that everybody tends to go back to SQL anyway.

Starting point is 00:32:35 MongoDB supports SQL and you also have the SQL adapter to Spark, for example. And SQL is just great. And this is why we are also still building on SQL and we are still intending on continuing to do so. And we are strong believers in SQL. I mean, those were spicy. Both of those were spicy. Because if

Starting point is 00:32:57 the average bear, I think, would say, I don't want SQL. And the other average bear would say, I believe in a distributed world. So it's really interesting to think through that space of future. Where do you align around things like Object Store? Like, what do you think about CedarDB and the future of CedarDB? Do you think about the future of Object Store? Does that play into how you think about your database design? It sounds like no, because you really want to run on as close to the bear hardware

Starting point is 00:33:24 and take advantage of SSDs and all these other things. How do you think about the ecosystem and how it evolves as CedarDB becomes more mature, gets more adopted? How does it change the rest of the ecosystem? Do you mean things like S3 with object storage? I do, yeah. We do want to work very closely to the hardware because that's the only way to really get all of the performance

Starting point is 00:33:50 that you can get with modern hardware. But we definitely think, as Chris already touched on, separating compute and storage makes a lot of sense and you will need to do this. And for this, using object storages like S3 just makes sense. That's the obvious way to do this. And our entire data storage model, like the internal data representation of CDDB,

Starting point is 00:34:10 is designed specifically with object storage in mind. So we have certain block sizes, we have a columnar block-wise storage, which makes it very easy to separate this into different blobs and makes it very easy to access this efficiently and also cost effectively, because storing data in S3 is cheap, but accessing it can be very expensive if your system doesn't do this very efficiently. So object storage is definitely something that we think is a good way to go

Starting point is 00:34:41 and you should do that. Object storage is not something for us that's out far in the future, right? We already have built quite a lot of work and actually, like Moritz said before, parallelize everything also applies to the network stack. If you want to get the most out of the network bandwidth that you pay for with cloud instances

Starting point is 00:35:03 in AWS and Azure and so on, you need to also fetch your data in parallel and we can achieve like reading at instance bandwidth from object stores to keep this delay of not having it available on SSD as low as possible and still get the most performance even if your data is stored remotely. You know, I think the ability to run everything with SQL on a single node is probably everyone's

Starting point is 00:35:31 dream, to be honest, right? Who wants, really wants to run thousands of servers? But that kind of begs the question, really, like, how far are you guys right now? Because I'm sure everybody listening to this and even got an inkling, like, this could be really great. We'll probably want to try it, figure out, can they use it? Are you ready for production yet? Or where are you with the progress of CedarDB? And if there's a way for people to actually want to play with things, is it possible today?

Starting point is 00:36:00 So we are a very young company, we just incorporated this year. But we are, I mean, as you mentioned, we have now regular blog posts where we can get insights into what we are doing, what our technology does. And we also have a waitlist for interested, you know, potential customers. And so if you're interested in working with us,

Starting point is 00:36:18 just sign up on our waitlist and you can join the few people that are already trying out and testing CedarDB for their production use case. And then, of course, you can expect many more updates from us in the next months. That's amazing. So I guess to kind of top that off, where should I go find you? What is the best place to kind of like keep track of the progress of CedarDB?

Starting point is 00:36:40 So you can definitely find us on our website, cedarDB.com. Our blog posts are published there, and you can also find our Twitter account. And cdb.com. Our blog posts are published there. And you can also find our Twitter account. And on LinkedIn, we will also publish our updates. So these are the three main channels. Amazing. I think this is one of those super exciting projects that hopefully becomes a product that everybody can certainly use.

Starting point is 00:37:00 And it just changes the landscape, right? Thanks for both of you being on you know we got enough high spiciness and all this sort of goodness that we need to get here so everybody else check out serial db go download when they're ready or track their progress it will be it's super exciting to hear this yeah thank you very much for having us thank you it's been so much fun

The Infra Pod - Will HTAP database eat both OLAP & OLTP? Chat with Moritz & Christian at CedarDB

Ian and Tim sat down with the cofounders of CedarDB (https://cedardb.com/) that's building a all-in-one database that merges both OLAP and OLTP into one. Listen to our chat to hear how they got st...arted from the academics and now jumping into making this database to take on all workloads.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

The Infra Pod - Will HTAP database eat both OLAP & OLTP? Chat with Moritz & Christian at CedarDB

Ian and Tim sat down with the cofounders of CedarDB (https://cedardb.com/) that&#39;s building a all-in-one database that merges both OLAP and OLTP into one. Listen to our chat to hear how they got st...arted from the academics and now jumping into making this database to take on all workloads.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Ian and Tim sat down with the cofounders of CedarDB (https://cedardb.com/) that's building a all-in-one database that merges both OLAP and OLTP into one. Listen to our chat to hear how they got st...arted from the academics and now jumping into making this database to take on all workloads.