Orchestrate all the Things - Data Lakehouse, meet fast queries and visualization: Databricks unveils Delta Engine, acquires Redash. Featuring Databricks CEO / Co-Founder Ali Ghodsi

Starting point is 00:00:00 Welcome to the Orchestrate All the Things podcast. I'm George Amatiotis and we'll be connecting the dots together. Today's episode features Ali Ghotzi, co-founder and CEO of Databricks. We connected to discuss the latest news from Databricks, namely the announcement of a new query engine called Delta Engine and the acquisition of Redash, an open-source visualization product. Our discussion started with aση των data lake houses, που είναι ένα τρόπο που η Databricks προσπαθεί να σημαντήσει

Starting point is 00:00:31 την συμφωνία των data warehouses και data lakes. Μιλήσαμε για τρόπους όπως τα Multicloud και το Machine Learning, που προκύπτουν σε μια νέα πραγματικότητα, πώς δημιουργούνται data warehouses και data lakes, και τι πιστεύει το data lake house bring to the table. We also talked about Delta Engine and Redux, of course, and we wrapped up with an outlook on Databricks' business growth. I hope you enjoyed the podcast.

Starting point is 00:00:58 If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook. So what I wanted to start with was basically a little bit of a recap because, well, since it's been a while since the last time we connected, I think I was quite interested in getting your views on the lake house, basically, which is like the latest concept that you have been advocating. And to me, it comes as a natural progression of what you've been working on for basically since the inception of the company. So kind of this merging, let's say, this coalescing of the two worlds,

Starting point is 00:01:40 data warehouses and cloud, basically, and machine learning and everything that you've been working on. So I've seen some people raise a few questions around the validity of the term and where it comes from, whether it's a marketing term or a real thing and so on and so forth. So I don't know. Before I say anything, I would just like you to make, let's say, an opening statement on that one. Yeah, it makes sense. And that's a great question. Yeah, I actually think the lake house, whether it's going to be called that or something else,

Starting point is 00:02:15 is inevitable. It will happen with or without data breaks, with or without that term. And the reasons for it are just logical. One is that just this force of machine learning and data science that's becoming really, really important in organizations and you need it. And I'll connect the dots and I'll connect this to the lakehouse, but that's a big force that we're seeing. The second force that we're seeing is that multi-cloud, open source sort of approaches are also becoming more and more common. I don't know if you saw the Gartner report from two months ago on multi-cloud show that 81% of the customers they talked to out of a group of, I think, 600 some customers said that 80%, 81% said that they have more than two or more clouds.

Starting point is 00:03:06 So you have multi cloud and You don't want to get locked into a cloud because you have a multi cloud strategy and machine learning platforms As two major trends that are happening in our time So let's connect it to the lake house then and see why the lake house is inevitable. Well when you look at the lake house We have two things today and neither can fully do the job end-to-end. We have on the one hand side data warehouses. They do not support machine learning. So they cannot actually align with this major trend or inflection point that we're at.

Starting point is 00:03:41 They cannot do machine learning. In fact, you can't even store the data that machine learning workloads oftentimes use. So machine learning often is about video. It's often about audio. It's often about images. It's often about natural language, text, massive amounts of corpus text. And data warehouses simply don't even, usually you don't even store them. I don't know any customer that stores data in a data warehouse for that.

Starting point is 00:04:08 So already there you have that issue. The second issue you have this fear of locking in into proprietary format into a data warehouse. People want it to be open. They want it to work across the clouds. They want it to use some standard format. So that's the second thing against data warehouses. The third thing is the rising cost of storing all of your data in the data warehouse. Okay.

Starting point is 00:04:31 So those three things on the one hand side make the data warehouse, it cannot be the answer to everything. Okay. You have to address those three questions that I brought up. On the other hand, you have data lakes, which also have seen, you know, a big sort of adoption in the last 10 years. There we have one, the fact that those data lakes oftentimes are becoming data swamps, where you're just dumping lots of data.

Starting point is 00:04:52 But it's hard to actually make use of it, because there is no structure to it. Second, they don't give you great performance. And third, they don't actually support BI workloads, reporting workloads. So it's very logical what I'm saying. Data lakes cannot be the end all answer to all your data problems because of those three things.

Starting point is 00:05:13 They can't even support BI tools properly. And on the other hand, enterprise data warehouses also cannot be the answer to all your data problems because they can't even support basic machine learning on video. Okay, so this has to be solved and the solution is the combination of the two and that's what we call the lake house. Now you might have a different name for it but this will happen because the one thing that you can be sure of is that

Starting point is 00:05:38 innovation always happens. So people figure out a way to solve these two. So in the lake house paradigm, what is it? And how does it solve these two? One, it's an open format based on data lakes. So it's an open architecture. It's not a closed off walled garden. And you store all of your data in that data lake. You can store now video, audio, all that kind of data.

Starting point is 00:06:01 But it's also different from a pure data lake, because it avoids the data swamp problem by adding a transactional layer so that you can get quality and so that you can curate your data lake directly on your data lake. So now you can actually build in quality and reliability into your data lake. So that's what we call the transactional quality layer. And there are lots of solutions for that. At Databricks, obviously, we are developing the open source project called Delta Lake for that.

Starting point is 00:06:33 So that's one. And then the third thing is a low latency approach to actually accessing that data. So you can get low latency and high throughput, so great performance directly on the data lake, so that your BI tools and your reporting can actually directly access it. at low latency and high throughput, so great performance directly on the Delta Lake, so that your BI tools and your reporting can actually directly access it. That's the paradigm we call Lakehouse.

Starting point is 00:06:51 So given the two problems I said, this is going to happen. The question is just, can the innovation actually do it? And I believe we are extremely close already. That makes sense and yes, I mean having covered actually Delta Lake, the moment it was open source in fact, a lot of what you say is familiar and does indeed resonate. And yes, I can see the value of adding a transactional layer and definitely I can see the value and I'm sure many people can see it as well of having low latency because you know that enables BI and all the other additional stack of tools that you can have

Starting point is 00:07:32 on top of your lake. However I would argue that where these two paradigms, the data warehouse and the data lake, sit or are on different ends of the spectrum in terms of schema, basically, because on the data warehouse, you have schema on right, basically. So everything has to adhere to a specific schema and the type of schema is determined the moment you do your ETL, basically, and the moment you ingest your data while on the other hand of the spectrum the data lake you basically have no schema and it's in a way it's you know the hadoop paradigm of loose schema on read and you just apply the schema you want at the moment you want to read the specific data and you know there's plus and cons to both approaches and the schema on read in a way

Starting point is 00:08:23 works well because well you know in the data lake world you don't necessarily know which part of your data you're going to be using which part of your data will be useful and therefore kind of skipping uh kind of cutting corners makes sense because much of that data will not it's not going to be used anyway uh what i'm going with that basically is I'm trying to figure out where on that spectrum does the lake house stand. So is it schema on read? Is it schema on write? Is it something else? And on the description that you gave earlier, I missed that part entirely. And I think that also relates to things like data catalogs, for example.

Starting point is 00:09:02 Yeah. Great question, George. Actually, this gets to the heart of the issue, right? It is a data swamp because you were doing schema on read, right? And it is very structured and reliable, the data warehouse, because you're doing schema on write. So in some sense, it might seem like you can't have your cake and eat it too. Well, it turned out you can actually. And that's actually what Delta Lake enables you to do. So the way it works is that you can actually store all your data,

Starting point is 00:09:32 obviously, on a data lake with no schema. That is actually possible, right? But then as you format it into Delta tables, it actually then lets you up-level it to various levels of schema on right. Okay? it actually then lets you up-level it to various levels of schema on right. Okay? So it basically enables you to do schema enforcement, but you can also sort of let it be at different levels of sort of enforcement. So there's something called merge schema in Delta.

Starting point is 00:10:01 Merge schema lets you actually specify what are the changes that you allow happening and how flexible do you want to be. And you can actually go all the way and specify something called Delta expectations. Delta expectations, you can express any quality expression. So for instance, you can say the age of the person you are inserting into this table has to be over this age. And if they're not over 18, we're gonna accept them in this table and so on and

Starting point is 00:10:28 you can specify whatever you want and then you can specify how the table should actually behave should it just warn you or should it actually put it in a quarantine in a different place or should it reject the data so you can actually specify all these levels and the way you do that in Delta Lake is this operation called the thing that I mentioned, merge schema. So the merge schema lets you actually then specify exactly what level of enforcement you want. And so where does that leave you? Where it leaves you is that what enterprises today are doing is that they're actually building curated data lakes. The curated data lake looks as follows.

Starting point is 00:11:08 You have raw tables there. The raw tables, they might be in any particular format, and they actually are essentially a schema on read table. And then after that, you move your data into a bronze table, and then after that, into a silver table, and then after that into a silver table, and after that into a gold table. At each level of these, you're refining your data, and you're putting more schema enforcement on it. And the gold tables, it's pretty simple to describe what's happening in those. In those, you don't do warning.

Starting point is 00:11:40 You only allow the certain data that satisfies the schema. And anything that doesn't satisfy it moves into a quarantine. So that you know that all the data sets and you have alerting on them so that they're actually pristine. That way you can gradually improve the data that you have. Now, all of this data sits on a data lake. So it's all in open source parquet format. It's all on your data lake. But the ones that have been annotated as gold tables, they are the highest quality. So you know that the schema is fully enforced.

Starting point is 00:12:11 So to give you an example, on the gold table, if you add a new data set to it and it has a new column that didn't exist in the data set before, it will not allow that operation to happen. Or if the type of the data that you're adding is incompatible with the type that's already in the table, it will reject that. Does that make sense? Does that answer your question?

Starting point is 00:12:33 It does. Well, of course, it kind of triggers another set of questions. So basically, it sounds like, as it should, in fact, that this is not just about how you implement it technically, but there's a certain line of thinking. I would even go as far as to call it methodology behind it that users basically kind of have to familiarize themselves with and ascribe to in a way. And so I'm wondering if, you know, if you have a certain, if you have, I don't know, a training program, basically, for them to be able to get with the program. Yeah, absolutely. There's a certification that our customer success team actually gives to people.

Starting point is 00:13:13 And they have this Delta workshop where they train them. And then we have solutions architects that actually will train you building it up this way. So absolutely, you have to follow this methodology, right? Just like the data warehouse in itself is not, it's just a technology, right? But once you start using it, you might have to do entity relationship ER diagrams, or you have to figure out the schema of your data. Is it the star schema? What is the structure of it?

Starting point is 00:13:38 There's the same thing here. But it now enables you to do that on the data lake, which now is open, which now is based on a standard format, and which now also can store video, audio, and text, which means now you can actually do machine learning on it. So it ends up actually giving you benefits that you wouldn't have if you're purely in a database. Okay. Don't get me wrong.

Starting point is 00:14:04 I'm not saying it could be otherwise. And actually, for me personally, I count it as a plus of the fact that you have a methodology that goes with technology. Just one final brief question to wrap this up and move to the other parts. So I'm wondering if the methodology

Starting point is 00:14:23 is also open source, basically, because Delta Lake is. And'm wondering if the methodology is also open source, basically, because Delta Lake is, and I wonder if the methodology is also freely available beyond Databricks' clients. Absolutely. And we can share some of the writings we have done on this. Yeah. So this is not something that the methodology is not trademarked or anything like that. So it's very important. We're actually trying to, as much as get people to uh to uh to adhere to it because that way you can actually then build what we call the curated data lake okay okay that's great so uh connected to that actually i know that part of the news you are

Starting point is 00:14:58 about to announce is a new engine for uh called Lake, which you call Delta Engine. And just looking at the outline of it, on the surface, it looks like a faster hive or impala or something like this. SQL, the old story of doing SQL on Hadoop, in a way. So I'm wondering if that's accurate. And what's the difference, basically? What makes Delta Engine different? Yeah, I mean, this is a state-of-the-art engine that's extremely fast.

Starting point is 00:15:32 So some differences with some of the technologies you mentioned. And if you have other ones, I'm happy to also compare with. If you compare it, for instance, with Hive, Hive is written in Java. And actually, it was important for us to actually get out of the Java virtual machine because the Java virtual machine, no matter, I mean, we've got what went through multiple generations of sort of trying to optimize the JVM. And it just turned out if you want really, really raw, extremely fast performance, you basically have to get out of it.

Starting point is 00:16:01 We had the project called Tungsten a few years ago, which was a way in which, from the JVM, we were trying to do the memory management outside of the JVM, but from the inside. And it just ends up being very complicated. And at the end of the day, you don't really quite get the performance speeds that you would get. So if you want a really, really state-of-the-art high performance engine, you have to be

Starting point is 00:16:22 close to the machine language. So things like C or C++ or even assembly will be necessary. So that's one difference. And then you mentioned Impala. It's actually a very different engine from Impala. This is a vectorized engine. So it builds on vectorization. So it's columnar data.

Starting point is 00:16:39 And the columns of data are actually executed using vectorization. And actually, on modern computers, you can use SIMD instructions. So AVX instructions. So single instruction, multiple data. What that means is you tell the CPU, for instance, I want to compute the average age here. So I need to add up all the ages and divide them by the N. It can do the additions in parallel for you. The CPU can in one instruction actually take many, like 16 of them, and add them up for you.

Starting point is 00:17:10 That's what modern hardware can do. And that's what happens if you lay out the data in a column so that you can easily add them up. So that's different from Impala as well. So those are some of the differences. So the idea here is to have a very, very fast state-of-the-art columnar engine and push actually the state-of-the-art beyond what has been done before.

Starting point is 00:17:31 Okay. So based on what you just said, I guess I will have to assume that this may not follow the lead of Delta Lake, basically, because Delta Lake also started its lifecycle as a proprietary product basically and then eventually you open sourced it well basically because you wanted the format to be adopted and then the approach to be adopted but I'm guessing that maybe the Delta engine will be

Starting point is 00:17:57 different because it sounds like well it kind of builds on the standards of Delta Lake, but then you add extra performance, basically, and you would probably want to keep that as a differentiating factor. Yeah, there's a couple of points on that. One is that when you are doing these very high-performance things that are low-level, it's harder to do them as open-source projects. So that's one. And second, usually these early projects, we usually keep them proprietary.

Starting point is 00:18:26 We are, of course, internally always discussing if we can open source a project. And my hope is that down the line, we can do that. We've done it many, many times with internal projects, whether it was MLflow or Delta or these others. It is more harder when they are very low level, highly performance tuned products, but it might be possible as well. But the current plan is to keep it proprietary right

Starting point is 00:18:51 now. Okay, okay. Well, I guess we'll have to wait and see on that one. Yeah. Okay, so the other interesting piece of news that you're announcing is the acquisition of Redash. And Redash is an open source framework for doing things such as dashboards and visualization and so on. So I had to, interestingly enough, I saw that my co-contributor at ZDNet had kind of guessed, kind of saw this coming a while ago. He was writing about Apache Spark and the Databricks platform, and he was mentioning how this visualization part at the point, at the time of writing,

Starting point is 00:19:37 seemed to be like something that the stack was missing. And he was kind of assuming that, well, we may see a partnership coming up in that space. Well, instead of a partnership partnership there's an acquisition and I think this is an interesting one for a number of reasons first of all looking at Redas you know as a product it seems like you know very solid it has a very good user base and you know seeing enough it seems to be also kind of leveraging the same philosophy that Databricks is leveraging. So, open source and the business model is basically making that available on the cloud software as a service. So, I would like if you would like to say a few words on how that came to be basically, what was it that made you go out for an acquisition

Starting point is 00:20:27 rather than a partnership? And then why Redash specifically? And how does that make sense in terms of a business model? Because, I mean, will the core product remain open source? And if yes, and will it remain as a standalone project? How exactly is that going to be integrated in your stack? And how will the logistics work out in a way? Yeah, great questions. I'll try to address all of them. So basically the way it started is that one of our customers, one of our larger customers was saying,

Starting point is 00:21:00 you guys should look at Redash. We're using it with Databricks. And we said, no, we have our own visualization built in. And, you know, he told us that that's nothing compared to Redash. So have a look at that project. So we started looking closer at the project, and we started working closer and closer with them, with the company, and especially Arik Freimovic, which is the co-founder in Israel.

Starting point is 00:21:21 And how it happened, the inside scoop is simply it was love at first sight. You know, it was literally, you know, here we find this guy in Israel, you know, and it's as if, you know, we were sort of twin brothers. You know, he had the same mindset as us, super strong technical background. However, he has a skill set that we don't really have, right, in the same way, right? He's focused on the front end side. And, you has a skill set that we don't really have in the same way. He's focused on the front end side. And we have been largely a back end company. So it was really sort of a match made in heaven. So once we met, it was inevitable that things would from there transpired. Also open source. Actually, his company was created in 2013, just like Databricks.

Starting point is 00:22:06 Massive developer adoption. And then one thing we actually liked is his attention to quality. So, you know, there are a lot of frameworks out there in open source for doing plotting and visualizations. But actually Redash stood out. We actually tried all of them. Before you buy a company, you see what's out there, right? How does it compare with what else is out there? And when we tried the different ones, Redash actually stood out as the one that had, you know,

Starting point is 00:22:31 the fit and finish of each visualization was amazing. So in other ones, there were corner cases where the visualization would break down. If you gave it too much data, or if you gave it too many series, it couldn't plot it. Or if the x-axis sometimes was too big, it would sort of overflow. But with Redash, it just seemed very, very robust. And it just has to do with the kind of culture they had had and the kind of culture that Arik has

Starting point is 00:22:57 set down in his company. So we thought that was the thing that made it very very special for us so and the rest is kind of history and you know ever since I can say you know when you acquire a company it's always questions of you know how's it you know there's this human aspect how are those folks that come from a different company how are they gonna work in the new in their new home is there gonna be tissue rejection or things of that nature right you have those kind of questions.

Starting point is 00:23:25 And I'm just shocked that it's as if Arik has been here from day one. He's almost like a co-founder of Databricks from day one. So it's sort of mind-meld. So it's amazing from that point of view. So that brings us to the other questions. So how are we going to actually deal with it? Same way we deal with other things, right? The core project will remain open source.

Starting point is 00:23:46 We're excited about the community behind Redash, just like we're excited about the community behind Apache Spark and the community behind Delta Lake. We want that community to continue to thrive and prosper, which means, yes, they might be using it on-prem, and Databricks is not an on-prem company, but the same is true about Apache Spark. People are using it on-prem. Delta Lake, we added support for HDFS so people could use it on-prem with HDFS, even though Databricks never actually is ever involved with HDFS itself. And then in terms of what it's

Starting point is 00:24:16 going to look like, you'll have to wait a little bit to see it, but it will be a sort of centerpiece of the sort of front and center of Databricks. So you'll be able to use it to do visualization. So, of course, the SaaS platform will be empowered with it. And it will be sort of highly integrated and just running out of the box on Databricks. Okay. So, yeah, what you say makes sense because just looking at it without having any kind of thing for for you it looked very much to me like what we call a knack we hire so if you wanted the

Starting point is 00:24:50 technology you could very well just just have the technology it's open source anyway it seems to me like you know there was probably like a kind of what you wanted to achieve was probably get get the talent on board basically and get them to work closely with you so that you can possibly maybe, besides integrating more closely with your own stack, I'm guessing that maybe further down the line, you may want to develop some extra proprietary offering for your stack as well, having the team on board as well. Yeah. You know, I would just say that just in nuance, usually when you acquihire, you're saying,

Starting point is 00:25:27 well, these are great people, let's hire them and then let's not use the product. They come work on our product. In this case, we absolutely love Redash. And so we want both the product and we want the talent. It is true that it is open source, so you can just pick it up as you mentioned. The thing though that is really important for me is oftentimes there is actually a factory behind these software artifacts, right? The factory that builds them.

Starting point is 00:25:54 And exactly how that factory works, no one really from outside ever knows how these factories work. How do they actually build the software end to end? And when you acquire a full company, you get the whole factory, so you know that it's going to work. You know that the assembly line will be in sync with the quality control at the end and so on and so forth. So that's why we've been super excited and it's been working fantastic so far. Thank you. And to make the connection with an interview that you gave again for ZDNet, I think it must have been October 2019 or something, where you mentioned the fact that Databricks was from that point on going to have two development centers, one in the US and one in Amsterdam,

Starting point is 00:26:44 if I'm not mistaken and so I guess maybe the people in in Redux are going to be your third one and I think you said they're based in Israel yeah you know the COVID pandemic has changed things and turned it a little bit on its end. So, who knows? Maybe we'll have 200 development centers. Yeah. I was going to mention that this is an old, by now, an old interview, so I guess

Starting point is 00:27:14 things have gone even more distributed than they used to be. Yeah. Absolutely. And to mention again that same interview, you mentioned that Databricks was at at the time seeing very very good growth basically and i think uh to quote you you mentioned something

Starting point is 00:27:34 like i don't know in the last year we've seen growth beyond our wildest expectations or something like that even though you know it's not that long since that interview. I was wondering if this is keeping up, basically. I would, just kind of while guessing, I would say that if anything, you may have seen some additional growth in the last couple of months due to the fact that remote work and more cloud and so on and so forth. So this is the kind of message I'm getting from all cloud-oriented companies, and I'm just wondering if it's the same in your case. Yeah, a few things I would say.

Starting point is 00:28:09 Absolutely, a few things. One is macro trend. The pandemic is accelerating the future. So, you know, people are getting rid of cash. They're doing more, you know, telemedicine. They're doing more video conferencing. And AI machine learning is one of those futures, right? It's the future.

Starting point is 00:28:29 So it's getting accelerated. So more and more CFOs are saying, let's actually double down on more automation. Let's make sure that at least we're investing in that. So I would say it's just on that side, right? So you're just seeing an accelerated adoption of those things. You're also right. Cloud is another thing that is inevitable. Eventually everybody will be in the cloud. That's also accelerated. People don't want to run data centers and, you know, send humans

Starting point is 00:28:51 into them and worry about the spread of the pandemic and so on. They want to leave it to the big, big companies that do this for, you know, at scale. So those are all positive trends. And then I would also finally say, you know, it's, you know, a lot of startups have been laying off people. They've had hiring freezes. We've been fortunate that we've sort of, we've planned for an economic downturn for the last three years. We've been sort of predicting it.

Starting point is 00:29:13 First, we thought it was going to be 2017 incorrectly. Then we thought it's 18, 19, and then it happened in 20. So we've been sort of preparing for it three years. So we were really set up for hitting the gas and accelerating when this happened. So for instance, we started hiring and we see a significant boost in hiring, especially top talent has become much, much easier after the pandemic. You know, several of the big sort of tech shops are doing massive layoffs here, especially in Silicon Valley, you know, Airbnb, Uber, those are two big ones where you're seeing sort of mass layoffs of really good people.

Starting point is 00:29:48 And then also other tech companies like Facebook and Google, you can see that there's a slowdown. So that helps. The other thing is that we're well capitalized because we've been sort of saving money for this. And also as well as sort of office space. Since we thought that this sort of financial downturn would come, we ended up actually not signing up for all of the office space

Starting point is 00:30:10 that we were otherwise going to sign up for. In fact, we were looking at $120 million office space that we were almost about to sign, but we last minute decided not to do it last year. And so it just, because we were sort of planning on some kind of massive downturn coming, we find ourselves sort of fortunate enough that we can accelerate in these times. Okay.

Starting point is 00:30:32 Okay. Well, thanks. One last short one then on cloud, basically. I know that you already have a close partnership with Microsoft in Azure and I was wondering if there's something similar being planned for the other two big cloud players. We are actually working more and more very closely with all the cloud vendors including AWS and including all the other ones. So definitely and also the Microsoft partnership is going really great. So that's also being deepened. So definitely Definitely. And also the Microsoft partnership is going really great. So that's also being deepened.

Starting point is 00:31:08 So definitely we're working close with them. And if your question is around, is there going to be another cloud? There will absolutely. It's just once the feedback is strong enough from our customer base, we will add more clouds down the line. Okay, great. Well, thanks. It's been a pleasure and we covered lots of ground in relatively short time. So good value for money. Thanks again for making the time and good luck with

Starting point is 00:31:33 everything. Thanks so much, George. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

Orchestrate all the Things - Data Lakehouse, meet fast queries and visualization: Databricks unveils Delta Engine, acquires Redash. Featuring Databricks CEO / Co-Founder Ali Ghodsi

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.