Orchestrate all the Things - Data Lakehouse, meet fast queries and visualization: Databricks unveils Delta Engine, acquires Redash. Featuring Databricks CEO / Co-Founder Ali Ghodsi
Episode Date: June 24, 2020Data warehouses alone don't cut it. Data lakes alone don't cut it either. So whether you call it data lakehouse or by any other name, you need the best of both worlds, says Databricks. A new que...ry engine and a visualization layer are the next pieces in Databricks' puzzle. We connected with Ali Ghodsi, co-founder and CEO of Databricks, to discuss their latest news: the announcement of a new query engine called Delta Engine, and the acquisition of Redash, an open source visualization product. Our discussion started with the background on data lakehouses, which is the term Databricks is advocating to signify the coalescing of data warehouses and data lakes. We talked about trends such as multi cloud and machine learning that lead to a new reality, how data warehouses and data lakes work, and what does the data lakehouse bring to the table. We also talked about Delta Engine and Redash of course, and we wrapped up with an outlook on Databricks business growth. ZDNet article published in June 2020
Transcript
Discussion (0)
Welcome to the Orchestrate All the Things podcast.
I'm George Amatiotis and we'll be connecting the dots together.
Today's episode features Ali Ghotzi, co-founder and CEO of Databricks.
We connected to discuss the latest news from Databricks,
namely the announcement of a new query engine called Delta Engine
and the acquisition of Redash, an open-source visualization product.
Our discussion started with aση των data lake houses,
που είναι ένα τρόπο που η Databricks προσπαθεί να σημαντήσει
την συμφωνία των data warehouses και data lakes.
Μιλήσαμε για τρόπους όπως τα Multicloud και το Machine Learning,
που προκύπτουν σε μια νέα πραγματικότητα,
πώς δημιουργούνται data warehouses και data lakes,
και τι πιστεύει το data lake house bring to the table.
We also talked about Delta Engine and Redux, of course, and we wrapped up with an outlook
on Databricks' business growth.
I hope you enjoyed the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.
So what I wanted to start with was basically a little bit of a recap
because, well, since it's been a while since the last time we connected,
I think I was quite interested in getting your views on the lake house,
basically, which is like the latest concept that you have been advocating.
And to me, it comes as a natural progression of what you've been working on
for basically since the inception of the company.
So kind of this merging, let's say, this coalescing of the two worlds,
data warehouses and cloud, basically, and machine learning and everything that you've
been working on.
So I've seen some people raise a few questions around the validity of the term and where
it comes from, whether it's a marketing term or a real thing and so on and so forth.
So I don't know.
Before I say anything, I would just like you to make, let's
say, an opening statement on that one. Yeah, it makes sense. And that's a great question.
Yeah, I actually think the lake house, whether it's going to be called that or something else,
is inevitable. It will happen with or without data breaks, with or without that term. And the reasons for it are just logical. One is that just
this force of machine learning and data science that's becoming really, really important in
organizations and you need it. And I'll connect the dots and I'll connect this to the lakehouse,
but that's a big force that we're seeing. The second force that we're seeing is that
multi-cloud, open source sort of
approaches are also becoming more and more common. I don't know if you saw the Gartner report from
two months ago on multi-cloud show that 81% of the customers they talked to out of a group of,
I think, 600 some customers said that 80%, 81% said that they have more than two or more clouds.
So you have multi cloud and
You don't want to get locked into a cloud because you have a multi cloud strategy and machine learning platforms
As two major trends that are happening in our time
So let's connect it to the lake house then and see why the lake house is inevitable. Well when you look at the lake house
We have two things today and neither can fully do the job end-to-end.
We have on the one hand side data warehouses.
They do not support machine learning.
So they cannot actually align with this major trend or inflection point that we're at.
They cannot do machine learning.
In fact, you can't even store the data that machine learning workloads oftentimes use.
So machine learning often is about video.
It's often about audio.
It's often about images.
It's often about natural language, text, massive amounts of corpus text.
And data warehouses simply don't even, usually you don't even store them.
I don't know any customer that stores data in a data warehouse for that.
So already there you have that issue.
The second issue you have this fear of locking in into proprietary format into a data warehouse.
People want it to be open.
They want it to work across the clouds.
They want it to use some standard format.
So that's the second thing against data warehouses.
The third thing is the rising cost of storing all of your data in the data warehouse.
Okay.
So those three things on the one hand side make the data warehouse, it cannot be the answer to everything.
Okay.
You have to address those three questions that I brought up.
On the other hand, you have data lakes, which also have seen, you know, a big sort of adoption
in the last 10 years.
There we have one, the fact that those data lakes oftentimes
are becoming data swamps, where you're just
dumping lots of data.
But it's hard to actually make use of it,
because there is no structure to it.
Second, they don't give you great performance.
And third, they don't actually support
BI workloads, reporting workloads.
So it's very logical what I'm saying.
Data lakes cannot be the end all answer to all your data
problems because of those three things.
They can't even support BI tools properly.
And on the other hand, enterprise data warehouses
also cannot be the answer to all your data problems
because they can't even support basic machine
learning on video. Okay, so this
has to be solved and the solution is the combination of the two and
that's what we call the lake house. Now you might have a different name for it
but this will happen because the one thing that you can be sure of is that
innovation always happens. So people figure out a way to solve these two. So
in the lake house paradigm, what is it?
And how does it solve these two?
One, it's an open format based on data lakes.
So it's an open architecture.
It's not a closed off walled garden.
And you store all of your data in that data lake.
You can store now video, audio, all that kind of data.
But it's also different from a pure data lake,
because it avoids the data
swamp problem by adding a transactional layer so that you can get quality and so that you
can curate your data lake directly on your data lake. So now you can actually build in
quality and reliability into your data lake. So that's what we call the transactional quality layer.
And there are lots of solutions for that.
At Databricks, obviously, we are developing the open source project
called Delta Lake for that.
So that's one.
And then the third thing is a low latency approach
to actually accessing that data.
So you can get low latency and high throughput,
so great performance directly on the data lake,
so that your BI tools and your reporting can actually directly access it. at low latency and high throughput, so great performance directly on the Delta Lake, so
that your BI tools and your reporting can actually directly access it.
That's the paradigm we call Lakehouse.
So given the two problems I said, this is going to happen.
The question is just, can the innovation actually do it?
And I believe we are extremely close already.
That makes sense and yes, I mean having covered actually Delta Lake, the moment it was open
source in fact, a lot of what you say is familiar and does indeed resonate.
And yes, I can see the value of adding a transactional layer and definitely I can see the value and
I'm sure many people can see it as well of having low latency because
you know that enables BI and all the other additional stack of tools that you can have
on top of your lake.
However I would argue that where these two paradigms, the data warehouse and the data
lake, sit or are on different ends of the spectrum in terms of schema,
basically, because on the data warehouse, you have schema on right, basically. So everything
has to adhere to a specific schema and the type of schema is determined the moment you do your
ETL, basically, and the moment you ingest your data while on the other hand of the spectrum the data lake you basically have no schema and it's in a way it's you know the hadoop paradigm of
loose schema on read and you just apply the schema you want at the moment you want to read the
specific data and you know there's plus and cons to both approaches and the schema on read in a way
works well because well you know
in the data lake world you don't necessarily know which part of your data you're going to be using
which part of your data will be useful and therefore kind of skipping uh kind of cutting
corners makes sense because much of that data will not it's not going to be used anyway uh
what i'm going with that basically is I'm trying to figure out where on that
spectrum does the lake house stand. So is it schema on read? Is it schema on write?
Is it something else? And on the description that you gave earlier, I missed that part
entirely. And I think that also relates to things like data catalogs, for example.
Yeah. Great question, George. Actually, this gets to the heart of the issue, right?
It is a data swamp because you were doing schema on read, right?
And it is very structured and reliable, the data warehouse,
because you're doing schema on write.
So in some sense, it might seem like you can't have your cake and eat it too.
Well, it turned out you can actually.
And that's actually what Delta Lake enables you to do.
So the way it works is that you can actually store all your data,
obviously, on a data lake with no schema.
That is actually possible, right?
But then as you format it into Delta tables,
it actually then lets you up-level it to various levels of schema on right.
Okay? it actually then lets you up-level it to various levels of schema on right.
Okay?
So it basically enables you to do schema enforcement,
but you can also sort of let it be at different levels of sort of enforcement. So there's something called merge schema in Delta.
Merge schema lets you actually specify what are the changes that you allow happening
and how flexible do you want to be.
And you can actually go all the way and specify something
called Delta expectations.
Delta expectations, you can express any quality expression.
So for instance, you can say the age of the person
you are inserting into this table has to be over this age.
And if they're not over 18, we're gonna accept them in this table and so on and
you can specify whatever you want and then you can specify how the table
should actually behave should it just warn you or should it actually put it in
a quarantine in a different place or should it reject the data so you can
actually specify all these levels and the way you do that in Delta Lake is this operation called the thing that I mentioned, merge schema.
So the merge schema lets you actually then specify exactly what level of enforcement you want.
And so where does that leave you? Where it leaves you is that what enterprises today are doing
is that they're actually building curated data lakes.
The curated data lake looks as follows.
You have raw tables there.
The raw tables, they might be in any particular format,
and they actually are essentially a schema on read table.
And then after that, you move your data into a bronze table,
and then after that, into a silver table, and then after that into a silver table, and after that into a gold table.
At each level of these, you're refining your data, and you're putting more schema enforcement on it.
And the gold tables, it's pretty simple to describe what's happening in those.
In those, you don't do warning.
You only allow the certain data that satisfies the schema.
And anything that doesn't
satisfy it moves into a quarantine. So that you know that all the data sets and you have alerting
on them so that they're actually pristine. That way you can gradually improve the data that you
have. Now, all of this data sits on a data lake. So it's all in open source parquet format. It's
all on your data lake. But the ones that have been annotated as gold tables,
they are the highest quality.
So you know that the schema is fully enforced.
So to give you an example, on the gold table,
if you add a new data set to it and it
has a new column that didn't exist in the data set before,
it will not allow that operation to happen.
Or if the type of the data that you're adding is incompatible with the type that's already
in the table, it will reject that.
Does that make sense?
Does that answer your question?
It does.
Well, of course, it kind of triggers another set of questions.
So basically, it sounds like, as it should, in fact, that this is not just about how you
implement it technically, but there's a certain line of thinking.
I would even go as far as to call it methodology behind it that users basically kind of have to familiarize themselves with and ascribe to in a way.
And so I'm wondering if, you know, if you have a certain, if you have, I don't know, a training program, basically, for them to be able to get with the program.
Yeah, absolutely.
There's a certification that our customer success team actually gives to people.
And they have this Delta workshop where they train them.
And then we have solutions architects that actually will train you building it up this way.
So absolutely, you have to follow this methodology, right?
Just like the data warehouse in itself is not, it's just a technology, right?
But once you start using it, you might have to do entity relationship ER diagrams, or
you have to figure out the schema of your data.
Is it the star schema?
What is the structure of it?
There's the same thing here.
But it now enables you to do that on the data lake, which now is open, which now is
based on a standard format, and which now also can store video, audio, and text, which
means now you can actually do machine learning on it.
So it ends up actually giving you benefits that you wouldn't have if you're purely in
a database.
Okay.
Don't get me wrong.
I'm not saying it could be otherwise.
And actually, for me personally,
I count it as a plus
of the fact that you have a methodology
that goes with technology.
Just one final brief question
to wrap this up and move to the other parts.
So I'm wondering if the methodology
is also open source, basically,
because Delta Lake is. And'm wondering if the methodology is also open source, basically, because Delta Lake
is, and I wonder if the methodology is also freely available beyond Databricks' clients.
Absolutely. And we can share some of the writings we have done on this. Yeah. So this is not
something that the methodology is not trademarked or anything like that. So it's very important.
We're actually trying to, as much as get people to uh to uh to adhere to it
because that way you can actually then build what we call the curated data lake
okay okay that's great so uh connected to that actually i know that part of the news you are
about to announce is a new engine for uh called Lake, which you call Delta Engine.
And just looking at the outline of it,
on the surface, it looks like a faster hive or impala or something like this.
SQL, the old story of doing SQL on Hadoop, in a way.
So I'm wondering if that's accurate.
And what's the difference, basically?
What makes Delta Engine different?
Yeah, I mean, this is a state-of-the-art engine that's extremely fast.
So some differences with some of the technologies you mentioned.
And if you have other ones, I'm happy to also compare with.
If you compare it, for instance, with Hive, Hive is written in Java.
And actually, it was important for us to actually get out of the Java virtual machine because
the Java virtual machine, no matter, I mean, we've got what went through multiple generations
of sort of trying to optimize the JVM.
And it just turned out if you want really, really raw, extremely fast performance, you
basically have to get out of it.
We had the project called Tungsten a few years ago, which was a way in which, from the JVM,
we were trying to do the memory management outside of the JVM,
but from the inside.
And it just ends up being very complicated.
And at the end of the day, you don't really
quite get the performance speeds that you would get.
So if you want a really, really state-of-the-art high
performance engine, you have to be
close to the machine language.
So things like C or C++ or even assembly will be necessary.
So that's one difference.
And then you mentioned Impala.
It's actually a very different engine from Impala.
This is a vectorized engine.
So it builds on vectorization.
So it's columnar data.
And the columns of data are actually
executed using vectorization.
And actually, on modern computers, you can use
SIMD instructions. So AVX instructions. So single instruction, multiple data. What that means is
you tell the CPU, for instance, I want to compute the average age here. So I need to add up all the
ages and divide them by the N. It can do the additions in parallel for you.
The CPU can in one instruction actually take many, like 16 of them, and add them up for
you.
That's what modern hardware can do.
And that's what happens if you lay out the data in a column so that you can easily add
them up.
So that's different from Impala as well.
So those are some of the differences.
So the idea here is to have a very, very fast state-of-the-art columnar engine
and push actually the state-of-the-art
beyond what has been done before.
Okay.
So based on what you just said,
I guess I will have to assume
that this may not follow the lead of Delta Lake, basically,
because Delta Lake also started its lifecycle
as a proprietary product
basically and then eventually you open sourced it well basically because you wanted the format to be
adopted and then the approach to be adopted but I'm guessing that maybe the Delta engine will be
different because it sounds like well it kind of builds on the standards of Delta Lake, but then you add extra performance, basically,
and you would probably want to keep that as a differentiating factor.
Yeah, there's a couple of points on that.
One is that when you are doing these very high-performance things that are low-level,
it's harder to do them as open-source projects.
So that's one.
And second, usually these early projects,
we usually keep them proprietary.
We are, of course, internally always discussing
if we can open source a project.
And my hope is that down the line, we can do that.
We've done it many, many times with internal projects,
whether it was MLflow or Delta or these others.
It is more harder when they are very low level, highly
performance tuned
products, but it might be possible as well. But the current plan is to keep it proprietary right
now. Okay, okay. Well, I guess we'll have to wait and see on that one. Yeah. Okay, so the other
interesting piece of news that you're announcing is the acquisition of Redash.
And Redash is an open source framework for doing things such as dashboards and visualization
and so on.
So I had to, interestingly enough, I saw that my co-contributor at ZDNet had kind of guessed,
kind of saw this coming a while ago.
He was writing about Apache Spark and the Databricks platform,
and he was mentioning how this visualization part at the point, at the time of writing,
seemed to be like something that the stack was missing.
And he was kind of assuming that, well, we may see a partnership coming up in that space.
Well, instead of a partnership partnership there's an acquisition and I think this is an interesting one for a number of reasons first of all looking at Redas
you know as a product it seems like you know very solid it has a very good user
base and you know seeing enough it seems to be also kind of leveraging the same philosophy that Databricks is leveraging. So,
open source and the business model is basically making that available on the cloud software as
a service. So, I would like if you would like to say a few words on how that came to be basically,
what was it that made you go out for an acquisition
rather than a partnership? And then why Redash specifically? And how does that make sense in
terms of a business model? Because, I mean, will the core product remain open source? And if yes,
and will it remain as a standalone project? How exactly is that going to be integrated in your stack?
And how will the logistics work out in a way?
Yeah, great questions.
I'll try to address all of them.
So basically the way it started is that one of our customers,
one of our larger customers was saying,
you guys should look at Redash.
We're using it with Databricks.
And we said, no, we have our own visualization built in.
And, you know, he told us that that's nothing compared to Redash.
So have a look at that project.
So we started looking closer at the project,
and we started working closer and closer with them, with the company,
and especially Arik Freimovic, which is the co-founder in Israel.
And how it happened, the inside scoop is simply it was love at first sight.
You know, it was literally, you know, here we find this guy in Israel, you know, and it's as if,
you know, we were sort of twin brothers. You know, he had the same mindset as us,
super strong technical background. However, he has a skill set that we don't really have,
right, in the same way, right? He's focused on the front end side. And, you has a skill set that we don't really have in the same way. He's
focused on the front end side. And we have been largely a back end company. So it was
really sort of a match made in heaven. So once we met, it was inevitable that things
would from there transpired. Also open source. Actually, his company was created in 2013, just like Databricks.
Massive developer adoption.
And then one thing we actually liked is his attention to quality.
So, you know, there are a lot of frameworks out there in open source for doing plotting and visualizations.
But actually Redash stood out.
We actually tried all of them.
Before you buy a company, you see what's out there, right?
How does it compare with what else is out there?
And when we tried the different ones, Redash actually stood out as the one that had, you know,
the fit and finish of each visualization was amazing.
So in other ones, there were corner cases where the visualization would break down.
If you gave it too much data, or if you gave it too many series, it couldn't plot it.
Or if the x-axis sometimes was too big,
it would sort of overflow.
But with Redash, it just seemed very, very robust.
And it just has to do with the kind of culture they had had
and the kind of culture that Arik has
set down in his company.
So we thought that was the thing that made it very very special for
us so and the rest is kind of history and you know ever since I can say you
know when you acquire a company it's always questions of you know how's it
you know there's this human aspect how are those folks that come from a
different company how are they gonna work in the new in their new home is
there gonna be tissue rejection or things of that nature right you have
those kind of questions.
And I'm just shocked that it's as if Arik has been here from day one.
He's almost like a co-founder of Databricks from day one.
So it's sort of mind-meld.
So it's amazing from that point of view.
So that brings us to the other questions.
So how are we going to actually deal with it?
Same way we deal with other things, right?
The core project will remain open source.
We're excited about the community behind Redash, just like we're excited about the community
behind Apache Spark and the community behind Delta Lake.
We want that community to continue to thrive and prosper, which means, yes, they might
be using it on-prem, and Databricks is not an on-prem company, but the same is true about
Apache Spark.
People are using it on-prem. Delta Lake, we added support for HDFS so people could
use it on-prem with HDFS, even though Databricks never actually is
ever involved with HDFS itself. And then in terms of what it's
going to look like, you'll have to wait a little bit to see it, but
it will be a sort of centerpiece of the sort of
front and center of Databricks.
So you'll be able to use it to do visualization.
So, of course, the SaaS platform will be empowered with it.
And it will be sort of highly integrated and just running out of the box on Databricks.
Okay.
So, yeah, what you say makes sense because just looking at it without having any kind of thing for for you it looked very much to me like what we call a knack we hire so if you wanted the
technology you could very well just just have the technology it's open source anyway it seems to me
like you know there was probably like a kind of what you wanted to achieve was probably get get
the talent on board basically and get them to work closely with
you so that you can possibly maybe, besides integrating more closely with your own stack,
I'm guessing that maybe further down the line, you may want to develop some extra proprietary
offering for your stack as well, having the team on board as well.
Yeah.
You know, I would just say that just in nuance, usually when you acquihire, you're saying,
well, these are great people, let's hire them and then let's not use the product.
They come work on our product.
In this case, we absolutely love Redash.
And so we want both the product and we want the talent.
It is true that it is open source, so you can just pick it up as you mentioned.
The thing though that is really important for me is oftentimes there is actually a factory
behind these software artifacts, right?
The factory that builds them.
And exactly how that factory works, no one really from outside ever knows how these factories
work.
How do they actually build the software end to end?
And when you acquire a full company, you get the whole factory, so you know that it's going to work.
You know that the assembly line will be in sync with the quality control at the end and so on and so forth.
So that's why we've been super excited and it's been working fantastic so far. Thank you. And to make the connection with an interview that you gave again for ZDNet, I think it
must have been October 2019 or something, where you mentioned the fact that Databricks
was from that point on going to have two development centers, one in the US and one in Amsterdam,
if I'm not mistaken
and so I guess maybe the people in in Redux are going to be your third one
and I think you said they're based in Israel yeah you know the COVID
pandemic has changed things and turned it a little bit on its end. So, who knows? Maybe we'll
have 200 development centers.
Yeah.
I was going to mention that this is an
old, by now, an old interview, so I guess
things have gone even more distributed
than they used to be.
Yeah. Absolutely.
And to mention
again that same interview,
you mentioned that
Databricks was at
at the time seeing very very good growth basically and i think uh to quote you you mentioned something
like i don't know in the last year we've seen growth beyond our wildest expectations or something
like that even though you know it's not that long since that interview. I was wondering if this is keeping up, basically.
I would, just kind of while guessing, I would say that if anything,
you may have seen some additional growth in the last couple of months
due to the fact that remote work and more cloud and so on and so forth.
So this is the kind of message I'm getting from all cloud-oriented companies,
and I'm just wondering if it's the same in your case.
Yeah, a few things I would say.
Absolutely, a few things.
One is macro trend.
The pandemic is accelerating the future.
So, you know, people are getting rid of cash.
They're doing more, you know, telemedicine.
They're doing more video conferencing.
And AI machine learning is one of those futures, right?
It's the future.
So it's getting accelerated.
So more and more CFOs are saying, let's actually double down on more automation.
Let's make sure that at least we're investing in that.
So I would say it's just on that side, right?
So you're just seeing an accelerated adoption of those things.
You're also right.
Cloud is another thing that is inevitable. Eventually everybody will be in the
cloud. That's also accelerated. People don't want to run data centers and, you know, send humans
into them and worry about the spread of the pandemic and so on. They want to leave it to
the big, big companies that do this for, you know, at scale. So those are all positive trends. And
then I would also finally say, you know, it's, you know, a lot of startups have been laying off
people.
They've had hiring freezes.
We've been fortunate that we've sort of, we've planned for an economic downturn for the last
three years.
We've been sort of predicting it.
First, we thought it was going to be 2017 incorrectly.
Then we thought it's 18, 19, and then it happened in 20.
So we've been sort of preparing for it three years.
So we were really set up for hitting the gas and accelerating when this happened. So for instance, we started hiring and we see a significant boost in hiring, especially
top talent has become much, much easier after the pandemic. You know, several of the big sort
of tech shops are doing massive layoffs here, especially in Silicon Valley, you know, Airbnb,
Uber, those are two big ones where you're seeing
sort of mass layoffs of really good people.
And then also other tech companies like Facebook and Google,
you can see that there's a slowdown.
So that helps.
The other thing is that we're well capitalized
because we've been sort of saving money for this.
And also as well as sort of office space.
Since we thought that this sort of financial downturn would come,
we ended up actually not signing up for all of the office space
that we were otherwise going to sign up for.
In fact, we were looking at $120 million office space
that we were almost about to sign,
but we last minute decided not to do it last year.
And so it just, because we were sort of planning
on some kind of massive downturn coming,
we find ourselves sort of fortunate enough that we can accelerate in these times.
Okay.
Okay.
Well, thanks.
One last short one then on cloud, basically.
I know that you already have a close partnership with Microsoft in Azure and I was wondering if there's something similar
being planned for the other two big cloud players.
We are actually working more and more very closely with all the cloud vendors including AWS and including all the other ones.
So definitely and also the Microsoft partnership is going really great. So that's also being deepened.
So definitely Definitely. And also the Microsoft partnership is going really great. So that's also being deepened.
So definitely we're working close with them.
And if your question is around, is there going to be another cloud?
There will absolutely.
It's just once the feedback is strong enough from our customer base,
we will add more clouds down the line.
Okay, great.
Well, thanks.
It's been a pleasure and we covered lots of ground in relatively short time. So good value for money. Thanks again for making the time and good luck with
everything.
Thanks so much, George.
I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration
on Twitter, LinkedIn, and Facebook.