Drill to Detail - Drill to Detail Ep.104 ‘Delta Lakes, Tiger Temples and the Databricks Lakehouse Platform’ with Special Guests Jason Pohl and Stewart Bryson
Episode Date: May 17, 2023Mark Rittman is joined by Jason Pohl, Senior Director of Data Management at Databricks along with special co-host and Head of Customer Experience at Coalesce, Stewart Bryson to talk about data lakehou...ses, Databricks and helping to raise tigers in a Thai Tiger Temple.Databricks - A HistoryWhat is a Delta Lake?The Databricks Lakehouse PlatformDatabricks SQLJason Pohl LinkedIn ProfileLions and Tigers and a Bear named Maam MaamStewart Bryson Github profileCoalesce.io
Transcript
Discussion (0)
Before we get into all the kind of Databricks goodies, something that's far more interesting
is the fact that you lived in London before and you raised tigers.
My life seems so boring right now. Hello and welcome to another episode of the Draw to Detail podcast.
And I'm your host, Mark Whitman.
And today we've got a special co-host, Stuart Bryson.
Thanks, Mark. Really appreciate you having me on here.
We're walking a thin line. People are going to get tired of hearing my voice.
But hey, I'm Stuart Bryce and long-time listeners of this show
will know me as the founder and former CEO of Red Pill Analytics. I think most of my podcasts that
I've done with you were during that time and you've done now a few since. And we're pleased
to be joined today by Jason Cole from Databricks. And we're going to be joined today by Jason Pohl from Databricks. And we're going to
dive into Databricks. Do you want to tell us, Jason, a little bit about what you do there?
And tell us a little bit about Databricks as well. Yeah. So thanks for having me on the program. I
really appreciate it. This is pretty exciting for me. I'm Jason Pohl. I'm a Senior Director of Data
Management at Databricks. So I kind of lead up a team of specialists, and we focus on bringing into market new products for Databricks, anything around data engineering, data governance, or data warehousing.
And I was also one of the first 10 solution architects at Databricks.
So I've been at Databricks for about seven and a half years now, and it's been a wild ride so far.
So it's been great to kind of see the product evolve since then.
And I'll give just a short plug.
So Databricks is a company that was co-founded by the original co-creators of Apache Spark.
And we've since gone on to create other open source projects like MLflow for machine learning,
as well as Delta Lake for data lake analytics.
And we basically have created these open source projects to become the
de facto standard in their domain. And then we make sure that Databricks is the best place to
run these open source projects. And we kind of like cover a whole gamut of use cases, anything
from data science and machine learning to streaming, ETL, data warehousing, and we do it all
on a single platform, which we call the Lakehouse platform,
which I think we're going to get into a little bit today.
Before we get into all the kind of Databricks goodies, something that's far more interesting
is the fact that you lived in London before and you raised tigers. So give us the kind of
backstory then. How did you go from raising tigers to be working at databricks yeah so i uh oh you're looking by linkedin yeah so i so i i moved
to london uh was it in 2001 and it was uh i was working for this small consultancy um it was
called in forte and then they rebranded to they got bought by a french consultancy called business
and decision which is kind of funny because you say it too quickly it sounds like business
indecision but they uh i was doing a lot of, uh, Siebel
CRM implementations and I was doing, I became the data migration guy. So I would migrate these
legacy systems into Siebel CRM. And then I ended up getting into a data warehousing after that.
So I was building a bespoke data warehouses with Informatica. Uh, and that kind of like,
uh, piqued my interest in data warehousing in general.
And then after I was in London for, what is it, six or seven years, I ended up deciding to take the long way home.
So I sold my stuff and got a backpack and I backpacked around the world.
And one of the stops was in Thailand. I got to go to the Tiger Temple.
I went there as a tourist and I had seen this place on, there's a youth program called 60 Minutes in the US I, I saw that years ago. I was like, Oh, if I'm ever in Thailand, I want to go there.
And so went and then they actually needed volunteers and I, and they offered like,
you know, uh, two meals a day and a place to sleep. So I was like, Oh, this sounds like an
amazing deal. So I, uh, I got, I, uh, I got my, my inflatable roll up mattress from my backpack and
spread it out on the tiled floor.
And I basically spent a month there helping them raise tigers.
So they're monastic grounds.
So they take in all these unwanted animals.
They have all sorts of animals.
They had a few camels.
They had horses, water buffalo.
But they're most famous for their tigers.
And so I got to help basically feed these tigers and play around with them,
give them a bath and stuff like that.
Mostly the younger cubs,
but I got to work with some of the one-year-olds as well who were about a hundred pounds or something like that.
So it was great.
It was an amazing experience,
something that I don't think you could really replicate in the U.S.
Wow.
So I was about to say, how would you give a tiger a bath?
And I was going to say very carefully, but that would be worse than that joke ever.
Nothing you say now, Jason, will be as interesting as the tigers really so uh you've got you've kind of set you've set a high bar now for the uh first of the episode
so uh but yeah so so i'm really keen to to have someone from databricks on and jason thank you
very much for uh for sort of volunteering to come on um and really i want to try and understand or
myself and stewart wants to try and understand or myself and Stuart wants to try and understand I suppose the fundamental decisions behind the Databricks platform and the fundamental things
that make it different and I suppose in particularly suitable for the kind of workloads that you take
on but with Stuart just very quickly you know what is it that introduces you about
about Databricks where did it fit into your world, really? Yeah, so going back, this is some time
ago, we did several Databricks projects when we were at Red Pill. A former employee of mine
went over to Databricks named Pete Tamason, shout out to that guy, and was just singing
its praises. So we sort of got invested and did a few projects and some proof of concepts.
I think the first thing that really stood out to me, and this is some time ago, is it was really
hard to run notebooks back then. And so the first thing was just how easily Databricks made it to
run notebooks. I mean, just cloud-based notebooks. I know it's more than that. They manage the
execution and they manage the cluster and the compute,
but just going in, grabbing a browser, starting to write some code and execute it. And then
additionally, you know, just anybody who's worked with notebooks knows this, but, you know, I wasn't
a notebook expert at that time. The ability to, you know, sort of in a single set of tasks, be able to jump from one language
to another. So the ability to build up some tables using SQL, to do some complex, I'm a JVM
developer at heart. And a lot of the ecosystem around Databricks was Scala based and still some
Java. And of course they had a lot of people using Python.
But just the ability to grab a code snippet of Scala
and work on using data frames and process some data
and then jump back into SQL.
And then the ability to, you know,
again, anyone who knows notebooks knows this,
but as I was coming up to speed on notebooks, the ability to visualize just right in line the results of your process and your data processing. in a single product or tool and not having to jump around to a SQL editor
to go test your code
and to go over to some sort of a workbook solution
to visualize it.
I think that's what really got me hooked.
And then the idea of notebooks as being,
you know, then scheduled
and the actual way that you load your data is exactly
the same experience you went through as a developer. That's what all clicked with me.
Yeah. I feel like, um, we, in the early days we were kind of like known as the quote notebook
company. So we, we went, um, GA and I think, uh, July of August, 2000 or no July, 2015,
but the company had been around for like two years before
that. And at the time, you know, there was a, I Python notebooks, those, those became Jupiter.
And then there's another one Zeppelin, which people were playing around with, but, you know,
people were trying to like, uh, do this notebook interface. So it was very, the demand was there,
like people wanted this stuff, but it was very flaky. If you tried to host this stuff yourself,
like things would crash. No one would consider like running a production job as a
notebook because you know, you, you were scared about the instability of it. And so I think we
kind of got famous because we just had these notebooks and they just kind of worked and you
could, you could import an expert Jupiter notebooks that actually was running an IPython kernel behind the scenes. So we just
got famous for that. And whenever I first joined, it was my first foray really into notebooks. And
I had kind of like done a lot of Java stuff with IDEs. And at first, I didn't really like it,
to be honest. It wasn't what I was used to. And then as I started to use it more,
what I originally would do is I would I would use it to prototype.
So I would, you know, anytime you're working with data, you got to like be able to run a query, get a result, investigate a little bit, profile it, and then go back and run another query.
And as you keep iterating through this, it's hard to do that in IDE.
But then once you kind of get the sense of the shape and the profile of your data, I would then like move it into, I would switch my over to my IDE and start to do some work there and then go back and forth. But,
but nowadays, like I, I hardly ever leave the notebook. I kind of really like it because it's,
it's a good place to do just what you were saying, the exploring the data and then visualizing it.
And I'm more of a data engineer than a data scientist. So I'm not really building these
statistical, you know, sophisticated models,
but we do have users that do that. And it's, it's been like a really good tool. And since then,
we've, it's kind of like allowed us to do other things. So we, we've had the ability to schedule
these notebooks as like automated jobs for a long time. And recently, and about a couple years ago,
we introduced like auto ML technology, where the idea idea being it's like a glass box method, we call it, where you basically have a you can take a table of data and you can kind of select a column that you would want to predict.
And then you basically you go into a wizard where you select a few other things like what's the most important things you want to predict. And then basically as you,
it was that wizard,
it was going to do,
it's,
it's going to automatically generate a bunch of notebooks and then run them
and try to like measure which notebook had the best fit and then give that to
the data scientists.
Like say here,
here's a working model.
Here's all the code in the notebook.
Here's the,
the runs that we
did and the best parameters that we tried to tune it with. You can basically take this notebook from
here and then try to make it better using, you know, all the domain expertise that that data
scientist would have. So we were able to build that because we had this notebook infrastructure
already in place. And it's really been a good launch. So it's kind of like hyper parameter
tuning at the notebook level. Yeah. Well, that's one of the things it does.
It does the hyperparameter tuning for you.
And, you know, you could probably hyperparameter tune to infinity, but it'll do enough to give you a good sense of, you know, here see a lot of Databricks customers that will also jump out to the IDE,
as you've described, and using your SDK, they'll build, you know,
whether it be user-defined functions or classes or, you know,
almost compartmentalize your code that way and then, you know,
plug that into your compute and reference that using a notebook. How often do our customers
doing that sort of, I'll call it non-notebook, hardcore IDE-based development, publishing
jar files, uploading those jars to the compute? If you had a rough estimate, how many customers
are using your product in that way? Oh, yeah, a lot. So like in general, if you had to kind of classify the buckets of who uses Databricks, you've got data scientists who
generally are gravitating more towards the notebooks. You've got data analysts who are
generally running SQL either through RUI or just connecting a BI tool. And then you've got data
engineers who are building these data pipelines. And a lot of them are using IDEs. And I think whenever I first
started, like IntelliJ was probably the most popular, but now VS Code, because it's hosted
and it has autopilot or copilot, it's become like a really popular tool. So we actually just
recently released a plugin for VS Code, which allows you to easily basically just connect to
a Databricks cluster and iterate
with your code there and have some of the same elements that you would have within the
Databricks notebook, like access to clusters and jobs and everything within there to make
the experience seamless.
The other thing that we're doing is we're leading this effort within the Spark community
for something called Spark Connect.
And the idea is that one of the challenges
with Spark is whenever you run like a Scala program, you're basically running your application
code on the same JVM that's being used as the driver for Spark. And so it's hard to get that
separation. So Spark Connect is basically going to separate this so that everything you do
from generating data frames kind of basically gets done within your local client. And then
the output of that is, you know, basically a query plan. And then that gets shipped over the wire
using RPC to the Spark cluster. And this allows you to kind of separate the creation of the
query plan from the execution of it, which means you'll be able to have Spark connect in lots of
different languages, even JavaScript. So that's one of the things that we're working on with the
Spark community. Okay. So Jason, a question from me. So hypothetically, if somebody in this call
wasn't so familiar with Databricks and some of the things like Delta Lake and so on,
maybe just walk us through kind of, I suppose, the initial product direction of kind of Databricks and how it came about.
And really, I suppose, fundamentally, why Spark is interesting, really.
What's it about?
And why was it so good that
a company formed around it really yeah so spark kind of came about whenever um in the the the
first like big data open source project was hadoop and hadoop had been on the scene for
um probably i don't know maybe five years or something like that whenever databrooks was
founded and the challenge though like hadoop was really built for on-prem data centers.
So if you buy a bunch of machines, you have them up and running 24-7
and you need to do some sort of distributed processing, Hadoop was good at that.
But then whenever the cloud came along, that model kind of like didn't work very well
because you didn't want to have machines up and running 24 by 7 in the cloud
because you paid for them by time.
So you'd want to take your data and store it in object storage like S3 on AWS.
And then you'd spin up the compute just whenever you wanted to do some processing,
whether that's for machine learning or analytics or ETL or whatever.
And so in that model, Hadoop didn't really fit well because its storage runs on EC2 machines
and wants those things to be up and running all
the time. So Spark was really good because it kind of separated that computing from the storage
layer. The other thing was that it was really fast. You could run the same routine that you
ran on Hadoop and it would run a hundred times faster on Spark. And the other thing was because
it used Scala as the basis for it rather than
Java, you could actually write the same program in like a 10th of the number of lines of code.
So it got famous for being able to write the same program in a fewer lines of code and run
a hundred times faster. And most of what it was being used for was either distributed machine
learning, or you've got so much data, you can't train your model on one single machine.
Or ETL.
And there was a lot of customers in the early days where they basically just had a whole bunch of JSON files that were sitting on an S3 that they got from some application, either mobile or web or whatever, that was dumping it there.
And they wanted to do some analysis on it.
And a lot of what we did in the early days was just helping these customers analyze this JSON data. And we went through a lot of different optimization techniques and it kind of coalesced to like, well, the first thing you would want to do is basically copy
that JSON data into the Parquet format and then run your analysis on those Parquet files.
And if you were to go back to the Spark Summit talks around 2015 to 2018,
everybody's given some sort of talk about, you know, converting data to Parquet and
what are the best techniques to do that? What's the right file size to have for your Parquet files?
How do you make sure that you can recover if your big ETL job fails partway through?
And there was, there's a number of things we, we actually came out of this thing called a DBIO, which was like our first stab at doing like some sort of atomic operation across
all these different files. Cause when you're writing a big job out, you could be writing
thousands of files. And if something happens, you know, like your, your network goes down or
whatever, and your job doesn't complete, then you don't want to be left with all these half
written files out there, you want to have a way to kind of like roll those back. And that kind of
like led us to developing Delta Lake. So we were, we were working with one of our customers, Apple
Computer, and they basically have to do network intrusion analysis across all the network points
on their network. So if they find
an intruder, they need to be able to work their way backwards to find out where this intruder came
in. And the solution that they were using couldn't really handle the data bombs. They were getting
petabytes of data every week, I think. And so they could only go back two weeks worth of data.
And so we kind of co-developed the solution Delta Lake with them and it allowed
them to go back years worth of data to find out, you know, if an intruder came in six months ago.
So that was kind of like the basis. And what Delta Lake does is instead of like thinking about files
on S3, you start thinking about tables. So it's like more of a logical construct. So instead of
thinking about, oh, I need to write all these files and then, you know, commit them and make sure that nothing's trying to read these files at the same time.
It basically allows you to just start to think about inserts, updates, deletes,
merge statements, just like a database really. And you can do it, you know, at scale. So you
can do it on tables that are petabytes big. It was optimized for streaming because with Apple's
use case,
they needed to be able to stream in
all this network information as it came
because there wasn't going to be a batch window
big enough for them to process it.
So it's kind of been engineered from the ground up for scale
to be able to support real-time streaming use cases
and to make sure that you can do other things
like time travel.
So if you need to run a query on a table
as that table existed
at a certain point in the past, you can do that. And we built it with them. Then we ended up
open sourcing it. And it's kind of become the de facto standard now for data lakes.
We have, I mean, just us alone, we have probably over seven or 8,000 customers that are using
Delta Lake in production today. So I've got a follow-up to that,
Jason, and it's probably a question you hear quite a bit, but it seems like the Parquet Plus movement kind of had a whole bunch of entrants all at the same time. So you've got Delta,
you have Iceberg, you have Hoodie, and there's probably several others that I'm missing there.
And if you were to sort of bake off there, tell us what does Delta have?
And maybe it's very similar to those formats, but what does Delta have besides the backing of Spark,
right? What does Delta have that those other formats maybe don't?
Yeah. I mean, I think you nailed all the big ones. There's basically Apache Iceberg,
which was created by some people at Netflix, and then Hoodie, which was created by some people at Uber.
And those formats combined with Delta Lake, they basically form what you might consider the lake house.
So there's actually a really good white paper that Michael Arnburst and some others created on the lake house.
And it basically shows how these formats help enable, you know, these types of use cases. In terms of like difference, you know, we, I think, you know, if you look at
Iceberg's main page, they, they kind of, their claim to fame is there for, for slow moving large
datasets. Whereas I think, you know, Delta can do that as well, but we've been engineered for
streaming at the beginning as well. So you can basically stream into a Delta table, um, and a much lower
latency than you could something like iceberg, for example. Um, then there's like, there's
differences in functionality where, you know, like, uh, I think hoodie and Delta lake, we have
this ability to do change data flow where you can actually query the transaction log to see the
actual inserts and updates, but iceberg doesn't have that functionality, but they'll probably
add it at some point. So there's, there's a lot of similarities across these formats, but I would
say Delta Lake is a bit more performant, like in every benchmark, you know, there's a lot more
performance given to the Delta Lake format. And I would say also we're just a bit more stable. So
we've got the architecture for Delta Lake is that the manifest or the
metadata, it's mastered within the bucket itself rather than Iceberg where it relies on a tight
coupling of the catalog. So you have to have some sort of catalog to use Iceberg. And whenever
you're doing concurrent writes to the table in Iceberg, there's potential for corruption because
you might have a heartbeat that gets out of sync. Whereas at Delta Lake, there's potential for corruption because you might have a heartbeat
that gets out of sync. Whereas, you know, at Delta Lake, there's no such heartbeat. So
there's like some finer, some finer details once you start working with it at scale.
And if you look at, you know, Hoodie's claim to fame was the, the data lakes streaming file system,
right? So incorporation of offsets and those sorts of things. So, but Delta Lake enables, you know,
um, Databricks live tables and your approach to, to trying to, you know, democratize streaming
for your average data engineer. So it must have a similar concept built in as well to enable
streaming. Correct? Yeah. I'm less familiar with like the hoodie streaming, but I, with, with Delta Lake, we, we basically make it easy to, to write short commits.
So like basically you can, whenever you're doing streaming, you're, there's like this balance
between throughput and latency. And so the, the more throughput you care about, you can basically
write bigger files. And then the more latent, the more you care about latency, you end up writing smaller files. And then those smaller files end up having a negative impact on
downstream query performance. And so what you can do is you can have this balance with Delta Lake
of how big you want to write those files. And then the longer you wait to write them, the longer
your latency is. But if you want really short latency, you can just write them as is. And then the longer you wait to write them, the longer your latency is. But if you want really short latency, you can just write them as is.
And then there's a compaction routine that you can run every so often, which will basically
compact those smaller files into larger files.
And it's all done transactionally.
So you can be doing this at the same time as you're streaming data in, at the same time
as people are querying those data sets.
So there's been a lot.
You're talking there about tables and so on and so forth.
And I'd like to kind of jump ahead a little bit now,
and Stuart again might take over a little bit in this conversation,
but you mentioned data lakehouses there,
and there's also things like a photon
and there's other new features in kind of Databricks
that give it some of the characteristics
of maybe a relational database.
So maybe just kick off with just,
maybe explain what a data lakehouse is.
Okay, I think we can guess in some respects,
but it's got a fairly precise definition, I believe.
And then why and what are Databricks doing to give your product,
I suppose, more, I suppose, relational database type features?
What's the strategy behind that and so on?
Yeah, so in the beginning,
there was just data lakes. And even though people were using Parquet to kind of save
structured data in data lakes, the data lakes, they're really good for machine learning because
typically when you build these machine learning models, you want to have access to the most
granular data. So if I'm building machine learning, a website, I want to know every single click that somebody is making and train a model.
Or likewise, if one of our customers is Riot Games and they they build these machine learning models to prevent churn and they need to know every single move that a player is making to be able to train those models.
So so that was really good. But then if you wanted to run traditional data warehouse style queries or business intelligence queries on top of the data lake, historically, it was kind of slow, to be
honest. So we've had the ability to run SQL against data lakes on Databricks since day one,
but it only really worked really well if you had a super large data set because there wasn't many
alternatives available. But what we've done is we've kind of like made the performance better
and better and making it better on data sets that are smaller and smaller. So now you can actually
have equivalent performance querying a data lake as you could if you queried a traditional data
warehouse on a database. And the things that really make that possible are the open source
Delta Lake project. So as you save
that data out and you save it with this Delta Lake metadata layer, you can do things like
centralize the statistics. So for every column, I know what the min and max values are for every
file. So I know which files I can skip reading whenever I go to run a query, for example.
And then Photon is really the thing that allowed us
to have the next level of performance.
So Photon, it's a custom built engine
where we basically, we kept the API of Spark the same.
And then we basically created a whole new implementation
of Spark using native C.
So most of Spark is written in a combination of Scala
and Java, which compiles with the JVM.
And then, you know, there's, Most of Spark is written in a combination of Scala and Java, which compiles the JVM.
And then, you know, Spark has always done some memory mapping where it uses Project Tungsten to allocate memory directly without having to go through the JVM.
But even with that, you're still kind of limited in what you can do with the JVM.
So with Photon, we basically use C++ and we use vectorization techniques to take advantage of the modern CPU chipsets with the SIMD.
And then we can basically process a chunk of records at a time rather than one record at a time.
And, you know, that combined with Delta, it allows us to basically take the same workload that you would run on open source Spark and then run it on Photon.
And it'll be, you know,
on average about seven or eight times faster. So that's what gives the performance benefit.
And we were able actually to prove this last year where we submitted the TPCDS benchmark,
and we beat the prior record holders, which was Alibaba. And we were able to prove that you could
have, you know, the same, if not better, data warehouse
performance on top of a data lake, and you'd get the price benefit of the economics of a data lake.
So I think the concept of the lake house is allowing you to basically run your data warehouse
workloads on the same data lake infrastructure that you've already had. Is Photon defined at
cluster creation? And if so,
do you choose Photon when you're defining a cluster? And if that's the case, then are there
any trade-offs to using Photon versus perhaps the more standard JVM? So whenever you go to create a
cluster in Databricks, there's actually like a, there's different cluster types. So like,
so we have one product called Databricks SQL, which it's purpose built for data warehouse style queries. So whenever you create a Databricks SQL
endpoint, it's actually, it's always using Photon. You don't really have a choice.
And then whenever you go to create another type of cluster, like an ad hoc, either interactive or
ETL cluster, you can choose to enable Photon or not. It's really just a checkbox where you check
it or not. There's really not any downside to using it or not using it. I mean, you basically
get better performance. And then we do have like, you know, for these ad hoc clusters, there is like
a multiplier. So we charge more money because you're getting better performance. So for some
people, you know, if your job runs, if you don't really care how fast your job runs, you know, it takes 30 minutes and you're,
you're fine with that because you're running once a day, you know, you might have better luck by
just not checking the phone time box. But if it, if it does speed it up, like at least 10 X,
you'd actually be saving more money by, by teching it. And we kind of built it gradually over the years. And the way we did it
was for every, every operator that Spark hits, whenever it's executing a query plan, we basically
would photonize it or, you know, rewrite it in C++. And, and if you're executing a query and it
hits one of these operators that isn't photonized, it'll just fall back to the JVM version. And then
over time, we've just kind of like built out, you know, all the operators that isn't photonized, it'll just fall back to the JVM version. And then over time,
we've just kind of like built out, you know, all the operators that are the most important. So now
almost every run kind of like doesn't, doesn't fall back to the JVM. It just runs straight
through photon. So I was interested when, when live tables hit Jason. So, you know, the idea
that, you know, something similar to what I had been doing in
the past with KSQL and Kafka, the ability to sort of compose tables, declare tables,
and the pipelines that execute them. One of the things that really surprised me as I was digging
in is just how much you guys were invested in this lake house
and these traditional data warehouse workloads
such that live tables had semantics
for slowly changing dimensions
and some of those sort of core data warehousing techniques.
Maybe if you could talk a little bit about
how much of a focus is it for exposing for, you know, exposing semantics like
that to, you know, that traditionally slowly changing dimensions are kind of hard to do.
And you guys have just kind of packaged it up with some semantics. Maybe talk about your vision
for trying to, you know, capture those workloads and make those easier to be done, you know,
really purely in SQL. Yeah, we, a lot of, I would say most of the workloads that Databricks processes are, you know,
ETL focused in some way or the other.
And so people have just been using our platform for that for a long time.
And DLT is, or Delta Live Tables, it's basically a way to simplify that and make it work at scale.
So one of the things that we do is you mentioned like this slowly
changing dimension. So as a SED type two, it's not that it's so much hard as it's just kind of
tedious. You end up writing the same boilerplate code for all these different tables. And so by
having an API that does that for you, it makes it a little bit more bulletproof, but it just
allows people to be more productive. And it also has a way to kind of specify the order that whatever the, whatever the column is in the
data that you're ordering by. So that way, if you have late arriving data, it can automatically
put that record in the right place. So that's, that's one thing. The other thing that we do is
we've we've investing a lot in like this thing called enzyme. So enzyme is the code name for you can think of as like an optimizer for ETL.
So just like database engines have optimizers for queries, what we want to have is an engine for for ETL, where if you're a data engineer and you're building a pipeline, you have to think about the best the best update strategy for your target. So you might choose to
basically only insert data, or you might choose to do a merge statement on data, or it could be
that you've architected in a way where you can drop a partition and recreate it or drop the whole
table and recreate it. But traditionally, like the data engineer has to think about the best way to
do that. With Delta Live Tables and Enzyme. We want the optimizer to choose the most optimal path.
And,
and what that results in is like,
it's essentially kind of like materialized views where you're creating a view
where you're creating a,
creating a table and then it's a CTAS statement.
And then you basically just say the,
the select of what you want that table to materialize as,
and then enzyme makes sure that it's always
keeping that output as up-to-date as possible using the least amount of compute as possible.
We've also kind of like worked on enhanced auto-scaling. So because we kind of like have
knowledge of the data flow whenever you execute one of these Delta Live tables, you can actually
have some knowledge on what the scale of the cluster
should be. So if you look at how much data has yet to be processed, you can use that information to
scale out the cluster in advance. So that way, if you have any sort of seasonality in your data,
like maybe you're a retail customer and you get more data in December than you do February,
it'll just automatically scale the cluster out to meet the
SLAs rather than having to think about how big of a cluster do I need in November, December,
and then encoding that in somewhere. So those are some of the things where we're building a DLT
where you don't really have to think about things so much. It just makes you more productive.
And I think it's going to be really interesting. Later this year, we'll have serverless as part
of DLT offering. I think the biggest benefit is whenever you build a dlt job you can write in
either sql or python and it could be streaming or batch it doesn't really matter but you don't
you don't have to think about what version of spark you're on you just you just write this
and then it just runs on our dlt clusters and then there's no more like upgrading to the next
version of spark it just automatically happens for you.
So is the strategy here then to, I suppose, be,
if you think about someone who is considering a platform to run data warehouse style workloads
and maybe sort of data science workloads,
they've got choices out there.
They've got, for example, Snowpark and Snowflake.
They've got maybe sort of BigQuery
and some of the kind of the features in there. And you've got, for example, Snowpark and Snowflake. They've got maybe BigQuery and some of
the kind of the features in there. And you've got Databricks. Would you say that these platforms
are kind of interchangeable now? Or are there certain workloads and certain personas that are
better suited to working with Databricks, do you think? Well, I mean, the whole point of the
lake house is you should have, you know, one platform that you can do all of your different
use cases well at scale.
And you can't really have a lake house unless it's using open formats and open technology.
So we've standardized on Delta Lake as this open format. We also leverage Spark as the open API for
doing the processing. And you can only really do machine learning at scale if you're using, you know, these open technologies.
So, like, I think BigQuery, they have some machine learning that's built in.
So with BigQuery, you could run something probably distributed at scale for whatever algorithms that they've baked into BigQuery.
But you can't just add your, you can't, like, add the newest, latest algorithm that somebody created in the open source.
So they don't really have that option.
And the other thing is like, I think most people who are running the, are training these models, the data scientists, they're not really doing that in SQL.
They want some sort of data frame API or interface to be able to do that, which is why I think Python has become so popular.
Because first with pandas and then other data frame APIs,
but that's how data scientists prefer to work.
I know Snowflake, they've released Snowpark recently.
They want people to think that it's really good for machine learning,
but it really can't do distributed machine learning.
I mean, all Snowpark really does is it just takes a DataFrame API and then
translates that to SQL and then runs that SQL against the same Snowflake endpoint that was
there before. So, you know, I think it does some helping with like, if you're basically got some
code and you want to turn that into UDF, like Snowpark will turn into UDF and then call that
UDF from the SQL. But I mean, otherwise, it's not really doing anything special. So if you wanted to train a model and
that model and the data you need to train that model, you know, if you can't fit it within the
memory on one of the nodes in Snowpark, it'll just fall over. So, you know, if you're training a lot
of, you know, really small models, it might work. But the large models, it's not going to work.
And I think they also don't have the ability to price that stuff at a reasonable rate.
I think their Snowpark stuff is actually charged like 50% more than their regular clusters.
So I think people are really just going to want a single, integrated, unified experience
with all these different use cases.
Yeah, I was just going to ask, you know, we've got Photon, we've got Enzyme.
Great names, by the way.
I'd love to meet the branding side of Databricks who comes up with these names.
An extreme BI.
Don't Google that.
Don't Google that.
So a little joke there, Jason.
Mark and I used to work for the same company,
and he's never let me let down that name that I came up with.
But maybe tell us about what's coming next, right?
So we've seen some of these.
You've talked about some of these great advancements that you've been making.
What's over the horizon?
You know, we won't hold you to this, you know, safe harbor.
Yeah.
But what are some of the great things that we're going to see from the platform?
For instance, through acquisition, you've added analytics.
And curious how, you know, we might see that evolve, you know, the ability to actually
not even have to have an analytics tool to some degree, if you invest in Databricks. So maybe
talk to us about what we'll see in the future and what you guys are focusing on next.
Yeah, I think, you know, if I think about just kind of like the spectrum of stuff, like
on the machine learning side, we actually, so we just announced on Tuesday, we went GA with our model serving layer.
So this is basically a serverless model serving layer, which allows you to take a model that's built with MLflow, which is the open source project that we created to encapsulate the whole lifecycle of machine learning.
You can basically serve that as a REST endpoint
and have people hit that with their applications very easily.
It integrates well with Feature Store,
which we came out with earlier.
And the Feature Store is, if you think about it,
you've got all these different data scientists
that are training these models.
They're inevitably going to end up training models
based on the same features or derivative of data that
they've created. So having a feature store allows you to register these features in a central place.
And then you can basically monitor the use of these features and then leverage them within
something like the model serving layer. So for example, you might have a feature that says,
you know, what's the number of purchases this customer has made over the last, you know, seven days or 30 days.
And then that feature would just continuously get updated as more sales come in.
So that way, whenever the model gets served, you can basically have the most accurate up-to-date information without having to, you know, think about all the different pipelines you'd have to create to do that.
So we're basically just making that loop very easy and very scalable.
At the end of the day, models are just basically code plus data.
So being able to unify that in a single platform is what we're working on.
I think a big thing is we announced at Data and I Summit the marketplace we're going to create.
So we want to create a data marketplace, which is not going to be just data, but everything
that you want to share.
So data assets, but also notebooks, machine learning models.
So being able to have a marketplace where third parties can basically kind of register
these things and then make them available to anyone who
wants to use them in Databricks, but also just, you don't even have to be on Databricks. So
because the basis of all that sharing is Delta sharing, which is an open source
protocol for being able to do data sharing, you know, as long as you're using, you know,
utility that can support that Delta sharing client, and we've already got that built into,
you know, Spark as well as Java.
And I think Power BI has it built into their tool. You can just basically attach to these Delta
shares. So that's something that we're going to be focused on a lot between now and the next
Dating High Summit. Another big... Can I kind of pause you there for one second? And I don't want
you to lose your train of thought, but I wanted to drill in for a second on the model serving in the feature store.
So, you know, a poor person's feature store is a table, right?
So and the features are columns in the table.
What does the feature store give us on top of that?
Is it something like versioning?
Is it the ability to declare them as features?
What would you say there? Yeah, I guess some versioning. Is it the ability to declare them as features?
What would you say there?
Yeah, I guess some versioning.
And usually with like features, there's like an offline serving and an online serving. So offline is whenever you're basically training a model, you're probably going to be doing something in batch.
And you kind of need to be able to grab these features in a way that scales for batch.
But with online, it's usually like, hey, I've got a,
you can imagine somebody clicking on an app or a website,
and that click is going to basically fire off a call to an inference layer
to give you the result back, a score of some sort.
You need that to be like super quick and super responsive.
So that's like an example of an online feature serving layer
where you need something that's basically a key value lookup that's really quick.
And so with a feature store, you want something that can basically do both of those and not have
to force the engineer to think about how to keep them in sync, but basically keep them in sync
automatically. So that's- And maybe-
Yeah. Oh, sorry. And maybe tell us how did before you introduced model serving to give some folks an understanding of what kind of value that brings?
How would you have had to do that prior to you guys introducing this?
Yeah, there was a lot of a lot of different methods. There was you can basically if you think about with machine learning, there's like training a model and then there's like the inference of that model.
So after you've trained a model, if you want to infer, you can basically do what a lot of people do is they would they would do it in batch at night.
So they would think of depending on what you're trying to predict, you could come up with, you know, every single possibility, every combination of of input values and then run that through a model and get the
output and then just save that output in some sort of key value lookup. So that way, whenever the
application needs to do a lookup the next day, they've got, you know, just something that's
pre-calculated. So that was one way. Streaming was another way that people do it. Just, you know,
if you're doing streaming, you can kind of do that lookup along the way. There was also, we had integrations with MLflow where, you know, if you
have a model that's registered in MLflow, you can use other services that implement MLflow. So like
Azure ML or SageMaker in AWS, where they've got their own online serving where you can basically
just post that. So it was kind of depending on the use case, but the nice thing about MLflow and open source
is like so many people have implemented
what's called flavors to do the serving and the hosting
that you're kind of like spoiled for choice
if you have a different environment
where you need to host it.
Fantastic.
Well, we're almost out of time now, Jason.
And so how do people find out more about Databricks
if they're interested in looking further?
Yeah, I would say, you know, Databricks.com.
You can go there.
We've got an easy way to start up a trial.
And, you know, I think you get like two weeks free or something like that where you can basically play around with it.
We have no shortage of blogs.
I can't keep up with all the blogs we put out, to be honest.
It's a lot.
And we've got got videos as well.
And then if you're on the open source side of the house,
we've got big communities around MLflow and Delta Lake and Spark.
And there's Slack channels that are made for all those, though.
So you can kind of get involved with the community as well.
And I'll just make a call out there too, Mark, as well.
One of the nice things about a notebook interface is
that there's a lot of publicly available notebooks that you can open right up into Databricks and
you can execute them end to end, assuming that your data that you're using, that the notebook
is using is public somewhere. So the idea of having to copy and paste scripts or check out scripts or whatever, you can, a lot of times when you Google something for, you know, how does, how has X, Y, or
Z done in Databricks?
What your result is going to be is a series of notebooks where it's been done that you
can just open up and I'll say play around with, but, you know, quite possibly, you know,
take through iterations into a production to something that you could take to production, which is which is really nice. Oh, that that reminds me I almost forgot there's a
so we just launched this website called it's db demos.ai. And so if you go to db demos.ai,
it's a website which has host a whole bunch of these different demos using the notebooks,
like you just mentioned. And if you've got data breaks, you basically just like run one line of
code in your notebook, and it'll automatically download and import whatever the demo is that like you just mentioned. And if you've got Databricks, you basically just like run one line of code
in your notebook
and it'll automatically download and import
whatever the demo is that you're looking at
and get you up and running
so you can just try this stuff out for your own.
Fantastic.
Well, thank you, Stuart.
And thank you, Jason.
That's fantastic.
Lovely to hear about the story about Databricks
and thank you for the very in-depth
kind of explanation there.
No question we had could catch you out.
So thank you. Very, very knowledgeable. Thank you very much. So thanks very much and take care and it's great to have you on the very in-depth kind of explanation there. No question we had could catch you out. So thank you.
Very, very knowledgeable.
Thank you very much.
So thanks very much and take care.
And it's great having you on the show.
Oh, thank you so much, Stuart.
And thank you, Mark.
It was a pleasure to get to meet you guys and love to do it again sometime.
Yeah, thanks, Jason.
It was a pleasure.
Thanks so much.