Drill to Detail - Drill to Detail Ep.104 ‘Delta Lakes, Tiger Temples and the Databricks Lakehouse Platform’ with Special Guests Jason Pohl and Stewart Bryson

Starting point is 00:00:00 Before we get into all the kind of Databricks goodies, something that's far more interesting is the fact that you lived in London before and you raised tigers. My life seems so boring right now. Hello and welcome to another episode of the Draw to Detail podcast. And I'm your host, Mark Whitman. And today we've got a special co-host, Stuart Bryson. Thanks, Mark. Really appreciate you having me on here. We're walking a thin line. People are going to get tired of hearing my voice. But hey, I'm Stuart Bryce and long-time listeners of this show

Starting point is 00:00:48 will know me as the founder and former CEO of Red Pill Analytics. I think most of my podcasts that I've done with you were during that time and you've done now a few since. And we're pleased to be joined today by Jason Cole from Databricks. And we're going to be joined today by Jason Pohl from Databricks. And we're going to dive into Databricks. Do you want to tell us, Jason, a little bit about what you do there? And tell us a little bit about Databricks as well. Yeah. So thanks for having me on the program. I really appreciate it. This is pretty exciting for me. I'm Jason Pohl. I'm a Senior Director of Data Management at Databricks. So I kind of lead up a team of specialists, and we focus on bringing into market new products for Databricks, anything around data engineering, data governance, or data warehousing. And I was also one of the first 10 solution architects at Databricks.

Starting point is 00:01:36 So I've been at Databricks for about seven and a half years now, and it's been a wild ride so far. So it's been great to kind of see the product evolve since then. And I'll give just a short plug. So Databricks is a company that was co-founded by the original co-creators of Apache Spark. And we've since gone on to create other open source projects like MLflow for machine learning, as well as Delta Lake for data lake analytics. And we basically have created these open source projects to become the de facto standard in their domain. And then we make sure that Databricks is the best place to

Starting point is 00:02:09 run these open source projects. And we kind of like cover a whole gamut of use cases, anything from data science and machine learning to streaming, ETL, data warehousing, and we do it all on a single platform, which we call the Lakehouse platform, which I think we're going to get into a little bit today. Before we get into all the kind of Databricks goodies, something that's far more interesting is the fact that you lived in London before and you raised tigers. So give us the kind of backstory then. How did you go from raising tigers to be working at databricks yeah so i uh oh you're looking by linkedin yeah so i so i i moved to london uh was it in 2001 and it was uh i was working for this small consultancy um it was

Starting point is 00:02:52 called in forte and then they rebranded to they got bought by a french consultancy called business and decision which is kind of funny because you say it too quickly it sounds like business indecision but they uh i was doing a lot of, uh, Siebel CRM implementations and I was doing, I became the data migration guy. So I would migrate these legacy systems into Siebel CRM. And then I ended up getting into a data warehousing after that. So I was building a bespoke data warehouses with Informatica. Uh, and that kind of like, uh, piqued my interest in data warehousing in general. And then after I was in London for, what is it, six or seven years, I ended up deciding to take the long way home.

Starting point is 00:03:35 So I sold my stuff and got a backpack and I backpacked around the world. And one of the stops was in Thailand. I got to go to the Tiger Temple. I went there as a tourist and I had seen this place on, there's a youth program called 60 Minutes in the US I, I saw that years ago. I was like, Oh, if I'm ever in Thailand, I want to go there. And so went and then they actually needed volunteers and I, and they offered like, you know, uh, two meals a day and a place to sleep. So I was like, Oh, this sounds like an amazing deal. So I, uh, I got, I, uh, I got my, my inflatable roll up mattress from my backpack and spread it out on the tiled floor. And I basically spent a month there helping them raise tigers.

Starting point is 00:04:08 So they're monastic grounds. So they take in all these unwanted animals. They have all sorts of animals. They had a few camels. They had horses, water buffalo. But they're most famous for their tigers. And so I got to help basically feed these tigers and play around with them, give them a bath and stuff like that.

Starting point is 00:04:27 Mostly the younger cubs, but I got to work with some of the one-year-olds as well who were about a hundred pounds or something like that. So it was great. It was an amazing experience, something that I don't think you could really replicate in the U.S. Wow. So I was about to say, how would you give a tiger a bath? And I was going to say very carefully, but that would be worse than that joke ever.

Starting point is 00:04:48 Nothing you say now, Jason, will be as interesting as the tigers really so uh you've got you've kind of set you've set a high bar now for the uh first of the episode so uh but yeah so so i'm really keen to to have someone from databricks on and jason thank you very much for uh for sort of volunteering to come on um and really i want to try and understand or myself and stewart wants to try and understand or myself and Stuart wants to try and understand I suppose the fundamental decisions behind the Databricks platform and the fundamental things that make it different and I suppose in particularly suitable for the kind of workloads that you take on but with Stuart just very quickly you know what is it that introduces you about about Databricks where did it fit into your world, really? Yeah, so going back, this is some time ago, we did several Databricks projects when we were at Red Pill. A former employee of mine

Starting point is 00:05:31 went over to Databricks named Pete Tamason, shout out to that guy, and was just singing its praises. So we sort of got invested and did a few projects and some proof of concepts. I think the first thing that really stood out to me, and this is some time ago, is it was really hard to run notebooks back then. And so the first thing was just how easily Databricks made it to run notebooks. I mean, just cloud-based notebooks. I know it's more than that. They manage the execution and they manage the cluster and the compute, but just going in, grabbing a browser, starting to write some code and execute it. And then additionally, you know, just anybody who's worked with notebooks knows this, but, you know, I wasn't

Starting point is 00:06:19 a notebook expert at that time. The ability to, you know, sort of in a single set of tasks, be able to jump from one language to another. So the ability to build up some tables using SQL, to do some complex, I'm a JVM developer at heart. And a lot of the ecosystem around Databricks was Scala based and still some Java. And of course they had a lot of people using Python. But just the ability to grab a code snippet of Scala and work on using data frames and process some data and then jump back into SQL. And then the ability to, you know,

Starting point is 00:07:00 again, anyone who knows notebooks knows this, but as I was coming up to speed on notebooks, the ability to visualize just right in line the results of your process and your data processing. in a single product or tool and not having to jump around to a SQL editor to go test your code and to go over to some sort of a workbook solution to visualize it. I think that's what really got me hooked. And then the idea of notebooks as being, you know, then scheduled

Starting point is 00:07:42 and the actual way that you load your data is exactly the same experience you went through as a developer. That's what all clicked with me. Yeah. I feel like, um, we, in the early days we were kind of like known as the quote notebook company. So we, we went, um, GA and I think, uh, July of August, 2000 or no July, 2015, but the company had been around for like two years before that. And at the time, you know, there was a, I Python notebooks, those, those became Jupiter. And then there's another one Zeppelin, which people were playing around with, but, you know, people were trying to like, uh, do this notebook interface. So it was very, the demand was there,

Starting point is 00:08:20 like people wanted this stuff, but it was very flaky. If you tried to host this stuff yourself, like things would crash. No one would consider like running a production job as a notebook because you know, you, you were scared about the instability of it. And so I think we kind of got famous because we just had these notebooks and they just kind of worked and you could, you could import an expert Jupiter notebooks that actually was running an IPython kernel behind the scenes. So we just got famous for that. And whenever I first joined, it was my first foray really into notebooks. And I had kind of like done a lot of Java stuff with IDEs. And at first, I didn't really like it, to be honest. It wasn't what I was used to. And then as I started to use it more,

Starting point is 00:09:03 what I originally would do is I would I would use it to prototype. So I would, you know, anytime you're working with data, you got to like be able to run a query, get a result, investigate a little bit, profile it, and then go back and run another query. And as you keep iterating through this, it's hard to do that in IDE. But then once you kind of get the sense of the shape and the profile of your data, I would then like move it into, I would switch my over to my IDE and start to do some work there and then go back and forth. But, but nowadays, like I, I hardly ever leave the notebook. I kind of really like it because it's, it's a good place to do just what you were saying, the exploring the data and then visualizing it. And I'm more of a data engineer than a data scientist. So I'm not really building these statistical, you know, sophisticated models,

Starting point is 00:09:45 but we do have users that do that. And it's, it's been like a really good tool. And since then, we've, it's kind of like allowed us to do other things. So we, we've had the ability to schedule these notebooks as like automated jobs for a long time. And recently, and about a couple years ago, we introduced like auto ML technology, where the idea idea being it's like a glass box method, we call it, where you basically have a you can take a table of data and you can kind of select a column that you would want to predict. And then you basically you go into a wizard where you select a few other things like what's the most important things you want to predict. And then basically as you, it was that wizard, it was going to do, it's,

Starting point is 00:10:29 it's going to automatically generate a bunch of notebooks and then run them and try to like measure which notebook had the best fit and then give that to the data scientists. Like say here, here's a working model. Here's all the code in the notebook. Here's the, the runs that we

Starting point is 00:10:45 did and the best parameters that we tried to tune it with. You can basically take this notebook from here and then try to make it better using, you know, all the domain expertise that that data scientist would have. So we were able to build that because we had this notebook infrastructure already in place. And it's really been a good launch. So it's kind of like hyper parameter tuning at the notebook level. Yeah. Well, that's one of the things it does. It does the hyperparameter tuning for you. And, you know, you could probably hyperparameter tune to infinity, but it'll do enough to give you a good sense of, you know, here see a lot of Databricks customers that will also jump out to the IDE, as you've described, and using your SDK, they'll build, you know,

Starting point is 00:11:31 whether it be user-defined functions or classes or, you know, almost compartmentalize your code that way and then, you know, plug that into your compute and reference that using a notebook. How often do our customers doing that sort of, I'll call it non-notebook, hardcore IDE-based development, publishing jar files, uploading those jars to the compute? If you had a rough estimate, how many customers are using your product in that way? Oh, yeah, a lot. So like in general, if you had to kind of classify the buckets of who uses Databricks, you've got data scientists who generally are gravitating more towards the notebooks. You've got data analysts who are generally running SQL either through RUI or just connecting a BI tool. And then you've got data

Starting point is 00:12:19 engineers who are building these data pipelines. And a lot of them are using IDEs. And I think whenever I first started, like IntelliJ was probably the most popular, but now VS Code, because it's hosted and it has autopilot or copilot, it's become like a really popular tool. So we actually just recently released a plugin for VS Code, which allows you to easily basically just connect to a Databricks cluster and iterate with your code there and have some of the same elements that you would have within the Databricks notebook, like access to clusters and jobs and everything within there to make the experience seamless.

Starting point is 00:12:56 The other thing that we're doing is we're leading this effort within the Spark community for something called Spark Connect. And the idea is that one of the challenges with Spark is whenever you run like a Scala program, you're basically running your application code on the same JVM that's being used as the driver for Spark. And so it's hard to get that separation. So Spark Connect is basically going to separate this so that everything you do from generating data frames kind of basically gets done within your local client. And then the output of that is, you know, basically a query plan. And then that gets shipped over the wire

Starting point is 00:13:40 using RPC to the Spark cluster. And this allows you to kind of separate the creation of the query plan from the execution of it, which means you'll be able to have Spark connect in lots of different languages, even JavaScript. So that's one of the things that we're working on with the Spark community. Okay. So Jason, a question from me. So hypothetically, if somebody in this call wasn't so familiar with Databricks and some of the things like Delta Lake and so on, maybe just walk us through kind of, I suppose, the initial product direction of kind of Databricks and how it came about. And really, I suppose, fundamentally, why Spark is interesting, really. What's it about?

Starting point is 00:14:21 And why was it so good that a company formed around it really yeah so spark kind of came about whenever um in the the the first like big data open source project was hadoop and hadoop had been on the scene for um probably i don't know maybe five years or something like that whenever databrooks was founded and the challenge though like hadoop was really built for on-prem data centers. So if you buy a bunch of machines, you have them up and running 24-7 and you need to do some sort of distributed processing, Hadoop was good at that. But then whenever the cloud came along, that model kind of like didn't work very well

Starting point is 00:14:59 because you didn't want to have machines up and running 24 by 7 in the cloud because you paid for them by time. So you'd want to take your data and store it in object storage like S3 on AWS. And then you'd spin up the compute just whenever you wanted to do some processing, whether that's for machine learning or analytics or ETL or whatever. And so in that model, Hadoop didn't really fit well because its storage runs on EC2 machines and wants those things to be up and running all the time. So Spark was really good because it kind of separated that computing from the storage

Starting point is 00:15:31 layer. The other thing was that it was really fast. You could run the same routine that you ran on Hadoop and it would run a hundred times faster on Spark. And the other thing was because it used Scala as the basis for it rather than Java, you could actually write the same program in like a 10th of the number of lines of code. So it got famous for being able to write the same program in a fewer lines of code and run a hundred times faster. And most of what it was being used for was either distributed machine learning, or you've got so much data, you can't train your model on one single machine. Or ETL.

Starting point is 00:16:19 And there was a lot of customers in the early days where they basically just had a whole bunch of JSON files that were sitting on an S3 that they got from some application, either mobile or web or whatever, that was dumping it there. And they wanted to do some analysis on it. And a lot of what we did in the early days was just helping these customers analyze this JSON data. And we went through a lot of different optimization techniques and it kind of coalesced to like, well, the first thing you would want to do is basically copy that JSON data into the Parquet format and then run your analysis on those Parquet files. And if you were to go back to the Spark Summit talks around 2015 to 2018, everybody's given some sort of talk about, you know, converting data to Parquet and what are the best techniques to do that? What's the right file size to have for your Parquet files? How do you make sure that you can recover if your big ETL job fails partway through?

Starting point is 00:17:01 And there was, there's a number of things we, we actually came out of this thing called a DBIO, which was like our first stab at doing like some sort of atomic operation across all these different files. Cause when you're writing a big job out, you could be writing thousands of files. And if something happens, you know, like your, your network goes down or whatever, and your job doesn't complete, then you don't want to be left with all these half written files out there, you want to have a way to kind of like roll those back. And that kind of like led us to developing Delta Lake. So we were, we were working with one of our customers, Apple Computer, and they basically have to do network intrusion analysis across all the network points on their network. So if they find

Starting point is 00:17:45 an intruder, they need to be able to work their way backwards to find out where this intruder came in. And the solution that they were using couldn't really handle the data bombs. They were getting petabytes of data every week, I think. And so they could only go back two weeks worth of data. And so we kind of co-developed the solution Delta Lake with them and it allowed them to go back years worth of data to find out, you know, if an intruder came in six months ago. So that was kind of like the basis. And what Delta Lake does is instead of like thinking about files on S3, you start thinking about tables. So it's like more of a logical construct. So instead of thinking about, oh, I need to write all these files and then, you know, commit them and make sure that nothing's trying to read these files at the same time.

Starting point is 00:18:29 It basically allows you to just start to think about inserts, updates, deletes, merge statements, just like a database really. And you can do it, you know, at scale. So you can do it on tables that are petabytes big. It was optimized for streaming because with Apple's use case, they needed to be able to stream in all this network information as it came because there wasn't going to be a batch window big enough for them to process it.

Starting point is 00:18:52 So it's kind of been engineered from the ground up for scale to be able to support real-time streaming use cases and to make sure that you can do other things like time travel. So if you need to run a query on a table as that table existed at a certain point in the past, you can do that. And we built it with them. Then we ended up open sourcing it. And it's kind of become the de facto standard now for data lakes.

Starting point is 00:19:16 We have, I mean, just us alone, we have probably over seven or 8,000 customers that are using Delta Lake in production today. So I've got a follow-up to that, Jason, and it's probably a question you hear quite a bit, but it seems like the Parquet Plus movement kind of had a whole bunch of entrants all at the same time. So you've got Delta, you have Iceberg, you have Hoodie, and there's probably several others that I'm missing there. And if you were to sort of bake off there, tell us what does Delta have? And maybe it's very similar to those formats, but what does Delta have besides the backing of Spark, right? What does Delta have that those other formats maybe don't? Yeah. I mean, I think you nailed all the big ones. There's basically Apache Iceberg,

Starting point is 00:20:02 which was created by some people at Netflix, and then Hoodie, which was created by some people at Uber. And those formats combined with Delta Lake, they basically form what you might consider the lake house. So there's actually a really good white paper that Michael Arnburst and some others created on the lake house. And it basically shows how these formats help enable, you know, these types of use cases. In terms of like difference, you know, we, I think, you know, if you look at Iceberg's main page, they, they kind of, their claim to fame is there for, for slow moving large datasets. Whereas I think, you know, Delta can do that as well, but we've been engineered for streaming at the beginning as well. So you can basically stream into a Delta table, um, and a much lower latency than you could something like iceberg, for example. Um, then there's like, there's

Starting point is 00:20:50 differences in functionality where, you know, like, uh, I think hoodie and Delta lake, we have this ability to do change data flow where you can actually query the transaction log to see the actual inserts and updates, but iceberg doesn't have that functionality, but they'll probably add it at some point. So there's, there's a lot of similarities across these formats, but I would say Delta Lake is a bit more performant, like in every benchmark, you know, there's a lot more performance given to the Delta Lake format. And I would say also we're just a bit more stable. So we've got the architecture for Delta Lake is that the manifest or the metadata, it's mastered within the bucket itself rather than Iceberg where it relies on a tight

Starting point is 00:21:33 coupling of the catalog. So you have to have some sort of catalog to use Iceberg. And whenever you're doing concurrent writes to the table in Iceberg, there's potential for corruption because you might have a heartbeat that gets out of sync. Whereas at Delta Lake, there's potential for corruption because you might have a heartbeat that gets out of sync. Whereas, you know, at Delta Lake, there's no such heartbeat. So there's like some finer, some finer details once you start working with it at scale. And if you look at, you know, Hoodie's claim to fame was the, the data lakes streaming file system, right? So incorporation of offsets and those sorts of things. So, but Delta Lake enables, you know, um, Databricks live tables and your approach to, to trying to, you know, democratize streaming

Starting point is 00:22:15 for your average data engineer. So it must have a similar concept built in as well to enable streaming. Correct? Yeah. I'm less familiar with like the hoodie streaming, but I, with, with Delta Lake, we, we basically make it easy to, to write short commits. So like basically you can, whenever you're doing streaming, you're, there's like this balance between throughput and latency. And so the, the more throughput you care about, you can basically write bigger files. And then the more latent, the more you care about latency, you end up writing smaller files. And then those smaller files end up having a negative impact on downstream query performance. And so what you can do is you can have this balance with Delta Lake of how big you want to write those files. And then the longer you wait to write them, the longer your latency is. But if you want really short latency, you can just write them as is. And then the longer you wait to write them, the longer your latency is. But if you want really short latency, you can just write them as is.

Starting point is 00:23:08 And then there's a compaction routine that you can run every so often, which will basically compact those smaller files into larger files. And it's all done transactionally. So you can be doing this at the same time as you're streaming data in, at the same time as people are querying those data sets. So there's been a lot. You're talking there about tables and so on and so forth. And I'd like to kind of jump ahead a little bit now,

Starting point is 00:23:29 and Stuart again might take over a little bit in this conversation, but you mentioned data lakehouses there, and there's also things like a photon and there's other new features in kind of Databricks that give it some of the characteristics of maybe a relational database. So maybe just kick off with just, maybe explain what a data lakehouse is.

Starting point is 00:23:46 Okay, I think we can guess in some respects, but it's got a fairly precise definition, I believe. And then why and what are Databricks doing to give your product, I suppose, more, I suppose, relational database type features? What's the strategy behind that and so on? Yeah, so in the beginning, there was just data lakes. And even though people were using Parquet to kind of save structured data in data lakes, the data lakes, they're really good for machine learning because

Starting point is 00:24:15 typically when you build these machine learning models, you want to have access to the most granular data. So if I'm building machine learning, a website, I want to know every single click that somebody is making and train a model. Or likewise, if one of our customers is Riot Games and they they build these machine learning models to prevent churn and they need to know every single move that a player is making to be able to train those models. So so that was really good. But then if you wanted to run traditional data warehouse style queries or business intelligence queries on top of the data lake, historically, it was kind of slow, to be honest. So we've had the ability to run SQL against data lakes on Databricks since day one, but it only really worked really well if you had a super large data set because there wasn't many alternatives available. But what we've done is we've kind of like made the performance better and better and making it better on data sets that are smaller and smaller. So now you can actually

Starting point is 00:25:11 have equivalent performance querying a data lake as you could if you queried a traditional data warehouse on a database. And the things that really make that possible are the open source Delta Lake project. So as you save that data out and you save it with this Delta Lake metadata layer, you can do things like centralize the statistics. So for every column, I know what the min and max values are for every file. So I know which files I can skip reading whenever I go to run a query, for example. And then Photon is really the thing that allowed us to have the next level of performance.

Starting point is 00:25:46 So Photon, it's a custom built engine where we basically, we kept the API of Spark the same. And then we basically created a whole new implementation of Spark using native C. So most of Spark is written in a combination of Scala and Java, which compiles with the JVM. And then, you know, there's, Most of Spark is written in a combination of Scala and Java, which compiles the JVM. And then, you know, Spark has always done some memory mapping where it uses Project Tungsten to allocate memory directly without having to go through the JVM.

Starting point is 00:26:16 But even with that, you're still kind of limited in what you can do with the JVM. So with Photon, we basically use C++ and we use vectorization techniques to take advantage of the modern CPU chipsets with the SIMD. And then we can basically process a chunk of records at a time rather than one record at a time. And, you know, that combined with Delta, it allows us to basically take the same workload that you would run on open source Spark and then run it on Photon. And it'll be, you know, on average about seven or eight times faster. So that's what gives the performance benefit. And we were able actually to prove this last year where we submitted the TPCDS benchmark, and we beat the prior record holders, which was Alibaba. And we were able to prove that you could

Starting point is 00:27:02 have, you know, the same, if not better, data warehouse performance on top of a data lake, and you'd get the price benefit of the economics of a data lake. So I think the concept of the lake house is allowing you to basically run your data warehouse workloads on the same data lake infrastructure that you've already had. Is Photon defined at cluster creation? And if so, do you choose Photon when you're defining a cluster? And if that's the case, then are there any trade-offs to using Photon versus perhaps the more standard JVM? So whenever you go to create a cluster in Databricks, there's actually like a, there's different cluster types. So like,

Starting point is 00:27:42 so we have one product called Databricks SQL, which it's purpose built for data warehouse style queries. So whenever you create a Databricks SQL endpoint, it's actually, it's always using Photon. You don't really have a choice. And then whenever you go to create another type of cluster, like an ad hoc, either interactive or ETL cluster, you can choose to enable Photon or not. It's really just a checkbox where you check it or not. There's really not any downside to using it or not using it. I mean, you basically get better performance. And then we do have like, you know, for these ad hoc clusters, there is like a multiplier. So we charge more money because you're getting better performance. So for some people, you know, if your job runs, if you don't really care how fast your job runs, you know, it takes 30 minutes and you're,

Starting point is 00:28:28 you're fine with that because you're running once a day, you know, you might have better luck by just not checking the phone time box. But if it, if it does speed it up, like at least 10 X, you'd actually be saving more money by, by teching it. And we kind of built it gradually over the years. And the way we did it was for every, every operator that Spark hits, whenever it's executing a query plan, we basically would photonize it or, you know, rewrite it in C++. And, and if you're executing a query and it hits one of these operators that isn't photonized, it'll just fall back to the JVM version. And then over time, we've just kind of like built out, you know, all the operators that isn't photonized, it'll just fall back to the JVM version. And then over time, we've just kind of like built out, you know, all the operators that are the most important. So now

Starting point is 00:29:09 almost every run kind of like doesn't, doesn't fall back to the JVM. It just runs straight through photon. So I was interested when, when live tables hit Jason. So, you know, the idea that, you know, something similar to what I had been doing in the past with KSQL and Kafka, the ability to sort of compose tables, declare tables, and the pipelines that execute them. One of the things that really surprised me as I was digging in is just how much you guys were invested in this lake house and these traditional data warehouse workloads such that live tables had semantics

Starting point is 00:29:51 for slowly changing dimensions and some of those sort of core data warehousing techniques. Maybe if you could talk a little bit about how much of a focus is it for exposing for, you know, exposing semantics like that to, you know, that traditionally slowly changing dimensions are kind of hard to do. And you guys have just kind of packaged it up with some semantics. Maybe talk about your vision for trying to, you know, capture those workloads and make those easier to be done, you know, really purely in SQL. Yeah, we, a lot of, I would say most of the workloads that Databricks processes are, you know,

Starting point is 00:30:29 ETL focused in some way or the other. And so people have just been using our platform for that for a long time. And DLT is, or Delta Live Tables, it's basically a way to simplify that and make it work at scale. So one of the things that we do is you mentioned like this slowly changing dimension. So as a SED type two, it's not that it's so much hard as it's just kind of tedious. You end up writing the same boilerplate code for all these different tables. And so by having an API that does that for you, it makes it a little bit more bulletproof, but it just allows people to be more productive. And it also has a way to kind of specify the order that whatever the, whatever the column is in the

Starting point is 00:31:10 data that you're ordering by. So that way, if you have late arriving data, it can automatically put that record in the right place. So that's, that's one thing. The other thing that we do is we've we've investing a lot in like this thing called enzyme. So enzyme is the code name for you can think of as like an optimizer for ETL. So just like database engines have optimizers for queries, what we want to have is an engine for for ETL, where if you're a data engineer and you're building a pipeline, you have to think about the best the best update strategy for your target. So you might choose to basically only insert data, or you might choose to do a merge statement on data, or it could be that you've architected in a way where you can drop a partition and recreate it or drop the whole table and recreate it. But traditionally, like the data engineer has to think about the best way to do that. With Delta Live Tables and Enzyme. We want the optimizer to choose the most optimal path.

Starting point is 00:32:08 And, and what that results in is like, it's essentially kind of like materialized views where you're creating a view where you're creating a, creating a table and then it's a CTAS statement. And then you basically just say the, the select of what you want that table to materialize as, and then enzyme makes sure that it's always

Starting point is 00:32:25 keeping that output as up-to-date as possible using the least amount of compute as possible. We've also kind of like worked on enhanced auto-scaling. So because we kind of like have knowledge of the data flow whenever you execute one of these Delta Live tables, you can actually have some knowledge on what the scale of the cluster should be. So if you look at how much data has yet to be processed, you can use that information to scale out the cluster in advance. So that way, if you have any sort of seasonality in your data, like maybe you're a retail customer and you get more data in December than you do February, it'll just automatically scale the cluster out to meet the

Starting point is 00:33:05 SLAs rather than having to think about how big of a cluster do I need in November, December, and then encoding that in somewhere. So those are some of the things where we're building a DLT where you don't really have to think about things so much. It just makes you more productive. And I think it's going to be really interesting. Later this year, we'll have serverless as part of DLT offering. I think the biggest benefit is whenever you build a dlt job you can write in either sql or python and it could be streaming or batch it doesn't really matter but you don't you don't have to think about what version of spark you're on you just you just write this and then it just runs on our dlt clusters and then there's no more like upgrading to the next

Starting point is 00:33:42 version of spark it just automatically happens for you. So is the strategy here then to, I suppose, be, if you think about someone who is considering a platform to run data warehouse style workloads and maybe sort of data science workloads, they've got choices out there. They've got, for example, Snowpark and Snowflake. They've got maybe sort of BigQuery and some of the kind of the features in there. And you've got, for example, Snowpark and Snowflake. They've got maybe BigQuery and some of

Starting point is 00:34:05 the kind of the features in there. And you've got Databricks. Would you say that these platforms are kind of interchangeable now? Or are there certain workloads and certain personas that are better suited to working with Databricks, do you think? Well, I mean, the whole point of the lake house is you should have, you know, one platform that you can do all of your different use cases well at scale. And you can't really have a lake house unless it's using open formats and open technology. So we've standardized on Delta Lake as this open format. We also leverage Spark as the open API for doing the processing. And you can only really do machine learning at scale if you're using, you know, these open technologies.

Starting point is 00:34:46 So, like, I think BigQuery, they have some machine learning that's built in. So with BigQuery, you could run something probably distributed at scale for whatever algorithms that they've baked into BigQuery. But you can't just add your, you can't, like, add the newest, latest algorithm that somebody created in the open source. So they don't really have that option. And the other thing is like, I think most people who are running the, are training these models, the data scientists, they're not really doing that in SQL. They want some sort of data frame API or interface to be able to do that, which is why I think Python has become so popular. Because first with pandas and then other data frame APIs, but that's how data scientists prefer to work.

Starting point is 00:35:30 I know Snowflake, they've released Snowpark recently. They want people to think that it's really good for machine learning, but it really can't do distributed machine learning. I mean, all Snowpark really does is it just takes a DataFrame API and then translates that to SQL and then runs that SQL against the same Snowflake endpoint that was there before. So, you know, I think it does some helping with like, if you're basically got some code and you want to turn that into UDF, like Snowpark will turn into UDF and then call that UDF from the SQL. But I mean, otherwise, it's not really doing anything special. So if you wanted to train a model and

Starting point is 00:36:10 that model and the data you need to train that model, you know, if you can't fit it within the memory on one of the nodes in Snowpark, it'll just fall over. So, you know, if you're training a lot of, you know, really small models, it might work. But the large models, it's not going to work. And I think they also don't have the ability to price that stuff at a reasonable rate. I think their Snowpark stuff is actually charged like 50% more than their regular clusters. So I think people are really just going to want a single, integrated, unified experience with all these different use cases. Yeah, I was just going to ask, you know, we've got Photon, we've got Enzyme.

Starting point is 00:36:50 Great names, by the way. I'd love to meet the branding side of Databricks who comes up with these names. An extreme BI. Don't Google that. Don't Google that. So a little joke there, Jason. Mark and I used to work for the same company, and he's never let me let down that name that I came up with.

Starting point is 00:37:15 But maybe tell us about what's coming next, right? So we've seen some of these. You've talked about some of these great advancements that you've been making. What's over the horizon? You know, we won't hold you to this, you know, safe harbor. Yeah. But what are some of the great things that we're going to see from the platform? For instance, through acquisition, you've added analytics.

Starting point is 00:37:42 And curious how, you know, we might see that evolve, you know, the ability to actually not even have to have an analytics tool to some degree, if you invest in Databricks. So maybe talk to us about what we'll see in the future and what you guys are focusing on next. Yeah, I think, you know, if I think about just kind of like the spectrum of stuff, like on the machine learning side, we actually, so we just announced on Tuesday, we went GA with our model serving layer. So this is basically a serverless model serving layer, which allows you to take a model that's built with MLflow, which is the open source project that we created to encapsulate the whole lifecycle of machine learning. You can basically serve that as a REST endpoint and have people hit that with their applications very easily.

Starting point is 00:38:29 It integrates well with Feature Store, which we came out with earlier. And the Feature Store is, if you think about it, you've got all these different data scientists that are training these models. They're inevitably going to end up training models based on the same features or derivative of data that they've created. So having a feature store allows you to register these features in a central place.

Starting point is 00:38:50 And then you can basically monitor the use of these features and then leverage them within something like the model serving layer. So for example, you might have a feature that says, you know, what's the number of purchases this customer has made over the last, you know, seven days or 30 days. And then that feature would just continuously get updated as more sales come in. So that way, whenever the model gets served, you can basically have the most accurate up-to-date information without having to, you know, think about all the different pipelines you'd have to create to do that. So we're basically just making that loop very easy and very scalable. At the end of the day, models are just basically code plus data. So being able to unify that in a single platform is what we're working on.

Starting point is 00:39:37 I think a big thing is we announced at Data and I Summit the marketplace we're going to create. So we want to create a data marketplace, which is not going to be just data, but everything that you want to share. So data assets, but also notebooks, machine learning models. So being able to have a marketplace where third parties can basically kind of register these things and then make them available to anyone who wants to use them in Databricks, but also just, you don't even have to be on Databricks. So because the basis of all that sharing is Delta sharing, which is an open source

Starting point is 00:40:13 protocol for being able to do data sharing, you know, as long as you're using, you know, utility that can support that Delta sharing client, and we've already got that built into, you know, Spark as well as Java. And I think Power BI has it built into their tool. You can just basically attach to these Delta shares. So that's something that we're going to be focused on a lot between now and the next Dating High Summit. Another big... Can I kind of pause you there for one second? And I don't want you to lose your train of thought, but I wanted to drill in for a second on the model serving in the feature store. So, you know, a poor person's feature store is a table, right?

Starting point is 00:40:51 So and the features are columns in the table. What does the feature store give us on top of that? Is it something like versioning? Is it the ability to declare them as features? What would you say there? Yeah, I guess some versioning. Is it the ability to declare them as features? What would you say there? Yeah, I guess some versioning. And usually with like features, there's like an offline serving and an online serving. So offline is whenever you're basically training a model, you're probably going to be doing something in batch.

Starting point is 00:41:17 And you kind of need to be able to grab these features in a way that scales for batch. But with online, it's usually like, hey, I've got a, you can imagine somebody clicking on an app or a website, and that click is going to basically fire off a call to an inference layer to give you the result back, a score of some sort. You need that to be like super quick and super responsive. So that's like an example of an online feature serving layer where you need something that's basically a key value lookup that's really quick.

Starting point is 00:41:49 And so with a feature store, you want something that can basically do both of those and not have to force the engineer to think about how to keep them in sync, but basically keep them in sync automatically. So that's- And maybe- Yeah. Oh, sorry. And maybe tell us how did before you introduced model serving to give some folks an understanding of what kind of value that brings? How would you have had to do that prior to you guys introducing this? Yeah, there was a lot of a lot of different methods. There was you can basically if you think about with machine learning, there's like training a model and then there's like the inference of that model. So after you've trained a model, if you want to infer, you can basically do what a lot of people do is they would they would do it in batch at night. So they would think of depending on what you're trying to predict, you could come up with, you know, every single possibility, every combination of of input values and then run that through a model and get the

Starting point is 00:42:46 output and then just save that output in some sort of key value lookup. So that way, whenever the application needs to do a lookup the next day, they've got, you know, just something that's pre-calculated. So that was one way. Streaming was another way that people do it. Just, you know, if you're doing streaming, you can kind of do that lookup along the way. There was also, we had integrations with MLflow where, you know, if you have a model that's registered in MLflow, you can use other services that implement MLflow. So like Azure ML or SageMaker in AWS, where they've got their own online serving where you can basically just post that. So it was kind of depending on the use case, but the nice thing about MLflow and open source is like so many people have implemented

Starting point is 00:43:29 what's called flavors to do the serving and the hosting that you're kind of like spoiled for choice if you have a different environment where you need to host it. Fantastic. Well, we're almost out of time now, Jason. And so how do people find out more about Databricks if they're interested in looking further?

Starting point is 00:43:45 Yeah, I would say, you know, Databricks.com. You can go there. We've got an easy way to start up a trial. And, you know, I think you get like two weeks free or something like that where you can basically play around with it. We have no shortage of blogs. I can't keep up with all the blogs we put out, to be honest. It's a lot. And we've got got videos as well.

Starting point is 00:44:05 And then if you're on the open source side of the house, we've got big communities around MLflow and Delta Lake and Spark. And there's Slack channels that are made for all those, though. So you can kind of get involved with the community as well. And I'll just make a call out there too, Mark, as well. One of the nice things about a notebook interface is that there's a lot of publicly available notebooks that you can open right up into Databricks and you can execute them end to end, assuming that your data that you're using, that the notebook

Starting point is 00:44:38 is using is public somewhere. So the idea of having to copy and paste scripts or check out scripts or whatever, you can, a lot of times when you Google something for, you know, how does, how has X, Y, or Z done in Databricks? What your result is going to be is a series of notebooks where it's been done that you can just open up and I'll say play around with, but, you know, quite possibly, you know, take through iterations into a production to something that you could take to production, which is which is really nice. Oh, that that reminds me I almost forgot there's a so we just launched this website called it's db demos.ai. And so if you go to db demos.ai, it's a website which has host a whole bunch of these different demos using the notebooks, like you just mentioned. And if you've got data breaks, you basically just like run one line of

Starting point is 00:45:24 code in your notebook, and it'll automatically download and import whatever the demo is that like you just mentioned. And if you've got Databricks, you basically just like run one line of code in your notebook and it'll automatically download and import whatever the demo is that you're looking at and get you up and running so you can just try this stuff out for your own. Fantastic. Well, thank you, Stuart.

Starting point is 00:45:35 And thank you, Jason. That's fantastic. Lovely to hear about the story about Databricks and thank you for the very in-depth kind of explanation there. No question we had could catch you out. So thank you. Very, very knowledgeable. Thank you very much. So thanks very much and take care and it's great to have you on the very in-depth kind of explanation there. No question we had could catch you out. So thank you. Very, very knowledgeable.

Starting point is 00:45:46 Thank you very much. So thanks very much and take care. And it's great having you on the show. Oh, thank you so much, Stuart. And thank you, Mark. It was a pleasure to get to meet you guys and love to do it again sometime. Yeah, thanks, Jason. It was a pleasure.

Starting point is 00:45:59 Thanks so much.

Your Ad Here

Drill to Detail - Drill to Detail Ep.104 ‘Delta Lakes, Tiger Temples and the Databricks Lakehouse Platform’ with Special Guests Jason Pohl and Stewart Bryson

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.