ACM ByteCast - Matei Zaharia - Episode 32

Episode Date: December 13, 2022

In this episode of ACM ByteCast, Bruke Kifle hosts Matei Zaharia, computer scientist, educator, and creator of Apache Spark. Matei is the Chief Technologist and Co-Founder of Databricks and an Assista...nt Professor of Computer Science at Stanford. He started the Apache Spark project during his PhD at UC Berkeley in 2009 and has worked broadly on other widely used data and machine learning software, including MLflow, Delta Lake, and Apache Mesos. Matei's research was recognized through the 2014 ACM Doctoral Dissertation Award, an NSF Career Award, and the US Presidential Early Career Award for Scientists and Engineers. Matei, who was born in Romania and grew up mostly in Canada, describes how he developed Spark, a framework for writing programs that run on a large cluster of nodes and process data in parallel, and how this led him to co-found Databricks around this technology. Matei and Bruke also discuss the new paradigm shift from traditional data warehouses to data lakes, as well as his work on MLflow, an open-source platform for managing the end-to-end machine learning lifecycle. He highlights some recent announcements in the field of AI and machine learning and shares observations from teaching and conducting research at Stanford, including an important current gap in computing education.

Transcript
Discussion (0)
Starting point is 00:00:01 This is ACM ByteCast, a podcast series from the Association for Computing Machinery, the world's largest education and scientific computing society. We talk to researchers, practitioners, and innovators who are at the intersection of computing research and practice. They share their experiences, the lessons they've learned, and their own visions for the future of computing. I am your host, Brooke Kifflate. Machine learning is undoubtedly transforming the world we live in. Advancements in modern computing technologies paired with the generation and availability of massive quantities of data have been key to enabling the adoption of machine learning across a wide range of industries and domains.
Starting point is 00:00:44 However, with massive quantities of diverse data, there is a clear need for a highly performant, general distributed processing system for big data workloads that allows users to process, transform, and explore big data sets. Our next guest, Dr. Matej Zaharia, has worked to achieve that and much more in the field of data management and machine learning. Dr. Matej Zaharia is the chief technologist and co-founder of Databricks, as well as an assistant professor of computer science at Stanford. He started Apache Spark Project during his PhD at UC Berkeley in 2009, and has worked broadly on other widely used data and machine learning software, including MLflow, Delta Lake, and Apache Mesos. Matej's research was recognized through the 2014 ACM Doctoral Dissertation Award,
Starting point is 00:01:36 an NSF Career Award, and the U.S. Presidential Early Career Award for Scientists and Engineers. Dr. Matej Zaria, welcome to ByteCast. Thanks so much, Sharon. Thanks so much for having me here, Brooke. So I'd love to start with a question I often like to leave with, Matej. Can you tell us more about your background and some key inflection points throughout your personal, academic, and professional career that ultimately led you to the field of computing and what you do today? Sure. Yeah. So let's see. So I was born in Romania, in Europe, and I grew up mostly in Canada. And I went into computer science in university, you know, mostly because I liked programming. And I also liked how quickly you can just try sort of the latest techniques for everything,
Starting point is 00:02:19 you know, because you could just run everything on your computer. You don't need special equipment or anything like that. And I was fortunate. I went to the University of Waterloo. I was fortunate to work with this networking professor, Derek Keshav, who got me interested in research. So I was doing research and networking and peer-to-peer systems part-time alongside doing my undergrad. And after that, I applied to PhD programs, and I ended up at UC Berkeley, working with Scott Shanker and Jörn Stoiker, again, mostly on networking things initially, but I became pretty interested in large-scale data center computing and frameworks like MapReduce
Starting point is 00:03:01 and just all the distributed computing frameworks that were coming out, as well as cloud computing. So that's what put me on the path towards Apache Spark and towards understanding these workloads, looking more at machine learning as well. And it was definitely the right time to start exploring that in the research world because these technologies went from being used at a few large web companies to pretty much every other organization in the world. So it's been a great, fun kind of field to be in. Oh, certainly. And you mentioned, of course, the important role of professors, faculty mentors who ultimately guided your interest in research, as well as this new field that you're in. You highlighted the
Starting point is 00:03:45 development of Spark at Berkeley, of course, as being a key inflection point in your journey. But can you help us understand what is Apache Spark and what are some of the motivations for its development, considering some of the existing solutions at the time with MapReduce, for instance? So Apache Spark is basically a framework for writing programs that are going to run on a large cluster of nodes and process data in parallel mostly. And, you know, there are a bunch of different components to it, but the core of it is just this API where you can write basically single machine code in Python or in Java or in other languages just on a single machine. And you can use these functional operations like map and reduce and other data processing operations like joins and group pies and so on. And you can write a program using these,
Starting point is 00:04:38 and then Spark will take that program and automatically parallelize it across a cluster, including shipping functions that you wrote and having them on in parallel on lots of items and then giving you back the result. So what it means is that, like anyone who's learned how to do some data processing on a single machine, say with libraries like Pandas and Python,
Starting point is 00:04:59 gets a similar library for working with a big distributed collection of data across a cluster. And you can also use your favorite, you know, your favorite kind of single node libraries as part of your program and just call them in parallel on lots of data. The goal is to make it very accessible for many different types of developers and data scientists, just like people who write programs to run something potentially at large scale. And on top of this basic sort of function that the engine provides, there's also a really rich ecosystem of libraries. So there are libraries on top that run a full-fledged SQL engine
Starting point is 00:05:37 that can do standard kind of analytical database workloads. There's a machine learning library that gives you lots of built-in algorithms. There's an incremental stream processing system, so you can write a computation and then Spark will automatically update the result as new data arrives. And there are many more libraries in the community that are just built on it. So it's also a nice framework for just combining these high-level libraries into a bigger application. And I think compared to the tools that existed before it, I would say Spark was one of the first to really focus on opening up these kind of large systems beyond software engineers. So figuring out APIs that just a data scientist or someone who doesn't just write programs
Starting point is 00:06:23 as their main job, but has maybe like a math background or like domain expertise can still be successful with them. So that was one difference, the focus on Python and R, for example, helped with that. And then the other difference is it focused a lot on composability, both from the programming perspective, you should be able to just call things that other people hold as libraries, like say a machine learning algorithm. And from the efficiency perspective, Spark can do very efficient in-memory data sharing between different steps of the computation. And that's what enables things like iterative algorithms for machine learning or stream
Starting point is 00:07:00 processing or interactive SQL queries. So these were kind of the differences from the previous engines that existed. And so a lot of the research was like figuring out how to make these things work and still be fault tolerant and efficient and so on. But once we did that, you could get this really great ecosystem of libraries on top that now users can just combine to do things.
Starting point is 00:07:23 I see. So certainly abstraction, composability, and of course, cost efficiency. Are there other salient features that you believe has made Apache Spark, you know, sort of dominate and become the framework of choice for big data distributed processing? I think a lot of the other ones kind of stem from this and also from the great community that's contributed to it and, you know, that's formed around it. So, you know, we went from basically an academic research project, mostly developed by grad students, to something that a lot of the major tech companies actually
Starting point is 00:07:58 started using and contributing to. And over time, also like other, you know, kind of non-tech enterprises also started building on it. So, and these have helped the system become a lot better and have helped test it at a huge scale and just make sure it works on, you know, the widest possible range of workloads and data types. So I think for a lot of people today, it's like, it's just a nice ecosystem to build on if you want something that's going to be reliable and that's well supported across the industry that connects to pretty much every data source there is out there. But I think a lot of the reason why people built on it were kind of these design decisions, for example, to make it cheaper and efficient and also just easy from a programming standpoint to compose things and also to make it possible for the engine to distribute and optimize the combination underneath these APIs so we can keep improving the performance of your existing job,
Starting point is 00:08:55 like whenever you upgrade Spark without you having to rewrite your job. And so a whole bunch of work has gone into that to make it kind of declarative and make it possible to optimize things. So you said something earlier, which was quite interesting, which is the ability to, you know, scale something that was once simply a research project into, you know, a solution that's widely adopted across the industry and, you know, by Fortune 500 and, you know, large enterprises. So as you look back with Origins and academia and the open source community,
Starting point is 00:09:27 how was Databricks for what it is now spun out? And how did the early days of Apache Spark ultimately lead to you creating and co-founding Databricks as it is today? Yeah, so I think being able to start a company that's just working to improve this technology and then to provide commercial services around it was very important to help us all really build out the project and get it to the next level. So I started working on Spark, I guess, in 2009. I think we released the very first alpha open source version in 2010.
Starting point is 00:10:04 And that's when I was just a grad student. And over time, like more students at Berkeley started kind of collaborating on it and building things on top of it. And at the time we saw, you know, there was quite a bit of interest, mostly from more tech-centric companies, but also from users, like I mentioned,
Starting point is 00:10:21 the data scientists, like the non-software engineers who still wanted to run large-scale computation in other organizations. And so we saw that there is demand for something like this, and it's quite interesting to figure out how to build it well and what to do in it. And so we encouraged the open source community. We encouraged contributions from outside. We reviewed patches. We moved the project into Apache Software Foundation as kind of a long-term home that's independent of the university. And so that helped. But we also realized that just by doing research, we can
Starting point is 00:10:58 never... It would be hard to invest a huge amount of effort into it and just have people working full time to make the project better. So we were also excited. We saw there is enough interest to justify creating a company in this space. And we didn't want a company that just does, say, support and services around Spark. So we actually started a company that just tries to provide a complete modern data platform based on what we saw working at kind of the most tech forward companies with it. And that does it in the cloud because that was another major kind of shift happening in the industry.
Starting point is 00:11:38 But launching the company also helped us invest more into Spark and also get it to the point where it was good enough that other companies started contributing heavily and using it in production. And it kind of grew from there. So we were excited to have to try to launch a company in this space, even regardless of Spark, just because we thought it's an interesting problem and everyone is going to shift the way they do data management as they move to the cloud. But yeah, it also, I think, helped really like cement our ability to contribute to the open source project. And we could hire engineers who just work on it and help build it out.
Starting point is 00:12:15 So as you think about some of the solutions or offerings, you mentioned some different consumers or users that could benefit, whether it be data scientists, data analysts, business analysts, those who maybe aren't software developers or don't have a software background. Who are sort of the key stakeholders that you think about when you design solutions at Databricks? And when we talk about this unified analytics platform, what do we mean by unified analytics and who's sort of the beneficiary of this? Yeah, great question. Yeah. So the way, you know, many organizations work today, both companies and kind of like things
Starting point is 00:12:51 like research labs and so on, like basically, you know, just organizations that work with data, they always have many different types of users who all want to do stuff with the data that they've collected. Like, you know, whether it's a research lab and you've got all this information collected from experiments or, you know, like telescopes or whatever it is, or it's a company and there's like all this information about how it's operating. Everyone wants to, you know, to understand what's happening
Starting point is 00:13:18 or to build applications that use it, like say a predictive modeling application or something like that. But, you know, these people come in with very, very different backgrounds. It's very valuable to them to have a common sort of representation of all the data, like everyone agrees on the data types you have, the tables, the schemas, all that stuff. And also a common sort of query language or like a common, you know, like semantics of different operations. So for example, maybe an engineer can write a function for computing a particular thing, like say about your customer, and then anyone in the organization can call that function and get that metric in an accurate way, as opposed to everyone trying to implement it in a different tool. So with the Spark engine, we tried to have one engine that can offer these interfaces that work for different people. So for example,
Starting point is 00:14:12 there's the SQL interface, which works for the widest range of users, including users who just connect a visual analytics tool like Tableau or Power BI. This is a tool where you drag and drop to say to create a chart and it actually runs SQL on the backend or something like Excel. It can also connect that way. So that's like one extreme. Then there's like the sort of users who will write in a scripting type language like Python. This would be someone like a data scientist where they do programming, but they're not just doing software engineering all the time. They're trying to answer questions or prototype things. And so that's where the Spark Python APIs help. know, responsible for like the most important data pipelines being reliable, things being computed correctly, you know, if anything's broken, like there'll be, you know, they'll get paged in the middle of the night to fix it.
Starting point is 00:15:11 And they want, you know, to use kind of the best software engineering tools out there to be able to test things, have static types to understand what's flowing through the system, you know, do different, like clone the job and try out different versions of it and compare the results and so on. And so we try to design like the data model in Spark and in the Databricks platform overall, including the other pieces like the storage component is all the same for everyone. So everyone can agree on, hey, here are the tables, here are the data types, here's what they all mean. The query model is also very similar. So that means someone can write a function
Starting point is 00:15:49 that other people can use, which is great for handing off kind of knowledge between teams. But then the actual interfaces, you can use these different ones that can all call into the same functions and into the same data model, but are tailored to different sort of user personas.
Starting point is 00:16:06 And we work a lot with all of these to make sure they can understand these things and that they map onto this common model that actually lets the organization work in a unified way. And it might seem a little bit obvious, but historically, at least this hasn't been the way that most companies work with data, because especially before the cloud and the on-premise world, you would buy these different systems and provision them on different servers. And they would each have their own way of storing data, their own query language, and so on. And then you'd have to do a lot of work just to
Starting point is 00:16:42 connect them together and to make things consistent. So we do think there's an opportunity to simplify things with this one engine that can do the different types of workloads and then one data store that is in the cloud that everything can connect to. You don't need to deploy different data stores for different workloads. I see. So what are some actual compelling scenarios or use cases of this one engine, one data load that you're seeing across industries from those who are actually adopting this solution? classic data warehousing workloads. This is what you would do with, say, a system like Teradata, for example, which is, hey, you load data into some tables, you can transform it, work on it, usually with SQL, and then you can query it with SQL and maybe serve it through these interactive
Starting point is 00:17:37 visualization tools or compute some reports and send those to people. So this is kind of simple analytics workloads where like you load it and then you ask some queries about it. And for that stuff, you know, using Spark for that means you'll be able to run it at very large scale in the cloud and you'll be able to have separated compute and storage. So like while you're running, you know, one query, you don't slow down the whole system. Other things can keep working and access the data. But it also means you could potentially use the other functions like machine learning or streaming on top of the same data model. So some people are just saying, hey, I want this classic stuff, but I want it in the cloud and in an elastic way where I don't have to worry about like, you know, how many CPUs I provision and how much storage. And at the other extreme, like a lot of organizations,
Starting point is 00:18:27 like virtually every, you know, like Fortune 1000 and probably even beyond that company has a machine learning team now and has a data science team. And they're all trying to figure out how to do, you know, predictive analytics, how to do features in their products that actually use machine learning in some way, like say recommendation engines or churn prediction or predicting failures. And there
Starting point is 00:18:51 are some really cool use cases that we've seen there. So for example, we saw a lot of the biotech companies are now developing new drugs for diseases based on analyzing large data sets and understanding what's happening in them. A lot of industrial companies have instrumented everything they put out. For example, every tractor you purchase from, like John Deere now, has lots of sensors on it that, you know, evaluate how it's working and can recommend, you know, when to fix pieces or like can tune it for optimal performance. Same thing with like every jet engine that's produced by like, you know, when to fix pieces or like can tune it for optimal performance. Same thing with like every jet engine that's produced by like, you know, that's used in airplanes and stuff like that.
Starting point is 00:19:31 They're building all these interesting applications based on it. And I think the really exciting thing for me is allowing, you know, people with minimal effort to be able to do these kind of more cutting edge applications, machine learning, streaming, and so on, on top of data they're collecting, in addition to just kind of the classic applications they can do. You mentioned one term, data warehouses, and obviously there's been evolutions.
Starting point is 00:19:56 We've had data warehouses, data lakes, and now there's the rise of the data lakehouse paradigm for modern data management. Why the need for this new paradigm? What challenges does it address? And why aren't warehouses and lakes efficient? Yeah, I'm happy to talk about it. I think there are a couple of different things that sometimes get conflated here with these. So there's like data warehouse systems, that's like the actual software that's managing data. And then there's also, there are these architectural terms,
Starting point is 00:20:26 like there is an idea of data warehouse as an architecture for managing data in an organization, which says, hey, before you open data to lots of users, like have a formal way of organizing it and defining different tables, defining relationships between them so that it's not a mess, so that you can keep it accurate over time and extend it and make sure everyone sees correct results. So both of these things are being, you know, actually kind of being, you know, rethought in
Starting point is 00:20:56 various ways. The one I talk about the most is the technology piece. So historically, if you wanted a system that can store lots of data, historical data, and then can do fast queries on it using SQL, you build these data warehouse systems. And they were designed to be deployed on their own servers, right? Like when you're an on-premise kind of company, you have to buy new servers to deploy a new piece of software. So they all assumed that they have full control of the data and they to deploy a new piece of software. So they all assumed that they have full control of the data and they're the only interface to that data. So they were all using basically proprietary custom storage formats. And then the only way applications talk to them is through SQL. And then within that, they get really great performance.
Starting point is 00:21:40 Now, when you have everything in the cloud, and when you start having applications that don't speak SQL, such as machine learning or maybe streaming applications, it becomes a bit of a problem that you've got lots of data locked into something that basically only one system can query through only one interface. So that's where Lakehouse comes in. So the other kind of model that goes under it is what's called a data lake. Data lake is basically just low-cost storage where you can just put files in any format. And this is what a lot of the Hadoop and Apache Hive kind of open source ecosystem built up. They just said, look, I want to manage large amounts of data without loading that into these kind of limited proprietary systems.
Starting point is 00:22:26 I just want to very cheaply store it. And then I'll load subsets of it later and do more sophisticated analysis. So data lakes are just based on low cost storage and this kind of file interface where you usually use open formats that many apps can read for your files. And then Lakehouse is this emerging trend to kind of combine the two and to get the data warehousing like performance and management features, things like asset transactions on top of low cost storage and open format. So you don't have to convert the data to a different format and move it to a different system just to get fast queries on it. And that's the model that we think is going to be the future. We saw like a lot of the kind of digital native, like new tech companies who had
Starting point is 00:23:12 to build their stack from scratch, just build something like this from day one. They never had the separate systems. And at Databricks, like that's kind of the, we decided to focus our platform around this model and to figure out how to do that well. And we think there's no technical reason why data in open storage formats can't be used to provide really great performance or to provide transactions or management features or all the things people expect. So we're just trying to give people that and to just simplify their data management overall by having one system that every app can connect to. I see. So combining the best of, you know, the low cost storage of
Starting point is 00:23:51 data lakes and, you know, the performance and manageability features. Right. Exactly. It is basically like a technical problem of like how to do that well, but it seems that if you can do it well, it's very useful. Like it just simplifies things. I want to turn my attention to another project that you created and have been actively contributing to, which is MLflow. You know, one of the most challenging aspects of productionizing machine learning is not actually training the models, but as you might predict, it's the deployment and the monitoring to actually ensure, you know, your production grade applications are, you know, as you'd expect them. So, you know, as you work on MLflow, how does MLflow actually help address some of the important pain points
Starting point is 00:24:36 and challenges in the, you know, ML development and deployment lifecycle? Basically, what we've seen is at almost every company that productionizes machine learning that actually tries to use it in a product that has to run well, they end up building what's called an ML platform or sometimes ML ops platform, basically a whole bunch of infrastructure to support the machine learning application. And this does a bunch of things. First of all, it usually trains a model, like retrains a model periodically, kind of automatically, because data is changing. It monitors that.
Starting point is 00:25:12 It gives you metrics about what's happening. And it maybe will alert if things are way off, and new model isn't doing well, or the data looks different. It versions all the artifacts, so you can kind of hold back as you do development and see what happened in the past. And it also handles deploying the model
Starting point is 00:25:30 and then actually serving it. And so, for example, a lot of the large tech companies build infrastructure like this. Like, for example, FB Learner at Facebook and Michelangelo at Uber and many other systems. But even the non-tech enterprise companies we
Starting point is 00:25:47 talked to all had something like this. So with MLflow, we basically created an open source project that handles this problem. It's an open source kind of extensible ML platform project. And what it does is it gives you some built-in functions for common things people need to do with machine learning. Like, for example, packaging a model and deploying it in different serving systems or tracking metrics about it or, you know, sharing experiments with a team and collaborating on those. And so it gives you these built-in things. And it also gives you this extensible framework where you can plug in new pieces. Like, for example, if you say, you know, when I build a new version of my model, I want it to pass through some custom reviewing steps, like maybe an automated test, and then a human that approves that that says, you know, this is actually, you know, a good model or like whatever,
Starting point is 00:26:39 you can plug that into MLflow in various ways to the APIs, and you can build this custom workflow on top of it. So that's what it does. And we see, again, a lot of teams use it as they move from just doing some experiments and creating a cool model to creating an application that's supposed to run all the time, maybe retrain periodically and be very easy to monitor. And we've tried to do something that people can use even during the experimentation phase. It saves you a little bit of time in experiment management and collaboration, but then it puts you in a spot
Starting point is 00:27:13 where you can quickly shift the model to production. And can such a model or a platform be useful for detecting and managing things like concept or data drift, shifts in your upstream data and how that might be impacting production? Yeah, definitely. Yeah. There are some built-in features in MLflow, like there's integration with SHAP for explainability, but it also allows you to put in some custom processing steps for your data, for testing
Starting point is 00:27:44 your model and validating it, also for any data, like doing model serving and inference for the results that are coming back from that. So you can use it to systematically plug in things into the pipeline. I should say MLflow doesn't provide its own algorithms. We're not trying to create a better explainability algorithm or something like that. We're just giving you kind of the programming model or programming framework where you can you can write your application to have these pieces. And it's very easy to instrument parts of it and to observe what's happening, like to automatically collect information and to show it to people and, you know, let them plug in things that listen to that information. So it's a lot like, you know, like when you run, say, a web application, there are these frameworks for how to build it that will handle certain things and make it easy to like roll forward and hold back.
Starting point is 00:28:36 Things like Ruby on Rails, for example. It's a lot like that. It's not, we're not trying to provide new algorithms or anything. I see. Yeah, I think as someone working in the space of deep learning and production sort of applications of deep learning, this is certainly an area that interests me personally. And as it relates to, you know, productionizing machine learning being one of the biggest challenges, I think most practitioners can certainly attest to that.
Starting point is 00:29:08 ACM ByteCast is available on Apple Podcasts, Google Podcasts, Podbean, Spotify, Stitcher, and TuneIn. If you're enjoying this episode, please subscribe and leave us a review on your favorite platform. Looking forward, I know there was the recent data and AI sort of flagship summit. Databricks announced some of the recent announcements and some of the upcoming features or capabilities that they hope to make available. What are some recent announcements that are quite exciting for you and looking for what are some future directions that you look forward to? Yeah, definitely. There's a lot of cool stuff happening in this space. I'll just mention a couple of them that I'm personally excited about. So one is on the data system side. So this summer at the Data and AI Summit, as you mentioned,
Starting point is 00:30:03 Databricks actually open sourced all of this storage management layer we have developed called Delta Lake. So this is what enables that kind of lake house pattern where you have an open format for your data and you can have rich data management features on top, like asset transactions and versioning and time travel, like rollbacks, things like that, and also improve performance. And so Delta Lake has been open source for a while, but Databricks always had some proprietary kind of enhancements for performance on top of it and for like some connectors to certain systems. And we just saw that, you know, this went from a brand new kind of product in 2018 with no users on it to something where I think more than 70% of data that our customers put in the platform is in Delta Lake today. So it became kind of this essential building block. to open source, like even the advanced performance features so that, you know, more companies, more products can easily build on this and integrate it and people can, you know, can
Starting point is 00:31:10 manage the data in it and not worry about like, hey, is this only usable from Databricks or somewhere else? So I'm really excited about it because I think it's one of the kind of best and most novel pieces of technology we've built. And there's already a bunch of interesting research on these kind of systems. And I think open sourcing it is going to enable a lot more. I've also, I've talked to like a whole bunch of researchers who are trying to do new things to, you know, make these kinds of lake house systems more powerful and more efficient.
Starting point is 00:31:39 So that's one thing. If you're someone working in the system space, definitely encourage you to check it out. There aren't many things that go from like zero to 70 percent of like how someone stores stuff, which is one of the more critical things you have to do with data is like store it reliably in such a short time period. The other one I'll mention on the machine learning side, it's an ML flow. So we're starting this new component to help simplify kind of the handoff of machine learning applications between the ML sort of experts and then the engineering teams that operate the applications. different user, you know, backgrounds and user types who do these different things. You've got someone that's more like a data scientist or like, you know, like an ML researcher who develops models and knows how to evaluate them and maybe how to make, you know, like how to tweak them to make them better. And then you've got a production engineer that knows about like, how do I monitor a thing, make sure it's working? How do I improve the performance? How do I set it up so I can operate and hold back
Starting point is 00:32:49 and deal with outages and stuff like that? And we found it's really hard to have people that do both well. And it shouldn't really have to be that way. It should be possible to have different users who focus on these aspects. And so we have this new component that's called MLflow pipelines that lets basically the engineering team create a pipeline where the ML researcher fills in specific steps of it, but the whole pipeline is operable and instrumentable and controllable by the engineering team. So basically, it's a way to modularize the code everyone is writing.
Starting point is 00:33:27 As a researcher, you get the contract that, hey, if you work within this API, your thing will immediately be productionizable. You don't have to change your code and risk any problems with that. And as an engineer, you get a lot of control about how stuff is passed around and how things are tested. Very interesting. Now on the decision to, or the announcement to open source all of Data Lake, what are some of the motivations that go into open sourcing software? And more generally, what are your thoughts on the future of open source innovation?
Starting point is 00:33:59 I mean, I think in general, like open source is a very powerful force in the software industry. And it's something that every software like development company has to keep in mind. And certainly like enterprises who buy software are very aware of it. And, you know, they're very aware, like everyone wants to design an architecture like in their company that's future proof. Nobody wants to, like if they can avoid it, you know, they don't want to pick something that they'll have to revamp in five years, because, you know, that vendor stopped doing the things that they need, or whatever, like, you know, the vendors now charging a lot of money, or whatever it is, right? Or like, you know, it's just locked into one. So I think everyone has to consider, you know, how to do it. And we just thought with Delta Lake, like, you know, at first we thought, oh, maybe this is for a few advanced users or something like that. But actually we realized like it improves everyone's quality of life, like working with these large lake house data sets so much that we actually want everyone to use it. And we want it
Starting point is 00:35:02 to be kind of a no brainer, like decision in terms of risk of like, will you use this versus a more classic like data format for your data, which doesn't have the nice features like transactions and so on. So that's why we wanted to make it open source so that people can feel like, yeah, this is something I can keep using, you know, decades into the future. And there'll be many vendors who support it. And I don't have to worry about petabytes of data locked into one vendor. And we're already seeing, already there are a ton of products that connect to Delta Lake, including all the major cloud data warehouses and all kinds of open source engines and stuff. And we're hoping to see more of that. That's one of the things we saw as a company building things around Spark
Starting point is 00:35:46 is we don't have to go and bug lots of companies to integrate with us and make their product work with us. There are so many products that work with open source Spark, so many libraries, whether free or commercial products that automatically work well on our platform thanks to the open interface.
Starting point is 00:36:03 And we want the same thing for the data. If you put your data in Delta Lake, you can use all the tools in the industry to collaborate on it. And you don't have to worry about that architectural choice. So improving access, improving adoption, and ease of extension with other tools of your liking. Yeah, exactly. Yeah. Great. I'd love to turn our attention to another hat that you wear. You know, while you aren't shaping the future of Databricks as chief technologist, you're actively involved in the future of computer science
Starting point is 00:36:34 as an assistant professor at Stanford University. So I would love to learn about some of the exciting research work that you're doing. So we were chatting briefly and you mentioned the Dawn project, which recently culminated. So I'd love to learn more about some of the sort of contributions of this work. I was very much moved by the mission to, you know, democratize AI by, you know, really making it dramatically easier for those to build AI powered applications. So what are some of the exciting contributions that you've observed with this project over the past number of years? So this is a project we started five years ago, like a group of faculty at Stanford. And we were really interested in this problem of like how to let more, you know, more people, more organizations successfully use AI. And we looked at it from a
Starting point is 00:37:23 whole bunch of angles. So for example, we had Kunle Olukerton was one of the faculty members. He works on hardware and programming interfaces and compilers, among other things. So he looked at that aspect of how can we make less expensive, more efficient hardware for AI, which is a super interesting area. I looked more from the system side. Peter Bayliss was another professor who looked from the database side, and there were other folks as well. And a bunch of interesting findings came out of it. So one finding was, as you mentioned, that productionizing ML is quite difficult.
Starting point is 00:38:01 And for many groups, this was the bottleneck of going from like a prototype to, you know, like an actual application that like really works and, you know, has impact. And so one of the projects I worked on with Peter, for example, was a new way to sort of debug and monitor and improve the quality of ML models that's called model assertions, which is a little bit like software, like assert statements in software, where you have things you expect to be true about the application, and you can apply the same things to the behavior of models. And then you can actually automatically detect when they're doing things wrong and also use that to supervise the models, to train them, to make them avoid that kind of behavior.
Starting point is 00:38:45 So we showed some examples of that and like basically with working with data for autonomous vehicles and for video analytics. So that's like one takeaway. Another interesting takeaway was that in some areas of AI, getting labeled data is actually the bottleneck. It doesn't matter how expensive it is to design your model, if at all, or to train it, actually getting labeled data is hard. So Professor Chris Ray, another one of the PIs, had a whole line of work on minimizing the amount of human labeling you need and using weak supervision, which is basically like using
Starting point is 00:39:23 automated rules that guess at the label, but may not be fully accurate and learning from those. And he's had a lot of success with that in quite a few domains where you can write these generic rules and run them over a collection of, say, like legal documents or medical papers or stuff like that, and actually get a pretty good model. And the challenge is how do you, in fact, you can do better than people who use just a label dataset, you know, with less effort without having to have people label, you know, millions of documents or images. So these were a couple of the interesting themes. But yeah, for me, the best part about it was just seeing people, you know, thinking about
Starting point is 00:40:03 this problem from all angles, and getting them all in a group together to talk about it and like kind of learning across these. So a lot of our work kind of ended up mixing insights from, you know, from the different areas that I think wouldn't have happened as easily without this group. Certainly, certainly. That's very interesting. Another line of research that actually excites me is some of the work on retrieval-based NLP. Undoubtedly, we've seen a lot of the great promise of large-scale language models, but there are also very clear limitations around high training costs, high inferencing costs, the true feasibility of actually productionizing these large models. There's explainability issues, you know, models are static, so there's freshness issues. So how does this sort of approach of
Starting point is 00:40:53 retrieval-based NLP work? And how does it address some of these fundamental issues that we observe with trying to make value out of these large-scale language models that exist today? Yeah, this is the research I'm probably most excited about in my group right now. So basically, so far, we've gotten some really great results in NLP with these giant models that have lots of weights, something like GPD3. And the idea with these models is you have a collection of knowledge, like you have a bunch of documents from the web, like you have a bunch of documents from the web, and you train the model over it.
Starting point is 00:41:29 And it sort of incorporates that knowledge into the weights. And then when you do predictions, it can do stuff like it knows that the capital of France is Paris. It knows that the president of the US is Joe Biden, like whatever. It has all this knowledge that you know, that appeared in those documents, and it can use that in various tasks. But if you think about it, you know, these are very expensive to train. They're also very expensive to do inference with. And they're very hard to update, because if something changes, like if, you know, after the next election,
Starting point is 00:42:00 the president of the US changes, you know, you got to retrain this whole model from scratch to give that. And you actually see this if you use GPD-3 today. Well, I don't know about today specifically, but definitely, you know, when we tried it like a while back, you know, it was returning the previous president of the US. It didn't know that this changed. So it's a problem. So the retrieval-oriented approach is that instead separate the knowledge. You have the documents and you have some neural networks, but then when you're given a task, like say you're being asked to answer a question, you search over the collection of knowledge somehow and then you read those documents. You pass the documents that you retrieved along with some context, like the question and other information about your task
Starting point is 00:42:45 into a smaller neural network and you produce the answer. And the nice thing about that is you can always update the knowledge because you can just change these documents. You also get quite a bit more interpretability. You get a sense of like, oh, why is it giving this answer? It's because it's not in this document. And like And maybe that was confusing. And it turns out to be a lot cheaper. So it depends on the task. These models can't do everything right now. But for some tasks such as question answering, these models are just orders of magnitude cheaper than the large language models. And they're much higher quality in terms of answers. And there's a lot of work in this space. There's now, for example, there are people using retrieval for language modeling, which is a very general use case. The retro model does that. There are people using them for images as well as
Starting point is 00:43:37 text. So retrieving images and text together and having interpretable results there. And there are people using them for more sophisticated applications. One of the ones we built can answer questions that require looking up many documents all at once, not just one. And it seeks out new knowledge until it has enough to answer the question, but basically looking at concepts that came up in these. So this still requires some index to actually do the retrieval from, right? I saw an interesting analogy of the open book,
Starting point is 00:44:11 sort of this retrieval-based NLP as being an open book exam, right? So there's still the need for the- Yeah, it's an open book. And actually a lot of the work is on, so that indexing itself is done, and that search is done using a neural network also by maybe embedding the documents into some kind of vector space and then searching for nearby vectors. And a lot of the work is on how to do that better also, which then immediately improves these. Or how to co-train the indexing and lookup together with, say, the question answering so that they're tailored for each other.
Starting point is 00:44:44 Very interesting. and lookup together with, say, the question answering so that they're tailored for each other. Very interesting. And you said there's already some line of work thinking about sort of image search. Are there sort of areas of investigation around multimodal information retrieval, like image to text, text to image? There's a little bit. Are there other areas where you see this potentially being applied? There's a little bit. There's not a ton yet, but there is some work with images.
Starting point is 00:45:06 I think it could be useful in other areas too. One example actually that I'm curious about would be reinforcement learning. Because if you think about it, you have this history of training exercises you did for your model. And again, even though you ran all those, you then just condensed everything into a bunch of weights. But what if you could look up like, what are past training situations that look similar to this? And like, what did I do? And what was the outcome? Maybe you'd be able to improve performance there. So my groups focus most on the NLP use cases, because there are so many of those and like, they're very easy to interact with. But I think it could be useful in other places too. If you think about
Starting point is 00:45:45 it again from the production ML perspective, if you're going to productionize one of these models, you want to be able to interpret what it's doing and also to fix it if something is wrong. And this gives you a nice, and you want it to be fast also, be fast enough to actually run. So this helps with that, but it also gives you these nice ways to see, wait, why is it making that prediction? And if I want to stop it from doing this, like, what do I change? Here, you can just change the documents it's pulling out that are misleading it. Right, right. To turn on to another responsibility that you have sort of in academia, as sort of a professor and as an instructor, what are, you know, some of the biggest gaps or opportunities that you see in computing education? I know certainly going through school myself,
Starting point is 00:46:32 I know there was a big interest in recent years in the field of machine learning, but ultimately to, you know, develop and deploy capabilities, it goes beyond model training. So what other areas do you feel that, you know, folks could sort of emphasize in computing education to really ensure that computer science graduates are well equipped with all the shift to software as a service or basically software being delivered through the cloud somehow. And this is happening everywhere, whether it's like just a user-facing kind of productivity app
Starting point is 00:47:17 like Google Drive or Salesforce, or it's actually the platform, right? Like you can get a database as a service from Google or from Microsoft, or you can get, you know, machine learning, whatever, like predictive time series, you know, predictions as a service from Amazon and so on. The thing is, all of these are now being delivered continuously. You know, my experience and the same with like, I think everyone who does it is that
Starting point is 00:47:42 building a production, you know, kind of cloud service and maintaining it is really hard. That's what we see at Databricks. And that's what many of the companies we work with see. And there isn't really any education on it. And I think it's more than engineering. I think it could require kind of new programming models that are going to work well. It could require new designs of systems. For example, how do I make my system so I can easily hold out a new version of my app and then hold back to an old one? Or how do I make it so it can isolate requests from different tenants and just
Starting point is 00:48:16 guarantee that no tenant can interfere with the performance of another tenant too much or stuff like that? So I think this is a super interesting area. I would love to have a class where you teach about these things, but also a class where instead of students turning in an assignment every couple of weeks with a little programming thing that we run some tests on, they actually deploy a service on day one, and then it has to keep operating and serving requests throughout the semester. And then, you know, they have to like, you know, put in a new feature, like implement,
Starting point is 00:48:51 I don't know, pagination, implement whatever GDPR compliance or whatever, without like corrupting the data or, or otherwise breaking it. Like I think, because that's what they'll have to do in a real job. Well, it seems like you have the syllabus for your next course. Yeah, potentially. I would certainly say there's certainly a huge divide between how folks are trained in university and sort of the reality of how things operate in industry and in production. And I think oftentimes internships or sort of hands-on experiential project-based learning becomes the best avenue to do that. And so I think this is certainly a very noble model of learning that could certainly benefit many folks as they make the transition and look to actually develop end-to-end systems. So to wrap it up, I'd love to touch on two things. One is, you know, we talked about a lot of your
Starting point is 00:49:42 work with Apache Spark and Databricks, but also your responsibilities as a researcher, as an advisor, as a professor. These are two very difficult jobs to manage. So how do you juggle your work in industry as sort of a practitioner, as the chief technologist at Databricks, with your role in academia as a researcher, as a professor, as an advisor? Do you find these two worlds colliding? Do you find them very different? And in one way, you know, does one role influence the other? Is it, you know, your industry sort of experiences really influencing your role in research? Or is it your research experiences that influence your work at Databricks? Yeah, I think the roles are definitely different. There are different kinds of concerns in each one. I do think a nice thing about a faculty job is it does give you flexibility to work with companies in various ways. And that is kind of part of the
Starting point is 00:50:37 job. That's just the way it works. But they are super different. I do learn a lot of stuff in both that like I some, you know, that influences what I do. Mostly, I think it's been seeing stuff in industry that I think, you know, is a big problem that isn't really studied in research. And then like kind of thinking about it from a research perspective, I've tried to keep them like fairly separate. And in most cases, because I don't want to have some kind of conflict of interest or like, you know, students feel they're working on something that like benefits Databricks or whatever. But there are often kind of just long term things like this thing with, you know, everyone in industry is writing services, but we, you know, there are like, you know, hundreds of papers each year on like debugging and stuff like that, or that don't consider that
Starting point is 00:51:24 that's an insight that can lead to, you that benefit the whole industry. But things that are very specific, usually I would do that work only in the company. And certainly within the company, it's helped just to see all the perspectives you see at a university on, say, the future of hardware or ML models or stuff like that. But they're both interesting. And I think the big difference is like an industry, if you often get a lot more resources behind a particular thing, and it's very hard to match that in academia. You can't hire like, you know, tens or hundreds of engineers to build a thing and then to maintain it.
Starting point is 00:52:01 And of course, in academia, you get to explore a lot of stuff. And, you know, if something doesn't work, academia, you get to explore a lot of stuff. And, you know, if something doesn't work, you can just switch to something else and so on. So like, it's very flexible that way. And you get to, you know, to teach people in a different way as well. Very interesting. So to wrap it up, I would love to sort of provide you the opportunity to share any parting remarks, but to provide some structure, what are some exciting future directions that you see for the field of data management, machine learning, or computing at large? And what are some of the exciting areas that you hope
Starting point is 00:52:35 to see some of the greatest promise in? Yeah, I mean, I think this is a great time to look at kind of both machine learning and computer systems. And I would say, I mean, I think everyone is very excited about the potential of these like large models and deep learning workloads and so on. I think it's super interesting. But I would also say, you know, if I were to give advice to like students or people getting into the field, I would also say to take a look at the computer system stuff, because there is this huge change to cloud computing and these systems ultimately underlie a lot of, you know, what's happening in other places. And there'll be a lot
Starting point is 00:53:17 of demand for like engineers, researchers, and so on that know about this stuff. Even the large language model stuff, I think in many places, it's become basically a systems problem of like, how do we scale out things more? How do we do it more cheaply and so on to train more models? And it's become less of a modeling problem. Of course, there are a lot of improvements you can do there too. But the point is that like, you know, if you can do that stuff well, it will have a lot of impact on real applications that are out there. Awesome. Well, it's been a pleasure speaking with you, Dr. Mitesaria.
Starting point is 00:53:51 Thanks so much for joining us at ACM ByteCast. Thanks so much for having me. ACM ByteCast is a production of the Association for Computing Machinery's Practitioner Board. To learn more about ACM and its activities, visit acm.org. For more information about this and other episodes, please visit our website at learning.acm.T. That's learning.acm.org slash ByteCast.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.