ACM ByteCast - Matei Zaharia - Episode 32
Episode Date: December 13, 2022In this episode of ACM ByteCast, Bruke Kifle hosts Matei Zaharia, computer scientist, educator, and creator of Apache Spark. Matei is the Chief Technologist and Co-Founder of Databricks and an Assista...nt Professor of Computer Science at Stanford. He started the Apache Spark project during his PhD at UC Berkeley in 2009 and has worked broadly on other widely used data and machine learning software, including MLflow, Delta Lake, and Apache Mesos. Matei's research was recognized through the 2014 ACM Doctoral Dissertation Award, an NSF Career Award, and the US Presidential Early Career Award for Scientists and Engineers. Matei, who was born in Romania and grew up mostly in Canada, describes how he developed Spark, a framework for writing programs that run on a large cluster of nodes and process data in parallel, and how this led him to co-found Databricks around this technology. Matei and Bruke also discuss the new paradigm shift from traditional data warehouses to data lakes, as well as his work on MLflow, an open-source platform for managing the end-to-end machine learning lifecycle. He highlights some recent announcements in the field of AI and machine learning and shares observations from teaching and conducting research at Stanford, including an important current gap in computing education.
Transcript
Discussion (0)
This is ACM ByteCast, a podcast series from the Association for Computing Machinery, the world's largest education and scientific computing society.
We talk to researchers, practitioners, and innovators who are at the intersection of computing research and practice.
They share their experiences, the lessons they've learned, and their own visions for the future of computing.
I am your host, Brooke Kifflate.
Machine learning is undoubtedly transforming the world we live in.
Advancements in modern computing technologies paired with the generation and availability
of massive quantities of data have been key to enabling the adoption of machine learning
across a wide range of industries and domains.
However, with massive quantities of diverse data, there is a clear need for a highly performant,
general distributed processing system for big data workloads that allows users to process,
transform, and explore big data sets. Our next guest, Dr. Matej Zaharia, has worked to achieve
that and much more in the field of data management and machine learning. Dr. Matej Zaharia is the chief technologist and co-founder of Databricks, as well as an
assistant professor of computer science at Stanford. He started Apache Spark Project
during his PhD at UC Berkeley in 2009, and has worked broadly on other widely used data and
machine learning software, including MLflow, Delta Lake, and Apache Mesos.
Matej's research was recognized through the 2014 ACM Doctoral Dissertation Award,
an NSF Career Award, and the U.S. Presidential Early Career Award for Scientists and Engineers.
Dr. Matej Zaria, welcome to ByteCast.
Thanks so much, Sharon. Thanks so much for having me here, Brooke.
So I'd love to start with a question I often like to leave with, Matej. Can you tell us more about your background and some key inflection points throughout your personal, academic,
and professional career that ultimately led you to the field of computing and what you do today?
Sure. Yeah. So let's see. So I was born in Romania, in Europe, and I grew up mostly in Canada.
And I went into computer science in university, you know, mostly because I liked programming.
And I also liked how quickly you can just try sort of the latest techniques for everything,
you know, because you could just run everything on your computer. You don't need
special equipment or anything like that. And I was fortunate. I went to the University of Waterloo. I was fortunate to work with this
networking professor, Derek Keshav, who got me interested in research. So I was doing research
and networking and peer-to-peer systems part-time alongside doing my undergrad. And after that, I applied to PhD programs, and I ended up at UC Berkeley,
working with Scott Shanker and Jörn Stoiker,
again, mostly on networking things initially,
but I became pretty interested in large-scale data center computing
and frameworks like MapReduce
and just all the distributed computing frameworks
that were coming out, as well as cloud computing.
So that's what put me on the path towards Apache Spark and towards understanding these
workloads, looking more at machine learning as well.
And it was definitely the right time to start exploring that in the research world because
these technologies went from being used at a few large web companies to pretty much every other organization in the world.
So it's been a great, fun kind of field to be in.
Oh, certainly. And you mentioned, of course, the important role of professors, faculty mentors who ultimately guided your interest in research, as well as this new field that you're in. You highlighted the
development of Spark at Berkeley, of course, as being a key inflection point in your journey. But
can you help us understand what is Apache Spark and what are some of the motivations for its
development, considering some of the existing solutions at the time with MapReduce, for
instance? So Apache Spark is basically a framework for writing programs that are going to run on a large cluster of nodes and process data in parallel mostly.
And, you know, there are a bunch of different components to it, but the core of it is just this API where you can write basically single machine code in Python or in Java or in other languages just on a single machine.
And you can use these functional operations like map and reduce
and other data processing operations like joins and group pies and so on.
And you can write a program using these,
and then Spark will take that program
and automatically parallelize it across a cluster,
including shipping functions that you wrote
and having them on in parallel on lots of items
and then giving you back the result.
So what it means is that, like anyone who's learned
how to do some data processing on a single machine,
say with libraries like Pandas and Python,
gets a similar library for working with a big distributed collection
of data across a cluster. And you
can also use your favorite, you know, your favorite kind of single node libraries as part
of your program and just call them in parallel on lots of data. The goal is to make it very
accessible for many different types of developers and data scientists, just like people who write
programs to run something potentially at large scale. And on top of this basic sort of function that the engine provides,
there's also a really rich ecosystem of libraries.
So there are libraries on top that run a full-fledged SQL engine
that can do standard kind of analytical database workloads.
There's a machine learning library that gives you lots of built-in algorithms.
There's an incremental stream processing system, so you can write a computation and then
Spark will automatically update the result as new data arrives. And there are many more libraries
in the community that are just built on it. So it's also a nice framework for just combining
these high-level libraries into a bigger application.
And I think compared to the tools that existed before it, I would say Spark was one of the first to really focus on opening up these kind of large systems beyond software engineers.
So figuring out APIs that just a data scientist or someone who doesn't just write programs
as their main job, but has maybe like a
math background or like domain expertise can still be successful with them. So that was one difference,
the focus on Python and R, for example, helped with that. And then the other difference is it
focused a lot on composability, both from the programming perspective, you should be able to
just call things that other people hold as libraries, like say a machine learning algorithm.
And from the efficiency perspective, Spark can do very efficient in-memory data sharing
between different steps of the computation.
And that's what enables things like iterative algorithms for machine learning or stream
processing or interactive SQL queries.
So these were kind of the differences from the previous engines that existed.
And so a lot of the research was like figuring out
how to make these things work
and still be fault tolerant and efficient and so on.
But once we did that,
you could get this really great ecosystem of libraries on top
that now users can just combine to do things.
I see.
So certainly abstraction, composability, and of course, cost efficiency.
Are there other salient features that you believe has made Apache Spark, you know, sort
of dominate and become the framework of choice for big data distributed processing?
I think a lot of the other ones kind of stem from this and also from the great community
that's contributed to it and,
you know, that's formed around it. So, you know, we went from basically an academic research project,
mostly developed by grad students, to something that a lot of the major tech companies actually
started using and contributing to. And over time, also like other, you know, kind of non-tech enterprises also
started building on it. So, and these have helped the system become a lot better and have helped
test it at a huge scale and just make sure it works on, you know, the widest possible range of
workloads and data types. So I think for a lot of people today, it's like, it's just a nice ecosystem
to build on if you want something that's going to be reliable and that's well supported across the industry that connects to pretty much every data source there is out there.
But I think a lot of the reason why people built on it were kind of these design decisions, for example, to make it cheaper and efficient and also just easy from a programming standpoint to compose things
and also to make it possible for the engine to distribute and optimize the combination
underneath these APIs so we can keep improving the performance of your existing job,
like whenever you upgrade Spark without you having to rewrite your job.
And so a whole bunch of work has gone into that to make it kind of declarative
and make it possible to optimize
things. So you said something earlier, which was quite interesting, which is the ability to,
you know, scale something that was once simply a research project into, you know, a solution that's
widely adopted across the industry and, you know, by Fortune 500 and, you know, large enterprises.
So as you look back with Origins and academia
and the open source community,
how was Databricks for what it is now spun out?
And how did the early days of Apache Spark
ultimately lead to you creating
and co-founding Databricks as it is today?
Yeah, so I think being able to start a company
that's just working to improve this technology and then to provide commercial services around it was very important to help us all really build out the project and get it to the next level.
So I started working on Spark, I guess, in 2009.
I think we released the very first alpha open source version in 2010.
And that's when I was just a grad student.
And over time, like more students at Berkeley
started kind of collaborating on it
and building things on top of it.
And at the time we saw, you know,
there was quite a bit of interest,
mostly from more tech-centric companies,
but also from users, like I mentioned,
the data scientists, like the non-software engineers
who still wanted
to run large-scale computation in other organizations. And so we saw that there is
demand for something like this, and it's quite interesting to figure out how to build it well
and what to do in it. And so we encouraged the open source community. We encouraged contributions
from outside. We reviewed patches. We moved the
project into Apache Software Foundation as kind of a long-term home that's independent of the
university. And so that helped. But we also realized that just by doing research, we can
never... It would be hard to invest a huge amount of effort into it and just have people working full
time to make the project better.
So we were also excited.
We saw there is enough interest to justify creating a company in this space.
And we didn't want a company that just does, say, support and services around Spark.
So we actually started a company that just tries to provide a complete modern data
platform based on what we saw working at kind of the most tech forward companies with it. And that
does it in the cloud because that was another major kind of shift happening in the industry.
But launching the company also helped us invest more into Spark and also get it to the point where
it was good enough that
other companies started contributing heavily and using it in production.
And it kind of grew from there.
So we were excited to have to try to launch a company in this space, even regardless of
Spark, just because we thought it's an interesting problem and everyone is going to shift the
way they do data management as they move to the cloud. But yeah, it also, I think, helped really like cement our ability to contribute to
the open source project. And we could hire engineers who just work on it and help build it out.
So as you think about some of the solutions or offerings, you mentioned some different
consumers or users that could benefit, whether it be data scientists, data analysts, business analysts,
those who maybe aren't software developers or don't have a software background.
Who are sort of the key stakeholders that you think about when you design solutions at Databricks?
And when we talk about this unified analytics platform, what do we mean by unified analytics
and who's sort of the beneficiary of this?
Yeah, great question. Yeah.
So the way, you know, many organizations work today, both companies and kind of like things
like research labs and so on, like basically, you know, just organizations that work with data,
they always have many different types of users who all want to do stuff with the data that they've
collected. Like, you know, whether it's a research lab
and you've got all this information collected from experiments
or, you know, like telescopes or whatever it is,
or it's a company and there's like all this information
about how it's operating.
Everyone wants to, you know, to understand what's happening
or to build applications that use it,
like say a predictive modeling application
or something like that.
But, you know, these people come in with very, very different backgrounds. It's very valuable
to them to have a common sort of representation of all the data, like everyone agrees on the
data types you have, the tables, the schemas, all that stuff. And also a common sort of query
language or like a common, you know, like semantics of different operations. So for example, maybe an engineer can write a function for computing a particular thing, like say about your customer, and then anyone in the organization can call that function and get that metric in an accurate way, as opposed to everyone trying to implement it in a different tool. So with the Spark engine, we tried to have
one engine that can offer these interfaces that work for different people. So for example,
there's the SQL interface, which works for the widest range of users, including users who just
connect a visual analytics tool like Tableau or Power BI. This is a tool where you drag and drop to say to create
a chart and it actually runs SQL on the backend or something like Excel. It can also connect that
way. So that's like one extreme. Then there's like the sort of users who will write in a scripting
type language like Python. This would be someone like a data scientist where they do programming,
but they're not just doing software engineering all the time. They're trying to answer questions or prototype things. And so that's where the Spark Python APIs help. know, responsible for like the most important data pipelines being reliable, things being
computed correctly, you know, if anything's broken, like there'll be, you know, they'll
get paged in the middle of the night to fix it.
And they want, you know, to use kind of the best software engineering tools out there
to be able to test things, have static types to understand what's flowing through the system,
you know, do different, like clone the job and try out
different versions of it and compare the results and so on. And so we try to design like the data
model in Spark and in the Databricks platform overall, including the other pieces like the
storage component is all the same for everyone. So everyone can agree on, hey, here are the tables,
here are the data types, here's what they all mean. The query model is also very similar.
So that means someone can write a function
that other people can use,
which is great for handing off kind of knowledge between teams.
But then the actual interfaces,
you can use these different ones
that can all call into the same functions
and into the same data model,
but are tailored to different sort of user
personas.
And we work a lot with all of these to make sure they can understand these things and
that they map onto this common model that actually lets the organization work in a unified
way.
And it might seem a little bit obvious, but historically, at least this hasn't been the
way that most companies work with data,
because especially before the cloud and the on-premise world, you would buy these different
systems and provision them on different servers. And they would each have their own way of storing
data, their own query language, and so on. And then you'd have to do a lot of work just to
connect them together and to make things consistent.
So we do think there's an opportunity to simplify things with this one engine that can do the different types of workloads and then one data store that is in the cloud that everything
can connect to.
You don't need to deploy different data stores for different workloads.
I see.
So what are some actual compelling scenarios or use cases of this one engine, one data load that you're seeing across industries from those who are actually adopting this solution? classic data warehousing workloads. This is what you would do with, say, a system like Teradata,
for example, which is, hey, you load data into some tables, you can transform it, work on it,
usually with SQL, and then you can query it with SQL and maybe serve it through these interactive
visualization tools or compute some reports and send those to people. So this is kind of simple
analytics workloads where like you load it
and then you ask some queries about it. And for that stuff, you know, using Spark for that means
you'll be able to run it at very large scale in the cloud and you'll be able to have separated
compute and storage. So like while you're running, you know, one query, you don't slow down the whole
system. Other things can keep working and access the data.
But it also means you could potentially use the other functions like machine learning or streaming on top of the same data model.
So some people are just saying, hey, I want this classic stuff, but I want it in the cloud and in an elastic way where I don't have to worry about like, you know, how many CPUs I provision and how much storage. And at the other extreme, like a lot of organizations,
like virtually every, you know, like Fortune 1000
and probably even beyond that company
has a machine learning team now and has a data science team.
And they're all trying to figure out how to do,
you know, predictive analytics,
how to do features in their products
that actually use machine learning in some way,
like say recommendation engines or churn prediction or predicting failures. And there
are some really cool use cases that we've seen there. So for example, we saw a lot of the biotech
companies are now developing new drugs for diseases based on analyzing large data sets and understanding what's happening in them.
A lot of industrial companies have instrumented everything they put out. For example,
every tractor you purchase from, like John Deere now, has lots of sensors on it that, you know,
evaluate how it's working and can recommend, you know, when to fix pieces or like can tune it for
optimal performance. Same thing with like every jet engine that's produced by like, you know, when to fix pieces or like can tune it for optimal performance.
Same thing with like every jet engine that's produced by like, you know, that's used in
airplanes and stuff like that.
They're building all these interesting applications based on it.
And I think the really exciting thing for me is allowing, you know, people with minimal
effort to be able to do these kind of more cutting edge applications, machine learning,
streaming, and so on,
on top of data they're collecting,
in addition to just kind of the classic applications they can do.
You mentioned one term, data warehouses,
and obviously there's been evolutions.
We've had data warehouses, data lakes,
and now there's the rise of the data lakehouse paradigm
for modern data management.
Why the need for this new paradigm?
What challenges does it address? And why aren't warehouses and lakes efficient?
Yeah, I'm happy to talk about it. I think there are a couple of different things that sometimes
get conflated here with these. So there's like data warehouse systems, that's like the actual
software that's managing data. And then there's also, there are these architectural terms,
like there is an idea of data warehouse as an architecture
for managing data in an organization, which says,
hey, before you open data to lots of users,
like have a formal way of organizing it
and defining different tables,
defining relationships between them so that it's not a mess,
so that you can keep it accurate over time and extend it and make sure everyone sees correct results.
So both of these things are being, you know, actually kind of being, you know, rethought in
various ways. The one I talk about the most is the technology piece. So historically, if you wanted a system that can store lots of data, historical
data, and then can do fast queries on it using SQL, you build these data warehouse systems. And
they were designed to be deployed on their own servers, right? Like when you're an on-premise
kind of company, you have to buy new servers to deploy a new piece of software. So they all
assumed that they have full control of the data and they to deploy a new piece of software. So they all assumed that
they have full control of the data and they're the only interface to that data. So they were all
using basically proprietary custom storage formats. And then the only way applications
talk to them is through SQL. And then within that, they get really great performance.
Now, when you have everything in the cloud, and when you start having applications that
don't speak SQL, such as machine learning or maybe streaming applications, it becomes a bit of a
problem that you've got lots of data locked into something that basically only one system can query
through only one interface. So that's where Lakehouse comes in. So the other kind of model that goes under it is what's called a data lake.
Data lake is basically just low-cost storage where you can just put files in any format.
And this is what a lot of the Hadoop and Apache Hive kind of open source ecosystem built up.
They just said, look, I want to manage large amounts of data without loading that into
these kind of limited proprietary systems.
I just want to very cheaply store it.
And then I'll load subsets of it later and do more sophisticated analysis.
So data lakes are just based on low cost storage and this kind of file interface where you usually use open formats that many apps can read for your files. And then Lakehouse is this emerging trend to kind of
combine the two and to get the data warehousing like performance and management features,
things like asset transactions on top of low cost storage and open format. So you don't have to
convert the data to a different format and move it to a different system just to get fast queries on
it. And that's the model that we think is going to be
the future. We saw like a lot of the kind of digital native, like new tech companies who had
to build their stack from scratch, just build something like this from day one. They never had
the separate systems. And at Databricks, like that's kind of the, we decided to focus our platform
around this model and to figure out how to do that well.
And we think there's no technical reason why data in open storage formats can't be used to provide
really great performance or to provide transactions or management features or all the things people
expect. So we're just trying to give people that and to just simplify their data management overall
by having one system
that every app can connect to. I see. So combining the best of, you know, the low cost storage of
data lakes and, you know, the performance and manageability features. Right. Exactly.
It is basically like a technical problem of like how to do that well, but it seems that if you can
do it well, it's very useful.
Like it just simplifies things.
I want to turn my attention to another project that you created and have been actively contributing to, which is MLflow.
You know, one of the most challenging aspects of productionizing machine learning is not actually training the models,
but as you might predict, it's the deployment and the monitoring to actually ensure, you know, your production grade applications are, you know, as you'd expect them. So, you know,
as you work on MLflow, how does MLflow actually help address some of the important pain points
and challenges in the, you know, ML development and deployment lifecycle? Basically, what we've
seen is at almost every company that productionizes machine learning
that actually tries to use it in a product that has to run well, they end up building what's
called an ML platform or sometimes ML ops platform, basically a whole bunch of infrastructure to
support the machine learning application. And this does a bunch of things. First of all, it usually trains a model,
like retrains a model periodically,
kind of automatically, because data is changing.
It monitors that.
It gives you metrics about what's happening.
And it maybe will alert if things are way off,
and new model isn't doing well,
or the data looks different.
It versions all the artifacts,
so you can kind of hold back as you do development
and see what happened in the past.
And it also handles deploying the model
and then actually serving it.
And so, for example,
a lot of the large tech companies
build infrastructure like this.
Like, for example, FB Learner at Facebook
and Michelangelo at Uber
and many other systems.
But even the non-tech enterprise companies we
talked to all had something like this. So with MLflow, we basically created an open source
project that handles this problem. It's an open source kind of extensible ML platform project.
And what it does is it gives you some built-in functions for common things people need to do with machine learning.
Like, for example, packaging a model and deploying it in different serving systems or tracking metrics about it or, you know, sharing experiments with a team and collaborating on those.
And so it gives you these built-in things.
And it also gives you this extensible framework where you can plug in new pieces.
Like, for example, if you say, you know, when I build a new version of my model, I want it to pass through some custom reviewing steps, like maybe an automated test, and then a human that
approves that that says, you know, this is actually, you know, a good model or like whatever,
you can plug that into MLflow in various ways to the APIs, and you can build this custom workflow on
top of it. So that's what it does. And we see, again, a lot of teams use it as they move from
just doing some experiments and creating a cool model to creating an application that's supposed
to run all the time, maybe retrain periodically and be very easy to monitor. And we've tried to
do something that people can use even during the experimentation phase.
It saves you a little bit of time
in experiment management and collaboration,
but then it puts you in a spot
where you can quickly shift the model to production.
And can such a model or a platform be useful
for detecting and managing things like concept or data drift, shifts in your
upstream data and how that might be impacting production?
Yeah, definitely.
Yeah.
There are some built-in features in MLflow, like there's integration with SHAP for explainability,
but it also allows you to put in some custom processing steps for your data, for testing
your model and validating it, also for any data,
like doing model serving and inference for the results that are coming back from that.
So you can use it to systematically plug in things into the pipeline. I should say MLflow
doesn't provide its own algorithms. We're not trying to create a better explainability algorithm
or something like that. We're just giving you kind of the programming model or programming framework where you can you can write your application to have these pieces.
And it's very easy to instrument parts of it and to observe what's happening, like to automatically collect information and to show it to people and, you know, let them plug in things that listen to that information. So it's a lot like, you know, like when you run, say, a web application, there are these
frameworks for how to build it that will handle certain things and make it easy to like roll
forward and hold back.
Things like Ruby on Rails, for example.
It's a lot like that.
It's not, we're not trying to provide new algorithms or anything.
I see. Yeah, I think as someone working in
the space of deep learning and production sort of applications of deep learning,
this is certainly an area that interests me personally. And as it relates to, you know,
productionizing machine learning being one of the biggest challenges, I think
most practitioners can certainly attest to that.
ACM ByteCast is available on Apple Podcasts, Google Podcasts, Podbean, Spotify, Stitcher,
and TuneIn. If you're enjoying this episode, please subscribe and leave us a review on your favorite platform. Looking forward, I know there was the recent data and AI
sort of flagship summit. Databricks announced some of the recent announcements and some of the
upcoming features or capabilities that they hope to make available. What are some recent
announcements that are quite exciting for you and looking for what are some future
directions that you look forward to? Yeah, definitely. There's a lot of cool
stuff happening in this space. I'll just mention a couple of them that I'm personally excited about.
So one is on the data system side. So this summer at the Data and AI Summit, as you mentioned,
Databricks actually open sourced all of this storage management layer we have developed called Delta Lake.
So this is what enables that kind of lake house pattern where you have an open format for your data and you can have rich data management features on top,
like asset transactions and versioning and time travel, like rollbacks, things like that,
and also improve performance. And so Delta Lake has been open source for a while, but Databricks
always had some proprietary kind of enhancements for performance on top of it and for like some
connectors to certain systems. And we just saw that, you know, this went from a brand new kind of product in 2018 with no users on it to something where I think more than 70% of data that our customers put in the platform is in Delta Lake today.
So it became kind of this essential building block. to open source, like even the advanced performance features so that, you know, more companies,
more products can easily build on this and integrate it and people can, you know, can
manage the data in it and not worry about like, hey, is this only usable from Databricks
or somewhere else?
So I'm really excited about it because I think it's one of the kind of best and most novel
pieces of technology we've built.
And there's already a bunch of interesting
research on these kind of systems. And I think open sourcing it is going to enable a lot more.
I've also, I've talked to like a whole bunch of researchers who are trying to do new things to,
you know, make these kinds of lake house systems more powerful and more efficient.
So that's one thing. If you're someone working in the system space,
definitely encourage you to check it out. There aren't many things that go from like zero to 70 percent of like how someone stores stuff, which is one of the more critical things you have to do with data is like store it reliably in such a short time period.
The other one I'll mention on the machine learning side, it's an ML flow. So we're starting this new component to help simplify kind of the handoff of machine learning applications between the ML sort of experts and then the engineering teams that operate the applications. different user, you know, backgrounds and user types who do these different things. You've got
someone that's more like a data scientist or like, you know, like an ML researcher who develops
models and knows how to evaluate them and maybe how to make, you know, like how to tweak them to
make them better. And then you've got a production engineer that knows about like, how do I monitor
a thing, make sure it's working? How do I improve the performance?
How do I set it up so I can operate and hold back
and deal with outages and stuff like that?
And we found it's really hard to have people that do both well.
And it shouldn't really have to be that way.
It should be possible to have different users who focus on these aspects.
And so we have this new
component that's called MLflow pipelines that lets basically the engineering team create a
pipeline where the ML researcher fills in specific steps of it, but the whole pipeline is operable and
instrumentable and controllable by the engineering team. So basically, it's a way to modularize the code everyone is writing.
As a researcher, you get the contract that, hey, if you work within this API, your thing
will immediately be productionizable.
You don't have to change your code and risk any problems with that.
And as an engineer, you get a lot of control about how stuff is passed around and how things
are tested.
Very interesting. Now on the decision to, or the announcement to open source all of Data Lake,
what are some of the motivations that go into open sourcing software? And more generally,
what are your thoughts on the future of open source innovation?
I mean, I think in general, like open source is a very powerful force in the software industry.
And it's something that every software like development company has to keep in mind.
And certainly like enterprises who buy software are very aware of it.
And, you know, they're very aware, like everyone wants to design an architecture like in their company that's future proof.
Nobody wants to, like if they can avoid it, you know, they don't want to pick something that they'll have to revamp in five years, because, you know, that vendor stopped doing the things that they need, or whatever, like, you know, the vendors now charging a lot of money, or whatever it is, right? Or like, you know, it's just locked into one. So I think everyone has to consider, you know, how to do it. And we just thought with Delta Lake, like,
you know, at first we thought, oh, maybe this is for a few advanced users or something like that.
But actually we realized like it improves everyone's quality of life, like working with
these large lake house data sets so much that we actually want everyone to use it. And we want it
to be kind of a no brainer, like decision in terms of risk of
like, will you use this versus a more classic like data format for your data, which doesn't have the
nice features like transactions and so on. So that's why we wanted to make it open source so
that people can feel like, yeah, this is something I can keep using, you know, decades into the future.
And there'll be many vendors who support it. And I don't have to worry about petabytes of data locked into one vendor. And we're already seeing, already there are a ton
of products that connect to Delta Lake, including all the major cloud data warehouses and all kinds
of open source engines and stuff. And we're hoping to see more of that. That's one of the things we
saw as a company building things around Spark
is we don't have to go and bug lots of companies
to integrate with us
and make their product work with us.
There are so many products that work with open source Spark,
so many libraries,
whether free or commercial products
that automatically work well on our platform
thanks to the open interface.
And we want the same thing for the data.
If you put your data in Delta Lake, you can use all the tools in the industry to collaborate on it.
And you don't have to worry about that architectural choice.
So improving access, improving adoption, and ease of extension with other tools of your liking.
Yeah, exactly. Yeah.
Great. I'd love to turn our attention to another hat that you wear.
You know, while you aren't shaping the future of Databricks as chief technologist,
you're actively involved in the future of computer science
as an assistant professor at Stanford University.
So I would love to learn about some of the exciting research work that you're doing.
So we were chatting briefly and you mentioned the Dawn project, which recently culminated. So I'd love to learn more about some of the sort of contributions
of this work. I was very much moved by the mission to, you know, democratize AI by, you know, really
making it dramatically easier for those to build AI powered applications. So what are some of the
exciting contributions that you've observed with this project over the past number of years? So this is a project we started five years ago,
like a group of faculty at Stanford. And we were really interested in this problem of like how to
let more, you know, more people, more organizations successfully use AI. And we looked at it from a
whole bunch of angles. So for example,
we had Kunle Olukerton was one of the faculty members. He works on hardware and programming
interfaces and compilers, among other things. So he looked at that aspect of how can we make
less expensive, more efficient hardware for AI, which is a super interesting area.
I looked more from the system side.
Peter Bayliss was another professor who looked from the database side, and there were other folks as well.
And a bunch of interesting findings came out of it.
So one finding was, as you mentioned, that productionizing ML is quite difficult.
And for many groups, this was the bottleneck of going from like a prototype
to, you know, like an actual application that like really works and, you know, has impact.
And so one of the projects I worked on with Peter, for example, was a new way to sort of
debug and monitor and improve the quality of ML models that's called model assertions,
which is a little bit like software, like assert statements in software, where you have things you expect to be true about the application,
and you can apply the same things to the behavior of models.
And then you can actually automatically detect when they're doing things wrong and also use
that to supervise the models, to train them, to make them avoid that kind of behavior.
So we showed some examples of that and like basically with working with data for autonomous
vehicles and for video analytics.
So that's like one takeaway.
Another interesting takeaway was that in some areas of AI, getting labeled data is actually
the bottleneck. It doesn't matter how expensive it is
to design your model, if at all, or to train it, actually getting labeled data is hard.
So Professor Chris Ray, another one of the PIs, had a whole line of work on minimizing the amount
of human labeling you need and using weak supervision, which is basically like using
automated rules that guess at the label,
but may not be fully accurate and learning from those. And he's had a lot of success with that in
quite a few domains where you can write these generic rules and run them over a collection of,
say, like legal documents or medical papers or stuff like that, and actually get a pretty good
model. And the challenge is how do you, in fact, you can do
better than people who use just a label dataset, you know, with less effort without having to have
people label, you know, millions of documents or images. So these were a couple of the interesting
themes. But yeah, for me, the best part about it was just seeing people, you know, thinking about
this problem from all angles, and getting them all in a
group together to talk about it and like kind of learning across these. So a lot of our work kind
of ended up mixing insights from, you know, from the different areas that I think wouldn't have
happened as easily without this group. Certainly, certainly. That's very interesting.
Another line of research that actually excites me is some of the work on retrieval-based NLP. Undoubtedly, we've seen a lot of the great promise
of large-scale language models, but there are also very clear limitations around high training
costs, high inferencing costs, the true feasibility of actually productionizing these large models. There's explainability issues,
you know, models are static, so there's freshness issues. So how does this sort of approach of
retrieval-based NLP work? And how does it address some of these fundamental issues that we observe
with trying to make value out of these large-scale language models that exist today?
Yeah, this is the research I'm probably most excited about in my group right now.
So basically, so far, we've gotten some really great results in NLP with these giant models
that have lots of weights, something like GPD3.
And the idea with these models is you have a collection of knowledge, like you have a
bunch of documents from the web, like you have a bunch of documents
from the web, and you train the model over it.
And it sort of incorporates that knowledge into the weights.
And then when you do predictions, it can do stuff like it knows that the capital of France
is Paris.
It knows that the president of the US is Joe Biden, like whatever.
It has all this knowledge that you know, that appeared in those
documents, and it can use that in various tasks. But if you think about it, you know, these are
very expensive to train. They're also very expensive to do inference with. And they're
very hard to update, because if something changes, like if, you know, after the next election,
the president of the US changes, you know, you got to retrain this whole model from scratch to give that. And you actually see this if you use GPD-3 today. Well, I don't know about
today specifically, but definitely, you know, when we tried it like a while back, you know,
it was returning the previous president of the US. It didn't know that this changed.
So it's a problem. So the retrieval-oriented approach is that instead separate the knowledge. You have
the documents and you have some neural networks, but then when you're given a task, like say you're
being asked to answer a question, you search over the collection of knowledge somehow and then you
read those documents. You pass the documents that you retrieved along with some context, like the
question and other information about your task
into a smaller neural network and you produce the answer. And the nice thing about that is
you can always update the knowledge because you can just change these documents. You also get
quite a bit more interpretability. You get a sense of like, oh, why is it giving this answer?
It's because it's not in this document. And like And maybe that was confusing. And it turns out to be a lot cheaper. So it depends on the task. These models can't do
everything right now. But for some tasks such as question answering, these models are just
orders of magnitude cheaper than the large language models. And they're much higher quality
in terms of answers. And there's a lot of work in this space. There's now,
for example, there are people using retrieval for language modeling, which is a very general use case. The retro model does that. There are people using them for images as well as
text. So retrieving images and text together and having interpretable results there.
And there are people using them for more sophisticated applications.
One of the ones we built can answer questions that require looking up
many documents all at once, not just one.
And it seeks out new knowledge until it has enough to answer the question,
but basically looking at concepts that came up in these.
So this still requires some index to actually do the retrieval from, right?
I saw an interesting analogy of the open book,
sort of this retrieval-based NLP as being an open book exam, right?
So there's still the need for the-
Yeah, it's an open book. And actually a lot of the work is on,
so that indexing itself is done,
and that search is done using a neural network also
by maybe embedding the documents into some kind of vector space and then searching for nearby vectors.
And a lot of the work is on how to do that better also, which then immediately improves these.
Or how to co-train the indexing and lookup together with, say, the question answering so that they're tailored for each other.
Very interesting. and lookup together with, say, the question answering so that they're tailored for each other.
Very interesting.
And you said there's already some line of work thinking about sort of image search. Are there sort of areas of investigation around multimodal information retrieval, like image
to text, text to image?
There's a little bit.
Are there other areas where you see this potentially being applied?
There's a little bit.
There's not a ton yet, but there is some work with images.
I think it could be useful in other areas too. One example actually that I'm curious about would
be reinforcement learning. Because if you think about it, you have this history of training
exercises you did for your model. And again, even though you ran all those, you then just condensed
everything into a bunch of weights. But what if you could look up like, what are past training situations that look similar to this? And like,
what did I do? And what was the outcome? Maybe you'd be able to improve performance there.
So my groups focus most on the NLP use cases, because there are so many of those and like,
they're very easy to interact with. But I think it could be useful in other places too.
If you think about
it again from the production ML perspective, if you're going to productionize one of these models,
you want to be able to interpret what it's doing and also to fix it if something is wrong. And
this gives you a nice, and you want it to be fast also, be fast enough to actually run.
So this helps with that, but it also gives you these nice ways to see, wait, why is it making that prediction? And if I want to stop it from doing
this, like, what do I change? Here, you can just change the documents it's pulling out that are
misleading it. Right, right. To turn on to another responsibility that you have
sort of in academia, as sort of a professor and as an instructor, what are, you know, some of the biggest gaps or
opportunities that you see in computing education? I know certainly going through school myself,
I know there was a big interest in recent years in the field of machine learning, but
ultimately to, you know, develop and deploy capabilities, it goes beyond model training.
So what other areas do you feel that, you know, folks could sort of emphasize in computing education to really ensure that computer science graduates are well equipped with all the shift to software as a service
or basically software being delivered
through the cloud somehow.
And this is happening everywhere,
whether it's like just a user-facing
kind of productivity app
like Google Drive or Salesforce,
or it's actually the platform, right?
Like you can get a database as a service
from Google or from Microsoft, or you can get,
you know, machine learning, whatever, like predictive time series, you know, predictions
as a service from Amazon and so on.
The thing is, all of these are now being delivered continuously.
You know, my experience and the same with like, I think everyone who does it is that
building a production, you know, kind of cloud service and maintaining it is really hard.
That's what we see at Databricks.
And that's what many of the companies we work with see.
And there isn't really any education on it.
And I think it's more than engineering.
I think it could require kind of new programming models that are going to work well.
It could require new designs of systems. For example, how do I make my system so I can easily hold out a new version of my app and then hold
back to an old one? Or how do I make it so it can isolate requests from different tenants and just
guarantee that no tenant can interfere with the performance of another tenant too much or stuff
like that? So I think this is a super interesting area.
I would love to have a class where you teach about these things,
but also a class where instead of students turning in an assignment every couple of weeks with a little programming thing
that we run some tests on,
they actually deploy a service on day one,
and then it has to keep operating and serving requests throughout
the semester. And then, you know, they have to like, you know, put in a new feature, like implement,
I don't know, pagination, implement whatever GDPR compliance or whatever, without like corrupting
the data or, or otherwise breaking it. Like I think, because that's what they'll have to do in
a real job. Well, it seems like you have the syllabus for your next course.
Yeah, potentially.
I would certainly say there's certainly a huge divide between how folks are trained in university and sort of the reality of how things operate in industry and in production.
And I think oftentimes internships or sort of hands-on experiential project-based learning becomes the best avenue to do that. And so I think this is certainly a very noble model of learning that could certainly
benefit many folks as they make the transition and look to actually develop end-to-end systems.
So to wrap it up, I'd love to touch on two things. One is, you know, we talked about a lot of your
work with Apache Spark and Databricks, but also your responsibilities as a researcher, as an advisor, as a professor.
These are two very difficult jobs to manage.
So how do you juggle your work in industry as sort of a practitioner, as the chief technologist at Databricks, with your role in academia as a researcher, as a professor, as an advisor?
Do you find these two worlds colliding? Do you find them very different? And in one way, you know, does one
role influence the other? Is it, you know, your industry sort of experiences really influencing
your role in research? Or is it your research experiences that influence your work at Databricks?
Yeah, I think the roles are definitely different.
There are different kinds of concerns in each one. I do think a nice thing about a faculty job is it does give you flexibility to work with companies in various ways. And that is kind of part of the
job. That's just the way it works. But they are super different. I do learn a lot of stuff in both that like I some,
you know, that influences what I do. Mostly, I think it's been seeing stuff in industry that I think, you know, is a big problem that isn't really studied in research. And then like kind
of thinking about it from a research perspective, I've tried to keep them like fairly separate.
And in most cases, because I don't want to have some kind of conflict of interest
or like, you know, students feel they're working on something that like benefits Databricks or
whatever. But there are often kind of just long term things like this thing with, you know,
everyone in industry is writing services, but we, you know, there are like, you know,
hundreds of papers each year on like debugging and stuff like that, or that don't consider that
that's an insight that can lead to, you that benefit the whole industry. But things that are
very specific, usually I would do that work only in the company. And certainly within the company,
it's helped just to see all the perspectives you see at a university on, say, the future of hardware
or ML models or stuff like that. But they're both interesting.
And I think the big difference is like an industry, if you often get a lot more resources
behind a particular thing, and it's very hard to match that in academia.
You can't hire like, you know, tens or hundreds of engineers to build a thing and then to
maintain it.
And of course, in academia, you get to explore a lot of stuff.
And, you know, if something doesn't work, academia, you get to explore a lot of stuff. And,
you know, if something doesn't work, you can just switch to something else and so on. So like,
it's very flexible that way. And you get to, you know, to teach people in a different way as well.
Very interesting. So to wrap it up, I would love to sort of provide you the opportunity to share
any parting remarks, but to provide some structure,
what are some exciting future directions that you see for the field of data management,
machine learning, or computing at large? And what are some of the exciting areas that you hope
to see some of the greatest promise in? Yeah, I mean, I think this is a great time to look at
kind of both machine learning and computer systems. And I would say,
I mean, I think everyone is very excited about the potential of these like large models and
deep learning workloads and so on. I think it's super interesting. But I would also say,
you know, if I were to give advice to like students or people getting into the field,
I would also say to take a look at the
computer system stuff, because there is this huge change to cloud computing and these systems
ultimately underlie a lot of, you know, what's happening in other places. And there'll be a lot
of demand for like engineers, researchers, and so on that know about this stuff. Even the large
language model stuff, I think in many places,
it's become basically a systems problem of like, how do we scale out things more? How do we do it
more cheaply and so on to train more models? And it's become less of a modeling problem. Of course,
there are a lot of improvements you can do there too. But the point is that like, you know, if you
can do that stuff well, it will have a lot of impact on real applications that are out there.
Awesome.
Well, it's been a pleasure speaking with you, Dr. Mitesaria.
Thanks so much for joining us at ACM ByteCast.
Thanks so much for having me.
ACM ByteCast is a production of the Association for Computing Machinery's Practitioner Board.
To learn more
about ACM and its activities, visit acm.org. For more information about this and other episodes,
please visit our website at learning.acm.T. That's learning.acm.org slash ByteCast.