a16z Podcast - a16z Podcast: A Conversation With the Inventor of Spark
Episode Date: June 24, 2015One of the most active and fastest growing open source big data cluster computing projects is Apache Spark, which was originally developed at U.C. Berkeley's AMPLab and is now used by internet giants ...and other companies around the world. Including, as announced most recently, IBM. In this Q&A with Spark inventor Matei Zaharia -- also the CTO and co-founder of Databricks (and a professor at MIT) -- on the heels of the recent Spark Summit, we cover the difference between Hadoop MapReduce and Spark; what are the ingredients of a successful open source project; and the story of how Spark almost helped a friend win a million dollars.
Transcript
Discussion (0)
Hello, everyone. Welcome to the A6 and Z podcast. I'm Sonal, and I'm here today with
Matei Zaharia, the CTO and co-founder of Databricks, which is the primary company driving
and developing Spark. And we're actually just coming out of the Spark Summit, which took place
this week, and it's one of the biggest events for developers who are working on Spark, for
companies that are interested in Spark, and pretty much for anyone who cares about trends in
the big data space. Just to start off, Matei, just start by just giving us a description of what Spark
is? So Spark is software for processing large volumes of data on a cluster. And the things that make it
unique are, first of all, it has a very powerful programming model that lets you do many kinds of
advanced analytics and processing, such as machine learning or graph computation or stream processing.
And second, it's designed to be very easy to use, much easier to use than previous systems for
working with large data. So what were some of the previous systems for working with large
data sets. Before Spark, the most widely used system was probably MapReduce, which was invented at Google
and popularized to the Open Source Hadoop project. And MapReduce itself was a major step over just
writing distributed programs from scratch, but it was still very difficult to adopt and use
and led to very complicated applications and also very poor performance in some of them.
So what were some of the reasons for inventing Spark in the first place, then?
I mean, besides the problems and the limitations of that, were you just trying to solve the problems of MapReduce, or were you actually trying to do something different?
Yeah, that's a good question. So we started building Spark after several years of working on MapReduce and working with companies that were very early on using MapReduce.
I was a PhD student at UC Berkeley, and we actually started working with Hadoop users back in 2007.
And I did, for example, an internship at Facebook when Facebook was only about 300 people.
they were just starting to set up Hadoop.
And in all these companies, I saw that there was a lot of potential to putting together
large data sets and processing them on a cluster, but they were all hitting the same kind
of limitations, and they all wanted to do more with it.
So basically, our lab created Spark to address these limitations.
You know, let's take a step back for a moment and talk again about your experience at Facebook,
though.
Why was the problem challenging there?
I mean, big data's been around forever.
So what about that problem was interesting and different?
that made you want something better than what you already had?
Like, what was it about that data, I guess?
So Facebook, like other companies, starting to, you know, to use business data,
was able to collect a lot of very valuable data about how users are interacting with it.
And Facebook was also growing very quickly.
So they were adding, you know, many tens of millions of users every few months.
And, you know, they definitely couldn't just talk to every user
or even send someone to every country where Facebook was,
used and figure out how people are using the site. So they needed to use this data to improve
this user experience. So there were two things that made it especially challenging. One was the
scale of the data, which was, you know, much higher than you could do with traditional tools. And
the second one was how many different people within Facebook wanted to interact with it. It wasn't just
one person or one team doing it. It was many people often with not, you know, not that many
technical skills. So I needed to be very easy to work with this data.
What was a limitation of, like, why wasn't what was in place? MapReduce, I guess you're
describing enough. So MapReduce was actually great for running sort of large batch jobs that
kind of scan through the whole data and summarize all of it and give you an answer. But it was
designed mainly for jobs that take tens of minutes to hours. MapReduce initially came out of Google
where it was used for web indexing, and the whole point was, I will run this giant job every
night, and in the morning it's built a new index of the web. But what Facebook wanted to do was
different. They had a lot of questions that they wanted to ask almost interactively. And there's
a person sitting there who launches a question at the cluster and needs to get an answer back. And it
wasn't very well suited for that. So it's more like the move fast and break things kind of model
at Facebook where you want to just deploy code or more importantly, quickly iterate in real
time. So you can get your feature testing and get feedback instantly.
It's not even feature testing, but just in general when you work with data, you want to ask
multiple questions repeatedly when you're doing ad hoc exploration of the data, as opposed to
when you have a certain application that, you know, okay, I'm just going to run this every night.
So the second feature that you mentioned about the usability.
I mean, I feel like it's obvious. We care about things being pretty and easy to use because
because no one wants to deal with the Kluji interface.
I mean, does that really matter in this case?
Because if you're really an insider who knows how to work with data,
do you really need to care about that?
The interesting thing is nobody wants only the insiders to work with data, basically.
Everyone wants to be able to access it directly.
Actually, there was a great keynote about this at the Spark Summit by Gloria Lau,
where she said that also the insiders themselves don't want to be answering questions for other people.
You know, if they have the specialized skills, if they're a data scientist, for example, they'd prefer to be doing something advanced, and the other users outside would prefer to ask the questions themselves.
So ease of use of use was quite important.
So one of the most interesting announcements that came out of the Spark Summit that I think a lot of people saw was the announcement that IBM is backing Spark.
Basically, IBM is doing two things.
First of all, it's investing in the development of Spark, you know, much in a similar way as to how they've invested in other technologies.
such as Java or Linux, they see Spark as a key technology and they're actually putting developer
resources. And second, IBM is also moving some of its internal products and product lines
to use Spark or to offer Spark to customers because they think it will improve these products
to build them on Spark. So does that mean that IBM is basically making a big bet on the cloud
in a bigger way than ever before? I think it's much more than the cloud. I think even though the cloud
is one area, you know, that IBM is interested in expanding, and they have huge business lines
that are completely unrelated to the cloud in solutions and consulting, and also in products
and services such as Watson or their database products and so on. And, you know, I hope that
what will happen is really great integration between these products and Spark. Were there any other
major news or interesting things that you saw, any other interesting keynotes or trends that
you think are important for our audience to know about what came out of Spark Summit this year?
One of the coolest ones I saw was talk from Toyota about how they use Spark to improve,
you know, to basically look at social media feedback, what people are writing about their cars
and figure out things like, oh, is there a problem with the brakes on the Prius or how they can
improve their products as a result of this.
So you have a car company that's basically using social media is big data because it's real time
and it's fast and they probably get a ton of feedback.
and they're using Spark to figure out how to change their product in real time?
Right.
They're not using it to deal with it in real time,
but they're using it just to get a lot more insight into how their vehicles behave once they're out there.
So, you know, one example is if people report, you know, say some noise coming from the brakes or something like that.
Now, they might hear about this, you know, if a person goes and talks to their mechanic,
but there might be many other people who just ask their friends online and ask, oh, look, I hear this.
kind of weird, you know, scrunching sound from my car. So, and it's actually very hard to take this
kind of text data, which, you know, might say, all the people have many, many different ways
of describing, say, this noise problem, and actually understand it and classify it, cluster together
all the messages about a specific problem. But for the engineers at Toyota, who, you know,
are trying to design the next car or to figure out, you know, any potential issues with the current
components, it's very useful to have this. Got it. That makes a lot of sense. So they're basically
analyzing the big social media data set to get that insight. What are some of the other interesting
things that you saw coming out of the Spark Summit? Apart from Toyota, we also saw a lot of other
great companies starting to talk about their use of Spark. So we've known about some of them
for a while, but it's nice to see companies talking publicly about interesting applications
they've built. So some of the other highlights, for example, included Netflix, Capital One.
at BVS summits, we also had Goldman Sachs and Novitis in the biotech.
One other question just on the big picture side, you know, I'm fascinated by the tension
between open and closed and when it comes to the open source topic and the open source
community as well. And one of the things that I think is really interesting is that every
side of open has an element of closed as well. And the evolution of open source, historically
as well as now, always involves big corporate players getting in the game as well as a lot of
individual developers, academics. How does that sort of affect the ecosystem in general?
I would love to hear your thoughts about that and also how that applies to Spark.
So the Spark community is already very large. It's actually the most active open source project
in data processing in general, as far as we can tell. And it has many companies participating
in it. And I think we've been able to grow the community in a way that everyone gets what they
want out of it. And there aren't really any major tensions between the companies,
working on it and people trying to keep certain things closed or not.
Is there something unique about this that's allowing that to happen?
So I think it's just a matter of the culture in the project and setting it up early on to be
very welcoming to contributors and to actually have people converge on how they're going to work
together, you know, how they keep it stable and reliable.
So we started doing that, you know, back from the UC Berkeley days when we had other people
contribute to it.
And I think we just have that pattern in place.
it's sort of best for every contributor that's involved.
There's been plenty of open source projects that have not reached this kind of scale or grown as fast.
And the ecosystem you're describing, it's just not random chance that it ended up there.
I guess what I'm really interested in is that you are the creator of one of the most popular
and fastest growing open source projects ever.
What are some of the ingredients of a successful open source project?
Like what does it take to kind of get here for other people working in open source?
Yeah, so I should say, you know, from the beginning that, you know, we didn't, we certainly didn't
imagine that Spark would be this widely used when we started. And it's only, you know, it's kind of
been a feedback loop as we saw people being excited in it. We also, you know, decided to spend a lot
of time to make it better and, and to foster the community around it. But I think there are
several things that helped. So first of all, Spark was actually tackling a problem, a real problem
that people had and that more and more people were beginning to have in the future, which was the
problem of working with really large-scale data sets. So it kind of resonated with a real need
people automatically. Yes, exactly. And a need that was also growing over time. So obviously,
and there wasn't much else they could use. So obviously that's helpful. Second, we had a really
fantastic set of people working on the project. And I think it's especially important when you
begin a project, when it goes from the original initial team to bigger and bigger teams,
because you're going to need a really great team to work with it in the future. The people,
at UC Berkeley and the people we got at Databricks to work on Spark are just a really fantastic
team that's able to build great software very quickly. Yeah, so the people aspect of it,
basically, because it is a community open-source project. Yes, definitely. And third, apart from
the sort of core team that was around it early on, we tried from the beginning to be very
engaged with the community and to foster new contributors, help them actually contribute to the
project and learn how to do things and actually participate.
And there are hundreds of people that do that, you know, many people might only send in one or two patches, but we've tried to make the barrier for that extremely low so that they can actually help out.
It takes some effort to do that because at the beginning, you know, if you're someone working on it every day and someone comes in and wants help to get some idea in, it's always faster for you to do it yourself than to help this other person.
But you have to do that.
You have to help them get set up and help them contribute in order to actually grow the total set of people.
who can contribute.
So is there just like a lot of documentation on the project then?
Or, I mean, what really makes them easier to be able to contribute then?
Like what concretely happens?
Yeah. So there's several things.
So first, you know, first, even before anyone contributes, they have to be able to use it.
So there's been a lot of focus on making Spark very easy to download and use and having
as much documentation and examples out of the box as possible.
And we're still doing a lot to expand this, actually.
The second thing you need is, you know, once people,
are actually trying to send in patches or to try to understand something about the project,
you do have to talk with them to review the patches and so on and, you know, and help them actually
get them in. And the third thing I need that's really important to keep a project moving quickly
is just really great infrastructure for testing, checking the quality, making sure that it continues
to be good. And by investing in this kind of infrastructure, much the same as you do in any other
engineering organization, you can then end up moving a lot faster. So these are the things we
we spend time on.
The most interesting thing about open source is the ecosystem that grows up around it,
because otherwise why have it even be open source?
Could you talk a little bit more about that?
I've actually seen you share a chart that shows a really rich ecosystem growing around Spark
and why that matters?
Yeah, definitely.
Yeah.
So over the past few years, we've seen a lot of other open source projects integrate with Spark
and build on top of it and really put together, you know, this set of software you can use
together to build applications. So in particular, you know, one of the things we saw is many of
the projects that were built on top of Hadoop, such as Hive, which is SQL processing at scale
and Pig and Mahout for machine learning are starting to run on top of Spark as well so that users
of those can get, you know, the speedups and the benefits from using Spark. And we've also seen
quite a few of the data storage projects, for example, MongoDB or Cassandra or Takion or many
of the NoSQL key value stores now connecting to Spark, offering ways to read the data in.
And for Spark users, that's exciting because it means they can write an application against Spark
and use it against data and all these storage systems.
They don't have to change the application to talk to each one.
So I think even beyond the activity happening in Spark itself, these projects are on top
and on the side are one of the most valuable things for the users.
And in fact, wasn't Spark itself, isn't Spark itself able to sit on top?
of Hadoop for those that use the Hadoop file storage system?
Yeah, definitely. From the beginning, we designed Spark to sit on top of Hadoop and to talk
to Hadoop, but we also left it open so that you can hunt it in other environments.
And, you know, we basically are philosophy in it is to try to integrate with all these
kind of environments where data can be stored and give users one really simple way to
work with the data no matter where it is.
You've talked about the community, you've talked about all the successful ingredients of an
open source project. But I guess I'm kind of fascinated by the story of an inventor. And we
have an opportunity to talk to an inventor of a really interesting thing here. And I want to hear
a little bit more color on what that was like. One of the most interesting ones was Lester Mackie,
who was a PhD student in the same year as me. And he was actually on the team that got second
place in the Netflix challenge. They were extremely close to winning the whole challenge.
So what was a Netflix challenge for our audience to remind people what it was?
Yeah, the Netflix challenge was this $1 million price to improve the accuracy of movie recommendations on Netflix.
So Netflix released this data set and said, oh, currently we can predict, you know,
users score for a movie to, you know, something like within 0.85 error.
And if you can increase this to point, you know, if you can decrease this to just point eight or something like that,
will give you a million dollars.
So it was an open challenge, very large scale, to encourage.
innovation in this area. Okay. So what happened with Lester? Yeah. So Lester was one of the people who had,
you know, he was developing a ton of new algorithms and combination of algorithms for this problem. He had
this pretty large data set, especially for that time. And he wanted to run these very quickly. So it's
actually one of the applications that I first tried to support in Spark was, you know, the recommendation
algorithm he was working on. So did he win the Netflix prize? Well, his team won one second place. So they
didn't quite make it, but they were extremely.
Wait, do they at least get something for winning second place?
I mean, first place got a million dollars.
Do they get anything?
I'm not sure whether they got anything.
I think you'd have to ask him.
Tell us talk about another transition here, which is the transition from an open source
project to becoming a commercial one.
And part of what you guys are doing, obviously, is involved in that.
Can you talk a little bit more about the transition of what that takes and move from
open source to commercial application that people actually use and have expectations
of?
Yeah, definitely.
So, you know, so as we saw Spark do very well in the open source domain, we wanted to start a company around it to really harden it and to bring it to a much wider class of commercial users.
And, you know, we really wanted to find a model that lets it continue to be fully open source and continue to be a successful project that way for everyone participating in it.
And traditionally, it's always been a tension in companies.
that try to commercialize open source projects because, you know, they build all this great
stuff and then they kind of give it away for free. And, you know, it's this tension between,
oh, are there some things we just shouldn't put into it or, you know, how else can we actually
have, you know, power a successful business around it. So the way we're doing this at Databricks
is actually quite different. And I think it's a very nice, very powerful model for doing this,
which is that we're offering Spark in a cloud service.
Let's talk about why that is.
Like, why are you guys doing that?
And actually, why is that so different?
Why don't other people do that?
We're offering Spark as a cloud service.
And what that means is, you know, it's the same spark
that anyone else gets in the open source.
All the libraries, all the improvements we put into the engine,
you can just download them and hunt them yourselves.
Or if you want, you know, you can talk to a vendor
that provides support to hunt it yourself.
And there isn't any tension for us
between, you know, do we put something in Spark
or does it become some kind of premium feature?
Thank you, Mite.
It was great hearing your story,
an evolution of Spark and what it is.
And thank you, everyone.
And that's another episode of the A6 and Z podcast.