The a16z Show - a16z Podcast: A Conversation With the Inventor of Spark

Episode Date: June 24, 2015

One of the most active and fastest growing open source big data cluster computing projects is Apache Spark, which was originally developed at U.C. Berkeley's AMPLab and is now used by internet giants ...and other companies around the world. Including, as announced most recently, IBM. In this Q&A with Spark inventor Matei Zaharia -- also the CTO and co-founder of Databricks (and a professor at MIT) -- on the heels of the recent Spark Summit, we cover the difference between Hadoop MapReduce and Spark; what are the ingredients of a successful open source project; and the story of how Spark almost helped a friend win a million dollars. Stay Updated:Find a16z on YouTube: YouTubeFind a16z on XFind a16z on LinkedInListen to the a16z Show on SpotifyListen to the a16z Show on Apple PodcastsFollow our host: https://twitter.com/eriktorenberg Please note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures. Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello, everyone. Welcome to the A6 and Z podcast. I'm Sonal, and I'm here today with Mette Zaharia, the CTO and co-founder of Databricks, which is the primary company driving and developing Spark. And we're actually just coming out of the Spark Summit, which took place this week. And it's one of the biggest events for developers who are working on Spark, for companies that are interested in Spark, and pretty much for anyone who cares about trends in the big data space. Just to start off, Matei, just start by just giving us a description of what Spark is. So Spark is software for processing large volumes of data on a cluster. And the things that make it unique are, first of all, it has a very powerful programming model that lets you do many kinds of advanced analytics and processing, such as machine learning or graph computation or stream processing. And second, it's designed to be very easy to use, much easier to use than previous systems for working with large data. So what were some of the previous systems for working with large data sets?
Starting point is 00:00:56 Before Spark, the most widely used system was probably MapReduce, which was invented at Google and popularized to the Open Source Hadoop project. And MapReduce itself was a major step over just writing distributed programs from scratch, but it was still very difficult to adopt and use and led to very complicated applications and also very poor performance in some of them. So what were some of the reasons for inventing Spark in the first place then? I mean, besides the problems and the limitations of that, were you just trying to solve the problems of MapReduce, or were you actually trying to do something different? Yeah, that's a good question. So we started building Spark after several years of working on MapReduce and working with companies that were very early on using MapReduce. I was a PhD student at UC Berkeley, and we actually started working with Hadoop users back in 2007. And I did, for example, an internship at Facebook when Facebook was only about 300 people.
Starting point is 00:01:56 They were just starting to set up Hadoop. And in all these companies, I saw that there was a lot of potential to putting together large data sets and processing them on a cluster. But they were all hitting the same kind of limitations, and they all wanted to do more with it. So basically, our lab created Spark to address these limitations. You know, let's take a step back for a moment and talk again about your experience at Facebook, though.
Starting point is 00:02:19 Why was the problem challenging there? I mean, big data's been around forever. So what about that problem was interesting and different? made you want something better than what you already had? Like what was it about that data, I guess? So Facebook, like other companies, starting to use business data, was able to collect a lot of very valuable data about how users are interacting with it.
Starting point is 00:02:43 And Facebook was also growing very quickly. So they were adding many tens of millions of users every few months. And, you know, they definitely couldn't just talk to every user or even send someone to every country where Facebook was used. and figure out how people are using the site. So they needed to use this data to improve this user experience. So there were two things that made it especially challenging. One was the scale of the data, which was much higher than you could do with traditional tools.
Starting point is 00:03:10 And the second one was how many different people within Facebook wanted to interact with it. It wasn't just one person or one team doing it. It was many people often with not that many technical skills. So it needed to be very easy to work with it. this data. What was a limitation of, like, why wasn't what was in place MapReduce, I guess, you're describing enough? So MapReduce was actually great for running sort of large batch jobs that kind of scan through the whole data and summarize all of it and give you an answer. But it was designed mainly for jobs that take tens of minutes to hours. MapReduce initially came out of
Starting point is 00:03:50 Google where it was used for web indexing. And the whole point was, I will run this giant job every night and in the morning it's built a new index of the web. But what Facebook wanted to do was different. They had a lot of questions that they wanted to ask almost interactively. And there's a person sitting there who launches a question at the cluster and needs to get an answer back. And it wasn't very well suited for that. So it's more like the move fast and break things kind of model at Facebook where you
Starting point is 00:04:18 want to like just deploy code or more importantly quickly iterate in real time. So you can get your feature testing and get feedback instantly. Not even feature testing, but just in general, when you work with data, you want to ask multiple questions repeatedly when you're doing ad hoc exploration of the data, as opposed to, you know, when you have a certain application that, you know, okay, I'm just going to run this every night. So the second feature that you mentioned about the usability. I mean, I feel like it's obvious. We care about things being pretty and easy to use because no one wants to deal with the Kluji interface. I mean, does that really matter in this case? Because if you're really an insider who knows how to work with data, do you really need to care about that? The interesting thing is nobody wants only the insiders to work with data, basically. Everyone wants to be able to access it directly. Actually, there was a great keynote about this at the Spark Summit by Gloria Lau, where she said that also the insiders themselves don't want to be answering questions for other people.
Starting point is 00:05:13 You know, if they have the specialized skills, if they're a data scientist, for example, they'd prefer to be doing something advanced, and the other users outside would prefer to ask the questions themselves. So ease of use was quite important. So one of the most interesting announcements that came out of the Spark Summit that I think a lot of people saw was the announcement that IBM is backing Spark. Basically, IBM is doing two things. First of all, it's investing in the development of Spark, you know, much in a similar way as to how they've invested in other technologies, such as Java or Linux. They see Spark as a key technology and they're actually putting developer resources. And second, IBM is also moving some of its internal products and product lines to use Spark
Starting point is 00:05:56 or to offer Spark to customers because they think it will improve these products to build them on Spark. So does that mean that IBM is basically making a big bet on the cloud in a bigger way than ever before? I think it's much more than the cloud. I think even though the cloud is one area that IBM is interested in expanding and they have huge business lines that are completely unrelated to the cloud. in solutions and consulting and also in products and services such as Watson or their database products
Starting point is 00:06:26 and so on. And, you know, I hope that what will happen is really great integration between these products and Spark. Were there any other major news or interesting things that you saw? Any other interesting keynotes or trends that you think are important for our audience to know about what came out of Spark Summit this year? One of the coolest ones I saw was a talk from Toyota about how they use Spark to improve, you know, to basically look at social media feedback, what people are writing about their cars and figure out things like, oh, is there a problem with the brakes on the Prius or how they can improve their products as a result of this?
Starting point is 00:07:01 So you have a car company that's basically using social media is big data because it's real time and it's fast and they probably get a ton of feedback. And they're using Spark to figure out how to change their product in real time. Right. They're not using it to deal with it in real time, but they're using it just to get a lot more insight into how their vehicles behave once they're out there. So, you know, one example is if people report, you know, say some noise coming from the brakes or something like that.
Starting point is 00:07:28 Now, they might hear about this, you know, if a person goes and talks to their mechanic, but there might be many other people who just ask their friends online and ask, oh, look, I hear this kind of weird, you know, scrunching sound from my car. So, and it's actually very hard to take this kind of text data, which, you know, might say, oh, the people have many, many different ways of describing, say, this noise problem
Starting point is 00:07:49 and actually understand it and classify it, cluster together all the messages about a specific problem. But for the engineers at Toyota who are trying to design the next car or to figure out any potential issues with the current components, it's very useful to have this.
Starting point is 00:08:05 Got it. That makes a lot of sense. So they're basically analyzing a big social media data set to get that insight. What are some of the other interesting things that you saw coming out of the Spark Summit? Apart from Toyota, we also saw a lot of other great companies starting to talk about their use of spark. So, you know, we've known about some of them for a while, but it's nice to see, you know, companies talking publicly about interesting applications
Starting point is 00:08:27 they've built. So some of the other highlights, for example, included Netflix, Capital One. At PVAS summits, we also had Goldman Sachs and Novartis in the biotech. One other question, just on the big picture side, you know, I'm fascinated by the tension between open and closed and when it comes to the open source topic and the open source community as well. And one of the things that I think is really interesting is that every side of open has an element of closed as well. And the evolution of open source historically as well as now always involves big corporate players getting in the game as well as a lot of individual developers, academics. How does that sort of affect the ecosystem in general? I would love to hear
Starting point is 00:09:08 your thoughts about that and also how that applies to Spark. So the Spark community is already very large. It's actually the most active open source project in data processing in general, as far as we can tell. And it has many companies participating in it. And I think we've been able to grow the community in a way that everyone gets what they want out of it. And there aren't really any major tensions between the companies working on it and people trying to keep certain things closed or not. Is there something unique about this that's allowing that to happen? So I think it's just a matter of the culture in the project and setting it up early on to be very welcoming to contributors and to actually have people converge on how they're going to work together,
Starting point is 00:09:50 you know, how they keep it stable and reliable. So we started doing that, you know, back from the UC Berkeley days when we had other people contribute to it. And I think we just have that pattern in place. And it's sort of best for every contributor that's involved. There's been plenty of open source projects that have not reached this kind of scale or grown as fast. And the ecosystem you're describing not, it's just not random chance that it's. ended up there. I guess what I'm really interested in is that you are the creator of one of the most popular and fastest growing open source projects ever. What are some of the ingredients of a
Starting point is 00:10:21 successful open source project? Like what does it take to kind of get here for other people working in open source? Yeah. So I should say, you know, from the beginning that, you know, we certainly didn't imagine that Spark would be this widely used when we started. And it's only, you know, it's kind of been a feedback loop as we saw people being excited in it. We also, you know, decided to spend a lot of time to make it better and to foster the community around it. But I think there are several things that helped. So first of all, Spark was actually tackling a problem, a real problem, that people had and that more and more people were beginning to have in the future, which was the problem of working with really large-scale data sets. So it kind of resonated with a real need people
Starting point is 00:11:03 automatically really had. Yes, exactly. And a need that was also growing over time. So obviously, and there wasn't much else they could use. So obviously that's helpful. Second, we had a really fantastic set of people working on the project. And I think it's especially important when you begin a project, when it goes from the original initial team to bigger and bigger teams, because you're going to need a really great team to work with it in the future. The people at UC Berkeley and the people we got at Databricks to work on Spark are just a really fantastic team that's able to build great software very quickly. Yeah, so the people aspect of it basically, because it is a community. open-source project.
Starting point is 00:11:42 Yes, definitely. And third, apart from this sort of core team that was around at early on, we tried from the beginning to be very engaged with the community and to foster new contributors, help them actually contribute to the project and learn how to do things and actually participate. And there are hundreds of people that do that. You know, many people might only send in one or two patches,
Starting point is 00:12:03 but we've tried to make the barrier for that extremely low so that they can actually help out. It takes some effort to do that, Because at the beginning, you know, if you're someone working on it every day and someone comes in and wants help to, you know, to get some idea in, it's always faster for you to do it yourself than to help this other person. But you have to do that. You have to help them get set up and help them contribute in order to actually grow the total set of people who can contribute. So is there just like a lot of documentation on the project then? Or, I mean, what really makes them easier to be able to contribute then? Like what concretely happens? Yeah. So there's several things. So first, you know, first, even before anyone contribute. is they have to be able to use it. So there's been a lot of focus on making Spark very easy to download and use
Starting point is 00:12:47 and having as much documentation and examples out of the box as possible. And we're still doing a lot to expand this, actually. The second thing you need is, you know, once people are actually trying to send in patches or to try to understand something about the project, you do have to talk with them to review the patches and so on and, you know, and help them actually get them in. And the third thing I need that's really important to keep a project moving quickly is just really great infrastructure for testing, checking the quality, making sure that it continues to be good.
Starting point is 00:13:19 And by investing in this kind of infrastructure, much the same as you do in any other engineering organization, you can then end up moving a lot faster. So these are the things we spend time on. The most interesting thing about open source is the ecosystem that grows up around it, because otherwise, why have it even be open source? Could you talk a little bit more about that? I've actually seen you share a chart that shows a really rich ecosystem growing around Spark and why that matters. Yeah, definitely. Yeah. So over the past few years, we've seen a lot of other open source projects integrate with Spark and build on top of it and really put together, you know, this set of software you can use together to build applications. So in particular, you know, one of the things we saw is many of the projects that were built on top of Hadoop, such as Hive, which is, SQL processing at scale and Pig and Mahout for machine learning are starting to run on top of Spark as well
Starting point is 00:14:13 so that users of those can get the speedups and the benefits from using Spark. And we've also seen quite a few of the data storage projects, for example, MongoDB or Cassandra or Takion or many of the NoSQL key value stores now connecting to Spark, offering ways to read the data in. And for Spark users, that's exciting because it means they can write an application against Spark and use it against data and all these storage systems. They don't have to change the application to talk to each one. So I think even beyond the activity happening in Spark itself,
Starting point is 00:14:47 these projects are on top and on the side are one of the most valuable things for the users. And in fact, wasn't Spark itself, isn't Spark and self able to sit on top of Hadoop, for those that use the Hadoop file storage system? Yeah, definitely. From the beginning, we designed Spark to sit on top of Hadoop and to talk to Hadoop. but we also left it open so that you can hunt it in other environments. And, you know, we basically are philosophy in it is to try to integrate with all these kind of environments where data can be stored and give users one really simple way to work with the
Starting point is 00:15:21 data no matter where it is. You've talked about the community. You've talked about all the successful ingredients of an open source project. But I guess I'm kind of fascinated by the story of an inventor. And we have an opportunity to talk to an inventor of a really interesting thing here. And I want to hear a little bit more color. on what that was like. One of the most interesting ones was Lester Mackey, who was a PhD student in the same year as me,
Starting point is 00:15:46 and he was actually on the team that got second place in the Netflix Challenge. They were extremely close to winning the whole challenge. So what was the Netflix challenge for our audience to remind people what it was? Yeah, the Netflix challenge was this $1 million price to improve the accuracy of movie recommendations on Netflix. So Netflix released this data. said and said, oh, currently we can predict, you know, user's score for a movie to, you know, something like within 0.85 error. And if you can increase this to point, you know, if you can decrease this to just 0.8 or something like that, will give you a million dollars. So it was an
Starting point is 00:16:21 open challenge, very large scale, to encourage innovation in this area. Okay. So what happened with Lester? Yeah. So Lester was one of the people who had, you know, he was developing a ton of new algorithms and combination of algorithms for this problem. He had this pretty, large data set, especially for that time, and he wanted to run these very quickly. So it's actually one of the applications that I first tried to support in Spark was, you know, the recommendation algorithm he was working on. So did he win the Netflix prize? Well, his team won one second place, so they didn't quite make it, but they were extremely.
Starting point is 00:16:55 Wait, do they at least get something for winning second place? I mean, first place got one million dollars. Do they get anything? I'm not sure whether they got anything. I think you'd have to ask him. Tell us talk about another transition here, which is the transition. transition from an open source project to becoming a commercial one. And part of what you guys are doing, obviously, is involved in that.
Starting point is 00:17:13 Can you talk a little bit more about the transition of what that takes and move from open source to commercial application that people actually use and have expectations of? Yeah, definitely. So, as we saw Spark do very well in the open source domain, we wanted to start a company around it to really harden it and to bring it to a much wider class of commercial users. And, you know, we really wanted to find a model that lets it continue to be fully open source and continue to be a successful project that way for everyone participating in it. And traditionally, it's always been a tension in companies that, you know, that try to commercialize open source projects because, you know, they build all this great stuff and then they kind of give it away for free. And, you know, it's this tension between, oh, are there some things we just shouldn't put into it or.
Starting point is 00:18:04 or how else can we actually have power, a successful business around it. So the way we're doing this at Databricks is actually quite different. And I think it's a very nice, very powerful model for doing this, which is that we're offering Spark in a cloud service. Let's talk about why that is. Like, why are you guys doing that? And actually, why is that so different? Why don't other people do that?
Starting point is 00:18:27 We're offering Spark as a cloud service. And what that means is, you know, it's the same spark that anyone else gets in the open source, all the libraries, all the improvements we put into the engine, you can just download them and hunt them yourselves. Or if you want, you can talk to a vendor that provides support to hunt it yourself. And there isn't any tension for us between, you know,
Starting point is 00:18:49 do we put something in Spark or does it become some kind of premium feature? Thank you, Matei. It was great hearing your story, an evolution of Spark and what it is. And thank you, everyone. And that's another episode of the A6 and Z podcast.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.