a16z Podcast - a16z Podcast: A New Lab Rises

Starting point is 00:00:00 The content here is for informational purposes only, should not be taken as legal business, tax, or investment, advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see A16Z.com slash disclosures. Hi, everyone. Welcome to the A6 and Z podcast. I'm Sonal. Today, general partner Peter Levine and I are chatting with Janstoyka, who's a director of the new RISE lab at UC Berkeley, formerly known as Amplab, where he was the co-director and which we've talked a bit about on this podcast. Jan also founded Databricks and served as its CEO for the first few years and has played a seminal

Starting point is 00:00:38 role in multiple startups that have come out of the Berkeley data analytics or badass stack. So we wanted to get his thoughts on the ingredients of successful labs, given the evolution of academia research and industry, as well as discuss open source and the role of community and what comes next after big data given AI. Welcome. Thank you. Thanks for having me. Jan, you've played super interesting roles, both on the academic front and on the commercial front.

Starting point is 00:01:04 And often those two areas are often in conflict with one another. And you're one of these people who has very successfully sort of embodied those two different characteristics. How do you do that? And what do you think the relationship is between one or the other? And how do you get that to work so well? I'm going to make a little joke here and say, how do you do it all? I mean, it's actually, I'm super interested in the answer. Yeah, that's a great question, and there are many dimensions, many interactions.

Starting point is 00:01:33 I do think as a high level, academia, it's allowing you to do more experimentation. It's set for that. And at Berkeley, we are in a privileged position. Of course, being close to the Silicon Valley, we have a lot of feedback from the industry. So we are very anchored in what are the real problems. then it's a very natural to transition some of this work to industry and to, you know, students and sometimes faculty starting companies around this project. I'd just be curious, like, how did Amplab start? How did that come about? How that evolved into rise? So first, let me start by saying how the labs are structured at Berkeley, because it's something unique.

Starting point is 00:02:16 And this goes back now 30 years ago. So the labs are like one rule is around five years. So there's a limit. There is a limit. So then each lab has particular vision and goals. So in particular, for instance, the Am lab, the vision is to make sense of the big data. And the goal is to build the next generation analytic stack to be used across industry and academy. Around big data, there were three projects that,

Starting point is 00:02:46 really were notable outputs of the Amplap. One, of course, is Spark, which is the basis of data bricks. Like, how do you describe Spark? It's a big data execution engine. The other project, of course, was Mesos, and that resulted in the company Mesosphere. And what is Mesos? It's a resource management systems which allow multiple cluster computing framework to share the same cluster, the same harder.

Starting point is 00:03:13 And the other project that came out of the Amplab, was a project called TACION, which now resulted in a company called Oluxio. It's an in-memory storage engine, which allow again multiple-classer computer frameworks to share the data in memory. Okay, so this goes back to what you're saying about how each of the labs at Berkeley is set up with a vision, and this is the vision of the big data ecosystem and what's coming next, and now all kind of comes together. All these components are part of the stack.

Starting point is 00:03:39 I found it very interesting. While this was a big data stack, as envisioned by the folks, working on these projects. Each of these projects has more expansive use than just in the individual use case that was originally envisioned. MESO, as an example, is not just for big data. It's a resource orchestration framework that can be used for any resource, right? And so the commercial applicability of these projects has a derivative effect of them being more applicable beyond just requiring that these three things work together, and that's the only way that these things can work together.

Starting point is 00:04:18 There are multiple layers of the stack. So actually, Mesos was the first project we developed. This is what started the stack. And Mesos was by design to support multiple cluster computing framework. We started with Hadoop. And actually, one of the reason we designed Spark is to show that it's much easier to build from ground. round up a new cluster computer engine on top of Mesos, that is out in the absence of Mesos.

Starting point is 00:04:49 Because Mesos will provide a bunch of services detecting whether the nose go down and restarting some of the task and things like that, providing isolation between different cluster computer frameworks and others. Going back to this idea of how the lab, the ingredients that make the Berkeley lab work. So one of them was that they have a vision, that they have this time limit of five years. Is there an architect, like a person who sits at the top and says, here's how we're going to do it? Or is it that the students come up with projects? Like, how does that work? We'd like to claim that there is someone which puts a stack in day one and this is what it is. The truth is that things are going to evolve more organically.

Starting point is 00:05:30 You have the major functionality and kind of you know how it fits. But then as you encounter, you better understand the kind of problems you are going to solve the industry. want to solve, and then once you understand and you solve the problems, other problems pop up. You didn't even know existed when you started the entire process. And many of these projects also, are they all research projects by PhD students? Is there a requirement that they have to be? All are started and their research project led by the PhD students. So I have a question about that because one of the things that I've always thought about

Starting point is 00:06:07 when it comes to this tension between academia and industry, you might hear about needs and requirements from industry. But when you're in research environment, you have unlimited, open-ended, you know, vision. An algorithm and a research paper is very different than something in production at industry scale. So how do you guys navigate that? First of all, students, they do internships, they understand the problems, they develop the first solutions. And then with that understanding, you can start more principled designs to build these systems. But the truth is that in every successful, projects, we had at least one or two partners. For instance, we worked very closely initially

Starting point is 00:06:48 with Facebook. When we started working with Facebook, Facebook, you know, the entire cluster, big cluster for big data, it was 18 nodes, and their big data team was like three people. Then, for instance, in the case of Mesos, we worked very closely with Twitter. And Hinman went to Twitter and worked very closely with Twitter engineer to deploy Mesos in production. And actually the feedback from Twitter has a big impact on the method's evolution. From just supporting big data, cluster computing frameworks like Hadoop, it went to support these long-running services. Yeah.

Starting point is 00:07:25 It actually strikes me that in your list of ingredients of what makes Berkeley labs special besides what we listed so far, and you alluded to this earlier, there is that proximity to industry, physical, geographic proximity. It's not an accident that Silicon Valley and other ecosystems around the world. grow up around these universities, and there's something that goes in, university industry collaboration back and forth, even for the professors, because you were able to go into a company and then go back. Yeah.

Starting point is 00:07:50 The other thing about what Amlab and all these labs have done is that they take a very strong stand about being open source, and we generate no patents. This is one thing we say explicitly and we are very clear about when we engage any partners. This is true, by the way, at Stanford, too. I think both Berkeley and Stanford have this in common. I don't care so much about patenting. I think it's true now about many universities. But Berkeley has this very long tradition of being open source,

Starting point is 00:08:18 even on open source or not fashionable. Unix BSD, for instance, you know, is coming from there. Because of that leadership that Berkeley has shown in open source, I would argue that the entire industry now has come around on this stuff and that the nature of collaboration and the nature of open source being a basis for innovation in software has really become a predominant guiding point in software development and building companies as a result of that. We're in the past, open source was seen as a fringe thing at one point in the past.

Starting point is 00:08:57 And innovation would occur within the walls of large organizations. And now, because of the work that Berkeley has done for years around open source, we see that as being the engine of innovation for, much of the software landscape today. This is very true. It didn't happen overnight. And, of course, it was not only Berkeley. There are many other centers of producing high-quality open-source software software.

Starting point is 00:09:21 For instance, then you are very familiar. Yeah, Cambridge. Yeah, of course. I had no idea that Cambridge was big on open-source. I think of Berkeley, MIT. MIT and Carnegie Mellon. But what happens is like it's a tipping point now, right? Over the past five, ten years, we are witnessing the tipping point.

Starting point is 00:09:39 where now you go to try to sell services to the companies and say are they open source. Well, the element, I believe, that has brought open source from fringe hobby project out of academia to mainstream is the fact that you can actually build commercial businesses on top of open source. And a lot of that, which you guys at Databricks are doing and many other companies, GitHub is another example, these hosted all. offerings that leverage open source has given a very clear path to sort of marry the innovation with open source with the ability to monetize, which is very important from a commercial standpoint. And that, you know, up until sort of the cloud and being able to, you know, and SaaS, there wasn't a clear path to go do that. And now that's sort of the best of both worlds. I think it's worth putting a little bold on this that when you talk about innovation,

Starting point is 00:10:36 you're essentially crowdsourcing the best developers and the best R&D in the world because it's that what's that famous line about all the best people never will work for you? And in fact, at Berkeley, Henry Chesbrough, Hank, is there and he talks a lot about open innovation. You have to collaborate with the outside worlds outside your company walls to be successful.

Starting point is 00:10:55 The most interesting thing I saw this past week is this stat that Microsoft now has more commits to GitHub than any other company. And can you just think about that for a moment? Like Microsoft. It's not only Microsoft. If you think about five years ago, Google, everything was close source. Oh, even at Google?

Starting point is 00:11:15 I always thought Google was kind of a pioneer in the open source world. It's more recently, right, with Kubernetes and others, they went, or TensorFlow, they went open source. But this is relatively recently. If you think about the MAPReduce and Google File System, which really started the big data movement, or close source. And that's why enable Hadoop to be developed as an open source alternative to Google File System and Google SMAPReduce. It's another sign of the tipping point. The fact that all these big companies developing primarily closed source

Starting point is 00:11:57 software and products, now they are far more active in the open source community. And one thing I want to mention here is you have to build a community. That goes back to your point about you need this collaboration. So all these successful open source project, they have a strong community. And I think that's also another thing which help, because people kind of understand better now how to build these communities. There are more examples out there. Is it just engineers and research students? Like, how do you think of community managers? Does I know companies invest in those functions? Absolutely. So first of all, the way this Berkeley labs are going, are happening.

Starting point is 00:12:36 Twice per year, there is a retreat, street race retreat, where people from Berkeley, students, faculty, and industry people from our sponsors come together and we present the earliest work and we get feedback. So it's a very... That's another ingredient to add to the list. It's a fantastic event. Now, what we done, we started what we called camps.

Starting point is 00:13:03 We have the AMP camp. We do exactly trying to foster building the community. We have not only presenting our work, but we have tutorials. We train them to work on our latest software, not always stable. Right. But the beta, the beta, actually, not even the beta, right? Pre-alpha. And so that also helps to foster the community.

Starting point is 00:13:28 And these are natural precursors of these summits. Well, actually, I was going to say it reminds me another example of this collaboration between industry, academia, and other players is what O'Reilly did with open source and they're starting of Foo camp and various other camps. It's sort of the same kind of thing. And so it's really interesting because these ideas are permeating both ways into the system. So it's super interesting to hear about the labs and what happens at Berkeley and the integration between academia and business. So now you're starting or have just started the RISE lab. What's that all about? What does it stand for? Yeah, it stands for real-time, intelligent, secure execution. Real-time intelligence, secure execution. That's actually quite anvil. I get why you're calling it Rise instead of saying that as a mouthful otherwise. It's mouthful, yes.

Starting point is 00:14:16 So if you think about putting in the context of the Amplab, Amplab was about analytics. And by the way, what did an Amp stand? Amp is Algorithms, machine, people. It was about interactive, ad hoc analytics, doing much better than before on large amounts of data. Getting insights. When you say making sense of the big data, you are talking about understanding, getting

Starting point is 00:14:39 insights. What Rice Lab is, it's about making intelligent decision, taking intelligent actions on the data. People have a lot of data, get more and more data, and then what do they want to do with this data? The first step is getting some insights on the data, computing some metrics, KPIs. But the next thing is to take decisions. Right?

Starting point is 00:15:02 Yeah. This is a holy grail. This is a holy grail. Absolutely. Right? So you want to do medical diagnosis. You want to see whether a transaction is fraudulent or not. You want to see where to steer next a car, a self-driving car.

Starting point is 00:15:17 Of course you want to see to know what you are going or news you are going to show to the user next. Right. It's sort of predicting the future. Always. With the best decision, you have to predict the future. Yeah. That's the definition, almost. And then you ask yourself, what is a good decision, right?

Starting point is 00:15:36 What is a good decision? That's a very essential question, in fact. It needs to be fast, right? Because faster decisions are better than slower decision. It needs to be on the most recent data. I want to make a decision based on what happens now. By most recent as in the realest, realist, timeist data out there. Right, right?

Starting point is 00:15:56 Rather than what happens yesterday. Or even an hour ago, actually, we're talking real time. to be personalized. You see about this hyper-personalization. It's clear it's here to stay. So this basically defines the goals because we want to make the decision on live data and we want to make the decision which are secure. Why?

Starting point is 00:16:16 Because if I am making a personal decision, then is a question. It's what about the privacy. Okay. So then you want to make personalized decision while preserving the privacy, while preserving data confidentiality. And of course, because the decision, if happens in real time, in some cases, the humans may not be in the loop. So you need to ensure the integrity of the decisions.

Starting point is 00:16:40 And then there are two other things about the decision, which are very important. One is robustness. The human is not the loop. Better the decision can be robust. There are many dimensions of robustness. It's about robustness to noisy input. It's raining as a camera doesn't capture the same kind of high-quality

Starting point is 00:16:58 images. Or in a real life example, like if you don't detect a heart attack and it's something else. Totally. It needs to be also robust on some parts of the system failed, some component failed. But also it needs to be robust to unforeseen input. So let me give you an example. Today, it's easy to have an algorithm to try to recognize cats and dogs. You do that and it's pretty good, very accurate. But now I show this algorithm, an elephant. What really what you want the algorithm to say, it's I don't know, I'm not sure. We need to provide some confidence interval of these decisions. Because then, for instance, if you are self-driving car, that you can pop up and let driver, tell the driver to take over, or make a safe

Starting point is 00:17:43 decision. Slow down and stop. That's one about robustness. The last one, it's explainability. Explainability. As we make decisions which augment the ability of humans to make more sophisticated decisions, or take decisions on behalf of humans, like a self-driving car. Explainability, it's very important. Think about a very simple example. You go to the doctor and you have the x-ray, right? And now, actually, there are very good algorithms. So he goes, take that x-ray and give you a diagnosis.

Starting point is 00:18:16 But would you be satisfied with a diagnosis? No, you'd want someone to hold your hand and tell you more about it. Yes. So that's what it is. You want to know why. This is how people start to trust a decision. Like even if your friends or spouse makes a decision which affects you and kind of is unexpected, what is your first question?

Starting point is 00:18:36 You know, why did it make that decision? Well, I mean, it sounds like what you're saying is that AMP Lab, in evolving to RISE lab, it's actually setting it up for decision making as you're describing all these characteristics. But more importantly, it's big data moving into AI and the world of science and tech is moving in this direction. There is no question about that. AI, there are over the past five years, great results. When you look at the PhD students who apply to Berkeley, PhD applicants, it turns out that well over 60% are applying for AI.

Starting point is 00:19:13 I have a pushback here, though. How then is a new lab at Berkeley relevant in a world where companies are now kind of creating their own labs in-house to really put together a lot of the the same skills, PhDs, et cetera. And in thinking through like this sort of connection to industry, especially because one of the complaints I hear all the time is about how there are a lot of the majority of AI papers are actually algorithms. And we actually need a lot more tools and tooling to actually take that information

Starting point is 00:19:40 and make decisions on it. Totally. Now, I think that it's an excellent question and it's a challenge, but, you know, sometimes the challenges are motivating. Those are the best kind. So the goal of the RIS lab, it is to build. open source platforms, tools, and algorithms to enable this intelligent real-time decisions on live data with strong security. So this will fit some of the holes you mentioned. And it's true.

Starting point is 00:20:06 Many of the algorithms which are published are hard to reproduce. So that's one part of the question. We see that need. Now, the other part of the question is say, well, look, Google or Microsoft or even now Chinese companies, Bidus and so forth. I was thinking of Bidu actually when Spend huge amount of resources, and you have also open AI. Huge amount of resources, so how can you make a difference? I think that's always happened. Now it's about the AI, but in the past it was about operating system. It was about, you know.

Starting point is 00:20:39 Computing, personal computing. Personal computing. What was zero part but an industry lab? So this all happens. So what I can tell you that what we find is that a lot of large company and partners, consumers of AI, or they want to be. We have financial companies like Capital One. We have IoT like Erickson.

Starting point is 00:21:01 We have Huawei. We have unfinancial. Which is Alibaba's payments armory. It's Alibaba is doing far more than payments. Oh, I know. So on hand, you know, these companies, they would prefer to have a very strong open source stack they could use. That's not solely the province of a single company.

Starting point is 00:21:22 Microsoft and Google of the world. which they have very strong team of the AI. They have hardware, right, powering AI algorithms. What we have is it's more research collaboration. Even days, they cannot explore all the potential, say, design choices. I would think that one of the things that you have as an advantage that companies do not is that you have a much different risk profile than a company. Earlier, one of the things you said is that the purpose of a lab is to do experiments.

Starting point is 00:21:50 And one of the questions that I have is, in this day and age, a lot of startups and big The companies do experiments in different ways. On Drys Lab, even faculty don't have offices. They open floor. We have, you know, it's like sitting side by side, AI people, system people, and database people, architecture people. Companies have tried experiments like that, and they don't work because it's hard to get people to just suddenly start collaborating because you stick them in a room together.

Starting point is 00:22:18 That is something that big companies cannot easily do. They're siloed by function. How many projects do you envision running concurrently at rise? Is there a limit? Or how do you guys... It's not really a limit. It's again more organical. And there are different kinds of projects.

Starting point is 00:22:34 There are projects which hope to become a strong artifact to be used across in industry, like Spark or Mezos or Takion. And there are more explorative projects which will remain at the stage of the prototyping. But you try to categorize. So, you know, you look at this kind of big, big. broad categories of problems and try to develop solution for this problem. I don't know if you ever looked at one of my blogs. I did the end of cloud computing, right?

Starting point is 00:23:01 The return to the edge and distributed computing, a lot of data now gets collected at the edge. It's real world information. A lot of decisions are happening there. A lot of decisions are happening at the edge. Totally. Where decisions need to be getting made. And so I'm curious, are there projects that you guys think about related to that?

Starting point is 00:23:18 That's a great question. So, indeed, we are really looking at building systems. which spans the cloud and edge. And I think they're going to be both ways. Things which are now done at the cloud is going to migrate some of the functionality on the edge. Also, on the other side, things which are now done only at the edge,

Starting point is 00:23:37 like self-driving cars will migrate. Some of the functionality will migrate to the cloud. So it's vice versa. It goes, you're saying that both things are moving to the edge and things in the edge are moving to the center. I feel like there's a poem in there somewhere. What is that? The things fall apart.

Starting point is 00:23:52 I just remember the classic poem. One of the really interesting things. So I started to become like, what happens when something ends, right? But now I'm going a step further to the future of computing because I believe that programming is sort of end. The life cycle of programming is sort of at the end. Like you can only write so many if-then-else statement. You're kind of done with logic, right, in that sense. And all this purification of data and the input of data into a system as a precursor to an observation,

Starting point is 00:24:22 that's happening and real world information is sort of the future of computer and iteration. You close a loop, right? Correct. We talk so much about programming, but I think that your lab is the exact example of academia

Starting point is 00:24:35 being proactive in the new world of data as opposed to coding, right? And new methods of academic research will need to be developed in order to promote data as a functional input to computing. See, about that. You start again.

Starting point is 00:24:52 In some of the application of this new AI, you know, research, like for instance, reinforcement learning, it's about synthesizing programs. Correct. Correct. Any other work that you think is interesting? One called Ray, and it's to build a cluster computing framework, which make it easier to build the next generation of AI application. One example is reinforcement learning. This reinforcement learning, it's about agent. which continuously interact with the environment

Starting point is 00:25:26 and learn from this interaction with the environment. If you think of a car, it's about an agent interacting, taking decisions, which will affect the real world and learning from these interactions. The other project is Clipper. It's about model serving. Once you create these models, you need to serve them. To users, you'd learn, and they make the request

Starting point is 00:25:48 about showing this image to do the image recognition. And this is a hard problem. It's multiple levels. How do you scale the life cycle management of the model serving? How you are going to update the models, how you are going to improve this model over time. Why is that such a hard problem right now? Well, because the models you developed on some data set. And for instance, the data or the queries are going to evolve over time.

Starting point is 00:26:14 And because the environments or the world around you evolves, what you learned. and which is embedded in the model may not be as relevant or as good as it used to be. So that's one example. The other work I would mention it's in area of security. We have the project called opaque. It's actually also related to Spark.

Starting point is 00:26:36 It's about how you do analytics, for instance, in the cloud so that you can defend against any kind of attacks, even if the operating system is compromised, or the hypervisor, the virtual machine is compromised, the data, and the computation is secure. And is it because they're kept separate? This uses these hardware enclaves. Oh, interesting.

Starting point is 00:27:01 Which are now in more and more processors, CPUs, like Intel SGX, arm trust zone. So it's using this new developments in the hardware. And finally, one other project is ground, and this is about to try to understand. understand the semantics, the provenance, the lineage of the data. Oh, interesting. Who created a certain data item? Who did modify that data item? Who did look at those data items?

Starting point is 00:27:33 And this, the goal is to provide this service across multiple data sources, across multiple data storage systems. I love this. It's like data provenance as a service. You know, I'd ask you about how you guys came up with the names for all those projects because I'm always fascinated by the naming of things. One of the funniest stories that Michael Franklin told me about Amplab is that when you guys were coming up with the name that Dave Patterson wrote an email going, why are you guys

Starting point is 00:28:03 being all backward about this, that you guys should instead take a first principles approach to what you want the qualities to be and then come up with the acronym instead of, you know, the other way around. You must have had some debates around how to end up on rise. Peterson has two rules. One is good to be a four-letter name. We are qualifying exams. Raid.

Starting point is 00:28:21 Didn't he come up with that too? Yeah. But it's right. We're done in a ray of inexpensive discs. Yeah, he was the inventor of like all this striping and streaming and parody bits and all this stuff. So his risk, raid, hopefully rise now. That's hilarious. The other one, it starts to his AR.

Starting point is 00:28:41 So you guys ignored him the first time around and now you're like, okay, we're going to start with ARF. And so how did you guys sort of land on Rock? Did you guys just literally, like, throw up, like, you know, that this is where we're going and here's all the qualities that we need to capture? We started with what are the attributes, so what you want to do. It's like a good branding exercise. And that was a good exercise. Yeah, that's great. Jan, thanks very much for educating us on kind of this intersection of academia and business.

Starting point is 00:29:12 Thank you for joining the A6 and Z podcast, Jan. Thank you for having me.

a16z Podcast - a16z Podcast: A New Lab Rises

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.