The a16z Show - Reining in Complexity: Data Science & Future of AI/ML Businesses

Starting point is 00:00:00 Hi, everyone. Welcome to the A6 and Z podcast. I'm Sonho. For this week's episode, we have one of our hallway-style conversations, and this one is literally like eavesdropping in on a debate and discussion that actually started as a Twitter thread debate and discussion all around the question of whether and how data and AI slash ML machine learning companies are different than software companies and what that means for the future of software businesses. Our guest, even questions our view of software eating the world, or rather asks what happens when software is everywhere, what comes next? Our guest is Peter Wang, the co-founder and CEO of Anaconda, who also leads their open source and community innovation group, as well as created the Pi Data Community and conferences,

Starting point is 00:00:47 and has devoted a lot of time and energy to growing the data science community there. And he's in conversation with A6 and Z general partner Martine Casado, who's written a lot about the evolution of software businesses, the new age of data, and especially AIML Economics. You can find those pieces at 860c.com slash ML Economics. The two dive into a number of themes throughout this conversation,

Starting point is 00:01:10 ranging from open source and crowdsource innovation and the messy ways that innovation really plays out to what it means when you move from hardware to software to data and AIML, abstracting something that is not just complicated, but actually complex. And then they touch briefly on what it means practically in building a new type of company

Starting point is 00:01:29 as well as the evolving role of data scientists. But the conversation begins with their shared vantage points in coming from physics, which is relevant here since these new kinds of businesses and products involve a process of experimenting, much as with physics. Both you and I come from the physics, computational physics background, and we've both kind of been pushed into this data, AI, ML, data science. And I don't know if that is coincidence or if we have an event.

Starting point is 00:01:59 affinity for that. Before we get into that, though, there's kind of a competing view of the world, which basically says SQL can do everything. And it's funny, we spend a lot of time actually looking at the data science or the data landscape. And it feels like there's two kind of worlds. There's like the data warehouse maximalists, which like, we'll stick all data in the data warehouse. And then we're going to do SQL. And then we're going to have some extensions to SQL, like you see popping up in like BigQuery, whatever, and that can do everything that needs to be done. And oh, by the way, if someone's using Python and R, all they're really doing is basic regressions and so we can just make that a simple extension and we're done. And then there's the other

Starting point is 00:02:32 view of the world, which I like to call the Hadoop refugees, which is like actually we do hardcore computation and we need R and Python because of stuff we do is very sophisticated. I mean, I know you're squarely on one side of those, but I wonder like, do you think there's a convergence that happens? Do these stay two worlds? Does one become irrelevant? Like, what happens there? Just because you oppose extremism doesn't make you an extremist, right? I would say data warehouse maximum is extremists. And I see a heterogeneous world. And it's the old yarn about, I guess, I don't know, there's so many variants of this. But Alan Perliss, a great computer scientist, has some really great quotes about some irreverencies about these kinds of things. But I would say that to the

Starting point is 00:03:11 idea that everything can be expressed in SQL, it's like which sequel, with how many extensions? Because at the end of the day, and with how many like extensions amount of extensions and multicorn on your Postgres actually running a Python kernel, yeah, I guess you're doing a sequel, but you're running a Python script, you know, so that's not really, doesn't count. And frankly, a lot of stuff runs access and VBA in this world. VBA isn't SQL. I think if you choose to look at the world through a particular lens, you can choose to count everything else as residuals and rounding errors. But if you take off those lenses, you see a much more diverse landscape. And I think that's where, for me, I see the space for sequel. And I understand the reasons why it has evolved under a particular

Starting point is 00:03:47 kind of animal. Like, the shark is still the best predatory fish in the ocean, but it's not the biggest predator in the world. And I think there's something about that. If you're in the ocean, you're going to basically be shark-like if you were going to eat a lot of fish. So if you're in that business data analytics world, especially because a lot of business data looks like fish, it's evolved to look like food for the sharks. So that's kind of the way it is. But what Hadoop opened up, back in 2012, I called it the Hadoop battering ram. I said, listen, we're not going to win the Hadoop game. We'll let the Hadoop vendors go and fight against the Teradata's and the oracles and all the classical data warehouse guys. Let them do that thing. Once they battered down the door, we're going to

Starting point is 00:04:22 come flooding in with all sorts of heterogeneous approaches to data science, data analytics, things that are hard to ask in SQL. And moreover, there's a term I use, which I don't hear used very often. Now, obviously, you've heard the term shadow IT, which is used quite a bit. But there's a shadow data management. That's a far, far more insidious and dangerous problem. When I was at an large MSN bank, they had a million dollar Oracle database sitting somewhere, and it was too slow to actually run the analytics they needed. So what they did is they had an instance of this Oracle database, cost a million bucks. And what they did is the only query they ran was a full table dump into a CSV. And then they took that CSV and they did everything else with it. And it was

Starting point is 00:05:03 Python scripts. It was some, you know, random Java crap. It was a bunch of other stuff. And it was sort of like, so if you're a data manager, if you're like in the data management practice, you say, wow, we just have another big old million dollar instance stood up. Our data management techniques are great. It's a what do you call a Potempton Village, I guess. right? But then when you actually go and you ask the developers, hey, where's the source data for the stuff? Where's prod data coming from? Like, oh yeah, this file share backslash black slash something or the other or, you know, that file. I'm like, that file? What about that database? Don't touch the database. It's too brittle. Right. So there's this kind of stuff going on and everybody

Starting point is 00:05:35 listen to knows what I'm talking about. That shadow data management is absolutely a pernicious problem and data science is just eating it alive. Because to ask the question you want to ask, you have to integrate data sets together. Master data management is about siloization, normalization, and all this kind of stuff. You've hit to the segue, which I just think it's so germane to what we're here to talk about, which is there's clearly problem domains, which SQL is totally fine for, right? Yep. And you can argue there's problem domains, which is just not good. I mean, like any sort of hardcore statistics is just not very good for. And the point of us being on this podcast is actually talk about, okay, listen, we're seeing kind of new types of companies and new types of workloads,

Starting point is 00:06:13 and they're around kind of processing data. And I totally hear you that this shadow data management is a real issue. And you could make an argument why that exists, is not because people are stupid or they don't know how to do good workflows. It's like literally we don't have the tooling to deal with data in the right way. One macro question that I have that I would love to hash out with you is, are we seeing a fundamental shift in workload that requires a fundamentally new set of tools and a fundamentally new type of company,

Starting point is 00:06:44 or is this just more of a transition where we can kind of put into service steel tools? And I just want to be a little bit more specific, which is in the past, you had your toolkit of systems approaches and you have a software system and you'd kind of pull them out and apply it to the problem and SQL's one of them.

Starting point is 00:07:01 We kind of understood how those software systems behaved, and we kind of understood how the company's both around them behave. As an investor looking at a lot of data companies, they just don't look the same. The types of tools they use, the type of operational practice they use, and the one that you pointed out was a great one, which is now data becomes a primitive that you want to actually apply like software techniques to in a way, but we don't have the tools to do that. And then we've written posts about margin structures look a lot different, the way you build your company different. And so I just, do you think this mess is because data scientists don't have formal CS trainings,

Starting point is 00:07:30 or do you think this is an entirely different problem domain and we should actually look at what the future looks like for that and develop new tools, etc. This is like the heart of what we're talking about. This is absolutely the heart, and I will try to start from the top, which is this concept that every baby or every child is born and they're raising, they think their childhood is normal, right? They think of like your childhood is like the normal thing. So we have developers coming online in the late 2000s, let's say, and they think this is the world. Even me as a professional starting in 99, right?

Starting point is 00:07:57 It's like, well, this is just what there is. The more you start researching history and looking back, you're like, you know what, we're just building in this industry, we just layer it's frozen accident on top of frozen accident, on top of frozen accident, very, very, very. few times to people make principled, intentional, revolutionary shifts, right? It's basically band-aid as substrate, okay? So starting from the top, what I would say is that there is no law. There was nothing carved in stone that Moses brought down from the mountain that said, all information systems must be deconstructed into hardware and software and data. There's no such

Starting point is 00:08:31 thing. It was information systems full stop. The fact that we had different cost structures for innovation in hardware versus software versus networking and so forth, that has led to different rates of innovation, different paces, things like that. And so when a business steps in and says, okay, what's on the shelf that I can use to accelerate my business processes, then it makes sense because this thing, that thing, the other. Like when you buy a car, you buy the car, and then you put CDs in the car. You don't go buy a car with a CD pre-spec, right? Is there the exception of technical innovation in certain areas? So, for example, like, we now know how to built systems that extract very useful information out of data pretty simply that didn't really

Starting point is 00:09:12 work in the late 90s. Like I remember the whole first, you know, neural network like genetic program. Oh yeah, right, right, right, yep. Miasma of the late 90s. I did a number of projects on that they didn't really work. They actually work now. So you could also argue that the technical landscape has changed. Is that just been a macroeconomic issue on the company? Yeah, I mean, ornithopters work if you can flap hard enough, right? It doesn't necessarily mean it's the right architecture and depends on the density of air. Mortonopters might work great in Mars, but not on Earth, right? Propelers work better on Earth, right? Well, with internal combustion engines and et cetera, et cetera. But the point is that, yes, you're right. I guess my point could

Starting point is 00:09:45 be, say, thusly, there is a multidimensional optimization surface we should be thinking about, not just the optimization surface of software or data architecture, data management, and, you know, things like that. I mean, as someone did software device networking, you know that better than anybody. But here's what's interesting to me, which is if you build a hardware company, the tools you use, the money that you need to raise, the innovation pace is defined one way. And if you do a software company, it's actually defined quite differently. Although you still use like a lot of the same practices, it still is engineering, you can still modularize. It's not clear to me that as soon as you move to data, you're in the same domain. Software to me feels like an engineering problem. that you can modulize, you can build interfaces, you're building it from the ground up, you control all the primitives. Data feels like science.

Starting point is 00:10:38 It's like you're trying to rein the complexity of the physical world, right? It's one thing to build a house, building a very complex building, very hard. We had to do all this design practice. And we got the skyscraper. That's very different than understanding the cosmos. Because the cosmos is so complex, and you don't understand what it is,

Starting point is 00:10:55 and you don't have a blueprint. And data companies are defining the cosmos more than building the sky screen. You hit it. You hit it on said. I'll just back up and comment on one thing relative to the hardware and software. Hardware is frozen software to some extent, but the pace of, how to put it, because hardware is expensive and slow and has been, at least historically, the industry has a much more robust view towards standards. Here's the thing. Because you have standards, now you have a binary bullshit proof, does a worker, does it not work kind of thing. okay that then reflects and changes then kind of what you need to do software what it does it makes

Starting point is 00:11:30 mistakes in hardware expensive because there is a intersubjective reality beyond any particular vendor about what is a mistake in software because it moves so fast it's too fast running to build specs and hard specs and say did you meet this performance spec you said you were going to do no one cares about that software is just so fast and loose it's like jazz I mean so because it moves fast and there's not a you can't put that thing in, then the price of making a mistake in software is almost completely subsumed or lost. And so it's cheap to make mistakes in software because the cost is invisible. 100%. However, the actual engineering practices aren't that different as far as like, I mean, you're absolutely right. Like formal verification is much more important in hardware, but it still

Starting point is 00:12:13 feels like engineering to me. You know exactly where you're going. You have a roadmap. You build an engineering team around that data is different. Data is different. You don't have a roadmap. Like, it is the universe that you're trying to like ingest and extract inside of fact this is the exact critique you're absolutely right when you talk about what you do in software and hardware companies you are trying to manage complexity for the most part you get something but the thing that always screws you i figure every kind of engineering is trying to achieve some kind of lift while finding some kind of drag right and in the case of software and hardware engineering usually it's achieving performance or something like that or some scale of computation while minimizing complexity

Starting point is 00:12:49 and having manageable errors and things like that okay so that's the those things, but it's very goal-oriented. Yeah, building to a goal. It's one of the things to say, like, I'm going to build this complex system, which you can basically describe do macasups for the destination. That's very different than saying, extract insight out of this. That's right. The great John Tucci said there's two kinds of data analysis. There's confirmatory, kind of reporting mode, and then there's exploratory.

Starting point is 00:13:11 And the thing you're talking about, the reason why data smells and data practices smell like science is because there is no such thing as data. All data is just frozen models. Right. Every single data set comes from a sensor. Even a picture. Everyone thinks, oh, well, I took a picture, right? That's just raw data. No, it's not. There's a bare matrix. There's a log transform. There's a gamma correction. And fundamentally, there's an exposure time, which is a temporal sampling domain. So there's all of these things. There is no such thing as data. There's just frozen models. And where businesses get screwed up is when they treat data management as sort of this goal-oriented silhouization, it's a static artifact and it is artifact management. It's almost like a like sort of ad hoc library process. And that's not the same as the kind of data thinking or the way when you think about data in a MLAI sort of world. Because in that world, we see that models and data are both fluid. It's a much more from a meta, not to get too metaphysical, but it's more of a process oriented metaphysics. It's much more temporal oriented than the static

Starting point is 00:14:14 views that current data management practice has. And that's why the, I think the SQL database extremists are not going to win this particular round. So I'm a systems guy, right? I did my PhD in computer systems. In systems, we have five tricks. It's like virtualization, caching. We literally have five or six tricks that we throw at every single problem. And you can build amazingly complex systems with these things.

Starting point is 00:14:38 Like, you know, we understand distribution. We understand consensus. And so while a piece of software like Google is very complex, it actually can be reduced into sub-problems that we know answers to, and then, you know, we can build it. So I would say like the relative complexity, the relative entropy of a software system is finite. Right. It's not clear to me if you're trying to use data to run a system that the entropy is as finite. Well, yeah.

Starting point is 00:15:10 We don't control nature. I mean, what do we use data for? We use data for pricing. We use data for fraud detection. We use data for calculating white times. Okay, so what are the inputs from these things? These things, it's like people's behavior. Like, there's so much entropy in all of us.

Starting point is 00:15:27 It's like the weather. It's hugely lossy, right? Well, it's these classically chaotic, high entropy systems. And so one of my thesis is, and I'll just love to test this on you, is that building a software system is a relatively low entropy exercise because you're dealing with primitives that you understand and you're engineering it. We're actually trying to deal with data. and you're reining in so much entropy and you're trying to extract it.

Starting point is 00:15:48 That ultimately is why we end up with different companies because it's just much, much harder to deal with that much complex. Yeah, well, that makes a lot of sense. And, you know, the Keneffen framework talks about the difference between complex and complicated and chaotic, right? Yes, yes, yes, yes, sure. Right? And so complicated, and I think the pithiest way to say this is,

Starting point is 00:16:06 complicated means that you can take it apart, understand the bits, and put it back together again. Complex means that you cannot do that, right? So a fine Swiss watch is complicated, a cockroach is complex. And so I think when you talk about computer systems, I'm not a systems guy like you are, but one of the best things I've heard about it is that everyone thinks, what is the quote? Everyone thinks distributed computing is about space, but really it's about time. What is the time horizon in which we can define a unit of atomicity?

Starting point is 00:16:36 What is the time to coherence, right, et cetera, et cetera. And so it's always a space-time trade-off. And I'm sorry I make this look so into the physics world, but I see it that way because it's natural flex for me. In fact, I wanted to major in computer science. But my dad, who was a physicist, he said, look, son, if you become a computer programmer, if you go into computer science, you're going to become a programmer. And you're just going to build tools. If you're a scientist, though, you're going to be the one using those tools to make an impact. So I majored in physics.

Starting point is 00:17:04 But then as soon as I got out of physics, it was 99. I'm like, all my friends are getting like, they're getting starting bonuses and they're getting jobs. and they're worse programmers than me. And so I ended up joining computer graphics startup. And that's why I started using Python, was in 99. I realized that I get scripted a bunch of C++ plus much better with Python than with broken template support in Visual Studios. This got awful.

Starting point is 00:17:23 I came to networking by way of computational physics. When I was a computational physicist, I was a computer scientist doing computational simulation at Lawrence Livermore National Lab. That's my first job after undergrad. I was a huge numeric user because that was the only way to do high performance computing in Python. and from what I understand that became aniconic... I would love it if you would kind of give the history of that project.

Starting point is 00:17:45 So in 99, it was Jim Huganan. I think there's some others that might be forgetting. It can be credited with working on some of the early matrix stuff and the Jim Hugganen worked on numeric, and they realized that the operator overloading in Python would allow you to do something that looked a bit like Matlab. You know, like it's like, hey, it looks like you write back to code. And it's like, hey, this hack kind of works.

Starting point is 00:18:05 And also Python's C-level extensibility meant that they could build a little tight sea library that would be fast. So you're writing the scripting thing, the syntax looked like Matlab, but it ran at basically C speed, which is really important. So then it turns out, though, that some of the features they built, the space telescope science institute folks, the ones who run the Hubble telescope, they had some other ideas about what they want to do with this library. And numeric wasn't quite flexible enough or some other stuff, but they created an alternative matrix library called Numoray. And Numoray had like fancy indexing. Numoray had a few other things. And so the ecosystem in

Starting point is 00:18:38 the early 2000s, when I first got my first paid job doing Python was 2004, and I was doing consulting on Python and sci-fi and all that stuff. And there was still a split between numerae and numeric. In fact, most of the libraries that were trying to build on top of the stuff, they built a compatibility layer called numerics, which then would flexibly import sub symbols from these different libraries depending on which you're trying to, it was terrible. The wild and woolly days of early Python. You know, it's a mess. Crowdsourced innovation is always a mess, but the result is still nice because what happens is you end up getting somebody like Travis Oliphon who comes along

Starting point is 00:19:09 2005 and says this is a mess and this is slowing down innovation because everyone has to do the work twice we got to make a work with Numeri and with Numeric and we can't make forward progress so he spent a year of his life into making just coding and designing and he made a really nice thing and he called it NUMPI and he came out with it in like end of 2005-thousand-6 time frame and then the world rejoiced and I was like oh my god this is great this is the unification we needed you know at the sci-pie conference in Pasadena the following year we gave him an award. Anyway, that's what happened in the mid-2000s. And then many years later than in 2010 time frame,

Starting point is 00:19:43 he actually joined the company I was at Enthot. And we had many happy days there doing a lot of scientific computing consulting, which is fun for science nerds, but a niche area, right? But then we started getting contracts and consulting inquiries from hedge funds and from banks and investment banks and things like that. And by the end of the 2000s, I'm walking to the floor of like J.P. Morgan, Bank of America. and they have thousands of people relying on sci-pi and num-pie to run advanced models.

Starting point is 00:20:09 You had coders sitting next two traders like on the energy desk, and you're like, this guy is asking me really deep questions about sci-pie. He's really trying to do stuff with this. So I had this insight that I think Python is ready to go into the mainstream like business analytics space. And that's not just MATLAB that it could be taking market share from, but maybe SaaS. So at the same time, big data was starting to crest at that time or peak. And I realized that people want to do more than just ask SQL questions of their, big data. And in fact, when I went to the first strata in 2011, all of the vendors on the show

Starting point is 00:20:40 floor were selling many of them flavors Hadoop, SQL integrations, faster Hadoop, et cetera, but then when you go to the tutorials, every single data science tutorial was teaching Python and R. But there's no Python vendor. And also Python's kind of janky for some of the stuff. It doesn't play with Java very well. Python and R were both second-class citizens in the Hadoop world. So I said, you know, I think there's something here. And that's why I started the company. We started is Continuum Analytics in 2012. And it was Python for business data analytics, Python for data science. That's what led to that. Anyway, that was a long sort of exposition. But to your question about the history of all of this, how this came around, but I think that when you talk

Starting point is 00:21:15 about software systems, it's actually very interesting. We build software systems thinking they're merely Lego bricks that we make relatively homogenous or homogeneous or well-structured, studs are spaced this way. They're this big and this tall. And then we can stack them together and boom, now you have a bigger Lego. But in reality, when you look at any real software language in modern software systems, there's complexity to it more than the complication, and that's where your worst bugs lie. You know, like you have some MPN module that pulls in some other crap,

Starting point is 00:21:43 and that interferes with another crap, and it tries to install this other thing in your system, and now you have complexity beyond the complication. So I think the practice of software is bedeviled by the fact that it actually is playing at this point with so much complication that it basically appears complex, to our human minds? Barbara Liskopf has my favorite Turing Awards Execut and speech ever, and if you haven't heard it, you have to hear it. And it's

Starting point is 00:22:06 basically about modularity and computer science. And it's how you can take big problems and make them small problems. Like engineering with modularity, you can rein in complexity. So you have a complicated system, but I think you can actually manage the complexity. I'll give you an example on the data side where that's not the case. There are natural systems that are self-similar. By self-similar, it means that they retain the same stochastic properties no matter what zoom level. So unlike a software system, if you've reduced it down to a method, you've got a fairly simple abstraction,

Starting point is 00:22:35 there are some natural systems like, say, coastlines, that it doesn't matter at what level you look at it. They still are like super complex. So one thesis is like, yes, software systems can be complex, but like they're more complicated in that you can modularize and focus on things. That's not necessarily the case with data. Data is as complex as the natural world. I mean, like, you don't have control over the weather and the weather is self-similar.

Starting point is 00:23:00 And no matter what Zoom level you look at it, it still retains the same stochastic problems. It's not like data. You don't have the tools to necessarily to reduce the complexity to something that is merely complicated. Right. So the question then in the data practice world then, let's just keep it at that level then, which is I think a great place to be talking about it, to which point do you stop? What is your optimization criterion, right? Because all engineering is a trade-off.

Starting point is 00:23:23 So for the amount of effort you want to put in, how well do you need to understand the coastline? If you're trying to target a guided missile into a window, a building, you don't need to map the coastline down to a millimeter, right? It's all so forth. So I think that when you get to data, get data, you recognize that it really, ultimately, if you actually want to get all the value out of it, you've got to loop it around into the overall Uda loop of your business. The Observe Orient, decide, act loop, and actually take action with it and correct and zoom into the appropriate level. I think this is kind of what this all boils down to. So now the question is, let's say that you're building a company. that instead of, you know, the goal of the company is building a modular software system,

Starting point is 00:24:01 is reining the complexity of data, which we're seeing more and more companies do. What does that mean to deal with that much complexity? So what you just mentioned is, well, okay, maybe you look at like the different Zoom level or maybe you've got like a full feedback system or whatever. But before we even get to how you do this, I would like to either agree or disagree that the companies trying to rein in that complexity are different. I completely agree with that. The companies that actually understand even the problem,

Starting point is 00:24:25 they need to solve, they have a better chance to solve the problem. Because it's actually very much like cloud computing. It used to be, how do I build the software on the basis of the computational resource I have access to? Well, once you have ability to access essentially limitless computation, and you got to ask about, well, what is it I need to build? What I really want to do, right? So I think with data is a similar thing where you say, well, you can put in, for any arbore, you can put in more money and get more texture, more resolutionary predictions. Where do you stop? Right? Exactly. And if the stop is like, I can only convince the CEO to hire three data scientists, so that's where we stop. It's just what three scientists can do.

Starting point is 00:25:03 I think that's how a lot of people are winging it right now. But the interesting thing with the hedge funds, you look at them is they understand this. Like some people say, you know what, we're not going to work at the microstructure level. We're just not going to do that. Because there's a few big players that play the high frequency stuff. We're going to leave that out. We're going to do kind of longer term stuff and do bigger strategies, you know, longer term strategies. So they self-select into zones where they believe they have the observational capacity and connect that to execution.

Starting point is 00:25:25 capacity. Again, it's about the Uda loop. They believe they can run a coherent loop. Data is important and all of that, but more importantly is keeping track of the model because it's not just processing data anymore. At some point, it's also going to be modifying the systems that are then producing that data, right? It's a loop. And the most effective companies, it has to be that the data processing is part of both the inference and the execution step, right? And one thing that was shocking to me, honestly, in the last 10 years have been doing this, so many businesses, big businesses, At the heart of a lot of really important parts of the business, the models are very old. They're very stale.

Starting point is 00:25:58 They iterate very slowly. And it's a massively human intensive task with VPs and PowerPoints and everything else to get revs on models. And then you go to the hedge funds and it's like, no, we hire engineers. They come and they code a lot, Matt Lab, and they're trading 100 grand the first week, right? That's different. That's a very different view of the Udo loop. And, you know, I think in our Twitter exchange, this is where I said, you know, all companies are going to have to look like hedge funds because in a world,

Starting point is 00:26:23 where you can have essentially unbounded observational capabilities. You could be a logistic startup, and you could basically get data as good as FedEx or anybody else doing logistics. You could be, you do whatever, there's a great leveling field with regard to the sensory capabilities. There's a great leveler with regard to the cloud computing capabilities. You don't need to go hire 100 sysadden to go and rack a bunch of servers. You can just turn on some things.

Starting point is 00:26:45 So with that being said, you can now have an extremely low footprint, fast-moving companies that are just there to run the OOOOOB, and to have extremely explicit and intentional sense making around the modeling. And for them, data then, it's sort of like the difference between a fish, the way a fish sees water versus somebody holding water at a ladle, right? You don't even think about the data because you're just swimming in it, right? Obviously, you understand data.

Starting point is 00:27:08 Yeah, so this is like the silly VC observation. The silly VC observation is if you look at a software company that doesn't have to deal with the complexity of data, they tend to have relatively high margins, say 70 to 80%. And the reason is because they're building skyscrapers and then they sell those skyscrapers. and the team needed to build a skyscraper is relatively fixed, and then you can sell as many of those as you want. That's kind of the software model.

Starting point is 00:27:29 When we look at companies that are raining in the complexity of data, and that's how they extract value, the more people you put to rein in that data, the better your results are. And so now you're incented to have more and more people try and work on that data over time. So I think the structure of a hedge fund is we hire more people to work on the data,

Starting point is 00:27:51 we can potentially get more money just because they're actually reining the complexity of that data. But in the software world, all of that complexity is basically going into the margins, yet depending on who the buyer is, you can't increase the top line in the same way. So let's say I'm going to sell five copies of my software, right? Now, if I sell five copies of my software, people are buying the software. They're not buying the results of the data. Maybe they'll like my software better because it's more accurate or less accurate. it, but the number of people working on the data doesn't directly drive the amount of software

Starting point is 00:28:24 that gets built. And so now you have this existential margin issue, which is you want to increase the number of people working on the data, labeling it, cleaning it, because you can always get some improvement. Right. Here's the question. If we think about in the software space, you have software vendors and buyers. And the theory of a software vendor, again, going back on our history, there used to just be computer companies. And then Bill Gates was like, hey, stop pirating my crap, pay for my software. Because software's a thing. It's not just you long-haired hippies copying each other's Unix code.

Starting point is 00:28:57 Like software is a thing, right? And you need to pay me for it. A letter to hobbyist, 1970, whatever it was, or something like that. But he did that at the beginning of the PC era. And the PC era basically said, well, here's a set of standards. Here's X-86. The X-D6, ISA, here's ISA, here's ESA and bus and here peripherals and networking and all the crap.

Starting point is 00:29:13 And so you have a set of standards that in the space, actually there's a set of standards that in the space, actually this recent blog post that I think you I don't know if you wrote but you promoted the narrow waste of TCPIP And yeah that was me and Ali That's an old networking guys look at crypto The point is You know a lot of these things rhyme with each other

Starting point is 00:29:31 When you have standards what they do Is they reduce the cost of innovation And they increase the innovation surface The PC era was such a gigantic It's such a gigantic leveler That allowed the era of software to thrive But again Moses didn't have a third tablet that said there must be software hardware divided,

Starting point is 00:29:50 and that software must always have these kinds of margins. We're now entering into an era where people are considering the entire stack of what an information system is. And so when you look at that, there's no reason at all why, if I'm an end user customer buyer, why should X percentage of my alpha or my margin or surplus, if you want to talk about capital and all that stuff, why should this percentage by surplus go to all accrue, broadly across all of these companies, broadly accrue into just one software vendor? Because if I in-source and in-house the technology, and I have the FTEs,

Starting point is 00:30:25 all of the residual value stays within the boundaries of my firm. And this is what a hedge fund does. In fact, when I go and try to sell hedge funds, they will generally buy software. They use our open source. They like to get consulting services and ask questions. They're very high-end users of our open source stuff. But they basically say, why should I share anything? Like they'll buy database, they'll buy some things that they perceive to be truly infrastructure and truly commodity.

Starting point is 00:30:47 Anything above that, if there's a chance of it contributing deeply in a generative way, not a decomposable way, but in a generative way to their alpha, they're going to keep it in-house. It's proprietary. I was out of dinner with the CTO of a hedge fund, and he was like, tell me why I should care about open source. I'm like, because they had an internal, like, crappy version of pandas. And I was trying to give them the story of like, look, if you just use pandas, you would basically leverage all of the, you'd, you basically is cost amortization of innovation for you, right? And it's not differentiating value for you to have your own little tabular data structure. People think that open source is winning or has won. I think the fact that open source is commoditizing all this stuff means that software itself,

Starting point is 00:31:24 the value chain is collapsing. And so right now, open source is a movement, I think, unfortunately, is confused. There's sort of the Stallman-esque religious aspect to it almost. And then there's something deeply beautiful about crowdsource innovation and legit community collaborative innovation. That's really important. And we're almost losing that because everyone's, like, oh, but open source is one now. I think that's a misread of the situation. It's the thing I keep tweeting about because I'm saddened by the loss of that threat of the principle. Why do we do open source? Why do we do crowdsource innovation? So anyway, it's that conversation. I think software companies do look different because they have thrived in an era of relative. The substrate

Starting point is 00:32:00 they've sat on is pretty flat. And now we're entering a space where performance matters great deal, where the information systems are integrated again. Software is only one component of a whole integrate information system. And because of that, now it's no longer like I can sell just one piece of software across a thousand companies and just harvest all of this margin. So here's my mental model on these things. Let's imagine that you have two companies, company and company B. So company A, they're building a system and all the properties that system are going to be defined in software. And so they've got a roadmap and then they build a software over a period of time. That's company A. Let's say company B, let's say actually they're going to use just all off the shelf kind of AIML workflow

Starting point is 00:32:37 where they're not actually really writing software. It's all about getting the models to be predictive. And so the entire company is around cleaning data, labeling data, training the models, right? They're very, very different because the complexity of the second one is just far, far greater. And I would say defensibility of the second one is far, far greater just because of the nature of data. And so it feels to me there's almost like an emergence of a new type of company. Absolutely, yeah. where the organization, the margins, the go to market,

Starting point is 00:33:10 everything is being dictated by the fact that they're processing data rather than writing software primarily. I think we're all still trying to understand what that second class of company looks like. Yeah. One of my pitches is that by harnessing the power of open source to commoditize, to do the disruption on a lot of classical data processing systems, we would basically be one of the last great software companies

Starting point is 00:33:34 and be one of the first great AI companies. The margin doesn't come from how well you do the software bit. And so I think that's the big news. I mean, maybe I have a bit of a controversial view on this, but I think that the era of software being the dominant part of the stack, I know, you know, Mark and Dresen likes to say software's eating the world. It is eating the world, but it's a ruminant at this point, right? It's not the most efficient digester of the value.

Starting point is 00:33:55 And so, look, you benefit from chlorophyll, even though you're not a plant. You just eat a lot of plants. I think in the era of, I mean, if we want to kind of go to the complex systems thinking, right, in the era of data abundance, the people who can benefit, build models, refine models, and execute on them fastest are the ones that are going to win. They're the chaos agents in the ecosystem. So look, we still live in a world of plants, but there's a beautiful infographic I saw the other day, which is how much biomass is on the earth, and most of his plants. And then you got like, this little bit is animals. And in a little bit,

Starting point is 00:34:23 there's like this little bit as mammals and then it's like this little bit as humans. I think that in the world order to come, there's just still going to be, of course, hardware and software companies and so on so forth. But I think the margins, where you really want to look for the growth, is going to be those people who are moving like animals. And not just claiming a spot, I'm going to hear, you know, grow my leaves, you can still catch some sunlight. But your optionality, I mean, it is, you know, business is war. Your optionality is reduced.

Starting point is 00:34:47 And the companies that can move fastest among these different places, those are the animals. And that's going to be running faster udal loops. I would love to talk about how this impacts the actual business. So I'm not sure there's a huge change on go-to-market, except for the fact that there's two types of these kinds of, these kind of AIML companies. There's the infrastructure companies,

Starting point is 00:35:07 which basically build the tools to use AIML, and that's standard, that looks like a standard of software infrastructure company. It would be like a data company or something like that, to be a really point. And then there's those that use data science AIML to tackle problems in the real world. And in those,

Starting point is 00:35:25 it's kind of interesting because you end up not building a software company, but more of a farming company or an agricultural company. And so you're not selling to Core IT, right? And so they just tend to, look very different than typical software problems because they're selling to a different constituency. They're not software problems. The software is a means to an end, not the ends unto itself. And this is particularly germane to AIML because it allows us to solve problems that typically software hasn't been good at solving in the past. It allows us to solve vision problems better

Starting point is 00:35:54 than we've been able to do it before, audio processing problems, better than we've been doing it before. It's kind of like the best way to interoperate with the physical world. And so now we're off like building these companies that solve these kind of real world problems, and you just have different looking companies to do that, because again, you're selling to the person that inspects the HVAC system, you're selling to the person that is the farmer, you're selling to the person that does manage the forest. So I think one thing for the very high level that, and like anybody creating a company in this space needs to think through as the following, which is if you're building just the infrastructure, just the tooling in the nuts and bolts, you look like

Starting point is 00:36:25 a software company and somebody else deals with the actual AIML application, and that's fine. But let's say that you yourself are ingesting the data, cleaning the data, labeling the data, there's a lot of variable cost to do that. Like every customer may have a new data set. And what happens is this impacts the margins of your business. Like it looks like you have lower margins because for every customer you've got all of this work to do. And so I think you need to make a decision early on whether do you want to be the one that's doing that work or because that's something you can actually offload to the customer.

Starting point is 00:36:58 So let's say you go to a new customer, listen, we're going to take all your data, we're going to clean your data, we're going to create your models, and we're going to solve your problems. And in that case, you internalize all of that. And as far as your organization, you need to know that this is basically a services arm. Another option is you can say, customer, we're going to give you all these tools, but you're going to have to bring in your own data. You're going to have to hire people to label it. You're going to have to learn to tune your models and will help you with all of that, but you're the one that's going to go ahead and sync that cost. So you have to think very deeply of how you structure your company relative to the variable headcount, like the headcount that has to grow per customers, because that seems to be the big difference that we see for these AML companies and the typical software company. I think it's hard to do one of these companies right now because we are in a transitional time. A lot of the customers don't even know what they're asking for and they're kind of looking for that help. And even now, people, they recognize it's a growth area and where the future is headed. So they want to spend some money on it. But absolutely right.

Starting point is 00:37:55 the amount of work you have to do per customer starts looking a lot like a services play. And there's a reason why a lot of companies, when you really look inside the skeleton, like I think I called it the skeleton buried in the ARR, you see a lot. Totally. Eric von Hippel has a great book around democratizing innovation. And he says, even when we have a space in which a product is possible, products usually only cover 60 to 70% of the end user need. The end user still has to do.

Starting point is 00:38:22 And he's not talking about software. he's talking about people like, you know, welding things onto the side of their tractors. So, right? He's talking about in general, the customer has this thing they need to do. When it comes to AIML application areas, it's a lot more than just 30% that has to be customized per customer site. So I think for businesses right now in this transition, it's super hard not to end up looking, if you're doing good job for your customers, it's hard not to look like you're doing a services play. Now, that being said, there are, I think, viable strategies through this,

Starting point is 00:38:50 which is that you can specialize in an area and domain. and say, look, we're going to come in and work on your data set, but we have our own reference model we've built. It's exactly right. Right. And now we can benchmark you against that. We can bring some of our own magic juice into this. So now the thing that is generalizable across or productizable across a thing, maybe it's

Starting point is 00:39:09 only for that sector, but the thing that's generalizable is not just the software. It's actually more defensible than the software. I just want to very quickly put a fine point in this. There's two things that you brought up that are very important to realize. The first one is we are in a transition. So customers don't even know what it means to. like label data and clean data. Maybe in five years you can go to customer and say, we've got all the tooling for you, but you're responsible for managing the data,

Starting point is 00:39:31 and therefore you offload the cost. It's just today, you just don't have enough education in the market to do that. They don't have data scientists, et cetera, et cetera. And so I think in order to get the market into the transition, the startups have to do that. You have to build that basically services aren't. The second point you made is actually, I think the critical one is there actually is some commonality in verticals. And so you can reduce that margin by sharing as much as possible. But it does, require customers to share data or at least share models. And that's sometimes a tough conversation with the customers. Well, it's not just sharing models. I mean, there are deeper and interesting,

Starting point is 00:40:02 more leveraged plays to be made. For instance, you go into a sector and you realize, oh, all of these people are doing their own craptacular things because of their limited budgets and their data sets are broken in this way. But holy crap, there's this other vendor over here with this data set. I can go and negotiate an exclusivity with that vendor. And now I'm the only one that can bring that kind of model lift, you know, into this particular sector. So there's a lot of that 1800s style like homesteading to be done in this space. So I think it's more than just the, let me average out central limit theorem, everybody in this industry. There's some really cool things to be done.

Starting point is 00:40:35 So the first thing companies need to figure out is what type of a company they are. Many are very confused about that? You need to know, are you a software company and you're building tooling or are you a company where the majority of the complexity of the companies around data? And by the way, many companies start a software companies and end up as data companies and then they've structured things incorrectly. So let's say that you've come to the answer that and you've figured out. you're a data company. Once that happens, you need to understand that often companies that are extracting value from data, there's a lot of complexity per customer in order to do that. And you need to structure your company the correct way, which is like, just realize it may be hard to scale,

Starting point is 00:41:08 just realize you're going to have different processes around the actual data or come up with a strategy to offload that to the customer. Now, the reality is because the market is so immature, it's unlikely the customer is going to be able to do a lot of that, but it's something that you can over time train the market to do and do that transition. But I think this is the big sticking point with many founders. They think they're software companies. They end up being data companies. They didn't build the organizations to deal with that internal complexity.

Starting point is 00:41:33 It's coming down in the margins. Everybody's kind of confused. So I think just a little bit of self-awareness and a little bit of planning go a really long way in this space. But it requires a very different. Many West Coast firms have the thesis that to do a really great tech startup, you need at least a tech founder somewhere in there because they kind of see where things are going. For a really good AI startup, you need to have machine learning people at that leadership. level because they know what it means. They know why a single data set can be a billion dollars

Starting point is 00:41:58 or swing a billion dollar deal. The difference between a software engineer and like a data scientist is that software you generally know what the inputs are or the types of inputs and your goal is to construct a system that given these inputs produces these sets of outputs. So you have very nice, clean definitions around correctness for the most part. With data science, there's unfortunately not that. You can have a piece of code and for some sets of values, it's correct. Other sets of values, it still produces a result, but those results are wrong. And a function's correctness is dependent on values. This is the key thing that differentiates all of data science and machine learning

Starting point is 00:42:32 from classical software engineering. Classical software engineering is like, we've got our test data set, we've got our prod data set. It works in tests, it's going to work in prod, right? That's not how data science and machine learning work at all. In data science and machine learning, the correctness of a function is value dependent and also performance dependent. And the performance is also value dependent. So now you have this intertwined synthesis of a data

Starting point is 00:42:52 and a modeling and a computation problem that cannot be decomposed into orthogonal vectors, right? That's the difficulty of this. What I think is that in five, ten years' time, every company that is actually still in existence and doing well has to essentially have synthesized and brought a synthesis in of their data capacity, their data modeling capacity, the model build and computation, the hardesting appropriate computation in an economical fashion to suit their needs. So the word I like to use for this is cybernetics. I mean, we are right now in between the software era and the cybernetic era,

Starting point is 00:43:27 and I think we will get to a cybernetic future. And cybernetic, by the way, you know, it comes from the same word as Kubernetes, right? It means governor. It means a theory of action and control. So businesses have to see computation really moving its way up. Data modeling process has to move it all the way up to the very tippy top of the business. That synthesis will happen. It will have to happen.

Starting point is 00:43:47 That's what the selection pressure is in the business world. I don't know exactly the path will take. to get there in the transitional time, businesses who want to basically get in ahead of the curve, they've got to have very clear thinking at the leadership level, and they must have a very clear understanding with their investors about what they're going to look like as they chase the Marlin, because it's going to take a little while. So I think that's the trick right now, is that you've got to find founding teams or leadership teams that have a solid understanding of software, of what software is and isn't, of where the value is in the software activity,

Starting point is 00:44:18 and of where the value is and the data and data modeling activities. In a time of fog, you've got to have very, very clear-headed thinking about that sort of thing. But ultimately, that synthesis must be what comes. Thank you. Thank you so much.

The a16z Show - Reining in Complexity: Data Science & Future of AI/ML Businesses

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.