a16z Podcast - Reining in Complexity: Data Science & Future of AI/ML Businesses

Starting point is 00:00:00 Hi, everyone. Welcome to the A6 and Z podcast. I'm Sonho. For this week's episode, we have one of our hallway-style conversations, and this one is literally like eavesdropping in on a debate and discussion that actually started as a Twitter thread debate and discussion all around the question of whether and how data and AI slash ML machine learning companies are different than software companies and what that means for the future of software businesses. Our guest, even questions our view of software eating the world, or rather asks what happens when software is everywhere, what comes next? Our guest is Peter Wang, the co-founder and CEO of Anaconda, who also leads their open source and community innovation group, as well as created the Pi Data Community and conferences,

Starting point is 00:00:47 and has devoted a lot of time and energy to growing the data science community there. And he's in conversation with A6 and Z general partner Martine Casado, who's written a lot about the evolution of software businesses, the new age of data, and especially AIML economics. You can find those pieces at 860.c.com slash ML Economics. The two dive into a number of themes throughout this conversation, ranging from open source and crowdsource innovation and the messy ways that innovation really plays out to what it means when you move from hardware to software to data and AIML, abstracting something that is not just complicated but actually complex. And then they touched briefly on what it means practically in building a new type of company as well as the

Starting point is 00:01:30 evolving role of data scientists. But the conversation begins with their shared vantage points in coming from physics, which is relevant here since these new kinds of businesses and products involve a process of experimenting much as with physics. Both you and I come from the physics computational physics background and we've both kind of been pushed into this data, AI, ML, data science, And I don't know if that is coincidence or if we have an affinity for that. Before we get into that, though, there's kind of a competing view of the world, which basically says SQL can do everything. And it's funny, we spend a lot of time actually looking at the data science or the data landscape.

Starting point is 00:02:09 And it feels like there's two kind of worlds. There's like the data warehouse maximalists, which like, we'll stick all data in the data warehouse. And then we're going to do SQL. And then we're going to have some extensions to SQL like you see popping up in like BigQuery or whatever. And that can do everything that needs to be done. and oh, by the way, if someone's using Python and R, all they're really doing is basic regressions, and so we can just make that a simple extension, and we're done.

Starting point is 00:02:31 And then there's the other view of the world, which I like to call the Hadoop refugees, which is like, actually, we do hardcore computation, and we need R and Python because this stuff we do is very sophisticated. I mean, I know you're squarely on one side of those, but I wonder, like, do you think there's a convergence that happens? Do these stay two worlds? Does one become irrelevant?

Starting point is 00:02:50 Like, what happens there? Just because you oppose extremism doesn't, make you an extremist, right? I would say data warehouse maximalist or extremists. And I see a heterogeneous world. And it's the old yarn about, I guess, I don't know, there's so many variants of this. But Alan Perlis, a great computer scientist, has some really great quotes about some of irreverancies about these kinds of things. But I would say that to the idea that everything can be expressed in sequel, it's like which sequel with how many extensions? Because at the end of the day, and with how many like extensions upon extensions and multicorn on your Postgres actually running a Python

Starting point is 00:03:22 kernel. Yeah, I guess you're doing a sequel, but you're running a Python script, you know, so that's not really, doesn't count. And frankly, a lot of stuff runs access and VBA in this world. VBA isn't SQL. I think if you choose to look at the world through a particular lens, you can choose to count everything else as residuals and rounding errors. But if you take off those lenses, you see a much more diverse landscape. And I think that's where, for me, I see the space for SQL. And I understand the reasons why it has evolved under a particular kind of animal. Like, the shark is still the best predatory fish in the ocean, but it's not the biggest predator in the world. And I think there's something about that, that if you're in the ocean,

Starting point is 00:03:56 you're going to basically be shark-like if you were going to eat a lot of fish. So if you're in that business data analytics world, especially because a lot of business data looks like fish, it's evolved to look like food for the sharks. So that's kind of the way it is. But what Hadoop opened up, back in 2012, I called it the Hadoop battering ram. I said, listen, we're not going to win the Hadoop Hadoop game. We'll let the Hadoop vendors go and fight against the Teradata's and the Oracles and the classical data warehouse guys, let them do that thing. Once they battered down the door, we're going to come flooding in with all sorts of heterogeneous approaches to data science, data analytics, things that are hard to ask in SQL. And moreover, there's a term I use,

Starting point is 00:04:32 which I don't hear used very often. Now, obviously, you've heard the term shadow IT, which is used quite a bit, but there's a shadow data management. That's a far, far more insidious and dangerous problem. When I was at an large investment bank, they had a million-dollar Oracle database sitting somewhere and it was too slow to actually run the analytics they needed. So what they did is they had an instance of this Oracle database cost a million bucks. And what they did is the only query they ran was a full table dump into a CSV. And then they took that CSV and they did everything else with it. And it was Python scripts. It was some, you know, random Java crap. It was a bunch of other stuff. And it was sort of like, so if you're a data manager, if you're like in the data management

Starting point is 00:05:10 practice, you say, wow, we just have another big old million dollar instance stood up. Our data management techniques are great. It's a what do you call a Potempton Village, I guess, right? But then when you actually go and you ask the developers, hey, where's the source data for this stuff? Where's prod data coming from? Like, oh, yeah, this file share backslash, black slash something or the other, or, you know, that file. I'm like, that file? What about that database? Don't touch the database. It's too brittle, right? So there's this kind of stuff going on, and everybody listens to knows what I'm talking about. That shadow data management is absolutely a pernicious problem and data science is just eating it alive. Because to ask the question you want to ask,

Starting point is 00:05:46 you have to integrate data sets together. Master data management is about siloization, normalization, and all this kind of stuff. You've hit to the segue, which I just think it's so germane to what we're here to talk about, which is there's clearly problem domains, which SQL is totally fine for, right? Yep. And you can argue there's problem domains, which is just not good. I mean, like any sort of hardcore statistics is just not very good for. And the point of us being on this podcast is actually talk about, okay, listen, we're seeing kind of new types of companies and new types of workloads and they're around kind of processing data. And I totally hear you that this shadow data management is a real issue. And you can make an argument why that exists is not because

Starting point is 00:06:24 people are stupid or they don't know how to do good workflows. It's like literally we don't have the tooling to deal with data in the right way. One macro question that I have that I would love to hash out with you is, are we seeing a fundamental shift in workload that requires a fundamentally new set of tools and a fundamentally new type of company, or is this just more of a transition where we can kind of put into service steel tools? And I just want to be a little bit more specific, which is in the past, you had your toolkit of systems approaches and you have a software system, and you'd kind of pull them out and apply it to the problem, and SQL's one of them.

Starting point is 00:07:01 And we kind of understood how those software systems behaved, and we kind of understood how the company's both around them behave. You know, as an investor looking at a lot of data companies, they just don't look the same. The types of tools they use, the type of operational practice they use, and the one that you pointed out was a great one, which is now data becomes a primitive that you want to actually apply like software techniques to in a way, but we don't have the tools to do that? And then we've written posts about margin structures look a lot different, the way you build your company different. And so I just, do you think this mess is because data scientists don't have formal CS trainings or do you think this is an entirely different problem domain and we should actually look at what the future looks like for that? develop new tools, et cetera. This is like the heart of what we're talking about. This is absolutely the heart. And I will try to start from the top, which is this concept that every

Starting point is 00:07:43 baby or every child is born and they're raising. They think their childhood is normal, right? They think of like your childhood is like the normal thing. So we have developers coming online in the late 2000s, let's say, and they think this is the world. Even me as a professional starting in 99, right? It's like, well, this is just what there is. The more you start researching history and looking back, you're like, you know what, we're just building in this industry, we just layer it's frozen accident on top of frozen accident on top of frozen accident very very few times to people make principled intentional revolutionary shifts right it's you basically band-aid as substrate okay so starting from the top what I would say is that

Starting point is 00:08:20 there is no law there was nothing carved in stone that Moses brought down from the mountain that said all information systems must be deconstructed into hardware and software and data there's no such thing it was information systems full stop The fact that we had different cost structures for innovation in hardware versus software versus networking and so forth, that has led to different rates of innovation, different paces, things like that. And so when a business steps in and says, okay, what's on the shelf that I can use to accelerate my business processes, then it makes sense because this thing, that thing,

Starting point is 00:08:55 the other. Like when you buy a car, you buy the car and then you put CDs in the car. You don't go buy a car with a CD pre-spec, right? Is there the exception of technical innovation in certain areas? So, for example, like, we now know how to build systems that extract very useful information out of data pretty simply that didn't really work in the late 90s. Like, I remember the whole first, you know, neural network like genetic program. Oh, yeah, right, right, yep. Miasma of the late 90s. I did a number of projects on that they didn't really work. They actually work now. So you could also argue that the technical landscape has changed. Is that just been a macroeconomic issue on the company? Yeah, I mean, ornithopters work if you can flap hard enough, right? It doesn't necessarily mean it's the right architecture. And it depends on the density of air. Mortonopters might work great in Mars, but not on Earth, right? Propellers work better on Earth, right? Well, with internal combustion engines and et cetera, et cetera.

Starting point is 00:09:41 But the point is that, yes, you're right. I guess my point could be, say, thusly, there is a multidimensional optimization surface we should be thinking about, not just the optimization surface of software or data architecture, data management, and, you know, things like that. I mean, as someone did software-divine networking, you know that better than anybody. But here's what's interesting to me,

Starting point is 00:10:01 is if you build a hardware company, the tools you use, the money that you need to raise, the innovation pace is defined one way. And if you do a software company, it's actually defined quite differently. Although you still use like a lot of the same practices, it still is engineering, you can still modularize. It's not clear to me that as soon as you move to data, you're in the same domain. Software to me feels like an engineering problem that you can modularize, you can build interfaces, you're building it from the ground up, you control all the primitives. Data feels like science. It's like you're trying to rein the complexity of the physical world, right? It's one thing to like build a house, building a very complex building, very hard. We had to do

Starting point is 00:10:46 all this design practice and the other, but, you know, we got the skyscraper. That's very different than understanding the cosmos. Because the cosmos is so complex and you don't understand what it is and you don't have a blueprint. And data companies are defining the cosmos more than building the skyscraper. You hit it on and said. I'll just back up and comment on one thing relative to the hardware and software. Hardware is frozen softwares to some extent, but the pace of, how to put it, because hardware is expensive and slow and has been, at least historically, the industry has a much more robust view towards standards. Now, here's the thing. Because you have standards, now you have a binary bullshit proof, does a worker, does it not work kind of thing. Okay? That then reflects and changes then kind of what you need to do. Software, what it does, it makes mistakes. in hardware expensive because there is a intersubjective reality beyond any particular vendor

Starting point is 00:11:37 about what is a mistake. In software, because it moves so fast, it's too fast running to build specs and hard specs and say, did you meet this performance spec you said you were going to do? No one cares about that. Software is just so fast and loose. It's like jazz. I mean, so because it moves fast and there's not a, you can't put that thing in, then the price of making a mistake in software is almost completely subsumed or lost. And so, it's cheap to make mistakes in software because the cost is invisible. 100%. However, the actual engineering practices aren't that different as far as like, I mean, you're absolutely right. Like formal verification is much more important in hardware, but it still

Starting point is 00:12:13 feels like engineering to me. You know exactly where you're going. You have a roadmap. You build an engineering team around that. Data is different. Data is different. You don't have a roadmap. Like, it is the universe that you're trying to like ingest and extract inside of it. This is the exact critique. You're absolutely right. When you talk about what you do in software and hardware companies, you are trying to manage complexity for the most part. You get something, but the thing that always screws you, I figure every kind of engineering is trying to achieve some kind of lift while finding some kind of drag, right? And in the case of software and hardware engineering, usually it's achieving performance or something like that or some scale of computation while minimizing complexity

Starting point is 00:12:49 and having manageable errors and things like that. Okay, so that's those things. But it's very goal oriented. Yeah, building to a goal. It's one of the things to say, like, I'm going to build this complex system, which you can basically describe do macasups for the destination. That's very different than saying, extract insight out of this. That's right. The great John Tucci said there's two kinds of data analysis. There's confirmatory, kind of reporting mode, and then there's exploratory. And the thing you're talking about, the reason why data smells and data practices smell like science is because there is no such thing as data. All data is just frozen models. Every single data set comes from a sensor. Even a picture. Everyone thinks, oh, well, I took a picture,

Starting point is 00:13:27 That's just raw data. No, it's not. There's a Bayer matrix. There's a log transform. There's a gamma correction. And fundamentally, there's an exposure time, which is a temporal sampling domain. So there's all of these things. There is no such thing as data. There's just frozen models. And where businesses get screwed up is when they treat data management as sort of this goal-oriented silhouization, it's a static artifact and it is artifact management. It's almost like a sort of ad hoc library process. And that's not the same as the kind of data thinking or the way when you think about data in a MLAI sort of world. Because in that world, we see that models and data are both fluid. It's a much more from a meta, not to get too metaphysical, but it's more of a process-oriented

Starting point is 00:14:10 metaphysics. It's much more temporal oriented than the static views that current data management practice has. And that's why the, I think, the SQL database extremists are not going to win this particular round. So I'm a systems guy, right? I did my PhD in computer systems. In systems, we have five tricks. It's like virtualization, caching. We literally have five or six tricks that we throw at every single problem. And you can build amazingly complex systems with these things. Like, you know, we understand distribution. We understand consensus. And so while a piece of software like Google is very complex, it actually can be reduced into sub-problems. And, you know, we understand problems that we know answers to and then, you know, we can build it. So I would say like the relative

Starting point is 00:14:55 complexity, the relative entropy of a software system is finite. Right. It's not clear to me if you're trying to use data to run a system that the entropy is as finite. Well, yeah. We don't control nature. I mean, what do we use data for? We use data for pricing. We use data for fraud detection. We use data for calculating white times. Okay, so what are the inputs from these things? These things, it's like people's behavior. Like, there's so much entropy in all of us. It's like the weather. It's hugely lossy, right? Well, it's these classically chaotic, high entropy systems. And so one of my thesis is, and I'll just love to test this on you, is that building a software systems are relatively low entropy exercise because you're dealing with primitives that you understand

Starting point is 00:15:42 and you're engineering it. We're actually trying to deal with data. You're raining in so much entropy and you're trying to extract it. That ultimately is why we end up with different companies because it's just much, much harder to deal with that much complex. Yeah. Well, that makes a lot of sense. And, you know, the Keneffen framework talks about the difference between complex and complicated and chaotic, right? Yes, yes, yes, yes, sure. Right. And so complicated, and I think the pithiest way to say this is, complicated means that you can take it apart, understand the bits, and put it back together again. Complex means that you cannot do that, right? So a fine Swiss watch is complicated, a cockroach is complex. And so I think when you talk about computer systems,

Starting point is 00:16:22 I'm not a systems guy like you are, but one of the best things I've heard about it is that everyone thinks, what is the quote? Everyone thinks distributed computing is about space, but really it's about time. What is the time horizon in which we can define a unit of atomicity? What is the time to coherence, right, et cetera, et cetera. And so it's always a space time tradeoff. And I'm sorry I make this so like into the physics world, but I see it that way. because it's natural flex for me. In fact, I wanted to major in computer science, but my dad, who was a physicist, he said, look, son, if you become a computer programmer, if you go into computer science, you're going to become a programmer, and you're just going to build tools. If you're a scientist, though, you're going to be the one using those tools to make an impact. So I majored in physics. But then as soon as I got out of physics, it was 99. I'm like, all my friends are getting like, they're getting starting bonuses and they're getting jobs, and they're worse programmers than me. And so I ended up joining computer graphics startup. And that's why I started using Python, was in 99. And I realized that I get script a bunch of C++ plus much better with Python than with broken template support in visual studios. This got awful.

Starting point is 00:17:23 I came to networking by way of computational physics. When I was a computational physicist, I was a computer scientist doing computational simulation at Lawrence Livermore National Lab. That's my first job after undergrad. I was a huge numeric user because that was the only way to do high performance computing in Python. And from what I understand, that became an icon. I would love it if you would kind of give the history of that project. In 99, it was Jim Hugenin, I think there's some others that might be forgetting, can be credited with working on some of the early matrix stuff,

Starting point is 00:17:52 and the Jim Hugganen worked on numeric, and they realized that the operator overloading in Python would allow you to do something that looked a bit like Matlab, you know, like it's like, hey, it looks like you write vector code, and it's like, hey, this hack kind of works. And also Python's C-level extensibility meant that they could build a little tight C-library that would be fast. So you're writing the scripting thing, the syntax looked like Matlab, but it ran at basically C-speed, which is really important.

Starting point is 00:18:18 So then it turns out, though, that some of the features they built, the Space Telescope Science Institute folks, the ones who run the Hubble Telescope, they had some other ideas about what they want to do with this library. And numeric wasn't quite flexible enough or some other stuff, but they created an alternative matrix library called Numoray. And Numoray had, like, fancy indexing, Numeray had a few other things.

Starting point is 00:18:37 And so the ecosystem in the early 2000s, when I first got my first paid job doing Python was 2004, and I was doing consulting on Python and sci-fi and all that stuff, and there was still a split between numerae and numeric. In fact, most of the libraries that were trying to build on top of the stuff, they built a compatibility layer called numerics, which then would flexibly import sub symbols from these different libraries depending on which you're trying to, it was terrible.

Starting point is 00:18:59 The wild and woolly days of early Python. You know, it's a mess. Crowdsourced innovation is always a mess, but the result is still nice, because what happens is you end up getting somebody like Travis Oliphon, who comes along 2005 and says, this is a mess, and this is slowing down innovation, because everyone has to do the work twice. We've got to make a work with Numeri and with Numeric, and we can't make forward progress.

Starting point is 00:19:18 So he spent a year of his life into making, just coding, and designing, and he made a really nice thing, and he called it NUMPI, and he came out with it in like end of 2005-thousand-6 time frame, and then the world rejoiced. And everyone was like, oh, my God, this is great. This is the unification we needed. You know, at the Sipai Conference of Pasadena, the following year, we gave him an award. Anyway, that's what happened in the mid-2000s. And then many years later than in 2010 time frame, he actually joined the company I was at Enthot.

Starting point is 00:19:46 And we had many happy days there doing a lot of scientific computing consulting, which is fun for science nerds, but a niche area, right? But then we started getting contracts and consulting inquiries from hedge funds and from banks and investment banks and things like that. And by the end of the 2000s, I'm walking to the floor of like JP Morgan, Bank of America, and they have thousands of people relying on sci-pie and numpy to run advanced models. coders sitting next to traders, like on the energy desk. And you're like, this guy is asking me really deep questions about sci-pie. He's really trying to do stuff with this. So I had this

Starting point is 00:20:17 insight that I think Python is ready to go into the mainstream like business analytics space. And that's not just Matlab that it could be taking market share from, but maybe SaaS. So at the same time, big data was starting to crest at that time per peak. And I realized that people want to do more than just ask SQL questions of their big data. And in fact, when I went to the first strata, in 2011, all of the vendors on the show floor were selling many different flavors Hadoop, SQL integrations, faster Hadoop, et cetera, et cetera. But then when you go to the tutorials, every single data science tutorial was teaching Python and R.

Starting point is 00:20:49 But there's no Python vendor. And also, Python's kind of janky for some of the stuff. It doesn't play with Java very well. Python and R were both second-class citizens in the Hadoop world. So I said, you know, I think there's something here. And that's why I started the company. We started as Continuum Analytics in 2012. And it was Python for Business Data Analytics, Python for Data Analytics, Python for

Starting point is 00:21:07 data science. That's what led to that. Anyway, that was a long sort of exposition. But to your question about the history of all of this, how this came around, but I think that when you talk about software systems, it's actually very interesting. We build software systems thinking they're merely Lego bricks that we make relatively homogenous or homogeneous or well-structured, studs are spaced this way. They're this big and this tall. And then we can stack them together and boom, now you have a bigger Lego. But in reality, when you look at any real software language in modern software systems, there's complexity to it more than the complication, and that's where your worst bugs lie. You know, like you have some MPN module that pulls in some other crap, and that interferes

Starting point is 00:21:44 some other crap, and it tries to install this other thing in your system, and now you have complexity beyond the complication. So I think the practice of software is bedeviled by the fact that it actually is playing at this point with so much complication that it basically appears complex to our human minds. Barbara Lyskoff has my favorite Turing Awards Exectin speech ever, and if you haven't heard it, you have to hear it. And it's basically about modularity and computer science. And it's how you can take big problems and make them small problems. Like engineering with modularity, you can rein in complexity.

Starting point is 00:22:16 So you have a complicated system, but I think you can actually manage the complexity. I'll give you an example on the data side where that's not the case. There are natural systems that are self-similar. By self-similar, it means that they retain the same stochastic properties no matter what zoom level. So unlike a software system, if you've reduced it down to a method, you've got a fairly simple abstraction. There are some natural systems like, say, coastlines, that it doesn't matter at what level you look at.

Starting point is 00:22:40 They still are like super complex. So one thesis is like, yes, software systems can be complex, but they're more complicated in that you can modularize and focus on things. That's not necessarily the case with data. Data is as complex as the natural world. I mean, you don't have control over the weather, and the weather is self-similar. And no matter what zoom level you look at it,

Starting point is 00:23:02 it still retains the same stochastic problems. It's not like data, you don't have the tools to necessarily to reduce the complexity to something that is merely complicated. Right. So the question then in the data practice world then, let's just keep it at that level then, which I think a great place to be talking about it, to which point do you stop? What is your optimization criterion, right? Because all engineering is a trade-off.

Starting point is 00:23:23 So for the amount of effort you want to put in, how well do you need to understand that coastline? If you're trying to target a guided missile into a window, a building, you don't need to map the coastline down to a millimeter, right? It's all so forth. So I think that when you get to data, get data, you recognize that it really, ultimately, if you actually want to get all the value out of it, you've got to loop it around into the overall Uda loop of your business. The Observe Orient, decide act loop, and actually take action with it and correct and zoom into the appropriate level. I think this is kind of what this all boils down to. So now the question is, let's say that you're building a company that instead of, you know, the goal of the company is building a modular software system. is reining the complexity of data, which we're seeing more and more companies do. What does that mean to deal with that much complexity? So what you just mentioned is, well, okay, maybe you look at like the different Zoom level or maybe you've got like a full feedback system or whatever.

Starting point is 00:24:15 But before we even get to how you do this, I would like to either agree or disagree that the companies trying to rein in that complexity are different. I completely agree with that. The companies that actually understand even the problem they need to solve, they have a better chance to solve the problem. because it's actually very much like cloud computing. It used to be, how do I build the software on the basis of the computational resource I have access to? Well, once you have ability to access essentially limitless computation,

Starting point is 00:24:39 and you got to ask about, well, what is that I would need to build, what I really want to do, right? So I think with data, it's a similar thing where you say, well, you can put in, for any arbore, you can put in more money and get more texture and more resolutionary predictions. Exactly. Where do you stop? Right. Exactly. And if the stop is, like, I can only convince the CEO to hire three data scientists, so that's where we stop.

Starting point is 00:25:01 It's just what three scientists can do. I think that's how a lot of people are winging it right now. But the interesting thing with the hedge funds, you look at them, is they understand this. Like some people say, you know what, we're not going to work at the microstructure level. We're just not going to do that. Because there's a few big players that play the high frequency stuff. We're going to leave that out. We're going to do kind of longer-term stuff and do bigger strategies, you know, longer-term strategies.

Starting point is 00:25:19 So they self-select into zones where they believe they have the observational capacity and connect that to execution capacity. Again, it's about the Uda loop. They believe they can run a coherent loop. Data is important and all of that, but more importantly, is keeping track of the model because it's not just processing data anymore. At some point, it's also going to be modifying the systems

Starting point is 00:25:38 that are then producing that data, right? It's a loop. And the most effective companies, it has to be, the data processing is part of both the inference and the execution step, right? And one thing that was shocking to me, honestly, in the last 10 years have been doing this, so many businesses, big businesses,

Starting point is 00:25:53 At the heart of a lot of really important parts of the business, the models are very old. They're very stale. They iterate very slowly. And it's a massively human intensive task with VPs and PowerPoints and everything else to get revs on models. And then you go to the hedge funds and it's like, no, we hire engineers. They come and they code a lot, Matt Lab, and they're trading 100 grand the first week, right? That's different. That's a very different view of the Udo loop.

Starting point is 00:26:18 And, you know, I think in our Twitter exchange, this is where I said, you know, all companies are going to have to look like hedge funds because in a world, where you can have essentially unbounded observational capabilities. You could be a logistic startup, and you could basically get data as good as FedEx or anybody else doing logistics. You could be, you do whatever, there's a great leveling field with regard to the sensory capabilities. There's a great leveler with regard to the cloud computing capabilities.

Starting point is 00:26:40 You don't need to go hire 100 sysadden to go and rack a bunch of servers. You can just turn on some things. So with that being said, you can now have an extremely low footprint, fast-moving companies that are just there to run the OOOOOB, and to have extremely explicit and intentional sense-making around the modeling. And for them, data then, it's sort of like the difference between a fish, the way a fish sees water versus somebody holding water to ladle, right?

Starting point is 00:27:03 You don't even think about the data because you're just swimming in it, right? Obviously, you understand data. Yeah, so this is like the silly VC observation. The silly VC observation is if you look at a software company that doesn't have to deal with the complexity of data, they tend to have relatively high margins, say 70 to 80%. And the reason is because they're building skyscrapers, and then they sell those skyscrapers.

Starting point is 00:27:22 and the team needed to build a skyscraper is relatively fixed, and then you can sell as many of those as you want. That's kind of the software model. When we look at companies that are reining in the complexity of data, and that's how they extract value, the more people you put to rein in that data, the better your results are. And so now you're incented to have more and more people

Starting point is 00:27:43 try and work on that data over time. So I think the structure of a hedge fund is, we hire more people to work on the data, we can potentially get more money just because they're actually reining the complexity of that data. But in the software world, all of that complexity is basically going into the margins. Yet, depending on who the buyer is, you can't increase the top line in the same way. So let's say I'm going to sell five copies of my software, right? Now, if I sell five copies of my software, people are buying the software.

Starting point is 00:28:14 They're not buying the results of the data. Like maybe they'll like my software better because it's more accurate or less accurate. it, but the number of people working on the data doesn't directly drive the amount of software that gets built. And so now you have this existential margin issue, which is you want to increase the number of people working on the data, labeling it, cleaning it, because you can always get some improvement. Right. Here's the question. If we think about, in the software space, you have software vendors and buyers. And the theory of a software vendor, again, going back on our history, there used to just be computer companies.

Starting point is 00:28:50 And then Bill Gates was like, hey, stop pirating my crap, pay for my software, because software's a thing. It's not just you long-haired hippies copying each other's Unix code. Like software is a thing, right? And you need to pay me for it. A letter to hobbyist, 1970, whatever it was, or something like that. But he did that at the beginning of the PC era. And the PC era basically said, well, here's a set of standards.

Starting point is 00:29:07 Here's X-86. The X-D6, ISA, here's ISA, here's peripherals, and networking, and all the crap. And so you have a set of standards that in the space, actually, this recent blog post that I think you, I don't know if you wrote, but you promoted the narrow waste of TCPIP and the- Oh yeah, that was me and Ali. That's an old networking guys. Look at crypto. The point is, you know, a lot of these things rhyme with each other. When you have standards, what they do is they reduce the cost of innovation and they increase the innovation surface. The PC era was such a gigantic, it's such a gigantic leveler that allowed the era of software to

Starting point is 00:29:44 thrive. But again, Moses didn't have a third tablet that said there must be software hardware divided and that software must always have these kinds of margins. We're now entering into an era where people are considering the entire stack of what an information system is. And so when you look at that, there's no reason at all why, if I'm an end user customer buyer, why should X percentage of my alpha or my margin or surplus, if you want to talk about capital and all that stuff, why should this percentage by surplus go to all accrue, broadly across all of these companies, broadly accrue into just one software vendor. Because if I insource and in-house the technology, and I have the FTEs, all of the residual value stays within the boundaries of my firm. And this is

Starting point is 00:30:29 what a hedge fund does. In fact, when I go and try to sell hedge funds, they don't generally buy software. They use our open source. They like to get consulting services and ask questions. They're very high-end users of our open source stuff. But they basically say, why should I share anything? like they'll buy database, they'll buy some things that they perceive to be truly infrastructure and truly commodity, anything above that, if there's a chance of it contributing deeply in a generative way, not a decomposable way, but in a generative way to their alpha, they're going to keep it in-house. It's proprietary. I was out of dinner with the CTO of a hedge fund, and he's like, tell me why I should care about open source. I'm like, because they had an internal, like,

Starting point is 00:31:06 crappy version of pandas. I was trying to give them the story of like, look, if you just use pandas, you would basically leverage all of the, you basically is cost to amortization of innovation for you, right? And it's not differentiating value for you to have your own little tabular data structure. People think that open source is winning or has won. I think the fact that open source is commoditizing all this stuff means that software itself, the value chain is collapsing. And so right now, open source is a movement,

Starting point is 00:31:29 I think, unfortunately, it's confused. There's sort of the Stallman-esque religious aspect to it almost. And then there's something deeply beautiful about crowdsource innovation and legit community collaborative innovation that's really important, and we're almost losing that because everyone's like, oh, but open source is one now. I think that's a misread of the situation. And it's the thing I keep tweeting about

Starting point is 00:31:46 because I'm saddened by the loss of that threat, of the principle, why do we do open source? Why do we do crowdsource innovation? So anyway, it's that conversation. I think software companies do look different because they have thrived in an era of relative, the substrate they've sat on is pretty flat. And now we're entering a space where performance matters great deal,

Starting point is 00:32:04 where the information systems are integrated again. Software is only one component of a whole integrated information system, And because of that, now it's no longer like I can sell just one piece of software across a thousand companies and just harvest all of this margin. So here's my mental model on these things. Let's imagine that you have two companies, company and company B. So company A, they're building a system, and all the properties that system are going to be defined in software.

Starting point is 00:32:28 And so they've got a roadmap and then they build a software over a period of time. That's company A. Let's say company B, let's say actually they're going to use just all off the shelf kind of AIML workflow where they're not actually really writing software. It's all about getting the models to be predictive. And so the entire company is around cleaning data, labeling data, training the models, right? They're very, very different because the complexity of the second one is just far, far greater. And I would say defensibility of the second one is far, far greater just because of the nature of data.

Starting point is 00:33:02 And so it feels to me there's almost like an emergence of a new type of company. Absolutely, yeah. where the organization, the margins, the go-to-market, everything is being dictated by the fact that they're processing data rather than writing software primarily. I think we're all still trying to understand what that second class of company looks like. Yeah. One of my pitches is that by harnessing the power of open source

Starting point is 00:33:25 to commoditize, to do the disruption on a lot of classical data processing systems, we would basically be one of the last great software companies and be one of the first great AI companies. The margin doesn't come from how well you do the software bit. And so I think that's the big news. I mean, maybe I have a bit of a controversial view on this, but I think that the era of software being the dominant part of the stack, I know, you know, Mark Andreessen likes to say software's eating the world.

Starting point is 00:33:50 It is eating the world, but it's a ruminant at this point, right? It's not the most efficient digester of the value. And so, look, you benefit from chlorophyll, even though you're not a plant. You just eat a lot of plants. I think in the era of, I mean, if we want to kind of go to the complex systems thinking, right, in the era of data abundance, the people who can, build models, refine models, and execute on them fastest, are the ones that are going to win. They're the chaos agents in the ecosystem. So look, we still live in a world of plants, but there's

Starting point is 00:34:15 a beautiful infographic I saw the other day, which is how much biomass is on the earth, and most of his plants. And then you've got like, this little bit is animals. And of the little bit, there's like this little bit as mammals and then it's like this little bit as humans. I think that in the world order to come, there's just still going to be, of course, hardware and software companies and so on so forth. But I think the margins, where you really want to look for the growth is going to be those people who are moving like animals and not just claiming a spot. I'm going to hear, you know, grow my leaves. You can still catch some sunlight, but your optionality, I mean, it is, you know, business is war. Your optionality is reduced. And the

Starting point is 00:34:47 companies that can move fastest among these different places, those are the animals. And that's going to be running faster oot loops. I would love to talk about how this impacts the actual business. So I'm not sure there's a huge change on go to market, except for the fact that there's two types of these kind of AIML companies. There's the infrastructure companies, which basically build the tools to use AIML. And that's standard, that looks like a standard of software infrastructure company. It'd be like a data company or something like that, to be a really point. And then there's those that use data science AIML to tackle problems in the real world. And in those, it's kind of interesting because you end up not building a software company,

Starting point is 00:35:28 but more of a farming company or agricultural company. And so you're not selling to Core IT, right? So they just tend to look very different than typical software problems because they're selling to a different constituency. They're not software problems. The software is a means to an end, not the ends unto itself. And this is particularly germane to AIML because it allows us to solve problems that typically software hasn't been good at solving in the past.

Starting point is 00:35:51 It allows us to solve vision problems better than we've been able to do it before, audio processing problems, better than we've been doing it before. It's kind of like the best way to interoperate with the physical world. And so now we're off like building these companies that solve these kind of real world problems, and you just have different-looking companies to do that because, again, you're selling to the person that inspects the H-FAC system,

Starting point is 00:36:10 you're selling to the person that is the farmer, you're selling to the person that does manage the forest. So I think one thing for the very high level that any anybody creating a company in this space needs to think through as the following, which is if you're building just the infrastructure, just the tooling in the nuts and bolts, you look like a software company,

Starting point is 00:36:26 and somebody else deals with the actual AIML application, and that's fine. But let's say that you yourself are ingesting the data, data, cleaning the data, labeling the data. There's a lot of variable cost to do that. Like every customer may have a new data set. And what happens is this impacts the margins of your business. Like it looks like you have lower margins because for every customer you've got all of this work to do. And so I think you need to make a decision early on whether do you want to be the one that's doing that work or because that's something you can actually offload

Starting point is 00:36:57 to the customer. So let's say you go to a new customer, listen, we're going to take all your data, we're going to clean your data, we're going to create your models, and we're going to solve your problems. And in that case, you internalize all of that. And as far as your organization, you need to know that this is basically a services arm. Another option is you can say, customer, we're going to give you all these tools, but you're going to have to bring in your own data, you're going to have to hire people to label it. You're going to have to learn to tune your models and will help you with all of that, but you're the one that's going to go ahead and sync that cost. So you have to think very deeply of how you structure your company relative

Starting point is 00:37:32 to the variable headcount, like the headcount that has to grow per customers because that seems to be the big difference that we see for these AML companies and the typical software company. I think it's hard to do one of these companies right now because we are in a transitional time. A lot of the customers don't even know what they're asking for

Starting point is 00:37:47 and they're kind of looking for that help. And even now, people, they recognize as a growth area and where the future is headed, so they want to spend some money on it. But absolutely right, the amount of work you have to do per customer starts looking a lot like a services play. And there's a reason why a lot of companies,

Starting point is 00:38:01 When you really look inside the skeleton, like I think I called it, the skeleton buried in the ARR, you see a lot. Totally. Eric von Hippel has a great book around democratizing innovation. And he says, even when we have a space in which a product is possible, products usually only cover 60 to 70 percent of the end user need. The end user still has to do. And he's not talking about software.

Starting point is 00:38:23 He's talking about people like, you know, welding things onto the side of their tractors. He's talking about in general, the customer has a thing they need to do. When it comes to AIML application areas, it's a lot more than just 30% that has to be customized per customer site. So I think for businesses right now in this transition, it's super hard not to end up looking, if you're doing good job for your customers, it's hard not to look like you're doing a services play. Now, that being said, there are, I think, viable strategies through this, which is that

Starting point is 00:38:51 you can specialize in an area and domain and say, look, we're going to come in and work on your data set, but we have our own reference model we've built. It's exactly right. That's exactly right. And now we can benchmark you against that. We can bring some our own magic juice into this. So now the thing that is generalizable across or productizable across a thing. Maybe it's only for that sector. But the thing that's generalizable is not just the software. It's actually more defensible than the software. I just want to very quickly put a fine point in this. There's two things that you brought up that are very important to realize. The first one is we are in a transition. So customers don't even know what it means to like label data and clean data. Maybe in five years you can go to customer and say, we've got all the tooling for you. But you're responsible for managing the data and therefore you offload the cost. It's just today you just don't have enough education in the market to do that. They don't have data scientists, et cetera, et cetera. And so I think in order to get the market into the transition, the startups have to do that. Like you have to build that that basically services aren't. The second point you made is actually

Starting point is 00:39:45 I think the critical one is there actually is some commonality in verticals. And so you can reduce that margin by sharing as much as possible. But it does require customers to share data or at least share models. And that's sometimes a tough conversation with the customers. Well, it's not just sharing models. I mean, there are deeper and interesting, more leveraged plays to be made. For instance, you go into a sector and you realize, oh, all of these people are doing their own craptacular things because they're limited budgets and their data sets are broken in this way. But holy crap, there's this other vendor over here with this data set. I can go and negotiate an exclusivity with that vendor. And now I'm the only one that can bring that kind of model lift, you know, into this particular sector. So there's a lot of that 1800-style like homesteading to be done in the space. So I think it's more than just the, let me average out central limit theorem, everybody in this industry. There's some really cool things to be done. So the first thing companies need to figure out is what type of a company they are. Many are very confused about that? You need to know, are you a software company and you're building tooling or are you a company where the majority of the complexity

Starting point is 00:40:45 of the companies around data? And by the way, many companies start a software companies and then they've structured things incorrectly. So let's say that you've come to the answer that and you've figured out you're a data company. Once that happens, you need to understand that often companies that are extracting value from data, there's a lot of complexity per customer in order to do that. And you need to structure your company the correct way, which is like, just realize it may be hard to scale, just realize you're going to have different processes around the actual data or come up with a strategy to offload that to the customer. Now, the reality is because the market is so immature, it's unlikely the customer is going to be able to do a lot of that,

Starting point is 00:41:19 but it's something that you can over time train the market to do and do that transition. But I think this is the big sticking point with many times. They think the software They end up being data companies. They didn't build the organizations to deal with that internal complexity. It's coming down in the margins. Everybody's kind of confused. So I think just a little bit of self-awareness and a little bit of planning go a really long way in this space. But it requires a very different.

Starting point is 00:41:43 Many West Coast firms have the thesis that to do a really great tech startup, you need at least a tech founder somewhere in there because they kind of see where things are going. For a really good AI startup, you need to have machine learning people at that leadership level because they know what it means. They know why a single data set can be a billion dollars. or swing a billion dollar deal. The difference between a software engineer and like a data scientist is that software, you generally know what the inputs are or the types of inputs,

Starting point is 00:42:07 and your goal is to construct a system that given these inputs, produces these sets of outputs. So you have very nice, clean definitions around correctness for the most part. With data science, there's unfortunately not that. You can have a piece of code, and for some sets of values, it's correct. Other sets of values, it still produces a result, but those results are wrong. And a function's correctness is dependent on value. use. This is the key thing that differentiates all of data science and machine learning from

Starting point is 00:42:32 classical software engineering. Classical software engineering is like, we've got our test data set, we've got our prod data set. It works in test. It's going to work in prod, right? That's not how data science and machine learning work at all. In data science and machine learning, the correctness of a function is value dependent and also performance dependent. And the performance is also value dependent. So now you have this intertwined synthesis of a data and a modeling and a computation problem that cannot be decomposed into orthogonal vectors, right? That's the difficulty of this. What I think is that in five, ten years time, every company that is actually still in existence and doing well has to essentially have synthesized and brought a synthesis in of their

Starting point is 00:43:12 data capacity, their data modeling capacity, the model build and computation, hardening appropriate computation in an economical fashion to suit their needs. So the word I like to use for this is cybernetics. I mean, we are right now in between the software era and the cybernetic era. And I think we will get to a cybernetic future. And cybernetic, by the way, you know, it comes from the same word as Kubernetes, right? It means governor. It means a theory of action and control. So businesses have to see computation really moving its way up. Data modeling process has to move it's all the way up to the very tippy top of the business. That synthesis will happen. It will have to happen. That's what the selection pressure is in the business world. I don't know exactly the

Starting point is 00:43:50 path will take to get there. In the transitional time, businesses who want to basically get in ahead of the curve, they've got to have very clear thinking at the leadership level, and they must have a very clear understanding with their investors about what they're going to look like as they chase the Marlin, because it's going to take a little while. So I think that's the trick right now is that you've got to find founding teams or leadership teams that have a solid understanding of software, of what software is and isn't, of where the value is in the software activity, and of where the value is and the data and data modeling activities. In a time of fog, you've got to have very, very clear-headed thinking about that sort of thing. But ultimately, that synthesis must be what

Starting point is 00:44:28 comes. Thank you. Thank you so much.

a16z Podcast - Reining in Complexity: Data Science & Future of AI/ML Businesses

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.