a16z Podcast - a16z Podcast: The Product Edge in Machine Learning Startups

Starting point is 00:00:00 The content here is for informational purposes only, should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16Z fund. For more details, please see A16Z.com slash disclosures. Hi, everyone. Welcome to the A6 and Z podcast. I'm Sonal. Today's episode, moderated by A6 and Z board partner, Steven Sinovsky, is all about the data edge in machine learning startups. The conversation covers everything from machine learning algorithms and academic papers versus in products to how startups can compete with big companies on the data front to machine learning as a service and how to tease apart hype versus reality. Joining us to have this conversation are Jensen Harris, the CTO and co-founder of Textio, which is a platform for augmented writing of business documents like job description. And let me actually tell you a little bit more about what they do only as it's relevant for the discussion on this pod. Their platform provides in real time quantitative guidance based on evidence that's continually mined from tens of millions of documents. And then we also have A.J. Shankar, CEO and co-founder of Everlaw, which is an A6 and Z portfolio company. He's a computer scientist who fell into law, not in a good way, not in a bad way. And they help lawyers sift through huge piles of evidence to find the proverbial needle in the haystack for both litigation discovery and for helping build the narrative of a convincing case. And just to give a quick sense of the contrast between people and machines here,

Starting point is 00:01:17 It's something that would take people hours of manual linear review, often on very tight deadlines, like over a single weekend, booking a warehouse just to sort through everything. And now with cases that involve over 10 million documents and terabytes and terabytes of data, machine learning helps sift through all that at both a content and context level to figure out who's saying what, what they're talking about and why. Now over to Steven. So the category of e-discovery, with or without machine learning, you know, it's a big category. It's sort of been established for a long time, even there's a lot of legacy tech. But like the big players, like Google has the e-discovery product, Microsoft has the e-discovery product. What is it that makes it so that, hey, they have all the data, like, it's all in Gmail already. Why can't, why is it that they don't have this huge advantage over a startup?

Starting point is 00:02:02 What is it that really enables a startup to have an advantage in this machine learning world? Because a lot of people just believe that deep learning is a big company thing because they have all the data and they're just going to win. It's a great question to ask for any startup that's looking at machine learning, right? You know, why try? Because some, you know, some big co will have all the data and all the expertise, right? And so the first compelling reason is that, you know, there are many areas that these big companies don't actually invest in. They don't care about. You know, I'll discuss legal tech in a second, but, you know, legal tech, I actually think is one of them, medical, HR at the level that you guys are operating at, agriculture, fintech.

Starting point is 00:02:39 They're huge, valuable areas where Google or Microsoft isn't building something. specific, domain-specific products. I mean, they're fundamentally, by and large, you know, they're generic B-to-B or B-C companies that address these broad areas. So if you go for a narrow area, there's a great opportunity. The second, you know, there are a couple other reasons why you can have an advantage or at least be on equal footing, right? So one reason is if the data sets you're dealing with are isolated and having more data doesn't help you that much more. In our case, the individual matters that we deal with are largely, should be isolated. The matter is like a specific case. Yeah, yeah. Case and matter. Yeah, exactly. Yes. It

Starting point is 00:03:13 matter, you know, it's totally fine to look at them independently by and large. And a particular data set in a particular litigation context will have a particular set of documents that are actually interesting. And that set might differ depending on the context. And so we don't even know coming in necessarily what's interesting until someone says, here's why this is interesting for this particular case. So the training has to happen every time in a case-by-case setting. You can get a really, you know, get pretty far with that. Another area, of course, is where you can get a lot of data yourself fast, right? So if you're a self-driving car company, even a small one, you drive a couple hundred miles,

Starting point is 00:03:48 you have an unbelievable amount of data you can start to work with. But I'll say the third thing to unpack there is, you know, you mentioned, of course, deep learning as being this thing that's the problem that's a big companies. And, you know, that in many cases is true for the big problems that they're trying to solve, right? So we saw breakthroughs in, you know, transcription and translation and, you know, alpha-go and all that other stuff. Hitten detection. Yeah, exactly. that's obviously the most fundamental, you know, can you find a cat in a picture? You solve that. You've probably

Starting point is 00:04:13 solved everything. But, you know, most ML implementations are not deep learning. They're not neural networks. There's a ton of implementations that are incredibly valuable that don't involve networks. So, for example, you know, what IBM calls Watson, which is not some brain that just solves every problem, but a whole agglomeration of different techniques is largely regression, which is an incredibly powerful technique. You know, the poker bot that won heads up poker recently, also not a network, right? So there are many areas where if you know what to optimize, statistical machine learning will take you really, really far, and there's no reason to shy away from that. It's incredibly powerful in the right domain. So one of the things I hear you saying a little bit is

Starting point is 00:04:49 if you're getting started in your company and you have access to a data source, it doesn't necessarily mean your first reaction should be like get a VM going, start torch up, and figure out what to go do, that there's a whole bunch of work to decide what type of learning technique to apply to what you're doing. I would definitely suggest, you know, having, but I felt it's better to actually understand your domain and the problem you're trying to solve and then figure out how machine learning works into it as opposed to saying, hey, I'm an expert in ML, let me just do something. Here's the hammer.

Starting point is 00:05:18 I guess all things being equal, you'd like to have more data versus less data. But actually having the right data and having it be data that is, you know, tuned for the problem that you're trying to solve and then having a purpose-built stack that allows you to build the right user experience, the right service, and the right business on top of it,

Starting point is 00:05:36 you know, oftentimes, doesn't take having hundreds of millions of rows of data to achieve that. You can achieve actually pretty good prediction in a lot of areas with relatively small data set if you build sort of the right product on top of it. And so the value that you get out of a machine learning product isn't always about how much data is underneath it. And I think that's a really important thing, especially in the early days of a startup where you're just not going to have the huge proprietary data set. Can you get the right data set? And I think that's a really important thing. and can you use techniques that are appropriate for that size data set to find the interesting wedge that lets you find the smoking gun, you know, in the, in the discovery or lets you find the thing that all of a sudden brings 20% more people to apply for a job.

Starting point is 00:06:26 And, you know, I think all of the algorithmic stuff in machine learning is going to be commodity. Like there are like 20 places in the world where they're inventing new algorithms and that's, you know, educational institutions and huge companies. and that's really important, but that stuff more and more rapidly ends up in the public domain anyway. And so it's not so much about that as it is, can you craft the thing that blends machine learning with other techniques, with statistical techniques, with user experience techniques to build a product that has actual value. In navigating the idea maze for your domains, you both have sort of achieved product market fit using machine learning as an ingredient, but not doing it in the way that we read so much about. like the, hey, we're going to do machine learning here, so we got all images on Earth, or we're going to do machine learning here, and we've ingested all written words to do translation in between two languages. So that, to me, is, like, super interesting because it really says that for product market fit, like, there's a whole bunch of stuff beyond, like, a giant corpus of data and, like, the latest deep learning algorithm. I think that's a difference between an academic paper and a company. For the academic papers, the sole concern is, given this corpus, what can I extract from it? We're trying to solve people's problems in the real world, and the problems are rarely reduced to just finding the kind of information throughput of this

Starting point is 00:07:47 corpus. They have to do whatever problem they have, attack it from many angles in many different ways. I would say our machine learning component is one of the features, but there are many others because you can't just solve this problem with this one technique. Well, not just that, but you can go after huge data, right? You can go scrape the web to try to amass some database of data. And if it's bad data or if it doesn't have an outcome associated with it, it doesn't matter how many thousands or millions or tens of millions that you have. It's not going to, you know, it's garbage in, garbage out. You're not going to have a good product at the end. What you're going to have is a slide that says you have 10 million documents and a product that no one wants.

Starting point is 00:08:25 I want to actually then, building on this, which we're going in an interesting direction, belief you share your path to product market fit involves a lot of other code. You don't have like a command line tool that takes a job description in. You've built a whole system around it. Yeah. So the first thing is even within the predictive engine, like the brain of it, machine learning is one part of it. It turns out to get the best results. It had to be combined with statistics and traditional natural language processing and heuristic analysis and all of these things together blended actually worked much better than just a single model. Oh, and then you build a word processor. Well, then of course, on top of the

Starting point is 00:09:05 it you have the other really important thing is user experience. Like it has to be, you know, from the time you type something and let go of the key, we have 300 milliseconds to round trip your whole document down into the cloud, get all the predictions, bring it back up and light it up on screen so that it's as fast as spell check. And so to do that, you have to build the editor. You have to build the word processor piece. You have to build, of course, you know, authentication and all the other pieces that you build. A huge advantage that a startup has that some big company trying to do the same thing doesn't have, which is we can tailor our security policies, tailor the way that we handle the data, the way that we sanitize the data, and the way that we use

Starting point is 00:09:44 the data in a very clear way that a blanket terms of service that covers, for instance, all of Google services or all of Microsoft services can't do. Yeah, you're a full-on enterprise thing. So you've got all of that stuff. People have to be able to sign in, and you have to be able to share documents. So you have to build a document library. So you have the whole service stack. and you have the whole monitoring stack.

Starting point is 00:10:06 So how do you tell whether or not people are using it and how much they're using it and things like that? And so the heart of it, of course, is the pieces that are about the predictive engine, but that's not all of it. And in fact, we found in our sort of earliest days, like our first six months, that we didn't have to, you know, really ingest tens of millions of things then. And what we really needed was, you know, tens of thousands or hundreds of thousands of really good pieces of data. With the results. With the results. Right. They had the outcomes associated with them.

Starting point is 00:10:42 And then we could build the right models and we could figure out the kind of data that we needed and build the right experience on top of it to figure out whether we had product market fit. We could have spent a year doing nothing but just aggregating data. Being, getting PhDs. Both of you are actually your post-series A startups. You're not super, super far along. but you both have like a SaaS business, you have ongoing revenue, you have many customers signed up.

Starting point is 00:11:06 But what I think is just interesting is you basically are like, in my view, modern SaaS companies. If I, like you've now taken machine learning. And just like before when you were a SaaS company, yeah, yeah, we have hosting and we use a database. Now you also have machine learning is just part of your ongoing value proposition.

Starting point is 00:11:23 And it's just like an ingredient. That's right. I think that's how it should be. Again, maybe, I mean, my view, it seems like there's a, There's certainly a ton of hype around machine learning, certainly with recent advances, there should be a lot of hype. It's really compelling. But I do feel like typically when companies are hyping up the machine learning component, there might be doing it more to raise money than they are to provide a value to the customer.

Starting point is 00:11:43 The customer, of course, wants to say, what is this doing for me at the end of the day, not what algorithms are using to do it. And so I think part of it should be packaged as part of a broader message about what you're doing for them. Yeah, and it's not clear that you're going to pay just for machine learning. You're paying for solving a business problem if you're a customer. So again, the core lesson there is make sure you're articulating that. Maybe you could just kind of tick off a little bit of the core tech that you did use. Either of you using any of the open source things that people have heard about. How are you integrating them? I mean, yeah. I mean, we use a whole bunch of stuff, I would say. That's a terrible answer. Yeah, it didn't really offer any specific. And computers? So we use a lot of open source tools that are available. Spark, I think, is the go-toe one for now. But we also spent a lot of time on how do we get clean data into that system. We basically take stuff from a bunch of different sources and synthesize that to get good results. But a lot of what we do, actually, to your point earlier, is the notion of cleaning up the data is so important to getting good results. Oh, okay.

Starting point is 00:12:41 And so the notion of garbage in garbage out is really, is true. It's certainly that's true in AI as in anything else. And so a lot of actually what we do is take these off-the-shelf components, but spend a lot of effort into putting really clean input into them. And so one brief example of that, which is, and this is the kind of thing where with domain specificity, understanding your user's workflow and providing a tool that's catered to how they like to work. Again, not a big company. Right. Sure.

Starting point is 00:13:07 Sure. Yeah. I would imagine, you know, Microsoft doesn't know a ton about, you know, how lawyers specifically do this one kind of thing. But here's an example, right? So if you have an email thread in your discovery process, right? There's all these emails. We automatically thread the emails for you from anything, really.

Starting point is 00:13:22 Sometimes people will say, well, this, you know, this is the really interesting, you know, email in the thread. I'm going to mark it as really relevant. But I'm going to mark these other emails as not relevant because I only, you know, this is the one that encapsulates everything. By and large, there's actually a lot of shared content between emails and a thread, replies, and all that other stuff, that if you were to feed all these into a machine learning algorithm, or one email was hot and the others were cold, but they largely shared the same content, it's going to mitigate that signal, right? It's going to muddle it because

Starting point is 00:13:46 you suddenly have these two opposite ends of the spectrum with the same data. So we're able to take that kind of information, for instance, and clean it up, because we know what all the threads are. We look at document exact duplicates and near duplicates for data and say these things are actually really similar. They probably pick this one because they thought it was interesting. And they don't want to look at these other duplicates, but we don't want those to muddle the signal. So we have all these other tools that go into analyzing the data coming in so that it's really clean when it comes into these algorithms. And that actually helps with performance a lot. Yeah, that's a hugely important point. You know, we use probably many of the same technologies. We use Spark as, you know,

Starting point is 00:14:17 pretty much anyone who has to do sort of massively parallel sort of data analysis stuff needs to do. But, you know, our actual, you know, machine learning algorithms and the core NLP stuff we do is the standard sort of Python libraries that you know you can go download and and use but we have put an enormous amount of time into our data processing pipeline and that is the very you know very tailored thing that we write that takes the same thing as you mentioned like take all the job listings that come in in all sorts of different formats de-dupe them clean them find the outcomes normalize the outcomes like all of the stuff that you do to make the data coming in be really meaningful throw out the data that isn't meaningful.

Starting point is 00:14:56 That is hugely important. We've also found talking about technologies, when you're getting your company off the ground, you're going to end up doing a lot of ad hoc data analysis to try to figure out what are the models that work, what are the patterns that work. And what that means is you're just going to be looking at your data. You're going to be looking for patterns.

Starting point is 00:15:14 You're not going to be operating at 100 million scale. You're going to be looking at 5,000 things and trying to figure out, like, where are the patterns that work and what are the models that work? And something like Athena, which allows you to do basically serverless SQL queries on an S3 bucket, can allow you super fast to just dump a bunch of data into AWS into what's called an S3 bucket, which is like a storage container. And without deploying any infrastructure, you can just start writing SQL queries and see what you can find.

Starting point is 00:15:43 You know, five years ago, you would have been managing a big infrastructure to do this. And now in like your first two weeks creating your startup, you can be starting, you can spend more time cleaning the data and finding the pattern. and less time, like, actually trying to build the algorithms or, you know, doing the other kinds of things. That was super interesting on some of the tools that were used. But I think there's probably, like, some really good learning in just like, wow, there's a lot of stuff out there. Like, all the host, like, all the infrastructure people have machine learning services.

Starting point is 00:16:13 So do you have any lessons in, like, how you approach the choosing of your technology stack relative to, like, what's out there, build versus use versus buy versus contribute? Yeah, the tooling is really good now. And not only do they have these services, there actually seem to be a race for all these big companies to open source their technologies for people to use, which is great. I think the best thing to do is just try to learn some of the fundamentals or get people on a team that can learn enough of the fundamentals

Starting point is 00:16:40 to be able to choose rationally amongst the fundamental decision-making points you're going to have about what kind of problem you're trying to solve. And then any one of these tools that are released will get you enough of the way to understand, is this the kind of thing where I want to use ML or not? Yeah, you don't need to custom build something, and you shouldn't spend any of your time working on that. Like, you should figure out what cloud platform you're using, whether it's AWS or whether it's Azure or something else. They all have built-in ML services that you can use that are totally fine. Like they're using the same algorithms, the same exact things that you might use if you sort of custom-built something,

Starting point is 00:17:14 and they have plenty of power for you to do the more important thing, which is like figure out where the actual predictions are in your data. And so you shouldn't try to, especially in the early time, build anything custom there. You should use what's there. And they all seem also, one of the things I think is really cool is they all are, they're so, in a sense, academic in their variations that they're able to make it easy for you to iterate and try different models and different networks very, very quickly. Again, without, like, if you don't build any of your own infrastructure, you just use theirs, you just change between deep learning models pretty quickly. That's right. And you're not buying any hardware. You're not managing any hardware.

Starting point is 00:17:47 Yeah, no one's going to eye the hardware, though, for sure, I hope. Please don't. Because you're both post this product market fit and in the scaling kind of phase, but using machine learning, using data, I want to ask each of you to sort of offer some advice on something you might have done differently than you did. So one of the things for us that I wish we had known now was we felt a lot of imposter syndrome in the first two or three months, especially around machine learning. This idea that because we didn't have the huge data set that we weren't going to be. able to build something that was of value. And, and I think we learned that it was actually okay, that it was to start out based on a smaller amount of data that was more tailored to not just a specific domain, but to a specific customer. And that allowed us to prime our learning loop

Starting point is 00:18:37 by having the relationship with that customer and got their data, which then allowed us to make our predictions better and get the next customer. So you got the flywheel going. So what you described is essentially the machine learning version of just early adopters in the enterprise space, but using the data side of it. And not being afraid to do it at a really small scale. Right. Like one customer.

Starting point is 00:18:59 One department. One department. Yeah. Get the really clean, really tailored data for whatever you're trying to do. And then do the flywheel very small because it blows up very fast. But isn't LinkedIn going to win because it has all the job descriptions, which is what led to sort of that imposter feeling,

Starting point is 00:19:15 I think, was there were a lot of job descriptions that someone already had. But here you are doing something that they can't possibly imagine doing right now. Because we went and found the data that actually LinkedIn doesn't have. And a small amount of that data is more valuable in some ways than the large pile of sort of, you know, aggregate

Starting point is 00:19:31 data. When you talk about big data, obviously the algorithms are really compelling and the results are really compelling. But the other half of the equation is presenting the data to the user in the way that they can understand it, right? And actually make use of it. Certainly in our arena, it was insufficient as we learned to just say, hey, here's this great stuff. It's going to tell you

Starting point is 00:19:47 what to do for a lot of lawyers, that feels like a black box, right? And that black box, when they have to go explain to their clients, hey, I did this because this thing said this thing, you know, ugh, right? So a lot of what we do now is give greater transparency about what the system is doing, what you can trust and what you can't trust. And building that trust with the user, I think, is really important to actually drive that kind of usage because the end of the day, it is their jobs. And the more information to give them about how it's doing, when it's doing a good job and when it's doing a bad job, I think the better we're seeing the response be. That runs a little bit counter and I think a really awesome learning, which is

Starting point is 00:20:19 sometimes people think that with machine learning, like it means you do away with a lot of UI, you do a lay with a lot of stuff, but it turns out in a world where people have to be comfortable with what led to the recommendation or to the solution, like actually presenting a journey through that is actually helpful. Yeah. I mean, the key thing you want to do is present the AI as a partner in the humans, in the people's endeavors, you know, what they're trying to do. This is something that's going to help you, you're going to work with it. And when you do that, of course, you need a lot of trust. It's really about the human and the machine together, accomplishing something that neither could accomplish on its own. And so, you know, so many machine learning products

Starting point is 00:20:57 have been about trying to either look backwards and tell you what happened or to predict the future. It's much harder to predict the future and then help you change the future. And that's user experience and model together. It's not just the algorithms of the secret sauces, the algorithms plus the product experience that blends humans and computers together in this sort of beautiful learning loop. Awesome. Well, I really want to thank Jensen and AJ for sharing their advice and wisdom on building startups using machine learning in the enterprise. So thank you.

a16z Podcast - a16z Podcast: The Product Edge in Machine Learning Startups

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.