a16z Podcast - a16z Podcast: The Product Edge in Machine Learning Startups
Episode Date: March 17, 2017A lot of machine learning startups initially feel a bit of “impostor syndrome” around competing with big companies, because (the argument goes), those companies have all the data; surely we can’...t beat that! Yet there are many ways startups can, and do, successfully compete with big companies. You can actually achieve great results in a lot of areas even with a relatively small data set, argue the guests on this podcast, if you build the right product on top of it. So how do you go about building the right product (beyond machine-learning algorithms in academic papers)? It’s about the whole system, the user experience, transparency, domain expertise, choosing the right tools. But what do you build, what do you buy, and do you bother to customize? Jensen Harris, CTO and co-founder of Textio, and AJ Shankar, CEO and co-founder of Everlaw, share their lessons learned here in this episode of the a16z Podcast — including what they wish they’d known early on. Because, observes moderator (and a16z board partner) Steven Sinofsky, “To achieve product market fit, there’s a whole bunch of stuff beyond a giant corpus of data, and the latest deep learning algorithm.” Machine learning is an ingredient, part of a modern software-as-a-service company; going beyond the hype, it’s really about figuring out the problem you’re trying to solve… and then figuring out where machine learning fits in (as opposed to the other way around). Customers are paying you to help solve a problem for them, after all. The views expressed here are those of the individual AH Capital Management, L.L.C. (“a16z”) personnel quoted and are not the views of a16z or its affiliates. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. This content is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. You should consult your own advisers as to those matters. References to any securities or digital assets are for illustrative purposes only, and do not constitute an investment recommendation or offer to provide investment advisory services. Furthermore, this content is not directed at nor intended for use by any investors or prospective investors, and may not under any circumstances be relied upon when making a decision to invest in any fund managed by a16z. (An offering to invest in an a16z fund will be made only by the private placement memorandum, subscription agreement, and other relevant documentation of any such fund and should be read in their entirety.) Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z, and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by Andreessen Horowitz (excluding investments and certain publicly traded cryptocurrencies/ digital assets for which the issuer has not provided permission for a16z to disclose publicly) is available at https://a16z.com/investments/. Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Past performance is not indicative of future results. The content speaks only as of the date indicated. Any projections, estimates, forecasts, targets, prospects, and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Please see https://a16z.com/disclosures for additional important information.
Transcript
Discussion (0)
The content here is for informational purposes only, should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security and is not directed at any investors or potential investors in any A16Z fund. For more details, please see A16Z.com slash disclosures.
Hi, everyone. Welcome to the A6 and Z podcast. I'm Sonal. Today's episode, moderated by A6 and Z board partner, Steven Sinovsky, is all about the data edge in machine learning startups. The conversation covers everything from machine learning algorithms and academic papers versus in products to how startups can compete with big companies on the data front to machine learning as a service and how to tease apart hype versus reality. Joining us to have this conversation are Jensen Harris, the CTO and co-founder of Textio, which is a platform for augmented writing of business documents like job description.
And let me actually tell you a little bit more about what they do only as it's relevant for the discussion on this pod.
Their platform provides in real time quantitative guidance based on evidence that's continually mined from tens of millions of documents.
And then we also have A.J. Shankar, CEO and co-founder of Everlaw, which is an A6 and Z portfolio company.
He's a computer scientist who fell into law, not in a good way, not in a bad way.
And they help lawyers sift through huge piles of evidence to find the proverbial needle in the haystack for both litigation discovery and for helping build the narrative of a convincing case.
And just to give a quick sense of the contrast between people and machines here,
It's something that would take people hours of manual linear review, often on very tight deadlines, like over a single weekend, booking a warehouse just to sort through everything.
And now with cases that involve over 10 million documents and terabytes and terabytes of data, machine learning helps sift through all that at both a content and context level to figure out who's saying what, what they're talking about and why.
Now over to Steven.
So the category of e-discovery, with or without machine learning, you know, it's a big category.
It's sort of been established for a long time, even there's a lot of legacy tech.
But like the big players, like Google has the e-discovery product, Microsoft has the e-discovery product.
What is it that makes it so that, hey, they have all the data, like, it's all in Gmail already.
Why can't, why is it that they don't have this huge advantage over a startup?
What is it that really enables a startup to have an advantage in this machine learning world?
Because a lot of people just believe that deep learning is a big company thing because they have all the data and they're just going to win.
It's a great question to ask for any startup that's looking at machine learning, right?
You know, why try?
Because some, you know, some big co will have all the data and all the expertise, right?
And so the first compelling reason is that, you know, there are many areas that these big companies don't actually invest in.
They don't care about.
You know, I'll discuss legal tech in a second, but, you know, legal tech, I actually think is one of them, medical, HR at the level that you guys are operating at, agriculture, fintech.
They're huge, valuable areas where Google or Microsoft isn't building something.
specific, domain-specific products. I mean, they're fundamentally, by and large, you know,
they're generic B-to-B or B-C companies that address these broad areas. So if you go for a
narrow area, there's a great opportunity. The second, you know, there are a couple other reasons
why you can have an advantage or at least be on equal footing, right? So one reason is if
the data sets you're dealing with are isolated and having more data doesn't help you that much
more. In our case, the individual matters that we deal with are largely, should be isolated.
The matter is like a specific case. Yeah, yeah. Case and matter. Yeah, exactly. Yes. It
matter, you know, it's totally fine to look at them independently by and large.
And a particular data set in a particular litigation context will have a particular set of
documents that are actually interesting. And that set might differ depending on the
context. And so we don't even know coming in necessarily what's interesting until someone
says, here's why this is interesting for this particular case. So the training has to happen
every time in a case-by-case setting. You can get a really, you know, get pretty far with that.
Another area, of course, is where you can get a lot of data yourself fast, right?
So if you're a self-driving car company, even a small one, you drive a couple hundred miles,
you have an unbelievable amount of data you can start to work with.
But I'll say the third thing to unpack there is, you know, you mentioned, of course,
deep learning as being this thing that's the problem that's a big companies.
And, you know, that in many cases is true for the big problems that they're trying to solve, right?
So we saw breakthroughs in, you know, transcription and translation and, you know, alpha-go and all that other stuff.
Hitten detection.
Yeah, exactly.
that's obviously the most fundamental, you know, can you find a cat in a picture? You solve that. You've probably
solved everything. But, you know, most ML implementations are not deep learning. They're not neural
networks. There's a ton of implementations that are incredibly valuable that don't involve networks.
So, for example, you know, what IBM calls Watson, which is not some brain that just solves every
problem, but a whole agglomeration of different techniques is largely regression, which is an
incredibly powerful technique. You know, the poker bot that won heads up poker recently, also not a
network, right? So there are many areas where if you know what to optimize, statistical machine
learning will take you really, really far, and there's no reason to shy away from that. It's
incredibly powerful in the right domain. So one of the things I hear you saying a little bit is
if you're getting started in your company and you have access to a data source, it doesn't
necessarily mean your first reaction should be like get a VM going, start torch up, and figure
out what to go do, that there's a whole bunch of work to decide what type of learning technique to
apply to what you're doing. I would definitely suggest, you know, having, but I felt it's better
to actually understand your domain and the problem you're trying to solve
and then figure out how machine learning works into it
as opposed to saying, hey, I'm an expert in ML,
let me just do something. Here's the hammer.
I guess all things being equal,
you'd like to have more data versus less data.
But actually having the right data
and having it be data that is, you know,
tuned for the problem that you're trying to solve
and then having a purpose-built stack
that allows you to build the right user experience,
the right service, and the right business on top of it,
you know, oftentimes,
doesn't take having hundreds of millions of rows of data to achieve that. You can achieve
actually pretty good prediction in a lot of areas with relatively small data set if you build
sort of the right product on top of it. And so the value that you get out of a machine learning
product isn't always about how much data is underneath it. And I think that's a really
important thing, especially in the early days of a startup where you're just not going to have
the huge proprietary data set. Can you get the right data set? And I think that's a really important thing.
and can you use techniques that are appropriate for that size data set to find the interesting wedge that lets you find the smoking gun, you know, in the, in the discovery or lets you find the thing that all of a sudden brings 20% more people to apply for a job.
And, you know, I think all of the algorithmic stuff in machine learning is going to be commodity.
Like there are like 20 places in the world where they're inventing new algorithms and that's, you know, educational institutions and huge companies.
and that's really important, but that stuff more and more rapidly ends up in the public domain anyway.
And so it's not so much about that as it is, can you craft the thing that blends machine learning with other techniques, with statistical techniques, with user experience techniques to build a product that has actual value.
In navigating the idea maze for your domains, you both have sort of achieved product market fit using machine learning as an ingredient, but not doing it in the way that we read so much about.
like the, hey, we're going to do machine learning here, so we got all images on Earth, or we're going to do machine learning here, and we've ingested all written words to do translation in between two languages. So that, to me, is, like, super interesting because it really says that for product market fit, like, there's a whole bunch of stuff beyond, like, a giant corpus of data and, like, the latest deep learning algorithm. I think that's a difference between an academic paper and a company. For the academic papers, the sole concern is, given this corpus, what
can I extract from it? We're trying to solve people's problems in the real world, and the
problems are rarely reduced to just finding the kind of information throughput of this
corpus. They have to do whatever problem they have, attack it from many angles in many different
ways. I would say our machine learning component is one of the features, but there are many others
because you can't just solve this problem with this one technique. Well, not just that, but you can
go after huge data, right? You can go scrape the web to try to amass some database of data. And if it's
bad data or if it doesn't have an outcome associated with it, it doesn't matter how many
thousands or millions or tens of millions that you have. It's not going to, you know, it's garbage
in, garbage out. You're not going to have a good product at the end. What you're going to have is
a slide that says you have 10 million documents and a product that no one wants.
I want to actually then, building on this, which we're going in an interesting direction,
belief you share your path to product market fit involves a lot of other code. You don't have like
a command line tool that takes a job description in. You've built a whole system around it.
Yeah. So the first thing is even within the predictive engine, like the brain of it,
machine learning is one part of it. It turns out to get the best results. It had to be combined
with statistics and traditional natural language processing and heuristic analysis and all of
these things together blended actually worked much better than just a single model. Oh, and then you
build a word processor. Well, then of course, on top of the
it you have the other really important thing is user experience. Like it has to be, you know,
from the time you type something and let go of the key, we have 300 milliseconds to round trip
your whole document down into the cloud, get all the predictions, bring it back up and light it
up on screen so that it's as fast as spell check. And so to do that, you have to build the editor.
You have to build the word processor piece. You have to build, of course, you know, authentication
and all the other pieces that you build. A huge advantage that a startup has that some big company
trying to do the same thing doesn't have, which is we can tailor our security policies,
tailor the way that we handle the data, the way that we sanitize the data, and the way that we use
the data in a very clear way that a blanket terms of service that covers, for instance, all
of Google services or all of Microsoft services can't do.
Yeah, you're a full-on enterprise thing.
So you've got all of that stuff.
People have to be able to sign in, and you have to be able to share documents.
So you have to build a document library.
So you have the whole service stack.
and you have the whole monitoring stack.
So how do you tell whether or not people are using it and how much they're using it and things like that?
And so the heart of it, of course, is the pieces that are about the predictive engine, but that's not all of it.
And in fact, we found in our sort of earliest days, like our first six months, that we didn't have to, you know, really ingest tens of millions of things then.
And what we really needed was, you know, tens of thousands or hundreds of thousands of really good pieces of data.
With the results.
With the results.
Right.
They had the outcomes associated with them.
And then we could build the right models and we could figure out the kind of data that we needed and build the right experience on top of it to figure out whether we had product market fit.
We could have spent a year doing nothing but just aggregating data.
Being, getting PhDs.
Both of you are actually your post-series A startups.
You're not super, super far along.
but you both have like a SaaS business,
you have ongoing revenue,
you have many customers signed up.
But what I think is just interesting
is you basically are like, in my view,
modern SaaS companies.
If I, like you've now taken machine learning.
And just like before when you were a SaaS company,
yeah, yeah, we have hosting and we use a database.
Now you also have machine learning
is just part of your ongoing value proposition.
And it's just like an ingredient.
That's right.
I think that's how it should be.
Again, maybe, I mean, my view,
it seems like there's a,
There's certainly a ton of hype around machine learning, certainly with recent advances, there should be a lot of hype.
It's really compelling.
But I do feel like typically when companies are hyping up the machine learning component, there might be doing it more to raise money than they are to provide a value to the customer.
The customer, of course, wants to say, what is this doing for me at the end of the day, not what algorithms are using to do it.
And so I think part of it should be packaged as part of a broader message about what you're doing for them.
Yeah, and it's not clear that you're going to pay just for machine learning.
You're paying for solving a business problem if you're a customer.
So again, the core lesson there is make sure you're articulating that. Maybe you could just kind of tick off a little bit of the core tech that you did use. Either of you using any of the open source things that people have heard about. How are you integrating them? I mean, yeah. I mean, we use a whole bunch of stuff, I would say. That's a terrible answer. Yeah, it didn't really offer any specific. And computers? So we use a lot of open source tools that are available. Spark, I think, is the go-toe one for now. But we also spent a lot of time on how do we get clean data into that system.
We basically take stuff from a bunch of different sources and synthesize that to get good results.
But a lot of what we do, actually, to your point earlier, is the notion of cleaning up the data is so important to getting good results.
Oh, okay.
And so the notion of garbage in garbage out is really, is true.
It's certainly that's true in AI as in anything else.
And so a lot of actually what we do is take these off-the-shelf components, but spend a lot of effort into putting really clean input into them.
And so one brief example of that, which is, and this is the kind of thing where with domain specificity,
understanding your user's workflow and providing a tool that's catered to how they like to work.
Again, not a big company.
Right.
Sure.
Sure.
Yeah.
I would imagine, you know, Microsoft doesn't know a ton about, you know, how lawyers specifically
do this one kind of thing.
But here's an example, right?
So if you have an email thread in your discovery process, right?
There's all these emails.
We automatically thread the emails for you from anything, really.
Sometimes people will say, well, this, you know, this is the really interesting, you know,
email in the thread.
I'm going to mark it as really relevant.
But I'm going to mark these other emails as not relevant because I only, you know, this is the
one that encapsulates everything. By and large, there's actually a lot of shared content between
emails and a thread, replies, and all that other stuff, that if you were to feed all these
into a machine learning algorithm, or one email was hot and the others were cold, but they largely
shared the same content, it's going to mitigate that signal, right? It's going to muddle it because
you suddenly have these two opposite ends of the spectrum with the same data. So we're able to
take that kind of information, for instance, and clean it up, because we know what all the threads are.
We look at document exact duplicates and near duplicates for data and say these things are actually
really similar. They probably pick this one because they thought it was interesting. And they don't want to
look at these other duplicates, but we don't want those to muddle the signal. So we have all these
other tools that go into analyzing the data coming in so that it's really clean when it comes
into these algorithms. And that actually helps with performance a lot. Yeah, that's a hugely
important point. You know, we use probably many of the same technologies. We use Spark as, you know,
pretty much anyone who has to do sort of massively parallel sort of data analysis stuff needs to do.
But, you know, our actual, you know, machine learning algorithms and the core NLP stuff we do is the
standard sort of Python libraries that you know you can go download and and use but we have put an
enormous amount of time into our data processing pipeline and that is the very you know very tailored
thing that we write that takes the same thing as you mentioned like take all the job listings that
come in in all sorts of different formats de-dupe them clean them find the outcomes normalize the
outcomes like all of the stuff that you do to make the data coming in be really meaningful
throw out the data that isn't meaningful.
That is hugely important.
We've also found talking about technologies,
when you're getting your company off the ground,
you're going to end up doing a lot of ad hoc data analysis
to try to figure out what are the models that work,
what are the patterns that work.
And what that means is you're just going to be looking at your data.
You're going to be looking for patterns.
You're not going to be operating at 100 million scale.
You're going to be looking at 5,000 things
and trying to figure out, like, where are the patterns that work
and what are the models that work?
And something like Athena, which allows you to do basically serverless SQL queries on an S3 bucket,
can allow you super fast to just dump a bunch of data into AWS into what's called an S3 bucket,
which is like a storage container.
And without deploying any infrastructure, you can just start writing SQL queries and see what you can find.
You know, five years ago, you would have been managing a big infrastructure to do this.
And now in like your first two weeks creating your startup, you can be starting, you can spend more time cleaning the data and finding the pattern.
and less time, like, actually trying to build the algorithms or, you know, doing the other
kinds of things.
That was super interesting on some of the tools that were used.
But I think there's probably, like, some really good learning in just like, wow, there's
a lot of stuff out there.
Like, all the host, like, all the infrastructure people have machine learning services.
So do you have any lessons in, like, how you approach the choosing of your technology stack
relative to, like, what's out there, build versus use versus buy versus contribute?
Yeah, the tooling is really good now.
And not only do they have these services,
there actually seem to be a race for all these big companies
to open source their technologies for people to use, which is great.
I think the best thing to do is just try to learn some of the fundamentals
or get people on a team that can learn enough of the fundamentals
to be able to choose rationally amongst the fundamental decision-making points
you're going to have about what kind of problem you're trying to solve.
And then any one of these tools that are released will get you enough of the way to understand,
is this the kind of thing where I want to use ML or not?
Yeah, you don't need to custom build something, and you shouldn't spend any of your time working on that.
Like, you should figure out what cloud platform you're using, whether it's AWS or whether it's Azure or something else.
They all have built-in ML services that you can use that are totally fine.
Like they're using the same algorithms, the same exact things that you might use if you sort of custom-built something,
and they have plenty of power for you to do the more important thing, which is like figure out where the actual predictions are in your data.
And so you shouldn't try to, especially in the early time, build anything custom there.
You should use what's there.
And they all seem also, one of the things I think is really cool is they all are, they're so, in a sense, academic in their variations that they're able to make it easy for you to iterate and try different models and different networks very, very quickly.
Again, without, like, if you don't build any of your own infrastructure, you just use theirs, you just change between deep learning models pretty quickly.
That's right.
And you're not buying any hardware.
You're not managing any hardware.
Yeah, no one's going to eye the hardware, though, for sure, I hope.
Please don't.
Because you're both post this product market fit and in the scaling kind of phase, but using machine learning, using data, I want to ask each of you to sort of offer some advice on something you might have done differently than you did.
So one of the things for us that I wish we had known now was we felt a lot of imposter syndrome in the first two or three months, especially around machine learning.
This idea that because we didn't have the huge data set that we weren't going to be.
able to build something that was of value. And, and I think we learned that it was actually
okay, that it was to start out based on a smaller amount of data that was more tailored to not
just a specific domain, but to a specific customer. And that allowed us to prime our learning loop
by having the relationship with that customer and got their data, which then allowed us to make
our predictions better and get the next customer. So you got the flywheel going. So what you described
is essentially the machine learning version
of just early adopters
in the enterprise space, but using the data
side of it. And not being afraid
to do it at a really small scale.
Right. Like one customer.
One department. One department.
Yeah. Get the really clean, really
tailored data for whatever
you're trying to do. And then
do the flywheel very small because it blows
up very fast. But isn't LinkedIn
going to win because it has all the job descriptions,
which is what led to sort of that imposter feeling,
I think, was there were a lot of job
descriptions that someone already had.
But here you are doing something that they can't
possibly imagine doing right now.
Because we went and found the data that actually LinkedIn doesn't
have. And a small amount of that data is more
valuable in some ways than the large
pile of sort of, you know, aggregate
data. When you talk about big data, obviously
the algorithms are really compelling and the results are
really compelling. But the other half of the equation
is presenting the data to the user in the way
that they can understand it, right? And actually
make use of it. Certainly in our arena,
it was insufficient as we learned
to just say, hey, here's this great stuff. It's going to tell you
what to do for a lot of lawyers, that feels like a black box, right? And that black box,
when they have to go explain to their clients, hey, I did this because this thing said this
thing, you know, ugh, right? So a lot of what we do now is give greater transparency about what
the system is doing, what you can trust and what you can't trust. And building that trust
with the user, I think, is really important to actually drive that kind of usage because
the end of the day, it is their jobs. And the more information to give them about how it's doing,
when it's doing a good job and when it's doing a bad job, I think the better we're seeing
the response be. That runs a little bit counter and I think a really awesome learning, which is
sometimes people think that with machine learning, like it means you do away with a lot of
UI, you do a lay with a lot of stuff, but it turns out in a world where people have to be
comfortable with what led to the recommendation or to the solution, like actually presenting a journey
through that is actually helpful. Yeah. I mean, the key thing you want to do is present the AI as a
partner in the humans, in the people's endeavors, you know, what they're trying to do. This is something
that's going to help you, you're going to work with it. And when you do that, of course,
you need a lot of trust. It's really about the human and the machine together, accomplishing something
that neither could accomplish on its own. And so, you know, so many machine learning products
have been about trying to either look backwards and tell you what happened or to predict the future.
It's much harder to predict the future and then help you change the future. And that's user experience
and model together. It's not just the algorithms of the secret sauces, the algorithms plus the product
experience that blends humans and computers together in this sort of beautiful learning loop.
Awesome. Well, I really want to thank Jensen and AJ for sharing their advice and wisdom on building
startups using machine learning in the enterprise. So thank you.