The Data Stack Show - 33: ML is a Data Quality Problem with Peter Gao from Aquarium Learning
Episode Date: April 14, 2021On this week's episode of The Data Stack Show, Eric and Kostas talk with Peter Gao, co-founder, and CEO at Aquarium Learning. A former engineer at Cruise Automation, Peter and Aquarium Learning help M...L teams improve their model performances by improving their data.Highlights from this week's episode include:How getting hit by a drunk driver made researching self-driving cars personal for Peter (2:12)Filtering out the hype in self-driving car news to get a clear picture of its state today (6:52)The data required for a self-driving vehicle (13:56)Operation Vacation and how Aquarium can help provide the tools to make models better (16:53)Utilizing neural networks to index data (20:41)How Aquarium fits in the ML stack (30:25)Interesting use cases of Aquarium (33:59)Distinguishing subclasses of machine learning (40:05)Human involvement in machine learning (46:13)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution.
Thanks for joining the show today.
Welcome back to the Data Stack Show.
We have another really interesting guest for you on the topic of machine learning, deep
learning in particular. We'll talk about
neural networks, but it's Peter from a company called Aquarium, and he's the founder of the
company. Really, really interesting stuff. My burning question, which is becoming more and
more common, is actually not specifically about machine learning, but Peter has a background in self-driving cars. And I'm just so interested
to ask him about being an early employee at Cruise, the self-driving car company. And so
that's what I want to ask him about. Costas, what's your burning question?
My question is going to be more around data, to be honest. Data is a big thing in machine
learning anyway. The algorithm is always completely useless without the right data. And also the company that Peter is building is around
how we can provide better data at the end to train our models. The type of data that's usually used
in machine learning are a bit different from what we usually use in analytics. So on one hand,
you have more structured data. For machine learning, you have more unstructured data. So it's going to be very interesting to see and ask about
the aspect of quality and how you work with data and how you annotate data, how you prepare data
in the context of machine learning, which is going to be quite different from what we have
learned in the past with other guests that we had in the show. Yeah, absolutely. I think it'll be
cool to hear about someone who's building tooling for that as opposed to maybe a practitioner. Let's jump in and
get to know Peter. Let's do it. Peter, welcome to the show. We're so excited to have you as a guest
on the Dataset show. Yeah, glad to be here. Well, first of all, just give us a brief introduction so our listeners know who you are and tell
us about Aquarium Learning, your company's high-level overview.
Yeah, so I'm Peter.
I'm the co-founder and CEO of Aquarium, and we basically build a ML data management system
that makes it easier for teams to improve their ML models by improving the data sets
that they're trained on.
Typically, the best way to improve your model performance is just to hold the model code
fixed and kind of take a good look at the data set and find places where you can fix
bad labels or add more of certain important data.
And then that generally tends to produce the most efficient way to improve your model performance.
So we make it easier for people to go through that workflow of looking at their data, finding problems, fixing those problems,
and then retraining a better model. Very cool. Well, our listeners know by now that I have
tons of questions about that, that I'm going to do what I have fallen into the habit of doing.
Of course, we exchanged some communication beforehand beforehand and I looked at your background.
And one thing that we love to do for our listeners is just sort of learn where people came from, especially as it relates to sort of their journey with working with data.
And you have a lot of experience in the self-driving car space, which is really fascinating.
And, you know, there's, there's just so many
interesting things about that dealing with data and I mean, even, you know, sort of society and
economy, but I w I would love to ask you a couple of questions about that, if that's okay.
Yeah, sure. Um, also have to talk about my kind of lead up to that. It was a bit of a long story,
but yeah, that'd be great. Actually, that'd be great because I'm going to
dominate the conversation with a bunch of self-driving car questions for a minute. So if
you could just give us the run-up to that, I think that'd be awesome. Yeah, sure. So back in high
school, I was actually on the robotics team. That was a big part of my life back then, became the
captain of the team and senior year, and then kind of went into college. And, you know, at the time, all the
careers in robotics tend to be focused around defense, and I wasn't super excited about that.
So I went and worked in web for a little bit. So I did internships at Pinterest and Khan Academy.
And, you know, at that time, I kind of knew that machine learning was this really interesting
field that had a lot of potential. And so I was working kind of like a mix of sort of the web stack and like normal web engineering, and then also integrating machine
learning into that stack. So worked on, you know, that at Khan Academy and Pinterest. At Pinterest,
it was more for spam fraud detection. And then for Khan Academy, it was actually for predicting
the ability of students based on basically a diagnostic test
that we would give to them once they onboarded onto the tool. So we could give them the right
content. So I had kind of like this background of robotics and then also like a lot of exposure to
sort of the web stack and data pipelines behind it. And then, you know, when I was back in school,
you know, my research was in deep learning for object detection. And I actually
happened upon Cruise when, you know, that was kind of a sort of, I think, two person YC company.
And then I ended up interviewing when it was like around five people and then decided to go there
around like, you know, 18, I think I was the 18th employee, but it was personal for me, you know,
in partially that number one, it was kind for me, you know, in partially that
number one, it was kind of a way to like come back to doing robotics and combining that with
like my machine learning and deep learning interest that started to become more and more
relevant for some of the perception tasks. And then the other thing was that I got hit by a
drunk driver in college and that was a pretty negative experience. And so I had a pretty
personal stake in the mission. So, and yeah, I I like small companies so it was a lot of fun yeah well they're not
they're not 18 people anymore like 2000 at this point yeah it's what an experience and thank you
for sharing such a personal story and how cool that you later in your career got to return to your love of robotics and also
combine other professional experience. I mean, that's just such a, such a neat opportunity.
Yeah, definitely. Okay. So I will indulge myself in a couple of questions and then hand it over
to Costas. So this is less on the technical side, but self-driving cars have had an interesting
hype cycle, you know, so they'll sort of, you know, go through or they've gone through, you know, okay,
self-driving cars, you know, did a test and the car passed and this is the future. And then it
kind of goes quiet for a while and then something else pops up. But I'm just interested to know from
someone who worked so closely on it, what is the, what's real and what's
hype and what's your perspective on that? I just, I know our listeners would be interested in that.
So I think if you look at some of like the really early self-driving stuff, like even those like
sort of corny videos from the fifties about cars that will drive themselves, you know, a lot of
the sort of issue with those sorts of setups was that you had to essentially
set up specialized infrastructure for self-driving cars you know like you would have to put rails in
the road or like magnets or something like that so that the the car would know where it was and
where it needed to go and so that's kind of why it didn't really work out until like you know i
think somewhere in like the 90s or early 2000, there started to be kind of a resurgence in interest in self-driving and, you know, obviously
with the sort of like DARPA grand challenge and urban challenge that led into Waymo and stuff
like that. I think it was just kind of the demonstration that the sensors, the compute
technology, the sort of just hardware and software stack had evolved to the point where you could
actually do perception in a relatively workable way, which is, you know, being able to interpret
the world through the sensors rather than having to build specialized infrastructure for these
vehicles for these robots to work. And so I think, you know, the sort of critical piece that started
to make self driving really get a lot closer to reality was, you know, the sort of critical piece that started to make self-driving really get a lot
closer to reality was, you know, the DARPA Urban Challenge was like somewhere in like 2008, 2009.
Yeah. Deep learning came onto the scene around 2012. And with deep learning, you suddenly had
this sort of technology that was really well, you know, it just performed really well for a lot of perception tasks that were relevant for, you know, not only like, you know, sort of contrived problems like ImageNet, but also for like real world robotics problems like traffic light detection, object detection, classification and things like that. this boost of performance made it really workable to have systems that could reliably and safely
operate in the real world without too many modifications to the infrastructure. So, you know,
at least now with self-driving, you can go to Phoenix and if you have the Lyft app, you know,
you can call a Waymo car and they will pick you up and it will send you places without a driver
and it will be safer than a human you know
that is the state of the technology right now and i think that part is underhyped the part that is
overhyped is this idea that you know it's going to be everywhere tomorrow and so like when you
look at these sort of systems that are very complex that are very safety critical you need
to like basically recertify them and readapt them to every sort of new domain that you want to deploy them into. So moving from, you know,
Phoenix to San Francisco, where you have more of an urban environment,
moving from San Francisco to New York,
where you have more like weather conditions like snow or like, you know,
sort of unique driving behavior that all takes like a certain amount of time.
And a lot of that time actually comes into sort of readapting these deep
learning models or like, you know, a lot of like the code inside of the sort of normal robotics stack for these new conditions. much money went into it so many great people got together that it showed that real life robotics
and applied deep learning was something that worked as long as it was put into the correct
sort of system specifications and it's something where it is super valuable in a lot of industries
that are not self-driving that in a lot of cases are a lot simpler and a lot easier and just as
economically impactful and that's kind of
like the customer base that we actually serve over at Aquarium. So, yeah. Wow. Yeah. It's almost like
the hype happened too early where I was like, this is going to change the world. And I was like,
okay, it's actually going to take way longer. But then when you say something like you can just
download a consumer mobile app in Phoenix and get picked up by a driverless car and have, you know, a way safer experience than you can have under any other
circumstance. I mean, it's like, okay, the future is here. Like that's insane that you can just,
anyone can go do that in Phoenix. I think Neil Stevenson says it best when, you know,
the future is here, but it's just not very evenly distributed. And I think that's kind of the real
reason why people are underwhelmed with it. You know, it's here, but it's just not very evenly distributed. Yeah. And I think that's kind of the real reason why
people are underwhelmed with it. You know, it's here, but it's not everywhere. Yeah. I mean,
that's also, you know, AI, we've had several conversations around this with AI where like
AI branding is the craziest thing because it's like, some people fear it, some people, you know,
deny it, you know, and it's sort of one of those things that are like, oh, AI and self-driving cars
are going to change the world. And it's like, well, all you can do is drive cars around Phoenix.
I mean, that's not very cool.
Yeah.
And something, you know, like I want to like sort of emphasize is that like when we were starting off at Cruise, you know, back in like 2015, at that point, the state of the sort of tool chain around machine learning or specifically deep learning was just terrible.
It was non-existent.
You know, I worked on a project at a Berkeley called Cafe, which was the sort of first deep learning framework after AlexNet. And at that
point in time, you know, the maintainers for it were these three graduate students who were
simultaneously trying to complete their PhDs and also maintain this like open source repo that was
being used by thousands and thousands of people. And of course, you know, like one of them is going to take priority over the others. And
so like when I got to Cruise, you know, we had to build all this stuff from scratch because there's
a lot of parts of that sort of ML workflow and you have to build like tools to make all of that
easier. And we had to build all of it from scratch, you know, back in the day. And now you look around
and there's so much great stuff out there that covers so many different
parts of the stack and so now it is easier than ever to get something working you know it's easier
than ever to get like an mvp that functions at like sort of like 80 accuracy that you can present
your boss and be like yeah we should invest more into this but the part that doesn't have as much
tooling and doesn't have as much focus is the part around making this MVP work in production on a large variety of circumstances and acceptable accuracy.
And that's where Aquarium really focuses on.
Basically taking a lot of the learnings that the self-driving field has already sort of grappled with and in a large extent, like already solved. And helping a lot of all these other people
who are working on machine learning,
working on deep learning and trying to make their models
adapt to the sort of circumstances they see
in the real world and make them better
and iterate on these models over time.
Yeah, well, that's a perfect segue for me
to end my monopoly on the conversation.
And Costas, I know just from chatting with you today,
you have a bunch of burning questions about aquarium.
So Costas, take it away.
I will stop monopolizing the conversation.
Thank you, Eric.
Thank you so much.
So Peter, I have quite a few questions about aquarium,
but before we go there,
quick question about your experience
with like self-driving cars
and more specifically with
data. So can you give us an idea of like what kind of data you are using when you're trying
to build all these different systems that enable like a self-driving car? Yeah. So if you think
about a self-driving car kind of as like this rough, you know, software block diagram, you know,
in one side of this block diagram is sensor input. So this is
stuff like LIDAR, this is stuff like cameras, this is stuff like accelerometer data, GPS,
all that stuff. And then out the other end comes steering actions, you know, like, you know,
the accelerator, you know, set it to 80% or break a little bit or like steer left to this extent.
And then, you know, you look at the hardware stack, a little bit or like steer left to this extent. And then, you know,
you look at the hardware stack and of course, like there's a lot more stuff that's going on over
there. But when you look at the kind of data that you're handling on the input side, a lot of it
deals with sort of the sensor data that the car basically is capturing as it drives around in the
world, as well as kind of the more, you know, essentially consumer facing inputs.
Like I would like to get picked up here and I would like to go there.
And, you know, what car is going to be the one that's, you know, taking me and sort of command and control aspect of it.
So there's a lot of data, but, you know,
happy to go into specific aspects of it if you're interested.
Yeah, yeah, sure.
So can you share with us a little bit more of your experience,
like with working with this data?
You mentioned earlier that back then when you started at Cruise, you didn't have the...
Not even the tool set was there, right?
So how was a typical day of working with data there at Cruise?
So I think the biggest distinction between a lot of these robotics use cases
and the more traditional web use cases that people are used to
is that if you look at a site like Google or Facebook or something like that, most of the
data that is being generated is quote unquote structured data. You know, these are sort of
like tabular data, things that can be described in sort of like a SQL database or an Excel spreadsheet.
Now in a robotics application with all these sensors, the vast majority that
is being, you know, of that data that's being generated is all unstructured data. It's primarily
imagery. It is primarily point clouds. It is things that are essentially really hard to index with
traditional data stores that were developed for web. And so like you have this big problem where,
you know, these vehicles are generating terabytes and terabytes of imagery per hour.
And now you have to figure out what to do with it. How do you store it? What do you basically
index and not index? And from the perspective of machine learning practitioner, what do you
train your model on? And that ends up being this huge problem, not only on the piping of where to
store things and how to process it, but also just in terms of like the sort of workflow and the intelligence on top of
it. Like, you know, what do you do to make this stream of just, you know, massive, you know,
onrush of data into something that solves your problem for you?
Super interesting. And how did your experience there drive you to create Aquarium
today? What were the challenges that you had and that you're trying to address with Aquarium today?
So in self-driving, of course, like, you know, you have these very stringent requirements on
your sort of system performance and your sort of machine learning model performance. And so
a lot of the work that we did there was around making sure that we could improve
our models consistently over time.
And the sort of like open secret
that a lot of applied machine learning practitioners
have come to is that most of the practical gains
in your machine learning model performance
come from improvements to the data it's trained on.
So what does this mean?
If you're just getting started
and you look at your labels and you look at your data and
a lot of it is incorrect, you know, either it's mislabeled or it's corrupted or something
like that, like you shouldn't expect your model to be any good.
You should probably clean up the bad data and bad labels and, you know, train it on
clean data.
And of course you're going to have like much better results.
And it's just such low hanging fruit that, you know that is kind of just the easiest thing to look at. And then, you know, the flip
side is also that when you look at the failures of your model, you know, these are not necessarily
things that can be tackled with sort of like PhD level changes to the model code. A lot of the
times it's sort of cases where you need to just go and collect more data of a certain difficult
scenario. And so in the self-driving use case, you know, the sort of common example I like to use
is that let's say you've trained a cone detector, you know, it takes, you know, your model takes in
an image, it says, okay, here's a cone or not a cone or something like that. And if you train this
sort of model, first off, it tends to do really badly on green cones. And
you're like, oh, what's going on? Like, you know, why isn't it at 100% accuracy? And you look into
it and you realize it's not doing well on the green cones because all of the cones that you
trained on were orange. So this model has never seen green cones before. It doesn't know what to
do with these green cones. And the solution here is you should go find more pictures of green cones
and collect them back and label them and retrain your model on it, and it will start to handle green cones.
And so that sort of process of understanding the failure modes, addressing them with the proper data curation, and then making sure that the retrained model is better than the previous one that you had,
that takes a lot of time if you don't have good tooling for it.
And so if you have good tooling for it, not only can this iterative process be really fast and really reliable in producing your improvements to your model,
but it's something where you can essentially take the ML engineer out of the equation,
where they don't need to be there every single day
hand-holding this like machine learning pipeline
from end to end.
Instead, this is something where you can just have
a sort of domain expert who understands
like what is a cone or not a cone
or what is good or what is bad
and have them click around in a user interface
and essentially improve the data set and improve the model.
So this is known in, like Andre Carpafi talks about this a lot as operationification,
but specifically it's a way to reliably improve the model without needing extremely skilled
labor just from looking at the data.
And so when we look at like, you know, what we work with Aquarium, there's just a lot of people who have the same sort of problem where they're trying to get a model to work in production.
And we're basically giving them the tools to do the sort of same iteration cycle to make their models better and work in production.
Yeah, makes total sense.
So can you describe to us how you can put structure to this unstructured data? Because my assumption isn't,
please correct me if I'm wrong on that,
but I assume that like the first step
is to take this unstructured data
and create some structure out of it, right?
Like create some metadata or these labels
that you mentioned that then you use
for organizing the work around like the model training.
So how is this done?
So the naive way that people try to build structure
around data tends to come from basically
assigning metadata on top of it.
So this is stuff that, you know, for example, you can put timestamps associated to when
like a piece of audio was captured, or you can, you know, get something about like, you
know, who was the speaker inside of this audio.
You can say which device it was captured from, you know, like that sort of stuff are basically convenient splits to be able to sort of index your data in the same way that you would with like structured data, right?
But this has like a lot of limitations because that means that if you want to capture any sort of variation in the underlying data, you have to pull it out into metadata. So either you are like, you know, having humans annotate all
of this stuff, or, you know, you have some way of automatically capturing all of the variation
in the underlying data. And that's just not practical in, you know, the vast majority of
use cases. So really, the magic of what we do with Aquarium is that we rely on neural networks to index the data for us.
So a neural network, basically when you run it on a piece of data, you can extract out a activation,
a layer of the middle of the neural network and produce this thing known as an embedding.
And this embedding is kind of like this vector that is what the neural network thought about this data point of this audio or imagery or whatever.
And then you can actually compare these vectors to each other and find, OK, here's actually a cluster of very similar data.
Or here is an outlier in the data set.
And so this neural network is essentially extracting structure out of this very messy input data. And by relying on this sort of aspect of
neural networks, we can actually tell you what is in your data set, what is the distribution of
your data set, what is the variation in your data set. And you can start to uncover patterns in your
model's performance. Like here's like a little cluster of green cones that your model consistently
fails on. And then we can also do things like search
within unlabeled data to find more examples of these green cones that you can therefore collect
and bring back and label instead of having to go look through a spreadsheet of a million images and
click on one link at a time to find the piece of data that you're looking for.
Oh, that's fascinating, actually. And I assume that still, I mean, you go through these embeddings,
creating these embeddings, which create, like,
extract some kind of structure out of your data.
Is there semantic information around that?
Or this is something that still a human has to do?
I assume that the neural network is not capable of, like, figuring out,
oh, in this image, you have green cones or cones, right? Like
the concept can be just a cone, doesn't matter about the color. Or is it also possible to do
that? How it works and how it works together with a human operator? Yeah. So this is kind of known
as like unsupervised learning in machine learning literature. And so roughly what that means in
practice is that your embeddings are producing like clusters in your imagery and your data and your like unstructured sort of input.
And then a human operator can, instead of looking through like a million sort of just, you know, flat examples of data points, they can just focus on the clusters.
And you can look at each cluster and have a sense that, okay, this is all like, you know, the same red truck,
or this is all the same, like, you know, green can.
And then also you have this notion of similarity
where like, okay, here's a section
that is green cans versus red cans,
or here's a section where you have like,
you know, very boxy objects.
And the cool thing about this is that these activations,
these embeddings can be
extracted from pretty much any neural network. And so if you were trying to train a machine
learning model on your data to do a task, then you can basically extract out these embeddings
as a byproduct of your training process, and they will produce these really good clusterings.
And so for the human operator, really what it does is help them kind of look at this massive pool of data,
and it distills it down into these clusters or patterns that they can more easily look through and understand what is going on.
Peter, isn't there a kind of like chicken-egg problem, though, the way that you describe it?
I mean, if I have to train first the neural networks to create the
embeddings, I need to have some data, right? But in order to do that, I have to annotate the data.
So how does this work in real life? So when someone is getting started with a machine
learning task, usually what you do is you can actually take a pre-trained model.
So there's a lot of neural networks that are trained on just general imagery or general
audio and stuff like that. And those models can be used to extract embeddings on data that they
have never seen before. So if you're trying to just collect a set of data that you're starting
from scratch and therefore you can't train your own model, then you can actually use these
pre-trained models to generate embeddings and to kind of organize this data upfront for you.
And of course the embeddings will not be super great. You know, sometimes they're going to look for similarity
and things that you particularly don't care about for your task. But then once you've sort of
bootstrapped a set of data that you can now train a model on, then you can basically go and train
your own model on that data and extract your own embeddings from it. And then now you have kind of this set of embeddings that is pretty well attuned to your task. And to some extent also, like, you know,
what these embeddings allow you to do is to, for example, uncover, like, here's like a pattern
of failure cases, right? Like, you know, we can go tell you, like, here's like a section of the
dataset where there's an edge case you haven't seen before. But there also sometimes has to be interaction with human labelers where, okay, like, you know, the model
thinks here is like, you know, a set of data of a certain type, but you will also want to send it
to a human workforce for them to basically check it, to QA it, to label it into a form that you
know is clean and that you're comfortable retraining your model on.
So I think it's not as much sort of automation in the sense of replacing the human,
as much as it's kind of an interactive feedback loop between the human and the machine in order to produce a better model.
Yeah, absolutely. And I think this is a kind of pattern that is very recurrent when it comes to machine learning and AI.
And that's something that we have discussed also with other people here who are coming from this space.
And it's very interesting to see that at the end the model i mean it's a black box that
takes information that has been curated by a human to learn and provide things results that are
relevant to humans right so i think that's very interesting and i think that's something that
should be heard more often like out there because you have like you know like all these people who
are like oh ai is going to be you know like a post-apocalyptic situation with Terminator
and stuff like that. So, and yeah, I think it's very important to hear that like from experts.
Yeah. And, and, you know, like to give you an example of kind of where this, you know,
happens inside of our product, you know, one of the things that we do with Aquarium is to surface
the places in your dataset where your model disagrees the most
with your labels, with your data. And we surface this to a human user and basically show them these
examples and asks, like, is the problem here that the model is wrong, that it is making a mistake
on these green cones or whatnot? Or is this a case where the labels
are wrong? Where like, for example, like a label is missing or it's misannotated or the data is
corrupted. And so ultimately the human has to be the person who judges that, right? That cannot be
resolved automatically. The human has to kind of give their intent of what they want this model to
do. It's kind of like training a coworker to do a task by basically,
you know, you have hired this new person, you give them some instructions on how to do their job,
and then they do their job for a little bit. You inspect their work and you're saying, okay,
this is stuff that you did well, this is the stuff that you did incorrectly, and you should change
up for next time. And by giving them that feedback, now they perform better in their job.
And it's the same way with these sort of AI models, with these deep learning models,, it's like you have to give this feedback
by communicating in Morse code
with sticks that you're banging on a rock
and it's really hard and it's really difficult.
So that's why we're trying to build this tooling
to make it more easy to interact and iterate
with these machine learning models to get what you want.
Yeah.
So let's talk a little bit more about the product itself
and how it works.
From what I understand,
like we are talking about an interactive process here,
working around the data and curating the data.
In order to do that,
it's one part is of course like the curation itself,
creating the labels,
getting like surfacing the embeddings
and representing these to the users.
But then whatever you do there has to be applied
on retraining your models, right?
And then go back and do the same again.
So how does Aquarium work and operates with the process
of training a model, the rest of the infrastructure
that a machine learning engineer is using today,
and how it fits in this overall, let's say, ML stack?
Yeah. So I think like if you were to look at the ML stack, let's say that you're tackling a problem,
you're tackling a problem like, you know, in an NLP, let's say you're doing like, you know,
named entity recognition or something like, you know, when you have your problem, now you can
sort of work back towards like what tools you need. You know, like maybe
you need a labeling service to, you know, annotate these things into the text. You need some sort of
way to train your model in a distributed way really quickly. You need something for your
experiment management. You need something for deployment. You need something for monitoring.
And there's a lot of different components in the stack. And so some sort of, you know, some approaches by other vendors is to essentially offer all
of this stuff in the pipeline in one complete package.
And what we've seen from working with a lot of machine learning teams is that, you know,
if you were to consider the analogy to web, that would be like you go to like Squarespace
and you click around in Squarespace and now you can
set up this very like, you know, nice, but very basic website. But as soon as you want to do
something more complex, you know, if you want to build Facebook, you kind of instead go towards
this model where you string together a lot of different tools into a tool chain that works
out for you. So you combine Sentry with, you know, like a Django server on top of AWS, you know,
you do all sorts of sort
of stuff where the engineer essentially is like stitching together these pipes to create a
cohesive product. And with Aquarium, you know, we see that is kind of like the mode that a lot of
serious machine learning teams are moving towards where they kind of stitch together a lot of
different tools that are best in class for like, you know, training or for labeling or for deployment
or whatever. And Aquarium basically sits on top of that and is kind of like a workflow layer.
So what we do is integrate into whatever sort of data stores or labeling provider or like
model type with a training system that you have and kind of give you this high level
overview of, okay, this is what your data set looks like.
Here's what your model performance looks like. Here are the places where they disagree. And here's an engine in which you can
basically understand where the failures are happening, triage them, and then take resolutions
on them by, for example, identifying, okay, here's like a section that you're not doing very well on,
and then helping you collect more of that data within the app and then sending it off to a
labeling provider to be annotated and then basically you know triggering a training run after that
and then allowing you to compare the difference between your new and your old model within
aquarium and so we kind of are this interesting mix between like jira and century for the machine
learning stack where we kind of sit on top of whatever infrastructure
people already have built internally. And we are more just telling them, hey, this is the thing
that you need to do next to make your model better, helping them do that by dispatching tasks
to different tools and parts of the pipeline that they've already built and helping them basically
take actions to improve
their model not necessarily in aquarium but through aquarium it's interesting and can you
give us an interesting story around one of your customers or like something with the data that
happened there the reason i'm asking this question is because anything that has to do with ml and the
actual work itself behind ML,
it's something that it's a little bit opaque to the people out there. Everyone like sees or thinks
about the magic that happens, like we have a self-driving car, right? But can you help us a
little bit understand the work that is done behind that? Yeah, so I can actually give you a few
because I think, you know, the favorite part of my job and the reason I left Cruise to start Aquarium is that there are so many really interesting, awesome problem domains
that people are applying deep learning to. And they're really fun and useful and just unexpected,
honestly. Like some of our customers are doing deep learning on trash. And it turns out that
that's a very lucrative industry to be analyzing what people are recycling
or what food people are throwing away
and giving insights to, for example,
like the recycling center
to know how to sort different pieces of recycling
or to the kitchen owner
to decide what food they need to make less of.
And that is like something I never would have thought of
when I was working on self-driving.
And like, yeah, there's other people who are working on agriculture. There's other people who
are working on logistics and drones and like so many like just disparate places like, you know,
surveillance and like, you know, industrial inspection and stuff like that. And it's so
fun just like getting to know all these people who are doing such interesting stuff and helping them
really. And so I can tell you about one of our customers that we wrote a case study on,
it's called Asturglu and they're a company based out of Europe. And what they do is basically they
have a stack that allows you to input like drone or aerial imagery for inspection of critical
infrastructure. So this is like, you know, wind turbines, power lines, cooling towers
and power plants. And the way that people used to inspect this stuff was literally you get a ladder
or you get some climbing hooks and you climb this pole up this power line and you go and you look to
make sure that there's no like corrosion or like, you know, damage or whatever to the power line.
And of course this is something that is like very time consuming,
very expensive.
You're going to miss a lot of stuff and it's dangerous because you're sending
like a person to climb up this power line.
And so what Stir Blue does is they take this imagery and they analyze it,
you know, number one,
using aerial imagery instead of requiring someone to climb up physically.
And then number two, being able to inspect this imagery with a combination of human experts and
deep learning models to find defects and surface them to the sort of owner of like the grid or
something like that so that they can direct maintenance towards it. And of course, like,
you know, the advantage of this model is that you can go and just inspect way more stuff way more efficiently, and catch a lot more problems before they happen.
And so for them, you know, they've trained this deep learning model that is kind of working in
concert with a team of experts. And they wanted to make this model better in order to be able to
handle just more miles of power lines more efficiently without needing to
rely on like this very limited pool of human experts who are going to take quite a while to
get through all of that data. And so we helped them look at their model and they realized like,
okay, like, you know, where's our model doing badly? And they realized that most of the problems
were actually just with the data. You know, there are cases where like there was like one or two labelers that they were
working with who were kind of consistently making mistakes.
And they were able to go find that and catch that and give sort of like corrective feedback
to the labelers so they could produce good data.
And in certain cases, you know, like sort of in their legacy sort of way that they were
doing data labeling, they were using a different standard.
They were drawing these very large polygons on top of certain defects instead of very like sort of in their legacy sort of way that they were doing data labeling, they were using a different standard.
They were drawing these very large polygons on top of certain defects instead of very tight polygons around the actual area where the defect was occurring.
You know, instead of drawing like a polygon around like, you know,
a hole in the wood, they were drawing it on the entire, you know,
like power pole line. And so we helped them uncover,
like this is the issue and this is why your,
your model is kind of outputting weird stuff. And they we helped them uncover like, this is the issue. And this is why your, your
model is kind of outputting weird stuff. And they're like, Oh, wow. Yeah. Okay. That makes
sense. And they were able to go back and they were able to do a pass through their labels and fix
them to, you know, adhere to this common standard of like, you know, small polygons around the
actual defect. And when they retrained the model, it got like 13% better. And that was like a week
of work. And it was like a week of work.
And it was just such low hanging fruit that they didn't really know it was even there until they looked.
And then, you know, based on that, they were able to cover just hundreds of miles of power
lines a lot more quickly.
They were able to cut the sort of requirements of their human experts in half and cut the
labeling costs in half.
And, you know, of course they made their customers
much happier. And, you know, I can tell you another story about a customer that we worked with in
industrial inspection. I can tell you some stories from like, you know, different sort of customers
and different sort of domains, like, you know, one of them in agriculture, but, you know, it's,
it's something where number one, I think it's just so great. There's all these different
exciting applications, but number two, I'm also surprised that the same playbook works extremely well across
all these different applications. It's something where you wouldn't think it was something that
was going to be a common way to improve all of them. But with the magic of deep learning,
you can apply this repeatable playbook to a fairly common set of models and achieve the same great results.
Yeah, yeah.
And I think it's amazing on how many different use cases are out there where deep learning is used and people just have no idea about it.
We all focus on what you hear about self self-driving cars mainly to be honest and anything
that has to do with surveillance but it's it's amazing and i think it's something important for
people to hear all these different like amazing use cases that are out there that they don't just
as you said they don't just reduce costs in some cases they also save lives right because
climbing on these poles and trying like to figure out if there's a defect
there, it's a dangerous job. It's not something that it's easy to do. So Peter, two last questions
from me, and then I'll let Eric continue with his questions. First of all, it's about the data again.
Can you give us a sense of what are the most commonly used data in machine learning today?
So I think it's critical to sort of distinguish that there's a lot of, you know, subclasses of machine learning.
So, you know, if we were to go back to kind of like the late 90s and early 2000s, machine
learning has been very successfully deployed in a lot of web applications for things like
recommendations or forecasting or predictions
and things like that. So this is like, you know, if you're on Google and you're clicking around,
you know, what do you recommend to the top of the list? Or if you're trying to forecast what is like
your future revenue based on your previous revenue or something like that, you know,
these are problems that are relatively well understood
and have been applied in a lot of use cases successfully in the early 2000s.
And a lot of this is because it's something where you're kind of getting the data for free
from like user actions on your site or from just like, you know,
seeing, you know, like the present versus the past and then trying to predict the future.
And all this data tended to be kind of like tabular data, like recommendations and ads targeting and price forecasting. That's
all like, you know, stuff that you can put into a SQL database or spreadsheet. So this is like a
class of data that is still, I think, extremely prevalent and extremely, you know, value adding,
and it's very common. Now, the sort of data that we deal with in our line of work with
deep learning tends to be more like unstructured data. So this is a lot of people who are dealing
with imagery, a lot of people who are dealing with audio and NLP sort of text use cases,
and then some people dealing with, for for example 3d point clouds that are generated
from lidars or like cad models and things like that but in this sort of new wave of deep learning
as a subset of machine learning there's kind of more of an emphasis on unstructured data and that
unstructured data tends to you know have a lot more interaction with the real world with the
messiness of the real world as well instead of just like kind of the tabular sort of clean,
isolated nature of like actions
within a web app or something.
And I think, you know, beyond that,
the thing about this paradigm
of working with unstructured data
is that the data does not come for free.
So instead of this being something
where it's kind of like a prediction
or a forecasting problem, where you kind of are just like trying to like refine your guesses of what people will do in the future based on what they did in the past.
Now you're kind of doing something that is more like automation in terms of your workflow, where you are trying to get humans to do a certain task for you.
And sometimes you have to pay them to do labeling of bounding boxes or whatnot. And then you're essentially telling your model to try and not only imitate it,
but also to generalize from that set of data to data it's never seen before. And so this sort of
new model of doing deep learning and the requirements around it leads to a pretty
different workflow. So I think some parts are in common, like a lot of like the stuff that you need to use for crunching data and moving it from place to
place is definitely in common. But then there's a lot of differences in particular for like,
you know, the fact that you're using a deep learning model that has to be trained on GPUs
at scale, or the fact that now you have to annotate data, or the fact that now you have
to kind of think about like, you know, what is the right data to annotate, which, you know,
we think a lot about. Do you see any use case like today, or like,
do you expect to see something in the future where deep learning can be used with more structured
data? Yeah, actually, what we're seeing right now is that some groups are actually using deep
learning on structured data in really interesting ways. There's a lot of sort of like graph
convolutional models. I think if you were to look at some of the more advanced groups, you know, I think internal to Google and Facebook,
they're already moving over to deep learning models. I think if you were to look at, for
example, Instacart, Instacart has surprisingly done a lot of stuff with deep learning for
basically predicting what people should pick up inside of grocery stores and in what order and
stuff like that, which is really fascinating. And I think the reason why it hasn't been as
widespread so far is just because people kind of have been using sort of old school, non-deep
learning models for quite a while. And there's a lot of inertia that carries over from that,
especially since it performs pretty well. But then it gets to the point where when you start tackling really complex problems
or when you have like really, really, really big data sets,
that's the point where sort of like the new age
of deep learning models offers like way better performance.
Yeah, that's super, super interesting.
Eric, the stage is yours.
I have learned an incredible amount. And
I also have to say, I'm sure our listeners, at least some of them felt the same way. But when
you gave the trash example, I had to take a minute and think, okay, what if someone's running deep
learning on my trash, what is it saying about me? And I got, I had this moment of like, that is so
crazy. Super duper interesting.
Two questions, because I know we're getting close to time here.
One, and this is just interested in your perspective, because you've kind of seen the data tooling and data workflows come of age in a way. as I think about what we've learned on the show over so many episodes is that
even when we're talking about really advanced technologies, there still seems to be,
I guess, if you just, if you break it down, not, I would say like unexpected, but especially to me,
who doesn't have a background in technology, a surprising amount of manual work that still goes
into some of this stuff. And I think about aquarium specifically where there's all sorts
of value, but I think about just the workflows pre-aquarium and post-aquarium as you've described
them. And it's amazing how much just effort and work that it saves in automation. And I guess,
you know, living in an age where we have self-driving cars, it's surprising to me. And I just loved your perspective on that
because it seems to be still so pervasive, even though we're using really, really advanced tools.
Yeah. You know, like one of the examples that I like to raise up when we talk about this is that
before Aquarium, a lot of the times the sort of standard of tooling for people is like spreadsheets or Jupyter notebooks.
Or like, you know, I remember working on a project where our visualizer and our data set organization system was Mac preview.
And it was a bunch of folders with images in them on my local hard drive we were labeling.
You know, and this is just like the reality that it's just still
really hard to work with data and the sort of paradigm that, you know, machine learning is
about, you know, like if you look at the way that we write code, there's so many great tools out
there for debugging your code, profiling your code, understand what is going on in your code.
And, you know, with data, it's still something where like people are kind of waking up to the fact that you need tools to make that process of understanding and
improvement much easier. And so I think there's always going to be some amount of labor or at
least in the next like, you know, four years or so, because someone has got to say to the machine
learning model, this is what I want, right?
It's not something where you can necessarily write a spec as a product manager of exactly
what type of attributes are required to classify something as a cat.
You know, like it's hard to write that down.
And so the sort of process of working on machine learning tends to be a lot more iterative
where you kind of give
it examples of you know cats and then you see where it fails and then you kind of correct it
and then you know continue going on that front so i think there's going to always be some amount of
like human involvement in that front but then so much of like the the sort of unnecessary toil
that happens right now in machine learning is about trying to make sense of like, I have
millions and millions of data points. And where's the section of this data set that I need to focus
my human attention on that is most important for improving this model? Totally. I think that's such
an elegant way to put it where you, another subject that we've talked about on the show a bunch, but the
human involvement in machine learning is so critical, but unnecessary toil is I think a much
better term to describe what I was talking about, you know, that just still seems so pervasive and
how cool that you're building tools to help solve that. One last question here. I won't make that promise because I'm horrible at keeping it, but maybe one last question is when it comes to data for machine learning
applications, you sort of have this interesting issue of critical mass, right? So it becomes valuable when you have enough data to train models and, you know, you sort of have enough inputs to make it really valuable when producing the outputs.
And you work across such a wide variety of different industries with your customers. thoughts on the threshold there? Because I know that there are a lot of companies out there who maybe run like a really tight ship on sort of their general data practice and are
wanting to explore machine learning. What's the critical mass? And does that vary in terms of
types of data? I'd just be interested in your perspective on, I guess, what's the low water
mark as far as the threshold and sort of quantity and types of data? Yeah. So the reality is, is that it actually just depends on your application. It's really hard to
kind of like say, you know, one size fits all type scenario, like, you know, what is the minimum
threshold for it to perform well? It actually depends on like how complicated your problem is,
how variable your input data is, and whether you have access to
like some sort of pre-trained model or not that can kind of cut out a lot of the work of learning
from the process. So I think, you know, my personal like rule of thumb for working with imagery with
like a pre-trained model is that you want to get something on the order of like 10,000 examples to kind of like start off with. And, you know, you can usually just like human
annotate up to like 10,000 without too much like cost to yourself. But then, you know, the sort of
more general way to understand this is that you can actually do something known as an ablation study,
where if you have some set of data, let's say that you have like a thousand examples,
then what you can do is you can train,
you can set aside like a hundred examples
as an evaluation set,
and then you can train on 100 examples of the remainder
or like 200 or 300 or 400 or 500 or 600, seven, eight, nine.
And then you can see like of these models
that you've trained on different sizes of the
data set, like how well they do on that evaluation set of the hundred. And if you see that you can
add more data and the model performance is going to, it's getting way better as you add
additional bits of data, then you should probably go get some more data. But if you're starting to
get to the point where you have diminishing returns from just generically adding data, you know, that's the point where you have to be really intelligent
about what data you add to get the most improvement to your model. Because then at that point,
most of your error cases are actually, you know, kind of in the long tail, they're edge cases.
So one last thing I actually want to leave you all with, I know that was the last question,
and we're like, you know, now at two o'clock, but I think the thing that we want to do with
Aquarium in terms of the long-term vision is that we want to be able to make a system
that a person who's not a ML expert, who is someone who's just an expert in their domain of agriculture or, you know, waste recycling or
whatnot can go into this nice UI and click some buttons and get a model to do what they want and
to improve it over time. That's really our end goal with Aquarium. And that's the thing that
we're building towards every day. Very cool. Well, Peter, it's been an incredible conversation, as we say to almost all of our
guests, I think, maybe all of them. We'd love to check back in as you continue to build out Aquarium
and see how you're doing maybe in another six months or so and have you back on the show.
Yeah, sounds great. Well, it was great talking and let's keep in touch.
As always, amazing conversation. It's so interesting to meet people and hear their
backgrounds. I'm going to, there was lots of technical stuff. So I'm going to say that I think
one of the most interesting parts of the conversation that really stuck out to me was
Peter's obviously incredibly intelligent and articulate, but there's an underlying passion
there based on his life experiences, you know, sort of from early childhood interest in robotics to going through
sort of a traumatic experience, you know, related to vehicles. And I am just always amazed to see
people building things that are reacting to or built upon really deep life experiences they've had. So I think it
was a privilege that he was vulnerable enough to share some of those things with us and really
appreciated it. Yeah, absolutely. It was a fascinating discussion, actually. And Eric,
one of the things that happened to me during this conversation is that I realized that when we are
talking about self-driving cars, actually we are building robots.
For some reason, I didn't think about this before our conversation today. I was thinking of it more
like a software kind of problem, but actually what we are doing is building a robot, which is amazing.
Anyway, that's my realization. I totally agree. It's one of those things where you say it out loud and it sounds like the simplest conclusion to come to.
But until he actually said that, I hadn't made the connection either, which is so funny.
Yeah, because usually when someone says the word robot, what's the first thing that you think about?
It's Boston Dynamics, right? Like these robots that they dance, that they try to walk as a human and all these things. But at the
end, that's exactly what we are doing with a car when we want it to be self-driving. We are building
a robot. Anyway, it was a very fascinating conversation. It was very interesting to hear
about the techniques that they are using on creating, let's say, extracting structure out
of unstructured data using the same neural networks that were used also as
our models with all this theory around the embeddings that Peter was mentioning.
And two things that I'd like our audience to pay attention to.
One is, again, also from Peter, we heard that the relationship between technology, AI, ML,
and humans are much more of a synergetic relationship than an
antagonistic relationship that the media are trying to portray out there, which is great to
hear from him. That's one thing. And the other thing is that at the end, it's all about the data,
right? If you don't have the right data, if the quality of the data is low, no matter how good
your algorithm or your model is, you're not going to have the results that you need.
So I'm really looking forward to have him again in a future show.
Yeah, absolutely.
And for all of our listeners who were also thinking about Autobots and Decepticons in the context of Transformers, when Kostas said, what do you think about when you think about robots?
I did the same thing. I didn't have quite as mature of an initial thought as Costas.
So you're in good company if you thought about transformers.
Yeah, and Terminator.
Don't forget Terminator.
And Terminator.
That's right, 2030, we're getting closer.
Great, well, thanks again for joining us on the show.
Please subscribe on your favorite podcast network.
That way you'll get notified of new episodes every week.
We have an incredible lineup in the next couple of weeks.
So you'll want to make sure to catch every show.
And until next time, thanks for joining us.
The Data Stack Show is brought to you by Rutterstack,
the complete customer data pipeline solution.
Learn more at rutterstack.com.