The Data Stack Show - 33: ML is a Data Quality Problem with Peter Gao from Aquarium Learning

Starting point is 00:00:00 The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution. Thanks for joining the show today. Welcome back to the Data Stack Show. We have another really interesting guest for you on the topic of machine learning, deep learning in particular. We'll talk about neural networks, but it's Peter from a company called Aquarium, and he's the founder of the company. Really, really interesting stuff. My burning question, which is becoming more and more common, is actually not specifically about machine learning, but Peter has a background in self-driving cars. And I'm just so interested

Starting point is 00:00:45 to ask him about being an early employee at Cruise, the self-driving car company. And so that's what I want to ask him about. Costas, what's your burning question? My question is going to be more around data, to be honest. Data is a big thing in machine learning anyway. The algorithm is always completely useless without the right data. And also the company that Peter is building is around how we can provide better data at the end to train our models. The type of data that's usually used in machine learning are a bit different from what we usually use in analytics. So on one hand, you have more structured data. For machine learning, you have more unstructured data. So it's going to be very interesting to see and ask about the aspect of quality and how you work with data and how you annotate data, how you prepare data

Starting point is 00:01:34 in the context of machine learning, which is going to be quite different from what we have learned in the past with other guests that we had in the show. Yeah, absolutely. I think it'll be cool to hear about someone who's building tooling for that as opposed to maybe a practitioner. Let's jump in and get to know Peter. Let's do it. Peter, welcome to the show. We're so excited to have you as a guest on the Dataset show. Yeah, glad to be here. Well, first of all, just give us a brief introduction so our listeners know who you are and tell us about Aquarium Learning, your company's high-level overview. Yeah, so I'm Peter. I'm the co-founder and CEO of Aquarium, and we basically build a ML data management system

Starting point is 00:02:18 that makes it easier for teams to improve their ML models by improving the data sets that they're trained on. Typically, the best way to improve your model performance is just to hold the model code fixed and kind of take a good look at the data set and find places where you can fix bad labels or add more of certain important data. And then that generally tends to produce the most efficient way to improve your model performance. So we make it easier for people to go through that workflow of looking at their data, finding problems, fixing those problems, and then retraining a better model. Very cool. Well, our listeners know by now that I have

Starting point is 00:02:54 tons of questions about that, that I'm going to do what I have fallen into the habit of doing. Of course, we exchanged some communication beforehand beforehand and I looked at your background. And one thing that we love to do for our listeners is just sort of learn where people came from, especially as it relates to sort of their journey with working with data. And you have a lot of experience in the self-driving car space, which is really fascinating. And, you know, there's, there's just so many interesting things about that dealing with data and I mean, even, you know, sort of society and economy, but I w I would love to ask you a couple of questions about that, if that's okay. Yeah, sure. Um, also have to talk about my kind of lead up to that. It was a bit of a long story,

Starting point is 00:03:41 but yeah, that'd be great. Actually, that'd be great because I'm going to dominate the conversation with a bunch of self-driving car questions for a minute. So if you could just give us the run-up to that, I think that'd be awesome. Yeah, sure. So back in high school, I was actually on the robotics team. That was a big part of my life back then, became the captain of the team and senior year, and then kind of went into college. And, you know, at the time, all the careers in robotics tend to be focused around defense, and I wasn't super excited about that. So I went and worked in web for a little bit. So I did internships at Pinterest and Khan Academy. And, you know, at that time, I kind of knew that machine learning was this really interesting

Starting point is 00:04:20 field that had a lot of potential. And so I was working kind of like a mix of sort of the web stack and like normal web engineering, and then also integrating machine learning into that stack. So worked on, you know, that at Khan Academy and Pinterest. At Pinterest, it was more for spam fraud detection. And then for Khan Academy, it was actually for predicting the ability of students based on basically a diagnostic test that we would give to them once they onboarded onto the tool. So we could give them the right content. So I had kind of like this background of robotics and then also like a lot of exposure to sort of the web stack and data pipelines behind it. And then, you know, when I was back in school, you know, my research was in deep learning for object detection. And I actually

Starting point is 00:05:06 happened upon Cruise when, you know, that was kind of a sort of, I think, two person YC company. And then I ended up interviewing when it was like around five people and then decided to go there around like, you know, 18, I think I was the 18th employee, but it was personal for me, you know, in partially that number one, it was kind for me, you know, in partially that number one, it was kind of a way to like come back to doing robotics and combining that with like my machine learning and deep learning interest that started to become more and more relevant for some of the perception tasks. And then the other thing was that I got hit by a drunk driver in college and that was a pretty negative experience. And so I had a pretty

Starting point is 00:05:44 personal stake in the mission. So, and yeah, I I like small companies so it was a lot of fun yeah well they're not they're not 18 people anymore like 2000 at this point yeah it's what an experience and thank you for sharing such a personal story and how cool that you later in your career got to return to your love of robotics and also combine other professional experience. I mean, that's just such a, such a neat opportunity. Yeah, definitely. Okay. So I will indulge myself in a couple of questions and then hand it over to Costas. So this is less on the technical side, but self-driving cars have had an interesting hype cycle, you know, so they'll sort of, you know, go through or they've gone through, you know, okay, self-driving cars, you know, did a test and the car passed and this is the future. And then it

Starting point is 00:06:36 kind of goes quiet for a while and then something else pops up. But I'm just interested to know from someone who worked so closely on it, what is the, what's real and what's hype and what's your perspective on that? I just, I know our listeners would be interested in that. So I think if you look at some of like the really early self-driving stuff, like even those like sort of corny videos from the fifties about cars that will drive themselves, you know, a lot of the sort of issue with those sorts of setups was that you had to essentially set up specialized infrastructure for self-driving cars you know like you would have to put rails in the road or like magnets or something like that so that the the car would know where it was and

Starting point is 00:07:16 where it needed to go and so that's kind of why it didn't really work out until like you know i think somewhere in like the 90s or early 2000, there started to be kind of a resurgence in interest in self-driving and, you know, obviously with the sort of like DARPA grand challenge and urban challenge that led into Waymo and stuff like that. I think it was just kind of the demonstration that the sensors, the compute technology, the sort of just hardware and software stack had evolved to the point where you could actually do perception in a relatively workable way, which is, you know, being able to interpret the world through the sensors rather than having to build specialized infrastructure for these vehicles for these robots to work. And so I think, you know, the sort of critical piece that started

Starting point is 00:08:03 to make self driving really get a lot closer to reality was, you know, the sort of critical piece that started to make self-driving really get a lot closer to reality was, you know, the DARPA Urban Challenge was like somewhere in like 2008, 2009. Yeah. Deep learning came onto the scene around 2012. And with deep learning, you suddenly had this sort of technology that was really well, you know, it just performed really well for a lot of perception tasks that were relevant for, you know, not only like, you know, sort of contrived problems like ImageNet, but also for like real world robotics problems like traffic light detection, object detection, classification and things like that. this boost of performance made it really workable to have systems that could reliably and safely operate in the real world without too many modifications to the infrastructure. So, you know, at least now with self-driving, you can go to Phoenix and if you have the Lyft app, you know, you can call a Waymo car and they will pick you up and it will send you places without a driver and it will be safer than a human you know

Starting point is 00:09:05 that is the state of the technology right now and i think that part is underhyped the part that is overhyped is this idea that you know it's going to be everywhere tomorrow and so like when you look at these sort of systems that are very complex that are very safety critical you need to like basically recertify them and readapt them to every sort of new domain that you want to deploy them into. So moving from, you know, Phoenix to San Francisco, where you have more of an urban environment, moving from San Francisco to New York, where you have more like weather conditions like snow or like, you know, sort of unique driving behavior that all takes like a certain amount of time.

Starting point is 00:09:40 And a lot of that time actually comes into sort of readapting these deep learning models or like, you know, a lot of like the code inside of the sort of normal robotics stack for these new conditions. much money went into it so many great people got together that it showed that real life robotics and applied deep learning was something that worked as long as it was put into the correct sort of system specifications and it's something where it is super valuable in a lot of industries that are not self-driving that in a lot of cases are a lot simpler and a lot easier and just as economically impactful and that's kind of like the customer base that we actually serve over at Aquarium. So, yeah. Wow. Yeah. It's almost like the hype happened too early where I was like, this is going to change the world. And I was like,

Starting point is 00:10:36 okay, it's actually going to take way longer. But then when you say something like you can just download a consumer mobile app in Phoenix and get picked up by a driverless car and have, you know, a way safer experience than you can have under any other circumstance. I mean, it's like, okay, the future is here. Like that's insane that you can just, anyone can go do that in Phoenix. I think Neil Stevenson says it best when, you know, the future is here, but it's just not very evenly distributed. And I think that's kind of the real reason why people are underwhelmed with it. You know, it's here, but it's just not very evenly distributed. Yeah. And I think that's kind of the real reason why people are underwhelmed with it. You know, it's here, but it's not everywhere. Yeah. I mean, that's also, you know, AI, we've had several conversations around this with AI where like

Starting point is 00:11:14 AI branding is the craziest thing because it's like, some people fear it, some people, you know, deny it, you know, and it's sort of one of those things that are like, oh, AI and self-driving cars are going to change the world. And it's like, well, all you can do is drive cars around Phoenix. I mean, that's not very cool. Yeah. And something, you know, like I want to like sort of emphasize is that like when we were starting off at Cruise, you know, back in like 2015, at that point, the state of the sort of tool chain around machine learning or specifically deep learning was just terrible. It was non-existent. You know, I worked on a project at a Berkeley called Cafe, which was the sort of first deep learning framework after AlexNet. And at that

Starting point is 00:11:50 point in time, you know, the maintainers for it were these three graduate students who were simultaneously trying to complete their PhDs and also maintain this like open source repo that was being used by thousands and thousands of people. And of course, you know, like one of them is going to take priority over the others. And so like when I got to Cruise, you know, we had to build all this stuff from scratch because there's a lot of parts of that sort of ML workflow and you have to build like tools to make all of that easier. And we had to build all of it from scratch, you know, back in the day. And now you look around and there's so much great stuff out there that covers so many different parts of the stack and so now it is easier than ever to get something working you know it's easier

Starting point is 00:12:32 than ever to get like an mvp that functions at like sort of like 80 accuracy that you can present your boss and be like yeah we should invest more into this but the part that doesn't have as much tooling and doesn't have as much focus is the part around making this MVP work in production on a large variety of circumstances and acceptable accuracy. And that's where Aquarium really focuses on. Basically taking a lot of the learnings that the self-driving field has already sort of grappled with and in a large extent, like already solved. And helping a lot of all these other people who are working on machine learning, working on deep learning and trying to make their models adapt to the sort of circumstances they see

Starting point is 00:13:12 in the real world and make them better and iterate on these models over time. Yeah, well, that's a perfect segue for me to end my monopoly on the conversation. And Costas, I know just from chatting with you today, you have a bunch of burning questions about aquarium. So Costas, take it away. I will stop monopolizing the conversation.

Starting point is 00:13:32 Thank you, Eric. Thank you so much. So Peter, I have quite a few questions about aquarium, but before we go there, quick question about your experience with like self-driving cars and more specifically with data. So can you give us an idea of like what kind of data you are using when you're trying

Starting point is 00:13:52 to build all these different systems that enable like a self-driving car? Yeah. So if you think about a self-driving car kind of as like this rough, you know, software block diagram, you know, in one side of this block diagram is sensor input. So this is stuff like LIDAR, this is stuff like cameras, this is stuff like accelerometer data, GPS, all that stuff. And then out the other end comes steering actions, you know, like, you know, the accelerator, you know, set it to 80% or break a little bit or like steer left to this extent. And then, you know, you look at the hardware stack, a little bit or like steer left to this extent. And then, you know, you look at the hardware stack and of course, like there's a lot more stuff that's going on over

Starting point is 00:14:29 there. But when you look at the kind of data that you're handling on the input side, a lot of it deals with sort of the sensor data that the car basically is capturing as it drives around in the world, as well as kind of the more, you know, essentially consumer facing inputs. Like I would like to get picked up here and I would like to go there. And, you know, what car is going to be the one that's, you know, taking me and sort of command and control aspect of it. So there's a lot of data, but, you know, happy to go into specific aspects of it if you're interested. Yeah, yeah, sure.

Starting point is 00:15:00 So can you share with us a little bit more of your experience, like with working with this data? You mentioned earlier that back then when you started at Cruise, you didn't have the... Not even the tool set was there, right? So how was a typical day of working with data there at Cruise? So I think the biggest distinction between a lot of these robotics use cases and the more traditional web use cases that people are used to is that if you look at a site like Google or Facebook or something like that, most of the

Starting point is 00:15:31 data that is being generated is quote unquote structured data. You know, these are sort of like tabular data, things that can be described in sort of like a SQL database or an Excel spreadsheet. Now in a robotics application with all these sensors, the vast majority that is being, you know, of that data that's being generated is all unstructured data. It's primarily imagery. It is primarily point clouds. It is things that are essentially really hard to index with traditional data stores that were developed for web. And so like you have this big problem where, you know, these vehicles are generating terabytes and terabytes of imagery per hour. And now you have to figure out what to do with it. How do you store it? What do you basically

Starting point is 00:16:12 index and not index? And from the perspective of machine learning practitioner, what do you train your model on? And that ends up being this huge problem, not only on the piping of where to store things and how to process it, but also just in terms of like the sort of workflow and the intelligence on top of it. Like, you know, what do you do to make this stream of just, you know, massive, you know, onrush of data into something that solves your problem for you? Super interesting. And how did your experience there drive you to create Aquarium today? What were the challenges that you had and that you're trying to address with Aquarium today? So in self-driving, of course, like, you know, you have these very stringent requirements on

Starting point is 00:16:58 your sort of system performance and your sort of machine learning model performance. And so a lot of the work that we did there was around making sure that we could improve our models consistently over time. And the sort of like open secret that a lot of applied machine learning practitioners have come to is that most of the practical gains in your machine learning model performance come from improvements to the data it's trained on.

Starting point is 00:17:21 So what does this mean? If you're just getting started and you look at your labels and you look at your data and a lot of it is incorrect, you know, either it's mislabeled or it's corrupted or something like that, like you shouldn't expect your model to be any good. You should probably clean up the bad data and bad labels and, you know, train it on clean data. And of course you're going to have like much better results.

Starting point is 00:17:42 And it's just such low hanging fruit that, you know that is kind of just the easiest thing to look at. And then, you know, the flip side is also that when you look at the failures of your model, you know, these are not necessarily things that can be tackled with sort of like PhD level changes to the model code. A lot of the times it's sort of cases where you need to just go and collect more data of a certain difficult scenario. And so in the self-driving use case, you know, the sort of common example I like to use is that let's say you've trained a cone detector, you know, it takes, you know, your model takes in an image, it says, okay, here's a cone or not a cone or something like that. And if you train this sort of model, first off, it tends to do really badly on green cones. And

Starting point is 00:18:25 you're like, oh, what's going on? Like, you know, why isn't it at 100% accuracy? And you look into it and you realize it's not doing well on the green cones because all of the cones that you trained on were orange. So this model has never seen green cones before. It doesn't know what to do with these green cones. And the solution here is you should go find more pictures of green cones and collect them back and label them and retrain your model on it, and it will start to handle green cones. And so that sort of process of understanding the failure modes, addressing them with the proper data curation, and then making sure that the retrained model is better than the previous one that you had, that takes a lot of time if you don't have good tooling for it. And so if you have good tooling for it, not only can this iterative process be really fast and really reliable in producing your improvements to your model,

Starting point is 00:19:17 but it's something where you can essentially take the ML engineer out of the equation, where they don't need to be there every single day hand-holding this like machine learning pipeline from end to end. Instead, this is something where you can just have a sort of domain expert who understands like what is a cone or not a cone or what is good or what is bad

Starting point is 00:19:37 and have them click around in a user interface and essentially improve the data set and improve the model. So this is known in, like Andre Carpafi talks about this a lot as operationification, but specifically it's a way to reliably improve the model without needing extremely skilled labor just from looking at the data. And so when we look at like, you know, what we work with Aquarium, there's just a lot of people who have the same sort of problem where they're trying to get a model to work in production. And we're basically giving them the tools to do the sort of same iteration cycle to make their models better and work in production. Yeah, makes total sense.

Starting point is 00:20:17 So can you describe to us how you can put structure to this unstructured data? Because my assumption isn't, please correct me if I'm wrong on that, but I assume that like the first step is to take this unstructured data and create some structure out of it, right? Like create some metadata or these labels that you mentioned that then you use for organizing the work around like the model training.

Starting point is 00:20:40 So how is this done? So the naive way that people try to build structure around data tends to come from basically assigning metadata on top of it. So this is stuff that, you know, for example, you can put timestamps associated to when like a piece of audio was captured, or you can, you know, get something about like, you know, who was the speaker inside of this audio. You can say which device it was captured from, you know, like that sort of stuff are basically convenient splits to be able to sort of index your data in the same way that you would with like structured data, right?

Starting point is 00:21:11 But this has like a lot of limitations because that means that if you want to capture any sort of variation in the underlying data, you have to pull it out into metadata. So either you are like, you know, having humans annotate all of this stuff, or, you know, you have some way of automatically capturing all of the variation in the underlying data. And that's just not practical in, you know, the vast majority of use cases. So really, the magic of what we do with Aquarium is that we rely on neural networks to index the data for us. So a neural network, basically when you run it on a piece of data, you can extract out a activation, a layer of the middle of the neural network and produce this thing known as an embedding. And this embedding is kind of like this vector that is what the neural network thought about this data point of this audio or imagery or whatever. And then you can actually compare these vectors to each other and find, OK, here's actually a cluster of very similar data.

Starting point is 00:22:15 Or here is an outlier in the data set. And so this neural network is essentially extracting structure out of this very messy input data. And by relying on this sort of aspect of neural networks, we can actually tell you what is in your data set, what is the distribution of your data set, what is the variation in your data set. And you can start to uncover patterns in your model's performance. Like here's like a little cluster of green cones that your model consistently fails on. And then we can also do things like search within unlabeled data to find more examples of these green cones that you can therefore collect and bring back and label instead of having to go look through a spreadsheet of a million images and

Starting point is 00:22:54 click on one link at a time to find the piece of data that you're looking for. Oh, that's fascinating, actually. And I assume that still, I mean, you go through these embeddings, creating these embeddings, which create, like, extract some kind of structure out of your data. Is there semantic information around that? Or this is something that still a human has to do? I assume that the neural network is not capable of, like, figuring out, oh, in this image, you have green cones or cones, right? Like

Starting point is 00:23:26 the concept can be just a cone, doesn't matter about the color. Or is it also possible to do that? How it works and how it works together with a human operator? Yeah. So this is kind of known as like unsupervised learning in machine learning literature. And so roughly what that means in practice is that your embeddings are producing like clusters in your imagery and your data and your like unstructured sort of input. And then a human operator can, instead of looking through like a million sort of just, you know, flat examples of data points, they can just focus on the clusters. And you can look at each cluster and have a sense that, okay, this is all like, you know, the same red truck, or this is all the same, like, you know, green can. And then also you have this notion of similarity

Starting point is 00:24:12 where like, okay, here's a section that is green cans versus red cans, or here's a section where you have like, you know, very boxy objects. And the cool thing about this is that these activations, these embeddings can be extracted from pretty much any neural network. And so if you were trying to train a machine learning model on your data to do a task, then you can basically extract out these embeddings

Starting point is 00:24:37 as a byproduct of your training process, and they will produce these really good clusterings. And so for the human operator, really what it does is help them kind of look at this massive pool of data, and it distills it down into these clusters or patterns that they can more easily look through and understand what is going on. Peter, isn't there a kind of like chicken-egg problem, though, the way that you describe it? I mean, if I have to train first the neural networks to create the embeddings, I need to have some data, right? But in order to do that, I have to annotate the data. So how does this work in real life? So when someone is getting started with a machine learning task, usually what you do is you can actually take a pre-trained model.

Starting point is 00:25:20 So there's a lot of neural networks that are trained on just general imagery or general audio and stuff like that. And those models can be used to extract embeddings on data that they have never seen before. So if you're trying to just collect a set of data that you're starting from scratch and therefore you can't train your own model, then you can actually use these pre-trained models to generate embeddings and to kind of organize this data upfront for you. And of course the embeddings will not be super great. You know, sometimes they're going to look for similarity and things that you particularly don't care about for your task. But then once you've sort of bootstrapped a set of data that you can now train a model on, then you can basically go and train

Starting point is 00:25:59 your own model on that data and extract your own embeddings from it. And then now you have kind of this set of embeddings that is pretty well attuned to your task. And to some extent also, like, you know, what these embeddings allow you to do is to, for example, uncover, like, here's like a pattern of failure cases, right? Like, you know, we can go tell you, like, here's like a section of the dataset where there's an edge case you haven't seen before. But there also sometimes has to be interaction with human labelers where, okay, like, you know, the model thinks here is like, you know, a set of data of a certain type, but you will also want to send it to a human workforce for them to basically check it, to QA it, to label it into a form that you know is clean and that you're comfortable retraining your model on. So I think it's not as much sort of automation in the sense of replacing the human,

Starting point is 00:26:50 as much as it's kind of an interactive feedback loop between the human and the machine in order to produce a better model. Yeah, absolutely. And I think this is a kind of pattern that is very recurrent when it comes to machine learning and AI. And that's something that we have discussed also with other people here who are coming from this space. And it's very interesting to see that at the end the model i mean it's a black box that takes information that has been curated by a human to learn and provide things results that are relevant to humans right so i think that's very interesting and i think that's something that should be heard more often like out there because you have like you know like all these people who are like oh ai is going to be you know like a post-apocalyptic situation with Terminator

Starting point is 00:27:48 and stuff like that. So, and yeah, I think it's very important to hear that like from experts. Yeah. And, and, you know, like to give you an example of kind of where this, you know, happens inside of our product, you know, one of the things that we do with Aquarium is to surface the places in your dataset where your model disagrees the most with your labels, with your data. And we surface this to a human user and basically show them these examples and asks, like, is the problem here that the model is wrong, that it is making a mistake on these green cones or whatnot? Or is this a case where the labels are wrong? Where like, for example, like a label is missing or it's misannotated or the data is

Starting point is 00:28:30 corrupted. And so ultimately the human has to be the person who judges that, right? That cannot be resolved automatically. The human has to kind of give their intent of what they want this model to do. It's kind of like training a coworker to do a task by basically, you know, you have hired this new person, you give them some instructions on how to do their job, and then they do their job for a little bit. You inspect their work and you're saying, okay, this is stuff that you did well, this is the stuff that you did incorrectly, and you should change up for next time. And by giving them that feedback, now they perform better in their job. And it's the same way with these sort of AI models, with these deep learning models,, it's like you have to give this feedback

Starting point is 00:29:26 by communicating in Morse code with sticks that you're banging on a rock and it's really hard and it's really difficult. So that's why we're trying to build this tooling to make it more easy to interact and iterate with these machine learning models to get what you want. Yeah. So let's talk a little bit more about the product itself

Starting point is 00:29:46 and how it works. From what I understand, like we are talking about an interactive process here, working around the data and curating the data. In order to do that, it's one part is of course like the curation itself, creating the labels, getting like surfacing the embeddings

Starting point is 00:30:01 and representing these to the users. But then whatever you do there has to be applied on retraining your models, right? And then go back and do the same again. So how does Aquarium work and operates with the process of training a model, the rest of the infrastructure that a machine learning engineer is using today, and how it fits in this overall, let's say, ML stack?

Starting point is 00:30:25 Yeah. So I think like if you were to look at the ML stack, let's say that you're tackling a problem, you're tackling a problem like, you know, in an NLP, let's say you're doing like, you know, named entity recognition or something like, you know, when you have your problem, now you can sort of work back towards like what tools you need. You know, like maybe you need a labeling service to, you know, annotate these things into the text. You need some sort of way to train your model in a distributed way really quickly. You need something for your experiment management. You need something for deployment. You need something for monitoring. And there's a lot of different components in the stack. And so some sort of, you know, some approaches by other vendors is to essentially offer all

Starting point is 00:31:08 of this stuff in the pipeline in one complete package. And what we've seen from working with a lot of machine learning teams is that, you know, if you were to consider the analogy to web, that would be like you go to like Squarespace and you click around in Squarespace and now you can set up this very like, you know, nice, but very basic website. But as soon as you want to do something more complex, you know, if you want to build Facebook, you kind of instead go towards this model where you string together a lot of different tools into a tool chain that works out for you. So you combine Sentry with, you know, like a Django server on top of AWS, you know,

Starting point is 00:31:44 you do all sorts of sort of stuff where the engineer essentially is like stitching together these pipes to create a cohesive product. And with Aquarium, you know, we see that is kind of like the mode that a lot of serious machine learning teams are moving towards where they kind of stitch together a lot of different tools that are best in class for like, you know, training or for labeling or for deployment or whatever. And Aquarium basically sits on top of that and is kind of like a workflow layer. So what we do is integrate into whatever sort of data stores or labeling provider or like model type with a training system that you have and kind of give you this high level

Starting point is 00:32:19 overview of, okay, this is what your data set looks like. Here's what your model performance looks like. Here are the places where they disagree. And here's an engine in which you can basically understand where the failures are happening, triage them, and then take resolutions on them by, for example, identifying, okay, here's like a section that you're not doing very well on, and then helping you collect more of that data within the app and then sending it off to a labeling provider to be annotated and then basically you know triggering a training run after that and then allowing you to compare the difference between your new and your old model within aquarium and so we kind of are this interesting mix between like jira and century for the machine

Starting point is 00:33:03 learning stack where we kind of sit on top of whatever infrastructure people already have built internally. And we are more just telling them, hey, this is the thing that you need to do next to make your model better, helping them do that by dispatching tasks to different tools and parts of the pipeline that they've already built and helping them basically take actions to improve their model not necessarily in aquarium but through aquarium it's interesting and can you give us an interesting story around one of your customers or like something with the data that happened there the reason i'm asking this question is because anything that has to do with ml and the

Starting point is 00:33:43 actual work itself behind ML, it's something that it's a little bit opaque to the people out there. Everyone like sees or thinks about the magic that happens, like we have a self-driving car, right? But can you help us a little bit understand the work that is done behind that? Yeah, so I can actually give you a few because I think, you know, the favorite part of my job and the reason I left Cruise to start Aquarium is that there are so many really interesting, awesome problem domains that people are applying deep learning to. And they're really fun and useful and just unexpected, honestly. Like some of our customers are doing deep learning on trash. And it turns out that that's a very lucrative industry to be analyzing what people are recycling

Starting point is 00:34:27 or what food people are throwing away and giving insights to, for example, like the recycling center to know how to sort different pieces of recycling or to the kitchen owner to decide what food they need to make less of. And that is like something I never would have thought of when I was working on self-driving.

Starting point is 00:34:45 And like, yeah, there's other people who are working on agriculture. There's other people who are working on logistics and drones and like so many like just disparate places like, you know, surveillance and like, you know, industrial inspection and stuff like that. And it's so fun just like getting to know all these people who are doing such interesting stuff and helping them really. And so I can tell you about one of our customers that we wrote a case study on, it's called Asturglu and they're a company based out of Europe. And what they do is basically they have a stack that allows you to input like drone or aerial imagery for inspection of critical infrastructure. So this is like, you know, wind turbines, power lines, cooling towers

Starting point is 00:35:27 and power plants. And the way that people used to inspect this stuff was literally you get a ladder or you get some climbing hooks and you climb this pole up this power line and you go and you look to make sure that there's no like corrosion or like, you know, damage or whatever to the power line. And of course this is something that is like very time consuming, very expensive. You're going to miss a lot of stuff and it's dangerous because you're sending like a person to climb up this power line. And so what Stir Blue does is they take this imagery and they analyze it,

Starting point is 00:36:00 you know, number one, using aerial imagery instead of requiring someone to climb up physically. And then number two, being able to inspect this imagery with a combination of human experts and deep learning models to find defects and surface them to the sort of owner of like the grid or something like that so that they can direct maintenance towards it. And of course, like, you know, the advantage of this model is that you can go and just inspect way more stuff way more efficiently, and catch a lot more problems before they happen. And so for them, you know, they've trained this deep learning model that is kind of working in concert with a team of experts. And they wanted to make this model better in order to be able to

Starting point is 00:36:41 handle just more miles of power lines more efficiently without needing to rely on like this very limited pool of human experts who are going to take quite a while to get through all of that data. And so we helped them look at their model and they realized like, okay, like, you know, where's our model doing badly? And they realized that most of the problems were actually just with the data. You know, there are cases where like there was like one or two labelers that they were working with who were kind of consistently making mistakes. And they were able to go find that and catch that and give sort of like corrective feedback to the labelers so they could produce good data.

Starting point is 00:37:18 And in certain cases, you know, like sort of in their legacy sort of way that they were doing data labeling, they were using a different standard. They were drawing these very large polygons on top of certain defects instead of very like sort of in their legacy sort of way that they were doing data labeling, they were using a different standard. They were drawing these very large polygons on top of certain defects instead of very tight polygons around the actual area where the defect was occurring. You know, instead of drawing like a polygon around like, you know, a hole in the wood, they were drawing it on the entire, you know, like power pole line. And so we helped them uncover, like this is the issue and this is why your,

Starting point is 00:37:44 your model is kind of outputting weird stuff. And they we helped them uncover like, this is the issue. And this is why your, your model is kind of outputting weird stuff. And they're like, Oh, wow. Yeah. Okay. That makes sense. And they were able to go back and they were able to do a pass through their labels and fix them to, you know, adhere to this common standard of like, you know, small polygons around the actual defect. And when they retrained the model, it got like 13% better. And that was like a week of work. And it was like a week of work. And it was just such low hanging fruit that they didn't really know it was even there until they looked. And then, you know, based on that, they were able to cover just hundreds of miles of power

Starting point is 00:38:14 lines a lot more quickly. They were able to cut the sort of requirements of their human experts in half and cut the labeling costs in half. And, you know, of course they made their customers much happier. And, you know, I can tell you another story about a customer that we worked with in industrial inspection. I can tell you some stories from like, you know, different sort of customers and different sort of domains, like, you know, one of them in agriculture, but, you know, it's, it's something where number one, I think it's just so great. There's all these different

Starting point is 00:38:43 exciting applications, but number two, I'm also surprised that the same playbook works extremely well across all these different applications. It's something where you wouldn't think it was something that was going to be a common way to improve all of them. But with the magic of deep learning, you can apply this repeatable playbook to a fairly common set of models and achieve the same great results. Yeah, yeah. And I think it's amazing on how many different use cases are out there where deep learning is used and people just have no idea about it. We all focus on what you hear about self self-driving cars mainly to be honest and anything that has to do with surveillance but it's it's amazing and i think it's something important for

Starting point is 00:39:30 people to hear all these different like amazing use cases that are out there that they don't just as you said they don't just reduce costs in some cases they also save lives right because climbing on these poles and trying like to figure out if there's a defect there, it's a dangerous job. It's not something that it's easy to do. So Peter, two last questions from me, and then I'll let Eric continue with his questions. First of all, it's about the data again. Can you give us a sense of what are the most commonly used data in machine learning today? So I think it's critical to sort of distinguish that there's a lot of, you know, subclasses of machine learning. So, you know, if we were to go back to kind of like the late 90s and early 2000s, machine

Starting point is 00:40:17 learning has been very successfully deployed in a lot of web applications for things like recommendations or forecasting or predictions and things like that. So this is like, you know, if you're on Google and you're clicking around, you know, what do you recommend to the top of the list? Or if you're trying to forecast what is like your future revenue based on your previous revenue or something like that, you know, these are problems that are relatively well understood and have been applied in a lot of use cases successfully in the early 2000s. And a lot of this is because it's something where you're kind of getting the data for free

Starting point is 00:40:54 from like user actions on your site or from just like, you know, seeing, you know, like the present versus the past and then trying to predict the future. And all this data tended to be kind of like tabular data, like recommendations and ads targeting and price forecasting. That's all like, you know, stuff that you can put into a SQL database or spreadsheet. So this is like a class of data that is still, I think, extremely prevalent and extremely, you know, value adding, and it's very common. Now, the sort of data that we deal with in our line of work with deep learning tends to be more like unstructured data. So this is a lot of people who are dealing with imagery, a lot of people who are dealing with audio and NLP sort of text use cases,

Starting point is 00:41:41 and then some people dealing with, for for example 3d point clouds that are generated from lidars or like cad models and things like that but in this sort of new wave of deep learning as a subset of machine learning there's kind of more of an emphasis on unstructured data and that unstructured data tends to you know have a lot more interaction with the real world with the messiness of the real world as well instead of just like kind of the tabular sort of clean, isolated nature of like actions within a web app or something. And I think, you know, beyond that,

Starting point is 00:42:13 the thing about this paradigm of working with unstructured data is that the data does not come for free. So instead of this being something where it's kind of like a prediction or a forecasting problem, where you kind of are just like trying to like refine your guesses of what people will do in the future based on what they did in the past. Now you're kind of doing something that is more like automation in terms of your workflow, where you are trying to get humans to do a certain task for you. And sometimes you have to pay them to do labeling of bounding boxes or whatnot. And then you're essentially telling your model to try and not only imitate it,

Starting point is 00:42:50 but also to generalize from that set of data to data it's never seen before. And so this sort of new model of doing deep learning and the requirements around it leads to a pretty different workflow. So I think some parts are in common, like a lot of like the stuff that you need to use for crunching data and moving it from place to place is definitely in common. But then there's a lot of differences in particular for like, you know, the fact that you're using a deep learning model that has to be trained on GPUs at scale, or the fact that now you have to annotate data, or the fact that now you have to kind of think about like, you know, what is the right data to annotate, which, you know, we think a lot about. Do you see any use case like today, or like,

Starting point is 00:43:28 do you expect to see something in the future where deep learning can be used with more structured data? Yeah, actually, what we're seeing right now is that some groups are actually using deep learning on structured data in really interesting ways. There's a lot of sort of like graph convolutional models. I think if you were to look at some of the more advanced groups, you know, I think internal to Google and Facebook, they're already moving over to deep learning models. I think if you were to look at, for example, Instacart, Instacart has surprisingly done a lot of stuff with deep learning for basically predicting what people should pick up inside of grocery stores and in what order and stuff like that, which is really fascinating. And I think the reason why it hasn't been as

Starting point is 00:44:10 widespread so far is just because people kind of have been using sort of old school, non-deep learning models for quite a while. And there's a lot of inertia that carries over from that, especially since it performs pretty well. But then it gets to the point where when you start tackling really complex problems or when you have like really, really, really big data sets, that's the point where sort of like the new age of deep learning models offers like way better performance. Yeah, that's super, super interesting. Eric, the stage is yours.

Starting point is 00:44:43 I have learned an incredible amount. And I also have to say, I'm sure our listeners, at least some of them felt the same way. But when you gave the trash example, I had to take a minute and think, okay, what if someone's running deep learning on my trash, what is it saying about me? And I got, I had this moment of like, that is so crazy. Super duper interesting. Two questions, because I know we're getting close to time here. One, and this is just interested in your perspective, because you've kind of seen the data tooling and data workflows come of age in a way. as I think about what we've learned on the show over so many episodes is that even when we're talking about really advanced technologies, there still seems to be,

Starting point is 00:45:33 I guess, if you just, if you break it down, not, I would say like unexpected, but especially to me, who doesn't have a background in technology, a surprising amount of manual work that still goes into some of this stuff. And I think about aquarium specifically where there's all sorts of value, but I think about just the workflows pre-aquarium and post-aquarium as you've described them. And it's amazing how much just effort and work that it saves in automation. And I guess, you know, living in an age where we have self-driving cars, it's surprising to me. And I just loved your perspective on that because it seems to be still so pervasive, even though we're using really, really advanced tools. Yeah. You know, like one of the examples that I like to raise up when we talk about this is that

Starting point is 00:46:19 before Aquarium, a lot of the times the sort of standard of tooling for people is like spreadsheets or Jupyter notebooks. Or like, you know, I remember working on a project where our visualizer and our data set organization system was Mac preview. And it was a bunch of folders with images in them on my local hard drive we were labeling. You know, and this is just like the reality that it's just still really hard to work with data and the sort of paradigm that, you know, machine learning is about, you know, like if you look at the way that we write code, there's so many great tools out there for debugging your code, profiling your code, understand what is going on in your code. And, you know, with data, it's still something where like people are kind of waking up to the fact that you need tools to make that process of understanding and

Starting point is 00:47:08 improvement much easier. And so I think there's always going to be some amount of labor or at least in the next like, you know, four years or so, because someone has got to say to the machine learning model, this is what I want, right? It's not something where you can necessarily write a spec as a product manager of exactly what type of attributes are required to classify something as a cat. You know, like it's hard to write that down. And so the sort of process of working on machine learning tends to be a lot more iterative where you kind of give

Starting point is 00:47:45 it examples of you know cats and then you see where it fails and then you kind of correct it and then you know continue going on that front so i think there's going to always be some amount of like human involvement in that front but then so much of like the the sort of unnecessary toil that happens right now in machine learning is about trying to make sense of like, I have millions and millions of data points. And where's the section of this data set that I need to focus my human attention on that is most important for improving this model? Totally. I think that's such an elegant way to put it where you, another subject that we've talked about on the show a bunch, but the human involvement in machine learning is so critical, but unnecessary toil is I think a much

Starting point is 00:48:33 better term to describe what I was talking about, you know, that just still seems so pervasive and how cool that you're building tools to help solve that. One last question here. I won't make that promise because I'm horrible at keeping it, but maybe one last question is when it comes to data for machine learning applications, you sort of have this interesting issue of critical mass, right? So it becomes valuable when you have enough data to train models and, you know, you sort of have enough inputs to make it really valuable when producing the outputs. And you work across such a wide variety of different industries with your customers. thoughts on the threshold there? Because I know that there are a lot of companies out there who maybe run like a really tight ship on sort of their general data practice and are wanting to explore machine learning. What's the critical mass? And does that vary in terms of types of data? I'd just be interested in your perspective on, I guess, what's the low water mark as far as the threshold and sort of quantity and types of data? Yeah. So the reality is, is that it actually just depends on your application. It's really hard to kind of like say, you know, one size fits all type scenario, like, you know, what is the minimum

Starting point is 00:49:57 threshold for it to perform well? It actually depends on like how complicated your problem is, how variable your input data is, and whether you have access to like some sort of pre-trained model or not that can kind of cut out a lot of the work of learning from the process. So I think, you know, my personal like rule of thumb for working with imagery with like a pre-trained model is that you want to get something on the order of like 10,000 examples to kind of like start off with. And, you know, you can usually just like human annotate up to like 10,000 without too much like cost to yourself. But then, you know, the sort of more general way to understand this is that you can actually do something known as an ablation study, where if you have some set of data, let's say that you have like a thousand examples,

Starting point is 00:50:46 then what you can do is you can train, you can set aside like a hundred examples as an evaluation set, and then you can train on 100 examples of the remainder or like 200 or 300 or 400 or 500 or 600, seven, eight, nine. And then you can see like of these models that you've trained on different sizes of the data set, like how well they do on that evaluation set of the hundred. And if you see that you can

Starting point is 00:51:11 add more data and the model performance is going to, it's getting way better as you add additional bits of data, then you should probably go get some more data. But if you're starting to get to the point where you have diminishing returns from just generically adding data, you know, that's the point where you have to be really intelligent about what data you add to get the most improvement to your model. Because then at that point, most of your error cases are actually, you know, kind of in the long tail, they're edge cases. So one last thing I actually want to leave you all with, I know that was the last question, and we're like, you know, now at two o'clock, but I think the thing that we want to do with Aquarium in terms of the long-term vision is that we want to be able to make a system

Starting point is 00:51:55 that a person who's not a ML expert, who is someone who's just an expert in their domain of agriculture or, you know, waste recycling or whatnot can go into this nice UI and click some buttons and get a model to do what they want and to improve it over time. That's really our end goal with Aquarium. And that's the thing that we're building towards every day. Very cool. Well, Peter, it's been an incredible conversation, as we say to almost all of our guests, I think, maybe all of them. We'd love to check back in as you continue to build out Aquarium and see how you're doing maybe in another six months or so and have you back on the show. Yeah, sounds great. Well, it was great talking and let's keep in touch. As always, amazing conversation. It's so interesting to meet people and hear their

Starting point is 00:52:45 backgrounds. I'm going to, there was lots of technical stuff. So I'm going to say that I think one of the most interesting parts of the conversation that really stuck out to me was Peter's obviously incredibly intelligent and articulate, but there's an underlying passion there based on his life experiences, you know, sort of from early childhood interest in robotics to going through sort of a traumatic experience, you know, related to vehicles. And I am just always amazed to see people building things that are reacting to or built upon really deep life experiences they've had. So I think it was a privilege that he was vulnerable enough to share some of those things with us and really appreciated it. Yeah, absolutely. It was a fascinating discussion, actually. And Eric,

Starting point is 00:53:36 one of the things that happened to me during this conversation is that I realized that when we are talking about self-driving cars, actually we are building robots. For some reason, I didn't think about this before our conversation today. I was thinking of it more like a software kind of problem, but actually what we are doing is building a robot, which is amazing. Anyway, that's my realization. I totally agree. It's one of those things where you say it out loud and it sounds like the simplest conclusion to come to. But until he actually said that, I hadn't made the connection either, which is so funny. Yeah, because usually when someone says the word robot, what's the first thing that you think about? It's Boston Dynamics, right? Like these robots that they dance, that they try to walk as a human and all these things. But at the

Starting point is 00:54:26 end, that's exactly what we are doing with a car when we want it to be self-driving. We are building a robot. Anyway, it was a very fascinating conversation. It was very interesting to hear about the techniques that they are using on creating, let's say, extracting structure out of unstructured data using the same neural networks that were used also as our models with all this theory around the embeddings that Peter was mentioning. And two things that I'd like our audience to pay attention to. One is, again, also from Peter, we heard that the relationship between technology, AI, ML, and humans are much more of a synergetic relationship than an

Starting point is 00:55:06 antagonistic relationship that the media are trying to portray out there, which is great to hear from him. That's one thing. And the other thing is that at the end, it's all about the data, right? If you don't have the right data, if the quality of the data is low, no matter how good your algorithm or your model is, you're not going to have the results that you need. So I'm really looking forward to have him again in a future show. Yeah, absolutely. And for all of our listeners who were also thinking about Autobots and Decepticons in the context of Transformers, when Kostas said, what do you think about when you think about robots? I did the same thing. I didn't have quite as mature of an initial thought as Costas.

Starting point is 00:55:50 So you're in good company if you thought about transformers. Yeah, and Terminator. Don't forget Terminator. And Terminator. That's right, 2030, we're getting closer. Great, well, thanks again for joining us on the show. Please subscribe on your favorite podcast network. That way you'll get notified of new episodes every week.

Starting point is 00:56:08 We have an incredible lineup in the next couple of weeks. So you'll want to make sure to catch every show. And until next time, thanks for joining us. The Data Stack Show is brought to you by Rutterstack, the complete customer data pipeline solution. Learn more at rutterstack.com.

The Data Stack Show - 33: ML is a Data Quality Problem with Peter Gao from Aquarium Learning

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.