Embedded - 252: A Good Heuristic for Pain Tolerance

Episode Date: July 5, 2018

Katie Malone (@multiarmbandit) works in data science, has podcast about machine learning, and has a Phd in Physics. We mostly talked about machine learning, ways to kill people, mathematics, and impos...tor syndrome. Katie is the host of the Linear Digressionspodcast (@LinDigressions). She recommended the Linear Digressions interview with Matt Mightas something Embedded listeners might enjoy. Katie and Ben also recently did a show about git. Katie taught Udacity’s Intro to Machine Learningcourse (free!). She also recommends the Andrew Ng Machine Learning Coursera course. Neural nets can be fooled in hilarious ways: Muffins vs dogs, Labradoodles vs chicken, and more. Intentional, adversarial attacks are also possible. Impostor syndromeis totally a thing. We’ve talked about it before. You might recognize the discussion methodology from Embedded #24: I’m a Total Fraud. Katie works at Civis Analyticsand they are hiring.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Embedded. I am Alicia White here with Christopher White. Our guest this week is Katie Malone, a data scientist with a podcast about machine learning. Hi, Katie. Thanks for joining us today. My pleasure. Thanks for having me. Katie, could you give us some background about yourself? Sure. So I've been in data science for about three years. I work at a startup in Chicago called Civis Analytics that does data science services and technology work. Before that, I was finishing a PhD in experimental particle physics. So I did a bunch of searches for new particles.
Starting point is 00:00:47 And so it's been a really fun transition from academia to data science. You still get to do a lot of the programming and a lot of the statistics and machine learning, but usually a little bit quicker and with a little bit of a bigger impact. So I'm really loving it. You worked at CERN. I did. I spent a couple of years there in graduate school doing research. Yep. I want to ask you about that,
Starting point is 00:01:19 but I have so many machine learning questions that I'm not sure we'll get to it. Before we do get to the deeper questions, we have lightning round. Okay. We'll ask you short questions. We want short answers. And if we are behaving ourselves, we won't ask for more detail. Sounds good. Christopher, do you want to start?
Starting point is 00:01:36 Sure. What's your preferred language to develop in? Python, although I don't program that much anymore, quite honestly. TensorFlow or Keras? Keras. Bayesian or frequentist? What are these questions? Bayesian in real life, frequentist if it's on a computer.
Starting point is 00:02:01 Favorite style of neural net to identify animals? The one that's inside my brain. How much data is enough data? More than what I have. That's always true. Okay, favorite style of neural network to identify cars for a self-driving car? Oh, that's an interesting question. I don't remember what they used for the self-driving car.
Starting point is 00:02:32 There was a point where I actually knew the right answer for this one. I'm sure it's some kind of convolutional net for the image recognition, but then I think they had some kind of interesting stuff going on inside for some of the decision-making. It wasn't a quiz. It was personal opinion. Oh, well, I mean, I just want to pick the one that works the best. I assume it's the one that's out there. So can you tell us about your podcast? Sure. Yeah, I'd love to. So it's called Linear Digressions,
Starting point is 00:03:01 which is a little bit of a pun. And I've been doing it for coming up on three. No, coming up on four years. Yay! Oof, that's a long time. Do it with a friend of mine, Ben Jaffe. So we used to work together back when I was still in grad school. I did a summer internship at Udacity, which is a company that does online learning and courses and stuff.
Starting point is 00:03:29 I was putting together a machine learning course and had too much content for what I could put into a single course in three months worth of work. And I'm a big podcast fan, have been for a long time. So I thought that those were kind of the raw ingredients for an interesting little project. And Ben is not a machine learning person, but he had some audio background. And we just get along really well and he's interested in this sort of stuff. So that was a little bit about how we got started. And so now we're still doing it about i guess like i said four years coming up on four years later so every week roughly we have a new episode about something in data science or machine learning and that's evolved a little bit over the course of the over the course of the podcast once it started
Starting point is 00:04:20 out it had maybe a little bit more stuff about physics, for example, because that's what I was doing. Then it started to have a little bit more of a managerial role now than I used to be. That's why I don't program quite as much anymore. But I'm thinking a lot about how machine learning and data science are being used to solve real problems. And so, again, that's sort of a direction that we are taking it now that hasn't gone as much in the past, but it still has pretty strong backbone of the general idea that machine learning and data science are fields that move very, very quickly. And it's hard to keep up with sort of the gist of all of the new developments that are happening. And
Starting point is 00:05:19 quite frankly, I think there's not quite enough medium technical resources for somebody who wants something that's more detailed than, say, the popular press or a New York Times article, but not so detailed as a scientific paper. We try to sit in that gap for non-technical folks who want to feel like they're keeping up with this stuff and for the technical folks who want to fulfill their breadth requirement and and keep up with a field that's moving really quickly and so it is just you and ben and usually ben is learning about something you're talking about he represents the audience yes okay it's been four years how much has been still learning a ton i think it's been four years. How much has Ben still learning? A ton, I think. It's been interesting. So his background is he's a front-end web developer, so he's quite technical. And in some topics, he's even more technical than I am. So when we were talking, we had a recent episode about Git and GitHub and version control and um so he's of course just as well versed in this as i am maybe maybe more so um but it's funny we sometimes chat about this usually not while we're recording or
Starting point is 00:06:33 anything but uh i think he by his own telling uh he's learned a lot and i can see how there's connections that he's making between the topics as we're covering them. Like, I think he's just much more, he's much more comfortable almost unconsciously with these topics than he used to be. But he's a deeply curious person. I don't think he would ever say that he has mastery or anything. Most practitioners in this field wouldn't say they've achieved mastery.
Starting point is 00:07:06 But I think that's been one of the things that's been really fun for him is kind of that learning, but continuing to have sort of that beginner's curiosity as, like you said, sort of a stand-in for some of the folks in our audience that we think are probably in shoes that are really similar to his. There are callbacks to previous episodes and the episodes seem to build on each other. Do I have to start at the beginning? No, no, definitely not. We try really hard to make the callbacks as unobtrusive as possible. There are, I guess the one exception to this is sometimes there's episodes where it would take an hour to really unpack an issue or an item in all of its interesting detail. So sometimes we'll split that across two or it's usually two, maybe sometimes
Starting point is 00:08:02 up to three consecutive episodes. And then you might want to listen to those in a unit. But mostly, one of the things we're trying to do is contextualize this stuff a little bit too. So machine learning, when I was starting, it felt like a lot of disconnected topics. And I would learn a topic and then stop learning it and then move on to the next topic and like learn it and then stop and then move on to the next topic. But as I've grown as a scientist and as the field has grown, you start to see some of the common threads. And so that's been one of the things that's been really interesting is that there's stuff that we were saying two and three and four years ago that just isn't true anymore because the
Starting point is 00:08:45 field has has changed and so that can be a really really interesting thing to call out when it happens like hey there's there's basically this living archive of this field as we've been and of ourselves understanding it as we've been going. And you can sometimes opportunistically drop back and look at the time capsules and that's kind of fun. Yeah. You recently mentioned dropout layers, which is still something people are using with convolutional neural networks to find like cars, but it may be going away. I mean, the math has never really made sense, but there may be better ways to handle the dropouts. And you did talk about that somewhat recently. Yeah. I mean, and I think like sometimes for sure there's topics that we cover because not because they're the new hotness, but because they're just these big items from, you know, two or three or five or, you know, sometimes 100 years a jack of many trades, a master of quite few of them.
Starting point is 00:10:09 I wouldn't claim expertise in neural nets. And one of the things that has been hard for me trying to keep up with that field, number one, it's just moving so fast. And number two, there's these big topics like dropout that are assumed knowledge if you're trying to read a paper. And I was finding myself having trouble actually reading some of the neural net papers and keeping up with new stuff like capsule networks or something because they assume that you already know what dropout is. I didn't. So go out and learn it. That's fine. But part of what I try to do as a like part of the reason that I do the podcast is because then I try to hook that back into sharing that out with people.
Starting point is 00:10:55 So I learn things so that I can share them out and I share them out as a motivation for me to go learn them. So yeah, it's an interesting field to try to learn because, like I said, there's so much context there. And sometimes that's what we're trying to achieve is just doing some of that footwork for our listeners. You must get asked, like, a lot, how do I get into machine learning? Are you sick of that question? I think it would be kind of mean to say I'm sick of it. You can be mean, it's okay.
Starting point is 00:11:40 One thing I've been contemplating, there's kind of a saying that if once you've written the same email, like that, you know, somebody writes me, it's usually in an email, somebody writes me an email, and the gist of it, how can I get into machine learning? Once you've replied to that email three times, you should just write a blog post. I think I'm definitely past the three marker. So, maybe that's my signal that I should just sit down for a couple hours and write out what I usually say. But what I usually say is mostly the same. It's pretty hard to do anything that's very tailored to any individual, because I don't know most of these people on a personal level, or what kind of strengths and weaknesses they might have. So, I usually tell them similar things, which is when you're just starting out, one of the things that at least I
Starting point is 00:12:27 look for when I'm interviewing people that don't have a lot of experience is projects, basically. So in particular, I think projects are really interesting for a few reasons. Number one, they actually make you a better programmer, a better data scientist. They force you to think about actual real world problems. So they're just a good training ground. Second is that especially if they're projects that, like I'm much more keen on projects where people go out and they collect a data set because they're actually interested in trying to formulate and solve a problem versus doing like a Kaggle competition. And I think that's much more realistic. It's much closer to the real world. Most of real world machine learning and data science, somebody doesn't hand you a cleaned up data set and say,
Starting point is 00:13:20 please predict this column. Like you have to go out and make that data set and formulate the problem and understand what column you're supposed to be predicting and all that sort of stuff. So if somebody is doing a project that demonstrates that they've thought about a problem on that level, that's pretty unique and really valuable to me. And then the third thing that's cool about projects is, again, especially if it's a project that someone goes out and starts on their own or contributes to on their own, it tells me something about what they're interested in and what kind of data scientist or machine learning person they might be, which is sometimes hard to get any other way if somebody hasn't had work experience. So it just, it gives you something interesting to connect with people on, on a personal level where you're like, oh, you really care about basketball. Like that's kind of interesting. Tell me about why you care so much about basketball. Tell me about what's going on in basketball. You decided to go out and solve. Or sometimes people have projects that are, you know, social good projects or things that they
Starting point is 00:14:23 really, you know, problems that they care about in society, and they go out and they try to do data science around that, which I think is incredibly cool. So it just gives me something sort of personal to also connect with that person. And so if you're on the other side of the table, and you're a person who's trying to get started in machine learning, I think that those are, you know, the project approach is like one of the more fun and interesting ways to do it. And it can be, at least from my perspective, like a really good way to communicate out what it is you care about and how it is you work, which is really key to getting started. If you're, you know, especially if you're looking for other people to join you or for folks to give you a shot at their company or whatever. Do you have resources, classes, books, blogs,
Starting point is 00:15:18 suggested reading, listening? I mean, other than your own podcast, of course. That's a good question. It depends a little bit on what somebody's asking for. I think the Coursera machine learning course is really, really good. The Andrew Stanford one. Yeah, that is fantastic. Yes. So that's a total classic. Anyone who hasn't done that or something that's like equivalent, I would send them there first. I don't have, I have a lot of- Can I recommend your own Udacity class?
Starting point is 00:15:57 Oh, sure. I mean, it's good too. Why not? Yeah, that's pretty good. Why not? of by the box machine learning, there's actually too many resources for me to give you a great answer to that question. There's so many that I've come across through the podcast, especially very often. So we keep a website like linear digressions.com. And so each time we have a, an episode release, we post a little thing on there. And it's gotten to the point where I actually use that as kind of a library or a link blog for myself to go back and find interesting resources that I've come across.
Starting point is 00:16:51 But maybe it's been a couple years and I don't remember where to find it. I can usually try, you know, kind of reverse engineer what episode that might have been and then go look it up. So it's sort of it's sort of funny. There was like extreme example of this. There was one time I was teaching or about to go give a lecture on a fairly technical neural net topic, I think. And I actually had, I was a little rusty. It had been a while since I thought about it in great detail. I meant to like read up on it the night before but then something came up i forget um and i actually listened to an old episode of the podcast because i was like i think i once did this uh that doesn't happen very often but that's kind of an extreme example i'm laughing because i did that once too i had to give a presentation on ble and i went back and on the way to give the presentation, I listened to a show we did with Bleacher Snyder, Josh. And I'm pretty sure I just quoted everything he said.
Starting point is 00:17:57 Yeah, I mean, you did the work. You might as well take advantage of it. So that's great for the newbies. I am an established embedded software engineer. I'm not looking to change careers, but machine learning keeps nibbling at the edges. People want it, people are excited about it, but I'm not sure how we're going to put it into tiny devices that don't have gobs of memory gobs of processing power how do i learn enough to know what will affect me what do i need to know about machine learning yeah i was thinking about this i think it's a really tough question because it gets at the heart of what's possible and it's a little bit hard to
Starting point is 00:18:48 go find a good resource that tells you sort of what's possible that's at the right level of technical detail like i said that's the place where i always that i always um found the biggest gap there's really good super duper overly technical stuff There's really good, super duper, overly technical stuff. There's really good popular science stuff. The stuff in between is where, at least for me, where I want to do the most reading and where I find sometimes a gap. The best, this is not a great answer, but it's the thing that if you have access to it, it's probably the best one, which is make friends with a data scientist or a machine learning person. Like maybe there's one at your company or there's meetups nearby that you can strike up a conversation. Because
Starting point is 00:19:38 that allows you to talk to each other as humans. And sometimes the thing that makes figuring out like if machine learning, what about machine learning is interesting and what's just hot air, like part of the thing that makes that conversation challenging is that machine learning has a lot of jargon. And I imagine that, you know, whatever other technical people, like if you're working on super small devices or something, in order for you to explain in great detail what you do too, like I would probably be mystified by it for a few minutes as well. And so if we were standing there face to face, you would see that and be able to unpack it a little bit for me, that sort of thing. So, if you have access to a data scientist or a machine learning person that doesn't mind, you know, going out for a coffee or going out for a beer and just like telling you what they're thinking about or what they're working on, I think that's the best thing. Well, that's good. As long as you're here, I have a few questions for you. Oh dear. Inference versus training. This I know is very important. And I
Starting point is 00:20:49 know that when I want to train a network, it takes, I mean, this is why I use AWS cloud. This is why I use our game computer, but inference. Can you tell me about the difference between inference and training? Yeah, so if I understand the question, I might phrase it a little bit differently, which is inference versus prediction. So prediction, I usually think of as kind of your standard machine learning output, which is, I have a bunch of data. There's patterns in that data that allow me to predict something that I'm interested in, use statistical learning methods to kind of figure out what those patterns are, and then use them to predict that outcome of interest on new cases where you don't already know the answer. Inference is, it uses in some cases, similar techniques, stuff like linear and logistic regression, but it's based on statistics in kind of a more rigorous way than some of the more
Starting point is 00:21:54 pattern matchy stuff that's in machine learning. And inference is valuable because it can tell you why. So it can make predictions, but it can also tell you sort of what inputs to those predictions are affecting the outcome. And so depending on the problem that you're trying to solve, like if you want to take an action based off of your prediction and an action that's going to change the outcome that someone experiences, then it's likely, although not guaranteed, that what you want is something like inference. But if what you want is just something that's as accurate as possible in giving you the right prediction of what's going to happen, then often prediction or machine learning is what you want to do. And the reason that the
Starting point is 00:22:46 difference matters also is that inferential statistics can give you predictions, just like machine learning can give you predictions. But usually the predictions that you get in machine learning are higher accuracy than the ones that you get with inferential algorithms. And so that means that you pay a little bit of a price for having something that's more explainable, which makes some sense. But it means then that sometimes there's kind of a mismatch in expectations then when you're using these algorithms, because very often people don't realize that there's some trade-off between the predictive power of an algorithm and its explainability. And that can sometimes lead to bad outcomes, because a lot of times people want to understand
Starting point is 00:23:34 why an algorithm says something in order to trust it and to feel like they want to use it. This is awesome and confusing, because the NVIDIA material uses inference in place of prediction. I mean, even their outputs of their machine learning, of their AlexNet neural nets, they call them inferences. And so for them, anything you're doing that isn't training, anytime you're running the machine learning algorithms forward, you're doing inference. Oh, I see. So they use inference and prediction interchangeably. Yes, we would call that like scoring probably where I work anyway. Yeah, jargon, right?
Starting point is 00:24:22 Yeah, and I was thinking of causal inference, which is like a subfield of statistics. Huh. Okay. Yeah. Let me take a second crack at your question. I'll try to be a little bit, a little bit briefer. So what's the difference between training and inference or training and prediction? So one is where you're actually learning what the patterns are between the inputs to your algorithm, like your all of your data and the outcome of interest. So that's that's the training part. And that can be really slow and really painful, because there's all kinds of different patterns that can occur. And you need to, you know, potentially search a lot of space in order to find them. Once you've found the patterns and those are somehow encapsulated within a data structure, like a binary tree or a linear equation or something, then inference or prediction or
Starting point is 00:25:18 scoring is taking those patterns, matching them up against new cases, and using that to figure out what the outcome of interest is for those new cases. That is more along the lines of what they told me, but I am so happy to hear that there are differences even from experts. And so the scoring, the prediction inference, whatever it is we do, we call it, that has a much lower cost of processing. That's a fast thing compared to training. Yeah. Generally, yeah. And so that's the part that usually affects the embedded systems more.
Starting point is 00:26:01 You take all the data you can find, you train the system somewhere else with some specialized people with specialized skills, and then they come back and they say, these are the features of interest, and we want you to run this, what ends up being a linear equation. I mean, it's a forward neural network, but at the end, it's just some multiplies and adds. And then I look at them and say, these features make no sense. You can't ask me to do an FFT and then only use some of the results because that FFT doesn't make sense like that.
Starting point is 00:26:39 Let's just do it more mathematically efficiently. And they're like, no, you have to have the features be exactly this. It sounds like a very specific complaint. Okay. Maybe that one is a little specific, but it's not, it's not the only one. I do see this a lot where the features, you just look at them and you're like, yeah, that makes no sense. Oh, well, so like, it sounds like you, you're experiencing a version of the thing that I gave in my first answer when I didn't totally understand the question, which is the interpretability of the results, right? So by interpretability, I mean, does a human, when a human looks at it, can they, can they understand it? And does it make sense? Those are slightly different questions. You can have something that's very simple to understand, but it doesn't make any sense is what I mean. Or you can have something that makes a lot of sense, but it's so complicated that you can't fit it all in your head at the same time.
Starting point is 00:27:34 So it sounds like what the machine learning folks did was they built an algorithm that doesn't know anything about the world. So it doesn't understand what an FFT is, or it doesn't understand what variables should make sense. It's just being trained to make predictions that are as accurate as possible. For some reason, this happens all the time, it finds some combination of features that don't make a ton of sense to you as a human. And that combination of those features is what gives it the most accurate predictions. And now you're at a little bit of an impasse, right? Because the machine learning people, it's really hard for them, or it feels unfair in a sense to have to go back and tell their algorithm that they solved the problem
Starting point is 00:28:25 wrong. And from a technical perspective, that can be really challenging to do. But I totally get what you're saying. You're like, this doesn't make any sense to me. And so then you have to decide where to go from that. And that's more of like a, it's not really a technical problem exactly. Like you could take the outputs of their algorithm and put it into a thing, I assume, mostly. Yeah. Oh, yeah. No, the actual algorithm is not hard. It's just, why would you use mathematically equivalent things in different ways?
Starting point is 00:29:03 Why would you not allow me to use the most optimal math and that may be due to overfitting in that case it was over because the the training doesn't take into account the way the features are used later right because if it knew it's optimizing right so if it knew oh yes i'm doing right? So if it knew, oh, yes, I'm doing these features, but that makes your later algorithm really inefficient. If you fed that back to it, wouldn't it find a different space in that vector space of solutions? Yeah, I suppose it could.
Starting point is 00:29:37 I mean, you'd have to basically come up with like a, you know, it's like a constrained optimization problem, or there's some kind of constraint that you're putting on the neural net that says, even if it gives me a good, you know, quote, unquote, good answer, I don't want something that looks like this, it has to be within like this range. You know, from a technical and practical perspective, this can be trivial to extremely hard, at least from what I understand. I'm not, again, a huge neural net expert. But yeah, I mean, it's kind of interesting because I mentioned my background was in physics, and high energy particle physics is a field that,
Starting point is 00:30:20 at first glance, looks like it would be great for machine learning because there's just huge amounts of uh huge amounts of data that are being created and there's oh yeah just like tons like you know one of the most data data rich uh like fields that you can work in and what you're trying to do is basically advanced pattern matching to try to figure out if there's certain signals in those huge piles of data um and that seems like something that would be quite nice for machine learning. But physicists understand theoretical physics, and there's like a bunch of math that they understand that underlies the way that all these particles should be behaving and created and decaying and all this sort of stuff. And machine learning generally doesn't respect any of that, right? It can kind of go do all kinds of crazy things in order to give you an answer. So, there was actually, you know, from my view anyway, a surprisingly high amount of resistance within
Starting point is 00:31:15 high energy physics to machine learning for a long, long time, which is not to say that it was never used, but there was a lot of skepticism relative to some of the other fields where there wasn't that deep theoretical understanding. And I think it's been changing a lot in the last five years or so within physics. There are some people who are doing some really incredible work there and trying to narrow that gap somewhat, as well as just making the case for why machine learning still deserves a place at the table. But just as a little bit of an aside, yeah, it's a huge issue. It's a huge, huge issue. If you work in data science or machine learning, you should be thinking about this very deeply. You should want to be doing work that has an impact. And if part of what that means is that people need to understand it in order to be using it in their decision- making processes or in their work, and they don't trust it I think would be ideal for the field overall. Sorry,
Starting point is 00:32:27 I got on a little bit of a soapbox there, but it's something that I think is really important and that doesn't get enough consideration. No, I like that soapbox because when an algorithm seems exceedingly fragile and weirdly dependent on things that shouldn't matter. I do want to be able to go back to my data scientists and say, are you sure? Can you make it a little bit more explainable? Because having to be perfectly predictable in what has to be a limited subset of the world, those are questions I have to ask because you can take this pile of data and you can give me an answer. You can give me the highest predictability for this data. But if I put this in my device, it's going to see a lot of other data. And when it fails,
Starting point is 00:33:23 someone's going to say why. And they're going to be right to ask why. Machine learning and why, I hate waving my hands. I at least want to be able to say, well, these are the factors we took into account. And when we look at your system, those factors don't work. And here's maybe some reasons why or some ways we failed, but I don't want to have to just say, I don't know, machine didn't like it. Yeah, no, I think that's right. I mean, one of the hardest problems in machine learning is taking your algorithm out of the lab and then putting it into a place where it can see potentially new stuff that it's never seen before. And, you know, I think that's why we've seen some pretty incredible successes come out of machine learning programs.
Starting point is 00:34:19 Like a good example is AlphaGo. Like 10 years ago, we thought know there was no there was no way that computers could ever beat humans that go and now they've beaten the world master uh like very handily with entirely novel ways i mean this is something that we would have said that i would have said no that doesn't look right because that's not at all how a human would play and yet it was right it was totally right yep yep um and that's one of the things that probably your machine learning person also has in their head a little bit uh when they hand you this this algorithm is if you say this doesn't look right and i don't think it's going to work they'll say like well yeah but you didn't think
Starting point is 00:35:02 alpha go is going to work either and. And that's a little bit unfair. And usually, they're probably wrong. But the point that I wanted to make, though, that, you know, the advantage that AlphaGo has that you should keep in mind is that Go is a very tightly controlled environment, right? So it's not like it has rules. It has, there's like a little grid, and the grid always has the same number of squares on it, and there's the little tiles, and there's only places where they, certain places where they can go. And it's very orderly. It's very complex, but it's very orderly. And I would contrast this with, you know, something that's a little bit more, uh, self-driving cars.
Starting point is 00:35:45 Anything in real life? Yeah, I think self-driving cars are a fantastic example, right? Like, there are rules of the road and there's rules of driving, but, you know, there's weird stuff. Yeah, there's weird stuff that can happen when you're on the road. And that's not an excuse for algorithms getting it wrong or for them being undertrained or for stupid mistakes or anything. But like, I, when I was at Udacity, I worked with one of the, one of the people who was, you know, kind of the father of the self-driving car. And I had a chance to talk with him a lot about what that was like in the development program that he was building out at Google before he came to Udacity, before he started Udacity as a founder.
Starting point is 00:36:25 Sebastian Thrun. Yes, yes. And he told a story, I don't remember if it made it into the final cut for the class or not, but told a story that they took a self-driving car out onto the highway. And this probably would have been in like 2012, maybe. This was a while back. So, we did the course in 2014. So they're out on the highway and they're going whatever highway speeds, 60, 70 miles an hour.
Starting point is 00:36:51 It was in self-driving mode and a plastic bag flies across the highway. Something that you and I would see and you know, all those things being equal, I don't want to hit it, but I'm not going to do anything crazy to avoid a plastic bag. Right? At the same time, it's not an extremely common thing. Like, I don't think I've seen a plastic bag blow across the highway in, let's say, at least a year. So, the car sees the plastic bag and it goes into just like a full-on stop and, you know, maintains control of the car and everything, everything was fine, but clearly the car, you know, the car was freaked out if that's possible. And they took it back and they looked at the logs yesterday and they, or they looked at the logs that night, excuse me, and realized that it had set off, you know, an alarm that was equivalent
Starting point is 00:37:44 to basically like, I think there's a kid in the street, right? Like a kid ran out after a ball in between cars or something. And then that would absolutely, yes, that would, I can see how a car and a small child or a plastic bag and a small child might look not that different to something that's, you know, doesn't have a huge number of cases of seeing either one of them. And if there's a child, then absolutely, you want to do everything that you can to stop the car and to not hit the kid. So, that makes a lot of sense. But I give it as an example of, like, how driving is messy in ways that we as humans don't think of very consciously. Like, we have really good systems for kind of dealing with some of that ambiguity and with learning very quickly from
Starting point is 00:38:34 just a few examples. And we've all also been learning for, I guess, most of the people who are listening to this for dozens of years. Compare this with a neural net that's maybe trained for 45 minutes or something that's not that long to learn about everything that you need to know about the world. So anyway, I think it's, it then gives a little bit of perspective, though, on the original question that you asked about, like, when should we, when does machine learning, like, really shine? And when does it really give the best predictions? And when is it kind of underpowered or feel underpowered? And I think it has to do with, you know, how orderly the system can be. And,
Starting point is 00:39:18 you know, are there lots of ways that in the course of collecting data, noise can creep in? If so, then machine learning is going to, you know, algorithms are going to struggle somewhat more than if there's just lots of very orderly, but perhaps complex rules. Then sometimes it can do a little bit better stuff like image recognition or, you know, like I said, alpha go uh i heard there's some version of ibm watson that's learned how to do sort of college style debating which sounds really impressive but then you find out that it's only trained to debate on yeah like a hundred different topics you're like okay well that's impressive but but i could debate a thousand different topics um i don't mean to be dismissive
Starting point is 00:40:03 of of ai stuff i think it's really, it's interesting and it's moving very, very quickly. And I'm sure that this is stuff that in a year or two, we'll look back on it and laugh a little bit about how wrong I am about some of these things. But it's a good piece of perspective to keep in mind, I think, especially when you're watching some of these very frothy predictions about the future of AI. You have mentioned you're not an expert in neural nets, although you're far more of an expert than most people have ever met. When you say machine learning, what do you mean by that? So I think of machine learning as a superset of sort of algorithms and some accompanying techniques to make the data good for the algorithms or to
Starting point is 00:40:55 understand how the algorithms are performing. So neural nets are one example or one class of examples of particular algorithms that you can then use to solve specific problems. Okay. As a data scientist, well, maybe I should phrase this differently. A friend recently asked if I could help with an algorithm and he dropped a pile of data on me and said, there's something in here and you just have to figure out which signal is important and how to do it. And I took up the challenge because it was amusing. But is that what a data scientist does? What does a data scientist actually do? I, well, I guess it depends a little bit on what the data was and how, you know, well-constrained that question was, I think, so a data scientist does,
Starting point is 00:41:46 part of their job is understanding machine learning and other types of statistical algorithms for, like, how do you actually solve a problem from a scientific point of view or from a computational point of view? In my view, that's, that's kind of like a, the, the warhead that sits on top of a ballistic missile. And so it's the thing that everybody pays attention to, but there's all this stuff that sits underneath it that makes the whole thing actually go do something. And so the rocket and all the other stuff is things like understanding what the problem is, talking to people, actually going out and collecting the data, because it's pretty rare that the data is actually in a format that's
Starting point is 00:42:33 going to be good for solving the problem, that it's clean. Clean data is very rare. And so a data scientist, that's where they spend probably 80% of their time easily, is just trying to get the problem defined and the data into a state where you can do something like some statistics or some machine learning. Yeah, he gave me a bunch of data. It was CSV, so that was easy. But it wasn't labeled. I didn't know what I was looking at, so I was just looking at changes, which, I mean, turned out to be mostly what he needed. But I plotted it, I filtered it, I looked at noise characteristics, I looked at derivatives, and it was all very amusing, but eventually I actually got the physical model of what was happening and everything became much clearer. Is this, I mean, is this what you do? Or was this just me playing with the techniques I know to find?
Starting point is 00:43:40 I don't really know what a data scientist does. Could you answer that again? Yeah, so I would what a data scientist does. Could you answer that again? Yeah. So I would say a data scientist. So there's a couple different, there's different flavors of data scientists. Let me try to answer the one that maybe is closest to what you're thinking about. So let me give an example from my work because it's the thing I understand the best. And I mentioned that it's from my work at Civis because then if you listen to what I'm saying and you're like, I'm a data scientist or I understand this and that's not what my job is, that's probably because you have a different job than me.
Starting point is 00:44:18 And I'll respect to that, but there's different flavors of this. As a data scientist at my job, what I do is go out and talk to people at businesses who are trying to understand how to solve problems more effectively with data. So it might be something like, we have a customer churn problem. So we need to be able to predict which of our customers are dissatisfied with our service, and there's some risk that they're going to leave, and then perhaps proactively offer them something that entices them to stay. That's a typical example. they can that concisely describe what the problem is. But sometimes not. Sometimes it takes a couple hours to kind of unpack all of the stuff that might be, it could be going that great at their business and to figure out that, okay, that's a problem that we can solve with data science. And moreover, you have the data that's appropriate to solve it. That's a huge thing. And so then very often we'll have to spend some time on the computer actually hooking up to their data sources and actually looking at their data very often, bringing it into computing environments that we have set up for this analysis.
Starting point is 00:45:39 Because usually they don't always have the computational tools and software and stuff that we do. So we kind of have to move all the data over, take a look at it, clean it, see how it's formatted, formulate an actual metric that we can use to say, is this customer about to churn, which is more complicated than it sounds. And then maybe we can start thinking about doing some machine learning on the data to make a prediction for any given person about whether somebody has churned. And so then we might spend, you know, a few weeks or up to like a few months, like taking a few passes at those models and talking with them about preliminary results. And here's what we're finding. Is this making sense to you?
Starting point is 00:46:32 Here's how much better we think we could do than sort of some of your taking a guess, like baseline estimates. And then maybe we can say that we've solved the problem. So a lot of it is, like I said, that 80% that's upfront and in the backend of actually understanding the business problem and trying to solve it. And then the machine learning and the statistics and stuff is kind of the 20% that sits in the middle there. Does that give you a better picture? Yeah. It is nice to hear that the machine learning is a piece, but it's a pretty difficult piece to learn. So I'm a little sad by that. I mean, most of what you were saying, I'm like, yeah, okay. I totally can't. I do that a lot. I go in,
Starting point is 00:47:24 I talk to a client, I realize that what they're telling me is not what they actually want. And then I kind of help them see what they want and help them give them options. And I try to investigate and finally we come up with an agreement. And then I get to do the actual part of my job, the putting things together and programming and designing and making it robust and testing it. But so much of it is just trying to convince people to tell me all the things that they really want instead of the picture they've put on it. Yeah, and a lot of times, like one of the things that I've, I think I'm seeing in the field of data science as it's evolving, is that there's, it's not an individual sport anymore, as much as it used to be, it's a team sport. And so there might be, you know, the data scientist who is a machine learning expert, and really, really understands a lot of those, you know, algorithmic ins and outs.
Starting point is 00:48:27 But then they have other people on their team who are folks like data engineers who understand how to get the data in and out of the systems. Maybe they work with product managers or with client facing folks who are more specialized in the business context and who are conversant in machine learning, but they're not, you know, you don't have to be a PhD in machine learning to understand a business problem. And so that I think is overall good for the field of data science, because it means that there's a more mature idea of how to achieve actual results with it. But then it also means that sometimes folks who were a little more full stack before, data scientists who used to have to play all of those parts,
Starting point is 00:49:19 are now specializing and it's a little bit narrower hopefully it's deeper um but it's a it's an evolutional or evolutionary step that i think is happening right now at least from at least from my perspective okay i'm gonna switch topics a little bit because we did get some questions from our patreon slack people uh which indicates to me that many embedded engineers, hardware and software hate machine learning. So that was... I noticed that too. There's some skepticism of neural nets in the questions.
Starting point is 00:49:54 Yeah. One of the big ones was attacking neural nets intentionally. Part of it was using neural nets to attack other neural nets, but there was this battle of, can we, how often does it happen? Are there secret machine learning battles going out there? How did they discover that muffins and dogs look exactly the like, and that labradoodles and chicken look indistinguishable to neural nets well they just kind of look similar right like have you done that you've done the google exercise where you you google like dog or chicken it's kind of similar
Starting point is 00:50:35 i have trouble telling them apart yeah yeah i want the neural nets on that one. Not fair. Yeah. But are there battles going on? I mean, are people taking neural net outputs and then trying to figure out how to fool them? So I have only heard about those in laboratory settings. So where there's a team of researchers that's explicitly setting out with a neural net that's not really trying to solve a real problem, right? So it might be like, no offense, but sorting between dogs and chicken is like not actually a problem that anyone has. So setting up a neural net to disambiguate between the two is a little bit what I mean by it's a toy problem. And then building a network to attack it. And then they, you know, spend three to 12 months refining that and writing the paper and whatnot. So, it's not to say that it's not a thing that's actually happening, but it's happening in the lab. In the same way that there's all kinds of stuff that happens in labs,
Starting point is 00:51:46 but doesn't necessarily make it out into the real world. Now that having been said, those are the things that I've heard about. If there are adversarial attacks that are happening in the wild, then I probably wouldn't hear about them. Right. If, if you're out trying to attack,
Starting point is 00:52:01 say a self-driving car software, you're probably not writing papers about it and publishing them on Google Scholar. So, the absence of evidence that I'm aware of doesn't mean that it's not happening. It just means that if you're really trying to seriously attack something, you're probably not talking about it publicly. Yeah, it strikes me that the intent or the goal of the attacks needs to be taken into account too, because there's, like you said, there are the toy problems, and let's see if we can fool this thing. And then there's taking self-driving cars as an example. Oh, let's make self-driving cars do
Starting point is 00:52:41 something bad. Well, if your goal is to do something bad, there's probably a lot of easier ways to do that than to become an expert in machine learning enough to figure out how to fool a nerd. I'm just saying, it's like having glass doors and good padlocks. It's like, that's fine, but somebody will just break the window. So you're saying that instead of trying to fool a self-driving car on neural net, we should just spread caltrops on the road? Yeah.
Starting point is 00:53:08 I mean, somebody wants to crash a bunch of cars. It seems a lot easier to do it, you know, in a purely mechanical way. But that's just me. So far, I think that's right. I mean, like, devil's advocate here, imagine we were all driving self-driving cars, and then you could just, I don't know, with one software bug, make it so the transportation system was barely like that's, that would be like a pretty big deal, but we're not there yet. But it's, it's not because, uh, the people who are doing
Starting point is 00:53:33 the, the adversarial attacks are, uh, well, I think they, I think they're, they're probably thinking longer term or bigger scale, but anyway, I, I take your point. Yeah. If I wanted to cause mayhem, there's easier ways to do it than with, than with hacking a self-driving car, at least right now. It is good to be thinking about these things because you can picture eventually we'll get to a self-driving car society and then you put some filter over their camera that makes it so anytime they see a stop sign it doesn't actually happen and and or some other way of fooling it that's a but this isn't specific to self-driving cars you and it doesn't have to be general if you're going to murder someone, you're just going to murder them. I don't think it's hard to do it. I don't think you need machine learning to commit mayhem.
Starting point is 00:54:38 Let's just stick with that. Okay, so another question we got was about using neural net for vision recognition systems, but then also using machine learning for trajectory planning and low-level motor control. When we do safety-critical systems, how do we figure out where it's okay to use neural nets and where we need to use mathematically rigorous algorithms? hardware engineering, obviously. So, forgive me if this is a shallow understanding. But I think without knowing exactly what a safety critical system is- Something that keeps you alive. Yeah. So, my rule of thumb is like, if the neural net, like, I'm assuming like the neural net in some laboratory conditions does better than the mechanical system. And that's why we're even considering this at all, right? Because if the neural net does worse, then it's not,
Starting point is 00:55:48 it's like, why are we even considering swapping it out? Yeah, I think, like, at least for me, it would be like, what's the, I hate to be like, just talking about business here, but it's like, what's the problem that you're trying to solve? Like, is there an actual safety issue in the system that the mechanical, uh, process are, are not showing themselves to be adequate for solving? Um, I mean, if that's the case, I don't like, that's a problem and you should address it. I don't know that the place I would start would be a neural net. The thing about neural nets is they're pretty complex algorithms. And there's, as you point out, they're fairly challenging to train. They can be pretty black boxy. They're hard to understand. lot of lower octane algorithms that you could throw out at first that can still be an improvement over simple rules-based heuristics, but that are still a little bit more human-friendly and a little bit more robust to some of the adversarial attacks or anything like that. So I don't think those are the only two possibilities is,
Starting point is 00:57:09 is what I'm trying to say is there's a, there's some transition zones in between them that I would suspect are, are plausible intermediate steps. If, if you're finding them that the mechanical systems aren't cutting it. And again, if they're doing fine, then leave them alone. Yeah. When we do talk about control theory, it's funny because when I took AI for robotics at Udacity, they talked about PID control as though it was a machine learning technique. And to some extent, the machine learning fell in how to tune
Starting point is 00:57:46 your PID. But even that, it was just an algorithm. It was just normal. And then Kalman filters came up in AI for robotics. And I was like, it's just a Kalman filter. It's just normal. It's just math. And sometimes I find people get very focused on neural nets and machine learning. And even those, we're finding ways to make them less black box. Looking at the convolution ones and how they shape different layers allows us to see what they are seeing and what they're queuing on and things are becoming i i guess the thing i the soapbox i'm on is machine learning isn't neural nets and there isn't a single line that says this is algorithm heuristics and this is statistically biased or statistically informed. There's some gradient. I bet was what you just said.
Starting point is 00:58:52 I think, yeah, I think you put that really well. Yeah, it's not binary. It's continuous. As I mentioned, I've done some machine learning from the Udacity courses and I'm reading it. I'm reading about it. I'm trying it out on my little robot. I'm listening to your podcast, but overall, I don't feel competent in it at all. Do you have any advice for getting over the hurdle of, oh my God, I would never put this on my resume. What if they actually asked me anything about it? Well, I think there's a lot of imposter syndrome out there. So I know that doesn't fix it. But I always try to make a point of saying that. I think there are even people who are, you know, in my position, who would say they have some imposter syndrome, too.
Starting point is 00:59:40 That's, that's part of the reason that I like doing projects is if you're, you know, specifically thinking within the context of an interview or just in general, you know, if you've done a pretty in-depth project, then you've probably bumped up against a bunch of stuff that's hard and you've thought about it pretty deeply and in in sticking with the project and continuing progress on it you've implicitly sort of overcome it or worked around it or whatever and those are the things that you know force you to go out and learn stuff and that's how you become an expert i don't think there's a way to and I think that that does instill kind of robustness or confidence, at least for me, in a way that coursework never did. So, if I had to build something or if I had to make something, I would feel like I had some understanding or some mastery. If I just took a course about it or read a book about it, then, you know, I felt like my understanding was shallower. So that's part of the reason I'm a big advocate for that.
Starting point is 01:00:52 One of the things that I saw from, I think, a presentation you gave was that willingness to do math was a good criteria to get into machine learning and data science. But so many people aren't willing to do math. Do you have any idea why? Are we lazy? Math is hard. But we used to love math. Do you remember being a kid and the geeky love of math and puzzles? Yeah, that was before before you get to, you know, multivariable calculus.
Starting point is 01:01:26 Statistics. The brain-bendingness. I mean, I think that's kind of why I think math is a decent heuristic for this, honestly. It's like, machine learning is not okay. Part of the reason I say that is because math is just
Starting point is 01:01:41 embedded within machine learning. Like, you have to know some linear algebra in order to write some of these algorithms or to understand how they work. Now, that having been said, there's really good libraries out there and you can use these algorithms without understanding deep down how they work. I will be the first person to say there are data scientists out there and I have counted myself among them, who use things without fully understanding the math all the way down. And I think that that's, you know, not, it's not ideal, but it's understandable. And it's how the world keeps moving forward. And like, I'm not going to judge too much. But I think that one of the things that math is a good heuristic for is a little bit of like pain tolerance and sort of everyone's had the opportunity to take math classes and many of us have, you know, struggled to various extents.
Starting point is 01:02:36 And if you haven't struggled, then there's a good chance that like machine learning, at least the very technical pieces of machine learning might come easier to you than they did to me. But like, you have to be a little bit tough in this field. And you have to be very rigorous in your thinking. And you have to be very dogged in the sense that you have to know that there's a many of these problems have a right and many wrong answers, and finding wrong answers is not, you know, sometimes close is good enough, but sometimes it's not. Math keeps you disciplined in that sense. So, that's the other thing that math has going for it, is that for most people, people it is pretty tough and it's sort of related to stuff that you need to know uh and so it's not a bad it's not a bad way to tell if you're uh like i don't want to say tough enough because that glamorizes it in a way that i don't mean to
Starting point is 01:03:39 but it's a little bit i can't come up with with a better phrase. Like, it's a good way of knowing if you're tough enough for some of the technical challenges that could be slung your way. I heard about Ada Lovelace doing calculus problems in her free time. And I thought, if only I could do that. And then I realized that that was one of my top 10 stupidest thoughts um and I got a calculus book and I started working through it working the problems and it was it was what you're talking about I had forgotten the persistence needed to fight through long, variable, algebraic things. And the stick-to-it-ness and the keeping things organized, even though these were all things I do in my job,
Starting point is 01:04:36 there's a level of rigor with just doing the math and knowing whether or not you're going to get to the right answer and being able to show it. And the book I had was very applied. So there were many physics things going on that I had to think about and had to draw and had to remember how to do the drawings in a way that would lead me to the solution. So I kind of agree. And I'm embarrassed that I needed to do that that but it was useful and it has helped me
Starting point is 01:05:06 move on to things like Udacity and machine learning and doing more data science but I had fallen out of the habit of even things I loved like the signal processing which I mean I'm always going to be a fan of anything dealing with Fourier, but I had just started to rely on the libraries and not gotten too far into the math anymore, because you don't have to. But the math informs things. It reminds you of what's underneath it all. And so I see why you're saying math is important i guess uh let's see i only have a couple more questions and i you know i have to get back to the imposter syndrome uh do you do you feel that way do you have problems with it um in certain contexts i do so it it depends on who else is in the room, I guess. I think the podcast actually helps a ton
Starting point is 01:06:09 with this because the place where I feel imposter syndrome a lot is kind of this like, it's like, oh, have you heard about this thing yet? Everybody's talking about it. And if I'm like, no, then I don't feel like a real data scientist. But the podcast forces me to always be looking for new things to learn about and new things to talk about. But there are definitely situations like if you were to put me in a room with a lab full of machine learning, PhD students and their professors and postdocs and stuff like yeah I would not feel like I was one of them okay you have a PhD from where some little school that nobody's ever heard of Stanford you worked at CERN which like everybody here is like why aren't you asking here about CERN I want to know about CERN oh my, my God, the Higgs boson. Oh, my God, it's so cute. What? You did a class with Sebastian Thrun, who is just, oh, my God, he's amazing.
Starting point is 01:07:18 And you've been doing a podcast for four years that helps people understand, not just teaches them basics, but gets to the intuitive understanding of machine learning and data science. And you feel this way? Well, yeah. I mean, there's a bunch of stuff I haven't done, right? I mean, like I, for example, here's an example, something that I've always wanted to take like six months and, and three textbooks and just learn is signal processing actually. And I'm not even, I'm not joking about that. It's one of the most interesting fields that I've not learned, uh, or that I, that I know about that I haven't learned. Let me put it that way. There's all kinds of interesting stuff out there, but like it it just hasn't happened. But there are times when I would like
Starting point is 01:08:10 fake it a little bit. Like I would go read about Fourier transforms for a while. We learned about them a little bit in physics for other reasons. So I wasn't totally, totally making stuff up, but you know, I'm not, I'm not a signal processing expert, but, uh, so if anyone, if, if I had been doing that episode with you instead of with Ben, you would have just been like correcting me all over the place. So that's, that's a little bit what I mean by like, I haven't done everything yet. Like there's a lot of stuff out there. Uh, so, and like I said, it's a field that moves so fast that even if you were an expert two or three or four or five years ago in something interesting, like, I don't know,
Starting point is 01:08:50 like it's moved on, uh, something else is new. So it's very hard to get, at least for me, I don't get very complacent in, in that, uh, context. And so I guess that's for me kind of related to feeling imposter syndrome. Learning is just your turn. It's not about faking it. It's not about being an imposter. It's just about learning. Everybody has to learn this stuff. I'm sorry, I'm giving you such a hard time, and I do this all the time as well. So to some extent, I want to hear other people say it because you are accomplished. And you're using your imposter syndrome in a good way.
Starting point is 01:09:31 You're showing, you're using it to continue growing and continue learning, which is a great way to take these feelings of inadequacy and fear and turn them into something that is helpful to lots of people. Well, thank you. I appreciate that. I think maybe I'm ready to end the show on that note, unless you have a favorite episode of your podcast that we should start with. That's an interesting question. So we don't do very many interviews, and I would not call myself an excellent interviewer by any means, but there was one that we did. This would have been in around the very end of 2016, or maybe like the first week or two of 2017, where we interviewed a researcher named Matt
Starting point is 01:10:27 Might. And he sits at kind of the intersection of genomics and computer science and is doing like genomics research for understanding genetic diseases and stuff. And I thought he was an incredibly interesting person to talk to and has like a very compelling personal story and is just an incredibly brilliant and accomplished person. So I don't know if that's exactly the one to start with because it's going to give you a funny flavor of what we do. It's an atypical episode, but it was one that I really enjoyed. I'll put a couple of my favorites in the show notes as well.
Starting point is 01:11:09 Do you have any thoughts you'd like to leave us with? This isn't very profound, but the thing I want to mention is if any of this excites you and you're in the Chicago area and looking for your next thing, uh, civis is hiring. We're hiring for my teams. We're hiring all across the company. We're not a huge, a huge company by any means. We're sort of, uh, we're still a startup. We're only five years old, but, uh, if you're excited and you want to talk, uh, check out our careers page, uh, civisanalytics.com probably slash careers. I don't actually have the URL on me, but yeah. That would be in the show notes, of course.
Starting point is 01:11:49 Cool. Our guest has been Katie Malone, Director of Data Science in the Research and Development Department at Civis Analytics. Katie also hosts Linear Digressions, an excellent podcast making machine learning concepts accessible. Thank you for being with us. Thank you so much. It's been great. Thank you to Christopher for producing and co-hosting, and thank you for listening. You
Starting point is 01:12:13 can always contact us at show at embedded.fm or hit that contact link on embedded.fm. Thank you to Tom and George for their help with questions this week. If you'd like to find out guests in advance, please support us on Patreon and then sign up for the Slack. Now, a quote to leave you with. This one's going to be from Bob Ross. Anything we don't like, we'll turn it into a happy little tree or something. We don't make mistakes.
Starting point is 01:12:40 We just have happy accidents. Embedded is an independently produced radio show that focuses on the many aspects of engineering. It is a production of Logical Elegance, an embedded software consulting company in California. If there are advertisements in the show, we did not put them there and do not receive money from them. At this time, our sponsors are Logical Elegance and listeners like you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.