Embedded - 252: A Good Heuristic for Pain Tolerance
Episode Date: July 5, 2018Katie Malone (@multiarmbandit) works in data science, has podcast about machine learning, and has a Phd in Physics. We mostly talked about machine learning, ways to kill people, mathematics, and impos...tor syndrome. Katie is the host of the Linear Digressionspodcast (@LinDigressions). She recommended the Linear Digressions interview with Matt Mightas something Embedded listeners might enjoy. Katie and Ben also recently did a show about git. Katie taught Udacity’s Intro to Machine Learningcourse (free!). She also recommends the Andrew Ng Machine Learning Coursera course. Neural nets can be fooled in hilarious ways: Muffins vs dogs, Labradoodles vs chicken, and more. Intentional, adversarial attacks are also possible. Impostor syndromeis totally a thing. We’ve talked about it before. You might recognize the discussion methodology from Embedded #24: I’m a Total Fraud. Katie works at Civis Analyticsand they are hiring.
Transcript
Discussion (0)
Welcome to Embedded. I am Alicia White here with Christopher White. Our guest this week
is Katie Malone, a data scientist with a podcast about machine learning.
Hi, Katie. Thanks for joining us today.
My pleasure. Thanks for having me.
Katie, could you give us some background about yourself? Sure. So I've been in data science
for about three years. I work at a startup in Chicago called Civis Analytics that does
data science services and technology work. Before that, I was finishing a PhD in experimental
particle physics. So I did a bunch of searches for new particles.
And so it's been a really fun transition from academia to data science. You still get to do
a lot of the programming and a lot of the statistics and machine learning,
but usually a little bit quicker and with a little bit of a bigger impact. So I'm really loving it.
You worked at CERN.
I did.
I spent a couple of years there in graduate school doing research.
Yep.
I want to ask you about that,
but I have so many machine learning questions that I'm not sure we'll get to it.
Before we do get to the deeper questions, we have lightning round.
Okay.
We'll ask you short questions.
We want short answers.
And if we are behaving ourselves, we won't ask for more detail.
Sounds good.
Christopher, do you want to start?
Sure.
What's your preferred language to develop in?
Python, although I don't program that much anymore, quite honestly.
TensorFlow or Keras?
Keras.
Bayesian or frequentist?
What are these questions?
Bayesian in real life, frequentist if it's on a computer.
Favorite style of neural net to identify animals?
The one that's inside my brain.
How much data is enough data?
More than what I have.
That's always true.
Okay, favorite style of neural network to identify cars for a self-driving car?
Oh, that's an interesting question.
I don't remember what they used for the self-driving car.
There was a point where I actually knew the right answer for this one.
I'm sure it's some kind of convolutional net for the image recognition,
but then I think they had some kind of interesting stuff going on inside for some of the decision-making.
It wasn't a quiz. It was personal opinion.
Oh, well, I mean, I just want to pick the one that works the best.
I assume it's the one that's out there.
So can you tell us about your podcast?
Sure. Yeah, I'd love to. So it's called Linear Digressions,
which is a little bit of a pun.
And I've been doing it for coming up on three.
No, coming up on four years.
Yay!
Oof, that's a long time.
Do it with a friend of mine, Ben Jaffe.
So we used to work together back when I was still in grad school.
I did a summer internship at Udacity, which is a company that does online learning and courses and stuff.
I was putting together a machine learning course and had too much content for what I could put into a single course in three months worth of work.
And I'm a big podcast fan, have been for a long time.
So I thought that those were kind of the raw ingredients
for an interesting little project. And Ben is not a machine learning person, but he had some
audio background. And we just get along really well and he's interested in this sort of stuff.
So that was a little bit about how we got started. And so now we're still doing it about i guess like i said four years coming up on four years later
so every week roughly we have a new episode about something in data science or machine learning and
that's evolved a little bit over the course of the over the course of the podcast once it started
out it had maybe a little bit more stuff about physics, for example, because that's what I was doing. Then it started to have a little bit more of a managerial role now than I used to be.
That's why I don't program quite as much anymore.
But I'm thinking a lot about how machine learning and data science are being used to solve real
problems.
And so, again, that's sort of a direction that we are taking it now that hasn't gone
as much in the past, but it still has pretty strong backbone of the general idea that
machine learning and data science are fields that move very, very quickly. And it's hard to
keep up with sort of the gist of all of the new developments that are happening. And
quite frankly, I think there's not quite enough medium technical resources for somebody who wants something that's more detailed than, say, the popular press or a New York Times article, but not so detailed as a scientific paper.
We try to sit in that gap for non-technical folks who want to feel like they're keeping up with this stuff and for the technical folks who want to fulfill their
breadth requirement and and keep up with a field that's moving really quickly and so it is just you
and ben and usually ben is learning about something you're talking about he represents the audience
yes okay it's been four years how much has been still learning a ton i think it's been four years. How much has Ben still learning? A ton, I think. It's been interesting. So his background is he's a front-end web developer,
so he's quite technical. And in some topics, he's even more technical than I am. So when we were
talking, we had a recent episode about Git and GitHub and version control and um so he's of course just as well versed in this as i am maybe
maybe more so um but it's funny we sometimes chat about this usually not while we're recording or
anything but uh i think he by his own telling uh he's learned a lot and i can see how there's
connections that he's making between the topics
as we're covering them.
Like, I think he's just much more, he's much more comfortable almost unconsciously with
these topics than he used to be.
But he's a deeply curious person.
I don't think he would ever say that he has mastery or anything.
Most practitioners in this field wouldn't say they've achieved mastery.
But I think that's been one of the things that's been really fun for him is kind of that learning,
but continuing to have sort of that beginner's curiosity as, like you said, sort of a stand-in
for some of the folks in our audience that we think are probably in shoes that are really similar to his.
There are callbacks to previous episodes and the episodes seem to build on each other. Do I have to start at the beginning?
No, no, definitely not. We try really hard to make the callbacks as unobtrusive as possible.
There are, I guess the one exception to this is sometimes there's
episodes where it would take an hour to really unpack an issue or an item in all of its
interesting detail. So sometimes we'll split that across two or it's usually two, maybe sometimes
up to three consecutive episodes. And then you might
want to listen to those in a unit. But mostly, one of the things we're trying to do is contextualize
this stuff a little bit too. So machine learning, when I was starting, it felt like a lot of
disconnected topics. And I would learn a topic and then stop learning it and then move on to
the next topic and like learn it and then stop and then move on to the next topic. But as I've grown as a scientist
and as the field has grown, you start to see some of the common threads. And so that's been one of
the things that's been really interesting is that there's stuff that we were saying two and three
and four years ago that just isn't true anymore because the
field has has changed and so that can be a really really interesting thing to call out when it
happens like hey there's there's basically this living archive of this field as we've been and
of ourselves understanding it as we've been going. And you can sometimes opportunistically drop back and
look at the time capsules and that's kind of fun.
Yeah. You recently mentioned dropout layers, which is still something people are using with
convolutional neural networks to find like cars, but it may be going away. I mean, the math has never really made sense,
but there may be better ways to handle the dropouts. And you did talk about that somewhat
recently. Yeah. I mean, and I think like sometimes for sure there's topics that we cover because not because they're the new hotness, but because they're just these big items from, you know, two or three or five or, you know, sometimes 100 years a jack of many trades, a master of quite few of them.
I wouldn't claim expertise in neural nets. And one of the things that has been hard for me trying to
keep up with that field, number one, it's just moving so fast. And number two, there's these big
topics like dropout that are assumed knowledge if you're trying to read a
paper. And I was finding myself having trouble actually reading some of the neural net papers
and keeping up with new stuff like capsule networks or something because they assume that you
already know what dropout is. I didn't. So go out and learn it. That's fine.
But part of what I try to do as a like part of the reason that I do the podcast is because
then I try to hook that back into sharing that out with people.
So I learn things so that I can share them out and I share them out as a motivation for
me to go learn them.
So yeah, it's an interesting field to try to learn because, like I said,
there's so much context there. And sometimes that's what we're trying to achieve is just
doing some of that footwork for our listeners.
You must get asked, like, a lot, how do I get into machine learning? Are you sick of that question?
I think it would be kind of mean to say I'm sick of it.
You can be mean, it's okay.
One thing I've been contemplating, there's kind of a saying that if once you've written the same email, like that, you know, somebody writes me, it's usually in an email,
somebody writes me an email, and the gist of it, how can I get into machine learning? Once you've replied to that email three
times, you should just write a blog post. I think I'm definitely past the three marker. So, maybe
that's my signal that I should just sit down for a couple hours and write out what I usually say.
But what I usually say is mostly the same. It's pretty hard to do anything that's
very tailored to any individual, because I don't know most of these people on a personal level,
or what kind of strengths and weaknesses they might have. So, I usually tell them similar
things, which is when you're just starting out, one of the things that at least I
look for when I'm interviewing people that don't have a lot of experience is projects, basically.
So in particular, I think projects are really interesting for a few reasons. Number one,
they actually make you a better programmer,
a better data scientist. They force you to think about actual real world problems.
So they're just a good training ground. Second is that especially if they're projects that,
like I'm much more keen on projects where people go out and they collect a data set because they're actually interested in trying to formulate and solve a problem versus doing like a Kaggle competition.
And I think that's much more realistic. It's much closer to the real world. Most of real world
machine learning and data science, somebody doesn't hand you a cleaned up data set and say,
please predict this column. Like you have to go out and make that data set and formulate the
problem and understand what column you're supposed to be predicting and all that sort of stuff.
So if somebody is doing a project that demonstrates that they've thought about a
problem on that level, that's pretty unique and really valuable to me. And then the third thing
that's cool about projects is, again, especially if it's a project that someone goes out and starts on their own or contributes to on their own, it tells me something about what they're interested in and what kind of data scientist or machine learning person they might be, which is sometimes hard to get any other way if somebody hasn't had work experience. So it just, it gives you something interesting to connect with people on, on a personal level where you're like, oh, you really
care about basketball. Like that's kind of interesting. Tell me about why you care so
much about basketball. Tell me about what's going on in basketball. You decided to go out and solve.
Or sometimes people have projects that are, you know, social good projects or things that they
really, you know, problems that they care about in society, and they go out and they try to do data science
around that, which I think is incredibly cool. So it just gives me something sort of personal to
also connect with that person. And so if you're on the other side of the table, and you're a person who's trying to get started in machine learning, I think that those are,
you know, the project approach is like one of the more fun and interesting ways to do it. And it can
be, at least from my perspective, like a really good way to communicate out what it is you care
about and how it is you work, which is really key to getting started.
If you're, you know, especially if you're looking for other people to join you or for folks to
give you a shot at their company or whatever. Do you have resources, classes, books, blogs,
suggested reading, listening? I mean, other than your own podcast, of course. That's a good question. It depends a little bit on what somebody's asking for.
I think the Coursera machine learning course is really, really good.
The Andrew Stanford one.
Yeah, that is fantastic.
Yes. So that's a total classic.
Anyone who hasn't done that or something that's like equivalent, I would send them there first.
I don't have, I have a lot of-
Can I recommend your own Udacity class?
Oh, sure.
I mean, it's good too.
Why not?
Yeah, that's pretty good. Why not? of by the box machine learning, there's actually too many resources for me to give you a great
answer to that question. There's so many that I've come across through the podcast, especially
very often. So we keep a website like linear digressions.com. And so each time we have a,
an episode release, we post a little thing on there. And it's gotten to the point where I
actually use that as kind of a library or a link blog for myself to go back and find interesting resources that I've come across.
But maybe it's been a couple years and I don't remember where to find it.
I can usually try, you know, kind of reverse engineer what episode that might have been and then go look it up.
So it's sort of it's sort of funny.
There was like extreme example of this. There was one time I was teaching or about to go give a lecture on a fairly technical neural net topic, I think. And I actually had, I was a little rusty. It had been a while since I thought about it in great detail. I meant to like read up on it the night before but then something came up i forget um and i actually
listened to an old episode of the podcast because i was like i think i once did this
uh that doesn't happen very often but that's kind of an extreme example i'm laughing because i did
that once too i had to give a presentation on ble and i went back and on the way to give the presentation, I listened to a show we did with Bleacher Snyder, Josh.
And I'm pretty sure I just quoted everything he said.
Yeah, I mean, you did the work.
You might as well take advantage of it.
So that's great for the newbies.
I am an established embedded software engineer.
I'm not looking to change careers, but machine learning keeps nibbling at the edges. People
want it, people are excited about it, but I'm not sure how we're going to put it into tiny devices that don't have gobs of memory gobs of processing power
how do i learn enough to know what will affect me what do i need to know about machine learning
yeah i was thinking about this i think it's a really tough question because it gets at the heart of what's possible and it's a little bit hard to
go find a good resource that tells you sort of what's possible that's at the right level of
technical detail like i said that's the place where i always that i always um found the biggest
gap there's really good super duper overly technical stuff There's really good, super duper, overly technical stuff. There's really good
popular science stuff. The stuff in between is where, at least for me, where I want to do the
most reading and where I find sometimes a gap. The best, this is not a great answer, but it's the
thing that if you have access to it, it's probably the best one, which is
make friends with a data scientist or a machine learning person. Like maybe there's one at your
company or there's meetups nearby that you can strike up a conversation. Because
that allows you to talk to each other as humans. And sometimes the thing that makes
figuring out like if machine learning, what about machine learning is interesting and what's just
hot air, like part of the thing that makes that conversation challenging is that machine learning
has a lot of jargon. And I imagine that, you know, whatever other technical people, like if you're working on super small devices or something, in order for you to explain in great detail what you do too, like I would probably be mystified by it for a few minutes as well.
And so if we were standing there face to face, you would see that and be able to unpack it a little bit for me, that sort of thing. So, if you have access to a data scientist or a machine
learning person that doesn't mind, you know, going out for a coffee or going out for a beer and just
like telling you what they're thinking about or what they're working on, I think that's the best
thing. Well, that's good. As long as you're here, I have a few questions for you. Oh dear. Inference versus training. This I know is very important. And I
know that when I want to train a network, it takes, I mean, this is why I use AWS cloud.
This is why I use our game computer, but inference. Can you tell me about the difference between inference and training?
Yeah, so if I understand the question, I might phrase it a little bit differently, which is inference versus prediction. So prediction, I usually think of as kind of your standard
machine learning output, which is, I have a bunch of data. There's patterns in that data that allow me to
predict something that I'm interested in, use statistical learning methods to kind of figure
out what those patterns are, and then use them to predict that outcome of interest on new cases
where you don't already know the answer. Inference is, it uses in some cases, similar techniques, stuff like linear and logistic
regression, but it's based on statistics in kind of a more rigorous way than some of the more
pattern matchy stuff that's in machine learning. And inference is valuable because it can tell you
why. So it can make predictions, but it can also tell you
sort of what inputs to those predictions are affecting the outcome. And so depending on the
problem that you're trying to solve, like if you want to take an action based off of your prediction
and an action that's going to change the outcome that someone experiences,
then it's likely, although not guaranteed, that what you want is something like inference.
But if what you want is just something that's as accurate as possible in giving you the right
prediction of what's going to happen, then often prediction or machine learning is what you want to do. And the reason that the
difference matters also is that inferential statistics can give you predictions,
just like machine learning can give you predictions. But usually the predictions that
you get in machine learning are higher accuracy than the ones that you get with inferential
algorithms. And so that means
that you pay a little bit of a price for having something that's more explainable, which makes
some sense. But it means then that sometimes there's kind of a mismatch in expectations then
when you're using these algorithms, because very often people don't realize that there's some trade-off between the predictive power of an algorithm and its explainability.
And that can sometimes lead to bad outcomes, because a lot of times people want to understand
why an algorithm says something in order to trust it and to feel like they want to use it.
This is awesome and confusing, because the NVIDIA material uses inference in place of prediction.
I mean, even their outputs of their machine learning, of their AlexNet neural nets, they call them inferences.
And so for them, anything you're doing that isn't training, anytime you're running the machine learning algorithms forward, you're doing inference.
Oh, I see.
So they use inference and prediction interchangeably.
Yes, we would call that like scoring probably where I work anyway.
Yeah, jargon, right?
Yeah, and I was thinking of causal inference, which is like a subfield of statistics.
Huh. Okay. Yeah. Let me take a second crack at your question. I'll try to be a little bit,
a little bit briefer. So what's the difference between training and inference or training and
prediction? So one is where you're actually learning what the patterns are between the inputs to your algorithm, like your all of your data and the outcome of interest. So that's
that's the training part. And that can be really slow and really painful, because there's all kinds
of different patterns that can occur. And you need to, you know, potentially search a lot of space in
order to find them. Once you've found the patterns and those are somehow encapsulated within a data structure,
like a binary tree or a linear equation or something, then inference or prediction or
scoring is taking those patterns, matching them up against new cases, and using that to figure
out what the outcome of interest is for those new cases. That is more along the lines of what
they told me, but I am so happy to hear that there are differences even from experts.
And so the scoring, the prediction inference, whatever it is we do, we call it, that has a much lower cost of processing.
That's a fast thing compared to training.
Yeah.
Generally, yeah.
And so that's the part that usually affects the embedded systems more.
You take all the data you can find, you train the system
somewhere else with some specialized people with specialized skills, and then they come back and
they say, these are the features of interest, and we want you to run this, what ends up being a
linear equation. I mean, it's a forward neural network, but at the end, it's just some multiplies and adds. And then I look at them and say,
these features make no sense.
You can't ask me to do an FFT
and then only use some of the results
because that FFT doesn't make sense like that.
Let's just do it more mathematically efficiently.
And they're like, no,
you have to have the features be exactly this.
It sounds like a very specific complaint.
Okay. Maybe that one is a little specific, but it's not, it's not the only one. I do see this
a lot where the features, you just look at them and you're like, yeah, that makes no sense.
Oh, well, so like, it sounds like you, you're experiencing a version of the thing that I gave in my first answer when I didn't totally understand the question, which is the interpretability of the results, right? So by interpretability, I mean, does a human, when a human looks at it, can they, can they understand it? And does it make sense? Those are slightly different questions. You can have something that's very simple to understand, but it doesn't make any sense is what I mean. Or you can have something that makes a lot of sense,
but it's so complicated that you can't fit it all in your head at the same time.
So it sounds like what the machine learning folks did was they built an algorithm that
doesn't know anything about the world. So it doesn't understand what an FFT is,
or it doesn't understand what variables should make sense. It's just being trained to make
predictions that are as accurate as possible. For some reason, this happens all the time,
it finds some combination of features that don't make a ton of sense to you as a human. And that combination of
those features is what gives it the most accurate predictions. And now you're at a little bit of an
impasse, right? Because the machine learning people, it's really hard for them, or it feels
unfair in a sense to have to go back and tell their algorithm that they solved the problem
wrong. And from a technical perspective, that can be really challenging to do.
But I totally get what you're saying. You're like, this doesn't make any sense to me.
And so then you have to decide where to go from that. And that's more of like a,
it's not really a technical problem exactly. Like you could take the outputs of their algorithm and put it into a thing, I assume, mostly.
Yeah.
Oh, yeah.
No, the actual algorithm is not hard.
It's just, why would you use mathematically equivalent things in different ways?
Why would you not allow me to use the most
optimal math and that may be due to overfitting in that case it was over because the the training
doesn't take into account the way the features are used later right because if it knew it's
optimizing right so if it knew oh yes i'm doing right? So if it knew, oh, yes, I'm doing these features,
but that makes your later algorithm really inefficient.
If you fed that back to it,
wouldn't it find a different space in that vector space of solutions?
Yeah, I suppose it could.
I mean, you'd have to basically come up with like a,
you know, it's like a constrained optimization problem, or there's some kind of constraint that
you're putting on the neural net that says, even if it gives me a good, you know, quote,
unquote, good answer, I don't want something that looks like this, it has to be within like
this range. You know, from a technical and practical perspective, this can be
trivial to extremely hard, at least from what I understand.
I'm not, again, a huge neural net expert. But yeah, I mean, it's kind of interesting because
I mentioned my background was in physics, and high energy particle physics is a field that,
at first glance, looks like it would be great for machine learning because there's just huge amounts of uh huge amounts of data that are being created and there's oh yeah just like tons like
you know one of the most data data rich uh like fields that you can work in and what you're trying
to do is basically advanced pattern matching to try to figure out if there's certain signals in
those huge piles of data um and that seems like something that would be quite nice for machine learning. But physicists understand theoretical physics,
and there's like a bunch of math that they understand that underlies the way that all
these particles should be behaving and created and decaying and all this sort of stuff.
And machine learning generally doesn't respect any of that, right?
It can kind of go do all kinds of crazy things in order to give you an answer. So, there was actually, you know, from my view anyway, a surprisingly high amount of resistance within
high energy physics to machine learning for a long, long time, which is not to say that it
was never used, but there was a lot of skepticism relative to some of the other fields where there wasn't that deep theoretical understanding.
And I think it's been changing a lot in the last five years or so within physics.
There are some people who are doing some really incredible work there and trying to narrow that gap somewhat, as well as just making the case for why machine learning still deserves a place at the table.
But just as a little bit of an aside, yeah, it's a huge issue. It's a huge, huge issue. If you work
in data science or machine learning, you should be thinking about this very deeply. You should want to
be doing work that has an impact. And if part of what that means is that people need to understand
it in order to be using it in their decision- making processes or in their work, and they don't trust it I think would be ideal for the field overall. Sorry,
I got on a little bit of a soapbox there, but it's something that I think is really important
and that doesn't get enough consideration. No, I like that soapbox because when an algorithm
seems exceedingly fragile and weirdly dependent on things that shouldn't matter. I do want to be able to go
back to my data scientists and say, are you sure? Can you make it a little bit more explainable?
Because having to be perfectly predictable in what has to be a limited subset of the world,
those are questions I have to ask because you can take this pile of data and
you can give me an answer. You can give me the highest predictability for this data.
But if I put this in my device, it's going to see a lot of other data. And when it fails,
someone's going to say why. And they're going to be right to ask why.
Machine learning and why, I hate waving my hands. I at least want to be able to say,
well, these are the factors we took into account. And when we look at your system,
those factors don't work. And here's maybe some reasons why or some ways we failed,
but I don't want to have to just say, I don't know, machine didn't like it.
Yeah, no, I think that's right. I mean, one of the hardest problems in machine learning is
taking your algorithm out of the lab and then putting it into a place where it can see potentially new stuff that it's never seen before.
And, you know, I think that's why we've seen some pretty incredible successes come out of machine learning programs.
Like a good example is AlphaGo.
Like 10 years ago, we thought know there was no there was no way
that computers could ever beat humans that go and now they've beaten the world master uh like very
handily with entirely novel ways i mean this is something that we would have said that i would
have said no that doesn't look right because that's not at all how a human would play and yet it was right it was totally
right yep yep um and that's one of the things that probably your machine learning person also
has in their head a little bit uh when they hand you this this algorithm is if you say this doesn't
look right and i don't think it's going to work they'll say like well yeah but you didn't think
alpha go is going to work either and. And that's a little bit unfair.
And usually, they're probably wrong.
But the point that I wanted to make, though, that, you know, the advantage that AlphaGo has that you should keep in mind is that Go is a very tightly controlled environment, right?
So it's not like it has rules. It has, there's like a
little grid, and the grid always has the same number of squares on it, and there's the little
tiles, and there's only places where they, certain places where they can go. And it's very orderly.
It's very complex, but it's very orderly. And I would contrast this with, you know,
something that's a little bit more, uh, self-driving cars.
Anything in real life?
Yeah, I think self-driving cars are a fantastic example, right? Like, there are rules of the road
and there's rules of driving, but, you know, there's weird stuff. Yeah, there's weird stuff
that can happen when you're on the road. And that's not an excuse for algorithms getting it
wrong or for them being undertrained or for stupid mistakes or anything.
But like, I, when I was at Udacity, I worked with one of the, one of the people who was, you know, kind of the father of the self-driving car. And I had a chance to talk with him a lot
about what that was like in the development program that he was building out at Google
before he came to Udacity, before he started Udacity as a founder.
Sebastian Thrun.
Yes, yes.
And he told a story, I don't remember if it made it into the final cut for the class or not,
but told a story that they took a self-driving car out onto the highway.
And this probably would have been in like 2012, maybe.
This was a while back.
So, we did the course in 2014.
So they're out on the highway and they're going whatever highway speeds, 60, 70 miles an hour.
It was in self-driving mode and a plastic bag flies across the highway. Something that you and
I would see and you know, all those things being equal, I don't want to hit it, but I'm not going
to do anything crazy to avoid a plastic bag. Right? At the same time, it's not an extremely common thing. Like, I don't think I've seen a
plastic bag blow across the highway in, let's say, at least a year. So, the car sees the plastic bag
and it goes into just like a full-on stop and, you know, maintains control of the car and everything,
everything was fine, but clearly the car, you know, the car was freaked out if that's possible.
And they took it back and they looked at the logs yesterday and they, or they looked at the logs
that night, excuse me, and realized that it had set off, you know, an alarm that was equivalent
to basically like,
I think there's a kid in the street, right? Like a kid ran out after a ball in between cars or
something. And then that would absolutely, yes, that would, I can see how a car and a small child
or a plastic bag and a small child might look not that different to something that's, you know,
doesn't have a huge number of cases of seeing either one of them.
And if there's a child, then absolutely, you want to do everything that you can to stop the car and to not hit the kid. So, that makes a lot of sense. But I give it as an example of, like,
how driving is messy in ways that we as humans don't think of very consciously. Like, we have really good
systems for kind of dealing with some of that ambiguity and with learning very quickly from
just a few examples. And we've all also been learning for, I guess, most of the people who
are listening to this for dozens of years. Compare this with a neural net that's maybe
trained for 45 minutes or something
that's not that long to learn about everything that you need to know about the world. So anyway,
I think it's, it then gives a little bit of perspective, though, on the original question
that you asked about, like, when should we, when does machine learning, like, really shine? And
when does it really give the best predictions? And when is it kind of underpowered or feel
underpowered? And I think it has to do with, you know, how orderly the system can be. And,
you know, are there lots of ways that in the course of collecting data, noise can creep in?
If so, then machine learning
is going to, you know, algorithms are going to struggle somewhat more than if there's
just lots of very orderly, but perhaps complex rules. Then sometimes it can do a little bit
better stuff like image recognition or, you know, like I said, alpha go uh i heard there's some version of ibm watson that's learned
how to do sort of college style debating which sounds really impressive but then you find out
that it's only trained to debate on yeah like a hundred different topics you're like okay well
that's impressive but but i could debate a thousand different topics um i don't mean to be dismissive
of of ai stuff i think it's really,
it's interesting and it's moving very, very quickly. And I'm sure that this is stuff that in
a year or two, we'll look back on it and laugh a little bit about how wrong I am about some of
these things. But it's a good piece of perspective to keep in mind, I think, especially when you're
watching some of these very frothy predictions about the
future of AI. You have mentioned you're not an expert in neural nets, although you're far more
of an expert than most people have ever met. When you say machine learning, what do you mean by that? So I think of machine learning as a superset of sort of
algorithms and some accompanying techniques to make the data good for the algorithms or to
understand how the algorithms are performing. So neural nets are one example or one class of
examples of particular algorithms that you can then use to solve specific problems.
Okay. As a data scientist, well, maybe I should phrase this differently. A friend recently asked
if I could help with an algorithm and he dropped a pile of data on me and said,
there's something in here and you just have to figure out which signal is important and
how to do it. And I took up the challenge because it was amusing. But is that what a data scientist
does? What does a data scientist actually do? I, well, I guess it depends a little bit on what
the data was and how, you know, well-constrained that question was, I think, so a data scientist does,
part of their job is understanding machine learning and other types of statistical algorithms for,
like, how do you actually solve a problem from a scientific point of view or from a computational
point of view? In my view, that's, that's kind of like a, the, the warhead that sits on top of a ballistic missile.
And so it's the thing that everybody pays attention to, but there's all this stuff that
sits underneath it that makes the whole thing actually go do something.
And so the rocket and all the other stuff is things like understanding what the problem
is, talking to people, actually going out and
collecting the data, because it's pretty rare that the data is actually in a format that's
going to be good for solving the problem, that it's clean. Clean data is very rare.
And so a data scientist, that's where they spend probably 80% of their time easily,
is just trying to get the problem defined and the data into a state where you can do something like
some statistics or some machine learning. Yeah, he gave me a bunch of data. It was CSV,
so that was easy. But it wasn't labeled. I didn't know what I was looking at, so I was just looking at changes, which, I mean, turned out to be mostly what he needed.
But I plotted it, I filtered it, I looked at noise characteristics, I looked at derivatives, and it was all very amusing, but eventually I actually got the physical model of what was happening and everything became much clearer.
Is this, I mean, is this what you do?
Or was this just me playing with the techniques I know to find?
I don't really know what a data scientist does.
Could you answer that again?
Yeah, so I would what a data scientist does. Could you answer that again? Yeah.
So I would say a data scientist.
So there's a couple different, there's different flavors of data scientists.
Let me try to answer the one that maybe is closest to what you're thinking about.
So let me give an example from my work because it's the thing I understand the best.
And I mentioned that it's from my work at Civis because then if you listen to what I'm saying and you're like, I'm a data scientist or I understand this and that's not what my job is, that's probably because you have a different job than me.
And I'll respect to that, but there's different flavors of this. As a data scientist at my job, what I do is go out and talk to people at
businesses who are trying to understand how to solve problems more effectively with data.
So it might be something like, we have a customer churn problem. So we need to be able to predict
which of our customers are dissatisfied with our service, and there's some risk that they're going to leave, and then perhaps proactively offer them something that entices them to stay. That's a typical example. they can that concisely describe what the problem is. But sometimes not. Sometimes it takes a couple
hours to kind of unpack all of the stuff that might be, it could be going that great at their
business and to figure out that, okay, that's a problem that we can solve with data science.
And moreover, you have the data that's appropriate to solve it. That's a huge thing.
And so then very often we'll have to spend some time on the computer actually hooking up to their data sources and actually looking at their data very often, bringing it into computing environments that we have set up for this analysis.
Because usually they don't always have the computational tools and software and stuff that we do.
So we kind of have to move all the data over, take a look at it, clean it, see how it's formatted,
formulate an actual metric that we can use to say, is this customer about to churn,
which is more complicated than it sounds. And then maybe we can start
thinking about doing some machine learning on the data to make a prediction for any given person
about whether somebody has churned. And so then we might spend, you know, a few weeks or up to like
a few months, like taking a few passes at those models and talking with them about
preliminary results. And here's what we're finding. Is this making sense to you?
Here's how much better we think we could do than sort of some of your taking a guess, like
baseline estimates. And then maybe we can say that we've solved the problem. So a lot of it is,
like I said, that 80% that's upfront and in the backend of actually understanding the business
problem and trying to solve it. And then the machine learning and the statistics and stuff
is kind of the 20% that sits in the middle there. Does that give you a better picture?
Yeah. It is nice to hear that the machine learning is a piece,
but it's a pretty difficult piece to learn. So I'm a little sad by that. I mean,
most of what you were saying, I'm like, yeah, okay. I totally can't. I do that a lot. I go in,
I talk to a client, I realize that what they're telling me is not what they actually want.
And then I kind of help them see what they want and help them give them options. And I try to
investigate and finally we come up with an agreement. And then I get to do the actual
part of my job, the putting things together and programming and designing and making it robust and testing it.
But so much of it is just trying to convince people to tell me all the things that they really want instead of the picture they've put on it. Yeah, and a lot of times, like one of the things that I've, I think I'm seeing in the field
of data science as it's evolving, is that there's, it's not an individual sport anymore, as much as
it used to be, it's a team sport. And so there might be, you know, the data scientist who is a
machine learning expert, and really, really understands a lot of those, you know, algorithmic ins and outs.
But then they have other people on their team who are folks like data engineers who understand how
to get the data in and out of the systems. Maybe they work with product managers or with
client facing folks who are more specialized in the business context and who are conversant in machine
learning, but they're not, you know, you don't have to be a PhD in machine learning to understand
a business problem. And so that I think is overall good for the field of data science,
because it means that there's a more mature idea of how to achieve actual results with it.
But then it also means that sometimes folks who were a little more full stack before,
data scientists who used to have to play all of those parts,
are now specializing and it's a little bit narrower hopefully it's deeper um but it's a it's an
evolutional or evolutionary step that i think is happening right now at least from at least from my
perspective okay i'm gonna switch topics a little bit because we did get some questions from our
patreon slack people uh which indicates to me that many embedded engineers, hardware and software hate machine
learning.
So that was...
I noticed that too.
There's some skepticism of neural nets in the questions.
Yeah.
One of the big ones was attacking neural nets intentionally.
Part of it was using neural nets to attack other neural nets,
but there was this battle of, can we, how often does it happen? Are there secret machine learning
battles going out there? How did they discover that muffins and dogs look exactly the like,
and that labradoodles and chicken look indistinguishable
to neural nets well they just kind of look similar right like have you done that you've
done the google exercise where you you google like dog or chicken it's kind of similar
i have trouble telling them apart yeah yeah i want the neural nets on that one. Not fair. Yeah. But are there battles going on?
I mean, are people taking neural net outputs and then trying to figure out how to fool them?
So I have only heard about those in laboratory settings. So where there's a team of researchers that's explicitly setting out with a neural net that's not really trying to solve a real problem, right?
So it might be like, no offense, but sorting between dogs and chicken is like not actually a problem that anyone has.
So setting up a neural net to disambiguate between the two is a little bit what I mean by it's a toy problem. And then building a network to attack it.
And then they, you know, spend three to 12 months refining that and writing the paper and whatnot.
So, it's not to say that it's not a thing that's actually happening, but it's happening in the lab.
In the same way that there's all kinds of stuff that happens in labs,
but doesn't necessarily make it out into the real world.
Now that having been said,
those are the things that I've heard about.
If there are adversarial attacks that are happening in the wild,
then I probably wouldn't hear about them.
Right.
If,
if you're out trying to attack,
say a self-driving car software,
you're probably not writing papers about it and
publishing them on Google Scholar. So, the absence of evidence that I'm aware of doesn't
mean that it's not happening. It just means that if you're really trying to seriously attack
something, you're probably not talking about it publicly.
Yeah, it strikes me that the intent or the goal of the attacks needs to be taken into account too,
because there's, like you said, there are the toy problems, and let's see if we can fool this thing.
And then there's taking self-driving cars as an example. Oh, let's make self-driving cars do
something bad. Well, if your goal is to do something bad,
there's probably a lot of easier ways to do that than to become an expert in machine learning enough
to figure out how to fool a nerd.
I'm just saying, it's like having glass doors and good padlocks.
It's like, that's fine, but somebody will just break the window.
So you're saying that instead of trying to fool a self-driving car on neural net,
we should just spread caltrops on the road?
Yeah.
I mean, somebody wants to crash a bunch of cars.
It seems a lot easier to do it, you know, in a purely mechanical way.
But that's just me.
So far, I think that's right.
I mean, like, devil's advocate here, imagine we were all driving self-driving cars,
and then you could just, I don't know, with one software bug,
make it so the transportation system was barely like that's, that would be like a
pretty big deal, but we're not there yet. But it's, it's not because, uh, the people who are doing
the, the adversarial attacks are, uh, well, I think they, I think they're, they're probably
thinking longer term or bigger scale, but anyway, I, I take your point. Yeah. If I wanted to cause
mayhem, there's easier ways to do it than with, than with hacking a self-driving car, at least right now. It is good to be thinking
about these things because you can picture eventually we'll get to a self-driving car
society and then you put some filter over their camera that makes it so anytime they see a stop sign it doesn't actually happen
and and or some other way of fooling it that's a but this isn't specific to self-driving cars you
and it doesn't have to be general if you're going to murder someone, you're just going to murder them. I don't think it's hard to do it.
I don't think you need machine learning to commit mayhem.
Let's just stick with that.
Okay, so another question we got was about using neural net for vision recognition systems, but then also using machine learning for trajectory planning and low-level motor control.
When we do safety-critical systems, how do we figure out where it's okay to use neural nets and where we need to use mathematically rigorous algorithms? hardware engineering, obviously. So, forgive me if this is a shallow understanding. But I think
without knowing exactly what a safety critical system is-
Something that keeps you alive.
Yeah. So, my rule of thumb is like, if the neural net, like, I'm assuming like the neural net in
some laboratory conditions does better than the mechanical system. And that's why we're even
considering this at all, right? Because if the neural net does worse, then it's not,
it's like, why are we even considering swapping it out? Yeah, I think, like, at least for me,
it would be like, what's the, I hate to be like, just talking about business here, but it's like,
what's the problem that you're trying to solve? Like, is there an actual safety issue in the system that the
mechanical, uh, process are, are not showing themselves to be adequate for solving? Um,
I mean, if that's the case, I don't like, that's a problem and you should address it. I don't know
that the place I would start would be a neural net. The thing about neural nets is they're pretty complex algorithms. And there's, as you point out, they're fairly challenging to train. They can be pretty black boxy. They're hard to understand. lot of lower octane algorithms that you could throw out at first that can still be an improvement
over simple rules-based heuristics, but that are still a little bit more human-friendly and a
little bit more robust to some of the adversarial attacks or anything like that. So I don't think those are the only two possibilities is,
is what I'm trying to say is there's a,
there's some transition zones in between them that I would suspect are,
are plausible intermediate steps. If,
if you're finding them that the mechanical systems aren't cutting it.
And again, if they're doing fine, then leave them alone.
Yeah.
When we do talk about control theory, it's funny because when I took AI for robotics at Udacity, they talked about PID control as though it was a machine learning technique.
And to some extent, the machine learning fell in how to tune
your PID. But even that, it was just an algorithm. It was just normal. And then Kalman filters came
up in AI for robotics. And I was like, it's just a Kalman filter. It's just normal. It's just math.
And sometimes I find people get very focused on neural nets and machine learning.
And even those, we're finding ways to make them less black box.
Looking at the convolution ones and how they shape different layers allows us to see what they are seeing and what they're queuing on
and things are becoming i i guess the thing i the soapbox i'm on is machine learning isn't
neural nets and there isn't a single line that says this is algorithm heuristics and this is statistically biased or statistically informed. There's some gradient.
I bet was what you just said.
I think, yeah, I think you put that really well. Yeah, it's not binary. It's continuous.
As I mentioned, I've done some machine learning from the Udacity courses and I'm reading it.
I'm reading about it. I'm trying it out on my little robot. I'm listening to your podcast, but overall,
I don't feel competent in it at all. Do you have any advice for getting over the hurdle of,
oh my God, I would never put this on my resume. What if they actually asked me anything about it?
Well, I think there's a lot of imposter syndrome out there. So I know that
doesn't fix it. But I always try to make a point of saying that. I think there are even people who
are, you know, in my position, who would say they have some imposter syndrome, too.
That's, that's part of the reason that I like doing projects is if you're, you know, specifically thinking within the context of an interview or just in general, you know, if you've done a pretty in-depth project, then you've probably bumped up against a bunch of stuff that's hard and you've thought about it pretty deeply and in in sticking with
the project and continuing progress on it you've implicitly sort of overcome it or worked around
it or whatever and those are the things that you know force you to go out and learn stuff and
that's how you become an expert i don't think there's a way to and I think that that does instill kind of robustness or confidence, at least for me,
in a way that coursework never did. So, if I had to build something or if I had to make something,
I would feel like I had some understanding or some mastery. If I just took a course about it
or read a book about it, then, you know, I felt like my understanding
was shallower. So that's part of the reason I'm a big advocate for that.
One of the things that I saw from, I think, a presentation you gave was that willingness to
do math was a good criteria to get into machine learning and data science.
But so many people aren't willing to do math.
Do you have any idea why?
Are we lazy?
Math is hard.
But we used to love math.
Do you remember being a kid and the geeky love of math and puzzles? Yeah, that was before before you get to, you know, multivariable calculus.
Statistics.
The brain-bendingness.
I mean, I think that's kind of
why I think math is a
decent heuristic for this, honestly.
It's like, machine learning is not
okay. Part of the reason I say
that is because math is just
embedded within machine learning. Like, you have to
know some linear algebra in order to write some of these algorithms or to understand how they work.
Now, that having been said, there's really good libraries out there and you can use these
algorithms without understanding deep down how they work. I will be the first person to say there
are data scientists out there and I have counted myself among them, who use things without fully
understanding the math all the way down. And I think that that's, you know, not, it's not ideal,
but it's understandable. And it's how the world keeps moving forward. And like, I'm not going to
judge too much. But I think that one of the things that math is a good heuristic for is a little bit of like pain tolerance and sort of everyone's had the opportunity to take math classes and many of us have, you know, struggled to various extents.
And if you haven't struggled, then there's a good chance that like machine learning, at least the very technical pieces of machine learning might come easier to you than they did to me. But like, you have to be a little bit tough in this field. And you have to
be very rigorous in your thinking. And you have to be very dogged in the sense that you have to
know that there's a many of these problems have a right and many wrong answers,
and finding wrong answers is not, you know, sometimes close is good enough, but sometimes
it's not. Math keeps you disciplined in that sense. So, that's the other thing that math has
going for it, is that for most people, people it is pretty tough and it's sort of related
to stuff that you need to know uh and so it's not a bad it's not a bad way to tell if you're
uh like i don't want to say tough enough because that glamorizes it in a way that i don't mean to
but it's a little bit i can't come up with with a better phrase. Like, it's a good way of knowing if you're tough enough for some of the technical challenges that could be slung your way.
I heard about Ada Lovelace doing calculus problems in her free time.
And I thought, if only I could do that.
And then I realized that that was one of my top 10 stupidest thoughts um and I got a
calculus book and I started working through it working the problems and it was it was what you're
talking about I had forgotten the persistence needed to fight through long, variable, algebraic things.
And the stick-to-it-ness and the keeping things organized,
even though these were all things I do in my job,
there's a level of rigor with just doing the math
and knowing whether or not you're going to get to the right answer
and being able to show it.
And the book I had was very applied.
So there were many physics things going on that I had to think about and had to draw
and had to remember how to do the drawings in a way that would lead me to the solution.
So I kind of agree.
And I'm embarrassed that I needed to do that that but it was useful and it has helped me
move on to things like Udacity and machine learning and doing more data science but I had
fallen out of the habit of even things I loved like the signal processing which I mean I'm always
going to be a fan of anything dealing with Fourier, but
I had just started to rely on the libraries and not gotten too far into the math anymore,
because you don't have to. But the math informs things. It reminds you of what's underneath it
all. And so I see why you're saying math is important i guess uh let's see i only have
a couple more questions and i you know i have to get back to the imposter syndrome uh do you do
you feel that way do you have problems with it um in certain contexts i do so it it depends on who else is in the room, I guess. I think the podcast actually helps a ton
with this because the place where I feel imposter syndrome a lot is kind of this like, it's like,
oh, have you heard about this thing yet? Everybody's talking about it. And if I'm like,
no, then I don't feel like a real data scientist. But the podcast forces me to always be looking for new things to learn about and new things to talk about. But there are definitely situations like if you were to put me in a room with a lab full of machine learning, PhD students and their professors and postdocs and stuff like yeah I would not feel like I was
one of them okay you have a PhD from where some little school that nobody's ever heard of
Stanford you worked at CERN which like everybody here is like why aren't you asking here about
CERN I want to know about CERN oh my, my God, the Higgs boson. Oh, my God, it's so cute.
What?
You did a class with Sebastian Thrun, who is just, oh, my God, he's amazing.
And you've been doing a podcast for four years that helps people understand, not just teaches them basics, but gets to the intuitive
understanding of machine learning and data science. And you feel this way?
Well, yeah. I mean, there's a bunch of stuff I haven't done, right?
I mean, like I, for example, here's an example, something that I've always wanted to take
like six months and, and three textbooks and just learn is signal processing actually. And I'm not
even, I'm not joking about that. It's one of the most interesting fields that I've not learned,
uh, or that I, that I know about that I haven't learned. Let me put it that way. There's all
kinds of interesting stuff out there, but like it it just hasn't happened. But there are times when I would like
fake it a little bit. Like I would go read about Fourier transforms for a while. We learned about
them a little bit in physics for other reasons. So I wasn't totally, totally making stuff up,
but you know, I'm not, I'm not a signal processing expert, but, uh, so if anyone,
if, if I had been doing that episode with you instead of with Ben, you would have just been
like correcting me all over the place. So that's, that's a little bit what I mean by like, I haven't
done everything yet. Like there's a lot of stuff out there. Uh, so, and like I said, it's a field
that moves so fast that even if you were an expert
two or three or four or five years ago in something interesting, like, I don't know,
like it's moved on, uh, something else is new.
So it's very hard to get, at least for me, I don't get very complacent in, in that, uh,
context.
And so I guess that's for me kind of related to feeling imposter syndrome.
Learning is just your turn. It's not about faking it. It's not about being an imposter. It's just
about learning. Everybody has to learn this stuff. I'm sorry, I'm giving you such a hard time,
and I do this all the time as well. So to some extent, I want to hear other people say it because you are accomplished.
And you're using your imposter syndrome in a good way.
You're showing, you're using it to continue growing and continue learning,
which is a great way to take these feelings of inadequacy and fear
and turn them into something that is helpful to lots of people.
Well, thank you. I appreciate that.
I think maybe I'm ready to end the show on that note, unless you have a favorite episode of your podcast that we should start with.
That's an interesting question. So we don't do very many interviews,
and I would not call myself an excellent interviewer by any means, but there was one
that we did. This would have been in around the very end of 2016, or maybe like the first week or two of 2017, where we interviewed a researcher named Matt
Might. And he sits at kind of the intersection of genomics and computer science and is doing
like genomics research for understanding genetic diseases and stuff. And I thought he was an incredibly interesting person to talk to
and has like a very compelling personal story
and is just an incredibly brilliant and accomplished person.
So I don't know if that's exactly the one to start with
because it's going to give you a funny flavor of what we do.
It's an atypical episode, but it was one that I really enjoyed.
I'll put a couple of my favorites in the show notes as well.
Do you have any thoughts you'd like to leave us with?
This isn't very profound, but the thing I want to mention is if any of this excites you and you're
in the Chicago area and looking for your next thing, uh, civis is hiring. We're hiring for my
teams. We're hiring all across the company. We're not a huge, a huge company by any means. We're
sort of, uh, we're still a startup. We're only five years old, but, uh, if you're excited and
you want to talk, uh, check out our careers page, uh, civisanalytics.com probably slash careers.
I don't actually have the URL on me, but yeah.
That would be in the show notes, of course.
Cool.
Our guest has been Katie Malone,
Director of Data Science in the Research and Development Department at Civis Analytics.
Katie also hosts Linear Digressions,
an excellent podcast making machine learning concepts accessible.
Thank you for being with us.
Thank you so much. It's been great.
Thank you to Christopher for producing and co-hosting, and thank you for listening. You
can always contact us at show at embedded.fm or hit that contact link on embedded.fm.
Thank you to Tom and George for their help with questions this week. If you'd like to find out guests in advance, please support us on Patreon
and then sign up for the Slack.
Now, a quote to leave you with.
This one's going to be from Bob Ross.
Anything we don't like,
we'll turn it into a happy little tree or something.
We don't make mistakes.
We just have happy accidents.
Embedded is an independently produced radio show that
focuses on the many aspects of engineering. It is a production of Logical Elegance,
an embedded software consulting company in California. If there are advertisements in
the show, we did not put them there and do not receive money from them.
At this time, our sponsors are Logical Elegance and
listeners like you.