Programming Throwdown - Bayesian Thinking
Episode Date: June 22, 2020Many people have asked us for more content on machine learning and artificial intelligence. This episode covers probability and Bayesian math. Understanding random numbers is key to so many d...ifferent technologies and solutions. Max and I dive deep and try to give as many pointers as possible. Give it a listen and let us know what you think! Max also has an awesome podcast, The Local Maximum. Check out his show on any podcast app or using the link in the show notes! Show notes: https://www.programmingthrowdown.com/2020/06/episode-102-bayesian-thinking-with-max.html ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
programming throwdown episode 102 bayesian thinking with max squire take it away jason
hey everybody so this is uh this is gonna an awesome episode. I know so many people have
asked for more content on AI and on machine learning. And today we have Max Sklar, who's
an engineer at Foursquare's Innovation Lab. He's going to talk to us about Bayesian thinking,
Bayesian inference, and this whole universe. So thanks so much for coming on the show.
Thank you so much for having me. I love talking about this stuff. So let's get into it. I'm
excited.
Cool. Awesome. How are you doing? How are you handling the COVID situation?
Well, as reasonably as one can expect, I guess, you know, every week on the podcast, people
can hear my mood going up and down and have commented as such.
But I also experience mood swings.
There's something about, you know, being in the same place for everything that causes you to have these ups and downs.
Yeah, it seems to be a universal thing.
I feel like I was mentioning this to someone yesterday. Like, do you ever, like, go through your Apple photos or Google photos or whatever? And,
you know, just go through your past, like the last five years, 10 years and just scroll.
I'm like, what's it going to be like scrolling through these months? It's like all of a sudden
I'm living my life. And then, oh, it's a bunch of pictures in my room. And then a bunch of pictures
of sirens going by and then more pictures of my room than a bunch of wires. And I'm like,
oh, let's skip this. Let's skip this part. Yeah, it's so true. Yeah. What I heard the other day, which was really resonated, was that 2020 is kind of like when you get bored at SimCity and you just turn on all the disasters at once.
It's kind of like what things have degenerated to. to it's it's funny i i read like a full review of one of the things that um i wasted my time on
during during this pandemic which unfortunately i feel like i'm not alone when i feel like i could
be getting so much done but i wasted so much time was watching like a review of the original sim
city on on youtube so okay i found out i was i was reading an article that there's a game called Sim Refinery that Maxis made, I think,
for Chevron and further employees. And someone was able to get a copy of it. And you can actually
download it and play it in an emulator. I haven't tried it yet. Cool. So the topic is something that
we need to really attack kind of from the surface because there's sort of a lot of layers to it.
And so I feel like maybe one good way to, you know, get started on it is just to talk a little bit about uncertainty.
I mean, there's a lot of challenges.
You know, everyone knows software, at least most software is deterministic.
You say one plus one and you expect it to be two every time.
But when you're doing anything from you're trying to make some kind of prediction or you're getting data from some analog device or some sensor,
the world around you or even just behaviors of people is full of uncertainty.
And just in general, like how do people deal with that kind of stuff when they're writing code in this very rigid way?
Yeah, well, one one thing that I've learned even over the past couple of years, as I've been doing this for many years, is how deep the rabbit hole goes in terms of the different kinds of uncertainty and risk and probability and the
different definitions around it. People don't even agree on what uncertainty means or what
probability means. I kind of tend to take the subjective view that it's sort of your
degree of belief that something is true. But then there are the cases where, you know, hey, we're at a casino and we're flipping
a coin and I know reasonably well, you know, what the probability is.
Everybody agrees that like a coin flip is 50 percent and the die roll is a six, a six,
a six, et cetera.
But in the real world, it's not like that. You often have cases where, you know,
you don't even know what, you know, it's sort of a subjective on what the probability is of
certain events. And the best you could do is try to estimate it or try to figure out, okay,
what are the different possibilities that I want to consider here? And that's where Bayesian
inference really shines. So I think the first step is just it's a dive into a problem and maybe it'll help to go over
some examples and try to figure out, you know, what type of uncertainty am I dealing with
exactly? And kind of having that discussion with your team is usually very fruitful.
Yeah, it totally makes sense. I mean, I think one of the one way that sort of has both of those probability and uncertainty in one is the is the classic multi armed bandit,
which is this idea where, you know, imagine you go to Las Vegas and imagine now, you know, in the real world, you know, slot machines are highly regulated.
And so you could go to any casino and I think you could even look up the
probabilities. They're all posted publicly. But but, you know, let's suspend belief on that for
a moment and just assume that every hotel had their own rules. Maybe some hotels are brand new
and so they make the slot machines really generous. Some of them are very strict, but you
don't know what any of those probabilities are. And so you you have, you know,
maybe, you know, a thousand dollars worth of quarters in your pocket and you or in a bucket
or something and you want to basically make as much money as possible. And so you're kind of
torn because you don't you don't know any of the information, but you don't want to waste a lot of
money finding out either. You want to
quickly kind of dial that in and then choose the best slot machine for the rest of the day.
And so you could see how you're sort of balancing, you know, reducing uncertainty and, you know,
becoming more certain about your world with, in this case, I guess, profit.
Yeah, exactly. And another, one of the things that I like to talk about is sometimes you talk about,
okay, what's the implications in terms of building a machine learning model at work?
But sometimes there's also the question of what does this mean for my everyday life?
And the multi-armed bandit example is the question of what is the cost of getting information?
And oftentimes we don't consider what the cost of getting information. And oftentimes we don't consider
what the cost of getting information
to get the right answer is,
because it's not zero.
And so people often either go into rabbit holes
and try to get as much information as possible
when it kind of freezes them in place,
or people will throw up their hands
and just make a decision because they don't know.
But taking a second to think about,
what's the cost of getting information and what does that get me is another way to kind of dig into a problem. Yeah, yeah, totally makes sense. Yeah. And so, you know,
imagine, you know, if you're writing, if you're writing something that's going to do any kind of
forecasting, you're invariably going to have some type of probability and then some type of
uncertainty around that. So, I mean, this is something that I think really gets neglected
in a lot of these kind of coding boot camps or these tutorials online. Like if you've seen the
show Silicon Valley, where I have not seen the last season, but you can spoil me if you want.
Oh, you know, I haven't either. So we're both in the same boat.
OK, we're both in the same boat. So, Patrick, try not to spoil it for us.
But but basically the person I think Gian makes makes a hot dog, not hot dog app. Right.
Right. And so just taking that example, you know, if you build a model to do that, it's not going to say thumbs up or thumbs down it's going
to give you a probability and if you build a Bayesian model it's also going to give you a
distribution which we can which we can talk about but but you have to then you know turn that into
an action so so that really is context dependent so for, for example, you know, maybe the hot dog, not hot dog app.
You should only say hot dog if it's really confident because you're going to notify someone or send an email and you really don't want to be wrong.
But then on the flip side, imagine something a little more serious like you're doing cancer screening.
Well, maybe if you're 10 percent sure you should say yes, because yes means running an additional test.
Right. And so so you really can't you can't do anything in machine learning without, you know,
dealing with probability and sort of philosophy or economics of whatever you're trying to build and trying to marry those two together.
Yeah. Oftentimes you're ultimately after making a decision, but you
don't want to just go right away to the yes or no. You want to come up with the probability or
the distribution and then use that to help make the decision. So kind of breaking it up into two
parts. If I could give an example from my work that this is, we'll probably bring this up again,
but this is sort of this example. It might not be the most fun example, but it's definitely one of my favorites
because it's just one of the projects that I've done that just brought so much value,
which was in ad attribution.
That was a product I worked on for Foursquare a few years ago.
A lot of people in data science, you're going to run into attribution at some point in your life
trying to figure out whether ads work or not.
Big data problem, you know, lots of people trying to solve it.
And Foursquare was kind of, we were trying to figure out whether people, whether ads were driving people into stores.
And we had some visit data and we had some data on who saw the ads.
And it was really hard to untangle what the clients were asking for
because they wanted to know lift.
They wanted to know, okay, what percent more likely is someone to visit one of my stores
if I give them an ad?
Is it 10 percent?
Is it 20 percent?
Is it 5 percent?
And no one asked for a probability distribution over that.
They just wanted a number, an exact number.
And then they wanted to tell us either an exact number or we don't have enough information to say.
And originally, the team was kind of just taking that at face value, being like, okay, let's try to just calculate this number.
But it led to so many problems with, you know, how accurate do we know?
How accurate is this number, you know,
when can we report it, when can we not report it, and then to untangle that, we took a step
back and said, wait a minute, wait a minute, we don't know what the lift value is for an
ad, but let's try to turn this into a Bayesian model, and we could talk about, like, I don't
know if we want to kind of go and dive into like the intro to Bayesian theory, maybe after this.
But we said instead of talking about what is the lift of an ad, let's say, hey, we don't know what the lift of an ad is, but there's some uncertainty around it.
There is it could be anything.
And then as we have some probability distribution there.
And then as we gather more and more data, we update our probability distribution and it gets taller and skinnier over time.
And then we could come up with like a confidence range or sort of a Bayesian confidence range, which is, you know, hey, your lift is somewhere in there.
And then we could use that to help the clients make decisions. And it's sort of, even though some clients didn't
understand what we were doing, it sort of made it a lot easier on the engineering and staff side,
where we could be like, okay, we now all agree on what we're doing and what we're doing makes sense.
And it was just amazing to see how it just detangled a lot of our issues.
Yeah, yeah, totally. Yeah, I think you touch on a lot of things there. So we'll try and unpack that. I mean, one is, I think probability distributions
have to be one of the most foreign things. Let's say foreign entry level things in mathematics,
right? Because I mean, there's a lot of things that, you know, string theory, there's a lot of
things that are very complicated. But I would say PDFs are probably one of the things that
they're fundamental, foundational
but they're also like really abstract and complicated.
Yeah, can you take a crack at sort of explaining what a probability density function is to folks out there?
Yeah, yeah.
I mean, I guess one thing you could start with is if a PDF is blowing your mind, why not just start with the
discrete case? Because, and I find this too, sometimes when I'm dealing with continuous data,
if I think about it in a discrete way, it's a lot easier to start. So just think of what a regular
probability distribution is. You have a number of potential events. In the case of Bayesian
inference, you're usually talking about several hypotheses that you're trying to decide, OK, which which one of these is the is the true one.
Sometimes you're trying to figure out, you know, trying can kind of say, OK, this is what it is.
It's just going to be a bunch of numbers that add up to one. Right.
So each things have a probability. And then, you know, you could update those values as more data comes in.
A PDF is a little more complicated because it's continuous data. So instead of having it on, let's say,
10 possibilities or 30 possibilities or two possibilities, actually, a lot of times it's
two possibilities, you're dealing with a space that it's like, okay, this is a number. This is
like a real number. In the case of attribution, we're trying to find the ad lift. This is a
positive number greater than zero. It's where usually
it's greater than one because one means that the ad was not effective. So although there are
exceptions to that. So it's it's now you don't have you don't have some a bunch of numbers that
like a bunch of values that add to one. You have a continuous space that if you know calculus,
it integrates to one. You can kind of break up the continuous space in different one. You have a continuous space that if you know calculus, it integrates to one.
You can kind of break up the continuous space in different ways. You could break it up into sections like, hey, I could bucket it. What's the chance it's, you know, greater than five or less
than five? Well, both of those have to add to one. But, you know, essentially, you're just searching
a space of numbers. And you could even make it more complicated than numbers still. You could make it into, you know, R2, R3, you know, a space of vectors, but then that
gets even more abstract.
So depending on, I'm pretty sure you probably have listeners who are comfortable with levels
of abstraction on a high level and not so much, but, you know, essentially you're just
trying to figure out the relative probability of different possibilities.
That's all of this stuff is trying to answer that one question.
And I and then and yeah, you know, it's your hypothesis.
Yeah. It's your depending on how complicated you want your hypothesis space to be.
Yeah, it totally makes sense. You basically you have this function.
And when the function is high, like you could imagine in your mind just just any type of function you'd, you know, graph on your graphing calculator or something. Right. And when that
function is high, you know, has a high Y value that for those values of X, they're just more
likely. And when the function is low, those values are really really unlikely so if you imagine like a coin
toss um and and you're you're getting the pdf of this of this coin uh you know it could be you
toss the coin three times and get three heads it's possible but it's much it's much more likely
that you're going to get at least one tails and so you could imagine you know the y at one third
being being higher than the y at zero but yeah the thing is really weird is as imagine, you know, the Y at one third being being higher than the Y at zero.
But, yeah, the thing is really weird is, as you said, you know, if you pick a single point, it actually doesn't really mean anything.
It's really just when you integrate over a range, because, for example, yeah, like you never in a continuous phase, like you never have exactly 33, you know, 0.333. It's always going
to be there's always going to be some uncertainty there. And so you have to deal in terms of these
of these ranges. And yeah. And so it's kind of more like a pie chart that's been kind of unrolled
in a sense. Yeah. There are a number of different ways of looking at it. Another one I think of it is relative probabilities. Like, let's suppose you have just a standard Gaussian function that peaks at zero, right? So the zero value is the highest probability. number has a probability of zero of landing on exactly that number. But you can actually compare,
hey, what's the chance of me landing on zero versus the chance of me landing on one? And you
can kind of compare relatively what those two are, even though they're both zero, which is
kind of mind blowing until you're comfortable with calculus and differentials and all that.
But sometimes you could just understand what the graph means, like,
hey, I'm going to be somewhere under this graph. And so I'm much more likely to be in the high
sections than the low sections. And that's sort of where my mind goes when I read a simple PDF.
And then every once in a while, I want to get into like really abstract, like what's going on here.
And I, you know, sometimes I do read into this stuff. I read
into like topology and, and, and measure theory. And you could really get into, yeah, you could
get into like some really abstract. I was reading the other day about, I can't believe I'm bringing
this up. I was reading the other day about something called pointless topology. And it's this idea that in a in a in a space like real numbers, the point is not the fundamental unit.
You actually can only talk about, you know, places of finite extent.
So you have to be able to move a little bit to the right and a little bit to the left.
And I mean, think about it in terms of like electricity. Like if you're doing
if you're an electrician, like you never say, OK, exactly one hundred and twenty point zero zero
zero zero volts is coming out of this socket like like no house has that kind of tolerance.
Yeah. And so you're always thinking about things in terms of tolerance. And so you're always
thinking about things in terms of hopefully small ranges of numbers.
Yeah. And the scary thing is, is oftentimes the infinite precision number really has no meaning, you know, because what does it mean to if I said there was 120 volts, but it was, you know, zeros all the way down.
Exactly. Like, is that even possible if you go deep down?
It's like quantum physics. Oftentimes the numbers that you you're talking about stop having meaning anymore.
It's one example that I give is like how many water molecules are in the Atlantic Ocean?
Could probably come up with a reasonable estimate to that.
But is there an exact integer that represents that that value?
Probably not.
No. And I mean, with with evaporation and everything, you'd never be able to count it. So another thing that I think, you know, talking about Bayesian and statistics
really helps people is with generative, let's say models, but even more generically generative
systems, right? I mean, I think when people write, for example, when I write unit tests,
I typically, you know, ideally, I to make a unit test that's generative.
So, for example, so, for example, let's say I write my own sort.
Maybe, you know, I could have a unit test where I take some some canonical inputs.
I know what the sorted outputs are and I run my sort and then I make sure it matches. But another way to do that would be to have a validation function.
So I have, let's say, the STL quick sort, which I can't use for whatever reason, but I have it.
And my unit test could be generate some data, run the STL sort, run my new sort,
and maybe my new one's faster.
You know, that could be part of the test.
And then compare the results, right?
So that's just an example of this sort of generative unit test.
So some people hate that idea or they tell you you need to fix the random seed,
and there's kind of a whole rabbit hole there.
But, you know, you can imagine a lot of testing as being sort of generative.
And I think to really understand Bayesian inference and how to build these Bayesian models, you have to think in terms of, you know, you have some phenomena.
What generated that phenomena?
And can I, you know, artificially, how much of that phenomena can I generate artificially?
And so, you know, people who want to do, let's say, game design, you know, have to deal with this kind of thing.
And so in general, it's really good for really any engineer to know how to build sort of a generative system to model some problem.
Yeah. A couple of points on that. First of all, the the the idea that the first step in kind of a Bayesian inference is,
once you've defined the problem, I guess I would say that's the first step, is then you have to
figure out what your hypothesis space is. You know, what are the possible mechanisms
that I'm going to consider that could have generated the data that I'm seeing or that i'm about to see and um one of the well i guess you
could say it's like a pitfall of bayesian uh thinking or bayesian analysis is that there is
no guarantee that um you know one of the hypotheses that you come up with is is true or that works
particularly well um and so that's why you often need to throw in like a dummy hypothesis and say
oh this is actually just random data uh or something like that. And then if that beats everything, then, you know,
OK, I either it either is random or I don't understand the mechanism. And so and I know
we're talking kind of abstractly here, but if we can go back to I'm trying to get trying to figure
out where we were in the conversation.
If we go back to the unit test, like that's always very interesting.
Like I remember like writing a unit test for the sword is one thing.
Like I've written some to check if a logistic regression is coming up with the right answer.
And it's often – you often have to kind of tell the rest of the organization, my unit tests run differently.
I either have to set the random number generator in stone, in which case I don't know if I'm necessarily getting what I want.
My unit tests pass 95 percent of the time, but the other five percent, it doesn't pass. So every once in a while when you're doing this stuff, you run into problems
with the organization with unit tests where I don't necessarily know the answer other than
in some cases I just have unit tests that I run offline that aren't part of our automated testing
whenever I change something that's probabilistic. Yeah, that totally makes sense. Yeah, I mean,
and this is true in general. You have a generative system, especially one that's probabilistic. Yeah, that totally makes sense. Yeah, I mean, and this is true in general is, you know, you have a generative system, especially one that's, let's say,
generating data for a test. There's always going to be that one chance that it generates,
like, really degenerate data, like maybe all the X's are the same. I mean, that could happen one
out of, you know, a billion times or a million times.
And there are unit test systems that are going to run your test a million times,
maybe even a million times a day.
So that could happen.
Yeah, yeah, exactly.
It happens all the time.
One in a million.
I'm not surprised by one in a million things anymore.
Yeah, I saw something.
This is a pretty old quote,
but it was from the lady who is the head of security at Twitter many years ago.
And she basically said something to the effect of, yeah, if there's a if there's a one in a million chance that it happens, you know, 20 times an hour on Twitter or something like that.
Yeah. You guys are going to take a little break to talk about University of California, Irvine's Division of Continuing Education.
So this is a pretty cool program.
They have a variety of different kind of certificates that you could acquire. They have things like Python, they have data science, they have machine learning.
And these are things where, you know, if you didn't necessarily get, let's say, a degree in machine learning or you haven't worked as a machine learning engineer for a bunch of years, this is a way to sort of get a lot of that knowledge, a lot of that expertise.
And, you know, I know Patrick and I, we've both done a bunch of courses online.
And so it's a really good way to sort of boost your knowledge and your skills in a particular area.
Yeah, I mean, I did tons of online classes when I first started working.
And, you know, for me, being part of a class, I mean, it's always interesting.
But the curriculum, the self-paced stuff, it works great sometimes.
But sometimes having a here's what we're doing each week, marching through their curriculum and going through it,
it's very similar to how, you know, just a normal university class works.
In fact, you know, feeling like it's almost exactly the same is just a comfortable thing, a good way to learn and learning from professors who, you know, that's that's their thing.
That's they teach. They help others to learn and having access to it, doing the assignments, it really helped me go from, you know,
where my undergraduate left off to, you know, to just kind of bootstrapping into more specifics,
higher level things, things that were more pertinent to my job at the time. You know,
I hardly recommend people taking classes, continuing education from a college.
Yep. Yep. Yeah. i think getting it through a
university is is actually a really really stellar i mean it's really awesome that universities are
starting to get into this and um you know that there's going to be sort of quality lectures and
professors there's there's you know there's a very strong brand behind any any sort of major
university and you know uc reminds one of the top universities for CS.
So they've been around since about 1962, I think.
And they've been around a long time.
They've been teaching a long time.
They've been teaching online a long time.
And so it's a good place to go and get this kind of education.
Yeah, if you're interested, I think they're still doing enrollment for some late classes for spring,
but summer is upcoming. And as we've been talking about this whole episode, I mean,
I think everyone has extra time at home these days. And if you're interested, you can check it out at ce.uci.edu slash programming throwdown. And we'll put the
link in the show notes, of course. But once again, that's ce.uci.edu slash programming throwdown.
Yeah. And if you do sign up and take any courses, you know, we'd love to get feedback. You know,
please write us in. Tell us what you think of it.
You know, we could pass it on to them. But also for us, it's really good to know, you know, what you thought about that.
You know, folks out there who are listening. So. All right. Back to the show.
So. So. So, yeah. So what makes something Bayesian or one of the one way to explain it is, you know, when you're building when you're building that generative system, if you build it on top of random numbers right now, you have kind of a generative model. So, for example, so, for example, if if if we go to our to our sort example, so we're generating some random numbers to sort and to compare the outputs of them.
If we know the function that's generating those numbers.
So, for example, the numbers that we're sorting, we're generating them from a Gaussian distribution.
Right. If we have the getting back to our earlier topic, if we have the probability density function for this generator, now we have, instead of needing, when we do analysis,
instead of needing the list of numbers, we could just think about it in terms of, okay,
this is the distribution that I'm sorting. And then all of the models and the math afterwards
could be done on the distributions. And your output could also be another probability density
function. Yeah. So I'm not sure I'm following exactly.
So you're generating a bunch of random numbers from from, let's say, a Gaussian and you still want to sort it.
Are you saying, oh, yeah. Yeah.
Yeah. Well, I'm saying let's say we just want to do some analysis on the sort. So, yeah, the actual sort is going is not going to be a Bayesian system because it's it's it's kind of a deterministic thing.
But which is in general, I think if you start with sort of random numbers and then you instead of applying things on each of those elements that we've drawn from the distribution,
you could also imagine, you know, doing functions on the distribution itself.
So let's take this this Gaussian distribution and let's maybe multiply it by a constant and add it to this other Gaussian distribution.
And if you do that enough times, you get some kind of linear regression.
And so that that output then of that process also becomes a distribution.
Yeah. I mean, I think the the main point is you can actually recapture the distribution from the random numbers.
And that's always very interesting. I was I've been reading like a lot of well, a few months ago,
I was reading a whole bunch of the the covid papers, you know, the coronavirus papers.
And the one that I was really interested in on March, you know, March 10th was how long does it take for someone to get it?
Because I walked into the Foursquare office on March 6th and no one was there.
And I was like, what's going on?
Is it a weekend?
Did I screw up?
And haven't you heard somebody got the COVID and we all have to quarantine for two weeks.
And that was back then. Nobody was quarantining.
So I was like it was kind of like I felt like I was personally being punished.
But but after five days, you know, I was like, OK, I'm not sick.
None of my co-workers are sick.
And what's the probability that that probability that we got it or any of
us got it? And as I looked into the number, I found one paper that was really good, which,
you know, took the discrete data of, okay, these are example of people who know when they were
exposed, know when they started exhibiting symptoms. And so they construct the probability distribution on, you know,
when you start exhibiting symptoms from when you were exposed.
And I noticed in the paper, which I liked, which gave me confidence in the paper,
which, you know, a lot of the papers that I've read over the last three months
in this topic did not inspire as much confidence.
But one of the things that they did was they checked a lot of forms for the distribution.
So they checked they they checked to see if it was a Gaussian, but they also checked many, hey, you have a median of getting this thing in like five days and set 97 percent chance it'll
come at you in 14 days.
So that that sort of inspired confidence from the from the data they got.
And yeah, nobody at work at least got sick from work.
So that was good.
I was pretty confident after five days, but my co-workers were
still not so confident. Yeah, I mean, I think the statistics, especially when it's rather early and
there hasn't been enough time, it's just very hard to get that right. I think a lot of people learned
the hard way that fractions have denominators, right? And so you might have a thousand cases,
but if you tested a hundred times as many people
as the next place,
then that obviously has a huge impact.
Yeah, I mean, one thing that I like about Bayesian inference
is it sort of tells me what my certainty is as I go.
So if I'm like, quote unquote,
allowed to draw an inference early on, I'll know that.
And because, you know, there'll be less uncertainty around my data. If I already have enough data,
if I don't have enough data and my my probability is still spread all over the place,
then I'll know I still have to wait. Yeah. And one thing you touched on is is this idea that we have to fit distribution.
So so, you know, we talked about the probability density function, but there's there's a variety of different functions that have really nice properties.
Right. So, for example, as we talked about, there's the Gaussian distribution or the normal distribution.
And that is this symmetric distribution that has a whole bunch of really nice properties.
Now, as we talked about the electricity example, you know, the real world, in the same way as the real world will never give you exactly 120 volts,
it's also not going to give you exactly a Gaussian distribution either,
right? And so, and it could give you a distribution that is, and it probably will
give you a distribution that is unlike any of the mathematical distributions. So, you know,
a lot of work usually does. Yeah. A lot of work has to go into sort of how do we sort of fit the data that we have to, you know, one or more or some
composition of distribution. So that actually is itself a really difficult problem. Yeah. One of
the topics that I like talking about on this is conjugate priors. I don't know if you've
run into that, but an example of a conjugate prior is a
beta distribution. A beta distribution is an uncertainty over a probability. So let me say
it in a way that might make sense, might make more sense intuitively. Like you have a coin,
a weighted coin, and you don't know if the coin is weighted 0% on heads or 100% on heads.
It's somewhere in between or equal to those two things.
It's somewhere on a value from 0 to 1, and we don't know exactly where it is.
And weighted means it's cheated.
It's cheating in some way, but you don't know.
Right, right.
So, I mean, you could think of it as a coin, or you could think of it as any event that's repeated,
where I don't really know what the probability of this
event, like I don't know what the probability of this event is. All I know is that it's repeated
over and over and it's somewhere between zero and one. And so a beta distribution represents
an uncertainty over that. And a beta distribution like a Gaussian distribution, it has, you know, has two parameters and, you know, you can say it has a mean and it has a variance and all that. And a beta distribution like a Gaussian distribution, it has two parameters
and you can say it has a mean
and it has a variance and all that.
But the really cool thing
about the beta distribution is
that if you have a beta distribution
to represent the uncertainty
over the
weight of the coin,
if you flip the coin a bunch of times
and then run through Bayes' rule
and then calculate the updated distribution,
the updated distribution is also a beta distribution
just with updated parameters.
And it's actually very easy.
You just add the number of heads to one parameter
and number of tails to the other parameter.
So the math all works out
and you don't even need any complicated calculations. You don't even need to use a
computer. You can just figure out what the need is. And so sometimes those are really good to look
at. There's beta Dirichlet, which I've written a bunch about and Gamma. And then, like you said, oftentimes you want to, you know, in some cases you want to check, you know, hey, could this be some pathological distribution?
And you should often throw that into the mix to check because some of the biggest mistakes, not mistakes, but like errors that people can make is like, you know, sort of trying to fit your data into a certain distribution when it doesn't fit.
And sometimes you have to for simplicity. So you have to kind of get like an intuition for when this is going to be a problem.
But it's usually a problem for instances where you have data where, you know, one day it could blow up and everything is affected.
I'm trying to think of an exact answer.
And you get an intuition for which is which.
For example, the attribution, I'm not worried about one ad all of a sudden.
Well, I don't know.
There could be an ad that could be extremely effective.
But I don't expect it to be out of a range of like an order of magnitude.
Yeah, exactly. And the other thing is, you know, these distributions are shaped by what we call hyper parameters.
So, for example, looking at the normal distribution, it's symmetric.
So so that's you can't vary that, but you can still vary basically how much you want to, let's say, stretch it.
And so we call that the variance of the distribution.
And you can also vary where the top of it is.
We call that, I guess, the basis or the base.
There's different words for that.
But the center of the distribution, the mean of the distribution works.
And so those are hyperparameters, right?
So let's say you have some data and you can fit a normal distribution.
Let's say the data is bimodal, which means you actually see two normal distributions in your data.
So imagine it has sort of these two hops, right?
Right. So now you can say, well, there's a Gaussian distribution that's generated this data,
but sometimes the mean is this or sometimes the mean is that. And so you can have another
distribution which tells you when you should use one mean or the other. Right. So you could have, let's say, a Bernoulli distribution.
And when it's heads, you use the one mean and when it's tails, use the other mean.
And so then in this way, it becomes kind of this chain or this this this.
You could look at it as a graph where distributions are informing other distributions.
And so you can even introduce, you know, external data.
So, for example, yeah, when it's sunny outside, it's much more likely to be this distribution, this normal distribution.
When it's raining outside, it's more likely to be this other one. And it's just that the data you have was a combination of sunny and rainy. And so that's why you have this bimodal distribution.
But it turns out if you add this extra feature,
now you can explain it in a much better way.
That's predictive.
Yeah, yeah.
And that's how these things kind of chain together.
You can have, hey, which group am I in?
Am I in group A or group B?
And that's the first distribution between A and B. And then within A,
okay, we have a
Gaussian. Within B, we have a Gaussian.
There's a whole field
called hierarchical
models that's kind of based on
that. And it's sort of like, okay,
I'm in
a subgroup, and then I'm in a larger
group. And there are certain
properties I'm in a subgroup and then I'm in a larger group. And there are certain there are certain, you know, there's there's certain properties that the larger group has.
And there's certain properties that each subgroup has. And as I get data, how do I figure out what those properties are?
And a good example of that is in election forecasting. I interviewed on the Local Maximum, Alex Andorra.
I think that was episode 99,
if I remember correctly.
Yeah, so basically,
he does election forecasting.
He's in France.
He's kind of like the 538 blog for France.
Oh, cool.
Yeah.
And so what...
I try to go to localmax radio slash ninety nine to make sure.
But I don't know. It's not not responding right now.
But he we talked about hierarchical models.
And the question is, OK, let's say I do a poll from a single town and I get a crazy outlier on that poll.
And the way I talk about there's three things that could be going on.
One, you could have just gotten a bad sample or you could have just gotten like a random,
just it just happened to be that the random people you hit, it just fell in a weird way.
That's like that one in a million or maybe not even one in a million shot where it's just it just randomly was an outlier
and that does not reflect the data.
Two, it could be that that town,
something is going on in that town
and people are really changing their minds.
And three, it could be that
this is a broader national trend.
And so when you get that data,
how do you tell?
How do you tell which case it is?
Or maybe it's a combination of those cases.
And a hierarchical model, a hierarchical Bayesian model does a very good job of kind of disambiguating between these and trying to figure out, OK, where does this fit in?
Yeah, totally makes sense. Yeah. I mean, so so you could actually have a distribution over all three of those hypotheses, like I guess in this case, a traditional distribution.
And and over time and with more surveys and more samples, you would become more and more confident about about some mixture of those hypotheses.
Yeah. Yeah. This episode 98, by the way, but not 99.
But it's it's yeah, it's oftentimes I mean, one of the things that I might think of it is, OK, let's say each town has a and again, you could break it up by by town.
You could break it up by age and gender and demographics and all these things.
But to be, you know, to to to simplify, let's say each town, let's say it's two candidate election and you just have a beta distribution over what the probability for each person is in those towns.
Then the county, you say, OK, each town in that county, we're going to have a different beta distribution.
And the the hyper parameters for the data beta distribution are are drawn from another distribution for that county and then so on and so forth.
And you go up to the state level.
And so that's, yeah, and that sort of disambiguates.
And oftentimes in these cases, it can be very intractable because, like, do I want to break it up by geography?
Do I want to break it up by demographics?
Do I want some combination?
And that's where a lot of kind of trial that that's where you're getting into you know the
machine learning curse of dimensionality i'm just trying to um you know trying to include
trying to figure out which variables are best to include yeah i mean it's kind of it kind of ties
into the multi-armed bandit stuff again where where you know just to give an example if you
if you flip a coin once and you get heads, there's just a lot of uncertainty there.
If you flip it 10,000 times and it is roughly 50-50, then you're much more certain.
And so you have what's called tighter bounds, right?
And so as you try to come up with more and more features to explain some phenomena.
So, you know, the phenomena is the survey results.
And as you come up with more and more features to better explain that phenomena, you also are
adding more and more uncertainty because, you know, maybe you've only surveyed one 18-year-old
from, you know, the Lyon province, from Nice. And so now you've gotten so specific that you've blown up the variance.
And so you're constantly kind of struggling with this, we'll call it bias variance tradeoff, right?
And so that's one of the reasons why acquisition functions are really important.
And so we could kind of dive into that. I mean, a lot of these a lot of these systems and one really cool thing about Bayesian in general is it's because it comes from kind of the signal processing world.
You know, a lot of it is designed to be iterative.
So it's not like collaborative filtering and some of these other methods where they're just concerned about this one shot.
Like you generate this probability of hot dog and then that's it, you're done.
But in this case, you can actually have an acquisition function which says, you know,
who should I survey next to get the most possible signal?
And going back to the casino example, right, like which which casino should I visit to either learn more about it or or make make the most money?
And so so actually, you know, Bayesian, a lot of Bayesian methods directly address the acquisition function, which is another thing that isn't isn't covered really well but is also really important yeah to expand on that a
little bit a real world example from uh four square attribution uh well in this case it wasn't so much
an acquisition function where we're trying to decide who to survey because we have our panel
of data but it was more like trying to decide which data to throw out because we had too much
it was very expensive to put all of it in our model. And we
were trying to figure out, let's use Starbucks as an example. I don't think Starbucks is a client,
but let's use Starbucks as an example. We're trying to figure out, okay, what is the probability that
any given user or for every user and their data and information about them, or what is the
probability that they're going to visit Starbucks on this particular day.
And we have that probability set up for every person in our system.
And so that was a pretty big model to train.
And we had examples of days where people did visit Starbucks
and days where people didn't visit Starbucks.
And as you can imagine, there are a
lot of examples of a person days where Starbucks was not visited. There are a lot of examples where
Starbucks was visited, but there's a huge multiple of days where Starbucks wasn't visited. And so
we did a bunch of stuff where we threw away some of that data so that we got less, you know, that we can then calculate our probability
function, which was kind of a logistic regression on a smaller data size.
And we, again, use Bayesian inference to say, okay, how do I get these original parameters
from the original data set, given that we probabilistically threw away a bunch of this data
and we were able to calculate that as like a bias-corrected logistic regression.
It's a very interesting problem to tackle.
And again, save a lot of money, save a lot of calculation time
and programming cycles and AWS costs and all that.
Yeah, yeah, true. It's true. And even if you had infinite of those, you run into this other issue where the model will spend energy based on the loss.
So, for example, the example you gave is great.
You know, the odds that someone goes to Starbucks specifically, you know, if you look at everyone on Foursquare,
including all the folks who have never been to a Starbucks, the odds that someone goes to
Starbucks is probably way less than 1%. Yeah. And so if the model, let's say, doesn't isn't a very,
very large model, it doesn't have a lot of free parameters, then the model is just going to say no one goes to Starbucks ever.
And it's going to be 99 percent accurate. Right. And so and so so typically, you know, that's one of the big challenges on the modeling side is is how do we respect the sort of economics of the
problem? Right. Like like why is saying no one ever goes to starbucks bad the reason is because
it's a lot more valuable to predict when someone's going to starbucks and get it right
than it is to predict that this person is not going to starbucks right because if you know
someone's going to starbucks you could send them let's say a coupon and you could get maybe three
dollars worth of value out of that um if you know that someone's not going to Starbucks, that's kind of useless. Right.
Yeah. And so and more importantly, it was which which people are going to be swayed by the ads, too, because we wanted to know who was going no matter what.
Yeah, that's right. Yeah. You need the uplift. Right.
So so the so so you do all this, all these various tricks to to make sure the model is as useful as possible.
But because you've done that now, it's it's uncalibrated.
So now the model thinks that an average person goes to Starbucks, you know, every other day because it's been given data that's skewed.
Right. And so then you have to do this calibration step where you build a another kind of simpler model on top of that model,
which is which is given unbiased data and you freeze that that first model so that the first model is biased on purpose and you freeze that.
But now you train this unbiased one afterwards. And, you know, I think a lot of this, you know,
between sort of this bias variance trade off and all the things downstream of that and this question of like, how do you subdivide the data and how do you gather new features?
That's probably 90 percent of what a data scientist will do in their day job.
Yeah. Yeah, I agree. It's some of this stuff can get quite,
quite involved. And oftentimes, you have to take a deep breath and say, Okay, let's, let's start at
the beginning. And let's, let's do this step by step. Because, well, in my case, and I think a
lot of cases, you often have sort of, you know, a problem of explaining to the organization all of these
issues, which can be intractable as well, you know, management. Yeah. So what is, how do you deploy,
you know, something like this reliably? You know, and how do you know if you've deployed a bad
Apple? I mean, let's say that, let's say there's a data corruption. How do you kind of keep that
from getting to people? And how do you build sort do you kind of keep that from getting to people?
And how do you build sort of protection in case it does get out to people?
Well, that's I'm trying to think if I have some good examples of that.
You know, we definitely have a bunch of like, you know, data sanity checks in all of our pipelines to make sure things don't shrink or grow considerably
between runs. But honestly, sometimes they're more, sometimes they cause us more problems
than they help. And other times, other times things have blown up anyway. But in terms of
models, you know, I think you just look, there's nothing that you can do if the data coming in is is just horrible, just completely like zeroed out or something, unless you sort of just have a check that the new that the new model is completely, you know, completely 180 degrees different from the old model.
Oh, that's a really good point.
Yeah.
So sanity smoke checks help a lot.
One thing that I talk about a lot that's been running on Foursquare for the last, like, six years in its current form,
but really eight years, is the rating system.
So if you go to Foursquare City Guide, you can search for venues and restaurants and bars and stuff. Remember, we used to go to bars, movie theaters and things like that.
Yeah. Yeah. So they had a one to ten rating.
And the algorithm in that is one of the biggest signals is sentiment analysis on the Foursquare tips, which are like these mini reviews that people write. And I built this in 2012 and then kind of revamped it in 2014 with Stephanie Yang,
who I had on my show in episode three.
And basically the sentiment analysis algorithm is based on people who have liked and disliked a venue explicitly,
but also left a tip.
And so we trained language models on that.
And the reason why this is my favorite example is that this thing has been getting retrained every week for the last six years.
And it probably, I don't know, Foursquare isn't growing by a big percentage now,
but back in the day, like it used to get better over time.
Like it would be better at doing new languages because, uh, a certain language maybe didn't have enough data.
But then when more people, that language started using four square, uh, it got better and better
at figuring out, uh, whether something was positive or negative in that language. And so
sometimes you can deploy something where it actually gets better over time instead of
drifting. Um, yeah, which is really interesting. It's it's you have to a certain number of things have to align.
You have to be training on your own data.
And sometimes like, you know, it has to be something that doesn't change like that, that fast.
I don't think I don't think language language does change.
But in terms of what we were looking for, it doesn't change that much.
Or if it does change, it's the introduction of new terms that it could easily learn and not, you know, completely going opposite.
But, yeah, it's always an interesting, I don't know if I have a clear answer for you,
but it's always an interesting architecture problem.
Like, how do I build this thing to last?
Yeah, I mean, I think it's it's definitely an open problem.
I think, as you said, though, you know, comparing to yesterday's model, having some expectation about drift and and sort of A.B.
testing or I guess A.A. testing, you know, this version versus the previous one is really the best you can do.
I mean, there are some sort of counterfactual evaluation techniques,
but it's very, very hard to dial those in.
Yeah, especially something like venue ratings
where how good they are is very subjective.
And then if you want to dial, yes, there's no ground truth there.
And then I guess we were using our own data for,
even though it seems like they've worked pretty well and they've held up over the years, you know, people still say, oh, we enjoy these, you know, better than some of the other services that that rate places.
But it I was trying to think of what like if you go if you go to the sentiment analysis, like, yes, we have we have training data that we have kind of ground truth there with people self-reporting.
But it's not really always truth. So and, you know, there's spam and all that.
So there's a whole bunch of things that could go wrong. Yeah, totally makes sense.
What do you think about, you know, I used to do Patrick and I both used to do a lot of image processing.
This is probably like a decade ago.
You know, and back then, as before, deep learning was a thing.
And so it was a lot of kind of, you know, SVM or sorry, SVD and these other sort of unsupervised like these decompositional models, a whole bunch of human engineering.
And and then deep learning came and just ate all of it. Right. like these decompositional models, a whole bunch of human engineering.
And and then deep learning came and just ate all of it. Right.
And so I remember when I was at NYU, I took a class in machine learning and the professor was Jan LeCun.
I didn't know who he was, but now he's at facebook and all that and uh he showed us i think it was the first day uh he showed us like a a camera that like a little eyeball shaped camera that he pointed at things
and then on the screen it said what he was pointing at like it would be like keys phone uh desk you
know and i'd be like and it was just uh it was just amazing and now that technology um is is
everywhere yeah yeah yeah and so and so deep learning kind of just ate all that feel like And it was just it was just amazing. And now that technology is everywhere.
Yeah. Yeah. Yeah. And so and so deep learning kind of just ate all that feel like no one has to know about Gabor filters or his Gaussian pyramid models or a lot of these color contrast models. I mean, they just take the raw pixels, throw it into this deep net. And so you're starting to see this emergence,
and it's been going on for a few years now,
of sort of this deep Bayesian models where we're basically, you know,
it'll be some deep learning where the last layer is a Gaussian process
that has just priors that nobody knows.
Or you start to see, like another example is, you know, you start to see variational inference pop up everywhere,
like this reparameterization trick pop up everywhere.
And so there are all these sort of black box models where nobody really knows what these distributions are
in the same way that nobody really knows what's going on in the deep learning system.
But it can handle enormous amounts of data.
And so do you think that that sort of deep learning is going to eat a lot of the kind of, you know,
Bayesian math and it's just going to become deep learning?
Well, so first of all, I think some parts of deep learning are based on Bayesian math.
So you have to some degree or another.
So it's not necessarily that it's going to eat it.
But I think that there is always going to be a market for the simpler models.
You're never going to reach a point where people are like, oh, linear regression and logistic regression.
Don't even learn about those.
Those are obsolete. No, I think those types of things
are always going to be kind of the bread and butter, you know, first step to try to wrap your
head around the problem and figure out and to be interpretable and try to figure out, you know,
what is really going on, what the causal mechanisms are that a deep model can't necessarily tell you.
And deep learning is really good at something like image recognition or speech recognition,
where you just have such complicated levels of abstraction that it can do really well.
But I don't think it eats everything.
I think these complex models are very important, and they're going to do a lot of things, but
eats everything, no.
Yeah, I totally agree with you.
Yeah, I think, yeah, I mean, if you think about it, actually, I mean, imagine you have,
let's say, a click probability model that's a deep model.
It's still a logistic regression at the last layer.
It just has this sort of this embedding, this composition that sort of transforms the data into something that can be learned in a linear space.
Right. And so it might just be that, you know, we do this composition and we have to just cross our fingers there.
But but what comes out of that, that embedding that comes out of that, it could still be we could still do some interpretable ML on top of that.
Right. And sometimes sometimes these click models or this marketing data is not as deeply complex with layers of abstraction as is image recognition.
So sometimes you can throw deep learning at it and you can get small, small wins, but not always.
They're often small wins and not big wins. But for for for lots of effort. But, hey, I'm sure Google will use it to kind of squeeze extra point one percent efficiency out of their out of their clicks and, you know, get translates to some huge amount of money.
But it might not might might not be practical on the scale of like a small business.
Yeah. I mean, I think it's probably worth mentioning the reparameterization tricks.
It's something that most people could do with really any data without having to fit distributions.
And the idea is, you know, as we talked about, the distributions are defined by these hyperparameters. And so you could imagine, like, if you imagine,
just let's take the logistic regression example.
You have a set of labels.
So you have a set of, you know, coin tosses,
and you want to have some function that takes in, I don't know,
the wind and some other parameters and decides whether the coin tosses
will be biased in one way or the other.
But that's going to give you a single probability or the hot dog, not hot dog, for example.
It's going to give you a single probability like point nine or something.
If you want a distribution, one way to do it is to say, OK, instead of trying to get the actual answer so so the way that 0.9 comes about is when we
train the model you know we give it a picture of a hot dog and we say this is one we give it a
picture that doesn't have a hot dog and we say this is zero and so over time it it trains on that
but you could also you could give it you know of pictures and say, you know, in this batch of pictures, 90 percent of them are hot dogs and not necessarily tell them tell the model, you know, which which which ones are which.
You know, you could do that and it will still learn. It will take longer because it will have to sort of, you know, tease apart from these batches which ones are actual hot dogs.
But it would work.
Does it have does it have multiple such batches like to say in this batch, 90 percent are hot dogs in this batch, 10 percent are hot dogs?
Exactly. Yeah. Oh, OK. Yeah.
So so if we give it enough of these batches and let's say we shuffle in between the batches, it'll still learn even though we only have one loss or one, you know, bit of feedback for the entire batch, right?
And so if you can follow that, then you could imagine saying this batch has this distribution
and Mr. Model, I want you to match this distribution. So instead of saying this batch has a 0.9,
you could say this batch has this Gaussian distribution,
you know, match it.
And the next batch will have a different Gaussian distribution.
And over time, you know, it will start to fit
all of these different distributions over all these batches.
And so this is a nice little trick you can use.
It's called the reparameterization trick.
And so you can actually get a distribution out of pretty much any model.
And I feel like that's kind of, you know, lesson 101.
You know, we could do much better than that.
But that's one way where anyone can get into models that output distributions.
Yeah, I haven't heard of the reparameterization trick.
I'll have to look into it. has been mixed and matched and filtered and all that. And so long as I know how it was done
or I have some hypothesis as to how it was done,
you could always come up with a Bayesian model on that
and learn something.
Yeah, yeah, totally.
Yeah, I think having the uncertainty
lets you do some really, really cool things.
Like imagine, yeah, against problems dependent.
Like Foursquare is actually a good example.
Let's say we're recommending restaurants to somebody.
And let's say we don't know whether they like Thai food or not.
Well, maybe we should do what's called the upper confidence bounds,
which is just a fancy way of saying, you know, if the distribution is really wide,
we want to be more likely to choose this item.
So so Foursquare doesn't know if I like Thai food.
They show me an advertisement or they suggest a Thai food restaurant is nearby.
And if I cancel it or if I tell them, don't show me this again, well, that tightens the balance. Now they know something. And so next time,
maybe they'll try Chinese food. Right. On the flip side, it could be that I really love Thai food,
but I always go to the same place. And so I haven't looked for Thai food on Foursquare.
And so by taking that chance, Foursquare, if I do, maybe I click on it and I go to that restaurant. That really opens up this huge array of possibilities that that didn't exist before.
Right. So that's where, you know, showing the upper confidence bound of something can can can give you this nice mix of learning.
But at the same time, you know, using the information you learned at the same time.
Yeah. Lots of ways to slice and dice the data.
Yeah, totally.
If I could put it one way.
Yeah, exactly.
I mean, I think having the probability and the uncertainty opens up tons of opportunities.
I feel like this is something that a lot of even big companies like Google and Facebook
and these other large companies are just starting to get their hands around this.
They have these very, very complicated models, but then they all output a single number.
And you're kind of limited in what you can do with that.
Yeah. Yeah. That's that's a constant struggle of, you know, the the data scientists often have more information than the product managers or the customers want to use.
And, yeah, I get it.
Maybe like in a consumer app, you don't necessarily want to start teaching people probability,
but you want to be ready with all of the probabilities so that you can then make the changes to the, you know,
you could then make product decisions as you go, I guess.
Yeah, that makes sense.
Like I want to increase or like maybe I want an indication that we're uncertain
or maybe I want an indication that maybe you'll like this place because of this or, you know,
all the different possibilities that we've been over a lot over the last many years at Foursquare, for sure. So in Foursquare, do you have, you know, like these, let's say, clear box or these like
very interpretable models and you have sort of the heavy hitting deep learning model that
squeezes the extra one percent?
Do you do both of that or do you just focus on the clear box or like how do you sort of
strike that balance?
So let's try to figure out which product we want to talk about, the consumer product or attribution.
So we can talk about attribution for a bit.
Attribution was, you know, we had a bias-corrected logistic regression, like I said,
which sort of tried to figure out what the likelihood of someone visiting a place,
given what day it is and given information about them and when they visited before and all that.
And then we compared what happens when they didn't see the ads, when they did see the ad.
And then we use a Bayesian model to come up with ad lift.
That was a few years ago.
More recently, last year, Foursquare bought a company placed, which only did location-based attribution.
And so we found that we kind of did it the same way,
but now the Foursquare attribution team is like, I don't know, 20 data scientists.
And I've sort of – when I was doing it, it was like me and two other people.
And now we're like, okay, well, we've got this group in Seattle who are very good doing it.
So I don't have as much insight into it, but I assume it's I assume it's similar.
But some of the models we had a logistic regression model on the top layer, but some of the bottom layers were actually a bit more complicated.
So, for example, one of the things that might affect whether you visit a place or not is
your age and gender.
You know, wouldn't you agree?
But we didn't have everyone's age and gender.
And what we did was for the people who we did have their age and gender, we built a
model to predict what it was given the places that we visited.
And I did this fun thing where I revealed to the company like everyone's everyone's gender, probabilistic gender.
And I was nice. I had to be very careful. Like, you know, it's just a model.
This is not, you know, telling you which gender you actually are.
It turned out that most of the people who are misgendered by the model were married people who were doing things with their spouse.
Oh, that makes sense.
Yeah.
And I found out like I was 99 percent male.
But I also found that interesting things like it wouldn't be so much if I didn't go to Best Buy, which is like across the street all the time.
Then it would be more like 70 percent.
Does going into a Best Buy make you more male?
Of course not.
It's just a model.
It's not causation a model, guys.
It's not causation.
It's correlation.
Yeah, yeah, exactly.
But no, then we were able to use the output of this model as an input to the attribution model.
We didn't just say, which gender are you? We actually used the percentage output from the model and plugged it into the new model directly.
So we had these layers. And the age and gender it into the new model directly. So we had these layers.
And the age and gender model was a little more complicated.
I don't think it was deep learning, but it was like it was gradient-boosted trees, so
something a little bit a lot more nonlinear than just, you know, a logistic regression.
Yeah, totally makes sense.
Yeah, I mean, I think from what I've seen, it's it's it's always this struggle.
You know, you have you have the super complicated model squeezes out the extra percent, but no one can interpret it.
And so you build all these simple models. But then, you know, if there's any skew or bias at all,
then you're kind of you're informing your product managers based on one model,
but then your actual system is using a different one. And that that's that's always difficult.
Right. Sometimes it's sort of it's sort of evens itself out.
So, for example, the the gender model, we're using the output of that to predict whether you're going to visit a place.
And so the the visit prediction algorithm is just using this input.
It doesn't know that it's a gender prediction.
It just knows that it's a feature.
And so it could use it however it wants.
And even if the gender prediction is off, the layer on top of it,
the visit prediction kind of takes what it can from that.
And so we don't have to worry about whether it's entirely accurate on
predicting gender or not. Yeah, yeah, totally. And if you have the uncertainty of the gender,
that could also go into the model. So it could say if the gender isn't certain,
then it doesn't matter what number it is, we'll just throw it out or something.
Yeah, that's exactly what we did. Or in our case, if it was like 50 50, I think we have a few.
I don't know what there's a number of things you could do. You could just, you know, input a point five as a real number.
That's the feature. Or you could just say, you know, pick if it's whatever the probability is,
sample from that and give me the sample value, either one or zero, and do that a few times.
There's a lot of things you could do. So what are you doing now? So you've worked on so many
different parts of Foursquare. I mean, we've talked about several of them just in the show alone.
And now you're at the Innovation Lab. What do you do there? Right. So every once in a while,
I get to do a statistical model. But right right now I'm really just working with our founder, Dennis Crowley, and we're trying to use Foursquare's core tech to put
out new apps that sort of inspire, show off the company's technology.
And so it's a nice change of pace to be able to essentially work on the fun stuff. We had a really great product ready to go, and it was called Marsbot Audio, where you
would walk around the city, and one of the things that Foursquare can do is it can tell
whether you're walking past a store, like when you walk right past a place.
And so we can trigger an audio file to play when you walk past something.
So we have like sound effects.
We have text speech and we have you could even upload your own audio.
And then, you know, whenever someone with the app walks by the place you set it at, they'll hear it.
And so that was really cool.
It was ready to go for March 11th.
And we're like, are we going to be finished?
Are we going to be finished?
And on March 10th, we were like, it looks like we're going to finish for tomorrow.
And then the country shut down. So so we kind of pushed that back six months or so.
I really, really hope to get this out, though, as soon as we can, because the tech is ready and it's really cool.
But that's awesome. One of the things I've been thinking a lot about is I think what you're working on falls right into this, is how
to sort of reboot people's habits, right? So imagine someone is in the habit of going to the
movie theater once a week. There has to be some way to sort of reboot that habit when things open
back up. And I think something like this could be pretty cool where, you know, if it could kind of um you know kind of like encourage uh almost like a
like a fitness tracker but for fun you know like just to help people kind of recover their social
life i know i know it's um it's been a little i'm not happy with the way the industry has gone i i
feel like a lot of these consumer apps,
one of the things that attracted me to Foursquare was how positive it was as a social network
and as a tool to be present
and do things in the real world.
Foursquare is not one of the big tech companies right now,
and I feel like a lot of consumer technology
has gone in a darker direction that maybe we don't like.
And now over the next three months, I feel like I want people to get out again.
I want people to have a good life again.
And I feel like people are saying, oh, no, no, that's evil.
We have to work on contact tracing apps.
We have to work on apps to tell you which friends are out to kill you.
And I'm not too happy with that.
I'm really hoping I feel like I'm my thinking is against the grain here. But well, I think it's a
great place. It's a great it's a great thing for an innovation lab to be working on because it's
it's sort of the next frontier. Right. I mean, getting through
covid is is frontier one. But but trying to rebuild, you know, the social infrastructure,
I mean, that's going to have to happen. And so thinking about that now is really good.
Yeah. Yeah. So over the last few months, we've been working more on, you know, data sets for,
you know, coronavirus, essentially not for the virus itself,
but to see which places are coming on and offline at different times.
You know, there's a Foursquare recovery index at visitdata.org.
But, yeah, pretty soon we'll return to these, I hope.
Cool. And so working at Foursquare, can you kind of walk us through,
I mean, we have a lot of folks who are, let's say,
in college right now. They are looking for internships. They're looking for jobs. But even
more than that, they want to know kind of what it's like to work at a company like Foursquare.
Could you kind of like walk us through kind of what a day in your life is like? Yeah, well,
one of the things that I've learned over the last 10 years is don't get too comfortable because once you are, things are going to shift under your feet pretty quickly.
And that's, you know, so whatever the day in my life is now is, well, it sure is.
It certainly isn't the same as what the day in my life was four months ago back in February.
And it certainly wasn't what it was, you know, two years ago.
I'd say the best case for Foursquare was when we were, you know,
when we're working on these labs projects and it's just three or four people talking every day,
working on the whiteboard, trying something out, making progress, you know,
building something into the app and then returning and then trying
to go out and use it and then returning and then, you know, trying it again.
Attribution, yeah, the things that I enjoyed was actually like coming out with these models
and selling these models to the company and even going on some sales calls and trying to explain, you know,
how they work and why, you know, certain ads are seeing lifts and certain aren't.
There was a lot of issues, you know, but sometimes, you know, there are issues on those teams
when, you know, certain companies wanted custom things done that we as engineers didn't really
think made a lot of sense and you're kind of kind of pressured by the sales team into that.
That stuff happens as well. Sometimes, you know, you run into cases where the the the company is kind of reaching.
You're excited about a product, but the company kind of reaches a dead end or deprioritizes your product.
And for someone, well, even for someone who's been in the industry for a long time, but also for someone new, like that's always a very disappointing point, part of working in this field.
I don't know if you'd agree.
Yeah, totally.
Yeah.
That is the number one complaint.
I'm on a research team, like an applied research team.
And the number one source of frustration is when, you know,
we have something that all of the math
looks great and it's headed in all the right direction, but the product changes and the
demand disappears. Yeah, yeah. Sometimes that happens. Sometimes the leadership doesn't see,
it doesn't, you know, sometimes it is very valuable, but the leadership in the company either doesn't see it or is a focused elsewhere.
And then sometimes but, you know, there's a plus side to that, too, is when you do ship something and you get something out.
And either it's, you know, either it's on the back end business to business that, you know, is bringing a lot of value or it's on the consumer side.
And people are loving it and telling you about it like there's nothing better than that yeah yeah totally agree
so is foursquare hiring like full-time are they doing internships with with the whole covid thing
are they going to do fall internships things are a little bit up in the air at foursquare because
again you know we bought uh we bought placed in last year, and this year we bought or merged another company called Factual.
So there's a lot of integrating these companies.
There's a lot of moving pieces at Foursquare right now.
So I don't exactly know what the internship situation at Foursquare is like.
But Foursquare.com slash jobs you can go to. If I were looking for a job, I'd also like
Union Square Ventures
is one of the VC
companies that funds
Foursquare, and they
have a job board
for all of their companies, and
they invest in a lot of cool companies, so
I often go to their job board.
Sometimes I first
show people, this is the Foursquare job board.
Let's see if your job is here. If it's not there, then I jump to the USV job board because there's a lot of cool.
There's a lot of cool companies in there as well that are kind of a similar size, similar philosophy.
So is Foursquare on this remote work bandwagon or are most of the folks in New York?
We are remote right now and I don't know where we're going.
Yeah, that's true.
Personally, I really hope to get back into the office.
I just feel like some of the creative work that we do has to be face to face.
I don't need the open office with 300 people. But having like a small group of, you know, three to seven to ten people get together
and solve problems is way more efficient than everyone in their apartment with either, you
know, delivery people come in or their kids screaming at them or whatever, their dog,
whatever is happening in the background.
It doesn't, in my opinion, it's not working as well, or at least in order for it to work as well, we'd have to – like none of us have our apartments, particularly in New York City, like optimized for this.
So it's not – I really hope to get back.
I don't know what the situation is going to be.
There are people who are saying, let's never go back.
I don't know who's
going to win uh yeah honestly it's it's it's i don't think anyone does i think i mean i mean
honestly when twitter uh jack dorsey from twitter said uh no one has to come back uh that was
absolutely shocking to to the entire industry i mean that was seismic. I mean, that probably like Zillow stock price
probably tanked when he tweeted that. Right. I mean, I mean, it's just unbelievable. And I totally
agree with you that at least for now, there hasn't been the right mechanism for a while.
Actually, this is way before covid when most of us were in the office, we do have another team that's in Seattle.
And what we did is we set up a television and they did the same where,
where when you look at the television, you see their, their desks.
It's kind of weird, you know, it's, it's kind of like a, like literally,
like some kind of portal.
I don't, I don't know if we need to see in everybody's homes either.
I mean, there's a lot of personal things in people's homes.
And it's not. And it's it's I don't see how people don't recognize that.
Yeah, exactly. I think there needs to be. Yeah, there needs to be a new modality.
I mean, like one way to do it is maybe, you know, we have some on the clock time where everyone's there.
And, you know, that is a time where the mic is always on or something.
I honestly don't know.
It's a totally uncharted territory, but we'll have to see how that all goes.
But, yeah, I think if folks out there listening are interested, it'ssquare.com slash jobs or what was the other
one union union something oh usv.com slash jobs oh usv.com slash jobs i think that's what it is
let me check and so uh yeah we'll put it in the show notes and we'll correct it either way cool
this was this was awesome um is there oh we should talk about your podcast um so your podcast is
called the local maximum every week which is which is awesome for every week we reach a new local
maximum hopefully higher than the previous local maximum but not always yeah i think the name is
great yeah i think uh uh yeah hopefully there's some uh uh you have some kind of mini batch going on or something where you can eventually reach a global maximum.
You don't just get stuck. That will be the last. If there is a last episode, it will be called the global maximum.
Yeah, that's right. Exactly. Very cool. And so what do you talk about there?
So I I generally talk about things that are interesting to me.
I I'd say about 50 percent of% of my episodes are guest interviews.
And so a lot of the guests are technology related, but I do branch out a little bit.
I branch out to, you know, professors, probability and statisticians, of course,
then pull more on like the data engineering side.
But then I've also like talked to historians and and other podcasters and
comedians and things like that. So I am kind of all over the place there. One of the things that
I like to do is take the concepts in sort of machine learning and Bayesian analysis and just
like apply them to either the news or to your everyday life. And so, and by the way, if you've made it this far,
but you kind of are interested in Bayesian inference
and you kind of want to look at it from,
you kind of want the 101, the more basic ones,
I have a lot of really good episodes for you to listen to.
Like, I think in the hundreds, I've got,
let me check my archive for a second,
but I know episode 105, I've got, let me check my archive for a second. But I know episode 105,
I talked to a mathematician, Sophie Carr, talks a lot about Bayes rule. And then I had a few
episodes after that about like, what is probability? And so we do go into the philosophical side of
things. But also, I try to make it relevant to someone who is not a data scientist and not in tech.
Like I can explain what overfitting and underfitting is.
And then I ask, OK, who in your life do you know who is always overfitting?
And people people tend to have an answer for that.
So everyone's got that one uncle, right?
Yeah, exactly.
And well, I think the example we have is often like, you know, toddlers can can overfit.
Not always. Yeah. But or and sometimes older people tend to tend to underfit.
But we can their their examples is interesting to just have these prompts and come up with these these examples.
Or sometimes I just kind of, you know, take a bunch of news stories that day and try to distill it down into a theme like
what's what's Occam's razor or what is expected value or something like that.
What inspired you to I mean, it's a huge, huge undertaking to do a podcast.
I mean, you do one show a week, which is which is just it takes an extraordinary amount of
time.
And so what inspired you to really start it and what kind of keeps you motivated?
Well, I, I was trying to do, well, I had a, um, uh, radio show in college. This was back in like
2004 to 2006. And that was like the best part of my week was when I was doing that radio show.
So, um, I knew I would enjoy it.
I was trying to put more content out there.
And at the same time, you know, I, you know, I had spent a summer kind of at NYU Future Lab talking to entrepreneurs.
And I was like, well, you know what?
I think I would enjoy this. No matter what project I do, I'm going to want to have like an audience to discuss it with and kind of a forum to learn about, you know, learn about new concepts and learn about issues and talk to people who I'm interested in, like authors.
And so I tried it.
I did like a 10 episode challenge.
Let's put on 10 episodes and see how it goes.
And then after that, I just kept going.
It was just a lot of fun.
Every week is, you know, some weeks are easy.
Some weeks it's a little bit challenging where I have to do a solo show where the solo shows always take the most research and thought, even though it should be simple.
Right. I could just edit whatever I want. No one will ever know. But it's I feel like the the solo shows are the hardest for me because when I'm interviewing someone, at least I could ask for their content, their information.
But one week seems to be the sweet spot where if it's two weeks and I had something to say, it would take too long for me to get the time to say it.
But if it was more than one week, it would just be completely overwhelming. Yeah, totally makes sense. Yeah, totally makes sense. I think
Patrick and I originally were doing it biweekly and we switched to monthly. But yeah, I totally
agree that, you know, if you find a piece of news early in the month, sometimes you have to you have
to kind of throw it out. And because because, you know, two weeks go by, things become irrelevant really quickly.
I mean, especially in 2020, where your entire lifestyle changes, it seems almost week to week.
Yeah. Yeah. I mean, I'm almost going to go back to some of these episodes and relive it.
It's like, do I want to do that? I don't know. But but no, but it did generate a lot of interesting episodes.
Like I did one episode, 115, all about the coronavirus models.
You know what? What was right? What was wrong? What are they actually trying to do?
And I had a lot of you know, I talked to my friend who works in a hospital about recently about what his experiences have been like over the last three weeks.
So, you know, we have been talking a lot about what's going on.
I, you know, some episodes I feel like I just want to talk about a concept
and ignore what's going on in the outside world.
I've completely thrown that away, I think.
Yep, yep.
Yeah, I think, you know, it's interesting. We have this sort of this this by polar or bimodal response to our current events.
There's some folks who say, you know, I just want to know about the topic.
And there's other folks who say, I'm so glad you you you add color to the show and it's not just about tech all the time.
And so I think the compromise we finally struck was to put the time stamp of
when the show topic starts um that seemed to satisfy most people but um but i totally agree
with you that the idea of just ignoring the outside world you know like the entire world
is burning down and you say okay today we're going to talk about the beta distribution
kind of odd but yeah i feel like i i feel like i I know to wrap it up, but I feel like the the purpose of the podcast is not to be a like a course that you take that's going to be evergreen.
It's more like let's sit back. Let's have a conversation about issues that are important to us.
And let's talk about let's talk about concepts that we know and then uh we'll get interested in things we'll discover things then if we want to learn more maybe we'll
do like you know look look at a more formal course or something like that yeah yeah totally so it's
it's the local maximum is the podcast are you on you're probably on uh like like google podcasts
and stitcher and all these things apple spotify, Apple, Spotify. Great. Yeah. If you
if you don't have any of those or if you want more information, you can go to localmaxradio.com.
And Max, it was it was it was really awesome to have you. I just realized your name is Max
and it's Local Max Radio. Is that on purpose or is that? Well, yes, I realize it's a triple entendre, right?
Yeah, that's right. So there's the local maximum, which is the the probabilistic.
Well, it's the local maximum is a is both an optimization concept, a machine learning concept and a design concept, too.
So that's one meaning. There's also my name in it and also the local maximum.
Well, it's, you know, location data. I'm all about that, too.
So two meetings.
Two meetings is one thing.
But three meetings, that's when you know.
That's very rare, three meetings. Yeah, you hit the jackpot.
It's very clever.
Cool.
This is awesome.
I'm definitely going to listen to some episodes.
Actually, right when we get off this, I'm going to check this out because this is super exciting.
And I really appreciate you coming on.
Folks are just dying,
chomping at the bit to know about anything AI machine learning. And so, you know, I feel like we covered a lot of really, really important ground here that any engineer can apply,
even in their day-to-day work, which is ultimately, I think, really useful and has high impact.
Cool. It's great to be on the show and to have this really fascinating discussion with you.
We covered a lot of ground, things that I have to look more into now.
All right. Cool. Awesome. It was really great talking to you.
And then for folks out there, thank you for supporting us on Patreon.
Obviously, tough times for a lot of folks out there.
We don't demand
anything. All the content is totally free. But we do really appreciate all the donations that goes
to our equipment. And we have some hosting costs and all of that. So thank you again for your
support. Thank you for your emails. A lot of really good ideas keep coming in. And our hope is to, I don't think we'll ever end the FIFO queue,
but our hope is to go through it and try to answer as many questions as you folks have.
And we'll see you all next month. The intro music is Axo by Binar Pilot.
Programming Throwdown is distributed under a Creative Commons Attribution Sharealike 2.0 license.
You're free to share, copy, distribute, transmit the work, to remix, adapt the work,
but you must provide attribution to Patrick and I and share alike in kind.