Programming Throwdown - Bayesian Thinking

Episode Date: June 22, 2020

Many people have asked us for more content on machine learning and artificial intelligence. This episode covers probability and Bayesian math. Understanding random numbers is key to so many d...ifferent technologies and solutions. Max and I dive deep and try to give as many pointers as possible. Give it a listen and let us know what you think! Max also has an awesome podcast, The Local Maximum. Check out his show on any podcast app or using the link in the show notes! Show notes: https://www.programmingthrowdown.com/2020/06/episode-102-bayesian-thinking-with-max.html ★ Support this podcast on Patreon ★

Transcript
Discussion (0)
Starting point is 00:00:00 programming throwdown episode 102 bayesian thinking with max squire take it away jason hey everybody so this is uh this is gonna an awesome episode. I know so many people have asked for more content on AI and on machine learning. And today we have Max Sklar, who's an engineer at Foursquare's Innovation Lab. He's going to talk to us about Bayesian thinking, Bayesian inference, and this whole universe. So thanks so much for coming on the show. Thank you so much for having me. I love talking about this stuff. So let's get into it. I'm excited. Cool. Awesome. How are you doing? How are you handling the COVID situation?
Starting point is 00:00:53 Well, as reasonably as one can expect, I guess, you know, every week on the podcast, people can hear my mood going up and down and have commented as such. But I also experience mood swings. There's something about, you know, being in the same place for everything that causes you to have these ups and downs. Yeah, it seems to be a universal thing. I feel like I was mentioning this to someone yesterday. Like, do you ever, like, go through your Apple photos or Google photos or whatever? And, you know, just go through your past, like the last five years, 10 years and just scroll. I'm like, what's it going to be like scrolling through these months? It's like all of a sudden
Starting point is 00:01:33 I'm living my life. And then, oh, it's a bunch of pictures in my room. And then a bunch of pictures of sirens going by and then more pictures of my room than a bunch of wires. And I'm like, oh, let's skip this. Let's skip this part. Yeah, it's so true. Yeah. What I heard the other day, which was really resonated, was that 2020 is kind of like when you get bored at SimCity and you just turn on all the disasters at once. It's kind of like what things have degenerated to. to it's it's funny i i read like a full review of one of the things that um i wasted my time on during during this pandemic which unfortunately i feel like i'm not alone when i feel like i could be getting so much done but i wasted so much time was watching like a review of the original sim city on on youtube so okay i found out i was i was reading an article that there's a game called Sim Refinery that Maxis made, I think, for Chevron and further employees. And someone was able to get a copy of it. And you can actually
Starting point is 00:02:32 download it and play it in an emulator. I haven't tried it yet. Cool. So the topic is something that we need to really attack kind of from the surface because there's sort of a lot of layers to it. And so I feel like maybe one good way to, you know, get started on it is just to talk a little bit about uncertainty. I mean, there's a lot of challenges. You know, everyone knows software, at least most software is deterministic. You say one plus one and you expect it to be two every time. But when you're doing anything from you're trying to make some kind of prediction or you're getting data from some analog device or some sensor, the world around you or even just behaviors of people is full of uncertainty.
Starting point is 00:03:31 And just in general, like how do people deal with that kind of stuff when they're writing code in this very rigid way? Yeah, well, one one thing that I've learned even over the past couple of years, as I've been doing this for many years, is how deep the rabbit hole goes in terms of the different kinds of uncertainty and risk and probability and the different definitions around it. People don't even agree on what uncertainty means or what probability means. I kind of tend to take the subjective view that it's sort of your degree of belief that something is true. But then there are the cases where, you know, hey, we're at a casino and we're flipping a coin and I know reasonably well, you know, what the probability is. Everybody agrees that like a coin flip is 50 percent and the die roll is a six, a six, a six, et cetera.
Starting point is 00:04:21 But in the real world, it's not like that. You often have cases where, you know, you don't even know what, you know, it's sort of a subjective on what the probability is of certain events. And the best you could do is try to estimate it or try to figure out, okay, what are the different possibilities that I want to consider here? And that's where Bayesian inference really shines. So I think the first step is just it's a dive into a problem and maybe it'll help to go over some examples and try to figure out, you know, what type of uncertainty am I dealing with exactly? And kind of having that discussion with your team is usually very fruitful. Yeah, it totally makes sense. I mean, I think one of the one way that sort of has both of those probability and uncertainty in one is the is the classic multi armed bandit,
Starting point is 00:05:10 which is this idea where, you know, imagine you go to Las Vegas and imagine now, you know, in the real world, you know, slot machines are highly regulated. And so you could go to any casino and I think you could even look up the probabilities. They're all posted publicly. But but, you know, let's suspend belief on that for a moment and just assume that every hotel had their own rules. Maybe some hotels are brand new and so they make the slot machines really generous. Some of them are very strict, but you don't know what any of those probabilities are. And so you you have, you know, maybe, you know, a thousand dollars worth of quarters in your pocket and you or in a bucket or something and you want to basically make as much money as possible. And so you're kind of
Starting point is 00:05:57 torn because you don't you don't know any of the information, but you don't want to waste a lot of money finding out either. You want to quickly kind of dial that in and then choose the best slot machine for the rest of the day. And so you could see how you're sort of balancing, you know, reducing uncertainty and, you know, becoming more certain about your world with, in this case, I guess, profit. Yeah, exactly. And another, one of the things that I like to talk about is sometimes you talk about, okay, what's the implications in terms of building a machine learning model at work? But sometimes there's also the question of what does this mean for my everyday life?
Starting point is 00:06:37 And the multi-armed bandit example is the question of what is the cost of getting information? And oftentimes we don't consider what the cost of getting information. And oftentimes we don't consider what the cost of getting information to get the right answer is, because it's not zero. And so people often either go into rabbit holes and try to get as much information as possible when it kind of freezes them in place,
Starting point is 00:06:59 or people will throw up their hands and just make a decision because they don't know. But taking a second to think about, what's the cost of getting information and what does that get me is another way to kind of dig into a problem. Yeah, yeah, totally makes sense. Yeah. And so, you know, imagine, you know, if you're writing, if you're writing something that's going to do any kind of forecasting, you're invariably going to have some type of probability and then some type of uncertainty around that. So, I mean, this is something that I think really gets neglected in a lot of these kind of coding boot camps or these tutorials online. Like if you've seen the
Starting point is 00:07:37 show Silicon Valley, where I have not seen the last season, but you can spoil me if you want. Oh, you know, I haven't either. So we're both in the same boat. OK, we're both in the same boat. So, Patrick, try not to spoil it for us. But but basically the person I think Gian makes makes a hot dog, not hot dog app. Right. Right. And so just taking that example, you know, if you build a model to do that, it's not going to say thumbs up or thumbs down it's going to give you a probability and if you build a Bayesian model it's also going to give you a distribution which we can which we can talk about but but you have to then you know turn that into an action so so that really is context dependent so for, for example, you know, maybe the hot dog, not hot dog app.
Starting point is 00:08:26 You should only say hot dog if it's really confident because you're going to notify someone or send an email and you really don't want to be wrong. But then on the flip side, imagine something a little more serious like you're doing cancer screening. Well, maybe if you're 10 percent sure you should say yes, because yes means running an additional test. Right. And so so you really can't you can't do anything in machine learning without, you know, dealing with probability and sort of philosophy or economics of whatever you're trying to build and trying to marry those two together. Yeah. Oftentimes you're ultimately after making a decision, but you don't want to just go right away to the yes or no. You want to come up with the probability or the distribution and then use that to help make the decision. So kind of breaking it up into two
Starting point is 00:09:15 parts. If I could give an example from my work that this is, we'll probably bring this up again, but this is sort of this example. It might not be the most fun example, but it's definitely one of my favorites because it's just one of the projects that I've done that just brought so much value, which was in ad attribution. That was a product I worked on for Foursquare a few years ago. A lot of people in data science, you're going to run into attribution at some point in your life trying to figure out whether ads work or not. Big data problem, you know, lots of people trying to solve it.
Starting point is 00:09:52 And Foursquare was kind of, we were trying to figure out whether people, whether ads were driving people into stores. And we had some visit data and we had some data on who saw the ads. And it was really hard to untangle what the clients were asking for because they wanted to know lift. They wanted to know, okay, what percent more likely is someone to visit one of my stores if I give them an ad? Is it 10 percent? Is it 20 percent?
Starting point is 00:10:18 Is it 5 percent? And no one asked for a probability distribution over that. They just wanted a number, an exact number. And then they wanted to tell us either an exact number or we don't have enough information to say. And originally, the team was kind of just taking that at face value, being like, okay, let's try to just calculate this number. But it led to so many problems with, you know, how accurate do we know? How accurate is this number, you know, when can we report it, when can we not report it, and then to untangle that, we took a step
Starting point is 00:10:54 back and said, wait a minute, wait a minute, we don't know what the lift value is for an ad, but let's try to turn this into a Bayesian model, and we could talk about, like, I don't know if we want to kind of go and dive into like the intro to Bayesian theory, maybe after this. But we said instead of talking about what is the lift of an ad, let's say, hey, we don't know what the lift of an ad is, but there's some uncertainty around it. There is it could be anything. And then as we have some probability distribution there. And then as we gather more and more data, we update our probability distribution and it gets taller and skinnier over time. And then we could come up with like a confidence range or sort of a Bayesian confidence range, which is, you know, hey, your lift is somewhere in there.
Starting point is 00:11:38 And then we could use that to help the clients make decisions. And it's sort of, even though some clients didn't understand what we were doing, it sort of made it a lot easier on the engineering and staff side, where we could be like, okay, we now all agree on what we're doing and what we're doing makes sense. And it was just amazing to see how it just detangled a lot of our issues. Yeah, yeah, totally. Yeah, I think you touch on a lot of things there. So we'll try and unpack that. I mean, one is, I think probability distributions have to be one of the most foreign things. Let's say foreign entry level things in mathematics, right? Because I mean, there's a lot of things that, you know, string theory, there's a lot of things that are very complicated. But I would say PDFs are probably one of the things that
Starting point is 00:12:23 they're fundamental, foundational but they're also like really abstract and complicated. Yeah, can you take a crack at sort of explaining what a probability density function is to folks out there? Yeah, yeah. I mean, I guess one thing you could start with is if a PDF is blowing your mind, why not just start with the discrete case? Because, and I find this too, sometimes when I'm dealing with continuous data, if I think about it in a discrete way, it's a lot easier to start. So just think of what a regular probability distribution is. You have a number of potential events. In the case of Bayesian
Starting point is 00:13:04 inference, you're usually talking about several hypotheses that you're trying to decide, OK, which which one of these is the is the true one. Sometimes you're trying to figure out, you know, trying can kind of say, OK, this is what it is. It's just going to be a bunch of numbers that add up to one. Right. So each things have a probability. And then, you know, you could update those values as more data comes in. A PDF is a little more complicated because it's continuous data. So instead of having it on, let's say, 10 possibilities or 30 possibilities or two possibilities, actually, a lot of times it's two possibilities, you're dealing with a space that it's like, okay, this is a number. This is like a real number. In the case of attribution, we're trying to find the ad lift. This is a
Starting point is 00:14:01 positive number greater than zero. It's where usually it's greater than one because one means that the ad was not effective. So although there are exceptions to that. So it's it's now you don't have you don't have some a bunch of numbers that like a bunch of values that add to one. You have a continuous space that if you know calculus, it integrates to one. You can kind of break up the continuous space in different one. You have a continuous space that if you know calculus, it integrates to one. You can kind of break up the continuous space in different ways. You could break it up into sections like, hey, I could bucket it. What's the chance it's, you know, greater than five or less than five? Well, both of those have to add to one. But, you know, essentially, you're just searching a space of numbers. And you could even make it more complicated than numbers still. You could make it into, you know, R2, R3, you know, a space of vectors, but then that
Starting point is 00:14:50 gets even more abstract. So depending on, I'm pretty sure you probably have listeners who are comfortable with levels of abstraction on a high level and not so much, but, you know, essentially you're just trying to figure out the relative probability of different possibilities. That's all of this stuff is trying to answer that one question. And I and then and yeah, you know, it's your hypothesis. Yeah. It's your depending on how complicated you want your hypothesis space to be. Yeah, it totally makes sense. You basically you have this function.
Starting point is 00:15:29 And when the function is high, like you could imagine in your mind just just any type of function you'd, you know, graph on your graphing calculator or something. Right. And when that function is high, you know, has a high Y value that for those values of X, they're just more likely. And when the function is low, those values are really really unlikely so if you imagine like a coin toss um and and you're you're getting the pdf of this of this coin uh you know it could be you toss the coin three times and get three heads it's possible but it's much it's much more likely that you're going to get at least one tails and so you could imagine you know the y at one third being being higher than the y at zero but yeah the thing is really weird is as imagine, you know, the Y at one third being being higher than the Y at zero. But, yeah, the thing is really weird is, as you said, you know, if you pick a single point, it actually doesn't really mean anything.
Starting point is 00:16:14 It's really just when you integrate over a range, because, for example, yeah, like you never in a continuous phase, like you never have exactly 33, you know, 0.333. It's always going to be there's always going to be some uncertainty there. And so you have to deal in terms of these of these ranges. And yeah. And so it's kind of more like a pie chart that's been kind of unrolled in a sense. Yeah. There are a number of different ways of looking at it. Another one I think of it is relative probabilities. Like, let's suppose you have just a standard Gaussian function that peaks at zero, right? So the zero value is the highest probability. number has a probability of zero of landing on exactly that number. But you can actually compare, hey, what's the chance of me landing on zero versus the chance of me landing on one? And you can kind of compare relatively what those two are, even though they're both zero, which is kind of mind blowing until you're comfortable with calculus and differentials and all that. But sometimes you could just understand what the graph means, like,
Starting point is 00:17:26 hey, I'm going to be somewhere under this graph. And so I'm much more likely to be in the high sections than the low sections. And that's sort of where my mind goes when I read a simple PDF. And then every once in a while, I want to get into like really abstract, like what's going on here. And I, you know, sometimes I do read into this stuff. I read into like topology and, and, and measure theory. And you could really get into, yeah, you could get into like some really abstract. I was reading the other day about, I can't believe I'm bringing this up. I was reading the other day about something called pointless topology. And it's this idea that in a in a in a space like real numbers, the point is not the fundamental unit. You actually can only talk about, you know, places of finite extent.
Starting point is 00:18:17 So you have to be able to move a little bit to the right and a little bit to the left. And I mean, think about it in terms of like electricity. Like if you're doing if you're an electrician, like you never say, OK, exactly one hundred and twenty point zero zero zero zero volts is coming out of this socket like like no house has that kind of tolerance. Yeah. And so you're always thinking about things in terms of tolerance. And so you're always thinking about things in terms of hopefully small ranges of numbers. Yeah. And the scary thing is, is oftentimes the infinite precision number really has no meaning, you know, because what does it mean to if I said there was 120 volts, but it was, you know, zeros all the way down. Exactly. Like, is that even possible if you go deep down?
Starting point is 00:19:08 It's like quantum physics. Oftentimes the numbers that you you're talking about stop having meaning anymore. It's one example that I give is like how many water molecules are in the Atlantic Ocean? Could probably come up with a reasonable estimate to that. But is there an exact integer that represents that that value? Probably not. No. And I mean, with with evaporation and everything, you'd never be able to count it. So another thing that I think, you know, talking about Bayesian and statistics really helps people is with generative, let's say models, but even more generically generative systems, right? I mean, I think when people write, for example, when I write unit tests,
Starting point is 00:19:42 I typically, you know, ideally, I to make a unit test that's generative. So, for example, so, for example, let's say I write my own sort. Maybe, you know, I could have a unit test where I take some some canonical inputs. I know what the sorted outputs are and I run my sort and then I make sure it matches. But another way to do that would be to have a validation function. So I have, let's say, the STL quick sort, which I can't use for whatever reason, but I have it. And my unit test could be generate some data, run the STL sort, run my new sort, and maybe my new one's faster. You know, that could be part of the test.
Starting point is 00:20:27 And then compare the results, right? So that's just an example of this sort of generative unit test. So some people hate that idea or they tell you you need to fix the random seed, and there's kind of a whole rabbit hole there. But, you know, you can imagine a lot of testing as being sort of generative. And I think to really understand Bayesian inference and how to build these Bayesian models, you have to think in terms of, you know, you have some phenomena. What generated that phenomena? And can I, you know, artificially, how much of that phenomena can I generate artificially?
Starting point is 00:21:04 And so, you know, people who want to do, let's say, game design, you know, have to deal with this kind of thing. And so in general, it's really good for really any engineer to know how to build sort of a generative system to model some problem. Yeah. A couple of points on that. First of all, the the the idea that the first step in kind of a Bayesian inference is, once you've defined the problem, I guess I would say that's the first step, is then you have to figure out what your hypothesis space is. You know, what are the possible mechanisms that I'm going to consider that could have generated the data that I'm seeing or that i'm about to see and um one of the well i guess you could say it's like a pitfall of bayesian uh thinking or bayesian analysis is that there is no guarantee that um you know one of the hypotheses that you come up with is is true or that works
Starting point is 00:21:59 particularly well um and so that's why you often need to throw in like a dummy hypothesis and say oh this is actually just random data uh or something like that. And then if that beats everything, then, you know, OK, I either it either is random or I don't understand the mechanism. And so and I know we're talking kind of abstractly here, but if we can go back to I'm trying to get trying to figure out where we were in the conversation. If we go back to the unit test, like that's always very interesting. Like I remember like writing a unit test for the sword is one thing. Like I've written some to check if a logistic regression is coming up with the right answer.
Starting point is 00:22:39 And it's often – you often have to kind of tell the rest of the organization, my unit tests run differently. I either have to set the random number generator in stone, in which case I don't know if I'm necessarily getting what I want. My unit tests pass 95 percent of the time, but the other five percent, it doesn't pass. So every once in a while when you're doing this stuff, you run into problems with the organization with unit tests where I don't necessarily know the answer other than in some cases I just have unit tests that I run offline that aren't part of our automated testing whenever I change something that's probabilistic. Yeah, that totally makes sense. Yeah, I mean, and this is true in general. You have a generative system, especially one that's probabilistic. Yeah, that totally makes sense. Yeah, I mean, and this is true in general is, you know, you have a generative system, especially one that's, let's say, generating data for a test. There's always going to be that one chance that it generates,
Starting point is 00:23:34 like, really degenerate data, like maybe all the X's are the same. I mean, that could happen one out of, you know, a billion times or a million times. And there are unit test systems that are going to run your test a million times, maybe even a million times a day. So that could happen. Yeah, yeah, exactly. It happens all the time. One in a million.
Starting point is 00:23:58 I'm not surprised by one in a million things anymore. Yeah, I saw something. This is a pretty old quote, but it was from the lady who is the head of security at Twitter many years ago. And she basically said something to the effect of, yeah, if there's a if there's a one in a million chance that it happens, you know, 20 times an hour on Twitter or something like that. Yeah. You guys are going to take a little break to talk about University of California, Irvine's Division of Continuing Education. So this is a pretty cool program. They have a variety of different kind of certificates that you could acquire. They have things like Python, they have data science, they have machine learning.
Starting point is 00:24:38 And these are things where, you know, if you didn't necessarily get, let's say, a degree in machine learning or you haven't worked as a machine learning engineer for a bunch of years, this is a way to sort of get a lot of that knowledge, a lot of that expertise. And, you know, I know Patrick and I, we've both done a bunch of courses online. And so it's a really good way to sort of boost your knowledge and your skills in a particular area. Yeah, I mean, I did tons of online classes when I first started working. And, you know, for me, being part of a class, I mean, it's always interesting. But the curriculum, the self-paced stuff, it works great sometimes. But sometimes having a here's what we're doing each week, marching through their curriculum and going through it, it's very similar to how, you know, just a normal university class works.
Starting point is 00:25:28 In fact, you know, feeling like it's almost exactly the same is just a comfortable thing, a good way to learn and learning from professors who, you know, that's that's their thing. That's they teach. They help others to learn and having access to it, doing the assignments, it really helped me go from, you know, where my undergraduate left off to, you know, to just kind of bootstrapping into more specifics, higher level things, things that were more pertinent to my job at the time. You know, I hardly recommend people taking classes, continuing education from a college. Yep. Yep. Yeah. i think getting it through a university is is actually a really really stellar i mean it's really awesome that universities are starting to get into this and um you know that there's going to be sort of quality lectures and
Starting point is 00:26:18 professors there's there's you know there's a very strong brand behind any any sort of major university and you know uc reminds one of the top universities for CS. So they've been around since about 1962, I think. And they've been around a long time. They've been teaching a long time. They've been teaching online a long time. And so it's a good place to go and get this kind of education. Yeah, if you're interested, I think they're still doing enrollment for some late classes for spring,
Starting point is 00:26:50 but summer is upcoming. And as we've been talking about this whole episode, I mean, I think everyone has extra time at home these days. And if you're interested, you can check it out at ce.uci.edu slash programming throwdown. And we'll put the link in the show notes, of course. But once again, that's ce.uci.edu slash programming throwdown. Yeah. And if you do sign up and take any courses, you know, we'd love to get feedback. You know, please write us in. Tell us what you think of it. You know, we could pass it on to them. But also for us, it's really good to know, you know, what you thought about that. You know, folks out there who are listening. So. All right. Back to the show. So. So. So, yeah. So what makes something Bayesian or one of the one way to explain it is, you know, when you're building when you're building that generative system, if you build it on top of random numbers right now, you have kind of a generative model. So, for example, so, for example, if if if we go to our to our sort example, so we're generating some random numbers to sort and to compare the outputs of them.
Starting point is 00:28:05 If we know the function that's generating those numbers. So, for example, the numbers that we're sorting, we're generating them from a Gaussian distribution. Right. If we have the getting back to our earlier topic, if we have the probability density function for this generator, now we have, instead of needing, when we do analysis, instead of needing the list of numbers, we could just think about it in terms of, okay, this is the distribution that I'm sorting. And then all of the models and the math afterwards could be done on the distributions. And your output could also be another probability density function. Yeah. So I'm not sure I'm following exactly. So you're generating a bunch of random numbers from from, let's say, a Gaussian and you still want to sort it.
Starting point is 00:28:53 Are you saying, oh, yeah. Yeah. Yeah. Well, I'm saying let's say we just want to do some analysis on the sort. So, yeah, the actual sort is going is not going to be a Bayesian system because it's it's it's kind of a deterministic thing. But which is in general, I think if you start with sort of random numbers and then you instead of applying things on each of those elements that we've drawn from the distribution, you could also imagine, you know, doing functions on the distribution itself. So let's take this this Gaussian distribution and let's maybe multiply it by a constant and add it to this other Gaussian distribution. And if you do that enough times, you get some kind of linear regression. And so that that output then of that process also becomes a distribution. Yeah. I mean, I think the the main point is you can actually recapture the distribution from the random numbers.
Starting point is 00:29:54 And that's always very interesting. I was I've been reading like a lot of well, a few months ago, I was reading a whole bunch of the the covid papers, you know, the coronavirus papers. And the one that I was really interested in on March, you know, March 10th was how long does it take for someone to get it? Because I walked into the Foursquare office on March 6th and no one was there. And I was like, what's going on? Is it a weekend? Did I screw up? And haven't you heard somebody got the COVID and we all have to quarantine for two weeks.
Starting point is 00:30:27 And that was back then. Nobody was quarantining. So I was like it was kind of like I felt like I was personally being punished. But but after five days, you know, I was like, OK, I'm not sick. None of my co-workers are sick. And what's the probability that that probability that we got it or any of us got it? And as I looked into the number, I found one paper that was really good, which, you know, took the discrete data of, okay, these are example of people who know when they were exposed, know when they started exhibiting symptoms. And so they construct the probability distribution on, you know,
Starting point is 00:31:07 when you start exhibiting symptoms from when you were exposed. And I noticed in the paper, which I liked, which gave me confidence in the paper, which, you know, a lot of the papers that I've read over the last three months in this topic did not inspire as much confidence. But one of the things that they did was they checked a lot of forms for the distribution. So they checked they they checked to see if it was a Gaussian, but they also checked many, hey, you have a median of getting this thing in like five days and set 97 percent chance it'll come at you in 14 days. So that that sort of inspired confidence from the from the data they got.
Starting point is 00:31:57 And yeah, nobody at work at least got sick from work. So that was good. I was pretty confident after five days, but my co-workers were still not so confident. Yeah, I mean, I think the statistics, especially when it's rather early and there hasn't been enough time, it's just very hard to get that right. I think a lot of people learned the hard way that fractions have denominators, right? And so you might have a thousand cases, but if you tested a hundred times as many people as the next place,
Starting point is 00:32:32 then that obviously has a huge impact. Yeah, I mean, one thing that I like about Bayesian inference is it sort of tells me what my certainty is as I go. So if I'm like, quote unquote, allowed to draw an inference early on, I'll know that. And because, you know, there'll be less uncertainty around my data. If I already have enough data, if I don't have enough data and my my probability is still spread all over the place, then I'll know I still have to wait. Yeah. And one thing you touched on is is this idea that we have to fit distribution.
Starting point is 00:33:08 So so, you know, we talked about the probability density function, but there's there's a variety of different functions that have really nice properties. Right. So, for example, as we talked about, there's the Gaussian distribution or the normal distribution. And that is this symmetric distribution that has a whole bunch of really nice properties. Now, as we talked about the electricity example, you know, the real world, in the same way as the real world will never give you exactly 120 volts, it's also not going to give you exactly a Gaussian distribution either, right? And so, and it could give you a distribution that is, and it probably will give you a distribution that is unlike any of the mathematical distributions. So, you know, a lot of work usually does. Yeah. A lot of work has to go into sort of how do we sort of fit the data that we have to, you know, one or more or some
Starting point is 00:34:08 composition of distribution. So that actually is itself a really difficult problem. Yeah. One of the topics that I like talking about on this is conjugate priors. I don't know if you've run into that, but an example of a conjugate prior is a beta distribution. A beta distribution is an uncertainty over a probability. So let me say it in a way that might make sense, might make more sense intuitively. Like you have a coin, a weighted coin, and you don't know if the coin is weighted 0% on heads or 100% on heads. It's somewhere in between or equal to those two things. It's somewhere on a value from 0 to 1, and we don't know exactly where it is.
Starting point is 00:34:52 And weighted means it's cheated. It's cheating in some way, but you don't know. Right, right. So, I mean, you could think of it as a coin, or you could think of it as any event that's repeated, where I don't really know what the probability of this event, like I don't know what the probability of this event is. All I know is that it's repeated over and over and it's somewhere between zero and one. And so a beta distribution represents an uncertainty over that. And a beta distribution like a Gaussian distribution, it has, you know, has two parameters and, you know, you can say it has a mean and it has a variance and all that. And a beta distribution like a Gaussian distribution, it has two parameters
Starting point is 00:35:26 and you can say it has a mean and it has a variance and all that. But the really cool thing about the beta distribution is that if you have a beta distribution to represent the uncertainty over the weight of the coin,
Starting point is 00:35:42 if you flip the coin a bunch of times and then run through Bayes' rule and then calculate the updated distribution, the updated distribution is also a beta distribution just with updated parameters. And it's actually very easy. You just add the number of heads to one parameter and number of tails to the other parameter.
Starting point is 00:35:59 So the math all works out and you don't even need any complicated calculations. You don't even need to use a computer. You can just figure out what the need is. And so sometimes those are really good to look at. There's beta Dirichlet, which I've written a bunch about and Gamma. And then, like you said, oftentimes you want to, you know, in some cases you want to check, you know, hey, could this be some pathological distribution? And you should often throw that into the mix to check because some of the biggest mistakes, not mistakes, but like errors that people can make is like, you know, sort of trying to fit your data into a certain distribution when it doesn't fit. And sometimes you have to for simplicity. So you have to kind of get like an intuition for when this is going to be a problem. But it's usually a problem for instances where you have data where, you know, one day it could blow up and everything is affected. I'm trying to think of an exact answer.
Starting point is 00:37:09 And you get an intuition for which is which. For example, the attribution, I'm not worried about one ad all of a sudden. Well, I don't know. There could be an ad that could be extremely effective. But I don't expect it to be out of a range of like an order of magnitude. Yeah, exactly. And the other thing is, you know, these distributions are shaped by what we call hyper parameters. So, for example, looking at the normal distribution, it's symmetric. So so that's you can't vary that, but you can still vary basically how much you want to, let's say, stretch it.
Starting point is 00:37:48 And so we call that the variance of the distribution. And you can also vary where the top of it is. We call that, I guess, the basis or the base. There's different words for that. But the center of the distribution, the mean of the distribution works. And so those are hyperparameters, right? So let's say you have some data and you can fit a normal distribution. Let's say the data is bimodal, which means you actually see two normal distributions in your data.
Starting point is 00:38:22 So imagine it has sort of these two hops, right? Right. So now you can say, well, there's a Gaussian distribution that's generated this data, but sometimes the mean is this or sometimes the mean is that. And so you can have another distribution which tells you when you should use one mean or the other. Right. So you could have, let's say, a Bernoulli distribution. And when it's heads, you use the one mean and when it's tails, use the other mean. And so then in this way, it becomes kind of this chain or this this this. You could look at it as a graph where distributions are informing other distributions. And so you can even introduce, you know, external data.
Starting point is 00:39:06 So, for example, yeah, when it's sunny outside, it's much more likely to be this distribution, this normal distribution. When it's raining outside, it's more likely to be this other one. And it's just that the data you have was a combination of sunny and rainy. And so that's why you have this bimodal distribution. But it turns out if you add this extra feature, now you can explain it in a much better way. That's predictive. Yeah, yeah. And that's how these things kind of chain together. You can have, hey, which group am I in?
Starting point is 00:39:41 Am I in group A or group B? And that's the first distribution between A and B. And then within A, okay, we have a Gaussian. Within B, we have a Gaussian. There's a whole field called hierarchical models that's kind of based on that. And it's sort of like, okay,
Starting point is 00:39:58 I'm in a subgroup, and then I'm in a larger group. And there are certain properties I'm in a subgroup and then I'm in a larger group. And there are certain there are certain, you know, there's there's certain properties that the larger group has. And there's certain properties that each subgroup has. And as I get data, how do I figure out what those properties are? And a good example of that is in election forecasting. I interviewed on the Local Maximum, Alex Andorra. I think that was episode 99, if I remember correctly.
Starting point is 00:40:32 Yeah, so basically, he does election forecasting. He's in France. He's kind of like the 538 blog for France. Oh, cool. Yeah. And so what... I try to go to localmax radio slash ninety nine to make sure.
Starting point is 00:40:47 But I don't know. It's not not responding right now. But he we talked about hierarchical models. And the question is, OK, let's say I do a poll from a single town and I get a crazy outlier on that poll. And the way I talk about there's three things that could be going on. One, you could have just gotten a bad sample or you could have just gotten like a random, just it just happened to be that the random people you hit, it just fell in a weird way. That's like that one in a million or maybe not even one in a million shot where it's just it just randomly was an outlier and that does not reflect the data.
Starting point is 00:41:29 Two, it could be that that town, something is going on in that town and people are really changing their minds. And three, it could be that this is a broader national trend. And so when you get that data, how do you tell? How do you tell which case it is?
Starting point is 00:41:42 Or maybe it's a combination of those cases. And a hierarchical model, a hierarchical Bayesian model does a very good job of kind of disambiguating between these and trying to figure out, OK, where does this fit in? Yeah, totally makes sense. Yeah. I mean, so so you could actually have a distribution over all three of those hypotheses, like I guess in this case, a traditional distribution. And and over time and with more surveys and more samples, you would become more and more confident about about some mixture of those hypotheses. Yeah. Yeah. This episode 98, by the way, but not 99. But it's it's yeah, it's oftentimes I mean, one of the things that I might think of it is, OK, let's say each town has a and again, you could break it up by by town. You could break it up by age and gender and demographics and all these things. But to be, you know, to to to simplify, let's say each town, let's say it's two candidate election and you just have a beta distribution over what the probability for each person is in those towns.
Starting point is 00:42:49 Then the county, you say, OK, each town in that county, we're going to have a different beta distribution. And the the hyper parameters for the data beta distribution are are drawn from another distribution for that county and then so on and so forth. And you go up to the state level. And so that's, yeah, and that sort of disambiguates. And oftentimes in these cases, it can be very intractable because, like, do I want to break it up by geography? Do I want to break it up by demographics? Do I want some combination? And that's where a lot of kind of trial that that's where you're getting into you know the
Starting point is 00:43:25 machine learning curse of dimensionality i'm just trying to um you know trying to include trying to figure out which variables are best to include yeah i mean it's kind of it kind of ties into the multi-armed bandit stuff again where where you know just to give an example if you if you flip a coin once and you get heads, there's just a lot of uncertainty there. If you flip it 10,000 times and it is roughly 50-50, then you're much more certain. And so you have what's called tighter bounds, right? And so as you try to come up with more and more features to explain some phenomena. So, you know, the phenomena is the survey results.
Starting point is 00:44:06 And as you come up with more and more features to better explain that phenomena, you also are adding more and more uncertainty because, you know, maybe you've only surveyed one 18-year-old from, you know, the Lyon province, from Nice. And so now you've gotten so specific that you've blown up the variance. And so you're constantly kind of struggling with this, we'll call it bias variance tradeoff, right? And so that's one of the reasons why acquisition functions are really important. And so we could kind of dive into that. I mean, a lot of these a lot of these systems and one really cool thing about Bayesian in general is it's because it comes from kind of the signal processing world. You know, a lot of it is designed to be iterative. So it's not like collaborative filtering and some of these other methods where they're just concerned about this one shot.
Starting point is 00:45:05 Like you generate this probability of hot dog and then that's it, you're done. But in this case, you can actually have an acquisition function which says, you know, who should I survey next to get the most possible signal? And going back to the casino example, right, like which which casino should I visit to either learn more about it or or make make the most money? And so so actually, you know, Bayesian, a lot of Bayesian methods directly address the acquisition function, which is another thing that isn't isn't covered really well but is also really important yeah to expand on that a little bit a real world example from uh four square attribution uh well in this case it wasn't so much an acquisition function where we're trying to decide who to survey because we have our panel of data but it was more like trying to decide which data to throw out because we had too much
Starting point is 00:46:02 it was very expensive to put all of it in our model. And we were trying to figure out, let's use Starbucks as an example. I don't think Starbucks is a client, but let's use Starbucks as an example. We're trying to figure out, okay, what is the probability that any given user or for every user and their data and information about them, or what is the probability that they're going to visit Starbucks on this particular day. And we have that probability set up for every person in our system. And so that was a pretty big model to train. And we had examples of days where people did visit Starbucks
Starting point is 00:46:40 and days where people didn't visit Starbucks. And as you can imagine, there are a lot of examples of a person days where Starbucks was not visited. There are a lot of examples where Starbucks was visited, but there's a huge multiple of days where Starbucks wasn't visited. And so we did a bunch of stuff where we threw away some of that data so that we got less, you know, that we can then calculate our probability function, which was kind of a logistic regression on a smaller data size. And we, again, use Bayesian inference to say, okay, how do I get these original parameters from the original data set, given that we probabilistically threw away a bunch of this data
Starting point is 00:47:30 and we were able to calculate that as like a bias-corrected logistic regression. It's a very interesting problem to tackle. And again, save a lot of money, save a lot of calculation time and programming cycles and AWS costs and all that. Yeah, yeah, true. It's true. And even if you had infinite of those, you run into this other issue where the model will spend energy based on the loss. So, for example, the example you gave is great. You know, the odds that someone goes to Starbucks specifically, you know, if you look at everyone on Foursquare, including all the folks who have never been to a Starbucks, the odds that someone goes to
Starting point is 00:48:11 Starbucks is probably way less than 1%. Yeah. And so if the model, let's say, doesn't isn't a very, very large model, it doesn't have a lot of free parameters, then the model is just going to say no one goes to Starbucks ever. And it's going to be 99 percent accurate. Right. And so and so so typically, you know, that's one of the big challenges on the modeling side is is how do we respect the sort of economics of the problem? Right. Like like why is saying no one ever goes to starbucks bad the reason is because it's a lot more valuable to predict when someone's going to starbucks and get it right than it is to predict that this person is not going to starbucks right because if you know someone's going to starbucks you could send them let's say a coupon and you could get maybe three dollars worth of value out of that um if you know that someone's not going to Starbucks, that's kind of useless. Right.
Starting point is 00:49:08 Yeah. And so and more importantly, it was which which people are going to be swayed by the ads, too, because we wanted to know who was going no matter what. Yeah, that's right. Yeah. You need the uplift. Right. So so the so so you do all this, all these various tricks to to make sure the model is as useful as possible. But because you've done that now, it's it's uncalibrated. So now the model thinks that an average person goes to Starbucks, you know, every other day because it's been given data that's skewed. Right. And so then you have to do this calibration step where you build a another kind of simpler model on top of that model, which is which is given unbiased data and you freeze that that first model so that the first model is biased on purpose and you freeze that. But now you train this unbiased one afterwards. And, you know, I think a lot of this, you know,
Starting point is 00:50:01 between sort of this bias variance trade off and all the things downstream of that and this question of like, how do you subdivide the data and how do you gather new features? That's probably 90 percent of what a data scientist will do in their day job. Yeah. Yeah, I agree. It's some of this stuff can get quite, quite involved. And oftentimes, you have to take a deep breath and say, Okay, let's, let's start at the beginning. And let's, let's do this step by step. Because, well, in my case, and I think a lot of cases, you often have sort of, you know, a problem of explaining to the organization all of these issues, which can be intractable as well, you know, management. Yeah. So what is, how do you deploy, you know, something like this reliably? You know, and how do you know if you've deployed a bad
Starting point is 00:50:58 Apple? I mean, let's say that, let's say there's a data corruption. How do you kind of keep that from getting to people? And how do you build sort do you kind of keep that from getting to people? And how do you build sort of protection in case it does get out to people? Well, that's I'm trying to think if I have some good examples of that. You know, we definitely have a bunch of like, you know, data sanity checks in all of our pipelines to make sure things don't shrink or grow considerably between runs. But honestly, sometimes they're more, sometimes they cause us more problems than they help. And other times, other times things have blown up anyway. But in terms of models, you know, I think you just look, there's nothing that you can do if the data coming in is is just horrible, just completely like zeroed out or something, unless you sort of just have a check that the new that the new model is completely, you know, completely 180 degrees different from the old model.
Starting point is 00:52:06 Oh, that's a really good point. Yeah. So sanity smoke checks help a lot. One thing that I talk about a lot that's been running on Foursquare for the last, like, six years in its current form, but really eight years, is the rating system. So if you go to Foursquare City Guide, you can search for venues and restaurants and bars and stuff. Remember, we used to go to bars, movie theaters and things like that. Yeah. Yeah. So they had a one to ten rating. And the algorithm in that is one of the biggest signals is sentiment analysis on the Foursquare tips, which are like these mini reviews that people write. And I built this in 2012 and then kind of revamped it in 2014 with Stephanie Yang,
Starting point is 00:52:50 who I had on my show in episode three. And basically the sentiment analysis algorithm is based on people who have liked and disliked a venue explicitly, but also left a tip. And so we trained language models on that. And the reason why this is my favorite example is that this thing has been getting retrained every week for the last six years. And it probably, I don't know, Foursquare isn't growing by a big percentage now, but back in the day, like it used to get better over time. Like it would be better at doing new languages because, uh, a certain language maybe didn't have enough data.
Starting point is 00:53:29 But then when more people, that language started using four square, uh, it got better and better at figuring out, uh, whether something was positive or negative in that language. And so sometimes you can deploy something where it actually gets better over time instead of drifting. Um, yeah, which is really interesting. It's it's you have to a certain number of things have to align. You have to be training on your own data. And sometimes like, you know, it has to be something that doesn't change like that, that fast. I don't think I don't think language language does change. But in terms of what we were looking for, it doesn't change that much.
Starting point is 00:54:04 Or if it does change, it's the introduction of new terms that it could easily learn and not, you know, completely going opposite. But, yeah, it's always an interesting, I don't know if I have a clear answer for you, but it's always an interesting architecture problem. Like, how do I build this thing to last? Yeah, I mean, I think it's it's definitely an open problem. I think, as you said, though, you know, comparing to yesterday's model, having some expectation about drift and and sort of A.B. testing or I guess A.A. testing, you know, this version versus the previous one is really the best you can do. I mean, there are some sort of counterfactual evaluation techniques,
Starting point is 00:54:46 but it's very, very hard to dial those in. Yeah, especially something like venue ratings where how good they are is very subjective. And then if you want to dial, yes, there's no ground truth there. And then I guess we were using our own data for, even though it seems like they've worked pretty well and they've held up over the years, you know, people still say, oh, we enjoy these, you know, better than some of the other services that that rate places. But it I was trying to think of what like if you go if you go to the sentiment analysis, like, yes, we have we have training data that we have kind of ground truth there with people self-reporting. But it's not really always truth. So and, you know, there's spam and all that.
Starting point is 00:55:35 So there's a whole bunch of things that could go wrong. Yeah, totally makes sense. What do you think about, you know, I used to do Patrick and I both used to do a lot of image processing. This is probably like a decade ago. You know, and back then, as before, deep learning was a thing. And so it was a lot of kind of, you know, SVM or sorry, SVD and these other sort of unsupervised like these decompositional models, a whole bunch of human engineering. And and then deep learning came and just ate all of it. Right. like these decompositional models, a whole bunch of human engineering. And and then deep learning came and just ate all of it. Right. And so I remember when I was at NYU, I took a class in machine learning and the professor was Jan LeCun.
Starting point is 00:56:26 I didn't know who he was, but now he's at facebook and all that and uh he showed us i think it was the first day uh he showed us like a a camera that like a little eyeball shaped camera that he pointed at things and then on the screen it said what he was pointing at like it would be like keys phone uh desk you know and i'd be like and it was just uh it was just amazing and now that technology um is is everywhere yeah yeah yeah and so and so deep learning kind of just ate all that feel like And it was just it was just amazing. And now that technology is everywhere. Yeah. Yeah. Yeah. And so and so deep learning kind of just ate all that feel like no one has to know about Gabor filters or his Gaussian pyramid models or a lot of these color contrast models. I mean, they just take the raw pixels, throw it into this deep net. And so you're starting to see this emergence, and it's been going on for a few years now, of sort of this deep Bayesian models where we're basically, you know, it'll be some deep learning where the last layer is a Gaussian process
Starting point is 00:57:16 that has just priors that nobody knows. Or you start to see, like another example is, you know, you start to see variational inference pop up everywhere, like this reparameterization trick pop up everywhere. And so there are all these sort of black box models where nobody really knows what these distributions are in the same way that nobody really knows what's going on in the deep learning system. But it can handle enormous amounts of data. And so do you think that that sort of deep learning is going to eat a lot of the kind of, you know, Bayesian math and it's just going to become deep learning?
Starting point is 00:57:58 Well, so first of all, I think some parts of deep learning are based on Bayesian math. So you have to some degree or another. So it's not necessarily that it's going to eat it. But I think that there is always going to be a market for the simpler models. You're never going to reach a point where people are like, oh, linear regression and logistic regression. Don't even learn about those. Those are obsolete. No, I think those types of things are always going to be kind of the bread and butter, you know, first step to try to wrap your
Starting point is 00:58:34 head around the problem and figure out and to be interpretable and try to figure out, you know, what is really going on, what the causal mechanisms are that a deep model can't necessarily tell you. And deep learning is really good at something like image recognition or speech recognition, where you just have such complicated levels of abstraction that it can do really well. But I don't think it eats everything. I think these complex models are very important, and they're going to do a lot of things, but eats everything, no. Yeah, I totally agree with you.
Starting point is 00:59:16 Yeah, I think, yeah, I mean, if you think about it, actually, I mean, imagine you have, let's say, a click probability model that's a deep model. It's still a logistic regression at the last layer. It just has this sort of this embedding, this composition that sort of transforms the data into something that can be learned in a linear space. Right. And so it might just be that, you know, we do this composition and we have to just cross our fingers there. But but what comes out of that, that embedding that comes out of that, it could still be we could still do some interpretable ML on top of that. Right. And sometimes sometimes these click models or this marketing data is not as deeply complex with layers of abstraction as is image recognition. So sometimes you can throw deep learning at it and you can get small, small wins, but not always.
Starting point is 01:00:17 They're often small wins and not big wins. But for for for lots of effort. But, hey, I'm sure Google will use it to kind of squeeze extra point one percent efficiency out of their out of their clicks and, you know, get translates to some huge amount of money. But it might not might might not be practical on the scale of like a small business. Yeah. I mean, I think it's probably worth mentioning the reparameterization tricks. It's something that most people could do with really any data without having to fit distributions. And the idea is, you know, as we talked about, the distributions are defined by these hyperparameters. And so you could imagine, like, if you imagine, just let's take the logistic regression example. You have a set of labels. So you have a set of, you know, coin tosses,
Starting point is 01:01:15 and you want to have some function that takes in, I don't know, the wind and some other parameters and decides whether the coin tosses will be biased in one way or the other. But that's going to give you a single probability or the hot dog, not hot dog, for example. It's going to give you a single probability like point nine or something. If you want a distribution, one way to do it is to say, OK, instead of trying to get the actual answer so so the way that 0.9 comes about is when we train the model you know we give it a picture of a hot dog and we say this is one we give it a picture that doesn't have a hot dog and we say this is zero and so over time it it trains on that
Starting point is 01:02:00 but you could also you could give it you know of pictures and say, you know, in this batch of pictures, 90 percent of them are hot dogs and not necessarily tell them tell the model, you know, which which which ones are which. You know, you could do that and it will still learn. It will take longer because it will have to sort of, you know, tease apart from these batches which ones are actual hot dogs. But it would work. Does it have does it have multiple such batches like to say in this batch, 90 percent are hot dogs in this batch, 10 percent are hot dogs? Exactly. Yeah. Oh, OK. Yeah. So so if we give it enough of these batches and let's say we shuffle in between the batches, it'll still learn even though we only have one loss or one, you know, bit of feedback for the entire batch, right? And so if you can follow that, then you could imagine saying this batch has this distribution and Mr. Model, I want you to match this distribution. So instead of saying this batch has a 0.9,
Starting point is 01:03:07 you could say this batch has this Gaussian distribution, you know, match it. And the next batch will have a different Gaussian distribution. And over time, you know, it will start to fit all of these different distributions over all these batches. And so this is a nice little trick you can use. It's called the reparameterization trick. And so you can actually get a distribution out of pretty much any model.
Starting point is 01:03:34 And I feel like that's kind of, you know, lesson 101. You know, we could do much better than that. But that's one way where anyone can get into models that output distributions. Yeah, I haven't heard of the reparameterization trick. I'll have to look into it. has been mixed and matched and filtered and all that. And so long as I know how it was done or I have some hypothesis as to how it was done, you could always come up with a Bayesian model on that and learn something.
Starting point is 01:04:15 Yeah, yeah, totally. Yeah, I think having the uncertainty lets you do some really, really cool things. Like imagine, yeah, against problems dependent. Like Foursquare is actually a good example. Let's say we're recommending restaurants to somebody. And let's say we don't know whether they like Thai food or not. Well, maybe we should do what's called the upper confidence bounds,
Starting point is 01:04:39 which is just a fancy way of saying, you know, if the distribution is really wide, we want to be more likely to choose this item. So so Foursquare doesn't know if I like Thai food. They show me an advertisement or they suggest a Thai food restaurant is nearby. And if I cancel it or if I tell them, don't show me this again, well, that tightens the balance. Now they know something. And so next time, maybe they'll try Chinese food. Right. On the flip side, it could be that I really love Thai food, but I always go to the same place. And so I haven't looked for Thai food on Foursquare. And so by taking that chance, Foursquare, if I do, maybe I click on it and I go to that restaurant. That really opens up this huge array of possibilities that that didn't exist before.
Starting point is 01:05:29 Right. So that's where, you know, showing the upper confidence bound of something can can can give you this nice mix of learning. But at the same time, you know, using the information you learned at the same time. Yeah. Lots of ways to slice and dice the data. Yeah, totally. If I could put it one way. Yeah, exactly. I mean, I think having the probability and the uncertainty opens up tons of opportunities. I feel like this is something that a lot of even big companies like Google and Facebook
Starting point is 01:06:02 and these other large companies are just starting to get their hands around this. They have these very, very complicated models, but then they all output a single number. And you're kind of limited in what you can do with that. Yeah. Yeah. That's that's a constant struggle of, you know, the the data scientists often have more information than the product managers or the customers want to use. And, yeah, I get it. Maybe like in a consumer app, you don't necessarily want to start teaching people probability, but you want to be ready with all of the probabilities so that you can then make the changes to the, you know, you could then make product decisions as you go, I guess.
Starting point is 01:06:48 Yeah, that makes sense. Like I want to increase or like maybe I want an indication that we're uncertain or maybe I want an indication that maybe you'll like this place because of this or, you know, all the different possibilities that we've been over a lot over the last many years at Foursquare, for sure. So in Foursquare, do you have, you know, like these, let's say, clear box or these like very interpretable models and you have sort of the heavy hitting deep learning model that squeezes the extra one percent? Do you do both of that or do you just focus on the clear box or like how do you sort of strike that balance?
Starting point is 01:07:22 So let's try to figure out which product we want to talk about, the consumer product or attribution. So we can talk about attribution for a bit. Attribution was, you know, we had a bias-corrected logistic regression, like I said, which sort of tried to figure out what the likelihood of someone visiting a place, given what day it is and given information about them and when they visited before and all that. And then we compared what happens when they didn't see the ads, when they did see the ad. And then we use a Bayesian model to come up with ad lift. That was a few years ago.
Starting point is 01:07:56 More recently, last year, Foursquare bought a company placed, which only did location-based attribution. And so we found that we kind of did it the same way, but now the Foursquare attribution team is like, I don't know, 20 data scientists. And I've sort of – when I was doing it, it was like me and two other people. And now we're like, okay, well, we've got this group in Seattle who are very good doing it. So I don't have as much insight into it, but I assume it's I assume it's similar. But some of the models we had a logistic regression model on the top layer, but some of the bottom layers were actually a bit more complicated. So, for example, one of the things that might affect whether you visit a place or not is
Starting point is 01:08:46 your age and gender. You know, wouldn't you agree? But we didn't have everyone's age and gender. And what we did was for the people who we did have their age and gender, we built a model to predict what it was given the places that we visited. And I did this fun thing where I revealed to the company like everyone's everyone's gender, probabilistic gender. And I was nice. I had to be very careful. Like, you know, it's just a model. This is not, you know, telling you which gender you actually are.
Starting point is 01:09:18 It turned out that most of the people who are misgendered by the model were married people who were doing things with their spouse. Oh, that makes sense. Yeah. And I found out like I was 99 percent male. But I also found that interesting things like it wouldn't be so much if I didn't go to Best Buy, which is like across the street all the time. Then it would be more like 70 percent. Does going into a Best Buy make you more male? Of course not.
Starting point is 01:09:43 It's just a model. It's not causation a model, guys. It's not causation. It's correlation. Yeah, yeah, exactly. But no, then we were able to use the output of this model as an input to the attribution model. We didn't just say, which gender are you? We actually used the percentage output from the model and plugged it into the new model directly. So we had these layers. And the age and gender it into the new model directly. So we had these layers.
Starting point is 01:10:07 And the age and gender model was a little more complicated. I don't think it was deep learning, but it was like it was gradient-boosted trees, so something a little bit a lot more nonlinear than just, you know, a logistic regression. Yeah, totally makes sense. Yeah, I mean, I think from what I've seen, it's it's it's always this struggle. You know, you have you have the super complicated model squeezes out the extra percent, but no one can interpret it. And so you build all these simple models. But then, you know, if there's any skew or bias at all, then you're kind of you're informing your product managers based on one model,
Starting point is 01:10:45 but then your actual system is using a different one. And that that's that's always difficult. Right. Sometimes it's sort of it's sort of evens itself out. So, for example, the the gender model, we're using the output of that to predict whether you're going to visit a place. And so the the visit prediction algorithm is just using this input. It doesn't know that it's a gender prediction. It just knows that it's a feature. And so it could use it however it wants. And even if the gender prediction is off, the layer on top of it,
Starting point is 01:11:18 the visit prediction kind of takes what it can from that. And so we don't have to worry about whether it's entirely accurate on predicting gender or not. Yeah, yeah, totally. And if you have the uncertainty of the gender, that could also go into the model. So it could say if the gender isn't certain, then it doesn't matter what number it is, we'll just throw it out or something. Yeah, that's exactly what we did. Or in our case, if it was like 50 50, I think we have a few. I don't know what there's a number of things you could do. You could just, you know, input a point five as a real number. That's the feature. Or you could just say, you know, pick if it's whatever the probability is,
Starting point is 01:11:59 sample from that and give me the sample value, either one or zero, and do that a few times. There's a lot of things you could do. So what are you doing now? So you've worked on so many different parts of Foursquare. I mean, we've talked about several of them just in the show alone. And now you're at the Innovation Lab. What do you do there? Right. So every once in a while, I get to do a statistical model. But right right now I'm really just working with our founder, Dennis Crowley, and we're trying to use Foursquare's core tech to put out new apps that sort of inspire, show off the company's technology. And so it's a nice change of pace to be able to essentially work on the fun stuff. We had a really great product ready to go, and it was called Marsbot Audio, where you would walk around the city, and one of the things that Foursquare can do is it can tell
Starting point is 01:12:56 whether you're walking past a store, like when you walk right past a place. And so we can trigger an audio file to play when you walk past something. So we have like sound effects. We have text speech and we have you could even upload your own audio. And then, you know, whenever someone with the app walks by the place you set it at, they'll hear it. And so that was really cool. It was ready to go for March 11th. And we're like, are we going to be finished?
Starting point is 01:13:19 Are we going to be finished? And on March 10th, we were like, it looks like we're going to finish for tomorrow. And then the country shut down. So so we kind of pushed that back six months or so. I really, really hope to get this out, though, as soon as we can, because the tech is ready and it's really cool. But that's awesome. One of the things I've been thinking a lot about is I think what you're working on falls right into this, is how to sort of reboot people's habits, right? So imagine someone is in the habit of going to the movie theater once a week. There has to be some way to sort of reboot that habit when things open back up. And I think something like this could be pretty cool where, you know, if it could kind of um you know kind of like encourage uh almost like a
Starting point is 01:14:07 like a fitness tracker but for fun you know like just to help people kind of recover their social life i know i know it's um it's been a little i'm not happy with the way the industry has gone i i feel like a lot of these consumer apps, one of the things that attracted me to Foursquare was how positive it was as a social network and as a tool to be present and do things in the real world. Foursquare is not one of the big tech companies right now, and I feel like a lot of consumer technology
Starting point is 01:14:42 has gone in a darker direction that maybe we don't like. And now over the next three months, I feel like I want people to get out again. I want people to have a good life again. And I feel like people are saying, oh, no, no, that's evil. We have to work on contact tracing apps. We have to work on apps to tell you which friends are out to kill you. And I'm not too happy with that. I'm really hoping I feel like I'm my thinking is against the grain here. But well, I think it's a
Starting point is 01:15:15 great place. It's a great it's a great thing for an innovation lab to be working on because it's it's sort of the next frontier. Right. I mean, getting through covid is is frontier one. But but trying to rebuild, you know, the social infrastructure, I mean, that's going to have to happen. And so thinking about that now is really good. Yeah. Yeah. So over the last few months, we've been working more on, you know, data sets for, you know, coronavirus, essentially not for the virus itself, but to see which places are coming on and offline at different times. You know, there's a Foursquare recovery index at visitdata.org.
Starting point is 01:15:54 But, yeah, pretty soon we'll return to these, I hope. Cool. And so working at Foursquare, can you kind of walk us through, I mean, we have a lot of folks who are, let's say, in college right now. They are looking for internships. They're looking for jobs. But even more than that, they want to know kind of what it's like to work at a company like Foursquare. Could you kind of like walk us through kind of what a day in your life is like? Yeah, well, one of the things that I've learned over the last 10 years is don't get too comfortable because once you are, things are going to shift under your feet pretty quickly. And that's, you know, so whatever the day in my life is now is, well, it sure is.
Starting point is 01:16:36 It certainly isn't the same as what the day in my life was four months ago back in February. And it certainly wasn't what it was, you know, two years ago. I'd say the best case for Foursquare was when we were, you know, when we're working on these labs projects and it's just three or four people talking every day, working on the whiteboard, trying something out, making progress, you know, building something into the app and then returning and then trying to go out and use it and then returning and then, you know, trying it again. Attribution, yeah, the things that I enjoyed was actually like coming out with these models
Starting point is 01:17:19 and selling these models to the company and even going on some sales calls and trying to explain, you know, how they work and why, you know, certain ads are seeing lifts and certain aren't. There was a lot of issues, you know, but sometimes, you know, there are issues on those teams when, you know, certain companies wanted custom things done that we as engineers didn't really think made a lot of sense and you're kind of kind of pressured by the sales team into that. That stuff happens as well. Sometimes, you know, you run into cases where the the the company is kind of reaching. You're excited about a product, but the company kind of reaches a dead end or deprioritizes your product. And for someone, well, even for someone who's been in the industry for a long time, but also for someone new, like that's always a very disappointing point, part of working in this field.
Starting point is 01:18:10 I don't know if you'd agree. Yeah, totally. Yeah. That is the number one complaint. I'm on a research team, like an applied research team. And the number one source of frustration is when, you know, we have something that all of the math looks great and it's headed in all the right direction, but the product changes and the
Starting point is 01:18:31 demand disappears. Yeah, yeah. Sometimes that happens. Sometimes the leadership doesn't see, it doesn't, you know, sometimes it is very valuable, but the leadership in the company either doesn't see it or is a focused elsewhere. And then sometimes but, you know, there's a plus side to that, too, is when you do ship something and you get something out. And either it's, you know, either it's on the back end business to business that, you know, is bringing a lot of value or it's on the consumer side. And people are loving it and telling you about it like there's nothing better than that yeah yeah totally agree so is foursquare hiring like full-time are they doing internships with with the whole covid thing are they going to do fall internships things are a little bit up in the air at foursquare because again you know we bought uh we bought placed in last year, and this year we bought or merged another company called Factual.
Starting point is 01:19:29 So there's a lot of integrating these companies. There's a lot of moving pieces at Foursquare right now. So I don't exactly know what the internship situation at Foursquare is like. But Foursquare.com slash jobs you can go to. If I were looking for a job, I'd also like Union Square Ventures is one of the VC companies that funds Foursquare, and they
Starting point is 01:19:53 have a job board for all of their companies, and they invest in a lot of cool companies, so I often go to their job board. Sometimes I first show people, this is the Foursquare job board. Let's see if your job is here. If it's not there, then I jump to the USV job board because there's a lot of cool. There's a lot of cool companies in there as well that are kind of a similar size, similar philosophy.
Starting point is 01:20:19 So is Foursquare on this remote work bandwagon or are most of the folks in New York? We are remote right now and I don't know where we're going. Yeah, that's true. Personally, I really hope to get back into the office. I just feel like some of the creative work that we do has to be face to face. I don't need the open office with 300 people. But having like a small group of, you know, three to seven to ten people get together and solve problems is way more efficient than everyone in their apartment with either, you know, delivery people come in or their kids screaming at them or whatever, their dog,
Starting point is 01:21:00 whatever is happening in the background. It doesn't, in my opinion, it's not working as well, or at least in order for it to work as well, we'd have to – like none of us have our apartments, particularly in New York City, like optimized for this. So it's not – I really hope to get back. I don't know what the situation is going to be. There are people who are saying, let's never go back. I don't know who's going to win uh yeah honestly it's it's it's i don't think anyone does i think i mean i mean honestly when twitter uh jack dorsey from twitter said uh no one has to come back uh that was
Starting point is 01:21:36 absolutely shocking to to the entire industry i mean that was seismic. I mean, that probably like Zillow stock price probably tanked when he tweeted that. Right. I mean, I mean, it's just unbelievable. And I totally agree with you that at least for now, there hasn't been the right mechanism for a while. Actually, this is way before covid when most of us were in the office, we do have another team that's in Seattle. And what we did is we set up a television and they did the same where, where when you look at the television, you see their, their desks. It's kind of weird, you know, it's, it's kind of like a, like literally, like some kind of portal.
Starting point is 01:22:20 I don't, I don't know if we need to see in everybody's homes either. I mean, there's a lot of personal things in people's homes. And it's not. And it's it's I don't see how people don't recognize that. Yeah, exactly. I think there needs to be. Yeah, there needs to be a new modality. I mean, like one way to do it is maybe, you know, we have some on the clock time where everyone's there. And, you know, that is a time where the mic is always on or something. I honestly don't know. It's a totally uncharted territory, but we'll have to see how that all goes.
Starting point is 01:22:59 But, yeah, I think if folks out there listening are interested, it'ssquare.com slash jobs or what was the other one union union something oh usv.com slash jobs oh usv.com slash jobs i think that's what it is let me check and so uh yeah we'll put it in the show notes and we'll correct it either way cool this was this was awesome um is there oh we should talk about your podcast um so your podcast is called the local maximum every week which is which is awesome for every week we reach a new local maximum hopefully higher than the previous local maximum but not always yeah i think the name is great yeah i think uh uh yeah hopefully there's some uh uh you have some kind of mini batch going on or something where you can eventually reach a global maximum. You don't just get stuck. That will be the last. If there is a last episode, it will be called the global maximum.
Starting point is 01:23:55 Yeah, that's right. Exactly. Very cool. And so what do you talk about there? So I I generally talk about things that are interesting to me. I I'd say about 50 percent of% of my episodes are guest interviews. And so a lot of the guests are technology related, but I do branch out a little bit. I branch out to, you know, professors, probability and statisticians, of course, then pull more on like the data engineering side. But then I've also like talked to historians and and other podcasters and comedians and things like that. So I am kind of all over the place there. One of the things that
Starting point is 01:24:33 I like to do is take the concepts in sort of machine learning and Bayesian analysis and just like apply them to either the news or to your everyday life. And so, and by the way, if you've made it this far, but you kind of are interested in Bayesian inference and you kind of want to look at it from, you kind of want the 101, the more basic ones, I have a lot of really good episodes for you to listen to. Like, I think in the hundreds, I've got, let me check my archive for a second,
Starting point is 01:25:04 but I know episode 105, I've got, let me check my archive for a second. But I know episode 105, I talked to a mathematician, Sophie Carr, talks a lot about Bayes rule. And then I had a few episodes after that about like, what is probability? And so we do go into the philosophical side of things. But also, I try to make it relevant to someone who is not a data scientist and not in tech. Like I can explain what overfitting and underfitting is. And then I ask, OK, who in your life do you know who is always overfitting? And people people tend to have an answer for that. So everyone's got that one uncle, right?
Starting point is 01:25:39 Yeah, exactly. And well, I think the example we have is often like, you know, toddlers can can overfit. Not always. Yeah. But or and sometimes older people tend to tend to underfit. But we can their their examples is interesting to just have these prompts and come up with these these examples. Or sometimes I just kind of, you know, take a bunch of news stories that day and try to distill it down into a theme like what's what's Occam's razor or what is expected value or something like that. What inspired you to I mean, it's a huge, huge undertaking to do a podcast. I mean, you do one show a week, which is which is just it takes an extraordinary amount of
Starting point is 01:26:23 time. And so what inspired you to really start it and what kind of keeps you motivated? Well, I, I was trying to do, well, I had a, um, uh, radio show in college. This was back in like 2004 to 2006. And that was like the best part of my week was when I was doing that radio show. So, um, I knew I would enjoy it. I was trying to put more content out there. And at the same time, you know, I, you know, I had spent a summer kind of at NYU Future Lab talking to entrepreneurs. And I was like, well, you know what?
Starting point is 01:26:56 I think I would enjoy this. No matter what project I do, I'm going to want to have like an audience to discuss it with and kind of a forum to learn about, you know, learn about new concepts and learn about issues and talk to people who I'm interested in, like authors. And so I tried it. I did like a 10 episode challenge. Let's put on 10 episodes and see how it goes. And then after that, I just kept going. It was just a lot of fun. Every week is, you know, some weeks are easy. Some weeks it's a little bit challenging where I have to do a solo show where the solo shows always take the most research and thought, even though it should be simple.
Starting point is 01:27:37 Right. I could just edit whatever I want. No one will ever know. But it's I feel like the the solo shows are the hardest for me because when I'm interviewing someone, at least I could ask for their content, their information. But one week seems to be the sweet spot where if it's two weeks and I had something to say, it would take too long for me to get the time to say it. But if it was more than one week, it would just be completely overwhelming. Yeah, totally makes sense. Yeah, totally makes sense. I think Patrick and I originally were doing it biweekly and we switched to monthly. But yeah, I totally agree that, you know, if you find a piece of news early in the month, sometimes you have to you have to kind of throw it out. And because because, you know, two weeks go by, things become irrelevant really quickly. I mean, especially in 2020, where your entire lifestyle changes, it seems almost week to week. Yeah. Yeah. I mean, I'm almost going to go back to some of these episodes and relive it.
Starting point is 01:28:40 It's like, do I want to do that? I don't know. But but no, but it did generate a lot of interesting episodes. Like I did one episode, 115, all about the coronavirus models. You know what? What was right? What was wrong? What are they actually trying to do? And I had a lot of you know, I talked to my friend who works in a hospital about recently about what his experiences have been like over the last three weeks. So, you know, we have been talking a lot about what's going on. I, you know, some episodes I feel like I just want to talk about a concept and ignore what's going on in the outside world. I've completely thrown that away, I think.
Starting point is 01:29:20 Yep, yep. Yeah, I think, you know, it's interesting. We have this sort of this this by polar or bimodal response to our current events. There's some folks who say, you know, I just want to know about the topic. And there's other folks who say, I'm so glad you you you add color to the show and it's not just about tech all the time. And so I think the compromise we finally struck was to put the time stamp of when the show topic starts um that seemed to satisfy most people but um but i totally agree with you that the idea of just ignoring the outside world you know like the entire world is burning down and you say okay today we're going to talk about the beta distribution
Starting point is 01:30:00 kind of odd but yeah i feel like i i feel like i I know to wrap it up, but I feel like the the purpose of the podcast is not to be a like a course that you take that's going to be evergreen. It's more like let's sit back. Let's have a conversation about issues that are important to us. And let's talk about let's talk about concepts that we know and then uh we'll get interested in things we'll discover things then if we want to learn more maybe we'll do like you know look look at a more formal course or something like that yeah yeah totally so it's it's the local maximum is the podcast are you on you're probably on uh like like google podcasts and stitcher and all these things apple spotify, Apple, Spotify. Great. Yeah. If you if you don't have any of those or if you want more information, you can go to localmaxradio.com. And Max, it was it was it was really awesome to have you. I just realized your name is Max
Starting point is 01:30:59 and it's Local Max Radio. Is that on purpose or is that? Well, yes, I realize it's a triple entendre, right? Yeah, that's right. So there's the local maximum, which is the the probabilistic. Well, it's the local maximum is a is both an optimization concept, a machine learning concept and a design concept, too. So that's one meaning. There's also my name in it and also the local maximum. Well, it's, you know, location data. I'm all about that, too. So two meetings. Two meetings is one thing. But three meetings, that's when you know.
Starting point is 01:31:32 That's very rare, three meetings. Yeah, you hit the jackpot. It's very clever. Cool. This is awesome. I'm definitely going to listen to some episodes. Actually, right when we get off this, I'm going to check this out because this is super exciting. And I really appreciate you coming on. Folks are just dying,
Starting point is 01:31:51 chomping at the bit to know about anything AI machine learning. And so, you know, I feel like we covered a lot of really, really important ground here that any engineer can apply, even in their day-to-day work, which is ultimately, I think, really useful and has high impact. Cool. It's great to be on the show and to have this really fascinating discussion with you. We covered a lot of ground, things that I have to look more into now. All right. Cool. Awesome. It was really great talking to you. And then for folks out there, thank you for supporting us on Patreon. Obviously, tough times for a lot of folks out there. We don't demand
Starting point is 01:32:25 anything. All the content is totally free. But we do really appreciate all the donations that goes to our equipment. And we have some hosting costs and all of that. So thank you again for your support. Thank you for your emails. A lot of really good ideas keep coming in. And our hope is to, I don't think we'll ever end the FIFO queue, but our hope is to go through it and try to answer as many questions as you folks have. And we'll see you all next month. The intro music is Axo by Binar Pilot. Programming Throwdown is distributed under a Creative Commons Attribution Sharealike 2.0 license. You're free to share, copy, distribute, transmit the work, to remix, adapt the work, but you must provide attribution to Patrick and I and share alike in kind.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.