Programming Throwdown - Machine Learning Embeddings with Edo Liberty

Episode Date: September 27, 2021

00:00:24 Introduction00:02:19 Edo's Background00:08:20 What are Embeddings?00:14:00 Self-Organizing Maps & how humans store data00:22:27 The lifecycle of a machine learning system00:34:40... The weirdness of high-dimensional spaces00:42:20 How to manage a vector database00:47:01 Pinecone the company ★ Support this podcast on Patreon ★

Transcript
Discussion (0)
Starting point is 00:00:00 Hey everybody, how's it going? I know a lot of folks out there ask about AI and machine learning. Patrick's done some machine learning stuff in the health space. I do a lot of machine learning stuff, or at least I pretend to in my day job. And so a lot of people write to us, message us on Twitter, asking questions along these lines. And so it's always a real pleasure where we can talk about an area that we really love. And this is right in the heart of it. We're gonna be talking a lot about machine learning embeddings, which is a term you might not have heard before, but you're going to really enjoy it. And it's something that is extremely, extremely useful.
Starting point is 00:00:57 I'm extremely lucky that we have Ido Liberty here, who's an expert in this area. He's the CEO of Pinecone, and he's here to talk to us about embeddings and kind of riff on that. So thanks so much for coming on the show, Ido. Yeah, thanks for inviting me, guys. Cool. So we ask a lot of folks this lately, how has COVID kind of changed your business? Are you back in the office yet? Or what's going on with that? Half are and half aren't. So the California folks are still at home. New York folks are back in the office.
Starting point is 00:01:28 Israel folks are back in the office. Oh, I see. Is that because of local laws or just how people felt in those areas? It's a bit of both. A bit of both. I'm a big believer in people hanging out together and seeing each other and having lunch together and getting to know each other a little bit better than on Zoom. So I'm pushing everybody who wants to be together to be
Starting point is 00:01:52 together. Yeah. Yeah. Totally agree. Totally agree. I wonder if we'll end up with, remember WeWork? I think is WeWork still a thing? It probably is still a thing, even though they had all sorts of issues as a company, but that might be the kind of thing where physically you're next to a bunch of people that you can socialize with, but you actually all work for different companies or something like that. Who knows what the future is going to be? Yeah, 100%. Cool. So why don't you tell us your story and how you got into this area and how embeddings became important to what you do and kind of where that all started from? Yeah. So my journey into becoming a CEO has nothing to do with business. I did my PhD in computer science and my postdoc in applied math, working mostly on
Starting point is 00:02:38 CS theory and algorithms for machine learning and big data. What I ended up doing mostly was high dimensional geometry. So numerical linear algebra and clustering and dimension reduction and stuff like that. And a big part of that was nearest neighbor search and searching through high dimensional vectors, which we will get to in the end or in the middle of this discussion. And then I opened a company in 2009 that did the real-time video search. And real-time video search kind of has to be based on this vector search capability, which we'll talk about later.
Starting point is 00:03:16 And again, so I started investing in that. So can you tell me real quick, this always fascinates me. How did you make that leap to being an entrepreneur? So right before, let's say 2008, what were you doing? And then how were you able to sort of quit that job and say, okay, I'm going to start this real-time video understanding company? What made you take that leap of faith? So the answer is, so it's two things really one of them is uh i've just always kind of been
Starting point is 00:03:48 very entrepreneurial but by nature i was kind of i was importing like balance skateboards to israel when i was in high school and i was like i was always like doing stuff like that right so i was some part of it was always a little bit entrepreneurial. But the other one is just like dumb luck and just happenstance. I was in New York and my cousins of all people said, oh, I'm going to start this company and I'm going to hire whatever, like 10 engineers to do XYZ. And I'm like, oh, you don't need 10 engineers for this. We can whip it up in like like like a month and uh
Starting point is 00:04:27 you can you can go raise money on that and he didn't believe me so we we spent a month and we we made something work and so you know when we found ourselves with some money in a company so like okay i guess we have to run a company now. Wow. So you took a month off, like a month. I wouldn't call it a vacation. It's probably like the busiest month of your life, but you took that month off, quote unquote, and then you went to investors and you said, here's our idea. I was a postdoc at the time. So it's like the work was not incredibly demanding, definitely not time critical. So like I could kind of disappear for a month. It was actually a great, like, I loved it. I actually still remember it. I would wake up at
Starting point is 00:05:10 like 8am. I would make myself coffee. I would sit down and I would code until like midnight and go to sleep. I did that for a full month. And it was fantastic. It was fantastic. It's so addictive. It's like this dopamine rush, you know? Exactly. I didn't talk to anyone. I didn didn't need anything i would just like drink coffee and coke for like a month straight it was fantastic i loved it not a single meeting man it sounds amazing actually you bring back really really positive memories i mean maybe maybe we should do that again i mean like every now and then maybe highly highly recommend yeah wow that's an awesome story. A month's worth of flow.
Starting point is 00:05:47 It was fantastic. But anyway, the cool part about it is that we ended up building a really cool solution. We ended up selling the company. I moved to Yahoo when I was there, director of science. So did you sell the company to Yahoo or is that how it worked out? No, no. We actually sold it to Vizio. Okay. Got it. The TV company. And then the company to Yahoo or is that how it worked out? No, no. We actually sold it to Vizio. Okay, got it. The TV company.
Starting point is 00:06:08 And then I moved to Yahoo, where I spent a lot of time as the director of machine learning. So I was building machine learning platforms and solutions and so on, working on mail and ads and spam and feed ranking and so on. And yeah, then went to AWS and spent about two and a half three years so all the all the offerings from aws so sage maker and all the you know there's like recognition i think for video yeah exactly language stuff language stuff all that you know all those services i didn't own all of them but I was definitely a part of building that org and being a big part of the science of a lot of these services. So I, you know, my journey has kind of been a little bit all over the map, which is highly recommended. Seems chaotic in hindsight,
Starting point is 00:06:56 but kind of fun nonetheless. No, I think it's amazing. I think it's great. I think that, you know, you can see the progression. I mean, it's, it might be, might seem chaotic, you know, geometrically, but you can see kind of a, like a nice trajectory. Yeah. Yeah. Very cool. And so, and so you were at AWS and then from AWS, is that how you were inspired to go to Pinecone or was there a spot in between? No, no, from AWS I opened Pinecone. So building, it was kind of seeing how the sausage is made, like building managed services and databases. That's where I really got into like databases and managed services and cloud offerings and so on,
Starting point is 00:07:39 kind of really diving in deep into that. And it suddenly dawned on me, you you know that like i have solved exactly the same problem like 10 times in my career and i always built it from scratch and it was always hard and always like insufficient in the end yeah so i figured we should do it once and for all and like do it right and give it to everyone. Yeah, that's awesome. Very, very cool. Yeah. Fascinating. Yeah. I think it's amazing journey. We can now give kind of an overview of, you know, what that means to, you know, provide sort of a vector database. And so maybe we first need to talk about, about embeddings and where they come from and why we need them. Yeah. So, of the interesting statistics that I uncovered while thinking about opening
Starting point is 00:08:34 Pinecon. I had a lot of conviction that this was needed, but I knew embeddings were needed for dealing with unstructured data. So when you look at tabular data, it's very clear how to store it, how to slice and dice into it, how to work with it, because we have 50 years worth of databases developed for it. Yeah. I mean, if people have used Excel and you've done this thing where maybe you have a bunch of columns with data in Excel, and then you want to try to learn a new column.
Starting point is 00:09:08 There's a column that has some missing values, and you want to try to guess it from the other columns. That's pretty easy to do in Excel, and you can kind of imagine what's going on there. There's some model that's just looking at a single row and pulling those values and then trying to infer that new column. But then there's a question of how do you do that if it's an image or a video and pulling those values and then trying to infer that new column. But then there's a question of like, how do you do that if it's an image or a video or a website?
Starting point is 00:09:33 You know, now it's totally unclear how to do that. Correct. And this, you know, it extends much more. So it could be like, yeah, those like very complex data types, like images and long text documents and, you know, some like audio recordings and so on. But it could be like travel through a website, like a trajectory through a website or a shopping cart or something else, which is in and of itself like a complex object. It's not like it's not something you can stick in a table in like four columns. Right. And for a long time, it wasn't really clear how to deal with those objects, right? It was like each and every one of them had its own like little
Starting point is 00:10:12 mini discipline in computer science and how to reason about them, how to structure them, how to predict about them and so on. And in the last few years, there has been this movement with deep learning to basically unify this entire paradigm. So basically say, I will use deep learning neural nets, basically, like deep neural nets, right? To take the input in some way and transform it through some layers or through some transformations. And in the end, provide a high dimensional dense vector. So let's say 1,024, an array of 1,024 floating point numbers. But I will train the network such that that vector is somehow semantically representative of the object.
Starting point is 00:11:06 For example, I would train it such that if two documents mean the same thing, okay, they talk about the syntopical, they're similar in some sense, then their resulting vectors would be very correlated. Yeah. And then the magic is knowing what that similarity is, right? Or how to measure that. Right. And so there's a whole, again, there's a whole body of literature and a good amount of technical
Starting point is 00:11:39 know-how of how do you take data and produce these embeddings such as they are good. But the fascinating thing is that it's, while it isn't always the perfect solution, it's very often great. And so you can transform images and text and audio and like shopping habits and like your video preferences and pretty much you name it and convert it into this dense vector. And now you can, instead of dealing with these complex types,
Starting point is 00:12:16 you can now have this embedding. You actually embed. By embedding, it's like you put an item in high dimensional space you you assign a vector to it right so think about a vector as like a point in a high dimensional space that's why it's called embedding and so now instead of talking about documents and semantic similarity you talk about vectors like and angles and distances and and so now it becomes a lot more concrete and a lot more actionable and a lot more something that you can talk reason about and like build for and like code and and systems and not
Starting point is 00:12:53 like it's not an abstract thought anymore yeah i mean if people have seen this visualization where um there's a bunch of points and they all have kind of springs um in between them and then you you run this through some physics model and they all just kind of springs in between them. And then you run this through some physics model and they all just kind of separate and they end up in this like spiderweb kind of shape, right? That's because all the points are kind of repulsing each other. And even though it's only a local repulsion, it creates this really beautiful kind of spiderweb structure globally, right? Because you see this similar with birds, right? Birds have a similar thing where, you know, they have some very simple rules in their brain around not getting too close
Starting point is 00:13:31 to other birds and like staying in a certain position to the bird next to them. And then that ends up with this like beautiful kind of V shape or swarm or things like that. And this is similar where we're doing that, but then some of the points are actually getting pulled together and some are getting pushed apart depending on things we know about them. Exactly. So there were like works from, I think the early 2000s called like self-organizing maps. So maybe this is like ideas if you're older, you know, I don't know, maybe the idea itself, actually, you know what, I think the idea itself is even older. They were like old works, like on MDS, stuff like that, ISO maps. And there was like a flurry of maybe the idea itself actually you know what i think the idea itself is even older they're like
Starting point is 00:14:05 old works like on mds stuff like that easel maps and there was like a flurry of works and everything so you would like take things like say i'll take all the people like you know a million people and i'll just look at their social network graph yeah and i'll try to put i'll try to create a like these virtual springs that put people that you know close together physically. So then we try to make them physically adjacent to each other. And if you try to do that in two dimensions, what you get is the map of the US, like if you have one of the US folks. Maybe now with COVID and social networks, maybe that now doesn't work anymore.
Starting point is 00:14:42 But back in the day, it did. Yeah. Or if you imagine doing a two-dimensional embedding on interests, you might have like a little island over here of gamers and a little island of skateboarders. And there might be somebody who's interested in both. And so they're going to be kind of on this sort of inflection point or on the saddle point in between those two clusters. But you'll still have those large clusters. Or maybe you still have those large clusters. Or maybe you'll have another cluster of gamer skateboarders, right,
Starting point is 00:15:08 if there are enough of them. And that cluster will start having a defined border and pushing the other clusters away. And so, Jay, you end up with these really beautiful behaviors. And when you're done, you have something really interesting. You can draw a little circle on that map and say, everyone here really likes skateboarding. And once you start introducing a whole bunch of other topics and you get to more than two
Starting point is 00:15:30 dimensions, you can even guess at things that people like that they didn't even know they liked. A hundred percent. And I think in two dimensions, you get pretty limited pretty fast. So two dimensional space is very limited. but when you live in a thousand dimensional space you you get so much more representational power because you can be a skateboarder but in the same thing same time be a dad in the same time be like really into like whatever like anime and whatever like also be like vegetarian i mean and all these things are facets in your behavior and things that you care about and they, they're not mutually exclusive and you know, they're just like a part of you. So you're not, you're not a skateboarder or a vegetarian, you're both. Yeah. So that's how,
Starting point is 00:16:15 you know, these high dimensional dense vectors can actually like represent something a lot more deep, a lot more rich, a lot more actionable than just like, you know, this one feature or another. Yeah, it's actually, it's that, it's that curse of dimensionality, but actually working in your favor where we're going from two to four dimensions gives you so much more expressibility. And then going to a thousand dimensions gives you, you know, exponentially more expressibility. And it's just amazing what you can represent. I mean, we've seen just unbelievable things with GPT-3 where people have auto-generated websites.
Starting point is 00:16:50 There's this thing Codex where it auto-generates code. And it's just amazing. And it's all using embeddings. Yeah. And the amazing thing is like, that's how our brain works too. Yeah. So if you think about everything that you remember seeing,
Starting point is 00:17:07 everything that you've ever seen, the part of the lateral cortex, and pretty much everything, your higher functions in the brain that deal with visual imagery, the input that deal with visual imagery. Okay. The input that they have is not the light intensity on your retina, which you might think is highly related to the actual image.
Starting point is 00:17:34 Right. Or what's like, what we think about is an image in like as a RGB roster. What your brain gets is the output of the visual cortex, which is an activation of a few million neurons. And that's it. And those activations have very little correlation to the actual RGB values in the image. In fact, we know for a fact they have almost no correlation because when you look at your partner at midnight inside the house or maybe in high noon outside, in terms of colors, it's completely different. It's got nothing to do with each other, right? But to you, it looks exactly the same. The brain just completely changes all the values to make the embedding exactly the same. So you
Starting point is 00:18:26 see the same thing. Yeah. I mean, a good example of this is to try to guess at how thick a wall is an interior wall. Now, if, if you're a dad and you're constantly in the attic, you know, fixing stuff, that's different, you know, that's cheating, but like, but like you might look at a wall and say, Oh, that's, that's maybe, you know, half a foot thick. But then, you know, when you actually look at a map, it's like, oh, no, it's like four feet or something like it. So you'll be off by a ton. And yet you're able to navigate around your house, you know, trivially. You know, you don't have a map that is in any way accurate. But you have this really sort of deep, like semantic understanding of the different rooms and where to go. And and so so you,
Starting point is 00:19:11 my, my older son sleepwalks. And so somehow he's able to sleepwalk down the stairs and into the fridge and like get a stick of cheese. Like, I don't know how he does it, but, but he's able to do that yet. There's, there's that map isn't there. So it's, it's all just these super deep embeddings and, and because everything gets collapsed down to that, in the case of the human brain, you know, million dimensional embedding, it's in a sense, kind of easy to store, right? Because you're putting the same million numbers to hold, you know, everything you see, and to hold, you know, you might have to integrate over them for video, you know, audio might be a different thing, but it's all the data structure, if you will, is the same for everything.
Starting point is 00:19:51 So that actually, I think you nailed it. I think this is like, it brings a very interesting topic, which is how do you even remember? How do you get, like when your son sleepwalks, so when he looks at a stick of cheese and he knows it's a stick of cheese, or you look at your loved one and they're whatever. A stick of cheese.
Starting point is 00:20:11 How do you know that a stick of cheese? No, how do you know it's the same thing? Yeah. Somewhere in your brain, you need to be able to go and match what you have against the bank of gobs of these like embeddings right and say oh this isn't a truck this isn't a tree and this isn't like uh whatever like my grandma and it isn't whatever like a road it's a stick of cheese right yeah and even you know you can see really abstract like uh like post-modern versions of
Starting point is 00:20:47 chairs and still know right away it's a chair yeah right and so how does that happen you know how do you do that and so if you think about it computers need to do the same thing you know if i generate an embedding for for a sentence right so So we, you know, you might think about, okay, I'm some, whatever. Like I used to work on spam detection at Yahoo, right? So you see an email that looks like some Nigerian prince is trying to convince you to transfer money somewhere.
Starting point is 00:21:17 You're like, okay, I know this. I know what it is, right? But it's not exactly the same email. Like these scams look slightly different for each other. But any rational person would read this like, oh, I remember this thing. It looks like something else. Because in your brain, you do this matching very easily. But computers don't know how to do this. So for computers, you would pass a different email that maybe semantically says the same thing, but you would get a slightly different vector
Starting point is 00:21:46 coming out of it. And unlike structured data, when you can do a hash table lookup and just fetch exactly the same key and then get the same value, you don't have that anymore. You just have a vector that's different than anything else you've ever seen before.
Starting point is 00:22:01 It might be close to other things you've seen before or highly correlated with a small set, but it's not an exact match. And so if you went ahead and tried to code this thing and try to say, okay, now I have a billion, 1,000 dimensional vectors and give me the 100 that correlate the most with it, right? Except for going one by one, computing the correlation and just keeping the top 10, it's not clear what to do. And so now you get into this like really, really rich and interesting like topic in computer science that has to do with like vector indexes and retrieval. And that's, you know, that's something that like gets really deep, really fast. And again, I'm happy to dive as deep as you want. I'm not sure how much you want to, how much you want to scratch the surface.
Starting point is 00:22:52 Yeah. Well, let's, let's go through sort of the life cycle of a system that uses vectors, right? So, so let's take like a recommender system, right? So we have, let's say it's an image recommender system. So we have a lot of all of these images and whether the person those embeddings together. And if a person liked one image, but not the other, it's going to pull them apart, right? And there's more complexity around random negative mining and all of that. We'll just leave all that alone and put a bookmark in that. But so you can imagine, you know, in the beginning, it's all random. This is one thing that actually surprised a lot of people. In the beginning, all the embeddings are totally random. This is one thing that actually surprised a lot of people. In the beginning, all the embeddings are totally random. So, you know, for any particular picture, it's just somewhere floating in that thousand dimensional space and every picture is going to have some other spot. But then through this process of kind of pulling and pushing, you know, pairs of points, you know,
Starting point is 00:24:00 together or apart, you know, eventually it'll start to coalesce into that structure that we really like. And then to your point, if we know somebody, the last picture they liked was this one, then we can ask the question, what should we show them next? And it's, well, what are the closest things to that one they just liked? So now we have this problem of here's a point
Starting point is 00:24:25 and embedding that we've trained. It could be maybe a week ago, but we have this trained model and we have this point that we know somebody likes. Let's get the points around it. And so, yeah, you're the naive way to do that is to go through every single point and say, how close are you? How close are you? How close are you? And then keep an array of the top 10. But for something like, imagine Amazon. I mean, how many items does Amazon have in their inventory, right? Or how many pictures are on Instagram, right? I mean, just billions. I actually know the answer to that. Oh, really? All right. Let's hear it. If it's not confidential. I think it is.
Starting point is 00:25:03 Okay. Let's say it's more than 100 billion or something, at least for Instagram. There's so many people on there making pictures. And so that whole idea of going one by one, you just can't do it. So that kind of gets to sort of what you've really been focusing on. And so how do people do that? I mean, Amazon works, Instagram, I can press F5 and it comes right up. Every time I hit refresh, it's right there.
Starting point is 00:25:30 I PTR, I pull down in the app and I get new photos. Like how are people able to do that in a reasonable time? So now we basically are discussing this like algorithmic question, right? So let's just kind of make that crisp of what the problem is. You get n vectors in dimension d, so n being large, say billions, d being large-ish but fixed, say 1,000. And now I give you another vector in dimension 1,000.
Starting point is 00:26:01 I tell you, give me all the vectors that correlate the most with it. Or let's just say maybe our closest in Euclidean distance. So I actually measure the sum of squared distances and sum of square discrepancies on coordinates and take the square root of that. So like you said, yes, you can scan everything. Clearly, that's not feasible.
Starting point is 00:26:19 So now how do you do it? Interestingly enough, you can actually prove that you cannot accelerate this beyond some limit, which is still pretty grim, okay, without incurring some level of error. In some sense, you need to accept the fact that you will have a little bit of approximation. So you might miss something that is relatively close, maybe not almost identical, but kind of on the perimeter of the bowl you're searching in. You might miss that, or you might accidentally include something
Starting point is 00:26:55 that's slightly outside of it. So you have to accept a little bit of a fudge factor, right? But once you do, you can accelerate things significantly. And in fact, all the algorithms that do fast nearest neighbor search do approximate nearest neighbor search. Then approximate nearest neighbor search means look for the nearest neighbors, looks for the vectors that are closest approximately. Approximately means exactly what I said. There's like this little fudge factor around the edges. Now, I'll just kind of give a very quick and
Starting point is 00:27:27 maybe almost like a sketch of kind of the main ideas in big algorithms. So, first of all, there is clustering. Clustering is an obvious idea. I'll just start by saying I'll have like a thousand clusters. Clusters are like a
Starting point is 00:27:43 large collection of points that are all relatively close to one another. And instead of remembering all of them, I'll just kind of remember the center. And so when seeing something, instead of thinking whether it's a mouse or a stick of cheese, I can first say, OK, is this an animal or like an inanimate object, right? If it's an animal, good. I've already reduced the search space by a big margin, right? So now I need
Starting point is 00:28:12 to, and I can iterate on that. So you can like deconstruct the world. That makes sense, like hierarchical, like for Amazon, if you search search for batteries there's probably something in amazon that says this you know don't bother with pet food like just forget that whole category you know like there's there's probably something and then they can do something afterwards right and so but even if you just take images for example even with images you can already take their embeddings and yeah, just already cluster things. Oh, cluster the embedding space.
Starting point is 00:28:48 You don't have to manually do anything. Yeah, you cluster the embedding space. So now you take those points in high dimension, just say, oh, those set of points are so close together that instead of measuring all the distance, I can just measure the distance of the center. And if it's close enough, then I will maybe suspect that some of them should be close. And then I should check that. So how you cluster exactly and how you then search within those clusters, I still didn't talk about that.
Starting point is 00:29:16 But that's one main idea that works very well. And again, there's a whole literature on how to do that fast and efficiently and accurately and so on. Then within each cluster, so now let's say you've narrowed down your search space from a billion to let's say 20 clusters, each one of them has a million points. Now they're not clusters anymore because they're all sticks of cheese. Maybe different, whatever. Blue cheese. Yeah, like a piece of chalk and a stick of cheese. And I don't know.
Starting point is 00:29:47 So now, like all white cylindrical things, right? Yeah. So now you really have to do something a lot more refined, right? And there you have a whole other set of algorithms that look like what's called product quantization or other types of quantization, instead of computing the dot product exactly. So saying I'll actually compute the distance exactly for like one candidate and one vector, I will compress it somehow and be able to just
Starting point is 00:30:17 have a very quick, maybe of one, you know, indication of, is this like a likely match, right? So like a real simple way to do this would be just to have buckets, right? So you could just uniformly sort of divide the space up and say, if my point is so many buckets away from this point, then don't even bother, just skip that whole bucket or something. Correct. I mean, buckets actually would be probably more akin to the clustering step. Ah, okay. Got it. Which is a good way, by the way, in low dimensions, bucketing and like an algorithm called KD trees is actually very efficient. If you have an algorithm, if your data is like two or five or 10 dimensional, then this like bucketing works
Starting point is 00:31:01 fantastically well. In dimensions like 100 to 1,000, it kind of breaks spectacularly because high dimensional spaces are weird. Yeah, that's true. I never thought about that. But yeah, I mean, just to give a bit of context, like for a KD tree, you can only split. And so a KD tree is a binary tree where at each node in the tree,
Starting point is 00:31:20 you split in one dimension. And you say everything that's more than this in that dimension, let's say you split where one dimension. And you say everything that's more than this in that dimension, let's say you split where the fourth dimension is 0.4 or higher. They all go to one part of the tree. Everything else goes to the other part. But because each split is only one dimension,
Starting point is 00:31:37 if you have a thousand dimensions, you can't usually make very good splits. No, I mean, if you split roughly in half, you only have log n splits before you remain with one item in every box. And log n could be like 10, I don't know, like definitely less than 50. And you have 1,000 dimensions.
Starting point is 00:31:57 So that doesn't work. Yeah, so you have product quantization, which is like a very, very quick way of iterating through those candidates. And again, a is like a very, very quick way of like iterating through those candidates. And again, a very mathematical, very, very cool field of research has a lot of numerical and algebra in it and high performance computing tricks. And it's very exciting. In fact, we actually have a paper coming out showing how we beat the state of the art on that coming up very shortly, probably like a month or so.
Starting point is 00:32:28 And then there's a whole field, there's a whole other type of algorithms that kind of blend the two and kind of create this graph over your points and say, okay, I'll traverse kind of the map in high dimensional space. I'll have a set of candidates and I'll just look for the neighbor that gets me in the direction that I want to go. And I'll keep going that way. And again, those end up being very efficient sometimes. And sometimes kind of, they just completely choke. And there's a lot of science about why and when that happens and when they excel and when they'll just kind of then the water. I can speak about this for many hours. I highly recommend if you're a computer science major,
Starting point is 00:33:13 if you're a mathematician, if you really care about high dimensional geometry, linear algebra and so on, it's a wide and exciting field where a lot of cool things are happening. It's intimately related to how neural nets are applied and how data is represented. So very cool, very cool field.
Starting point is 00:33:32 Yeah, totally. Yeah. Yeah, I think that it makes a ton of sense. So the quantization, you're saying that through that, you don't necessarily need a tree. You're able to get it so fast that once you have that cluster of a million items, you can get the dot product or approximation of the dot product of all million of them really quickly. Correct. Wow. That is so cool. I, yeah, for some reason I, I, uh, I didn't,
Starting point is 00:33:57 I totally didn't know that. I thought people were still using KD trees, but you know, your point is spot on. Um, I guess, I guess maybe I guess maybe just to continue the devil's advocacy, what about ball trees or VP trees or these other things? Is the quantization, just the fact that you can parallelize it maybe on a GPU or something, does that make that more appealing than these other sort of tree ideas? Yeah. So there are two factors that play in. One of them is the hardware. Like you said, some algorithms just lend themselves really well to the kind of hardware that you have. So GPUs and so on. And also CPUs, just a memory hierarchy. I mean, the way it's set up, these efficient scans end up being surprisingly efficient.
Starting point is 00:34:43 That's one. But the second thing is just like the weirdness of high-dimensional space. As humans, we are limited to thinking about three dimensions. We talk about four and a hundred dimensional spaces, but the mental image we have is really three-dimensional. And high-dimensional spaces are weird. I mean, they are just, they behave mathematically very, very, very different than low dimensional spaces. For example, you can ask yourself how many vectors that are exactly orthogonal to each other I can fit in a space of dimension D. Well, they're exactly orthogonal to each other. And so they form like a coordinate system. Right.
Starting point is 00:35:24 And so in dimension D, you can only fit exactly D of those. Yep. What happens if I give you a little bit of flexibility and I say instead of exactly 90 degrees between two of them, you're allowed to have 85 degrees between every pair. How many of those can you put together in dimension D? And so in, or maybe instead of vectors, I think about them as like lines. So they go both ways, right? I only know the answer for two. So in two, you still can do only two.
Starting point is 00:35:57 Yeah, that's right. And so that didn't change. In three dimensions, you, again, can only do three. I guess so, right? Yeah. And so somehow you would think that it would still grow linearly. And maybe at some dimension, it would start growing a little bit more. But for most people, the intuition is that it would stay somehow linear in the dimension. Yeah. The interesting thing is that it grows exponentially in the dimension.
Starting point is 00:36:20 So in dimension 1,000, you can have more vectors. You can have, I actually calculated once. I forget the number, but it was many billions of vectors that are almost orthogonal to one another. Wow, that is a remarkable, wow. That's mind blowing. I want to really put an exclamation point on that. So everything we're doing is an approximation, right? Everything.
Starting point is 00:36:44 I mean, even the fact, the way you got these embeddings was through an approximation. You're using some kind of stochastic descent process. And so the fact that you're not a perfect 90 degrees, you're still effectively orthogonal with that tolerance. And so what you're saying is that there's billions of ways things can be just completely different from each other. Yeah. In fact, I'll give you more than that. In fact, if you just take two random directions in dimension 1000, they will be almost orthogonal. Yeah. Yeah, I believe it. Yeah. Actually, that was a method, right? I mean, early on before deep learning, random projections where you just randomly assign vectors to documents, that was one of the ways that people were doing retrieval.
Starting point is 00:37:32 Correct. And in fact, my whole PhD was about how to do random projections like algorithmically faster. that makes sense wow that is mind-blowing so oh i see and so then that explains the approximation because i was putting a mental bookmark in that because if you do something like kd trees you can do early stopping like at some point you know that this partition this entire sub tree is further from the nearest neighbor you know even if i was to get the perfect point in this partition, it's still too far and I don't have to search anymore and I'm done. But because we're doing this clustering and because we're, you know, there is a chance that maybe you're in between two clusters
Starting point is 00:38:16 or something like that. It's just that it's because the clustering is centered around dense areas, the chance that you're not in one of those dense areas is like impossibly small, especially at these high dimensions. Yes. So even though I think, I think the last point you made was actually maybe the opposite of what I was saying. I mean, I think that, you know, you asked about Katie trees and like bowls and so on, just maybe to explain to the audience. I mean, the idea is to say, is to say, okay, I'll have everything in one ball and then I'll just, I'll cover my space.
Starting point is 00:38:49 I'll just take balls of radius, like one half. Let's say my, everything lives in a ball of radius one. I'll just cover my data in balls of radius one half, right? So everything lives in there. And then those one half I'll cover with balls of radius a quarter and so on, right? What I'm saying is that in high dimensional data, okay, if you have everything in a ball of dimension one, if you had to cover your data with balls of radius one half, okay, every ball will contain one data point. There would be no tree.
Starting point is 00:39:18 It would just like everything is far away from everything else, you know? And so there is no tree. There is no structure. It's just random chaos. And whether we like it or not, data is sometimes like that. How does clustering not suffer that same problem? It does. And that's why vector indexes are hard to build because in some sense, they are themselves a model of the data. Yeah. I see. So clustering. Oh, I see. I see. So,
Starting point is 00:39:50 so maybe I'll try and get an understanding. So, so, so clustering is, is, is this course thing, but you really only have to do it once. And, and, and within a cluster, there's still many entries versus like this L these tree approaches, you're kind of doing a bad thing many, many times. Like as you're traversing this tree, you're kind of making, you're making log N mistakes,
Starting point is 00:40:12 you know, until you get to the leaf. Yeah. So, you know, I think, again, it's hard, at least for me, it's very hard to like talk about this abstractly without kind of math and drawings and so on. So, but yeah, it's, I'll just maybe try to wrap it up by saying that it's a fascinating set of problems, which still have a lot of research going into them. And I highly encourage, you know,
Starting point is 00:40:38 CS graduates and EE and anybody who's like into this thing and want to see some really gnarly math and really cool engineering, it's a good place to put some time in. Yeah, yeah, totally. Yeah, I just want to double down on that. I think it's really interesting. It's becoming more and more important every day. And it's the future of information retrieval.
Starting point is 00:41:03 It's the future of how search engines and databases will work. And so if you zoom out from the core index, it's like, think about making the analog of databases. We've just spent 20, 25 minutes just talking about B-trees. Or if it's a search engine, we just spent 25 minutes talking about an inverted index, right? But we haven't talked about the rest of the database around it. How do you do the embeddings? How do you save them? How do you retrieve from them? How do you do the OD? Yeah, the management system part of the DBMS. Exactly. How do you update things in place? How do you read and write concurrently? How do you now take this abstract algorithmic discussion
Starting point is 00:41:48 and make an actual hardened production database out of it that you can go and do information, do image search on a billion images in production and not worry about whether it will work or not? Yeah, I actually had two questions which i think will start to get into the management side of it one was how do you deal with the fact that when you retrain the model even on the same data you know unless you keep everything deterministic which is almost impossible if you retrain the model, the embeddings are totally different the next time. And so anyone who's using your database and storing the embeddings, for example,
Starting point is 00:42:30 is going to be in a real big problem when you retrain, right? So how do you deal with that? So, you know, for the listeners, Pinecone, so the company that I founded and ICO is called Pinecone, and it builds a vector database. That's what we do. Okay. The way we deal with this is we give our customers and our users full control over everything. Right. We are not opinionated on like what embeddings you have, where you, you know, how you create them.
Starting point is 00:43:04 And when do you replace what with what. We actually give you full control over your data and how you retrieve from it and what your vectors are. And so we never actually swap your vectors behind the scenes for you. So your query kind of looks one way a second, like a second later it looks different. And you're sitting there thinking what the hell's just happened. You know, if you train a different model and you create different embeddings, then you create like a new vector index in the database, and now you can decide which one of them you want to query.
Starting point is 00:43:38 And maybe you phase one out and maybe you keep both of them or you do whatever you want. And so, you know, as a data infrastructure, we try to be the least opinionated and the most like enabling and just giving all the levers and, you know, buttons possible for our customers. Yeah, that makes sense. Yeah. I think what you're saying is, again, it raises a lot of other very cool problems. If I just tweak some portion of my data,
Starting point is 00:44:11 like do I have to re-index everything? Do I have to, like, how does that work? Do I have access to both versions of the index with some update and some not? If I tell me, oh, I just have all the vectors and I just move each one of them by a little bit. So all the data changes, but not by a whole lot. Do I, can I somehow not just throw everything away and do everything from scratch? There's so much to be done. I mean, this is like, and we've, you know, as a science and as an engineering community,
Starting point is 00:44:43 we're not even remotely close to being able to answer those questions definitively. Yeah, we have something in, so my background is in decision theory, and we have something in decision theory called actor regularized regression, or it could be called critic regularized regression. And what it's basically saying is, let's say it's a robot car. We'll train a model that copies what the robot car is doing right now. So this model, if the robot car maybe is being driven by a person, maybe it's not a robot yet, or who knows what it is, but maybe you have a lot of data of real cars, but we'll train a model that clones that behavior of these people or this past system. And then when we go to train the model that's trying to maximize something, we regularize it, which means we add a penalty the further away it gets from this first model.
Starting point is 00:45:35 And so I could imagine doing something like that, where if you keep your entire neural net that generated that embedding, then the next time around you can add some penalty so that the things won't drift too much. SGD is also like the high dimensional spaces where, you know, that sounds like it makes sense and maybe for 99% of the time it makes sense, but there's still going to be 1% of the embeddings that are going to go to the moon, you know? Yeah, they're going to change completely. And that's, and again, that's, that's fine. You know, what we are trying to achieve is, is kind of make, make that layer as robust and just production ready as possible. So you as a scientist or an engineer in a big company, if you want to do like semantic search over, over, over billions of documents or images or what have you,
Starting point is 00:46:30 you get to care about your application and about, you know, how you retrieve and how you show it to your customers and what they care about, what models you're using and not like the entire, you know, everything we just talked about, like everything, like all the algorithms and optimizations and the hardware and the distribution and the systems design. And there's so much that goes into it that frankly, you, you know, you shouldn't care about if that's not your thing. Yeah. Unless you're me, in which case there is your thing.
Starting point is 00:47:00 And it's the only thing you care about. Yeah. So let's dive into that. So there's a bunch of people who are listening to this podcast and thinking this is the coolest thing I've ever seen or heard of in my whole life, right? And so tell us about Pinecone. Are there internships?
Starting point is 00:47:20 Where are the internships located? Are there full-time positions? What's a day in the life of Pinecone? Kind of run us through Pinecone as a company. So Pinecone, first of all, in terms of locations, we have centers in San Francisco, in New York, and in Tel Aviv, Israel. We certainly have open positions all across the board. I'll just kind of say the three, four most critical areas that we invest in.
Starting point is 00:47:54 First of all is the core engine. A lot of high-performance computing, hardware accelerations, all the index stuff. That sits mostly in Israel. All the cloud stuff that sits mostly in Israel, all the cloud and scaling. Okay, so how do you take those indexes and those data structures and make them scale at cloud, you know, billions of items and be consistent and persistent and highly available and so on. That sits mostly in New York right now. And another thing that sits mostly in New York right now is the management of it. So everything that looks like security, data governance, metering, logging,
Starting point is 00:48:37 like all that stuff, right? So everything that makes it a managed stuff, like resource allocation and so on. Okay. And the last thing that we invest in is the action of the machine learning. So the embeddings, the models, the applications of embeddings and vector representations in real life
Starting point is 00:48:54 and things like anywhere from recommendation to text, to images, to personalization, to what have you. So it runs a pretty wide set of disciplines and applications, everywhere from low, low level CC plus plus CUDA and so on, all the way up to like modeling and, and so on. Of course, we have open positions, both for interns and full time. Some of them are on our website, pinecone.io. And if you're interested and you're not sure that you appear in those open recs,
Starting point is 00:49:31 just send us an email. We love hearing from people. Cool. So what's a day like in Pinecone? So what's something that makes Pinecone unique as far as a place to work? Do you have ping pong tournaments or is there something is there something that just kind of organically grew with Pinecone that is has been kind of sticky and interesting? So I I COVID has really threw a wrench in a lot of that, but I will tell you what what I feel I feel is most exciting about working at Pinecone.
Starting point is 00:50:09 And so I, for better or for worse, we've developed this addiction to like insanely smart, intelligent people. And it makes it very hard to hire people, but most of our folks have either PhDs or high degrees, you know, AWS, Google, Facebook, Amazon, like built huge systems, architected like the internals of AWS and S3 and SageMaker and like Splunk and Databricks and, you know, built unbelievable solutions for like whatever in the load of optimization kind of people who do like crazy just do crazy things i mean like simd and all these things yeah i've been in and around engineers my entire life and like the the sentence uh oh i just made something like 3x faster is being said pretty casually nice that. That's amazing. What happens? Like, oh, I did this and that.
Starting point is 00:51:08 I changed here and I reorganized all of that. It's like, oh, now it's much faster. Oh, shit. I had no idea that was possible. So you work just with people who just kind of mess with your brain. They're so smart and so talented. And it's just fun.
Starting point is 00:51:25 And at the same time, it's just like I said, the problem itself is cool because it runs all the way from data and machine learning to like big gnarly systems designed to low level algorithms and kind of the whole thing
Starting point is 00:51:38 needs to work as one unit. And so you kind of have to be very broad to really get it. And so it's just fun. It's just a fun place to work. That makes sense. And I just wanted to end on this note. Who are the kinds of people who use something like this?
Starting point is 00:51:55 I think some people might look at this and say, well, there's maybe a handful of companies in the world that have a billion images, right? But clearly, even people who don you know, people who have, who don't have maybe that much data, you know, it's not just about not using a for loop. It's about the true management. I mean, you could use a CSV instead of MySQL, if you don't have a big company, it doesn't mean that's a good idea, right? And so give us an idea of like, who is the sort of target audience for this? And like, kind of how broad is that audience and what are the different verticals there?
Starting point is 00:52:29 Yeah. So it's, it's incredibly broad. I mean, uh, you can, uh, like you said, I mean, the fact that you don't have billions of objects, right. Even if you have 10 million objects, you know, it doesn't mean you have to start building a microservice for it. And the beautiful thing about consumption-based pricing for Pinecone is that you pay for what you use. And so if you have a small use case, it's going to cost you like 50 bucks a month. So why are you even bothering? So there's the whole management aspect of it,
Starting point is 00:53:05 which makes it like you get all the bells and whistles, but you just get the hassle-free experience. And that is already a big enough market. But you get two different kinds of scaling issues that look very different systems-wise. One of them is just companies who have an unbelievable amount of data. So they come to us and say, oh, I have 10 billion items. What the hell do I do now? It doesn't fit anywhere. It's just going to cost me
Starting point is 00:53:36 an arm and a leg to run anything. And companies who have a different scaling issue say, I have 10,000 queries per second. What the hell am I doing? What do I do with this? So Say I have 10,000 queries per second. What the hell am I? What do I do with this? So now I have this super high availability, low latency system that I have to build. So I have to care about a whole different set of problems. So being able to cater to all three is hard and exciting. And for somebody like me, who that's what I do for a living, that for me is a worthwhile challenge. But, you know, those, you know, kind of three buckets are each and every one of them contains thousands of applications and different use cases. And so you don't have to be in Pinterest or Facebook or Google to be facing these issues.
Starting point is 00:54:20 Yeah, I mean, one thing you hit on that I didn't think about until you mentioned it was the SLA, you know, the latency. Because I think with PyTorch and TensorFlow and some of these things, you know, you can do vector math without, like, I don't know anything about CUDA or OpenCL, or I've just heard these names, but I'm using PyTorch and it can do the job well. But I think that PyTorch is not designed for low latency. And so if you have a million embeddings in PyTorch, you're stuck. Like there is nothing that you can really do. You have to go outside of PyTorch to do that. Yeah, so you have to use something that's dedicated for it. And you have to optimize all the system around for it. I mean, we have POCs running
Starting point is 00:55:06 with customers who use a lot more than 10,000 words per second. And they see a 15 millisecond P99. Wow. Wow. It's amazing. And then, so for that, to get that, you have to, it's like the algorithms and so on, but you have to optimize, you know, even like the way you deserialize JSONs. I mean, you can't be careless about that. You know, it's like stuff like that you don't necessarily think about. So it's like everything around it needs to be super optimized for this use case, you
Starting point is 00:55:39 know. One last thing, actually, this is just, I don't want to get too much in the weeds here, but, you know, like with MySQL, you know, people do an insert statement and it seems kind of, you know, you're inserting these tuples. And so it kind of makes sense. So in your case, how do people dump, you know, a billion embeddings into your database? I mean, is like, I guess the API is, it's still like a rest call call with a huge object that's attached to it or something. I mean, you don't send all a billion in one go. You probably would.
Starting point is 00:56:10 You batch it and you send them in batches of a thousand or you dump them on disk on S3 or GCS somewhere. Oh, that makes sense. So you have this S3 file that's massive that has all these embeddings and you just point PyCon to it. It's like hey go go digest this entire thing yeah that makes sense or you can you can do it on the fly you can do it like you know you have a dynamic API you just observe and update vectors like one by one or in batches oh that makes sense like people are adding new products and so you're adding adding deleting updating so on yeah so there's this like you're adding, deleting, updating, so on. Yeah. So there's this like, you're adding new products, adding new products, adding new products.
Starting point is 00:56:46 And then boom, there's like a new embedding model. Everything has to get rewritten. So there's like now sort of this other snapshot that's being spun up. And then when that's ready, you flip them over. And then, yeah, I mean, this is stuff that you do not want to do yourself. I've never done it. It just sounds it just sounds brutal just talking about it i mean it sounds like a really exciting but hard to do
Starting point is 00:57:10 yeah exactly and we're not even scratching the surface you know i again i think uh anybody who's built a distributed system like you you think you think about everything and then a customer comes to you with like a mind bender and like, oh, crap. Yeah. Like they want to do what? So. Yeah. Cool. This is really, really awesome. So we have a link to Pinecone in the show notes. People check that out. Is there a free version for people like high school students or college students or, you know, how can they get started with it so first of all uh they can go to our website and they can start a free trial the free trial right now is is uh two weeks and so uh it's not a whole lot of time but if you're if you're a student and you're working on a cool project and you want that extended just drop me a note or whatever we have a chat cool drop us a note and say,
Starting point is 00:58:05 hey, I'm a student. I'm not really a customer, but I just really think it's neat. Can I keep using it? Odds are, if you really do something cool, we'll just let you do it. Cool. That is amazing.
Starting point is 00:58:18 We've done that before. So we're all into education and supporting people's learning. Kind of, yeah, Tinker has been building exciting demos. So we're all into education and supporting people's learning. Yeah, Tinker has been building exciting demos. Very, very cool. Yeah, if people build anything really cool with Pinecone, let us know. Let Ido know. We'll have his contact, his Twitter, et cetera, the handle for Pinecone in the show notes.
Starting point is 00:58:44 Thanks, everybody, for supporting the show on Patreon. We really appreciate that. And thanks for supporting us on Audible if you do that. And thank you, Ido, so much. This was an amazing episode. There's so much more depth we could go into. We could spend a whole day
Starting point is 00:58:57 talking about embeddings. We didn't even get a chance to talk about the sort of machine learning part and ResNet and all of that. But, you know, that's, that's, we've did, we've done a really good job of covering, you know, the high level and folks can definitely, you know, look up some of the things that we talked about and take it to the
Starting point is 00:59:15 next level. Thank you so much. Thank you. music by eric barn dollar programming throwdown is distributed under a creative commons attribution share alike 2.0 license you're free to share copy copy, distribute, transmit the work, to remix, adapt the work, but you must provide an attribution to Patrick and I and share alike in kind.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.