Programming Throwdown - Machine Learning Embeddings with Edo Liberty
Episode Date: September 27, 202100:00:24 Introduction00:02:19 Edo's Background00:08:20 What are Embeddings?00:14:00 Self-Organizing Maps & how humans store data00:22:27 The lifecycle of a machine learning system00:34:40... The weirdness of high-dimensional spaces00:42:20 How to manage a vector database00:47:01 Pinecone the company ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
Hey everybody, how's it going?
I know a lot of folks out there
ask about AI and machine learning. Patrick's done some machine learning stuff in the health space.
I do a lot of machine learning stuff, or at least I pretend to in my day job. And so a lot of people
write to us, message us on Twitter, asking questions along these lines. And so it's always
a real pleasure where we can talk about an area that we really love. And this is right in the heart of it. We're gonna
be talking a lot about machine learning embeddings, which is a term you might not have heard before,
but you're going to really enjoy it. And it's something that is extremely, extremely useful.
I'm extremely lucky that we have Ido Liberty here, who's an expert in this area. He's the CEO of
Pinecone, and he's here to talk to us
about embeddings and kind of riff on that. So thanks so much for coming on the show, Ido.
Yeah, thanks for inviting me, guys.
Cool. So we ask a lot of folks this lately, how has COVID kind of changed
your business? Are you back in the office yet? Or what's going on with that?
Half are and half aren't. So the California folks are still at home.
New York folks are back in the office.
Israel folks are back in the office.
Oh, I see.
Is that because of local laws or just how people felt in those areas?
It's a bit of both.
A bit of both.
I'm a big believer in people hanging out together and seeing each other and having lunch together
and getting to know each
other a little bit better than on Zoom. So I'm pushing everybody who wants to be together to be
together. Yeah. Yeah. Totally agree. Totally agree. I wonder if we'll end up with, remember WeWork?
I think is WeWork still a thing? It probably is still a thing, even though they had all sorts of
issues as a company, but that might be the kind of thing where physically you're next to a bunch of people that you can socialize with,
but you actually all work for different companies or something like that. Who knows what the future
is going to be? Yeah, 100%. Cool. So why don't you tell us your story and how you got into this
area and how embeddings became important to what you do and kind of
where that all started from? Yeah. So my journey into becoming a CEO has nothing to do with
business. I did my PhD in computer science and my postdoc in applied math, working mostly on
CS theory and algorithms for machine learning and big data. What I ended up doing mostly was high dimensional
geometry. So numerical linear algebra and clustering and dimension reduction and stuff like that.
And a big part of that was nearest neighbor search and searching through high dimensional
vectors, which we will get to in the end or in the middle of this discussion. And then I opened a company in 2009
that did the real-time video search.
And real-time video search kind of has to be
based on this vector search capability,
which we'll talk about later.
And again, so I started investing in that.
So can you tell me real quick,
this always fascinates me.
How did you make that leap to being an entrepreneur?
So right before, let's say 2008, what were you doing?
And then how were you able to sort of quit that job and say, okay, I'm going to start this real-time video understanding company?
What made you take that leap of faith?
So the answer is, so it's two things really one of them is uh i've just always kind of been
very entrepreneurial but by nature i was kind of i was importing like balance skateboards to
israel when i was in high school and i was like i was always like doing stuff like that right so i
was some part of it was always a little bit entrepreneurial. But the other one is just like dumb luck and just happenstance.
I was in New York and my cousins of all people said,
oh, I'm going to start this company and I'm going to hire
whatever, like 10 engineers to do XYZ.
And I'm like, oh, you don't need 10 engineers for this.
We can whip it up in like like like a month and uh
you can you can go raise money on that and he didn't believe me so we we spent a month and we
we made something work and so you know when we found ourselves with some money in a company so
like okay i guess we have to run a company now. Wow. So you took a month off,
like a month. I wouldn't call it a vacation. It's probably like the busiest month of your life,
but you took that month off, quote unquote, and then you went to investors and you said,
here's our idea. I was a postdoc at the time. So it's like the work was not incredibly demanding,
definitely not time critical. So like I could kind of disappear for a month.
It was actually a great, like, I loved it. I actually still remember it. I would wake up at
like 8am. I would make myself coffee. I would sit down and I would code until like midnight and go
to sleep. I did that for a full month. And it was fantastic. It was fantastic.
It's so addictive. It's like this dopamine rush, you know?
Exactly. I didn't talk to anyone. I didn didn't need anything i would just like drink coffee and coke for like a month straight it was fantastic i loved
it not a single meeting man it sounds amazing actually you bring back really really positive
memories i mean maybe maybe we should do that again i mean like every now and then maybe highly
highly recommend yeah wow that's an awesome story.
A month's worth of flow.
It was fantastic.
But anyway, the cool part about it is that we ended up building a really cool solution.
We ended up selling the company. I moved to Yahoo when I was there, director of science.
So did you sell the company to Yahoo or is that how it worked out?
No, no.
We actually sold it to Vizio. Okay. Got it. The TV company. And then the company to Yahoo or is that how it worked out? No, no. We actually sold it to Vizio.
Okay, got it.
The TV company.
And then I moved to Yahoo, where I spent a lot of time as the director of machine learning.
So I was building machine learning platforms and solutions and so on, working on mail and ads and spam and feed ranking and so on.
And yeah, then went to AWS and spent about two and a half three years
so all the all the offerings from aws so sage maker and all the you know there's like recognition
i think for video yeah exactly language stuff language stuff all that you know all those
services i didn't own all of them but I was definitely a part of building that org and
being a big part of the science of a lot of these services. So I, you know, my journey has kind of
been a little bit all over the map, which is highly recommended. Seems chaotic in hindsight,
but kind of fun nonetheless. No, I think it's amazing. I think it's great. I think that,
you know, you can see the progression. I mean, it's, it might be, might seem chaotic, you know, geometrically, but you can see kind of a,
like a nice trajectory. Yeah. Yeah. Very cool. And so, and so you were at AWS and then from AWS,
is that how you were inspired to go to Pinecone or was there a spot in between? No, no, from AWS I opened Pinecone.
So building, it was kind of seeing how the sausage is made,
like building managed services and databases.
That's where I really got into like databases
and managed services and cloud offerings and so on,
kind of really diving in deep into that.
And it suddenly dawned on me, you you know that like i have solved exactly the
same problem like 10 times in my career and i always built it from scratch and it was always
hard and always like insufficient in the end yeah so i figured we should do it once and for all
and like do it right and give it to everyone. Yeah, that's awesome. Very, very cool. Yeah. Fascinating. Yeah. I think it's amazing journey.
We can now give kind of an overview of, you know, what that means to, you know, provide sort of a
vector database. And so maybe we first need to talk about, about embeddings and where they come
from and why we need them. Yeah. So, of the interesting statistics that I uncovered while thinking about opening
Pinecon. I had a lot of conviction that this was needed, but I knew embeddings were needed
for dealing with unstructured data. So when you look at tabular data, it's very clear how to store it, how to slice and dice
into it, how to work with it, because we have 50 years worth of databases developed for
it.
Yeah.
I mean, if people have used Excel and you've done this thing where maybe you have a bunch
of columns with data in Excel,
and then you want to try to learn a new column.
There's a column that has some missing values,
and you want to try to guess it from the other columns.
That's pretty easy to do in Excel, and you can kind of imagine what's going on there.
There's some model that's just looking at a single row and pulling those values
and then trying to infer that new column.
But then there's a question of how do you do that if it's an image or a video and pulling those values and then trying to infer that new column.
But then there's a question of like,
how do you do that if it's an image or a video or a website?
You know, now it's totally unclear how to do that.
Correct.
And this, you know, it extends much more. So it could be like, yeah, those like very complex data types,
like images and long text documents and, you know, some like audio recordings and so on.
But it could be like travel through a website, like a trajectory through a website or a shopping cart or something else, which is in and of itself like a complex object.
It's not like it's not something you can stick in a table in like four columns. Right.
And for a long time, it wasn't really clear how
to deal with those objects, right? It was like each and every one of them had its own like little
mini discipline in computer science and how to reason about them, how to structure them, how to
predict about them and so on. And in the last few years, there has been this movement with deep learning to basically unify
this entire paradigm. So basically say, I will use deep learning neural nets, basically,
like deep neural nets, right? To take the input in some way and transform it through some layers or through some transformations.
And in the end, provide a high dimensional dense vector.
So let's say 1,024, an array of 1,024 floating point numbers.
But I will train the network such
that that vector is somehow semantically representative of the object.
For example, I would train it such that if two documents mean the same thing,
okay, they talk about the syntopical, they're similar in some sense,
then their resulting vectors would be very correlated.
Yeah.
And then the magic is knowing what that similarity is, right?
Or how to measure that.
Right.
And so there's a whole, again, there's a whole body of literature and a good amount of technical
know-how of how do you take data and produce these embeddings such as they are good.
But the fascinating thing is that it's, while it isn't always the perfect solution,
it's very often great.
And so you can transform images and text and audio and like shopping habits
and like your video preferences
and pretty much you name it
and convert it into this dense vector.
And now you can, instead of dealing with these complex types,
you can now have this embedding.
You actually embed.
By embedding, it's like you put an item in high
dimensional space you you assign a vector to it right so think about a vector as like a point in
a high dimensional space that's why it's called embedding and so now instead of talking about
documents and semantic similarity you talk about vectors like and angles and distances and and so
now it becomes a lot more concrete and a lot more actionable and a lot more
something that you can talk reason about and like build for and like code and and systems and not
like it's not an abstract thought anymore yeah i mean if people have seen this visualization where
um there's a bunch of points and they all have kind of springs um in between them and then you
you run this through some physics model and they all just kind of springs in between them. And then you run this through
some physics model and they all just kind of separate and they end up in this like spiderweb
kind of shape, right? That's because all the points are kind of repulsing each other. And even
though it's only a local repulsion, it creates this really beautiful kind of spiderweb structure
globally, right? Because you see this similar with birds, right? Birds have a similar
thing where, you know, they have some very simple rules in their brain around not getting too close
to other birds and like staying in a certain position to the bird next to them. And then that
ends up with this like beautiful kind of V shape or swarm or things like that. And this is similar
where we're doing that, but then some of the
points are actually getting pulled together and some are getting pushed apart depending on things
we know about them. Exactly. So there were like works from, I think the early 2000s called like
self-organizing maps. So maybe this is like ideas if you're older, you know, I don't know,
maybe the idea itself, actually, you know what, I think the idea itself is even older.
They were like old works, like on MDS, stuff like that, ISO maps. And there was like a flurry of maybe the idea itself actually you know what i think the idea itself is even older they're like
old works like on mds stuff like that easel maps and there was like a flurry of works and everything
so you would like take things like say i'll take all the people like you know a million people and
i'll just look at their social network graph yeah and i'll try to put i'll try to create a like these
virtual springs that put people that you know close together physically.
So then we try to make them physically adjacent to each other.
And if you try to do that in two dimensions, what you get is the map of the US, like if
you have one of the US folks.
Maybe now with COVID and social networks, maybe that now doesn't work anymore.
But back in the day, it did.
Yeah.
Or if you imagine doing a two-dimensional embedding on interests, you might have like a
little island over here of gamers and a little island of skateboarders. And there might be
somebody who's interested in both. And so they're going to be kind of on this sort of inflection
point or on the saddle point in between those two clusters. But you'll still have those large
clusters. Or maybe you still have those large clusters.
Or maybe you'll have another cluster of gamer skateboarders, right,
if there are enough of them.
And that cluster will start having a defined border
and pushing the other clusters away.
And so, Jay, you end up with these really beautiful behaviors.
And when you're done, you have something really interesting.
You can draw a little circle on that map and say,
everyone here really likes skateboarding.
And once you start introducing a whole bunch of other topics and you get to more than two
dimensions, you can even guess at things that people like that they didn't even know they liked.
A hundred percent. And I think in two dimensions, you get pretty limited pretty fast. So two
dimensional space is very limited. but when you live in a thousand
dimensional space you you get so much more representational power because you can be a
skateboarder but in the same thing same time be a dad in the same time be like really into like
whatever like anime and whatever like also be like vegetarian i mean and all these things are
facets in your behavior and things that you care about and they, they're not mutually exclusive and you know, they're just like a part
of you. So you're not, you're not a skateboarder or a vegetarian, you're both. Yeah. So that's how,
you know, these high dimensional dense vectors can actually like represent something a lot more
deep, a lot more rich, a lot more actionable than just like,
you know, this one feature or another. Yeah, it's actually, it's that, it's that
curse of dimensionality, but actually working in your favor where we're going from two to four
dimensions gives you so much more expressibility. And then going to a thousand dimensions gives you,
you know, exponentially more expressibility. And it's just amazing what you can represent.
I mean, we've seen just unbelievable things with GPT-3
where people have auto-generated websites.
There's this thing Codex where it auto-generates code.
And it's just amazing.
And it's all using embeddings.
Yeah.
And the amazing thing is like,
that's how our brain works too.
Yeah.
So if you think about everything that you remember seeing,
everything that you've ever seen,
the part of the lateral cortex,
and pretty much everything,
your higher functions in the brain
that deal with visual imagery,
the input that deal with visual imagery. Okay.
The input that they have is not the light intensity on your retina, which you might
think is highly related to the actual image.
Right.
Or what's like, what we think about is an image in like as a RGB roster.
What your brain gets is the output of the visual cortex, which is an activation of a few
million neurons. And that's it. And those activations have very little correlation to
the actual RGB values in the image. In fact, we know for a fact they have almost no correlation because when you look at your partner at midnight inside the house or
maybe in high noon outside, in terms of colors, it's completely different. It's got nothing to
do with each other, right? But to you, it looks exactly the same. The brain just completely
changes all the values to make the embedding exactly the same. So you
see the same thing. Yeah. I mean, a good example of this is to try to guess at how thick a wall
is an interior wall. Now, if, if you're a dad and you're constantly in the attic, you know,
fixing stuff, that's different, you know, that's cheating, but like, but like you might look at a
wall and say, Oh, that's, that's maybe, you know, half a foot thick. But then, you know, when you actually look at a map, it's like, oh, no, it's like four feet or something like it.
So you'll be off by a ton. And yet you're able to navigate around your house, you know, trivially.
You know, you don't have a map that is in any way accurate.
But you have this really sort of deep, like semantic understanding of the different rooms and where to go.
And and so so you,
my, my older son sleepwalks. And so somehow he's able to sleepwalk down the stairs and into the fridge and like get a stick of cheese. Like, I don't know how he does it, but, but he's able to
do that yet. There's, there's that map isn't there. So it's, it's all just these super deep embeddings
and, and because everything gets collapsed
down to that, in the case of the human brain, you know, million dimensional embedding,
it's in a sense, kind of easy to store, right? Because you're putting the same million numbers
to hold, you know, everything you see, and to hold, you know, you might have to integrate over
them for video, you know, audio might be a different thing, but it's all the data structure, if you will,
is the same for everything.
So that actually, I think you nailed it.
I think this is like, it brings a very interesting topic,
which is how do you even remember?
How do you get, like when your son sleepwalks,
so when he looks at a stick of cheese
and he knows it's a stick of cheese,
or you look at your loved one and they're whatever.
A stick of cheese.
How do you know that a stick of cheese?
No, how do you know it's the same thing?
Yeah.
Somewhere in your brain, you need to be able to go
and match what you have against the bank of gobs of these like embeddings right and
say oh this isn't a truck this isn't a tree and this isn't like uh whatever like my grandma and
it isn't whatever like a road it's a stick of cheese right yeah and even you know you can see
really abstract like uh like post-modern versions of
chairs and still know right away it's a chair yeah right and so how does that happen you know
how do you do that and so if you think about it computers need to do the same thing you know if i
generate an embedding for for a sentence right so So we, you know, you might think about,
okay, I'm some, whatever.
Like I used to work on spam detection at Yahoo, right?
So you see an email that looks like
some Nigerian prince is trying to convince you
to transfer money somewhere.
You're like, okay, I know this.
I know what it is, right?
But it's not exactly the same email.
Like these scams look slightly different for each other.
But any rational person would read this like, oh, I remember this thing. It looks like something
else. Because in your brain, you do this matching very easily. But computers don't know how to do
this. So for computers, you would pass a different email that maybe semantically says the same thing,
but you would get a slightly different vector
coming out of it.
And unlike structured data,
when you can do a hash table lookup
and just fetch exactly the same key
and then get the same value,
you don't have that anymore.
You just have a vector that's different
than anything else you've ever seen before.
It might be close to other things you've seen before
or highly correlated with a small set, but it's not an exact match. And so if you went ahead and tried
to code this thing and try to say, okay, now I have a billion, 1,000 dimensional vectors and
give me the 100 that correlate the most with it, right? Except for going one by one, computing the correlation and just keeping the top 10,
it's not clear what to do. And so now you get into this like really, really rich and interesting
like topic in computer science that has to do with like vector indexes and retrieval.
And that's, you know, that's something that like gets really deep, really fast. And again, I'm happy to dive
as deep as you want. I'm not sure how much you want to, how much you want to scratch the surface.
Yeah. Well, let's, let's go through sort of the life cycle of a system that uses vectors, right?
So, so let's take like a recommender system, right? So we have, let's say it's an image recommender system. So we have a lot of all of these images and whether the person those embeddings together. And if a person liked one
image, but not the other, it's going to pull them apart, right? And there's more complexity around
random negative mining and all of that. We'll just leave all that alone and put a bookmark in that.
But so you can imagine, you know, in the beginning, it's all random. This is one thing that
actually surprised a lot of people. In the beginning, all the embeddings are totally random. This is one thing that actually surprised a lot of people. In the beginning, all the embeddings are totally random. So, you know, for any particular picture, it's just somewhere
floating in that thousand dimensional space and every picture is going to have some other spot.
But then through this process of kind of pulling and pushing, you know, pairs of points, you know,
together or apart, you know, eventually it'll start to coalesce into that structure that we really like.
And then to your point, if we know somebody,
the last picture they liked was this one,
then we can ask the question,
what should we show them next?
And it's, well, what are the closest things
to that one they just liked?
So now we have this problem of here's a point
and embedding that we've trained. It could be maybe a week ago, but we have this trained model
and we have this point that we know somebody likes. Let's get the points around it. And so,
yeah, you're the naive way to do that is to go through every single point and say,
how close are you? How close are you? How close are you? And then keep an array of the top 10. But for something like, imagine Amazon. I mean, how many items does Amazon have in their
inventory, right? Or how many pictures are on Instagram, right? I mean, just billions.
I actually know the answer to that.
Oh, really? All right. Let's hear it. If it's not confidential.
I think it is.
Okay. Let's say it's more than 100 billion or something, at least for Instagram.
There's so many people on there making pictures.
And so that whole idea of going one by one, you just can't do it.
So that kind of gets to sort of what you've really been focusing on.
And so how do people do that?
I mean, Amazon works, Instagram,
I can press F5 and it comes right up.
Every time I hit refresh, it's right there.
I PTR, I pull down in the app and I get new photos.
Like how are people able to do that in a reasonable time?
So now we basically are discussing this
like algorithmic question, right?
So let's just kind of make that crisp of what the problem is.
You get n vectors in dimension d, so n being large, say billions,
d being large-ish but fixed, say 1,000.
And now I give you another vector in dimension 1,000.
I tell you, give me all the vectors that
correlate the most with it.
Or let's just say maybe our closest in Euclidean distance.
So I actually measure the sum of squared distances
and sum of square discrepancies on coordinates
and take the square root of that.
So like you said, yes, you can scan everything.
Clearly, that's not feasible.
So now how do you do it?
Interestingly enough, you can actually prove that you cannot accelerate this
beyond some limit, which is still pretty grim, okay, without incurring some level of error.
In some sense, you need to accept the fact that you will have a little bit of approximation. So you might miss something that is relatively close,
maybe not almost identical,
but kind of on the perimeter of the bowl you're searching in.
You might miss that,
or you might accidentally include something
that's slightly outside of it.
So you have to accept a little bit of a fudge factor, right?
But once you do, you can accelerate things significantly.
And in fact, all the algorithms that do fast nearest neighbor search do approximate nearest neighbor search.
Then approximate nearest neighbor search means look for the nearest neighbors, looks for the vectors that are closest approximately.
Approximately means exactly what I said.
There's like this little fudge factor around the edges. Now, I'll just kind of give a very
quick and
maybe almost like a
sketch of kind of the
main ideas in big algorithms. So,
first of all, there is
clustering. Clustering is an obvious idea.
I'll just start by saying
I'll have like a thousand
clusters. Clusters are like a
large collection of points that are all relatively
close to one another.
And instead of remembering all of them,
I'll just kind of remember the center.
And so when seeing something, instead of thinking
whether it's a mouse or a stick of cheese,
I can first say, OK, is this an animal or like an inanimate object, right?
If it's an animal, good. I've already reduced the search space by a big margin, right? So now I need
to, and I can iterate on that. So you can like deconstruct the world. That makes sense, like
hierarchical, like for Amazon, if you search search for batteries there's probably something in amazon
that says this you know don't bother with pet food like just forget that whole category you know like
there's there's probably something and then they can do something afterwards right and so but even
if you just take images for example even with images you can already take their embeddings
and yeah, just already
cluster things.
Oh, cluster the embedding space.
You don't have to manually do anything.
Yeah, you cluster the embedding space.
So now you take those points in high dimension, just say, oh, those set of points are so close
together that instead of measuring all the distance, I can just measure the distance
of the center.
And if it's close enough, then I will maybe suspect that some of them should be close.
And then I should check that.
So how you cluster exactly and how you then search within those clusters, I still didn't talk about that.
But that's one main idea that works very well.
And again, there's a whole literature on how to do that fast and efficiently and accurately and so on.
Then within each cluster, so now let's say you've narrowed down your search space
from a billion to let's say 20 clusters, each one of them has a million points.
Now they're not clusters anymore because they're all sticks of cheese.
Maybe different, whatever. Blue cheese.
Yeah, like a piece of chalk and a stick of cheese.
And I don't know.
So now, like all white cylindrical things, right?
Yeah.
So now you really have to do something a lot more refined, right?
And there you have a whole other set of algorithms
that look like what's called product quantization
or other types of quantization,
instead of computing the dot product exactly. So saying I'll actually compute the distance exactly
for like one candidate and one vector, I will compress it somehow and be able to just
have a very quick, maybe of one, you know, indication of, is this like a likely match, right?
So like a real simple way to do this would be just to have buckets, right? So you could just
uniformly sort of divide the space up and say, if my point is so many buckets away from this point,
then don't even bother, just skip that whole bucket or something.
Correct. I mean, buckets actually would be probably more
akin to the clustering step. Ah, okay. Got it. Which is a good way, by the way, in low dimensions,
bucketing and like an algorithm called KD trees is actually very efficient. If you have an algorithm,
if your data is like two or five or 10 dimensional, then this like bucketing works
fantastically well. In dimensions like 100 to 1,000,
it kind of breaks spectacularly because high dimensional spaces are weird.
Yeah, that's true.
I never thought about that.
But yeah, I mean, just to give a bit of context,
like for a KD tree, you can only split.
And so a KD tree is a binary tree
where at each node in the tree,
you split in one dimension.
And you say everything that's more than this
in that dimension, let's say you split where one dimension. And you say everything that's more than this in that dimension,
let's say you split where the fourth dimension
is 0.4 or higher.
They all go to one part of the tree.
Everything else goes to the other part.
But because each split is only one dimension,
if you have a thousand dimensions,
you can't usually make very good splits.
No, I mean, if you split roughly in half,
you only have log n splits before you
remain with one item in every box.
And log n could be like 10, I don't know,
like definitely less than 50.
And you have 1,000 dimensions.
So that doesn't work.
Yeah, so you have product quantization,
which is like a very, very quick way of iterating
through those candidates. And again, a is like a very, very quick way of like iterating through those candidates.
And again, a very mathematical, very, very cool field of research has a lot of numerical and algebra in it and high performance computing tricks.
And it's very exciting.
In fact, we actually have a paper coming out showing how we beat the state of the art on that coming up very shortly, probably
like a month or so.
And then there's a whole field, there's a whole other type of algorithms that kind of
blend the two and kind of create this graph over your points and say, okay, I'll traverse
kind of the map in high dimensional space.
I'll have a set of candidates and I'll just look for the neighbor that gets me in the direction that I want to go. And I'll keep going that way.
And again, those end up being very efficient sometimes. And sometimes kind of, they just
completely choke. And there's a lot of science about why and when that happens and when they
excel and when they'll just kind of then the water.
I can speak about this for many hours. I highly recommend if you're a computer science major,
if you're a mathematician,
if you really care about high dimensional geometry,
linear algebra and so on,
it's a wide and exciting field
where a lot of cool things are happening.
It's intimately related to how neural nets are applied
and how data is represented.
So very cool, very cool field.
Yeah, totally.
Yeah.
Yeah, I think that it makes a ton of sense.
So the quantization, you're saying that through that,
you don't necessarily need a tree.
You're able to get it so fast that once you have that cluster
of a million items, you can get the dot product or approximation of the dot product of all million
of them really quickly. Correct. Wow. That is so cool. I, yeah, for some reason I, I, uh, I didn't,
I totally didn't know that. I thought people were still using KD trees, but you know, your point is
spot on. Um, I guess, I guess maybe I guess maybe just to continue the devil's advocacy,
what about ball trees or VP trees or these other things? Is the quantization, just the fact that
you can parallelize it maybe on a GPU or something, does that make that more appealing than these
other sort of tree ideas? Yeah. So there are two factors that play in.
One of them is the hardware. Like you said, some algorithms just lend themselves really well to
the kind of hardware that you have. So GPUs and so on. And also CPUs, just a memory hierarchy. I
mean, the way it's set up, these efficient scans end up being surprisingly efficient.
That's one. But the second thing is just like the weirdness of
high-dimensional space. As humans, we are limited to thinking about three dimensions.
We talk about four and a hundred dimensional spaces, but the mental image we have is really
three-dimensional. And high-dimensional spaces are weird. I mean, they are just, they behave mathematically very, very, very different than low dimensional spaces.
For example, you can ask yourself how many vectors that are exactly orthogonal to each other I can fit in a space of dimension D.
Well, they're exactly orthogonal to each other.
And so they form like a coordinate system.
Right.
And so in dimension D, you can only fit exactly
D of those. Yep.
What happens if I give you a little bit of flexibility and I say instead of exactly 90
degrees between two of them, you're allowed to have 85 degrees between every pair. How many of
those can you put together in dimension D? And so in, or maybe
instead of vectors, I think about them as like lines. So they go both ways, right?
I only know the answer for two.
So in two, you still can do only two.
Yeah, that's right.
And so that didn't change. In three dimensions, you, again, can only do three.
I guess so, right? Yeah.
And so somehow you would think that it would still grow linearly.
And maybe at some dimension, it would start growing a little bit more.
But for most people, the intuition is that it would stay somehow linear in the dimension.
Yeah.
The interesting thing is that it grows exponentially in the dimension.
So in dimension 1,000, you can have more vectors.
You can have, I actually calculated once.
I forget the number, but it was many billions of vectors that are almost orthogonal to one another.
Wow, that is a remarkable, wow.
That's mind blowing.
I want to really put an exclamation point on that.
So everything we're doing is an approximation, right?
Everything.
I mean, even the fact, the way you got these embeddings was through an approximation. You're
using some kind of stochastic descent process. And so the fact that you're not a perfect 90 degrees,
you're still effectively orthogonal with that tolerance. And so what you're saying is that
there's billions of ways things can be just
completely different from each other. Yeah. In fact, I'll give you more than that. In fact,
if you just take two random directions in dimension 1000, they will be almost orthogonal.
Yeah. Yeah, I believe it. Yeah. Actually, that was a method, right? I mean, early on before
deep learning, random projections where you just randomly assign vectors to documents, that was one of the ways that people were doing retrieval.
Correct. And in fact, my whole PhD was about how to do random projections like algorithmically faster. that makes sense wow that is mind-blowing so oh i see and so then that explains the approximation
because i was putting a mental bookmark in that because if you do something like kd trees you can
do early stopping like at some point you know that this partition this entire sub tree is further
from the nearest neighbor you know even if i was to get the perfect point in this partition, it's still too far
and I don't have to search anymore and I'm done.
But because we're doing this clustering
and because we're, you know,
there is a chance that maybe you're in between two clusters
or something like that.
It's just that it's because the clustering
is centered around dense areas,
the chance that you're not in one of those dense
areas is like impossibly small, especially at these high dimensions. Yes. So even though I think,
I think the last point you made was actually maybe the opposite of what I was saying. I mean,
I think that, you know, you asked about Katie trees and like bowls and so on, just maybe to
explain to the audience. I mean, the idea is to say, is to say, okay, I'll have everything in one ball and then I'll just, I'll cover my space.
I'll just take balls of radius, like one half. Let's say my, everything lives in a ball of radius
one. I'll just cover my data in balls of radius one half, right? So everything lives in there.
And then those one half I'll cover with balls of radius a quarter and so on, right? What I'm saying
is that in high dimensional data, okay,
if you have everything in a ball of dimension one,
if you had to cover your data with balls of radius one half, okay,
every ball will contain one data point.
There would be no tree.
It would just like everything is far away from everything else, you know?
And so there is no tree.
There is no structure. It's just random chaos.
And whether we like it or not, data is sometimes like that.
How does clustering not suffer that same problem?
It does. And that's why vector indexes are hard to build because in some sense,
they are themselves a model of the data.
Yeah. I see. So clustering. Oh, I see. I see. So,
so maybe I'll try and get an understanding. So, so,
so clustering is, is, is this course thing,
but you really only have to do it once. And, and,
and within a cluster,
there's still many entries versus like this L these tree approaches,
you're kind of doing a bad thing many, many times.
Like as you're traversing this tree,
you're kind of making, you're making log N mistakes,
you know, until you get to the leaf.
Yeah. So, you know, I think, again, it's hard,
at least for me, it's very hard to like talk
about this abstractly without kind of math
and drawings and so on.
So, but yeah, it's, I'll just maybe try to wrap it up by saying that it's a fascinating set of
problems, which still have a lot of research going into them.
And I highly encourage, you know,
CS graduates and EE and anybody who's like into this thing and want to see
some really gnarly math and really
cool engineering, it's a good place to put some time in.
Yeah, yeah, totally.
Yeah, I just want to double down on that.
I think it's really interesting.
It's becoming more and more important every day.
And it's the future of information retrieval.
It's the future of how search engines and databases
will work. And so if you zoom out from the core index, it's like, think about making
the analog of databases. We've just spent 20, 25 minutes just talking about B-trees.
Or if it's a search engine, we just spent 25 minutes talking about an inverted index,
right? But we haven't talked about the rest of the database around it.
How do you do the embeddings? How do you save them? How do you retrieve from them? How do you do the
OD? Yeah, the management system part of the DBMS. Exactly. How do you update things in place?
How do you read and write concurrently? How do you now take this abstract algorithmic discussion
and make an actual hardened production database out of it
that you can go and do information,
do image search on a billion images in production
and not worry about whether it will work or not?
Yeah, I actually had two
questions which i think will start to get into the management side of it one was how do you deal with
the fact that when you retrain the model even on the same data you know unless you keep everything
deterministic which is almost impossible if you retrain the model, the embeddings are totally different the next time. And so anyone who's using your database and storing the embeddings, for example,
is going to be in a real big problem when you retrain, right?
So how do you deal with that?
So, you know, for the listeners, Pinecone, so the company that I founded and ICO is called Pinecone, and it builds a vector database.
That's what we do.
Okay.
The way we deal with this is we give our customers and our users full control over everything.
Right.
We are not opinionated on like what embeddings you have, where you, you know, how you create them.
And when do you replace what with what.
We actually give you full control over your data and how you retrieve from it and what your vectors are.
And so we never actually swap your vectors behind the scenes for you.
So your query kind of looks one way a second, like a second later it looks different.
And you're sitting there thinking what the hell's just happened.
You know, if you train a different model and you create different embeddings, then
you create like a new vector index in the database, and now you can decide
which one of them you want to query.
And maybe you phase one out and maybe you keep both of them
or you do whatever you want.
And so, you know, as a data infrastructure, we try to be the least opinionated and the most like enabling and just giving all the levers and, you know, buttons possible for
our customers.
Yeah, that makes sense.
Yeah.
I think what you're saying is, again, it raises a lot of other very cool problems.
If I just tweak some portion of my data,
like do I have to re-index everything?
Do I have to, like, how does that work?
Do I have access to both versions of the index
with some update and some not?
If I tell me, oh, I just have all the vectors and I just move each
one of them by a little bit. So all the data changes, but not by a whole lot. Do I, can I
somehow not just throw everything away and do everything from scratch? There's so much to be
done. I mean, this is like, and we've, you know, as a science and as an engineering community,
we're not even remotely close to being able to answer those questions definitively.
Yeah, we have something in, so my background is in decision theory, and we have something in decision theory called actor regularized regression, or it could be called critic regularized regression.
And what it's basically saying is, let's say it's a robot car.
We'll train a model that copies what the robot car is doing
right now. So this model, if the robot car maybe is being driven by a person, maybe it's not a
robot yet, or who knows what it is, but maybe you have a lot of data of real cars, but we'll
train a model that clones that behavior of these people or this past system. And then when we go to train the model that's trying to maximize something,
we regularize it, which means we add a penalty the further away it gets from this first model.
And so I could imagine doing something like that, where if you keep your entire neural net that
generated that embedding, then the next time around you can add some penalty
so that the things won't drift too much. SGD is also like the high dimensional spaces where,
you know, that sounds like it makes sense and maybe for 99% of the time it makes sense,
but there's still going to be 1% of the embeddings that are going to go to the moon, you know?
Yeah, they're going to change completely. And that's, and again, that's, that's fine. You know,
what we are trying to achieve is, is kind of make, make that layer as robust and just production
ready as possible. So you as a scientist or an engineer in a big company, if you want to do like semantic search over, over, over billions of documents or images or what have you,
you get to care about your application and about, you know, how you retrieve
and how you show it to your customers and what they care about, what models
you're using and not like the entire, you know, everything we just talked
about, like everything, like all the algorithms and optimizations and the hardware and the distribution and the systems design.
And there's so much that goes into it that frankly, you, you know, you shouldn't care
about if that's not your thing.
Yeah.
Unless you're me, in which case there is your thing.
And it's the only thing you care about.
Yeah.
So let's dive into that. So there's a bunch of people
who are listening to this podcast
and thinking this is the coolest thing I've ever seen
or heard of in my whole life, right?
And so tell us about Pinecone.
Are there internships?
Where are the internships located?
Are there full-time positions?
What's a day in the life of Pinecone?
Kind of run us through Pinecone as a company.
So Pinecone, first of all, in terms of locations,
we have centers in San Francisco, in New York, and in Tel Aviv, Israel.
We certainly have open positions all across the board.
I'll just kind of say the three, four most critical areas that we invest in.
First of all is the core engine.
A lot of high-performance computing, hardware accelerations, all the index stuff.
That sits mostly in Israel.
All the cloud stuff that sits mostly in Israel, all the cloud and scaling.
Okay, so how do you take those indexes and those data structures and make them scale at cloud, you know, billions of items and be consistent and persistent and highly available and so on.
That sits mostly in New York right now.
And another thing that sits mostly in New York right now is the management of it.
So everything that looks like security, data governance, metering, logging,
like all that stuff, right?
So everything that makes it a managed stuff, like resource allocation and so on.
Okay.
And the last thing that we invest in
is the action of the machine learning.
So the embeddings, the models,
the applications of embeddings
and vector representations in real life
and things like anywhere from recommendation
to text, to images, to personalization,
to what have you.
So it runs a pretty wide set of
disciplines and applications, everywhere from low, low level CC plus plus CUDA and so on,
all the way up to like modeling and, and so on. Of course, we have open positions,
both for interns and full time. Some of them are on our website, pinecone.io. And if you're interested and you're not sure
that you appear in those open recs,
just send us an email.
We love hearing from people.
Cool.
So what's a day like in Pinecone?
So what's something that makes Pinecone unique
as far as a place to work?
Do you have ping pong tournaments or is there something is there something that just kind of organically grew with Pinecone that is has been kind of sticky and interesting?
So I I COVID has really threw a wrench in a lot of that, but I will tell you what what I feel I feel is most exciting about working at Pinecone.
And so I, for better or for worse, we've developed this addiction to like insanely smart,
intelligent people. And it makes it very hard to hire people, but most of our folks have either PhDs or high degrees, you know, AWS, Google, Facebook, Amazon, like built huge systems,
architected like the internals of AWS and S3 and SageMaker and like Splunk and Databricks and, you know,
built unbelievable solutions for like whatever in the load of optimization kind of people who do like
crazy just do crazy things i mean like simd and all these things yeah i've been in and around
engineers my entire life and like the the sentence uh oh i just made something like 3x faster
is being said pretty casually nice that. That's amazing. What happens?
Like, oh, I did this and that.
I changed here and I reorganized all of that.
It's like, oh, now it's much faster.
Oh, shit.
I had no idea that was possible.
So you work just with people
who just kind of mess with your brain.
They're so smart and so talented.
And it's just fun.
And at the same time,
it's just like I said,
the problem itself is cool
because it runs all the way from
data and machine learning
to like big gnarly systems designed
to low level algorithms
and kind of the whole thing
needs to work as one unit.
And so you kind of have to be very broad
to really get it.
And so it's just fun.
It's just a fun place to work.
That makes sense.
And I just wanted to end on this note.
Who are the kinds of people who use something like this?
I think some people might look at this and say, well, there's maybe a handful of companies
in the world that have a billion images, right?
But clearly, even people who don you know, people who have,
who don't have maybe that much data, you know, it's not just about not using a for loop. It's
about the true management. I mean, you could use a CSV instead of MySQL, if you don't have a big
company, it doesn't mean that's a good idea, right? And so give us an idea of like, who is the
sort of target audience for this?
And like, kind of how broad is that audience and what are the different verticals there?
Yeah.
So it's, it's incredibly broad.
I mean, uh, you can, uh, like you said, I mean, the fact that you don't have billions
of objects, right.
Even if you have 10 million objects, you know, it doesn't mean you have to start building a microservice for it.
And the beautiful thing about consumption-based pricing for Pinecone is that you pay for what you use.
And so if you have a small use case, it's going to cost you like 50 bucks a month.
So why are you even bothering? So there's the whole management aspect of it,
which makes it like you get all the bells and whistles,
but you just get the hassle-free experience.
And that is already a big enough market.
But you get two different kinds of scaling issues
that look very different systems-wise.
One of them is just companies who have an unbelievable amount of data.
So they come to us and say, oh, I have 10 billion items.
What the hell do I do now? It doesn't fit anywhere. It's just going to cost me
an arm and a leg to run anything. And companies who have a different scaling
issue say, I have 10,000 queries per second. What the hell am I
doing? What do I do with this? So Say I have 10,000 queries per second. What the hell am I? What do I do with this?
So now I have this super high availability, low latency system that I have to build. So I have
to care about a whole different set of problems. So being able to cater to all three is hard and
exciting. And for somebody like me, who that's what I do for a living, that for me is a worthwhile challenge.
But, you know, those, you know, kind of three buckets are each and every one of them contains thousands of applications and different use cases.
And so you don't have to be in Pinterest or Facebook or Google to be facing these issues.
Yeah, I mean, one thing you hit on that I didn't think about until you mentioned it was the SLA, you know, the latency. Because I think with PyTorch and TensorFlow and some of these things, you know, you can do vector math without, like, I don't know anything about CUDA or OpenCL, or I've just heard these names, but I'm using PyTorch and it can do the job well. But I think that PyTorch is not designed for low latency.
And so if you have a million embeddings in PyTorch,
you're stuck.
Like there is nothing that you can really do.
You have to go outside of PyTorch to do that.
Yeah, so you have to use something that's dedicated for it.
And you have to optimize all the system around for it.
I mean, we have POCs running
with customers who use a lot more than 10,000 words per second. And they see a 15 millisecond
P99.
Wow. Wow. It's amazing.
And then, so for that, to get that, you have to, it's like the algorithms and so on, but you
have to optimize, you know, even like the way you deserialize JSONs.
I mean, you can't be careless about that.
You know, it's like stuff like that you don't necessarily think about.
So it's like everything around it needs to be super optimized for this use case, you
know.
One last thing, actually, this is just, I don't want to get too much in the weeds here,
but, you know, like with MySQL, you know, people do an insert statement and it seems kind of, you know, you're inserting these tuples.
And so it kind of makes sense.
So in your case, how do people dump, you know, a billion embeddings into your database?
I mean, is like, I guess the API is, it's still like a rest call call with a huge object that's attached to it or something.
I mean, you don't send all a billion in one go.
You probably would.
You batch it and you send them in batches of a thousand or you dump them on disk on S3 or GCS somewhere.
Oh, that makes sense.
So you have this S3 file that's massive that has all these embeddings and you just point PyCon to it.
It's like hey
go go digest this entire thing yeah that makes sense or you can you can do it on the fly you
can do it like you know you have a dynamic API you just observe and update vectors like one by one or
in batches oh that makes sense like people are adding new products and so you're adding adding
deleting updating so on yeah so there's this like you're adding, deleting, updating, so on. Yeah. So there's this like, you're adding new products, adding new products, adding new products.
And then boom, there's like a new embedding model.
Everything has to get rewritten.
So there's like now sort of this other snapshot that's being spun up.
And then when that's ready, you flip them over.
And then, yeah, I mean, this is stuff that you do not want to do yourself.
I've never done it.
It just sounds
it just sounds brutal just talking about it i mean it sounds like a really exciting but hard to do
yeah exactly and we're not even scratching the surface you know i again i think uh anybody who's
built a distributed system like you you think you think about everything and then a customer
comes to you with like a mind bender and like, oh, crap. Yeah. Like they want to do what? So. Yeah. Cool. This is really, really awesome. So we have a link
to Pinecone in the show notes. People check that out. Is there a free version for people like high
school students or college students or, you know, how can they get started with it so first of all uh they can go to our website and they can start a free trial the free trial right
now is is uh two weeks and so uh it's not a whole lot of time but if you're if you're a student and
you're working on a cool project and you want that extended just drop me a note or whatever we have a
chat cool drop us a note and say,
hey, I'm a student.
I'm not really a customer,
but I just really think it's neat.
Can I keep using it?
Odds are, if you really do something cool,
we'll just let you do it.
Cool.
That is amazing.
We've done that before.
So we're all into education and supporting people's learning.
Kind of, yeah, Tinker has been building exciting demos. So we're all into education and supporting people's learning.
Yeah, Tinker has been building exciting demos.
Very, very cool.
Yeah, if people build anything really cool with Pinecone, let us know.
Let Ido know.
We'll have his contact, his Twitter, et cetera, the handle for Pinecone in the show notes.
Thanks, everybody, for supporting the show on Patreon.
We really appreciate that.
And thanks for supporting us on Audible if you do that.
And thank you, Ido, so much.
This was an amazing episode.
There's so much more depth
we could go into.
We could spend a whole day
talking about embeddings.
We didn't even get a chance
to talk about the sort of
machine learning part
and ResNet and all of that.
But, you know, that's,
that's, we've did, we've done a really good job of covering, you know, the high level and folks
can definitely, you know, look up some of the things that we talked about and take it to the
next level. Thank you so much. Thank you. music by eric barn dollar programming throwdown is distributed under a creative commons attribution
share alike 2.0 license you're free to share copy copy, distribute, transmit the work, to remix, adapt the work,
but you must provide an attribution to Patrick and I and share alike in kind.