Orchestrate all the Things - Pinecone, a serverless vector database for machine learning, leaves stealth with $10M funding. Backstage chat with CEO Edo Liberty
Episode Date: January 27, 2021Vectors are foundational for machine learning applications. Pinecone, a specialized cloud database for vectors, has secured significant investment from the people who brought Snowflake to the wor...ld. Could this be the next big thing? Article published on ZDNet
Transcript
Discussion (0)
Καλώς ήρθατε στο ορκέστριατο All the Things Podcast.
Είμαι ο Γιώργος Αματιώτης και θα συνδέσουμε τα δόντια μαζί.
Το σήμερα επειδή είναι πινκόν,
είναι ένα διευθυντικό δίκτυο για τη μασχή μάθηση,
που ζει με 10 ειδικά δόλια.
Συζητούμε με τον CEO και τον Φαντερ,
τον Είδο Λίμπερτι.
Οι δίκτυες είναι βασικοί για τα εμπιστοσύνη
και η πινκόν μόνο δίκτυα σημαντική από τους άνθρωποι που έπρεπε το Σνόφλεγγ να το διευθυνθεί. Vectors are foundational for machine learning applications and Pinecon just secured significant
investment from the people who brought Snowflake to the world.
Could this be the next big thing?
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook.
Thanks for making the time for the conversation today. And just in ways of introduction, let me start by saying that the occasion is your funding.
So let's start from the beginning, basically.
So it's a very typical way to start.
So if you would like to say a few words about you, the company and the team and the funding, what you do, all of those things.
Sure. I'll keep it brief. I am trained as a scientist. I spend most of my life as an academic
developing and writing papers and research on machine learning and systems and so on.
Spent seven years at Yahoo as running the scalable machine learning group and later at AWS, roughly three years building a platform called SageMaker,
which is the machine learning platform from AWS.
And a year and a half ago, I opened Pinecone to build a vector database, which we feel
like is one of the most crucial components in being able to deploy large-scale machine
learning solutions.
To date, we've raised slightly more than $10 million from one of the lead investors in early infrastructure in the Silicon Valley,
and one of the core investors in Snowflake as well, and built a team.
The team is now distributed between Israel, New York, and California.
And we are launching now,
basically in a few days,
opening the platform
for the first time to the audience,
to external users.
And yeah, that's very exciting.
It's technology we've been working on for a long time now.
Okay. Actually, that's the part I wanted to ask you to reiterate because I didn't quite get it.
How long did you say? I mean, I got the part about your personal background, let's say,
but how long have you and the team been working on Pinecone?
We started May 19. Okay. so like a year and a half.
And did you get any seed funding or did you just... Yes, we had seed funding. We had
an early seed funding and now we completed our seed funding with the raise of with this
final raise from Wing Venture when the partner and the partner is Peter Wagner
who's one of the most acclaimed investors in the Bay Area and somebody
was personally very happy to work with like I said one of the early investors
in Snowflake and one of the people
who kind of made a lot of very accurate shots very early on and somebody who has a tremendous
amount of experience in this field.
So we're very happy to have him on the board.
So that's a vote of confidence for sure.
And my one-liner on what Pinecon does, and you can
correct me if I'm wrong, is that you're basically a database for vectors. And for anyone who's into
machine learning, that should be enough in and by itself. But I was wondering if you'd like to
expand a little bit on that for people who are not necessarily familiar with that. So,
what are vectors and why they're important? And what is the precise problem that you're
solving for people who use vectors?
100%. So, you're right, it's a very technical term. But machine learning is changing how
data is represented in the world. We are used to data being records in a database,
like keys and values, or images, or audio, or text documents.
But when you use machine learning models,
they don't look at the world this way.
The input that they expect is a very long list of numbers, and that is called
a vector. It's just a list of numbers. For a human, that's completely opaque and meaningless,
but for a machine learning model, that's exactly the inputs and the outputs that they
expect and consume and create. If you are building a large-scale machine learning platform
in a big company or a small company,
as long as you have,
if you're deploying large-scale machine learning,
you will have millions, tens of millions,
hundreds of millions of these high-dimensional vectors.
Again, very long lists of numbers, which you have to manipulate in real time, at scale, in production,
and that requires a dedicated infrastructure, and Pinecone is that infrastructure.
And so if you're using managed MongoDB for your collection of JSON objects or Elasticsearch for
your collection of documents, you will use Pinecone for your collection of JSON objects, so Elasticsearch for your collection of documents,
you will use Pinecone for your collection
of high-dimensional vectors,
which again, anybody who does machine learning at scale
already is grappling with that problem.
Okay, so actually that was going to be
one of my next questions.
So apparently people are already doing that
and they already do something to manage their vectors and retrieve them and do all of those things that they need to do.
So what is the new thing that you bring to the table as opposed to existing solutions?
Good. So first of all, we invented the solution, not the problem.
Obviously, people are already facing this problem and they solve it in a variety of different ways, none of which we thought were the right ways. You have the path of trying to somehow bend the pipes, some existing infrastructure that
you have to make it do something that it was not designed to do.
That ends up being both a lot of work and not very efficient.
You have the folks who try to build something from scratch, which ends up being oftentimes
a multi-year process, and several engineers and hard to maintain and so on.
So people build on top of either open source solutions or-source components and cobble them together, again,
that has significant shortcomings.
Frankly, most companies that we speak with end up understanding that this is too much
work or they don't have the right talent to do it in-house
and they just buy a black box solution for the application that they want.
So if they're trying to do, you know, recommendation on a shopping website,
they figure out, oh, we can't actually build this system.
Let's just buy it from a vendor that just does shopping recommendation end-to-end
and not even worry about the whole thing.
But the trend is for companies to now move towards doing more machine learning, more data science,
and owning inside the organization their own machine learning and their own data.
They want to wrestle it away from the black box solutions. And we give them the ability to do that without having to build all the infrastructure.
And we do that by having built three different components that interact together. One of them is the Vector Index itself.
It's a highly specialized piece of software that indexes high-dimensional vectors incredibly efficiently
and is able to interact with them incredibly fast and accurately.
A distribution system, a container distribution platform that allows us to scale horizontally
to any number of vectors and be able to withstand any workload. And a cloud management system that
allows us to give you a very simple API without having to worry about resources. And so you can
spin up a service and spin it down all from a Jupyter notebook without ever
provisioning machines and setting up networks and so on. You just get started. And so
because all three of them work really closely together, you can start immediately,
scale to any size, and work both precisely and quickly. And that brings a level of flexibility that was just not there before.
Okay, so when comparing and contrasting, let's say, existing solutions, you mentioned what I would say I would count as the two extremes. So a completely agnostic approach, like
totally infrastructure-oriented, and something which is very domain-specific, like a recommendation system, for example, of which
apparently vectors would be one part of, but it wouldn't be like a generic solution.
And I think from the sound of it, it sounds like you're positioning yourself somewhere in the middle.
You are infrastructure, but not domain-specific infrastructure. So people can use it to build a recommendation or any other kind of solution, right? Correct. We are a horizontal platform for
dealing with large collections of vectors in general. We see that, for example, with shopping
recommendation, that ends up being a very common use case where people,
in what's called embedding, they embed user behavior and items in their catalog into vector
space and do personalization on the website using a vector database. And so that's just a common
pattern. And we see that again and again. And so, yes, I mean, every database will have its own
kind of standard use cases. And that's one of ours, right?
This is just something that repeats itself. So we don't,
we don't build it for online retailers,
but online retailers find out a value in it. And so it's a,
it's a common thing.
Okay and so I guess that ties into one of the other questions I had when I was
looking around before this discussion on you know to find material on what you do and how you do it.
The way you described the process through which people can use Pinecone stood out for me.
So it's summarized in a few words, customize, load, query, and observe.
And the part of it that I think ties into this discussion is probably the customize,
because the way that you choose to represent, well, objects in your domain of interest,
be it, you know, recommendations or any other kind of application,
I think actually touches upon the customized.
So there's many different ways that you can represent the same domain in vectors, right?
So I guess that would be up to the builders of the application to specify.
It doesn't come out of the box with PanCon, right?
Correct.
So when you, like we started discussion saying that machine learning,
in machine learning, represent items or objects as vectors.
There are many ways to do that.
So there's a machine learning model whose job is to translate, for example, a text document to a vector.
Any language model does that.
We have a flurry of those, from GloVe to just traditional Word-to-Vec solutions to LSTMs
to BERT to even GPT-3.
All of those are language models that convert spoken language or text to a
high-dimensional vector. Our customers want to use any one of those or anything that they
build in-house. We, as a horizontal platform, don't want to be opinionated on how you want to represent your world. We want to give you the ability to do that. And so that the configuration, the definition of your service
is defined by you. We have a model hub. You can upload your transformation model, be it
something you trained or something generic. And we can orchestrate that in real time and make sure that
when you send us a document, we convert it to a vector and we index it
or we search with it.
And that definition of which model is executed where
and how the water goes through the pipes is exactly that definition.
I see. Yeah, it does. It also brings
up for me a follow-up question. So you spoke about how you need to connect the models that you train
to Pinecone and that makes perfect sense in order to be able to store and to index them. However, it's a very, very common scenario these days that people
retrain their models incrementally. So, what happens when you do that? So, today my model
looks like X and tomorrow it looks like X plus delta. Do I have to reconnect it? Yeah, so think about if you retrain the model that converts documents to vectors, then now
your corpus of document might have not changed, but the vector representation changed.
And so now your index is completely separate.
You have another index of vectors, really, to work with. What
we allow you to do is actually to have both of them run in parallel and to have a router
in front of them so you can run your A-B test. So you can say, my text search, my neural
search application works with Pinecone as the backend. I will have both indexes live, both vector representations of both my
models live, and I will just route between them some fraction of traffic so you can run your A-B
tests with it. I think you're raising an interesting scientific question,
which I would love to think about,
but I unfortunately will probably take a long time to think about.
Whether if the embeddings are close to the original ones,
if you can somehow morph one index into the other and not re-index everything?
And that's a very good question.
Yeah, that's precisely what I'm saying.
Mathematically, this is impossible, but that's an amazingly good question.
Okay.
Okay.
Well, you know, I'm sure it is.
That's why I asked it.
But, you know, it's not an easy one to answer.
I didn't want to sound surprised. I mean, it's a very good question.
Okay. Well, you know, it's one of the hardest things when you deal with this whole machine learning pipeline, actually. All these moving parts, your dataset is evolving and your model is evolving at the same time, and just synchronizing all of those is probably one of the hardest
problems around to solve. So, I'm not surprised that you haven't actually cracked it.
No, I mean, to be honest, the data evolving is something we obviously support. So you incrementally update and delete data all the time.
In fact, this is one of the hardest things to achieve, which we have.
You can update the vector index and the vectors are searchable within actually microseconds, but it's like definitely for the application,
you can count on milliseconds.
We can update hundreds of thousands of vectors a second.
Okay.
And so the data evolving is 100% something
that is a big part of what we do.
Like I said, when the model is retrained,
you can re-index everything very quickly
and switch to that seamlessly.
And so definitely something we care about.
I think the question that you asked,
which is frankly a rare setting, but in which the model is actually live, like incrementally
training on live data and constantly deployed, like always the most fresh data, the most
fresh model is always the one being used.
And I know very few places where that's the case, but still, it's a very challenging setting and it's very interesting to think about.
Okay, let's shift to another part of the process, the querying part.
And I also read some of the material that you have available online.
And by reading them, I mean, how you basically try to explain what a query is in
the first place and how you serve that. And that kind of made me wonder if this is an actual thing
for your space. Do people actually query embeddings or all they care about in the context
of their models and their application is whether something is similar to something else.
Can you clarify? I'm not sure I understand the question exactly. Are you asking if...
Okay, go on. Yeah, sure. I'm going to try to give an example so maybe it becomes more clear.
So, you have all your vectors and you have your index of vectors.
And in theory, you have something that on the face of it at least looks like a query language.
So I was wondering if that's actually used as such
in use cases that you're aware of.
So do people actually use Pinecon, for example, to ask, I don't know, bring me all the documents whose vectors correspond to values
that are such and such, I don't know, after that date,
and include this word and this kind of thing?
Yeah, 100%.
I mean, that's why we built our database that way,
because that's how people want to use it.
When you deal with high-dimensional vectors, you don't
have this word appears in the document or not
because you don't have a document. You don't have timestamp is larger
than something because you don't have a timestamp. You don't have SQL.
You don't have terms and documents. You don't have the regular constructs of
a database. And so you have the regular constructs of a database.
And so you have to create, so you have to communicate your needs in a different way.
And so when you think about a, when you look at two numbers, like you can think about them as X and Y coordinates on a sheet of paper, right, on the regular axes, and they correspond to some location,
some point on your page, right, some dot, right?
If you look at a thousand-dimensional vector, that's a list of a thousand numbers. You can look at them, you can
think about it as a dot in a thousand-dimensional space,
right? It might be hard to imagine a thousand dimensional space, but nevertheless, mathematically,
it's exactly the same thing. It's a dot in a thousand dimensional
space. And now you want to
somehow try to retrieve maybe that data point. So what
do you know about it? You can say, okay, I know where it is. So I want to
describe to you,
for example, give me all the data points around it, okay? Because maybe I'm doing some similarity, okay? And so that query of give me the 10 closest points to some location in space,
or give me everything inside some ball in space. So a ball centered somewhere with some radius, right?
That's a geometric construct, right?
And it sounds maybe very abstract.
But when you deal with high-dimensional vectors,
A, that's the only thing you have.
But B, machine learning practitioners are very used to doing this.
This is exactly the language that they use, right?
Give everything inside a cone
or behind some hyper, in some half space.
And again, those sound mathematical and abstract
for non-practitioners,
but for people in the field,
that is, you know,
it's exactly how they communicate
what data they want
and how they retrieve information from such a database.
Okay. Actually, that's precisely the reason I asked you this.
Because when you say that you have a database for vectors,
for someone like me, who's not a machine learning practitioner,
I'm trying to relate it to something I know.
So do you have a query language? Does it work like SQL? Apparently,
from your answer, not exactly.
So the analogy is not one-to-one.
So, yeah,
I mean, I think SQL has...
It's not like SQL,
so you can... It's definitely
a, you know...
No SQL is an old buzzword
at this point, but it's
definitely not a SQL database.
But it does have its own query language, which I suppose has its own expressivity and the things you can do with it.
All these create, inserts, updates, deletes, and so on.
Correct.
And the kind of queries are geometric queries. They're things that have to do with nearest neighbor search, with similarity search, with cosine search.
They have a lot of names.
These are technical terms that might not mean a lot to non-practitioners.
But again, for practitioners, those are very common things. And so, you know, there's many thousands of academic publications
and technical reports on how these are used in practice
for a flurry of problems, anywhere from recommendation
to anomaly detection, to similarity search, to deduplication,
to data fusion, you name it.
Okay.
So it sounds like you didn't actually, you know,
like invent your own query language from scratch,
but rather you encoded operations
that people were already using.
Correct.
We did not invent the problem.
We invented the solution.
Okay.
So I think we're almost out of time.
So let's wrap up with a more, well,
operational and kind of business-oriented question. Another thing that drew my attention was the fact that
it looked like for the time being you only have a cloud-based solution. I was wondering,
okay, it makes sense why you may want to start with that, but I was wondering if
offering an on-premise solution
is in your roadmap as well. The answer is no, but there is a very good reason for it. First of all,
as a point of curiosity, I speak with customers every day,
oftentimes more than once a day.
Amazingly enough,
I'm asked about on-prem in every conversation.
But the shocking thing is,
nine times out of 10,
when people say on-prem, they actually mean a public cloud, but in their own VPC.
They don't even, like even on-prem lost its meaning.
They don't actually own any physical machines.
On-prem used to be like my actual physical machine somewhere.
People don't even, you know, people even don't, The term on-prem has already changed. So people ask me,
can you work on-prem? And then we say, no. And they say, no, no, we don't mean actual on-prem.
We mean on-prem in other AWS accounts. Okay. Maybe that means that you don't
actually talk to people who work in regulated industries because for those
people on-prem actually means you know good old on-prem in in many cases right and so i don't
want i don't want to mean i don't want i don't want to say that those customers don't exist and
they don't have a legitimate need for our product um all i'm saying is that the world is moving in a very clear direction.
For us to give the kind of experience that we want to give our customers, which means
fully elastic and auto-scaling, fully managed so you don't have to wake up at night and
maintain anything.
We have folks on call and monitoring everything and alert, you know,
an alert set up. And so everything is managed. Everything is, is, is,
is elastic. Everything is, is kind of hands-free for you.
Our ability to do that and to do cost cutting on your behalf is only possible in a cloud.
When we control everything, we can actually spin down resources,
we can improve our operations, and we can monitor and fix stuff.
If we run on-prem, we just can't offer that
kind of service. In the new world, we see that
people have a business problem. They want to
build a better recommendation engine for the shopping site or a better text search engine for
their documents. They're not in the business of maintaining, you know, distributed systems and infrastructure in the cloud.
And they want that service. Right.
So there's a reason why we need to operate in the cloud.
But we also see that the world is moving in that direction. even regulated industries, I think,
will move to some version of a public cloud,
maybe more regulated, more secure,
maybe fenced up in some other way.
But I think the world in which every large,
like most large companies actually own compute centers
is I think we're moving away from that world.
Well, that's a whole discussion in and of itself.
So let's not go there at this time because I know we have to go.
One last thing and I'll let you happily go your way.
Next items in your roadmap after this funding,
growing the team or go to market or what?
100%.
So, you know, we are laser focused on building the best vector database in the world.
That takes a lot of work, both engineering and science.
And so definitely growing the team in all three locations in New York and San Francisco.
And, yeah, go to market. You know, we are very lean on our go to market.
We are opening our platform so everybody can onboard and use our product hands free and kind kind of self-on-board and use it.
And so, yeah, we're mostly going to invest in just building an amazing product and keep improving it.
Okay. Well, thanks and best of luck.
Thank you so much. Have a great day.
I hope you enjoyed the podcast.
If you like my work, you can follow Link Data Orchestration
on Twitter, LinkedIn, and Facebook.