Orchestrate all the Things - Pinecone, a serverless vector database for machine learning, leaves stealth with $10M funding. Backstage chat with CEO Edo Liberty

Episode Date: January 27, 2021

Vectors are foundational for machine learning applications. Pinecone, a specialized cloud database for vectors, has secured significant investment from the people who brought Snowflake to the wor...ld. Could this be the next big thing? Article published on ZDNet

Transcript
Discussion (0)
Starting point is 00:00:00 Καλώς ήρθατε στο ορκέστριατο All the Things Podcast. Είμαι ο Γιώργος Αματιώτης και θα συνδέσουμε τα δόντια μαζί. Το σήμερα επειδή είναι πινκόν, είναι ένα διευθυντικό δίκτυο για τη μασχή μάθηση, που ζει με 10 ειδικά δόλια. Συζητούμε με τον CEO και τον Φαντερ, τον Είδο Λίμπερτι. Οι δίκτυες είναι βασικοί για τα εμπιστοσύνη
Starting point is 00:00:22 και η πινκόν μόνο δίκτυα σημαντική από τους άνθρωποι που έπρεπε το Σνόφλεγγ να το διευθυνθεί. Vectors are foundational for machine learning applications and Pinecon just secured significant investment from the people who brought Snowflake to the world. Could this be the next big thing? I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn and Facebook. Thanks for making the time for the conversation today. And just in ways of introduction, let me start by saying that the occasion is your funding. So let's start from the beginning, basically. So it's a very typical way to start.
Starting point is 00:00:58 So if you would like to say a few words about you, the company and the team and the funding, what you do, all of those things. Sure. I'll keep it brief. I am trained as a scientist. I spend most of my life as an academic developing and writing papers and research on machine learning and systems and so on. Spent seven years at Yahoo as running the scalable machine learning group and later at AWS, roughly three years building a platform called SageMaker, which is the machine learning platform from AWS. And a year and a half ago, I opened Pinecone to build a vector database, which we feel like is one of the most crucial components in being able to deploy large-scale machine learning solutions.
Starting point is 00:02:00 To date, we've raised slightly more than $10 million from one of the lead investors in early infrastructure in the Silicon Valley, and one of the core investors in Snowflake as well, and built a team. The team is now distributed between Israel, New York, and California. And we are launching now, basically in a few days, opening the platform for the first time to the audience, to external users.
Starting point is 00:02:42 And yeah, that's very exciting. It's technology we've been working on for a long time now. Okay. Actually, that's the part I wanted to ask you to reiterate because I didn't quite get it. How long did you say? I mean, I got the part about your personal background, let's say, but how long have you and the team been working on Pinecone? We started May 19. Okay. so like a year and a half. And did you get any seed funding or did you just... Yes, we had seed funding. We had an early seed funding and now we completed our seed funding with the raise of with this
Starting point is 00:03:25 final raise from Wing Venture when the partner and the partner is Peter Wagner who's one of the most acclaimed investors in the Bay Area and somebody was personally very happy to work with like I said one of the early investors in Snowflake and one of the people who kind of made a lot of very accurate shots very early on and somebody who has a tremendous amount of experience in this field. So we're very happy to have him on the board. So that's a vote of confidence for sure.
Starting point is 00:04:03 And my one-liner on what Pinecon does, and you can correct me if I'm wrong, is that you're basically a database for vectors. And for anyone who's into machine learning, that should be enough in and by itself. But I was wondering if you'd like to expand a little bit on that for people who are not necessarily familiar with that. So, what are vectors and why they're important? And what is the precise problem that you're solving for people who use vectors? 100%. So, you're right, it's a very technical term. But machine learning is changing how data is represented in the world. We are used to data being records in a database,
Starting point is 00:04:48 like keys and values, or images, or audio, or text documents. But when you use machine learning models, they don't look at the world this way. The input that they expect is a very long list of numbers, and that is called a vector. It's just a list of numbers. For a human, that's completely opaque and meaningless, but for a machine learning model, that's exactly the inputs and the outputs that they expect and consume and create. If you are building a large-scale machine learning platform in a big company or a small company,
Starting point is 00:05:31 as long as you have, if you're deploying large-scale machine learning, you will have millions, tens of millions, hundreds of millions of these high-dimensional vectors. Again, very long lists of numbers, which you have to manipulate in real time, at scale, in production, and that requires a dedicated infrastructure, and Pinecone is that infrastructure. And so if you're using managed MongoDB for your collection of JSON objects or Elasticsearch for your collection of documents, you will use Pinecone for your collection of JSON objects, so Elasticsearch for your collection of documents,
Starting point is 00:06:05 you will use Pinecone for your collection of high-dimensional vectors, which again, anybody who does machine learning at scale already is grappling with that problem. Okay, so actually that was going to be one of my next questions. So apparently people are already doing that and they already do something to manage their vectors and retrieve them and do all of those things that they need to do.
Starting point is 00:06:29 So what is the new thing that you bring to the table as opposed to existing solutions? Good. So first of all, we invented the solution, not the problem. Obviously, people are already facing this problem and they solve it in a variety of different ways, none of which we thought were the right ways. You have the path of trying to somehow bend the pipes, some existing infrastructure that you have to make it do something that it was not designed to do. That ends up being both a lot of work and not very efficient. You have the folks who try to build something from scratch, which ends up being oftentimes a multi-year process, and several engineers and hard to maintain and so on. So people build on top of either open source solutions or-source components and cobble them together, again,
Starting point is 00:07:54 that has significant shortcomings. Frankly, most companies that we speak with end up understanding that this is too much work or they don't have the right talent to do it in-house and they just buy a black box solution for the application that they want. So if they're trying to do, you know, recommendation on a shopping website, they figure out, oh, we can't actually build this system. Let's just buy it from a vendor that just does shopping recommendation end-to-end and not even worry about the whole thing.
Starting point is 00:08:32 But the trend is for companies to now move towards doing more machine learning, more data science, and owning inside the organization their own machine learning and their own data. They want to wrestle it away from the black box solutions. And we give them the ability to do that without having to build all the infrastructure. And we do that by having built three different components that interact together. One of them is the Vector Index itself. It's a highly specialized piece of software that indexes high-dimensional vectors incredibly efficiently and is able to interact with them incredibly fast and accurately. A distribution system, a container distribution platform that allows us to scale horizontally to any number of vectors and be able to withstand any workload. And a cloud management system that
Starting point is 00:09:28 allows us to give you a very simple API without having to worry about resources. And so you can spin up a service and spin it down all from a Jupyter notebook without ever provisioning machines and setting up networks and so on. You just get started. And so because all three of them work really closely together, you can start immediately, scale to any size, and work both precisely and quickly. And that brings a level of flexibility that was just not there before. Okay, so when comparing and contrasting, let's say, existing solutions, you mentioned what I would say I would count as the two extremes. So a completely agnostic approach, like totally infrastructure-oriented, and something which is very domain-specific, like a recommendation system, for example, of which apparently vectors would be one part of, but it wouldn't be like a generic solution.
Starting point is 00:10:33 And I think from the sound of it, it sounds like you're positioning yourself somewhere in the middle. You are infrastructure, but not domain-specific infrastructure. So people can use it to build a recommendation or any other kind of solution, right? Correct. We are a horizontal platform for dealing with large collections of vectors in general. We see that, for example, with shopping recommendation, that ends up being a very common use case where people, in what's called embedding, they embed user behavior and items in their catalog into vector space and do personalization on the website using a vector database. And so that's just a common pattern. And we see that again and again. And so, yes, I mean, every database will have its own kind of standard use cases. And that's one of ours, right?
Starting point is 00:11:32 This is just something that repeats itself. So we don't, we don't build it for online retailers, but online retailers find out a value in it. And so it's a, it's a common thing. Okay and so I guess that ties into one of the other questions I had when I was looking around before this discussion on you know to find material on what you do and how you do it. The way you described the process through which people can use Pinecone stood out for me. So it's summarized in a few words, customize, load, query, and observe.
Starting point is 00:12:13 And the part of it that I think ties into this discussion is probably the customize, because the way that you choose to represent, well, objects in your domain of interest, be it, you know, recommendations or any other kind of application, I think actually touches upon the customized. So there's many different ways that you can represent the same domain in vectors, right? So I guess that would be up to the builders of the application to specify. It doesn't come out of the box with PanCon, right? Correct.
Starting point is 00:12:55 So when you, like we started discussion saying that machine learning, in machine learning, represent items or objects as vectors. There are many ways to do that. So there's a machine learning model whose job is to translate, for example, a text document to a vector. Any language model does that. We have a flurry of those, from GloVe to just traditional Word-to-Vec solutions to LSTMs to BERT to even GPT-3. All of those are language models that convert spoken language or text to a
Starting point is 00:13:27 high-dimensional vector. Our customers want to use any one of those or anything that they build in-house. We, as a horizontal platform, don't want to be opinionated on how you want to represent your world. We want to give you the ability to do that. And so that the configuration, the definition of your service is defined by you. We have a model hub. You can upload your transformation model, be it something you trained or something generic. And we can orchestrate that in real time and make sure that when you send us a document, we convert it to a vector and we index it or we search with it. And that definition of which model is executed where and how the water goes through the pipes is exactly that definition.
Starting point is 00:14:24 I see. Yeah, it does. It also brings up for me a follow-up question. So you spoke about how you need to connect the models that you train to Pinecone and that makes perfect sense in order to be able to store and to index them. However, it's a very, very common scenario these days that people retrain their models incrementally. So, what happens when you do that? So, today my model looks like X and tomorrow it looks like X plus delta. Do I have to reconnect it? Yeah, so think about if you retrain the model that converts documents to vectors, then now your corpus of document might have not changed, but the vector representation changed. And so now your index is completely separate. You have another index of vectors, really, to work with. What
Starting point is 00:15:25 we allow you to do is actually to have both of them run in parallel and to have a router in front of them so you can run your A-B test. So you can say, my text search, my neural search application works with Pinecone as the backend. I will have both indexes live, both vector representations of both my models live, and I will just route between them some fraction of traffic so you can run your A-B tests with it. I think you're raising an interesting scientific question, which I would love to think about, but I unfortunately will probably take a long time to think about. Whether if the embeddings are close to the original ones,
Starting point is 00:16:22 if you can somehow morph one index into the other and not re-index everything? And that's a very good question. Yeah, that's precisely what I'm saying. Mathematically, this is impossible, but that's an amazingly good question. Okay. Okay. Well, you know, I'm sure it is. That's why I asked it.
Starting point is 00:16:48 But, you know, it's not an easy one to answer. I didn't want to sound surprised. I mean, it's a very good question. Okay. Well, you know, it's one of the hardest things when you deal with this whole machine learning pipeline, actually. All these moving parts, your dataset is evolving and your model is evolving at the same time, and just synchronizing all of those is probably one of the hardest problems around to solve. So, I'm not surprised that you haven't actually cracked it. No, I mean, to be honest, the data evolving is something we obviously support. So you incrementally update and delete data all the time. In fact, this is one of the hardest things to achieve, which we have. You can update the vector index and the vectors are searchable within actually microseconds, but it's like definitely for the application, you can count on milliseconds.
Starting point is 00:17:50 We can update hundreds of thousands of vectors a second. Okay. And so the data evolving is 100% something that is a big part of what we do. Like I said, when the model is retrained, you can re-index everything very quickly and switch to that seamlessly. And so definitely something we care about.
Starting point is 00:18:16 I think the question that you asked, which is frankly a rare setting, but in which the model is actually live, like incrementally training on live data and constantly deployed, like always the most fresh data, the most fresh model is always the one being used. And I know very few places where that's the case, but still, it's a very challenging setting and it's very interesting to think about. Okay, let's shift to another part of the process, the querying part. And I also read some of the material that you have available online. And by reading them, I mean, how you basically try to explain what a query is in
Starting point is 00:19:07 the first place and how you serve that. And that kind of made me wonder if this is an actual thing for your space. Do people actually query embeddings or all they care about in the context of their models and their application is whether something is similar to something else. Can you clarify? I'm not sure I understand the question exactly. Are you asking if... Okay, go on. Yeah, sure. I'm going to try to give an example so maybe it becomes more clear. So, you have all your vectors and you have your index of vectors. And in theory, you have something that on the face of it at least looks like a query language. So I was wondering if that's actually used as such
Starting point is 00:19:56 in use cases that you're aware of. So do people actually use Pinecon, for example, to ask, I don't know, bring me all the documents whose vectors correspond to values that are such and such, I don't know, after that date, and include this word and this kind of thing? Yeah, 100%. I mean, that's why we built our database that way, because that's how people want to use it. When you deal with high-dimensional vectors, you don't
Starting point is 00:20:28 have this word appears in the document or not because you don't have a document. You don't have timestamp is larger than something because you don't have a timestamp. You don't have SQL. You don't have terms and documents. You don't have the regular constructs of a database. And so you have the regular constructs of a database. And so you have to create, so you have to communicate your needs in a different way. And so when you think about a, when you look at two numbers, like you can think about them as X and Y coordinates on a sheet of paper, right, on the regular axes, and they correspond to some location, some point on your page, right, some dot, right?
Starting point is 00:21:16 If you look at a thousand-dimensional vector, that's a list of a thousand numbers. You can look at them, you can think about it as a dot in a thousand-dimensional space, right? It might be hard to imagine a thousand dimensional space, but nevertheless, mathematically, it's exactly the same thing. It's a dot in a thousand dimensional space. And now you want to somehow try to retrieve maybe that data point. So what do you know about it? You can say, okay, I know where it is. So I want to describe to you,
Starting point is 00:21:51 for example, give me all the data points around it, okay? Because maybe I'm doing some similarity, okay? And so that query of give me the 10 closest points to some location in space, or give me everything inside some ball in space. So a ball centered somewhere with some radius, right? That's a geometric construct, right? And it sounds maybe very abstract. But when you deal with high-dimensional vectors, A, that's the only thing you have. But B, machine learning practitioners are very used to doing this. This is exactly the language that they use, right?
Starting point is 00:22:25 Give everything inside a cone or behind some hyper, in some half space. And again, those sound mathematical and abstract for non-practitioners, but for people in the field, that is, you know, it's exactly how they communicate what data they want
Starting point is 00:22:45 and how they retrieve information from such a database. Okay. Actually, that's precisely the reason I asked you this. Because when you say that you have a database for vectors, for someone like me, who's not a machine learning practitioner, I'm trying to relate it to something I know. So do you have a query language? Does it work like SQL? Apparently, from your answer, not exactly. So the analogy is not one-to-one.
Starting point is 00:23:11 So, yeah, I mean, I think SQL has... It's not like SQL, so you can... It's definitely a, you know... No SQL is an old buzzword at this point, but it's definitely not a SQL database.
Starting point is 00:23:29 But it does have its own query language, which I suppose has its own expressivity and the things you can do with it. All these create, inserts, updates, deletes, and so on. Correct. And the kind of queries are geometric queries. They're things that have to do with nearest neighbor search, with similarity search, with cosine search. They have a lot of names. These are technical terms that might not mean a lot to non-practitioners. But again, for practitioners, those are very common things. And so, you know, there's many thousands of academic publications and technical reports on how these are used in practice
Starting point is 00:24:10 for a flurry of problems, anywhere from recommendation to anomaly detection, to similarity search, to deduplication, to data fusion, you name it. Okay. So it sounds like you didn't actually, you know, like invent your own query language from scratch, but rather you encoded operations that people were already using.
Starting point is 00:24:33 Correct. We did not invent the problem. We invented the solution. Okay. So I think we're almost out of time. So let's wrap up with a more, well, operational and kind of business-oriented question. Another thing that drew my attention was the fact that it looked like for the time being you only have a cloud-based solution. I was wondering,
Starting point is 00:24:58 okay, it makes sense why you may want to start with that, but I was wondering if offering an on-premise solution is in your roadmap as well. The answer is no, but there is a very good reason for it. First of all, as a point of curiosity, I speak with customers every day, oftentimes more than once a day. Amazingly enough, I'm asked about on-prem in every conversation. But the shocking thing is,
Starting point is 00:25:42 nine times out of 10, when people say on-prem, they actually mean a public cloud, but in their own VPC. They don't even, like even on-prem lost its meaning. They don't actually own any physical machines. On-prem used to be like my actual physical machine somewhere. People don't even, you know, people even don't, The term on-prem has already changed. So people ask me, can you work on-prem? And then we say, no. And they say, no, no, we don't mean actual on-prem. We mean on-prem in other AWS accounts. Okay. Maybe that means that you don't
Starting point is 00:26:21 actually talk to people who work in regulated industries because for those people on-prem actually means you know good old on-prem in in many cases right and so i don't want i don't want to mean i don't want i don't want to say that those customers don't exist and they don't have a legitimate need for our product um all i'm saying is that the world is moving in a very clear direction. For us to give the kind of experience that we want to give our customers, which means fully elastic and auto-scaling, fully managed so you don't have to wake up at night and maintain anything. We have folks on call and monitoring everything and alert, you know,
Starting point is 00:27:10 an alert set up. And so everything is managed. Everything is, is, is, is elastic. Everything is, is kind of hands-free for you. Our ability to do that and to do cost cutting on your behalf is only possible in a cloud. When we control everything, we can actually spin down resources, we can improve our operations, and we can monitor and fix stuff. If we run on-prem, we just can't offer that kind of service. In the new world, we see that people have a business problem. They want to
Starting point is 00:27:49 build a better recommendation engine for the shopping site or a better text search engine for their documents. They're not in the business of maintaining, you know, distributed systems and infrastructure in the cloud. And they want that service. Right. So there's a reason why we need to operate in the cloud. But we also see that the world is moving in that direction. even regulated industries, I think, will move to some version of a public cloud, maybe more regulated, more secure, maybe fenced up in some other way.
Starting point is 00:28:35 But I think the world in which every large, like most large companies actually own compute centers is I think we're moving away from that world. Well, that's a whole discussion in and of itself. So let's not go there at this time because I know we have to go. One last thing and I'll let you happily go your way. Next items in your roadmap after this funding, growing the team or go to market or what?
Starting point is 00:29:07 100%. So, you know, we are laser focused on building the best vector database in the world. That takes a lot of work, both engineering and science. And so definitely growing the team in all three locations in New York and San Francisco. And, yeah, go to market. You know, we are very lean on our go to market. We are opening our platform so everybody can onboard and use our product hands free and kind kind of self-on-board and use it. And so, yeah, we're mostly going to invest in just building an amazing product and keep improving it. Okay. Well, thanks and best of luck.
Starting point is 00:29:56 Thank you so much. Have a great day. I hope you enjoyed the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.