Orchestrate all the Things - Weaviate, an open-source search engine powered by machine learning, vectors, graphs, and GraphQL. Featuring co-founder Bob van Luijt

Starting point is 00:00:00 Welcome to the orchestrate all the things podcast. I'm George Amadiotis and we'll be connecting the dots together. Google uses machine learning and graphs to deliver search results. Most search engines do not. We the eight want to change that. Bob van Lout's career in technology started at age 15 building websites to help people sell toothbrushes online. Not many 15 year olds-olds do that today, and fewer still did it then. Apparently that gave Van Lout enough of a head start to arrive at the confluence of

Starting point is 00:00:33 technology trends today. He went on to study arts, but ended up working full-time in technology. In 2015, when Google introduced its RankBrain algorithm, the quality of search results jumped up. It was a watershed moment as it introduced machine learning in search. A few people noticed, including Van Lout, who saw a business opportunity and decided to bring this to the masses. I hope you will enjoy the podcast. If you like my work, you can follow Link Data Orchestration on Twitter, LinkedIn, and Facebook.

Starting point is 00:01:07 And secondly, so my name is Bob van Luyt. I'm one of the co-founders of Semi Technologies, which is the company that's created around the vector search engine, WeaveYate. I'm of the generation that I almost grew up with the internet. I started doing things online when I was very young, creating websites, those kind of things. The first time I actually made money off the web was when I was 15 years old. I just had a side job, you know, and the guy was selling toothbrushes. And I overheard him saying like, hey, we actually can sell these toothbrushes online. And I said, you know what, I actually, you know, I can help you build a website.

Starting point is 00:01:54 And that's literally how my career in technology started. Then I went off to study. I studied arts. And while I was studying, I was still working in software. And back then, software still had like... If you wanted to do something in software, it often was like a focus on, you know, studying mathematics or studying computer science. And I, you know, I had another view on things that I could create online. Actually when I was in my early 20s and when I was done studying, I learned that there was a lot of opportunity online to build tools,

Starting point is 00:02:32 to build software, so I ended up really full-time in technology. So I've literally been working for 20 years now almost, if you include me being 15 in selling toothbrushes or helping people sell toothbrushes online actually exactly so fortunately it evolved a little bit further than going from the toothbrushes but it's true that that's really how it started and then I worked always as a consultant with my own company and then I founded SEMI. The origin story of SEMI and Weaveyate actually comes from the fact that when I was working as a consultant, two things happened.

Starting point is 00:03:14 The first thing was that I was dealing with a lot of unstructured data, like I guess everybody who's working in technology. It's a pain, it's a problem to relate data which is difficult to relate so maybe even in a small data chunk might be structured but then if you have two for example from different vendors or different types of products it becomes difficult so that was one thing and the second thing that happened was in i believe it was in 2016 that Google announced that Google search was changing towards RankBrain. And as you can also read on the Wikipedia page of RankBrain, it's actually that they explain that they use word factorization to actually make relations in the queries, and

Starting point is 00:04:01 that's how they try to present results. So I was intrigued by that. I was experimenting myself with all these NLP models that were coming out, and then I was at a cloud conference, and I asked somebody, I said, like, are you going to build a B2B solution, the search engine that does this, right, that we can just add business data to the search engine and search through these unstructured data sets

Starting point is 00:04:28 like we can do, for example, with Google search. And then the answer was no. So I thought that's my opportunity and that's the origin story of Weaviate. So Weaviate was really created to try to solve the unstructured data problem. That's the origin story. So when was that? I mean, when did you attend that conference and had this aha moment?

Starting point is 00:04:52 I remember exactly where it was. It was in a theater at Mission Street in San Francisco. And it was just, it was like this, literally this aha moment. I was like, of course, that's like, this is, it makes sense to try to build such a database for businesses. I have to say though, it's 2016, now it's 2021. Building a database is very complex. I guess. now it's 2021, building a database is very complex. And I have way better people than myself currently working on the technology. But the original idea was this,

Starting point is 00:05:40 machine, NLP machine learning models output vectors. So they place these individual words, back then these individual words in a vector space. And the idea was very simple. That what if we take a data object, can be anything, can be an email, can be a product, can be a post, whatever. We look at all these individual words and how they are, where the vectors sit

Starting point is 00:06:08 in the space for these individual words. We calculate a new vector position for those words, and that will be where the document sits in the vector space. And that was the original idea, and that turned out to work. So for example, we have a demo data set and in a demo data set we have all kinds of publications and articles. And then if you say, for example, for the publication, okay, which public or what publication is most related to fashion? And WeaveYate says, oh, then you need to look at Vogue. And what we've built on top of that is that the data that's in WeaveYate is like it's in a graph format, so the moment that you're able to find a node in the graph, you're able to traverse further and find other things in the graph. I was just going to say that

Starting point is 00:07:00 you have already started going a bit deeper, but actually that's why I wanted to stop you before you go too deep because actually yeah I mean for you and me and maybe some people in the audience all these you know vectors and what they do it's something we know but not everyone necessarily knows so I was going to ask you to maybe take a step back and just do the vector 101. So what is a vector? And why don't databases support vectors? Like out of the box, why can't I use like I don't know, MySQL or whatever to store vectors? What's the issue with that?

Starting point is 00:07:37 And so because this is like a key issue, the key issue that you're solving, basically, the gap that you're addressing. So let's explain to people what are vectors and what is the tough thing about storing them and retrieving them. Great question. So if you look at a, let's say, for example, we take the example of recognizing if there's a cat in a photo. That was the famous deep learning example. Of course the machine doesn't literally look at the photo like we humans look at the photo but the machine

Starting point is 00:08:13 looks at a representation of the photo and that representation of the photo is in something called vectors and the easiest way to understand this is if you look at coordinates on a map it's like you could see but they're like in a in a hyperspace so there are many dimensions but to paint a mental picture you can see this as three dimensions and but the problem now is that it's of course was great that the pattern could be recognized in the photo and then it would say yes it's a cat or no it's a cat but now the problem comes like what if you want to do that for like a hundred thousand photos or for a million photos or even even more then you need a different solution then you need to have a way to look through the space and to find similar things so for example you could have a

Starting point is 00:09:01 photo of cat okay show me similar photos. So what would happen is that when it gets vectorized, this photo is being placed in that space. And then it looks at what are other things in their neighborhood in that space. And that is how it works for photos. But for example, for natural language processing, it works in a similar way. And what I often give as an example there is that if

Starting point is 00:09:26 you go to the supermarket and you have a shopping list and the shopping list says like i need washing powder apples and a banana and if you go into the three-dimensional space of the of the supermarket and you find an apple then you know that the banana is going to be closer by than the washing powder. And if you move in the space towards the washing powder, you know you're moving away from the apple and the banana, which are grouped in the fruit section. This is literally how you could see such a machine learning model, how you could paint a mental picture of how a machine learning model works. So the model tries to arrange the data and these individual data points in that space. And that's how, so sometimes there's 300 dimensions, sometimes 1,200 dimensions, but the model

Starting point is 00:10:17 constantly tries to arrange stuff in that space. So that's how it works in layman's terms under the hood. Okay, so one kind of naive, let's say, approach in... So obviously, when you're working with machine learning algorithms, you need the ability to store and retrieve vectors. So you can... And to do things like similarity search and all those things so a naive approach be like okay vectors are basically you know a long series of numbers so you know i can use any database to to store them and would that work or what's what's the problem with that why don't people just do that? Yeah, that's a great question. Well, actually, people do do that. And you can compare this a little bit with working with Excel. So if you

Starting point is 00:11:11 work with Excel and you have a lot of data, that at some point, it's not going to work anymore. Excel is not able to process the amount of data that you have anymore. So if you want to make a turntable or make a graph, it's just too much data. So then you need to transfer from Excel to a database. The same problem is with these vectors. So what you see with search engine databases is that they're often built for a specific use case. So if you peel off the onion and you see what's at the heart, then you see that there's an engine at the heart that

Starting point is 00:11:46 solves specific problems. So for example, if you look at solar, that solar, if you peel off the onion, you get to leucine. And leucine is good at text-based searching and keyword matching. It's not really built for relating vectors on a high scale to them. So the reason why it becomes interesting to create a vector search engine or a vector database is because if you have cases where you don't want to have the raw text search at the heart, but you want to have the vector search at the heart, you need different algorithms than, for example, the sync to search through them. So in our case, for example, the first one that we use is the so-called HNSW algorithm. There are more. For example, Google recently released their algorithm. And so these vector search engines have that at the heart. So

Starting point is 00:12:39 Weavey takes that at the heart to actually grow from that. So that's why it becomes easy to do the similarity search or classification in WeaveYate rather than doing it with a traditional database because you just simply can't scale it to the size that you might want to scale it to another question and that's something which i find quite quite important actually and i asked the same question to Edo Liberty from Pinecone, who's another solution for dealing with vector, another vector database let's say. So an important point is to understand that precisely this encoding, so how do you actually do the translation let's say between you know your real life or data object or whatever

Starting point is 00:13:25 you want to call it. So to keep to stick to the three dimensions, which is kind of easier to understand, the equivalent of that would be and to use to extend on your, your example with the banana, for example. So in order to, to encode the banana, something that the vector engine can understand, you would maybe in a three-dimensional space give it some xyz coordinate. But the way you do that is actually very specific to, you know, to your problem, to your encoding. So if you're using, for example, if you're using different ways and different ways of mapping that and different algorithms which work in different, with different mappings, then you kind of get this yeah this semantic mismatch basically so you need to encode and decode in the exact same ways in order for that to work

Starting point is 00:14:16 basically so how do you manage that in with it yeah that's it that's a great question actually so because what you said what you're saying that's a great question, actually. Because what you're saying, that's true. So you need to decode and encode in that same space. So in the example of the supermarket, if I go to supermarket A and I find that a certain coordinate, the banana, that wouldn't say that if I go to another supermarket that I find another banana at the same coordinate, right? So that's the problem. So two answers to that question. The first part is related to the vector search engine itself. So the vector search engine is agnostic about where the vectors come from. So you just create a data object and you say,

Starting point is 00:14:57 this data object is represented by these vectors. So that's one part of the answer. However, people often also want to somehow vectorize, right? So they don't necessarily, sometimes they want to do that themselves, but in the majority of cases, they want that to be done for them. And how we do that in Weaviate is that we have a module structure. So Weaviate has modules, and you can choose a vectorizer module. And these modules are good at certain tasks. So some are more general purpose. So they might are good at certain tasks. So some are more general purpose. So they might be good at news articles. Some might be better for cybersecurity cases. Some might be

Starting point is 00:15:31 better for healthcare related cases, et cetera, et cetera. So what we also do at Semi is that we present these modules and then we say, okay, if this is your use case, you can use this vectorizer. If that's your use case, you can use that vectorizer. If that's your use case, you can use that vectorizer. And if you're a developer and you want to go a little step deeper or data scientist and you want to go a step deeper, you can even create your own module to vectorize. But what we see in the majority of the cases is that the vectorizers that come out of the box already are good enough to solve the problems.

Starting point is 00:16:08 So that's how we solve it. We offer a vectorizer on top of VVA and then it's the responsibility of the user basically to use the same vectorizing when reading and writing and updating and doing whatever I guess right yes however however of course this is complex technology right so for people to work with it it's difficult so one of the goals that we have um building wavegate is that even if somebody is specifically focusing on the ux of the apis of alleviate we want to take that abstraction layer away so if you if you work with vvH, you're not really aware that the vectorizer is determining these vectors. You can see it, of course, if you so desire. But if you just want to run it as a database and add data and trust the vectorizer to do its job, you can also do that. And that's

Starting point is 00:16:56 how we see that the majority of people use it. They just run a VVH, throw in the data, which gets vectorized, start doing their queries and solve the problem that they have. Okay, so does that, can that or should it actually be customized for people who want to work with specific machine learning frameworks? So, for example, if I'm using PyTorch for some of my application and some other people are using TensorFlow, I guess these frameworks have their own way of vectorizing probably. So does that integrate with WeVue8 in some way? That's one part of the question. And the second part, you know, so that has to do with the framework. If you go on a bit deeper level of granularity, it also has to do, I guess, with the specific model

Starting point is 00:17:50 that you're training, which may be, and in most cases these days actually is, retrained. So the way that it vectorizes something today may be in theory different from how it does this in a month from now. So how do you keep consistency in those cases? That's a very good question. So the first answer to the question is a simple yes. So the whole goal that we have with building the ecosystem around Weavegate with the modules,

Starting point is 00:18:21 to allow people to make it as easy as possible to use their models to vectorize the data. If they don't want to build a model themselves, that's also fine because then we give them a model. So we say, okay, here you have a model, you can use it to vectorize. We try to make it as simple as possible. The second part of the question is, yes, so the moment that a model sits in, is used with Reviate, you kind of, for that moment, you're stuck to that model, right? So you need to use that model. We currently are looking at ways of efficiently re-indexing if you change the model.

Starting point is 00:19:00 But the thing is, in most cases, it's not really needed. So let's say that you have a model which is fine-tuned for a cybersecurity case. Safe argument. Then the data in Weavey8 is constantly changing because you add stuff, you search, you get stuff out, but the model doing the vectorizing stays the same. And there's another upside because it's a stateless model, so you can also scale it horizontally to speed up the process of factorizing and importing data in WeFace. So yes, we are looking at what we can do for re-indexing, but to be honest, we have at the moment zero use cases where that is actually needed. But if somebody has a use case, and I would love to hear it, of course,

Starting point is 00:19:50 but most users just use the model that they started with. Okay. Yeah, I mean, I guess it also has to do with the lifecycle of models, because at some point, I guess they will be updated. It's just a question of when and whether your use cases so far have hit that threshold in time, let's say. Again, this is something you will have to come up with a solution for. Yeah, you can compare that with a traditional database, right? So let's take MySQL. I could imagine that if you are running a certain version of MySQL and you get a new version, you're going to test it out first. You set it up in a staging environment, load the data in,

Starting point is 00:20:37 see if it performs better. You do all these kind of things. That's really how we see changing the model currently. It's just an update to your stack. It's not the case that the model needs to be updated to be effective. So if you have a use case where you use it to search through documents, for example, you can easily do a year with the model. What we do have within Weave8 is something we call transfer learning. So what you can do is you can use that model, but in near real time, you can teach it new concepts if something happens.

Starting point is 00:21:17 So for example, in news events or those kinds of things. So that's something we do have. The problem that we wanted to solve there is not having to retrain the model. Because of course also training models is of course very expensive. Yeah, that's true and a very good point. And actually a whole topic in and by itself, so let's not go too deep in that but very very timely let's say i i see many people actually worried about like okay this this whole training and retraining cycle is getting a bit out of hand let's say but yeah let's let's not go there at least for a time being and

Starting point is 00:21:57 just to follow up on what you mentioned briefly about use cases so it may be a good opportunity to mention a few of those. So what kind of scenarios will be used for and perhaps some clients that you can mention? Yeah, sure. So it's the, of course, we are a startup. And so what you do is like you try to investigate what's the best way to apply this technology in which industries. And so what we see that an example of such an industry

Starting point is 00:22:28 is the FMCG and retail industry. So what we see there is that there's a lot of unstructured data. So product descriptions, data and ERP systems, invoices, those kind of things. And they often need to be somehow related to a structured way of working internally. So, for example, you could have a lot of work to actually make these models. So there we see a lot of applications with VV8. And a nice use case example that we have there is something that we do with Metro in Germany. So the challenge that they had was that in their CRM system,

Starting point is 00:23:27 they somehow wanted to figure out what opportunities were in the market, what potential new customers were. So what we did with Weavey8 is that we loaded in from existing data systems, customer data into Weaveyate. We also looked at public data sources like open street maps and Weave Yate started to try to make relations automatically between customers and public data. The moment that it could not make such a relation then it said like hey this is a potential new customer. So a new restaurant might have opened in Berlin, wasn't a customer yet, but was already in, for example, the open street maps that I said, and then Weaviet said,

Starting point is 00:24:11 I can't make this connection. And therefore, they knew in a few milliseconds where to go and where to try to find new customers. So that's an example in the FMCG space, but we're also finding a lot of similar cases and we're currently researching like cybersecurity, healthcare and those kinds of industries as well. But the problem is at the root level always the same and that is somehow unstructured data needs to be related to something internally structured. That is a constantly

Starting point is 00:24:45 recurring pattern over and over and over again. And that's what we focus on and that's the problems that we try to help and solve in different industries. Okay, that's interesting and also a good opportunity for me to ask something that I wanted to ask, because I also have to say that this is how I originally came to know Weeviate, as a kind of, well, how to frame it best. Maybe graph embedding search engine, let's say, and now it seems like, on the surface at least, because that's, I think you're probably using it, you mentioned at some point you're using a graph data και τώρα σημαίνει ότι, στον πλαίσιο του, γιατί πιστεύω ότι το χρησιμοποιείτε, το είπατε κάποιο σημείο, χρησιμοποιείτε ένα γραφικό δασκόνι για παραδείγματος ειδικά, αλλά είναι μια καλή ευκαιρία γιατί το παράδειγμα που

Starting point is 00:25:36 είπατε για το τι κάνετε στο μέτρο χρησιμοποιείται, είναι κάτι, πιθανό, για μένα και θα εξηγήσω τι σημαίνω. Λοιπόν, βασικά, προσπαθείτε να βρείτε όπου δεν είναι δυνατόν να βρείτε την συνδέση. counterintuitive for me and I'll explain what I mean. So you basically try to find where it's not possible to find the connection. People actually most of the time at least use graphs to do the exact opposite, to discover connections. So that's why I say it's a bit counter counterintuitive. So if I wonder if you'd like to say a few words about the specifics of your data model and how graph relates to that. And why did you choose it and how do you use it? Yeah, this is actually a great question.

Starting point is 00:26:14 And it's actually very much related to the graph, but in a different way maybe than you're looking at it. So let me give this example again, let's give this Google search example again. So on Google search, you have the knowledge graph in Google search, so with for example, Wikipedia entries and those kind of things. But the problem is, how do you know which notes from the graph to show?

Starting point is 00:26:41 So a nice example is, there was a, the other day I had a conversation with somebody and we talked about this and we did the query from the Disney movie, Finding Nemo, we asked the question to item, Clownfish. So the powerful thing there was actually finding somewhere in that graph space the correct node to show, which in this case was Clownfish. Now, the finding of nodes in the graph, so the right nodes in the graph,

Starting point is 00:27:24 or not being able to find them says a lot, right? You can get a lot of insights from that. So to answer your question, one of the things that we've learned is that focusing on searching in that graph and finding these these data objects, it can actually be very powerful because it also says something if you can't find something. So sometimes you can find it, and then it's very important and powerful. Sometimes you can't find it, that's also important. Because if you can't find it, then we can say, hey, you're looking for something that's not in the graph, maybe you want me to represent it in the graph. So one of the things that we also have is automatic classification where we do exactly that. So you can load your

Starting point is 00:28:09 data in and you can ask Weaveate to make itself the graph reference relations just based on where things sit in the space. So then Weaveate tries to make relation if there is none and that's again an example of working with not being able to find something in the graph but actually making value you know creating value out of that. Okay then I also have to ask you about how do you access that because going through your your material I saw at some point something that struck me about GraphQL interface. And a little bit of semantic clarification may be necessary here because even today for lots of people when they hear GraphQL they think of it as a GraphQL language and it's not precisely that,

Starting point is 00:28:58 it's more like a meta-API query language if you want to call it that, but still an interesting choice. So, why did you choose it and how do you use it and if it relates to your underlying data model at all? So, this is a really good question. So, this has to do with UX, with user experience, rather than anything related to the database per se. And that was this. So I'm personally a strong believer that just expressing our data in a graph-like fashion, that's just the future. For me, it doesn't make a lot of sense anymore to not do that. I mean, I understand with data warehouses and just make a lot of sense anymore to not do that. I mean, I understand with data warehouses and just stacking a lot of, I understand,

Starting point is 00:29:48 but more if we want to relate data to each other, it's just the graph model to represent the data makes the most sense. But then the second question is like, okay, so how are we going to give people access to this information, right? So if we want to give them a lot of capabilities that they can do like a tremendous amount of stuff, then you might be looking at something like maybe like Sparkle.

Starting point is 00:30:11 But on the other hand, if you want to make it simple for people to access the graph so that they just have a very short learning curve, which is very easy to learn, then GraphQL becomes interesting because most developers who are unfamiliar with graph technology, if they see Sparkle, they start sweating and they get nervous. And if they see GraphQL, they go like, hey, I understand this, it makes sense. There's another upside to GraphQL and that is the amount of community work happening around it. I understand this, it makes sense. There's another upside to GraphQL,

Starting point is 00:30:45 and that is the amount of community work happening around it. So with interfaces for different programming languages, and software applications, you name it, it's a tremendous amount of libraries are available. And because we've just used one-on-one GraphQL as an interface, it's really easy to use all these libraries as well. And we learned that the data model, to answer a data

Starting point is 00:31:14 model question, so data has an RDF-like class property data, a graph-like data model, which is ideal to represent in GraphQL. So, in the example that I gave from the news publications, it's very easy to make a query, say like, show me all articles that are about housing prices and show me in which publications they appeared. Very easy query in GraphQL. So that's the reason why we chose for GraphQL and seems to be paying off because we're getting a lot of positive responses to the GraphQL interface. Yeah, I mean, sure, what you mentioned is true about GraphQL's popularity, it's something I've seen as well and I guess everyone who works in that field in one way or another also sees that, so probably a good choice.

Starting point is 00:32:14 However, your reply actually triggered my curiosity, so now I have to ask you, because you mentioned, for example, being able to ask specific questions about even entities and things like that. So GraphQL does have a schema, a kind of way to define a schema, and you also refer to RDF-like structures. So I wonder if you have any schema mechanism supporting Weavey8, and if you do, how does one manipulate it? How do you define a schema? And how do you actually map a schema to the vector space? It's a great question. So if you run a Weavey8, so you just start one up, right?

Starting point is 00:33:01 So it starts running. Then the API endpoint becomes available. And the first thing that you need to do is that you need to create a class property schema. So in the example of the article dataset, you might say I have the class article and I have a class publication. And then you can say the article, there's a title, an abstract, and is in a publication or appeared in publication, and you can make a reference, a graph reference to the publication. So that's the data model. Now, if you start to add data to Weavey8, for example, an article, you simply have to say, here you have the title and the summary of an article,

Starting point is 00:33:45 so of the class article. You send it to Weaviate and Weaviate takes care of the rest. So now it automatically gets vectorized if you decided to use a vectorizer, and it becomes part of the graph-like model where you can see if you say show me all articles, the article will be part of that search result. So what you now can do is both do the

Starting point is 00:34:07 DML-based search, so you can say like show me all articles about housing prices, but then you can further traverse through the graph and say like okay in which publication did these articles appear. So you're now able to mix the machine learning searches and a traversing the graph to find more information about the data object in one go. That's very interesting. Actually, I didn't realize that that was the case. And yeah, you're right. It actually goes beyond vectors in that case. So do you maintain multiple indexes to be able to do that? That's a good question and I'm not 100% certain what the

Starting point is 00:34:50 answer is because that goes very deep to my seat. You can definitely answer that question. Yeah, fair enough. But the important thing is that we really were focusing on the searching in the graph part. So we are also not saying like that we are a graph database. We're not because we are a vector search engine, but we've adopted the whole graph-like thinking to represent the data to our end users. And because it's very intuitive for people to understand that that's the way how they are, how they can retrieve the data. It's an easy way for people to use Weave.

Starting point is 00:35:35 I even sometimes give business demos where I demo the GraphQL API, and even people who don't necessarily have a developer or data science background intuitively understand the GraphQL API when they see it. Yeah it's one of the benefits of GraphQL, this simplicity. To come back here, sorry I'm asking you a little bit maybe many questions about that but I'm also also learning and interested in figuring out what you do exactly and how you do it. So I wonder if defining a schema is necessary, like required to work

Starting point is 00:36:13 with vectors, or can you do it without a schema as well? So you need to do it to work with Weavey8, but this is regardless of doing something with factors. We have use cases where people just want to do a document search and they just have one class, which is document. And now they search through these documents. So it really depends on the use case. But what we see is that, again, I'm a strong believer in the whole UX part also of the API.

Starting point is 00:36:45 It makes it very intuitive for the end users that even if they only have one class, that they can say, okay, I want to find a document, and I want to search for this search term. So it's really choosing the graph, and with that GraphQL as the query language, which really also was a UX decision. Because as we discussed earlier, this is complex technology. And so what we aim to do is try to make it as easy as possible for the developer, and therefore the end use case they're solving, to start the Weave Yate, start working with it and solve the problem.

Starting point is 00:37:26 So they don't have to learn different query languages. They don't have to learn different practices. It's very, hopefully, and that's our goal, it's very straightforward. And again, on the topic of schema, so is it possible to do something like reuse, import existing schemas, be it on GraphQL or even other types of schemas? And yes, I realize it's very case dependent.

Starting point is 00:37:54 Some people may not need one at all. Some people just need maybe a schema with document and that's it. Just wondering if somebody wants to use something more elaborate would they be for example able to reuse something that they have already created let's say in some other application or something that exists in a repository or something like that? Yeah definitely and that's also not uncommon so what we often see in the use cases is that the existing and I can give you an example there that the existing schema of a graph is used within we fear so just a traditional

Starting point is 00:38:34 schema is used in we've yet to solve new problems practical example so we are currently also looking at the cyber security space and there's a there's a there's a well-known framework in the cyber security landscape called the the attack framework and the attack framework has a has a graph-like um structure so it says like okay i have certain cyber security threats they have names there are certain mitigations which are graph relations and those kind of things. So what happens is that this traditional graph scheme is loaded in a Weaviate with the threats, with the mitigations, etc. But now the powerful thing is that people can start to use the shearing model to search through that graph. So they say like, Hey, I see happening this, or I see happening that,

Starting point is 00:39:26 but do you think this is, and then we've, you'd say, I think that you need to start at this node in the graph, because that seems to the most closely related to the problem that you were describing. So this is how we try to, to marry the, traditional approach, if I may call it like that. And the new machine learning approach to build a bridge between them so that's um and so basically the answer to you is yes that's and that actually happens a lot okay so what what format can be imported uh in gui v8 i mean what schema format

Starting point is 00:39:59 would it be i don't know rd rdfs or i don't know some dd or, I don't know, some DDL description from SQL or what is supported? So currently, we just, the API endpoints from Weave8 or in the clients, they just take the, you just need to have the client property structure. Anything that represents a class property graph can be loaded into Weavey8. If you have a different type of graph, then it becomes a little bit more difficult. But I also would like to say that if people have problems there, that they can't represent their schema for whatever reason in Weavey8, then also we would love to know, of course, because then we also know how we can improve it.

Starting point is 00:40:43 Okay, cool. to know of course because then we also know how we can improve it okay okay cool yeah i know it's a bit uh it's quite involved actually there's this conversation about schemas and graphs and whether you import them and how but it's just it was just surprising for me to hear that you do that and that's why i wanted to know more yeah yeah yeah and it's again i would like to to emphasize again that that the um that is of course really going down to it to the tech but but the end goal the overarching end goal it aims like that we really want to solve the unstructured data problem right so how are you going to get insights from your unstructured data and how are you going to somehow map that um to something that you can understand and using your day-to-day business processes?

Starting point is 00:41:26 And that's why we've chosen that model to structure Weave8 like that. And another way to visualize it is that I sometimes say, if you look at a graph, it's like the representation is kind of two-dimensional. So it's like you have another another note and there's connection but the distance that they have to each other doesn't really doesn't really matter right so because it's for use case so for example if you if you say for example the movie finding Nemo and then it's like is produced by and then you find Disney and how far the distance is between the dozen doesn't matter but with the vector search engine so where we say like no these nodes the actually affect representation they sit somewhere also in space so now we can start to search

Starting point is 00:42:18 through the space for similar entities in the graph and then we can target the graph. So that example that I gave with Google search, like what type of fish is Nemo, that is first a machine learning problem to try to find the right node in the graph. And then when it's found a node, it turns into a graph problem because you want to show that that's a clownfish. And that's why we, that, that, that's what we've been inspired by that we wanted to do that that's a clownfish. And that's what we were just inspired by, that we wanted to do that as well. Because when I started,

Starting point is 00:42:52 and we now we're proving that with all the cases that we have, is that the assumption was, if we just know where to find and target this data object, then we have to start in solving the problem because the problem is not making the structure or not making the the schema or some more complex case ontologies we believe that the problems is more in finding the right data and that's what we've what we focus on so we help people to actually find it that's also why we say that we are a search engine.

Starting point is 00:43:26 We're really focusing on helping people to find it, regardless of what kind of data that you have. Okay, that's a great way to change granularity, basically, and just go one level up. And you reminded me of something I heard in a recent conversation I had with Gary Marcus and it was one of the most enjoyable conversations we talked about all kinds of things at some point we even talked about vectors and graph embeddings specifically because we were talking about you know futures technologies emerging technologies and you know, futures technologies, emerging technologies, and you know, what he thinks of them and so on.

Starting point is 00:44:06 And so when the discussion came to graph embeddings and vectors, he quoted someone else there, someone called Ray Mooney, who is a computational linguist, and according to him the quote was, well, you can't squeeze the meaning of an entire sentence in one vector. He used some other words in between, but I'm just going to leave them out. And Gary's take on that was like, well, yeah, okay, they work reasonably well. You know, they can give you something, some degree of similarity, yes, but he had two main objections. What is a good degree of similarity, A, and B was basically consistency. So he said something along the lines of, well, I haven't seen so far, excuse me, something that works consistently well at scale for this kind of

Starting point is 00:45:02 thing. So I wonder what, obviously, you're kind of, I guess, on the other side, let's say, because you work with vectors and obviously you believe in them. And so you have to think about ways to deal with that. So I wonder what your take is as a kind of counter argument. No, definitely. So I think I'm also on the other side from from a different perspective right so so this isn't i understand this argument from a academic

Starting point is 00:45:33 perspective but i'm also on a i'm also looking at it from a business perspective and um the the thing is that if so have the the problem that we try to solve is finding something. So if the similarity search does a good job in finding something, then I'm already happy, right? Because then I go like, yes, we are able to solve this problem and we solve the use case. So it can be a document or an image or whatever. So that's a different way of looking at the problem. And another way to put it is like the technologies that are coming out and the models that are coming out are already good enough.

Starting point is 00:46:16 And I think a great example of that is, well, hopefully we are an example of that, but another example is, again, Google Search, right? They're doing that a lot. And a lot of people get value from doing searches where these kinds of factorized technologies are being used. So I find it very important to say that there's a difference between the academic discussion of, and I, because I agree with that, it's very difficult to have, to capture real meaning, like linguistic semantic meaning from a sentence in a vector but if you look at this from a business perspective if it's good enough to capture the meaning from that

Starting point is 00:46:52 sentence you're already you're already good right so because you can solve the problem at hand so that's the the second thing the the other problem um um how I understand the consistency problem, is this. So if you want to vectorize something, so let's say you use a model to try to do question answering on a text corpus, then the text corpus that needs to be vectorized, and that takes a lot of time. It takes a lot on CPU. It's undoable. On GPU, it's, you know, it's, but if you want to do that on a huge scale for a lot of end users, it's difficult. It's difficult to, you know, to be consistent in the performance that you're giving. And that, again, is one of the problems that we we also saw and that's what we're trying to solve so if you want to search through two documents then now the traditional way

Starting point is 00:47:52 of doing it is like take two documents factorize them factorize the query compare them answer takes a long time but what if you have a million documents or more? And that is what the Effective Search Engine aims to solve. So it takes the outcomes of the model as a source of truth and uses that. Then it becomes from a data science problem, it becomes an engineering problem. Like how are we going to search as fast as we can through them? And that's also why my colleague H HN, wrote a nice blog that was like doing a similarity search based on the stillbirth in less than 50 milliseconds. Because that is the added value of doing that. And we believe that that's an engineering challenge that we need to solve.

Starting point is 00:48:40 So I hope that answers the two questions. Yeah. So which also brings us to another topic we haven't covered so far. So you mentioned briefly in the beginning of the conversation that what you do is actually open source. And I wanted to ask whether there is any proprietary part in it at all. And you know, what kind of license you use and you know what kind of community you have on github committers and all of that stuff so one part of the question and

Starting point is 00:49:13 which is kind of the logistics of it let's say and then the second part and maybe even more interesting part is well uh the uh the philosophy the the rationale behind that. And I think you already kind of alluded to it in saying, well, it's all about scaling. So people, in theory, can take that and use it on their own systems. But it's all about scaling and elasticity and all of those things. So if you can say a few words about that. Yeah, so there are many reasons why you can choose to have something open source, or in our case, an open source core.

Starting point is 00:49:51 In our case, that is simply for transparency towards our customers and users. So we're not necessarily looking for contributors. I mean, of course, it's nice if somebody contributes something, but we're not asking for it. We're not advertising that property. You have to bear in mind that if I go to a potential customer to sell WeaveHits-related services, and I explain what it can do, and even if I can demo it to them, the next thing that they're going to say, like, wow, that looks very fancy. It's machine learning, so there's a little bit of this black box element to it. So they often forward it to a data scientist or a software engineer and say like,

Starting point is 00:50:28 okay, can you take a look at it? What is this? So that's why we've chosen that open core model because we can be transparent. This is how we do it, right? So this is how the problem is solved. So what is a business model? What do we do?

Starting point is 00:50:48 Well, we actually sell enterprise services around Weavey8. So for example, think about the open source license. We take the open source license away and we replace it with an enterprise license. Sometimes there's specific fine-tuned models that people use or customly build modules in Weave Yate. That's something that we sell. So we have really created a business around that

Starting point is 00:51:13 open source core, but it's not a freemium type of thing or something. So it's like the open source core is the core at the heart of all applications that we have, but enterprise have many needs and and that for example is a data scientist somebody wants to try out we've ate or a small startup doesn't have and that's that's our that's our business model so we we make money around those support and services around that open source core. Okay, interesting. So it's not oriented towards software as a service.

Starting point is 00:51:50 You don't offer it as a kind of, I don't know, pay-as-you-go or whatever subscription-based model or what have you. You offer services around it. Yeah, so yes, we do offer it as a SaaS, but you know what's very interesting? Something that we've learned through talking to customers is that the majority of them are working with data which is very sensitive. And they've now built their, or they have their own clouds,

Starting point is 00:52:18 or they have these hybrid, private, public clouds that they're working with. So almost all customers ask us the question, can Reviate run on our environments? And we said, yes, of course. And we understand, they don't want to send their data away, even if it needs to be handled some way, or even if you have a data processing agreement, no, no, no.

Starting point is 00:52:44 They just wanna make sure that they have control over that data. And so we said like, well, you know, if you can come to us, we understand and appreciate that, we'll come to you. So we really build our ecosystem around offering these services to them in production. What we often see in practice is that development or staging environment happen on the SaaS offering or smaller startups or smaller companies using the SaaS offering.

Starting point is 00:53:14 But these big enterprises, the production stuff, the production data really needs to be on-prem, being like actual on-prem or in the private public cloud. Okay. Okay. Yeah. I've heard actually from people the exact opposite rationale as well. And it also makes sense.

Starting point is 00:53:35 And in the end of the day, I guess it depends on, you know, where the majority of your current use cases and prospects are. And so where do you need to focus because I've also heard like the counter argument like okay because our clients already are in the cloud you know they don't want to have you know the extra cost of moving data around so it makes more sense for them if we offer it as a SaaS in the cloud which you know you can't object to and at the same time you know if your clients for whatever reason regulation or I don't know what not want to run then yes it also makes sense for you to offer it that way yeah

Starting point is 00:54:15 and just to bear in mind so it practically that's also how it works so so for example we we use we work with the major cloud providers. So let's take, for example, Google Cloud. So how does that work in practice? So a customer says like, okay, we work on Google Cloud platform, for example, or any other, but let's, for the sake of argument, use Google Cloud platform. We work on Google Cloud platform. They have a support license. And what they do is that they create a project that my team members have access to. We have everything as a SaaS out of the box with the push of a button to load it into Google Cloud.

Starting point is 00:54:55 So now it runs in their project. But we can still offer the same support as we would do with the other SaaS offerings that we have. But if they have another, if they say no, no, no, for us it must run on Azure, no problem. With the push of a button we run it on Azure and it's very easy for us to maintain it like that. So you run on I guess all three major cloud providers and perhaps others as well? Yeah, so we want to offer others, but in all honesty,

Starting point is 00:55:30 all of our customers are on one of those three. Okay, yeah. I'm not surprised, actually. Yeah. Yes. Do you also find, so you described so far a way of working, which is, I guess, kind of in a kind of close collaboration with clients

Starting point is 00:55:51 for, you know, whatever reasons, because they want your involvement or support or whatnot. But is it also possible, you know, for people to self-service basically? So they just, you know, sign up, get a subscription and just get going on their own absolutely and and so we have two things for that so one is that we have um our what we call the we've yet console that's our sas offering that you just you know you log in select the we've yet click go and you can start playing around with it we also also offer a Weave8 five days for free.

Starting point is 00:56:25 So you just have to click go and you can make as many sandboxes as you want. So that's on one end. On the other hand, we of course also have customers that say like, wait, listen, we have our own data science team that wants to work with Weave8, but we occasionally we need support and we need help.

Starting point is 00:56:43 So just a license for that type of support is enough for us. So that's also possible. It really depends on who's buying and who's using in the organization. Thank you. Wondering then, if wrapping up, you may also want to say a few words about the current status of SEMI, the company, like how many people are currently in the team and whether you see

Starting point is 00:57:12 expanding and in what direction and so on, future plans in general. So SEMI currently is with nine, currently we're with nine people, and we're completely distributed. So that was already before the pandemic started. So we have people on the farthest to the east in Poland and on the farthest to the west in the US. What we currently are focusing on is, of course, trying to determine where do we add the most value with VEV8 in which industries. Or to use the startup jargon there, to determine and nailing our niches.

Starting point is 00:58:01 That's what we're doing. And I can proudly say that we are seeing more and more customers coming in from the FMCG and retail part. So we really start to understand how we V8 adds value in that landscape of ERP systems, data warehouse, et cetera. And to answer your question, what we're focusing on, especially also this year, to growing and better understanding the amount of industries where Weavey8 adds value. So think about cybersecurity, healthcare, those kinds of industries. And if somebody hears this and has a great idea, then make sure to reach out to me.

Starting point is 00:58:46 Okay, cool. Well, let's see if that happens. All right, so yeah, I think we covered quite a lot and at quite a different level of, you know, detail from the very, very specific and technical to the quite abstract and everything in between. So yeah, I'm good on my end. If you want to add something to wrap up. Yeah, so if I may add, then, is that however you look at these types of technologies, right, or you look at it as a developer data scientist,

Starting point is 00:59:19 then you're, of course, more than welcome to try it out. Just if you Google VFH, you can miss it. But also if you're more interested from a business angle you want to understand like okay how can this help in my business and and what can we do there then i would like to invite people to just reach out because um i can always explain how in their specific industry we've yet might help and last but not least thank you for having me. Thanks for coming. It was really interesting for me as well. And I guess you could tell because I asked you quite a few questions.

Starting point is 00:59:54 I hope you enjoyed the podcast. If you like my work, you can follow Linked Data Orchestration on Twitter, LinkedIn and Facebook.

Orchestrate all the Things - Weaviate, an open-source search engine powered by machine learning, vectors, graphs, and GraphQL. Featuring co-founder Bob van Luijt

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.