Orchestrate all the Things - Weaviate, an open-source search engine powered by machine learning, vectors, graphs, and GraphQL. Featuring co-founder Bob van Luijt
Episode Date: April 7, 2021Google uses machine learning and graphs to deliver search results. Most search engines do not. Weaviate wants to change that. Bob van Luijt's career in technology started at age 15, building web... sites to help people sell toothbrushes online. Not many 15 year-olds do that today, and fewer still did it then. Apparently that gave van Luijt enough of a head start to arrive at the confluence of technology trends today. Van Luijt went on to study arts, but ended up working full time in technology anyway. In 2015, when Google introduced its RankBrain Algorithm, the quality of search results jumped up. It was a watershed moment, as it introduced machine learning in search. A few people noticed, including van Luijt, who saw a business opportunity, and decided to bring this to the masses. Article published on ZDNet
Transcript
Discussion (0)
Welcome to the orchestrate all the things podcast.
I'm George Amadiotis and we'll be connecting the dots together.
Google uses machine learning and graphs to deliver search results.
Most search engines do not. We the eight want to change that.
Bob van Lout's career in technology started at age 15
building websites to help people sell toothbrushes online.
Not many 15 year olds-olds do that today, and fewer still did it then.
Apparently that gave Van Lout enough of a head start to arrive at the confluence of
technology trends today.
He went on to study arts, but ended up working full-time in technology.
In 2015, when Google introduced its RankBrain algorithm, the quality of search results jumped up.
It was a watershed moment as it introduced machine learning in search.
A few people noticed, including Van Lout, who saw a business opportunity and decided to bring this to the masses.
I hope you will enjoy the podcast.
If you like my work, you can follow Link Data Orchestration on Twitter,
LinkedIn, and Facebook.
And secondly, so my name is Bob van Luyt. I'm one of the co-founders of Semi Technologies,
which is the company that's created around the vector search engine, WeaveYate. I'm of
the generation that I almost grew up with the internet.
I started doing things online when I was very young, creating websites, those kind of things.
The first time I actually made money off the web was when I was 15 years old.
I just had a side job, you know, and the guy was selling toothbrushes.
And I overheard him saying like, hey, we actually can sell these toothbrushes online.
And I said, you know what, I actually, you know, I can help you build a website.
And that's literally how my career in technology started.
Then I went off to study.
I studied arts.
And while I was studying, I was still working in software.
And back then, software still had like...
If you wanted to do something in software, it often was like a focus on, you know, studying mathematics or studying computer science.
And I, you know, I had another view on things that I could create online. Actually when I was in my early 20s and when I was done studying,
I learned that there was a lot of opportunity online to build tools,
to build software, so I ended up really full-time in technology.
So I've literally been working for 20 years now almost,
if you include me being 15 in selling toothbrushes or helping people
sell toothbrushes online actually exactly so fortunately it evolved a little bit further than
going from the toothbrushes but it's true that that's really how it started and then I worked
always as a consultant with my own company and then I founded SEMI.
The origin story of SEMI and Weaveyate actually comes from the fact that when I was working
as a consultant, two things happened.
The first thing was that I was dealing with a lot of unstructured data, like I guess everybody
who's working in technology.
It's a pain, it's a problem to relate data
which is difficult to relate so maybe even in a small data chunk might be structured but then if
you have two for example from different vendors or different types of products it becomes difficult
so that was one thing and the second thing that happened was in i believe it was in 2016 that Google announced that Google search was changing towards RankBrain.
And as you can also read on the Wikipedia page of RankBrain, it's actually that they
explain that they use word factorization to actually make relations in the queries, and
that's how they try to present results. So I was intrigued by that.
I was experimenting myself with all these NLP models
that were coming out, and then I was at a cloud conference,
and I asked somebody, I said, like,
are you going to build a B2B solution,
the search engine that does this, right,
that we can just add business data to the search engine
and search through these unstructured data sets
like we can do, for example, with Google search.
And then the answer was no.
So I thought that's my opportunity
and that's the origin story of Weaviate.
So Weaviate was really created
to try to solve the unstructured data problem.
That's the origin story.
So when was that? I mean, when did you attend that conference and had this aha moment?
I remember exactly where it was. It was in a theater at Mission Street in San Francisco.
And it was just, it was like this, literally this aha moment.
I was like, of course, that's like, this is, it makes sense to try to build such a database for businesses.
I have to say though, it's 2016, now it's 2021.
Building a database is very complex.
I guess. now it's 2021, building a database is very complex.
And I have way better people than myself currently working on the technology.
But the original idea was this,
machine, NLP machine learning models output vectors.
So they place these individual words,
back then these individual words in a vector space.
And the idea was very simple.
That what if we take a data object,
can be anything, can be an email,
can be a product, can be a post, whatever.
We look at all these individual words and how they are, where the vectors sit
in the space for these individual words. We calculate a new vector position for those words,
and that will be where the document sits in the vector space. And that was the original idea,
and that turned out to work. So for example, we have a demo data set and in a demo data set we have all kinds of publications and articles.
And then if you say, for example, for the publication, okay, which public or what publication is most related to fashion?
And WeaveYate says, oh, then you need to look at Vogue.
And what we've built on top of that is that the data that's in WeaveYate is like it's
in a graph format, so the moment that you're able to find a node in the graph, you're able to traverse
further and find other things in the graph. I was just going to say that
you have already started going a bit deeper, but actually that's why I wanted to stop you
before you go too deep because actually yeah I mean for you and me and maybe some people in the
audience all these you know vectors and what they do it's something we know but not everyone
necessarily knows so I was going to ask you to maybe take a step back and just do the vector 101.
So what is a vector?
And why don't databases support vectors?
Like out of the box, why can't I use like I don't know, MySQL or whatever to store vectors?
What's the issue with that?
And so because this is like a key issue, the key issue that you're solving, basically, the gap that you're addressing.
So let's explain to people what are vectors and what is the tough thing about storing
them and retrieving them.
Great question.
So if you look at a, let's say, for example, we take the example of recognizing if there's
a cat in a photo.
That was the famous deep learning example. Of course
the machine doesn't literally look at the photo like we humans look at the photo but the machine
looks at a representation of the photo and that representation of the photo is in something called
vectors and the easiest way to understand this is if you look at coordinates on a map it's like
you could see but they're like in a in a hyperspace so there are many dimensions but
to paint a mental picture you can see this as three dimensions and but the problem now is that
it's of course was great that the pattern could be recognized in the photo and then it would say
yes it's a cat or no it's a cat but now the problem comes like what if you want to do that for like a hundred thousand photos
or for a million photos or even even more then you need a different solution then you need to
have a way to look through the space and to find similar things so for example you could have a
photo of cat okay show me similar photos.
So what would happen is that when it gets vectorized,
this photo is being placed in that space.
And then it looks at what are other things in their neighborhood in that space.
And that is how it works for photos.
But for example, for natural language processing,
it works in a similar way.
And what I often give as an example there is that if
you go to the supermarket and you have a shopping list and the shopping list says like i need
washing powder apples and a banana and if you go into the three-dimensional space of the of the
supermarket and you find an apple then you know that the banana is going to be closer by
than the washing powder. And if you move in the space towards the washing powder, you know you're
moving away from the apple and the banana, which are grouped in the fruit section. This is literally
how you could see such a machine learning model, how you could paint a mental picture of how a
machine learning model works. So the model tries to arrange the data and these individual data points in that space.
And that's how, so sometimes there's 300 dimensions, sometimes 1,200 dimensions, but the model
constantly tries to arrange stuff in that space.
So that's how it works in layman's terms under the hood.
Okay, so one kind of naive, let's say, approach in... So obviously, when you're working with
machine learning algorithms, you need the ability to store and retrieve vectors. So you can...
And to do things like similarity search and all
those things so a naive approach be like okay vectors are basically you know a long series of
numbers so you know i can use any database to to store them and would that work or what's what's
the problem with that why don't people just do that? Yeah, that's a great question. Well, actually, people do do that. And you can compare this a little bit with working with Excel. So if you
work with Excel and you have a lot of data, that at some point, it's not going to work anymore.
Excel is not able to process the amount of data that you have anymore. So if you want to make a
turntable or make a graph, it's just too much data. So then you need to transfer from Excel to a database.
The same problem is with these vectors.
So what you see with search engine databases
is that they're often built for a specific use case.
So if you peel off the onion and you see what's at the heart,
then you see that there's an engine at the heart that
solves specific problems. So for example, if you look at solar, that solar, if you peel off the
onion, you get to leucine. And leucine is good at text-based searching and keyword matching.
It's not really built for relating vectors on a high scale to them. So the reason why it becomes interesting to create a vector search engine or a vector
database is because if you have cases where you don't want to have the raw text search
at the heart, but you want to have the vector search at the heart, you need different algorithms
than, for example, the sync to search through them.
So in our case, for example, the first one that we use is the so-called HNSW algorithm. There are more. For example, Google recently
released their algorithm. And so these vector search engines have that at the heart. So
Weavey takes that at the heart to actually grow from that. So that's why it becomes easy
to do the similarity search or
classification in WeaveYate rather than doing it with a traditional database because you just
simply can't scale it to the size that you might want to scale it to another question and that's
something which i find quite quite important actually and i asked the same question to Edo Liberty from Pinecone, who's
another solution for dealing with vector, another vector database let's say. So an important
point is to understand that precisely this encoding, so how do you actually do the translation
let's say between you know your real life or data object or whatever
you want to call it. So to keep to stick to the three dimensions, which is kind of easier
to understand, the equivalent of that would be and to use to extend on your, your example
with the banana, for example. So in order to, to encode the banana, something that the
vector engine can understand, you would maybe in a three-dimensional space give it some xyz coordinate. But the way you do that is actually
very specific to, you know, to your problem, to your encoding. So if you're using, for example,
if you're using different ways and different ways of mapping that and different algorithms which
work in different, with different mappings, then you kind of get this yeah this semantic mismatch basically so you
need to encode and decode in the exact same ways in order for that to work
basically so how do you manage that in with it yeah that's it that's a great
question actually so because what you said what you're saying that's a great question, actually. Because what you're saying, that's true.
So you need to decode and encode in that same space.
So in the example of the supermarket, if I go to supermarket A and I find that a certain coordinate, the banana,
that wouldn't say that if I go to another supermarket that I find another banana at the same coordinate, right?
So that's the problem.
So two answers to that question. The first part is related to the vector search engine itself. So the vector search engine
is agnostic about where the vectors come from. So you just create a data object and you say,
this data object is represented by these vectors. So that's one part of the answer.
However, people often also want to somehow vectorize, right?
So they don't necessarily, sometimes they want to do that themselves, but in the majority of cases, they want that to be done for them.
And how we do that in Weaviate is that we have a module structure.
So Weaviate has modules, and you can choose a vectorizer module.
And these modules are good at certain tasks.
So some are more general purpose. So they might are good at certain tasks. So some are more general purpose. So they
might be good at news articles. Some might be better for cybersecurity cases. Some might be
better for healthcare related cases, et cetera, et cetera. So what we also do at Semi is that we
present these modules and then we say, okay, if this is your use case, you can use this vectorizer.
If that's your use case, you can use that vectorizer. If that's your use case, you can use that vectorizer.
And if you're a developer and you want to go a little step deeper
or data scientist and you want to go a step deeper,
you can even create your own module to vectorize.
But what we see in the majority of the cases is that the vectorizers
that come out of the box already are good enough to solve the problems.
So that's how we solve it. We offer a vectorizer on top of VVA and then it's the responsibility of the user basically to use the same
vectorizing when reading and writing and updating and doing whatever I guess
right yes however however of course this is complex technology right so for people
to work with it it's difficult so one of the goals that we have um building wavegate is that
even if somebody is specifically focusing on the ux of the apis of alleviate we want to take that
abstraction layer away so if you if you work with vvH, you're not really aware that the vectorizer is determining these
vectors. You can see it, of course, if you so desire. But if you just want to run it as a
database and add data and trust the vectorizer to do its job, you can also do that. And that's
how we see that the majority of people use it. They just run a VVH, throw in the data,
which gets vectorized, start doing their queries and solve the problem that they have.
Okay, so does that, can that or should it actually be customized for people who want to work with specific machine learning frameworks?
So, for example, if I'm using PyTorch for some of my application and some other people are using
TensorFlow, I guess these frameworks have their own way of vectorizing probably. So
does that integrate with WeVue8 in some way? That's one part of the question. And the second part,
you know, so that has to do with the framework. If you go on a bit deeper level of granularity,
it also has to do, I guess, with the specific model
that you're training, which may be,
and in most cases these days actually is, retrained.
So the way that it vectorizes something today
may be in theory different from how it does this
in a month from now.
So how do you keep consistency in those cases?
That's a very good question. So the first answer to the question is a simple yes.
So the whole goal that we have with building the ecosystem around Weavegate with the modules,
to allow people to make it as easy as possible to use their
models to vectorize the data. If they don't want to build a model themselves, that's also
fine because then we give them a model. So we say, okay, here you have a model, you can
use it to vectorize. We try to make it as simple as possible. The second part of the
question is, yes, so the moment that a model sits in, is used with
Reviate, you kind of, for that moment, you're stuck to that model, right?
So you need to use that model.
We currently are looking at ways of efficiently re-indexing if you change the model.
But the thing is, in most cases, it's not really needed. So let's say that you have
a model which is fine-tuned for a cybersecurity case. Safe argument. Then the data in Weavey8
is constantly changing because you add stuff, you search, you get stuff out, but the model
doing the vectorizing stays the same. And there's another upside because it's a stateless model, so you can
also scale it horizontally to speed up the process of factorizing and importing data
in WeFace. So yes, we are looking at what we can do for re-indexing, but to be honest,
we have at the moment zero use cases where that is actually needed.
But if somebody has a use case, and I would love to hear it, of course,
but most users just use the model that they started with.
Okay. Yeah, I mean, I guess it also has to do with the lifecycle of models,
because at some point, I guess they will be updated. It's just a question of when
and whether your use cases so far have hit that threshold in time, let's say.
Again, this is something you will have to come up with a solution for.
Yeah, you can compare that with a traditional database, right? So let's take MySQL.
I could imagine that if you are running a certain version of MySQL and you get a new version, you're going to test it out first.
You set it up in a staging environment, load the data in,
see if it performs better.
You do all these kind of things.
That's really how we see changing the model currently.
It's just an update to your stack.
It's not the case that the model needs to be updated to be effective. So if you have a use case
where you use it to search through documents, for example, you can easily do a year with the model. What we do have within Weave8 is something we call transfer learning.
So what you can do is you can use that model,
but in near real time, you can teach it new concepts if something happens.
So for example, in news events or those kinds of things.
So that's something we do have.
The problem that we wanted to solve there is not having to retrain the model.
Because of course also training models is of course very expensive.
Yeah, that's true and a very good point.
And actually a whole topic in and by itself, so let's not go too deep in that but very very timely let's say i i see many people
actually worried about like okay this this whole training and retraining cycle is getting a bit
out of hand let's say but yeah let's let's not go there at least for a time being and
just to follow up on what you mentioned briefly about use cases so it may be a good opportunity to mention a few of those. So what kind of scenarios will be used for
and perhaps some clients that you can mention?
Yeah, sure.
So it's the, of course, we are a startup.
And so what you do is like you try to investigate
what's the best way to apply this technology
in which industries.
And so what we see that an example of such an industry
is the FMCG and retail industry.
So what we see there is that there's a lot of unstructured data.
So product descriptions, data and ERP systems, invoices,
those kind of things.
And they often need to be somehow related to a structured way of working internally.
So, for example, you could have a lot of work to actually make these models.
So there we see a lot of applications with VV8.
And a nice use case example that we have there is something that we do with Metro in Germany. So the challenge that they had was that in their CRM system,
they somehow wanted to figure out what opportunities were in the market,
what potential new customers were. So what we did with Weavey8 is that we loaded in from existing
data systems, customer data into Weaveyate. We also looked at public data sources like
open street maps and Weave Yate started to try to make relations automatically between customers and
public data. The moment that it could not make such a relation then it said like hey this is a
potential new customer. So a new restaurant might have opened in Berlin,
wasn't a customer yet, but was already in, for example,
the open street maps that I said, and then Weaviet said,
I can't make this connection.
And therefore, they knew in a few milliseconds where to go
and where to try to find new customers.
So that's an example in the FMCG space, but we're also finding a lot of similar cases
and we're currently researching like cybersecurity, healthcare and those kinds of industries as well.
But the problem is at the root level
always the same and that is
somehow unstructured data needs to be related to something internally structured. That is a constantly
recurring pattern over and over and over again. And that's what we focus on and that's
the problems that we try to help and solve in different industries. Okay, that's interesting
and also a good opportunity for me to ask something that I wanted to ask, because I also have to say that
this is how I originally came to know Weeviate, as a kind of, well, how to frame it best. Maybe
graph embedding search engine, let's say, and now it seems like, on the surface at least, because
that's, I think you're probably using it, you mentioned at some point you're using a graph data και τώρα σημαίνει ότι, στον πλαίσιο του, γιατί πιστεύω ότι το χρησιμοποιείτε,
το είπατε κάποιο σημείο, χρησιμοποιείτε ένα γραφικό δασκόνι για
παραδείγματος ειδικά, αλλά είναι μια καλή ευκαιρία γιατί το παράδειγμα που
είπατε για το τι κάνετε στο μέτρο χρησιμοποιείται, είναι κάτι,
πιθανό, για μένα και θα εξηγήσω τι σημαίνω. Λοιπόν, βασικά,
προσπαθείτε να βρείτε όπου δεν είναι δυνατόν να βρείτε την συνδέση. counterintuitive for me and I'll explain what I mean. So you basically try to find where it's not
possible to find the connection. People actually most of the time at least use graphs to do the
exact opposite, to discover connections. So that's why I say it's a bit counter counterintuitive. So
if I wonder if you'd like to say a few words about the specifics of your data model and how graph relates to that.
And why did you choose it and how do you use it?
Yeah, this is actually a great question.
And it's actually very much related to the graph, but in a different way maybe than you're looking at it.
So let me give this example again,
let's give this Google search example again.
So on Google search, you have the knowledge graph
in Google search, so with for example,
Wikipedia entries and those kind of things.
But the problem is, how do you know
which notes from the graph to show?
So a nice example is, there was a, the other day I had a conversation
with somebody and we talked about this and we did the query from the Disney movie, Finding
Nemo, we asked the question to item, Clownfish.
So the powerful thing there was actually finding somewhere
in that graph space the correct node to show,
which in this case was Clownfish.
Now, the finding of nodes in the graph,
so the right nodes in the graph,
or not being able to find them
says a lot, right? You can get a lot of insights from that. So to answer your question, one of the
things that we've learned is that focusing on searching in that graph and finding these
these data objects, it can actually be very powerful because it also says something
if you can't find something. So sometimes you can find it, and then it's very important
and powerful. Sometimes you can't find it, that's also important. Because if you can't
find it, then we can say, hey, you're looking for something that's not in the graph, maybe
you want me to represent it in the graph. So one of the things that we also have is automatic classification where we do exactly that. So you can load your
data in and you can ask Weaveate to make itself the graph reference relations
just based on where things sit in the space. So then Weaveate tries to make
relation if there is none and that's again an example of working with
not being able to find something in the graph but actually making value you know creating value out
of that. Okay then I also have to ask you about how do you access that because going through your
your material I saw at some point something that struck me about GraphQL interface. And
a little bit of semantic clarification may be necessary here because even today for lots of
people when they hear GraphQL they think of it as a GraphQL language and it's not precisely that,
it's more like a meta-API query language if you want to call it that, but still an interesting choice. So,
why did you choose it and how do you use it and if it relates to your underlying data model at all?
So, this is a really good question. So, this has to do with UX, with user experience, rather than anything related to the database per se.
And that was this.
So I'm personally a strong believer that just expressing our data in a graph-like fashion, that's just the future.
For me, it doesn't make a lot of sense anymore to not do that.
I mean, I understand with data warehouses and just make a lot of sense anymore to not do that. I mean, I understand with data warehouses
and just stacking a lot of, I understand,
but more if we want to relate data to each other,
it's just the graph model to represent the data
makes the most sense.
But then the second question is like, okay,
so how are we going to give people access
to this information, right?
So if we want to give them a lot of capabilities that they can do like a tremendous amount of stuff,
then you might be looking at something like maybe like Sparkle.
But on the other hand, if you want to make it simple for people to access the graph
so that they just have a very short learning curve, which is very easy to learn, then GraphQL becomes interesting because most developers who are unfamiliar
with graph technology, if they see Sparkle,
they start sweating and they get nervous.
And if they see GraphQL, they go like,
hey, I understand this, it makes sense.
There's another upside to GraphQL
and that is the amount of community work happening around it. I understand this, it makes sense. There's another upside to GraphQL,
and that is the amount of community work
happening around it.
So with interfaces for different programming languages,
and software applications, you name it,
it's a tremendous amount of libraries are available.
And because we've just used one-on-one GraphQL
as an interface, it's really
easy to use all these libraries as well. And we learned that the data model, to answer a data
model question, so data has an RDF-like class property data, a graph-like data model, which is
ideal to represent in GraphQL. So, in the
example that I gave from the news publications, it's very easy to make a
query, say like, show me all articles that are about housing prices and show me in
which publications they appeared. Very easy query in GraphQL. So that's the reason why we chose for
GraphQL and seems to be paying off because we're getting a lot of positive responses to the
GraphQL interface. Yeah, I mean, sure, what you mentioned is true about GraphQL's popularity, it's something I've seen as well and I guess
everyone who works in that field in one way or another also sees that, so probably a good choice.
However, your reply actually triggered my curiosity, so now I have to ask you, because
you mentioned, for example, being able to ask specific questions about even entities and
things like that. So GraphQL does have a schema, a kind of way to define a schema, and you also
refer to RDF-like structures. So I wonder if you have any schema mechanism supporting Weavey8, and if you do, how does one manipulate it?
How do you define a schema?
And how do you actually map a schema to the vector space?
It's a great question.
So if you run a Weavey8, so you just start one up, right?
So it starts running.
Then the API endpoint becomes available.
And the first thing that you need to do is that you need to create a class property schema.
So in the example of the article dataset, you might say I have the class article and I have
a class publication. And then you can say the article, there's a title, an abstract, and is in a publication or appeared in publication,
and you can make a reference, a graph reference to the publication. So that's the data model.
Now, if you start to add data to Weavey8, for example, an article, you simply have to say,
here you have the title and the summary of an article,
so of the class article.
You send it to Weaviate and Weaviate takes care of the rest.
So now it automatically gets vectorized
if you decided to use a vectorizer,
and it becomes part of the graph-like model
where you can see if you say show me all articles,
the article will be part of that search result.
So what you now can do is both do the
DML-based search, so you can say like show me all articles about housing prices, but then you can
further traverse through the graph and say like okay in which publication did these articles
appear. So you're now able to mix the machine learning searches and a traversing
the graph to find more information about the data object in one go.
That's very interesting. Actually, I didn't realize that that was the case. And yeah,
you're right. It actually goes beyond vectors in that case. So do you maintain multiple
indexes to be able to do that?
That's a good question and I'm not 100% certain what the
answer is because that goes very deep to my seat. You can definitely
answer that question. Yeah, fair enough. But the important thing is that we really were
focusing on the searching in the graph part.
So we are also not saying like that we are a graph database.
We're not because we are a vector search engine, but we've adopted the whole graph-like thinking
to represent the data to our end users. And because it's very intuitive for people to understand that that's the way how they are,
how they can retrieve the data.
It's an easy way for people to use Weave.
I even sometimes give business demos where I demo the GraphQL API,
and even people who don't necessarily have a developer
or data science background
intuitively understand the GraphQL API
when they see it. Yeah it's one of the benefits of GraphQL, this
simplicity. To come back here, sorry I'm asking you a little bit maybe many
questions about that but I'm also also learning and interested in figuring out what you do exactly
and how you do it. So I wonder if defining a schema is necessary, like required to work
with vectors, or can you do it without a schema as well?
So you need to do it to work with Weavey8, but this is regardless of doing something with factors.
We have use cases where people just want to do a document search
and they just have one class, which is document.
And now they search through these documents.
So it really depends on the use case.
But what we see is that, again,
I'm a strong believer in the whole UX part also of the API.
It makes it very intuitive for the end users that even if they only have one class,
that they can say, okay, I want to find a document,
and I want to search for this search term.
So it's really choosing the graph, and with that GraphQL as the query language,
which really also was a UX decision.
Because as we discussed earlier, this is complex technology.
And so what we aim to do is try to make it as easy as possible for the developer,
and therefore the end use case they're solving, to start the Weave Yate, start working with it and solve the problem.
So they don't have to learn different query languages.
They don't have to learn different practices.
It's very, hopefully, and that's our goal,
it's very straightforward.
And again, on the topic of schema,
so is it possible to do something like reuse, import existing
schemas, be it on GraphQL or even other types of schemas?
And yes, I realize it's very case dependent.
Some people may not need one at all.
Some people just need maybe a schema with document and that's it.
Just wondering if somebody wants to use something more elaborate would they be
for example able to reuse something that they have already created let's say in
some other application or something that exists in a repository or something like
that? Yeah definitely and that's also not uncommon so what we often see in
the use cases is that the existing and I can give you an example there that
the existing schema of a graph is used within we fear so just a traditional
schema is used in we've yet to solve new problems practical example so we are
currently also looking at the cyber security space and there's
a there's a there's a well-known framework in the cyber security landscape called the the attack
framework and the attack framework has a has a graph-like um structure so it says like okay
i have certain cyber security threats they have names there are certain mitigations which are graph relations and those kind of things. So what happens is that this
traditional graph scheme is loaded in a Weaviate with the threats, with
the mitigations, etc. But now the powerful thing is that people can start to use
the shearing model to search through that graph. So they say like, Hey, I see happening this, or I see happening that,
but do you think this is, and then we've, you'd say,
I think that you need to start at this node in the graph,
because that seems to the most closely related to the problem that you were
describing. So this is how we try to, to marry the,
traditional approach, if I may call it like that.
And the new machine learning approach to build a
bridge between them so that's um and so basically the answer to you is yes that's and that actually
happens a lot okay so what what format can be imported uh in gui v8 i mean what schema format
would it be i don't know rd rdfs or i don't know some dd or, I don't know, some DDL description from SQL or what is supported?
So currently, we just, the API endpoints from Weave8 or in the clients, they just take the,
you just need to have the client property structure. Anything that represents a class property graph can be loaded into Weavey8.
If you have a different type of graph, then it becomes a little bit more difficult.
But I also would like to say that if people have problems there,
that they can't represent their schema for whatever reason in Weavey8,
then also we would love to know, of course,
because then we also know how we can improve it.
Okay, cool. to know of course because then we also know how we can improve it okay okay cool yeah i know it's a
bit uh it's quite involved actually there's this conversation about schemas and graphs and whether
you import them and how but it's just it was just surprising for me to hear that you do that and
that's why i wanted to know more yeah yeah yeah and it's again i would like to to emphasize again that that the um that is
of course really going down to it to the tech but but the end goal the overarching end goal it aims
like that we really want to solve the unstructured data problem right so how are you going to get
insights from your unstructured data and how are you going to somehow map that um to something that
you can understand and using your day-to-day business processes?
And that's why we've chosen that model to structure Weave8 like that.
And another way to visualize it is that I sometimes say, if you look at a graph, it's like the representation is kind of two-dimensional.
So it's like you have another another note and there's connection but the distance that
they have to each other doesn't really doesn't really matter right so because
it's for use case so for example if you if you say for example the movie finding
Nemo and then it's like is produced by and then you find Disney and how far the distance is between the dozen doesn't matter but with the vector
search engine so where we say like no these nodes the actually affect
representation they sit somewhere also in space so now we can start to search
through the space for similar entities in the graph and then we can target the graph. So that example that I gave with Google search,
like what type of fish is Nemo,
that is first a machine learning problem to try to find the right node in the
graph. And then when it's found a node,
it turns into a graph problem because you want to show that that's a clownfish.
And that's why we, that, that, that's what we've been inspired by that we wanted to do that that's a clownfish. And that's what we were just inspired by,
that we wanted to do that as well.
Because when I started,
and we now we're proving that with all the cases that we have,
is that the assumption was,
if we just know where to find and target this data object,
then we have to start in solving the problem
because the problem is not making the structure or not making the the schema
or some more complex case ontologies we believe that the problems is more in
finding the right data and that's what we've what we focus on so we help people
to actually find it that's also why we say that we are a search engine.
We're really focusing on helping people to find it,
regardless of what kind of data that you have.
Okay, that's a great way to change granularity, basically,
and just go one level up.
And you reminded me of something I heard in a recent conversation I had with
Gary Marcus and it was one of the most enjoyable conversations we talked about all kinds of things
at some point we even talked about vectors and graph embeddings specifically because we were
talking about you know futures technologies emerging technologies and you know, futures technologies, emerging technologies, and you know, what he thinks of them and so on.
And so when the discussion came to graph embeddings and vectors, he quoted someone else there,
someone called Ray Mooney, who is a computational linguist, and according to him the quote was,
well, you can't squeeze the meaning of an entire sentence in one vector.
He used some other words in between, but I'm just going to leave them out. And Gary's take
on that was like, well, yeah, okay, they work reasonably well. You know, they can give you
something, some degree of similarity, yes, but he had two main objections. What is a good degree
of similarity, A, and B was basically consistency. So he said something along the lines of, well,
I haven't seen so far, excuse me, something that works consistently well at scale for this kind of
thing. So I wonder what, obviously, you're kind of, I guess,
on the other side, let's say, because you work with vectors
and obviously you believe in them.
And so you have to think about ways to deal with that.
So I wonder what your take is as a kind of counter argument.
No, definitely.
So I think I'm also on the other side from
from a different perspective right so so this isn't i understand this argument from a academic
perspective but i'm also on a i'm also looking at it from a business perspective and um the the
thing is that if so have the the problem that we try to solve is finding something.
So if the similarity search does a good job in finding something, then I'm already happy, right?
Because then I go like, yes, we are able to solve this problem and we solve the use case.
So it can be a document or an image or whatever.
So that's a different way of looking at the problem.
And another way to put it is like the technologies that are coming out
and the models that are coming out are already good enough.
And I think a great example of that is, well, hopefully we are an example of that,
but another example is, again, Google Search, right?
They're doing
that a lot. And a lot of people get value from doing searches where these kinds of factorized
technologies are being used. So I find it very important to say that there's a difference between
the academic discussion of, and I, because I agree with that, it's very difficult to have,
to capture real meaning, like linguistic semantic meaning from a sentence in a vector but if you
look at this from a business perspective if it's good enough to capture the meaning from that
sentence you're already you're already good right so because you can solve the problem at hand
so that's the the second thing the the other problem um um how I understand the consistency problem, is this. So if you want to vectorize something,
so let's say you use a model to try to do question answering on a text corpus,
then the text corpus that needs to be vectorized, and that takes a lot of time. It takes a lot on CPU.
It's undoable.
On GPU, it's, you know, it's, but if you want to do that on a huge scale for a lot of end users, it's difficult.
It's difficult to, you know, to be consistent in the performance that you're giving.
And that, again, is one of the problems that we we also saw and that's what we're trying to solve so if you want to search through two documents then now the traditional way
of doing it is like take two documents factorize them factorize the query compare them answer
takes a long time but what if you have a million documents or more? And that is what the
Effective Search Engine aims to solve. So it takes the outcomes of the model as a
source of truth and uses that. Then it becomes from a data science problem, it
becomes an engineering problem. Like how are we going to search as fast as
we can through them? And that's also why my colleague H HN, wrote a nice blog that was like doing a similarity search based on the stillbirth in less than 50 milliseconds.
Because that is the added value of doing that.
And we believe that that's an engineering challenge that we need to solve.
So I hope that answers the two questions.
Yeah.
So which also brings us to another topic we haven't
covered so far. So you mentioned briefly in the beginning of the
conversation that what you do is actually open source. And I
wanted to ask whether there is any proprietary part in it at
all. And you know, what kind of license you use and you know what kind of
community you have on github committers and all of that stuff so one part of the question and
which is kind of the logistics of it let's say and then the second part and maybe even more
interesting part is well uh the uh the philosophy the the rationale behind that. And I think you already kind of alluded to it in saying,
well, it's all about scaling.
So people, in theory, can take that and use it on their own systems.
But it's all about scaling and elasticity and all of those things.
So if you can say a few words about that.
Yeah, so there are many reasons why you can choose to have something open source,
or in our case, an open source core.
In our case, that is simply for transparency towards our customers and users.
So we're not necessarily looking for contributors.
I mean, of course, it's nice if somebody contributes something,
but we're not asking for it. We're not advertising that property. You have to bear in mind that if I go to a
potential customer to sell WeaveHits-related services, and I explain what it can do, and even
if I can demo it to them, the next thing that they're going to say, like, wow, that looks very
fancy. It's machine learning, so there's a little bit of this black box element to it.
So they often forward it to a data scientist or a software engineer and say like,
okay, can you take a look at it?
What is this?
So that's why we've chosen that open core model
because we can be transparent.
This is how we do it, right?
So this is how the problem is solved.
So what is a business model?
What do we do?
Well, we actually sell enterprise services around Weavey8.
So for example, think about the open source license.
We take the open source license away
and we replace it with an enterprise license.
Sometimes there's specific fine-tuned models
that people use or customly build modules in Weave Yate.
That's something that we sell.
So we have really created a business around that
open source core, but it's not a freemium type of thing
or something.
So it's like the open source core is the core at the heart
of all applications that we have,
but enterprise have many needs and and that for example is a data scientist
somebody wants to try out we've ate or a small startup doesn't have and that's
that's our that's our business model so we we make money around those support
and services around that open source core. Okay, interesting. So it's not oriented towards software as a service.
You don't offer it as a kind of, I don't know, pay-as-you-go
or whatever subscription-based model or what have you.
You offer services around it.
Yeah, so yes, we do offer it as a SaaS, but you know what's very interesting?
Something that we've learned through talking to customers
is that the majority of them are working with data
which is very sensitive.
And they've now built their, or they have their own clouds,
or they have these hybrid, private, public clouds
that they're working with.
So almost all customers ask us the question,
can Reviate run on our environments?
And we said, yes, of course.
And we understand, they don't want to send their data away,
even if it needs to be handled some way,
or even if you have a data processing agreement, no, no, no.
They just wanna make sure that they have control over that data.
And so we said like, well, you know, if you can come to us,
we understand and appreciate that, we'll come to you.
So we really build our ecosystem around offering these services to them in
production.
What we often see in practice is that development or staging
environment happen on the SaaS offering or smaller startups
or smaller companies using the SaaS offering.
But these big enterprises, the production stuff,
the production data really needs to be on-prem,
being like actual on-prem or in the private public cloud.
Okay.
Okay.
Yeah.
I've heard actually from people the exact opposite rationale as well.
And it also makes sense.
And in the end of the day, I guess it depends on, you know,
where the majority of your current use cases and prospects are.
And so where do you need to focus because I've also
heard like the counter argument like okay because our clients already are in the cloud you know they
don't want to have you know the extra cost of moving data around so it makes more sense for them
if we offer it as a SaaS in the cloud which you know you can't object to and at the same time
you know if your clients for whatever reason regulation or I don't know what
not want to run then yes it also makes sense for you to offer it that way yeah
and just to bear in mind so it practically that's also how it works so
so for example we we use we work with the major cloud providers. So let's take, for example, Google Cloud.
So how does that work in practice? So a customer says like, okay, we work on Google Cloud platform,
for example, or any other, but let's, for the sake of argument, use Google Cloud platform.
We work on Google Cloud platform. They have a support license. And what they do is that they create a project that my team
members have access to.
We have everything as a SaaS out of the box
with the push of a button to load it into Google Cloud.
So now it runs in their project.
But we can still offer the same support
as we would do with the other SaaS offerings that we have.
But if they have another, if they say no, no, no, for us it must run on Azure, no problem.
With the push of a button we run it on Azure and it's very easy for us to maintain it like that.
So you run on I guess all three major cloud providers and perhaps others as well?
Yeah, so we want to offer others,
but in all honesty,
all of our customers are on one of those three.
Okay, yeah.
I'm not surprised, actually.
Yeah.
Yes.
Do you also find,
so you described so far a way of working,
which is, I guess, kind of in a kind of close collaboration with clients
for, you know, whatever reasons,
because they want your involvement or support or whatnot.
But is it also possible, you know, for people to self-service basically?
So they just, you know, sign up, get a subscription
and just get going on
their own absolutely and and so we have two things for that so one is that we have um our what we
call the we've yet console that's our sas offering that you just you know you log in select the we've
yet click go and you can start playing around with it we also also offer a Weave8 five days for free.
So you just have to click go
and you can make as many sandboxes as you want.
So that's on one end.
On the other hand, we of course also have customers
that say like, wait, listen,
we have our own data science team
that wants to work with Weave8,
but we occasionally we need support and we need help.
So just a license for that type of support is enough for us.
So that's also possible.
It really depends on who's buying and who's using in the organization.
Thank you.
Wondering then, if wrapping up,
you may also want to say a few words about the current status
of SEMI, the company,
like how many people are currently in the team and whether you see
expanding and in what direction and so on, future plans in general.
So SEMI currently is with nine, currently we're with nine people,
and we're completely distributed.
So that was already before the pandemic started.
So we have people on the farthest to the east in Poland
and on the farthest to the west in the US.
What we currently are focusing on is, of course, trying to determine where do we add the most value with VEV8 in which industries.
Or to use the startup jargon there, to determine and nailing our niches.
That's what we're doing. And I can proudly say that we are seeing more
and more customers coming in from the FMCG and retail part.
So we really start to understand how we V8 adds value
in that landscape of ERP systems, data warehouse, et cetera.
And to answer your question, what we're focusing on, especially also this year,
to growing and better understanding the amount of industries where Weavey8 adds value.
So think about cybersecurity, healthcare, those kinds of industries.
And if somebody hears this and has a great idea, then make sure to reach out to me.
Okay, cool. Well, let's see if that happens.
All right, so yeah, I think we covered quite a lot and at quite a different level of, you know,
detail from the very, very specific and technical to the quite abstract and everything in between.
So yeah, I'm good on my end.
If you want to add something to wrap up.
Yeah, so if I may add, then,
is that however you look at these types of technologies, right,
or you look at it as a developer data scientist,
then you're, of course, more than welcome to try it out.
Just if you Google VFH, you can miss it.
But also if you're more interested from a business angle you want to understand like okay how can this help in my
business and and what can we do there then i would like to invite people to just reach out because
um i can always explain how in their specific industry we've yet might help and last but not
least thank you for having me. Thanks for coming.
It was really interesting for me as well.
And I guess you could tell because I asked you quite a few questions.
I hope you enjoyed the podcast.
If you like my work, you can follow Linked Data Orchestration on Twitter, LinkedIn and Facebook.