Drill to Detail - Drill to Detail Ep.102 'LLMs, Semantic Models and Bringing AI to the Modern Data Stack' with Special Guest David Jayatillake
Episode Date: April 14, 2023Mark Rittman is joined by David Jayatillake, CEO and co-founder of Delphi Labs, to talk about the role of semantic models and the marketplace today, large language models and the phenomena that is Cha...tGPT, and how Delphi Labs are planning on bringing AI to Slack and the modern data stack.https://www.delphihq.com/#abouthttps://www.linkedin.com/in/david-jayatillake/details/experience/For B2B Generative AI Apps, Is Less More?LinkedIn post on rationale behind Delphi Labs product developmentDelphi - Jaffle Shop DemoSemantic Superiority - Part 1 and Part 2Semantic Search product featureLLM Implications on Analytics (and Analysts!)“Is This You?” Entity Matching in the Modern Data Stack with Large Language models and github repoChatGPT, Large Language Models and the Future of dbt and Analytics ConsultingRA Assistant
Transcript
Discussion (0)
Hello and welcome to another episode of the Jewelry Detail Podcast and I'm your host Mark Whitman.
Today I'm pleased to be joined by David Jayatilika from Delphi.
Thanks for having me, Great to be here.
For anybody who doesn't know you, just maybe explain what you do at the moment and the company that you formed recently.
Sure. So I'm CEO and co-founder of on a new product, which uses a combination of semantic
layers and large language models to provide a natural language interface for data.
Okay, so before we get into the detail of your product and large language models in general,
and AI and how it affects sort of analytics. You've got this
interesting kind of backstory, I suppose. And you and I were both working at WorldPay at the same
time. I think you're on the client side and I was doing a consulting role. But let me just kind of
talk about how you ended up at doing what you're doing now and the route to that through Metaplane
as well. Sure. So I guess it starts with when I was at uni, I did maths.
And like most people at the time, looked for an internship, ended up in Big Four accounting.
The part that I enjoyed about it was the analytical part and quickly moved on to my first analyst role, which was at Ocado.
I think most people today have heard of Ocado.
I always ask whether people have heard of them
because when I was there in 2010, not everyone had heard of them.
But I think most people have now.
It's in grocery.
So I started there as a strategy and trading analyst,
and that's where I left my SQL.
And this was like pre-BI tools in the UK, really.
They were one of the first companies to play around with Tableau in the UK,
and I was there when they were looking at it um then i moved on to worldpay which is yeah i guess where we both uh were there at the same time with our you know mutual colleague chris tab
um i was there as uh what what was what would be a data analyst i suppose in role title but
i was doing you know
different things from data engineering analytics engineering there as well because that was just
what was required to do the job um and i that was the first place i built a data team focused on uh
portfolio pricing and i moved from there back into more of a mainstream data role at a fintech called
Elevate Credit and there I was head of BI and analytics looking after a mixed team of analysts
and data scientists and then from there I moved to List and at List I started with a team of two
which were just in BI and then ended with being senior director of data,
looking after a 25-person data organization team of teams,
doing everything from data science, data engineering,
analytics engineering, and analytics.
And that's kind of like my jumping off point into startup land.
End up at a company called Evora,
helping a friend of mine spin out a startup to focus on
metrics observability is the way I'd describe it where we're observing a trend of a metric and
doing anomaly detection on that trend to suggest and find reasons for changes in the trend.
Unfortunately it was just a very difficult time, difficult circumstances for the company
and the product.
My main role there was to raise money, didn't end up doing that, but learned a lot from
it.
And then finally, I was before this role at Delphi, I was at Metaplane being head of data
there.
It's kind of a strange role head of data
at a data SaaS startup I was doing I guess many different things including product management
developer relations content community and a little bit of data but mostly that data was to generate
that data work was to generate content interesting interesting so so that's actually where I first met you i think as well i think
we were at an event were you in the audience at an event i think it was a firebolt event yes
and i think we got talking there and um and after that i tried out metaplane thought it's quite
interesting but then we were gonna and actually i was keen to get you on the show anyway at that
point but then but then it turned out you'd actually left there and started your new your
startup which just happened to be in two areas that I found particularly interesting.
Semantic models, which is something I've been working with for years now.
And also the new world, I suppose, of large language models and sort of AI and a lot of things that people now suddenly become very topical because of chat GPT.
What we can do in this show then really is look at, I suppose, those two things.
So first of all, I suppose,
a foundational look at what semantic models are and how they contrast to,
I suppose, just a sort of database schema.
And also maybe a bit of a commentary on what's in the market at the moment.
Yeah.
And then we'll look at how that kind of links in with the world of AI and LLMs
and I suppose your vision really
for Delphi Labs and how that's going to turn out into a product in time. But let's start off really
with semantic models, semantic layers. For anybody who probably has heard of it but could do with a
definition of what we're talking about, what do you define a semantic model and a semantic layer as being? So I think at its core a semantic layer
maps real world entities so things like customers, orders, revenue to a logical data structure you
know this could be on a database this could be files on disk but essentially it abstracts away from the user or the system needing to think about the structure of the data in storage.
And it's a way for that system or user to just ask for a metric or an entity and get the results they need.
But how does that differ then from, say, a relational database schema
that already is like an abstraction layer and already, you know,
through declarative SQL, you just ask what you want.
So how does it, I suppose, add value beyond what you get
with a database schema?
So I think with a database schema, you still have to write SQL.
You know, you still have to know how to join, which fields to pull.
And like often with these schemas, they're not clean.
You have to kind of know, oh, I need to pull this column and then filter it by this other column.
Like the whole point of a semantic layer is that it's simple.
You ask for the thing you want and then you maybe filter it.
But those filters are exactly, you know, oh, I want it for UK customers.
It's not
some filter to get around the structure of the data it's a filter to get to what you want so
you talked about entities a moment ago so to what extent does semantic models and semantic layers
need to understand what the data really means that's a person is a person maybe a salesperson
is a is a sort of variant of a person i mean is that part of what we're talking about as well
yeah definitely so i think this is probably what i would say is the difference between a semantic maybe a salesperson is a sort of variant of a person. I mean, is that part of what we're talking about as well? Yeah, definitely.
So I think this is probably what I would say
is the difference between a semantic layer
and a metrics layer is,
is that a semantic layer will have this understanding
of what an entity is,
rather than just the metrics and dimensions
that those entities have to be accessible.
So yes, absolutely.
You could have something like a user or a customer or a person.
And then that entity is a real thing in the real world.
But then it's also extensible.
And you can say, well, this person is actually a salesperson.
So that's a subcategory of that entity.
Or this is a customer, which is a subcategory of a user and so it's
like an extension of a class that that sort of way of thinking okay so i suppose a bit to be
good at semantic models to understand it it's as much about understanding language and the meaning
of language really isn't it yes i think so or it's even or even the meaning of uh of this of
the business or the organization's world right so if you understand the organization's world, right? So if you understand the organization's world
and the things that are in it,
then that's what's required for understanding
what should be in the semantic layer.
So if we look at, I suppose, look at vendors that are in this space
and products that are in this space,
a semantic model is more often in a product's sense
is more than just maybe a SQL translation layer.
It's things like caching and API access.
Yes.
Where do they come into it?
Why are they typically thought of as being part of a semantic layer or semantic model?
I think it's because that's how you get it.
Because if you just have that core definition of a semantic model, which I described, great.
You've described the world, but you can't act you can't get get at it so you definitely need
first of all an api to then be able to you know first of all submit a request and then get a
response and then that then leads to needing access control because you know fundamentally
not everyone should have access to everything in a semantic layer necessarily okay and what about
things like obviously there are there are table structures and columns and so on but do you think things like you mentioned measures
but things like hierarchies and and understanding i suppose the relationship between attributes and
levels and hierarchies are they part of it as well in your mind yeah i i think so so i was recently
looking at at scale um and at scale is a a semantic layer that actually serves large enterprises like netflix and visa
and they have these concepts of hierarchy so i think dimensions in particular have these
hierarchies you can have you know you can have like geography and that could have then country
and then region and then city as like levels and the hierarchy so yeah definitely there's a number
of players in the market currently making noise you've got dbt labs you've got cube and you've
got the original i suppose in this new generation of products kind of looker so maybe just let's
take a step through um what your thoughts are on those various products and and kind of what
they're trying to achieve so going back to say looker and looker mel so that was i suppose the first of the products of this generation of tools that were
that were big on this so what was your take on looker and looker mel and maybe some of the things
they're trying to do now with their universal semantic model so when i first experienced look
i think it was 2019 when i joined list who was was on Looker, still are on Looker.
I was, you know, having come from like a Microsoft SQL background,
and so the only metrics that I'd ever seen was like SQL Server Analysis Services,
which is like an OLAP cube and very, very traditional.
And then having seen Looker and the way you could define things in code, and it's almost a semantic layer, except it doesn't really have true entities.
I was really amazed and the power of what you could do in it was was fantastic at the time.
And when when Looker was a standalone company, I think it was okay that it was tied into the Looker product to an extent.
But I think as it's been bought by Google,
there's concern about it being tied in and there being lock-in.
Google have now as well brought out Looker Modeler as its own API.
When that's now decoupled from the Looker front end, you can now put
other BI tools on top of it.
I think this is a good step forward, although if you think about what GCP's main goals are,
it's driving spend on GCP, and that's very, very, very...
The gravity of that is around BigQuery.
And so they'll, this, you know, in all likelihood,
they will want this as a mechanism
for driving BigQuery spend.
So even though I think it's good now
that other BI tools can use the layer,
I think, you know, I would be worried
if I wasn't on BigQuery trying to use this.
Okay, okay.
And of course, the other big player in the market is dbt Labs
and their move last year to announce the semantic layer.
And obviously, then there was an acquisition recently of Transform.
What's been your observation on the dbt metrics layer,
then semantic model and what they're doing now with that acquisition?
So I think I've written about this a bit, but the dbt semantic layers prior iteration
had some very good things about it.
Like it did have entities.
It had quite a strict way of defining metrics and dimensions and timeframes.
But the problem was that it ended up similar to OLAP cubes or Tableau where there was no
guarantee that you define a metric once.
Because of how every metric needed to effectively have its own OLAP cube, you could end that.
That was always like a single dbt model that you could see quite quickly needing to have
to pre-join everything for one particular
metric and doing again for a variant of the metric it wasn't as flexible as you'd want and it was
actually less flexible than look as look ml which which you know defines joins and then has dynamic
grain in the query um and so that i think was a problem and I think the whole really the whole dbt community I think
thought that that was a problem with it and I think that's partly why it wasn't very well adopted I
think like Tristan quoted two and a half percent of dbt orgs were using the semantic layer in the
previous iteration before they bought Transform.
I think the acquisition of Transform is excellent. Transform is definitely a very good semantic layer.
It's up there with Kube and LookML,
if not better than LookML.
And it does allow you to define joins
and have dynamic grain in your queries.
It's a good step forward for dbt, for sure.
What about, I suppose, the fundamental thing
that the semantic model is part of the data transformation layer,
and therefore to make any changes to that and add to it,
you've got to start editing the project.
I mean, I know that obviously with things like Light like light dash that's even more of a case yeah but do you think there's fundamentally a bit of a kind of a an issue
there that to get end users to use it that's not going to happen what's your thoughts on that
so for me yeah to get end users i think it probably is worse but i think you know that
how many end users were actually writing LookML in Looker
probably is quite rare.
I think, actually,
if you think about that,
analytics engineers or data engineers
were the ones adding to it.
I think it's better
because if you can write your transformation code,
define what those models are,
then define the semantic layer there
in that same workflow.
And then, like you mentioned, like Dashlight, and then possibly define your actual front
end assets in the same workflow again, the likelihood of you capturing any issues in
development is much higher than if they're in three different places.
So I think that's a good thing.
I think it's logical for the transformation work
and the semantic layer definition work
to be in the same place
because the people looking after them
are most likely the same people.
The skillset required is the same as well.
Okay, okay.
I suppose the third player that we've,
my company has been using quite a bit recently is Cube.
So Cube, I suppose, is a bit different is um is cube yeah so so cube i suppose is a bit
different to the two in that they are a standalone company that focuses purely on that semantic
layer yeah maybe just kind of what's what's cube and um and what's your thoughts on on that as a
kind of concept in an approach i think cube are really impressive like we're talking to them at
the moment um yeah uh they have their semantic layer is very good
in the way it's, it really reminds me of Looker,
but just it's a bit better than LookerMail, I think.
And because, you know, they've had a bit more time
than dbt to work, you know, they've been founded in 2019
and they've been working on this solidly,
but they've got really good features, access control,
caching, like their caching is like state of thely. They've got really good features, access control, caching. Their caching is state-of-the-art.
They built something new because Redis wasn't good enough.
That's pretty amazing.
And they've got access control baked into the semantic layer,
which is very, very good as well.
Also, I think what's really interesting about Kube is
I think they understand that people will want to define
the semantic layer with transformation.
And that's why they've already enabled using, you know,
allowing people to define their semantic layer in the dbt format,
but then being able to serve it using Cube.
And I think, you know, they'll continue that with Metric Flow,
which is the transformed version as well.
I think that will come out soon from Kube.
So I think Kube are doing a really good job.
I know that they're very heavily used
in the embedded analytics space.
They've got thousands of GitHub stars.
And since they've released their cloud product,
I think at the end of last year,
they've got hundreds of customers on it already.
So I think they're going to do very well and i you know they're they're they take they're taking
like a very smart like a viewpoint of well there's no need to like try and get people to decide
between writing dbt uh semantic logic or cube semantic or they can do either or it doesn't
matter interesting interesting in fact they're the next people on the show so i'm recording an
episode with them next week so uh before we move on to the next topic do you think
maybe the future is bi tools and alt tools will support multiple sort of semantic layers i noticed
that with the announcements around universal semantic model that they meant they said that
thought spot are going to support that as well and i know they already said they'll support the dbt
metrics layer yeah so maybe it's not necessarily a kind of one or the other.
It's a multiple thing, do you think?
I think so.
And I think the problem is that data team leads are very wary of having their semantic layer locked into some other application and this is something i found when i was uh trying to then deal with
looker is oh suddenly you realize that there's a whole monolithic piece of software that you put
inside your semantic layer and you can't get away you know it's very difficult for you to consider moving from that product
because it's so hard to move that logic.
So I think anyone who's been a data team lead
and used one of those tools will be thinking about,
well, how can I protect myself from this in the future?
And using Kube, using DBT, using AtScale,
even using Looker Modeler is a step forward to protecting yourself and then
not needing to be not being beholden to paying whatever that bi tool chooses to charge you
okay okay fantastic so let's move on to this topic of large language models and and and chat gpt and
ai so right so just frame again for the listeners benefit um define what a
large language model is and um and maybe kind of just outline some of the interest that's been on
this recently yeah so i i i won't profess to be an expert in large language models but essentially
they're these new uh uh generation of uh models some people are calling them AI. I'm not sure they're true AI. But
essentially, they can generate content like text or images in ways that we've never been able to
even dream of before. And you've you know you've seen people ask you
know ask them questions like show me an image in the you know like i i recently saw a mid journey
so mid journey uses a large language model in the background someone recently posted an image of uh
big ben in the style of van gogh and it was actually amazing. And, you know, and I've used Midjourney before,
and I guess ChatGPT is the text equivalent of this,
where you can say, give me, write me a paragraph on this topic,
or write me a code snippet in this language to do this thing.
And it can do it, and it's learnt from uh and you know i think fundamentally some of the
things that have enabled these models as data engineering they've been they've they've been
able to synthesize training data sets that are of really high quality that the models can learn from
and that's i think the key part of how they've been successful okay so i think it's based i
think the basic underlying technology is is i think it's like
markov chains isn't it where where you've got if you know if you've got a sequence of things or
you've got a pattern of things that have happened being a predict what the next word is or the next
or the answer to this thing and so you know given a given for example a question then if you've had
enough training data you would know what the the most likely answer to that question would be and
if you have enough input data then you can start to sort of to generate things from there that you know even
weren't even there before really well to an extent anyway maybe um try to think of an example here
how would this apply to think to things like bi and so on first of all or what what do you think
some of the initial use cases for this would be in our world so i i think you know um we've seen some of these come
out and one of the first things people have tried to do is generate sql directly yes yeah from a
question yeah and so there's actually i think there's actually tens now and you even see
multiple of them in like the last yc batch of companies that are just doing text-to-SQL and you know there's you know some pretty famous
names in there already so that that's like one of the first things and it's probably one of the
most simple things to do is is translate like a natural language query yeah exactly where else
will we see them I could imagine things like ELT potentially because if you think about a lot of ELT is an API
request to some third party system and then pulling the data and then pushing it somewhere
else, you could imagine certainly the original API request being quite straightforward for
an LLM to generate.
Because in some ways, like a semantic request it's it's a very structured
request that it's got to make with very narrow possibilities so and from what we've seen with
Delphi it's they do quite well at generating those requests a couple of things we've been doing one
of the first uses was we used it to be able to give us the descriptions to go with measures in looker so and in dbt for every word like say net profit or
anything anything where there was a word that is commonly used we'd used it via the api then to go
and actually generate as part of you know part of a project the descriptions for example or even
things like documentation so you know that's one basic way of doing it um but recently something i
did was i thought well can i use this can i use chat gbt to build a dbt package yeah um and it's
it was interesting i mean certainly you go you go in there and you with a prompt you say things like
you know imagine you're an analytics engineer building a um a dbt package for a consulting
company in our case um and you know it's uncannily good
at the start you know you go in there and you ask those questions and it will come back and i say
maybe for example the uh one of the sources is harvest the other sources is zero you know it
would know what the table structures are in the exports from say five tran for those and it would
be able to sort of to come up with a data model and some mappings and
so on there um and it's and it's you know in some respects it's like having a text conversation with
an outsourced developer for example um but it's but it's also it's i suppose what it's doing it's
not necessarily coming up with anything any new insights it's it's it's maybe regurgitating stuff
and coming up with new stuff that's a that's a that's a variant
of that one of the issues i found though was um what called um hallucination is this maybe you
can explain what they are um david yeah so i think i've heard that phrase so i think um with
hallucinations and i think it's tip is is very typical when you ask it um not to generate code
but to generate facts that it will it will just it will just generate
an answer yeah and because it will always generate an answer whether it has like a if it has the
actual information or not it will just make something up that sounds like a good answer
and i think that's i think that's generally generally what people say is hallucination i
don't i i'm not sure why it happens i'm not sure anyone knows exactly
why anyone hallucination happens it could be that the training data has has incorrect information in
or it could just be that it's just trying to generate something that sounds right and it
doesn't really matter about the facts underneath yeah yeah it's um i think the phrase i've heard
used is a confident bullshitter in some respects. It's like a classic consultant, really, someone who's very confident.
But the example I had was I asked it to come up with some code
that would do a fuzzy match on names and company name.
And it said, well, in BigQuery, you can use this function
called Jaro Winkler function.
And it very confidently gave me the code for that.
But there is no such function for that.
And I think an analogy i've
heard in the past someone saying is it's a bit like if your arm was if you had your arm amputated
and your your brain's model of your body um still has the arm there and so it would you know you
would feel pain you would think your arm is there it takes a while for the model to adjust for the
fact that your arm isn't there and in a way that's kind of what it's doing what these models are
doing with say its own thoughts it kind of has a mental model's kind of what it's doing what these models are doing with say
its own thoughts it kind of has a mental model of the world and it will take a while for that
model to be adjusted by feedback saying that's incorrect and so on i suppose that is why
uh open ai have put out put their put their products out now for open testing yeah yeah i
think that's right uh and i with that example you mentioned like I think it's got that dichotomy of knowing that this Jara Winkler is the right way to do this.
But then you want it to be done in BigQuery and it hasn't put two and two together that you can't do it.
Or if you wanted to do it on BigQuery, you'd have to write that as a UDF or something.
And it hasn't figured out that that's what it needs to do.
Yes. OK. So when i was looking at your um your
linkedin page and some of your sort of blogs you you mentioned an a16 a16 sort of um article
recently that was that was quite fundamentally in the thinking that you've been doing around
around delphi right so maybe just explain what the article was and what it's trying to say and
why that was influential for you yeah so this a16 article was uh about the topic of
using these large language models in b2b applications it wasn't specifically about
data applications it was just applications in general but i think the principles completely
hold true and the way they described the the situation is that right now there's this wave one of applications,
which they call generative AI applications, which is correct.
You put in a prompt and you get information.
And you can generate a lot of information and content very quickly because it's very easy to make prompts.
And it generates a large amount of information from the prompts and i think
in data this is going to dazzle people but it's also it's not really going to help them because
i think what you'll find is just like with some of the texas equal companies is that they'll give
this to business users and they'll get many answers for questions but they'll start to get different
answers to the same question very quickly and it's just going to cause confusion i think because
if you just have many many answers no one knows what the truth is it becomes very difficult to
actually use it for as an insight and then the article goes on to describe the second wave, which they call Synth AI.
And what Synth AI does is rather than generating lots of pieces of information, it uses lots of information as inputs to generate fewer insights that are more clear, almost like distillation of this information. And I think, and when I read this article, I realized that this was
exactly what I was trying to articulate about Delphi throughout our fundraise, that this is why
we're different to the text-to-sequel companies. Yes, people, the interface for the person is the
same. They want to get an answer, so they ask a question, but we don't specifically generate a large amount of content.
What Delphi tries to do is, firstly,
we will try to find a previous question that has been answered
and find out, you know, maybe through methods like Jera Winkler on the text
or maybe through semantic similarity of the question to a previous question that we
have answered this question before. And the answer we gave was this. And the person who asked it was
this other person, maybe he was your colleague. And so we want to start showing people, you know,
consistent answers and previous answers that have been validated potentially. And so we've got
answers that we've given before.
We've got existing work.
This could be a dashboard.
This could be a notebook in Hex or whatever
where we have semantic similarity to the question
and we can offer this as a potential answer
to the question as well.
And finally, we can generate a semantic request.
So we don't connect to databases directly
to generate brand new SQL queries.
We connect to semantic layers
and generate a semantic layer request.
Now, fundamentally, a semantic layer
is like one of the most scalable ways
that a data team can, you know,
collate their information and knowledge about the data.
So we're leveraging that we're synthesizing uh insights from that information that already exists and that's that's i think
the bedrock of of delphi is is both the llm and the semantic layer and i think that's exactly why
we're actually a wave two synth ai application and not a wave one gen i gen ai app
okay okay so so obviously there's been a lot of conceptual stuff there let's talk about the
product itself so i mean i've i've tried so obviously like most people i've tried to do
things like feeding feeding a kind of a dbt model into into jet gpt feeding a kind of like a csv
file you know feeding that information in and obviously you get very
you don't get very far with that um so maybe just tell us you know what what is the user experience
like with with with delphi okay and how what's the user experience like and how does it kind of work
under the covers because there's been a lot of there's a lot of kind of buzzwords a lot of
technologies here but how does it actually work under the covers but let's walk let's do that by
walking through the user experience and then let's drill into it to go along.
Yeah.
So I think the user experience is very similar
to like the Texas SQL companies.
Here's a box, but the box in our case is Slack.
And if you refer to that A16 article,
at the end it says SynthAI is the the start but i think but it says moving into
the workflow is like how it is how you make this like a moat and that's exactly what we had thought
from the outset is that we want to solve the whole workflow of how someone gets from asking a question
and to getting an answer and there's so many steps along the way if you know how analytics works. So yes, someone can ask a question, but then we go
through these various stages of triage, you know, is this question similar to a previous question?
And so we can do things like text similarity, semantic similarity. So if you think about semantic similarity, what you're doing is you're either thinking
about that question as a vector with embeddings that's similar to other vectors with embeddings,
that's one way, or you can find, well, what items in the semantic layer or objects in
the semantic layer are similar to this question.
And then because you've got that for other questions, you can then say, well, the array of
semantic objects is similar. So therefore, they're semantically similar. And therefore,
we can suggest this as a solution. So that's the second step of triage. Third step of triage would be, well,
the same thing applies to not only previous questions, but existing work. So this could be
a dashboard, it could be a notebook or any other piece of work which has semantic objects
that are relative to it. And then finally, we can generate new work but that's as you can see like
we don't want to generate new content quickly we'd rather do that sparingly and then add that
to our learnings so the next step would be to generate a new semantic uh layer request so you
know it could be uh you know what what someone could ask, what was my revenue
by marketing channel for last week? And what we'll then do is because we've gone through that
triage process, and we know there wasn't a similar piece of work or question to offer,
we're going to make a new semantic request. But we want to provide you know as much safety in the
answer as possible so one of the things we already do today is we answer that we repeat the question
back to the user much like an analyst would to you right today so we would say well by revenue
we're going to assume you mean uh gross merchandise value net of refunds with promo codes applied
by marketing channel you mean UTM channel.
And by last week, you mean the last week by order created date, for example.
And we'll repeat that back to the user, much like an analyst would today.
And then the user can say, well, actually, no, that's not what I meant.
I meant by the order shipped date, not the order created date for the last week.
And I meant I didn't want promo codes applied to the revenue.
So let's get rid of that.
And then you get to a point where the user is much more trust
in what they're getting because they've been told what it means.
And that's possible.
That's only possible because of the semantic layer
because the semantic layer has this defined inside it.
And then we run the request and then
they can access that information they can export it as a csv they can go and explore it in a bi
tool if that's how they've integrated with us uh and they can see you know they can see the request
and if this doesn't do what they want it to do, this is where we start involving human analysts.
So they can then ask for help from a human analyst if it still hasn't given them what they want.
Or a human analyst can come and validate the request as well.
Because fundamentally, the highest risk part of the workflow is where a new request is made.
So we want to reduce that risk by bringing
in human analysts at this point to validate the request. And over time, those requests will start
to become less new, there'll be new less of the time. And so in our repository of questions that
we've answered with these requests and responses, we'll have a whole set of validated answers.
Okay. So where does the LLM and say GPT-4 come in there?
Because it sounds like the things you were saying at the start, looking back through the list of previous questions, looking for matches,
that sounds like the sort of thing you've had in the past with say, ask Look and and say thought spot and so on so where specifically does the lm come into this really
there's a few places say for example generating embeddings uh generating the generating the
the request of the semantic layer so that that's probably the key thing when we do that is to jump
is that's you know we're using g do that is to jump is that's, you know,
we're using GPT-4 for that currently, we used Codex before. And then finally, one of the things
we also do is when we do generate the answer, if the person has asked for data, that's fine,
they'll get data like as a CSV or explore from here in your dashboard tool but if
they've actually just asked for a straightforward answer like were we profitable last week we will
just we can pipe the the the output from the semantic layer into chat gpt again and interpret
it for them so we can just say yes last week you had $200 of profit so you
were profitable last week you know that that's the sort of thing we can do that now there is some
kind of security concern over the last part because even though you're only giving chat GPT
aggregated data about your business it's still potentially sensitive so we have that as something
that's configurable on setup as to whether you want to allow that to happen or not um yeah okay
so so you mentioned embeddings there okay so i've been playing around i've been playing around with
a similar thing on our website where we have been using a a service called Mask, my Ask AI, that uses embeddings
and it will work with ChatGPT to include your data
in the responses it returns.
So can you explain what embeddings are
and whether we're training the ChatGPT model
or just giving additional information?
How does that work?
So embeddings, so if you think about representing something as a vector,
so a vector could be about say an entity for example, so it could be about a person and one
embedding could be their gender, one embedding could be their gender one embedding could be their geography
one embedding you know all these different things now that's a very human way to think about
embeddings the truth is is that the way a large language model would generate embeddings you know
it would generate you know hundreds of thousands of embeddings, and they may be very abstract.
It could be something as strange as,
especially when you're generating embeddings about a text,
it could be, oh, the fifth character versus the first character
was 10 characters apart or something completely abstract
that doesn't mean anything to anyone.
But using that huge number of embeddings they can work out similarity between text very well okay so so with your okay so with your product do does it pass the customer's data
and their semantic model to chat gpt um or is it kept separate the reason i asked that is because i think there was
a thing in the papers recently about samsung they were using chat gpt to to do similar sort of work
and they ended up inadvertently passing all of their a whole chunk of their proprietary ip to
chat gpt and it's now part of the model so you you meant how much of how much of um the customer's
data is passed to ChatGPT
and potentially included in their training data?
The reason I mention this is because Samsung recently in the press
for inadvertently sending a load of their IP to ChatGPT
and apparently became part of the training data.
So what's the separation there between that
and how much data is passed to ChatGPT?
Yeah.
So this is, again, I think a strength of our method
instead of just sending it your database schema
and columns and all of that metadata.
So we do need to send ChatGPT the objects from your semantic layer.
So these are the things that exist in your world
you know customer entities users whatever and their attributes their dimensions their metrics
so those names those have to get sent to the large language model for our system to work
um but we don't have to send them actually any data because that I would classify as metadata to any organization. So our system can work
entirely on metadata because we only send those objects and then we generate a request based on
those objects and then we use the request and then we don't even have to send the data to chat GPT.
Okay, okay. I think I was reading again at the weekend that if you use the API for chat GPT,
then you can choose to not have that data be used for kind of like for training purposes. I think I was reading again at the weekend that if you use the API for ChatGPT, then you can choose to not have that data be used for kind of like for training purposes.
I think it's something that because you're paying for it.
So I think certainly my take on it is that if you're a commercial service using their OpenAI sort of APIs for this, then, you know, then you're safe there really. But I think certainly if you were, if you were sitting there with the chat GPT web interface and just using it to kind of like to,
to generate, to generate, to answer your questions, just like any kind of consumer,
that is when your data is potentially being used for training data. But certainly if you're using
the API, then it's a different kind of category of commercial use really.
Yeah, exactly. Yeah. And that's what we do. Yeah.
Question I'd have again
then is really um how does this work kind of commercially and how does this work when you've
got a tool like say um well what what bi tools would you use this in conjunction with really
and how would you then relate to that and how did it work i suppose um in a commercial and
licensing sense and that sort of thing. So for today we integrate with light dash database looker as BI tools.
And then we integrate with cube DBT semantic layer as like the,
and that's the original DBT semantic layer as,
as semantic layers.
We are considering like looking at things like at scale in the future and
then the new dbt semantic layer, but we're waiting for the new APIs to come out.
From a licensing, when you say licensing point of view, do you mean in terms of what do you pay
Delphi? Or do you mean? Yeah, you know, that really, I suppose. Yeah. Yeah, I think we had
an idea of how to price it. And we've recently spoken to some of our beta users about how they would expect us to price it. And actually, they came up with a very similar thought to us. So I think that sounds like a reasonable way. But essentially, for now, the way we're thinking about pricing is on consumption. So based on the number of questions you asked, Delphi, but what we do is we'd have tiers. So, you know, let's say a $500 a month
tier, which had 5000 questions that you could ask, because we because our, our first our current
interfaces and slack, you know, we don't, we number one, we can't, but we also don't really
want to restrict the number of users who has who have access to Delphi, you know, our whole mission
is to allow the whole of an organization to have access to
data.
And in particular, people who are probably not comfortable using BI tools, you know,
that would be like a core audience for us.
So, you know, we want anyone to be able to access Delphi.
Yeah.
Yeah.
I mean, I sat through the Loom video that's on your website.
And the thing that struck me most as is it's like
having an analyst available on slack to you who can just answer your questions and rather than
you going into a tool like light dash and creating some analysis and working out yourself you ask
this analyst who is smart enough to kind of double check what you're asking about is the correct
thing at the start and then goes away and kind of comes back to the answer with you so it's like
having your own your own kind of analyst there i mean maybe maybe just kind of comes back to the answer with you. So it's like having your own kind of analyst there.
I mean, maybe just kind of, if you maybe just mention that Loom video,
what it's trying to do and how that process works.
Yeah, exactly.
So yeah, this is our demo Loom video.
And it's Michael showing how Delphi interacts with a person.
So someone will ask Delphi a question,
and Delphi will then clarify, you know, first of all, Delphi a question, and Delphi will then clarify.
First of all, Delphi will try and show you existing work,
so maybe it's a dashboard that you've already got built.
And then if you say no, it's not one of those.
And then Delphi will generate a new request,
but it won't just run the request before asking you about it.
It will ask you,
well, this is the request we're about to run. And it will speak that back to you in human language.
And you can then say, yes, this is great, or no, it needs work and I need to adjust it.
And then finally, when you say yes, it will then run that request and give you the results, either as a CSV or Explore in Lightdash or another BI tool.
And yeah, that's the current workflow shown in the demo.
Okay, okay.
So I appreciate the product is early stages
and it's been a few months now since you started mentioning it,
but where do you see it going?
What's the kind of within rounds of what you can talk about now?
Where would you like to see this going?
And what, I suppose, are the next problems to be solved in this space so i think our mission is to
solve that workflow you know and some people like i've spoken to some people and they call this like
the shoulder tap problem about analysts being just disrupted but i see it as like a two-sided problem
is that yes analysts get disrupted by lots of ad hoc questions and they can't, they're not as productive as they'd like to be.
But there's also the second part of that, which is that the stakeholder is either not even given access to the data because they're not one of the lucky people who has a seat in the BI tool.
And they also just, they want quick answers to do their job.
You know, is it a business stakeholder's job
to know how to use a BI tool?
I think because there wasn't much of an option
in order for there to be a scalable way
to access data in the past,
people have said, yes,
it is their job to know how to use it.
But I think we're moving away from that now.
I think it will be the case that it's not their job to know how to use a bi tool they should
just be able to ask a question and get an answer that's safe that they can then go and do their job
with um okay so do you see this as being maybe the the actual final solution or the kind of
people always been talking about self-service bi and it's it's always been around the corner and
it's always been something that is a kind of goal do you think maybe this is what could what could actually
deliver that really or certainly you know yeah what do you think on that yeah i i i think this
is the start of the end for that so the large language models are improving very very fast
enough to the point where people are kind of not, you know, they're getting a bit excited and saying, do we need to pause development on them, which I don't agree with. But they,
you know, they're getting better and better. And if you think about where Delphi will be,
even just trying to do the same things we do today, in 12 months, because we'll have more
powerful models available to use, you know, you can see that self-serve will happen,
you know, we will be able to answer a completely non-technical user's question and give them a safe
response. And it won't be every time. And I think that's where human analysts will come in.
And I think I don't see us just getting rid of data teams i think you
know that's that's like part of the philosophy of some of the texas evil companies that you don't
need the analytics engineer you don't need your analysts we'll just do everything for you
automatically i just think that i don't agree with that philosophy i just think you'll always need
you always need analysts and maybe they just won't be doing those kind of rote or easy
requests all the time they'll they'll be focusing on refining the system and answering those more
abstract questions like some you know i've seen analysts ask questions like should we do this
activity as a business you know and that is we're a very long way away from a large language model being able to,
because it needs to decompose that thought
into a number of different analytical pieces of work.
And that's like a sequential story,
which then leads to an answer.
We're so far away from a large language model
being able to answer that kind of question.
So we need data teams around.
But our mission with Delphi is to supercharge those data teams, really.
So just to round things off then, how do people find out more about Delphi?
And I suppose also, what's your ideal customer?
What's your ideal customer and kind of sort of use case and so on,
just so that you can refocus on the people that
you can be most helpful for i think right now our ideal customer is is probably a data team who have
implemented the semantic layer like we we thought about in the future like helping teams set up
their semantic layers and i think that's probably a second act thing like that maybe we'll do next
year.
I don't see us looking to hitting that this year necessarily.
But yeah,
so right now our ideal customer will be a team who has a semantic layer set
up,
whether that's looker,
light dash,
database or cube or the dbt semantic layer,
like one of those five. And, yeah, so and that they have this problem where most likely, they're, they're a smallish data team, and
they have many, many stakeholders who, who hit them with questions. Like the typical data teams,
I think probably about five and they have a channel in Slack
where data people just ask them questions.
And this is exactly where we want Delphi to live
is in that channel
and they can just ask Delphi those questions.
Fantastic.
And where do people find out about the product then?
So we have delphihq.com,
so you can sign up to our waitlist
if you want to try out the product.
Or you can just contact me on LinkedIn
or Michael Irvine, my co-founder on LinkedIn as well.
We're pretty active in the DBT
and locally optimistic communities as well.
So yeah, there's a few different ways to reach out to us.
Okay, and when
are we going to see you at an event in london i think you go to most of the analytics engineering
ones and you actually got your own one as well so uh are we going to see one of those soon
yeah so we have a london analytics meetup uh at the end of this month it's going to be at
depop i'm just like finalizing the details for the meetup invitation but yeah i regularly attend the london analytics
engineering meetup and the london dbt meetups as well so you can definitely see me at those i'll
be i'll be i'll be attending those as whenever i can fantastic fantastic well david thank you
very much for coming on the show really interesting product really topical as well
so thank you very much and best of luck for the future thanks so much mark