The Data Stack Show - 174: Does Your Data Stack Need a Semantic Layer? Featuring Artyom Keydunov of Cube Dev
Episode Date: January 24, 2024Highlights from this week’s conversation include:Artyom’s background in the data space (0:32)The growth and changes at Cube (5:58)Pain points of managing metrics definitions across different tools... (9:39)Trade-offs between coupled and decoupled semantic layers (12:12)Making a case for implementing a semantic layer (14:17)The evolution of semantic layers (23:28)Challenges in designing a decoupled semantic layer (24:16)Different approaches to solving the interface problem (26:58)Implementing a SQL engine in Cube (35:58)Overhead and debugging in semantic layers (39:08)The semantic layer and its importance (46:26)The need for semantics in data products (47:34)What’s the future of semantic layers and user experience? (51:49)Final thoughts and takeaways (57:34)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, the CDP for developers.
You can learn more at rudderstack.com.
We're here with Artiem, the co-founder and CEO of
Cube. Artiem, thanks for coming on the show and welcome. Thank you. Thank you for having me today.
My name is Artiem. I'm co-founder and CEO at a company called Cube. I also co-created
Cube open source project back in 2019. Then a year after,
I started a company with my co-founder. So I've been on a journey of building Universal
Semantic Layer and how it was going through the cycles of evolution in the last few years. So yeah,
exciting to be here and to chat all about Semantic Lay layers and data. Yeah, that's awesome.
And we've had you before.
It's been almost like a year since you were on the show, Artur.
And many things have happened in the industry.
So I'm very curious to see how semantic layers have evolved in this one year.
And also, what's next, especially after the, let's say, this whole revolution that's happening right now with AI, LLMs, and all these new technologies around data.
So I'm really looking forward to chatting more about that stuff.
What about you?
What are a couple of things that you're really excited to chat about today?
Yeah, it was a great year for semantic layers, for sure.
And I'm very glad how the data community evolved
in their thinking about the need for semantic layers.
And I saw different vendors, different companies
coming up with semantic layer solutions.
And it's definitely happy to see the kind of, you know, the categories maturing
overall. And at Cube, you know, like, I hope we contribute a lot, you know, like to the thinking,
you know, like to the framework, how the semantic layer should fit into the modern data stack.
And obviously, you know, like the elephant in the room this year was LLMs, right? Like an AI. And
I felt like it contributed to the, you know, like ideas and the
need for the semantic layer because LLMs, they're all about semantics. They essentially like a text
in, text out, right? And text is semantics. So that's why, you know, like I see that was a strong
tailwind. Okay. We really need semantics, not only for humans, but for AIs as well. And let's
talk about semantic layers, right? And like how we can get semantics, not only for humans, but for AIs as well. And let's talk about semantic layers, right?
And how we can get semantics about our data.
Yeah, no, that's awesome.
I think we have plenty to talk about.
So what do you think, Eric?
Let's dive in.
Let's do it.
RJ, it was so fun to have guests back on the show again.
It's been about a year and so many exciting updates to talk about.
Before we get into the topics, can you just remind our listeners what Qube is?
Qube is semantic layer or universal and standalone semantic layer.
And the reason why I highlight standalone and universal
because I feel like semantic layers,
they were here for a long time, right?
Business objects, they had a semantic layer.
Problem with that kind of started to happen
as we started to have more and more BI tools
and cloud made it really easy to buy more and more tools.
We started to have a lot of semantic layers
sort of scattered across different tools
because now we have five.
ThoughtSpot, Power BI,
like Dolmore, like Tableau, you name it.
And then all these BIs,
they have semantic layer attached to the product,
coupled with the product.
And now organizations,
there's five BIs and they all have those semantic layers. The problem is that we repeat ourselves
when we define a metric at every BI level. We go into one tool, we define all the metrics. We go
to the second tool, we define it. And the frameworks always are different. The way we
define metrics are different, but essentially we define metrics are different. But essentially, we define the same metrics.
And it creates a problem and your stack becomes not dry.
And, you know, like in engineering, we always try to keep things dry.
Do not repeat yourself, right?
So, and that's a whole idea behind Semanticlare.
Let's make our data stack dry on a scale.
So, we take the metrics out of the BI tools and we define them in one place. We call
that place universal semantic layer, and it sits between cloud data warehouses and all the data
visualization tools. And then we just define the metrics in that place. And then we deliver
metrics to all the different data consumption tools. So that's the whole idea behind semantic
layer, universal semantic layer, and Kube is building on.
What a great, concise definition.
It sounds like you've explained that a couple times before.
I did, yeah.
Let's talk about the last year.
So when we last had you on the show, Kube was focused on a fairly specific use case.
You know, I think that you were talking about semantic,
but not quite to the,
I know that you were focused on analytics use cases. And I think you were talking about this concept
of headless BI, if my memory serves me correctly.
So can you explain the journey
that you and Cube have been on
going from sort of the
positioning as an analytics type solution and using the term headless?
And then what led you?
How did the company grow and change?
What did you learn about your customers that sort of pushed the move to Symantec?
Right.
I think we went through quite an evolution since we started a company and project in
2019, right?
The company started as an open source project in 2019, and then the company itself started
in 2020.
So it's been three years, a little over three years now.
I think that we initially had this big vision where we wanted to create metrics, semantics,
and then deliver them to different
places. But we never had really good semantics about semantics, I would say, how we would call
that, right? We had different names. I remember calling ourselves API for data at some point,
and then we were calling ourselves a headless BI, and then the metrics layer. And eventually,
it felt like the industry arrived at the term semantic layer and that everyone
is using semantic layer right now.
So from even naming perspective, we went through the several steps of the evolution here.
And from a product perspective, I think when we only started, the obvious problem to solve was how we use metrics in the embedded analytics
application or customer-facing application. Because that's where you still need to build
a semantic layer, but you would build it manually. You're not going to use one provided by BI,
but you would actually write a code in your Django app or in your Ruby on Rails app
to deliver the metrics to the customer, right?
So we thought, let's try to remove that piece
that developers need to build inside this frameworks
and just kind of make it generic.
So that was a very clear first use case
and it was a big need
and that's why it helped us to get initial traction.
As we started to build on top of this use case, we started to have customers saying to us, hey, we're using Kube to show metrics to
our customers inside their app, but we're kind of looking at the same metrics inside our BI tool
and inside our second BI tool and a third BI tool. Why we not just use Q to centralize all the metrics
for all the different places?
And that's how we started, you know,
like to go to this next step,
next kind of, you know, like age of evolution for Q.
It's like, okay, let's bring all the BIs right now
to work on top of Q.
And that's sort of how our vision is.
Kind of, you know, like, I wouldn't say expanded
because we always wanted to do that, but the product started to expand toward the bigger
vision, right? And then we added more BI tools and, you know, like a different AI apps this year.
So that, you know, like, and if you go on our website right now, it's quite a different website
than like a year ago and two years ago, right? We talk about the bigger vision right now.
I think that was a major change on our end as a product
is now we're like not only serving combative analytics,
but we're serving a bigger picture of power
and all different data experiences in the organization.
Makes total sense.
One question I have for you is around adoption.
And what I mean by that is,
I'm just interested in the point at which your customers or users come to you. So you talk about having, you know, multiple different BI systems that all have sort of have their own semantic
layers. It would seem that a lot of companies hit a pain point where they're managing that,
that, you know, those metrics definitions across a number of different tools,
you know, and platforms. And so do you see a lot of, is that sort of the main inflection point
where companies come to Qube? Yeah, I think it's a compound problem is because you get so many BI tools and then even inside
of one BI tool, like Tableau, you may have a lot of different workbooks and then every
workbook, it acts like a silo with all its metrics inside it.
And then you think, oh, how do I connect workbooks together or something?
So, and it's sort of, you know, like it adds up, you know, like to the problem every time you build a new dashboard, every time you do a new report, or someone tries to do analysis in Excel.
So that's why I think this problem is always on top of mind of data engineers, data leaders. find, okay, what is the best way to manage data modeling and metrics, you know, like,
because it feels always like we did so much progress with ideas around, you know, code
first management, applying software engineering best practices.
So we have matured, like data pipelines, you know, like there are like a medallion architecture,
like all of this, like different ideas.
But then we sort of fail at the last moment where like we actually need to build metrics.
And I think that sort of creates this sort of anxiety or, you know, like uncomfortable
feeling among data leaders and data engineers.
There is something should be a better architecture to do this than I'm doing it today.
That's sort of probably why people started
to think about semantic layer and talk to us and kind of explore different options.
Yeah, that makes total sense. Are you saying more companies or teams try to start with this layer you know you mean like just from scratch right like even if they uh yeah i i think it
happens i see it's happened sometimes i think it's mostly something that comes after you know
like you have a warehouse you have one two three because then the problem is more evident, right? Like you see, now I have all that mess, I need to clean up the house here.
But I see companies are starting with thinking about, okay, semantic layer from the beginning,
which feels to me just kind of maturing of the category, more awareness, you know, data
team that are aware that they would need to semantic layer sooner
because they would need it eventually, right?
So they're thinking, okay, let's put it sooner than later.
I think the caveat here is that sometimes there is a,
there is sort of, you know, like an opportunity to use tool
like a Looker, for example, which offers a great coupled semantic layer,
which might have, it might look like a good idea today.
Say you're a mid-market company with 100-200 people,
still a small data team,
and you might need only one BI tool at this point,
and Looker is an option,
and semantic layer is coupled with Looker,
and everything makes sense.
You have your transformations, you have your semantic layer in LookML Looker and it kind of everything makes sense right you
have your transformations you have your semantic layer in LookML and yeah visualization the problem
though is that once your organization would grow you definitely will you will hire people who would
say hey I use Tableau all my life why should I use Looker right yeah yeah and then like Power BI
will come in Sigma Sigma, like Excel.
So that's sort of a trade-off always, you know, like I see that companies when they're picking up semantic layer, sometimes, you know, like Looker really looks like a good option if you're small.
But then if you think about what happens next, that's probably decoupled semantic layer would be sort of a better option. So that's some interesting caveat I've been seeing, you know, like when a company is kind of on a smaller size,
thinking about semantic layer.
Yeah. Yeah.
You know, that's a tricky thing.
And I'd love for you to help us think about how,
you know, how data teams can make a case for this, right?
Because it's one of those, you know, you said,
which was my hypothesis as well, just like, you know, you said, which was my hypothesis as
well, just like, you know, this becomes the perception of value becomes much higher when
you're in a lot of pain because you have multiple BI tools, right? But you save a lot of time and
money by not having to get to that place of pain in the first place. But I think one of the challenges is justifying the, you know,
expenditure of like, you know, cost, right?
Whether that's like, you know, paying for software or, you know,
your team actually implementing it.
How do you think about, you know,
how would you recommend that someone make a case for doing something that,
you know, it's kind of the thing
where it's like, you know, is this going to provide us immediate value now? No. Like,
will it save us a million dollars over the next three years? Yes. You know, in time that we would
have allocated to like wrangling all this data, right? How do you, how would you help someone make that case? Yeah, I think we need to,
and by we say all the semantic layer providers and, you know, like to some extent BI vendors
as well, that BI vendors that want to integrate with semantic layers, we need to make it as easy
and as cheap as possible to the small teams to implement the best practices, you know, like from the beginning.
Right.
Say by it, you know, like our solution, the way we think about the pricing could kill, we try to make it scale is organization.
So we don't want it to be as expensive as Looker, for example.
Right.
So we want to make sure it's cheap initially.
So you still have a budget for like maybe Superset, Apache Sup for example, right? So we want to make sure it's skip initially. So you still have a budget for like maybe Superset,
Apache Superset, right?
Preset, which is a cloud version of Superset.
You bundle those tools together
and then you kind of go with that architecture
instead of a Looker where you would, you know,
like have the vendor lock in all the company
and it's on scale.
So I think from what we can do first,
we need to create a correct business model.
And then we said, you know,
they come in cheap first and then scale with a usage.
But we also need to make sure that our products,
they offer very good experience comparing to coupled solutions, right?
Like obviously when you have a coupled solution,
it's easier to build a good user experience
versus when you have a decoupled
because you need to kind of try to make two products
work as good almost as the same product,
which is a really hard solution.
But we need to solve for this problem.
So I think these two things that we need to do
and then for data
teams to justify it, I guess it's really, you know, like understanding the best practices and
understanding that eventually you would need to scale. And, you know, like if vendors make it as
much as easier as possible with attractive cost and attractive integration. So that would be
easier for data teams to just kind of go with that architecture as
soon as and later.
Yep.
Makes total sense.
Okay.
Well, we're talking about semantic layers.
I feel like you've done a great job of explaining where the semantic layer sits.
DBT is, you know, a very widely used data tool, and they emphasize this semantic layer a lot.
How do you compare? You know, I think the way that you're describing a lot of things,
a lot of people would describe that, you know, describe dbt in a simpler way. So can you just
explain some of the key differences or even use cases? Sure. Maybe I'll first, we'll go a little
bit through the history, you know, like how dbt arrived at semantic layers.
Sure.
I think dbt started as a transformation tool, right?
Like a dbt core widely used in popular transformation tool.
And then dbt, the company, they kind of raised a lot of money as part of low interest data stack phenomena.
And then kind of started to build around the initial DBT core CLI tools.
And I think at some point they announced that they wanted to build
a metrics layer or semantic layer, right?
The first attempt was to build it in-house,
and then they sort of failed to deliver on expectations.
And to solve this issue of failing,
they decided to buy a company called Transform Data.
And Transform Data was one of our competitors.
Really great team.
Well, so, you know, like it's always when you start a company,
it's always good to have competitors because you understand
probably other smart people, like, you know, like they're doing something the same as you so you're probably doing something
right right so like as always was good for me to know like competitors like transform data because
it was a lot of like reassurances that we're doing the right thing so what happened is that
we got a little bit more traction and transformed data from a business perspective
and kind of reining the category.
So transformed data, they decided to go and get acquired
and DBT acquired them.
I think that was like a second attempt of DBT to deliver that.
Now it happened almost a year ago.
I think we're still in a state where it's not quite clear what dbt semantically is as a product.
When I talk to the community, when I talk to users, I hear a lot of awareness about dbt semantically, but I don't see actual users and customers of dbt semantic
layer because it's still i think the product is still not there and i mean it's hard for me to
you know like to kind of talk about and kind of think why it's happening right because they have
all their reasons that's a big company right now they raised a lot of money it creates a lot of pressure maybe you
know like they're looking at a different areas of the product how to you know like optimize for
monetization how to optimize you know like for the conversion from a cloud open source to cloud
they brought aws vp of x aws vp of product right Like to solve all these problems. So it felt like Semanticlare is not getting enough attention
because it's a really hard problem to solve
in a technological and product.
And it's not like an existential thing for dbt, right?
If dbt fails at Semanticlare, they still have a business.
If kube fails at Semanticlare much clearer they still have a business if cube fails
we don't have a business right so we have to make it right we have to make it work for dbt it's
just one of the features they have at this point yeah yeah super interesting one question i should
have thought of this earlier but i'm sure a lot of teams build their own kind of semantic layer to address some of these issues.
What tools are they using to do that?
You know, to essentially sort of mimic what Kube would do.
How are they doing?
What does like the in-house build look like for this?
So I think that we can categorize these tools into sort of a one that's a simple versions
of that and more like complicated.
A simple version of that would be to use data marts as sort of your semantic layer.
The problem with leveraging data marts is that you would have to create every data mart
for every level of detail or grain, right, in case of the data,
because non-additive measures, the joints, they all create this complexity where you
cannot have a single data mart serving, you know, like metrics with a multiple grain.
So in that case, you would have to either create a lot of data marts, and that's what
some companies do, you know, like if you have a process, you know, like a lot of data marts. And that's what some companies do.
If you have a process, how you can produce these data marts and control, you can put
more like a man at work just to do that.
It's expensive, but it's possible.
But again, it's very expensive.
The other option would be to create this, your own in-house virtual layer
that would give you the virtual semantic layer
and that would generate a SQL.
And I think some of,
and I know some of the more sophisticated tech companies,
they build their own in-house versions of that.
But essentially that would be QoP, right?
Or that would be any layer
that would offer youAprido, that would be an emulator layer that would offer you the virtual
data layer that would actually
generate SQL when you query it.
This way you're kind of solving for this
grain or level of detail
problem. So I see this kind of essentially
two options. Your eyes are put a lot
of money and time into
manual work of creating and
duplicating data marts or you build your
in-house version of Q.
Yep, makes total sense.
All right, well, Kostas, I've been monopolizing the mic,
so please jump in here because I know you have a ton of questions.
And I want to hear about the LLM and AI stuff.
Yeah, but before we get there,
I think it should be good to spend a little bit of time
getting a little bit
more technical about semantic layers.
You mentioned, Artem, that you said a few seconds ago that semantic layers are a hard
problem, both from a product perspective and a technical perspective, right?
What does this mean?
And let's focus on the technical side of things.
Why building a semantic layer is hard?
Right.
So essentially, at the heart or the core of semantic layer,
you have a virtual data representation that can generate a SQL, right?
That's like, I think, the whole idea around semantic layer.
Even when semantic layer is being a part of the SQL, right? That's like, I think the whole idea around semantic layer,
even when semantic layer is being a part of the BI, right?
Because if you look back into, you know, business objects,
like a first generation, like of BI,
semantic layer, or even with at Looker,
we'll see that semantic layer essentially is this sort of a virtual representation of data
that lets you drag and drop things.
And then when you do this, when you build a query, then the system generates SQL and executes that SQL against your cloud data warehouse.
In case of a live query, it extracts, obviously, it's going to query its own data store. So I think that's sort of a core of the problem is like how you build a virtual layer
that exposes data as a measures and dimensions to the end user and then generates a SQL.
And now, you know, like we have all these problems about how do we deal with joints?
How do we deal with fan outs? You know, like traps, casues, all of that when we're generating
a SQL. So I think like a SQL generation and creating the right framework for obstruction,
that's one of the problem, one piece of the problem.
The other big problem is, which was not solved by any coupled semantic layer,
is how do we make an interface to semantic layer?
Because that problem really arise when we build a decoupled semantic layer.
Because when we have a decoupled semantic layer, it means we need to have an API so
a different system can connect to the semantic layer.
So from that perspective, the question would be like, okay, we have a Tableau now.
How Tableau would connect to your decoupled semantic layer?
And that could be different ideas.
If one can build a one-to-one connector with Tableau,
the problem here is that you would have to build
one-to-one connectors with all the tools.
And it might be just a maintenance burden,
even almost impossible.
So if you would look at different options,
what you can do, you will probably arrive at SQL.
You would think, okay, semantic letters, we probably speak SQL because all those tools,
they already speak SQL. Now the problem with SQL is that SQL doesn't know about metrics,
right? SQL is just a columns, right? So you kind of define metrics. When you write a query,
you'll write like average, some, you can do some a little bit, you know, like a mass, right,
in your SQL.
But you cannot say, hey, SQL, give me that query, that measure, right, that metric.
So that's where like the problem is, like, how do we make SQL look almost like MDX or
work almost like MDX so that it can query multidimensional data?
I think the missing piece here is idea of measure.
How do we make SQL aware of measures?
So when you write your SQL query, you can say, hey, I just wanted to get that measure
with that dimension and apply these filters and get a result back.
I think that would be the biggest second technical problem when building semantic letters.
So the first one is how do you
design the architecture you know like how do you define measures dimensions all these objects and
generate a sequel and then the second big problem would be how do you create an interface so tools
like tableau can query a semantic layer yeah let's start with uh like the last one that you mentioned. So how do we do that?
Do we, yet another time,
try to extend SQL
and add more syntax there to do that?
Is it...
I don't know.
Do we just add metadata or annotation?
First of all, you're the expert here, so what are the different
attempts that people have tried
to solve that, outside of building
one-to-one
connections with every possible BI tool
out there?
Yeah, so
so far
outside of one-to-one connectors,
I saw mostly two
attempts.
One is introduce your own query language.
And this query language would be just some sort of a custom.
Think about like a NoSQL-style database query language, right?
Like you still query something, but it's not a SQL.
That was what Transform Data had before DBT acquired them.
I don't think that's the right approach because the good thing about it, it's native to query metrics, which is, you know, it feels good when you use it.
The problem is that you have all this data infrastructure already built around SQL.
And then you need to go and pitch top-low power BI,
hey, can you support my metric square language?
Which is like almost impossible thing to do, right?
So that's why I feel like it's not the right approach.
The other approach is to make it SQL first.
Now, here we have two branches, how to make that. One is that's what we do at Kube,
and what is Looker is doing. I'll talk about it in a second. And the other is what dbt
semantically are currently doing. So what dbt is doing, they're taking a SQL more as a container,
and then they put a bunch of ginger inside it. So from that perspective, you use SQL as more like just a protocol, right?
Like the container, but the actual querying happening right inside this Jinja template
inside the SQL.
I think it sort of, you know, like it might solve the connectivity issue, just the basic
connectivity.
But then the question would be, how would Tableau generate that query, right?
Because Tableau generates a SQL query when you do drag and drop.
Then you sort of come down to like a one-to-one connectors
because you would need to run it like a driver for Tableau
that knows how to generate the ginger template.
So I don't think that's quite a solution.
Again, it might be easier for a person and for a human, right,
to write that inside a SQL query if you use a tool like a hex
or if you use a tool like a mode.
But it might be really hard for tools like Power BI, Tableau
to actually generate that.
Yeah.
And now the final option, that's what we do and what Looker is doing,
is to be as NC SQL compatible as possible
with addition of the measure type.
Now, the measure type is a spatial type in a SQL
that would represent an already predefined metric,
meaning that it's a spatial column that knows how to evaluate itself.
So we define that it you know, like it can be active
users, it can be like percentage of failed transaction with a ratio metric. And from the
SQL standpoint, it's going to be just a column in your table with a spatial type measure. And you
would use a function called measure, like spatial aggregate function, to query this.
So you would say, hey, I want to get my measure back.
And then you kind of, by doing this SQL query, you're telling a system that I don't want
it calculated.
I know that you know how you calculate it.
It'll already just give me the value back. So that seems like the least evil here in terms of changes to the SQL, right?
Because we don't want to change SQL, right? But we have to make this minor change. And it feels like
the least necessary change that is required that we can make to make it work. The challenge here is that it might be not very natural for SQL
because we're kind of making SQL multidimensional at this point.
So we want to make sure to make it accurate
so we're not breaking any SQL standards and SQL expectations.
But that's possible to do.
So that's what we do at Kube, and that's what Looker is doing
with the Looker modeler.
And just for the context, when GCP acquired Looker,
they announced that they wanted to turn Looker into this sort of
semantic layer as well, and it means they need to build an interface.
An interface for that would be the Apache call site
that is being developed by Julian Hyde.
And that's how Julian and the team approaches the problem of querying measures as well,
with a spatial measure type and a spatial measure function.
The one thing here to mention is that you may say Tableau doesn't have a measure function, right?
Like it only has some average. Here's an interesting thing is like,
we probably need to provide some backward compatibility
for like APIs that still don't have it.
Maybe, you know, like use some for like sums,
for measures with subtype sum
or with measures with subtype average,
we can use average.
But there are like some compatibility things,
but long-term, I think that approach is the most viable one.
So, okay.
What I see here is that...
Let's start with the approach of DBT.
So the DBT is trying to solve the problem more on the front-end side, let's say, by
having... front-end side, let's say, by considering SQL more as a template and then having a pre-processor
there that's based on the Jinja logic and reaches the SQL with whatever it has to be.
Okay, I can see the value in that in terms of the flexibility. And most importantly, you don't really have to go back to the query engine and make changes
to the query engine, which is a pretty hard thing to do in general, right?
But as you say, you have the problem there that you need all the front-end tools at the
end to somehow understand these template languages and stuff in there.
And then you have the other approach of introducing new types,
which, okay, it sounds like the most engineering sound way
to do it, right?
But then, okay, you don't have necessarily to go and change things
on the front end that much, but you have to change things
on the back end, right?
So what's the solution there?
And let's keep it away from Google because Google
is a special kind of creature in terms of the resources that they have and how they think about
products. But let's say about Kube, right? You can't go out there and be like, hey, Redshift,
let's introduce another type here, right? So how do you do that? How do you implement that as cube?
Yeah.
I think first a high-level architecture here would be that Tableau generates that query
with Azure, all of that.
That query is going to be sent to cube or any semantic layer.
And then cube, a semantic layer, should generate a real SQL based on a data models that are going to be executed in Snowflake Databricks.
And then kind of send result all the way back to Tableau.
So in that case, the question is, how do we implement that SQL engine, right?
Like that can talk about measures and all of this.
Yeah, that's a challenge that the k cube we implement in our own SQL engine.
So we're building one.
We're like building obviously on top of existing technologies, right?
We're using Apache Arrow Data Fusion as a SQL parser.
And to some extent, the SQL sort of logical and the planner, right?
But we extend the planner on a level where we introduce AD of measures and the planner, right? But we extend the planner on a level
where we introduce AD of measures and dimensions.
And then we build our execution engine
in a way that for some of the queries
and execution happens at the core part
of the semantic layer
where we generate the SQL query,
we execute it,
send all the way back to sort of
the kube SQL engine. And then the rest of the execution happens just as a regular SQL,
because you might have an inner query that goes to your semantics, and then you can have outer
query that do some post-processing, right? Like when the data is fetched, you want it to change that.
So it's kind of, it could be a combination of things.
But to answer your question, yes.
In that case, every semantic vendor that is going with that approach would need to have
their own SQL engine.
And for us, we built based on the Postgres protocol.
So in Postgres compliance, we also support Redshift style of Postgres where, you
know, like some of the functions might be different, but essentially it's Postgres
compiled.
Okay.
So, but you're still in there operate with other query engines, right?
You don't expect like the users to substitute, like for example, like BigQuery
with Kube to do the data
warehousing, right?
No, yeah. I mean
we look at the postgres
for Tableau, DOM,
ThoughtSpot, all of this, and then
they send query to Kube. And then once we
get a query, we generate a real query
to all the backends like
Snowflake, Databricks,
Starburst, all of these tools.
So it's sort of, you know, like a two-step process, right?
You first send a query to Q,
which is a query to your semantic layer,
which is still a SQL query.
But then you've got a completely separate SQL query
based on your real data backend.
Yeah, so like rewrite the query and make it like... So in that case, my question here is two things.
One is the user experience and how it is affected, because we add another layer of
interaction there, which is a very common way of solving problems in engineering,
but adds also like more latency
there probably right we don't know but like it might and the other thing is like the developer
experience in terms of like how do you debug issues now because now you don't have just like
the sql that i write on like or let's say I generate something on my
tableau, this thing
which is just visual stuff for
me as a user goes to
Kube. Kube rewrites
the query and executes
the query then on
BigQuery, let's say, for example.
You have all these different
steps where the query gets
transformed one way or another
where like things can go wrong like for whatever reason right and the reason i'm asking is because
like i remember back in like 2015 or 2016 back then at blender like we had like a customer who
was using looker with redshift and they were were in a total panic one day because
something went wrong with their Looker
mail, and Looker started
generating some queries
that really
destroyed the cluster.
Right?
And these things can happen,
but it becomes harder for
the developer to debug.
So how do you find the right balance there, right?
Yeah, that's a good question.
So first, we'll talk about overhead, right?
And potentially adding some performance penalty here.
I feel like that's true.
You know, like you've got something in the middle
that gets one query in and generates a second query.
So that simple generation kind of might take some time.
On KubeSight, we optimize it.
So many things have been pre-compiled and reused from, you know, like a data model generation and compilation perspective.
So we usually try to minimize it overhead to like 100 milliseconds or so.
And while we deal with analytics, usually we talk about seconds in analytics anyway. So
it's usually not a big overhead. We also have a very developed cache in Clare because Kube started
like a lot for embedded analytics, right? Like an embedded analytics latency's performance
is really critical. So that's why we have
a really sophisticated cache in
Clare that can help in many cases
to not only to mitigate
that additional latency,
but also to improve
over even the scenario
where you wouldn't have a cube in the middle,
right? The cube actually
can add a latency,
but can remove latency in many cases
if you use cache at the level.
So I would generally say that's not a problem
that we've seen in customers that can be mitigated.
The second thing is really interesting
and you're spot on that sort of debugging issue,
observability, you know, like, how do you deal with that?
You know, that's the funny thing that it's a problem, but the opportunity itself, and
that's how the cube cloud started.
So when we first, we had this open source project and when we, you know, like started
a company around it, you know, like we raised a siege round and we raised a series A and at some point,
okay, well, we need to start building a commercial product, right? We need to make money. And my
co-founder and I, we started to think what we can build that creates value on top of Q. And that's
what the first thing we built, observability debugging platform that helps you to understand what's happening
with queries. Now, cloud is
much bigger, so a lot of features, a lot
of stuff, but that's how it started
and that's still been a big part of that.
So we spend a lot of time to build
a lot of tools to help you
navigate issues because that's right.
Once you have something in the middle,
you have to give tools
to people to be able to
debug and understand what's happening. Yeah, 100%. All right, cool. I think that was a very interesting
dive into the internals of a system, like a semantic layer. For me, it was important
to do that because I think that it's hard for people from the outside to see like the complexity of building a system like
that. And there are like, as you said, like still like problems out there that's okay. Probably
there are like even better ways like to do it? So let's move to the future.
And let's talk a little bit
because semantics is like a very interesting term.
And it's something that's also related a lot
with AI and LLMs,
with LLMs being like the technology,
a lot of you like the AI growth right now.
So in a world where, okay, people imagine that, like, I don't know, like in a year or two from now,
people would just pick on the laptop and the laptop will generate the SQL and I don't know, like come up with whatever. And we can argue if this is realistic or not, but I'm obviously
just replicating the hype here and exaggerating. But how do you see semantic layers working
together with LLMs and what's the importance, let's say, like of the two being together for the
organization?
Yeah, great question.
When LLAMs, you know, like came out, right?
And I think GPTU, what a year ago, right?
I think it's been like a little over than a year ago now.
It created a lot of excitement.
And I think in data, one of the first use cases was like,
okay, now we can write SQL automatically. That was everyone was thinking about. And a lot of attempts
and it started to see a lot of companies started to build around this idea again, because it's not
a new idea. I mean, remember ThoughtSpot, right? When they started it was like all over like their positioning and
messaging like ThoughtSpot is like a text to SQL kind of a generation so we're like we're doing it
again now with a better technology for sure I think we at Cube been you know like thinking a
lot about it talking to a lot of people trying to do that and sort of to summarize, you know, like my experience, you know, I was using LLM for the
text to SQL generation. I think that the recent paper from the data.world team did a really good
job of kind of summarizing what's happening. So what they did is that they decided to do a
benchmark and they published it as a paper. So the idea of the
benchmark is let's take a data set with a bunch of relations. I think they took some of the public
data sets on like insurance, kind of, you know, like a use case domain. And what they're doing
is that they're taking a set of questions and they expect a set of answers back.
And they can measure accuracy
if the answer is correct or not.
And they have specific prompts.
So the first attempt is to run it directly
on top of schema.
So essentially in the prompt,
they're saying,
you're about to answer the question.
Use this DDL to learn about the schema.
And then accuracy was about 16% or something in that case.
Not good.
Then what they did is they ran it over the knowledge graph.
So they took this ontologist, like standard ontologist, they build it, and they kind of
fed that ontology
to the LLM and they were saying like, here's the ontology, run the query on top of the ontology.
And the result was about like a 56 or 58%, like essentially three X better, but still like
hit and miss essentially overall. One question would be right, one question would be wrong, but it's still a 3x improvement. And after that, we have a few partners, the companies that are building
some sort of like a text-to-SQL products on top of semantic layers. And these partners,
their idea is that they believe that semantic layer can significantly improve the accuracy
of this solution. So what they did is they took Kube as a semantic layer, and then they ran the
same benchmark on top of Kube, and they got to 100% accuracy. And the thing is that Kube gives
all the semantics, all the relationship of the ontology to the LLM system. And then you craft the prompt
with all this information and you just run it against GUP and you get a really good accuracy.
So it's like it went from 16, which is like a raw, raw SQL to like 100% accuracy. So I believe
if we want to build a future with, you know, like text to SQL products, it has to have a semantic layer part of it.
And now I went to reInvent last week
and AWS announced a queue.
There's like a big chatbot, right,
that lives now in AWS products.
And it's connected to multiple products in QuickSight.
So what do you get in QuickSight right now,
and QuickSight is a BI from AWS, right?
Like just for the context.
So it's a pretty standard BI,
all the features you would expect from a BI.
And now they added this natural language.
So what you can do, you can sort of create,
they call it a topic,
which essentially kind of a data set
like representation that you get a bunch of measures, dimensions together,
build some kind of data set, you call it a topic,
and then they require you to give a lot of semantics to that topic,
like what are synonyms, how do you call your metrics,
you may have your jargon or specific acronyms
in your organization you use to call metric.
So you kind of essentially give all the semantics to the system and then it can give some results,
some kind of good results, right? So now what I think is going to happen is that every BI is
going to add feature like this. Now, every BI would require semantics to be inside that BI to make that feature work very well.
So it will create, and I started with that, right?
When you asked me what is semantic layer,
universal semantic layer set,
the problem is trying to make it dry across data stack.
So now the problem is going to be even worse
because you will have all this semantics, synonyms,
all the words inside every BI
if you wanted to make it work very well with natural language, right?
So you'll have like a semantic scattering across all of these places.
So I think that's what's going to really happen.
And I think that's why the value of standalone semantic layer
would even be bigger in this like LLM-based world.
Yeah. So what is the semantic layer? In the example that you mentioned,
what is Kube bringing to the LLM that the ontology cannot capture and we get such a
big difference in the performance at the end? I think ontology, the idea of ontology is that they don't have... They're all about relationships mostly, right?
They don't have metrics.
And now in analytics, you kind of will ask about metrics anyway.
Like you would ask, what are my transaction rates, failure,
or what bounce rate on my website?
And where is it defined?
You need to define it somewhere, right?
So you either define it inside your PI,
like you would do it for queue in a quick site. You would exactly define it. But then you will have define it somewhere, right? So you either define it inside your PI, like you
would do it for Q in a quick site, you would exactly define it, but then you will have the
same problem, right? Like, because it's going to scatter it across multiple places, or you go to
define it some, you know, like a standard place, like a semantic layer. Ontology by itself just
doesn't support it. I feel like maybe if we just take ontology and extend it to some degree,
you know, like to cover analytics, it would help.
But then in that case, it would just become a semantic layer, really, right?
Yeah, I get it.
Okay, it's like more about, like, it's not how the information is represented,
but like what information is actually included as part of the ontology.
Like, okay, like theoretically, you could have like an ontology describing metrics,
like nothing stops you from that. But obviously, that was not the case. And, okay, that's,
that's interesting. And my question is, like, okay, when you remove completely, let's say,
the guardrails from a human, because, like, either, like, if we call them UIs or a DSL or a language or whatever, at the end what we
are doing is that we're creating, let's say, a very rigid and strict context in which the human
brain operates. Now, if we remove that and we let the user just type whatever they want, what can be
asked is completely open.
We don't even have, let's say, syntactic checks there, as we have when we write code.
And my question is, sure, you you have, your semantic layer will always be
like a limited representation of the word out there, right? Like you cannot have like in your
semantic layer, like everything. On the other hand, the user can ask whatever, right? So how do you
handle like the user experience in such an environment?
I don't expect you to have an answer.
To be honest, I think some of the really hard problems that people like us who build have to answer with using LLMs and bringing LLMs to the markets out there.
But I'd love to hear how you think about it.
Yeah.
I mean, that's a good question.
And it's obviously more about the distant future, right?
But I think it's good to think about it now.
So the flow as it is today, right, with sort of data, like sort of data products, is that you have a data engineer, right, who sort of owns this semantic layer.
And whether you're doing it inside your BI or, you know, like you're doing it in a cube,
you still define metrics somewhere.
And then you say you have a user who consumes that.
And then usually you have all this conversation,
like you have a meeting with a
marketing team and they say, hey, we have this HubSpot data. We want to look at the metrics,
you know, like maybe like contact requests, forms, fields. And you kind of, as a data engineer,
you're having a conversation with them, seeing, you try to understand what do they want? And they
try to map it. Do I have it in my semantic layer? Do I have this metrics? Did they build this smart
or not? And then you say, say okay i have this and that i probably
don't have this few metrics and dimensions but i'm going can build it for you and then next week
we'll have a meeting and i will show you how to do that and then you essentially do that right
you build that and you say here's you know like the metrics measures here's like the list of things
you can do like drag and drop enjoy right and then they would probably say, hey, I'm missing this dimension, right?
Can you build it?
And it's like always comes and goes.
That's probably the best scenario, right?
The worst would be they're just asking for ad hoc reports.
But in the best case,
you're building an actual semantic layer
and they use the semantic layers.
But you're still going to develop it.
Semantic layer is not something you build once and you don still going to develop it. Semantic layer is not something you build once
and you don't need to touch it. So now, say in the future, we have this system you were describing,
right? So I think from that perspective, that system would somehow should act as a data
engineer as well that can modify the semantic layer because the system, the AI would know,
okay, this is the semantic layer I have.
We want me to get this information, but I don't have it yet.
So probably need to go and make a change to semantic layer
and then you will be able to access all of this.
And I either make a change like ad hoc change on a fly
to extend the semantic layer at the given moment
to satisfy your request,
or I make it more like universal change and, you know, like that can be applied everywhere.
Now the question is like, do we trust AI to fully automate that process or the person is still going
to be in the loop? That could be a good question. You know, like I see a world where like AI can
make a pull request to semantic layer change and then layer change, and then as a human would review the change and accept it. That can be
possible. Something may be fully automated and AI will
just kind of maintain it. So I think we'll see how that's going to be developed.
But that's roughly how I see the future flow.
Yeah, that makes a lot of sense.
Anyway, I think we are going to have some very
interesting at least months two years ahead of us with for sure new things coming out there and
like i'm very like excited to see also like what cube is going to be building there so veric the
microphone back to you i know that you can't stay away from it for that long. So all year. Wow. I just, it has a magnetic pull or I have a magnetic pull. We'll never know.
But Artie, one personal question just to land the plane here, you know, we're always interested in
what people would do if they weren't working in data. So if you didn't have a job, you know,
building tooling or working in the data space if you didn't have a job, you know, building tooling
or working
in the data space,
what would you do?
I would do,
I mean,
I'm a big fan of
table games
and Dungeons and Dragons
and, you know,
like video games,
RPGs,
all that stuff.
I would do that for a living.
I think
that would be a fun job.
So,
but,
I mean,
I got pulled into software engineering like early on you know like and just i haven't had a really chance to think about what i would do outside
of it that's something you know like i would do you know like i i understand that may not be as
how would i put it profitable as but it still could be fun.
Yeah. Yeah.
No, that's interesting.
I mean, I think anyone who's played a really, like really good games feel so
natural and then you play a game that's designed pretty poorly and you're like,
whoa, this is like, it's not fun, you know?
So yeah, that would certainly be a fascinating, fascinating
problem space to tackle.
Yeah.
I think the games,
they're all about the stories. And I actually been thinking a lot lately about,
you know,
like how we can apply AI to the games.
And I think like a lot of people in the industry,
in the gaming industry,
they thinking about it,
especially,
you know,
like to,
you know,
like RPGs and we're like a story part of that.
I saw a few projects online where like, you know, Dungeons and Dragons, right? Like you have a, saw a few projects online where like you know dungeons and
dragons right like you have a like a dungeon master kind of you know like leading a game
so it's so projects online kind of making an ai-based dungeon master that kind of would run
a game for you like just based on llm so i think that could be cool so yeah and then llms can
definitely have a really interesting impact on the industry.
Very cool. Well, Artyom, thanks for giving us some of your time. So great to have you back on the show.
Oh, yeah. Yeah, I had fun. Thank you. That was really good.
We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite
podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com.
The show is brought to you by Rudderstack,
the CDP for developers.
Learn how to build a CDP on your data warehouse
at rudderstack.com.