The Data Stack Show - 174: Does Your Data Stack Need a Semantic Layer? Featuring Artyom Keydunov of Cube Dev

Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. We're here with Artiem, the co-founder and CEO of Cube. Artiem, thanks for coming on the show and welcome. Thank you. Thank you for having me today. My name is Artiem. I'm co-founder and CEO at a company called Cube. I also co-created

Starting point is 00:00:42 Cube open source project back in 2019. Then a year after, I started a company with my co-founder. So I've been on a journey of building Universal Semantic Layer and how it was going through the cycles of evolution in the last few years. So yeah, exciting to be here and to chat all about Semantic Lay layers and data. Yeah, that's awesome. And we've had you before. It's been almost like a year since you were on the show, Artur. And many things have happened in the industry. So I'm very curious to see how semantic layers have evolved in this one year.

Starting point is 00:01:21 And also, what's next, especially after the, let's say, this whole revolution that's happening right now with AI, LLMs, and all these new technologies around data. So I'm really looking forward to chatting more about that stuff. What about you? What are a couple of things that you're really excited to chat about today? Yeah, it was a great year for semantic layers, for sure. And I'm very glad how the data community evolved in their thinking about the need for semantic layers. And I saw different vendors, different companies

Starting point is 00:01:58 coming up with semantic layer solutions. And it's definitely happy to see the kind of, you know, the categories maturing overall. And at Cube, you know, like, I hope we contribute a lot, you know, like to the thinking, you know, like to the framework, how the semantic layer should fit into the modern data stack. And obviously, you know, like the elephant in the room this year was LLMs, right? Like an AI. And I felt like it contributed to the, you know, like ideas and the need for the semantic layer because LLMs, they're all about semantics. They essentially like a text in, text out, right? And text is semantics. So that's why, you know, like I see that was a strong

Starting point is 00:02:38 tailwind. Okay. We really need semantics, not only for humans, but for AIs as well. And let's talk about semantic layers, right? And like how we can get semantics, not only for humans, but for AIs as well. And let's talk about semantic layers, right? And how we can get semantics about our data. Yeah, no, that's awesome. I think we have plenty to talk about. So what do you think, Eric? Let's dive in. Let's do it.

Starting point is 00:02:56 RJ, it was so fun to have guests back on the show again. It's been about a year and so many exciting updates to talk about. Before we get into the topics, can you just remind our listeners what Qube is? Qube is semantic layer or universal and standalone semantic layer. And the reason why I highlight standalone and universal because I feel like semantic layers, they were here for a long time, right? Business objects, they had a semantic layer.

Starting point is 00:03:32 Problem with that kind of started to happen as we started to have more and more BI tools and cloud made it really easy to buy more and more tools. We started to have a lot of semantic layers sort of scattered across different tools because now we have five. ThoughtSpot, Power BI, like Dolmore, like Tableau, you name it.

Starting point is 00:03:57 And then all these BIs, they have semantic layer attached to the product, coupled with the product. And now organizations, there's five BIs and they all have those semantic layers. The problem is that we repeat ourselves when we define a metric at every BI level. We go into one tool, we define all the metrics. We go to the second tool, we define it. And the frameworks always are different. The way we define metrics are different, but essentially we define metrics are different. But essentially, we define the same metrics.

Starting point is 00:04:25 And it creates a problem and your stack becomes not dry. And, you know, like in engineering, we always try to keep things dry. Do not repeat yourself, right? So, and that's a whole idea behind Semanticlare. Let's make our data stack dry on a scale. So, we take the metrics out of the BI tools and we define them in one place. We call that place universal semantic layer, and it sits between cloud data warehouses and all the data visualization tools. And then we just define the metrics in that place. And then we deliver

Starting point is 00:04:57 metrics to all the different data consumption tools. So that's the whole idea behind semantic layer, universal semantic layer, and Kube is building on. What a great, concise definition. It sounds like you've explained that a couple times before. I did, yeah. Let's talk about the last year. So when we last had you on the show, Kube was focused on a fairly specific use case. You know, I think that you were talking about semantic,

Starting point is 00:05:29 but not quite to the, I know that you were focused on analytics use cases. And I think you were talking about this concept of headless BI, if my memory serves me correctly. So can you explain the journey that you and Cube have been on going from sort of the positioning as an analytics type solution and using the term headless? And then what led you?

Starting point is 00:05:52 How did the company grow and change? What did you learn about your customers that sort of pushed the move to Symantec? Right. I think we went through quite an evolution since we started a company and project in 2019, right? The company started as an open source project in 2019, and then the company itself started in 2020. So it's been three years, a little over three years now.

Starting point is 00:06:15 I think that we initially had this big vision where we wanted to create metrics, semantics, and then deliver them to different places. But we never had really good semantics about semantics, I would say, how we would call that, right? We had different names. I remember calling ourselves API for data at some point, and then we were calling ourselves a headless BI, and then the metrics layer. And eventually, it felt like the industry arrived at the term semantic layer and that everyone is using semantic layer right now. So from even naming perspective, we went through the several steps of the evolution here.

Starting point is 00:06:56 And from a product perspective, I think when we only started, the obvious problem to solve was how we use metrics in the embedded analytics application or customer-facing application. Because that's where you still need to build a semantic layer, but you would build it manually. You're not going to use one provided by BI, but you would actually write a code in your Django app or in your Ruby on Rails app to deliver the metrics to the customer, right? So we thought, let's try to remove that piece that developers need to build inside this frameworks and just kind of make it generic.

Starting point is 00:07:36 So that was a very clear first use case and it was a big need and that's why it helped us to get initial traction. As we started to build on top of this use case, we started to have customers saying to us, hey, we're using Kube to show metrics to our customers inside their app, but we're kind of looking at the same metrics inside our BI tool and inside our second BI tool and a third BI tool. Why we not just use Q to centralize all the metrics for all the different places? And that's how we started, you know,

Starting point is 00:08:11 like to go to this next step, next kind of, you know, like age of evolution for Q. It's like, okay, let's bring all the BIs right now to work on top of Q. And that's sort of how our vision is. Kind of, you know, like, I wouldn't say expanded because we always wanted to do that, but the product started to expand toward the bigger vision, right? And then we added more BI tools and, you know, like a different AI apps this year.

Starting point is 00:08:37 So that, you know, like, and if you go on our website right now, it's quite a different website than like a year ago and two years ago, right? We talk about the bigger vision right now. I think that was a major change on our end as a product is now we're like not only serving combative analytics, but we're serving a bigger picture of power and all different data experiences in the organization. Makes total sense. One question I have for you is around adoption.

Starting point is 00:09:07 And what I mean by that is, I'm just interested in the point at which your customers or users come to you. So you talk about having, you know, multiple different BI systems that all have sort of have their own semantic layers. It would seem that a lot of companies hit a pain point where they're managing that, that, you know, those metrics definitions across a number of different tools, you know, and platforms. And so do you see a lot of, is that sort of the main inflection point where companies come to Qube? Yeah, I think it's a compound problem is because you get so many BI tools and then even inside of one BI tool, like Tableau, you may have a lot of different workbooks and then every workbook, it acts like a silo with all its metrics inside it.

Starting point is 00:09:56 And then you think, oh, how do I connect workbooks together or something? So, and it's sort of, you know, like it adds up, you know, like to the problem every time you build a new dashboard, every time you do a new report, or someone tries to do analysis in Excel. So that's why I think this problem is always on top of mind of data engineers, data leaders. find, okay, what is the best way to manage data modeling and metrics, you know, like, because it feels always like we did so much progress with ideas around, you know, code first management, applying software engineering best practices. So we have matured, like data pipelines, you know, like there are like a medallion architecture, like all of this, like different ideas. But then we sort of fail at the last moment where like we actually need to build metrics.

Starting point is 00:10:50 And I think that sort of creates this sort of anxiety or, you know, like uncomfortable feeling among data leaders and data engineers. There is something should be a better architecture to do this than I'm doing it today. That's sort of probably why people started to think about semantic layer and talk to us and kind of explore different options. Yeah, that makes total sense. Are you saying more companies or teams try to start with this layer you know you mean like just from scratch right like even if they uh yeah i i think it happens i see it's happened sometimes i think it's mostly something that comes after you know like you have a warehouse you have one two three because then the problem is more evident, right? Like you see, now I have all that mess, I need to clean up the house here.

Starting point is 00:11:49 But I see companies are starting with thinking about, okay, semantic layer from the beginning, which feels to me just kind of maturing of the category, more awareness, you know, data team that are aware that they would need to semantic layer sooner because they would need it eventually, right? So they're thinking, okay, let's put it sooner than later. I think the caveat here is that sometimes there is a, there is sort of, you know, like an opportunity to use tool like a Looker, for example, which offers a great coupled semantic layer,

Starting point is 00:12:25 which might have, it might look like a good idea today. Say you're a mid-market company with 100-200 people, still a small data team, and you might need only one BI tool at this point, and Looker is an option, and semantic layer is coupled with Looker, and everything makes sense. You have your transformations, you have your semantic layer in LookML Looker and it kind of everything makes sense right you

Starting point is 00:12:45 have your transformations you have your semantic layer in LookML and yeah visualization the problem though is that once your organization would grow you definitely will you will hire people who would say hey I use Tableau all my life why should I use Looker right yeah yeah and then like Power BI will come in Sigma Sigma, like Excel. So that's sort of a trade-off always, you know, like I see that companies when they're picking up semantic layer, sometimes, you know, like Looker really looks like a good option if you're small. But then if you think about what happens next, that's probably decoupled semantic layer would be sort of a better option. So that's some interesting caveat I've been seeing, you know, like when a company is kind of on a smaller size, thinking about semantic layer. Yeah. Yeah.

Starting point is 00:13:31 You know, that's a tricky thing. And I'd love for you to help us think about how, you know, how data teams can make a case for this, right? Because it's one of those, you know, you said, which was my hypothesis as well, just like, you know, you said, which was my hypothesis as well, just like, you know, this becomes the perception of value becomes much higher when you're in a lot of pain because you have multiple BI tools, right? But you save a lot of time and money by not having to get to that place of pain in the first place. But I think one of the challenges is justifying the, you know,

Starting point is 00:14:08 expenditure of like, you know, cost, right? Whether that's like, you know, paying for software or, you know, your team actually implementing it. How do you think about, you know, how would you recommend that someone make a case for doing something that, you know, it's kind of the thing where it's like, you know, is this going to provide us immediate value now? No. Like, will it save us a million dollars over the next three years? Yes. You know, in time that we would

Starting point is 00:14:38 have allocated to like wrangling all this data, right? How do you, how would you help someone make that case? Yeah, I think we need to, and by we say all the semantic layer providers and, you know, like to some extent BI vendors as well, that BI vendors that want to integrate with semantic layers, we need to make it as easy and as cheap as possible to the small teams to implement the best practices, you know, like from the beginning. Right. Say by it, you know, like our solution, the way we think about the pricing could kill, we try to make it scale is organization. So we don't want it to be as expensive as Looker, for example. Right.

Starting point is 00:15:21 So we want to make sure it's cheap initially. So you still have a budget for like maybe Superset, Apache Sup for example, right? So we want to make sure it's skip initially. So you still have a budget for like maybe Superset, Apache Superset, right? Preset, which is a cloud version of Superset. You bundle those tools together and then you kind of go with that architecture instead of a Looker where you would, you know, like have the vendor lock in all the company

Starting point is 00:15:41 and it's on scale. So I think from what we can do first, we need to create a correct business model. And then we said, you know, they come in cheap first and then scale with a usage. But we also need to make sure that our products, they offer very good experience comparing to coupled solutions, right? Like obviously when you have a coupled solution,

Starting point is 00:16:06 it's easier to build a good user experience versus when you have a decoupled because you need to kind of try to make two products work as good almost as the same product, which is a really hard solution. But we need to solve for this problem. So I think these two things that we need to do and then for data

Starting point is 00:16:26 teams to justify it, I guess it's really, you know, like understanding the best practices and understanding that eventually you would need to scale. And, you know, like if vendors make it as much as easier as possible with attractive cost and attractive integration. So that would be easier for data teams to just kind of go with that architecture as soon as and later. Yep. Makes total sense. Okay.

Starting point is 00:16:49 Well, we're talking about semantic layers. I feel like you've done a great job of explaining where the semantic layer sits. DBT is, you know, a very widely used data tool, and they emphasize this semantic layer a lot. How do you compare? You know, I think the way that you're describing a lot of things, a lot of people would describe that, you know, describe dbt in a simpler way. So can you just explain some of the key differences or even use cases? Sure. Maybe I'll first, we'll go a little bit through the history, you know, like how dbt arrived at semantic layers. Sure.

Starting point is 00:17:26 I think dbt started as a transformation tool, right? Like a dbt core widely used in popular transformation tool. And then dbt, the company, they kind of raised a lot of money as part of low interest data stack phenomena. And then kind of started to build around the initial DBT core CLI tools. And I think at some point they announced that they wanted to build a metrics layer or semantic layer, right? The first attempt was to build it in-house, and then they sort of failed to deliver on expectations.

Starting point is 00:18:03 And to solve this issue of failing, they decided to buy a company called Transform Data. And Transform Data was one of our competitors. Really great team. Well, so, you know, like it's always when you start a company, it's always good to have competitors because you understand probably other smart people, like, you know, like they're doing something the same as you so you're probably doing something right right so like as always was good for me to know like competitors like transform data because

Starting point is 00:18:33 it was a lot of like reassurances that we're doing the right thing so what happened is that we got a little bit more traction and transformed data from a business perspective and kind of reining the category. So transformed data, they decided to go and get acquired and DBT acquired them. I think that was like a second attempt of DBT to deliver that. Now it happened almost a year ago. I think we're still in a state where it's not quite clear what dbt semantically is as a product.

Starting point is 00:19:14 When I talk to the community, when I talk to users, I hear a lot of awareness about dbt semantically, but I don't see actual users and customers of dbt semantic layer because it's still i think the product is still not there and i mean it's hard for me to you know like to kind of talk about and kind of think why it's happening right because they have all their reasons that's a big company right now they raised a lot of money it creates a lot of pressure maybe you know like they're looking at a different areas of the product how to you know like optimize for monetization how to optimize you know like for the conversion from a cloud open source to cloud they brought aws vp of x aws vp of product right Like to solve all these problems. So it felt like Semanticlare is not getting enough attention because it's a really hard problem to solve

Starting point is 00:20:13 in a technological and product. And it's not like an existential thing for dbt, right? If dbt fails at Semanticlare, they still have a business. If kube fails at Semanticlare much clearer they still have a business if cube fails we don't have a business right so we have to make it right we have to make it work for dbt it's just one of the features they have at this point yeah yeah super interesting one question i should have thought of this earlier but i'm sure a lot of teams build their own kind of semantic layer to address some of these issues. What tools are they using to do that?

Starting point is 00:20:50 You know, to essentially sort of mimic what Kube would do. How are they doing? What does like the in-house build look like for this? So I think that we can categorize these tools into sort of a one that's a simple versions of that and more like complicated. A simple version of that would be to use data marts as sort of your semantic layer. The problem with leveraging data marts is that you would have to create every data mart for every level of detail or grain, right, in case of the data,

Starting point is 00:21:26 because non-additive measures, the joints, they all create this complexity where you cannot have a single data mart serving, you know, like metrics with a multiple grain. So in that case, you would have to either create a lot of data marts, and that's what some companies do, you know, like if you have a process, you know, like a lot of data marts. And that's what some companies do. If you have a process, how you can produce these data marts and control, you can put more like a man at work just to do that. It's expensive, but it's possible. But again, it's very expensive.

Starting point is 00:22:01 The other option would be to create this, your own in-house virtual layer that would give you the virtual semantic layer and that would generate a SQL. And I think some of, and I know some of the more sophisticated tech companies, they build their own in-house versions of that. But essentially that would be QoP, right? Or that would be any layer

Starting point is 00:22:23 that would offer youAprido, that would be an emulator layer that would offer you the virtual data layer that would actually generate SQL when you query it. This way you're kind of solving for this grain or level of detail problem. So I see this kind of essentially two options. Your eyes are put a lot of money and time into

Starting point is 00:22:39 manual work of creating and duplicating data marts or you build your in-house version of Q. Yep, makes total sense. All right, well, Kostas, I've been monopolizing the mic, so please jump in here because I know you have a ton of questions. And I want to hear about the LLM and AI stuff. Yeah, but before we get there,

Starting point is 00:23:01 I think it should be good to spend a little bit of time getting a little bit more technical about semantic layers. You mentioned, Artem, that you said a few seconds ago that semantic layers are a hard problem, both from a product perspective and a technical perspective, right? What does this mean? And let's focus on the technical side of things. Why building a semantic layer is hard?

Starting point is 00:23:28 Right. So essentially, at the heart or the core of semantic layer, you have a virtual data representation that can generate a SQL, right? That's like, I think, the whole idea around semantic layer. Even when semantic layer is being a part of the SQL, right? That's like, I think the whole idea around semantic layer, even when semantic layer is being a part of the BI, right? Because if you look back into, you know, business objects, like a first generation, like of BI,

Starting point is 00:23:57 semantic layer, or even with at Looker, we'll see that semantic layer essentially is this sort of a virtual representation of data that lets you drag and drop things. And then when you do this, when you build a query, then the system generates SQL and executes that SQL against your cloud data warehouse. In case of a live query, it extracts, obviously, it's going to query its own data store. So I think that's sort of a core of the problem is like how you build a virtual layer that exposes data as a measures and dimensions to the end user and then generates a SQL. And now, you know, like we have all these problems about how do we deal with joints? How do we deal with fan outs? You know, like traps, casues, all of that when we're generating

Starting point is 00:24:43 a SQL. So I think like a SQL generation and creating the right framework for obstruction, that's one of the problem, one piece of the problem. The other big problem is, which was not solved by any coupled semantic layer, is how do we make an interface to semantic layer? Because that problem really arise when we build a decoupled semantic layer. Because when we have a decoupled semantic layer, it means we need to have an API so a different system can connect to the semantic layer. So from that perspective, the question would be like, okay, we have a Tableau now.

Starting point is 00:25:19 How Tableau would connect to your decoupled semantic layer? And that could be different ideas. If one can build a one-to-one connector with Tableau, the problem here is that you would have to build one-to-one connectors with all the tools. And it might be just a maintenance burden, even almost impossible. So if you would look at different options,

Starting point is 00:25:42 what you can do, you will probably arrive at SQL. You would think, okay, semantic letters, we probably speak SQL because all those tools, they already speak SQL. Now the problem with SQL is that SQL doesn't know about metrics, right? SQL is just a columns, right? So you kind of define metrics. When you write a query, you'll write like average, some, you can do some a little bit, you know, like a mass, right, in your SQL. But you cannot say, hey, SQL, give me that query, that measure, right, that metric. So that's where like the problem is, like, how do we make SQL look almost like MDX or

Starting point is 00:26:19 work almost like MDX so that it can query multidimensional data? I think the missing piece here is idea of measure. How do we make SQL aware of measures? So when you write your SQL query, you can say, hey, I just wanted to get that measure with that dimension and apply these filters and get a result back. I think that would be the biggest second technical problem when building semantic letters. So the first one is how do you design the architecture you know like how do you define measures dimensions all these objects and

Starting point is 00:26:50 generate a sequel and then the second big problem would be how do you create an interface so tools like tableau can query a semantic layer yeah let's start with uh like the last one that you mentioned. So how do we do that? Do we, yet another time, try to extend SQL and add more syntax there to do that? Is it... I don't know. Do we just add metadata or annotation?

Starting point is 00:27:23 First of all, you're the expert here, so what are the different attempts that people have tried to solve that, outside of building one-to-one connections with every possible BI tool out there? Yeah, so so far

Starting point is 00:27:40 outside of one-to-one connectors, I saw mostly two attempts. One is introduce your own query language. And this query language would be just some sort of a custom. Think about like a NoSQL-style database query language, right? Like you still query something, but it's not a SQL. That was what Transform Data had before DBT acquired them.

Starting point is 00:28:10 I don't think that's the right approach because the good thing about it, it's native to query metrics, which is, you know, it feels good when you use it. The problem is that you have all this data infrastructure already built around SQL. And then you need to go and pitch top-low power BI, hey, can you support my metric square language? Which is like almost impossible thing to do, right? So that's why I feel like it's not the right approach. The other approach is to make it SQL first. Now, here we have two branches, how to make that. One is that's what we do at Kube,

Starting point is 00:28:49 and what is Looker is doing. I'll talk about it in a second. And the other is what dbt semantically are currently doing. So what dbt is doing, they're taking a SQL more as a container, and then they put a bunch of ginger inside it. So from that perspective, you use SQL as more like just a protocol, right? Like the container, but the actual querying happening right inside this Jinja template inside the SQL. I think it sort of, you know, like it might solve the connectivity issue, just the basic connectivity. But then the question would be, how would Tableau generate that query, right?

Starting point is 00:29:28 Because Tableau generates a SQL query when you do drag and drop. Then you sort of come down to like a one-to-one connectors because you would need to run it like a driver for Tableau that knows how to generate the ginger template. So I don't think that's quite a solution. Again, it might be easier for a person and for a human, right, to write that inside a SQL query if you use a tool like a hex or if you use a tool like a mode.

Starting point is 00:29:51 But it might be really hard for tools like Power BI, Tableau to actually generate that. Yeah. And now the final option, that's what we do and what Looker is doing, is to be as NC SQL compatible as possible with addition of the measure type. Now, the measure type is a spatial type in a SQL that would represent an already predefined metric,

Starting point is 00:30:18 meaning that it's a spatial column that knows how to evaluate itself. So we define that it you know, like it can be active users, it can be like percentage of failed transaction with a ratio metric. And from the SQL standpoint, it's going to be just a column in your table with a spatial type measure. And you would use a function called measure, like spatial aggregate function, to query this. So you would say, hey, I want to get my measure back. And then you kind of, by doing this SQL query, you're telling a system that I don't want it calculated.

Starting point is 00:30:57 I know that you know how you calculate it. It'll already just give me the value back. So that seems like the least evil here in terms of changes to the SQL, right? Because we don't want to change SQL, right? But we have to make this minor change. And it feels like the least necessary change that is required that we can make to make it work. The challenge here is that it might be not very natural for SQL because we're kind of making SQL multidimensional at this point. So we want to make sure to make it accurate so we're not breaking any SQL standards and SQL expectations. But that's possible to do.

Starting point is 00:31:44 So that's what we do at Kube, and that's what Looker is doing with the Looker modeler. And just for the context, when GCP acquired Looker, they announced that they wanted to turn Looker into this sort of semantic layer as well, and it means they need to build an interface. An interface for that would be the Apache call site that is being developed by Julian Hyde. And that's how Julian and the team approaches the problem of querying measures as well,

Starting point is 00:32:12 with a spatial measure type and a spatial measure function. The one thing here to mention is that you may say Tableau doesn't have a measure function, right? Like it only has some average. Here's an interesting thing is like, we probably need to provide some backward compatibility for like APIs that still don't have it. Maybe, you know, like use some for like sums, for measures with subtype sum or with measures with subtype average,

Starting point is 00:32:36 we can use average. But there are like some compatibility things, but long-term, I think that approach is the most viable one. So, okay. What I see here is that... Let's start with the approach of DBT. So the DBT is trying to solve the problem more on the front-end side, let's say, by having... front-end side, let's say, by considering SQL more as a template and then having a pre-processor

Starting point is 00:33:08 there that's based on the Jinja logic and reaches the SQL with whatever it has to be. Okay, I can see the value in that in terms of the flexibility. And most importantly, you don't really have to go back to the query engine and make changes to the query engine, which is a pretty hard thing to do in general, right? But as you say, you have the problem there that you need all the front-end tools at the end to somehow understand these template languages and stuff in there. And then you have the other approach of introducing new types, which, okay, it sounds like the most engineering sound way to do it, right?

Starting point is 00:33:53 But then, okay, you don't have necessarily to go and change things on the front end that much, but you have to change things on the back end, right? So what's the solution there? And let's keep it away from Google because Google is a special kind of creature in terms of the resources that they have and how they think about products. But let's say about Kube, right? You can't go out there and be like, hey, Redshift, let's introduce another type here, right? So how do you do that? How do you implement that as cube?

Starting point is 00:34:26 Yeah. I think first a high-level architecture here would be that Tableau generates that query with Azure, all of that. That query is going to be sent to cube or any semantic layer. And then cube, a semantic layer, should generate a real SQL based on a data models that are going to be executed in Snowflake Databricks. And then kind of send result all the way back to Tableau. So in that case, the question is, how do we implement that SQL engine, right? Like that can talk about measures and all of this.

Starting point is 00:35:00 Yeah, that's a challenge that the k cube we implement in our own SQL engine. So we're building one. We're like building obviously on top of existing technologies, right? We're using Apache Arrow Data Fusion as a SQL parser. And to some extent, the SQL sort of logical and the planner, right? But we extend the planner on a level where we introduce AD of measures and the planner, right? But we extend the planner on a level where we introduce AD of measures and dimensions. And then we build our execution engine

Starting point is 00:35:32 in a way that for some of the queries and execution happens at the core part of the semantic layer where we generate the SQL query, we execute it, send all the way back to sort of the kube SQL engine. And then the rest of the execution happens just as a regular SQL, because you might have an inner query that goes to your semantics, and then you can have outer

Starting point is 00:36:00 query that do some post-processing, right? Like when the data is fetched, you want it to change that. So it's kind of, it could be a combination of things. But to answer your question, yes. In that case, every semantic vendor that is going with that approach would need to have their own SQL engine. And for us, we built based on the Postgres protocol. So in Postgres compliance, we also support Redshift style of Postgres where, you know, like some of the functions might be different, but essentially it's Postgres

Starting point is 00:36:32 compiled. Okay. So, but you're still in there operate with other query engines, right? You don't expect like the users to substitute, like for example, like BigQuery with Kube to do the data warehousing, right? No, yeah. I mean we look at the postgres

Starting point is 00:36:52 for Tableau, DOM, ThoughtSpot, all of this, and then they send query to Kube. And then once we get a query, we generate a real query to all the backends like Snowflake, Databricks, Starburst, all of these tools. So it's sort of, you know, like a two-step process, right?

Starting point is 00:37:09 You first send a query to Q, which is a query to your semantic layer, which is still a SQL query. But then you've got a completely separate SQL query based on your real data backend. Yeah, so like rewrite the query and make it like... So in that case, my question here is two things. One is the user experience and how it is affected, because we add another layer of interaction there, which is a very common way of solving problems in engineering,

Starting point is 00:37:42 but adds also like more latency there probably right we don't know but like it might and the other thing is like the developer experience in terms of like how do you debug issues now because now you don't have just like the sql that i write on like or let's say I generate something on my tableau, this thing which is just visual stuff for me as a user goes to Kube. Kube rewrites

Starting point is 00:38:14 the query and executes the query then on BigQuery, let's say, for example. You have all these different steps where the query gets transformed one way or another where like things can go wrong like for whatever reason right and the reason i'm asking is because like i remember back in like 2015 or 2016 back then at blender like we had like a customer who

Starting point is 00:38:39 was using looker with redshift and they were were in a total panic one day because something went wrong with their Looker mail, and Looker started generating some queries that really destroyed the cluster. Right? And these things can happen,

Starting point is 00:38:59 but it becomes harder for the developer to debug. So how do you find the right balance there, right? Yeah, that's a good question. So first, we'll talk about overhead, right? And potentially adding some performance penalty here. I feel like that's true. You know, like you've got something in the middle

Starting point is 00:39:24 that gets one query in and generates a second query. So that simple generation kind of might take some time. On KubeSight, we optimize it. So many things have been pre-compiled and reused from, you know, like a data model generation and compilation perspective. So we usually try to minimize it overhead to like 100 milliseconds or so. And while we deal with analytics, usually we talk about seconds in analytics anyway. So it's usually not a big overhead. We also have a very developed cache in Clare because Kube started like a lot for embedded analytics, right? Like an embedded analytics latency's performance

Starting point is 00:40:05 is really critical. So that's why we have a really sophisticated cache in Clare that can help in many cases to not only to mitigate that additional latency, but also to improve over even the scenario where you wouldn't have a cube in the middle,

Starting point is 00:40:22 right? The cube actually can add a latency, but can remove latency in many cases if you use cache at the level. So I would generally say that's not a problem that we've seen in customers that can be mitigated. The second thing is really interesting and you're spot on that sort of debugging issue,

Starting point is 00:40:43 observability, you know, like, how do you deal with that? You know, that's the funny thing that it's a problem, but the opportunity itself, and that's how the cube cloud started. So when we first, we had this open source project and when we, you know, like started a company around it, you know, like we raised a siege round and we raised a series A and at some point, okay, well, we need to start building a commercial product, right? We need to make money. And my co-founder and I, we started to think what we can build that creates value on top of Q. And that's what the first thing we built, observability debugging platform that helps you to understand what's happening

Starting point is 00:41:25 with queries. Now, cloud is much bigger, so a lot of features, a lot of stuff, but that's how it started and that's still been a big part of that. So we spend a lot of time to build a lot of tools to help you navigate issues because that's right. Once you have something in the middle,

Starting point is 00:41:42 you have to give tools to people to be able to debug and understand what's happening. Yeah, 100%. All right, cool. I think that was a very interesting dive into the internals of a system, like a semantic layer. For me, it was important to do that because I think that it's hard for people from the outside to see like the complexity of building a system like that. And there are like, as you said, like still like problems out there that's okay. Probably there are like even better ways like to do it? So let's move to the future. And let's talk a little bit

Starting point is 00:42:28 because semantics is like a very interesting term. And it's something that's also related a lot with AI and LLMs, with LLMs being like the technology, a lot of you like the AI growth right now. So in a world where, okay, people imagine that, like, I don't know, like in a year or two from now, people would just pick on the laptop and the laptop will generate the SQL and I don't know, like come up with whatever. And we can argue if this is realistic or not, but I'm obviously just replicating the hype here and exaggerating. But how do you see semantic layers working

Starting point is 00:43:18 together with LLMs and what's the importance, let's say, like of the two being together for the organization? Yeah, great question. When LLAMs, you know, like came out, right? And I think GPTU, what a year ago, right? I think it's been like a little over than a year ago now. It created a lot of excitement. And I think in data, one of the first use cases was like,

Starting point is 00:43:46 okay, now we can write SQL automatically. That was everyone was thinking about. And a lot of attempts and it started to see a lot of companies started to build around this idea again, because it's not a new idea. I mean, remember ThoughtSpot, right? When they started it was like all over like their positioning and messaging like ThoughtSpot is like a text to SQL kind of a generation so we're like we're doing it again now with a better technology for sure I think we at Cube been you know like thinking a lot about it talking to a lot of people trying to do that and sort of to summarize, you know, like my experience, you know, I was using LLM for the text to SQL generation. I think that the recent paper from the data.world team did a really good job of kind of summarizing what's happening. So what they did is that they decided to do a

Starting point is 00:44:41 benchmark and they published it as a paper. So the idea of the benchmark is let's take a data set with a bunch of relations. I think they took some of the public data sets on like insurance, kind of, you know, like a use case domain. And what they're doing is that they're taking a set of questions and they expect a set of answers back. And they can measure accuracy if the answer is correct or not. And they have specific prompts. So the first attempt is to run it directly

Starting point is 00:45:15 on top of schema. So essentially in the prompt, they're saying, you're about to answer the question. Use this DDL to learn about the schema. And then accuracy was about 16% or something in that case. Not good. Then what they did is they ran it over the knowledge graph.

Starting point is 00:45:39 So they took this ontologist, like standard ontologist, they build it, and they kind of fed that ontology to the LLM and they were saying like, here's the ontology, run the query on top of the ontology. And the result was about like a 56 or 58%, like essentially three X better, but still like hit and miss essentially overall. One question would be right, one question would be wrong, but it's still a 3x improvement. And after that, we have a few partners, the companies that are building some sort of like a text-to-SQL products on top of semantic layers. And these partners, their idea is that they believe that semantic layer can significantly improve the accuracy of this solution. So what they did is they took Kube as a semantic layer, and then they ran the

Starting point is 00:46:31 same benchmark on top of Kube, and they got to 100% accuracy. And the thing is that Kube gives all the semantics, all the relationship of the ontology to the LLM system. And then you craft the prompt with all this information and you just run it against GUP and you get a really good accuracy. So it's like it went from 16, which is like a raw, raw SQL to like 100% accuracy. So I believe if we want to build a future with, you know, like text to SQL products, it has to have a semantic layer part of it. And now I went to reInvent last week and AWS announced a queue. There's like a big chatbot, right,

Starting point is 00:47:15 that lives now in AWS products. And it's connected to multiple products in QuickSight. So what do you get in QuickSight right now, and QuickSight is a BI from AWS, right? Like just for the context. So it's a pretty standard BI, all the features you would expect from a BI. And now they added this natural language.

Starting point is 00:47:36 So what you can do, you can sort of create, they call it a topic, which essentially kind of a data set like representation that you get a bunch of measures, dimensions together, build some kind of data set, you call it a topic, and then they require you to give a lot of semantics to that topic, like what are synonyms, how do you call your metrics, you may have your jargon or specific acronyms

Starting point is 00:48:03 in your organization you use to call metric. So you kind of essentially give all the semantics to the system and then it can give some results, some kind of good results, right? So now what I think is going to happen is that every BI is going to add feature like this. Now, every BI would require semantics to be inside that BI to make that feature work very well. So it will create, and I started with that, right? When you asked me what is semantic layer, universal semantic layer set, the problem is trying to make it dry across data stack.

Starting point is 00:48:37 So now the problem is going to be even worse because you will have all this semantics, synonyms, all the words inside every BI if you wanted to make it work very well with natural language, right? So you'll have like a semantic scattering across all of these places. So I think that's what's going to really happen. And I think that's why the value of standalone semantic layer would even be bigger in this like LLM-based world.

Starting point is 00:49:02 Yeah. So what is the semantic layer? In the example that you mentioned, what is Kube bringing to the LLM that the ontology cannot capture and we get such a big difference in the performance at the end? I think ontology, the idea of ontology is that they don't have... They're all about relationships mostly, right? They don't have metrics. And now in analytics, you kind of will ask about metrics anyway. Like you would ask, what are my transaction rates, failure, or what bounce rate on my website? And where is it defined?

Starting point is 00:49:40 You need to define it somewhere, right? So you either define it inside your PI, like you would do it for queue in a quick site. You would exactly define it. But then you will have define it somewhere, right? So you either define it inside your PI, like you would do it for Q in a quick site, you would exactly define it, but then you will have the same problem, right? Like, because it's going to scatter it across multiple places, or you go to define it some, you know, like a standard place, like a semantic layer. Ontology by itself just doesn't support it. I feel like maybe if we just take ontology and extend it to some degree, you know, like to cover analytics, it would help.

Starting point is 00:50:08 But then in that case, it would just become a semantic layer, really, right? Yeah, I get it. Okay, it's like more about, like, it's not how the information is represented, but like what information is actually included as part of the ontology. Like, okay, like theoretically, you could have like an ontology describing metrics, like nothing stops you from that. But obviously, that was not the case. And, okay, that's, that's interesting. And my question is, like, okay, when you remove completely, let's say, the guardrails from a human, because, like, either, like, if we call them UIs or a DSL or a language or whatever, at the end what we

Starting point is 00:50:48 are doing is that we're creating, let's say, a very rigid and strict context in which the human brain operates. Now, if we remove that and we let the user just type whatever they want, what can be asked is completely open. We don't even have, let's say, syntactic checks there, as we have when we write code. And my question is, sure, you you have, your semantic layer will always be like a limited representation of the word out there, right? Like you cannot have like in your semantic layer, like everything. On the other hand, the user can ask whatever, right? So how do you handle like the user experience in such an environment?

Starting point is 00:51:47 I don't expect you to have an answer. To be honest, I think some of the really hard problems that people like us who build have to answer with using LLMs and bringing LLMs to the markets out there. But I'd love to hear how you think about it. Yeah. I mean, that's a good question. And it's obviously more about the distant future, right? But I think it's good to think about it now. So the flow as it is today, right, with sort of data, like sort of data products, is that you have a data engineer, right, who sort of owns this semantic layer.

Starting point is 00:52:30 And whether you're doing it inside your BI or, you know, like you're doing it in a cube, you still define metrics somewhere. And then you say you have a user who consumes that. And then usually you have all this conversation, like you have a meeting with a marketing team and they say, hey, we have this HubSpot data. We want to look at the metrics, you know, like maybe like contact requests, forms, fields. And you kind of, as a data engineer, you're having a conversation with them, seeing, you try to understand what do they want? And they

Starting point is 00:52:57 try to map it. Do I have it in my semantic layer? Do I have this metrics? Did they build this smart or not? And then you say, say okay i have this and that i probably don't have this few metrics and dimensions but i'm going can build it for you and then next week we'll have a meeting and i will show you how to do that and then you essentially do that right you build that and you say here's you know like the metrics measures here's like the list of things you can do like drag and drop enjoy right and then they would probably say, hey, I'm missing this dimension, right? Can you build it? And it's like always comes and goes.

Starting point is 00:53:29 That's probably the best scenario, right? The worst would be they're just asking for ad hoc reports. But in the best case, you're building an actual semantic layer and they use the semantic layers. But you're still going to develop it. Semantic layer is not something you build once and you don still going to develop it. Semantic layer is not something you build once and you don't need to touch it. So now, say in the future, we have this system you were describing,

Starting point is 00:53:53 right? So I think from that perspective, that system would somehow should act as a data engineer as well that can modify the semantic layer because the system, the AI would know, okay, this is the semantic layer I have. We want me to get this information, but I don't have it yet. So probably need to go and make a change to semantic layer and then you will be able to access all of this. And I either make a change like ad hoc change on a fly to extend the semantic layer at the given moment

Starting point is 00:54:23 to satisfy your request, or I make it more like universal change and, you know, like that can be applied everywhere. Now the question is like, do we trust AI to fully automate that process or the person is still going to be in the loop? That could be a good question. You know, like I see a world where like AI can make a pull request to semantic layer change and then layer change, and then as a human would review the change and accept it. That can be possible. Something may be fully automated and AI will just kind of maintain it. So I think we'll see how that's going to be developed. But that's roughly how I see the future flow.

Starting point is 00:55:01 Yeah, that makes a lot of sense. Anyway, I think we are going to have some very interesting at least months two years ahead of us with for sure new things coming out there and like i'm very like excited to see also like what cube is going to be building there so veric the microphone back to you i know that you can't stay away from it for that long. So all year. Wow. I just, it has a magnetic pull or I have a magnetic pull. We'll never know. But Artie, one personal question just to land the plane here, you know, we're always interested in what people would do if they weren't working in data. So if you didn't have a job, you know, building tooling or working in the data space if you didn't have a job, you know, building tooling

Starting point is 00:55:45 or working in the data space, what would you do? I would do, I mean, I'm a big fan of table games and Dungeons and Dragons

Starting point is 00:55:54 and, you know, like video games, RPGs, all that stuff. I would do that for a living. I think that would be a fun job. So,

Starting point is 00:56:03 but, I mean, I got pulled into software engineering like early on you know like and just i haven't had a really chance to think about what i would do outside of it that's something you know like i would do you know like i i understand that may not be as how would i put it profitable as but it still could be fun. Yeah. Yeah. No, that's interesting. I mean, I think anyone who's played a really, like really good games feel so

Starting point is 00:56:31 natural and then you play a game that's designed pretty poorly and you're like, whoa, this is like, it's not fun, you know? So yeah, that would certainly be a fascinating, fascinating problem space to tackle. Yeah. I think the games, they're all about the stories. And I actually been thinking a lot lately about, you know,

Starting point is 00:56:49 like how we can apply AI to the games. And I think like a lot of people in the industry, in the gaming industry, they thinking about it, especially, you know, like to, you know,

Starting point is 00:56:59 like RPGs and we're like a story part of that. I saw a few projects online where like, you know, Dungeons and Dragons, right? Like you have a, saw a few projects online where like you know dungeons and dragons right like you have a like a dungeon master kind of you know like leading a game so it's so projects online kind of making an ai-based dungeon master that kind of would run a game for you like just based on llm so i think that could be cool so yeah and then llms can definitely have a really interesting impact on the industry. Very cool. Well, Artyom, thanks for giving us some of your time. So great to have you back on the show. Oh, yeah. Yeah, I had fun. Thank you. That was really good.

Starting point is 00:57:34 We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

Pet Camera - EBO Air 2

The Data Stack Show - 174: Does Your Data Stack Need a Semantic Layer? Featuring Artyom Keydunov of Cube Dev

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.

Your Ad Here

Pet Camera - EBO Air 2

The Data Stack Show - 174: Does Your Data Stack Need a Semantic Layer? Featuring Artyom Keydunov of Cube Dev

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.