Latent Space: The AI Engineer Podcast - Powering your Copilot for Data – with Artem Keydunov of Cube.dev

Episode Date: October 26, 2023

The first workshops and talks from the AI Engineer Summit are now up! Join the >20k viewers on YouTube, find clips on Twitter (we’re also clipping @latentspacepod), and chat with us on Discord!Text-...to-SQL was one of the first applications of NLP. Thoughtspot offered “Ask your data questions” as their core differentiation compared to traditional dashboarding tools. In a way, they provide a much friendlier interface with your own structured (aka “tabular”, as in “SQL tables”) data, the same way that RLHF and Instruction Tuning helped turn the GPT-3 of 2020 into the ChatGPT of 2022.Today, natural language queries on your databases are a commodity. There are 4 different ChatGPT plugins that offer this, as well as a bunch of startups like one of our previous guests, Seek.ai. Perplexity originally started with a similar product in 2022: In March 2023 LangChain wrote a blog post on LLMs and SQL highlighting why they don’t consistently work:* “LLMs can write SQL, but they are often prone to making up tables, making up field”* “LLMs have some context window which limits the amount of text they can operate over”* “The SQL it writes may be incorrect for whatever reason, or it could be correct but just return an unexpected result.”For example, if you ask a model to “return all active users in the last 7 days” it might hallucinate a `is_active` column, join to an `activity` table that doesn’t exist, or potentially get the wrong date (especially in leap years!).We previously talked to Shreya Rajpal at Guardrails AI, which also supports Text2SQL enforcement. Their approach was to run the actual SQL against your database and then use the error messages to improve the query: Semantic Layers to the rescueCube is an open source semantic layer which recently integrated with LangChain to solve these issues in a different way. You can use YAML, Javascript, or Python to create definitions of different metrics, measures and dimensions for your data: Creating these metrics and passing them in the model context limits the possibility for errors as the model just needs to query the `active_users` view, and Cube will then expand that into the full SQL in a reliable way. The downside of this approach compared to the Guardrails one for example is that it requires more upfront work to define metrics, but on the other hand it leads to more reliable and predictable outputs. The promise of adding a great semantic layer to your LLM app is irresistible - you greatly minimize hallucinations, make much more token efficient prompts, and your data stays up to date without any retraining or re-indexing. However, there are also difficulties with implementing semantic layers well, so we were glad to go deep on the topic with Artem as one of the leading players in this space!Timestamps* [00:00:00] Introductions* [00:01:28] Statsbot and limitations of natural language processing in 2017* [00:04:27] Building Cube as the infrastructure for Statsbot* [00:08:01] Open sourcing Cube in 2019* [00:09:09] Explaining the concept of a semantic layer/Cube* [00:11:01] Using semantic layers to provide context for AI models working with tabular data* [00:14:47] Workflow of generating queries from natural language via semantic layer* [00:21:07] Using Cube to power customer-facing analytics and natural language interfaces* [00:22:38] Building data-driven AI applications and agents* [00:25:59] The future of the modern data stack* [00:29:43] Example use cases of Slack bots powered by Cube* [00:30:59] Using GPT models and limitations around math* [00:32:44] Tips for building data-driven AI apps* [00:35:20] Challenges around monetizing embedded analytics* [00:36:27] Lightning RoundTranscriptSwyx: Hey everyone, welcome to the Latent Space podcast. This is Swyx, writer, editor of Latent Space and founder of Smol.ai and Alessio, partner and CTO in residence at Decibel Partners. [00:00:15]Alessio: Hey everyone, and today we have Artem Keydunov on the podcast, co-founder of Cube. Hey Artem. [00:00:21]Artem: Hey Alessio, hi Swyx. Good to be here today, thank you for inviting me. [00:00:25]Alessio: Yeah, thanks for joining. For people that don't know, I've known Artem for a long time, ever since he started Cube. And Cube is actually a spin-out of his previous company, which is Statsbot. And this kind of feels like going both backward and forward in time. So the premise of Statsbot was having a Slack bot that you can ask, basically like text to SQL in Slack, and this was six, seven years ago, something like that. A lot ahead of its time, and you see startups trying to do that today. And then Cube came out of that as a part of the infrastructure that was powering Statsbot. And Cube then evolved from an embedded analytics product to the semantic layer and just an awesome open source evolution. I think you have over 16,000 stars on GitHub today, you have a very active open source community. But maybe for people at home, just give a quick like lay of the land of the original Statsbot product. You know, what got you interested in like text to SQL and what were some of the limitations that you saw then, the limitations that you're also seeing today in the new landscape? [00:01:28]Artem: I started Statsbot in 2016. The original idea was to just make sort of a side project based off my initial project that I did at a company that I was working for back then. And I was working for a company that was building software for schools, and we were using Slack a lot. And Slack was growing really fast, a lot of people were talking about Slack, you know, like Slack apps, chatbots in general. So I think it was, you know, like another wave of, you know, bots and all that. We have one more wave right now, but it always comes in waves. So we were like living through one of those waves. And I wanted to build a bot that would give me information from different places where like a data lives to Slack. So it was like developer data, like New Relic, maybe some marketing data, Google Analytics, and then some just regular data, like a production database, so it sells for sometimes. And I wanted to bring it all into Slack, because we were always chatting, you know, like in Slack, and I wanted to see some stats in Slack. So that was the idea of Statsbot, right, like bring stats to Slack. I built that as a, you know, like a first sort of a side project, and I published it on Reddit. And people started to use it even before Slack came up with that Slack application directory. So it was a little, you know, like a hackish way to install it, but people are still installing it. So it was a lot of fun. And then Slack kind of came up with that application directory, and they reached out to me and they wanted to feature Statsbot, because it was one of the already being kind of widely used bots on Slack. So they featured me on this application directory front page, and I just got a lot of, you know, like new users signing up for that. It was a lot of fun, I think, you know, like, but it was sort of a big limitation in terms of how you can process natural language, because the original idea was to let people ask questions directly in Slack, right, hey, show me my, you know, like opportunities closed last week or something like that. My co founder, who kind of started helping me with this Slack application, him and I were trying to build a system to recognize that natural language. But it was, you know, we didn't have LLMs right back then and all of that technology. So it was really hard to build the system, especially the systems that can kind of, you know, like keep talking to you, like maintain some sort of a dialogue. It was a lot of like one off requests, and like, it was a lot of hit and miss, right? If you know how to construct a query in natural language, you will get a result back. But you know, like, it was not a system that was capable of, you know, like asking follow up questions to try to understand what you actually want. And then kind of finally, you know, like, bring this all context and go to generate a SQL query, get the result back and all of that. So that was a really missing part. And I think right now, that's, you know, like, what is the difference? So right now, I kind of bullish that if I would start Statsbot again, probably would have a much better shot at it. But back then, that was a big limitation. We kind of build a queue, right, as we were working on Statsbot, because we needed it. [00:04:27]Alessio: What was the ML stack at the time? Were you building, trying to build your own natural language understanding models, like were there open source models that were good that you were trying to leverage? [00:04:38]Artem: I think it was mostly combination of a bunch of things. And we tried a lot of different approaches. The first version, which I built, like was Regex. They were working well. [00:04:47]Swyx: It's the same as I did, I did option pricing when I was in finance, and I had a natural language pricing tool thing. And it was Regex. It was just a lot of Regex. [00:04:59]Artem: Yeah. [00:05:00]Artem: And my co-founder, Pavel, he's much smarter than I am. He's like PhD in math, all of that. And he started to do some stuff. I was like, no, you just do that stuff. I don't know. I can do Regex. And he started to do some models and trying to either look at what we had on the market back then, or try to build a different sort of models. Again, we didn't have any foundation back in place, right? We wanted to try to use existing math, obviously, right? But it was not something that we can take the model and try and run it. I think in 2019, we started to see more of stuff, like ecosystem being built, and then it eventually kind of resulted in all this LLM, like what we have right now. But back then in 2016, it was not much available for just the people to build on top. It was some academic research, right, kind of been happening. But it was very, very early for something to actually be able to use. [00:05:58]Alessio: And then that became Cube, which started just as an open source project. And I think I remember going on a walk with you in San Mateo in 2020, something like that. And you had people reaching out to you who were like, hey, we use Cube in production. I just need to give you some money, even though you guys are not a company. What's the story of Cube then from Statsbot to where you are today? [00:06:21]Artem: We built a Cube at Statsbot because we needed it. It was like, the whole Statsbot stack was that we first tried to translate the initial sort of language query into some sort of multidimensional query. It's like we were trying to understand, okay, people wanted to get active opportunities, right? What does it mean? Is it a metric? Is it what a dimension here? Because usually in analytics, you always, you know, like, try to reduce everything down to the sort of, you know, like a multidimensional framework. So that was the first step. And that's where, you know, like it didn't really work well because all this limitation of us not having foundational technologies. But then from the multidimensional query, we wanted to go to SQL. And that's what was SemanticLayer and what was Cube essentially. So we built a framework where you would be able to map your data into this concept, into this metrics. Because when people were coming to Statsbot, they were bringing their own datasets, right? And the big question was, how do we tell the system what is active opportunities for that specific users? How we kind of, you know, like provide that context, how we do the training. So that's why we came up with the idea of building the SemanticLayer so people can actually define their metrics and then kind of use them as a Statsbot. So that's how we built a Cube. At some point, we saw people started to see more value in the Cube itself, you know, like kind of building the SemanticLayer and then using it to power different types of the application. So in 2019, we decided, okay, it feels like it might be a standalone product and a lot of people want to use it. Let's just try to open source it. So we took it out of Statsbot and open-sourced. [00:08:01]Swyx: Can I make sure that everyone has the same foundational knowledge? The concept of a cube is not something that you invented. I think, you know, not everyone has the same background in analytics and data that all three of us do. Maybe you want to explain like OLAP Cube, HyperCube, the brief history of cubes. Right. [00:08:17]Artem: I'll try, you know, like a lot of like Wikipedia pages and like a lot of like a blog post trying to go into academics of it. So I'm trying to like... [00:08:25]Swyx: Cube's according to you. Yeah. [00:08:27]Artem: So when we think about just a table in a database, the problem with the table, it's not a multidimensional, meaning that in many cases, if we want to slice the data, we kind of need to result with a different table, right? Like think about when you're writing a SQL query to answer one question, SQL query always ends up with a table, right? So you write one SQL, you got one. And then you write to answer a different question, you write a second query. So you're kind of getting a bunch of tables. So now let's imagine that we can kind of bring all these tables together into multidimensional table. And that's essentially Cube. So it's just like the way that we can have measures and dimension that can potentially be used at the same time from a different angles. [00:09:09]Alessio: So initially, a lot of your use cases were more BI related, but you recently released a LangChain integration. There's obviously more and more interest in, again, using these models to answer data questions. So you've seen the chat GPT code interpreter, which is renamed as like advanced data analysis. What's kind of like the future of like the semantic layer in AI? You know, what are like some of the use cases that you're seeing and why do you think it's a good strategy to make it easier to do now the text to SQL you wanted to do seven years ago? [00:09:39]Artem: Yeah. So, I mean, you know, when it started to happen, I was just like, oh my God, people are now building Statsbot with Cube. They just have a better technology for, you know, like natural language. So it kind of, it made sense to me, you know, like from the first moment I saw it. So I think it's something that, you know, like happening right now and chat bot is one of the use cases. I think, you know, like if you try to generalize it, the use case would be how do we use structured or tabular data with, you know, like AI models, right? Like how do we turn the data and give the context as a data and then bring it to the model and then model can, you know, like give you answers, make a questions, do whatever you want. But the question is like how we go from just the data in your data warehouse, database, whatever, which is usually just a tabular data, right? Like in a SQL based warehouses to some sort of, you know, like a context that system can do. And if you're building this application, you have to do it. It's like no way you can get away around not doing this. You either map it manually or you come up with some framework or something else. So our take is that and my take is that semantic layer is just really good place for this context to leave because you need to give this context to the humans. You need to give that context to the AI system anyway, right? So that's why you define metric once and then, you know, like you teach your AI system what this metric is about. [00:11:01]Alessio: What are some of the challenges of using tabular versus language data and some of the ways that having the semantic layer kind of makes that easier maybe? [00:11:09]Artem: Imagine you're a human, right? And you're going into like your new data analyst at a company and just people give you a warehouse with a bunch of tables and they tell you, okay, just try to make sense of this data. And you're going through all of these tables and you're really like trying to make sense without any, you know, like additional context or like some columns. In many cases, they might have a weird names. Sometimes, you know, if they follow some kind of like a star schema or, you know, like a Kimball style dimensions, maybe that would be easier because you would have facts and dimensions column, but it's still, it's hard to understand and kind of make sense because it doesn't have descriptions, right? And then there is like a whole like industry of like a data catalogs exist because the whole purpose of that to give context to the data so people can understand that. And I think the same applies to the AI, right? Like, and the same challenge is that if you give it pure tabular data, it doesn't have this sort of context that it can read. So you sort of needed to write a book or like essay about your data and give that book to the system so it can understand it. [00:12:12]Alessio: Can you run through the steps of how that works today? So the initial part is like the natural language query, like what are the steps that happen in between to do model, to semantic layer, semantic layer, to SQL and all that flow? [00:12:26]Artem: The first key step is to do some sort of indexing. That's what I was referring to, like write a book about your data, right? Describe in a text format what your data is about, right? Like what metrics it has, dimensions, what is the structures of that, what a relationship between those metrics, what are potential values of the dimensions. So sort of, you know, like build a really good index as a text representation and then turn it into embeddings into your, you know, like a vector storage. Once you have that, then you can provide that as a context to the model. I mean, there are like a lot of options, like either fine tune or, you know, like sort of in context learning, but somehow kind of give that as a context to the model, right? And then once this model has this context, it can create a query. Now the query I believe should be created against semantic layer because it reduces the room for the error. Because what usually happens is that your query to semantic layer would be very simple. It would be like, give me that metric group by that dimension and maybe that filter should be applied. And then your real query for the warehouse, it might have like a five joins, a lot of different techniques, like how to avoid fan out, fan traps, chasm traps, all of that stuff. And the bigger query, the more room that the model can make an error, right? Like even sometimes it could be a small error and then, you know, like your numbers is going to be off. But making a query against semantic layer, that sort of reduces the error. So the model generates a SQL query and then it executes us again, semantic layer. And semantic layer executes us against your warehouse and then sends result all the way back to the application. And then can be done multiple times because what we were missing was both this ability to have a conversation, right? With the model. You can ask question and then system can do a follow-up questions, you know, like then do a query to get some additional information based on this information, do a query again. And sort of, you know, like it can keep doing this stuff and then eventually maybe give you a big report that consists of a lot of like data points. But the whole flow is that it knows the system, it knows your data because you already kind of did the indexing and then it queries semantic layer instead of a data warehouse directly. [00:14:47]Alessio: Maybe just to make it a little clearer for people that haven't used a semantic layer before, you can add definitions like revenue, where revenue is like select from customers and like join orders and then sum of the amount of orders. But in the semantic layer, you're kind of hiding all of that away. So when you do natural language to queue, it just select revenue from last week and then it turns into a bigger query. [00:15:12]Swyx: One of the biggest difficulties around semantic layer for people who've never thought about this concept before, this all sounds super neat until you have multiple stakeholders within a single company who all have different concepts of what a revenue is. They all have different concepts of what active user is. And then they'll have like, you know, revenue revision one by the sales team, you know, and then revenue revision one, accounting team or tax team, I don't know. I feel like I always want semantic layer discussions to talk about the not so pretty parts of the semantic layer, because this is where effectively you ship your org chart in the semantic layer. [00:15:47]Artem: I think the way I think about it is that at the end of the day, semantic layer is a code base. And in Qubit, it's essentially a code base, right? It's not just a set of YAML files with pythons. I think code is never perfect, right? It's never going to be perfect. It will have a lot of, you know, like revisions of code. We have a version control, which helps it's easier with revisions. So I think we should treat our metrics and semantic layer as a code, right? And then collaboration is a big part of it. You know, like if there are like multiple teams that sort of have a different opinions, let them collaborate on the pull request, you know, they can discuss that, like why they think that should be calculated differently, have an open conversation about it, you know, like when everyone can just discuss it, like an open source community, right? Like you go on a GitHub and you talk about why that code is written the way it's written, right? It should be written differently. And then hopefully at some point you can come up, you know, like to some definition. Now if you still should have multiple versions, right? It's a code, right? You can still manage it. But I think the big part of that is that like, we really need to treat it as a code base. Then it makes a lot of things easier, not as spreadsheets, you know, like a hidden Excel files. [00:16:53]Alessio: The other thing is like then having the definition spread in the organization, like versus everybody trying to come up with their own thing. But yeah, I'm sure that when you talk to customers, there's people that have issues with the product and it's really like two people trying to define the same thing. One in sales that wants to look good, the other is like the finance team that wants to be conservative and they all have different definitions. How important is the natural language to people? Obviously you guys both work in modern data stack companies either now or before. There's going to be the whole wave of empowering data professionals. I think now a big part of the wave is removing the need for data professionals to always be in the loop and having non-technical folks do more of the work. Are you seeing that as a big push too with these models, like allowing everybody to interact with the data? [00:17:42]Artem: I think it's a multidimensional question. That's an example of, you know, like where you have a lot of inside the question. In terms of examples, I think a lot of people building different, you know, like agents or chatbots. You have a company that built an internal Slack bot that sort of answers questions, you know, like based on the data in a warehouse. And then like a lot of people kind of go in and like ask that chatbot this question. Is it like a real big use case? Maybe. Is it still like a toy pet project? Maybe too right now. I think it's really hard to tell them apart at this point because there is a lot of like a hype, you know, and just people building LLM stuff because it's cool and everyone wants to build something, you know, like even at least a pet project. So that's what happened in Krizawa community as well. We see a lot of like people building a lot of cool stuff and it probably will take some time for that stuff to mature and kind of to see like what are real, the best use cases. But I think what I saw so far, one use case was building this chatbot and we have even one company that are building it as a service. So they essentially connect into Q semantic layer and then offering their like chatbot So you can do it in a web, in a slack, so it can, you know, like answer questions based on data in your semantic layer, but also see a lot of things like they're just being built in house. And there are other use cases, sort of automation, you know, like that agent checks on the data and then kind of perform some actions based, you know, like on changes in data. But other dimension of your question is like, will it replace people or not? I think, you know, like what I see so far in data specifically, you know, like a few use cases of LLM, I don't see Q being part of that use case, but it's more like a copilot for data analyst, a copilot for data engineer, where you develop something, you develop a model and it can help you to write a SQL or something like that. So you know, it can create a boilerplate SQL, and then you can edit this SQL, which is fine because you know how to edit SQL, right? So you're not going to make a mistake, but it will help you to just generate, you know, like a bunch of SQL that you write again and again, right? Like boilerplate code. So sort of a copilot use case. I think that's great. And we'll see more of it. I think every platform that is building for data engineers will have some sort of a copilot capabilities and Cubectl, we're building this copilot capabilities to help people build semantic layers easier. I think that just a baseline for every engineering product right now to have some sort of, you know, like a copilot capabilities. Then the other use case is a little bit more where Cube is being involved is like, how do we enable access to data for non-technical people through the natural language as an interface to data, right? Like visual dashboards, charts, it's always has been an interface to data in every BI. Now I think we will see just a second interface as a just kind of a natural language. So I think at this point, many BI's will add it as a commodity feature is like Tableau will probably have a search bar at some point saying like, Hey, ask me a question. I know that some of the, you know, like AWS Squeak site, they're about to announce features like this in their like BI. And I think Power BI will do that, especially with their deal with open AI. So every company, every BI will have this some sort of a search capabilities built in inside their BI. So I think that's just going to be a baseline feature for them as well. But that's where Cube can help because we can provide that context, right? [00:21:07]Alessio: Do you know how, or do you have an idea for how these products will differentiate once you get the same interface? So right now there's like, you know, Tableau is like the super complicated and it's like super sad. It's like easier. Yeah. Do you just see everything will look the same and then how do people differentiate? [00:21:24]Artem: It's like they all have line chart, right? And they all have bar chart. I feel like it pretty much the same and it's going to be fragmented as well. And every major vendor and most of the vendors will try to have some sort of natural language capabilities and they might be a little bit different. Some of them will try to position the whole product around it. Some of them will just have them as a checkbox, right? So we'll see, but I don't think it's going to be something that will change the BI market, you know, like something that will can take the BI market and make it more consolidated rather than, you know, like what we have right now. I think it's still will remain fragmented. [00:22:04]Alessio: Let's talk a bit more about application use cases. So people also use Q for kind of like analytics in their product, like dashboards and things like that. How do you see that changing and more, especially like when it comes to like agents, you know, so there's like a lot of people trying to build agents for reporting, building agents for sales. If you're building a sales agent, you need to know everything about the purchasing history of the customer. All of these things. Yeah. Any thoughts there? What should all the AI engineers listening think about when implementing data into agents? [00:22:38]Artem: Yeah, I think kind of, you know, like trying to solve for two problems. One is how to make sure that agents or LLM model, right, has enough context about, you know, like a tabular data and also, you know, like how do we deliver updates to the context, which is also important because data is changing, right? So every time we change something upstream, we need to surely update that context in our vector database or something. And how do you make sure that the queries are correct? You know, I think it's obviously a big pain and that's all, you know, like AI kind of, you know, like a space right now, how do we make sure that we don't, you know, provide our own cancers, but I think, you know, like be able to reduce the room for error as much as possible that what I would look for, you know, like to try to like minimize potential damage. And then our use case for Qube, it's been using a lot to power sort of customer facing analytics. So I don't think much is going to change is that I feel like again, more and more products will adopt natural language interfaces as sort of a part of that product as well. So we would be able to power this business to not only, you know, like a chart, visuals, but also some sort of, you know, like a summaries, probably in the future, you're going to open the page with some surface stats and you will have a smart summary kind of generated by AI. And that summary can be powered by Qube, right, like, because the rest is already being powered by Qube. [00:24:04]Alessio: You know, we had Linus from Notion on the pod and one of the ideas he had that I really like is kind of like thumbnails of text, kind of like how do you like compress knowledge and then start to expand it. A lot of that comes into dashboards, you know, where like you have a lot of data, you have like a lot of charts and sometimes you just want to know, hey, this is like the three lines summary of it. [00:24:25]Artem: Exactly. [00:24:26]Alessio: Makes sense that you want to power that. How are you thinking about, yeah, the evolution of like the modern data stack in quotes, whatever that means today. What's like the future of what people are going to do? What's the future of like what models and agents are going to do for them? Do you have any, any thoughts? [00:24:42]Artem: I feel like modern data stack sometimes is not very, I mean, it's obviously big crossover between AI, you know, like ecosystem, AI infrastructure, ecosystem, and then sort of a data. But I don't think it's a full overlap. So I feel like when we know, like I'm looking at a lot of like what's happening in a modern data stack where like we use warehouses, we use BI's, you know, different like transformation tools, catalogs, like data quality tools, ETLs, all of that. I don't see a lot of being compacted by AI specifically. I think, you know, that space is being compacted as much as any other space in terms of, yes, we'll have all this copilot capabilities, some of AI capabilities here and there, but I don't see anything sort of dramatically, you know, being sort of, you know, a change or shifted because of, you know, like AI wave. In terms of just in general data space, I think in the last two, three years, we saw an explosion, right? Like we got like a lot of tools, every vendor for every problem. I feel like right now we should go through the cycle of consolidation. If Fivetran and DBT merge, they can be Alteryx of a new generation or something like that. And you know, probably some ETL tool there. I feel it might happen. I mean, it's just natural waves, you know, like in cycles. [00:25:59]Alessio: I wonder if everybody is going to have their own copilot. The other thing I think about these models is like Swyx was at Airbyte and yeah, there's Fivetran. [00:26:08]Swyx: Fivetran versus AirByte, I don't think it'll mix very well. [00:26:10]Alessio: A lot of times these companies are doing the syntax work for you of like building the integration between your data store and like the app or another data store. I feel like now these models are pretty good at coming up with the integration themselves and like using the docs to then connect the two. So I'm really curious, like in the future, what that will look like. And same with data transformation. I mean, you think about DBT and some of these tools and right now you have to create rules to normalize and transform data. In the future, I could see you explaining the model, how you want the data to be, and then the model figuring out how to do the transformation. I think it all needs a semantic layer as far as like figuring out what to do with it. You know, what's the data for and where it goes. [00:26:53]Artem: Yeah, I think many of this, you know, like workflows will be augmented by, you know, like some sort of a copilot. You know, you can describe what transformation you want to see and it can generate a boilerplate right, of transformation for you, or even, you know, like kind of generate a boilerplate of specific ETL driver or ETL integration. I think we're still not at the point where this code can be fully automated. So we still need a human and a loop, right, like who can be, who can use this copilot. But in general, I think, yeah, data work and software engineering work can be augmented quite significantly with all that stuff. [00:27:31]Alessio: You know, the big thing with machine learning before was like, well, all of your data is bad. You know, the data is not good for anything. And I think like now, at least with these models, they have some knowledge of their own and they can also tell you if your data is bad, which I think is like something that before you didn't have. Any cool apps that you've seen being built on Qube, like any kind of like AI native things that people should think about, new experiences, anything like that? [00:27:54]Artem: Well, I see a lot of Slack bots. They all remind me of Statsbot, but I know like I played with a few of them. They're much, much better than Statsbot. It feels like it's on the surface, right? It's just that use case that you really want, you know, think about you, a data engineer in your company, like everyone is like, and you're asking, hey, can you pull that data for me? And you would be like, can I build a bot to replace myself? You know, like, so they can both ping that bot instead. So it's like, that's why a lot of people doing that. So I think it's a first use case that actually people are playing with. But I think inside that use case, people get creative. So I see bots that can actually have a dialogue with you. So, you know, like you would come to that bot and say, hey, show me metrics. And the bot would be like, what kind of metrics? What do you want to look at? You will be like active users. And then it would be like, how do you define active users? You want to see active users sort of cohort, you want to see active users kind of changing behavior over time, like a lot of like a follow up questions. So it tries to sort of, you know, like understand what exactly you want. And that's how many data analysts work, right? When people started to ask you something, you always try to understand what exactly do you mean? Because many people don't know how to ask correct questions about your data. It's a sort of an interesting specter. On one side of the specter, you know, nothing is like, hey, show me metrics. And the other side of specter, you know how to write SQL, and you can write exact query to your data warehouse, right? So many people like a little bit in the middle. And the data analysts, they usually have the knowledge about your data. And that's why they can ask follow up questions and to understand what exactly you want. And I saw people building bots who can do that. That part is amazing. I mean, like generating SQL, all that stuff, it's okay, it's good. But when the bot can actually act like they know that your data and they can ask follow up questions. I think that's great. [00:29:43]Swyx: Yeah. [00:29:44]Alessio: Are there any issues with the models and the way they understand numbers? One of the big complaints people have is like GPT, at least 3.5, cannot do math. Have you seen any limitations and improvement? And also when it comes to what model to use, do you see most people use like GPT-4? Because it's like the best at this kind of analysis. [00:30:03]Artem: I think I saw people use all kinds of models. To be honest, it's usually GPT. So inside GPT, it could be 3.5 or 4, right? But it's not like I see a lot of something else, to be honest, like, I mean, maybe some open source alternatives, but it feels like the market is being dominated by just chat GPT. In terms of the problems, I think chatting about it with a few people. So if math is required to do math, you know, like outside of, you know, like chat GPT itself, so it would be like some additional Python scripts or something. When we're talking about production level use cases, it's quite a lot of Python code around, you know, like your model to make it work. To be honest, it's like, it's not that magic that you just throw the model in and like it can give you all these answers. For like a toy use cases, the one we have on a, you know, like our demo page or something, it works fine. But, you know, like if you want to do like a lot of post-processing, do a mass on URL, you probably need to code it in Python anyway. That's what I see people doing. [00:30:59]Alessio: We heard the same from Harrison and LangChain that most people just use OpenAI. We did a OpenAI has no moat emergency podcast, and it was funny to like just see the reaction that people had to that and how hard it actually is to break down some of the monopoly. What else should people keep in mind, Artem? You're kind of like at the cutting edge of this. You know, if I'm looking to build a data-driven AI application, I'm trying to build data into my AI workflows. Any mistakes people should avoid? Any tips on the best stack to use? What tools to use? [00:31:32]Artem: I would just recommend going through to warehouse as soon as possible. I think a lot of people feel that MySQL can be a warehouse, which can be maybe on like a lower scale, but definitely not from a performance perspective. So just kind of starting with a good warehouse, a query engine, Lakehouse, that's probably like something I would recommend starting from a day zero. And there are good ways to do it, very cheap, with open source technologies too, especially in the Lakehouse architecture. I think, you know, I'm biased, obviously, but using a semantic layer, preferably Cube, and for, you know, like a context. And other than that, I just feel it's a very interesting space in terms of AI ecosystem. I see a lot of people using link chain right now, which is great, you know, like, and we build an integration. But I'm sure the space will continue to evolve and, you know, like we'll see a lot of interesting tools and maybe, you know, like some tools would be a better fit for a job. I'm not aware of any right now, but it's always interesting to see how it evolves. Also it's a little unclear, you know, like how all the infrastructure around actually developing, testing, documenting, all that stuff will kind of evolve too. But yeah, again, it's just like really interesting to see and observe, you know, what's happening in this space. [00:32:44]Swyx: So before we go to the lightning round, I wanted to ask you on your thoughts on embedded analytics and in a sense, the kind of chatbots that people are inserting on their websites and building with LLMs is very much sort of end user programming or end user interaction with their own data. I love seeing embedded analytics, and for those who don't know, embedded analytics is basically user facing dashboards where you can see your own data, right? Instead of the company seeing data across all their customers, it's an individual user seeing their own data as a slice of the overall data that is owned by the platform that they're using. So I love embedded analytics. Well, actually, overwhelmingly, the observation that I've had is that people who try to build in this market fail to monetize. And I was wondering your insights on why. [00:33:31]Artem: I think overall, the statement is true. It's really hard to monetize, you know, like in embedded analytics. That's why at Qube we're excited more about our internal kind of BI use case, or like a company's a building, you know, like a chatbots for their internal data consumption or like internal workflows. Embedded analytics is hard to monetize because it's historically been dominated by the BI vendors. And we still see a lot of organizations are using BI tools as vendors. And what I was talking about, BI vendors adding natural language interfaces, they will probably add that to the embedded analytics capabilities as well, right? So they would be able to embed that too. So I think that's part of it. Also, you know, if you look at the embedded analytics market, the bigger organizations are big GADs, they're really more custom, you know, like it becomes and at some point I see many organizations, they just stop using any vendor, and they just kind of build most of the stuff from scratch, which probably, you know, like the right way to do. So it's sort of, you know, like you got a market that is very kept at the top. And then you also in that middle and small segment, you got a lot of vendors trying to, you know, like to compete for the buyers. And because again, the BI is very fragmented, embedded analytics, therefore is fragmented also. So you're really going after the mid market slice, and then with a lot of other vendors competing for that. So that's why it's historically been hard to monetize, right? I don't think AI really going to change that just because it's using model, you just pay to open AI. And that's it, like everyone can do that, right? So it's not much of a competitive advantage. So it's going to be more like a commodity features that a lot of vendors would be able to leverage. [00:35:20]Alessio: This is great, Artem. As usual, we got our lightning round. So it's three questions. One is about acceleration, one on exploration, and then take away. The acceleration thing is what's something that already happened in AI or maybe, you know, in data that you thought would take much longer, but it's already happening today. [00:35:38]Artem: To be honest, all this foundational models, I thought that we had a lot of models that been in production for like, you know, maybe decade or so. And it was like a very niche use cases, very vertical use cases, it's just like in very customized models. And even when we're building Statsbot back then in 2016, right, even back then, we had some natural language models being deployed, like a Google Translate or something that was still was a sort of a model, right, but it was very customized with a specific use case. So I thought that would continue for like, many years, we will use AI, we'll have all these customized niche models. But there is like foundational model, they like very generic now, they can serve many, many different use cases. So I think that is a big change. And I didn't expect that, to be honest. [00:36:27]Swyx: The next question is about exploration. What is one thing that you think is the most interesting unsolved question in AI? [00:36:33]Artem: I think AI is a subset of software engineering in general. And it's sort of connected to the data as well. Because software engineering as a discipline, it has quite a history. We build a lot of processes, you know, like toolkits and methodologies, how we prod that, [00:36:50]Swyx: right. [00:36:51]Artem: But AI, I don't think it's completely different. But it has some unique traits, you know, like, it's quite not idempotent, right, and kind of from many dimensions and like other traits. So which kind of may require a different methodologies may require different approaches and a different toolkit. I don't think how much is going to deviate from a standard software engineering, I think many tools and practices that we develop our software engineering can be applied to AI. And some of the data best practices can be applied as well. But it's like we got a DevOps, right, like it's just a bunch of tools, like ecosystem. So now like AI is kind of feels like it's shaping into that with a lot of its own, you know, like methodologies, practices and toolkits. So I'm really excited about it. And I think it's a lot of unsolved still question again, how do we develop that? How do we test you know, like, what is the best practices? How what is a methodologist? So I think that would be an interesting to see. [00:37:44]Alessio: Awesome. Yeah. Our final message, you know, you have a big audience of engineers and technical folks, what's something you want everybody to remember to think about to explore? [00:37:55]Artem: I mean, it says being hooked to try to build a chatbot, you know, like for analytics, back then and kind of, you know, like looking at what people do right now, I think, yeah, just do that. I mean, it's working right now, with foundational models, it's actually now it's possible to build all those cool applications. I'm so excited to see, you know, like, how much changed in the last six years or so that we actually now can build a smart agents. So I think that sort of, you know, like a takeaways and yeah, we are, as humans in general, we like we really move technology forward. And it's fun to see, you know, like, it's just a first hand. [00:38:30]Alessio: Well, thank you so much for coming on Artem. [00:38:32]Swyx: This was great. [00:38:32] This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.latent.space/subscribe

Transcript
Discussion (0)
Starting point is 00:00:06 Hey, everyone. Welcome to the Latenspace podcast. This is Swix, writer, editor of Latenspace and founder of SmallEye and Alessio, partner and CTO in residence at Decimal Partners. Hey, everyone. And today we have Ardenham Kuddinob on the podcast, co-founder of Cube. Hey, Arden. Hi, Al-Soy, how Swix. Good to be here today. Thank you for inviting me. Yeah. Thanks for joining. For people that don't know, I've known Arden for a long time ever since he started, Cube. And Cube is actually a spin out of his previous company, which is Statsbot. And this kind of feels like going both backward and forward in time. So the premise of Statsbot was having a Slack pod that you can ask, busy like Text 2SQL in Slack, and this was six, seven years ago, something like that, a lot ahead of its time and you see startups trying to do that today.
Starting point is 00:00:53 And then Cube came out of that as a part of the infrastructure that was popularing Statspot, and Cube then evolved from an embedded analytics product to the semantic layer and just an awesome open source evolution. I think you have over 16,000 stars on GitHub today. You have a very active open source community. But maybe for people at home, just give a quick layup the land of the original Statspot product. What got you interested in like text 2SQL and what were some of the limitations that you saw then? The limitations that you're also seeing today in the new landscape. I started Statsbot in 2016. The original idea was to just make sort of a side project based of my initial project that I did at a company that I was working for
Starting point is 00:01:41 back then. And I was working for a company that was building software for schools. And we were using Slack a lot. And Slack was growing really fast. A lot of people were talking about Slack, you know, like Slack apps, charts, bots in general. So I think it was, you know, like another wave of, you know, bots and all of that. We have one more wave right now, but it always comes in waves. So we were like living through one of these waves. and I wanted to build a bot that would give me information from the different places where like a data leaves to Slack. So it was like developer data, like New Relic, maybe some marketing data, Google Analytics, and then some just a regular data like a production database or it sells for sometimes.
Starting point is 00:02:22 And I wanted to bring it all into Slack because we were always chatting, you know, like in Slack and I wanted to see some stats in Slack. So that was idea stats bot, right? Like bring stats to Slack. I built that as a first sort of a site project and I published it on Reddit and people started to use it even before Slack came up with that Slack application directory. So it was a little, you know, like a hackish way to install it, but people were still installing it.
Starting point is 00:02:48 So it was a lot of fun and then Slack kind of came up with that application directory and they reached out to me and they wanted a feature stats bot because it was one of the already being kind of widely used bots on Slack. So they featured me on this application directory front page, and I just got a lot of new users signing up for that. It was a lot of fun, I think, you know, but it was sort of a big limitation in terms of how you can process natural language,
Starting point is 00:03:14 because the original idea was to let people ask questions directly in Slack, right? Hey, show me my, you know, like opportunities closed last week or something like that. my co-founder who kind of started helping me with this Slack application. Him and I were trying to build a system to recognize that natural language, but it was, you know, we didn't have LLMs right back then and all of that technology, so it was really hard to build the system, especially the systems that can kind of, you know, like keep talking to you, like maintain some sort of a dialogue. It was a lot of like one-off request and like it was a lot of heat and miss, right?
Starting point is 00:03:50 If you know how to construct a query in natural language, you will get a result back. but you know like it was not a system that it was capable of you know like asking follow-up questions to try to understand what you actually want and then kind of finally you know like bring this all context and go to generate a SQL query get the result back and all of that so that was a really missing part and I think right now that's you know like what is a difference or right now I kind of bullish that if I would start stats but again probably would have a much better shot at it but back then that was a big limitation we kind of build a queue right as we were working on stats bought because we needed it.
Starting point is 00:04:27 What was the ML stack at the time? Were you trying to build your own natural language understanding models, like were there open source models that were good that you were trying to leverage? I think it was mostly combination of a bunch of things, and we tried a lot of different approaches. The first version which I built was Ragamp. They were working well. It's the same as I did.
Starting point is 00:04:51 I did option pricing when I was in finance, and I had a natural language pricing tool thing, and it was regex. It was just a lot of rigex. Yeah. Yeah. And then my co-founder joined me with Pabble. He's much smarter than I am.
Starting point is 00:05:05 He like, pinched in, MinMass, all of that. And he started to, like, do some stuff. I was like, no, you just do that stuff. I don't know. Like, I can do reg X. And, you know, like, he started to do, like, some models and trying to, I either, you know, like, look at what we had on the market back then. Oh, you know, like, try to build.
Starting point is 00:05:21 the different sort of, you know, like models. Again, we didn't have any foundation back in place, right? We wanted to try to use existing mess, obviously, right? But it was not something that we can take the model and, you know, like a try and run it. I think in 2019 we started to see more like of stuff, you know, like ecosystem being built and then it eventually kind of, you know, like resulted in all this LLM, like what we're here right now. But back then in 2016, it was not much, you know, like available for just the people to build on top. It was some academic research, right, kind of been happening.
Starting point is 00:05:53 But it was like very, very early, you know, like for something to actually be able to use. And then that became Cube, which was started just as an open source project. And I think I remember going on a walk with you in San Mateo in like 2020, something like that. And you were like, you have people reaching out to you who are like, hey, we use Cuban production. Like, I just need to give you some money, even though you guys are not a company. What's the story of Cube then from Statsbot to where you are today? We built a cube at Statsbot because we needed it. It was like the whole Statsbot stack was that we first tried to translate the
Starting point is 00:06:30 natural sort of language query into some sort of multidimensional query. It's like we were trying to understand, okay, people wanted to get an active opportunities, right? What does it mean? Is it a metric? Is it what a dimension here? Because usually in analytics, you always, you know, like try to reduce everything down to the sort of, you know, like a multidimensional framework. So that was the first
Starting point is 00:06:51 step. And that's where, you know, like it didn't really work well because all this limitation of us not having foundational technologies. But then from the multidimensional query, we wanted to go to SQL. And that's what was semantic layer and what was Q essentially. So we built a framework where you would be able to map your data into this concept, into this metrics. Because when people were coming to stats bot, they were bringing their own data sets, right? And the big question was, how do we tell the system what is active opportunities for that specific users? How we kind of, you know, like provide that context, how we do that training. So that's why we came up in the idea of building the semantic layer. So people can actually define their metrics and then kind of use
Starting point is 00:07:36 them with the stats bot. So that's how we build a cube. At some point, we saw people started to see more value in a cube itself, you know, like kind of building the semantic layers and then. And then, and using it to power different types of the application. So in 2019, we decided to kind of feels like it might be a standalone product and a lot of people want to use it. Let's just try to open source it. So we took it out of Statsbot and open source. Can I make sure that everyone has the same foundational lot ofish?
Starting point is 00:08:04 The concept of a cube is not something that you invented. I think, you know, not everyone has the same background in analytics and data that all three of us do. Maybe you're going to explain like OLAP cube, hypercube, the briefest is. of cubes. Right. I'll try, you know, like a lot of like a Wikipedia pages and like a blog post trying to go into academics of it. So I'm trying to like cubes according to you. Yeah. So when we think about just a table in a database, the problem with the table, it's not a multidimensional, meaning that in many cases, if we want to slice the data, we kind of need to result with a different
Starting point is 00:08:38 table, right? Like think about when you're writing a SQL query to answer one question, SQL query always ends up with a table, right? So you write one SQL, you got one. Then you write, to answer a different question, you write a second query. So you're kind of getting a bunch of tables. So now let's imagine that we can kind of bring all these tables together into multidimensional table. And that's essentially a cube. So it's just like the way that we can have measures and dimensions that can potentially be
Starting point is 00:09:06 used at the same time from a different angles. So initially a lot of your use cases were more BI-related. But you recently release a length chain integration. There's obviously more and more interest in, again, using these models to answer data questions. So you've seen the chat GPT code interpreter, which is renamed. That's like advanced data analysis. What's kind of like the future of like the semantic layer in AI? You know, what are like some of the use cases that you're seeing?
Starting point is 00:09:32 And what do you think it's a good strategy to make it easier to do now the text to SQL you wanted to do seven years ago? Yeah. So, I mean, you know, when it started to happen, I was. just like, oh my God, people are now building stats bot with Q. They just have a better technology for, you know, like natural language. So it kind of, it made sense to me, you know, like from the first moment I saw it. So I think it's something that, you know, like happening right now. And chatbot is one of the use cases. I think, you know, like if you try to generalize it, the use case would be how do we use structured or tabular data with, you know, like AI models, right?
Starting point is 00:10:08 like how do we turn the data and give the context to the data and then bring it to the model. And then model can, you know, like give your answers, make a questions, do whatever you want. But the question is like how we go from just the data in your data warehouse, database, whatever, which is usually just a tabler data, right, like in a SQL-based warehouses, to some sort of, you know, like a context that system can do. And if you're building this application, you have to do it. It's like no way you can get away around not doing this. You either map it manually or you come up with some framework or something else.
Starting point is 00:10:42 So our take is that and my take is that semantic layer is just really good place for this context to leave. Because you need to give this context to the humans. You need to give that context to the AI system anyway, right? So that's why you define metric once and then, you know, like you teach your AI system what this metric is about. What are some of the challenges of using tabular versus language data and some of the ways that having the semantic layer kind of makes that easier, maybe. Imagine you're a human, right, and you're going into, like, your new data analyst at a company and just people give you a warehouse with a bunch of tables,
Starting point is 00:11:18 and they tell you, okay, just try to make sense of this data. And you're going through all of these tables, and you're really, like, trying to make sense without any, you know, like additional context, sort of, like, some columns. In many cases, they might have weird names. Sometimes, you know, if they follow some kind of like a star schema, or like a Kimball style dimensions, maybe that would be easier because you would have facts and dimensions column,
Starting point is 00:11:40 but it's still, it's hard to understand and kind of make sense because it doesn't have descriptions, right? And then there is like a whole, like industry of like a data catalogs exist because the whole purpose of that to give context to the data so people can understand that.
Starting point is 00:11:55 And I think the same applies to the AI, right? Like, and the same challenge is that if it, you give it pure tabular data, it doesn't have this sort of context that it can read. So you sort of need to write a book or essay about your data and give that book to the system so it can understand it. Can you run through the steps of how that works today? So the initial part is like the natural language query.
Starting point is 00:12:19 Like what are the steps that happen in between to do model to semantic layer, semantic layer to SQL and all that flow? The first key step is to do some sort of indexing. That's what I was referring to like write a book about your data, right? describe in a text format what your data is about, right? Like what metrics it has, dimensions, what is the structures of that, what a relationship between these metrics, what the potential values of the dimensions,
Starting point is 00:12:49 so sort of, you know, like build a really good indexed as a text representation, and then turn it into embeddings into your, you know, like a vector storage. Once you have that, then you can provide it as a context to the model. I mean, there are like a lot of options, like either fine-tune or sort of in-context learning, but somehow kind of give that as a context to the model, right? And then when this model has this context, it can create a query.
Starting point is 00:13:15 Now, the query, I believe, should be created again semanticlear because it reduces the room for the error. Because what usually happens is that your query to semantic layer would be very simple. It would be like, give me that metric grouped by that dimension and maybe that filter should be applied. And then your real query for the warehouse, it might have like a five joins, a lot of different techniques, like how to avoid fan out, fan traps, chasm traps, all of that stuff.
Starting point is 00:13:45 And the bigger query, the more room that the model can make an error, right? Like sometimes it could be a small error and then, you know, like your numbers is going to be off. But making a query again semantic layer, that sort of reduces the error. So the model generates a SQL query and then it executes us against semantic layer. And semantic layer executes us against your warehouse and then sends result all the way back to the application. And then can be done multiple times because what we were missing with Stats bought this ability
Starting point is 00:14:15 to have a conversation right with the model. You can ask question and then system can do a follow-up questions, you know, like then do a query to get some additional information based on this information, do a query again. And sort of, you know, like it can keep doing this stuff and then eventually make maybe give you a big report that consisted for a lot of data points. But the whole flow is that it knows the system, it knows your data, because you already kind of did the indexing, and then it queries semantic layer instead of a data warehouse directly. Maybe just to make it a little clear for people that haven't used a semantic layer before,
Starting point is 00:14:52 you can add definitions like revenue, where revenue is like select from customers and like join orders and then some of the amount of orders, but in the semantic layer, you're kind of hiding all of that away. So when you do natural language to Q, it just select revenue from last week, and then it turns into a bigger query. One of the biggest difficulties around semantic layer, for people who've never thought about this concept before, this all sounds super neat, until you have multiple stakeholders within a single company who all have different concepts of what a revenue is. They're all a different concept of what active user is. And they're all a different concept the what active user is, and then so they'll have like, you know, revenue revision one
Starting point is 00:15:32 by the sales team, you know, and their revenue revision one, accounting team or tax team, I don't know. I feel like I always want semantic layer discussions to talk about the not so pretty parts of the semantic layer, because this is where effectively you ship your org chart in the semantic layer. I think the way I think about it is that in the end of the day, semantic layer is a code base, and in cup it's essentially a code base, right? It's just a set of YAML files with Python. I think code is never perfect, right? It's never going to be perfect. It will have a lot of, you know, like, revisions of code. We have a version control, which helps it's easier with the revisions. So I think we should treat our metrics and sematic layer as a code. Right.
Starting point is 00:16:10 And then collaboration is a big part of it. You know, like, if there are like multiple teams that sort of have a different opinions, let them collaborate on a pool request, you know, like and discuss that, like, why they think that should be calculated differently. Have an open conversation about it, you know, like when everyone can just discuss it, like an open source community, right? Like you go on a GitHub and you talk about why that code is written the way it's written, right? It should be written differently.
Starting point is 00:16:33 And then hopefully at some point you can come up, you know, like to some definition. Now, if you still should have multiple versions, right, it's a code, right? You can still manage it. But I think the big part of that is that like we really need to treat it as a code base. Then it makes a lot of things easier,
Starting point is 00:16:49 not a spreadsheet, you know, like a hidden Excel files. The other thing is like then having the definition spread in the organization, like versus everybody trying to come up with their own thing. But yeah, I'm sure that when you talk to customers, there's people that have issues with the product and it's really like two people trying to define the same thing. One in sales that wants to look good. The other is like the finance team that wants to be conservative and they all have different definitions. How important is the natural language to people?
Starting point is 00:17:18 Obviously, you guys both work in modern data stack companies either now or before. There's going to be the whole wave of empowering data professionals. I think now a big part of the wave is removing the need for data professionals to always be in the loop and having non-technical folks do more of the work. Are you seeing that as a big push too with these models, like allowing everybody to interact with the data? I think it's a multidimensional question. That's an example of, you know, like where you have a lot of inside the question. In terms of examples, I think a lot of people building different, you know, like agents or chatbots,
Starting point is 00:17:51 we have a company that built as internal Slackbot that sort of answers questions, you know, like based on the data in a warehouse. And then like a lot of people kind of go in and, you know, like ask that chat bought this question. Is it like a real big use case? Maybe. Is it still like a toy pet project? Maybe too right now. I think it's really hard to tell them apart at this point because there is a lot of like a
Starting point is 00:18:16 hype, you know, and just people building LLM style because it's cool and everyone wants to build something, you know, like, even at least a pet project. So that's what happened in Kuzawa community as well. We see a lot of people building a lot of cool stuff. And it probably will take some time for that stuff to mature and kind of to see like what a real, the best use cases. But I think what I saw so far one use case was building this chatbot.
Starting point is 00:18:38 And we have even one company that a building kit as a service. So they essentially connect into Q semantic layer and then offering their like chatbot. So you can do it in a web, in a Slack. so it can, you know, like answer questions based on data in your semantic layer, but also see a lot of things like this just being built in-house. And there are the use cases, sort of automation, you know, like that agent checks on the data and then kind of perform some actions based, you know, like on changes in data. But other dimension of your question is like, will it replace people or not?
Starting point is 00:19:13 I think, you know, like what I see so far in data specifically, you know, like a few use cases of LLM. I don't see Hube being part of that use case, but it's more like a copilot for data analyst, a copilot for data engineer, where you develop something, you develop a model, and it can help you to write a SQL or something like that. So, you know, it can create a boilerplate SQL, and then you can edit this SQL, which is fine, because you know how to edit SQL, right? So you're not going to make any mistake, but it will help you to just generate, you know, like a bunch of SQL that you write again and again, write a like boilerplate code. So sort of a co-pilot use case.
Starting point is 00:19:51 I think that's great and we'll see more of it. I think every platform that is building for data engineers will have some sort of a co-pilot capabilities. And Kube included, we're building this co-pilot capabilities to help people build semantic layers easier. I think that's just a baseline for every engineering product right now to have some sort of, you know, like a co-pilot capabilities. Then the other use case is a little bit more where Kube is being involved.
Starting point is 00:20:15 It's like, how do we enable access to data for non-technical people? through the natural language as an interface to data, right? Like visual dashboards, charts, it always has been an interface to data in every BI. Now I think we will see just a second interface as a just kind of a natural language. So I think at this point, many BIs will edit as a commodity feature. It's like, Tableau will probably have a surgery bar at some point saying,
Starting point is 00:20:42 like, hey, ask me a question. I know that some of the, you know, like AWS speak site, they're about to announce features like this in their like BI. And I think PowerBi will do that, especially with their deal with Open AI. So every company, every BI, will have some sort of a search capability is built inside their BI. So I think that's just going to be a baseline feature for them as well. But that's where a Kube can help because we can provide that context, right?
Starting point is 00:21:07 Do you know how, or do you have an idea for how these products will differentiate once you get the same interface? So right now there's like, you know, Tableau is like the super complicated. and it's like super set, it's like easier. Yeah, do you just see everything will look the same and then how do people differentiate? It's like they all have line chart, right? And they all have bar chart.
Starting point is 00:21:28 I feel like it pretty much the same. And it's going to be fragmented as well. And every major vendor and most of the vendors will try to have some sort of natural language capabilities. And they might be a little bit different. Some of them will try to position the whole product around it. Some of them will just have them as a check. box, right? So we'll see. But I don't think it's going to be something that will change the
Starting point is 00:21:53 B.I. Market. You know, like something that can take the B.I. Market and make it more consolidated versus than, you know, like what we have right now. I think it still will remain fragmented. Let's talk a bit more about application use cases. So people also use Q for kind of like analytics in their product, like dashboards and things like that. How do you see that changing and more? especially when it comes to like agents, you know, so there's like a lot of people trying to build agents for reporting, building agents for sales. If you're building a sales agent,
Starting point is 00:22:26 you need to know everything about the purchasing history of the customer, all of these things. Yeah, any thoughts there? What should all the AI engineers listening, think about when implementing data into agents? Yeah, I think kind of, you know, like trying to solve for two problems. One is how to make sure that agents or LLM are model, right, has enough context about, you know, like a tabular data.
Starting point is 00:22:50 And also, you know, like, how do we deliver updates to the context, which is also important because data is changing, right? So every time we change something to upstream, we need to show we update that context in our vector database or something. And how do you make sure that the query is correct? You know, I think it's obviously a big pain in this, all, you know, like AI kind of, you know, like a space right now. How do we make sure that we don't, you know, provide our own transfers?
Starting point is 00:23:16 but I think, you know, like be able to reduce the room for error as much as possible that what I would look for, you know, like to try to like minimize potential damage. And then our use case for QP, it's been using a lot to power sort of customer-facing analytics. So I don't think the much going to change is that I feel like, again, more and more products will adopt natural language interfaces as sort of a part of that product as well. So we would be able to power this pieces to not only, you know, like a chart, visuals, but also some sort of, you know, like summaries. Probably in a future you're going to open the page with some sort of his stats and you will have a smart summary kind of generated by AI.
Starting point is 00:23:59 And that summary can be powered by cube, right? Like, because the rest is already being powered by cube. You know, we had Linus from Notion on the pod. And one of the ideas he had that I really like is kind of like thumbnails of text, kind of like, how do you, like, How do you like compress knowledge and then start to expand it? A lot of that comes into dashboards, you know, where like you have a lot of data. You have like a lot of charts and sometimes you just want to know, hey, this is like the three lines summary of it. Exactly.
Starting point is 00:24:26 Makes sense that you want to power that. How are you thinking about, yeah, the evolution of like the modern data stack in quotes, whatever that means today? What's like the future of what people are going to do? What's the future of like what models and agents are going to do for them? Do you have any thoughts? I feel like modern data stack sometimes is not very, I mean, it's obviously a big cross-solar between AI, you know, like ecosystem, AI infrastructure ecosystem, and then sort of data. But I don't think it's a full overlap.
Starting point is 00:24:56 So I feel like when we know, like I'm looking at a lot of like what's happening in a modern data stack where like we use warehouses, we use BI's, you know, different like transformation tools, catalogs, like data quality tools, ETLs, all of that. I don't see a lot of being compacted by AI specifically. I think, you know, that space is being compacted as much as any other space in terms of, yes, we'll have all this copilot capabilities, some of AI capabilities here and there. But I don't see anything sort of dramatically, you know, being sort of, you know, a change or shifted because of, you know, like AI wave. In terms of just in general data space, I think in the last two, three years, we saw an explosion, right? like we got like a lot of tools, every vendor for every problem.
Starting point is 00:25:43 I feel like right now we should go through the cycle of consolidation. If 5 trend and DBT merge, they can be alterics of a new generation or something like. And, you know, probably some ETL tool to there. I feel it might happen. I mean, it just natural waves, you know, like in cycles. I wonder if everybody is going to have their own co-pilot. The other thing I think about these models is like SWIX was at Airbyte and, yeah, there's 5Tren. and just like,
Starting point is 00:26:08 Friday versus Airvite. I don't think it will make very well. A lot of times these companies are doing the syntax work for you of like building the integration between your data store and like the app or another data store. I feel like now these models are pretty good at coming up with the integration themselves and like using the docs to then connect it to. So I'm really curious like in the future what that would look like.
Starting point is 00:26:29 And same with data transformation. I mean, you think about DBT and some of these tools. And right now you have to create rules to normalize and transform. data, but in the future, I could see you explaining the model how you want the data to be and then the model figuring out how to do the transformation. I think it all needs a semantic layer as far as figuring out what to do with it. You know, what's the data for and where it goes? Yeah, I think many of this, you know, like workflows will be augmented by, you know, like some sort of a copilot. You know, you can describe what transformation you want to see and it can generate
Starting point is 00:27:04 a boilerplate, right, of transformation for you. Or even, you know, like kind of generate a boilerplate of specific ETL driver or ETL integration. I think we're still not at the point where this code can be fully automated, so we still need a human analogue, right? Like who can use this copilot? But in general, I think, yeah, data work and software engineering work can be augmented quite significantly with all that stuff. You know, the big thing with machine learning before was like, well, all of your data is bad.
Starting point is 00:27:34 You know, the data's not good for anything. and I think like now at least with these models, they have some knowledge of their own. And they can also tell you if your data is bad, which I think is like something that before you didn't have. Any cool apps that you've seen being built on, Cube, like any kind of like AI native things that people should think about new experiences, anything like that. Well, I see a lot of SlackBots. They all remind me StatsBod, but I know like I played with few of them, they're much, much better than stats bots. It feels like it's on a surface, right? it's just that use case that you really want.
Starting point is 00:28:06 You know, think about you a data engineer in your company. Like everyone is like, and you're asking, hey, can you pull that data for me? And you would be like, can I build a bot to replace myself? You know, like, so they can pick that bot instead. So it's like, that's why a lot of people doing this. So I think it's the first use case that actually people are playing with. But I think inside that use case, people get creative. So I see bots that can actually have a dialogue with you.
Starting point is 00:28:30 So, you know, like you would come to that pot and say, hey, show me metrics. And the bot would be like, kind of metrics. What do you want to look at? You will be like active users and then it would be like, how do you define active users? You want to see active users sort of cohort. You want to see active users kind of changing behavior over time. A lot of like a follow-up question. So it tries to sort of, you know, like understand what exactly you want. And that's how many data analysts work, right? When people start to ask you something, you always try to understand what exactly do you mean. because many people don't know how to ask correct questions about your data.
Starting point is 00:29:04 It's a sort of interesting specter. On one side of a specter, you know nothing. You're just like, hey, show me metrics. And the other side of spectra, you know how to write SQL and you can write exact query to your data warehouse, right? So many people are a little bit in the middle. And the data analysts, they usually have the knowledge about your data, and that's why they can ask follow-up questions
Starting point is 00:29:26 and to understand what exactly you want. and I saw people building bots who can do that. That part is amazing. I mean, like generating SQL, all that stuff, it's okay, it's good, but when the bot can actually act like they know that your data and they can ask for a lot of questions, I think that's great. Are there any issues with the models and the way they understand numbers? One of the big complaints people have is like GPD,
Starting point is 00:29:51 at least three and a half cannot do math. Have you seen any limitations and improvement? and also when it comes to one model to use, do you see most people use like GPD 4 because it's like the best at this kind of analysis? I think I saw people use all kinds of models. To be honest, it's usually GPT. So inside GPD it could be 3.5 or 4, right?
Starting point is 00:30:11 But it's not like I see a lot of something else, to be honest. Like, I mean, maybe some open source alternatives, but it feels like the market is being dominated by just chat GPD. In terms of the problems, I think, chatting about it with a few people, so if mass is required to do mass, you know, like outside of, you know, like chatypd itself. So it would be like some additional Python scripts or something. When we're talking about production level use cases, it's quite a lot of Python code around,
Starting point is 00:30:39 you know, like your model to make it work, to be honest. It's like, it's not that magic that you just throw the model and it like it can give you all this answers. For like a toy use case, the one we have on a, you know, like our demo page or something, it works fine, it's great. But, you know, like if you want to do like a lot of post-processing, do a mass on your own, probably need to code it in Python anyway. That's what I see people doing. We heard the same from Harrison and Langtrain that most people just use Open AI. We did an open-a-as-no-mode emergency podcast, and it was funny to like just see the reaction
Starting point is 00:31:10 that people had to that and how hard it actually is to break down some of the monopoly. What else should people keep in mind, Arlem? You're kind of like at the cutting edge of this. You know, if I'm looking to build a data-driven AI application, I'm trying to build data into my AI workflows. Any mistakes, people should avoid any tips on the best stack to use, what tools to use. I would just recommend going through to warehouse as soon as possible. I think a lot of people feel that MySQL can be a warehouse, which can be maybe on like a lower scale, but definitely not from a performance perspective. So just kind of having started increasing good warehouse, a query engine like house. That's probably like something that would recommend, starting from a day zero. And
Starting point is 00:31:54 really good ways to do it a very cheap with open source technologies too, especially in a lakehouse architecture. I think, you know, I'm biased obviously, but using a semantic layer, preferably cube and for, you know, like a context. And other than that, it's just like a fill. It's a very interesting space in terms of the AI ecosystem. I see a lot of people using link chain right now, which is great, you know, like, and we build an integration. But I'm sure the space will continue to evolve and, you know, like, we'll see a lot of interesting tools and maybe, you know, like, tools would be a better feed for a job. I'm not aware of any right now, but it's always interesting to see how it evolve. Also, it's a little unclear, you know, like how all the infrastructure
Starting point is 00:32:33 and actually developing, testing, documenting, call it stuff will kind of evolve to. But, yeah, again, it's just like really interesting to see and observe, you know, what's happening in the space. So before we go to the lightning round, I wanted to ask you on your thoughts on embedded analytics. and in a sense, the kind of chatbots that people are inserting on their websites and building with LMs is very much sort of end user programming or end user interaction with their own data. I love seeing embedded analytics. And for those who don't know, embedded analytics is basically user-facing dashboards where you can see your own data, right? Instead of the company seeing data across all their customers, it's an individual user seeing
Starting point is 00:33:16 their own data as a slice of the overall data that is owned by the platform. that they're using. So I love embedded analytics. Actually, overwhelmingly, the observation that I've had is that people who try to build in this market fail to monetize. And I was wondering your insights on why. I think overall the statement is true. It's really hard to monetize, you know, like in embedded analytics. That's why it would be excited more about our internal kind of BI use case or like companies who are building, you know, like a chatbots for their internal data consumption or like internal workflows. Embedded analytics is hard to monetize because, it's historically been dominated by the BI vendors, and we still see a lot of organizations
Starting point is 00:33:57 are using BI tools as vendors. And what I was talking about, BI vendors adding natural language interfaces, they will probably add that to the embedded analytics capabilities as well, right? So they would be able to embed that too. So I think that's part of it. Also, you know, if you look at the embedded analytics market, the bigger organization, the big gets, they're really more custom, you know, like it becomes. And at some point, as you main organizations, they just stop using getting a vendor. And they just kind of build most of the stuff from Screech,
Starting point is 00:34:31 which probably, you know, like the right way to do. So it's sort of, you know, like you got a market that is very kept at the top. And then you also in that middle and small segment, you got a lot of vendors trying, you know, like to compete for the buyers. And because, again, the B.I. is very fragmented and bad analytics, their foist fragmented also. So you're really going after the mid-market slice and then with a lot of other vendors competing for that.
Starting point is 00:35:00 So that's why it's historically been hard to monetize, right? I don't think AI really going to change that just because it's using a model, you just pay to open AI, and that's it. Like, everyone can do that, right? So it's not much of competitive advantage. So it's going to be more like a commodity feature that a lot of buyers would be able to leverage.
Starting point is 00:35:20 This is great, Arndham. As usual, we got our lightning ground. So it's true question. One is about acceleration, one on exploration, and then a takeaway. The acceleration thing is what's something that already happened in AI or maybe, you know, in data that you thought would take much longer. But it's already happening today. To be honest, all this foundational models, I thought that we had a lot of models that been in production for like, no, maybe decayed or so. And it was like a very niche use cases, very vertical use cases.
Starting point is 00:35:50 is just like in very customized models. And even when we were building Statsbot back then in 2016, right, even back then we had some natural language models being deployed, like a Google Translate or something that was like that still was a sort of a model, right, but it was very customized with a specific use case. So I thought that would continue for like many years. We'll use AI, we'll have all this customized niche models. But there is like foundational model.
Starting point is 00:36:17 They like very generic now. they can serve many, many different use cases. So I think that is a big change, and I didn't expect that, to be honest. The next question is about exploration. What is one thing that you think is the most interesting unsolved question in AI? I think AI is a subset of software engineering in general, and it's sort of connected to the data as well. Because software engineering, as a discipline, it has quite a history. We build a lot of processes, you know, like toolkits and methodologies, how we project. to that, right? And now AI, I don't think it's completely different, but it has some unique traits, you know, like it's quite not a dependent, right, and kind of from many dimensions, and like other traits, so which kind of may require different methodologies, may require different
Starting point is 00:37:07 approaches in a different toolkit. I don't think how much is going to deviate from a standard software engineering. I think many tools and practices that we develop our software engineering can be applied to AI and some of the data best practices can be applied as well. But it's like we got devolps, right? It's just a bunch of tools like ecosystem. So now like AI is kind of, feels like it's shaping into that with a lot of its own, you know, like methodologies, practices and toolkit. So I'm really excited about it. And I think it's a lot of unsolved still question. Again, how do we develop with that? How do we test? You know, like what is the best practices? How what is a methodologist? So I think that would be an interesting to see. Awesome. And then, yeah, our final message,
Starting point is 00:37:47 you have a big audience of engineers and technical folks, what's something you want everybody to remember, to think about, to explore? I mean, it says being who tried to build a chatbot, you know, like for analytics back then and kind of, you know, like looking at what people do right now, I think, yeah, just do that. I mean, it's working right now. With foundational models is actually now it's possible to build all those cool applications. I'm so excited to see, you know, like how much changed in the last six years
Starting point is 00:38:16 so that we actually now can build a smart agent. So I think that sort of, you know, we could takeaways. And yeah, we are, as humans in general, we're like, we really move technology forward. And it's fun to see, you know, like, it's just a first hand. Well, thank you so much for coming on, Artem. This is great.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.