The Data Stack Show - 259: Too Big to Fail? The Hype, Hope, and Reality of AI with Kostas Pardalis of typedef

Episode Date: August 27, 2025

This week on The Data Stack Show, Brooks and John welcome back Kostas Pardalis, long-time co-host of the Data Stack Show and now Co-Founder of typedef. The group discusses the rapid evolution of AI an...d data infrastructure. The conversation also explores how AI is accelerating industry change, the challenges of integrating large language models (LLMs) into data workflows, and the limitations of current semantic layers. Kostas shares insights on building next-generation query engines, the importance of using familiar engineering paradigms, and the need to make AI seamless and almost invisible in user experiences. Key takeaways include the necessity of practical, incremental innovation, the reality behind AI hype, strategies for making advanced data tools accessible and reliable for engineers and businesses alike, and so much more. Highlights from this week’s conversation include:Kostas’s Background and Career Timeline (1:10)Transition from RudderStack to Starburst Data (4:25)AI Acceleration and Industry Impact (9:37)AI Hype, Investment, and Polarized Reactions (12:05)Historical Parallels and Tech Adoption (13:54)AI Disrupting Tech Workers and Internal Drama (18:56)Experimentation Phase and Future AI Applications (24:01)Invisible AI and User Experience (28:21)AI in Data Infrastructure and LLMs (34:24)SQL, LLMs, and Engineering Solutions (36:35)Standardization, Semantic Layers, and Data Modeling (41:01)Introduction to typedef (45:49)Productionizing AI Workloads with typedef (51:36)Familiarity, Reliability, and Engineering Best Practices (57:24)Security, Enterprise Concerns, and Open Source Models (1:00:48)Final Thoughts and Takeaways (1:01:47)The Data Stack Show is a weekly podcast powered by RudderStack, customer data infrastructure that enables you to deliver real-time customer event data everywhere it’s needed to power smarter decisions and better customer experiences. Each week, we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Hi, I'm Eric Dots. And I'm John Wessel. Welcome to The Datastack Show. The Datastack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals to learn about new data technologies and how data teams are run at top companies. Before we dig into today's data. episode, we want to give a huge thanks to our presenting sponsor, RudderSack. They give us the equipment and time to do this show week in, week out, and provide you the valuable
Starting point is 00:00:38 content. RudderSack provides customer data infrastructure and is used by the world's most innovative companies to collect, transform, and deliver their event data wherever it's needed all in real time. You can learn more at rudderstack.com. Welcome back to the Datastack show. We are here with an extremely special guest. Costas Pardalis from TypeDev, formerly a co-host of the Datastack show, Costas, welcome back as a guest. Happy to be back here, Eric. It's always fun to be back like at the Datastack show, which I have some great memories of like being the co-host here with you and working with Brooks. So I'm super happy to be here.
Starting point is 00:01:24 Well, I know that longtime listeners are very familiar with you, but for the several million people, people that are new listeners since you left the show. Give them just a brief background on yourself. Oh, are you saying you're not like, I'm not that popular yet. Like, so everyone already knows about me. I thought they did. Oh, my goodness. Okay. All right. I'll do it, even if it's just a formality, but I'll say a few things about myself. So, yeah, I'm cautious. I'll keep it brief because I'm sure, like, we'll talk more about that, like, during So I've been building data infrastructure for, like, more than a decade now, especially, like, startups. So I, what I really enjoyed, like, the intersection of business building and technology building.
Starting point is 00:02:10 I do have, like, an engineering background. I can't escape that. But at the same time, I really enjoy, like, building products and companies. So naturally, I found something very technical to productize and commercialize. and sell to, like, technical personas. I'm working on something new right now, and I'm looking forward to like to talk more about that and what is happening in the world and our industry today.
Starting point is 00:02:38 Awesome. So, POSAS, really excited to hear about what you're doing now. We'll definitely dig into some AI topics. And then I will save this for the show, but I want a response from you to some of the news the last couple weeks about return on investment for AI. There's some stuff floating around where all these companies they're saying, you know, they've invested X amount of money and not seeing the return on AI.
Starting point is 00:03:00 So we've got to talk about that topic. And what do you want to dig into? Oh, 100%. I think talking about what has happened so far with AI, where we are today, and what's next? I think it's like very interesting. So definitely looking forward like to talking on that and like what's next? Because if anything, what we've seen so far is everything has been like extremely accelerated. like things that in other industries like probably take, I don't know,
Starting point is 00:03:29 like maybe years or even like decades like to happen with AI right now and take industry like things take like literally weeks or months. So what we are seeing is pretty normal, natural, and I think kind of gives us a hint of what is coming next. It's I think like super excited. Exciting. Awesome. Well, let's dig in and ja.
Starting point is 00:03:52 Let's do it. Okay, Kostas, we're going to save. the full story for a little bit later in the show, but give us just the brief timeline. So was it three years ago you left Rudder's Sack? Probably, no, something like that. Yeah, like, yeah. Yeah. left rather stack. I left primarily because I was looking for like some new stuff like to work
Starting point is 00:04:34 with just for the people that probably don't know about what I was doing until that time. Like since 2014 I pretty much have been working in like building infrastructure for data ingestion. So anything that's like related to let's say like the what was called back then like the data for the modern data stack. with, like, the ELT pattern and, like, extracting data and loading data in the data warehouse, but, like, all that stuff. But what I had been doing, the way that I kind of model, like, the industry, is like an onion, right?
Starting point is 00:05:12 Like, you have ingestion that needs, like, to happen in the periphery. That's what brings the data in from all, like, the places. But there's, like, in the core, there's, like, compute and storage, right? And usually these parts are, well, like, I would say we can't live without that. We can live without ingestion, probably, but we can't live without computer and storage, right? Like, if we do not have the better warehouse, our pipelines wouldn't be valuable. And then, like, also, like, some very interesting, like, technical problems around that. And I want to, like, to go deeper into these problems and this space.
Starting point is 00:05:51 And by the way, they're like challenges and opportunities there that both from, let's say, like the technology, like what it means like to build like something like Snowflake, for example, or like something like Spark, right? But also business related. Like we do see like the size of like, let's say the data warehousing like market compared like to the adjacent market. But it was like a little comparison there. Right. So that's what I wanted to do. I would like, I wouldn't like to go deeper into data infrastructure. And I was looking to join like a company that was building some kind of query engine around like that stuff.
Starting point is 00:06:31 I ended up like doing like Starburst data. Starburst data was still easy. They are commercializing Trino. Trino was initially created under the name like Presto. Presto and Trino like created around the same time with Spark. and they were, let's say, the Spark was, like, initially being primarily used for preparing the data when you have big data, and then you would use something like
Starting point is 00:06:59 to create the data, like, after the data has been, like, EPO. Amazing project, open source, like, for anyone with, like, interesting things to getting into that stuff, they should definitely check it. Like, it's probably one of the best projects for an Apple who wants to see how, like, databases and download systems work. And I joined for that. Plus, it was, it is a company with a very strong enterprise, go-to-market motion.
Starting point is 00:07:27 And I want to, like, do that because until, including, like, rather, like, my experience with go-to-market was primarily, like, like, from SMPs up to, like, market and more like the PLT kind of way of, like, growing. And I really want to, like, to see how enterprise also, like, works. And Trino and Star Business has been viewed like a more like a Fortune 100 company. So it was like a great way for me like to see also how that works. So joined there, I spent about a year, being a little bit selfies, to be honest. I knew that I'm joining primarily to learn and like find opportunities of what I'd like to start next
Starting point is 00:08:05 because I wanted like to start another company. And what I started realizing there was that the data infrastructure that we have today and the tooling that we have has been like designed and built pretty much like 12, 13 years ago under a completely different assumptions of like what the needs are of the market, right? The main reason that these systems were built are to serve BI, business diligence, right? Like reporting at scale. Of course, like with big data, but they are built with these use case that I can mind. Through like this decade, we have more use cases that stuff like emerging, we have a mail, we have a peddle analytics, data started actually becoming from,
Starting point is 00:08:50 let's say, something that we used to understand what we did, to actually being the product, right? And that's kind of like the intuition and like the experience like how that was like, okay, I think what is happening here is that we spent the past 15 years building SaaS and pretty much digitizing every activity that we have. This is done. Like, okay, how many new, like, sales forces we are going to build? Like, we still have the CRMs out there. Like, this thing is getting digitally already, like, digital.
Starting point is 00:09:28 Same thing with marketing. Same thing with, like, product design. Same thing with pretty much like everything, right? Same with, like, consumer side too. So how is this industry going to like to keep accelerating and growing, right? And the answer for me to that was, like, data, because now
Starting point is 00:09:46 that everything is digital, we create, like, all data, so we need to figure out ways to build products on top of the day. Like, the data will become, like, the product. And we are probably entering, like, the next decade is going to be all about how we deliver value over that.
Starting point is 00:10:02 And then AI happened, and it just accelerated everything, because if you think about what, like, AI, is all about working with data at the end of the day. Sure, you can use it like to create a new type of CRM or like create a new type of like a VRP or whatever. But at the end of the day, the reason that this is going to be better than what Shelford was
Starting point is 00:10:30 is because it's primarily going like to be building on top of data that they are being generated and they are used like. with models to do things that were not like possible to do before, like in an efficient way. Based on that, I mean, actually before the AI happened, like I decided that I won't like to go and start a company where we are going to build like the next generation of query engines that will make it like much more accessible to more people to go and build on top of data. We started with that and as we've been building and interacting like with design partners and seeing, like, what's going on like with AI,
Starting point is 00:11:13 we ended up building, like, a fusion, like a new type of all-up we re-entering, which outside of pure compute. It also considers inference, which is a new type of compute as a first-class citizen. And you can use this technology, like, to work with your data, both how you would do, like,
Starting point is 00:11:35 traditional using something like Snowflake or TagDB, but also mix in their LLMs and inference in a very controlled way and in a way that very something we are like to develop as to build. Man, that was a really comprehensive. That was like an LLM summary of the last couple of years. Yeah, I'm spending too much time with LLNs and they start affecting the way I talk. Yeah, I'm just to say I.
Starting point is 00:12:05 It's close to say I. Yeah. No, that was great. Okay, I have a zillion product questions, and I know John does too. But let's talk about, let's just talk about your perspective on what's happening with AI.
Starting point is 00:12:22 Like you said, it's, you know, compressing years into months and weeks. And, you know, it's interesting. If you read a lot of the posts and comments on Hacker News, It is, you know, opinions are very polarizing. You know, there's a lot of people who are, you know, you can sense fear. There's a lot of developers who, you know, have a deep sense of FOMO, you know, who are trying to navigate things, you know, and opinions are all over the place.
Starting point is 00:12:56 But you're actually, you know, building with AI as a core part of the compute engine that you've built. so what is you what's your disposition i guess the other component i would add that i think is really mind-boggling is the amount of money that's being poured into this which i think is hard for a lot of people to interpret just as far as you know is that feeling hype is that you know especially based on some of the product experiences so there you go an easy. That's a softball for your first question. Yeah. Okay. Where do you want to start from? Like, the, do you want to talk about the reactions that people have?
Starting point is 00:13:44 Yeah, I think it'd be an interesting place to start. Like, yeah. Are you surprised by the varied reactions? No, I'm not. I mean, like, there's, that's always the case, right? Like, I don't think that you're when you when something new comes out and it's not just like an incremental change to something that we already know and not familiar with I think humans tend to be like get like polarized you have like the people who are like oh yeah that's best thing ever happened like the humanity and then you have people who are like this is usually like you can see that like with electric cars
Starting point is 00:14:29 Right. I'm sure that the first people who started like buying Tesla were pretty much, you know, like just, they would find this thing like perfectly benefit was like breaking every two miles. Right. And then you have like the people who are like, okay, if I don't have like a V8 that wakes up everyone like around me, like, why would I have a car? Right. And like they are both valid. I mean, like I like both. Right. Like I do see that. joy of, like, you know, noisy V8. I also see the convenience or, like, a car that it's pretty much like an iPhone on wheels. I think that's, like, always the case until, you know, like, something gets, like, normalized and then it's just, like, everyone accepts it. I think people who leave, like, when the iPhone came out, I don't think, like, everyone who used it, we're like, oh, my God, this is, like, definitely people who are like, okay, like this thing is like
Starting point is 00:15:27 doesn't work like well enough. Like I remember like my first iPhone for example I was promised like this thing like to connect on Wi-Fi and actually for me it took like a couple of months until an update came out like that I actually managed on Wi-Fi, right? It wasn't
Starting point is 00:15:44 anything like related to what we have today. And okay for the even older people who experienced Internet when it just came out well I don't think that downloading anything back then was
Starting point is 00:16:00 reliable at all right like you would download something just to go it will take forever we go run it and oh shit this thing is corrupted I have to re-download
Starting point is 00:16:12 the whole thing right of course like 56K well there was like the 2400 before that 9600 let
Starting point is 00:16:24 these are just like numbers of like bytes per second. We're not talking about gigabits or like whatever we have. Yeah, yeah. Right. And the reason I'm saying that and I'm sure like I don't know.
Starting point is 00:16:38 Like when I was like start as a kid like interacting with the internet. Okay like my parents probably were thinking, oh it's like a new toy you know. I don't think like they could comprehend what this thing would become like 20 years later. Right. But for this to happen, it took a lot of investments. It took a dot-com thing to happen. It took a lot of engineering, like really hard engineering. And it took time. What do we see, I think, like today is that these times are getting compressed. And usually, because in a way, like money, the way that money works, especially like in investments and why people like raise money.
Starting point is 00:17:24 like for example, like people build some things because money is kind of like compressed time. Like we think that without money would take you, let's say, one year to do. If you raise money, you can probably do it like in three months. So why people see all these like huge amounts of money being pulled for that is because there is a raise to make things happen as fast as possible. And what took internet like 20 years, we try like to make it in like five years, right? So I think that's like the mental model at least I have when I'm trying like to like judge why like these amounts of money are like going into that. Of course there's also the thing that with this technology compared like to something like, I mean like with crypto you also have that. I think because you needed like people were investing in an infrastructure like to mine this stuff.
Starting point is 00:18:16 But you do have like huge also infrastructure investments, right? You need like telacenters, and even before that, you need energy. So there's like a lot of money that, like, required for that stuff. Now, there is one more thing, though, because the interesting thing is that it's not polarizing for people in general. It's polarizing for engineers too. And I think what's like the most interesting thing for me with AI is that if the first time master, I don't know, like decades, where tech workers are not disrupting other industries. They are disrupting themselves.
Starting point is 00:18:57 And that's scary, right? Yeah. So tech workers usually were like, oh, I'm coming into this industry. I'm digitizing this thing. And of course, the people who used to do that work before, they were like, oh, my, you are replacing me or like you're doing this or like you're doing that. But never at no point, anyone was like, oh, this is going to. to replace, like, the engineers themselves.
Starting point is 00:19:20 Now, there is a feeling that this might be happening. I don't agree with that, but I think, like, the polarization also, it's more interesting right now or the drama is more interesting because it's actually internal drama and internal disruption that is happening and to take industry itself. So it doesn't just, like, disrupt other industries, it disrupts itself. So I have a question, then, saying all that close, just in back to the funding thing, at what point, and I think maybe we're already there, is AI essentially?
Starting point is 00:19:48 too big to fail. There's too much money. So many people invested, like, we're going to make it to succeed. Like, doesn't, like, you know, everything, because there's something, there's like a human thing where, like, one of the reasons that, like, these things succeed is because everybody decided that we wanted it to succeed. Yeah. I mean, it's going to fail to meet some people's expectations for sure. Like, it can't, like, it can't meet everybody's expectations, but. Yeah. I think what we lack in these conversations, in my opinion, is like a definition. of like success and failure, right? Like, what does it mean for AI, like to fail, for example?
Starting point is 00:20:24 If we set the conversation like, okay, the goal of what we are doing right here is like to create like the remator who's going like to, I don't know, like, roll the world and we will just all retire as humanity. Yeah, like, of course, going to fail. Like, I don't see that happening like in the next like two or three years. Like, probably never will happen because there's, it's much more complex. on that, right? Like, even if you had that, if you have created that, the deploying that thing is like a human endeavor. And, like, the way that humanity works is like, it's certainly complicated.
Starting point is 00:21:03 So you can't just, like, reduce this whole process into a statement of, oh, when we have AGI, like, it's game over. Like, that's a goal. And, like, then we succeed. Without that, we don't. right so I think like we in my opinion like we can't talk about success and failure yet primarily because what we are doing right now is that okay we have a new thing out there this new thing has new capabilities that we didn't have before okay we are still trying to figure out the limits of this thing but most importantly we are trying to figure out the problems that make sense to solve
Starting point is 00:21:50 and what it is to make it viable, right? So, I can put it this way, like, there are problems today that you can solve with AI, but it's not viable because AI is still like too expensive
Starting point is 00:22:07 for the use cases. You have cases where you have new things that you couldn't do before that you can do it with AI, but it's not reliable enough, right, to put it into production, right? They're like, there's like, and there's still like a lot of stuff that we don't even know yet that we can solve it, like with this new technology. So it's still like an exploratory phase
Starting point is 00:22:37 of like trying to figure out what makes sense, like to do with this thing. What is like, let's say the killer app for this which I think like already being like deployed in some case and delivers value there but there are like other cases obviously where it fails and like as every other R&D project out there like there's going to be like a lot of failure like that's what R&D is right like failure it's like you have to embrace that like a lot of that stuff like are going like
Starting point is 00:23:07 to fail the difference in my opinion is that experimenting is quite cheap compared like to doing it in the past Right. If someone wanted, let's say, a couple of years ago to go and experiment with building, like incorporating like ML models to build like recommended like for their system. Like it couldn't be just an experiment. Like they would have like to make sure that this is going to work because it will be a big investment for them. Like you have to find the people. You have like to find the data. You have like to iterate on these. It takes months, maybe years. And, like, many times, like, what was happening was that we're faking that these things were, like, succeeding because we did invest individually, like, too much into them. It doesn't hurt, probably in the company, but it doesn't also mean that it's not, like, the value that we were expecting that's going to add, right? Right.
Starting point is 00:24:01 But I like to look, I think it's really fun to look back. So you're probably familiar with the TV show, The Jetsons, the old TV show, that. animated. So it's fun to look back, you know, the, it's the future looking, you know, what is future, what's the future going to be like? And the two things that come out at me from that, which is, you know, they originally made the show decades ago. Blind cars are part of that show and robots. And if you add, like, those are two, it's hard to think of like the things people thought would be now 30 years ago, 50 years ago, right? But it's helpful to like, to bring that up like so there's going to be some AI applications we're working on today
Starting point is 00:24:42 that it will be the equivalent of a flying car like we just haven't gotten there the physics don't work like we don't know how to solve like that problem or like the robotics like you know so far has been slower than a lot of people thought like we don't all have you know robots in our houses other than maybe vacuums right so like what does that look like so i think that's really interesting to because we're in this experimentation phase to think about which categories right now that we're throwing AI at which categories are going to hit the walls, they're going to be the future, you know, flying cars, for example. Yeah.
Starting point is 00:25:15 Yeah. First of all, I mean, you mentioned like robotics. I think robotics is like a big thing, like for many different reasons, not like only because of AI. I think there is traditional like robotics has been like a space that building goes like extremely slow, but it is a space that now has been like accelerating like a lot and new models of, like, building, like, there's open source robotics now, like, things like that. And I think that there's definitely going to be, I'll say that, like, a very interesting
Starting point is 00:25:47 intersection between, like, the robotics itself and AI and what together they can do. But I think, like, first of all, one of the things that I don't know, like, people, I think like they need, like, to take a step back and think a little bit about what has happened in the past, like, three years, like with AI. we are still trying to figure out what's the right way for us as humans like to interact with this thing, right?
Starting point is 00:26:12 Like, co-pilot came out and for a while, like, copilot was like the thing, right? Like, let's build like a copilot for everything.
Starting point is 00:26:19 Let's build a copilot for writing code. Let's build a copilot for Word and Excel. Let's build a copilot, I don't like, or whatever. And I think like what people like started realizing is that the copilot thing,
Starting point is 00:26:33 which pretty much means we have this model and then we have the human who is in the loop there to make sure that the model always stays like on track, it's not very efficient, right? Because like what happens is you have tasks, sure, like some of them like might be accelerated because you are using the copilot, but then you have like a human who instead of doing other things is like babysitting a model like to do something, right? So, of course, you are not going to see, like, crazy ROI there. Like, what's the point?
Starting point is 00:27:10 I mean, it's just, like, instead of typing, instead of the computer, like, on Excel, like, now you have someone who's, like, typing in free text, like, to a model, trying to convince the model to do the right thing, right? So that part of, like, the automation, I think, it became obvious that, like, it didn't work that well. There are, like, some cases where it works, but, like, it's not as global. like of Universal as, like, we would think it would be. Then we started, like, seeing new paradox of, like, how these things can be down. Like, at the end of the day, like, if someone tries to abstract what is happening,
Starting point is 00:27:44 is how we can see there, like, these models of, like, as we were, like, considered, like, software before, which is, okay, I want this thing, like, when it has a task, like, to just go and do it and come back and make sure that when it comes back, it's, like, the right thing. But the problem with models is that by their nature, they are not deterministic. things might go like the wrong way or things so we need like to figure out new ways to both like interact and also build systems out of this like an engineering problem at the end of the day like it's not like the science has been done like the thing is out there okay how do we need this thing like reliable at the end of the day yeah yeah i think you know one thing that that john and i have talked
Starting point is 00:28:25 about is that in a lot of cases and actually one interesting thing when we start to get into type deaf here in a little bit. Gassus, I think is really applicable to the API that you've built. But like, like you said, unfortunately, in my opinion, like having a co-pilot chatbot as like the thing that just everyone deployed in every possible way for every use case was really a bad start because I think the best use case is that a lot of it, or like, maybe a better way to say it would be, I think some of the best manifestations of this as far as user experience is that you won't really notice AI.
Starting point is 00:29:04 It's not like it's at the forefront, right? It's just sort of disappearing behind a user experience that feels like magically fluid or high context. I mean, it's going to hide an immense amount of complexity and make hard things seem really simple. Yeah, because as an example of that, like Netflix is one of my favorite examples of that. Like, the, like, brilliance of, like, their recommendation engine stuff they did, it's completely invisible to the user. Others, I'm like, oh, I might like to watch that,
Starting point is 00:29:37 you know? Like, no, like, those are the experiences I think will be fascinating to see, like, come into lots of different products, like, with AI. And I haven't seen as much of that yet. Well, I think you can see them, like, in some cases. Like, I'm sorry for the wrapping Eric, like, but, there are, like, some cases where it's not like, like, in development, for example, right? And again, you have something like cloud code which, okay, like, it is an experience on its own with its own limitations, right?
Starting point is 00:30:07 It doesn't mean that you just like throw this thing out there and like it's going to build like a whole Linux kernel on its own. But stuff like using models to do like a preliminary review of a new PR, right? Or actually using a model you as you do like a PR review. Like these things are accelerating. processes like a lot. Okay, now they are not replacing the engineer, right? And I don't think why this is a bad thing, but it does make the engineer like much more productive at the end of the day. Same thing with like shelf, like for example, right? Like, okay, like you want to go
Starting point is 00:30:47 and personalize like messages that you are going like to send like 200 people. Like in the past, if you wanted to do that, we'll take like, I don't know, like two hours probably like to go and do that. Now, it will probably take half an hour. Now, it doesn't replace the SDR, or some people might claim it does, but I think it's a very good idea, but it does make, like, the people more, like, productive. And I think that is the reason that what's, there was like a conversation, like, for a while that we're saying that the observation is when it comes like to impact to jobs, the main, the first layer of professional that's been affected by that's like middle management. And the reason for that is because when you, in the past, for every like five
Starting point is 00:31:38 SDRs, you probably needed like one sales manager. Now you need one sales manager for like 100 of them or like 50 of them, right? Because a lot of the stuff that you had to do to make sure that these people like were doing the right thing now can happen with a like much more efficiently. The same thing also, like, with customer support, which is, like, one of, like, the most common, like, use case where, like, AI is, like, heavily. Like, one of the things that the managers had to do was, like, to go through the recordings that the agents had and make sure that they were doing the right thing. That's super time consuming, right? Like, you literally have someone with, who works, like, for eight hours as an agent and talks, like, in total, I don't know, like, let's say three hours.
Starting point is 00:32:22 Someone had, like, to go through, like, three hours of transcript and figure out if they are doing the right thing. now they can do that for many more people in less time because they have the tooling to do that, right? So I think there is, like, impact happening out there. It's just that the way that the dream of AI is being sold, it's not, like, what is happening is not as sexy as the dream. We're going to take a quick break from the episode to talk about our sponsor, Rutter Sack.
Starting point is 00:32:51 Now, I could say a bunch of nice things as if I found a fancy new tool, But John has been implementing RudderStack for over half a decade. John, you work with customer event data every day and you know how hard it can be to make sure that data is clean and then to stream it everywhere it needs to go. Yeah, Eric, as you know, customer data can get messy. And if you've ever seen a tag manager, you know how messy it can get. So RudderStack has really been one of my team's secret weapons. We can collect and standardize data from anywhere, web, mobile, even server side, and then send it to our downstream tool.
Starting point is 00:33:24 Now, rumor has it that you have implemented the longest-running production instance of Rudder Stack at six years in going. Yes, I can confirm that. And one of the reasons we picked Rudder Stack was that it does not store the data and we can live-stream data to our downstream tools. One of the things about the implementation that has been so common over all the years and with so many Rudder-Stack customers is that it wasn't a wholesale replacement of your stack. it fit right into your existing tool set. Yeah, and even with technical tools, Eric, things like Kafka or PubSub, but you don't have to have all that complicated
Starting point is 00:34:02 customer data infrastructure. Well, if you need to stream clean customer data to your entire stack, including your data infrastructure tools, head over to rudderstack.com to learn more. Let's start to talk about data infrastructure because I really want to talk about type def, mainly because I got a demo of it.
Starting point is 00:34:19 I got a demo of it right before we started recording, so I'm all excited about it. But let's talk about data infrastructure, because I agree, you know, totally cost us that a lot of the significant impact that's happening isn't super sexy. Where are you seeing? I mean, obviously, you're building some of this with type deaf, but, and John, I would ask you this question, too, because you use, you know, you have a really good handle in the landscape and use new tools all the time.
Starting point is 00:34:46 Like, it's interesting because non-deterministic you know, having a nondeterministic tool for data infrastructure is really different than like summarize a transcript and give me the gist of it, right? Like you're not going to the threshold for making nondeterministic changes to a production system or to data that, you know, is business critical. You know, clearly there's a different threshold there. But what does the landscape look like with using LLMs and data infrastructure? Well, I have a really small annexate here that I'll share Eric, and I think it was interesting. So I occasionally do mentoring stuff, and I had a mentoring call earlier today,
Starting point is 00:35:27 and somebody's using an LLM to generate SQL to look at web analytics. I'm sure that happens all over the place, especially with startups, and it was a startup. So get on this call, and it was so funny, like, even a few months ago, I probably would have, like, walked them through, like, because then it really no sequel, right? So walk them through, like, and taught them a little bit about SQL. but I had actually thought them a little bit about prompting is like what I did. So like it was the simplest solve. They were getting a little loss in this query and essentially like it was a really short solve
Starting point is 00:36:00 of like, hey, break this down into CTEs. Like let me show you how to prompt it to make it use CTE instead of sub queries. So we did that and then said, all right, run each CTE. And if there's an error in the CTE, take that one part out, drop it in the new window, tell it to fix that piece, move it back over. And then like work through it. And we did it together and like, 15 minutes. She's like, oh, like, this is amazing. This is great. And it was just like something that even like six months ago, like that's not how I would have walked through, you know, a problem with somebody. So. Yeah. Yeah. I think of the implications of it. But yeah. Yeah. Yeah. It's interesting. I think like working with data is also like an interesting topic like when it comes like to a little names for a couple of different reasons. First of all, SQL was created because it was supposed to be used by.
Starting point is 00:36:49 business people, not technical people, right? So, like, it kind of resembles, like, natural language when you write it, right? It's a way, like, for how industry was, like, trying to create a BSL that could be used by, like, non-technical
Starting point is 00:37:05 people at the end of the day. Like, that was, like, the goal of that. Now, obviously, things get more and more complicated as we try to do more and more things, right? And, of course, like, when you start going to these people
Starting point is 00:37:20 who are supposed to be like business people or like business analysts or even like managers and you're suddenly like explain to them like use like terms of like CPEs or projections
Starting point is 00:37:31 or joins like what are you talking about? But it comes out that it's a good language for flames like to generate and for people like to debug because they are
Starting point is 00:37:46 usually end up like writing your logic, and because it is, like, data flow, the model is, like, data flow driven instead of, like, decision-driven, like, like, brands-driven. It's, you will get something back for your question that, okay, you can spend some time, like, understand what this thing is doing. Like, you don't have, like, to go through, like, thousands of clients of code, like, to figure out what's going on.
Starting point is 00:38:08 Now, having said that, at the same time, as with everything else with AI, people jumped directly, like, to the dream. And I'm like, okay, let's do, like, text of SQL. Right? Let's have like Eric saying, go there and be like, hey, how did my product team perform in the last quarter and expect something to come back that makes sense, right? We're not there yet. I don't know if we are going to get there. I think what will happen to your point. And I think, like, John, what you described is like great is that you need to, you have like this generalist, which is like the model that can do everything. good enough but if you want to do it like really good
Starting point is 00:38:53 as an output like you really have like to constrate it of how it's going like to operate right and you have to constrain it based on like the problems that you try to solve and its problem is different
Starting point is 00:39:05 so you need like a different context like it's not like something like generic that you can just put there and like it'll solve like every problem that's where we're engineering coming right so there are I think we are at the time where okay, we need to engineer
Starting point is 00:39:20 solutions. We need to sit down and for the problems that we are trying to solve, find the ways that these models can operate in good enough, like margins of errors and put them into production and keep improving
Starting point is 00:39:38 as we did in the past, right? That's what engineering has always been doing. No difference. I think one thing that I'd be interested in both of your opinions on is, I agree that we need to engineer solutions. I think part of that is in the modeling layer, right? So one of the challenges, if you think about an LLM writing SQL, is that the underlying data sets are wildly different even for the same basic use case, right? And so if there was a way to normalize on
Starting point is 00:40:16 you know a basic data model so you mentioned web analytics right well that's actually a fully known you know there are like standards you can use for that it's a fully known you know that's it's you know you have page views you have sessions you have whatever right those are all like almost ubiquitously defined terms right and so in fact if you weren't able to have a consistent underlying data model, then you would be setting the LLM up for success because it's, you know, it's not having to try to interpret like, like, you know, wildly different underlying data models to produce the same result. And I think about the same thing with frameworks, right? I mean, if you think about, you know, V0 from VERS, like, it's running,
Starting point is 00:41:12 it's generating next apps, right? I mean, that that framework is super, well-defined. There's a zillion examples, right? And so, like, within a certain set of parameters, it can do some pretty awesome stuff, you know, like with those guardrails there. So do you think we will also see a big part of this become a normalization or sort of standardization of underlying data in order to create an environment in which the LLM is set up better for success? No. The reason I'm saying that is because I think like, when it comes like to data and schemas and total stuff has been tried a lot in the past. And it always like failed because the problem with these things is that it's extremely hard, like, first of all, like to agree about like the semantics, like what it means.
Starting point is 00:42:07 Like there are like actually there's like a very rich literature out there and like scientific research on like how to model like specific domains. Like, like, especially like in archiving, for example. Like, if you go there, you will see that depending on, like, the type of, like, medium that you want to use. Like, they are very well-defined, like, schemas and, most important, like, semantics around, like, how do you digitize, like, book, right? Like, what are the parts that you break down? What are the metadata that you need for these things? Like, there is a lot of word that has been done. But, like, the problem with that stuff is that it's extremely hard, like, to put human,
Starting point is 00:42:45 want to agree upon these things. And for a good reason, it's not like because we're like a problematic species. It's just that all these things are very context sensitive and the way that I will do this thing like in my company, like might be very different compared to like how Eric does things like in his company. And if you want to agree on something, it has to be good enough for both of us without causing problems to any of us because of like whatever exist in there to satisfy like another stakeholder, right? So it's really hard. I think like the way that, and there's another thing there, which is continuity, right?
Starting point is 00:43:25 We are not just resetting. Like the enterprise, like go like to Bangal America. I don't like, how long like is Bangal America like operating? For a while they started with like IBM started building like the first mainframes or whatever, right? It's not like you can go in there and just like remove everything and put, something new in there. Like, you need to continue. You need to continue it, right?
Starting point is 00:43:48 So things that you know, it's really, I think what can happen is like a couple of different things. One is either you decide of how models should come up on consensus of like how to do things and you let the models like figure this out and you don't care at the end of the data model or you have another layer of abstraction, which is what semantic layers are. Right, like the whole concept of semantic layer is that, okay, I have my data on my data lake or data warehouse. I model this thing like in any way I want, but I centralize also like the semantics around the meaning of this data. So when I'm going to talk about revenue, it doesn't matter if I'm cost us from sales and Derek from marketing.
Starting point is 00:44:42 we are going to use the same definitions of what revenue is, right? Or we will have multiple different ones, but we would know which one each one of us is using. So the solution, usually like to these things is like to add abstractions, that's like how we've been doing it so far. And I think that's what is going like to happen now. The main difference is that so far we've been building the abstractions, considering one type of entity,
Starting point is 00:45:12 interacting with that, which is the human. We also have to make into account that we have another entity, which is the model, and the model needs a different experience than a human to interact with these systems. So we don't have only, like, user experience. Now we need also, like, I don't have a model experience, whatever, but this is the thing. All right, well, we have to use our remaining time to talk about type-deaf.
Starting point is 00:45:37 So I know you gave us a brief overview at the very beginning. But give us type-deaf in like two minutes. We have more than two minutes to talk about it. Yeah. So when we started type-diff, like our goal was to find, like to build the tooling that we need today to work with data. And when I'm talking, it sounds like very generic, but I'm, we started from like a very all-up perspective, right? What do we do with the data that we have, like, on our data lake or, like, our data warehouse, right? So we're not talking about, like, transactional use cases here, like, how you build your application with your performance database.
Starting point is 00:46:24 It's more about, okay, we have collected everything. What do we do with that now? Like, how do we build new applications on top of this data? Traditionally, they're like you're using systems like Spark, right? Yep. But Spark has started, like, showing its age because, again, as I said, like, at some, at the beginning, like these things were like built primarily with like the BI, like the business intelligence like use case in mind.
Starting point is 00:46:49 So when you try like to build them, builds, I don't like a recommender or like other types of like applications on top of your data, more customer facing things, it becomes hard to do it. The way that we've been solving it so far is by using talent, right? Like very specialized people who can make sure that this thing was going like to be. working properly, regardless of, like, what we throw on it. That's really hard, like, to scale outside of, like, the big area in a way, right? It's extremely hard to go and ask, like, every engineer out there to become an expert on, like, building and operating, like, distributed systems, especially, like, with data.
Starting point is 00:47:34 So we're like, okay, what's how we can solve that, like, how we can turn building applications, like with data, like a similar experience to how like phone end engineers and the backend engineers have with application development, right? What happens with MongoDB and Node.js becoming like a thing and node becoming like a thing
Starting point is 00:47:54 and suddenly we have this explosion of like millions of engineers like building things, right? But do it for data. That's how we started. To do that, we had like to build pretty much like from scratch in your query engine. We want to like to use familiar interfaces. So people can, but they have some experience with working with data, they can already, like, use it.
Starting point is 00:48:14 So we build on top of, like, the PISPARC API. We used, like, the data frame API as a paradigm because it's a good way to mix together imperative programming with declarative programming. So kind of have the best of both work, like from what you have with SQL, but also with, like, a language like Python. And then we had that. Well, I also like to make it serverless, but then, as we said, like, AI happened. So now we have, like, a new type of compute. So it's not all, like, the workload's completely changed. We don't have CPU is not the bottleneck anymore.
Starting point is 00:48:53 The bottleneck is all about reaching out to LLLM's and, like, hoping that we get something back. And also, we get something back. Do we know if this is correct? Like, that's not like a deterministic answer, right? So how do we engineer and put things like into production when we have, like, new workloads. So our next step was, okay, we are going to make inference, LLM inference, like a first classic engine. And we got kind of object of like, okay, how we can do that without having to introduce like completely new concept like the engineers. So we kind of introduced
Starting point is 00:49:28 like these new operators in the data frame API where as you have like a join before, now you have a semantic join. As you have like a filter before now you have like a semantic filter. And extends the operations that you already know how to do on data, but using both like natural language and also using unstructured data where something has to be inferred. It's not like explicit already in your data set. And then reducing all the, removing all the like hard work of like having to interact with inference engines, figuring out like back pressure, what to do with failure, all the things that are like extremely painful because these new technology
Starting point is 00:50:11 are still like young and many things haven't been figured out yet in terms of infrastructure but all these things end up like making working with them like unreliable enough to make it hard like to put into production. So our goal is like okay, the end
Starting point is 00:50:26 use type dev to build like AI applications both let's say like static applications that they have like a static execution graph but also agentic ones where you can let a model like decide
Starting point is 00:50:43 what to do based on like the tools that it has access to. Do it on data. So it's not like a generic environment that you can go and build let's say like any type of like a genetic workload there. Like if you want to go scrape the web and come back with insides
Starting point is 00:50:59 type def and fennick is not the way to do it. But if you won't like to implement that on top of your data warehouse data, then it's a great tool like to use and make it also really fast like to experiment because it's like very familiar
Starting point is 00:51:17 like to work wings and when you're ready like to get into production remove all like the boilerplate that someone is like to build in under like the monots the underlying infrastructure and making things like much more efficient at the end and more reliable like to put into production which is like quite a big problem right now
Starting point is 00:51:33 and why like many AI projects are like failing so I have to digest Yeah, it's a lot I know But it's hard to Talk about these things Without using a lot of words Yeah
Starting point is 00:51:49 You left us speechless No, we were both on me Yeah, can you go back to the semantic? Yes Can you imagine this like I want to talk a little bit on the semantic layer Because this has been a really fascinating one for me
Starting point is 00:52:02 Because I like your point a lot around like This we talked about earlier Historically, you've got BI tools Now we've got like, we've got maybe agents for first class citizens or people equipped with like AI tools. It's kind of another class of people. But back to the semantic layer, like there's a startup that I've followed their journey and talked to their founders a lot. And it's been interesting just to follow them where they were like really hard like semantic layer. Like it's not going to work at all without a semantic layer.
Starting point is 00:52:30 And then they were kind of and then like back to that like comment on like talent. It's like, well, how many companies are in a point where they have a mature enough warehouse and they have all this organized into, you know, a modeling tool like DBT? And they have like a mature semantic layer. Like even that number is like not super high. And so it's just interesting because even they, I think, have like gone back and thought like, well, but if we did kind of go back to text to SQL and think about like basically dynamically generated, you know, semantic layers.
Starting point is 00:53:02 So there's not as much like engineering involved. in that. So I wonder how many of those like reinventions will happen on like just pragmatically, right, where it's like, okay, this is how it should work. This is how it works best. We're going to have to go back and reinvent practically because like to our tan, like our tool addressable mark is not big enough. So we need to like go. Yeah. Yeah. I mean, I think that a lot of that stuff goes back like to kind of like what we're showing about like the continuity. Right. Like if if you have like a company that has. has been operating like a BI stack for a couple of years now, right? They probably have a code base of SQL that already exists there. And migrating that to like a semantic layer, which by the way, the semantic layer also needs to depopulate, that monads, right? Yes, you do add there like an abstraction that can probably make things better, assuming that it has been curated, right?
Starting point is 00:54:08 And most importantly, curate, like, someone has created it, right? That's, like, one of the reasons that traditional, like, the semantic layer is not something new. Like, has been around, like, for a very long time. But it was primarily, like, an enterprise thing. And it was an enterprise thing because the enterprise had the resources, like, to go and build and maintain these things, right? Now, can an LEMC help with that?
Starting point is 00:54:30 Maybe, I don't know. That's like something for the semantic layer people like to figure out. But at the end of the day, if you come to a team that already spends probably 40, 50% of their time, asking requests like, hey, I'm trying to calculate this thing. Do we already do it? And if yes, where, and can we update it to also add this new thing there? Because we also have a new source that we want to track SEO coming from related to it. And tell them, well, you can solve this.
Starting point is 00:55:01 if you go through like a six months project to build a semantic layer and educate also the whole company that they have like to use whatever we put in there yeah it's like super hard like even if on paper like it works you have to both change the organization behavior and to invest like in the technology resources that you don't already have so it's a hard sell right you need to I think in my opinion there's more of a product opinion you have to fix the problems
Starting point is 00:55:37 that already exist like what people carry from the past and make the transition easy if the transition is not easy to this new world that you are promising people wouldn't like it's too much and that's like part of like
Starting point is 00:55:51 why we build like type of the way we did is because if you try if you have to educate people a lot it's, you put a lot of risk in, like, what you are building. People don't have time, and you don't have the money also like to do it. So it has to be something that it's, like, very familiar for people, like, to use and makes it easy.
Starting point is 00:56:14 So all the decisions that we made is that familiar APIs for both humans and machines, right? Vice Park has been out there, like, for a long time. These models have been trained on that, so, like, the API is kind of, like, known. You can go and, like, ask it at the end of the, they like to build something on our framework and it will probably succeed, like after one or two iterations just because of this family are like with the syntax. So we need like to reduce the amount of like effort that people have to put in order to migrate into these new worlds. Because at the end of the day, like we kind of solve the same problems like in a better way.
Starting point is 00:56:52 But it will like to make this reality happen fast, we have like to help people migrate. also like fast. We can't just like promise a dream. We'll take them six months of implementation before they can't even like taste the dream. And that's what we are trying like to do, like with type of remove everything as much as possible that makes it really incompatible with what people already know. Like the same way that you would build like a pipeline in the past, like to process your data. Like you should do the same thing using LLMs without having to to learn like new concepts. That's, if we might not like to do that with type Dave,
Starting point is 00:57:34 from a product perspective, we are going to, I'll call it like a success. This is going to be a commercial success, lots of different conversation. But that's kind of like the goal, right? Do the things you are doing in the past, but in a much, much better way, because now transparently you can't use the LEMS
Starting point is 00:57:53 to do some of the stuff that would, like, extremely hard like to do before. But without compromising, on how you put things in the production, how you operate things, and how fast you can iterate on the problem you are trying to solve. I love it. Yeah, that was when you were giving me a demo earlier today,
Starting point is 00:58:15 that I think was, it was actually pretty surprising because when we talked about what would it take to productionize this for the use case we were discussing, it was just kind of it didn't really feel that unfamiliar. Yeah.
Starting point is 00:58:33 That is so. I mean, this kind of feels, you know, this feels very natural, right? Like, here's all the tables. You know, you have a pipeline set up. So yeah, I was,
Starting point is 00:58:47 yeah, that's super interesting. I didn't even really think about that. I just, my main thought was, oh, that's like, sounds way easier than I thought it was going. to sound. So hopefully that's commercial success. Sure. Yeah, yeah. It's on the way. A hundred percent. And I think like a positive side effect of using like familiar
Starting point is 00:59:08 paradox is that when things go wrong and of course things will go wrong, it will be easier for people like to reason about them and like figure out the issues and fix things. Again, I'll keep like kind of, I don't know, becoming like boring, but it is engineering at the end of the day. Like we've been spending so much much time building these best practices, these ways of like operating systems, operating unreliable systems in a reliably way. We just need like to use the same principles. And as you said, like put AI in there, but the AI should feel like almost magical. Like it shouldn't feel like, oh, now everything that I was doing is breaking because I'm trying to use this damn new
Starting point is 00:59:52 thing that's, I don't know why it breaks. And I think that goes back to what you were talking about with the use case. Awesome. Well, we are at the buzzer, as we like to say. Brooks is telling us we're out of time. So, Cacostas, I would love to have you come back on for a round two. And I want to do two things. Let's talk about some use cases that you're implementing for your customers.
Starting point is 01:00:17 And then the other thing that we didn't talk about that I would love to talk about, and this is just for me talking with, you know, some of our larger customers. and their restrictions on even using LLMs, you know, especially as it relates to certain types of data is a huge challenge, right? And I mean, you know, in the startup, you know, like you were saying, John, okay, this person's like, you know, just throwing SQL, you know,
Starting point is 01:00:44 probably straight into GPT and, you know, hearing data, you know, whatever, right? And it's like, okay, well, you, I mean, you cannot do that at a large company, right? And there are like a lot of security, like legitimate, you know, security concerns and other things like that. So I'd love to cover that too, Kastas,
Starting point is 01:01:00 because the types of workloads that you're running that's clearly a concern. Yeah, yeah, 100%. I think a lot of that stuff is being addressed and like I think it's getting easier. I like to use, like to find solutions that either through
Starting point is 01:01:16 using let's say proprietary like open source model that you only run or use like from the big providers, but, like, in very, like, secure ways. Like, it's something but, like, the big, like, open AI and, like, all these people are, like, this is kind of, like, a solved problem, like, at this point, I would say. And I would say that, like, most people probably end up using open source models,
Starting point is 01:01:43 not that much because of security, but more because of performance. Interesting. But that's, we can talk about that, yeah. Okay, it's a interesting topic. Love it. Thank you so much, guys. I loved it. And I'm looking forward to come back again.
Starting point is 01:02:00 Yeah, we'll do it soon. The Datastack show is brought to you by Rudderstack. Learn more at Rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.