The Data Stack Show - Re-Air: AI is All About Working with Data with Kostas Pardalis of typedef

Starting point is 00:00:00 Hey everyone, before we dive in, we wanted to take a moment to thank you for listening and being part of our community. Today, we're revisiting one of our most popular episodes in the archives, a conversation full of insights worth hearing again. We hope you enjoy it and remember you can stay up to date with the latest content and subscribe to the show at datastackshow.com. Hi, I'm Eric Dodds. And I'm John Wessel. Welcome to The Datastack Show. The Datastack Show is a podcast where we talk about the technical, business, and human challenges involved in data work. Join our casual conversations with innovators and data professionals

Starting point is 00:00:36 to learn about new data technologies and how data teams are run at top companies. Before we dig into today's episode, we want to give a huge thanks to our presenting sponsor, Rutter Sack. They give us the equipment and time to do this show week in, week out, and provide you the valuable content. RudderSack provides customer data infrastructure and is used by the world's most innovative companies

Starting point is 00:01:03 to collect, transform, and deliver their event data wherever it's needed all in real time. You can learn more at rudderstack.com. Welcome back to the Datastack show. We are here with an extremely special guest, Kostas Pardalis from TypeDev, formerly a co-host of the Datastack show, Kostas, welcome back as a guest.

Starting point is 00:01:27 Happy to be back here, Eric. It's always fun to be back like at the latest like show, which I have some great memories of like being the co-host here with you and working with Brooks. So I'm super happy to be here. Well, I know that longtime listeners are very familiar with you, but for the several million people that are new listeners since you left the show, give them just a brief background on yourself. Oh, are you saying you are not like, I'm not that popular yet, like, so everyone already knows about me? I thought they're deeps. Oh, my goodness.

Starting point is 00:02:03 Okay. All right. I'll do it, even if it's just a formality, but I'll say a few things about myself. So, yeah, I'm cost us. I'll keep it brief because I'm sure, like, we'll talk more about that, like, during the show. So I've been building data infrastructure for, like, more than a decade now, especially, like, startups. So I, what I really enjoyed like the intersection of business building and technology building. I do have like an engineering background. I can't escape that. But at the same time,

Starting point is 00:02:33 I really enjoy like building products and companies. So naturally, I found something very technical to productize and commercialize and sell to like technical personas. I'm working on something new right now, and I'm looking forward. We'd like to talk more about that and what is happening in the world in our industry today. Awesome. So KOSAS really excited to hear about what you're doing now. We'll definitely dig into some AI topics. And then I will save this for the show, but I want a response from you to some of the news the last couple weeks about return on investment for AI. There's some stuff floating around where all these companies are saying, you know, they've invested X amount of money and not seen the return on AI. So we've got to

Starting point is 00:03:19 talk about that topic. And what do you want to dig into? Oh, 100%. I think talking about what has happened so far with AI, where we are today and what's next. I think it's like very interesting. So definitely looking forward like to talking on that and like what's next. Because if anything, what we've seen so far is everything has been like extremely accelerated. Like things that like in other industries like probably take I don't know, like maybe years or even. like decades like to happen with AI right now and take industry like things take like literally weeks or months so what we are seeing is pretty normal natural and I think kind of gives us a hint of what is coming next which I think like super excited exciting awesome well

Starting point is 00:04:09 let's dig in and ja let's do it okay Kostas we're going to save the full story for a little bit later in the show, but give us just the brief timeline. So was it three years ago you left Rudderstack? Probably, no, something like that. Yeah. Yeah. Yeah. A little bit more than like three years, probably like closer to four, maybe. Yeah. Okay, so give us the timeline. What happened between the time you left and what you're doing today? Yeah. So I left Rutherstack I left primarily because I was looking for like some new stuff like to work with just for the people that probably don't know

Starting point is 00:04:56 about what I was doing until that time like since 2014 I pretty much have been working and like building infrastructure for data ingestion so anything looks like related to let's say like what was called back then like the modern data stack with, like, the ELT pattern and, like, extracting data and loading data into the warehouse, blah, like, all that stuff. But what I had been doing, the way that I kind of model, like, the industry, is like, an onion,

Starting point is 00:05:30 right? Like, you have ingestion that needs like to happen in the periphery. That's what brings the data in from all, like, the places. But there's, like, in the core, there's, like, compute and storage, right? And usually these parts are, well, like, I would say, we can't live without that. We can live without ingestion, probably, but we can't live without computer and storage, right? Like, if we do not have the peta warehouse, our pipelines wouldn't be valuable. And then, like, also, like, some very interesting, like, technical problems around that.

Starting point is 00:06:05 And I want to, like, to go deeper into these problems and this space. And by the way, they're like challenges and opportunities there that both from, let's say, like the technology, like what it means like to build like something like Snowflake, for example, or like something like Spark, right? But also business related. Like we do see like the size of like, let's say the data warehousing like market compared like to the adjacent market. But it was like a little comparison there. right. So that's what I wanted to do. I was like, I wouldn't like to go deeper into data infrastructure. And I was looking to join like a company that was building some kind of query engine around like that South. I ended up like doing like Starburst data. Starburst data was still is. They are commercializing Trino. Trino was initially created under the name like Presto. Presto and Trino like created around the same time with Spark. And they were. let's say the Spark was like initially being primarily used for preparing the data when you have big data and then you would use something like press like to query the data

Starting point is 00:07:18 like after the data has been like EPO amazing project open source like for anyone with like interesting into getting into that stuff they should definitely check it like it's probably one of the best projects for an apple wants to see how like databases and though lab systems work and I joined for that plus it is a company with a very strong enterprise go to market motion and I want to like do that because until including like rather

Starting point is 00:07:49 my experience with go to market was primarily like from SMPs up to like market and more like the PLT kind of way of like growing and I really want to like to see how enterprise also like works and Trino and Starbus has been viewed like and more like the Fortune 100 company. So it was like a great way for me, like to see also how that works.

Starting point is 00:08:11 So joined there, spent about a year being a little bit selfies, to be honest. I knew that I'm joining primarily to learn and like find opportunities of what like to start next because I wanted like to start another company. And what I started realizing there was that the data infrastructure that we have today and the tooling that we have has been like this. and built pretty much like 12, 13 years ago under a complete different assumptions of like what the needs are of the market, right? The main reason that these systems were built are to serve BI, business diligence, right?

Starting point is 00:08:51 Like reporting at scale. Of course, like with big data, but they are built with these use case like in mind. Through like this decade, we have more use cases that stop like emerging. We have a mail, we have a bit of analytics. Data started actually becoming from, let's say, something that we used to understand what we did to actually being the product, right? And that's kind of like the intuition and like the experience like how that was like, okay, I think what is happening here is that we spent the past 15 years building SaaS

Starting point is 00:09:29 and pretty much digitizing every activity that we have. This is done. Like, okay, how many new, like, sales forces we're going to build? Like, we still have the CRMs out there. Like, this thing is getting digitally already, like, digital. Same thing with marketing. Same thing with, like, product design. Same thing with pretty much like everything, right?

Starting point is 00:09:52 Same with, like, consumer side too. So how is this industry going to like to keep accelerating and growing, right? And the answer for me to that was, like, data. Now that everything is digital, we create like all the data. So we need to figure out ways to build products on top of the data. Like the data will become like the product. And we are probably entering like the next decade is going to be all about how we deliver value over that. And then AI happens and it just accelerated everything because if we think about what like AI is,

Starting point is 00:10:29 is all about working with data at the end of the day. Sure, you can use it like to create a new type of CRM or like create a new type of like a VRP or whatever. But at the end of the day, the reason that this is going to be better than what Shelford was is because it's primarily going like to be building on top of data that they are being generated and they are used like with model.

Starting point is 00:10:59 to do things that were not like possible to do before, like in an efficient way. Based on that, I mean, actually before the AI happened, like I decided that I won't like to go and start a company where we are going to build like the next generation of query engines that will make it like much more accessible to more people to go and build on top of data. We started with that and as we've been building and interacting like with design partners and seeing, like, what's going on like with AI, we ended up building, it's not like a fusion, like a new type of all-up we re-entering

Starting point is 00:11:39 which outside of pure compute. It also considers inference, which is a new type of compute as a first-class citizen. And you can use this technology, like, to work with your data, both how you would do, like, traditional, using something like Snowflake or TagDB, but also mix in their LLMs and inference in a very controlled way and in a way that's like very something like to develop us to build.

Starting point is 00:12:07 Man. That was a really comprehensive. That was like an LLM summary of the last couple of years. Yeah, I'm spending too much time with LLMG and they start like affecting the way I talk. Yeah, I'm just to say I. Yeah. Yeah, yeah.

Starting point is 00:12:26 No, that was great. Okay, I have a zillion product questions, and I know John does too. But let's talk about, let's just talk about your perspective on what's happening with AI. Like you said, it's, you know, compressing years into months and weeks.

Starting point is 00:12:46 And, you know, it's interesting. If you read a lot of the posts and comments on Hacker News, it is, you know, opinions are very polarizing. You know, there's a lot of people who are, you know, you can sense fear. There's a lot of developers who, you know, have a deep sense of FOMO, you know, who are trying to navigate things, you know, and opinions are all over the place. But you're actually, you know, building with AI as a core part of the compute engine that you've built. so what is you what's your disposition i guess the other component i would add that i think is really

Starting point is 00:13:31 mind boggling is the amount of money that's being poured into this which i think is hard for a lot of people to interpret just as far as you know is that feeling hype is that you know especially based on some of the product experiences so there you go and easy that's a softball for your first question. Yeah. Okay, where do you want to start from? Like, the, do you want to talk about the reactions that people have? Or?

Starting point is 00:14:03 Yeah, I think it'd be an interesting place to start. Like, yeah. Are you surprised by the varied reactions? No, I'm not. I mean, like, that's always the case, right? Like, I don't think that you're, when you, when something new comes out, And it's not just like an incremental change to something that we already know and not familiar with. I think humans tend to be like, get like polarized.

Starting point is 00:14:36 You have like the people who are like, oh, yeah, that's the best thing ever happened, like the humanity. And then you have people who are like, this is usually. Like you can see that like with electric cards, right? I'm sure that the first people who started like buying this. Cloud were pretty much, you know, like just, they would find this thing like perfectly even if it was like breaking every two miles, right? And then you have like the people who are like, okay, if I don't have like a V8 that wakes up everyone like around me, like, why would I have a car?

Starting point is 00:15:11 Right. And like they are both valid. I mean, like I like both, right? Like I do see the joy of like, you know, noisy V8. I also see the convenience, so like a car that it's pretty much like an iPhone on wheels. I think that's like always the case until, you know, like something gets like normalized and then it's just like everyone accepts it. I think people who leave like when the iPhone came out, I don't think like everyone who used

Starting point is 00:15:40 it. We're like, oh my God, this is like, like definitely people who are like, okay, like this thing is like doesn't work well enough. Like I remember my first iPhone, for example, I was promised like this thing like to connect on Wi-Fi. And actually, for me,

Starting point is 00:15:57 it took like a couple of months until an update came out that I actually managed on Wi-Fi, right? It wasn't anything like related to what we have today. And, okay, for the even older people who experienced Internet

Starting point is 00:16:11 when it just came out, well, I don't think that downloading anything back then was reliable at all, right? Like you would download something just to go, it will take forever. We go run it and, oh, shit, this thing is corrupted. I have to redownload the whole thing, right?

Starting point is 00:16:32 Of course, like, today we take it. Well, they was like the $2,400 before that $9,600. By the way, these are just like numbers of like, bytes per second. We're not talking about gigabits or like whatever we have to do, right? And the reason I'm saying that

Starting point is 00:16:55 and I'm sure, like, I don't know, like when I was like start as a kid like interacting with the internet. Okay, like my parents probably were thinking, oh, it's like a new toy, you know. I don't think like they could comprehend what this thing would become like 20 years later, right?

Starting point is 00:17:14 But for this to happen, it took a lot of investments, it took a dot-com thing to happen, it took a lot of engineering, like really hard engineering, and it took time. What we see, I think, like today is that these times are getting compressed. And usually, because in a way, like money, the way that money works, especially like in investments and why people like raise money, like for example, like to build some thing, is because money is kind of like compressed time. Like, we think that without money would take you, let's say, one year to do. If you raise money, you can probably do it like in three months.

Starting point is 00:17:56 So why people see all these, like, huge amounts of money being pulled up of that is because there is a raise to make things happen as fast as possible. And what took internet like 20 years, we try like to make it in like five years, right? So I think that's like the mental model at least I have when I'm trying like to like judge why like these amounts of money are like going into that. Of course there's also the thing that with this technology compared like to something like, I mean like with crypto you also have that. I think because you needed like people were investing in an infrastructure like to mine this stuff. But you do have like huge also infrastructure investments, right? You need like the other centers, and even before that, you need energy.

Starting point is 00:18:44 So there's like a lot of money that, like, they're going for that stuff. Now, there is one more thing, though, because the interesting thing is that it's not polarizing for people in general. It's polarizing for engineers too. And I think what's like the most interesting thing for me with AI is that if the first time after, I don't know, like decades, where tech workers are not disrupting other industries. They are disrupting themselves. And that's scary, right? Yeah.

Starting point is 00:19:18 So tech workers usually were like, oh, I'm coming into this industry. I'm digitizing this thing. And of course, the people who used to do that work before, they were like, oh, like, you are replacing me or like you're doing this or like you're doing that. But never at no point, anyone was like, oh, this is going to. to replace, like, the engineers themselves. Now, there is a feeling that this might be happening. I don't agree with that, but I think, like, the polarization also, it's more interesting right now or the drama is more interesting

Starting point is 00:19:48 because it's actually internal drama and internal disruption that is happening and to take industry itself. So it doesn't just, like, disrupt other industries, it disrupts itself. So I have a question, then, saying all that close, just in back to the funding thing, at what point, and I think maybe we're already there, is AI essentially? too big to fail. There's too much money, too many people invested, like, we're going to make it to succeed. Like, build it, like, you know, everything, because there's something, there's like a human thing where, like, one of the reasons that, like, these things succeed is because

Starting point is 00:20:20 everybody decided that we wanted it to succeed. Yeah. I mean, it's going to fail to meet some people's expectations for sure. Like, it can't, like, it can't meet everybody's expectations, but. Yeah. I think what we lack in these conversations, in my opinion, is like a definition. of, like, success and failure, right? Like, what does it mean for AI, like, to fail, for example? If we set the conversation, like, okay, the goal of what we are doing right here is, like, to create, like, theinator who's going, like, to, I don't know, like, roll the world and we will just all retire as humanity.

Starting point is 00:20:57 Yeah, like, of course, going to fail. Like, I don't see that's happening, like, in the next, like, two or three years. Like, probably never will happen because there's, it's much more complex. on that, right? Like, even if you had that, if you have created that, the deploying that thing is like a human endeavor. And, like, the way that you mind works is like, it's certainly complicated. So you can't just, like, reduce this whole process into a statement of, oh, when we have AGI, like, it's game over. Like, that's a goal. And, like, then we succeed. Without that, we don't, right? So, I think, like, in my opinion, like, we can't talk about success and failure yet, primarily

Starting point is 00:21:43 because what we are doing right now is that, okay, we have a new thing out there. This new thing has new capabilities that we didn't have before, okay? We are still trying to figure out the limits of this thing, but most importantly, We are trying to figure out the problems that make sense to solve. And what it is to make it viable, right? So I put this way, like, there are problems today that you can solve with AI, but it's not viable because AI is still like too expensive for 55 use cases. Right?

Starting point is 00:22:27 Yep. You have cases where you have new things that you couldn't do before, that you can do it with AI, but it's not reliable enough to put it into production, right? There's like, and there's still like a lot of

Starting point is 00:22:45 stuff that we don't even know yet that we can solve them like with these new technology. So it's still like an exploratory phase of like trying to figure out what makes sense like to do with this thing. What is like, let's say,

Starting point is 00:23:03 the killer app for these, which I think already being deployed in some case and delivers value there, but there are other cases, obviously, where it fails. And, like, as every other R&D project out there, like, there's going to be, like, a lot of failure. Like, that's what R&B is, right? Like failure is, like, you have to embrace that.

Starting point is 00:23:24 Like, a lot of that stuff, like, are going, like, to fail. The difference, in my opinion, is that experimenting is quite cheap compared, like, to doing it in the past, right? If someone wanted, let's say, a couple of years ago to go and experiment with building, like incorporating like ML models to build like recommend for their system, like it couldn't be just an experiment. Like they would have like to make sure that this is going to work because it will be a big investment for them.

Starting point is 00:23:52 Like you have to find the people. You have like to find the data. You have like to iterate on these. It takes months, maybe years. And like many times like what was happening was that. We're faking that these things were like succeeding because we did invest individually like too much into them. It doesn't hurt probably in the company, but it doesn't also mean that it adds like the value that we were expecting that's going to add. Right. Right. What I like to look, I think it's really fun to look back. So you're probably familiar with the TV show, The Jetsons, the old TV show, the animated.

Starting point is 00:24:29 So it's fun to look back, you know, it's the future looking, you know, what is future, what's the future going to be like? And the two things that come out at me from that, which is, you know, they originally made the show decades ago, blind cars are part of that show and robots. And if you add, like, those are two, it's hard to think of like the things people thought would be now 30 years ago, 50 years ago, right?

Starting point is 00:24:54 But it's helpful to like, to bring that up, like, so there's going to be some AI applications we're working on today that it will be the equivalent of a flying car like we just haven't gotten there the physics don't work like we don't know how to solve like that problem or like the robotics like you know so far has been slower than a lot of people thought like we don't all have you know robots in our houses other than maybe vacuums right so like what does that look like so i think that's really interesting to because we're in this experimentation phase to think about which categories right now that we're throwing AI at which

Starting point is 00:25:28 categories are going to hit the walls, they're going to be the future, you know, flying cars, for example. Yeah. Yeah. First of all, I mean, you mentioned, like, robotics. I think robotics is, like, a big thing, like for many different reasons, not like only because of AI. I think there is traditional, like robotics has been like a space

Starting point is 00:25:46 that building goes, like, extremely slow. But it is a space that now has been, like, accelerating, like, a lot. And new models of, like, building, Like there's open source robotics now, like things like that. And I think that there's definitely going to be, I'll say that like a very interesting intersection between like the robotics itself and AI and what together they can do. But I think like, first of all, one of the things that I don't know, like people, I think

Starting point is 00:26:17 like they need like to take a step back and think a little bit about what has happened in the past like three years like with AI. We are still trying like to figure out what's like they're right. way for us as humans, like, to interact with this thing, right? Like, co-pilot came out, and for a while, like, copilot was like the thing, right? Like, let's build, like, a copilot for everything. Let's build a copilot for writing, go. Let's build a copilot for Word and Excel.

Starting point is 00:26:42 Let's build a copilot, I don't know, like, or whatever. And I think, like, what people, like, started realizing is that the copilot thing, which pretty much means we have this model and then we have the human who, is in the loop there to make sure that the model always stays like on track it's not very efficient right because like what happens

Starting point is 00:27:07 is you have tasks sure like some of them like might be accelerated because you are using the copilot but then you have like a human who instead of doing other things is like babysitting a model like to do something right so of course you are not going to see like crazy

Starting point is 00:27:26 ROI there. Like what's the point? I mean, it's just like instead of typing instead of the computer like on Excel, like now you have someone who's like typing in free text like to a model trying to convince the model to do the right thing, right? So that part of like the automation, I think it became obvious that like it didn't work that well. There are like some cases where it works, but like it's not as global like of universal as like we would think it would be. Then we started like seeing new paradox of like how these things can be done. Like at the end of the day, like if someone tries to abstract what is

Starting point is 00:28:02 happening is how we can see there like these models of like as we were like considered like software before which is okay I want this thing like when it has a task like to just go and do it and come back and make sure that when it comes back it's like the right thing. But the problem with models is that

Starting point is 00:28:18 by their nature they are not deterministic. So things might go like the wrong way or things. So we need like to figure out new ways to both, like, interact and also build systems out of this. Like an engineering problem at the end of the day. Like, it's not, like, the science has been done. Like, the thing is out there.

Starting point is 00:28:35 Okay. How do we move this thing, like, reliable at the end of the day? Yeah. Yeah. I think, you know, one thing that John and I have talked about is that, you know, a lot of cases. And actually, one interesting thing, when we start to get into type-deaf here in a little bit, Kassas, I think is really applicable to the API that you've built. But like, like you said,

Starting point is 00:28:56 the, unfortunately, in my opinion, like having a co-pilot chatbot as like the thing that just everyone deployed in every possible way for every use case was really a bad start because I think the best use case is that a lot of it or like maybe a better way to say it would be I think some of the best manifestations of this as far as user experience is that you won't really notice AI. It's not like it's at the forefront, right? It's just sort of disappearing behind a user experience that feels like magically fluid or high context. I mean, it's going to hide an immense amount of complexity and make hard things seem really simple. Yeah, because as an example of that, like, Netflix is one of my

Starting point is 00:29:46 favorite examples of that. Like, the, like, brilliance of, like, their recommendation engine stuff they did. It's completely invisible to the user. Others, I'm like, oh, I might like to watch That, you know, like, no, like, those are the experiences I think will be fascinating to see, like, come into lots of different products, like, with AI. And I haven't seen as much of that yet. Well, I think you can see them, like, in some cases. Like, I, sorry for the wrapping Eric, like, but there are, like, some cases where it's not like, like, in development, for example, right? And again, you have something like load code, which, okay, like, it is an experience on its own with its own limitations, right? It doesn't mean that you just throw this thing out there and like it's going to build like a whole Linux kernel and so on.

Starting point is 00:30:31 But stuff like using models to do like a preliminary review of a new PR, right? Or actually using a model you as you do like a PR review. Like these things are accelerating processes like a lot. Okay. Now they are not replacing the engineer, right? And I don't think why this is a bad thing, but it does make the engineer, like, much more productive at the end of the day. Same thing with, like, sales, like, for example, right? Like, okay, like, you want to go and personalize, like, messages that you are going to, like, to send, like, 200 people.

Starting point is 00:31:10 Like, in the past, if you wanted to do that, we'll take, like, I don't know, like, two hours probably, like, to go and do that. Now, it will probably take half an hour. Now, it doesn't replace the SDR, or some people might claim it does, but I think it's a very good idea, but it does make the people more productive. And I think that is the reason that what's, there was like a conversation like for a while that we're saying that the observation is when it comes like to impact to jobs, the main, the first layer that of professional that's been affected. by that's like middle management. And the reason for that is because when you, in the past, for every like five SDRs, you probably needed like one sales manager. Now you need one sales manager for like 100 of them or like 50 of them, right?

Starting point is 00:32:06 Because a lot of the stuff that you had to do to make sure that these people like were doing the right thing now can happen with a much more efficiently. The same thing also like with customer support, which is like one of like the most common like use case where like AI is like heavily. Like one of the things that the managers had to do was like to go through the recordings that the agents had and make sure that they were doing the right thing. That's super time consuming, right? Like you literally have someone with work like for eight hours as an agent and talks like in

Starting point is 00:32:38 total I don't know like let's say three hours. Someone had like to go through like three hours of transcript and figure out if they are doing the right thing. Now they can do that like for many more people in less time because they have. like the tool to do that, right? So I think there is like impact happening out there. It's just that the way that the dream of AI is being sold, it's not like what is happening is not as sexy as the dream. We're going to take a quick break from the episode to talk about our sponsor, rudder stack. Now, I could say a bunch of nice things as if I found a fancy new tool. But John

Starting point is 00:33:14 has been implementing rudder stack for over half a decade. John, you work with customer event data every day, and you know how hard it can be to make sure that data is clean and then to stream it everywhere it needs to go. Yeah, Eric, as you know, customer data can get messy. And if you've ever seen a tag manager, you know how messy it can get. So Rutterstack has really been one of my team's secret weapons. We can collect and standardize data from anywhere, web, mobile, even server side, and then send it to our downstream tools. Now, rumor has it that you have implemented the longest running production instance of rudder stack at six years and going. Yes, I can confirm that.

Starting point is 00:33:54 And one of the reasons we picked rudder stack was that it does not store the data and we can live stream data to our downstream tools. One of the things about the implementation that has been so common over all the years and with so many rudder stack customers is that it wasn't a wholesale replacement of your stack. it fit right into your existing tool set. Yeah, and even with technical tools, Eric, things like Kafka or PubSub, but you don't have to have all that complicated customer data infrastructure. Well, if you need to stream clean customer data

Starting point is 00:34:24 to your entire stack, including your data infrastructure tools, head over to rudderstack.com to learn more. Let's start to talk about data infrastructure because I really want to talk about type def, mainly because I got a demo of it. I got a demo of it right before we started recording, so I'm all excited about it. But let's talk about data infrastructure because I agree,

Starting point is 00:34:46 you know, totally cost us that a lot of the significant impact that's happening isn't super sexy. Where are you seeing? I mean, obviously, you're building some of this with type deaf, but, and John, I would ask you this question too, because you use, you know, you have a really good handle in the landscape and use new tools all the time. Like, it's interesting because non-deterministic, you know, having a non-deterministic tool for data infrastructure is really different than like summarize a transcript and give me the gist of it, right? Like you're not going to, the threshold for making non-deterministic changes to a production system or to data that, you know, is business critical. You know, clearly there's a different threshold there.

Starting point is 00:35:30 But what does the landscape look like with using LLMs and data infrastructure? Well, I haven't really small anecdote here that I'll share Eric that I think it was interesting. So I occasionally do mentoring stuff. And I had a mentoring call earlier today. And somebody's using an LLM to generate SQL to look at web analytics. I'm sure that happens all over the place, especially, you know, with startups. And it was a startup. So get on this call. And it was so funny, like even a few months ago, I probably would have like walked them through like, because then really no sequel, right? So walk them through like and taught them a little bit about SQL. cool. But I actually thought them a little bit about prompting. It's like what I did. So like it was

Starting point is 00:36:14 the simplest solve. They were getting a little loss in this query. And essentially like it was a really short solve of like, hey, break this down into CTEs. Like let me show you how to prompt it to make you use CTE instead of sub queries. So we did that and then said, all right, run each CTE. And if there's an error in the CTE, take that one part out, drop it in the new window, tell it to fix that piece, move it back over. And then like work through it. And we did it together. And like, 15 minutes. She's like, oh, like, this is an amazing. This one's great. And it was just like something that even like six months ago, like that's not how I would have walked through, you know, a problem with somebody. So yeah, yeah, I got to think of the implications of it,

Starting point is 00:36:51 but yeah. Yeah. Yeah, it's interesting. I think like working with data is also like an interesting topic like when it comes like to a little names for a couple of different reasons. First of all, SQL was created because it was supposed to be used by, business people, not technical people, right? So, like, it kind of resembles, like, natural language when you write it, right? It's a way, like, for how industry was, like, trying to create

Starting point is 00:37:21 a BSL that could be used by, like, non-technical people at the end of the day. Like, that was, like, the goal of that. Now, obviously, things get more and more complicated as we try to do more and more things, right? And, of course, like, when you start going to these people who are supposed to be

Starting point is 00:37:40 like business people or like business analysts or even like managers and you're suddenly like explain to them like use like terms of like CTEs or projections or joins like

Starting point is 00:37:51 what are you talking about? But it comes out that it's a good language for claims like to generate and for people like to debug because they are usually end up like

Starting point is 00:38:07 writing your logic, and because it is, like, data flow, the model is, like, data flow driven instead of, like, decision-driven, like, like, brands-driven. It's, you will get something back for your question that, okay, you can spend some time, like, understand what this thing is doing. Like, you don't have, like, to go to, like, thousands of clients of code, like, to figure out what's going on. Now, having said that, at the same time, as with everything else with AI,

Starting point is 00:38:31 people jumped directly, like, to the dream. And I like, okay, let's do, like, text to SQL. Right? Let's have like Eric saying, go there and be like, hey, how did my product team perform in the last quarter and expect something to come back that makes sense, right? We're not there yet. I don't know if we are going to get there. I think what will happen to your point. And I think, like, John, what you described is like great is that you need to, you have like these generalists, which is like the model that can do everything. good enough, but if you want to do it, like, really good as an output, like, you really have, like, to constrain of how it's going, like, to operate, right? And you have to constrain it based on, like, the problems that you try to solve, and its problem is different. So,

Starting point is 00:39:24 you need, like, a different context. Like, it's not, like, something like generics that you can just put there and, like, it'll solve, like, every problem. That's where we're engineering coming, right? So there are, I think we are at the time where, okay, okay, we need to engineer solutions. We need to sit down, and for the problems that we are trying to solve, find the ways that these models can operate in good enough, like, margins of errors, and put them into production and keep improving, as we did in the past, right? Like, that's what engineering has always been doing.

Starting point is 00:40:01 No difference. I think one thing that I'd be interested in, both of your opinions on is I agree that we need to engineer solutions. I think part of that is in the modeling layer, right? So, like, one of the challenges, if you think about an LLM writing SQL is that the underlying data sets

Starting point is 00:40:26 are, like, wildly different even for the same basic use case, right? And so if there was a way to normalize on, you know, a basic data model. So you mentioned web analytics, right? Well, that's actually a fully known, you know, there are like standards you can use for that. It's a fully known, you know, it's, you know,

Starting point is 00:40:51 you have page views, you have sessions, you have whatever, right? Those are all like almost ubiquitously defined terms, right? And so, in fact, if you weren't able to have a consistent underlying data model, then you would be setting the LLM up for success because it's not having to try to interpret like wildly different underlying data models to produce the same result. And I think about the same thing with frameworks, right? I mean, if you think about, you know, V0 from Versel, like, it's running, it's generating next steps, right? I mean, that that framework is super well defined.

Starting point is 00:41:35 There's a zillion examples, right? And so, like, within a certain set of parameters, it can do some pretty awesome stuff, you know, like with those guardrails there. So do you think we will also see a big part of this become a normalization or sort of standardization of underlying data in? in order to create an environment in which the LLM is set up better for success? No. The reason I'm saying that is because I think like standardization, when it comes like to data and schemas and little stuff has been tried a lot in the past. And it's always like failed because the problem with these things is that it's extremely hard like first of all like to agree about like the semantic. like what it means

Starting point is 00:42:25 like actually there's there's like a very rich literature out there like scientific research on like how to model like specific domains like especially like in archiving for example like if you go there you will see that depending on like the type of like medium that you want to use

Starting point is 00:42:42 like they are very well defined like schemas and most important semantics around like how do you digitize like a book right like what are the parts that you break down but the metadata that you need for these things. Like, there is a lot of work that has been done. But, like, the problem with that stuff is that it's extremely hard, like,

Starting point is 00:43:02 to put humans, like, to agree upon these things. And for a good reason, it's not like, because we're, like, a problematic species. It's just that all these things are very context sensitive. And the way that I will do these things like in my company, like might be very different compared to, like, how Eric does things like in his company. And you want to agree on something, it has to be good enough for both of us without causing problems to any of us because of, like, whatever exists in there to satisfy, like, another stakeholder, right? So it's really hard.

Starting point is 00:43:38 I think, like, the way that, and there's another thing there, which is continuity, right? We are not just resetting. Like, the enterprise, like, go, like, to Bangal America. I don't know. How long, like, is Bangal America operating? probably they started with like IBM started building like the first mainframes or whatever right it's like you can't go in there and just like remove everything and put something new in there like you need to yeah like you need you need to continue it right like so things that you know it's

Starting point is 00:44:09 really I think what can happen is like a couple of different things one is either you decide of how models should come up on consensus of like how to do things and you let the models like figure this out and you don't care at the end of the data model or you have another layer of abstraction which is what semantic layers are right like the whole concept of semantic layer is that okay I have my data on my data lake or data warehouse I model these things like in any way I want But I centralize also like the semantics around the meaning of this data. So when I'm going to talk about revenue, it doesn't matter if I'm cost us from sales and Derek from marketing. We are going to use the same definitions of what revenue is, right? Or we will have multiple different ones, but we would know which one, each one of us is using. So the solution, usually like to these things is like, to add abstractions, that's like how we've been doing it so far.

Starting point is 00:45:20 And I think that's what is going like to happen now. The main difference is that so far we've been building the abstractions, considering one type of entity interacting with that, which is the human. We also have to make into account that we have another entity, which is the model. And the model needs a different experience than a human to interact with these systems. So we don't have only, like, user experience. Now we need also like, I don't have a model experience, whatever, but at least a thing.

Starting point is 00:45:51 All right. Well, we have to use our remaining time to talk about type-deaf. So I know you gave us a brief overview at the very beginning, but what, give us type-deaf in like two minutes. We have more than two minutes to talk about it. Yeah. So when we started type-diff, like our goal was to, find, like to build the tooling that we need today to work with data.

Starting point is 00:46:22 And when I'm talking, it sounds like very generic, but I'm, we started from like a very all-up perspective, right? What do we do with the data that we have like on our data lake or like our data warehouse, right? So we're not talking about like transactional use cases here, like how you build your application with your post-ness database. It's more about. about, okay, we have collected everything. What do we do with that now? Like, how do we build new applications on top of this data? Traditionally, they're like you are using systems like Spark, right?

Starting point is 00:46:56 Yep. But Spark has started like showing its age because, again, as I said, like at some point at the beginning, like these things were like built primarily with like the BI, like the business and like use case in mind. So when you try like to build them builds, I don't like a recommender or like, other types of applications on top of your data, more customer-facing things, it becomes hard to do it. The way that we've been solving it so far is by using talent, right? Like, very specialized people who can make sure that this thing was going to be working properly regardless of

Starting point is 00:47:32 like what we throw on it. That's really hard like to scale outside of like the big area in in a way, right? It's extremely hard to go and ask, like, every engineer out there to become an expert on, like, building and operating, like, distributed systems, especially, like, with data. So we're like, okay, what's how we can solve that, like, how we can turn building applications, like with data, like a similar experience to how, like, phone engineers and the backend engineers have with application development, right? What happens with MongoDB and It's not J.S becoming like a thing, and not becoming like a thing. And suddenly we have this explosion of like millions of engineers, like building things, right?

Starting point is 00:48:18 But do it for data. That's how we started. To do that, we had like to build pretty much like from scratch in your query engine. We want to like to use familiar interfaces. So people can, but they have some experience with working with data. They can already like use it. So we build on top of like the Pi Spark API. we used, like, the data frame API as a paradigm because it's a good way to mix together

Starting point is 00:48:43 imperative programming with declarative programming, so kind of have the best of both work, like from what you have with SQL, but also with like a language like Python. And then we had that when I also like to make it serverless, but then, as we said, like, AI happened. So now we have like a new type of compute. So it's not all, like, the workloads completely changed. We don't have CPU is not the bottleneck anymore. The bottleneck is all about reaching out to LLM's and like hoping that we get something back.

Starting point is 00:49:17 And also we get something back. Do we know if this is correct? Like that's not like a deterministic answer, right? So how do we engineer and put things like into production when we have like these new workloads? So our next step was, okay, we are going to make. inference, LLM inference, like a first-class citizen, and we got kind of objects of like, okay, how we can do that without having to introduce, like, completely new concept, like the engineers. So we kind of introduced, like, these new operators in the Data Frame API, where as you have

Starting point is 00:49:51 like a joint before, now you have a semantic join, as you have like a filter before, now you have like a semantic filter and extends the operations that you already know how to do on data. but using both like natural language and also using unstructured data where something has to be inferred. It's not like explicit already in your data set. And then reducing all the, removing all the hard work of like having to interact with inference engines, figuring out like buck pressure, what to do with failure, all the things that are like extremely painful because these new technologies are still like young and make. things haven't been figured out yet in terms of infrastructure, but all these things end up

Starting point is 00:50:36 like making, working with them like unreliable enough to make it hard to put into production. So our goal is like, okay, at the end, use TypeDiv to build like AI applications, both let's say like static applications that they have like a static execution graph, but also a genetic ones where you can let a model, like, decide what to do based on, like, the tools that it has access to. Do it on data. So, not like a generic environment that you can go and build, let's say, like, any type of, like, a genetic workload there. Like, if you want to go scrape the web and come back with insights, type def, and fennick is not the way to do it. But if you won't like to implement that on top of your, like, data warehouse data, then it's a great.

Starting point is 00:51:28 like to use. And make it also really fast like to experiment because it's like very familiar like to work wings. And when you're ready like to get into production, remove all like the boiler plates that someone is like to build in order like to monots the underlying infrastructure and making things like much more efficient at the end and more reliable like to put into production, which is like quite a big problem right now and why like many AI projects are like failing. So I have to digest. Yeah, it's a lot. I know.

Starting point is 00:51:59 It's hard to talk about these things without using a lot of words. Yeah. You left us speechless. No, we were both on me. Yeah, can you go back to the semantic? Yes. Can you go back to the, like, I want to talk a little bit on the semantic layer because this has been a really fascinating one for me.

Starting point is 00:52:21 Because I like your point a lot around like, as we talked about earlier, historically, you've got BI tools, now we've got. like we've got maybe agents or first class citizens or people equipped with like AI tools is kind of another class of people but back to the semantic layer like there's a startup that I've followed their journey and talked to their founders a lot and it's been interesting just to follow them where they were like really hard like semantic layer like it's not going to work at all without a semantic layer and then they were kind of and then the and then like back to that like comment on like talent it's like well how many companies are in a point where

Starting point is 00:52:56 they have a mature enough warehouse and they have all this organized into, you know, a modeling tool like DBT and they have like a mature semantic layer. Like even that number is like not super high. And so it's just interesting because even they, I think, have like gone back and thought like, well, but if we did kind of go back to text to SQL and think about like basically dynamically generated, you know, semantic layer. So there's not as much like engineering involved in that. So I wonder how many of those.

Starting point is 00:53:26 like reinventions will happen on like just pragmatically right where it's like okay this is how it should work this how it works best we're going to have to go back and reinvent pragmatically because like to our tan like our tool addressable mark is not big enough so we need to like go you know yeah yeah i mean i think that a lot of that stuff goes back like to kind of like what we're showing about like the continuity right like if if you have like a company that has been operating like a BI stack for a couple of years now, right? They probably have a code base of SQL that already exists there. And migrating that to like a semantic layer, which by the way, the semantic layer also

Starting point is 00:54:13 needs to depopulated, that monarchs, right? Yes, you do add there like an abstraction that can probably make things better, assuming, that it has been curated, right? And most importantly, curate, like, someone has created it, right? That's, like, one of the reasons that traditional, like, the semantic layer is not something new. Like, has been around, like, for a very long time. But it was primarily, like, an enterprise thing.

Starting point is 00:54:40 And it was an enterprise thing because the enterprise had the resources, like, to go and build and maintain these things, right? Now, can NLMCH help with that? Maybe, I don't know, that's like something for the semantic layer people like to figure out. But at the end of the day, if you come to a team that already spends probably 40, 50% of their time, asking requests like, hey, I'm trying to calculate this thing, do we already do it? And if yes, where, and can we update it to also add this new thing there?

Starting point is 00:55:11 Because we also have a new source that we want to track SEO coming from related to it. And tell them, well, you can solve this. If you go through like a six-month project to build a semantic layer and educate also the whole company that they have like to use whatever we put in there. Yeah, it's like super hard. Like even if on paper like it works, you have to both change the organization behavior and to invest like in the technology resources that you don't already have. So it's a hard sell. right you need to

Starting point is 00:55:49 I think in my opinion there's a more of a product opinion you have to fix the problems that already exist like what people carry from the past and make the transition

Starting point is 00:56:00 easy if the transition is not easy to this new word that you are promising people wouldn't like it's too much yep

Starting point is 00:56:07 yep and that's like part of like why we build like type of the way we did is because we if you have to educate people a lot

Starting point is 00:56:17 it's you put a lot of risk in like what you are building people don't have time and you don't have the money also like to do it so it has to be something that it's like very familiar

Starting point is 00:56:29 for people like to use and makes it easy so all the decisions that we made is that familiar APIs for both humans and machines right PISPAR has been out there like for a long time

Starting point is 00:56:41 these models have been trained on that so like the API is kind of like known you can go and like ask it at the end of the day like to build something on our framework and it will probably succeed like after one or two iterations just because of this family are like with the syntax. So we need like to reduce the amount of like effort that people have to put in order to migrate into these new worlds. Because at the end of the day like we kind of solve the same problems like in a better way. But it will like to make this reality happen fast. We

Starting point is 00:57:15 have like to help people migrate also like fast, right? We can't just like promise a dream, that will take them six months of implementation before they can't even like taste the dream. And that's what we are trying like to do, like with type of remove everything as much as possible that makes it really incompatible with what people already know. Like the same way that you would build like a pipeline in the past, like the process of your data, like you should do the same thing. using LLMs without having to learn like new concepts.

Starting point is 00:57:49 If we might not like to do that with type-dive, from a product perspective, we are going to, I'll call it like a success. It's going to be a commercial success, lots of different conversation. But that's kind of like the goal, right? Do the things you were doing in the past, but in a much, much better way,

Starting point is 00:58:09 because now transparently you can't use LLM's to do some of the stuff that would, like, extremely hard like to do before. but without compromising on how you put things into production, how you operate things, and how fast you can iterate on the problem you are trying to solve. I love it. Yeah, that was when you were giving me a demo earlier today,

Starting point is 00:58:33 that I think was actually pretty surprising because when we talked about what would it take to productionize this for the use case we were discussing, it was just kind of it didn't really feel that unfamiliar yeah that is so i mean this kind of feels you know this feels very natural right like here's all the tables you know you have a pipeline set up so yeah i was yeah i that's super interesting i didn't even really think about that i just my main thought was oh that's like sounds way easier than i thought it was going to sound. So hopefully that's commercial success.

Starting point is 00:59:18 Sure. Yeah. Yeah. It's on the way. 100% and I think like a positive side effect of using like familiar paradigms is that when things go wrong and of course things will go wrong, it will be easier for people like to reason about them and like figure out the issues and fix things. Again, I'll keep like kind of, I don't know, becoming like boring, but it is engineering at the end of the day. Like We've been spending so much time building these best practices, these ways of, like, operating systems, operating unreliable systems in a reliably way. We just need, like, to use the same principles.

Starting point is 00:59:59 And as you said, like, put AI in there, but the AI should feel like almost magical. Like, it shouldn't feel like, oh, now everything that I was doing is breaking because I'm trying to use this damn new thing that's, I don't know why it breaks. Yep. And I think that goes back to what you were talking about with the use case. Awesome. Well, we are at the buzzer, as we like to say. Brooks is telling us we're out of time.

Starting point is 01:00:25 Because I would love to have you come back on for a round two. And I want to do two things. Let's talk about some use cases that you're implementing for your customers. And then the other thing that we didn't talk about that I would love to talk about. This is just for me talking with, you know, some of our larger customers and their restrictions on even using LLMs, you know, especially as it relates to certain types of data is a huge challenge, right? And I mean, you know, in the startup, you know, like you were saying, John, okay, this person's like, you know, just throwing SQL, you know, probably straight into GPT and, you know, hearing data, you know, whatever, right? And it's like, okay, well, you, I mean, you cannot do that at a large company, right? And there are like a lot of security, like legitimate, you know, security concerns and other things like that.

Starting point is 01:01:16 So I'd love to cover that too, Kastas, because the types of workloads that you're running, that's clearly a, you know, a concern. Yeah, yeah, 100%. I think, like, a lot of that stuff is being addressed. And, like, I think it's getting easier. I like to use, like to find solutions not either through using, let's say, proprietary open source model that you only run or use like from the big providers but like in very like secure ways like it's something but like the big like like open AI and like all these people are like this is kind of like a solved problem like at this point I would say and I would say

Starting point is 01:01:57 that like most people probably end up using open source models not that much because of security but more because of performance interesting but that's we can talk about that yeah okay it's a round too love it thank you so much guys I love it I'm looking forward to come back again yeah we'll do it soon the data stack show is brought to you by rudder stack learn more at rudder stack.com

The Data Stack Show - Re-Air: AI is All About Working with Data with Kostas Pardalis of typedef

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.