Disseminate: The Computer Science Research Podcast - Vikramank Singh | Panda: Performance Debugging for Databases using LLM Agents | #47

Starting point is 00:00:00 Hello and welcome to Disseminate, a computer science research podcast. I'm your host, Jack Wardby. Quick reminder that if you do enjoy the show, please do consider supporting us through Buy Me A Coffee. It really helps us to keep making the show. Today, I'm joined by Vikramant Singh, who will be telling us everything we need to know about Panda, performance debugging for databases using LLM agents. Vikramant is on the AWS Redshift team, where he's an applied scientist. Welcome to the show, Vikramank. Hey, thanks. Thanks for having me. Good to be here.

Starting point is 00:00:54 Fantastic. The pleasure is all ours. So can you tell us a little bit more about yourself and, yeah, how you became interested in database management research? Sure. So, yeah, as you said, I'm currently working as an applied scientist here at Amazon. I'm part of this team called Redshift,

Starting point is 00:01:10 which is a data warehousing service in AWS. And I joined AWS probably three years ago. And that's my introduction to databases. So I actually had no background in databases as such.

Starting point is 00:01:25 So I did my undergrad in computer science, spent some time working at Facebook in computer vision, machine learning, software engineer, moved to Berkeley to do my master's. And when I did my master's, my research was mainly around reinforcement learning or control or decision-making systems. And then I moved to AWS.

Starting point is 00:01:44 And I joined a team which was primarily working on databases. So all the problem statements that we used to solve were related to databases, machine learning. But I think we sort of picture or pose ourselves as people who work on ML for systems. So we use machine learning to solve problems in the systems domain. So yeah, that's my introduction. That's how I got introduced to problems in databases. I initially started working with this team called RDS,

Starting point is 00:02:13 the Relational Database Services in AWS, and started working on some very interesting problems for RDS customers for about two years. And then I moved to Redshift and that's where I am. Awesome, that's fantastic. So you finally found your way to databases, to the holy grail, right? So you finally made your way, that's awesome stuff. So today we're going to be talking about LLMs and I know they're all the rage at the moment. So can you maybe kind of start off for the listener, giving us some sort of background on kind of what LLMs are and kind of, we can talk a little bit about performance debugging as well and why that's important and hard necessarily in databases.

Starting point is 00:02:52 Sure, yeah. So again, I'm no expert in LLMs and I will probably not dive too deep into what LLMs are and so on, but just to give a very high level stuff. The way I see it, LLMs are, I feel, are what we call generator models. When I say generator models, again, not to dive too deep into the technical stuff,

Starting point is 00:03:13 but they learn the distribution of the data that's trained. And when we talk about the generation process, just sampling from the distribution and sampling in some meaningful way. And when I say a meaningful way, in this case, it's an autoregressive way, where what you generate at T plus one probably depends on the past.

Starting point is 00:03:30 It can be the independent generation as well, where whatever you generate at T plus one has no relation to what you generate at T, but that's not autoregressive. So why we are interested in LLMs nowadays is because LLMs have been primarily used for languages, because they're language models. And language, by default, tend to be auto-regressive, or have a sequential behavior to them. So what you see at time t plus one is probably related to time t,

Starting point is 00:03:57 which is related to time minus one, t minus two, and so on. So I kind of look at them as auto-regressive generative modules. That's what LLMs are. And the reason they've been so fascinating is because of the use cases they've been applied to. So now, earlier we used to train these models like small corpus of data. But now that we have the right amount of resources in terms of compute and energy, we can train them on like probably a large chunk of internet. And then it turns out they can do some pretty interesting stuff so that's that's like a very uh high level very superficial way of explaining what the items are no that's fantastic because i have i have sort of i mean i've i've messed around with with chat gpt and it's fantastic and i find myself using it more

Starting point is 00:04:42 and more or um every day kind of for just various different tasks. I don't tell my mum, but for like writing the message in her birthday card, I found it very useful for that. So it's giving me inspiration anyway for me to then fine tune, shall we say. But yeah, so that was a fantastic description of kind of what they are. And that was really, really insightful. So I guess kind of with that then, so we've kind of got these LLMs and they're super useful.

Starting point is 00:05:08 How can we then apply that to performance debugging in databases? And kind of, yeah, tell us more about the performance debugging angle of your work. Yeah, I mean, that's the question that I've been asked before as well. So like, when you think of LLMs, databases are not the first thing

Starting point is 00:05:23 that you think of in your case, right? So why is there, what's the overlap between databases and more specifically debugging databases? And so let's talk about debugging databases first. And I was, again, as I said, before joining AWS, I had no clue about what data, I mean, I had to know what databases are, but not a deeper understanding of how they work

Starting point is 00:05:45 and how people actually use them in production systems. So when I started working at RDS, I saw a lot of customers, how they use their databases. And when I say customers, these are people, let's say they are database engineers in their own respective teams, in their own companies. These are database DevOpsops people a lot of development engineers who rely on the database maybe they maintain let's say tens of teams of

Starting point is 00:06:11 databases uh for their company and how how do they make sure that the database is performing uh at the level that they want so there are various tools out there that uh these people try to use or to monitor the health of the database. Now, again, the term health of the database is still not very well studied or defined. So let's say a health of database is something that you can think of like the rate at which a query executes on an average continues to remain the same or P90 or P95, whatever. So there are different metrics that you can use to measure the health of database. It can be how many number,

Starting point is 00:06:48 it can be number of connections over time. It can be query latency over time. It can be number of active sessions over time. The different definitions of what is the health of database. So you can pick any one of them. And there are tons of telemetry data. So this is a time series data.

Starting point is 00:07:03 So at every time point, you can measure the average latency of all your frames. So on Monday, you can measure Tuesday, Wednesday, and so on. So this is a time series data. And usually in production, these database engineers, they monitor tens to hundreds of these telemetry data. And this is usually what starts as a database debugging process. So when you want to understand how your database is doing, you look at the stash board, which has tons of elementary data.

Starting point is 00:07:34 And for each metric, you have some sort of threshold over time that you have set. I would feel like, okay, if my average latency is beyond this, I don't care. Things are good. And if my average latency goes above the threshold, then something bad is going on. So that's where it starts usually. And the impact of this is gigantic.

Starting point is 00:07:47 So like these dashboards that are monitored, that can have any impact on the business. Because if let's think about if that query is something, let's say you are some e-commerce company and some customer is trying to search for a product on their website and the page is taking too long to load. So the customer doesn't know what's going on in the front end, but on the back end, probably some query is taking too long

Starting point is 00:08:09 to read your entire table or something like that is happening. So your query latency shot up, the page is not loading and the customer moves on to, let's say, a rival website. So it has like significant impact on the company's revenue and business model as well. So yeah, the database debugging is important. So these database engineers, they constantly monitor these telemetry data.

Starting point is 00:08:28 And whenever they think something wrong is going on, some of their thresholds have been crossed. Then they look into why something is wrong. So they look into each telemetry data, understand why things are going wrong. And for that, they usually go about reading a lot of documentations. So now when we talk about documentations, that's where the natural language comes in. So they're like a bunch of

Starting point is 00:08:52 open source documentations, they're a bunch of really good handcrafted documentations created by AWS, if it's an AWS database. And that's usually where they start with. They start creating why on this specific database engine, if a query is running slower, what could be the possible reasons? What are some actions that they're taking? How can they fix it? Is there a parameter that they need to tune? Is there a query that they need to tune? And things like that. So it's a combination of looking at this telemetry data and creating the relevant troubleshooting documents is what I generally call as a debilitating process. And the problem with this process is, A, the telemetry is long enough,

Starting point is 00:09:31 so there are hundreds of telemetry data that you need to monitor, which is difficult for a human being. And B, the documentation is vast. So it's not just one document per metric, there will be hundreds of documents. And finding the the right document identifying the right solution from the document is not trivial so the combination of this is very complex and on top of it this needs to be done in real time and on a continuous basis so they need to monitor day in day out constantly keep doing this over and over again so the way i see llms helping them is to understand this large corpus of natural language data, which is the troubleshooting documents, and then try to help them identify the right answers. And we'll talk later about how you can combine telemetry data and troubleshooting documents together using LLMs. But this is one place where I feel LLMs can be brought into the picture and help

Starting point is 00:10:25 speed up the debugging process for databases. Awesome stuff. So just to recap there, we've kind of got on one side, we've got this whole like thousands of metrics that we're like in terms of today that was kind of hard for us to sort of kind of have all in our head at one time. And then you've got these massive amounts of sort of documentation

Starting point is 00:10:42 and then all of a sudden you've got a customer screaming at you because their query is not fast enough and they're going to lose a sale to a rival company. Exactly. Okay, yeah, I can understand the motivation here now very clearly as why this DevOps engineer would kind of want someone else to help them sort of kind of solve this problem quickly. So I guess that's a nice sort of segue into Panda. And yeah, so give us the high-level elevator pitch for Panda then. How is this going to solve my life better as a DevOps sort of DBA, someone running one of these managed database systems?

Starting point is 00:11:14 Sure, sure. Interesting. So let me think of what's the elevator pitch for Panda. So as I explained in the column, that's exactly what Panda is designed to solve. So when we thought of Panda, the goal of Panda is to answer this very specific question. And the question that we try to answer is, what are some of the essential building blocks that we need in order to safely deploy any Blackbox large language model for debugging databases in production. And the goal of the recommendation or the goal of output of this Panda is to make sure it generates accurate,

Starting point is 00:11:50 verifiable, actionable, and useful recommendations. So sure, you can put anything inside a language model that can generate some stuff that does make sense. But is it useful? Is it verifiable? Is it accurate? These are the questions that we want to answer with Panda. And that's like why we want to build Panda

Starting point is 00:12:07 and what exactly Panda is. Panda is a service or is a framework that combines the power of telemetry and documentation using language models. That's what Panda is. So it combines information. It's able to extract information from telemetry data. It's able to extract information from telemetry data. It's able to extract information from documentation. And then it's able

Starting point is 00:12:29 to combine those two pieces of information together to generate. That's what Panda is. So yeah, before we go any further, why Panda? Yeah, yeah. I'm sure I didn't spend too much time thinking about the name. And when I wrote this long sentence, sure I didn't spend too much time thinking about the name. And when I wrote this long sentence, I didn't have an acronym for it. So I just wrote what this thing does. And what it does was it was performance-stable for databases using language models, so large language model agents.

Starting point is 00:12:58 And when I started looking at this large sentence, I was thinking of one word description. I just picked random stuff from the sentence and made this word called panda and i felt like panda is a word that usually people are aware of in the computer science domain because of this and that library in python uh so it felt it came on naturally to me so i just picked that name yeah i like that yeah no i like that also as well because like i guess when i kind of said like panda the kind of cuddly the like and it's like something that's going to help you. And I'm struggling to debug this problem. I need to go and cuddle my Panda and he's going to help.

Starting point is 00:13:30 I don't know. I think it was something like that as well. Anyway. Awesome. Yeah, it's been interesting. Yeah. Yeah. That's kind of what I was thinking about it.

Starting point is 00:13:38 But anyway. Cool. So you listed the four properties there of kind of what you want this LLM sort of debugging database agent to have. So kind of taking those sort of design goals and the kind of the sort of the thing you wanted to build. How did you go about sort of like what's the architecture of Panda? How did you sort of go about realizing this sort of goal? Sure. Yeah.

Starting point is 00:14:07 So, yeah, the way started of thinking about this solution was okay we had one thing clear that uh we know from our from our domain experts we know usually what database engineers think about or what they what's the process what the process looks like when uh let's say, a database engineer starts debugging our problem. So we try to write that process down. That's what our starting point was. So what did they do? So, okay, we looked at, okay, they look at a bunch of telemetry data.

Starting point is 00:14:37 That seems important. Once they start looking at telemetry data, they have some sort of knowledge about what each metric means and how different metrics connect with it. So that list of information is relevant. Second, third, what do they do after that? So let's say they identified five out of 100 metrics that do seem anomalous. What do they do after that? So then you realize what they do is for each of those metrics, they look at their documentations.

Starting point is 00:15:04 So they have some predefined set of rich documentation, and they start looking at it. So that's the third thing we found. So our model needs to look back. Then when they start going through each of those documentations, they try to find things like what are some usual cases where these metrics become anomalous. So what are some usual uh cases where uh these metrics become anomalous so what are the what are some usual uh problems what are some mutual root causes for this problem once they identify that uh they

Starting point is 00:15:33 try to find what are some fixes for these solutions which are usually in the same documentations and once they find those fixes then they try to think which of these are feasible given the time span that i have given the given the time span that I have, given the condition that my database isn't right now. Is it feasible for me to, let's say, restart the database and tune a parameter, like change the value parameter? Or maybe it's not possible for me because of the production data. I can't switch the database off and restart it again.

Starting point is 00:15:59 Okay, so next solution, next solution, things like that. And they finally apply this fix on the database. So they're like multiple steps that we followed the database engineer of how they try to fix the problem and then replicate that exactly with a language model. So that was the design of thinking phase of how the system should eventually look like to sum it up at a very high level there are four key design components first is what we call grounding so we needed a module in Panda which we call this grounding component or grounding module now what this grounding module does is we know

Starting point is 00:16:43 that the telemetry is very important. You can pick any language model like GPT or LAMA or any language model and ask it a question about database and natural language and it will generate a bunch of recommendations. So it's not that we are training a language

Starting point is 00:17:00 model, that's not the goal. But the goal is to ground the language model more by giving it the right context of the database. So how do we do that? And what we found is that the database engineers, they use telemetry to do that.

Starting point is 00:17:14 So how do we connect? And language models are not trained on telemetry data. They're not trained on these large metrics. So how do you combine the telemetry data with language models

Starting point is 00:17:24 such that the language models can now understand the context of what the database is experiencing right now to generate answers? That's what we call grounding. So we needed one component called grounding. The second component we called was verification. So we believe the system should be able to verify the generated answers using some relevant sources and produce citations along with it. So the end user can eventually verify what the output is, where the output is coming from. Now, this is easier for humans because when database engineers, they try to fix the problem, they read these documentations and they know exactly the source of this documentation so they know exactly when

Starting point is 00:18:05 they don't know the whole the person who wrote it but they don't know the organization where the document is coming from there is some sort of trust behind the truthfulness of the documentation so you want to build that into the system third is what we call affordance so what do we believe is that if the recommendation that the Panda provided is actually true, the system should be able to estimate and inform the user about the consequence. This is also very important. So if let's say Panda says increase your number of CPUs from 16 to 32, what's the consequence of that action?

Starting point is 00:18:41 The consequence could be increasing cost. The consequence could be your query latency could go down. So we want the user to know what will happen, what's the counterfactual, or what will happen if they do this, if they do that, before applying that recommendation. And fourth, and the last component, is feed.

Starting point is 00:18:58 So we believe the system should be able to accept feedback from the user and improve over time. So if Panda is running on a database, there's a database engineer that says, okay, you sent this last time, I applied the documentation, can fix my issue. So you want the system to take that into account the next time it generates an answer. So these are like the four key principles, grounding, verification, affordance, and feedback that kind of build the Panda. Awesome. So grounding, verification, affordance, and feedback that kind of build the contract.

Starting point is 00:19:26 Awesome, yeah. So grounding, verification, affordance, feedback, kind of just running through them one by one then in a bit, be a little bit more depth. So on the grounding thing, because when you were speaking, I know you was kind of talking about these, these LLMs are great for sort of

Starting point is 00:19:37 kind of large natural, like large corpuses of natural language. Then, but then we kind of want to combine this with these metrics. And I was thinking about the time series, like, hang hang on a minute here how do you kind of resolve this fact that these things expect one type of data and then you've got this massive amounts of different type of data that's really important so how did you go about sort of bringing those two things together in the grounding that's yeah yeah exactly and that's what i think is one of the most uh

Starting point is 00:20:03 interesting contributions of Panda. And there's some there's some this area is, again, very, very active in terms of research. You'll find new interesting papers and ideas coming up every other day where people are trying to experiment with numbers, experiment with math and ellipse. So I forgot the name of the exact paper, but there's some recent paper that came from, if I'm not wrong, Google or Stanford, somewhere I forgot the name. But they showed that you can actually input the raw telemetry data in the prompt,

Starting point is 00:20:40 and it will generate statistical answers about the metrics. If I give it, let's say, a month-long numbers of data, which timestamps and some value, and ask, let's say, what was my average query latency on Monday? It's actually able to infer what Monday means from the data, extract the numbers, and sum it up and give you the answers. So there's some very interesting emergent behaviors coming from LLM, which proves that these are not just

Starting point is 00:21:10 random word generators. They're learning something much more interesting and fundamental underneath them. And that's a whole different concept people are having where are these language models actually learning a model of the world are they learning something more sophisticated underneath them that we don't know yet uh but we

Starting point is 00:21:31 feel like they're just generating next word but they're not just generating next word how are they generating the next word is very important that's a whole different story but yeah so coming back to this telemetry so yeah there's some very interesting work on how you can combine telemetry. So, yeah, there's some very interesting work on how you can combine telemetry with text, with language models. And what we did in our case, let me talk about that, is what we did is we took all the telemetry data. So, you have a database.

Starting point is 00:21:56 For that database, when a customer asks a question, what we can do is we have simple tools that can go and extract the telemetry data from the database. We don't need LLMs to do that. In the paper, we have a very complex architecture of the entire framework and there's one small component inside what we call grounding mechanism. Inside that grounding mechanism, there is one small component called feature extractors.

Starting point is 00:22:21 For those feature extractors, what they do is, there's no LLM involved yet. what they do is there's no lm involved yet what they do is for a given database uh they connect the database itself and extract all the relevant telemetry metrics when i say relevant they will involve them they extract all the telemetry metrics so for example there could be database parameters there could be configuration knobs sequence statistics all the active sessions, all the database founders and everything for the last seven days. So they extracted all this elementary data. Now, again, before involving language models, what we do is we run these statistical algorithms

Starting point is 00:22:52 on top of the metrics. So what we want is, we went back to the human. So we looked at how humans look at these numbers, look at these elementary data. These humans, although they don't look at everything, they look at metrics that behave anomalously. They look at metrics which behave differently than usual. So identifying what metric, when a metric is anomalous,

Starting point is 00:23:15 is not something we want LLM to do. It can be done very well with simple statistical algorithms like outlier detections or change point detection or segmentation algorithms and so on. So we did exactly that. So once we extract these elementary data, we ran some statistical algorithms to identify what are the metric that are anomalies

Starting point is 00:23:33 that are different than the usual behavior and how are they different? Are they spiked up? Are they spiked down? Are they level shift? What are the point anomalies, segment anomalies and so on so we expect all these features using statistical algorithms and then we have a separate module that converts

Starting point is 00:23:52 these features into one line summary and that's where elements come into picture so we we uh give the metric name we give the definition of this metric and we give the type of anomaly the metric has and ask the l which ended a one-line summary of this. We would say something like the metric called, pick some metric, number of rows. So a metric called number of rows for the SQL query has a spike in its last seven days and with a value of, let's say, 120,

Starting point is 00:24:21 whereas on average, the value seen for this metric on this database is 50. So there's like one summary. And we repeat this for all the metrics, and this gives us like a small prompt of what the anomalous metrics are, how are they anomalous, and why are they anomalous. And we put this prompt together, and then

Starting point is 00:24:38 we combine it with the rest of the prompt that we generate. That's where, that's how we convert the telemetry. Ah, so yeah yeah that makes total sense is a kind of a conversion where you kind of you go ahead and you kind of give it to the llm in a format that it can it can handle and awesome stuff cool so yeah i guess kind of being the next step with the verification when he was talking about that kind of this idea of the provenance of where the information's come from and kind of because there's this sort of a i mean i can go and chat

Starting point is 00:25:04 gpt and it can it can gap it can give me some garbage citations for something right and it can be like and it looks really like realistic and it looks great it looks like if i go on google scholar that the paper don't exist so yeah how did you kind of how did how did you counteract that sort of thing then because i guess it would be lms would be subject to the same sort of thing the ones you used so yeah how did you tackle that exactly that's uh that's exactly the problem with lms right so uh if you ask llm itself to generate a citation for the answer the recommendation that it made it will definitely give you a citation but most of the time the citation would be just a hypothetical citation or something that's garbage

Starting point is 00:25:43 and in fact there's a recent study uh we can link the paper in the description, but there's a recent study that showed that only 50%, 51.5% of LLMs generated answers truly support their citations. And this completely undermines the trustworthiness of the real world. And this is where these things become questionable of being useful in production or in the real world. I mean, they all sound very interesting and fascinating for research or for demos,

Starting point is 00:26:12 but then we can't reliably use them on a continuous basis in the real world if things like these happen. So this was like a challenging component to us as well. Now, what's the extreme end? So the best possible thing you can do is have a human verify everything. That's like the golden truth.

Starting point is 00:26:35 And it's possible, but it's like highly unscalable and very costly, right? So you can't have a human verify every single generated answers because it's yeah it's both tiring cost uh cost it's not cost effective and you can't have as many humans there are exports in databases to verify everything so how do we do that so uh we did we we did something that was in between inside the verification mechanism we have two things uh, two components, two subcomponents.

Starting point is 00:27:06 One component is what we call answer verification, and one is what we call source attribution. So for answer verification, we use natural language model itself. So what we do is we frame this problem as something, again, in literature, it's called natural language inference, NLI, as a natural language inference and a live as a natural language inference task where we reuse the pre-trained element to act as a verifier and produce a label as accept, reject or neutral given a hypothesis. And when I say hypothesis, hypothesis is the generated answer. So what we do is we give the generated answer to the element and then we also give a premise. Now premise is all the relevant troubleshooting documents.

Starting point is 00:27:46 So what we do is we basically ask Ellen if this is the answer and this is the context, what do you think does the answer come from this context or not? And give me an answer in terms of yes, no, or maybe, or accept, reject, or neutral. So what we expect the LLM to do is to look at the context

Starting point is 00:28:03 and try to see if there's an interesting overlap between the answer and the context. Now, this can be done without LLM as well. You can just do some very simple textual mapping where you can see how many words are common between the answer and the context. But the problem with that is that the answer that we are giving to the LLM is actually generated by the LLM. And LLMs are known to paraphrase or rephrase. So simple textual mapping might not be the very best solution because that would always give you

Starting point is 00:28:32 a very small amount of overlap between the generated answer and the context because the answer could be rephrased, could be paraphrased, and so on. So the goal of this natural language inference problem is to make the LM look at the context, look at the generated answer, and tell us if the generated answer

Starting point is 00:28:51 is coming from the context or not. And again, sure, this process is not optimum. I mean, it's not the best process because again, in some sense, you're asking the student to correct their own answer people, right? So you're saying, okay, you're asking the student to correct their own answer. So you're saying, okay, you wrote the answer. Now here is the true answer corrected.

Starting point is 00:29:11 And you can cheat. So there is a possibility of checking, which we're aware of. And one way of mitigating it is to make it do multiple times. So we try to randomize it. So instead of asking it only once, we ask it three times and try to then take an average of what the response is.

Starting point is 00:29:30 So it may cheat once, it may cheat twice, but we're expecting it to not cheat N number of times. And that N is high parameter that we can do. You can increase that number of repetitions as many times as possible. So yeah, this process is not perfect,

Starting point is 00:29:45 but it's something that kind of works in reality as of now. And once this answer is verified in some sense, where LLM feels that it is indeed representing, the generated answer is representing the true context, then we move to source attribution. And source attribution is something simpler because source attribution is now, we want to cite

Starting point is 00:30:05 the lines that are there in generated answer. And again, we kind of follow the same process. So source attribution can again be done

Starting point is 00:30:12 in two steps, two ways. One is you can go line by line in generated answer and give that line to a lem and ask

Starting point is 00:30:19 which exact paragraph in the context is this line taken from. And it can give the exact paragraph and it will cite the paragraph number or page number from the documentation. That's one way.

Starting point is 00:30:31 And second is exact text overlapping. So if you see the exact verbatim sentence taken from the doc, you can just pick the doc number, page number and cite it there. So use a combination of both and that will give you the source attribution and then move forward.

Starting point is 00:30:46 So yeah, these are the two components in verification. Awesome. So yeah, that whole kind of thing about it kind of not treating every time

Starting point is 00:30:53 is interesting. And I mean, we'll probably talk about this, the best end to choose when we talk about the implementation and your evaluation potentially.

Starting point is 00:31:03 But I mean, is there not a worry that sort of the more you do it it gets better at cheating and it cheats more the more you run it is that sort of a concern or is it is each kind of run sort of independent of the previous run yeah so we make sure that we don't run this in in the same context or in the in the sense that we don't we don't want to run the like each time we ask the same question it's not in the same chat or in the same sequence so we wipe it memory essentially every time and say do it again okay so it doesn't think ah i cheated last time i'm gonna shoot again yeah right okay cool you can even you can even play with the uh

Starting point is 00:31:42 that that subtle thing sense you can even play with the subtle things. You can even play with the different parameters in the language model that you can tune. So for example, you can vary the quotient of innovation in the model or the quotient of the rate, the temperature of, what's it, what I'm blanking out on that word, but I think when you want the model to be stricter or be more creative, you can tune these models to not be very creative where they they

Starting point is 00:32:11 try to generate something that's so if you in more technical terms you want the model to generate samples that are very highly likely from the distribution and not out of distribution so if the model is more creative it can generate things that it is slightly less trained on but if the model is less creative it will generate something that is exactly what it is trained on so when you answer verification we want the model to be extremely less creative so you don't want them to think or wander in the areas where they're not trained to wander in and that would prevent them from being so cheating is is if you think of a cheating is a creative mechanism they're trying to be very creative there and that's why they're cheating

Starting point is 00:32:50 so you want to control the the level of creativity in these models and if you can control that we can somehow push them to not cheat it's funny it's about cheating being a very creative endeavor i mean you hear these stories don't you people kind of going out to extreme lengths to get exam answers and stuff like i think you may just revise for it because you put enough like you've put a lot more work into be a good cheat right so yeah but anyway cool so yeah affordance let's talk about affordance so how do you go about kind of taking this and turning it into this is what the consequences would be if you did this in practice because obviously the the how do you put a cost on it in terms of like the financial impact or the performance impact yeah this is a big space it's often quite hard

Starting point is 00:33:35 to know what the consequences of an action will be is so yeah how did you go about sort of breaking that down and estimate the impact of something yeah Yeah, correct. And this, I would say, is the most controversial, the most raw component in the system is this mechanism. Because I would say we've not even thought of this component really well.

Starting point is 00:33:58 We're still at a very early stage. We're still thinking what this affordance mechanism could encompass. So, for example, we don't even know what an affordance mechanism could encompass. So like, for example, we want to, we don't even know what an affordance means here. For example,

Starting point is 00:34:09 if the model generates something like add an index on this table, that's a recommendation. How would you go about estimating the impact of adding an index to let's say something on query latency?

Starting point is 00:34:22 It's very difficult for, like forget about LLM doing this, it's very difficult to even build a statistical model, a database model to do that. What's the impact on P50 or P90 of your query latency if you add an index on this table? It's very difficult to build that statistical database model. So it's not clear at all how can LLM do this.

Starting point is 00:34:44 So for now, what we have in a for-else mechanism So it's not clear at all how can LLM do this. So for now, what we have in a for instance mechanism is it's extremely heavily guarded with a bunch of guardrails. So what we do, we start off with something in the answer generation. So when the model generates an answer, we force it to be actionable, to generate an answer that is actionable. So now what does actionable mean? And actionable mean we wanted to generate something

Starting point is 00:35:07 with respect to a parameter that can be tuned. So it can't say something like tune your number of CPUs. That's not actionable. Sure, it is actionable in the sense that you can tune it, but what to exactly? So we wanted to generate things like, since we are telling the system that my current number of CPUs is 16, I don't want to tell me increase the CPU.

Starting point is 00:35:29 I want you to tell me increase to what? So increase from 16 to 32, or increase from 16 to 64. So don't have the kind of answers where we force the model to generate. So you always want, you always force the model to generate something very specific

Starting point is 00:35:44 that can then be converted into a statistical equation mathematical formula and next and we can estimate the impact so now when i say impact estimation we always what as of now we always want to drive the estimate the impact estimate towards a specific metric and right now there are a bunch of methods that we care about we care about the average query latency, and we care about the average number of active sessions in the system at any point in time. So we want to estimate the impact of these actions in terms of these two metrics.

Starting point is 00:36:17 Now, as you say, how do you settle on those two metrics? What was the decision there of how to actually pick those two out? Because there's a lot of things, right? Correct, correct, correct. Yes, again, the choice is, I would say still a design choice here. We designed the system based on how we have seen

Starting point is 00:36:33 the RDS engineers or the customers think about their system. And usually what we have seen is people care about the metrics, like what's the average query latency and people really care about what are the sessions that are active and people really care about what are the sessions

Starting point is 00:36:45 that are active and what are they waiting for. So how many such are they waiting and what are they exactly waiting for? So that metric seems to be pretty simple, but pretty impactful when people monitor their performance. So we picked these two metrics, but yeah, these can be any metrics. Now, the impact estimation model

Starting point is 00:37:02 is a simple statistical model. But think of it as a function which takes as input the parameters and output the value of your query latency. So these are like simple statistical models that are trained on the field data. Now, there are two ways you can train this model. These models can be customer specific or they can be field level model when i say customer specific what i mean is that uh these and when i say model you can think of it as a simple neural model it can be a regression model uh any any sort of model random forest decision tree whatever so uh this regression model is trained on when it's a customer specific it's straight only on this customer's data on which we're running Panda.

Starting point is 00:37:47 So if Panda is being run on your database, it's only trained on your database. And the goal for that is to make sure that the regression model is trained on the metrics that's coming out of just your database. So how is your query latency changing with respect to CPUs? You just want to model that. That is customer-specific models. We can also have fleet-level models where we can, instead of estimating how query latency changes

Starting point is 00:38:16 with respect to CPU on-board database, we estimate it across the fleet, across all the customers. So that model will give you an average estimate. So like on an average, how does query latency is impacted if you increase CPU? But on an average, how is query latency impacted if you decrease CPU or if you, things like that. So that's what we call fleet models and customer specific models. And we have simple models on both these fronts that we try to plug into Panda and estimate the impact. But again again these are very very uh early stage right now where you don't have impactful uh models there yet so we are still trying to play around with simple regression models and trying to estimate impact

Starting point is 00:38:56 nice nice yeah it's obviously still sort of uh an early sort of project right in terms of its its life yeah it's like exactly but yeah it's not going to be the finished polished article yeah so no that's that's that's cool so yeah i guess there's one more one more component then and this is this the idea of feedback and getting the the model to improve over time so uh how does that kind of component look at the in the current iteration of panda sure uh so in fact this is the very first mechanism that the models, the input sees. So like when you query Panda with the question, the question is first sent to the feedback mechanism. And the goal of that is to think of feedback mechanism as a simple database that stores questions, answers and feedback over time. So whenever a final answer is sent back to the customer, customer is asked to give a feedback in form of plus one or minus one,

Starting point is 00:39:51 like thumbs up or thumbs down. And that feedback is again sent and saved in this database. So if let's say you've been using Panda for over a week now, and let's say you have seen about 100 questions, every one of those questions is stored in this database and this database is think of it as simple as like three columns and row and column database and you have three columns first column is your question second column is the answer that panda generated and third column is feedback which comes up in terms

Starting point is 00:40:19 now uh if you ask a new question to panda Panda will take that question and go back to this database and look for what's this exact question or a similar question being asked before. And if it does find a match in the database, then it will look at when was that question asked because the tell-tale aspect is important. Like if I ask something like one month ago, it's probably not relevant anymore. So we keep that window to one day. So if we find an exact match or a similar match between your new question and an existing question in that one-day window, you'll probably retrieve the answer

Starting point is 00:40:57 and send it back to the customer without going through the entire generation process, which might be a high-latency process. So that's like one use case of feedback can, where we can generate answers faster and they can be repetitive. They can be similar to what you've been asked before in the same one day window. And if it is, we'll surface the same answer. And then we will tell customer that this is an answer from the question that you asked

Starting point is 00:41:18 in the past. If this is not relevant, you can re-ask the question and mention that in the prompt, that you gave me an answer, I didn't like it, and generate a new one for me. And that would bypass the feedback with Ansible altogether. Ah, fascinating. It's kind of like a mini stack overflow, essentially, kind of in the whole thing as well.

Starting point is 00:41:36 So cool, yeah. So we've covered off all the components there. And this, I mean, there's a lot of various, a lot of moving parts to this system so can you can you tell us about the implementation briefly maybe and like how do all these things fit together what does the actual code look like and yeah what's the user interface like as well actually like is it simple is it just a bar that i enter text in yes tell us a little bit more about the implementation sure sure yeah uh so again the way we coded Panda up was not,

Starting point is 00:42:05 it was a very experimental coding in the sense that it was, it started off as a very small project, a very experimental project. So the code base that we have is not very gigantic code base. Everything is well thought of and written in a way that's all fail safe. It's still very experimental, very raw, very early stage right now. So that's that. But the interface of the user is kept to be very simple. So interface is as simple as a chat interface.

Starting point is 00:42:39 So everything is done in the backend. So once you've triggered Panda, it connects to your database. So again, in order to connect to the database, you need to give it the credentials of your database. That's a separate story. But once Panda is running, the interface is as simple as the chat interface. There's a question, there's a bot,

Starting point is 00:42:56 and there's your answer bot. So you keep asking questions, they answer each and every answer. It has its own citations. And if it has had, if the answer has an impact

Starting point is 00:43:08 estimation, we'll have it at the right point. Because not every answer can be, not every answer's impact can be

Starting point is 00:43:14 estimated. So the impact estimation won't be there for every answer. If we could find a relevant impact,

Starting point is 00:43:20 we can add it in the answer at the right point. The interface is very simple, very, it's very simple, simple like a chatbot like a chatbot yes interface we're all kind of familiar with yeah yeah cool so i mean let's talk some results then so you evaluate you've evaluated um so let's we can talk about

Starting point is 00:43:38 that as well how we went actually about approaching to all evaluating it and then yeah let's talk some results as well so let's start off with like how you went about evaluating it first then we can talk about the results sure uh so i think we discussed this at some point in our chat today where uh the the golden rule the highest level we can uphold panda is to is with the human evaluation right so again just the one thing we need to clarify first what what exactly are we evaluating? Are we saying, what's the standard we are holding Panda? Are we saying Panda is something that can generate answers better than human?

Starting point is 00:44:15 No. What are we saying is that Panda can generate answers better than an existing language model. That's the bar we want to test Panda on. And the reason we want to clarify that is because we don't want to compare the Panda answers to a human answer. We can't compare Panda to an existing database engineer and say, okay, this guy can beat this guy.

Starting point is 00:44:37 No. What we want to do is we want to compare Panda with an existing language model, which in this case was GPT-4. And we want to say that since Panda has all these four components, affordance, feedback, grounding, and verification, with all these four components, Panda can perform better as compared to language.

Starting point is 00:44:57 So that's a goal. So how we started to set up the experiment is since we don't have any ground truth label for any recommendation we do use humans to evaluate the answer so what we do is we pick three different humans with three different level of database knowledge a beginner and intermediate and advanced and we show them we we generate uh we picked 50 prompts uh 25 from a Postgres engine, 25 from a MySQL engine, and all these prompts

Starting point is 00:45:27 were engineered in a way that we have been seeing more commonly in how customers ask questions about their databases. So customers' questions are usually not very detailed. They're always very, I mean, in some sense, it's correct because they don't have,

Starting point is 00:45:43 not all users have a good understanding of what's going on in the database. So they usually ask questions which could be very generic, very high level, and they expect the system to infer all the element stuff on its own and generate an answer. So we intentionally kept the questions to be very high level and very short to see how LLM responds to those questions.

Starting point is 00:46:03 So we came with 50 prompts 25 uh posters 20 for my sql and then we generated the answer using panda and using gpt4 and we showed these two answers to these three evaluators and asked them to rank uh or score the two responses on three aspects. And those three aspects were, first was trust, second was you understanding, and third was usefulness. So trust as in, if you read these two answers, which one do you

Starting point is 00:46:36 generally tend to trust more? And second is, if you read these two answers, which one do you understand better? And third was, if you read these two answers, which one do you feel is more useful to you? And they're asked like your thumbs up and thumbs down on each of these three aspects. We average their scores.

Starting point is 00:46:53 And then that's the response table that we generate in this paper across these 50 evaluations. Awesome stuff. So yeah, I guess, yeah. What's the big reveal? Tell us, how did Panda do? Panda was found to be often, so let's go dimension by dimension. So on the dimension of trust,

Starting point is 00:47:13 experts found Panda to be at more than 90% of time a better candidate than GPT. And the reason for that was, I think, is the citation, which we also mentioned in the paper, that with every response, Panda generates citations, which is usually an indication of trust where the customer can now go, or the user can now go and look at a citation, go to that very link, very documentation, and read exactly in detail if they have any questions around the recommendation. So that kind of generates more trust. So for more than 90% of times, Panda was rated higher on the trust.

Starting point is 00:47:49 The interesting thing was understanding. So when we asked the scorers to label, try to score based on understanding which of the answer they understood more, the beginner scorer was around 40-60%. So they rated Panda 60% but GPT 40%. So there's not that big of a gap. And the reason for that was GPT answers, if you look for it,

Starting point is 00:48:15 they're highly verbose. It will probably generate like 100, 300, 400 word passage for you. And for expert or intermediate people who know exactly what they're looking for, this kind of highly verbose answer is a waste of time for them. They want something very specific, very actionable. But when you look at it from a beginner's lens,

Starting point is 00:48:36 reading their entire paragraph does make sense for them because now they kind of understand the problem much more better and they can ask follow-up questions. So that's where we found that beginner people, where in some cases liking uh the responses from gpt better because panda was being a bit too specific in that sense okay they said that uh tune parameter x now what

Starting point is 00:48:56 if they don't know what parameter x is or what if they don't yeah the pan they've never heard of parameter x so they need some more context somelevel stuff, and then probably narrowing down to the parameter, which GPT was able... So GPT-4 never gave the parameter name or the value, but gave a very good, verbose explanation of what are the possible problems that customer could face with respect to those details, and they found it to be more understandable.

Starting point is 00:49:24 So just real quick, do you think that could be something you could incorporate as part of the model kind of kind of the person asking the question could definitely be like hey i'm a beginner i'm an expert blah blah blah and that sort of like would definitely in um influence the verbosity of the of the output but yeah exactly exactly now that's exactly what we thought about in the experiment so uh we wanted to incorporate we want to incorporate that in the system where when you instantiate Panda on your database, you also want to assign a role for the user who is using Panda. So you want to assign the role in the sense like, who are you and what do we expect from

Starting point is 00:49:57 Panda? And Panda would take that into account in generating answers. If your role is a beginner, it would be very well-posed and it will have a lot of introductory stuff before coming down to the actual actionable details. So yeah, that was very useful in the experiment as well. And I think the third dimension was usefulness. On usefulness, again, we found the intermediate and advanced folks to have at least more than 90% of approval rate for Panda because they found it to be more actionable, very specific with the exact parameter names exact

Starting point is 00:50:25 parameter values and so on whereas gpt was again very highly verbose or very very very generic so yeah useful the useful dimension was similar to trust dimension but the interesting one was understanding which had something for us to learn from yeah for sure i mean it feels like it is a clear indication here of sort of the youthness and the efficacy of Panda kind of going forward. So I guess, where do you go next? And what's the next on the research end? Obviously, there's things to make it production grade and a lot of polishing to get everything kind of stitched together and working well. But yeah, what are your next steps? so i think uh it's it's panda is very far away from production i would say it's still very uh it's taking still taking baby steps it's still in the place where we are still trying to falsify hypotheses around language models so it's still very experimental right now i would say and there are a lot of interesting areas that we can exploit or we can take panda into so for example the first thing that we found in our analysis was Panda is,

Starting point is 00:51:28 so you can chat with Panda, but the process of debugging is very incremented. So debugging is never one shot. It's multi-shot, right? And the right answer of why a problem has occurred, the root cause could be multiple. There's not always only one root cause why a problem has occurred, the root cause could be multiple. There's not always only one root cause for a problem. So sure, Panda generates something that may be right,

Starting point is 00:51:51 but it's not the only right answer. What if something else happens? So let's say you're seeing an increase in your query latency and Panda said, okay, you have an increase in query latency. It's because you didn't have an index on this column in the table. Sure, the customer didn't have an index on that column in the statement. Sure, the customer didn't have an index on that column in the statement, but is that the root cause?

Starting point is 00:52:10 Or what if they're just increasing workload at that time? Or what if it's something related to locking? So there could be more than one answer to a question. And we want Pandas to be more human-like in that sense. So instead of like jumping to conclusions, we want Pandas to be iterative andlike in that sense. So instead of like jumping to conclusions, we want Panda to be iterator and think in all different directions. So in some sense,

Starting point is 00:52:30 we want the Panda to generate a hypothesis tree where each branch is one root cause and then go on and evaluate each branch and then generate a recommendation which is not just one single recommendation, just forcing you to take something or believe it believe but it should give you more like an analysis of all things possible and let the customer pick which one they feel is right yeah that's fascinating because yeah in practice these problems there can be layers upon layers and has there been sort of um kind of did you encounter

Starting point is 00:53:01 where you solve one problem but inadvertently cause another problem by solving that one problem, if that makes sense? Like, I don't know, you add the index and then that caused something else to blow up. Like, how would you... Exactly. Yeah, exactly. That's exactly the problem that Panda would fail at, right?

Starting point is 00:53:18 So the answer that Panda generates is very confident, but it has no clue what that could trigger later on so sure we have we have the affordance and we have the place which where you can estimate the impact from this that's the goal the goal that company exactly to solve this if you have if you recommend something we want the fact what you know what is the impact is it going to trigger something else or is it going to increase something else that people don't care about? So things like that. At one point, Panda is not

Starting point is 00:53:50 able to solve these kind of problems where down the line things could break because of the recommendation that he made today and we want to fix that before even thinking of production. Before unleashing it in the wild. Cool. Awesome. Still though mean this feels like it can have a really big impact kind of going forward so yeah i guess kind

Starting point is 00:54:12 of re-elaborate on that a little bit like what impact do you think panda could have in the future yeah sure the the place where i feel panda would impact the most is reducing the amount of time database engineers or DevOps folks spend on debugging databases. So like, for example, from the scenario that we have seen in databases or in real world in RDS customers,

Starting point is 00:54:38 we feel like we've seen the customer spend about 10, 15, 20 minutes, even like several hours debugging one single problem. And the goal is that if, let's say, a customer on an algorithm spends like 45 minutes debugging a problem, can the Panda bring it down to let's say 5 minutes or 10 minutes?

Starting point is 00:54:53 And saving that 15, 20 minutes per half an hour worth of effort on one single problem is a significant amount because this multiplies really fast. So if a customer spends, let's say, debugging hundreds of problems a week, you could spend multiple hours worth of their effort on these problems. And the interesting part there is we don't want Panda. The Panda's goal is not to replace the database engineer.

Starting point is 00:55:20 It's to assist them. And that distinction is very important because once you make that distinction, the bar at which you qualify Panda to be useful significantly lowers. So what I mean is you don't want Panda to be absolutely accurate or 100% perfect all the time. You want it to be perfect most of the time because eventually the final action that is supposed to be made is to be made by the human. You want Panda to be accurate most of the time because eventually the final action that is supposed to be made is to be made by the human you want panda to be accurate most of the time so that it can help the human narrow down the space and find the right fix immediately so if we were to say that panda is to replace the database engine then the quality of the bar that you will hold is extremely high and

Starting point is 00:56:07 you might never reach that in a few years or whatever. And that would kind

Starting point is 00:56:11 of make the whole point of things more because then there are certain things that you can

Starting point is 00:56:17 never fix or at least not fix in the near future and these kind of work will never be

Starting point is 00:56:20 useful or made available to the public. But I think the fact that we are thinking of it more as an assistant rather than a replacement that lowers the power and allows for more uh creative and aggressive things to be done in this space yeah allows it to become

Starting point is 00:56:38 practical right allows you to kind of keep the human in the loop allows it to augment the human make the human more efficient and then obviously once you're using it in practice well then you start insights about how it can you you'll learn things about it that maybe you wouldn't have thought of rather than waiting for this sort of utopia future where we have this perfect sort of human equivalent right and so yeah i definitely definitely agree with you on that one and cool yes i mean i don't know how long you've been working on this project said it's pretty early days still in the sort of it's very yeah it's super yeah uh i've worked i started working on this probably a year ago okay cool so kind of about a year cool cool so yeah kind of across that year what's the sort of the most interesting thing you've kind of learned

Starting point is 00:57:21 while kind of working on panda what's been the biggest surprise i think the biggest surprise was uh how how much improvement you can make uh in okay before going there i think one thing that i missed out in our conversation is what is the actual lm panda is using because it's using lm? Yeah. And we never talked about what is that LLM. And I've mentioned this in the paper, but we use GPT-3.5, which is an inferior model to GPT-4. So the goal is to show that Panda is not a new language model, right?

Starting point is 00:57:59 So the goal of Panda is not to come up with yet another highly complicated, brilliant parameter language model. The goal of Panda is not to come up with yet another highly complicated, brilliant parameter language model. The goal of Panda is that if you pick any language model and plug Panda to it, can you improve from the vanilla model? Even if I pick, let's say, Claude V2, or if I pick a Lama model and plug Panda into it, can Panda plus Lama beat L meet llama that's the goal and that's what we're striving for and that was really eye-opening to me because i felt when i started

Starting point is 00:58:33 this piece i felt like gpt4 or gpt 3.5 was already like surpassing almost everything every expectation and but when you talk when we started to look the answers seemed correct uh fact factually correct like um but they're not very useful and that was eye-opening to me because every time i asked something related to databases it was able to give me a thousand word answer but none of it all of it made sense but none of it was useful so it was kind of like a uh aha moment where you realize that okay this is something that can be exploited because people feel like people were going all bonkers with this like lms coming around they're like okay the super useful highly productive and so on but we felt like we we dive deeper into it it's not

Starting point is 00:59:22 it's this vast amount of improvements that can be made. And when we started working on it, just by adding telemetry information, just by adding simple statistical models around it, the kind of improvement in the responses that were made

Starting point is 00:59:38 was very, very interesting to us. That's something we didn't expect this, take it that further. But that was something that was interesting. Yeah, that's fascinating. It kind of makes you kind of wonder

Starting point is 00:59:49 how many other sort of applications it can be. That takes a similar approach, right? Of taking kind of an NLL, taking GPT, and then kind of putting this sort of wrapper around it and kind of building on top of it. Yeah, I'm sure we're going to see loads of really fascinating sort of products and applications built on top of it

Starting point is 01:00:04 over the coming years for sure. That's cool. And by the way, it's already started. Yeah. So yeah, just to piggyback on a thought, there's already a bunch of research on using LLMs as tool management.

Starting point is 01:00:16 So instead of us making LLMs do everything, make the LLM act as a manager where it decides, it doesn't give you an answer, but it decides what is the right tool to pick to give you the answer. So in this case, think of it like when we convert telemetry to text, LLM is not converting the telemetry to text, but LLM is deciding what is the right algorithm to convert telemetry. So do I run anomaly detection do i

Starting point is 01:00:46 run uh outlier detection do i run point uh change point detection what is the right algorithm because what you care about could be different i let's say i only care about uh decreases or declines in the metrics i don't care about any increase so that information is relevant and if somehow element can figure that out uh it can it can query the right tool so there's a bunch of interesting papers one interesting paper i read recently called tool formula where language models can teach themselves to use tools and that's that's that's also a very interesting area of research going on where using lms as tools manager or using lms as reasoners which can create the right tool is very interesting.

Starting point is 01:01:27 Because if you think of it, LLMs are trained on this vast amount of knowledge base, which makes them very good, make them generate good answers on an average. Because they're trained on large amount of data. They're not trained on specific stuff. So they're not experts, but they're good on an average.

Starting point is 01:01:48 And people who are good on an average, who are trained on these vast amount of data, are good leaders. Or not leaders in the sense of good managers. Yeah. If you think of it, LLM has seen a lot of things which makes it a good manager.

Starting point is 01:02:05 So instead of asking you to solve every single fine detail in the problem, you can ask it, since you've seen so much, tell me who is the right person to fix it. And that's what LLM is better at. Instead of asking you, what is the right answer?

Starting point is 01:02:20 So ask it, who is the right person to solve it? And that LLM will give you a better answer there wow all the managers and sort of exec level people in the world better start worrying then if it can come if you can become an effective manager right so yeah yeah but that's true but like kind of that point of being sort of you can be sort of average or above averaging kind of across a broad spectrum of things they tend to be sort of good managers kind of good leaders and yeah so that's that's a really interesting really interesting point cool yeah so um tell me about your creative process in big man how do you go about kind of thinking about ideas generating ideas and then

Starting point is 01:02:56 selecting which ones to to um to to work on for for a long period of time. Yeah, so let me inside your brain. Yeah, it's an interesting one. I think what has worked for me, I'm sure that the different strategies people use to test their creative self and to come up with these interesting ideas. I think the one that has worked for me is to try out things and make sales fast instead of thinking of the best idea.

Starting point is 01:03:31 So the whole idea of Panda came up instantly when I started testing. When I started trying to charge GPT, my guess was that it would probably give a much better answer than what it gives because it's stayed on so much of data. So it should be better than that. But it gave some answers.

Starting point is 01:03:51 The first thing that I thought of was like, why is it not able to give this answer? And it's probably because it hasn't seen telemetry data. That's what's missing. So then I quickly built a demo of what if I'm able to supply telemetry data in some language form, can we improve it so I felt like an iterative demoing process kind of helped me a lot so if I would have sat down and try trying to design like the entire architecture one year ago I wouldn't have reached any because I didn't I couldn't see a lot of problems that I saw after demoing stuff. So I feel like a creative, fast-filling approach

Starting point is 01:04:27 kind of helped me reach a complete architecture, at least. So in terms of the demoing, this was a case of, okay, was it demoing to sort of the team and sort of like, oh, here's this, and then you kind of thought afterwards, this didn't work, actually, or got some feedback, and thought, oh, that could have been different. And that's the way you approach it. That's fascinating.

Starting point is 01:04:44 I've not had that answer before that's really cool everyone everyone answers this question differently it's brilliant to see um how we're all so different and but you know that's fascinating um cool i think you might try that see how it goes for me awesome stuff of uh not not knowing what i don't know what I don't know is that the right phrase yeah yeah yeah exactly exactly and it's very difficult to sit at t0

Starting point is 01:05:13 and predict t infinity so the more the faster you move towards t infinity as you find new ways of exploring it's like that these four components that we talk about in Panda,

Starting point is 01:05:26 it's not that we started thinking of it that way. It just, when we built the entire framework

Starting point is 01:05:34 component by component based on the feedback that we got, eventually we realized, okay,

Starting point is 01:05:40 this is what one framework is, this is what the other framework is. It's all like working backwards instead of working forward. yeah awesome stuff cool yeah anyway so so it's um

Starting point is 01:05:51 it's time for the for the the last the last word now so um what's the one listener you absolutely what's the one takeaway you want the listener to get from this podcast today i think i would say I think the last question that we discussed, the last point that we discussed is probably an interesting one I think people can take away from this because again, if you look at the work, I'm

Starting point is 01:06:15 neither an expert in databases nor an expert in language models. And we still came up with something interesting in this combined space because of just quickly experimenting and failing on these crazy ideas. So I think that's one takeaway we can take. You don't have to be an expert in either of the domains to come up with something useful. Because what's useful and what's interesting are very two different things

Starting point is 01:06:45 because something that can be absolutely uninteresting could be very useful and something that is super interesting can be extremely useless. So I would say if you want to build something aimed for useful and eventually you'll realize that what you end up building

Starting point is 01:07:01 it will become interesting at some point because interesting kind of is a state that evolves over time. If you have a useful problem, start building it. Don't care about how complex this model is going to be or how sophisticated I can train this model to.

Starting point is 01:07:18 Start with something very simple and try to solve it in space like Panetta does and eventually it will become interesting because no problems, because I'm sure you'll never, you can never think of all the problems at the very start. You'll always encounter problems

Starting point is 01:07:32 that will make things more interesting. I think that's a brilliant line to finish it on. So yeah, thank you so much, Vikram. It's been an absolute pleasure to talk to you today. And if the listener wants to know more about Vikramant's work,

Starting point is 01:07:43 we'll put links to everything in the show notes. And yeah, thanks again vikramant's work we'll put links to everything in the show notes and yeah thanks again vikramant it was a fantastic episode and we'll see you all next time for some more awesome computer science research Thank you.

Disseminate: The Computer Science Research Podcast - Vikramank Singh | Panda: Performance Debugging for Databases using LLM Agents | #47

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.