Programming Throwdown - 155: The Future of Search with Saahil Jain

Episode Date: April 10, 2023

When it comes to untangling the complexities of what lies ahead for search engines in this age of AI, few are as deeply versed in the subject as You.com Engineer Saahil Jain. Jason and Patric...k talk with him in this episode about what search even is, what challenges lie ahead, and where the shift in paradigms can be found. 00:01:16 Introductions00:02:06 How physics led Saahil to programming00:07:20 Getting started at Microsoft00:13:39 Analyzing human text input00:22:22 The exciting paradigm shift in search00:29:02 Rationales for direction00:33:40 Image generation models00:39:55 Knowledge bases00:45:12 FIFA00:49:29 Understanding the query’s intent00:51:18 Expectations00:55:38 A need to stay connected to authority repositories01:03:45 About working at You01:08:18 FarewellsResources mentioned in this episode:Join the Programming Throwdown Patreon community today: https://www.patreon.com/programmingthrowdown?ty=hLinks:Saahil Jain:Website: http://saahiljain.me/Email: saahil @ you.comGithub: https://github.com/saahil9jain/Linkedin: https://www.linkedin.com/in/saahiljain/Twitter: https://twitter.com/saahil9jainRadGraph: https://arxiv.org/abs/2106.14463VisualCheXbert: https://arxiv.org/abs/2102.11467 You.Com:Website: https://you.com/Twitter: https://twitter.com/YouSearchEngineDiscord: https://discord.gg/f9jRFH5gHP Others:On Thorium: https://www.youtube.com/watch?v=ElulEJruhRQ More Throwdown? Check out these prior episodes:E143: The Evolution of Search with Marcus Eagan: https://www.programmingthrowdown.com/2022/09/143-evolution-of-search-with-marcus.htmlE94: Search at Etsy: https://www.programmingthrowdown.com/2019/10/episode-94-search-at-etsy.html If you’ve enjoyed this episode, you can listen to more on Programming Throwdown’s website: https://www.programmingthrowdown.com/ Reach out to us via email: programmingthrowdown@gmail.com You can also follow Programming Throwdown on Facebook | Apple Podcasts | Spotify | Player.FM  Join the discussion on our DiscordHelp support Programming Throwdown through our Patreon ★ Support this podcast on Patreon ★

Transcript
Discussion (0)
Starting point is 00:00:00 programming throwdown episode 155 the future of search with sahil jain take it away patrick excited to be here for another episode. 155. I don't know. I guess we can, we remark on the number every time, but that habit, that habit. We should, we should hire someone to help write intros for us. Oh no, that's what we're going to get chat GPT to do. Oh, I should hire a ghostwriter. I was listening to a podcast where they were talking about getting researchers and then the researchers would basically feed them all their topics. And I was like, we've been doing this a very long time, Jason, but we've never been that professional. No, never happened.
Starting point is 00:00:52 I don't even know if we discussed it. Is it true that musicians don't write their own songs? Or is that just like a diss or something? It depends on the musician. Okay, that's a very good, very noncommittal answer. Is it true that Stack Overflow writes all your code, Jason? Yes. All right.
Starting point is 00:01:14 Well, we're going to welcome to the show Sahil. He is an engineer at u.com. Glad to have you here. Yeah, glad to be here. So the way we always kind of start this off with guests, you know, it's always an interesting story to learn a little bit about the different ways people got into tech, got into programming. So do you have like a first memory, like a first computer, or first time you did a programming problem, or was like the earliest thing that sort of
Starting point is 00:01:39 got you excited about technology? Yeah, that's a good question. So actually, I think compared to most folks, especially nowadays, I probably got introduced to programming a little bit later compared to most folks. I never actually did much programming in middle school or even high school, for that matter. I think the first kind of introduction I got was really in college. Really, I think what I initially wanted to do was become more of a mechanical engineer. So I was, you know, in high school and stuff, I was really interested in physics. So I thought in some ways I would maybe focus on, you know, going down the applied physics or just physics engineering. But then I think I realized very quickly that I wasn't super good with my hands,
Starting point is 00:02:19 but I was much better at, you know, dealing with the world of abstractions. So I think I naturally gravitated a bit towards programming. I think the first memory was working on a project where i was using a very simple you know nearest neighbor classification algorithm to determine whether a cell is malignant or benign okay i think that was kind of a really fun experience because i think it showed me how you know useful and powerful programming can be and kind of a really fun experience because I think it showed me how useful and powerful programming can be. And it was a really simple thing. You just basically look at the different attributes of a cell and you match it to the nearest kind of cell you have in your data set. And you can see whether or not that one was willing or benign and then classify this one.
Starting point is 00:03:00 And you can benchmark the scores. I just remember that being one of my first core experiences. Yeah, I mean, I think that's pretty good. You know, I think it's interesting, you're right, a lot of folks do have experiences a lot earlier. And so the, I don't want to say sophistication, that sounds insulting. The level of complexity of the thing that they get exposed to when they start programming is a lot, a lot lower, just because, you know, when you're, like you said, you know, a middle schooler or an elementary student, you know, you can be exposed to a lot of programming ideas, but you don't have the math background. So for you to say, you know, so it's simple,
Starting point is 00:03:33 you look to your neighbors, I mean, even describing, you know, what the nine connected neighborhood is, or your eight neighbors, and like trying to express how in a Cartesian space, you know, one might be up or up into the right, you know, explaining that to someone, you know, I have a fifth grader. So explaining that to my fifth grader, like she gets it, but she doesn't really get it. And so, you know, the complexity at which you can sort of like, touch on the variety of subjects that programming really involves. I think it's a great story. I don't know if you were doing an actual malignant detection thing or whether it was a sort of like a simplified cell version with attributes but either way i mean just like a great a great topic oh yeah this is totally a toy problem okay all right it's not real this is not real at all this is getting the first programming assignment type thing i was going to
Starting point is 00:04:19 be impressed you had like open cv open and you were like looking at microscopy slides and a dye stain and trying to i was going to be like, that's a pretty high bar for first programming problem. No, no, not at all. No, no, that's still good. But yeah. And, you know, you touched on as well, mechanical engineering. And I think it's often forgotten by me, like even when I was in school, mechanical engineers still had to do a lot of programming.
Starting point is 00:04:42 They get exposed to it. And even the CAD software today has a lot of scripting or procedural elements that aren't that dissimilar. But as you said, I think there are some folks who sort of think they want to do one thing and switch to the other or back and forth. Very, very common. So you you were doing this first programming assignment, you got really engaged, presumably in a class where and then you just sort of like decided to kind of like pursue that more, like take more classes? Yeah, I think at that time, I was still unsure. For me, I think I, I probably decided a little bit later, even after that. But yeah, I think essentially, I was kind of exploring different topics in parallel, I knew I was in the I was basically
Starting point is 00:05:21 an engine in the engineering school, which, you know, limited the options a little bit. But I was very interested in energy systems and that type of stuff at the time as well, which I still am. More as a side hobby, I guess. But yeah, so I think, you know, I think I realized that, yeah, I guess, you know, programming is fun, just the joy of it. And in general, building software. I think the idea of artificial intelligence also always appealed to me. I think that was really what kind of drew me towards programming is kind of the appeal of, you know, building intelligent systems.
Starting point is 00:05:50 So that's really kind of the route that drew me in. I guess we could hit the energy systems. I'm not sure exactly what you meant, but I feel like we can summon a discussion about thorium at this point, just do like the tech hipster topics like thorium ai like this is the this is the classic thing so i don't know we don't have to we don't have to go in there but is that energy systems you know you say you're in a hobby that's sort of like energy plant productions or oh sorry i said intelligent systems oh intelligent systems oh i'm sorry oh never mind i'm just excited i want to talk about thorium see that's the
Starting point is 00:06:22 all right cool never mind well i just ignored that uh we'll keep going keep going on uh is that the same as tiberium no that's coming to conquer yeah oh okay i i do remember talking to uh to someone else about thorium a while back yeah i wonder what happened to that i remember there was a lot of excitement there no no no i no. I can't. I can't. All right, ho. We've got to keep going.
Starting point is 00:06:48 Yeah, yeah. Look it up. Look it up. There's YouTube videos. YouTube videos. Thorium. It's good stuff. No, don't get near it.
Starting point is 00:06:53 Don't get near it. But read about it. It's fine. So, yeah. So you got this interest in programming. And then did you end up, you know, when you sort of graduated school, did you take your first job as like a programming job or wasn't sort of at that level yet? Yeah. So when I was deciding, I was actually deciding.
Starting point is 00:07:10 I remember for, I guess, my first job, I was deciding between, I guess, a couple of startups and then Microsoft as well. Some of them were engineering roles. Some of them were product roles. I actually ended up for my first role entering as a product manager at Microsoft where I was working on, I guess, cloud infrastructure and Office 365. Oh, very nice. I don't know if it's still the case. People may not know. So we have a little bit differently than a lot of other software and that they have tight couple teams where you said like product managers, software engineers and at a time test engineers. Is that still was that still the sort of arrangement that they had going or is that something that they've sort of gone past? Yeah, I mean, I guess when I was there, which is also, I guess, a little bit of time ago at this point. But I think it really depends on the team within the company. I think at a big company like Microsoft, it's almost like you're working at a different company depending on what org you're on. And the structures are radically different across different teams.
Starting point is 00:08:19 The team I was on, I remember we did have a little bit of that set up. We had a lot of service engineers because we were more on the cloud infrastructure side. So we had software engineers, service engineers, and product managers. Nice. And so you were doing product management, but still had your eye on wanting to do more of a programming role and continue to do that?
Starting point is 00:08:39 Or how did it sort of shape up in your time in that role? Yeah, so I mean, I've always been interested in building things as well as in whatever form it may take. So. I mean, I've always been interested in, you know, building things as well as in whatever form it may take. So I think the way I've always viewed it is there's a bunch of different roles. If somebody's interested in tech, there's definitely a lot of different ways to contribute, one of which is engineering, but one of which is, you know, design. There's a whole host of, you know, cool ways to contribute, even if you're not necessarily
Starting point is 00:09:03 inclined to code or to engineer necessarily. But I think for me, I was always, you know, working on side projects or, you know, interested in writing code in addition to kind of, I guess, my day job at that time. So I think in some ways, I necessarily didn't really identify myself as a product manager as an engineer, but more just, you more just in the spirit of building products, whatever that may be. And then I think I ended up again kind of returning to kind of being interested in artificial intelligence.
Starting point is 00:09:32 So I think immediately after Microsoft, I ended up becoming a researcher. So I ended up, I guess, doing research at Stanford in a machine learning group. And I think that was a pretty important experience for me and kind of helped me, you know, better understand what I'm interested in. Nice. That's a pretty big transition, right?
Starting point is 00:09:54 To work for Microsoft and then go to be a researcher at Stanford. How did that sort of come about? Yeah, so I ended up going to grad school. So I think the way it came about was, I think in some sense, I was always interested in, you know, artificial intelligence. And I guess a little bit of at the time, I was also very interested in healthcare. And I still am. So I think healthcare and search, there's fascinating interactions between the two of them. And I think there's a lot of work to be done in improving health search. But really, I think I came in with the angle of I wanted to, you know, use AI to improve healthcare. I think there's a lot of different ways healthcare is broken. And I was also interested in language. So natural language processing has always been kind of of interest to me. So I think I ended up doing research kind of at the intersection of the two. A lot of it was, you know, how can we use, you know, language to improve health, whether that
Starting point is 00:10:43 means, you know, mining, you know, reports for label data to then train, you know, language to improve health, whether that means, you know, mining, you know, reports for label data to then train, you know, computer vision models for healthcare, to just thinking a little bit more about, you know, conversational agents in healthcare and, you know, bias, etc. So I think that was kind of what drew me. It was more just an interest in topics. I think the topic in that case was just artificial intelligence, natural language processing, like deep learning, healthcare, those types of, it was a little bit varied, but that's kind of the general. Yeah. And so what did you, what did you kind of pursue while you were there? What kind of research were you doing? Yeah. So I guess I was doing research. Yeah,
Starting point is 00:11:19 I guess at the intersection of healthcare and deep learning. Okay. So I guess, you know, some of the projects I ended up working on, I guess maybe the first one was, it was a project called Chexbert, which we ended up releasing. And essentially what that was is it was a radiology report labeling tool that, you know, used, at this time I think BERT was, it wasn't necessarily new, but it was maybe like one or two years after BERT. And a lot of the existing radiology report labelers were very heuristic based, which means they use like a lot of, you
Starting point is 00:11:49 know, hard coded rules, essentially. And they were being used to train a bunch of computer vision models, some of which were being tested in hospitals. So the idea was that, you know, we can very easily improve the quality of the labels, which will then downstream improve all the other models that are trained on the labels using kind of some of the advances in natural language processing. So we ended up kind of developing Chexpert, which we released. And, you know, it was a great project
Starting point is 00:12:12 and, you know, it's been used by other researchers for their projects in different ways. So I think that was kind of one flavor of research. The other one was building data sets. So I think I gained a lot of appreciation for, you know, how important data is in machine learning and AI. Like one of the datasets we built was called RadGraph.
Starting point is 00:12:32 Essentially it was basically a dataset of entities and relations and radiology reports annotated at more of a fine-grained level than previous datasets in that healthcare space. And the idea was that it can eventually help train multimodal models when matched with computer vision images. So that was kind of a couple flavors of work. And there were some other along those lines. Nice. So it may be obvious to many folks, but humor me, it's not obvious to me. So Jason's background is more machine learning by now. But you mentioned something here that I've heard. So I'll get your opinion on it, or maybe you can help me on this. So you mentioned, you know, looking at radiology reports and doing natural language processing.
Starting point is 00:13:15 When you say that, do you mean, and you mentioned like sort of computer vision as well. Are you looking at the sort of like scans of a radiology, like an x-ray? Or are you looking at like the human text that like a radiologist would enter? Yeah, that's a great question. And yeah, I think I was breezing through. So no, no, no, that's okay. No, no, that's why I definitely have clarified a bit more, because I definitely think it's not obvious now that I've said it. But essentially, I think what I was looking at was, in this context, the human written text. So oftentimes, when a radiologist, you know, looks at their patient or their x rays, they'll then write down clinical
Starting point is 00:13:52 notes. So in general, I was paying a lot of attention to clinical notes that doctors are writing, and how to essentially structure that information. Because it's fascinating, there's all these notes that we have that we've collected, you know, doctors have made over, you know, the course of decades, and it's all very unstructured. And when you have unstructured data that's in free text, it's very hard to use it for analytics, insights, training machine learning models. So there's a lot of value we can just get by structuring a lot of the data in healthcare. So that's kind of maybe the theme of the research a little bit is how can we, how can we leverage, you know, all these like decades of knowledge that doctors have inputted? How can we structure that and then use it to, you know, be a little bit more algorithmic in the future? Nice. Well, I was going to tee up
Starting point is 00:14:33 a question about natural language processing and maybe large language models and like diverse fields that aren't related to text. But this is also interesting. So we'll dig in on this one. And maybe we can get back to that later at some other point, if it comes back up again. But so you mentioned this, and you know, we hear this as well, a little bit in the in the machine learning, this taking unstructured text, right? So a radiologist writes what they I, please correct me if I'm wrong. But like, they sort of write in, you know, notes about, hey, I looked at this patient's chart, maybe they were given something that they were looking to try to diagnose. And then they're sort of writing, it's all freeform prose, almost, you know, I guess I would say, like, oh, I see that, you know, there's this thing here, or that thing there, it looks
Starting point is 00:15:17 like this, it could be that maybe follow up suggestions would be to do this or that, you know, or, you know, I deem it's okay. And this is very unstructured. And you mentioned sort of making it more, you know, hierarchical. You also mentioned sort of entities and relationships and sort of like some of these modeling. And as you were saying, some of it may have been previously like heuristic based, like searching, I assume for sort of keywords saying this keyword corresponds to that keyword. Therefore in this chart, there's like this entity of, you know, a specific kind of growth. I don't know the word. And it relates to this patient and this over here. And you're sort of saying just giving it to a machine learning, you know, rather than heuristic and allowing it to just sort of build those relationships.
Starting point is 00:15:59 What does that end up is maybe that's too broad of a question. What does it end up looking like you feed them text and the training I assume is the sort of desired model output of like the hierarchy that you expect from from that text, and you're asking it to sort of do the same thing? Yeah, yeah, that's a good question. And I think in some ways, when I say structure, clinical information, the North Star is to capture all of the meaning and all of the nuance in a clinical note. But there's a lot of work until we can kind of do that. So the way we started off was very simple. And we, you know, used existing schemas that, you know, our lab had actually developed. In this case, it was called like the Chexpert report and schema.
Starting point is 00:16:41 And really it consisted of like 14 different labels for different conditions. So this was kind of, you know, formed with a lot of interaction between, you know, machine learning researchers and also doctors. And it was basically, you know, a very simple task, you know, given a report, what are the positive, negative or absent mentions of a particular medical condition? So for example, pneumonia, pneumothorax, cardiomegaly, these are all examples of different medical conditions. And there was maybe, you know, a couple of labels, different labels that each of those conditions can have. And that was kind of the setup of the problem. And then in terms of how we trained it, really, this is also kind of
Starting point is 00:17:19 another interesting thing is that we actually use a lot of these heuristic systems in order to generate labels. And we actually train some of these, you know, more powerful models using kind of models from or labels from simpler models. So this is kind of known in machine learning as generally kind of, you know, weekly supervised learning, where you can kind of have, you know, noisy labels, essentially, and you can learn off of them. And then you can have a stronger set of high quality labels that were, you know, devised by radiologists, and then you fine tune on those, and then you end up getting better performance than the initial labeler that you even used for training. So in some ways, it's kind of like a student teacher model, where the student ends up ultimately outperforming the teacher. It's kind of one maybe way of thinking about it. Nice. Yeah, I hear that as like a recurring thing now where rather than sort of a homogenous
Starting point is 00:18:10 training supervised, you know, this data set in and then I just get my results out where like you're kind of mentioning, you may have an earlier stage that uses one approach for training. And then you mentioned sort of like refinement, but like, you know, kind of mixing and matching different training styles throughout the sort of larger, larger model seems to be a, something I see at least as a, as a layman and not in the machine learning space as something that I see as a recurring theme more recently. So that's interesting. Yeah, yeah, no, I mean, it's a, there's a lot of interesting work going on. All right. So you're not at Stanford now. So something happened after that. So you did your research, you got your degree, and then where did you go next? Yeah. So I guess while I was doing research and I was becoming interested in language
Starting point is 00:19:02 and natural language processing, at that point in time, you know, towards, I guess, the end and maybe overlapping a little bit, I had started working with u.com. So our founders, Brian and Richard, were both previously at Salesforce, which is kind of this, you know, I guess a big tech company. And then they had kind of a vision of starting a search engine. So I know Richard has been thinking of starting a search engine for a while. And he had left Salesforce at that time with Brian. And I think I had noticed somewhere that, you know, that they, you know, that they had, you know, posted about it. And I think I had reached out or maybe they had reached
Starting point is 00:19:38 out by some like alumni or whatever. And ended up, you know, joining and, you know, helping build u.com. And, you know, there was definitely a lot of like, you know, interesting problems, and there still is in the search space, which I think is what attracted me. So I think making the jump from research to working at a startup was something that felt a little bit natural, especially given the fact that it was such a, you know, ambitious problem. So I think that's something that appealed to me is the idea of working on something that, you know, is hard. It's definitely not easy to build a search engine, especially when, you know, it's a, there's a lot of people who, you know, have a good product out there. Google's a great product. So, but we thought that, you know, we can,
Starting point is 00:20:21 you know, still provide something of value. So I think that's kind of what sold me a little bit on, I guess, moving from academia more into industry slash startup world. Awesome. So, I mean, I think this sort of have an idea of search, you know, the equivalent, I guess, of your hotkey and then letter F for sort of like searching a document for text. We even had an episode where we talked about, you know, how you would do search in databases and this kind of thing. But we sort of not really covering the topic of search engine, right? So something where you're going to a website and trying to index all of publicly or maybe even not publicly accessible human information that's up on the web and allowing people to sort of find the needle in the haystack, right? Find the thing
Starting point is 00:21:15 they're looking for. This is my personal, I've no idea this is a very good definition of a search engine. But this is a topic that I think everyone bumps up against at some time. Like you said, we've all used Google or, you know, another search engine and sort of, you know, put in your text. And like you said, there's already, you know, you sort of mentioned your background was natural language processing. There are many pieces to that. I can think of a thousand ways to sort of like start into the conversation. So maybe I'll just, you know, kick it over to you. Like when you think about this space, is there a sort of approach that you think about the sort of high level components of, you know, sort of I type text in a box, or what is a text even look like or formatting to, you know, I get, you know, a link or even just the information that I'm looking for
Starting point is 00:22:01 on the internet? Like, how do you either think about what it is today or think about where you want to see it going? Yeah, yeah. I think, yeah, there's definitely a lot of excitement around search right now. I think search is, in some ways, it's been a little bit, I wouldn't say, I think it's always been evolving over the last couple of decades.
Starting point is 00:22:21 But I think right now is particularly a time when we're seeing almost like a paradigm shift in search, which I can kind of get into in a bit. But in general, like when I think about search, and I think you mentioned search over databases, I think when I think about search kind of in this concept that you're talking about, when we think about, you know, essentially, you know, search engines, I think one of the differences is basically in the types of information that you're searching over and the goals. So I think when you're doing something like a database lookup, you're basically looking over data that's already been, you know, very structured. And you're essentially kind of looking for almost like a very procedural method of finding like an exact match or something.
Starting point is 00:22:58 And I think that definitely has value and is very interesting. In general, when I talk about search, I think about, you know about the discipline most aligned with search would be kind of this area known as information retrieval. And in information retrieval, I think there's a couple characteristics, one of which is we're kind of doing search over a large corpus of data. So if it's a small corpus, it's not really a search problem.
Starting point is 00:23:20 A lot of search engine is the point is that you're searching over kind of like an infinite amount of documents or a very large amount of documents. That's one characteristic that makes it a bit different sometimes than other settings. And then the other one is that these documents tend to be very unstructured. So for example, web pages are super unstructured, information about news, et cetera. So in general, you're looking for unstructured information. There's a lot of it.
Starting point is 00:23:41 That's basically the setting in which search operates. And then in terms of, I think you mentioned, you know, search being about, you know, you type in something, you get some links. I think that's the way it's historically been. And I think, you know, when we look at existing systems, a lot of times it's become kind of very ad dominated. So you end up, you know, searching, you get links and you get links with ads above it. And in general, that's kind of in the paradigm.
Starting point is 00:24:03 And obviously there's been a lot of content as well. if you look at a lot of search engine, there's knowledge panels, etc, with kind of extracted content, which is very useful. But I think what we're going to start seeing is a move towards search being more about getting you the answer and letting you do things. So I just want to bring a search engine, we think about, you know, being a do engine, not just search engine, how can we let you do things to kind of achieve your goals? So that's kind of maybe one distinction that we see search going from. So we kind of see it also as maybe going away from like a list of blue links to a list of kind of, you know, different types of organizations of content. So this can be kind of just giving you the answer straight up. You know, it could be in the form of a chatbot. So
Starting point is 00:24:44 we've been thinking a lot about conversation. So I think we'll see search evolve in many different ways. But yeah, I definitely think we're moving away from the, you know, you search something, you get a bunch of blue links, you click through a bunch of them, and then you eventually find what you're looking for to being a more kind of user centric, like streamlined experience. I was going to just make a funny observation
Starting point is 00:25:05 that getting blue links isn't as bad as searching something and getting the purple links. And you're like, oh, no, I've already tried these. Like, I need new links. So, okay, so blue links is, I get what you're saying. But anyways, I recall, like, I'm thinking back to, you kind of mentioned this, like, if you have a small set of data, it's sort of a different set of problem where small i guess is a is a bit of a relative term
Starting point is 00:25:29 but i recall like early on i guess you know i'm kind of old so like when i would first you know sort of move from a card catalog at the library to like they had a computer and you could look up not just what books were in your library but in ours ours, it's a countywide library where I was growing up. So actually, there was many branches, and you could search and see what books were in stock across the entire library system in that county. And I recall one of the things, one, it was pretty terrible. Two, like you had to put in text that was pretty close to what you were searching for. And then you needed to tell it like, do you want it to search, you know, author, title, subject, like literally in a drop drop down like what field are you searching and then they would give these things which at my time i didn't really know by boolean operators like i want this and this or not this um and so you could sort of say like i want i'm trying to think of a good
Starting point is 00:26:20 example i want cooking but not food i don't know what that would be. But like, you know, I want, yeah, oh, shoes, but not sneakers. So like, I guess dress shoes or, you know, brake pad shoes. Are there shoes? I don't know. And so like there was this, you know, sort of Boolean, very structured, almost like programming concepts, these logical concepts that you needed to sort of go in and put down into the search box and then run your search. And you know, it would come back with inevitably, basically nothing. But you know, you would try to get one thing. And then you were sort of saying,
Starting point is 00:26:55 you know, to moving to a more conversational. So I think over time, we've seen a lot of changes where people have moved to, I guess, with sort of Google coming to be there, like, you can type in a word. And it used to be before Google, like you would type in that word, and you would just find websites that had that word repeated, like 1000 times in like white on white at the bottom of the page. And they were just saying, like, frequency, like, which was the most. And then, at least for me, like starting to use Google was like you type in a word and that word didn't even need to appear in the page you were looking for. Like it seemed to try to understand related words or concepts or, you know, what you were searching for. And to sort of I'm
Starting point is 00:27:35 giving my own narrative, but feel free to fill in gaps there. But like today, you sort of mentioned now it becomes almost more even conversational. Like I'm not even just typing words, like I'm typing whole questions or sentences in sometimes. Sometimes it works well. Sometimes it doesn't. I feel like often it's just ignoring a lot of the context you're trying to put in your sentence. But, you know, that's sort of, I feel like myself,
Starting point is 00:27:56 even how I interact with it, I guess the systems are training us in as much as we train the system, like giving you feedback about the quality of the links and teaching you, like, you got to give me better inputs. But I don't know, like, is there, do you feel like you mentioned this move to conversational? Do you feel like we're at an end to this sort of like, I don't know what the right word would be like, strict, I'm looking for this word or concept directly to appear to something
Starting point is 00:28:22 that's more akin to how we would ask a question to someone behind the desk at a bookstore or, you know, someone at a university? Is that the sort of like direction that you think we're kind of moving? Yeah, definitely. I think you pretty much nailed it with your analysis, especially of the card catalogs and maybe the bookstores you were looking for books at. I think that's a great example of kind of the general direction in which it's been evolving. I think in some ways, you know, the reason why it's bad is because it's a tough problem. So, you know, when you have really, you know, long user queries
Starting point is 00:28:53 with ambiguous context and a lot of assumptions, it can definitely be tricky in order to kind of surface the right document or result for you or answer even. But I think that's definitely the direction in which we're heading. And a lot of this has been enabled by a ton of advancements that have happened in natural language processing over the last five this year's five six years um it's actually been tremendous i think there's been like multiple paradigm shifts essentially in terms of you know the way in which we can kind of deal with language uh with machine learning essentially that have kind of opened up new possibilities um so yeah i think and if i were to give kind of deal with language, with machine learning, essentially, that have kind of opened up new
Starting point is 00:29:25 possibilities. So yeah, I think if I were to give kind of a simple example of how we're moving in this direction, there's kind of this one concept in search known as semantic search. So you know, oftentimes, I think what you're referring to, and a lot of classical information retrieval is dealt around keyword search. And keyword search, no doubt is still even important today. But the idea behind keyword search is that when you, no doubt, is still even important today. But the idea behind keyword search is that when you look up something, you basically look for the exact word in a document. And you can obviously do some math, kind of, you know, normalize over the length of the document and other types of things. But now we're actually also, you know, moving into a world in
Starting point is 00:30:00 which semantic search is becoming, you know, more important and better, essentially, where you can basically have a user's query or question and you can do what's called, or basically a look in essentially what we call embed the question in a way in which you can kind of understand the context a bit more and then find similar documents. So even if the keywords don't exactly match up,
Starting point is 00:30:24 but the spirit of the question is similar to kind of the answer, we could still, you know, surface those now. And there's also been subsequent works, you know, dealing with large language models, et cetera, that kind of, you know, even further pushed the possibilities. But yeah, I think to answer your question,
Starting point is 00:30:40 that was a long-winded way of saying yes. No, no, no, it's great. Yeah, so I guess like this keyword versus, I mean, I don't have the right terms for it. So I think that was a long-winded way of saying no no that's great yeah so i i guess like this this keyword versus i mean i don't i don't have the right terms for it so i think that was really good thank you that that helps me put put words to it i guess there are still times where um you can go to duck duck go and sort of like put in you know i want to search the cpp reference documents for like i know the function i'm using i just always forget the order of inputs you know the order of the input parameters or whatever.
Starting point is 00:31:06 So like I want, I know exactly what I want. I want this, you know, document this, you know, thing, but it's still, you know, I don't know, first Python or first C++ or maybe tens of thousands of pages. You can't, we don't even print out books when I, you know, we used to have like, you know, language manuals or whatever you'd flip through. But now like you couldn't do that.
Starting point is 00:31:24 Like it wouldn't be practical for like the C standard library it'd just be gigantic uh or boost or something um you know and so now like yeah we rely on still maybe that keyword search like i actually do know what i'm looking for very specifically and i just want you to you know sort of find that thing but i i feel like that uh when i'm, maybe that's true. But that's sort of the exception. Often, like I was saying, even difference of not knowing semantic versus keyword. A lot of times you were saying like nearest neighbors earlier, you may not know the term nearest neighbors is like the academic way of describing finding the nearest cell to you by distance for, you know, matching an attribute set. And so you try to go to the search
Starting point is 00:32:05 engine, and you're just trying to describe the problem, finding the closest thing to me algorithm. And, you know, you're hoping that the search engine can sort of like, deduce that you're looking for, like you said, a document that describes how to do that thing. And from there, that this concept becomes obvious of like nearest neighbor search, and then maybe you can refine or go back. So that, I guess that's the semantics of it, like understanding what you're asking, and then knowing the answer to that question and giving you useful results to that. Yeah, exactly. Yeah, that makes sense. Yeah, I think that's a good way of kind of rephrasing it in a little bit more of a understandable way. But yeah, I think there's a lot of value to be gained by also knowing when to do each. I think in some ways, searches in some ways,
Starting point is 00:32:55 it's a very vague problem and it's a very all-encompassing one. So almost a lot of things that we do can be rephrased as search problems. You could think of all of coding as being one giant search problem where you're given a goal and you have to figure out what to do. And I think we're slowly going to kind of keep bridging that gap. And we'll see search kind of expand its scope a little bit into more, you know, what we call do actions instead of just being giving you information, allowing you to do things and eventually even doing things for you. So, you know, one example, you know, might be, and this is something that, and maybe this is kind of going in a little bit of an odd direction, but, you know, typically, you know, in search engines, you are getting content, but I think also now you can kind of, with a lot of advancements, you could think about, you know, making it take action for you.
Starting point is 00:33:40 So I'll give you a simple example. There's been a lot of advances in kind of image generation. So I'm sure, I don't know if, I'm sure a lot of the listeners have played with some of these really cool tools out there. So I think one of them would be things like mid-journey, stable diffusion. There's these models that you can give it text, and it'll literally create a really high-fidelity image for you. So when you type in a search engine now, or even in our chatbot, something like generate an image of a cat playing the piano, you'll be able to get that. So this is kind of a little bit of a wonky example. It doesn't really feel like a search problem.
Starting point is 00:34:14 But in some ways, I think we're going to start seeing a lot more of these types of commands and actions happening within the context of search. And similarly within programming, if you say, you know, I want to, you know, implement, I don't know, a binary tree with this unique case, you know, you may actually end up running into like a, you know, there may be, you know, the search engine might have some type of, you know, code generation module running, and it may just give you your code. That way, you don't have to search through a bunch of documentation, make it yourself. So I think what we're going to see is search is generally going to move a little bit more in this direction of allowing you to do things, as opposed to giving you information that you don't have to read, synthesize yourself, and then do things.
Starting point is 00:35:01 So I think that's going to be another really interesting shift we'll see. Yeah, I guess that's an interesting extension. You said it might be a little bit of a weird direction, but I kind of see it as we were describing like a shift in the way that you provide the information to almost, I don't even want to call it a search engine because like you said, it sort of moves beyond that, but like into the text box, I want to put words or, I mean, to be fair, like I, we've been talking about text box and search engine. I mean, the same is true for any of the smart assistants that you talk to verbally. It almost becomes the same problem, right? It used to be, you had a very formulaic, I need to speak in this precise way.
Starting point is 00:35:34 And they were all very like singular points of engagement. Like I asked for one thing. And if I ask again, there's no, no context. We were talking about that. Jason was mentioning that with chat GPT. We were sort of going through it and sort of talking about the difference between remembering what you've been, the context versus not remembering. And so I think when we go to the search engine text box or the smart assistants, that not only is there context outside of that text, like that you've
Starting point is 00:36:01 previously entered or things that knows about you,ization is one of those but then you're sort of describing the that's like if i think about that as like the left hand inputs but then now you're talking about the outputs it's not let me give you a piece of information that already exists back so a stack overflow page or cpp documentation or python diameters or your email maybe even like whatever it might be that you have in your corpus of documents, you're actually saying now, and you kind of mentioned it happens sometimes like sort of knowledge panels, or like image generation, where rather than saying, hey, I want a picture of cat, and let me do an image search where there's obviously smarts going on for knowing that this is a picture of image, there's the whole image processing stuff that this is an image of a cat to match to cat because obviously most of those
Starting point is 00:36:49 images don't have the word cat in them uh in the image itself so gaining that context but then as you were mentioning going even one step further which is you know using a lot of the advances we're seeing now with with a lot of these systems to generate that response that makes sense. And so do you think that that's something where I guess one, that's like pretty cool to like, it could be problematic in that like as unauthoritative as the web already is in ways like, how do we sort of know that it makes sense?
Starting point is 00:37:21 And then is there like a thing where we feel that those machine learning systems hang off of the text directly or is there some sort of like cleanup system that sits in between which is is sort of how i think about it happening today like today we put text in some sort of system sits in between to sort of like like you said do this embedding and then search the embedding space right and sort of sort of do this cleanup. And so does something exist where I put in what I want and it knows how to queue the systems well downstream, or did they just become end to end? We just have like these monolithic, I don't want to call it general AI, but these things that just know how to answer questions, generate text. I want to watch a Seinfeld episode, except they're on the moon. And it just like knows how to go create like,
Starting point is 00:38:04 you know, a movie for me that's like Seinfeld, but on the moon. Yeah, this is a great question. And I think the answer is probably a bit of both. So I think in some ways, you know, you know, we definitely should not end up, you know, moving away from authoritative content. I think having content that has citations is very important. So I think, for example, if you're looking up information, sometimes it's very important to, you know, read like kind of the details and the source documentation, and to dig up answers for yourself, and not just kind of see what a search engine condenses and gives to you. So I think that can also introduce a new form of bias that can really be a little bit problematic. So I think, you know, the way I think about it is, it really
Starting point is 00:38:41 depends. I think search is such a diverse task. Even within search, there's so many different categories of types of things people are looking to accomplish from search. Some of those things, I think, make sense for generations. So for example, if you ask, right now you ask a search engine to write me a poem. I think in that case, it's okay if it doesn't necessarily cite, if it's giving you you ideas or if you ask questions like you know when like who won the world cup in that case if you can authoritatively just kind of say Argentina that's that's the correct answer and that's okay to kind of show that there and obviously citations are are very important but I think in some ways it's also important to you know surface documentation
Starting point is 00:39:20 and really ground truth material that allows people to kind of make sure they have, they're in touch with kind of reality in some ways and not just, you know, immediately ingesting anything an AI tells them. So I think that's the great balance. I think that's something we're trying to find as well is how do we balance the need for, you know, the user's desire to get quick content with kind of allowing them to be able to dig into facts themselves and have trustworthy information, like citations. So I think that's kind of maybe a balance we're thinking about
Starting point is 00:39:50 is how to integrate the two of them together. And the way we think about this is also that I think knowledge bases are still going to be very important in the future. So even if we live in a world where, you know, it seems like ChatGPT is able to answer all sorts of things, in some ways it's still going to be very important to make sure that, you know, we have knowledge bases that these AI systems can interface with. And I think that's ultimately going to be the way in which we solve this is that you'll have, you know, you'll ask ChatGPT something.
Starting point is 00:40:15 Or in this case, you know, maybe let's not use ChatGPT, but like with u.com even, like you ask us something and we'll use a lot of ai to kind of generate the answer for you and but but hopefully the generation will hopefully be backed very strongly by you know knowledge bases and authoritative information that we'll give to you with citations and that's kind of the north star i think for me at least is and for you.com at least is how to kind of bridge that gap so i think what you you pointed out like a really important i think one of the toughest problems in search right now and in generated i in general is, you know, what is the balance there? For what use cases is it OK to have pure generated content with no citations? When do you need citations?
Starting point is 00:40:54 You know, to what degree do you need to surface the citations? To what degree can you adapt them and paraphrase? So these are all really thorny questions that I think we're going to have to come to grips with. Yeah, it's a bit of a side tangent. So I'll, I'll indulge it briefly, and then we can return. But one of the things when you were sort of talking there about citations and authoritative and not to be like, dystopian and say, it's very actually difficult to say what's true. So you can say, Oh, who won the World Cup? Well, even that, I mean, we could we could get into it but like i guess at a surface level it's pretty straightforward to sort of like find someone
Starting point is 00:41:29 who generally people would agree is authoritative like oh i could go to the fifa website and the fifa website is determined to be the like source of truth for this information and so whatever they say and let's just ignore the fact that they could get hacked or whatever but like there are other questions that are just very difficult to answer or maybe uh contentious or become political i don't i don't want to give any examples to like it'll color the conversation too much but i think there are just certain you know questions that you ask where actually it's very difficult and not to to tag another you know i guess like mean thing to talk about cryptocurrency. But I was reading about sort of how oracles work in cryptocurrency systems where you want to do like a betting market.
Starting point is 00:42:15 So I want to bet on, you know, who won the presidential election or who's going to win the World Cup. But of course, they like what system is going to provide the answer. And how do you resolve disputes where someone says, no, actually, that that wasn't the answer. And so sometimes the question itself could be ill posed, like it's not answerable in its current form or not uh undisputedly answerable and so how do you have this like staking of reputation or in this case money or you know coins in order to it's just a very interesting way to sort of start from a very decentralized i trust no one, how do we, and then it just ultimately ends up boiling down to a vote and who's willing to like, put their sort of like resources, their coins behind, you know, one side or the other,
Starting point is 00:42:54 and hope that they're, they sort of understand the, what the general answer is going to be. Again, not trying to get into the cryptocurrency discussion per se, although I'm happy to go there. But I think just this, this thing you mentioned that we're going to with these systems of text generation, search engines, like many systems are going to be trying to give answers that some people aren't going to listen to the warnings, and they're just going to take as truth. But even how do they train themselves and what is true? And whatever source you rely on, people can become an attack vector where people attempt to overwhelm that source of truth
Starting point is 00:43:30 to like say something else, right? So, oh, the FIFA website, right? Like we said, like it could get hacked. You could try to like, you know, say that the FIFA website is somehow biased. We don't need to go in there. The bribery scandals, whatever, right? Like, oh, you can't actually trust them. You need need to trust this other thing and i think it's just very interesting that this
Starting point is 00:43:49 somewhat subtle question like the hard problem seems to be just how do you give an answer but then when you peel back the onion layer once and say like okay but what is the actual answer uh so even if we all agreed on what the question was are we actually able to agree on the answer so again So again, I know that's a bit of a side tangent. But just when you were sort of saying it leads me to this, like, so exciting to see this progress where you can type in something, and a lot of times it's right or really close. And then you go that one step further, like, oh, wait a minute, hang on, there's like some still like foundational level questions where, you know, it's sort of is that Godel, the incompleteness theorem? Like you can't actually have like fundamental axioms that like describe your
Starting point is 00:44:30 entire, like some things are sort of like not provable. And this is somewhat dissatisfactory that you can't build an entire, okay. More tangents, but that's just like what tipped me off when you were sort of talking. Yeah. Yeah. I mean, those are, I think you made a lot of good points. And yeah, I think in some ways, again, it's like a balance because we have to make assumptions about the world. You know, we do this every day that could be wrong. I think the same is true of search is that, you know, when we provide an answer, we will have to make assumptions. Like, for example, if you ask who won the World Cup, we have to base that on some type of information, especially, well, I guess, yeah, in order to give an answer.
Starting point is 00:45:09 And then the sources we choose, like, for example, you mentioned the FIFA website. In general, we can assume it's quite reputational. There's other websites, news outlets, etc. And I think in some ways, you're right that we do have to make assumptions there. And those assumptions can often be wrong, right? Things can be changing. I think one of the key kind of ideas or principles is also to make sure that information can be cited as much as possible. So for example, the statement, say that FIFA was hacked, the statement, Argentina won the World Cup would then be wrong if it turns out that France actually won. But the statement, according to FIFA.com, that's self contained, and it's still somewhat right. So in some ways, you know, you can at least provide like, like answers that are rooted in independent context, where the user and, or, you know, yeah, I guess the user in this case, would look at that answer. And even if it's wrong, it would be kind of self contained
Starting point is 00:46:05 in its wrongness. And all the assumptions would be, you'd be able to investigate them within the answer. And I think that's kind of in general, the importance of citations is being able to kind of dig in and understand. So if you example, if you look up, you know, what is the nutritional value of a banana? Or how much protein is in a banana? And we just spit an answer out at you. And that answer could be, I think in some ways it's contested, right? Like maybe it's not clear exactly how much protein is in a banana. Which study, what, what, what area of the world are you in? Like the bananas are different depending on where they're grown.
Starting point is 00:46:40 So there's all this like nuance, right? So you have to make assumptions. And it seems like kind of one way we get around that is by trying to provide the sources so that if the user wants to dig in more, they can. Oftentimes, people won't want to, they just want a very rough answer. And it's okay. But you know, you don't know that ahead of time. So I think I think that's one kind of, you know, line of thinking that we're going about. But it's tricky. I think in general, it's somewhat of a historically unsolved problem. And it'll continue to be kind of a big issue in the future. Yeah, this sort of escape hatch you mentioned is really interesting.
Starting point is 00:47:18 And people like to debate this explainable AI, right? That being able to tell. So in your case, you may not know how it got it right? That like being able to tell. So in your case, you may not know how it got it. It may not be able to explain how it got to the answer, maybe not important, but it's giving you the equivalent of what you would have expected before, which is I click on, I click on the link and I go to the FIFA website. And like you said, you sort of get out of it because, Hey, I use this information and I gave it back to you. And I maybe gave it to you in a more palatable form. I wonder if that adds complication.
Starting point is 00:47:46 And then like, but you mentioned sort of moving beyond, which is like, oh, I want to generate me the nearest neighbor's code with a limit of, you know, 50 meters with the data set. I mean, I could just go on and describe a problem and we could get generation that may touch on, you know, a hundred different, you know, GitHub repositories or whatever. Right. And so this sort of explanation there becomes a bit tricky. You can give your sources, I used all of these inputs, here's a list of 100. But if I go to any one of the 100, I'm not really able to fact check. In the code example,
Starting point is 00:48:17 it's a little easier because one, hopefully you write unit tests or you can compile the code and make sure it works, which isn't always true out of the systems today. But like, you know, you can kind of go through that. But the same topic sort of appears, right? Which is like, you just mentioned, like, maybe a poem is subjective, so it's a little harder. But when you sort of touch across many things, does that end up becoming, I don't want to say like a limitation, like forcing the system to be required to sort of cite its sources? Does that, it adds an extra step. Does it end up becoming tricky or an issue itself? Like how do you even rank those
Starting point is 00:48:52 sources? How do you, like what order do they appear in? What's the correct or, you know, best order becomes like yet more things that the system has to do? Yeah, yeah. I mean, I think you're right that I think it is strictly and somewhat, it's a harder problem to provide a generation with citations and to provide a generation that's somewhat ungrounded. So I think you're right that it is somewhat, it's a harder problem. But I think it's a problem worth solving. And it's something that I think we're interested in kind of going deep and trying to go dig into. I think the other thing is, you know, I don't think all, I think it's important to be able to understand the intent of the query. I mean, I know when to kind of use which system. So I think, again, I think when it comes to search, these discussions can be kind of complicated because of the fact that when we talk about search, search is not really one, you know, single constrained task. It's really a set of so many different types of tasks,
Starting point is 00:49:47 some of which are frivolous, some of which are extremely important, some of which are even borderline, like life or death situations. And I think all of that is compassed within search. So I think we basically need to be very good at deciding when to engage with one technique versus the other. And I think the other important piece is around user expectations. So I think when it comes to
Starting point is 00:50:10 search, I think it's important that users have the right expectations when they use the product. So I think like if you're in a life or death situation, I would not suggest that, you know, you ask ChatGPT what to do. Definitely good advice. Yes. I don't think open AI would suggest that either. So I think, you know, in some ways, it's really important, no matter how good a product is to always have the right expectations with it. Like if you're feeling sick, or you have some issue with health, I mean, you should probably still see a doctor. Obviously, you know, there's amazing tools now and then search can be very helpful for it. But at the end of the day,
Starting point is 00:50:43 having the user expectation should be that, you know, you should probably trust your medical professionals. Maybe one day it'll be true that, you know, we'll be so good at doing certain types of things with search that that won't be possible or won't be as necessary, but at least right now
Starting point is 00:50:57 and far into the foreseeable future, I mean, it's the case that we should trust, you know, healthcare professionals. So I think in some ways, like the expectations are also important. So, you know, people should know how to use search. I think that's always been the case for search, right? Like we don't, we use Google for very specific things. Like if you ask Google, should I accept this job or not? We know that what to do. So it's important to kind of have the right expectations and using a product. And it's also the, the creator
Starting point is 00:51:23 of the products that are, it's also the creator of the products. It's their responsibility to make sure that the expectations are communicated appropriately so people use it the right way. I think with AI, we're going to see this happen a lot more where there's going to be a mismatch between kind of expectations and what the product can deliver. And that'll lead to some issues.
Starting point is 00:51:39 So I think we're trying to think hard about, you know, making sure that we can promote responsible usage of whatever we build. So I just wanted to jump in and ask a question about the structured data and and how to kind of incorporate that into something like chat gbt and we've seen a lot of articles talking about like oh i asked chat gbt what seven plus three is and it said it was 13 you know and so it's it doesn't like uh it's its own self-contained neural network which which can't rely on things from the outside so it can't go and ask wolfram alpha for an answer right and so i'm wondering like how do you where do you see that going like can you like how could we train something which knows, like to start writing
Starting point is 00:52:26 text, and then at some point to stop and say, I need to make a SQL query, and then put the result of a SQL query there and then keep writing text? Like, how is that? Where do you how do you think it's going to unfold? Yeah, that's a great question. And I think in some ways, it depends on what the purpose is. So the purpose of kind of the model and the user expectations around it will determine to what degree you need to do that. So I think, you know, in some cases, I think in some ways like search can be kind of a, it's a hard problem. In some ways it's definitionally impossible to be perfect at because there's so much ambiguity in user requests as well. And you never really know somebody's true intent when they're asking a question,
Starting point is 00:53:08 but you can always try to approximate it. But I think basically going to your question around, like, I guess, if you rephrase it, I guess you were basically asking what again? Oh, yeah. I was wondering how, like, right now, ChatGPT, you know, it just keeps calling the neural network. Yeah, it gets like, okay, the dog jumped over the fence. I'm wondering if it could get to a point where it could say, you know, yeah, the net worth of Shaquille O'Neal is, And then instead of just calling the neural network again, it knows at this point that it needs to like make a SQL query to like net worth database or something like that.
Starting point is 00:53:52 Yeah, yeah. Okay, sorry. I remember I got lost in a tangent. So thank you for bringing me back. No, it happens to all three of us. But basically, I think what you're talking about is something that maybe I had mentioned a little bit earlier about knowledge bases being important. So I think it's okay. So it depends on what ChatGPT's goal
Starting point is 00:54:08 is. So in some cases, you know, they have an API that they allow other people to use, and they don't necessarily need to solve that problem. Maybe they will. But it does seem like it's the responsibility of people building the products to know when to use a tool and how to use it. But I think the way, you know way I think about it at least is that it's going to be really important to know when to plug in knowledge bases. So if you're looking up what is the stock price of Microsoft, it's very unlikely that a neural network will be able to answer that question properly. Maybe one day if it has a live feed plugged in or whatever. But even then you need some component. There needs to be something that's kind of monitoring the real world and is
Starting point is 00:54:49 aware of events that are happening. So I think it's kind of clear when you look at it that way, that knowledge bases are important and we need some way of kind of having AI interfacing with them. And that's kind of maybe what I was talking about earlier when, you know, I think that, you know, the way that we're going to go about this, and I think that maybe the optimal way is to really think about how to combine
Starting point is 00:55:09 some of these advances in AI with a lot of the knowledge that we've collected, crawled, and kind of indexed as a search engine. So I think we'll need a mix of both and we'll need to know essentially when to draw on one versus the other and how to basically combine a lot of the advances in generative AI with a lot of the advances in traditional
Starting point is 00:55:29 and kind of, I guess, neural information retrieval. So I think it's a hard problem. But yeah, I think you're right that I think that it will be important to connect to valid authoritative sources of information so people can trust the content that we provide. Yeah, that makes sense. I think the hardest part is on the, on the, um, um, like data curation side, you know, if you're going to, for example, scrape Wikipedia, well, Wikipedia, you know, humans went in and wrote like Shaquille O'Neal's net worth is X, but they, they didn't make X like,
Starting point is 00:56:03 uh, some kind of token that is updated. It's just someone maybe looked it up on the internet, found it, and then hard-coded whatever the dollar amount is. And so we have to somehow reverse that process. We have to say, okay, this sentence from Wikipedia or whatever our input corpus is says, Shaquille's net worth is I don't know like 100 million or something and we have to be able to say okay that number actually
Starting point is 00:56:32 like there's a way I could get the latest version of that number like that number actually represents some kind of structure yeah I know that OpenAI spends a ton of time and energy on you know curating the data set and And I think that's going to be even more important going forward. Yeah, I agree. We also think a lot about curating data sets and having really good data to kind of build AI on and eventually, I guess, conversational search and search for. And I think the way we think about this at least is we have these ideas of u.com apps within our search engine. And, you know, a lot of those apps are, you know, ones that we've built in partnerships with other companies. We have, for example, a Stack Overflow app that, you know, provides kind of trusted programming content
Starting point is 00:57:19 that you can use to kind of supplement kind of generated code content. But I think in general, we also have this idea of like an open platform where, you know, we can have other people plug in and build apps with their own data. And eventually that will be used in conversational search as well. So I think when it comes to data curation, it's important. And I also think it's important that it's opened up. So there's not just one search engine really, you know, controlling all of kind of the data that's being curated,
Starting point is 00:57:50 but it's really kind of an open community in some ways that is, you know, providing data and, you know, updating the quality and users can kind of, you know, basically prioritize which sources of data that, you know, they like and they think is trustworthy. So I think that maybe went a little bit further than what you were suggesting, but I think it's a great point about data creation. No, it makes sense. That was a fantastic answer. So we're getting a little close to the end of time. So what I was going to just give us, we referenced this conversational search a couple of times. So I just wanted to give an opportunity
Starting point is 00:58:19 to talk about that before we sort of wrap up. But if I sort of take the words for what I assume that they mean, conversational search, where you're having more of a conversation rather than the traditional posing of search queries we sort of worked through. And then when you say conversational, I always think about it as two ways. So it's like, I'm asking a question, I'm getting some answer, and then maybe I'm giving refinements or feedback or additional context and getting right, you know, it becomes a two way dial. Am I am I in the right in the right vicinity of what we're talking about? Yeah. All right.
Starting point is 00:58:52 Awesome. And so for that, I mean, in some ways, I could see wanting to do that on some days and sometimes not. But I guess like if I start thinking about hard problems there and then, you know, maybe you can, you know, follow up with other hard problems or tell me that mine aren't really that hard. If you sort of have this ongoing conversation, sometimes I won't say that it's my wife, but definitely my wife. When I say, you know, have some conversation, she's and then the topic switches and you're not aware the topic has switched. Right. And so this even for I won't say full functioning adults, but like for humans,
Starting point is 00:59:26 I guess, to say generically that, you know, two humans conversing, the topic can switch and keeping up with the topic switches is actually a challenge, right? Not even that you're paying attention, despite whatever someone might say, like you're definitely listening, you're definitely paying attention. And the subtleness of a context switch, a topic switch, can be very difficult to discern. Yeah, I think this is definitely something we're thinking a lot about and trying to improve. But you're right, when we think about conversational search, one of the main differences between conversational search and normal search is that with normal search, you're typing in a new query essentially every time. And you're starting from scratch. You can't really leverage context that you have done previously and you have to basically make sure that whatever
Starting point is 01:00:09 you're typing in is self-contained and contains all the assumptions that you know you've learned along the way um so you know this is kind of often i guess there's different like you know there's technical words for kind of this this type of work but um i guess one of which is query rewriting is how do you essentially write the query? And this is what we do all the time when we're doing search, we're rewriting our queries in order to include more context, to include less context, in order to get the results we want.
Starting point is 01:00:34 With conversational search, you're essentially doing a chat and then you're doing another chat and you kind of expect the previous chat's context to apply or not apply, according to some assumptions. So, for example, if you look up what is the stock price of Microsoft and then you look up what is its price earning ratio, you're going to assume that you want the context from the previous chat to be relevant. But now let's say you suddenly say, who won the presidential election in, I don't know, 2008. You don't really need the context to apply and you wouldn't expect it to be related to Microsoft or anything. And the chat bot and conversational search should be able to handle that for you. I think it's a North Star goal. There's definitely cases where the context becomes really ambiguous. And it's ambiguous whether or not you're starting something new, or you're referring to kind of
Starting point is 01:01:25 things that you've talked about maybe even two or three previous chats ago and this is even chart for humans so it's definitely going to be hard for any conversational agent but this is kind of an active area of research that we're working on is how to think a lot about um you know being very good at um you know the conversational flow yeah i feel like so thinking to your example like i think you said 2008 like who who was elected president and 99.9 of people who say that those words want to know the like presidential election but the prior that you were just talking about microsoft could make it ambiguous if microsoft also held a shareholder election for president in that year right and so i i don't i don't think they did like let's just
Starting point is 01:02:05 say they did right like the term could apply although awkwardly and now you're forced with one like making sure that the chat understands there's two potential things and then how do you communicate that to the user in a like not uh incongruous way that like i i'm not sure which of the things you're talking because then it's super obvious right they know which one they intended but now all of a sudden you're stuck with a dilemma of either saying a non sequitur like oh well you know so and so won the US presidential election you know
Starting point is 01:02:34 and they're like well I was talking about Microsoft like it's a super like yeah like you said assumption is the word there there's a lot of assumption that you didn't necessarily put down and so how could anyone or any system have known it? Yeah, it seems like a very difficult problem. Yeah, yeah.
Starting point is 01:02:47 No, I think you made it. Yeah, well put. There's definitely assumptions that need to be made. And I think we have to make those assumptions clear as well. So if we're giving you the answer for the presidential election, there should be kind of some context in the answer that points that we're talking about the presidential election. Oh, that's even better. You just don't get the name.
Starting point is 01:03:09 Yeah. Wait, I didn't know he was president of microsoft so well awesome all right so i mean that was a pretty like whirlwind tour through a variety of different topics but i definitely learned some stuff so you know i had a super super good time we always like to give people a little bit of opportunity at the end to talk about you know sort of the company they work for sort of the culture there are you hiring um i can tee them up one by one you can just sort of go whatever you want but like let's start off like you know you mentioned you.com a few times you've been there for a few years now like how is it like working there you know would you recommend other people to join maybe that's a tf but like uh tell us a little bit about the the and the culture. Yeah, definitely.
Starting point is 01:03:45 So, yeah, I mean, I think working at u.com has definitely been, I think the best word to describe it is like an adventure. And I think we're, you know, still evolving and trying to figure out how to, you know, technology is changing, user needs are changing all the time, and how do we best meet those needs. And I think it's kind of an exciting space, search, conversational search, et cetera. So essentially, we're focused very deeply on these types of questions.
Starting point is 01:04:11 If you find them interesting, definitely reach out. I think our culture could probably be described as one that is collaborative, but also fast moving. So we definitely work at kind of the face of a startup. So we're very much focused on iterating fast, learning fast from our users and, you know, really, you know, building something. So I think that's kind of maybe one along with kind of collaboration. points yeah i mean if you're looking to to join or reach out you can definitely reach out to me at i guess sahil at u.com so s-a-h-i-l at u.com we'll have that if you didn't want to chat about search topics you can also feel free to message me i'm more than you know happy to chat about anything uh we've said chat too many times people are going to be confused what you mean
Starting point is 01:05:00 you get to give more no i'm just kidding uh so yeah i mean are you guys you guys have like uh physical locations are you doing most of virtual where are you guys at yeah so we're remote so we're fully remote um yeah since the start nice did anyone at your company move somewhere really exotic we recently just had someone move to hawaii oh really um i don't know i don't think anybody has necessarily moved anywhere exotic, although we do have employees like, you know, different places. Some of them, you know, outside of the country as well. Okay. All right. Yeah. It's, it's tempting. I mean, you could just go to Hawaii now. It's never didn't really think about it until someone said, I'm in Hawaii now. I said, Oh, yeah, okay, that's interesting. Yeah, I mean, now it could be I'm on vacation slash working or like,
Starting point is 01:05:49 I live here now. But yes. Yeah, definitely. There is a temptation, I think, for being displaced. I think a lot of companies, you know, to dwell on that, but I got, you know, in most companies, there's at least more flexibility or openness to that, even if it's only part time. And I really think that that's something that people should use and find new ways, it'll be a little difficult, right? Like, if I go on vacation with my family, they're normally used to me being present with them and on vacation the whole time. Now you have an option, like you could take, you know, two weeks somewhere else, you know, and visit and, you know, have them exposed, but you might be working part of the time. I think there's some interesting sort of like calibration
Starting point is 01:06:26 for everyone to do to each other. But I definitely think that's going to be an opportunity, at least for me, like I hope to be able to leverage. And I think a lot of folks will as well, like being able to, even if your culture is fully remote, making sure you're not just remote
Starting point is 01:06:39 in your house all the time, like taking opportunities to be remote in other places. So yeah, I think that's pretty awesome. Yeah. Anything else about you.com you want to, you want to sort of pitch, Sahil? So I assume interns, full-time, anything else that you want to kind of like encourage people to check out? Check out the website, of course. You know, I went, you can just go, you can try some of these things we were talking about today. They're already up there on the website.
Starting point is 01:07:04 Yeah, yeah. No, I would basically suggest, you know, if, you know, yeah, definitely interns are interested or full-time engineers, designers, marketers, really anybody is welcome to kind of reach out. And I think there's definitely a place for a lot of people here, especially people who are interested in, you know, making conversational search and search in general even better. And I think the other, yeah, I guess the other aspect would be to definitely check out u.com, check out UChat, which is our conversational search. And also if you're interested in kind of being part of the community, you can also join the community.
Starting point is 01:07:39 So if you go scroll to the bottom at u.com, there's a join community button. You can click that and we have a Slack group of kind of different users. I guess we call them beta users, but at this point they're, I guess, all sorts of users. And you're definitely welcome to kind of contribute, share your thoughts, ideas, and yeah, be part of the community.
Starting point is 01:07:59 Provide more training data for conversations. No, no, no. This is not about that. Yeah, sorry, no. This is not about that. Yeah, sorry, sorry. I shouldn't say that. Definitely. And that's you.com, just so no one types the letter U.
Starting point is 01:08:12 You.com. And yeah, check it out. Well, thank you all for being on the show. It's been a great time. I've really enjoyed talking about this. And thank you to all our listeners for hanging with us another episode. And we'll see you next time
Starting point is 01:08:26 music by eric barn dollar programming throwdown is distributed under a Creative Commons Attribution Sharealike 2.0 license. You're free to share, copy, distribute, transmit the work, to remix, adapt the work, but you must provide an attribution to Patrick and I, and sharealike in kind.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.