Programming Throwdown - 143: The Evolution of Search with Marcus Eagan

Starting point is 00:00:00 Welcome to another episode of the show, everybody. It's kind of crazy watching the ticker on the episode count go up and up. I always forget how many of these we've done. It's too many now. We're excited to be here with Marcus. Marcus is the staff product manager for Atlas Search at MongoDB. Welcome to the show, Marcus. Thank you. Happy to be here. Though we were doing a little bit of pre-show recording, I know we kind of hinted that sometimes we already got pretty excited. Marcus was helping us understand search, something that personally I've already known is like Google, right?

Starting point is 00:00:54 Search engine. But even though I've done database stuff before, my search always amounts to like equals equals checks and even getting those wrong because it turns out, yeah, anyways, it's difficult. So I'm excited to learn some stuff today and i'm glad marcus is here to help us through this but before we dig into that we always like to ask people kind of how you got into tech your story kind of like you know origin story marvel superhero whatever it wants to be it can be kind of boring it's all right but like marcus how'd you kind of like first get into tech? Was there like a moment you remember as like, this is the first time I got excited about computers or programming or? family's desktop computer just like opening it up like i got some tools waited till my mom was

Starting point is 00:01:48 gone from work at work and my dad was cooking and i started unscrewing like the panel it was a compact computer i remember like it was yesterday and like when my mom got back from work you know i knew i had the same corporal punishment coming to me that I had the first three times. And I wish I didn't mind, like it was worth it to look inside the computer each time and, you know, pull something random out. And, but this time my dad intervened and he was like, you know, maybe we should put them in this program. It was called DAPSEP. It was like Detroit area, like community education, something or like pre-engineering was the the the program. It was like extracurricular. Every Saturday you'd go up to the university was was just a mile from my house,

Starting point is 00:02:38 University of Detroit, and start learning about a range of topics in engineering, ranging from building circuits, like basic circuits, to programming computers on QBasic, using QBasic. So that was certainly my first time when I was like, okay, this isn't random. There's some logic to this. It's actually pretty cool. And one thing led to another. There was an extracurricular class. Well, my brother was studying computer engineering at first, and I would like look through his books with bewilderment and curiosity. Then my girlfriend, I mean, all we did was hold hands. We were 14. She like forced me to go to like web development with the calculus teacher in high

Starting point is 00:03:26 school. So I went to his class and I was like, I mean, this stuff just clicked with me right away in college where I was really focused on some of the mostly other areas primarily because I was much weaker in these areas, like, uh, research and writing. I found myself researching about computers and like the FCC and radio. So then I, I started to work on mesh networks with my friends and like, they're powering like internet for protesters and some of the big protests in New York City. We're, you know, then trying to provide Internet to migrant communities in Brooklyn when we left school. And then like impoverished communities just around the country, it turns out super expensive to build your own internet infrastructure and the ad hoc internet infrastructure.

Starting point is 00:04:31 And it would be a challenge today. There's a lot of cultural inertia to go out and say, invest in the hardware and then invest in the maintenance burden of maintaining a node and a mesh network. I think one day that will be important in certain regions in the United States, particularly like rural regions.

Starting point is 00:04:55 But there's many areas in the world where these sort of networks exist. But like going from the web development to the networking stack really sort of gave me a well-rounded foundation and like the Q basic. And then how I really started to learn was when I struck out on my own and started my own company focused on like security for home networks, like IOT, like when you plug in your router and then set up all the devices you got for Christmas, some of those devices were made smart very hastily and they would include, you know, self-signed certificates, for instance, default passwords or use weak encryption. And like, your network is only as strong as the weakest link. And I saw I saw a trend in 2015 2014, really, of people increasingly working from home and adding more of these devices to their networks. Like I would tell

Starting point is 00:06:00 people like your drop cam, that's probably okay. But like the second, third tier IP camera, who knows? Were you helping people like walk them down by, you know, kind of just applying best practices or were you doing like pen testing against them, sort of figuring out, Hey, this third party thing has some backdoor or a little bit of both. Like I'm just trying to kind of position like the kind of work you were doing yeah so we were just providing like an ids ips system mostly focused on ids so like you'd actually buy hardware device and then we have a device that's sort of like a sandbox in between your router and your modem ids is intrusion detection system that's right okay that's right so you know

Starting point is 00:06:47 we drop a linux box an embedded linux system in promiscuous mode in your home and then you tell people that if you go to someone's house and say i'm going to put this promiscuous mode network card in your like i feel like they're going to think something and it's not the right thing. Well, like that's right. It's better to know who's in promiscuous mode on your network than to not know. Right. Because there may already be some devices, some promiscuous devices in there. And it was just like it was a super hard problem. I still don't know how big of a problem it is today, even with 100% of people working from home in these knowledge positions, knowledge jobs, you know, working from home in some capacity,

Starting point is 00:07:32 you know, like I think the home is remains an attack vector, but like that company was bought by another company kind of going after a mesh networking problem, which was funny. I was like, this is going to be really, really hard folks. And they're like, no, just focus on the security and observability stuff. And I'm like, fine, fine. But like, you know, I think that remains a hard problem. But as a part of that, as a part of that work for months, people were just, you know, we're just mostly out of the box threat signatures, some stuff we tune. It's very difficult to do this. We were collecting a ton of data and we had done some interesting things, this heterogeneous networks proved to be virtually impossible without a search engine.

Starting point is 00:08:36 Oh, OK. Well, hang on. There's a bit of a side topic. I want to say this thing you're saying, like the home network remains and like all these iot a threat and like i myself find like unsure how many is people hooked up directly to their isps modem like their iot device and that's what's getting hacked versus people with a sensible you know to have a router with some basic default firewall rules and that and like i don't know on balance what it is but i am shocked i did a little very little bit of security stuff in in college you know just some basic like sql injection some like payload stack overflow stuff just enough to kind of like okay i get it but there are people i work with who went to you know four-year degrees at big name schools

Starting point is 00:09:17 who are like they don't get it you say they're like oh yeah hackers are a thing but like like my home network's not a big deal. Like, that's not true. Like, you wouldn't actually encounter in the wild, like a stack overflow that doesn't just crash your computer. Like the belief that you can deliver a tailored payload automatically via like a worm or something, like they know it exists, but they don't think it'll happen to them. And so I'm like shocked how even people who should know better or even understand the mechanism how these work

Starting point is 00:09:45 and by the way have seen each other's code like they still don't believe that that's like an ongoing threat it's actually shocking to me yeah i mean that that's a great question i think that we as software engineers suffer from groupthink tremendously and confirmation bias is like if it compiled it must be good ship it you know what i mean and folks are moving fast it's a super competitive space there's a lot of capital flowing into software so there's a lot of people like doing it, right? Tasting that. And we have seen recently like the cryptocurrency industry is, well, it's great because it's like you can express money directly in code, whereas like typically it's an abstraction, right? It's like most ways programmers interface with money is a shallow copy. So like a shallow copy

Starting point is 00:10:55 for those who aren't familiar with this, like it's a reference to something. It doesn't actually represent the thing that you're pointing to, right? Specifically in memory. But like in this case, shallow copy of money is like, okay, I'm using Shopify. I have a Shopify checkout button. That Shopify checkout button talks to Stripe. That Stripe API talks to First Data or some PayFact. And that PayFact talks to some mainframe stood up in like the 1960s. Who knows what that's running? Somewhere in Atlanta, probably. And with like the cryptocurrencies,

Starting point is 00:11:36 what I found so fascinating from a security standpoint is you can point to, and I just started learning Solidity early this year, like for the second time, this time it was a lot easier. Ethereum's programming language. Yeah, Solidity powers the EVM or allows you to interface with the EVM. That's right, Ethereum virtual machine. And like what I found so interesting and profound about it,

Starting point is 00:12:01 and I'm assuming this is how it is in all of them. I don't, my knowledge is limited there, admittedly, is that you can point directly to value. It's like a deep copy that you're like, this money exists really in the world. It can be spent for like goods and services and that's in your code. And so the security mishaps, the mistakes that are inevitable to happen in our software impact people a lot faster. In the case of that, there's like some folks working on like these bridges between two chains. And that's a super risky endeavor that I'm staying away from all the way. And like people lose 400 million and it's gone right away. And people feel that they know it. They know it,

Starting point is 00:12:46 which is important to me for security. Because if you look at the target hack, you know, you might remember that big target fiasco 2014, maybe 2013, where the HVAC system, the customer, I mean, a contractor was debugging the HVAC system. Their home network was compromised. Their VPN credentials were compromised. Folks, a command and control server somewhere was able to then skim credit cards because they were on the same network as the HVAC system for whatever reason. I'm sure that person isn't employed by Target anymore. Not the HVAC person. They probably still are the person who put those in the same network. You know, like once the, once those credit card numbers were taken and those credit cards were used, there was a charge

Starting point is 00:13:37 back to Target. Like the customer, the credit card user is just like, Hey, wait a minute. These aren't real. Like, or. Or how did they get my information? Is it because of Target? Target was the last place I bought something and then all this stuff showed up. So you can see how it's a little convoluted to the consumer, the end user of these computers, like what was really at stake for me, whereas like it's going to become more real for, for people today because of the connectedness and the cryptocurrency realm. And I'm talking about not using exchanges. That's different. That's more like what we're used to, but I'm saying like, like having a wallet and storing the primitives.

Starting point is 00:14:23 Yeah. I think the interesting two things I would say like from listening to that about cryptocurrency, which we reference obliquely all the time because it's a hot topic. But I'll say that this thing you mentioned about bridge is the interesting part too, is you don't need to put a bounty on them because the bounty is already there.

Starting point is 00:14:38 Like the money is already there. So like there are people financially incentivized to game theoretically to attack those things and to find the vulnerabilities and make themselves enormously rich, right? So like it's this very like high stakes, deep copy game, as you kind of mentioned, like if you transfer that thing, you own that thing and sort of them rolling it back, which is a whole other thing we could talk about in decentralization. But you're right.

Starting point is 00:15:02 I think that's really interesting. And the second thing is the amount of sort of game theory economics that I hear software engineers talk about, okay, Jason, if you go back and listen to the show, 10 years has been talking about economics and his curiosity. But like now, I feel it brings those things a lot more to the forefront where people are engaging in a broader space of society than just I write my program, it compiles, I do my web thing and i go home and i think outside of startups people in big corporations which i'm included you can sometimes shelter yourself that you just write your code and it's this little you know widget in the big thing

Starting point is 00:15:36 and i think in cryptocurrency whatever not it's good or bad net but like this fact that software engineers think more broadly about societal impacts, I think is a good thing. I think like what will happen will happen, but that broader thinking, I think it's not good. Yeah.

Starting point is 00:15:54 They also, they, um, NeurIPS, which is a conference for, uh, for, um,

Starting point is 00:15:59 neural networks and AI and all of this, they actually forced, uh, or not forced, but they, they put in their spec that all the submissions to NeurIPS ought to have a section on societal impact. And so it's really kind of bringing it to the forefront.

Starting point is 00:16:14 So I mean, it's something everyone has to be aware of now. Yeah, I think that's so prescient, right? Like people still want to talk about like cybersecurity and computer security, network security is like this specialized thing. Today, in 2022, the vast majority of crimes are committed online. Like way more pervasive than, you know, armed robbery ever was. It's just because you can automate it, you can scale it out on these hyperscaler cloud infrastructures. Yeah. Have you seen the meme about that?

Starting point is 00:16:54 There's this meme where it says, organized crime in the 1920s. It's these Sicilian gangsters with the bowler hats and everything. And it's like organized crime in 2020 is just a call center. Okay. Well, you had perfectly teed up a segue for intrusion detection and prevention and trying to do pattern matching. And I heard it coming, you had the perfect teed up segue to search. And I kind of busted it because I wanted to kind of talk that. So I'll retee it up for you. You were talking about, you know, sort of looking at signatures of payloads and packets and trying to understand and compare and how it's a very difficult problem. And so maybe we can pick it back up there. Yeah, it's like the home network was especially hard to try

Starting point is 00:17:37 to build a business around because they're so different. Like every home network is different you know some people are xbox families some people are playstation families no so some people are oculus families some people are hololens you know xbox oculus playstation nintendo switch they got everything plus like a smart toaster for crying out loud you know and and like a sous vide like i never get it yes uh and so it was like differentiating noise from you know a high fidelity threat indicator was extremely difficult. Doing it in a structured manner was practically impossible for it to be cost efficient. And so search sort of, it did two things. One thing was it enabled a broader swath of analysts, right? So like people who didn't necessarily know a query language, you know, in terms of the JSON API or the syntax and MQL,

Starting point is 00:18:53 like they just needed to use a search box or a few search fields. And then the other one was there were so many different fields, like there's no unified schema like we can sort of enforce one in terms of what we're like sending from from our linux devices like we have like initially just lua like writing whatever was in standard out from the ids to the central repository and then later the system we were working on was just like throw everything in Then standard out from the IDS to the central repository. And then later, the system we were working on was just like throw everything in Elasticsearch. Some fields we knew about, some fields we didn't know about. And to filter on like a non-deterministic number of fields in a performant fashion, Apache Lucene, the subsystem there, was really powerful

Starting point is 00:19:48 and remains really powerful for that kind of work, for log exploration and for, you know, finding a needle or a couple needles in a haystack of several million logs, several billion logs. So this sort of like log data you're mentioning with indeterminate fields, I mean, I guess that gets at people kind of talk about this difference between structured data versus unstructured data. So this kind of fields and set length and set things would be if you were emitting your logs in a very, very, very standard fashion,

Starting point is 00:20:21 then maybe searching for a time range or whatever would be pretty easy. But if you don't have control over some of that, but you're still collecting it up and looking, that's where you would start to use something like Lucene? Yes, I would say so. But even if you structured it, if there were a lot of different combinations and you wanted to give access to people who had a long list of things they could look for, like maybe thousands and thousands of potential search queries, then Lucene would also be good for that. So I always kind of thought Lucene, and I guess I'm wrong, but that's okay. I'll admit it. So I always thought Lucene was, I knew more sophisticated than this, but what I would say

Starting point is 00:21:01 I kind of knew as fuzzy matching, which is like I put in a query term and rather than this. But what I would say I kind of knew as fuzzy matching, which is like, I put in a query term, and rather than this equals equals thing, like I was mentioning in the beginning, which is how we all kind of start, like it was doing something a little bit more and a little bit fancy, okay, a lot fancier. But what you're saying is something more than that, maybe it does that. But it also does this like, more freeform text that you can input and sort of going over across the different fields of your data. Yeah, that's right. I think the fuzzy matching is an important point, though, because it goes back to like letting more people like a broader user base of analysts in this case, maybe or SOC employees, security operations center employees like search. And that's because Lucene implements

Starting point is 00:21:49 the Dommerall-Levenstein edit distance algorithm for determining in a performant manner if token is an edit distance of less than or equal to two away from what exists in your corpus or your search text, like your collection. If it's less than or equal than two, it can quickly correct that query to find which you intended to type. That's how the fuzzy works. So like, say you have a list of companies based in the Bay Area, maybe a few thousand companies based in the Bay Area, and Google is one of them. But you're on your phone and maybe your fingers are fat like mine.

Starting point is 00:22:35 So you type G-P-P-G-L-E. If you change one of those P's to an O, that's an edit distance of one. We've made one edit. You change the other other one to oh now you're at two you've maxed out but you match google now and so that's something that lucene does out of the box it's pretty good at that uh i haven't seen you know any other system as widely spread as widely used rather for that purpose i guess now we're sort of shaping up like the field of search i guess we're kind of entering that like if we have like a database and maybe people have seen like a sql query where you're not going to be able to say like i want rows which equals google and it matched gpple right so we're already

Starting point is 00:23:23 talking about like one layer of more sophistication, where you're sort of like, let's just call it row scanning, like going through every row and looking for stuff. But I guess like some of these things you're saying could be done there row scanning, but also you needing to do sort of some kind of pre indexing and sort of it, it feels like this is where we move from like, a programmer doing something else who tacked it on and is trying to find something to a sort of like, let's call it like a field of study, like something more sophisticated and building up the systems. Do I kind of have that space right? Like, I don't know. How do people who do search like kind of talk about search that do it often talk about the differing index types. So like B-trees have been around for a while and those existing databases and they allow you

Starting point is 00:24:17 to pinpoint, you know, a row based on a field and some matching criteria pretty efficiently, pretty fast. Inverted index kind of turns it on its head where like you're looking through a bag of words and you're trying to find all the documents in a collection, all the documents in an index that contain some word. And there's ranking, TFIDF. I mean, most people use BM25 today, which TFIDF is a part of that. TFIDF is, that's referring to term frequency over inverse document frequency. So you can think about if Marcus appears in one document in a 1000 document corpus, but Marcus appears several times in that document, that document is going to be surface higher than any, it's going to be the only document surface because Marcus appears there and many times, but if Jason appears there, but also appears in another document, maybe the Jason

Starting point is 00:25:28 document about you, like, so you appear in this document as the co-host of a show that I appeared on in the Marcus document, but you appear in the Jason document many times and you search for Jason, the document where you appear many times will appear first. And then the Marcus document will appear second because you appear more in that second document. And it's about document frequency, but like the, the word, the would not write, like would not be a strongly ranking keyword or search criterion for either of us because the inverse document frequency, like it's punished by the fact that it's in every document. And so most people will pull that one out

Starting point is 00:26:09 in most use cases because it'll be categorized as a stop word. Oh, so this is not okay in the, I have no idea what year, but like early search browser, research engines, let's say, which you go to website you type in you're looking for documents i guess in this case that would be web pages and you end up with those people at the bottom who would put like the same word repeated like a thousand times in the same color as the background of the web page or whatever this is what they were trying to gain is they were trying to make a certain word appear far more often in their webpage so that it would go up the search results. So, okay. I see. That's very interesting. And when you say stuff like stop

Starting point is 00:26:52 words, and we're like Marcus, and we were talking about this a bit before, but it kind of gets into a little bit understanding the structure of people's language. Like people talking normally into this, typing normally into the search box, rather than sort of curating what they're searching to what they think is in the documents. So like, as a programmer, if I was going to search something programmable, like I would type in a very specific enum name or something. But here, we're saying if people are using the word duh, they're kind of talking normally, they're just sort of writing what they would say into the search engine. Yeah. And I think that gets what I think is the heart of search, what drives so many people to search and makes it so compelling to me. What many people hope for,

Starting point is 00:27:37 some people don't explicitly don't want this, but many people hope for is that computers can embody many human characteristics so they can help us with things. Like, I don't want to paint my room. I want my computer to do it. I don't, you know, one of them, you know, I don't want to do my homework. I want my computer to do it. I don't want to cook dinner. I want my computer to do it. I don't want to drive. I want my computer to do it. I don't want to drive. I want my computer to do it. And so like, it's no coincidence that like the leader in autonomous driving is Waymo, which is a subsidiary spinoff at this point, spinoff of the largest search company in the world, right? Or Computer Vision, which also came out of search. It's because with search, you can interface with

Starting point is 00:28:28 the computer in a way that you interface with humans day to day. What makes humans so special is the ability for us to communicate. That's what I hear anyway. So maybe we could improve our communication, but in a world where you can talk to your computers and get like meaningful responses is really powerful and at the heart of search. Alexa, Siri, they're both powered, you know, to some extent by Lucene. Today's sponsor is Mergify. Mergify is a tool for GitHub that prioritizes, queues, automatically merges, comments, rebases, updates, labels, backports, closes, and assigns your pull requests.

Starting point is 00:29:15 Mergify features allow you to automate what you would normally do manually. You can secure your code using a merge queue, automatically merge it, and many more features. By saving time, you and your team can focus on projects that matter. Mergify can coordinate with any CI and is fully integrated into GitHub. They have a startup program that could give your company a 12-month credit to leverage Mergify. That's up to $21,000 of value.

Starting point is 00:29:43 Start saving time. Visit Mergify.com to sign up for a demo and get started. Or just follow the link in the show notes. Back to the episode. Okay, so does Lucene help there? So when you say this, I'm going to mess up the letters T F IDF. And you're sort of saying like, I'm taking the words written, maybe you allow for edit distance. But now when you start to say, like, we talk as humans and communicate, I start to think like meaning or semantics or like I'm saying a term and the thing that I want may not actually even contain the words I typed in. Yeah, I think this is such a good question because I think about it all the time. We talk about it internally all the time here at my company and on our team. And that is like inferring

Starting point is 00:30:38 human intent and inferring customer intent, user intent. it's a bit of a dark art. You have to know something about that person to know where they're coming from, like their perspective. So what makes surfacing this data for a user that's relevant to them before they even know what they're looking for, that's kind of what you're talking about. It's like find something based on criteria that doesn't even exist in the corpus is it's all context driven. Right. So if you ever go to like Safari for the first time or if you're paranoid and only use incognito, like they'll ask you, do you want to allow Google to know your location? And it's interesting. A lot of people, most people are going to say allow. Some people are going to say don't allow. But if you say allow,

Starting point is 00:31:32 the likelihood is that the relevance of the results that Google shows you will increase. Because if you search, let's say, Chinese food and Google knows you're on 52nd and Broadway, it's going to show you Chinese food in Manhattan near 52nd and Broadway. It's not going to show you like the really dank spots and flushing. You know, there are some really good spots if you want to hike, but like it's going to show you like those spots are going to be considered because they're rated so high. But the weight of proximity, geo proximity will pull these other restaurants nearby and even some restaurants that aren't necessarily Chinese restaurants and just have dishes with similar ingredients are also going to surface. So to riff on this, because I think I thought about this before, too, is if you typed in Paris bakery and the search engine knows nothing about you, should it show you bakeries in Paris? Probably. you like bakeries in Paris, probably like probably like without knowing anything about you, it should

Starting point is 00:32:45 go to Paris and find whatever using all these things you said, like give you bakery results. But if you're like, I don't know, around where I was in the Bay Area, there was like a restaurant called Paris Bakery. And so if it knows I'm in the Bay Area, rather than showing me something which could have been what I was looking for. But like Paris bakery should probably match to the restaurant named Paris bakery. But if I'm in Iowa, and there's no Paris bakery, it probably should still show me the ones in France, not the one in California, that's a relatively like unknown, you know, kind of minor restaurant, not a famous restaurant or anything. And so I think that dynamic you point out is like really interesting. Like, it's more than just knowing at a human level that like, these words mean this thing, but that meaning isn't universal. A human at a certain time in

Starting point is 00:33:35 a certain place saying those words mean something different than another human in another time in another place. Yeah, that's right. And like, I think a lot of people think about search in the context of Google, which is like a generalized search engine, you know, sort of like Bing. I haven't used Bing in several years, but like, I think Bing's still out there. Yeah. Like the problem changes if it's a domain specific search. So if you're building a search for your application, right, because Paris and a restaurant search app or a recipe search app or, you know, something like along this line or even a movie search app, Paris might be a synonym for France or Parisian, French. And so then Paris bakery also means France bakery and French bakery. And for sure, there's many French bakeries everywhere because the pastries are-

Starting point is 00:34:34 Or anywhere that serves croissants, yeah. Yeah, right, right. But there still are French bakeries, even Parisian bakeries in places in the United States and obviously all over France and Paris. I was just going to say, so maybe if we think about like, you know, I'm an engineer, I'm building my program, I start to collect a whole bunch of data, you know, and I want,

Starting point is 00:34:56 like you said, either users or analysts need to be able to interact with all the data I'm gathering. I guess I should have picked a specific example, but I didn't. And sort of, I know something about my domain. I know I want to allow this sort of semantic kind of searching to take place and fuzzy matching and really empower people to interact and surface results that they want from my specific thing. How do I go, like, what is the what is the trajectory I would normally go along? So first, you might do like a text search and SQL query, we talked about this, like, what is the kind of arc of how that happens as an engineer starts to

Starting point is 00:35:29 build up a set of things they want to search? Yeah. So, so the first thing is like indexing and running your first query, just like default index, index, run your first query, have an experience feeling it. Like I'm a big proponent of like, as a product manager, really focusing on how it feels to use a product, right? Because I've been in the place and I spend 20% of my time doing this today, continuing to do this, like banging my head into a wall, like trying to really get at it. And when I do that, I get a sense of where there's opportunity to improve. So when you are setting out to build that, you need to do it. You need to bang your head really quickly, index some data, query that index, understand the output, the format of the output, the information,

Starting point is 00:36:20 the metadata that you have exposed or available to you. Then secondly, you need to refine that index to make it more suitable or useful for your use case, right? So that might include pointers to a few synonyms collections. That might include language-specific analyzers like English or French to strip away the stop words and handle the diacritics appropriately. Diacritics are like those those squiggly lines under the sea in France. Today I learned. Or like the accent over. The A and Ola, I think that's right. And in Spanish. And so the next thing, once you kind of you understand like what your language analyzers are, you have your synonyms collection and your maybe some facet fields for the low cardinality fields, low cardinality is just like how many unique values are in an index in a corpus. So, for example, for imagine a movie search engine, there's only like five or six genres

Starting point is 00:37:36 like comedy, horror, drama, fantasy, thriller, you know, something like that. So romance. So like or bromance, you know, like that so romance so like or bromance you know not like you want to make that a facet field because those are fields that the customers can filter on even though there are a hundred thousand movies that's a good candidate because it's low cardinality relative to the number of documents to help you whittle down to what your customer help your customers find what they're looking for. But so once you've tried it, you've made an index,

Starting point is 00:38:09 the third step, if you're building a really sophisticated system now, like I'm kind of like the third step could be go build your UI. You know, like if you're building a sophisticated system, the third step would be to start to try to understand the queries. So what's interesting about the Paris bakery example is for a sophisticated system,

Starting point is 00:38:37 it might tag Paris as a location and bakery as a geolocation they might like draw a polygon even on the map like it had lucene has these capabilities like it has via the lat long query parser it can draw a polygon really for paris yeah for paris and you know you can read about you know tessellation and and lucene another day it's a like a topic of a phd probably actually i think a few people have done their phds on this field but like nick knizey i think is one. But like, yeah, you can draw a bound. You can draw a box. That is Paris. That's a polygon. And then bakery is a point of interest. So you got this geolocation and you got this point of interest. Bakery is a place. That's that's stuff like that's a thing. Paris is the constraint on it.

Starting point is 00:39:42 And so how you understand that query shapes like what results are returned. So you have a geo, like a geo constraint. That would be a filter. You're only showing bakeries in Paris and in bakery would be like a must have condition. So like bakery must appear in these documents in my corpus. And Paris is the only place where I'm looking. This is in one use case. Like it depends on what your application is. And so Paris, like, I mean, what you'll see is there's some limitations that you will get some documents you don't care about.

Starting point is 00:40:26 Like there's going to be grocery stores that have the word bakery in them. But you really wanted to go to like a specialty bakery. There's probably going to be coffee shops, cafes in that list because. Maybe baked goods or fresh from X bakery will surface. And so then like there's the next step once you have this query understanding, right, which is about taking the historical user data, like how people have interacted with their search results over time should influence like the future results. But like just so we're clear, understanding that query and constraining it to that geo field and then that point of interest that requires like a flexible schema, a flexible data model, right? So like right then and there, we know that a database, a relational database, like in a traditional sense,

Starting point is 00:41:31 it's just like not a good candidate for that. I mean, listen, I don't like, I know what these professors, some professors are saying at universities about this technology from the 60s, early 70s, about relational databases. They just don't fit the use case of modern computers today and how humans interface most of the time. Right.

Starting point is 00:41:54 Because and like Amazon, Google, they're not relying on relational databases much for these things. Right. They're using like this flexible data model that can evolve to meet your needs. Now we know our customers are querying by geo constraints. So we need a geo field, geo filter field even. Then now we know about this point of interest. So let's have a point of interest field.

Starting point is 00:42:20 Once you have those, you take the user's interaction with the results to understand the quality and even segmenting those users, right? So some people like to store like a hash of every user. I mean, we know many companies that do that. I won't name their names. I think probably even more valid to have buckets like personas, segments, where these are living categorizations, where some people may be in one segment for part of their customer lifespan, and then they graduate or get demoted to a different segment or get moved around. Maybe all segments are considered equal, and they just get moved around over time as their interest and interaction history changes but that that feedback loop is so important to cultivating like uh uh improved relevance so you

Starting point is 00:43:15 have the you have the categorization uh like query understanding there's like lots of nlp libraries that can help you with that and different techniques to help you classification algorithms. Then the next one, you got the click history or the interaction history. Maybe sounds a little less scary or, you know. The interaction history, the engagement. What is driving success of your user and their journey? And then boost those results, those keywords. Then the next one, which is sort of a buzzword frontier that I really

Starting point is 00:43:56 caution people not to jump into too heavily, definitely not head first because it'll cost you. I think it's really important to try to understand those first few steps and then move to this step, right? So you could call all of, most of that you could categorize as lexical search, more or less, like matching, text matching, fuzzy matching, based on lexical criteria, how things appear. Like once they get through the analysis chain, that's once they've been mutated by Lucene analyzers for easier retrieval. So friendship, friendly, friends, all match friend,

Starting point is 00:44:34 F-R-I-E-N, because that's the root, that's the stem of it. So once you get through the analysis, the indexing where it includes analysis to the query understanding, to like understanding your users and how they interact. Even maybe even some automated synonyms based on zero results, because nobody likes zero results. You go into a store, you ask them, hey, where are these pants? And, you know, because the customer service representative just walks away and doesn't come back and never answers your question. Like, you're going to let me walk out of here pantsless. Like, that's terrible. That's terrible.

Starting point is 00:45:16 Like, where are like if someone asks you a question, they they are counting on you for an answer. So like the zero results is like an area where I have seen people have a tremendous impact, just like automating synonym generation or just going in manually and adding synonyms for zero results. That's why I think the synonyms capability is so important. But once you get through all of this, let me wave my hands and get everybody excited. Right. It's like this era of semantic search. Right. Where I've been exploring one of the more challenging data sets to search for like explanations of concepts in search that are really advanced. So there's the one of the people you can blame for like Internet Explorer is the former product manager of Netscape. It's like Ben Horowitz.

Starting point is 00:46:14 He's like also famous for making a ton of great investments and companies early and startups early. But back in the day, he was a product manager on uh and netscape and they invented i mean they popularized the web browser and he does this thing in many of his books where he starts with like a rap quote which i i love and so i'm trying to do that as well, starting like the most recent I've done is like 12, like 10 years ago, maybe. I'm trying, I mostly go further back than that.

Starting point is 00:46:54 And one of the techniques, this literary technique and poetry for, you know, as long as poetry has been studied, really highlighted for me how semantic search works it just sort of reminded me so this is a big t-up i'm ready i'm ready yeah drake before he was famous he had this like uh mixtape and he's freestyling now he's the most famous probably i mean i don't get to listen to him that much anymore listen to things. A lot of old stuff these days in electronic music and classical music, but that's neither here nor there.

Starting point is 00:47:30 Back in the day, he said, we're getting Seinfeld on some Jerry and Elaine queries. Let's just say he said queries. I replaced another word there. Thank you. And so we're getting Seinfeld documents on some Jerry and Elaine queries. And so what he was describing is really how our brains do associations, like how we find memories stored in our internal databases based on like cues or hints. And that's really how semantic search works. It's really

Starting point is 00:48:08 built on neural networks to train these language models that can then tell you that when Jerry and Elaine, two first names, appear together, you may actually be looking for Seinfeld because Jerry and Elaine are two characters in Seinfeld. So they occur together a lot. Like, so you train this language model on thousands and thousands, billions of documents, really billions of documents where humans have written texts, maybe it's scripts, maybe it's reviews, critiques. Maybe it's like like search applications for streaming. You know, you have all these corpora, this rich body of text that contain both Seinfeld and Jerry and Elaine. They might even say Jerry and Elaine Seinfeld. I don't know if that's their last names in the show. I don't watch the show. But even I, when I hear Jerry and Elaine,

Starting point is 00:49:09 for some reason think Seinfeld. And so that's really what semantic search is about. Our team is working on releasing semantic search capabilities. I mean, they'll be out. They'll be available to users that like reach out to me. The reason why we're doing it like that is because I want feedback. I really want to guide people to success. I don't want them to just assume ML is going to be a panacea for their relevance or result

Starting point is 00:49:43 ordering problems because that's not how it is. Like, it's really important for an engineer to remind themselves every day that ML doesn't always win. No, come on. Come on. Because, like, my grandmother was about she was about ready to throw her shoe at the autonomous driver of a car we were in, you know, in the Bay Area years ago. We rode in the car and it's like trying to make an unprotected left. It's like, grandma, grandma, like this is the first time it's had an incident. You know, aside from like dodging like one leaf sticking in the road, thinking it's an obstruction, like this is the only thing that was really a problem.

Starting point is 00:50:29 And I struggled to make unprotected left. So let's cut it some slack. I should see what she does to you, too. Yeah, exactly. ML doesn't always win, but it can help. It does have the ability to, even if you don't have to make a synonym for Jerry and Elaine should match Seinfeld. You can just have this Jerry and Elaine query surface Seinfeld. Just like in our early testing of our feature using open source models, like the thing about today,

Starting point is 00:51:01 the reason why we're releasing this feature now as opposed to two years ago is I personally felt that the quality, the robustness of the publicly available language models and search engines had reached a point where we could reliably let customers play with and experience the power of what's called semantic search and not have to have like PhDs in data science or statistics or information science to really benefit from these techniques. Like there's a ton of models out there from a variety of companies. And we strongly believe that the open source search engine that powers most people's search applications that are using search engines, because a lot of people are still using databases for crying out loud. Most people that use search engines to power their search applications, we felt like Lucene had come a long way.

Starting point is 00:52:09 And we also want to contribute to it more at our company. Well, that was a great setup. So we moved through the arc of search and then you kind of started alluding to we and our company. So tell us a little bit about where you're at, what your guy's product is, and sort of like, you know, give us kind of the elevator bit about where you're at, what your guy's product is,

Starting point is 00:52:25 and give us the elevator pitch about what you're trying to do. So I'm at MongoDB. And what I'm trying to do is exploit the flexibility and the cognitive continuity of the MongoDB document model to enable the widest swath of people possible to build advanced search systems. Based on the steps I just described, I wrote this on my blog recently, like the cognitive steeplechase of working with a relational database and the throws of tables and rows. It's like, so he's telling me I call an API, that API gives me JSON objects. Then in my programming language, largely object oriented paradigm, I'm building business logic around these objects and manipulating these objects. And then when I go to store it, I have to like effectively use a bunch

Starting point is 00:53:34 of spreadsheets or file cabinets. Like that's how tables and rows feel to me. Like, you know, my company, like I made all the use in the world of of Postgres and stuff. It was like the cognitive dissonance, like when you're a smaller company and you need to be as efficient as possible and move as fast as possible. Like or even like a personal project or like starting a new thing, like the cognitive continuity pays like serious dividends because I get frustrated less. It's like imagine you're writing some code. You know, you've got your noise canceling headphones on signaling to the world. Leave me alone. Not a very good signal, but yes, I know your intent.

Starting point is 00:54:20 Yeah. Like, please don't bother me and so you you are in the middle of a function that you're describing and or like you know articulating expressing you're like 70 of the way through if somebody taps you on the shoulder pulls you out of the conference out of the conversation you're having with your ide for or text editor for two minutes, you might have lost the next 15, 30 minutes. You're going to come back and be like, where was I? What line was I on? Where's my cursor? Is Vim in insert mode or not? I don't know what to do. I cannot touch the keyboard back away from the keyboard. Woosa, woosa. I mean, this is a real thing. This is a real thing.

Starting point is 00:55:07 And so imagine nobody bothers you. You just went from the API to the business logic and this object-oriented paradigm to the tables and rows. It's the same thing. It's like tables and rows. I was just thinking about these objects. Boom, boom. I'm surprised. And so as a part of your workflow, every 15 minutes, you lose another 15 minutes. It's crazy to me that the discontinuity between the storage layer

Starting point is 00:55:39 and the logic layer or like the external system, the access layer, the data layer, if you will, like the discontinuity between the storage layer and most systems. So that's why I was like, okay, MongoDB, I need to go, I need to return there. I built a startup on that continuity. I know how much faster I was able to adapt and innovate and change and mutate and really push out features because I was kind of on the same wavelength. Like nobody was interrupting me. I wasn't interrupting myself.

Starting point is 00:56:15 And so I was like, if I really want to get this out, if my dream is really to get the power of high-dimensional vector space to the world for a variety of applications. It's not just a search box. Remember, search engines are the technology underlying so many different technologies, whether it's like fraud prevention, personalization, recommendations, or like drug discovery, gene profiling and descriptions, like predicting interactions. Like this could be life-changing stuff. And so the way I'm going to get this to the world is the path of least resistance, which I certainly see as the document model that's been pioneered by MongoDB. At this point, they're not the only ones because people have like wisened up. They understand now, but they're certainly, they built the most robust business

Starting point is 00:57:26 around this value prop and certainly the most robust managed service, which I think is just like, it has to be managed because that's another thing. It's like, okay, if I get the document stuff figured out, if I have to go think about racking and stacking servers. I feel like I'm in 2008 and AWS isn't a thing. It's like, come on. There are these managed services. You should use them. Engineering is, for me, a quest to simplify. Constantly iterate to further simplify. And MongoDB really let me do that i left i was in grad school at the university of michigan uh and the information the school of information and like i learned a lot of things really quickly and i didn't stay for very long because it was super complex for like it wasn't like i understood what was going on,

Starting point is 00:58:25 but it was like, I need to go to the world and tell them you don't need to do this. Like I want some people to go to grad school. It was like, but you can, like with MongoDB, if you're in high school, you can get started. Like there is a free tier and Atlas is free forever. If you're, you're mid career, you're 40, you want to keep your skills sharp, you can go get started. You can load a sample data set for free in a managed service nearby your house. There's like 90 something regions supported all over the world. You can get started and you can

Starting point is 00:58:57 create a search index with a click of a button or one API call. And then that's really powerful to me. At the end, I'll say all set, you're, or you're all set. And then you get a little button. You can just click it, type into the search box, view the query syntax, and you're off to the races. And so if we get all of that out the way and make that part simple for you, then you can understand or even start down the path of, hey, I want to run every query that comes in, every text query that my users provide. I want to transform that into a dense vector and submit that to MongoDB as a vector input and then calculate some similarity, whether it be Euclidean distance or dot product or cosine similarity of that query vector to my corpus of vectors to surface documents I didn't even know were in this corpus

Starting point is 00:59:56 or relationships I wasn't even aware of. That's super powerful to me. And then you talk about audio, video, like next we're going to be ideas if we ever get this Neuralink thing. So you're doing my job for me. I was just going to ask. All right. But it sounds like, okay, so Atlas is the managed instance, managed service for MongoDB. Is that right? Yeah, that's right. Okay, cool. So if people are listening and they want to try, it sounds like, I mean, you already did that. I'm just repeating it. But like, you know, they can go sign up on the website, they can put some example, like data in there, they can start searching, they can really kind of experience that. I think that's an awesome opportunity. I think people listening who this is not your field today, I mean, or even if it is, and you're interested, doing that firsthand experience is like one of the best ways to learn one of the best ways to immerse yourself. And I think as Marcus was describing, hit your head on the really obvious problems first before setting out to like, solve the hardest problem, and really kind of advance the understanding and learn what you need to layer on next. I think that was a wise words, wise words. Just great points. Thank you. The next question is like, what about working at

Starting point is 01:01:06 MongoDB? How is working at MongoDB? Like, are you guys hiring? Do you do internships? Where could people look? What's the kind of culture like? Just kind of give us the feel about a life there. So we're permanently hiring. If that gives you any insight into how much work there is to do. We're powering mission-critical infrastructure for a lot of the world. The whole world runs on MongoDB, seriously. And so, so many applications you would never even think of. There used to be this joke, oh, MongoDB is web scale. No, it's not.

Starting point is 01:01:39 It's like, whatever. Yes, it is. I mean, whatever you define it as, there's a lot of people using MongoDB at super high volumes. And so what that means as we move to support more of the demanding customers in the world, many of them, I think something like four out of five financial transactions, something crazy like that. But like like all these really large use cases means we need a lot of help. And then we're also very focused on the developer. And so we don't go as fast as some of these companies. Like you'll hear people talk about, as I get into the culture here, you'll

Starting point is 01:02:17 hear people talk about, you know, we ship to production, you know, 12 times a week on average. That's not how we operate because like we're focused on your experience. Like how does it feel to work with this database? If we ever start to break that cognitive continuity that I think is, is our biggest advantage and worth so much more than people have realized, then I think it becomes less of an interesting place for me. Like, because right now it is the simplest way, you know, to make use of your data, to get to a place where you can innovate fast and really start to explore and push the boundaries and iterate in a lot of features. So you can rest assured we're not moving super fast in that way. But relative, I mean, relative to the other database companies or infrastructure companies,

Starting point is 01:03:14 we move pretty fast. And maybe it's because we don't, we have that continuity, right? Ourselves, we work with the database all the time, so we can move faster we have like just tremendous uh set of internal tools and applications powered by mongodb and that is i mean that's one of our competitive advantages it's like we have the our people working on these systems have the cognitive continuity it's a some teams lean sort of academic in nature. Some teams are newer and maybe play a little more fast and loose. Doing a lot of experiments, like really trying to figure out what's the right direction, what's the right thing. There's a mixture of roles ranging from like we have an education engineering team, which like builds like MongoDB University, which is a free portal where you can

Starting point is 01:04:05 go and learn MongoDB if you're not being taught it in school yet, which I think everyone should do it because it's free. Or if you're not being taught it on the job, definitely go to MongoDB University. It's awesome. Helped me tremendously. And then we have Docs Platform. We maintain our own Docs Platform. And then we have the managed service in the database and the search system and the mobile database and the serverless functions and like charts and like some of these things go very very very in the weeds you know like uh but you can rest assured coming to a place like mago d you're surrounded by world class minds, world class talent. Like the predominant storage engine and Oracle today was architected by Michael Call and Keith Bostic, who they wrote it. It's called Sleepy Cat. And this is like in the early 90s or something oracle bought sleepy cat then they built in the

Starting point is 01:05:07 2000 teens i believe michael call and keith bostick came back together started a new company and they built wire tiger a new storage engine and mongo db acquired the wire tiger company i keep bostick who recently retired told me that Oracle didn't want the design. It's so like which which storage engine do you like? Would you prefer Sleepy Cat or Wired Tiger? It sounds like this great this great full circle. So you're really covering all the bases. Yeah. Yeah. I mean, like we're going to be like MongoDB is the next generation of data storage and access information retrieval.

Starting point is 01:05:49 And so, like, I think that you come to MongoDB to be with the experts on the cutting edge, work with the fastest growing companies and the fastest growing database company alive. Everyone is really nice. They work really hard to make space for all kinds of people. Like there are several people, including myself, who might be characterized as a little weird, but it's fine. I spent all my day thinking like, I mean, my brain is an inverted index at this point. And so sometimes I surface documents that you didn't intend to service. And like other times,

Starting point is 01:06:29 it's exactly what you're looking for. You kind of have to be tuned. And then, you know, we have engineers in New York City, our headquarters. Dublin is our EU headquarters. Dublin and, I mean, Barcelona and Berlin are both growing engineering offices with openings. And we have field engineers.

Starting point is 01:06:47 You can come here if you want to write just a little bit of code or a certain subset of code and work as a support engineer or consulting engineer or solutions architect. If you don't feel you want to be only coding, right, with the software engineering organization. We also have a large team in India, I believe in Singapore, we have a pretty big engineering team and Australia, and also San Francisco and Austin, pretty and we have like, field teams all over the world. You know, we follow the sun there. And so it's a it's a it's a pretty exciting place to be. It's very diverse. I think we're getting better and pretty much every category of diversity. And we don't just think about like what someone looks like. It's also like, what's their perspective?

Starting point is 01:07:38 Well, that was great, Marcus. And, you know, I think this has been super helpful to me. I mean, I learned a lot through the podcast today. This idea you said, it wasn't even like the topic of search, but this like cognitive continuity. People who work with me hear me talk about like cognitive burden. It's like kind of, I feel like very similar concepts, like really helping people make sure that when they're engaged in the flow, that this is something worth optimizing for, that making sure that whatever you're doing users, people writing code,

Starting point is 01:08:07 like making sure that you, you sort of reduce the amount of simultaneous complex concepts that have to be held in your head. And I I'm with you there. And I, I think this has been like really awesome. I know the listeners are going to really enjoy hearing this tour de force really of, of search.

Starting point is 01:08:21 You covered, I don't know, probably a thousand Googleable terms and, you know, years of study probably to get to the bottom of all of it. So I really appreciate you coming on and I really thank you for taking time. Thanks, Patrick. I really appreciate it. Thanks, Jason. Really appreciate it. Yeah. Thanks so much. It was, this is absolutely amazing. I mean, there's a, there's a ton of terms here. I think we, we, you know,

Starting point is 01:08:43 covered a whirlwind of things. So folks, if you have questions, we'll post Twitter handles. You can add us. You can add Marcus and ask questions and get engaged. It's a really interesting topic. Super, super important. Yeah, if you want early access, I'll have to plug it again. To semantic search capabilities, we're collecting again to semantic search capabilities. We're collecting, you know, people to get early access to those capabilities, because, like I said, we really want you to ensure that you're set up for success and that and you're not confused and that you can use it in a way that's cost effective and allows you to really benefit from the technology.

Starting point is 01:09:30 Well, thank you, everyone. We'll see you next time. Music by Eric Barndollar Programming Throwdown is distributed under a Creative Commons Attribution Sharealike 2.0 license. You're free to share, copy, distribute, transmit the work, to remix, adapt the work, but you must provide attribution to Patrick and I and sharealike in time.

Your Ad Here

Programming Throwdown - 143: The Evolution of Search with Marcus Eagan

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.