Programming Throwdown - 143: The Evolution of Search with Marcus Eagan
Episode Date: September 26, 2022Finding something online might seem easy - but as Marcus Eagan tells it, it’s not easy to get it right. In today’s episode, MongoDB’s Staff Product Manager on Atlas Search speaks with J...ason and Patrick about his own journey in software development and how to best use search engines to capture user intent. 00:00:34 Introductions00:01:30 Marcus’s unusual origin story00:05:10 Unsecured IoT devices00:09:56 How security groupthink can compromise matters00:12:48 The Target HVAC incident00:17:32 Business challenges with home networks00:21:51 Damerau-Levenshtein edit distance factor ≤ 200:23:58 How do people who do search talk about search00:30:35 Inferring human intent before they intend it00:46:13 Ben Horowitz00:47:32 Seinfeld as an association exercise00:52:27 What Marcus is doing at MongoDB00:58:30 How MongoDB can help at any level01:01:00 Working at MongoDB01:08:14 FarewellsResources mentioned in this episode: Marcus Eagan:Website: https://marcussorealheis.medium.comThe Future of Search Is Semantic & Lexical: https://marcussorealheis.medium.com/the-future-of-search-is-semantic-and-lexical-e55cc9973b6313 Hard Things I Do To Be A Dope Product Manager: https://marcussorealheis.medium.com/13-hard-things-i-do-to-be-a-dope-database-product-manager-7064768505f8Github: https://github.com/MarcusSorealheisTwitter: https://twitter.com/marcusforpeaceMongoDB:Website: https://www.mongodb.com/Atlas: https://www.mongodb.com/cloud/atlas/registerCareers: https://www.mongodb.com/careersOthers:Damerau-Levenshtein distance: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distanceLucene: https://lucene.apache.org/core/Target HVAC Incident (2014, Archive Link): https://archive.is/Wnwob Mergify:Website: https://mergify.com/If you’ve enjoyed this episode, you can listen to more on Programming Throwdown’s website: https://www.programmingthrowdown.com/ Reach out to us via email: programmingthrowdown@gmail.com You can also follow Programming Throwdown on Facebook | Apple Podcasts | Spotify | Player.FM  Join the discussion on our DiscordHelp support Programming Throwdown through our Patreon ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
Welcome to another episode of the show, everybody. It's
kind of crazy watching the ticker on the episode count go up and up. I always forget how many of
these we've done. It's too many now. We're excited to be here with Marcus. Marcus is the staff
product manager for Atlas Search at MongoDB. Welcome to the show, Marcus. Thank you. Happy
to be here. Though we were doing a little bit of pre-show recording, I know we kind of hinted that sometimes
we already got pretty excited.
Marcus was helping us understand search, something that personally I've already known is like
Google, right?
Search engine.
But even though I've done database stuff before, my search always amounts to like equals equals
checks and even getting those wrong because it turns out, yeah, anyways, it's difficult.
So I'm excited
to learn some stuff today and i'm glad marcus is here to help us through this but before we dig
into that we always like to ask people kind of how you got into tech your story kind of like you know
origin story marvel superhero whatever it wants to be it can be kind of boring it's all right
but like marcus how'd you kind of like first get into tech? Was there like a moment you remember as like, this is the first time I got excited about computers or programming or? family's desktop computer just like opening it up like i got some tools waited till my mom was
gone from work at work and my dad was cooking and i started unscrewing like the panel it was a
compact computer i remember like it was yesterday and like when my mom got back from work you know
i knew i had the same corporal punishment coming to me that I had the
first three times. And I wish I didn't mind, like it was worth it to look inside the computer
each time and, you know, pull something random out. And, but this time my dad intervened and
he was like, you know, maybe we should put them in this program. It was called DAPSEP.
It was like Detroit area, like community education, something or like pre-engineering was the the the program.
It was like extracurricular. Every Saturday you'd go up to the university was was just a mile from my house,
University of Detroit, and start learning about a range of topics in engineering, ranging from building circuits,
like basic circuits, to programming computers on QBasic, using QBasic. So that was certainly
my first time when I was like, okay, this isn't random. There's some logic to this.
It's actually pretty cool. And one thing led to another. There was
an extracurricular class. Well, my brother was studying computer engineering at first,
and I would like look through his books with bewilderment and curiosity. Then my girlfriend,
I mean, all we did was hold hands. We were 14. She like forced me to go to like web development
with the calculus teacher in high
school. So I went to his class and I was like, I mean, this stuff just clicked with me right away
in college where I was really focused on some of the mostly other areas primarily
because I was much weaker in these areas, like, uh, research and writing. I found myself
researching about computers and like the FCC and radio. So then I, I started to work on mesh
networks with my friends and like, they're powering like internet for protesters and some of the big protests in New York City.
We're, you know, then trying to provide Internet to migrant communities in Brooklyn when we left school.
And then like impoverished communities just around the country, it turns out super expensive to build your own internet infrastructure
and the ad hoc internet infrastructure.
And it would be a challenge today.
There's a lot of cultural inertia
to go out and say,
invest in the hardware
and then invest in the maintenance burden
of maintaining a node and a mesh network.
I think one day that will be important in certain regions in the United States, particularly
like rural regions.
But there's many areas in the world where these sort of networks exist.
But like going from the web development to the networking stack really sort of gave me a well-rounded foundation and like the Q basic.
And then how I really started to learn was when I struck out on my own and started my own company focused on like security for home networks, like IOT, like when you plug in your router
and then set up all the devices you got for Christmas, some of those devices were made smart
very hastily and they would include, you know, self-signed certificates, for instance,
default passwords or use weak encryption. And like, your network is
only as strong as the weakest link. And I saw I saw a trend in 2015 2014, really, of people
increasingly working from home and adding more of these devices to their networks. Like I would tell
people like your drop cam, that's probably okay. But like the second,
third tier IP camera, who knows? Were you helping people like walk them down by, you know, kind of
just applying best practices or were you doing like pen testing against them, sort of figuring
out, Hey, this third party thing has some backdoor or a little bit of both. Like I'm just trying to
kind of position like the kind of work you were doing yeah so we were just providing like an ids
ips system mostly focused on ids so like you'd actually buy hardware device and then we have a
device that's sort of like a sandbox in between your router and your modem ids is intrusion
detection system that's right okay that's right so you know
we drop a linux box an embedded linux system in promiscuous mode in your home and then
you tell people that if you go to someone's house and say i'm going to put this promiscuous
mode network card in your like i feel like they're going to think something and it's not the right thing. Well, like that's right.
It's better to know who's in promiscuous mode on your network than to not know.
Right. Because there may already be some devices, some promiscuous devices in there.
And it was just like it was a super hard problem.
I still don't know how big of a problem it is today, even with 100% of people working from home in
these knowledge positions, knowledge jobs, you know, working from home in some capacity,
you know, like I think the home is remains an attack vector, but like that company was bought
by another company kind of going after a mesh networking problem, which was funny. I was like,
this is going to be really, really hard folks. And they're like, no, just focus on the security and observability
stuff. And I'm like, fine, fine. But like, you know, I think that remains a hard problem. But
as a part of that, as a part of that work for months, people were just, you know, we're just
mostly out of the box threat signatures, some stuff we tune.
It's very difficult to do this.
We were collecting a ton of data and we had done some interesting things, this heterogeneous networks proved to be virtually impossible without a search engine.
Oh, OK. Well, hang on. There's a bit of a side topic.
I want to say this thing you're saying, like the home network remains and like all these iot a threat and like i myself find like unsure
how many is people hooked up directly to their isps modem like their iot device and that's what's
getting hacked versus people with a sensible you know to have a router with some basic default
firewall rules and that and like i don't know on balance what it is but i am shocked i did a little
very little bit of security stuff in in college you know just some
basic like sql injection some like payload stack overflow stuff just enough to kind of like okay i
get it but there are people i work with who went to you know four-year degrees at big name schools
who are like they don't get it you say they're like oh yeah hackers are a thing but like
like my home network's not a big deal. Like, that's not true.
Like, you wouldn't actually encounter in the wild, like a stack overflow that doesn't just
crash your computer.
Like the belief that you can deliver a tailored payload automatically via like a worm or something,
like they know it exists, but they don't think it'll happen to them.
And so I'm like shocked how even people who should know better or even understand the
mechanism how these work
and by the way have seen each other's code like they still don't believe that that's like an
ongoing threat it's actually shocking to me yeah i mean that that's a great question i think that
we as software engineers suffer from groupthink tremendously and confirmation bias is like if it compiled
it must be good ship it you know what i mean and folks are moving fast it's a super competitive
space there's a lot of capital flowing into software so there's a lot of people like doing it, right? Tasting that.
And we have seen recently like the cryptocurrency industry is, well, it's great because it's like
you can express money directly in code, whereas like typically it's an abstraction, right?
It's like most ways programmers interface with money is a shallow copy. So like a shallow copy
for those who aren't familiar with this, like it's a reference to something. It doesn't actually
represent the thing that you're pointing to, right? Specifically in memory.
But like in this case, shallow copy of money is like, okay, I'm using Shopify.
I have a Shopify checkout button.
That Shopify checkout button talks to Stripe.
That Stripe API talks to First Data or some PayFact.
And that PayFact talks to some mainframe stood up in like the 1960s.
Who knows what that's running? Somewhere in Atlanta, probably. And with like the cryptocurrencies,
what I found so fascinating from a security standpoint is you can point to, and I just
started learning Solidity early this year,
like for the second time,
this time it was a lot easier.
Ethereum's programming language.
Yeah, Solidity powers the EVM or allows you to interface with the EVM.
That's right, Ethereum virtual machine.
And like what I found so interesting and profound about it,
and I'm assuming this is how it is in all of them.
I don't, my knowledge is limited there, admittedly, is that you can point directly to value. It's like a deep copy that you're like,
this money exists really in the world. It can be spent for like goods and services and that's in
your code. And so the security mishaps, the mistakes that are inevitable to happen in our software impact people a lot faster.
In the case of that, there's like some folks working on like these bridges between two chains.
And that's a super risky endeavor that I'm staying away from all the way.
And like people lose 400 million and it's gone right away.
And people feel that they know it. They know it,
which is important to me for security. Because if you look at the target hack, you know,
you might remember that big target fiasco 2014, maybe 2013, where the HVAC system,
the customer, I mean, a contractor was debugging the HVAC system. Their home network was compromised.
Their VPN credentials were compromised. Folks, a command and control server somewhere was able
to then skim credit cards because they were on the same network as the HVAC system for whatever
reason. I'm sure that person isn't employed by Target anymore. Not the HVAC person.
They probably still are the person who put those in the same network. You know, like once the,
once those credit card numbers were taken and those credit cards were used, there was a charge
back to Target. Like the customer, the credit card user is just like, Hey, wait a minute.
These aren't real. Like, or. Or how did they get my information?
Is it because of Target? Target was the last place I bought something and then all this stuff showed
up. So you can see how it's a little convoluted to the consumer, the end user of these computers,
like what was really at stake for me, whereas like it's going to become more
real for, for people today because of the connectedness and the cryptocurrency realm.
And I'm talking about not using exchanges. That's different. That's more like what we're used to,
but I'm saying like, like having a wallet and storing the primitives.
Yeah. I think the interesting two things I would say
like from listening to that about cryptocurrency,
which we reference obliquely all the time
because it's a hot topic.
But I'll say that this thing you mentioned about bridge
is the interesting part too,
is you don't need to put a bounty on them
because the bounty is already there.
Like the money is already there.
So like there are people financially incentivized
to game theoretically to attack those things
and to find the vulnerabilities and make themselves enormously rich, right?
So like it's this very like high stakes, deep copy game, as you kind of mentioned, like
if you transfer that thing, you own that thing and sort of them rolling it back, which is
a whole other thing we could talk about in decentralization.
But you're right.
I think that's really interesting.
And the second thing is the amount of sort of game theory economics that I hear software engineers
talk about, okay, Jason, if you go back and listen to the show, 10 years has been talking about
economics and his curiosity. But like now, I feel it brings those things a lot more to the forefront
where people are engaging in a broader space of society than just I write my program, it compiles,
I do my web thing and i go
home and i think outside of startups people in big corporations which i'm included you can sometimes
shelter yourself that you just write your code and it's this little you know widget in the big thing
and i think in cryptocurrency whatever not it's good or bad net but like this fact that software
engineers think more broadly about societal
impacts,
I think is a good thing.
I think like what will happen will happen,
but that broader thinking,
I think it's not good.
Yeah.
They also,
they,
um,
NeurIPS,
which is a conference for,
uh,
for,
um,
neural networks and AI and all of this,
they actually forced,
uh,
or not forced,
but they,
they put in their spec
that all the submissions to NeurIPS ought to have a section on societal impact.
And so it's really kind of bringing it to the forefront.
So I mean, it's something everyone has to be aware of now.
Yeah, I think that's so prescient, right?
Like people still want to talk about like cybersecurity and computer security, network security is like this specialized thing.
Today, in 2022, the vast majority of crimes are committed online.
Like way more pervasive than, you know, armed robbery ever was.
It's just because you can automate it,
you can scale it out on these hyperscaler cloud infrastructures.
Yeah. Have you seen the meme about that?
There's this meme where it says,
organized crime in the 1920s.
It's these Sicilian gangsters with the bowler hats and everything. And it's like organized crime in 2020 is just a call center.
Okay. Well, you had perfectly teed up a segue for intrusion detection and prevention and trying to
do pattern matching. And I heard it coming, you had the perfect teed up segue to search. And I
kind of busted it because I wanted to kind of talk that. So I'll retee it up for you. You were
talking about, you know, sort of looking at signatures of payloads and packets and trying to understand and compare and how it's a very difficult problem. And so
maybe we can pick it back up there. Yeah, it's like the home network was especially hard to try
to build a business around because they're so different. Like every home network is different you know some people are xbox families
some people are playstation families no so some people are oculus families some people are
hololens you know xbox oculus playstation nintendo switch they got everything plus like a smart toaster for crying out loud you know
and and like a sous vide like i never get it yes uh and so it was like differentiating noise
from you know a high fidelity threat indicator was extremely difficult. Doing it in a structured
manner was practically impossible for it to be cost efficient. And so search sort of,
it did two things. One thing was it enabled a broader swath of analysts, right? So like people who didn't necessarily know a query language,
you know, in terms of the JSON API or the syntax and MQL,
like they just needed to use a search box or a few search fields.
And then the other one was there were so many different fields,
like there's no unified schema like we can sort of
enforce one in terms of what we're like sending from from our linux devices like we have like
initially just lua like writing whatever was in standard out from the ids to the central
repository and then later the system we were working on was just like throw everything in Then standard out from the IDS to the central repository.
And then later, the system we were working on was just like throw everything in Elasticsearch.
Some fields we knew about, some fields we didn't know about. And to filter on like a non-deterministic number of fields in a performant fashion, Apache Lucene, the subsystem there, was really powerful
and remains really powerful for that kind of work, for log exploration and for, you
know, finding a needle or a couple needles in a haystack of several million logs, several
billion logs.
So this sort of like log data you're mentioning with indeterminate fields,
I mean, I guess that gets at people kind of talk about this difference between
structured data versus unstructured data.
So this kind of fields and set length and set things would be
if you were emitting your logs in a very, very, very standard fashion,
then maybe searching for a time range or whatever would be pretty easy.
But if you don't have control over some of that, but you're still collecting it up and looking,
that's where you would start to use something like Lucene? Yes, I would say so. But even if
you structured it, if there were a lot of different combinations and you wanted to give access to
people who had a long list of things they could look for, like maybe
thousands and thousands of potential search queries, then Lucene would also be good for that.
So I always kind of thought Lucene, and I guess I'm wrong, but that's okay. I'll admit it.
So I always thought Lucene was, I knew more sophisticated than this, but what I would say
I kind of knew as fuzzy matching, which is like I put in a query term and rather than this. But what I would say I kind of knew as fuzzy matching, which is like, I put in a query term, and rather than this equals equals thing, like I was mentioning in the
beginning, which is how we all kind of start, like it was doing something a little bit more
and a little bit fancy, okay, a lot fancier. But what you're saying is something more than that,
maybe it does that. But it also does this like, more freeform text that you can input and sort of going over across the different fields of your
data. Yeah, that's right. I think the fuzzy matching is an important point, though, because
it goes back to like letting more people like a broader user base of analysts in this case, maybe
or SOC employees, security operations center employees like search.
And that's because Lucene implements
the Dommerall-Levenstein edit distance algorithm
for determining in a performant manner
if token is an edit distance of less than or equal to two
away from what exists in your corpus or your search
text, like your collection. If it's less than or equal than two, it can quickly correct that query
to find which you intended to type. That's how the fuzzy works. So like, say you have a list of
companies based in the Bay Area, maybe a few thousand companies based in the Bay Area, and Google is one of them.
But you're on your phone and maybe your fingers are fat like mine.
So you type G-P-P-G-L-E.
If you change one of those P's to an O, that's an edit distance of one.
We've made one edit. You change the other other one to oh now you're at two you've maxed out but you match google now
and so that's something that lucene does out of the box it's pretty good at that uh i haven't seen
you know any other system as widely spread as widely used rather for that purpose i guess now
we're sort of shaping up like the field of search i guess we're kind of entering that
like if we have like a database and maybe people have seen like a sql query where you're not going
to be able to say like i want rows which equals google and it matched gpple right so we're already
talking about like one layer of
more sophistication, where you're sort of like, let's just call it row scanning, like going through
every row and looking for stuff. But I guess like some of these things you're saying could be done
there row scanning, but also you needing to do sort of some kind of pre indexing and sort of it,
it feels like this is where we move from like, a programmer doing something else who tacked it on
and is trying to find something to a sort of like, let's call it like a field of study, like something more sophisticated and building up the systems.
Do I kind of have that space right? Like, I don't know. How do people who do search like kind of talk about search that do it often talk about the differing index types.
So like B-trees have been around for a while and those existing databases and they allow you
to pinpoint, you know, a row based on a field and some matching criteria pretty efficiently, pretty fast.
Inverted index kind of turns it on its head where like you're looking through a bag of words
and you're trying to find all the documents in a collection, all the documents in an index that contain some word. And there's
ranking, TFIDF. I mean, most people use BM25 today, which TFIDF is a part of that. TFIDF is,
that's referring to term frequency over inverse document frequency. So you can think about if Marcus appears in one document in a 1000 document corpus, but Marcus
appears several times in that document, that document is going to be surface higher than any,
it's going to be the only document surface because Marcus appears there and many times,
but if Jason appears there, but also appears in another document, maybe the Jason
document about you, like, so you appear in this document as the co-host of a show that I appeared
on in the Marcus document, but you appear in the Jason document many times and you search for Jason,
the document where you appear many times
will appear first. And then the Marcus document will appear second because you appear more in
that second document. And it's about document frequency, but like the, the word, the would not
write, like would not be a strongly ranking keyword or search criterion for either of us
because the inverse document frequency, like it's punished
by the fact that it's in every document. And so most people will pull that one out
in most use cases because it'll be categorized as a stop word.
Oh, so this is not okay in the, I have no idea what year, but like early search browser,
research engines, let's say, which you go to website you type in
you're looking for documents i guess in this case that would be web pages and you end up with those
people at the bottom who would put like the same word repeated like a thousand times in the same
color as the background of the web page or whatever this is what they were trying to gain
is they were trying to make a certain word appear far more often in their webpage so that it would go up
the search results. So, okay. I see. That's very interesting. And when you say stuff like stop
words, and we're like Marcus, and we were talking about this a bit before, but it kind of gets into
a little bit understanding the structure of people's language. Like people talking normally
into this, typing normally into the search box,
rather than sort of curating what they're searching to what they think is in the documents.
So like, as a programmer, if I was going to search something programmable, like I would type in a
very specific enum name or something. But here, we're saying if people are using the word duh,
they're kind of talking normally, they're just sort of writing what they would say into the search engine. Yeah. And I think that gets what I think is the heart of search,
what drives so many people to search and makes it so compelling to me. What many people hope for,
some people don't explicitly don't want this, but many people hope for is that computers can embody many human characteristics so they can help us with things.
Like, I don't want to paint my room. I want my computer to do it.
I don't, you know, one of them, you know, I don't want to do my homework.
I want my computer to do it. I don't want to cook dinner. I want my computer to do it.
I don't want to drive. I want my computer to do it. I don't want to drive. I want my computer to do it.
And so like, it's no coincidence that like the leader in autonomous driving is Waymo,
which is a subsidiary spinoff at this point, spinoff of the largest search company in the
world, right? Or Computer Vision, which also came out of search. It's because with search, you can interface with
the computer in a way that you interface with humans day to day. What makes humans so special
is the ability for us to communicate. That's what I hear anyway. So maybe we could improve
our communication, but in a world where you can talk to your computers and get like meaningful responses is really powerful and at the heart of search.
Alexa, Siri, they're both powered, you know, to some extent by Lucene.
Today's sponsor is Mergify. Mergify is a tool for GitHub that prioritizes,
queues, automatically merges, comments,
rebases, updates, labels, backports,
closes, and assigns your pull requests.
Mergify features allow you to automate
what you would normally do manually.
You can secure your code using a merge queue,
automatically merge it, and many more features.
By saving time, you and your team can focus on projects that matter.
Mergify can coordinate with any CI and is fully integrated into GitHub.
They have a startup program that could give your company a 12-month credit to leverage Mergify.
That's up to $21,000 of value.
Start saving time.
Visit Mergify.com to sign up for a demo and get started. Or just follow the link in the show notes. Back to the episode.
Okay, so does Lucene help there? So when you say this, I'm going to mess up the letters T F IDF.
And you're sort of saying like, I'm taking the words written, maybe you allow for edit distance. But now when you start
to say, like, we talk as humans and communicate, I start to think like meaning or semantics or
like I'm saying a term and the thing that I want may not actually even contain the words I typed
in. Yeah, I think this is such a good question because I think about it all the time. We talk
about it internally all the time here at my company and on our team. And that is like inferring
human intent and inferring customer intent, user intent. it's a bit of a dark art.
You have to know something about that person to know where they're coming from, like their perspective.
So what makes surfacing this data for a user that's relevant to them before they even know what they're looking for, that's kind of what you're talking about.
It's like find something based on criteria
that doesn't even exist in the corpus is it's all context driven. Right. So if you ever go to like
Safari for the first time or if you're paranoid and only use incognito, like they'll ask you,
do you want to allow Google to know your location? And it's interesting. A lot of people, most people are
going to say allow. Some people are going to say don't allow. But if you say allow,
the likelihood is that the relevance of the results that Google shows you will increase.
Because if you search, let's say, Chinese food and Google knows you're on 52nd and Broadway, it's going to show
you Chinese food in Manhattan near 52nd and Broadway. It's not going to show you like the
really dank spots and flushing. You know, there are some really good spots if you want to hike,
but like it's going to show you like those spots are going to be considered because they're rated so high.
But the weight of proximity, geo proximity will pull these other restaurants nearby and even some restaurants that aren't necessarily Chinese restaurants and just have dishes with similar ingredients are also going to surface.
So to riff on this, because I think I thought about this before, too, is if you typed in Paris bakery and the search engine knows nothing about you, should it show you bakeries in Paris?
Probably. you like bakeries in Paris, probably like probably like without knowing anything about you, it should
go to Paris and find whatever using all these things you said, like give you bakery results.
But if you're like, I don't know, around where I was in the Bay Area, there was like a
restaurant called Paris Bakery. And so if it knows I'm in the Bay Area, rather than showing me
something which could have been what I was looking for. But like Paris bakery should probably match to the restaurant named Paris bakery. But
if I'm in Iowa, and there's no Paris bakery, it probably should still show me the ones in France,
not the one in California, that's a relatively like unknown, you know, kind of minor restaurant,
not a famous restaurant or anything. And so I think that dynamic you point out is like really interesting. Like, it's more than just knowing at a human level that like,
these words mean this thing, but that meaning isn't universal. A human at a certain time in
a certain place saying those words mean something different than another human in another time in
another place. Yeah, that's right. And like, I think a lot of people think about search in the context
of Google, which is like a generalized search engine, you know, sort of like Bing. I haven't
used Bing in several years, but like, I think Bing's still out there. Yeah. Like the problem
changes if it's a domain specific search. So if you're building a search for your application, right, because Paris and a restaurant search app or a recipe search app or, you know, something like along this line or even a movie search app, Paris might be a synonym for France or Parisian, French. And so then Paris bakery also means
France bakery and French bakery.
And for sure, there's many French bakeries everywhere
because the pastries are-
Or anywhere that serves croissants, yeah.
Yeah, right, right.
But there still are French bakeries,
even Parisian bakeries in places in the United States
and obviously
all over France and Paris.
I was just going to say, so maybe if we think about like, you know, I'm an engineer, I'm
building my program, I start to collect a whole bunch of data, you know, and I want,
like you said, either users or analysts need to be able to interact with all the data I'm
gathering.
I guess I should have picked a specific example, but I didn't.
And sort of, I know something about my domain. I know I want to allow this sort of
semantic kind of searching to take place and fuzzy matching and really empower people to
interact and surface results that they want from my specific thing. How do I go, like, what is the
what is the trajectory I would normally go along? So first, you might do like a text search and
SQL query, we talked about this, like, what is the kind of arc of how that happens as an engineer starts to
build up a set of things they want to search? Yeah. So, so the first thing is like indexing
and running your first query, just like default index, index, run your first query, have an experience feeling it. Like I'm a big proponent of like,
as a product manager, really focusing on how it feels to use a product, right? Because I've been
in the place and I spend 20% of my time doing this today, continuing to do this, like banging
my head into a wall, like trying to really get at it. And when I do that,
I get a sense of where there's opportunity to improve. So when you are setting out to build
that, you need to do it. You need to bang your head really quickly, index some data,
query that index, understand the output, the format of the output, the information,
the metadata that you have exposed or available to you. Then secondly,
you need to refine that index to make it more suitable or useful for your use case, right?
So that might include pointers to a few synonyms collections. That might include language-specific
analyzers like English or French to strip away the stop words and handle the diacritics appropriately.
Diacritics are like those those squiggly lines under the sea in France.
Today I learned. Or like the accent over.
The A and Ola, I think that's right. And in Spanish. And so the next thing, once you kind of you understand like what your language analyzers are, you have your synonyms collection and your maybe some facet fields for the low cardinality fields, low cardinality is just like how many unique values are in an index in a corpus.
So, for example, for imagine a movie search engine, there's only like five or six genres
like comedy, horror, drama, fantasy, thriller, you know, something like that.
So romance.
So like or bromance, you know, like that so romance so like or bromance you know not like
you want to make that a facet field because those are fields that the customers can filter on even
though there are a hundred thousand movies that's a good candidate because it's low cardinality
relative to the number of documents to help you whittle down to what your customer help your
customers find what they're looking for.
But so once you've tried it, you've made an index,
the third step,
if you're building a really sophisticated system now,
like I'm kind of like the third step could be
go build your UI.
You know, like if you're building a sophisticated system,
the third step would be to start
to try to understand
the queries. So what's interesting about the Paris bakery example is for a sophisticated system,
it might tag Paris as a location and bakery as a geolocation they might like draw a polygon even
on the map like it had lucene has these capabilities like it has via the lat long
query parser it can draw a polygon really for paris yeah for paris and you know you can read about you know tessellation
and and lucene another day it's a like a topic of a phd probably actually i think a few people have
done their phds on this field but like nick knizey i think is one. But like, yeah, you can draw a bound.
You can draw a box. That is Paris. That's a polygon. And then bakery is a point of interest.
So you got this geolocation and you got this point of interest. Bakery is a place.
That's that's stuff like that's a thing. Paris is the constraint on it.
And so how you understand that query shapes like what results are returned.
So you have a geo, like a geo constraint. That would be a filter. You're only showing
bakeries in Paris and in bakery would be like a must have condition.
So like bakery must appear in these documents in my corpus.
And Paris is the only place where I'm looking.
This is in one use case.
Like it depends on what your application is.
And so Paris, like, I mean, what you'll see is there's some limitations that you will get some documents you don't care about.
Like there's going to be grocery stores that have the word bakery in them.
But you really wanted to go to like a specialty bakery.
There's probably going to be coffee shops, cafes in that list because.
Maybe baked goods or fresh from X bakery will surface.
And so then like there's the next step once you have this query understanding, right, which is about taking the historical user data, like how people have interacted with their search results over time should influence like the future results.
But like just so we're clear, understanding that query and constraining it to that geo field and then that point of interest that requires like a flexible schema, a flexible data model, right? So like right then and there,
we know that a database, a relational database,
like in a traditional sense,
it's just like not a good candidate for that.
I mean, listen, I don't like,
I know what these professors,
some professors are saying at universities
about this technology from the 60s, early 70s, about relational databases.
They just don't fit the use case of modern computers today and how humans interface most
of the time.
Right.
Because and like Amazon, Google, they're not relying on relational databases much for these
things.
Right.
They're using like this flexible data model that can evolve to meet your needs.
Now we know our customers are querying by geo constraints.
So we need a geo field, geo filter field even.
Then now we know about this point of interest.
So let's have a point of interest field.
Once you have those, you take the user's interaction with the results to understand the
quality and even segmenting those users, right? So some people like to store like a hash of every
user. I mean, we know many companies that do that. I won't name their names. I think probably even
more valid to have buckets like personas, segments, where these are living categorizations, where some
people may be in one segment for part of their customer lifespan, and then they graduate or get
demoted to a different segment or get moved around. Maybe all segments are considered equal,
and they just get moved around over time as their interest and interaction history changes
but that that feedback loop is so important to cultivating like uh uh improved relevance so you
have the you have the categorization uh like query understanding there's like lots of nlp
libraries that can help you with that and different techniques to help you classification algorithms.
Then the next one, you got the click history or the interaction history.
Maybe sounds a little less scary or, you know.
The interaction history, the engagement.
What is driving success of your user and their journey?
And then boost
those results, those keywords. Then the next one, which is sort of a buzzword frontier that I really
caution people not to jump into too heavily, definitely not head first because it'll cost you.
I think it's really important to try to
understand those first few steps and then move to this step, right? So you could call all of,
most of that you could categorize as lexical search, more or less, like matching, text matching,
fuzzy matching, based on lexical criteria, how things appear. Like once they get through the analysis chain,
that's once they've been mutated by Lucene analyzers
for easier retrieval.
So friendship, friendly, friends, all match friend,
F-R-I-E-N, because that's the root,
that's the stem of it.
So once you get through the analysis,
the indexing where it includes analysis
to the query understanding, to like understanding your users and how they interact.
Even maybe even some automated synonyms based on zero results, because nobody likes zero results.
You go into a store, you ask them, hey, where are these pants? And, you know, because the customer service representative just walks away and doesn't come back and never answers your question.
Like, you're going to let me walk out of here pantsless. Like, that's terrible. That's terrible.
Like, where are like if someone asks you a question, they they are counting on you for an answer. So like the zero results is like an area where I
have seen people have a tremendous impact, just like automating synonym generation or just going
in manually and adding synonyms for zero results. That's why I think the synonyms capability is so
important. But once you get through all of this, let me wave my hands and get everybody excited. Right. It's like this era of semantic search.
Right. Where I've been exploring one of the more challenging data sets to search for like explanations of concepts in search that are really advanced.
So there's the one of the people you can blame for like Internet Explorer is the former product
manager of Netscape.
It's like Ben Horowitz.
He's like also famous for making a ton of great investments and companies early and
startups early.
But back in the day, he was a product manager on uh and netscape and they
invented i mean they popularized the web browser and he does this thing in many of his books where
he starts with like a rap quote which i i love and so i'm trying to do that as well,
starting like the most recent I've done
is like 12, like 10 years ago, maybe.
I'm trying, I mostly go further back than that.
And one of the techniques,
this literary technique and poetry for, you know,
as long as poetry has been studied,
really highlighted for me how semantic search works
it just sort of reminded me so this is a big t-up i'm ready i'm ready yeah drake before he was famous
he had this like uh mixtape and he's freestyling now he's the most famous probably i mean i don't
get to listen to him that much anymore listen to things. A lot of old stuff these days in electronic music and classical music,
but that's neither here nor there.
Back in the day, he said,
we're getting Seinfeld on some Jerry and Elaine queries.
Let's just say he said queries.
I replaced another word there.
Thank you.
And so we're getting Seinfeld documents on some Jerry and Elaine
queries. And so what he was describing is really how our brains do associations, like how we find
memories stored in our internal databases based on like cues or hints. And that's really how semantic search works. It's really
built on neural networks to train these language models that can then tell you that when Jerry and
Elaine, two first names, appear together, you may actually be looking for Seinfeld because Jerry and Elaine are
two characters in Seinfeld. So they occur together a lot. Like, so you train this language model on
thousands and thousands, billions of documents, really billions of documents where humans have
written texts, maybe it's scripts, maybe it's reviews, critiques. Maybe it's like like search applications for streaming.
You know, you have all these corpora, this rich body of text that contain both Seinfeld and Jerry and Elaine.
They might even say Jerry and Elaine Seinfeld. I don't know if that's their last names in the show. I don't watch the show.
But even I, when I hear Jerry and Elaine,
for some reason think Seinfeld.
And so that's really what semantic search is about.
Our team is working on releasing
semantic search capabilities.
I mean, they'll be out. They'll be available to users that like reach out to me.
The reason why we're doing it like that is because I want feedback.
I really want to guide people to success.
I don't want them to just assume ML is going to be a panacea for their relevance or result
ordering problems because that's not how it is.
Like, it's really important for an engineer to remind themselves every day that ML doesn't always win.
No, come on. Come on.
Because, like, my grandmother was about she was about ready to throw her shoe at the autonomous driver of a car we were in, you know, in the Bay Area years ago.
We rode in the car and it's like trying to make an unprotected left.
It's like, grandma, grandma, like this is the first time it's had an incident.
You know, aside from like dodging like one leaf sticking in the road, thinking it's an
obstruction, like this is the only thing that was really a problem.
And I struggled to make unprotected left.
So let's cut it some slack.
I should see what she does to you, too.
Yeah, exactly.
ML doesn't always win, but it can help.
It does have the ability to, even if you don't have to make a synonym for Jerry and Elaine
should match Seinfeld. You can just have this Jerry and Elaine query surface Seinfeld. Just like
in our early testing of our feature using open source models, like the thing about today,
the reason why we're releasing this feature now as opposed to two years ago is I personally felt that the quality, the robustness of the publicly available language models and search engines had reached a point where we could reliably let customers play with and experience the power of what's called semantic
search and not have to have like PhDs in data science or statistics or information science to
really benefit from these techniques. Like there's a ton of models out there from a variety of companies.
And we strongly believe that the open source search engine that powers most people's search
applications that are using search engines, because a lot of people are still using databases
for crying out loud.
Most people that use search engines to power their search applications,
we felt like Lucene had come a long way.
And we also want to contribute to it more
at our company.
Well, that was a great setup.
So we moved through the arc of search
and then you kind of started alluding to
we and our company.
So tell us a little bit about where you're at,
what your guy's product is, and sort of like, you know, give us kind of the elevator bit about where you're at, what your guy's product is,
and give us the elevator pitch about what you're trying to do.
So I'm at MongoDB. And what I'm trying to do is exploit the flexibility and the cognitive continuity of the MongoDB document model
to enable the widest swath of people possible to build advanced search systems.
Based on the steps I just described, I wrote this on my blog recently,
like the cognitive steeplechase of working with a relational database and the throws of tables and rows.
It's like, so he's telling me I call an API, that API gives me JSON objects.
Then in my programming language, largely object oriented paradigm, I'm building business logic around these objects and
manipulating these objects. And then when I go to store it, I have to like effectively use a bunch
of spreadsheets or file cabinets. Like that's how tables and rows feel to me. Like, you know,
my company, like I made all the use in the world of of Postgres and stuff.
It was like the cognitive dissonance, like when you're a smaller company and you need to be as efficient as possible and move as fast as possible.
Like or even like a personal project or like starting a new thing, like the cognitive continuity pays like serious dividends because I get frustrated less.
It's like imagine you're writing some code.
You know, you've got your noise canceling headphones on signaling to the world.
Leave me alone.
Not a very good signal, but yes, I know your intent.
Yeah.
Like, please don't bother me and so you you are in the middle of a function that you're
describing and or like you know articulating expressing you're like 70 of the way through
if somebody taps you on the shoulder pulls you out of the conference out of the conversation
you're having with your ide for or text editor for two minutes, you might have lost the next 15, 30 minutes. You're going to
come back and be like, where was I? What line was I on? Where's my cursor? Is Vim in insert mode or
not? I don't know what to do. I cannot touch the keyboard back away from the keyboard. Woosa, woosa. I mean, this is a real thing.
This is a real thing.
And so imagine nobody bothers you.
You just went from the API to the business logic
and this object-oriented paradigm
to the tables and rows.
It's the same thing.
It's like tables and rows. I was just thinking about
these objects. Boom, boom. I'm surprised. And so as a part of your workflow, every 15 minutes,
you lose another 15 minutes. It's crazy to me that the discontinuity between the storage layer
and the logic layer or like the external system, the access layer, the data layer, if you will,
like the discontinuity between the storage layer and most systems.
So that's why I was like, okay, MongoDB, I need to go, I need to return there.
I built a startup on that continuity.
I know how much faster I was able to adapt and innovate and change and mutate and really push out features because I was
kind of on the same wavelength.
Like nobody was interrupting me.
I wasn't interrupting myself.
And so I was like, if I really want to get this out, if my dream is really to get the power of high-dimensional vector space to the world for a variety of
applications. It's not just a search box. Remember, search engines are the technology
underlying so many different technologies, whether it's like fraud prevention,
personalization, recommendations, or like drug discovery, gene profiling and descriptions,
like predicting interactions. Like this could be life-changing stuff. And so the way I'm going
to get this to the world is the path of least resistance, which I certainly see as the document
model that's been pioneered by MongoDB. At this point, they're not the only ones because people
have like wisened up. They understand now, but they're certainly, they built the most robust business
around this value prop and certainly the most robust managed service, which I think is just
like, it has to be managed because that's another thing. It's like, okay, if I get the document
stuff figured out, if I have to go think about racking and stacking servers. I feel like I'm in 2008 and AWS isn't a thing. It's like,
come on. There are these managed services. You should use them. Engineering is, for me,
a quest to simplify. Constantly iterate to further simplify. And MongoDB really let me do that i left i was in grad school at the
university of michigan uh and the information the school of information and like i learned a lot of
things really quickly and i didn't stay for very long because it was super complex for like it
wasn't like i understood what was going on,
but it was like,
I need to go to the world and tell them you don't need to do this.
Like I want some people to go to grad school. It was like, but you can,
like with MongoDB, if you're in high school, you can get started.
Like there is a free tier and Atlas is free forever. If you're,
you're mid career, you're 40, you want to keep your skills sharp, you can go
get started. You can load a sample data set for free in a managed service nearby your house.
There's like 90 something regions supported all over the world. You can get started and you can
create a search index with a click of a button or one API call. And then that's really powerful to
me. At the end, I'll say all set, you're, or you're all set. And then you get a little button. You can just click it, type into the
search box, view the query syntax, and you're off to the races. And so if we get all of that
out the way and make that part simple for you, then you can understand or even start down the
path of, hey, I want to run every query that comes in,
every text query that my users provide. I want to transform that into a dense vector and submit that
to MongoDB as a vector input and then calculate some similarity, whether it be Euclidean distance or dot product or cosine similarity of that query
vector to my corpus of vectors to surface documents I didn't even know were in this corpus
or relationships I wasn't even aware of. That's super powerful to me. And then you talk about audio, video, like next we're going to be ideas
if we ever get this Neuralink thing. So you're doing my job for me. I was just going to ask.
All right. But it sounds like, okay, so Atlas is the managed instance, managed service for MongoDB.
Is that right? Yeah, that's right. Okay, cool. So if people are listening and they want to try,
it sounds like, I mean, you already did that. I'm just repeating it. But like, you know, they can go sign up on the website, they can put some example, like data in there, they can start searching, they can really kind of experience that. I think that's an awesome opportunity. I think people listening who this is not your field today, I mean, or even if it is, and you're interested, doing that firsthand experience is like one of the best ways to learn one of the best ways to immerse yourself. And I think as Marcus was describing, hit your head on the really
obvious problems first before setting out to like, solve the hardest problem, and really kind of
advance the understanding and learn what you need to layer on next. I think that was a wise words,
wise words. Just great points. Thank you. The next question is like, what about working at
MongoDB? How is working at MongoDB? Like, are you guys hiring? Do you do internships? Where
could people look? What's the kind of culture like? Just kind of give us the feel about a life
there. So we're permanently hiring. If that gives you any insight into how much work there is to do. We're powering mission-critical infrastructure for a lot of the world.
The whole world runs on MongoDB, seriously.
And so, so many applications you would never even think of.
There used to be this joke,
oh, MongoDB is web scale.
No, it's not.
It's like, whatever.
Yes, it is.
I mean, whatever you define it as,
there's a lot of people using MongoDB at super high volumes.
And so what that means as we move to support more of the demanding customers in the world, many of them, I think something like four out of five financial transactions, something crazy like that.
But like like all these really large use cases means we need a lot
of help. And then we're also very focused on the developer. And so we don't go as fast as some of
these companies. Like you'll hear people talk about, as I get into the culture here, you'll
hear people talk about, you know, we ship to production, you know, 12 times a week on average. That's not how we operate because like we're focused
on your experience. Like how does it feel to work with this database? If we ever start to break that
cognitive continuity that I think is, is our biggest advantage and worth so much more than people have realized, then I think it becomes
less of an interesting place for me. Like, because right now it is the simplest way, you know, to make
use of your data, to get to a place where you can innovate fast and really start to explore and push
the boundaries and iterate in a lot of features.
So you can rest assured we're not moving super fast in that way.
But relative, I mean, relative to the other database companies or infrastructure companies,
we move pretty fast.
And maybe it's because we don't, we have that continuity, right?
Ourselves, we work with the database all the time, so we can move faster we have like just tremendous uh set of internal tools and applications powered by mongodb and that is i mean that's one of our competitive advantages
it's like we have the our people working on these systems have the cognitive continuity
it's a some teams lean sort of academic in nature. Some teams are newer and maybe play a little more fast and loose.
Doing a lot of experiments, like really trying to figure out what's the right direction, what's the right thing.
There's a mixture of roles ranging from like we have an education engineering team, which like builds like MongoDB University,
which is a free portal where you can
go and learn MongoDB if you're not being taught it in school yet, which I think everyone should
do it because it's free. Or if you're not being taught it on the job, definitely go to MongoDB
University. It's awesome. Helped me tremendously. And then we have Docs Platform. We maintain our
own Docs Platform. And then we have the managed service in the database and the search system and the mobile database and the serverless functions and
like charts and like some of these things go very very very in the weeds you know like uh but you
can rest assured coming to a place like mago d you're surrounded by world class minds, world class talent.
Like the predominant storage engine and Oracle today was architected by Michael Call and Keith Bostic, who they wrote it.
It's called Sleepy Cat. And this is like in the early 90s or something oracle bought sleepy cat then they built in the
2000 teens i believe michael call and keith bostick came back together started a new company
and they built wire tiger a new storage engine and mongo db acquired the wire tiger company
i keep bostick who recently retired told me that Oracle didn't want the design.
It's so like which which storage engine do you like?
Would you prefer Sleepy Cat or Wired Tiger?
It sounds like this great this great full circle.
So you're really covering all the bases.
Yeah. Yeah. I mean, like we're going to be like MongoDB is the next generation of data storage and access information retrieval.
And so, like, I think that you come to MongoDB to be with the experts on the cutting edge, work with the fastest growing companies and the fastest growing database company alive.
Everyone is really nice.
They work really hard to make space for all kinds of people.
Like there are several people, including myself, who might be characterized as a little weird,
but it's fine. I spent all my day thinking like, I mean, my brain is an inverted index at this
point. And so sometimes I surface documents
that you didn't intend to service.
And like other times,
it's exactly what you're looking for.
You kind of have to be tuned.
And then, you know, we have engineers
in New York City, our headquarters.
Dublin is our EU headquarters.
Dublin and, I mean, Barcelona and Berlin
are both growing engineering offices with openings.
And we have field engineers.
You can come here if you want to write just a little bit of code or a certain subset of code and work as a support engineer or consulting engineer or solutions architect.
If you don't feel you want to be only coding, right, with the software engineering organization.
We also have a large team in India,
I believe in Singapore, we have a pretty big engineering team and Australia, and also San
Francisco and Austin, pretty and we have like, field teams all over the world. You know, we
follow the sun there. And so it's a it's a it's a pretty exciting place to be. It's very diverse.
I think we're getting better and pretty much every category of diversity. And we don't just
think about like what someone looks like. It's also like, what's their perspective?
Well, that was great, Marcus. And, you know, I think this has been super helpful to me. I mean,
I learned a lot through the podcast today.
This idea you said, it wasn't even like the topic of search, but this like cognitive continuity.
People who work with me hear me talk about like cognitive burden.
It's like kind of, I feel like very similar concepts, like really helping people make
sure that when they're engaged in the flow, that this is something worth optimizing for,
that making sure that whatever you're doing users,
people writing code,
like making sure that you,
you sort of reduce the amount of simultaneous complex concepts that have to be held in your head.
And I I'm with you there.
And I,
I think this has been like really awesome.
I know the listeners are going to really enjoy hearing this tour de force
really of,
of search.
You covered,
I don't know,
probably a thousand Googleable terms and,
you know,
years of study probably to get to the bottom of all of it. So I really appreciate you
coming on and I really thank you for taking time. Thanks, Patrick. I really appreciate it.
Thanks, Jason. Really appreciate it. Yeah. Thanks so much. It was, this is absolutely
amazing. I mean, there's a, there's a ton of terms here. I think we, we, you know,
covered a whirlwind of things. So folks, if you have questions, we'll post Twitter handles.
You can add us.
You can add Marcus and ask questions and get engaged.
It's a really interesting topic.
Super, super important.
Yeah, if you want early access, I'll have to plug it again.
To semantic search capabilities, we're collecting again to semantic search capabilities.
We're collecting, you know, people to get early access to those capabilities, because, like I said, we really want you to ensure that you're set up for success and that and you're not confused and that you can use it in a way that's cost effective and allows you to really benefit from the technology.
Well, thank you, everyone. We'll see you next time.
Music by Eric Barndollar Programming Throwdown is distributed under a Creative Commons Attribution Sharealike 2.0 license.
You're free to share, copy, distribute, transmit the work, to remix, adapt the work, but you must provide attribution to Patrick and I and sharealike in time.