Software at Scale - Software at Scale 25 - Rajesh Venkataraman: Senior Staff Software Engineer at Google
Episode Date: June 25, 2021Rajesh Venkataraman is a Senior Staff Engineer at Google where he works on Privacy and Personalization at Google Pay. He’s had experience building and maintaining search systems for a large part of ...his career. He worked on natural language processing at Microsoft, the cloud inference team at Google, and released parts of the search infrastructure at Dropbox.Apple Podcasts | Spotify | Google PodcastsIn this episode, we discuss the nuances and technology behind search systems. We go over search infrastructure - data storage and retrieval, as well as search quality - tokenization, ranking, and more. I was especially curious about how image search and other advanced search systems work internally with constraints for low latency, high search quality, and cost-efficiency.Highlights08:00 - Getting started building a search system - where to begin? Some history.13:30 - Why we should use different hardware for different parts of a high throughput search system17:00 - What goes on behind the scenes in a search system when it has to incorporate a picture or a PDF? The rise of transformers, not the Optimus Prime kind. We go on to discuss how transformers work at a very high level.27:00 - The key idea for non-text search is being able to store, index, and search for vectors efficiently. Searches often involve nearest neighbor searches. Indexing involves techniques as simple as only storing the first few bits of each vector dimension in hashmaps.34:00 - How search systems efficiently rebuild their inverted indices based on changing data; internationalization for search systems; search user interface design and research.42:00 - How should a student interested in building a search system learn the best practices and techniques to do so? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev
Transcript
Discussion (0)
Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications.
I'm your host, Utsav Shah, and thank you for listening.
Hey Rajesh, welcome to another episode of the Software at Scale podcast.
A quick intro for listeners, Rajesh has been working on search systems for a large part of his career.
He worked on natural language processing at Microsoft, the cloud inference team at Google,
and various parts of the search infrastructure at Dropbox. And now he's a senior software engineer at Google. Thank you, Rajesh, for being here.
Hello, itself. Thanks for inviting me on to this podcast.
I'm really glad to join you and talk to you about my experience in the software engineering world.
Perfect. So I think search has always been really interesting to me.
Just the fact that Google search and various other searches just work seamlessly.
And I know there's a lot of stuff going on behind the scenes.
So what got you first or initially interested in working on search? has just worked like seamlessly and I know there's a lot of stuff going on behind the scenes so what
got you first or like initially interested in working on Suraj did you just like happen to
fall into that and then your interest grew from there yeah that's actually exactly right it was
an accidental marriage so to speak I graduated from Columbia in 2008-9 and the stuff that I was doing at Columbia was
learning theory and circuit complexity theory. It was proving a lot of theorems and it so happened
that I was at that time interested in pursuing a PhD in the same field, but I was just not good enough.
And primarily I was not good enough because I had started late. So I thought that I could go to a
research organization, learn from some people there and then come back to academia. And the
research organization that I picked was called PowerSet. It was a company that came out of
Xerox PARC and was acquired by Microsoft when I was interning at Microsoft as a graduate student.
That is where I did a lot of traditional natural language processing work for search because they were
a search engine which worked on Wikipedia.
There were two groups of people there.
One group of people is what you would call the statistically minded natural language
people.
So take text, make engrams out of it, compute statistics, and then use that in your search
scoring algorithm. The other group of
people who are now an endangered species, so to speak, in our world are the traditional natural
language people. They fall into various groups, but primarily you could call them semanticists.
So they have always believed that you can treat natural language the same way
or approximately the same way that we treat computer languages.
And although that is not how they started,
I mean, they believe that natural language has structure
and that structure is something that you can encode in software.
So you would have a morphological analyzer
and then for syntax, you would have a morphological analyzer and then for syntax,
you will have a grammar analyzer, much like Lex and Yak that you would use for compilers.
And a large number of people who were really good at what they were doing there were people of this
kind. So I got to learn a lot about how to build something called finite state transducers at that time.
That's how I started on search. Then for a little while, I ended up working on generating snippets on search engine result pages. When you go to a search engine result page, under the URL,
you usually find some text which is extracted from that page to tell you whether or not
that particular URL
is relevant to your query.
How do you generate that text was the problem that I was working on.
So that's how I started on search.
And was this at this was at Microsoft, right?
So was this for Bing or was this for something else?
So when I started, it was not called Bing. It was called live.com. And I witnessed, I was party to the group which made the transition from live.com to
bing.com.
And yes, I did work on Bing captions relevance, so to speak, which is generation of snippets
and how to compute relevance of the snippet
that you generate with the query itself that has been issued.
And then after a couple of years there, I joined Google.
But I did not start my career at Google in search.
I started in ads, actually.
It still had a strong scoring component to it. So a philosophical comment here,
search is everywhere. Whatever you try to do can be cast into a search problem, can be reduced to
a search problem. So you want to retrieve a large number of records matching a particular predicate. That is your quintessential
search problem. And so the ads problem that I was working on also had a search component to it.
And particularly from the systems side, Google is pretty good at using a small number of
well-architected, well-built systems for all purposes.
And so I got exposed to the systems that Google uses in order to build our software in ads first.
And after that project, I moved on to another project, which was actually in the search domain, where we were trying to improve relevance of everything at Google based on information that is available in various sources
can be used to improve the relevance of search results for everyone.
And from there, one thing led to another, and I ended up on the team that built the Cloud Inference API,
which is also a search problem, and we ended up improving Google autocomplete
quite a lot using our system there.
So that was my work on search at Google.
And after that, when I left Google for a short while,
I did not do any search per se,
but still the systems were very much the same.
I ended up actually at a
private cloud company and we were building software that mimics what AWS and GCP provide
in order to bring up virtual machines or setting up software defined networks and so on, but in the data centers that companies actually own.
So it's a private cloud with a public cloud experience. And after that, when I joined
Dropbox, that is where I met you, Utsav, I worked on search there. And what we built was publicly
announced as Nautilus, which is your classic search system
in order to help users find the data
that they put on their Dropboxes.
And since then, I've been back in my hometown in India.
I came back, all the experience that I just described to you
was in the US.
And when I came back to India, I stopped doing search, so to speak, as in stopped working
on Google search.
But I've been working on payments and several of the problems that we solve also have a
search component to it, which brings life a full circle, I guess.
That's such a fascinating background.
So, I mean, maybe we can start with Dropbox
because it's probably like the smallest
in terms of like scale.
So let's say that you somehow have like a corpus
of all of people's Dropboxes
and you're tasked to build like a search system.
And how would you like go about thinking about the problem
from like an infrastructure perspective? Like what would you go about thinking about the problem from an infrastructure perspective?
What would you build first? What would be the key things that you'd be trying to measure to see
if search is actually working or not? And from both from a user perspective,
infrastructure perspective. Yeah, that's a very, very interesting question.
And chances are that if you ask the exact same question to every member of the search
team that I was part of, you will receive at least five or six different, completely
different opinions.
So that does the caveat.
Let me tell you what I think would have been good things to measure,
and hopefully they are measuring that right now.
Historically, search has several measures of relevance.
And if you go back about 15, 20 years, I think almost everybody was using a measure of relevance based on keywords,
and it was called TF-IDF,
term frequency inverse document frequency. It's a way of computing how relevant the tokens in
your search query or how well the tokens in your search query match the documents themselves that
you retrieve. And that worked because for a very long time, that was the benchmark.
Basically, there was nothing else to compare to.
And so this worked very well.
But over time, people started becoming more semantic.
And in the sense that they might not be able to spell the word correctly,
or they might be thinking of a particular word, but mean something else. Sometimes it might not even be the same phrase that they use to look up the document.
Sometimes they can't even articulate it at all. They would just be able to, I'll give you an
example. A child, when they are trying to look for something, let's say that you have a Dropbox
full of children's photos, and you are trying to look up a particular experience. So let's call this
fictional child Martin and let's say that you want to look up a particularly
cute and funny face that Martin made when he was four years old at his birthday party.
Let's say that's what you want to actually look up. Incredibly hard
to articulate with tokens themselves. And so if you just keep measuring whether a person issued
a search query and then clicked on something, then you might actually miss these kind of
actual genuine user needs. So to answer your question,
one of the things that I would start with,
especially when you're building a search system
on personal data like Dropbox itself,
is that, well, Dropbox itself has multiple versions, right?
There are multiple SKUs.
The business use might be significantly different
from the personal one. So I'm going to talk primarily from an end user personal standpoint first,
and then we could probably talk about the business side of it.
From a personal standpoint, I would definitely collect at least 10,000 opinions. I would run
a survey first to figure out what are people looking for when they look in Dropbox.
If you just provide people a search box and then look at the queries themselves,
you might miss things that they're not able to express as queries at all.
So I would definitely do something. Click-through rate is something that people historically track
in order to figure out
whether if you show 10 results, what is the likelihood that users actually click on it?
And once they click on it, if they stay on the Dropbox property, then you might also
want to look at how long they spent on it as a proxy for whether it was relevant or
they just clicked on it and went back immediately.
So these are the things that I would definitely look at from a metrics perspective when I'm
doing anything on search. From an infrastructure perspective, Dropbox is very, very interesting.
I'm not saying this because I've worked on Dropbox. I'm saying this because I use Dropbox
and I use Google Drive. There's tons and tons of stuff that I put there
and I never look up again.
Or I might look it up in the immediate future
of when I actually added that document,
maybe for two weeks or a month or something like that.
But then I'll probably never touch it
for a very, very long time.
So for stuff like that. But then I'll probably never touch it for a very, very long time. So for stuff like that, I would definitely think of tiers when I built this whole system.
And I would say that the most recent stuff goes into a particular tier and older stuff goes into a different tier where you can sacrifice latency,
but you have to be really, really relevant because that corpus is likely to be much bigger.
And in terms of concrete infrastructure itself, if I was building this today, I know that we built
it a certain way with Nautilus, but if I was building it today,
I would definitely have different kinds of hardware serving the two of them. I would have different
SLAs in the sense that the kind of software that ensures that the machine serving this is built would also be significantly different.
And I suspect that this part, most software engineers would agree on who have worked on
search. So from an infrastructure, I would really want B2 hardware to serve the more recent stuff,
and not so beefy hardware for the older stuff, maybe commodity hardware.
A lot of what you read from Google's stuff that they publish is that they use commodity hardware
for everything.
And you're basically saying that for recent stuff,
maybe you just want beefier hardware.
And why do you think, what's the difference here?
Is it that you want to put all of this stuff in RAM
or maybe what am I missing?
That is one way to look at it. Recently, there are, when I say recent, this is probably seven
or eight years old at this point. But there are layers of storage that have come between
RAM and your spinning disks. So like SSDs maybe.
SSDs.
And I think Intel has a platform called Optane, NVRAM-based storage, which offers you better
latency than SSD, but lesser than that of RAM and so on.
So yeah, I would definitely say that more recent stuff
should stay in RAM so that you are able to,
because latency is paramount.
I'll just walk you through a concrete example
of something that I myself do fairly regularly.
So let us say that, or it doesn't have to be me. Let's say that somebody is actually
purchasing a home. So they buy them, they get a bunch of documents to read about the home
photographs, they share it and so on. Chances are that they will upload it, they will look at it for
one week, two weeks, a month, two months, whatever it might be for them to close the deal. And then that's it. Maybe in another year or so, there is some activity, but very,
very likely after that, there is no activity on those documents at all. And when somebody is
trying to get a job done of that kind, you really want to prioritize latency. And so you do want to actually keep all this in as,
I mean, you want to map this into hardware
that is as fast as possible.
And RAM happens to be that thing at the moment.
And you will have interesting,
of course, RAM is not infinite. No hardware is infinite,
but RAM is particularly scarce. So you will have to come up with interesting algorithms
of how to represent these things in memory itself so that you're able to fit as much as possible
in the same amount of RAM while keeping the latency minimal.
Okay. Yeah. So let's maybe walk through that a bit. Let's say that people upload two kinds of things, like a PDF and a picture.
What do you actually end up storing on hardware?
Do you do some kind of trigrams or like, and then I guess you extract some metadata
on the picture and you store that.
What exactly is stored?
I think a decade ago, what you just said might have been exactly what you do for them.
So let's say people keep pictures and you tokenize the URL or the file name or whatever
itself and keep them as engrams for quicker access or even synonyms, whatever number of combinations you can think of,
you do it. Or you layer it with a bunch of systems. So the user issues a query,
then you spell correct the query, and then you also apply synonyms on the query, and then you
send what you would call an expanded query to your actual retrieval system.
So there are systems in place here, first to perform
query expansions, and the second to fan out the query expansion to your retrieval system and
fetch all the documents. You could do that. But pictures and video, non-textual data,
have undergone a very, very significant revolution, so to speak. We just don't see it as a revolution because we have been living through it.
But it is quite dramatic.
These days, sometimes what people do is just upload another photograph or another picture,
which is similar to what they're looking for.
And the system is able to retrieve all of it.
It could even be a description of the photograph, not even something that is mentioned
in the photograph or in the file name at all and you could still pull it out. Ultimately,
everything would translate to some kind of inverted index but how you actually get to
the inverted index for retrieval has changed quite dramatically in the last 10 or 12 years or
so thanks to techniques that are available through machine learning. And this includes
image-based machine learning as well as natural language processing. I mean, about, I want to say about six or seven years ago,
transformers first made their appearance in the minds of people.
And then they were productionized, I want to say,
around the time that I was at Dropbox.
Maybe around 2016 or 17 is when transformers became publicly,
started being publicly used for these kinds of applications. And they've changed things dramatically.
Things like BERT and OpenGPT, et cetera, have the attention-based mechanism that they use
have changed how you look at retrieval itself.
You don't perform expansions only into the synonym space, so to speak.
You expand it into some vector space.
You basically generate floating point numbers, and then you start doing nearest neighbor
searches these days.
And there is a whole field of research that has been actively worked on by several people
on how to make these nearest neighbor searches
fast and things are all coming together now.
There are one set of systems which actually generate the vectors that you actually need
in order to represent the query itself.
And then there is another system which has already generated the vectors of all the data that has been uploaded, particularly photographs.
And now it comes down to figuring out which are the closest documents in this vector space itself.
Okay, that's super fascinating.
And maybe just to clarify my understanding.
So once you upload a photo to Dropbox or Google Photos or something, there's some service that takes this picture and converts it into some kind of vector
using machine learning. And then when somebody runs a search, like, okay, there's like the
regular synonyms and everything, but there's also something that converts that query into some kind
of vector space. And then there's a nearest neighbor search that's happening with all of these other things okay so maybe we can talk about that conversion right like going from a picture to a
vector and i don't know too much about transformers so if you can just walk us through it you know
what is a transformer like why is bert such a big deal and why why is everybody talking about gpd3
like what's actually happening behind the scenes? Before we go there,
I would like to clarify one thing.
And that is that I have not actually
personally worked on building these things yet.
But this is how I would build it,
is what I'm actually trying to say.
I have no idea whether Google does this
or Dropbox does this now,
or even when I was there.
But if I was actually doing it now, this is how I would approach it, because everything has come together.
So to answer your question, Transformer is basically a neural network, which employs a mechanism called attention and attention says
I'll be as intuitive as possible here which is that let's say that you have a long sentence
and you want to draw some inference out of it say that Utsav Shah lives in San Francisco.
This is a fairly short sentence, but if you were to involve several actors in the sentence itself or a paragraph, however long you want it to be,
chances are that the parts that make sense for you to draw your inference are not the entire blob of, is not the entire blob of text, but bits and pieces of that text itself.
And that's what attention actually tries to exploit. So essentially, attention layers
learn how to pay, for want of a better term, attention to various parts of the input stream itself in order to get to the outcome that you want to get to.
Okay.
So maybe like a longer example would be Utsav Shah lives in San Francisco
and is a software engineer.
And the attention layers will focus on my name, the location,
and my profession maybe.
In isolation, that is not sufficient to demonstrate what we want to say here.
But let's say that you are exploiting this sentence in order to match a job for you.
The fact that you live in San Francisco has some relevance,
but the most important thing there is the software engineer. And you might even go as far as saying
that the name itself is not relevant at all. It is where the job is and what job it is
that actually matters. So attention actually learns to perform these actions. It learns to understand that the fact that this individual, you, in this context, happens to live in San Francisco and software engineer are the most relevant bits for a particular task.
And it's a supervised learning mechanism, usually.
And so the task itself that it is being used for is important. So if you're
performing a translation task, for example, then the fact that you are translating an input
sentence to an output sentence, originally it used to be done through a recurrent neural network
system called sequence-to-sequence, which came out of Google. Transformers also came out of Google, but it was essentially to say that Sequence to Sequence used, if I remember correctly, I think it used
either an LSTM or a GRU-based mechanism, which are examples of recurrent neural networks.
There is a severe limitation with recurrent neural networks because of the fact that they
cannot remember stuff from very, very long ago.
And they process stuff sequentially.
It's harder to parallelize them.
Attention and transformers solve that problem.
They allow you to parallelize it and scale up the whole thing. And BERT is a big deal
because it scaled it up dramatically. OpenGPT is also a big deal because of the exact same reason.
It was able to scale it up even more. And Google probably has something which is even bigger. I
don't know. But essentially, bigger is better is the thing that is going on right now.
And so this, because the bigger it is, the more data you can throw at it, and the longer data you
can throw at it, and it learns to pay attention to more parts of the data for, if I were to describe
it intuitively, that may not be exactly what is happening, but it is a reasonable way of thinking
about it. So yeah, and neural networks always end up generating floating point numbers everywhere
along the way. So if you take any arbitrary layer there, which is deep enough, and enough is an
arbitrary choice of yours, it has to be far away from the input, but not too close to the output.
You get a representation of what has been fed through the network.
It's essentially an encoding of whatever has been fed through the network.
And that is your vector.
Okay.
Okay.
And how do you, I guess, once you have all of these vectors, right, like, you have to probably store them in like some compact way, so that you can easily retrieve a lot of them to do search or like to do like this nearest neighbor search on.
Okay.
Yeah.
And if you want, we can do a conversation on that. And maybe we will converge upon a solution for that.
Because I personally know of one system, but I don't want to describe that explicitly.
What I would do is, what almost anybody would do is, if I tell you that the vector, let's just say three numbers,
that the input vector that I'm trying to match is 10.5, 1,000, 0.3, and 99.
This is the input vector that I'm trying to match.
Essentially, the other vector that I'm trying to match it against
happens to be 11, 999, and 76.
These are the two vectors that you're trying to match. One way to find similarity
between them is to just drop enough bits from all of them and then index it. I mean, you could build
an interval search system, right? This is something that most people would be familiar with
when they are talking about database indexes, essentially. So for the first dimension, every record between one and 10 lies
in this particular space. For the second dimension, every record between and you could do it
logarithmically, you could do it linearly, whatever you want.
Just by dropping bits, you can basically just index through. Exactly.
And that is one way to quickly do it.
And this is a surprisingly effective technique, actually.
Because if your space is sufficiently large, if your dimensionality is sufficiently large,
chances are that you will be able to build efficient indexes of this kind,
and you'll be able to match it against any arbitrary input vector. You could get results
within milliseconds for millions of records like this, is my guess.
And you might not get the closest numbers and that's fine because these are all approximations
anyway. And then you could do an exact distance computation once you have narrowed down
so search is all about this get a reasonable number of candidates and then do more interesting
scoring on top of that yeah okay interesting yeah i actually want to learn like more and more about transformers but i think i'll
basically i'll do some research on my own because i want to ask you about once you've built something
like dropbox or like search for dropbox i know that the constraints for another search system
at like microsoft to google will be similar in some aspects but different than others like the
first thing i can think of is there's going to be ACLs on Dropbox search, which don't necessarily have
to exist for Google search. But what are some similarities that you've seen?
Similarities abound, actually. Almost everything has to have a load balancing system in front of
it because people are going to issue queries and you want to actually ensure that the service is highly available. So that's why you put a load
balancer in front of it. Every system is going to have some kind of storage representation and
even though they have vastly different use cases, there are a small number of database systems that you actually employ.
The small number of database system types also that you actually employ to build any
kind of search system, for example.
Another similarity, we mentioned this very early in our conversation, is that you ultimately
have to build an inverted index of some kind, where you have to go from a bunch of tokens
to a bunch of documents.
And these systems are not that different. I mean, you might have different operating characteristics
depending upon what the scale is and what the access patterns are and so on.
But the overall design of the system from a conceptual standpoint is similar.
So let's leave Dropbox or Google. You have Elasticsearch in the open source world.
It uses an inverted index, which is implemented by a piece of open source software called Lucene.
It literally actually implements an inverted index. And when you think about indices for search, that is the one that is dominant.
I mean, there might be other kinds of indices, but inverted indexes are the people generally tend to make a trade-off and say that fresh stuff is computed through a streaming service.
You could think of any streaming service that the world uses.
Data is piped through the stream and somebody is a listener on that stream and updates the index online.
And you could also think of, you know, we talked about Dropbox data, which could be
stale, so to speak, right? I think stale is not the appropriate word, but which is older,
which is not the most recent, and which is not undergoing a lot of change. So if you are building an index on top of that, there is no reason to do it online. You can do it as a batch system.
And these systems, what the open source world likes to call Hadoop or Spark or Google's own
cloud data flow, there is Apache Flink. All these things are examples of bad
systems. And they form the heart of almost any large scale search system that you can think of.
Interesting. Would the do you ever like rebuild indices? So it sounds like there is like an
incremental part for recent data for
example and are you rebuilding like a brand new index and you have two separate indices for like
live data and like relatively older data or is it just getting converged into one large thing
you would typically not rebuild indices if nothing changes.
But that's not how the world works.
Stuff changes all the time.
I mean, you can't even say that the search algorithm that you were using three years ago
for retrieval or for scoring or whatever it might be
is the best one that you can use today.
So with that as the context,
you have to rebuild indexes
because, or indices, because maybe you did not even compute a certain part of the index that
is necessary now. How do you do that? You have to rebuild the index. Index building,
the inverted index is a generic specification. What you put in the inverted index, you might change over time. And
in fact, it is a fantastic data structure for you to experiment with various things without actually
committing to everything upfront. I'll give you an example. Let us say that you're building web
search in 1999. Chances are that what you computed in your index included the page rank of every page.
Right? And every token mapped to something. Now circa 2002, 2003, let's say that somebody says,
hey, it would be nice if users did not have to spell correct their own queries if we could
fix it.
Maybe you want to update your index now with all the corrected spellings.
The data structure itself doesn't change, but you have to recompute the index now.
And use cases like this will keep coming up.
So you do recompute. And it's also good practice,
I think, from a software engineering perspective. So if there is bugs in your software, just
having the ability to recompute the whole thing very quickly is a good thing. And as
we know, when you write software at scale, bugs are inevitable.
Yes. So that makes me think about you know experimentation and
building out new features right so what is like a good way to you know experiment and try out like
this because it's so hard to figure out when if you're if you're working on let's say like
a spell check expansion query in a sense how do you know it's actually helping users and it's not making things worse?
And how do people
I guess build platforms for that?
Yeah, click-through rate is one of the most
commonly used
metrics to measure whether
users are actually finding
the things that they're looking for.
But you should always
back up quantitative stuff like this
with qualitative surveys.
You should genuinely ask
a fair sample of people
whether they are able to find
things more easily
now than before.
This is
going in the realm of human-computer
interaction and
user research.
It's a closed loop.
Software engineering doesn't happen in a vacuum.
You want to make sure that your user research experts actually design experiments,
user research experiments, to ask people to take reasonable actions,
actions that are part of their workflow, to locate whatever they're looking for.
And then you want to back it up with your quantitative metrics,
like whether the click-through rates are going up, what kind of documents are being clicked on, etc.
Okay. And I'm guessing that at the query expansion stage or at the retrieval stage is where you
will apply things like ACLs.
How would that fit in, in general?
And I guess that would be different for every single service that has to apply something
like ACLs.
I am not really sure how to answer that question because ACLs generally tend to apply on the
underlying data itself.
In Dropbox's case, that might be the actual document that somebody uploaded.
So until you get to the point of retrieval, I don't know how you can apply the ACL in
advance.
That's fair.
Yeah.
You could actually say that you have pre-computed ACLs of a certain kind and you only look within that subcorpus of data.
But that's an optimization.
Ultimately, the ACL has to apply on the document itself.
And checking for this ACL will be fairly expensive,
but I guess you just have to build something that can give you like an answer reasonably quickly.
Yeah.
I mean, there are table stakes, right?
You cannot violate certain principles.
Yeah.
Web search.
That is where web search differs from personal search.
Web search also may have notions of ACLs at certain points, but not always.
Usually it is the worldwide web.
If you can index it, you can show it.
That's all.
Or you can at least generate traffic to it.
Because ultimately somebody has to land on that landing page in order to see the content that is there.
It's not that way with private search.
So you can index everything because that is a prerequisite,
but it is not like everything that you can retrieve can actually be shown to
the user.
And how do you think about like localization and internationalization?
So like, do you just,
cause I'm sure you have to think about tokenizing data in separate ways.
And do you have to like even run like different like ML algorithms on different languages?
Or like, is that where you have things like transformers where you don't really particularly care?
Like, how does that exactly work if I had to build it out today?
I think that I would start out in that fashion today.
If I was to actually internationalize something, I would start out in that fashion today. If I was to actually internationalize something,
I would, I mean, this is the holy grail, that you have a single
network, which is able to, a neural network, which is able to
deal with every language possible.
And then it doesn't matter.
The network, I mean, you can think of this as a query encoding step itself.
So the input to the network is the query in the user-specified way.
And the output could be some canonical way.
That is a translation problem.
And it doesn't matter what the input language is. Let us make it the problem of the
network to figure it out. You could also do this by stacking. You could say that first you have a
network to identify what language it is in, then you have a network to translate it into the
canonical form and both of these could be transformers and then you could use that to
perform your retrieval. Historically though, humans actually used to
write software and they still do. The software used to tokenize it into
a specific canonical format based on the particular language that it was
in and the detection itself was a deterministic algorithm that would run. It would say that this
is in, say, Japanese or Hindi or Italian. And then there would be different systems which you
could send it to in order to get back tokens in that particular language. And then you would
retrieve it on the basis of that. At the retrieval stage, once you come down to individual tokens, it almost doesn't matter
because it is just a matter of maintaining vocabulary large enough, which would fit
all kinds of tokens. The first two steps where you transform the query
into something that you can actually send to your retrieval system are the interesting parts.
Gotcha.
And maybe like a last question in this theme, which is, let's say I'm a college student
and I've studied like a machine learning course.
I've maybe done like one course in neural networks and I've taken one distributed systems
course.
Do I have all of like the theoretical knowledge I need in order to implement a system like
this?
Or what are some like key concepts that you would recommend that people actually study all of like the theoretical knowledge I need in order to implement a system like this? Or
what are some like key concepts that you would recommend that people actually study before trying to implement like their own brand new search system if they had to?
I wouldn't actually start off it started that way at all, I would actually start by building
a search system. So because such problem and then you discover things along the way. The important thing to remember is that abstractions go a long way in software engineering.
Sometimes you don't even know how the storage system works.
It just works.
So when you're using MySQL, for example, very, very few people probably understand how the
exact indexing occurs in the database backend.
But it's not that important because they just need to understand the failure modes and the operating characteristics of the database itself.
And they probably do.
So I would say start by building a search system. If you've taken a machine learning course,
then at least you will be able to narrow down to what are the kinds of algorithms
that you want to use for scoring.
If you've taken a distributed systems course,
you probably know how to pick the right database
in order to keep your data itself.
And then if you haven't taken
how to build a search engine course,
by the way, courses like this used to exist
when I was in grad school. Maybe
they still do. I don't know.
If you haven't taken that, then
I would just start by
taking some open source
search system and reading the code
and seeing how people
are using it.
And
I would get ideas on what courses I need to take after that perfect cool then i want
to ask you a little bit about moving back to india right so you worked in software in both the us and
in india and like not a lot of people i guess there's a there's certainly there's a certain
amount of people who have that experience but what has been your take on it you know i think you've lived here you've lived in the u.s for more than 10 years and then you move back yeah so
how does the industry like differ like what about like the culture and just like anything else
so i've actually never worked at an organization which is servicing somebody
outside of the geography that I'm in.
So when I was working in the US,
I was working on a product
which was predominantly based out of the US
as in it's the users that predominantly was based.
And in India also, when I've been working,
since I came back,
I have been working on Google Pay for India.
So my user base is primarily here.
As far as software development processes are concerned, the world has converged almost completely, I would say.
I don't experience any significant difference in how I write software here than how I used to write it there.
And that's actually the
beauty of software and the internet itself. It allows you to be wherever you want to be and do
the same thing that you are interested in doing. I really like writing software and I like it the
same way here that I used to like it in the US. I get angry the same way at my computer when it
doesn't work. And it helps that it is the same kind of computer that I use here too.
And this experience is similar to almost everybody that I've spoken to.
It's fantastic.
It allows you to go wherever you want to be and do exactly the same thing. Processes wise, I guess you could say that in the US itself, particularly because I was
in the Bay Area for a very long time, it was a huge tech hub.
So you could just go into a bar and start talking software engineering and somebody
around you would probably contribute to that conversation.
It is similar where I live in India now.
In Bangalore, there are a lot of software engineers and if I were to go into some specific
areas, starting a conversation about software engineering itself would be fairly simple.
So culturally also, there isn't that much of a difference between what you find here
and what I used to find there.
Cool.
And I guess,
do you think that's like an artifact of,
you know, working at like a large US company
or do you think even Indian companies now
are becoming very similar?
I think you basically end up embodying the requirements of the profession that
you're in and the requirements of our profession basically don't allow you to
think too much about where you are.
It is about what you do and how you do it. I mean, this might sound pithy,
but actually the thing is that it's the same programming languages, it is the same tools,
and it is the same systems. So you don't have that much leeway in
evolving different styles, so to speak.
So whether it is by necessity or whether it is natural, I don't know.
But I don't think that there is much difference between them.
And the same headaches for when things don't compile.
Yeah, it's the same.
Or when you should have logged something,
but you did not log it, and then you have deployed it to production, but now you should have logged something but you did not log it and then you
have deployed it to production but now you want to debug something these problems are identical
around the globe okay and maybe a final question so you mentioned that you were planning on going
back to academia and you stayed in industry do you have any advice for like any young listeners on you know who are making that
decision thinking about going into academia today versus staying in industry like how would you
encourage people to think about that it really depends i think that academia has a certain
opportunity cost associated with it but if you're interested enough to do it, then the opportunity cost is irrelevant, so to speak.
A famous person that I used to work with used to claim he worked on the Mac, the original Mac with the jobs.
So he used to claim that by going to grad school, he financially his opportunity cost was some large number.
Okay and that's probably true if you go into academia you probably have to
pay a certain financial opportunity cost. But there is probably a different kind of freedom
that you get in academia versus in industry.
And the industry, actually, the people like to project industry versus academia as academia is freer and industry is more restrictive.
That's not really true. What is actually true is there are different kinds of freedoms that you get in the two worlds. The very fact that you have access to the kind of resources that we do in the industry
and now in the realm that we work in, the kind of data that we get to process, and even the kind of
people that we can help with the systems that we build, it's very different from what is possible
in academia. I think that particularly when it comes to computer
science and computer engineering, you won't be able to tell the difference between academia and
industry in about two more decades, especially with MOOCs coming up all over the place. People
are teaching themselves quite effectively thanks to the internet. And I don't think that there will be too much of a gap
between what academics work on
and what industry people are able to work on.
My favorite example of this is
Jeremy Howard used to be the chief of data science,
I think, at UCSF. And before that, he was the president of Kaggle.
And he teaches a course called fast.ai. He maintains a website called fast.ai.
By any stretch of imagination, he's not an academic academic he actually teaches it for engineers people who can code
but the
kind of
stuff that
he and
his
colleagues
churn out
is not
at all
different from
what you
get from
top academic
labs
so I
think that
if there is
a young
person who
is listening
to this
just tell them to not fix it on I think that if there is a young person who is listening to this,
just tell them to not fixate on how they actually end up.
Pick what they want to do and then figure out where it is possible for them to do it rather than pick where they want to go and then figure out what they want to do there.
I think that makes a lot of sense.
And that resonates.
If you're doing like distributed systems research
or like machine learning research,
the larger companies are doing such interesting things.
And it's not super easy to do that in different places
or like in academia sometimes.
Cool.
Well, thank you so much for being a guest.
I think this has been a lot of fun.
Thank you so much for being a guest I think this has been a lot of fun thank you so much for having me on
I really hope that everybody is safe
and sound
I hope so too
alright