Software at Scale - Software at Scale 25 - Rajesh Venkataraman: Senior Staff Software Engineer at Google

Episode Date: June 25, 2021

Rajesh Venkataraman is a Senior Staff Engineer at Google where he works on Privacy and Personalization at Google Pay. He’s had experience building and maintaining search systems for a large part of ...his career. He worked on natural language processing at Microsoft, the cloud inference team at Google, and released parts of the search infrastructure at Dropbox.Apple Podcasts | Spotify | Google PodcastsIn this episode, we discuss the nuances and technology behind search systems. We go over search infrastructure - data storage and retrieval, as well as search quality - tokenization, ranking, and more. I was especially curious about how image search and other advanced search systems work internally with constraints for low latency, high search quality, and cost-efficiency.Highlights08:00 - Getting started building a search system - where to begin? Some history.13:30 - Why we should use different hardware for different parts of a high throughput search system17:00 - What goes on behind the scenes in a search system when it has to incorporate a picture or a PDF? The rise of transformers, not the Optimus Prime kind. We go on to discuss how transformers work at a very high level.27:00 - The key idea for non-text search is being able to store, index, and search for vectors efficiently. Searches often involve nearest neighbor searches. Indexing involves techniques as simple as only storing the first few bits of each vector dimension in hashmaps.34:00 - How search systems efficiently rebuild their inverted indices based on changing data; internationalization for search systems; search user interface design and research.42:00 - How should a student interested in building a search system learn the best practices and techniques to do so? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to Software at Scale, a podcast where we discuss the technical stories behind large software applications. I'm your host, Utsav Shah, and thank you for listening. Hey Rajesh, welcome to another episode of the Software at Scale podcast. A quick intro for listeners, Rajesh has been working on search systems for a large part of his career. He worked on natural language processing at Microsoft, the cloud inference team at Google, and various parts of the search infrastructure at Dropbox. And now he's a senior software engineer at Google. Thank you, Rajesh, for being here. Hello, itself. Thanks for inviting me on to this podcast. I'm really glad to join you and talk to you about my experience in the software engineering world.
Starting point is 00:00:53 Perfect. So I think search has always been really interesting to me. Just the fact that Google search and various other searches just work seamlessly. And I know there's a lot of stuff going on behind the scenes. So what got you first or initially interested in working on search? has just worked like seamlessly and I know there's a lot of stuff going on behind the scenes so what got you first or like initially interested in working on Suraj did you just like happen to fall into that and then your interest grew from there yeah that's actually exactly right it was an accidental marriage so to speak I graduated from Columbia in 2008-9 and the stuff that I was doing at Columbia was learning theory and circuit complexity theory. It was proving a lot of theorems and it so happened
Starting point is 00:01:38 that I was at that time interested in pursuing a PhD in the same field, but I was just not good enough. And primarily I was not good enough because I had started late. So I thought that I could go to a research organization, learn from some people there and then come back to academia. And the research organization that I picked was called PowerSet. It was a company that came out of Xerox PARC and was acquired by Microsoft when I was interning at Microsoft as a graduate student. That is where I did a lot of traditional natural language processing work for search because they were a search engine which worked on Wikipedia. There were two groups of people there.
Starting point is 00:02:30 One group of people is what you would call the statistically minded natural language people. So take text, make engrams out of it, compute statistics, and then use that in your search scoring algorithm. The other group of people who are now an endangered species, so to speak, in our world are the traditional natural language people. They fall into various groups, but primarily you could call them semanticists. So they have always believed that you can treat natural language the same way or approximately the same way that we treat computer languages.
Starting point is 00:03:11 And although that is not how they started, I mean, they believe that natural language has structure and that structure is something that you can encode in software. So you would have a morphological analyzer and then for syntax, you would have a morphological analyzer and then for syntax, you will have a grammar analyzer, much like Lex and Yak that you would use for compilers. And a large number of people who were really good at what they were doing there were people of this kind. So I got to learn a lot about how to build something called finite state transducers at that time.
Starting point is 00:03:52 That's how I started on search. Then for a little while, I ended up working on generating snippets on search engine result pages. When you go to a search engine result page, under the URL, you usually find some text which is extracted from that page to tell you whether or not that particular URL is relevant to your query. How do you generate that text was the problem that I was working on. So that's how I started on search. And was this at this was at Microsoft, right? So was this for Bing or was this for something else?
Starting point is 00:04:21 So when I started, it was not called Bing. It was called live.com. And I witnessed, I was party to the group which made the transition from live.com to bing.com. And yes, I did work on Bing captions relevance, so to speak, which is generation of snippets and how to compute relevance of the snippet that you generate with the query itself that has been issued. And then after a couple of years there, I joined Google. But I did not start my career at Google in search. I started in ads, actually.
Starting point is 00:05:02 It still had a strong scoring component to it. So a philosophical comment here, search is everywhere. Whatever you try to do can be cast into a search problem, can be reduced to a search problem. So you want to retrieve a large number of records matching a particular predicate. That is your quintessential search problem. And so the ads problem that I was working on also had a search component to it. And particularly from the systems side, Google is pretty good at using a small number of well-architected, well-built systems for all purposes. And so I got exposed to the systems that Google uses in order to build our software in ads first. And after that project, I moved on to another project, which was actually in the search domain, where we were trying to improve relevance of everything at Google based on information that is available in various sources
Starting point is 00:06:26 can be used to improve the relevance of search results for everyone. And from there, one thing led to another, and I ended up on the team that built the Cloud Inference API, which is also a search problem, and we ended up improving Google autocomplete quite a lot using our system there. So that was my work on search at Google. And after that, when I left Google for a short while, I did not do any search per se, but still the systems were very much the same.
Starting point is 00:07:04 I ended up actually at a private cloud company and we were building software that mimics what AWS and GCP provide in order to bring up virtual machines or setting up software defined networks and so on, but in the data centers that companies actually own. So it's a private cloud with a public cloud experience. And after that, when I joined Dropbox, that is where I met you, Utsav, I worked on search there. And what we built was publicly announced as Nautilus, which is your classic search system in order to help users find the data that they put on their Dropboxes.
Starting point is 00:07:54 And since then, I've been back in my hometown in India. I came back, all the experience that I just described to you was in the US. And when I came back to India, I stopped doing search, so to speak, as in stopped working on Google search. But I've been working on payments and several of the problems that we solve also have a search component to it, which brings life a full circle, I guess. That's such a fascinating background.
Starting point is 00:08:26 So, I mean, maybe we can start with Dropbox because it's probably like the smallest in terms of like scale. So let's say that you somehow have like a corpus of all of people's Dropboxes and you're tasked to build like a search system. And how would you like go about thinking about the problem from like an infrastructure perspective? Like what would you go about thinking about the problem from an infrastructure perspective?
Starting point is 00:08:46 What would you build first? What would be the key things that you'd be trying to measure to see if search is actually working or not? And from both from a user perspective, infrastructure perspective. Yeah, that's a very, very interesting question. And chances are that if you ask the exact same question to every member of the search team that I was part of, you will receive at least five or six different, completely different opinions. So that does the caveat. Let me tell you what I think would have been good things to measure,
Starting point is 00:09:28 and hopefully they are measuring that right now. Historically, search has several measures of relevance. And if you go back about 15, 20 years, I think almost everybody was using a measure of relevance based on keywords, and it was called TF-IDF, term frequency inverse document frequency. It's a way of computing how relevant the tokens in your search query or how well the tokens in your search query match the documents themselves that you retrieve. And that worked because for a very long time, that was the benchmark. Basically, there was nothing else to compare to.
Starting point is 00:10:09 And so this worked very well. But over time, people started becoming more semantic. And in the sense that they might not be able to spell the word correctly, or they might be thinking of a particular word, but mean something else. Sometimes it might not even be the same phrase that they use to look up the document. Sometimes they can't even articulate it at all. They would just be able to, I'll give you an example. A child, when they are trying to look for something, let's say that you have a Dropbox full of children's photos, and you are trying to look up a particular experience. So let's call this fictional child Martin and let's say that you want to look up a particularly
Starting point is 00:10:56 cute and funny face that Martin made when he was four years old at his birthday party. Let's say that's what you want to actually look up. Incredibly hard to articulate with tokens themselves. And so if you just keep measuring whether a person issued a search query and then clicked on something, then you might actually miss these kind of actual genuine user needs. So to answer your question, one of the things that I would start with, especially when you're building a search system on personal data like Dropbox itself,
Starting point is 00:11:35 is that, well, Dropbox itself has multiple versions, right? There are multiple SKUs. The business use might be significantly different from the personal one. So I'm going to talk primarily from an end user personal standpoint first, and then we could probably talk about the business side of it. From a personal standpoint, I would definitely collect at least 10,000 opinions. I would run a survey first to figure out what are people looking for when they look in Dropbox. If you just provide people a search box and then look at the queries themselves,
Starting point is 00:12:13 you might miss things that they're not able to express as queries at all. So I would definitely do something. Click-through rate is something that people historically track in order to figure out whether if you show 10 results, what is the likelihood that users actually click on it? And once they click on it, if they stay on the Dropbox property, then you might also want to look at how long they spent on it as a proxy for whether it was relevant or they just clicked on it and went back immediately. So these are the things that I would definitely look at from a metrics perspective when I'm
Starting point is 00:12:48 doing anything on search. From an infrastructure perspective, Dropbox is very, very interesting. I'm not saying this because I've worked on Dropbox. I'm saying this because I use Dropbox and I use Google Drive. There's tons and tons of stuff that I put there and I never look up again. Or I might look it up in the immediate future of when I actually added that document, maybe for two weeks or a month or something like that. But then I'll probably never touch it
Starting point is 00:13:23 for a very, very long time. So for stuff like that. But then I'll probably never touch it for a very, very long time. So for stuff like that, I would definitely think of tiers when I built this whole system. And I would say that the most recent stuff goes into a particular tier and older stuff goes into a different tier where you can sacrifice latency, but you have to be really, really relevant because that corpus is likely to be much bigger. And in terms of concrete infrastructure itself, if I was building this today, I know that we built it a certain way with Nautilus, but if I was building it today, I would definitely have different kinds of hardware serving the two of them. I would have different SLAs in the sense that the kind of software that ensures that the machine serving this is built would also be significantly different.
Starting point is 00:14:28 And I suspect that this part, most software engineers would agree on who have worked on search. So from an infrastructure, I would really want B2 hardware to serve the more recent stuff, and not so beefy hardware for the older stuff, maybe commodity hardware. A lot of what you read from Google's stuff that they publish is that they use commodity hardware for everything. And you're basically saying that for recent stuff, maybe you just want beefier hardware. And why do you think, what's the difference here?
Starting point is 00:14:59 Is it that you want to put all of this stuff in RAM or maybe what am I missing? That is one way to look at it. Recently, there are, when I say recent, this is probably seven or eight years old at this point. But there are layers of storage that have come between RAM and your spinning disks. So like SSDs maybe. SSDs. And I think Intel has a platform called Optane, NVRAM-based storage, which offers you better latency than SSD, but lesser than that of RAM and so on.
Starting point is 00:15:44 So yeah, I would definitely say that more recent stuff should stay in RAM so that you are able to, because latency is paramount. I'll just walk you through a concrete example of something that I myself do fairly regularly. So let us say that, or it doesn't have to be me. Let's say that somebody is actually purchasing a home. So they buy them, they get a bunch of documents to read about the home photographs, they share it and so on. Chances are that they will upload it, they will look at it for
Starting point is 00:16:19 one week, two weeks, a month, two months, whatever it might be for them to close the deal. And then that's it. Maybe in another year or so, there is some activity, but very, very likely after that, there is no activity on those documents at all. And when somebody is trying to get a job done of that kind, you really want to prioritize latency. And so you do want to actually keep all this in as, I mean, you want to map this into hardware that is as fast as possible. And RAM happens to be that thing at the moment. And you will have interesting, of course, RAM is not infinite. No hardware is infinite,
Starting point is 00:17:06 but RAM is particularly scarce. So you will have to come up with interesting algorithms of how to represent these things in memory itself so that you're able to fit as much as possible in the same amount of RAM while keeping the latency minimal. Okay. Yeah. So let's maybe walk through that a bit. Let's say that people upload two kinds of things, like a PDF and a picture. What do you actually end up storing on hardware? Do you do some kind of trigrams or like, and then I guess you extract some metadata on the picture and you store that. What exactly is stored?
Starting point is 00:17:41 I think a decade ago, what you just said might have been exactly what you do for them. So let's say people keep pictures and you tokenize the URL or the file name or whatever itself and keep them as engrams for quicker access or even synonyms, whatever number of combinations you can think of, you do it. Or you layer it with a bunch of systems. So the user issues a query, then you spell correct the query, and then you also apply synonyms on the query, and then you send what you would call an expanded query to your actual retrieval system. So there are systems in place here, first to perform query expansions, and the second to fan out the query expansion to your retrieval system and
Starting point is 00:18:31 fetch all the documents. You could do that. But pictures and video, non-textual data, have undergone a very, very significant revolution, so to speak. We just don't see it as a revolution because we have been living through it. But it is quite dramatic. These days, sometimes what people do is just upload another photograph or another picture, which is similar to what they're looking for. And the system is able to retrieve all of it. It could even be a description of the photograph, not even something that is mentioned in the photograph or in the file name at all and you could still pull it out. Ultimately,
Starting point is 00:19:12 everything would translate to some kind of inverted index but how you actually get to the inverted index for retrieval has changed quite dramatically in the last 10 or 12 years or so thanks to techniques that are available through machine learning. And this includes image-based machine learning as well as natural language processing. I mean, about, I want to say about six or seven years ago, transformers first made their appearance in the minds of people. And then they were productionized, I want to say, around the time that I was at Dropbox. Maybe around 2016 or 17 is when transformers became publicly,
Starting point is 00:20:05 started being publicly used for these kinds of applications. And they've changed things dramatically. Things like BERT and OpenGPT, et cetera, have the attention-based mechanism that they use have changed how you look at retrieval itself. You don't perform expansions only into the synonym space, so to speak. You expand it into some vector space. You basically generate floating point numbers, and then you start doing nearest neighbor searches these days. And there is a whole field of research that has been actively worked on by several people
Starting point is 00:20:44 on how to make these nearest neighbor searches fast and things are all coming together now. There are one set of systems which actually generate the vectors that you actually need in order to represent the query itself. And then there is another system which has already generated the vectors of all the data that has been uploaded, particularly photographs. And now it comes down to figuring out which are the closest documents in this vector space itself. Okay, that's super fascinating. And maybe just to clarify my understanding.
Starting point is 00:21:21 So once you upload a photo to Dropbox or Google Photos or something, there's some service that takes this picture and converts it into some kind of vector using machine learning. And then when somebody runs a search, like, okay, there's like the regular synonyms and everything, but there's also something that converts that query into some kind of vector space. And then there's a nearest neighbor search that's happening with all of these other things okay so maybe we can talk about that conversion right like going from a picture to a vector and i don't know too much about transformers so if you can just walk us through it you know what is a transformer like why is bert such a big deal and why why is everybody talking about gpd3 like what's actually happening behind the scenes? Before we go there, I would like to clarify one thing.
Starting point is 00:22:08 And that is that I have not actually personally worked on building these things yet. But this is how I would build it, is what I'm actually trying to say. I have no idea whether Google does this or Dropbox does this now, or even when I was there. But if I was actually doing it now, this is how I would approach it, because everything has come together.
Starting point is 00:22:31 So to answer your question, Transformer is basically a neural network, which employs a mechanism called attention and attention says I'll be as intuitive as possible here which is that let's say that you have a long sentence and you want to draw some inference out of it say that Utsav Shah lives in San Francisco. This is a fairly short sentence, but if you were to involve several actors in the sentence itself or a paragraph, however long you want it to be, chances are that the parts that make sense for you to draw your inference are not the entire blob of, is not the entire blob of text, but bits and pieces of that text itself. And that's what attention actually tries to exploit. So essentially, attention layers learn how to pay, for want of a better term, attention to various parts of the input stream itself in order to get to the outcome that you want to get to. Okay.
Starting point is 00:23:51 So maybe like a longer example would be Utsav Shah lives in San Francisco and is a software engineer. And the attention layers will focus on my name, the location, and my profession maybe. In isolation, that is not sufficient to demonstrate what we want to say here. But let's say that you are exploiting this sentence in order to match a job for you. The fact that you live in San Francisco has some relevance, but the most important thing there is the software engineer. And you might even go as far as saying
Starting point is 00:24:32 that the name itself is not relevant at all. It is where the job is and what job it is that actually matters. So attention actually learns to perform these actions. It learns to understand that the fact that this individual, you, in this context, happens to live in San Francisco and software engineer are the most relevant bits for a particular task. And it's a supervised learning mechanism, usually. And so the task itself that it is being used for is important. So if you're performing a translation task, for example, then the fact that you are translating an input sentence to an output sentence, originally it used to be done through a recurrent neural network system called sequence-to-sequence, which came out of Google. Transformers also came out of Google, but it was essentially to say that Sequence to Sequence used, if I remember correctly, I think it used either an LSTM or a GRU-based mechanism, which are examples of recurrent neural networks.
Starting point is 00:25:40 There is a severe limitation with recurrent neural networks because of the fact that they cannot remember stuff from very, very long ago. And they process stuff sequentially. It's harder to parallelize them. Attention and transformers solve that problem. They allow you to parallelize it and scale up the whole thing. And BERT is a big deal because it scaled it up dramatically. OpenGPT is also a big deal because of the exact same reason. It was able to scale it up even more. And Google probably has something which is even bigger. I
Starting point is 00:26:21 don't know. But essentially, bigger is better is the thing that is going on right now. And so this, because the bigger it is, the more data you can throw at it, and the longer data you can throw at it, and it learns to pay attention to more parts of the data for, if I were to describe it intuitively, that may not be exactly what is happening, but it is a reasonable way of thinking about it. So yeah, and neural networks always end up generating floating point numbers everywhere along the way. So if you take any arbitrary layer there, which is deep enough, and enough is an arbitrary choice of yours, it has to be far away from the input, but not too close to the output. You get a representation of what has been fed through the network.
Starting point is 00:27:12 It's essentially an encoding of whatever has been fed through the network. And that is your vector. Okay. Okay. And how do you, I guess, once you have all of these vectors, right, like, you have to probably store them in like some compact way, so that you can easily retrieve a lot of them to do search or like to do like this nearest neighbor search on. Okay. Yeah. And if you want, we can do a conversation on that. And maybe we will converge upon a solution for that.
Starting point is 00:27:46 Because I personally know of one system, but I don't want to describe that explicitly. What I would do is, what almost anybody would do is, if I tell you that the vector, let's just say three numbers, that the input vector that I'm trying to match is 10.5, 1,000, 0.3, and 99. This is the input vector that I'm trying to match. Essentially, the other vector that I'm trying to match it against happens to be 11, 999, and 76. These are the two vectors that you're trying to match. One way to find similarity between them is to just drop enough bits from all of them and then index it. I mean, you could build
Starting point is 00:28:36 an interval search system, right? This is something that most people would be familiar with when they are talking about database indexes, essentially. So for the first dimension, every record between one and 10 lies in this particular space. For the second dimension, every record between and you could do it logarithmically, you could do it linearly, whatever you want. Just by dropping bits, you can basically just index through. Exactly. And that is one way to quickly do it. And this is a surprisingly effective technique, actually. Because if your space is sufficiently large, if your dimensionality is sufficiently large,
Starting point is 00:29:22 chances are that you will be able to build efficient indexes of this kind, and you'll be able to match it against any arbitrary input vector. You could get results within milliseconds for millions of records like this, is my guess. And you might not get the closest numbers and that's fine because these are all approximations anyway. And then you could do an exact distance computation once you have narrowed down so search is all about this get a reasonable number of candidates and then do more interesting scoring on top of that yeah okay interesting yeah i actually want to learn like more and more about transformers but i think i'll basically i'll do some research on my own because i want to ask you about once you've built something
Starting point is 00:30:15 like dropbox or like search for dropbox i know that the constraints for another search system at like microsoft to google will be similar in some aspects but different than others like the first thing i can think of is there's going to be ACLs on Dropbox search, which don't necessarily have to exist for Google search. But what are some similarities that you've seen? Similarities abound, actually. Almost everything has to have a load balancing system in front of it because people are going to issue queries and you want to actually ensure that the service is highly available. So that's why you put a load balancer in front of it. Every system is going to have some kind of storage representation and even though they have vastly different use cases, there are a small number of database systems that you actually employ.
Starting point is 00:31:07 The small number of database system types also that you actually employ to build any kind of search system, for example. Another similarity, we mentioned this very early in our conversation, is that you ultimately have to build an inverted index of some kind, where you have to go from a bunch of tokens to a bunch of documents. And these systems are not that different. I mean, you might have different operating characteristics depending upon what the scale is and what the access patterns are and so on. But the overall design of the system from a conceptual standpoint is similar.
Starting point is 00:31:48 So let's leave Dropbox or Google. You have Elasticsearch in the open source world. It uses an inverted index, which is implemented by a piece of open source software called Lucene. It literally actually implements an inverted index. And when you think about indices for search, that is the one that is dominant. I mean, there might be other kinds of indices, but inverted indexes are the people generally tend to make a trade-off and say that fresh stuff is computed through a streaming service. You could think of any streaming service that the world uses. Data is piped through the stream and somebody is a listener on that stream and updates the index online. And you could also think of, you know, we talked about Dropbox data, which could be stale, so to speak, right? I think stale is not the appropriate word, but which is older,
Starting point is 00:33:06 which is not the most recent, and which is not undergoing a lot of change. So if you are building an index on top of that, there is no reason to do it online. You can do it as a batch system. And these systems, what the open source world likes to call Hadoop or Spark or Google's own cloud data flow, there is Apache Flink. All these things are examples of bad systems. And they form the heart of almost any large scale search system that you can think of. Interesting. Would the do you ever like rebuild indices? So it sounds like there is like an incremental part for recent data for example and are you rebuilding like a brand new index and you have two separate indices for like live data and like relatively older data or is it just getting converged into one large thing
Starting point is 00:33:56 you would typically not rebuild indices if nothing changes. But that's not how the world works. Stuff changes all the time. I mean, you can't even say that the search algorithm that you were using three years ago for retrieval or for scoring or whatever it might be is the best one that you can use today. So with that as the context, you have to rebuild indexes
Starting point is 00:34:25 because, or indices, because maybe you did not even compute a certain part of the index that is necessary now. How do you do that? You have to rebuild the index. Index building, the inverted index is a generic specification. What you put in the inverted index, you might change over time. And in fact, it is a fantastic data structure for you to experiment with various things without actually committing to everything upfront. I'll give you an example. Let us say that you're building web search in 1999. Chances are that what you computed in your index included the page rank of every page. Right? And every token mapped to something. Now circa 2002, 2003, let's say that somebody says, hey, it would be nice if users did not have to spell correct their own queries if we could
Starting point is 00:35:26 fix it. Maybe you want to update your index now with all the corrected spellings. The data structure itself doesn't change, but you have to recompute the index now. And use cases like this will keep coming up. So you do recompute. And it's also good practice, I think, from a software engineering perspective. So if there is bugs in your software, just having the ability to recompute the whole thing very quickly is a good thing. And as we know, when you write software at scale, bugs are inevitable.
Starting point is 00:36:03 Yes. So that makes me think about you know experimentation and building out new features right so what is like a good way to you know experiment and try out like this because it's so hard to figure out when if you're if you're working on let's say like a spell check expansion query in a sense how do you know it's actually helping users and it's not making things worse? And how do people I guess build platforms for that? Yeah, click-through rate is one of the most commonly used
Starting point is 00:36:35 metrics to measure whether users are actually finding the things that they're looking for. But you should always back up quantitative stuff like this with qualitative surveys. You should genuinely ask a fair sample of people
Starting point is 00:36:51 whether they are able to find things more easily now than before. This is going in the realm of human-computer interaction and user research. It's a closed loop.
Starting point is 00:37:06 Software engineering doesn't happen in a vacuum. You want to make sure that your user research experts actually design experiments, user research experiments, to ask people to take reasonable actions, actions that are part of their workflow, to locate whatever they're looking for. And then you want to back it up with your quantitative metrics, like whether the click-through rates are going up, what kind of documents are being clicked on, etc. Okay. And I'm guessing that at the query expansion stage or at the retrieval stage is where you will apply things like ACLs.
Starting point is 00:37:51 How would that fit in, in general? And I guess that would be different for every single service that has to apply something like ACLs. I am not really sure how to answer that question because ACLs generally tend to apply on the underlying data itself. In Dropbox's case, that might be the actual document that somebody uploaded. So until you get to the point of retrieval, I don't know how you can apply the ACL in advance.
Starting point is 00:38:23 That's fair. Yeah. You could actually say that you have pre-computed ACLs of a certain kind and you only look within that subcorpus of data. But that's an optimization. Ultimately, the ACL has to apply on the document itself. And checking for this ACL will be fairly expensive, but I guess you just have to build something that can give you like an answer reasonably quickly. Yeah.
Starting point is 00:38:50 I mean, there are table stakes, right? You cannot violate certain principles. Yeah. Web search. That is where web search differs from personal search. Web search also may have notions of ACLs at certain points, but not always. Usually it is the worldwide web. If you can index it, you can show it.
Starting point is 00:39:13 That's all. Or you can at least generate traffic to it. Because ultimately somebody has to land on that landing page in order to see the content that is there. It's not that way with private search. So you can index everything because that is a prerequisite, but it is not like everything that you can retrieve can actually be shown to the user. And how do you think about like localization and internationalization?
Starting point is 00:39:40 So like, do you just, cause I'm sure you have to think about tokenizing data in separate ways. And do you have to like even run like different like ML algorithms on different languages? Or like, is that where you have things like transformers where you don't really particularly care? Like, how does that exactly work if I had to build it out today? I think that I would start out in that fashion today. If I was to actually internationalize something, I would start out in that fashion today. If I was to actually internationalize something, I would, I mean, this is the holy grail, that you have a single
Starting point is 00:40:11 network, which is able to, a neural network, which is able to deal with every language possible. And then it doesn't matter. The network, I mean, you can think of this as a query encoding step itself. So the input to the network is the query in the user-specified way. And the output could be some canonical way. That is a translation problem. And it doesn't matter what the input language is. Let us make it the problem of the
Starting point is 00:40:46 network to figure it out. You could also do this by stacking. You could say that first you have a network to identify what language it is in, then you have a network to translate it into the canonical form and both of these could be transformers and then you could use that to perform your retrieval. Historically though, humans actually used to write software and they still do. The software used to tokenize it into a specific canonical format based on the particular language that it was in and the detection itself was a deterministic algorithm that would run. It would say that this is in, say, Japanese or Hindi or Italian. And then there would be different systems which you
Starting point is 00:41:36 could send it to in order to get back tokens in that particular language. And then you would retrieve it on the basis of that. At the retrieval stage, once you come down to individual tokens, it almost doesn't matter because it is just a matter of maintaining vocabulary large enough, which would fit all kinds of tokens. The first two steps where you transform the query into something that you can actually send to your retrieval system are the interesting parts. Gotcha. And maybe like a last question in this theme, which is, let's say I'm a college student and I've studied like a machine learning course.
Starting point is 00:42:14 I've maybe done like one course in neural networks and I've taken one distributed systems course. Do I have all of like the theoretical knowledge I need in order to implement a system like this? Or what are some like key concepts that you would recommend that people actually study all of like the theoretical knowledge I need in order to implement a system like this? Or what are some like key concepts that you would recommend that people actually study before trying to implement like their own brand new search system if they had to? I wouldn't actually start off it started that way at all, I would actually start by building a search system. So because such problem and then you discover things along the way. The important thing to remember is that abstractions go a long way in software engineering.
Starting point is 00:42:53 Sometimes you don't even know how the storage system works. It just works. So when you're using MySQL, for example, very, very few people probably understand how the exact indexing occurs in the database backend. But it's not that important because they just need to understand the failure modes and the operating characteristics of the database itself. And they probably do. So I would say start by building a search system. If you've taken a machine learning course, then at least you will be able to narrow down to what are the kinds of algorithms
Starting point is 00:43:29 that you want to use for scoring. If you've taken a distributed systems course, you probably know how to pick the right database in order to keep your data itself. And then if you haven't taken how to build a search engine course, by the way, courses like this used to exist when I was in grad school. Maybe
Starting point is 00:43:48 they still do. I don't know. If you haven't taken that, then I would just start by taking some open source search system and reading the code and seeing how people are using it. And
Starting point is 00:44:02 I would get ideas on what courses I need to take after that perfect cool then i want to ask you a little bit about moving back to india right so you worked in software in both the us and in india and like not a lot of people i guess there's a there's certainly there's a certain amount of people who have that experience but what has been your take on it you know i think you've lived here you've lived in the u.s for more than 10 years and then you move back yeah so how does the industry like differ like what about like the culture and just like anything else so i've actually never worked at an organization which is servicing somebody outside of the geography that I'm in. So when I was working in the US,
Starting point is 00:44:50 I was working on a product which was predominantly based out of the US as in it's the users that predominantly was based. And in India also, when I've been working, since I came back, I have been working on Google Pay for India. So my user base is primarily here. As far as software development processes are concerned, the world has converged almost completely, I would say.
Starting point is 00:45:17 I don't experience any significant difference in how I write software here than how I used to write it there. And that's actually the beauty of software and the internet itself. It allows you to be wherever you want to be and do the same thing that you are interested in doing. I really like writing software and I like it the same way here that I used to like it in the US. I get angry the same way at my computer when it doesn't work. And it helps that it is the same kind of computer that I use here too. And this experience is similar to almost everybody that I've spoken to. It's fantastic.
Starting point is 00:45:54 It allows you to go wherever you want to be and do exactly the same thing. Processes wise, I guess you could say that in the US itself, particularly because I was in the Bay Area for a very long time, it was a huge tech hub. So you could just go into a bar and start talking software engineering and somebody around you would probably contribute to that conversation. It is similar where I live in India now. In Bangalore, there are a lot of software engineers and if I were to go into some specific areas, starting a conversation about software engineering itself would be fairly simple. So culturally also, there isn't that much of a difference between what you find here
Starting point is 00:46:47 and what I used to find there. Cool. And I guess, do you think that's like an artifact of, you know, working at like a large US company or do you think even Indian companies now are becoming very similar? I think you basically end up embodying the requirements of the profession that
Starting point is 00:47:12 you're in and the requirements of our profession basically don't allow you to think too much about where you are. It is about what you do and how you do it. I mean, this might sound pithy, but actually the thing is that it's the same programming languages, it is the same tools, and it is the same systems. So you don't have that much leeway in evolving different styles, so to speak. So whether it is by necessity or whether it is natural, I don't know. But I don't think that there is much difference between them.
Starting point is 00:47:54 And the same headaches for when things don't compile. Yeah, it's the same. Or when you should have logged something, but you did not log it, and then you have deployed it to production, but now you should have logged something but you did not log it and then you have deployed it to production but now you want to debug something these problems are identical around the globe okay and maybe a final question so you mentioned that you were planning on going back to academia and you stayed in industry do you have any advice for like any young listeners on you know who are making that decision thinking about going into academia today versus staying in industry like how would you
Starting point is 00:48:32 encourage people to think about that it really depends i think that academia has a certain opportunity cost associated with it but if you're interested enough to do it, then the opportunity cost is irrelevant, so to speak. A famous person that I used to work with used to claim he worked on the Mac, the original Mac with the jobs. So he used to claim that by going to grad school, he financially his opportunity cost was some large number. Okay and that's probably true if you go into academia you probably have to pay a certain financial opportunity cost. But there is probably a different kind of freedom that you get in academia versus in industry. And the industry, actually, the people like to project industry versus academia as academia is freer and industry is more restrictive.
Starting point is 00:49:37 That's not really true. What is actually true is there are different kinds of freedoms that you get in the two worlds. The very fact that you have access to the kind of resources that we do in the industry and now in the realm that we work in, the kind of data that we get to process, and even the kind of people that we can help with the systems that we build, it's very different from what is possible in academia. I think that particularly when it comes to computer science and computer engineering, you won't be able to tell the difference between academia and industry in about two more decades, especially with MOOCs coming up all over the place. People are teaching themselves quite effectively thanks to the internet. And I don't think that there will be too much of a gap between what academics work on
Starting point is 00:50:29 and what industry people are able to work on. My favorite example of this is Jeremy Howard used to be the chief of data science, I think, at UCSF. And before that, he was the president of Kaggle. And he teaches a course called fast.ai. He maintains a website called fast.ai. By any stretch of imagination, he's not an academic academic he actually teaches it for engineers people who can code but the kind of
Starting point is 00:51:07 stuff that he and his colleagues churn out is not at all different from
Starting point is 00:51:15 what you get from top academic labs so I think that if there is a young
Starting point is 00:51:22 person who is listening to this just tell them to not fix it on I think that if there is a young person who is listening to this, just tell them to not fixate on how they actually end up. Pick what they want to do and then figure out where it is possible for them to do it rather than pick where they want to go and then figure out what they want to do there. I think that makes a lot of sense. And that resonates.
Starting point is 00:51:45 If you're doing like distributed systems research or like machine learning research, the larger companies are doing such interesting things. And it's not super easy to do that in different places or like in academia sometimes. Cool. Well, thank you so much for being a guest. I think this has been a lot of fun.
Starting point is 00:52:06 Thank you so much for being a guest I think this has been a lot of fun thank you so much for having me on I really hope that everybody is safe and sound I hope so too alright

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.