Orchestrate all the Things - Foursquare moves to the future with a Geospatial Knowledge Graph. Featuring Distinguished Engineer Vikram Gundeti

Starting point is 00:00:00 Welcome to Orchestrate All the Things. I'm George Anadiotis and we'll be connecting the dots together. Stories about technology, data, AI and media and how they flow into each other, saving our lives. From a consumer-oriented application, Foursquare has evolved to a data and product provider for the enterprise. The next steps in its evolution will be powered by the Foursquare Graph. If the name Foursquare rings a bell, it means you were around in the 2010s. Your only resort to plausible deniability would be if you are a data professional. In the 2010s, Foursquare was a consumer-oriented mobile application. The premise was simple.

Starting point is 00:00:38 People would check in at different locations and get gamified rewards. Their location data would be shared with Foursquare and used for services such as recommendations. Facebook and Yelp got the lion's share out of that market, but Foursquare is still around. In addition to having 9 billion plus visits monthly from 500 million unique devices, Foursquare's data is used to power the likes of Apple, Uber and Coca-Cola. Today the company announced Foursquare Graph, what it dubs the interest is first application of graph technology to geospatial data.

Starting point is 00:01:11 I caught up with Vikram Gundeti, distinguished engineer at Foursquare, to learn more about what kind of data Foursquare deals with, what it does with that data, and how using graph is going to help. So Vikram, distinguished engineer at Foursquare, focusing on our enterprise products, driving the tech strategy. I'm primarily focused on driving the tech strategy for our enterprise products.

Starting point is 00:01:34 And Foursquare Graph is one of the key initiatives I've been focused on. I've been with Foursquare for a little over a year and a half. And prior to that, I was with Amazon for about 15 years. I was fortunate enough to be the first engineer on Alexa and spent about 10 years in that org, working on different systems within Alexa. Foursquare is in an interesting space today. Foursquare is, most people know Foursquare as the app that

Starting point is 00:02:06 popularized check-ins back in the 2010s. Late in 2010s, Foursquare did a pivot to in the process of building this app, Foursquare accrued a lot of interesting assets. We have a huge database of all places in the real world. And the important distinction with any other place is that these are moderated by real humans, because we go into the Foursquare app origins, all the data was crowdsourced.

Starting point is 00:02:33 And then we have an understanding of how people move across different places when they are checking in and providing this information. We have models that allow us to map any GPS coordinate to the place that the people are in with high certainty factoring in all the error and noise within the GPS signals as well. In the late 2010s, Foursquare did a pivot to enterprise and starts with two primary data products, places and visits. And then we did a bunch of acquisitions, notable one, Factual placed and unfolded and Factual brought in a large data set of places like Foursquare had this places data set created by humans, factual providers,

Starting point is 00:03:26 like places from digital sources, like websites and other providers. So it was a good blend of that product. Unfolded is a visualization product that helps visualize and analyze large scale geospatial data sets. Placed augmented our business. We have a business line. We have a marketers business line focused on providing attribution. What it means is like when you run an ad campaign and you want to understand

Starting point is 00:03:55 whether people actually ended up going to the store. So Foursquare leverages its movement data and provides these metrics so that people can optimize like people spending on ad campaigns can optimize their spend on whether the ad is successful versus not so that's where Foursquare is today great thanks thanks for the intro and interesting that you uh specifically mentioned attribution because that's something that caught my attention uh one of the reasons being that i was actually curious I mean I've been covering the the broader data and analytic space for a while and one of the topics that I actually remember because I've done a few was precisely

Starting point is 00:04:35 that attribution and specifically attribution for radio commercials because it's not something that people who are not familiar I was also not familiar with that I was like okay so how can you possibly know the real effect of your radio commercial and whether that drives people to do something or not and it turns out that you can do it and actually location data is and it plays an important part in that yeah so yeah you've already started painting a picture that has lots of data and actually lots of different data. You mentioned acquisitions, for example. That's a typical scenario through which you end up with lots of data sources that aren't necessarily integrated and actually are also hard to integrate.

Starting point is 00:05:18 So maybe that already paves the way towards what you're actually announcing tomorrow. So the Foursquare graph. Would you like to kind of briefly talk about what kind of data does Foursquare deal with and what kind of data infrastructure do you have in place for that? So for example, when those acquisitions were brought in-house, how did you manage that data integration that you had to do? Yeah, I think the four-square graph is an outcome of a series of observations solving specific pain points within our company and also how customers were using our data, how much time they were spending before they were able to extract insights for their business specific use case. So to talk about the use cases, because we had our stack in general, we had data coming

Starting point is 00:06:15 in from three different positions. Like if you take the Places data, our Foursquare stack was primarily an online stack which was based off of Scala web services and MongoDB and we had an offline stack that was primarily an S3 based data lake with a hive metastore right like in all of our pipelines and then Airflow and Luigi and we were heavy users of Spark and EMR. So, and then we brought in the factual pipeline, factual integration where we got in a lot of data and then that also used a lot of Spark and EMR, but had a completely different kinds of principles because of the kind of data they were dealing with, right?

Starting point is 00:07:03 Like for instance, actual focus primarily on offline data, so like digital sources rather than online sources like human data generated by Foursquare. We went through a series of evolutions in terms of series of steps and incremental lessons and finally converged some of these places data sets. But the broader problem we had was like, you have the places data set that was in one vertical stack serving a places product. We had the visits data set that we are generating, which has a loose dependency on places

Starting point is 00:07:43 because you need places to snap to a location. And it was in a separate vertical stack. And as a result, and this was one of the first symptoms that led to this whole whole evolution and the thought process is that we have to launch this attribute called census block, which is what is the census block where this place is, or where is the census block where this visit is captured, or something like that. To add that, we had to do, because these stacks were very siloed with different technologies and all that,

Starting point is 00:08:23 we had to do two times the work in terms of making that and there were two key things one was we had to because these were built as vertical products whatever is the internal data model also reflects the external data model for like if i need to add a census block i need need to actually change the places data. So the key lessons and similarly on the visits side. So the key lessons that we got out of this exercise was that one, we need more composability. Like something like the census block should not require us to change the places stack or the visits stack. It should be more dynamic because it's all it is derived from the lat long associated with place or the visit and we should be able to compose products so the whole notion of composability came into play

Starting point is 00:09:10 in terms of how we are structuring this then the other part is if i come from a very service background so in services you have interfaces and apis In the data world, we haven't been as disciplined about APIs or interfaces in general. Right. And I partly blame that with the evolution of not blame that, but I think it's a natural progression of how evolution of technologies took place in 2010s, where there was heavy usage of key value stores and then MapReduce and Spark like technologies which eliminated the emphasis on hey what is the schema of your data because you can manage it at the application level you write adaptations at the application level and you throw more compute at a problem because the data is stationary and all that right in prior to 2010 i mean i'm not going back to relational database but at least there is an

Starting point is 00:10:11 aspect of relational data which was like think about the model think about the access patterns and be very deliberate about how people are using them so that was missing Like, and part of the other challenge that we saw as a result of that was, we had a lot of denormalized copies of the same data. So it was always difficult to identify which one is the authoritative source, like, and so that led to the the other symptom that led to that was like, hey, we need to have a common data exchange wherein if I need access to certain data,

Starting point is 00:10:49 I can get from that exchange as the authoritative source and rather than rely on the internal data models. And this was evident for us. I'll give one example. Our places data sets use roads and buildings data from OSM to improve our lat-long scores. We make sure if a lat-long is falling in the middle of a road, we try to figure out

Starting point is 00:11:13 how to snap it to the right location based on various models. The same data, now if I want to use it in our visits models, it's a huge pain. Again, I have to ingest a different pipeline and make that. So making sure that whatever data our internal teams is accessible in a central exchange

Starting point is 00:11:36 was one other insight that led to, hey, we need to think about how we have these data sets organized in a single location. Till this point, we talked about composability of data sets, then having a common exchange. Then the next step came in for us in terms of when we talk to our customers who are consuming visits, what they are doing. What they're doing is they're going and finding more data sets that they can join this business with, like whether it is boundaries, like admin boundaries, like census blocks, demographics, weather, bunch of different pieces of information. And as a result, like the time to value from purely from business perspective, when we give the

Starting point is 00:12:23 data, the time to value for our customers is much longer the cycle is much longer right so we we start thinking about how can we bring that time to value uh reduce the time to value right like in terms of how can we reduce the time they spend before they are able to derive value from our data from that it came an idea of the insights from our customers, what kind of analyses our customers are doing. We were able to classify them into two key things. One is what I call a spatial aggregation, which is you want to aggregate the data and the relationships of different pieces of data across different spatial dimensions. What I mean is, like I can give an example of, instead of, I can ask questions now with our knowledge graph because of the way we organized our data to say, hey, where are people living in this neighborhood typically going for groceries?

Starting point is 00:13:19 How far is the typical commute distance? Where are they spending their time? Which categories of places are they spending time over weekends? And from this you can infer insights about like, let's say you're trying to purchase a home and you want to find if this is a pet friendly neighborhood and people in this location are frequenting to veterinary hospitals or petco's or something like that you can know that like hey this is a kind of a pet friendly location so those are the kinds of insights that we could generate by connecting the data connecting different pieces of data and aggregating across spatial dimensions then the other dimension that we also found was like a time dimension. How did this... on the time dimension there are two things. One is

Starting point is 00:14:12 how did a specific thing change over time? Like for instance, suddenly if you're looking at a neighborhood and we can look at different snapshots of the data over a period of time to say, hey, the dwell time of people visiting this particular location increased. And that leads to something like, hey, is there a new development there? Is there a new mall here? Or the density of visits increased? We can identify, derive that insights. And this is useful for businesses, right? Like, if Amazon wants to open up a new locker, or someone in Starbucks wants to open up a new location, this, the identification of a new development

Starting point is 00:14:57 becomes, becomes important in that sense, right? And the other kind of analysis is like point-in-time analysis like we have information about events or concerts happening at places. How does the visitation of the nearby businesses change when there is a football game going on? What kind of consumption patterns do that clientele have, right? Do I need to stock more pizza? Do I need to stock some other thing or something like that? That kind of a thing also helps businesses make important decisions. So these are the.

Starting point is 00:15:35 Dimensions of the problem, both from a customer perspective and an internal acceleration perspective that led to this idea of the knowledge graph or the four square graph. Great, thanks. And well, to summarize, let's say, what you've mentioned so far, well, four square does a number of things. First, you have lots of different data sets that your clients can use. You also have APIs that enable access to those datasets. And then you also have, well, I would call products that are built on top of the former two. So, and plus you have lots of different datasets sitting in lots of different data management

Starting point is 00:16:19 systems. And so you wanted to begin with, you wanted to have like a unified view across all of your data sets. And then what you referred to in the last part, so spatial temporal analysis and analytics that are built on top of the integration of all of that data. So I guess this is what, well, obviously, this is a pain point that many organizations have to deal with. The particular aspect in your case is that precisely you have to deal with this spatial-temporal analysis, which is something not everyone has to deal with. So I guess the first part, the data integration part, is what led you towards using knowledge graph technology. And I'm going to ask you about the specifics of this technology because as you probably know and many of the people who are listening also know that when we talk about

Starting point is 00:17:12 graphers, like any other technology, there's lots of different options, different data models and lots of different things to think about to find what's the best solution for you. So yeah, the first part of the question I wanted to ask is, well, what drove the decision to build a knowledge graph and what kind of choices did you have to make and what were the parameters that you considered to drive at the solution that you finally chose? So what drove the decision to build this knowledge graph? I think as I laid out earlier, the initial part of it was how can we accelerate innovation within the company, right? And because we came together from

Starting point is 00:18:03 different acquisitions, like as a company, there was a lot of data that was not even discovered. For instance, we get a lot of ground truth data from our mobile SDKs and apps, which drive our model creation for some of the things that we do, model training. If all of that is not accessible in a single location, then we are losing out on

Starting point is 00:18:28 building more robust models. So part of it was solving the discovery problem and making sure that we are building off of a common platform, common base. Any exchange of data is happening through that common base and any publishing of data is also happening through that common base. So that was one key dimension. And to talk about the specifics of the technology, we did a lot of exploration. I think as soon as we started thinking about this as a graph which has a lot of different relationships, like some of the relationships that we capture, like what are the user's homes, home locations, work locations, what are their tastes, what are the places they're visiting on a regular basis

Starting point is 00:19:13 and all that, right? So it gets large very quick. And then that coupled with, oh, we need to get boundaries dataset, we need to get weather datasets, we need to get boundaries data set, we need to get weather data sets, we need to get roads and buildings data sets to augment and improve our current products. We started with naively, we started with hey this seems like a graph, let's go find some graph databases and one of the challenges we found with native graph databases is that they solve the relationship angle of it,

Starting point is 00:19:48 like in terms of being able to connect multiple nodes and traverse that. But the temporal angle of that is a challenge, right? Like if you want to have a graph. And then the other challenge with the graph database is that you introduce a new paradigm of querying and a new language and all that when we have to go to different customers that becomes a bit of a problem right like you the learning curve even though once you understand the basics it's easier it's just a different thing than standard sql and that starts uh so uh we explored uh options. So what we ended up is with kind of like a hybrid model

Starting point is 00:20:28 and we found a technology partner we are partnering with to build both the temporal angle which seems more like a traditional data warehouse but also be able to overlay some of these uh relation mine information from relationships applying uh graph algorithms and all that so that's that's uh it was it was uh it was an interesting choice and we would publish a blog post just talking through some of those choices we had to make but um it's still it's still early days for us. We have a lot more to discover in terms of what kinds of analyses we want to run and it's a continuous optimization process for us. Okay, interesting. So it sounds like precisely because of your special needs, let's say, you ended up with a bespoke solution. And yes, you're right. Both spatial and temporal reasoning

Starting point is 00:21:26 are kind of challenging, to put it mildly, when using graph. It's not just when using graph, but it's especially challenging when using graph. So I can understand why you sort of ended up with something that sounds not precisely of the self, but more like something built for purpose, specifically for tailors to your needs, right?

Starting point is 00:21:49 Yep. Okay. So you mentioned something about, well, how people that use your APIs having, potentially having to familiarize themselves with a new paradigm. So did you actually avoid having to do that? So your API has not been affected by the new graph that you built?

Starting point is 00:22:13 Yeah. So the graph that we have right now is still internal powering our existing products. So we haven't yet. So we are at a point where we have a central backbone now that will augment our existing products. Whether it is, as I said, in our APIs, in our Places APIs, we provide Place recommendations. We now will be able to provide justifications around

Starting point is 00:22:40 why we are recommending this place in a more explainable way rather than just saying, hey, this has a five rating. We can say, for instance, some of the things that people working in this particular locality typically go here for lunch, which is a much stronger signal than saying, hey, this restaurant has a four-star rating or something of that sort in this neighborhood. So those are the kinds of insights we can surface through our mobile apps. restaurant has a four star rating or something of that sort in this neighborhood right um so those

Starting point is 00:23:05 are the kinds of insights we can surface through our mobile apps if you take our visual analytics products uh like studio um one of the big things is in studio is like and we have embraced the h3 technology um it is that um you can join different data sets to derive insights from that. For instance, one of the common things that people who are doing spatial analysis do is site selection. And you want to build a site suitability score

Starting point is 00:23:40 based on various different joining various different data sets. It could be the demographics of the neighborhood, it could be the it could be the density of POIs within that neighborhood, like the average dwell time within a particular location, and a bunch of different factors, right. Being we can index all our data, the data that we have to H3 cells at different granularities and be able to join our studio so that We can index all our data, the data that we have to H3 cells at different granularities and be able to join our studio

Starting point is 00:24:08 so that you can write these kinds of analytic functions and visualize them on a map. So we can build things like a site suitability score based on H3 cells, data type to H3 cells, like data derived from our core data sets, mapped to H3 cells and aggreg data derived from our core data sets mapped to H3 cells and aggregated in a way and visualize that. Then you can overlay them on a map to see the heat map of, oh this seems like a good location or this site has the highest score and use that to drive

Starting point is 00:24:38 analysis. So that's how we can augment our analytics products, then for our data customers who are still using our data, one of the things we can do is provide with more augmenting information that we can now derive from these data. For instance, if you take our places data sets instead of just saying that like here are the places we can say hey this place is within, we can still do today what whether the parent relationship with the mall but this place is in a mall that has 50 different locations and is near this intersection that's the kind of extrapolation of context we can now provide just by way of combining all these datasets together. And for our visits customers, we can provide

Starting point is 00:25:35 more higher order insights by way of doing the spatial aggregations or the temporal analysis. So it opens us up, like the key part of the graph is like graph is not an external facing product, but it's more an internal platform that accelerates the development or helps us reach a server existing customers and create new segments of customers for us. Okay, I see. Well, interesting. I picked up a couple of points, both from what you just mentioned in your last reply, but also from some of the things you mentioned previously. So, for example, you talked about relationships between different data points and actually defining those relationships and how doing that helps you have a better picture. And I think this directly ties to something you just mentioned previously. So providing justification for, for example, recommendations. So reasoning. Those are two of the specific strong points about using knowledge graph technology.

Starting point is 00:26:40 And it seems like you are leveraging those precisely to your needs. So if you hadn't already mentioned that what you're using is a bespoke solution, I would be tempted to say that you're probably using some knowledge graph platform or graph database technology of the sort because these are precisely the things that they let you do. I think though that we're close to the top of the hour, so we'll have to wrap up. So the last question I have for you is basically where do you see Foursquare going with that? So if you can just provide a little bit of a background. So how has your journey been so far?

Starting point is 00:27:24 So how long did it take you and what are the next steps in your in this evolution? Let's see. Yeah so before that like I want to add one point about the knowledge graph which is very important which is especially we are in a very privacy-centric landscape, right? Like importance of privacy has at most attention these days. And location is a key element of that. What we are doing with our knowledge graph is also centering all our insights around locations, not individuals. Which is even if you want to,

Starting point is 00:28:04 in the advertising business, if you want to target a certain class people today you organize all your information around individuals and target those individuals but what we can do with the knowledge graph is now identify places where you can find people of certain cohorts at different times and then change how you advertise i think if you have billboards what you can show at that billboard, because there are certain category of people that will be passing through that application, those are the kinds of things that we are doing. And privacy is a key element for us in this whole thing.

Starting point is 00:28:38 In terms of what's ahead for us, I think what we have been able to do with the graph is like now build the backbone. Now we are connecting our different products to improve the experiences of our products. And the journey till here was the toughest part, right? Because what it involved and entailed was identification of all the assets and different data sets that we had across the company, bringing them into a central location, into a common infrastructure, having people build off of that instruction. That was the hard part. Now what we are trying to do is working through different products we have and how we can augment the experiences and make those products more

Starting point is 00:29:21 compelling and figure out what new products we could build off of this particular stack. And the interesting part for us, I think this has always been one of the exciting parts for me about Foursquare is all our products actually help each other in some sense, like there is a flywheel, continuous flywheel for us. So for instance, our places data, we get data from our users,

Starting point is 00:29:45 which any data provided by users, whether they're checking in, whether they are using our APIs, that gives feedback that goes into our places data, which gets improved. Or the ground truth generated goes into our graph, and that improves our models. And that, in turn, helps us provide better experiences back to our customers, whether it is through our APIs or analytics platforms or other things. So that's what I'm most excited about. We've been able to create this flywheel of products centered around our

Starting point is 00:30:18 knowledge graph and that's an interesting part for us. Great. Well, just to add to what you said, indeed I think the first step in use cases like yours is usually harder because it also forces you to work on your internal data structures and basically do like a full data audit and data catalog of your assets which is not always the easiest thing to do but as you're finding out, it usually pays off. So just one extra question to wrap this up. Are you going to have some time through which your legacy infrastructure and your new Foursquare Graph infrastructure are going to run in parallel or are you switching directly to graph? I think we're doing this

Starting point is 00:31:05 incrementally. I mean first step is like getting all the assets into a single location which we have done. Now teams that will benefit from consuming those assets like an example of this thing is like our visits models. Now to improve the next iteration of our visits models we'll be using a lot more data from the graph. So that creates a pattern that automatically eliminates existing sources or like any non-like, like if they had a dependency on a bespoke table that is going to be removed and now it is going to take the dependency on the graph. So some of this is going to be incremental, some of this is going to be driven by new product development, new product features and all. Okay, great. Thanks a lot for this conversation and well,

Starting point is 00:31:52 good luck with everything going forward. Thank you. Thanks a lot. Thanks George. Thanks for sticking around. For more stories like this, check the link in bio and follow linked data orchestration.

Your Ad Here

Orchestrate all the Things - Foursquare moves to the future with a Geospatial Knowledge Graph. Featuring Distinguished Engineer Vikram Gundeti

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.