Orchestrate all the Things - Foursquare moves to the future with a Geospatial Knowledge Graph. Featuring Distinguished Engineer Vikram Gundeti
Episode Date: May 11, 2023From a consumer-oriented application, Foursquare has evolved to a data and product provider for enterprises. The next steps in its evolution will be powered by the Foursquare Graph If the name F...oursquare rings a bell, it means you were around in the 2010s. Your only resort to plausible deniability would be if you are a data professional - although that's not an either/or proposition. In the 2010s, Foursquare was a consumer-oriented mobile application. The premise was simple: people would check in at different locations and get gamified rewards. Their location data would be shared with Foursquare and used for services such as recommendations. Facebook and Yelp got the lion's share of that market, but Foursquare is still around. In addition to having 9 billion-plus visits monthly from 500 million unique devices, Foursquare's data is used to power the likes of Apple, Uber and Coca-Cola. Today the company announced Foursquare Graph, what it dubs the industry’s first application of graph technology to geospatial data. I caught up with Vikram Gundeti, Distinguished Engineer at Foursquare, to learn more about what kind of data Foursquare deals with, what it does with that data, and how using graph is going to help. Article published on Orchestrate all the Things
Transcript
Discussion (0)
Welcome to Orchestrate All the Things. I'm George Anadiotis and we'll be connecting the dots together.
Stories about technology, data, AI and media and how they flow into each other, saving our lives.
From a consumer-oriented application, Foursquare has evolved to a data and product provider for the enterprise.
The next steps in its evolution will be powered by the Foursquare Graph.
If the name Foursquare rings a bell, it means you were around in the 2010s.
Your only resort to plausible deniability would be if you are a data professional.
In the 2010s, Foursquare was a consumer-oriented mobile application.
The premise was simple.
People would check in at different locations and get gamified rewards.
Their location data would be shared with Foursquare and used for
services such as recommendations. Facebook and Yelp got the lion's share out of that market,
but Foursquare is still around. In addition to having 9 billion plus visits monthly
from 500 million unique devices, Foursquare's data is used to power the likes of Apple,
Uber and Coca-Cola. Today the company announced Foursquare Graph,
what it dubs the interest is first application
of graph technology to geospatial data.
I caught up with Vikram Gundeti,
distinguished engineer at Foursquare,
to learn more about what kind of data Foursquare deals with,
what it does with that data,
and how using graph is going to help.
So Vikram, distinguished engineer at Foursquare,
focusing on our enterprise products, driving the tech strategy.
I'm primarily focused on driving the tech strategy for our enterprise products.
And Foursquare Graph is one of the key initiatives I've been focused on.
I've been with Foursquare for a little over a year and a half.
And prior to that, I was with Amazon for about 15 years.
I was fortunate enough to be the first engineer on Alexa
and spent about 10 years in that org,
working on different systems within Alexa.
Foursquare is in an interesting space today.
Foursquare is, most people know Foursquare as the app that
popularized check-ins back in the 2010s.
Late in 2010s, Foursquare did a pivot to
in the process of building this app, Foursquare accrued a lot of interesting
assets. We have a huge database of all places
in the real world. And the important distinction with any other place
is that these are moderated by real humans,
because we go into the Foursquare app origins,
all the data was crowdsourced.
And then we have an understanding
of how people move across different places
when they are checking in and providing this information.
We have models that allow us to map any GPS coordinate to the place
that the people are in with high certainty factoring in all the error and noise within
the GPS signals as well. In the late 2010s, Foursquare did a pivot to enterprise and starts with two primary data products, places and visits. And then we did a
bunch of acquisitions, notable one, Factual placed and unfolded and Factual brought in a large data
set of places like Foursquare had this places data set created by humans, factual providers,
like places from digital sources,
like websites and other providers.
So it was a good blend of that product.
Unfolded is a visualization product that helps visualize
and analyze large scale geospatial data sets.
Placed augmented our business.
We have a business line. We have a marketers business line focused on
providing attribution. What it means is like when you run an ad campaign and you want to understand
whether people actually ended up going to the store. So Foursquare leverages its movement data
and provides these metrics so that people can optimize like people spending on ad campaigns
can optimize their spend on whether the ad is successful versus not so that's where Foursquare
is today great thanks thanks for the intro and interesting that you uh specifically mentioned
attribution because that's something that caught my attention uh one of the reasons being that
i was actually curious I mean
I've been covering the the broader data and analytic space for a while and one
of the topics that I actually remember because I've done a few was precisely
that attribution and specifically attribution for radio commercials
because it's not something that people who are not familiar I was also not
familiar with that I was like okay so how can you possibly know the real effect of your radio commercial and whether that drives
people to do something or not and it turns out that you can do it and actually location data is
and it plays an important part in that yeah so yeah you've already started painting a picture
that has lots of data and actually lots of different data.
You mentioned acquisitions, for example.
That's a typical scenario through which you end up with lots of data sources that aren't necessarily integrated and actually are also hard to integrate.
So maybe that already paves the way towards what you're actually announcing tomorrow. So the Foursquare graph.
Would you like to kind of briefly talk about what kind of data does Foursquare deal with and what
kind of data infrastructure do you have in place for that? So for example, when those acquisitions
were brought in-house, how did you manage that data integration that you had to do?
Yeah, I think the four-square graph is an outcome of a series of observations solving specific pain points within our company and also how customers were using our data,
how much time they were spending before they were able to extract insights
for their business specific use case.
So to talk about the use cases, because we had our stack in general, we had data coming
in from three different positions.
Like if you take the Places data, our Foursquare stack was primarily an online stack which was based off of Scala web services and MongoDB
and we had an offline stack that was primarily an S3 based data lake with a hive metastore
right like in all of our pipelines and then Airflow and Luigi and we were heavy users of Spark and EMR. So, and then we brought in the factual pipeline,
factual integration where we got in a lot of data
and then that also used a lot of Spark and EMR,
but had a completely different kinds of principles
because of the kind of data they were dealing with, right?
Like for instance, actual focus primarily on offline data, so like digital sources rather than online sources like
human data generated by Foursquare. We went through a series of evolutions in terms of series
of steps and incremental lessons and finally converged some of these places data sets.
But the broader problem we had was like,
you have the places data set
that was in one vertical stack serving a places product.
We had the visits data set that we are generating,
which has a loose dependency on places
because you need places to snap to a location.
And it was in a separate vertical stack. And as a result, and this was one of the first symptoms that led to this whole
whole evolution and the thought process is that we have to launch this attribute called census block,
which is what is the census block where this place is,
or where is the census block where this visit is captured,
or something like that.
To add that, we had to do, because these stacks were
very siloed with different technologies and all that,
we had to do two times the work in terms of
making that and there were two key things one was we had to because these were built as vertical
products whatever is the internal data model also reflects the external data model for like if i
need to add a census block i need need to actually change the places data.
So the key lessons and similarly on the visits side.
So the key lessons that we got out of this exercise was that one, we need more composability.
Like something like the census block should not require us to change the places stack or the visits stack. It should be more dynamic because it's all it is derived from the lat long associated with place or the
visit and we should be able to compose products so the whole notion of composability came into play
in terms of how we are structuring this then the other part is if i come from a very service
background so in services you have interfaces and apis In the data world, we haven't been as disciplined about APIs or interfaces in general.
Right. And I partly blame that with the evolution of not blame that, but I think it's a natural progression of how
evolution of technologies took place in 2010s, where there was heavy usage of key value stores and then MapReduce and Spark
like technologies which eliminated the emphasis on hey what is the schema of your data because
you can manage it at the application level you write adaptations at the application level
and you throw more compute at a problem because the data is stationary and all that right
in prior to 2010 i mean i'm not going back to relational database but at least there is an
aspect of relational data which was like think about the model think about the access patterns
and be very deliberate about how people are using them so that was missing Like, and part of the other challenge that we
saw as a result of that was, we had a lot of denormalized copies
of the same data. So it was always difficult to identify
which one is the authoritative source, like, and so that led to
the the other symptom that led to that was like, hey, we need
to have a common data exchange wherein
if I need access to certain data,
I can get from that exchange as the authoritative source
and rather than rely on the internal data models.
And this was evident for us.
I'll give one example.
Our places data sets use roads and buildings data from OSM
to improve our lat-long scores.
We make sure if a lat-long is falling
in the middle of a road, we try to figure out
how to snap it to the right location based
on various models.
The same data, now if I want to use it in our visits models,
it's a huge pain.
Again, I have to ingest a different pipeline
and make that.
So making sure that whatever data our internal teams
is accessible in a central exchange
was one other insight that led to, hey,
we need to think about how we have these data
sets organized in a single location.
Till this point, we talked about composability of data sets, then having a common exchange.
Then the next step came in for us in terms of when we talk to our customers who are consuming visits,
what they are doing. What they're doing is they're going and finding more data sets that they can join this business with, like whether it is boundaries, like admin boundaries, like
census blocks, demographics, weather, bunch of different pieces of information. And as
a result, like the time to value from purely from business perspective, when we give the
data, the time to value for our customers is much longer the cycle is much longer right so we we start thinking about how can we
bring that time to value uh reduce the time to value right like in terms of how can we reduce
the time they spend before they are able to derive value from our data from that it came an idea of the insights from our customers, what kind
of analyses our customers are doing. We were able to classify them into two key things. One is
what I call a spatial aggregation, which is you want to aggregate the data and the relationships
of different pieces of data across different spatial dimensions.
What I mean is, like I can give an example of, instead of, I can ask questions now with our knowledge graph
because of the way we organized our data to say, hey, where are people living in this neighborhood typically going for groceries?
How far is the typical commute distance? Where are they spending their time? Which categories
of places are they spending time over weekends? And from this you can infer insights about like,
let's say you're trying to purchase a home and you want to find if this is a pet friendly
neighborhood and people in this location are frequenting to veterinary hospitals or petco's or something like that you
can know that like hey this is a kind of a pet friendly location so those are the kinds of
insights that we could generate by connecting the data connecting different pieces of data
and aggregating across spatial dimensions then the other dimension that we also found was like a time dimension.
How did this... on the time dimension there are two things. One is
how did a specific thing change over time? Like for instance,
suddenly if you're looking at a neighborhood and we can look at different snapshots of the data over a period of time to say,
hey, the dwell time of people visiting this particular location increased.
And that leads to something like, hey, is there a new development there?
Is there a new mall here? Or the density of visits increased?
We can identify, derive that
insights. And this is useful for businesses, right? Like, if Amazon wants to open up a new locker, or
someone in Starbucks wants to open up a new location, this, the identification of a new development
becomes, becomes important in that sense, right? And the other kind of analysis is like point-in-time analysis
like we have information about events or concerts happening at places. How
does the visitation of the nearby businesses change when there is a
football game going on? What kind of consumption patterns do that clientele
have, right? Do I need to stock more pizza?
Do I need to stock some other thing or something like that?
That kind of a thing also helps businesses make important decisions.
So these are the.
Dimensions of the problem, both from a customer perspective
and an internal acceleration perspective that led to this idea of the knowledge
graph or the four square graph.
Great, thanks. And well, to summarize, let's say, what you've mentioned so far,
well, four square does a number of things. First, you have lots of different data sets that
your clients can use. You also have APIs that enable access to those datasets.
And then you also have, well, I would call products that are built on top of the former two.
So, and plus you have lots of different datasets sitting in lots of different data management
systems. And so you wanted to begin with, you wanted to have like a unified view across all of your data sets.
And then what you referred to in the last part, so spatial temporal analysis and analytics that are built on top of the integration of all of that data.
So I guess this is what, well, obviously, this is a pain point that many organizations have to deal with. The particular aspect in your case is
that precisely you have to deal with this spatial-temporal analysis, which is something
not everyone has to deal with. So I guess the first part, the data integration part,
is what led you towards using knowledge graph technology. And I'm going to ask you about
the specifics of this technology because
as you probably know and many of the people who are listening also know that when we talk about
graphers, like any other technology, there's lots of different options, different data models and
lots of different things to think about to find what's the best solution for you. So yeah, the first part of the question I wanted to ask is,
well, what drove the decision to build a knowledge graph
and what kind of choices did you have to make
and what were the parameters that you considered
to drive at the solution that you finally chose? So what drove the decision to
build this knowledge graph? I think as I laid out earlier, the initial part of it was how can we
accelerate innovation within the company, right? And because we came together from
different acquisitions, like as a company, there was a lot of data
that was not even discovered.
For instance, we get a lot of ground truth data
from our mobile SDKs and apps, which
drive our model creation for some of the things
that we do, model training.
If all of that is not accessible in a single location,
then we are losing out on
building more robust models. So part of it was solving the discovery problem and making sure
that we are building off of a common platform, common base. Any exchange of data is happening
through that common base and any publishing of data is also happening through that common base. So that was one key dimension. And to talk about the specifics of the technology,
we did a lot of exploration. I think as soon as we started thinking about this as a graph which has
a lot of different relationships, like some of the relationships that we capture, like what are the user's homes,
home locations, work locations,
what are their tastes,
what are the places they're visiting on a regular basis
and all that, right?
So it gets large very quick.
And then that coupled with,
oh, we need to get boundaries dataset,
we need to get weather datasets,
we need to get boundaries data set, we need to get weather data sets, we need to get roads and buildings data sets to augment and improve our current products. We started with
naively, we started with hey this seems like a graph, let's go find some graph databases and
one of the challenges we found with native graph databases is that they solve the relationship angle of it,
like in terms of being able to connect multiple nodes and traverse that.
But the temporal angle of that is a challenge, right?
Like if you want to have a graph.
And then the other challenge with the graph database is that you introduce a new paradigm of querying
and a new language and
all that when we have to go to different customers that becomes a bit of a problem right like you
the learning curve even though once you understand the basics it's easier it's just a different thing
than standard sql and that starts uh so uh we explored uh options. So what we ended up is with kind of like a hybrid model
and we found a technology partner we are partnering with to build both the temporal
angle which seems more like a traditional data warehouse but also be able to overlay some of these uh relation mine information from relationships
applying uh graph algorithms and all that so that's that's uh it was it was uh it was an
interesting choice and we would publish a blog post just talking through some of those choices
we had to make but um it's still it's still early days for us. We have a lot more to
discover in terms of what kinds of analyses we want to run and it's a continuous optimization
process for us. Okay, interesting. So it sounds like precisely because of your special needs,
let's say, you ended up with a bespoke solution. And yes, you're right. Both spatial and temporal reasoning
are kind of challenging, to put it mildly,
when using graph.
It's not just when using graph,
but it's especially challenging when using graph.
So I can understand why you sort of ended up
with something that sounds not precisely of the self,
but more like something built for purpose,
specifically for tailors to your needs, right?
Yep.
Okay.
So you mentioned something about, well,
how people that use your APIs having,
potentially having to familiarize themselves
with a new paradigm.
So did you actually avoid having to do that?
So your API has not been affected by the new graph that you built?
Yeah. So the graph that we have right now is
still internal powering our existing products.
So we haven't yet.
So we are at a point where we have a central backbone now
that will augment our existing products.
Whether it is, as I said, in our APIs, in our Places APIs,
we provide Place recommendations.
We now will be able to provide justifications around
why we are recommending this place in a more explainable way
rather than just saying, hey, this has a five rating.
We can say, for instance, some of the things that people working
in this particular locality typically go here for lunch,
which is a much stronger signal than saying, hey,
this restaurant has a four-star rating or something of that sort
in this neighborhood.
So those are the kinds of insights we can surface through our mobile apps. restaurant has a four star rating or something of that sort in this neighborhood right um so those
are the kinds of insights we can surface through our mobile apps if you take our visual analytics
products uh like studio um one of the big things is in studio is like and we have embraced the h3
technology um it is that um you can join different data sets to derive insights
from that.
For instance, one of the common things
that people who are doing spatial analysis do
is site selection.
And you want to build a site suitability score
based on various different joining
various different data sets.
It could be the demographics of the neighborhood, it could be the
it could be the density of POIs within that neighborhood, like the average dwell
time within a particular location, and a bunch of different factors, right.
Being we can index all our data, the data that we have to H3
cells at different granularities and be able to join our studio so that We can index all our data, the data that we have to H3 cells
at different granularities and be able to join our studio
so that you can write these kinds of analytic functions
and visualize them on a map.
So we can build things like a site suitability score
based on H3 cells, data type to H3 cells,
like data derived from our core data sets,
mapped to H3 cells and aggreg data derived from our core data sets mapped to H3 cells
and aggregated in a way and visualize that. Then you can overlay them on a map to see the heat map
of, oh this seems like a good location or this site has the highest score and use that to drive
analysis. So that's how we can augment our analytics products, then for our data customers who are still using our data,
one of the things we can do is provide with more augmenting information that we can now derive
from these data. For instance, if you take our places data sets instead of just saying that like
here are the places we can say hey this place is within, we can still do today what whether the
parent relationship with the mall but this place is in a mall that has 50 different locations
and is near this intersection that's the kind of extrapolation of context we can now provide just by way of combining
all these datasets together.
And for our visits customers, we can provide
more higher order insights by way of doing
the spatial aggregations or the temporal analysis.
So it opens us up, like the key part of the graph is like graph is not an external facing product,
but it's more an internal platform that accelerates the development or helps us reach a server existing customers and create new segments of customers for us. Okay, I see. Well, interesting. I picked up a couple of points, both from what you just mentioned in your last reply, but also from some of the things you mentioned previously.
So, for example, you talked about relationships between different data points and actually defining those relationships and how doing that helps you have a better picture. And I think this directly ties to something you just mentioned previously.
So providing justification for, for example, recommendations.
So reasoning.
Those are two of the specific strong points about using knowledge graph technology.
And it seems like you are leveraging those precisely to your needs.
So if you hadn't already mentioned that what you're using is a bespoke solution,
I would be tempted to say that you're probably using some knowledge graph platform or graph
database technology of the sort because these are precisely the things that they let you do.
I think though that we're close to the top of the hour, so we'll have to wrap up.
So the last question I have for you is basically where do you see Foursquare going with that?
So if you can just provide a little bit of a background.
So how has your journey been so far?
So how long did it take you and what are the next steps in your in this evolution? Let's see. Yeah so
before that like I want to add one point about the knowledge graph which is very important which is
especially we are in a very privacy-centric landscape, right? Like importance of privacy has at most attention these days.
And location is a key element of that.
What we are doing with our knowledge graph
is also centering all our insights
around locations, not individuals.
Which is even if you want to,
in the advertising business, if you want to target a certain class
people today you organize all your information around individuals and target those individuals
but what we can do with the knowledge graph is now identify places where you can find people of
certain cohorts at different times and then change how you advertise i think if you have billboards
what you can show at that billboard,
because there are certain category of people that will be passing through
that application, those are the kinds of things that we are doing.
And privacy is a key element for us in this whole thing.
In terms of what's ahead for us,
I think what we have been able to do with the graph is like now build the backbone.
Now we are connecting our different products to improve the experiences of our products.
And the journey till here was the toughest part, right?
Because what it involved and entailed was identification of all the assets and different data sets that we had across the company,
bringing them into a central location, into a common infrastructure, having people build off
of that instruction. That was the hard part. Now what we are trying to do is working through
different products we have and how we can augment the experiences and make those products more
compelling and figure out what new products we could build off of this particular stack.
And the interesting part for us,
I think this has always been one of the exciting parts
for me about Foursquare is all our products
actually help each other in some sense,
like there is a flywheel, continuous flywheel for us.
So for instance, our places data,
we get data from our users,
which any data provided by users, whether they're checking in,
whether they are using our APIs, that gives feedback
that goes into our places data, which gets improved.
Or the ground truth generated goes into our graph,
and that improves our models.
And that, in turn, helps us provide better experiences back to our customers,
whether it is through our APIs or analytics platforms or other things. So that's what I'm
most excited about. We've been able to create this flywheel of products centered around our
knowledge graph and that's an interesting part for us. Great. Well, just to add to what you said, indeed I think the
first step in use cases like yours is usually harder because it also forces
you to work on your internal data structures and basically do like a full
data audit and data catalog of your assets which is not always the easiest
thing to do but as you're
finding out, it usually pays off. So just one extra question to wrap this up. Are you going
to have some time through which your legacy infrastructure and your new Foursquare Graph
infrastructure are going to run in parallel or are you switching directly to graph? I think we're doing this
incrementally. I mean first step is like getting all the assets into a single location which we
have done. Now teams that will benefit from consuming those assets like an example of this
thing is like our visits models. Now to improve the next iteration of our visits models we'll be
using a lot more
data from the graph. So that creates a pattern that automatically eliminates existing sources
or like any non-like, like if they had a dependency on a bespoke table that is going to be removed and
now it is going to take the dependency on the graph. So some of this is going to be incremental,
some of this is going to be driven by new product development, new product features and all. Okay, great. Thanks a lot for this conversation and well,
good luck with everything going forward. Thank you. Thanks a lot. Thanks George. Thanks for
sticking around. For more stories like this, check the link in bio and follow linked data orchestration.