The Data Stack Show - 42: Scaling Data Science with Ryan Boyer of Shipt
Episode Date: June 30, 2021Highlights from this week’s episode include:Ryan's full circle path from stocking shelves at Target to using data science for a company owned by Target (2:00)Building great tools and wielding them e...ffectively (5:04)Changes at Shipt since being acquired (9:29)How people’s bias impacts models built by data scientists (12:30)The different data sources Shipt incorporates (22:02)How Ryan's work as a data scientist has changed as Shipt has grown (25:29)How data science helps marketing (31:38)Improving search experience (34:23)Shipt's evolving data stack (38:27)New trends in data science (47:06)The Data Stack Show is a weekly podcast powered by RudderStack. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Transcript
Discussion (0)
The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution.
Thanks for joining the show today.
We have a guest on the show today I'm particularly excited to talk to because I'm a customer
of theirs and it's the company Shipt.
They do grocery delivery and now all sorts of other stuff.
And actually, we've been customers in our household for a long time.
So I remember when they got acquired by Target.
And one thing that I'm really interested to ask Ryan about,
who's a data scientist at Shipt, is just around the complexity.
So if you open up the Shipt app and use it, there's so much going on there, even just
from the consumer side.
And I can't imagine the challenge of dealing with all the different data sets that they
have in terms of building models and sort of just managing the entire data science practice.
So complexity is my burning question.
Costas?
Data.
I'm pretty sure they have like a lot,
a lot of data that they're working with.
And I want to see both how they grew
from the first days until today
in terms of like the data itself
and the infrastructure behind it.
And what are the challenges around that?
And keep in mind that we are talking
about the marketplace at the end,
which always complicates things.
Although we tend to see only one side,
the side that we are part of on the marketplace. So I'm pretty sure that he will have like very
interesting information to share about how data is important into growing marketplaces.
Absolutely. Well, let's jump in and talk with Ryan Boyer from the data science team at Shipt.
All right, Ryan Boyer, welcome to the Data Stack Show. Thank you so much for having me.
I'm really excited to be here.
Oh man, we have so many things to ask you about, but why don't you just give us a brief
background?
So where did your career start out and then what was the pathway that led you to data
science at Shift?
Yeah, this is a great story.
So I got a math degree at Clemson University for my
undergrad, and then opposed to going to grad school like I initially planned, I upped and
moved to Bozeman, Montana, where I became a ski bum, a very terrible ski bum, and stocked shore
shelves at Target for about a year. Six years later, here I am working at a company owned by
Target. So very much come full circle.
How I got into data science and how I specifically ended up in SHIP is a little more direct.
After learning that I was a bad ski bum and wanted to use my brain a lot more, I went
back to grad school, got a degree in systems and information engineering, focusing a lot
on data science, math, statistics, and then ended up in Birmingham, Alabama, because it
was where my wife grew up
and wanted to do data science in a small Southern town. And there was really one or two options.
And I got lucky and joined a, at that point in time of a decent size startup name shipped.
And it has been rocket ship growth ever since then. I was the third person to title data
scientist. And now I'm on a team with,
I think, 50 people in the data science organization playing in and we're hiring always as far as I can tell. So it's been a lot of fun. Yeah, that's great. We encourage our guests
to tell our audience when they're hiring. And it seems like data science and data engineering roles are just in huge demand.
One question for you.
So, and this is, I just love the story of you stocking shelves while being a ski bum.
Did that influence sort of the way that you thought about solving problems around stocking
and in-stock items when you were working on that from an
actual data science standpoint at Shipt? I would say it certainly helped, right? Like I understood
that Target doesn't just get one truck a week. You know, there's lots of trucks a week that come
at different times. And so there was some like domain expertise I could bring to that problem.
But I would say that the bigger thing that I learned,
honestly, throughout all of my undergrad career,
and especially through my time as being a poor ski bum,
poor in the sense I wasn't very good at it,
was like how central people are to data science, right?
And so I would say that really has kind of been like the key driving thing for me as a data scientist
is how can I make a model or systems that work for people and with people?
So interesting. Can you give us just one example of sort of what it looks like to go from model to individual in some of the work that you've done?
Like just a practical example for our listeners? Yeah, so that's, I will say
this is probably the hardest thing
in data science to me, in my opinion,
is managing that stage of a project.
So we can talk about the out-of-stocks.
We can get into the model more later
if y'all are interested.
Basically at Chip, I built a model
that predicts whether a product
is out-of-stock in real time.
You know, from a data science side, they get a score between zero and one, basically a
probability of it being out of stock.
That's great.
And I can do metrics about how effective that is and all kinds of things.
But the question is, like, how can I use that to improve the lives of our shoppers, ship's
shoppers or ship's members who are either picking the groceries in the store or ordering them on our e-commerce app. And so there's a lot of discussion about how to do that
well and effectively and manage their expectations. I personally believe that data science models
should be thought of as tools and not solutions. Like no one looks at a house and then thanks the
hammer, right? They thank the carpenter who used the hammer to build the house well, right?
Like, but I feel like in data science, it's like, oh man, deep learning, neural networks,
great use of trees.
Like we can solve all of our problems with these cool tools.
And I would say, no, you can build lots of great tools that can be really effective at
your problem, but you still need to wield them effectively. Yeah, I think we see a pattern here we've seen before, right? Like actually two
patterns. One is what's like how you actually productize data science, which is not that
obvious at the end. And we will have like the opportunity, I think, to discuss more on this
today. And the other thing, Eric, that we have talked a lot before is actually that data science, machine learning, and all that stuff, it's more of
a tool, and they augment, and they have to work together with people. It's not a replacement for
the stuff that we are doing as people. So I'm pretty sure that you remember all these discussions
we had in the past and other episodes around that. Absolutely. Yeah. It's been a huge thing with data science.
Actually we've had multiple data scientists on the show and it's been really
encouraging that the,
the most common perspective is the human element of data science is the key
determinant and whether data science is actually effective or successful.
Yeah. I don't think we are going to see the terminator at the end anytime soon but no me either yeah i would
also say the human element is important on both ends right like getting data science to production
it really matters that you have a company and a culture who is bought in willing to invest willing
to work with you and willing to buy into the vision of how a data science tool can be used. Like, so there's
like that front end part of data science being successful. And then there's the part of just like,
can you build a data science model that affects your business or the people who use your business
in a way that supports and helps them opposed to just tries to absolve away all control
and what they like about the business
in order to give them best outcomes.
Sure.
We had someone on the show who talked a lot about AI
and people sometimes can have a fear of AI
and blame the technology.
And he said, if you see sort of negative
results from that, you have to remember there's a human behind that, you know, whether it's sort
of building a model or approving, which was a really, it was very thought provoking.
Yeah. I mean, I would also say sometimes us people who are building things can miss things too. And
so there are sometimes mistakes,
but I don't think that's an excuse for a, for a data science or a data science professional to,
to build something that is manipulative, but yeah, like there are people building things.
And in my opinion, we'll be building things for a long time. Sure. Yeah. Well, let's, I want to get
into the technical side of things. And I know Costas has a bunch of questions there, but
I think as a segue getting into
that, one thing that would be really interesting to hear about is you joining Shift pre-acquisition,
third person on the team with the title Data Science, and you're going to have a huge team
by the end of the year.
I'd love to know what has changed significantly post-acquisition and what hasn't
changed that much. Yeah. So like you said, I was the third person to title data scientist when I
joined. Our data science team was like five people and a manager and we'll be 50. We're like 50-ish
people now, probably be a hundred at the end of the year. if we can hire, find the talent we need. So feel free to apply if you're interested. The main thing that I would say is that change is just consistent at a
company that is growing as fast as Shipt has been. And that is the truth for how we do data science.
When I first joined, we were very scrappy and had little oversight. And it was kind of
awesome because it was just like, I wrote some code and your code
would be like, cool, do you like it?
And you'd say, yeah.
And I'd be like, okay.
And we'd run it out.
You know, we'd deploy it and we'd see what happens.
And we'd learn from that.
Of course, now we are much bigger and much more at scale.
And we have a much more rigorous system for deploys and peoples and checks and understanding how it's going to
affect things. But there's, there's still is this, this, in my opinion,
this desire to learn through experimentation,
to learn as much as we can and to go as fast as we can with a little more
cautiousness as well. So I would say like a lot has changed,
but a lot has stayed the same.
That's really encouraging to hear that,
to hear that you still feel like the startup mindset
and agility is still there
because that's often something you hear people
sort of bemoan post-acquisition is,
you know, we're part of a big company now
and it feels like we're part of a big company, but that's really encouraging. Yeah. I would say it's gotten harder to be as agile.
Like, you know, no one told me I couldn't do anything back in the day and I just did things.
Now we have people who are, you know, trying to figure out what's best and there is trying,
you know, a desire to move the large ship in one direction. But I do feel that data science is something that you're just
going to fail at a lot of times. Like you're going to build models and not going to work.
You're going to run a statistical analysis and you're not going to find anything. And if you
lose that ability to learn by doing, like not starting a project until everyone's on board
with it, or sometimes not even running a model in a small production test until you feel very
confident about how it's going to behave.
Like you're going to have a lot of trouble moving quickly
in the data science space.
This is great, Ryan.
I want to go a little bit back on our conversation
and to that part where you were discussing all together
about AI, machine learning, and the impact that it has on our lives.
And I want to ask you about something very specific, and this is bias.
So I want to hear from you.
First of all, help us understand a little bit better how bias is introduced,
how it is, let's say, represented or goes into the models.
And based on your experience, what kind of impact bias can have
to the end user of
any model that a data scientist can build? Yeah. So just to confirm, you're talking about
what I would call people bias, as opposed to statistical bias, the mathematical term, correct?
Yes. Yes. Okay. Okay. Just making sure. I would be surprised.
And I would be like, I don't't know it's been a long time since i
thought about things at that like statistical level yeah yeah bias i i believe this this is
me as a person we'll go to me as a data scientist bias to me is just something that is innate to the
human experience right like you don't know what you don't know. And it's really hard to understand what you don't know. And to me,
a lot of the ways that bias enters a modeling process or an analytical process is through that
unknown. You're unaware that your sample of your data set only represents all people from the
Southeast, or you're unaware of something like that. And then in that process, you end up building
a model that may be biased towards a certain member or customer type or segment of your
business. Like one of the classic ones you hear about in banking is like using zip codes and zip
codes end up being racially discriminatory because that if your model ends up, you know, not getting someone alone because their zip code, it can also often be
that the zip code is predominantly a certain race and you end up having like bias built into the
model. So as a data scientist, our, our job is to, in my mind, identify representative data samples
before you start building a model and account for that data science, that bias upfront. And we're never going to be perfect. Like that's
another thing that I feel like can be hard with data science models is like,
they're never going to be 100% accurate, but we need to make a best faith effort to control for bias on our data,
control for bias in the features of our models and ensure that we are building things that,
I mean, treat others as you want to be treated and are fair in their execution.
That's amazing, actually. It's a very interesting and fascinating topic. And I think what is more important about this topic is when we start talking about bias and how this can be introduced because there are humans behind these models, right?
Yeah. It's a human creation and it reflects us, right? So people also at the same time,
they are not that much aware about that.
When we are talking about AI,
like the public out there,
and we are not talking about like the engineers
or the data scientists, right?
They think that it's some kind of solution
that give the absolute truth
or it will always operate
as we are used with our cell phone, right?
Which is reliable and all that stuff.
So how we can communicate that to the public out there
and how we can both as like data scientists
and as product managers who are productizing these models,
how we can build like experiences
that they can educate in a way, let's say,
the people out there to feel more comfortable
with this new way of interacting
with technology, which includes mistakes. It includes bias, right? I really like the word
you used, educate. I really believe that for anything new to be successful, the people who
are championing it have to also be educators of that domain. For data science to be used in new places that shipped, I have to be an educator of
about what data science can do because it is unknown to others and sometimes to me,
but that's a different discussion.
I really believe that the experiences that a model or the interventions driven by a model
or the experiences that a model drives, especially as they're new, need to be either educate, like have an education component to them
or have a, like be a gradual transition to that, that distant future or whatever. Like
the, the idea I always think of, and I've always heard learning about this
kind of idea is, is like, you know, when they introduced elevators long before any of us were
born, people were terrified to get on elevators because this was a brand new idea of going up and
down and a machine and who knows what's going to happen. And to like assuage those fears, they said
elevator operators, little dudes are going to push the button to go up and down. And that will give you some comfort in this new system as we figure out how
it's going to work and we can educate you. And then of course,
now like elevators are wholly complex automated systems.
I think data science is the same way. Like there,
it's always going to be a challenge to go to,
to deploy the cutting edge in a way that is comfortable to people.
What you can do though, is make small steps towards that
and work to educate in the process of releasing new models and new experiences.
Yeah. What's your feeling so far?
Do you think we are doing a good job educating the people out there?
I don't, you guys, yeah, you're doing great. I feel like there's a lot of hype about data science. What's your feeling so far? Do you think we are doing a good job educating the people out there?
You guys?
Yeah, you're doing great.
I feel like there's a lot of hype about data science.
And as I mean, I think being a data scientist is being a skeptic. Like that's one of the things that makes you successful is like, is this data really saying what you're telling me it's saying?
I think that there's a lot of opportunity for data science to, to solve a lot of important problems in the world. I,
I don't think it's this magic solution or the silver bullet. And,
you know, like you said, we're going to have terminators walking around,
like we kind of already have cyborgs, right?
We've got people with pacemakers and all that kind of stuff. Right.
And we think of that as normal.
I think that there's going to be a gradual rollout
of advances in technology, including data science,
and it will come at a slow enough pace
that we'll only realize at high end,
so like, oh yeah, cyborgs walk among us,
these guys with pacemakers and transplants
and all that stuff to survive.
I think we'll feel the same way
about data science in 10 or 20 years.
I think one of the challenges with data science is that in terms of, and we've talked about this
before, sort of the public brand of, of data science and, you know, machine learning and
artificial intelligence is the, like when it's done really well, the experience is simple and congruent for the user. And so you,
it's like, you want to think about it as like a Rube Goldberg machine, right? Like, you know,
you know, this is funny. There's an old movie called Chitty Chitty Bang Bang. And there's
this like really complicated machine that literally just like cracks an egg and
then puts it on a plate for breakfast.
And it goes through this like really complex process, but the result is simple, right?
It's just, you know, you have breakfast and data science is the same way.
And so it's really hard for the average person to appreciate the complexity that goes into
something that just means that their recommendations
make really logical sense, you know, in an app or something. Yeah, I would add to that. I think that
most data science models solve very simple problems too, right? It's, you know, predicting,
predicting whether this, this product is a good recommendation or not, or predicting whether a person will
be a member or not in 30 days of a subscription service. The Rube Goldberg complexity part
in the interaction comes from how you use those, in my opinion. And when you start stacking models
together and pairing them with email marketing or ads or recommendations or changing how an app
performs. Like that's where the complexity comes to my mind. Like obviously there's the complexity
on the front end of cleaning your data, making sure it's representative, avoiding bias, doing
your due diligence to do data science well and ethically. But the complexity is so much more than the data science itself.
Speaking of complexity, one thing I wanted to ask you about is, and this really plays off of what we just talked about.
So we use Shipt in our household.
It's a great service.
We love it.
And at a very high level, it's so simple, right? It's you open an
app, you choose the groceries that you want, and then someone delivers the groceries to you. It's
so nice and simple. But before the show, I was making a shipped order. And I realized,
thinking about it through just the lens of data science and sort of your role and thinking about it through the, you know, just the lens of data science and sort of your role and thinking about the show, I realized this is so complex. I mean, there's so many moving parts
here. The app itself, I think is very well designed because there's a ton going on, especially on the
mobile side, you have to fit so many possible sort of decisions into a small screen. But then I
realized that's just one side of it, right? I'm the consumer on
the e-commerce side. And then there's an entirely different experience for the person who's picking
the groceries and then delivering them. And so can you speak a little bit to the complexity?
I'm sure there are things that I'm not even imagining, but it seems like a pretty wild
set of data that you have coming in. Yeah, I'll say up front that I think what I call the data state
bit ship, like I'll never have this quality of problems, quality of data, volume of data,
you know, just thing, greenfield problems to solve anywhere else. It's just so big and so vast and so
complex. And it's been great as a data scientist. I would also say like you've identified two of the several sides of our multi-sided marketplace.
We also partner with retailers to get their product inventory data into our app and partner with CPG brands like Coca-Cola or Pepsi to get up-to-date nutrition information and sales data and like sales and coupons and stuff into our app.
So we really have like four people kind of converging on this space of shipped and all
trying to make business exchanges, if that makes sense. It's extremely complicated in that there's
just so many people, so many different priorities and what we have to do to be successful as a
business and as data scientists is prioritize, like fundamentally just prioritize what is the most important for us to do now, because we certainly can't do it all.
Absolutely. And what does that process look like? to know sort of as a team, and I know it's beyond just the data science team because you're working
with probably all sorts of other teams, but I'd just love for our listeners to hear what does
that process look like? How do you prioritize the work that data science does? And what does
that decision-making process look like internally? Yeah, that's the hardest part. I mean, I can try
to talk about how it is, but I think it's constantly growing and changing, especially as we as a company grow and our business changes.
You know, when I joined Shipt, we were in like 20 cities and we offered just shop and deliver.
So now we are in, I think, 45 to 48 states in the United States, we offer shop and deliver.
We offer delivery only where a retailer picks it, makes the basket for you and our shopper
just picks it up and drives it.
We have four or five other kinds of like business models.
We deliver from places beyond grocery stores, Target, places like Party City, I think a
couple of sporting goods stores.
I can't even keep up with it.
Like the business has changed so much that the main thing I would say about prioritization is that it's not a one-time thing.
It is, it is a process and it is ongoing process and it can be painful to have something that
you've worked on all of a sudden, like not be priority anymore and to be shifting gears. But
I think that's necessary to be successful.
In terms of how it actually happens,
it's getting a lot of people in a room together to hash it out and talk about it.
And then at the end of the day,
someone's got to make a decision
and hopefully the group can collectively
come to a consensus.
But as we know, sometimes people disagree
and it takes leadership to help guide the ship.
Ryan, you've mentioned that there's a lot of change that happened in the company, right?
Like from since you joined, because you also joined like in a very early stage and the
company also grew really, really fast.
Can you tell us like a little bit how your work as a data scientist changed and from what it was affected by this growth.
Yeah. So I would say the first thing that changed is I now I'm focused on a much smaller like
section of the business and smaller problem scope. You know, back in the day I did infrastructure
things. I deviate a database or two, you know, I built dashboard. I mean, I did infrastructure things. I DBA'd a database or two. I built dashboard. I did everything.
It wore so many hats and also worked with so many different components of the business,
marketing, finance, accounting, engineering, operations, product. I was everywhere.
As the company has grown, I don't do as much with internal tools. And my focus has been much more
on the operation side
or things that kind of take place across the operation side. So maybe like we talked about
out of stocks, right? That kind of spans the basket building on our member customer side
and to the shopper side. So my scope has narrowed and that's been great because
I've been able to go much more in depth with these
problems. The solutions we were providing back in the day when I started were all very simple.
We tried to be pragmatic about it. There's no use in spending two months extra to get a 5%
improvement, right? Let's get something simple. Let's get it out there. I think we still
embrace that ideal, but we just have much more opportunity to tackle harder problems. And so we
get opportunities to invest in more complicated methodologies, more complicated problems, and
hopefully bigger and more important solutions for our business.
So would you say that, let's say, the value of the data science as an organization, as a function inside the company, has shifted compared to how it was at the beginning and how it is today? Or it's just the scale that has changed in the early days and we still are to be fair we still
very much like kind of a core driver for shift is opportunity we want to give people an opportunity
to have more time with their family by not having to go to the store we want to give our shoppers
an opportunity to earn more income or supplemental income or a full-time job to provide for them and their family.
Early and shipped, that was the core of our business.
And it was small enough that we can manage it in, I would say, a simple way.
Simple technology, simple rules, simple operations.
Not that it was simple.
It was very complex.
But we didn't need to rely on data science as much. As we've scaled and grown, data science and engineering have become so much more
critical to the success of Shipt, being able to function at scale, being able to be efficient
across the wide variety of businesses so that we can still be relational at our core and
provide opportunity to people at our core.
Do you think there is a time that it's too early for a company to invest in data science
based on your experience?
I would say no.
I think, but what I would say is that investing in data science often really means investing in data science
foundations, data engineers, analytics, getting to a place where analytics are driving the business
opposed to reactively interpreting like things the business attempts like that all is so important and i and to me
that is what investing in data science means for a a young a younger company that sets the stage for
for the fancy data science that like we all think of when we say data science advanced models
statistical analysis that kind of stuff so from from what I understand, like in SIPT,
data science is also like a big part of the product, right?
Like there are features on the product
that are actually driven by data science.
Yes.
And we will discuss more about this in a bit.
But before we go there,
are there other functions of the company right now
that they benefit from having
very strong data science team sides?
Yeah, absolutely. So features with
members and shoppers is one thing identified. Obviously, marketing and retention can benefit
a lot from data science and just trying to understand who our customers are and how we can
make them happy effectively. We have a lot of natural language processing data science problems
that shipped as well. All of the products that we get from our retail partners and from third-party sources and trying to enrich those,
it's really challenging to know if this package of goldfish that Target sells is the same as this
package of goldfish that Winn-Dixie sells. There's a lot of natural language processing problems
there of cleaning those up, identifying their brands, identifying if they are the same product across stores and getting our
data catalog in a way that is standardized across locations.
There's plenty of like finance and accounting, like modeling and forecasting components.
And then I'd also say there's a big operations component, just like we have a marketplace
and we need to make sure that supply and demand are balanced.
How do we hire shoppers?
How do we match shoppers and orders?
All those need to be done within the context of who we want to be as a business and how we want to value our shoppers and value our members.
But data science drives a key role in all of those things at ship.
Yeah, it's super interesting.
I want to ask you another question to try and make Eric happy.
So my question is...
It's impossible, Costas.
We'll see.
We'll see.
We'll see.
So Ryan, can you give us like a tip or help us understand how data science can help
marketing, especially in a way that it's not that obvious to most of our people,
people like me out there that we are not like actively working with marketing or data science.
I will say that this is something that I have probably the area I spent the least amount of
time at Shipt focusing on, but a common way I've seen data science used across multiple companies is for subscription
services, identifying likelihood of churn.
And so you can build a data science model that predicts if a subscriber of your service
will still be there, still be a member in 30 days or 90 days or 15 days, whatever time
interval you want. Coming out of that, you
can get an understanding of who you need to target for retention. And this can be as simple as
reaching out to them on the phone and asking how they're doing like for a small, you know,
SaaS company or for something like a ship. This could be something like extending a discount or,
you know, giving them a $5 credit to try to get them to re-engage with the service.
Those interventions obviously need to be domain specific.
But if you can understand who is appreciating your service and who is not appreciating your service, you can begin to try to figure out why and how you can fix that problem.
What do you think, Eric? Is this something useful for marketing?
Yeah. And I will tell you, I think a lot of, no, I'm very happy. Thank you.
Made my day. No, I think, so if you put yourself in the shoes of someone in marketing, hearkening back to my previous life, I think the challenge you have at scale is that the analytics tools that you're using aren't built to predict churn with sort of like custom inputs, right?
And you can't really do it in a spreadsheet because it's way too much volume. And so you're sort of, you can sort of anecdotally look at individual customer journeys to try and
give yourself an idea of what types of things might be causing churn. But it's pretty hard
tactically to achieve sort of a statistically significant view as a marketer, you know,
if you're not really, really good at SQL, but even then, you know, you're sort of at statistically significant view as a marketer, you know, if you're not really,
really good at SQL, but even then, you know, you're sort of at a large company dealing with,
you know, sort of access to databases and all that sort of stuff. So absolutely,
especially at scale. I mean, I can't imagine, you know, trying to crunch data at a company like
Shift because of how much there is. So yeah, I think there's huge value in that. Ryan, one thing you mentioned,
actually, and I'm interested in this kind of from a marketing product standpoint, but you had
mentioned prior to the show, a sort of cold start problem with search. And I'd love to hear the
story around that tactically. I think our audience would love to hear about that. Could you talk
about that particular problem and how you solved it? Yeah. So back in the day,
the good old days, Shipt was small. And I think I was hired before we had like an engineer who was
responsible for search. It was like we had an engineering team and they had solved search on
some problem, but we didn't have a search engineer. And so the problem that we had was
every time we launched a new retail partner, we had a brand new catalog of data.
No one had ever seen it before. No one had ever bought anything from it before. I mean,
yes, people had bought goldfish at prior retailers, but how should we show search results
from a new catalog? How do we handle things like house brands or things unique regionally, those differences.
When basically we had nothing.
I'll start by saying that our search team has solved this better than I did early in
the shipped lifetime.
And so my solution has now been deprecated and laid to rest and we are all better off
for it.
But what we did to solve this problem was basically
built what I would call a human in the loop tool that allowed us to use machine learning and then
polish it at the very end to give a great search experience on day one for our new retail partners.
What we first did was we did some advanced natural language processing stuff to compare products from existing stores that we have sold to new stores that we had never shown, seen, or sold any of their products for.
Getting a little tangential here, but I'll be upfront and say that UPCs are neither universal nor unique. So the idea of understanding exactly what product at store A
is the same at store B is not as simple as you think it would be. And if you've ever looked at
your receipt and seen like them, you know, take all the vowels out of a product and they give you
the price, like sometimes our data comes in like that, or at least used to.
So there's this fundamental problem of identifying first what new products were similar to old products and then sorting them based on that similarity and inferring from the old product search rankings where the new products should be.
So that's a very high level of how we did it. Technically, I would say we built like a in-house K&N, K nearest neighbors, clustering model, and then inferred search results off the
clusters. And then we had a suite of tools that allowed us to go in and, you know, people always
buy bananas. Does bananas come up near the top of the list? What happens if you search for cheese? Does it look right? And we were very able to just manually clean it up. Sure. And did you see pretty
significant improvements to people's sort of initial search experiences? Yeah. So we definitely
saw improvements to initial search experiences. It's a challenging thing to test and measure
because every new retailer is different. So there's a lot of confounding variables.
But we did see significant improvements in search conversion rates,
both based on what we had been doing beforehand,
which was people identifying manually the top thousand items
and sorting them at that point in time.
It's old ship.
Like every search rankings were just, you know, from top to bottom filter and order was kind of how it works.
The other big benefit we had is that the time to market for solving the search problem was drastically cut down.
It would take, you know, our catalog team, multiple hours, multiple people for, you know, for four hours to eight hours to just initialize search for a new retailer. With the
data science model, a data scientist was able to do it in an hour active time plus whatever
computational time it took and save a lot of and get better results while saving a lot of man hours
in the process. That's amazing, Ryan. Can you share with us a little bit more information about the data stack?
We started from very early ship
with that last conversation,
moving up to more modern ship.
The data stack has evolved over time.
Today, we use Snowflake
as our data warehouse solution.
We have Postgres databases,
which may or may not be used
by the data science team.
And our data scientists for
BI tool use Tableau. We also use DBT a lot for data engineering purposes.
But all of our like actual model deployment processes, taking the out of stock model and
running it in production so that it feeds real time systems systems. All of that is built in-house. And we are building a
team at Ship right now to really build the next generation model deployment stuff. Model ML
platform is kind of what we're calling it. Build the next generation ML platform from Ship because
we're still running on some of the stuff that we hacked together in a couple of afternoons several
years ago. So there's a lot of excitement there. afternoons several years ago so there's a lot
of excitement there yeah that's super interesting actually we had a sort of like a couple of weeks
ago with tekton tekton yeah you probably know them because i've met the guys at tech i met some of
the guys at tekton they've got they're doing some really cool stuff over there yeah yeah yeah and
what i found very interesting is that actually in this space that we call feature stores, which I mean, okay, I think still there's a lot of confusion what these things are,
but there's not a lot of, let's say, open source solutions out there.
Actually, there's only one, which I think is called Feast.
Any plans from your side to open source anything?
I don't know is the honest answer.
Personally, I would love to open source things.
I think that
sounds fun and satisfying. I think that more than likely our team will be relying and building on
top of a lot of the existing open source tools that are out there and then tweaking it for our
needs. Yeah, makes sense. I would say that ML platform is a huge competitive advantage these
days in the data science space.
After the people part of data science, the next hardest part is actually integrating it with all of the things.
Like how do you take that tool and use it effectively?
How do you do it in real time?
So that's my understanding and expectation of why there's not a lot of open source machine learning platforms out there.
It's because to the people who have built it and done it well,
it helps them succeed
and helps them outlast their competitors.
Yeah, I think that's an excellent point.
And I think it's a very good explanation
of why this is happening.
And I think it explains why
even traditional companies,
like really big companies
that traditionally has a lot of open source presence
like netflix for example even them they haven't made public the feature stores that they have
built you see a lot of talks about it presentations and all that stuff but like none of these is open
sourced yet and i think it's an excellent point that you are making it's like actually it's a very
competitive advantage like companies have by having these systems in-house. So it makes total sense. So you shared with us like the data
stack that you have. Are there any like specific tools that are used only by the data scientists?
How do you build and how do you iterate on your models? Like are there any frameworks that you
are using for that? Libraries? Anything specific that you would like to share there with us?
So first thing I'll say is I think that we are evolving there
and we'll have a lot of new tools
to better set up our model building systems down the road.
We've looked into all kinds of things
from MLflow and Kubeflow for artifact storage and model iteration storage.
There's a lot of opportunity out there, and we're trying to decide what we want to build in-house, what we want to paste one for, what we want to use open source for.
In terms of tools that we commonly use, Shipt's data science is very ambidextrous.
We use both R and Python as the problem needs. I will say that
anytime we start doing, we get into that real time space, we start having to think about feature
stores and think about APIs, Python's going to win out there. But oftentimes we find that R is
much more helpful at that exploratory data analysis phase. And we do have internal tools for
internal packages for both R and Python that
allow us to very easily communicate with all of our data stores and write and push data to them,
as well as our cloud provider. In terms of like other tools and the process flow, I think what we
really want to do is build our systems in a way that data science can iterate independently of the rest of the business.
Obviously, in not all cases, is that okay for us to do?
But if I'm building a recommendation engine, the goal would be for me to build it in a way that it communicates consistently with engineering.
And then I can begin iterating on it however I want to improve it.
Working with product managers and business,
so they're aware of the changes,
but apart from the current systems.
So we really embrace that kind of microservice idea.
Like that'd be the engineering component of it,
like microservices at Shipt
and really strive to build it simple at first
and then iterate and learn as we launch and run.
This is great.
Last question from my side
and then I'll let Eric ask any questions he might have.
What's the relationship with data engineering
and how you work together with them
and how is the function defined inside?
Yeah, so I would say there's actually
kind of two data engineering groups that shipped.
One data engineering group that shipped is all about getting data that our partners provide
and getting it into our system so that we can sell the products they have.
And that is a lot of data.
And historically, we've worked very closely with that group just in terms of building solutions to clean, standardize, and understand the product data that's coming in from our partners.
Predicting what brand a product is, if it doesn't come tagged with a brand, identifying data to be stored in our data warehouse and transforming that data into things that can be used by data scientists for analytics or by others in the company for analytics. We work closely with them, though I think that that team is going to continue to
scale even more as our group is growing. A lot of the challenge that we have at Shipt with data
and growing fast is things change, and it's hard to know when they change. So if engineering
changes the way they're solving a certain problem in the business,
like it can be challenging for us to know that that happened way downstream.
Our data engineering team is crucial for handling those changes and ensuring that we get clear
and ready data.
And they're a bunch of great guys.
I love them all a lot. Ryan, one question on the data engineering side.
Did you, going back to sort of the early days
before maybe the data engineering team
was as big as it is today,
did the data science team actually do
some of the data engineering work as well?
Or has there always been a sort of clear delineation
of responsibility?
There has not.
And in a lot of ways, there still isn't in a lot of ways,
like the data I need to build an out-of-stock model, right?
It's not going to be present in like this perfect form
where I can just select star from my table,
you know, and roll with it.
But there's still a lot of data engineering
that I have to do to,
that we have to do as data scientists
to build our models and build the pipelines
that feed and serve those models and to run the analytics from those models back into a place
where they can be analyzed later but i will say that the demarcation is much cleaner today than
it was in the past very early on i i did a lot of data engineering and that was just a necessary
thing for data science to work and function at that time because there wasn't
as much dedicated resources to internal data engineering. Sure. Super interesting. Well,
we're close to time here, so we'll ask one more question before we hop off. What are you most
excited about in terms of trends in data science that you kind of see on the front lines doing the work every day?
Yeah, so that's actually a really good question and a really hard question. One thing that I
really am excited about, and this is broad industry right now, is I feel like some of the
hype is dying down. There was this idea that data science is going to solve all of our problems,
self-driving cars are going to be here and they're not.
And all our problems haven't been solved yet.
And so we're at a point where we're coming to kind of terms with what data science can do.
And we're beginning to really, as an industry, begin to like make tangible steps forward,
opposed to having to dance around the hype and the expectations.
So that's one thing that really excites me. I think the other thing is that
people are becoming more and more receptive to data science being used in effective ways. And
we are really learning as an industry, how to do data science effectively. Like y'all talked about
this earlier, lots of people have come in and talked about the human element of data science and how important that is. And as an industry, we're really starting to realize that. And best practices are being developed. A whole bunch of companies have popped up to provide services for MLOps and how we can do data science at scale and monitor data science at scale. Like we've coming out of the kind of wild west early days
of data hype into a more steady and stable industry
with some more best practices around.
That's not to say that we are stable
and there's not a wild west component of this,
but I feel like it's much more clear
how to solve a lot of common problems today than it was when I started.
I've also learned a lot in that time too. Well, even, I mean, it's interesting, even if you think
about when you started at Shipt and, you know, to today, the number of new tools even that have
been introduced that make a lot of these things easy or tools that have been developed internally, you sort of have this maturing of the discipline where some of the technical problems are getting out of the way.
And you can focus on the deeper problems that you're actually trying to solve as opposed to building the infrastructure that makes it easier for you to solve them.
Absolutely. Yeah. And I mean,
an example of how this changed for us, like early on at Shift, we used Airflow, batch job orchestration tool Airflow. It was pretty much the only option at that point in time. We've
revisited that. We're still using Airflow, but we've revisited and had discussions about whether
that's right for us these days. And there are now six or 10 options and plenty more that I don't know about that each
meet that same general need, but do it in slightly different ways or slightly different targets,
slightly different niches. And from that, it's just wonderful to have opportunities and choices
to figure out how you want to do for business and to lean on the expertise of others.
Absolutely. Well, Ryan, it has been such a pleasure to have you on the show.
So interesting to hear about everything that's going on at Shipt. Congratulations on the success and best of luck in hiring, doubling the size of the team before the end of the year. That's
a tall order. Yeah. Hiring is hard. I'm excited for it though.
We need all the help we can get. Cool. Well, we'd love to check back in with you on a future episode
and thanks again. Thank you so much. I really appreciate it, Eric. Thanks Costas.
Thank you, Ryan. It was great. As always a fascinating conversation. I think one of my big takeaways was hearing about how things have stayed the same in
many ways going through a huge acquisition by such a large company like Target. That was just really
cool to hear. I mean, obviously there's more sort of structure being part of a larger company,
but it's really neat to hear. A lot of times you'll hear the opposite story where, you know,
a company gets acquired and sort of your ability to be agile early on dissolves and it's not as gratifying to be part
of the team anymore. But I didn't get that sense at all. And it just makes me really happy to hear
that that was sort of managed well and that they can still have that startup type feel to some
extent at a big company. Yeah. I'm a little bit disappointed to be honest,
Eric. We have another data scientist who said that we are not going to see Terminator anytime soon.
So it's a long time until 2030. So that's true. But yeah, regardless of that, it was a great
conversation with Ryan. And I think it's amazing to hear from people of like what kind of impact data science can have in a company and how many different aspects of the company it can affect.
And I think from what I understand, like during the conversation we had, it's such a case.
You have internal users, you have like run the product with it.
Every pretty much every stakeholder around the company is affected by data science.
And I hope that we will have more and more opportunities in the future, like to communicate
and educate the people out there about how data science is an important part of any tech
company today.
And not only tech, actually, in any company.
And I think one theme that's been recurring
is the human element of data science,
which I think has been really interesting to hear about.
And Ryan brought that up without us even bringing it up.
And that's just been a constant theme with all of our guests,
which is, I think, both fascinating and encouraging.
Yeah, absolutely.
All right.
Well, until next time,
thank you for joining us on the
Data Stack Show. Make sure to subscribe on your favorite podcast app. You'll get notified of new
episodes every week and we'll catch you on the next one. The Data Stack Show is brought to you
by Rudderstack, the complete customer data pipeline solution. Learn more at rudderStack.com.