The Data Stack Show - 42: Scaling Data Science with Ryan Boyer of Shipt

Starting point is 00:00:00 The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution. Thanks for joining the show today. We have a guest on the show today I'm particularly excited to talk to because I'm a customer of theirs and it's the company Shipt. They do grocery delivery and now all sorts of other stuff. And actually, we've been customers in our household for a long time. So I remember when they got acquired by Target. And one thing that I'm really interested to ask Ryan about,

Starting point is 00:00:39 who's a data scientist at Shipt, is just around the complexity. So if you open up the Shipt app and use it, there's so much going on there, even just from the consumer side. And I can't imagine the challenge of dealing with all the different data sets that they have in terms of building models and sort of just managing the entire data science practice. So complexity is my burning question. Costas? Data.

Starting point is 00:01:03 I'm pretty sure they have like a lot, a lot of data that they're working with. And I want to see both how they grew from the first days until today in terms of like the data itself and the infrastructure behind it. And what are the challenges around that? And keep in mind that we are talking

Starting point is 00:01:19 about the marketplace at the end, which always complicates things. Although we tend to see only one side, the side that we are part of on the marketplace. So I'm pretty sure that he will have like very interesting information to share about how data is important into growing marketplaces. Absolutely. Well, let's jump in and talk with Ryan Boyer from the data science team at Shipt. All right, Ryan Boyer, welcome to the Data Stack Show. Thank you so much for having me. I'm really excited to be here.

Starting point is 00:01:48 Oh man, we have so many things to ask you about, but why don't you just give us a brief background? So where did your career start out and then what was the pathway that led you to data science at Shift? Yeah, this is a great story. So I got a math degree at Clemson University for my undergrad, and then opposed to going to grad school like I initially planned, I upped and moved to Bozeman, Montana, where I became a ski bum, a very terrible ski bum, and stocked shore

Starting point is 00:02:16 shelves at Target for about a year. Six years later, here I am working at a company owned by Target. So very much come full circle. How I got into data science and how I specifically ended up in SHIP is a little more direct. After learning that I was a bad ski bum and wanted to use my brain a lot more, I went back to grad school, got a degree in systems and information engineering, focusing a lot on data science, math, statistics, and then ended up in Birmingham, Alabama, because it was where my wife grew up and wanted to do data science in a small Southern town. And there was really one or two options.

Starting point is 00:02:51 And I got lucky and joined a, at that point in time of a decent size startup name shipped. And it has been rocket ship growth ever since then. I was the third person to title data scientist. And now I'm on a team with, I think, 50 people in the data science organization playing in and we're hiring always as far as I can tell. So it's been a lot of fun. Yeah, that's great. We encourage our guests to tell our audience when they're hiring. And it seems like data science and data engineering roles are just in huge demand. One question for you. So, and this is, I just love the story of you stocking shelves while being a ski bum. Did that influence sort of the way that you thought about solving problems around stocking

Starting point is 00:03:42 and in-stock items when you were working on that from an actual data science standpoint at Shipt? I would say it certainly helped, right? Like I understood that Target doesn't just get one truck a week. You know, there's lots of trucks a week that come at different times. And so there was some like domain expertise I could bring to that problem. But I would say that the bigger thing that I learned, honestly, throughout all of my undergrad career, and especially through my time as being a poor ski bum, poor in the sense I wasn't very good at it,

Starting point is 00:04:20 was like how central people are to data science, right? And so I would say that really has kind of been like the key driving thing for me as a data scientist is how can I make a model or systems that work for people and with people? So interesting. Can you give us just one example of sort of what it looks like to go from model to individual in some of the work that you've done? Like just a practical example for our listeners? Yeah, so that's, I will say this is probably the hardest thing in data science to me, in my opinion, is managing that stage of a project.

Starting point is 00:04:54 So we can talk about the out-of-stocks. We can get into the model more later if y'all are interested. Basically at Chip, I built a model that predicts whether a product is out-of-stock in real time. You know, from a data science side, they get a score between zero and one, basically a probability of it being out of stock.

Starting point is 00:05:11 That's great. And I can do metrics about how effective that is and all kinds of things. But the question is, like, how can I use that to improve the lives of our shoppers, ship's shoppers or ship's members who are either picking the groceries in the store or ordering them on our e-commerce app. And so there's a lot of discussion about how to do that well and effectively and manage their expectations. I personally believe that data science models should be thought of as tools and not solutions. Like no one looks at a house and then thanks the hammer, right? They thank the carpenter who used the hammer to build the house well, right? Like, but I feel like in data science, it's like, oh man, deep learning, neural networks,

Starting point is 00:05:52 great use of trees. Like we can solve all of our problems with these cool tools. And I would say, no, you can build lots of great tools that can be really effective at your problem, but you still need to wield them effectively. Yeah, I think we see a pattern here we've seen before, right? Like actually two patterns. One is what's like how you actually productize data science, which is not that obvious at the end. And we will have like the opportunity, I think, to discuss more on this today. And the other thing, Eric, that we have talked a lot before is actually that data science, machine learning, and all that stuff, it's more of a tool, and they augment, and they have to work together with people. It's not a replacement for

Starting point is 00:06:36 the stuff that we are doing as people. So I'm pretty sure that you remember all these discussions we had in the past and other episodes around that. Absolutely. Yeah. It's been a huge thing with data science. Actually we've had multiple data scientists on the show and it's been really encouraging that the, the most common perspective is the human element of data science is the key determinant and whether data science is actually effective or successful. Yeah. I don't think we are going to see the terminator at the end anytime soon but no me either yeah i would also say the human element is important on both ends right like getting data science to production

Starting point is 00:07:19 it really matters that you have a company and a culture who is bought in willing to invest willing to work with you and willing to buy into the vision of how a data science tool can be used. Like, so there's like that front end part of data science being successful. And then there's the part of just like, can you build a data science model that affects your business or the people who use your business in a way that supports and helps them opposed to just tries to absolve away all control and what they like about the business in order to give them best outcomes. Sure.

Starting point is 00:07:50 We had someone on the show who talked a lot about AI and people sometimes can have a fear of AI and blame the technology. And he said, if you see sort of negative results from that, you have to remember there's a human behind that, you know, whether it's sort of building a model or approving, which was a really, it was very thought provoking. Yeah. I mean, I would also say sometimes us people who are building things can miss things too. And so there are sometimes mistakes,

Starting point is 00:08:25 but I don't think that's an excuse for a, for a data science or a data science professional to, to build something that is manipulative, but yeah, like there are people building things. And in my opinion, we'll be building things for a long time. Sure. Yeah. Well, let's, I want to get into the technical side of things. And I know Costas has a bunch of questions there, but I think as a segue getting into that, one thing that would be really interesting to hear about is you joining Shift pre-acquisition, third person on the team with the title Data Science, and you're going to have a huge team by the end of the year.

Starting point is 00:09:00 I'd love to know what has changed significantly post-acquisition and what hasn't changed that much. Yeah. So like you said, I was the third person to title data scientist when I joined. Our data science team was like five people and a manager and we'll be 50. We're like 50-ish people now, probably be a hundred at the end of the year. if we can hire, find the talent we need. So feel free to apply if you're interested. The main thing that I would say is that change is just consistent at a company that is growing as fast as Shipt has been. And that is the truth for how we do data science. When I first joined, we were very scrappy and had little oversight. And it was kind of awesome because it was just like, I wrote some code and your code would be like, cool, do you like it?

Starting point is 00:09:49 And you'd say, yeah. And I'd be like, okay. And we'd run it out. You know, we'd deploy it and we'd see what happens. And we'd learn from that. Of course, now we are much bigger and much more at scale. And we have a much more rigorous system for deploys and peoples and checks and understanding how it's going to affect things. But there's, there's still is this, this, in my opinion,

Starting point is 00:10:12 this desire to learn through experimentation, to learn as much as we can and to go as fast as we can with a little more cautiousness as well. So I would say like a lot has changed, but a lot has stayed the same. That's really encouraging to hear that, to hear that you still feel like the startup mindset and agility is still there because that's often something you hear people

Starting point is 00:10:38 sort of bemoan post-acquisition is, you know, we're part of a big company now and it feels like we're part of a big company, but that's really encouraging. Yeah. I would say it's gotten harder to be as agile. Like, you know, no one told me I couldn't do anything back in the day and I just did things. Now we have people who are, you know, trying to figure out what's best and there is trying, you know, a desire to move the large ship in one direction. But I do feel that data science is something that you're just going to fail at a lot of times. Like you're going to build models and not going to work. You're going to run a statistical analysis and you're not going to find anything. And if you

Starting point is 00:11:13 lose that ability to learn by doing, like not starting a project until everyone's on board with it, or sometimes not even running a model in a small production test until you feel very confident about how it's going to behave. Like you're going to have a lot of trouble moving quickly in the data science space. This is great, Ryan. I want to go a little bit back on our conversation and to that part where you were discussing all together

Starting point is 00:11:40 about AI, machine learning, and the impact that it has on our lives. And I want to ask you about something very specific, and this is bias. So I want to hear from you. First of all, help us understand a little bit better how bias is introduced, how it is, let's say, represented or goes into the models. And based on your experience, what kind of impact bias can have to the end user of any model that a data scientist can build? Yeah. So just to confirm, you're talking about

Starting point is 00:12:12 what I would call people bias, as opposed to statistical bias, the mathematical term, correct? Yes. Yes. Okay. Okay. Just making sure. I would be surprised. And I would be like, I don't't know it's been a long time since i thought about things at that like statistical level yeah yeah bias i i believe this this is me as a person we'll go to me as a data scientist bias to me is just something that is innate to the human experience right like you don't know what you don't know. And it's really hard to understand what you don't know. And to me, a lot of the ways that bias enters a modeling process or an analytical process is through that unknown. You're unaware that your sample of your data set only represents all people from the

Starting point is 00:12:59 Southeast, or you're unaware of something like that. And then in that process, you end up building a model that may be biased towards a certain member or customer type or segment of your business. Like one of the classic ones you hear about in banking is like using zip codes and zip codes end up being racially discriminatory because that if your model ends up, you know, not getting someone alone because their zip code, it can also often be that the zip code is predominantly a certain race and you end up having like bias built into the model. So as a data scientist, our, our job is to, in my mind, identify representative data samples before you start building a model and account for that data science, that bias upfront. And we're never going to be perfect. Like that's another thing that I feel like can be hard with data science models is like,

Starting point is 00:13:58 they're never going to be 100% accurate, but we need to make a best faith effort to control for bias on our data, control for bias in the features of our models and ensure that we are building things that, I mean, treat others as you want to be treated and are fair in their execution. That's amazing, actually. It's a very interesting and fascinating topic. And I think what is more important about this topic is when we start talking about bias and how this can be introduced because there are humans behind these models, right? Yeah. It's a human creation and it reflects us, right? So people also at the same time, they are not that much aware about that. When we are talking about AI, like the public out there,

Starting point is 00:14:50 and we are not talking about like the engineers or the data scientists, right? They think that it's some kind of solution that give the absolute truth or it will always operate as we are used with our cell phone, right? Which is reliable and all that stuff. So how we can communicate that to the public out there

Starting point is 00:15:10 and how we can both as like data scientists and as product managers who are productizing these models, how we can build like experiences that they can educate in a way, let's say, the people out there to feel more comfortable with this new way of interacting with technology, which includes mistakes. It includes bias, right? I really like the word you used, educate. I really believe that for anything new to be successful, the people who

Starting point is 00:15:40 are championing it have to also be educators of that domain. For data science to be used in new places that shipped, I have to be an educator of about what data science can do because it is unknown to others and sometimes to me, but that's a different discussion. I really believe that the experiences that a model or the interventions driven by a model or the experiences that a model drives, especially as they're new, need to be either educate, like have an education component to them or have a, like be a gradual transition to that, that distant future or whatever. Like the, the idea I always think of, and I've always heard learning about this kind of idea is, is like, you know, when they introduced elevators long before any of us were

Starting point is 00:16:30 born, people were terrified to get on elevators because this was a brand new idea of going up and down and a machine and who knows what's going to happen. And to like assuage those fears, they said elevator operators, little dudes are going to push the button to go up and down. And that will give you some comfort in this new system as we figure out how it's going to work and we can educate you. And then of course, now like elevators are wholly complex automated systems. I think data science is the same way. Like there, it's always going to be a challenge to go to, to deploy the cutting edge in a way that is comfortable to people.

Starting point is 00:17:07 What you can do though, is make small steps towards that and work to educate in the process of releasing new models and new experiences. Yeah. What's your feeling so far? Do you think we are doing a good job educating the people out there? I don't, you guys, yeah, you're doing great. I feel like there's a lot of hype about data science. What's your feeling so far? Do you think we are doing a good job educating the people out there? You guys? Yeah, you're doing great. I feel like there's a lot of hype about data science.

Starting point is 00:17:33 And as I mean, I think being a data scientist is being a skeptic. Like that's one of the things that makes you successful is like, is this data really saying what you're telling me it's saying? I think that there's a lot of opportunity for data science to, to solve a lot of important problems in the world. I, I don't think it's this magic solution or the silver bullet. And, you know, like you said, we're going to have terminators walking around, like we kind of already have cyborgs, right? We've got people with pacemakers and all that kind of stuff. Right. And we think of that as normal. I think that there's going to be a gradual rollout

Starting point is 00:18:06 of advances in technology, including data science, and it will come at a slow enough pace that we'll only realize at high end, so like, oh yeah, cyborgs walk among us, these guys with pacemakers and transplants and all that stuff to survive. I think we'll feel the same way about data science in 10 or 20 years.

Starting point is 00:18:23 I think one of the challenges with data science is that in terms of, and we've talked about this before, sort of the public brand of, of data science and, you know, machine learning and artificial intelligence is the, like when it's done really well, the experience is simple and congruent for the user. And so you, it's like, you want to think about it as like a Rube Goldberg machine, right? Like, you know, you know, this is funny. There's an old movie called Chitty Chitty Bang Bang. And there's this like really complicated machine that literally just like cracks an egg and then puts it on a plate for breakfast. And it goes through this like really complex process, but the result is simple, right?

Starting point is 00:19:13 It's just, you know, you have breakfast and data science is the same way. And so it's really hard for the average person to appreciate the complexity that goes into something that just means that their recommendations make really logical sense, you know, in an app or something. Yeah, I would add to that. I think that most data science models solve very simple problems too, right? It's, you know, predicting, predicting whether this, this product is a good recommendation or not, or predicting whether a person will be a member or not in 30 days of a subscription service. The Rube Goldberg complexity part in the interaction comes from how you use those, in my opinion. And when you start stacking models

Starting point is 00:20:00 together and pairing them with email marketing or ads or recommendations or changing how an app performs. Like that's where the complexity comes to my mind. Like obviously there's the complexity on the front end of cleaning your data, making sure it's representative, avoiding bias, doing your due diligence to do data science well and ethically. But the complexity is so much more than the data science itself. Speaking of complexity, one thing I wanted to ask you about is, and this really plays off of what we just talked about. So we use Shipt in our household. It's a great service. We love it.

Starting point is 00:20:42 And at a very high level, it's so simple, right? It's you open an app, you choose the groceries that you want, and then someone delivers the groceries to you. It's so nice and simple. But before the show, I was making a shipped order. And I realized, thinking about it through just the lens of data science and sort of your role and thinking about it through the, you know, just the lens of data science and sort of your role and thinking about the show, I realized this is so complex. I mean, there's so many moving parts here. The app itself, I think is very well designed because there's a ton going on, especially on the mobile side, you have to fit so many possible sort of decisions into a small screen. But then I realized that's just one side of it, right? I'm the consumer on the e-commerce side. And then there's an entirely different experience for the person who's picking

Starting point is 00:21:29 the groceries and then delivering them. And so can you speak a little bit to the complexity? I'm sure there are things that I'm not even imagining, but it seems like a pretty wild set of data that you have coming in. Yeah, I'll say up front that I think what I call the data state bit ship, like I'll never have this quality of problems, quality of data, volume of data, you know, just thing, greenfield problems to solve anywhere else. It's just so big and so vast and so complex. And it's been great as a data scientist. I would also say like you've identified two of the several sides of our multi-sided marketplace. We also partner with retailers to get their product inventory data into our app and partner with CPG brands like Coca-Cola or Pepsi to get up-to-date nutrition information and sales data and like sales and coupons and stuff into our app. So we really have like four people kind of converging on this space of shipped and all

Starting point is 00:22:31 trying to make business exchanges, if that makes sense. It's extremely complicated in that there's just so many people, so many different priorities and what we have to do to be successful as a business and as data scientists is prioritize, like fundamentally just prioritize what is the most important for us to do now, because we certainly can't do it all. Absolutely. And what does that process look like? to know sort of as a team, and I know it's beyond just the data science team because you're working with probably all sorts of other teams, but I'd just love for our listeners to hear what does that process look like? How do you prioritize the work that data science does? And what does that decision-making process look like internally? Yeah, that's the hardest part. I mean, I can try to talk about how it is, but I think it's constantly growing and changing, especially as we as a company grow and our business changes.

Starting point is 00:23:30 You know, when I joined Shipt, we were in like 20 cities and we offered just shop and deliver. So now we are in, I think, 45 to 48 states in the United States, we offer shop and deliver. We offer delivery only where a retailer picks it, makes the basket for you and our shopper just picks it up and drives it. We have four or five other kinds of like business models. We deliver from places beyond grocery stores, Target, places like Party City, I think a couple of sporting goods stores. I can't even keep up with it.

Starting point is 00:24:03 Like the business has changed so much that the main thing I would say about prioritization is that it's not a one-time thing. It is, it is a process and it is ongoing process and it can be painful to have something that you've worked on all of a sudden, like not be priority anymore and to be shifting gears. But I think that's necessary to be successful. In terms of how it actually happens, it's getting a lot of people in a room together to hash it out and talk about it. And then at the end of the day, someone's got to make a decision

Starting point is 00:24:35 and hopefully the group can collectively come to a consensus. But as we know, sometimes people disagree and it takes leadership to help guide the ship. Ryan, you've mentioned that there's a lot of change that happened in the company, right? Like from since you joined, because you also joined like in a very early stage and the company also grew really, really fast. Can you tell us like a little bit how your work as a data scientist changed and from what it was affected by this growth.

Starting point is 00:25:08 Yeah. So I would say the first thing that changed is I now I'm focused on a much smaller like section of the business and smaller problem scope. You know, back in the day I did infrastructure things. I deviate a database or two, you know, I built dashboard. I mean, I did infrastructure things. I DBA'd a database or two. I built dashboard. I did everything. It wore so many hats and also worked with so many different components of the business, marketing, finance, accounting, engineering, operations, product. I was everywhere. As the company has grown, I don't do as much with internal tools. And my focus has been much more on the operation side or things that kind of take place across the operation side. So maybe like we talked about

Starting point is 00:25:50 out of stocks, right? That kind of spans the basket building on our member customer side and to the shopper side. So my scope has narrowed and that's been great because I've been able to go much more in depth with these problems. The solutions we were providing back in the day when I started were all very simple. We tried to be pragmatic about it. There's no use in spending two months extra to get a 5% improvement, right? Let's get something simple. Let's get it out there. I think we still embrace that ideal, but we just have much more opportunity to tackle harder problems. And so we get opportunities to invest in more complicated methodologies, more complicated problems, and

Starting point is 00:26:35 hopefully bigger and more important solutions for our business. So would you say that, let's say, the value of the data science as an organization, as a function inside the company, has shifted compared to how it was at the beginning and how it is today? Or it's just the scale that has changed in the early days and we still are to be fair we still very much like kind of a core driver for shift is opportunity we want to give people an opportunity to have more time with their family by not having to go to the store we want to give our shoppers an opportunity to earn more income or supplemental income or a full-time job to provide for them and their family. Early and shipped, that was the core of our business. And it was small enough that we can manage it in, I would say, a simple way. Simple technology, simple rules, simple operations.

Starting point is 00:27:41 Not that it was simple. It was very complex. But we didn't need to rely on data science as much. As we've scaled and grown, data science and engineering have become so much more critical to the success of Shipt, being able to function at scale, being able to be efficient across the wide variety of businesses so that we can still be relational at our core and provide opportunity to people at our core. Do you think there is a time that it's too early for a company to invest in data science based on your experience?

Starting point is 00:28:17 I would say no. I think, but what I would say is that investing in data science often really means investing in data science foundations, data engineers, analytics, getting to a place where analytics are driving the business opposed to reactively interpreting like things the business attempts like that all is so important and i and to me that is what investing in data science means for a a young a younger company that sets the stage for for the fancy data science that like we all think of when we say data science advanced models statistical analysis that kind of stuff so from from what I understand, like in SIPT, data science is also like a big part of the product, right?

Starting point is 00:29:08 Like there are features on the product that are actually driven by data science. Yes. And we will discuss more about this in a bit. But before we go there, are there other functions of the company right now that they benefit from having very strong data science team sides?

Starting point is 00:29:24 Yeah, absolutely. So features with members and shoppers is one thing identified. Obviously, marketing and retention can benefit a lot from data science and just trying to understand who our customers are and how we can make them happy effectively. We have a lot of natural language processing data science problems that shipped as well. All of the products that we get from our retail partners and from third-party sources and trying to enrich those, it's really challenging to know if this package of goldfish that Target sells is the same as this package of goldfish that Winn-Dixie sells. There's a lot of natural language processing problems there of cleaning those up, identifying their brands, identifying if they are the same product across stores and getting our

Starting point is 00:30:10 data catalog in a way that is standardized across locations. There's plenty of like finance and accounting, like modeling and forecasting components. And then I'd also say there's a big operations component, just like we have a marketplace and we need to make sure that supply and demand are balanced. How do we hire shoppers? How do we match shoppers and orders? All those need to be done within the context of who we want to be as a business and how we want to value our shoppers and value our members. But data science drives a key role in all of those things at ship.

Starting point is 00:30:45 Yeah, it's super interesting. I want to ask you another question to try and make Eric happy. So my question is... It's impossible, Costas. We'll see. We'll see. We'll see. So Ryan, can you give us like a tip or help us understand how data science can help

Starting point is 00:31:07 marketing, especially in a way that it's not that obvious to most of our people, people like me out there that we are not like actively working with marketing or data science. I will say that this is something that I have probably the area I spent the least amount of time at Shipt focusing on, but a common way I've seen data science used across multiple companies is for subscription services, identifying likelihood of churn. And so you can build a data science model that predicts if a subscriber of your service will still be there, still be a member in 30 days or 90 days or 15 days, whatever time interval you want. Coming out of that, you

Starting point is 00:31:45 can get an understanding of who you need to target for retention. And this can be as simple as reaching out to them on the phone and asking how they're doing like for a small, you know, SaaS company or for something like a ship. This could be something like extending a discount or, you know, giving them a $5 credit to try to get them to re-engage with the service. Those interventions obviously need to be domain specific. But if you can understand who is appreciating your service and who is not appreciating your service, you can begin to try to figure out why and how you can fix that problem. What do you think, Eric? Is this something useful for marketing? Yeah. And I will tell you, I think a lot of, no, I'm very happy. Thank you.

Starting point is 00:32:37 Made my day. No, I think, so if you put yourself in the shoes of someone in marketing, hearkening back to my previous life, I think the challenge you have at scale is that the analytics tools that you're using aren't built to predict churn with sort of like custom inputs, right? And you can't really do it in a spreadsheet because it's way too much volume. And so you're sort of, you can sort of anecdotally look at individual customer journeys to try and give yourself an idea of what types of things might be causing churn. But it's pretty hard tactically to achieve sort of a statistically significant view as a marketer, you know, if you're not really, really good at SQL, but even then, you know, you're sort of at statistically significant view as a marketer, you know, if you're not really, really good at SQL, but even then, you know, you're sort of at a large company dealing with, you know, sort of access to databases and all that sort of stuff. So absolutely, especially at scale. I mean, I can't imagine, you know, trying to crunch data at a company like

Starting point is 00:33:39 Shift because of how much there is. So yeah, I think there's huge value in that. Ryan, one thing you mentioned, actually, and I'm interested in this kind of from a marketing product standpoint, but you had mentioned prior to the show, a sort of cold start problem with search. And I'd love to hear the story around that tactically. I think our audience would love to hear about that. Could you talk about that particular problem and how you solved it? Yeah. So back in the day, the good old days, Shipt was small. And I think I was hired before we had like an engineer who was responsible for search. It was like we had an engineering team and they had solved search on some problem, but we didn't have a search engineer. And so the problem that we had was

Starting point is 00:34:20 every time we launched a new retail partner, we had a brand new catalog of data. No one had ever seen it before. No one had ever bought anything from it before. I mean, yes, people had bought goldfish at prior retailers, but how should we show search results from a new catalog? How do we handle things like house brands or things unique regionally, those differences. When basically we had nothing. I'll start by saying that our search team has solved this better than I did early in the shipped lifetime. And so my solution has now been deprecated and laid to rest and we are all better off

Starting point is 00:35:02 for it. But what we did to solve this problem was basically built what I would call a human in the loop tool that allowed us to use machine learning and then polish it at the very end to give a great search experience on day one for our new retail partners. What we first did was we did some advanced natural language processing stuff to compare products from existing stores that we have sold to new stores that we had never shown, seen, or sold any of their products for. Getting a little tangential here, but I'll be upfront and say that UPCs are neither universal nor unique. So the idea of understanding exactly what product at store A is the same at store B is not as simple as you think it would be. And if you've ever looked at your receipt and seen like them, you know, take all the vowels out of a product and they give you

Starting point is 00:36:01 the price, like sometimes our data comes in like that, or at least used to. So there's this fundamental problem of identifying first what new products were similar to old products and then sorting them based on that similarity and inferring from the old product search rankings where the new products should be. So that's a very high level of how we did it. Technically, I would say we built like a in-house K&N, K nearest neighbors, clustering model, and then inferred search results off the clusters. And then we had a suite of tools that allowed us to go in and, you know, people always buy bananas. Does bananas come up near the top of the list? What happens if you search for cheese? Does it look right? And we were very able to just manually clean it up. Sure. And did you see pretty significant improvements to people's sort of initial search experiences? Yeah. So we definitely saw improvements to initial search experiences. It's a challenging thing to test and measure because every new retailer is different. So there's a lot of confounding variables.

Starting point is 00:37:06 But we did see significant improvements in search conversion rates, both based on what we had been doing beforehand, which was people identifying manually the top thousand items and sorting them at that point in time. It's old ship. Like every search rankings were just, you know, from top to bottom filter and order was kind of how it works. The other big benefit we had is that the time to market for solving the search problem was drastically cut down. It would take, you know, our catalog team, multiple hours, multiple people for, you know, for four hours to eight hours to just initialize search for a new retailer. With the

Starting point is 00:37:47 data science model, a data scientist was able to do it in an hour active time plus whatever computational time it took and save a lot of and get better results while saving a lot of man hours in the process. That's amazing, Ryan. Can you share with us a little bit more information about the data stack? We started from very early ship with that last conversation, moving up to more modern ship. The data stack has evolved over time. Today, we use Snowflake

Starting point is 00:38:15 as our data warehouse solution. We have Postgres databases, which may or may not be used by the data science team. And our data scientists for BI tool use Tableau. We also use DBT a lot for data engineering purposes. But all of our like actual model deployment processes, taking the out of stock model and running it in production so that it feeds real time systems systems. All of that is built in-house. And we are building a

Starting point is 00:38:46 team at Ship right now to really build the next generation model deployment stuff. Model ML platform is kind of what we're calling it. Build the next generation ML platform from Ship because we're still running on some of the stuff that we hacked together in a couple of afternoons several years ago. So there's a lot of excitement there. afternoons several years ago so there's a lot of excitement there yeah that's super interesting actually we had a sort of like a couple of weeks ago with tekton tekton yeah you probably know them because i've met the guys at tech i met some of the guys at tekton they've got they're doing some really cool stuff over there yeah yeah yeah and what i found very interesting is that actually in this space that we call feature stores, which I mean, okay, I think still there's a lot of confusion what these things are,

Starting point is 00:39:29 but there's not a lot of, let's say, open source solutions out there. Actually, there's only one, which I think is called Feast. Any plans from your side to open source anything? I don't know is the honest answer. Personally, I would love to open source things. I think that sounds fun and satisfying. I think that more than likely our team will be relying and building on top of a lot of the existing open source tools that are out there and then tweaking it for our

Starting point is 00:39:56 needs. Yeah, makes sense. I would say that ML platform is a huge competitive advantage these days in the data science space. After the people part of data science, the next hardest part is actually integrating it with all of the things. Like how do you take that tool and use it effectively? How do you do it in real time? So that's my understanding and expectation of why there's not a lot of open source machine learning platforms out there. It's because to the people who have built it and done it well, it helps them succeed

Starting point is 00:40:27 and helps them outlast their competitors. Yeah, I think that's an excellent point. And I think it's a very good explanation of why this is happening. And I think it explains why even traditional companies, like really big companies that traditionally has a lot of open source presence

Starting point is 00:40:45 like netflix for example even them they haven't made public the feature stores that they have built you see a lot of talks about it presentations and all that stuff but like none of these is open sourced yet and i think it's an excellent point that you are making it's like actually it's a very competitive advantage like companies have by having these systems in-house. So it makes total sense. So you shared with us like the data stack that you have. Are there any like specific tools that are used only by the data scientists? How do you build and how do you iterate on your models? Like are there any frameworks that you are using for that? Libraries? Anything specific that you would like to share there with us? So first thing I'll say is I think that we are evolving there

Starting point is 00:41:31 and we'll have a lot of new tools to better set up our model building systems down the road. We've looked into all kinds of things from MLflow and Kubeflow for artifact storage and model iteration storage. There's a lot of opportunity out there, and we're trying to decide what we want to build in-house, what we want to paste one for, what we want to use open source for. In terms of tools that we commonly use, Shipt's data science is very ambidextrous. We use both R and Python as the problem needs. I will say that anytime we start doing, we get into that real time space, we start having to think about feature

Starting point is 00:42:11 stores and think about APIs, Python's going to win out there. But oftentimes we find that R is much more helpful at that exploratory data analysis phase. And we do have internal tools for internal packages for both R and Python that allow us to very easily communicate with all of our data stores and write and push data to them, as well as our cloud provider. In terms of like other tools and the process flow, I think what we really want to do is build our systems in a way that data science can iterate independently of the rest of the business. Obviously, in not all cases, is that okay for us to do? But if I'm building a recommendation engine, the goal would be for me to build it in a way that it communicates consistently with engineering.

Starting point is 00:42:58 And then I can begin iterating on it however I want to improve it. Working with product managers and business, so they're aware of the changes, but apart from the current systems. So we really embrace that kind of microservice idea. Like that'd be the engineering component of it, like microservices at Shipt and really strive to build it simple at first

Starting point is 00:43:23 and then iterate and learn as we launch and run. This is great. Last question from my side and then I'll let Eric ask any questions he might have. What's the relationship with data engineering and how you work together with them and how is the function defined inside? Yeah, so I would say there's actually

Starting point is 00:43:44 kind of two data engineering groups that shipped. One data engineering group that shipped is all about getting data that our partners provide and getting it into our system so that we can sell the products they have. And that is a lot of data. And historically, we've worked very closely with that group just in terms of building solutions to clean, standardize, and understand the product data that's coming in from our partners. Predicting what brand a product is, if it doesn't come tagged with a brand, identifying data to be stored in our data warehouse and transforming that data into things that can be used by data scientists for analytics or by others in the company for analytics. We work closely with them, though I think that that team is going to continue to scale even more as our group is growing. A lot of the challenge that we have at Shipt with data and growing fast is things change, and it's hard to know when they change. So if engineering

Starting point is 00:45:04 changes the way they're solving a certain problem in the business, like it can be challenging for us to know that that happened way downstream. Our data engineering team is crucial for handling those changes and ensuring that we get clear and ready data. And they're a bunch of great guys. I love them all a lot. Ryan, one question on the data engineering side. Did you, going back to sort of the early days before maybe the data engineering team

Starting point is 00:45:33 was as big as it is today, did the data science team actually do some of the data engineering work as well? Or has there always been a sort of clear delineation of responsibility? There has not. And in a lot of ways, there still isn't in a lot of ways, like the data I need to build an out-of-stock model, right?

Starting point is 00:45:50 It's not going to be present in like this perfect form where I can just select star from my table, you know, and roll with it. But there's still a lot of data engineering that I have to do to, that we have to do as data scientists to build our models and build the pipelines that feed and serve those models and to run the analytics from those models back into a place

Starting point is 00:46:09 where they can be analyzed later but i will say that the demarcation is much cleaner today than it was in the past very early on i i did a lot of data engineering and that was just a necessary thing for data science to work and function at that time because there wasn't as much dedicated resources to internal data engineering. Sure. Super interesting. Well, we're close to time here, so we'll ask one more question before we hop off. What are you most excited about in terms of trends in data science that you kind of see on the front lines doing the work every day? Yeah, so that's actually a really good question and a really hard question. One thing that I really am excited about, and this is broad industry right now, is I feel like some of the

Starting point is 00:46:58 hype is dying down. There was this idea that data science is going to solve all of our problems, self-driving cars are going to be here and they're not. And all our problems haven't been solved yet. And so we're at a point where we're coming to kind of terms with what data science can do. And we're beginning to really, as an industry, begin to like make tangible steps forward, opposed to having to dance around the hype and the expectations. So that's one thing that really excites me. I think the other thing is that people are becoming more and more receptive to data science being used in effective ways. And

Starting point is 00:47:33 we are really learning as an industry, how to do data science effectively. Like y'all talked about this earlier, lots of people have come in and talked about the human element of data science and how important that is. And as an industry, we're really starting to realize that. And best practices are being developed. A whole bunch of companies have popped up to provide services for MLOps and how we can do data science at scale and monitor data science at scale. Like we've coming out of the kind of wild west early days of data hype into a more steady and stable industry with some more best practices around. That's not to say that we are stable and there's not a wild west component of this, but I feel like it's much more clear how to solve a lot of common problems today than it was when I started.

Starting point is 00:48:26 I've also learned a lot in that time too. Well, even, I mean, it's interesting, even if you think about when you started at Shipt and, you know, to today, the number of new tools even that have been introduced that make a lot of these things easy or tools that have been developed internally, you sort of have this maturing of the discipline where some of the technical problems are getting out of the way. And you can focus on the deeper problems that you're actually trying to solve as opposed to building the infrastructure that makes it easier for you to solve them. Absolutely. Yeah. And I mean, an example of how this changed for us, like early on at Shift, we used Airflow, batch job orchestration tool Airflow. It was pretty much the only option at that point in time. We've revisited that. We're still using Airflow, but we've revisited and had discussions about whether that's right for us these days. And there are now six or 10 options and plenty more that I don't know about that each

Starting point is 00:49:30 meet that same general need, but do it in slightly different ways or slightly different targets, slightly different niches. And from that, it's just wonderful to have opportunities and choices to figure out how you want to do for business and to lean on the expertise of others. Absolutely. Well, Ryan, it has been such a pleasure to have you on the show. So interesting to hear about everything that's going on at Shipt. Congratulations on the success and best of luck in hiring, doubling the size of the team before the end of the year. That's a tall order. Yeah. Hiring is hard. I'm excited for it though. We need all the help we can get. Cool. Well, we'd love to check back in with you on a future episode and thanks again. Thank you so much. I really appreciate it, Eric. Thanks Costas.

Starting point is 00:50:15 Thank you, Ryan. It was great. As always a fascinating conversation. I think one of my big takeaways was hearing about how things have stayed the same in many ways going through a huge acquisition by such a large company like Target. That was just really cool to hear. I mean, obviously there's more sort of structure being part of a larger company, but it's really neat to hear. A lot of times you'll hear the opposite story where, you know, a company gets acquired and sort of your ability to be agile early on dissolves and it's not as gratifying to be part of the team anymore. But I didn't get that sense at all. And it just makes me really happy to hear that that was sort of managed well and that they can still have that startup type feel to some extent at a big company. Yeah. I'm a little bit disappointed to be honest,

Starting point is 00:51:05 Eric. We have another data scientist who said that we are not going to see Terminator anytime soon. So it's a long time until 2030. So that's true. But yeah, regardless of that, it was a great conversation with Ryan. And I think it's amazing to hear from people of like what kind of impact data science can have in a company and how many different aspects of the company it can affect. And I think from what I understand, like during the conversation we had, it's such a case. You have internal users, you have like run the product with it. Every pretty much every stakeholder around the company is affected by data science. And I hope that we will have more and more opportunities in the future, like to communicate and educate the people out there about how data science is an important part of any tech

Starting point is 00:52:00 company today. And not only tech, actually, in any company. And I think one theme that's been recurring is the human element of data science, which I think has been really interesting to hear about. And Ryan brought that up without us even bringing it up. And that's just been a constant theme with all of our guests, which is, I think, both fascinating and encouraging.

Starting point is 00:52:21 Yeah, absolutely. All right. Well, until next time, thank you for joining us on the Data Stack Show. Make sure to subscribe on your favorite podcast app. You'll get notified of new episodes every week and we'll catch you on the next one. The Data Stack Show is brought to you by Rudderstack, the complete customer data pipeline solution. Learn more at rudderStack.com.

The Data Stack Show - 42: Scaling Data Science with Ryan Boyer of Shipt

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.