Drill to Detail - Drill to Detail Ep.32 'What Really Works in eCommerce' With Special Guest Will Browne

Starting point is 00:00:00 So hello and welcome to this week's episode of Drill to Detail and this time I'm pleased to be joined by two guests that actually work just a few desks down from me at the company I'm working with at the moment called Qubit. So Will and Mike, welcome to the show. Hi Mark, thank you very much. Hi. So Will and Mike were the authors of a new research paper that Qubit brought out this week, which actually I thought was very interesting and very pertinent to the kind of big data world and the kind of the analytics world that I work in and listen to. Actually, it was also featured in Hacker News, which was particularly interesting. and it's also particularly topical because of the news last week that Amazon were acquiring Whole Foods Market and expanding

Starting point is 00:00:50 once again from their kind of e-commerce kind of empire and you know putting putting a lot of kind of concern really in the minds of other e-commerce companies that are trying to compete so Will give us a summary first of all of what the paper was and actually what you two do. And then what we'll do then, we'll start to drill down after that into a bit of the history of this kind of industry. And then kind of what Qubit are doing and what the paper is about. So just kind of a summary of what you do, first of all. Yeah, of course. So me and Mike both work with the data science team at Qubit. And what we've done with this particular paper is look at a

Starting point is 00:01:25 whole load of experiments we've done cubit works with 200 plus clients and we aim to try and improve revenue by running experiments and now obviously not all of these experiments work um we thought we'd go through the 50 000 or so that we'd done and try and understand which of them worked and which of them didn't so to do that we had to categorize them. We did some interesting statistics, used some very interesting statistical models to do this, and at the end of the day we managed to work out, okay some types of treatments such as perhaps a button color change or resizing an element or messing around with an image don't tend to do very much, whilst other types of treatment that tend to affect how you perceive the value of a product do.

Starting point is 00:02:06 And so we tried to go into as much depth as we possibly could to understand which of these treatments did something and then which were probably a waste of time, as we think. OK, so, Will, there's a lot to unpack in that kind of intro there, really. And I'm conscious that, you know, obviously, when I came to work with Qubit in the area, I'm working with you now, I was familiar with the data platforms you guys use to kind of land data in and the general kind of techniques that are used to do the research that you're doing. But there's a lot of language in there that people wouldn't understand and wouldn't kind of get really. So let's kind of let's take a sort of step back. And so if you think back to probably what a lot of people who work in IT are familiar with, they're familiar with kind of e-commerce sites and they're familiar with building websites and so on.

Starting point is 00:02:49 But there's a whole industry, isn't there, that Qubit kind of came out of, but now it's in a different area. But a whole industry of trying to get those things to be more valuable and more productive and so on for the site owners. I mean, talk us a bit through the history of that and some of the words you've been talking about. Yeah, of course so a long time ago i think it was 2001 was the first experiment that was done um done by google i believe and they were looking to try and change how people interacted with the search term and now these kind of experiments online are very simple you have one version of website and then you have an improved version of the website, or what you hope to be an improved version of the website. You randomly send some people to one side or the other,

Starting point is 00:03:30 and you find out whether the improved version caused changes in behavior that you wanted. If it did, you could measure that, perform a statistical test, and say, great, this is actually better than that. Let's continue and move on forwards with the improved version. But at the beginning of this, obviously, it was quite hard to do anything other than simple changes of text and buttons and colors and so it was all very visual these tests and people actually did have a lot of success i think in the early days of the internet that changing words and changing colors and learning how people would interact with a browser you could you could learn a lot gradually as people have been able to

Starting point is 00:04:04 collect more data about people, about what's going on the site, you move away from things such as resizing elements and changing navigation structure to, okay, well, what are other people doing on the site right now? How can we get the information that we're collecting about our users and use it to drive persuasive changes on the site to make them want to purchase more?

Starting point is 00:04:23 So we've seen this journey, really, from kind of the basic cosmetic changes through to kind of okay let's try and optimize for increasing things like conversion rate which is a proportion of people who actually go on to purchase and finally on to things like oh well how can we make more people purchase more and how can we make them spend more using the data that we collect so we've got to that kind of stage now i think okay so so what you're describing there is what's termed ab testing wasn't it at the start there yes where you've got kind of yeah you've got kind of one you've got kind of one version a test version and so those things you call tests

Starting point is 00:04:53 they're really when you kind of run these different versions do the stats on it and trying to understand yeah was it statistically kind of significant and so on i mean so and and so again that i suppose that area kind of grew to be what you call AB and N testing and multivariate testing is a whole kind of world isn't there of trying those things but fundamentally I guess what you're doing is just kind of fiddling around with things and trying to understand you know which of the different kind of placements on the screen are the best but there's no real kind of like I suppose there's nothing particular about the viewer there. It's just kind of trying different places, isn't it? Yeah, it's exactly that.

Starting point is 00:05:28 It's about the designs. Initially, it was a lot about how people want to change design with the assumption that changing design would have an impact on user behavior. And by doing A-B testing, what you actually find out is what are the kind of things that you can change that actually cause real change in user behavior. And for a long time, they were done particularly badly badly and they weren't they weren't very well structured experiments shall we say uh and for a lot of people that was fine people really enjoyed the fact you got some data back and any data was good yeah uh but now i think people are much keener to

Starting point is 00:05:57 have okay we want the right data we want the results of our experiments our ab test to be valid and we want to be able to trust that moving forward those kind of results actually mean something for the bottom line which for a long time they didn't okay okay so the work that you and mike that you and mike do is you work in that you're in the research department aren't you at qubit so what kind of techniques and and and data and and kind of things you do within that area really because you really are you know examples of actual data scientists at work, aren't you? Yes, I've been on the research team for about two and a half years now. Before that, I did a PhD in maths.

Starting point is 00:06:35 I've been a data scientist for about four years now. So we mainly work in Python. We do a lot of SQL queries. Our data infrastructure is all built on Google Cloud services. We use a lot of techniques from all sorts of areas of statistics. We particularly like Bayesian statistics at Qubit. We also do a lot of machine learning, this kind of thing and a lot of our day-to-day is more about sort of building our data-driven products for qubit rather than this kind of analysis okay and so will your pm aren't

Starting point is 00:07:14 you for some of the products that qubit build on this kind of research yes precisely so i kind of work within the product management team i used to be a data scientist at qubit as well and we try and use the machine learning techniques we have to try and build solutions and products that actually cause changes in behavior because, I mean, that's what it all comes down to, I think. You can use machine learning to try and make the products you're building better,

Starting point is 00:07:37 try and make people more likely to spend more and make these changes that are more persuasive. But it's just one possible way of making people do more things. And some of the best techniques we've seen actually don't, they certainly need machine learning. A really good example, everyone's familiar with product recommendations,

Starting point is 00:07:53 and they're a good thing. They definitely have an impact. They are a positive thing. They increase the amount of money that people tend to spend. But the size of that effect is quite surprising when compared to something as simple as you have four left in stock, a pointer that just tells you that there are a few items left in stock. One requires a lot of data, a lot of innovations, algorithms, a massive pipeline. The other just really requires you to know what stock's available on site.

Starting point is 00:08:21 And what's interesting about this analysis that we've done, the research, is really kind of puts into stark contrast that you there's a difference between how complex and sophisticated the machine learning is and the end result in terms of user behavior and those things can be completely independent which is really fascinating because it becomes all about finding out what works rather than what's the cleverest thing you can do um and at keeper we try and do try and do a bit of both because we want to be able to do the cleverest things we can do we understand that they're not always both the best way of providing kind of value and causing a persuasive change on a website okay fantastic and well actually you're the person

Starting point is 00:08:56 i sat next to when i first arrived here so you were you were the uh you were the kind of we did yeah exactly so so uh yeah fantastic so let's get into this paper then so just outline for us um what this research paper was about and what were the kind of dry i mean you've mentioned a bit there about trying to find out what works but give us a bit of a kind of background as to kind of you know the thinking behind it and we'll get into the details in a minute but how was it done and and what was the kind of reason for it uh well i think i started doing something similar to this probably four years ago. It was just to see, well, is anything that we're doing working? And that was the first question.

Starting point is 00:09:32 Is A-B testing a good idea at all? Which turned out it was, which is good. And then it kind of gradually grew to be, okay, well, can we help our, because here at Cuba we have a professional services team that go out to try and improve converge rates and improve the amount people spend for each of each of our clients um can we help them be better at their jobs by telling them the kind of things that work and we did a bit of that we we did we i think we had some pretty good success there because we helped focus the team on the kind of techniques that that actually drive value and then i think from

Starting point is 00:10:03 that last year we spent a lot a lot of time doing it actually slightly more value and then I think from that last year we spent a lot of time doing it actually slightly more sophisticated and then this this year Mike's really taken those ideas and run with them and turn them into something very sophisticated yeah and I would say like the scale of these things has really grown like I remember we did one of these maybe a couple of years ago and there was there was maybe something like 60 experiments, which we examined in some detail. And this time around, we got that number up to something like 6,000, 7,000, something like that.

Starting point is 00:10:34 Okay, okay. So I guess probably someone listening might think, well, okay, this is a piece of marketing, or this is something that is just kind of some numbers, which they've kind of played around with and made to look how you want to look but tell us how i mean tell us how you've done it because it's you know it's been audited by by price walters coopers it's been done at a scale talk us through the methodology a little bit so you know again as an example of a kind of research project done at scale like this yeah i mean no one's really going to believe you if you as a

Starting point is 00:11:02 marketing personalization and kind of experimentation vendor says, yes, we do good experiments. No one's particularly going to believe that. We really wanted it to be the first really trustworthy and assured and honest and transparent use case of saying, look, there's all these different ways you can do things. And we've really tried to get across what you can expect from doing these kind of treatments that was the idea we want to be we want to change the industry from being about oh you know you can get a 30 uplifting conversion rate or hey you can get a 35 uplifting revenue by doing this one simple thing because i don't think anyone in the industry really believes that i think a lot of people who work in e-commerce will understand that

Starting point is 00:11:42 these claims are based on maybe a one-off they're a massive statistical outlier and so we thought it'd be interesting well why don't we instead of talking about the statistical outliers and the possibles why don't you about what you could expect to get what what what what is it then what is the most likely outcome of you doing some of this work okay okay okay that sounds good so take us through that take us through some of the highlights of it because obviously academic paper is a lot in there but what are some of the things out there that were kind of expected that you didn't expect and so on? So we kind of had a fairly good idea about what we were going to get going into it, because we had done these analyses before,

Starting point is 00:12:21 and we had some feelings that things like, so we call it scarcity, so this is where you're saying there's only three or something left in stock. Things like urgency, so this is where you have a like a countdown timer counting down to like you know you only have three hours left to order to get next day in delivery or something like this but also there was something we have a fair like social proof where you're sort of talking about what other people are doing like this is fairly new and we were quite interested to see how that went so so let's before we get four games of details on that just mean so you've got a few things there you've got social proof you've got sort of scarcity and so on what what are so what are they again how data driven are they and and and why would why why do people think that have an effect really i mean this this

Starting point is 00:13:09 boils down to work done ages ago by people who are nothing to do with the e-commerce world there was a great book by robert caldini um on the principles of persuasion where he broke down lots of sales techniques into basically authority, scarcity, that he uses another kind of a version of social proof, which I think he calls audience or something like that. And they're very well-known techniques in sales. And there's a lot of evidence behind these. There's theoretical evidence and then there's data-driven evidence of this work in the real world so it's really just applying the same techniques that people use day-to-day selling cars shoes and washing machines to the world of e-commerce and the shift here is is from thinking that it's going to be the user interface that causes the changes in behavior

Starting point is 00:14:00 and the persuasive messaging when it seems to be the same things that have made us want to buy fruit for market stall for the last 10 000 years say the same principles still apply um so so the biggest winners in terms of um you know since we did were so everything that we're doing is in terms of um how much we add to the sort of average basket spend for each visitor. So for each visitor who arrives on the site, like whether or not they buy anything, we kind of just want to know what the average amount that they spend. And so the things which we found were best were scarcity. So this is, you know know only three left in stock there was social proof which was

Starting point is 00:14:46 sorry so the scarcity is was about three percent uplift we had social proof that was about two percent uplift and emergency was about one and a half percent and the other major finding that we had that we weren't particularly surprised about this was that the simple UI changes that people make basically have no impact on average so just changing the color of an element on the page that has literally a zero percent average uplift which doesn't mean to say that like every single color change will have a 0% effect. Just the average is completely neutral to change the colour of something. And also that factors in the cost, presumably, of actually doing the work as well and that sort of thing.

Starting point is 00:15:35 Yeah. And I think one of the ones that people were a bit surprised about was calls to action. So this is where you're sort of changing the wording uh on a website to be more suggestive so rather than saying um you know complete your order or something like that you'll change the wording to like start your adventure or something um more colorful like that and um our professional service team i think they're quite keen on these i think they they had a feeling that they would have an impact but again they basically have have no impact they were

Starting point is 00:16:11 basically neutral interesting is that because you think the the effect of that has been diluted over the years i mean it was i remember that remember that was a thing you did right a few years ago really is it people are more used to that now or the probably example there are probably examples where it does work i mean there are there are examples where it does work and i think the reason that people hear about those examples and get quite excited about them and it seems like a very easy thing to change it's just the wording it's a simple like it's a simple thing you couldn't change so people test them a lot um on the off chance they're gonna have a big effect and that's probably what happens is that

Starting point is 00:16:45 enough people have tested them so that really you're just messing on with very small changes in wording which don't mean much i think the examples that tend to tend to have worked in the past have been um changing the wording from make a reservation to continue um and you can see that those they have a very different meaning um and so they cause a very different change of behavior so unless you're so that that's the example where it definitely does work and i think we've seen we've seen cases where that's those changes do something but on average what we've seen is they don't because people don't do things like that they tend to focus on smaller chains which don't have the same effect okay okay so let's take take two areas

Starting point is 00:17:23 that are kind of data driven and using things like machine learning and so on i mean so if we look at what was the what was the the the uplift and the benefit of things like kind of product recommendations because they're a kind of classic thing aren't they that everyone kind of learns with the learning machine learning and big data and so on what did you find to be what did you find really with the effect of those and the usefulness of those so um product recommendations were fairly interesting um i mean there it's worth pointing out that product recommendations there's different kinds of ways that you can do that so you can either put them on the product listing pages um so when someone clicks into a product you'll have a set of like you might also

Starting point is 00:18:03 like these but another way that people use them is like, once someone has reached the basket page, they're like, oh, people who like this also like this. So what we found with product recommendations is there was actually fairly neutral in terms of getting more people to convert. So like, if you weren't going to buy anything, on average, product recommendations didn't really help with that. But we did find that out of the people who did buy things, they tended to spend a little bit more. So product recommendations managed to raise the revenue by making customers buy more. And the effect was not huge. The effect was about half a percent,

Starting point is 00:18:48 but it was one of the few treatments that did have a reliably positive effect. Okay, okay. And what about, I mean, obviously working at Qubit, there's other players in the industry as well that are using data from visitors' actual activity and preferences and so on. What was the finding about,

Starting point is 00:19:04 what was the finding for that kind of work, personal of work personalization and so on was there much uplift in that or what so we did have a look at how um we have we have an idea of you can segment experiences or segment tests or not segment them based on customer activity and visitor behavior and from the analysis we did um it's much kind of subtle there because you have different gradations of segmentation you could say that maybe changing by mobile mobile user versus a desktop user is a very different user but is that really the same as visitor preference because you think maybe something more like well have they bought t-shirts before is a much better indicator so there's there's definitely a scale of how personalized you might think these experiences are.

Starting point is 00:19:48 But from a crude, either they are segmented or they're not, we found that on average, the expected impact of the segmented version was three times higher. So it went from 0.3% in terms of uplift to 0.9%, which is interesting. It may well be indicative of what we see in the future. It may not be. It's a good step on the journey towards personalization i think that's what a lot of people are trying to get towards but it does kind of show that i think if we delved into it there's going to be good versions of segments so segments that are useful useful and

Starting point is 00:20:19 and actually are differentiated from the rest of the population on the site and the kind of personalizations personalizations that aren't. The example we tend to give is, you could change the color of the button for a user based on their first name. That would be very personalized, very detailed, based on really interesting user behavior, but that's very unlikely to have any impact in terms of how much they spend. So there's going to be good versions and bad versions of this.

Starting point is 00:20:45 And so buckling them all under the same kind of umbrella of how do you use visitor preferences, it can basically be done well or it can be done badly, like everything else. And what techniques did you use really? I mean, the actual, I suppose, kind of method of doing this, how did that work out really? So it was, first thing I'll say is like,

Starting point is 00:21:04 it was an awful lot of work. It was very, very difficult. We basically had to sit down with them, go through in like excruciating detail like every single step that we were going to do and then they went away and came up with what test, basically, that they could perform on the data that we gave them so that they could satisfy themselves that we had sort of done the methodology as we said that we had done it. And there were various ways of doing this so for example like we use like a fairly a fairly sophisticated statistical model to sort of boil down like all

Starting point is 00:21:55 of the different tests into into sort of like one score that's quite a good explanation of what i can see in the paper is a bayesian hierarchical model i mean mean, I think, I think sort of, yeah, exactly. I mean, I guess the point is, is, is that to do this kind of, well, there's lots of testing and testing of testing going on, isn't there as well. And actually what's interesting is the uplifts as well are fairly kind of, they're quite small as well. I mean, I guess. Yeah. I mean, so, so we were saying before about how, you know, in the industry,

Starting point is 00:22:23 the numbers that people say are always so ridiculously large they're like 30 percent and um when we released this paper actually i think it went the other way like people were actually really surprised about how low the numbers were like um and like a little bit incredulous even that the numbers could possibly be that low um so yeah i guess you can't really win with these things sometimes yeah but but but there is a but there i mean but there is a noticeable uplift isn't there when this is done properly and i noticed in some of the material has gone around this kind of report you've talked about six percent i mean what's that

Starting point is 00:22:59 what's that six percent what does that mean really so that that six percent was in some of the materials we basically just thought well we are seeing these kind of 0.2 percent 0.4 percent two percent uplift in revenue per visitor but what about the cumulative impact of all of these so for a single client of ours who's or anyone really is running an optimization campaign how can we understand what is the total effect of running lots of these different experiments over time and we found for people who use the kind of the techniques that we found to work unsurprisingly um they'd have multiple versions of each of these different types of techniques on their site running at any one time um still running them as experiments um and the cumulative kind of the cumulative effect of all those experiments at the same time led to a proportional uplift in

Starting point is 00:23:45 on-site total revenue of kind of three four five six eight percent in some cases um and so that's interesting because the the the size of kind of a two percent on a subsection of your site that's not that's not a big impact that's that's almost not worth doing if you're just going to do that one single thing but when you start combining all of these techniques together you do end up getting a kind of a revenue uplift that seems worth it and of course two percent yeah of course two percent of a large amount i mean two percent of sort of i don't know amazon's numbers is a big amount isn't it really so yeah yeah that's and for those people this is where scale is all important because for someone like amazon probably changing the uh the color of a

Starting point is 00:24:25 button if it had a 0.01 effect it'd be worth doing um but for most e-commerce vendors if it's not above a one percent impact then it's probably not okay okay so so i mean in terms of what the paper tells you and and kind of the impact on the industry you know i'm very conscious that you know there was the amazon news the other week with kind of whole foods and so on i mean what what what kind of what kind of messages and lessons and bits of nuggets of information are there in this for say e-commerce business what is what's the implication of this and the message really well as as i was going to go out on a campaign if i wanted quick wins i would use the techniques we've got as things that

Starting point is 00:25:05 have an impact i would certainly there are some other things that you definitely should be testing anyway there's a lot of hygiene testing so a lot of the cosmetic and ux changes we've talked about they don't have an impact i mean if you're doing a big site redesign it's still worth running them as a test because these things had a fairly high probability of being high or low so there was a lot of variance associated with that with these kinds of experiments, which means that if you didn't test it, you run the risk of having a negative impact. You're not realizing that you've actually had a negative impact on your site.

Starting point is 00:25:33 So there's some hygiene and comfort reasons for doing these kinds of experiments as well. But once you've got past the initial stage of these quick wins, as it were, I mean, there's a whole lot of other experiments that didn't fit neatly into categories, which had to do with kind of they're more specific to each individual site and as you learn more about your users and collect more data about them then i think you can start being slightly more sophisticated about what you're trying to do you can still kind of lean on these techniques but there will be specific things based on your most loyal and most profitable users

Starting point is 00:26:03 that that matter to you and i think once you've got past that initial stage of yeah we've got these initial uplifts you've got to start focusing on the data you have about your users to try and come up with those differentiated real really important personalization techniques which are going to be oh we found these differentiated user groups and we need to show them different things to get the most out of them okay okay and that that just as an aside is what qubit does isn't it really so so you know that obviously that the product investment and so on is in that area but it's a general kind of like piece of advice that that's correct yeah i mean i think that i think it makes sense if there's going to be things that work across the board

Starting point is 00:26:36 and then when you get more data you can personalize otherwise you're going to get kind of yeah you're not going to get the returns sure okay and mike and any kind of thoughts or or feedback or anything you've had on you know in terms of doing this piece of work and being the kind of lead kind of data science on this any kind of any thoughts or advice or kind of comments really for the kind of the analysts and the big data people on the on the listening in here um i think in i think it's interesting to to note that the like there that there's always so much uncertainty in e-commerce. You can measure things to a degree of accuracy where you can say, we're 95% sure that adding this scarcity message has a positive impact.

Starting point is 00:27:23 But the problem with e-commerce is like not really many people have enough data to say um you know we managed to raise revenue by somewhere between you know 3.5 and 4 like most people just don't have enough data for that so i'd say that like um yeah just just always be be thinking about the uncertainties of your measurements and, you know, work really hard to try and remove that sort of, that feeling that you have some certainty there. So, Will, where would somebody get hold of this paper if they're interested in it? So, you can either search for it on hack and use which be the way you find the academic paper probably um you could google what works in e-commerce um

Starting point is 00:28:10 which i think it will come up with or the most easy and sensible way of doing that is to visit the cubit website um so it'll be cubit.com and check out the the research area there and there's two versions of the paper there's a marketing kind of friendly version, which is good and tells the story in much of a narrative way. And then there's the academic paper, which has more detailed information about each of the tables, each of the treatments, and the kind of effects you can expect to see. Of course, me and Mike would always recommend

Starting point is 00:28:36 you read the academic version, but others might want to read the marketing. Excellent. Well, look, thanks, Mike. Thanks, Will, for this. I mean, it's been excellent speaking to you. And yeah, really interesting to see a kind of a large-scale data science project at work and a bit of a bit of a kind of insight into the e-commerce world which is you

Starting point is 00:28:53 know very it's a big user of data and so on and a lot you know very data-driven as well so it's been great to uh to speak to you both thanks for having us.

Your Ad Here

Drill to Detail - Drill to Detail Ep.32 'What Really Works in eCommerce' With Special Guest Will Browne

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.