Drill to Detail - Drill to Detail Ep.32 'What Really Works in eCommerce' With Special Guest Will Browne
Episode Date: June 26, 2017Mark is joined by Qubit colleague Will Browne to talk about his recent academic paper co-authored with Mike Swarbrick Jones on what techniques are most successful in influencing and retaining eCommerc...e customers, using analytics and statistical analysis-at-scale to analyze and understand 20 billion user journeys held in Qubit's Google Cloud Platform-hosted Customer Data Store.
Transcript
Discussion (0)
So hello and welcome to this week's episode of Drill to Detail and this time I'm pleased
to be joined by two guests that actually work just a few desks down from me at the company
I'm working with at the moment called Qubit. So Will and Mike, welcome to the show.
Hi Mark, thank you very much.
Hi. So Will and Mike were the authors of a new research paper that Qubit brought out this week,
which actually I thought was very interesting and very pertinent to the kind of big data world and
the kind of the analytics world that I work in and listen to. Actually, it was also featured
in Hacker News, which was particularly interesting. and it's also particularly topical because of the news last week that Amazon were acquiring Whole Foods Market and expanding
once again from their kind of e-commerce kind of empire and you know putting putting a lot of kind
of concern really in the minds of other e-commerce companies that are trying to compete so Will give
us a summary first of all of what the paper was and actually what you two do.
And then what we'll do then, we'll start to drill down after that into a bit of the history of this kind of industry.
And then kind of what Qubit are doing and what the paper is about.
So just kind of a summary of what you do, first of all.
Yeah, of course. So me and Mike both work with the data science team at Qubit.
And what we've done with this particular paper is look at a
whole load of experiments we've done cubit works with 200 plus clients and we aim to try and improve
revenue by running experiments and now obviously not all of these experiments work um we thought
we'd go through the 50 000 or so that we'd done and try and understand which of them worked and
which of them didn't so to do that we had to categorize them. We did some interesting
statistics, used some very interesting statistical models to do this, and at the end of the day we
managed to work out, okay some types of treatments such as perhaps a button color change or resizing
an element or messing around with an image don't tend to do very much, whilst other types of
treatment that tend to affect how you perceive the value of a product do.
And so we tried to go into as much depth as we possibly could to understand which of these treatments did something
and then which were probably a waste of time, as we think.
OK, so, Will, there's a lot to unpack in that kind of intro there, really.
And I'm conscious that, you know, obviously, when I came to work with Qubit in the area, I'm working with you now,
I was familiar with the data platforms you guys use to kind of land data in and the general kind of techniques that are used to do the research that you're doing.
But there's a lot of language in there that people wouldn't understand and wouldn't kind of get really.
So let's kind of let's take a sort of step back.
And so if you think back to probably what a lot of people who work in IT are familiar with, they're familiar with kind of e-commerce sites and they're familiar with building websites and so on.
But there's a whole industry, isn't there, that Qubit kind of came out of, but now it's in a different area.
But a whole industry of trying to get those things to be more valuable and more productive and so on for the site owners.
I mean, talk us a bit through the history of that and some of the words you've been talking about.
Yeah, of course so a long time ago i think it was 2001 was the first experiment that was done um
done by google i believe and they were looking to try and change how people interacted with the
search term and now these kind of experiments online are very simple you have one version of
website and then you have an improved version of the website, or what you hope to be an improved version of the website.
You randomly send some people to one side or the other,
and you find out whether the improved version caused changes in behavior that you wanted.
If it did, you could measure that, perform a statistical test, and say,
great, this is actually better than that.
Let's continue and move on forwards with the improved version.
But at the beginning of this, obviously, it was quite hard to do anything other than simple changes of text and buttons and colors
and so it was all very visual these tests and people actually did have a lot of success i think
in the early days of the internet that changing words and changing colors and learning how people
would interact with a browser you could you could learn a lot gradually as people have been able to
collect more data
about people, about what's going on the site,
you move away from things such as resizing elements
and changing navigation structure to,
okay, well, what are other people doing on the site right now?
How can we get the information that we're collecting
about our users and use it to drive persuasive changes
on the site to make them want to purchase more?
So we've seen this journey, really,
from kind of the basic cosmetic changes through to kind of okay let's try and optimize for
increasing things like conversion rate which is a proportion of people who actually go on to
purchase and finally on to things like oh well how can we make more people purchase more and
how can we make them spend more using the data that we collect so we've got to that kind of stage
now i think okay so so what you're describing there is what's
termed ab testing wasn't it at the start there yes where you've got kind of yeah you've got kind
of one you've got kind of one version a test version and so those things you call tests
they're really when you kind of run these different versions do the stats on it and
trying to understand yeah was it statistically kind of significant and so on i mean so and and
so again that i suppose that area kind of grew to be what you call AB and N testing
and multivariate testing is a whole kind of world isn't there of trying those things but fundamentally
I guess what you're doing is just kind of fiddling around with things and trying to understand you
know which of the different kind of placements on the screen are the best but there's no real
kind of like I suppose there's nothing particular about the viewer there. It's just kind of trying different places, isn't it?
Yeah, it's exactly that.
It's about the designs.
Initially, it was a lot about how people want to change design with the assumption that changing design would have an impact on user behavior.
And by doing A-B testing, what you actually find out is
what are the kind of things that you can change
that actually cause real change in user behavior.
And for a long time, they were done particularly badly badly and they weren't they weren't very well structured
experiments shall we say uh and for a lot of people that was fine people really enjoyed the
fact you got some data back and any data was good yeah uh but now i think people are much keener to
have okay we want the right data we want the results of our experiments our ab test to be
valid and we want to be able to trust that moving forward those kind of results actually mean something for the bottom line which for a long
time they didn't okay okay so the work that you and mike that you and mike do is you work in that
you're in the research department aren't you at qubit so what kind of techniques and and and data
and and kind of things you do within that area really because you really are you know examples
of actual data scientists at work, aren't you?
Yes, I've been on the research team for about two and a half years now.
Before that, I did a PhD in maths.
I've been a data scientist for about four years now.
So we mainly work in Python.
We do a lot of SQL queries.
Our data infrastructure is all built on Google Cloud services.
We use a lot of techniques from all sorts of areas of statistics.
We particularly like Bayesian statistics at Qubit.
We also do a lot of machine learning, this kind of thing and a lot of our day-to-day is more about sort of building our
data-driven products for qubit rather than this kind of analysis okay and so will your pm aren't
you for some of the products that qubit build on this kind of research yes precisely so i kind of
work within the product management team i used to be a data scientist at qubit as well
and we try and use the machine learning techniques
we have to try and build solutions and products
that actually cause changes in behavior
because, I mean, that's what it all comes down to, I think.
You can use machine learning
to try and make the products you're building better,
try and make people more likely to spend more
and make these changes that are more persuasive.
But it's just one possible way
of making people do more things.
And some of the best techniques we've seen actually don't,
they certainly need machine learning.
A really good example,
everyone's familiar with product recommendations,
and they're a good thing.
They definitely have an impact.
They are a positive thing.
They increase the amount of money that people tend to spend.
But the size of that effect is quite surprising
when compared to something as simple as you have four left in stock, a pointer that just tells you that there
are a few items left in stock. One requires a lot of data, a lot of innovations, algorithms,
a massive pipeline. The other just really requires you to know what stock's available on site.
And what's interesting about this analysis that we've done, the research, is really
kind of puts into stark contrast that you there's a difference between how complex and
sophisticated the machine learning is and the end result in terms of user behavior and those
things can be completely independent which is really fascinating because it becomes all about
finding out what works rather than what's the cleverest thing you can do um and at keeper we
try and do try and do a bit of both because we want to be able to do the cleverest
things we can do we understand that they're not always both the best way of providing kind of
value and causing a persuasive change on a website okay fantastic and well actually you're the person
i sat next to when i first arrived here so you were you were the uh you were the kind of we did
yeah exactly so so uh yeah fantastic so let's get into this paper then so just outline for us um what this research paper was about and what
were the kind of dry i mean you've mentioned a bit there about trying to find out what works
but give us a bit of a kind of background as to kind of you know the thinking behind it
and we'll get into the details in a minute but how was it done and and what was the kind of reason
for it uh well i think i started doing something similar to this probably four years ago.
It was just to see, well, is anything that we're doing working?
And that was the first question.
Is A-B testing a good idea at all?
Which turned out it was, which is good.
And then it kind of gradually grew to be, okay, well, can we help our,
because here at Cuba we have a professional services team that go out to try and improve
converge rates and improve the amount people spend for each of each of our
clients um can we help them be better at their jobs by telling them the kind of things that work
and we did a bit of that we we did we i think we had some pretty good success there because we helped
focus the team on the kind of techniques that that actually drive value and then i think from
that last year we spent a lot a lot of time doing it actually slightly more value and then I think from that last year we spent a lot of time doing it
actually slightly more sophisticated and then this this year Mike's really taken those ideas
and run with them and turn them into something very sophisticated yeah and I would say like the
scale of these things has really grown like I remember we did one of these maybe a couple of
years ago and there was there was maybe something like 60 experiments,
which we examined in some detail.
And this time around, we got that number up to something like 6,000, 7,000,
something like that.
Okay, okay.
So I guess probably someone listening might think,
well, okay, this is a piece of marketing,
or this is something that is just kind of some numbers,
which they've kind of played around with and made to look how you want to look but tell us how i mean tell us how you've done it
because it's you know it's been audited by by price walters coopers it's been done at a scale
talk us through the methodology a little bit so you know again as an example of a kind of research
project done at scale like this yeah i mean no one's really going to believe you if you as a
marketing personalization and kind of experimentation vendor says, yes, we do good experiments.
No one's particularly going to believe that.
We really wanted it to be the first really trustworthy and assured and honest and transparent use case of saying, look, there's all these different ways you can do things.
And we've really tried to get across what you can
expect from doing these kind of treatments that was the idea we want to be we want to change the
industry from being about oh you know you can get a 30 uplifting conversion rate or hey you can get
a 35 uplifting revenue by doing this one simple thing because i don't think anyone in the industry
really believes that i think a lot of people who work in e-commerce will understand that
these claims are based on maybe a one-off they're a massive statistical outlier and so we thought it'd be interesting well why
don't we instead of talking about the statistical outliers and the possibles why don't you about
what you could expect to get what what what what is it then what is the most likely outcome of you
doing some of this work okay okay okay that sounds good so take us through that take us through some
of the highlights of it because obviously academic paper is a lot in there but what are some of the things out there that were kind of expected
that you didn't expect and so on?
So we kind of had a fairly good idea about what we were going to get going into it,
because we had done these analyses before,
and we had some feelings that things like, so we call it scarcity, so this is where
you're saying there's only three or something left in stock. Things like urgency, so this is where
you have a like a countdown timer counting down to like you know you only have three hours left
to order to get next day in delivery or something like this but also there was something we have a fair like social proof where you're sort of talking about
what other people are doing like this is fairly new and we were quite interested to see how that
went so so let's before we get four games of details on that just mean so you've got a few
things there you've got social proof you've got sort of scarcity and so on what what are so what
are they again how data driven are they and and and why would why why do people think that have an effect really i mean this this
boils down to work done ages ago by people who are nothing to do with the e-commerce world there was
a great book by robert caldini um on the principles of persuasion where he broke down lots of sales techniques into basically authority, scarcity,
that he uses another kind of a version of social proof, which I think he calls audience or something like that.
And they're very well-known techniques in sales.
And there's a lot of evidence behind these.
There's theoretical evidence and then there's data-driven evidence of this work in the real world so it's really just applying the same techniques that people use day-to-day
selling cars shoes and washing machines to the world of e-commerce and the shift here is is
from thinking that it's going to be the user interface that causes the changes in behavior
and the persuasive messaging when it seems to be the same things that have made us
want to buy fruit for market stall for the last 10 000 years say the same principles still apply
um so so the biggest winners in terms of um you know since we did were so everything that we're
doing is in terms of um how much we add to the sort of average basket spend for each visitor.
So for each visitor who arrives on the site, like whether or not they buy anything,
we kind of just want to know what the average amount that they spend.
And so the things which we found were best were scarcity.
So this is, you know know only three left in stock there was social proof which was
sorry so the scarcity is was about three percent uplift we had social proof that was about two
percent uplift and emergency was about one and a half percent and the other major finding that we
had that we weren't particularly surprised about this was that the simple UI changes that people make basically have no impact on average so just changing the
color of an element on the page that has literally a zero percent average uplift which doesn't mean
to say that like every single color change will have a 0% effect. Just the average is completely neutral
to change the colour of something.
And also that factors in the cost, presumably,
of actually doing the work as well and that sort of thing.
Yeah.
And I think one of the ones that people were a bit surprised about
was calls to action.
So this is where you're sort of changing the wording uh on a website to be more
suggestive so rather than saying um you know complete your order or something like that
you'll change the wording to like start your adventure or something um more colorful like that
and um our professional service team i think they're quite keen on these i think they they had a
feeling that they would have an impact but again they basically have have no impact they were
basically neutral interesting is that because you think the the effect of that has been diluted over
the years i mean it was i remember that remember that was a thing you did right a few years ago
really is it people are more used to that now or the probably example there are probably examples
where it does work i mean there are there are examples where it does work and i think the reason
that people hear about those examples and get quite excited about them and it seems like a very
easy thing to change it's just the wording it's a simple like it's a simple thing you couldn't
change so people test them a lot um on the off chance they're gonna have a big effect and that's
probably what happens is that
enough people have tested them so that really you're just messing on with very small changes
in wording which don't mean much i think the examples that tend to tend to have worked in
the past have been um changing the wording from make a reservation to continue um and you can
see that those they have a very different meaning um and so they cause a very
different change of behavior so unless you're so that that's the example where it definitely does
work and i think we've seen we've seen cases where that's those changes do something but on average
what we've seen is they don't because people don't do things like that they tend to focus
on smaller chains which don't have the same effect okay okay so let's take take two areas
that are kind of data driven and using things like machine learning and so on i mean so if we look at what was the
what was the the the uplift and the benefit of things like kind of product recommendations
because they're a kind of classic thing aren't they that everyone kind of learns with the learning
machine learning and big data and so on what did you find to be what did you find really
with the effect of those and the usefulness of those so um product recommendations
were fairly interesting um i mean there it's worth pointing out that product recommendations
there's different kinds of ways that you can do that so you can either put them on the product
listing pages um so when someone clicks into a product you'll have a set of like you might also
like these but another way that people use them is like, once someone has reached the basket page,
they're like, oh, people who like this also like this. So what we found with product recommendations
is there was actually fairly neutral in terms of getting more people to convert. So like,
if you weren't going to buy anything, on average, product recommendations didn't really help with that.
But we did find that out of the people who did buy things, they tended to spend a little bit more.
So product recommendations managed to raise the revenue by making customers buy more.
And the effect was not huge.
The effect was about half a percent,
but it was one of the few treatments
that did have a reliably positive effect.
Okay, okay.
And what about, I mean, obviously working at Qubit,
there's other players in the industry as well
that are using data from visitors' actual activity
and preferences and so on.
What was the finding about,
what was the finding for that kind of work, personal of work personalization and so on was there much uplift in
that or what so we did have a look at how um we have we have an idea of you can segment experiences
or segment tests or not segment them based on customer activity and visitor behavior
and from the analysis we did um it's much kind of subtle
there because you have different gradations of segmentation you could say that maybe changing by
mobile mobile user versus a desktop user is a very different user but is that really the same as
visitor preference because you think maybe something more like well have they bought t-shirts before
is a much better indicator so there's there's definitely a scale of how personalized you might think these experiences are.
But from a crude, either they are segmented or they're not,
we found that on average, the expected impact
of the segmented version was three times higher.
So it went from 0.3% in terms of uplift to 0.9%,
which is interesting.
It may well be indicative of what we see in the future.
It may not be. It's a good step on the journey towards personalization i think that's what a lot of people are trying to get towards but it does kind of show that i think if we delved
into it there's going to be good versions of segments so segments that are useful useful and
and actually are differentiated from the rest of the population on the site and the kind of
personalizations personalizations that aren't.
The example we tend to give is,
you could change the color of the button for a user based on their first name.
That would be very personalized, very detailed,
based on really interesting user behavior,
but that's very unlikely to have any impact in terms of how much they spend.
So there's going to be good versions and bad versions of this.
And so buckling them all under the same kind of umbrella
of how do you use visitor preferences,
it can basically be done well or it can be done badly,
like everything else.
And what techniques did you use really?
I mean, the actual, I suppose, kind of method of doing this,
how did that work out really?
So it was, first thing I'll say is like,
it was an awful lot of work. It was very, very difficult.
We basically had to sit down with them, go through in like excruciating detail like every single step
that we were going to do and then they went away and came up with what test, basically, that they could perform on the data that we gave them
so that they could satisfy themselves
that we had sort of done the methodology
as we said that we had done it.
And there were various ways of doing this so for example
like we use like a fairly a fairly sophisticated statistical model to sort of boil down like all
of the different tests into into sort of like one score that's quite a good explanation of what i
can see in the paper is a bayesian hierarchical model i mean mean, I think, I think sort of, yeah, exactly.
I mean, I guess the point is, is, is that to do this kind of, well,
there's lots of testing and testing of testing going on, isn't there as well.
And actually what's interesting is the uplifts as well are fairly kind of,
they're quite small as well. I mean, I guess.
Yeah. I mean, so, so we were saying before about how, you know,
in the industry,
the numbers that people say are always
so ridiculously large they're like 30 percent and um when we released this paper actually i think it
went the other way like people were actually really surprised about how low the numbers were
like um and like a little bit incredulous even that the numbers could possibly be that low
um so yeah i guess
you can't really win with these things sometimes yeah but but but there is a but there i mean
but there is a noticeable uplift isn't there when this is done properly and i noticed in some of the
material has gone around this kind of report you've talked about six percent i mean what's that
what's that six percent what does that mean really so that that six percent was in some of the
materials we basically just thought well we are seeing these kind of 0.2 percent 0.4 percent two percent
uplift in revenue per visitor but what about the cumulative impact of all of these so for a single
client of ours who's or anyone really is running an optimization campaign how can we understand
what is the total effect of running lots of these different experiments over time and we found for people who use the kind of the techniques that we found to work
unsurprisingly um they'd have multiple versions of each of these different types of techniques
on their site running at any one time um still running them as experiments um and the cumulative
kind of the cumulative effect of all those experiments at the same time led to a proportional uplift in
on-site total revenue of kind of three four five six eight percent in some cases um and so that's
interesting because the the the size of kind of a two percent on a subsection of your site that's
not that's not a big impact that's that's almost not worth doing if you're just going to do that
one single thing but when you start combining all of these techniques together you do end up getting a kind of a revenue uplift that seems worth it
and of course two percent yeah of course two percent of a large amount i mean two percent of
sort of i don't know amazon's numbers is a big amount isn't it really so yeah yeah that's and
for those people this is where scale is all important because for someone like amazon
probably changing the uh the color of a
button if it had a 0.01 effect it'd be worth doing um but for most e-commerce vendors if it's not
above a one percent impact then it's probably not okay okay so so i mean in terms of what the paper
tells you and and kind of the impact on the industry you know i'm very conscious that you
know there was the amazon news the other week with kind of whole foods and so on i mean what what what kind of
what kind of messages and lessons and bits of nuggets of information are there in this for
say e-commerce business what is what's the implication of this and the message really
well as as i was going to go out on a campaign if i wanted quick wins i would use the techniques
we've got as things that
have an impact i would certainly there are some other things that you definitely should be testing
anyway there's a lot of hygiene testing so a lot of the cosmetic and ux changes we've talked about
they don't have an impact i mean if you're doing a big site redesign it's still worth running them
as a test because these things had a fairly high probability of being high or low so there was a
lot of variance associated with that with these kinds of experiments,
which means that if you didn't test it,
you run the risk of having a negative impact.
You're not realizing that you've actually had a negative impact on your site.
So there's some hygiene and comfort reasons
for doing these kinds of experiments as well.
But once you've got past the initial stage of these quick wins, as it were,
I mean, there's a whole lot of other experiments
that didn't fit neatly into categories, which had to do with kind of they're more specific to each individual
site and as you learn more about your users and collect more data about them then i think you can
start being slightly more sophisticated about what you're trying to do you can still kind of lean on
these techniques but there will be specific things based on your most loyal and most profitable users
that that matter to you and i think once
you've got past that initial stage of yeah we've got these initial uplifts you've got to start
focusing on the data you have about your users to try and come up with those differentiated real
really important personalization techniques which are going to be oh we found these
differentiated user groups and we need to show them different things to get the most out of them
okay okay and that that just as an aside is what qubit does isn't it really so so you know that obviously that the product investment and so on
is in that area but it's a general kind of like piece of advice that that's correct yeah i mean
i think that i think it makes sense if there's going to be things that work across the board
and then when you get more data you can personalize otherwise you're going to get kind of
yeah you're not going to get the returns sure okay and mike and any kind of thoughts or or
feedback or anything you've had on you know in terms of doing this piece of work and being the kind of
lead kind of data science on this any kind of any thoughts or advice or kind of comments really for
the kind of the analysts and the big data people on the on the listening in here um i think in
i think it's interesting to to note that the like there that there's always so much uncertainty in e-commerce.
You can measure things to a degree of accuracy where you can say,
we're 95% sure that adding this scarcity message has a positive impact.
But the problem with e-commerce is like not really
many people have enough data to say um you know we managed to raise revenue by somewhere between
you know 3.5 and 4 like most people just don't have enough data for that so i'd say that like um
yeah just just always be be thinking about the uncertainties of your measurements
and, you know, work really hard to try and remove that sort of, that feeling that you have some certainty there.
So, Will, where would somebody get hold of this paper if they're interested in it?
So, you can either search for it on hack and use which
be the way you find the academic paper probably um you could google what works in e-commerce um
which i think it will come up with or the most easy and sensible way of doing that is to visit
the cubit website um so it'll be cubit.com and check out the the research area there and there's
two versions of the paper there's a marketing kind of friendly version, which is good and tells the story in much of a narrative way.
And then there's the academic paper,
which has more detailed information about each of the tables,
each of the treatments,
and the kind of effects you can expect to see.
Of course, me and Mike would always recommend
you read the academic version,
but others might want to read the marketing.
Excellent.
Well, look, thanks, Mike.
Thanks, Will, for this.
I mean, it's been excellent speaking to you.
And yeah, really interesting to see a kind of a large-scale data science
project at work and a bit of a bit of a kind of insight into the e-commerce world which is you
know very it's a big user of data and so on and a lot you know very data-driven as well so it's
been great to uh to speak to you both thanks for having us.