The Data Stack Show - 79: All About Experimentation with Che Sharma of Eppo
Episode Date: March 16, 2022Highlights from this week’s conversation include:Che’s background and career journey (4:23)Coherence between hemispheres in the human brain (6:58)Raising Airbnb above primitive AB testing technolo...gy (8:54)Economic thinking in Airbnb’s data science practice (14:24)Dealing with multiple pipelines (16:48)Eppo’s role in recognizing statistically significant data (20:01)Defining “experiment” (23:25)Types of experiments (25:57)The workflow journey (27:18)Dealing with metric silos (34:21)Why we still need to innovate today (37:03)Where experimentation can be used (39:36)How big a sample size should be (43:29)How to self-educate to get the maximum value (45:39)Bridging the gap between data engineers and data scientists (48:14)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com..
Transcript
Discussion (0)
Welcome to the Data Stack Show.
Each week we explore the world of data by talking to the people shaping its future.
You'll learn about new data technology and trends and how data teams and processes are run at top companies.
The Data Stack Show is brought to you by Rudderstack, one platform for all your customer data pipelines.
Learn more at rudderstack.com.
And don't forget, we're hiring for all your customer data pipelines. Learn more at ruddersack.com. And don't forget,
we're hiring for all sorts of roles. You have the chance to meet Costas and I
live in person coming up soon in Austin, Texas. We're both going to be at Data Council Austin.
The event is the 23rd and 24th of March, but both of our companies are hosting a happy hour on the 22nd, the night
before the event.
So you can come out and have a drink with Kostas and I.
Kostas, why should our visitors join us live in Austin?
For tequila, of course.
That could make things very interesting.
I mean, yeah, it's a happy hour.
People should come.
It's before the main event.
So without getting tired from the event or anything, come over there, meet in person,
something that we all miss because of all this mess with COVID.
Have some fun, talk about what we are doing and yeah, relax and have fun.
It's going to be a great time.
Learn more at datastackshow.com.
There'll be a banner there you can click on
to register for the happy hour
and we will see you in Austin in March.
Welcome to the Data Stack Show.
Today, we are going to talk with Che from Epo,
which is an experimentation platform
that actually runs on your warehouse, which is super interesting.
Can't wait to hear about that. But he has done some unbelievable things.
Most notably, he really built out a lot of the experimentation framework and technology at Airbnb pretty early on.
So he was there for four or five years. Costas, I'm really interested to know,
I want to ask him when he started at Airbnb, you know, the experimentation frameworks or
experimentation products, as we know them now, we're nowhere near at the level of sophistication,
right? So they probably had to build a bunch of stuff inside of Airbnb. And I think one of the side consequences of that
is you think about testing is that you're sort of creating an additional data source that
ultimately needs to be layered in with, you know, say product data or marketing data or whatever.
And that's a pretty complex technical challenge. So I'm going to ask him how he
sorted through that at Airbnb. How about you?
Yeah, I think I'd like to get a little bit more into the fundamentals of experimentation,
mainly because I think it's many people, you know, especially like in product management,
they consider, they take like experimentation very lightly. They think that's just like a tool that's like a kind of foracle that just tells you what you should be doing, right? Which obviously is not the case.
And especially if you are in a company that doesn't have anyone with background in statistics
or a data scientist or even an analyst, I think it can be very hard for someone to interpret correctly and
make sure that the experiments that are running, they're doing that right.
So I'd love to learn more about that.
What is an experiment?
How we should run them?
What's the theory behind them?
Why we should trust them? What's the theory behind them? You know, like why we should trust them, you know,
like all that stuff. And yeah, also see like if we can get some resources from Che about how we
can learn more. I think for product managers, especially it's important to get educated
before we start using these tools. I agree. Well, let's jump in and see what he has to say.
Che, welcome to the Data Sack Show. We're so excited to chat with you.
Yeah, I'm excited to talk with all three of you.
Okay. Well, so much to cover today. You've done some amazing things at companies like Airbnb and Webflow, but just give us your brief background and what you're doing today. Yeah, absolutely. So my name is Che. I'm the founder and CEO of Epo,
which is a next-gen AV experimentation platform, one that's specifically focused on the analytics
and reporting side of the workflow. A lot of it comes from my experience building these systems
at, like you said, Airbnb, Webflow, a bunch of other places. My background is in data science.
I was the fourth data scientist at Airbnb. I stayed there for about
five years of 2012, 2017, which has really guided a lot of the way I think around the data ecosystem
and seeing that how it played out at Webflow. In other words, it's sort of validated a lot of those
things. So some concrete things to know about Airbnb, you know, it's founded by designers.
And so that kind of leads into the way that the company likes to do
work. It comes much more from a Steve Jobs, like incubate something, focus on UX, and then release
it with grand announcements more than a sort of iterative data-driven measurement metrics approach
like a Zuckerberg or a Bezos type of thing. And what that really means as a data team is that in addition to needing to build all of these capabilities, you also had to win over the
work, you know, into a certain way of thinking to believing metrics matter. So in addition to
building infrastructure, we had to essentially win a culture war. And I, it was really interesting to
see how that all played out.
To me, the biggest piece of solving that problem was experimentation because
experimentation as a function just unlocked so much in addition to this concrete
ROI of measuring things,
it also fundamentally changes the way you do work where suddenly people have this
really intimate connection with metrics,
where you understand exactly what you drove, what you didn't drive, how your work plays into it.
And it also unlocks this culture of entrepreneurialism, where people can take risks,
try stuff out, validate ideas without winning political battles. So this combination of
just incredible business ROI, plus changing your corporate culture to one that's kind of
a little bit more fun was really
what led me to start up very cool super interesting okay and i have to ask one thing and i apologize
for the surprise here because we didn't talk about this in show prep but you did um some research on
the human brain uh-huh and i read that it was on coherence between hemispheres. And because we are all like super nerds on this show,
can you just tell us a brief bit about that?
Like what were you studying between hemispheres? That's so interesting.
Oh yeah. Yeah. It was fun. So, you know, as, as a bit of backgrounds,
my university,
I studied electrical engineering and the focus on signal processing.
And then I studied statistics. And
I really thought it was a cool way of understanding information and being able to make statements
about it. I came across this researcher in the Stanford psychology department, who was trying to
see if there was a different way of understanding the brain where instead of just looking at some
MRI and seeing what lights up and just seeing where the blood flows if instead you said maybe the way the brain works does not increase
blood flow it's from synchronizing things so while there's like two parts that are just kind of like
going in different pieces whatever when they are focused they just lock and suddenly their
community makes way and it was this kind of I didn't know anything about neuro, right?
I was an electronic theory student.
So I, like, one of the great things about being a statistician is you get to play in everyone's backyard and understand their fields.
And so this is my way of, like, learning a little bit about brain research.
So it was really tough.
Statistical methods to say, how do you make a hypothesis test around synchronization of neurons? But yeah, it was very
cool. Like, you know, we, I was only working on it for about six months, so I can't quite tell how
the research evolved over time, but it was cool to learn about the field. Fascinating. Well, I'll
have to dig into that more. Okay. Let's, let's dig into data stuff. So we have tons of questions
about the technical side, but
one thing I want to ask to start out. So let's go back to Airbnb. And when you joined, I'm just
trying to go back in my mental history here, dusting off some pages. But back then, the
A-B testing technology was pretty primitive compared to what it was today.
So how did you begin to think about that?
Like, did you evaluate vendors or like how did you just start out building it yourself?
Yeah, at Airbnb, you know, we always had a bias to build over buy.
I think you can kind of see the number of open source projects out of there. You know, one of my colleagues at Ethlo used to work at Snowflake and, you know, Snowflake
has forever been like, why are you like spending so much time and energy and stuff on rolling
your own infrastructure at this stage?
So in any case, Airbnb has always had its own biases.
We kind of knew from the start, we're going to build our own experimentation framework.
One engineer built a feature finding thing, like fairly quickly. That wasn't too much time. But then
this one data scientist and an engineer decided they wanted to build
the end-to-end workflow and
UI. So Airbnb, the first team to run experiments
was the search engine. This was back in, I think, late
2012, 2013.
They were, and this is pretty typical.
Most companies, I think the first team to run an experiment is either a machine learning team
or a growth team.
For me, it was a machine learning team.
Every time you iterate on a model,
you want to see drive revenue.
You know, not just did you drive like a click,
but did you drive revenue?
And crucially, if you're a machine learning team, you need that evidence if you're going to like hire four more engineers,
like, you know, no, and you did. So that is kind of the earliest place of investment.
Once that team started showing more success, then other teams started adopting it, the growth team,
the rest of the marketplaces team. And we started quickly seeing how the teams
that adopted experimentation, like wholesale, like really deeply started showing clear success.
One of the really formative experiences for me was this search team basically re-inflected
Airbnb's growth. Most companies, they start on this crazy rocket ship thing, Airbnb's going 3X,
3X, 3X, and then it was like 2.7X, 2.5X, whatever. This team was able start on this crazy rocket ship thing. Your music is going 3x, 3x, 3x. Then it was like
2.7x, 2.5x, whatever. This team
was able to re-accelerate.
Interesting. They broke the logarithmic
plane.
Exactly. It was clear it was
this team because they ran experiments and proved it.
Interesting. It's a really
amazing coalition
building moment.
I always say, if you're going to spread
experimentation culture start with some kind of teams that are going to adopt it really readily
and don't try to push it on everyone else until you've shown success interesting are there any
like is there anything do you remember any of the or like one example of a test that that team ran
that was kind of like a oh yeah you know this is you know huge absolutely there were that's a bunch of them okay let me let me talk through a
few one of them which i think was just a great example of how you draw that right how you imagine
this all is playing out so this data scientist he he looked at the data and saw there were all
these airbnb hosts who basically never accepted anyone.
You know, this was back in the day before Instant Book was a really large percent of Airbnb traffic.
Sure.
And instead, you had to request to every host, am I allowed to stay on this date?
I'm bringing my dog with me.
I'm showing up late or whatever.
Is that okay?
And people would say yes or no.
And there were some hosts who just literally never said yes.
And so this person noticed that these hosts were essentially
dragging down the marketplace because there's all these people
who would spend all this effort vetting and Airbnb listing,
message these hosts, get a no, and be like,
oh, we have a lot of folks, friends.
And so from there, ran an experiment that took the host denial rate
and steadily demoted people
and eventually took them off the system
if they were too strong naysayers.
And that was like a huge success.
It moved metrics.
And that was one of the earliest examples.
Like this was back in 2013 or something
before a lot of other experiments had come out.
And so that was one of those early examples of like,
oh, I think we have something here because this sort of strategic analysis bent work, it can be hard to win over
every level of hierarchy to get it done. But if you can run this experiment and show it's powerful,
that'll get reinvestment. So there are examples like that. But then every company has these
examples, these little changes that seem so cosmetic.
There's like no way they could matter that much.
And then they just blow the metrics out.
In the case of Airbnbs, this engineer ran this experiment where when you click on the Airbnb listing, it opens it in a new tab.
So that was the entire change.
When you click on the Airbnb listing, it opens in a new tab.
And that thing was the largest metric improvement of any experiment over five years.
Because it turns out, it's like very obvious about it, right?
Like, yeah, you do that.
You don't lose your search context.
Sure.
No.
Yeah.
Because you want to click on it.
Like, of course, we all do that, right?
It's like, I want to do that, right?
But man, I think boosted bookings by like
two or three percent it was like a very very long and just one little change and it's exactly the
sort of thing absent experimentation like there's no way people would have noticed that that was a
big deal the design team probably been like ah it looks kind of ugly i don't know like you know
hesitation yeah just every company has these stories, which I always think is fascinating
because it's not just random chance. There's important lessons here. Like not losing your
search context matters a lot in Airbnb system. Yeah, totally. One quick question. I know
two more questions. I'm monopolizing here and I know Costas has a bunch of questions.
First one, it's interesting to think about the economics, right?
So when you talked about, you know, hosts who never replied, like that's almost like,
you know, calling a store and saying, do you have this?
And then you go to the store and it's not there, right?
And so over time, it's like artificial supply, which is an economic problem.
Did you have to apply a lot of economic thinking in the data science practice at Airbnb? 100%. The person who did that analysis
wasn't a PhD economist. Oh, okay.
100%. So I actually think that like, you know, data science, there's a lot of skills that go
into it. There's like a straight engineering piece of just how do you make reliable, robust
data systems. But when you talk about the, the, the ultimate goal of data science,
and this is something I always try to like kind of confirm and validate for
people is that the whole reason you start a data team is not to like have a
data warehouse or the modern data stack or whatever.
The whole point is to make better decisions.
So you need to understand what data, what analyses, what can I do
that's going to lead to better decisions? And economists, they have a lot of background sort
of thing. It makes a lot of sense. Yeah, absolutely. Okay. This is going to be kind
of a detailed question, but hopefully it sets the stage for Kostas to ask a bunch of technical
questions as well. But one thing I'm interested in, and I'm
just thinking about our listeners who maybe are dealing with like a vendor A-B testing tool,
or maybe have like built something themselves, or even just trying to think about how to process
this. So you said, you know, someone built a simple feature flagging mechanism at Airbnb.
So one of my questions is, and this is sort of a problem
that every company faces, or at least, you know, my purview, which is limited. So maybe not every
company, but okay. So you have feature flagging in the context of like testing and data science,
but then you have this problem of, you kind of want that feature flag to be available in multiple
places, but generally you're also
running like a separate product analytics, you know, sort of infrastructure instead of pipelines.
You have growth, you have customer success, you know, there are all those components.
How did you deal with that from a technical standpoint? Right. Because, you know, you hear
about building your own feature flagging thing and it's like, does that actually make it harder to
deal with all these other pipelines as well? it's a great call out so it touches on
what i would call the modern experimentation stack is that to run experiments you basically
have these pieces right you have one way which is feature flagging or animization so that's the
the start of the technical stack which is arrive, they got to be put in groups. And that's actually it. That's where it ends.
And so you'll see tools like your optimizably launch darkly, which pretend that data warehouses don't exist.
They just let you do feature flagging and then, okay, let the data scientists sort it out.
And that's kind of what the gap we're trying to fill. So our product today actually does not have feature flagging at all, although we'll probably be building it pretty soon. Instead, what we rely on is this basic separation
of where feature flagging needs analytics,
which is the data warehouse.
So all of these feature flagging tools,
even if they don't directly give you data,
it's very easy to build your own instrumentation
and get that data into your Snowflake
or BigQuery or Redshift or whatever.
And as tools like Redytostack show,
there's this amazing new ability
to get everything into the warehouse.
And so it's a nice central point to operate off of
for applications like ourselves.
So, you know, modeling, experiment stack,
you got feature flagging, you have metrics,
which inevitably are operating out of their warehouse.
You have a bunch of pipelines
to kind of intermingle these things,
calculate your quantities, run your diagnostics,
do your investigations and then reporting,
which is this very public kind of cross-functionally consumed interface,
which is answering, how are my experiments are doing? Yeah, totally.
All right.
I'm going to be so rude to Costas and actually just continue to ask you questions because I can't stop. I can't help myself.
Yes. Okay. So, yeah. So what, so this is, this is so interesting to me. So the, and I'm coming at this just so you're aware as, you know, someone who's worked in marketing and done a lot of data stuff and use a lot of AB testing tools. So it seems like the
package solutions for AB testing, like their value comes a lot from basically sort of handling the
statistical analysis as a service, right? Like they suck all the, like, okay, you, you, whatever you do your test,
it says, okay, this is like variation one. And then it tells you like, okay, this is going to
improve your conversion rate or not. Right. But the challenge has been that they keep all the
data trapped inside of their like particular system, which I think inherently limits the
value of that data, because ultimately you want to
actually see that data in the context of all the other data you have. Is that kind of the idea
behind EPO is that you're, you know, you're sort of not, you know, creating obfuscation around the
data itself and just providing like a, yes, this is statistically significant or not.
There's a few pieces that go into what makes epo epo i think specifically with regards
to the data where it lives what it comes from sort of thing i think our standpoint which is what you
will see as a principle at airbnb or netflix or google whatever is that there should be a single
definition of like what is revenue right the data teams are singularly focused on defining
that thing. What is revenue? What is a purchase? What is this subscription upgrade or whatever?
And the natural home of those things is your data warehouse. So the real, you know, there's two
points of failure with most of the existing systems. One is that they create their own
parallel data warehouse. So suddenly they got their own idea of what revenue is, right?
And it's hard to really sync it up with your own.
And in addition to revenue itself, you want to split revenue by a bunch of other things,
right?
By marketing channel or by persona or whatever.
So that's one thing is that having an incomplete and parallel version of what data is, drives
data teams insane.
It's like I spent so much time trying to define this thing.
And, you know, here's the system over there telling a PM that they increase revenue when the revenue does not even include this other source.
Like that is inaccessible by the system.
So that's one piece of it.
And, you know, that probably gets exacerbated by different business models.
If you have multiple points of sale, then like trying to instrument each one separately doesn't really make sense.
You have to centralize it in a data warehouse.
If you use Stripe, Stripe is not a set of data that is accessible by those systems.
A bunch of things like that.
That's definitely a core piece of it. But the other piece is almost like a design principle around organizationally how should experiments be done
because one of the things that i run into all the time is you see a lot of companies that will
they'll match their organization to their tools instead of matching their tools to their
organization and it's really unfortunate because it means it puts a lot of stress to higher high
expertise statistician types economists and the like to actually be able to run experiments in the way that you're supposed to, to follow
statistical protocols, to have good procedures.
Whereas a tool should just enable those sort of things.
So the way we operate is that a lot of companies might have one, two, three, what I've called
experiment specialists who have opinions around what metrics matter. Here's how
they're defined. Here's what statistical regime we're going to use. And we want EPPO to let them
scale out that knowledge where they can do a one-time definition, say here are the rules of
engagement. And then going forward, some junior PM fresh out of college, never done this before,
can just operate within the system, turn the crank and being like, look, I just operate within the system turn the crank and being like look i just by using
the system i am following all the guidelines that i'm supposed to say i have to i mean i know that
i'm probably going to disappoint eric a little bit because he's expecting for me to ask something
very technical probably but i want to start with something very basic and i want to ask you
what is an experiment because we keep keep talking about, you know, like experimentation platforms and like all these things.
But let's start from the basics.
Like what is an experiment?
What defines that?
I love it because it's one of those things where the basic questions are the most technical, actually.
I'm going to give you my simple answer and then I'm going to give you my galaxy brain answer. So the simple answer is an experiment.
It's a methodology that you probably learned about in grade school, where if you have a theory of what is driving change,
you take a group of people or a group of something, you flip coins a bunch of times, flip them into multiple groups,
and then you measure who did
better you get one of those groups like you know i think that showing people making people do a
morning walk every day will lead to lower diabetes whatever like you have one group tell them to do
a walk every day and then you measure how much diabetes you got so that that's the basic
methodology is you know irrespective of
what type of AV experiment, it's basically that you need to have some random way of dividing
people into groups. We need some way to measure success, which are metrics. And then you can try
out different ideas for what drives success. Now, my, my galaxy brain approach to this, which is, you know, is that an experiment is anything that has a kind of before after comparison group that says, like, did this group do better than that group?
And what's interesting in the world today is that if you look at Nervy or Netflix, there are a bunch of products that you ship that don't lend themselves well to av experiments that let you
kind of divide the world into like think of like a pricing experiment are you going to give half
the people one price half for another price for a very kind of well-known product or if you're
netflix and you watch stranger things like you know that's actually the most important decision
netflix will make in a year is you know to get a we get an ROI on Stranger Things? And so there's a
kind of rich suite of things of causal inference methods that try to figure out, like, you know,
did metrics move once you control for a bunch of other factors? And the Galaxy Brain answer is that,
well, that's also kind of decision science, which fix right under.
That's super interesting.
And I mean, as long as I remember hearing like the term experiments, I can take at least,
we are always talking about like A-B testing, right?
Which as you said, as you described right now,
like it's about splitting like your population into two
and run the experiment there.
Is this the only way that we can do experiments?
So it gets back to the simple word experiment and the galaxy brain version.
To do a basic kind of A-B test, you do need some random or uncorrelated with the metric way of dividing people into groups.
So the nice thing about these online platforms is that dividing people into groups randomly is actually a well-solved problem.
It's actually probably the easiest part of the work flow.
So if you have the ability to randomly split people into groups, it's kind of the best way to do it.
Now, there's a kind of depth to this topic.
Like what happens when there are some users in the group
who have a way outlier disproportionate effect on everything?
You know, you can try to randomize them,
but they are just going to overpower everything.
How do you deal with that?
There's a set of methods called stratified sampling,
variance reduction techniques.
There's a bunch of ways to do it, but there are ways in which the random
sampling thing can break.
And again, it gets, it falls back on tools like Epo to try to make you aware
of them.
Okay.
And you mentioned like, but this is maybe like the easiest part of like the
workflow.
So what is this workflow?
Let's take us on a journey of working with the product itself.
How we start and what do we have to do until we get a result?
Yeah, absolutely.
So let me walk you through what I would call the experiment lifecycle.
And then from there, I'm happy to dive into how Apple touches on all of it. The start of an experiment is to have a basic alignment. I'm
like, what are we trying to do here? Like, are we trying to increase purchases? Are we trying to
reduce customer service tickets? You know, what is our overall goal? Just to have some idea of
like, this is where our goal is to be. And, you know the the corollary to that is that you need a metric for it right you need in some ways like what is a customer service ticket
what is a purchase from there the second stage is you need to come up with hypotheses of how am i
going to drive that metric you know is it that we want to reduce complexity and reduce friction
do you want to increase urgency, increase desire?
Are there social proof things? You just come up with a big list of stuff, right? Of saying like,
here's all my ideas of how I think we can improve things. And from there, there's a kind of basic product approach, you know, what is expected impacts for each one, what is expected complexity,
et cetera. So you come up with hypotheses and you have some way of deciding which ones you want to do.
From there, you have to design
an experiment, and designing an experiment
is both the product side
of UI,
UX type of thing, but also
there's a statistical piece, which
is called a power analysis.
So basically, to actually
get signal out of this change,
how many people do you need, how long do you need to wait to actually get signal out of this change, how many people do you need,
how long do you need to wait to actually get it? So, you know, you need to have enough sample size,
you need to be able to get that sort of signal. So that's, I'll call, I'll call part of experiment
design. From there, you have to implement it and implementing the experiment is where you touch on
the feature flagging side. It's also where there's straight product buildings, like you hopefully implement it without a bug. You know, hopefully it's a
community experiment. You didn't break it on iOS or something or push some important,
you know, design asset below the fold. So there's implementation details there.
From there, the experiment runs for a period of time and you want to observe it and make sure
it's healthy. Now, this is one of those tricky things where, you know, experimentation has this central
issue where from a statistical standpoint, you shouldn't peak too much.
You shouldn't stop experiment early.
You shouldn't really examine it too closely until it's done.
But that's just not a reality for most organizations.
You know, the real political capital is being spent on this thing.
You can't afford to let an unhealthy or unproductive experiment take up weeks of product time.
And so figuring out how to navigate that is also something that, you know, at FO we have
some opinions on the way to do it.
From there, you reach an end of the experiment, you have to make a decision.
And a lot of times that decision might be simple.
It's just like, did the metric go up or down?
But sometimes it gets complicated.
Like what happens if one metric goes up
and one metric goes down?
Like what happens if revenue goes up
and customer support tickets goes up?
What do you do there?
So kind of navigating that decision,
that metric hierarchy is kind of a big thing here.
And then from there, you have to make a decision, right?
You know, actually go forward and say, I'm going to launch launch it i'm not going to launch it sort of thing and record that
for posteriorly so there's a bunch of stuff involved here and i always think it's just
very telling that the commercial tools touch on like such a limited number of it so
from uh like about this experiment lifecycle,
which sounds like something quite complicated, to be honest,
there are like quite a few steps there and many different steps where
something can go wrong.
So how much of this lifecycle you can cover right now with the product
that you have?
Yeah, we, right now we are,
we do a lot of the post-implementation details.
So you like, actually that's not true.
So the things we do do are first, we let you build up this large corpus of metrics.
So that includes your important stuff like revenue purchases, et cetera.
It also includes your more of a spoke stuff like widget clicks or whatever.
And so all that becomes addressable in my system.
And if you want to see like for revenue,
like how did all my experiments do in history?
Like what were the biggest revenue drivers,
least revenue drivers, whatever you have those views from.
We also, from there,
we help you kind of after implementation to say like,
how long should this experiment be run, this power analysis thing.
So we automate that, make that self-serve for you.
And we automate all these diagnostics so that if there's some issue with randomization, it's not actually random, then we will alert you on that, make you aware of it.
If there's a precipitous metric drop, we'll make you aware of it. If there's a precipitous metric drop, you make you aware of it. And then at the
end for kind of guiding your decision, we have a sort of opinionated design interface that is meant
to say like, again, if you have some junior PM fresh out of college and they need to make a
decision here, how can you lead them to the right direction? You know, allow people to incorporate
their opinions and what metrics matter and what metrics are sort of explor direction. You know, allow people to incorporate their opinions and what metrics matter
and what metrics are sort of exploratory.
From there, all the experiments get compounded
into this like knowledge base.
So, you know, here's where you can look
through all the past experiments,
see has anyone tried this before,
understand the picture of like
what have been the big drivers historically.
So that's where our system touches today.
I think as time goes on, we're
going to be reaching further and further back into the planning cycle. So today we do power analysis
once it started. Pretty soon we're going to do power analysis before it starts. So you can
actually do a kind of scoring of the complexity of these experiments going in and then just kind
of group experiments according to like what i've got
ethics right like here's a bunch of experiments that are all related to driving search rankings
personalization policy or something so you you mentioned metrics a lot like in our conversation
so far and it seems like having a good understanding and good definition that is served among all the stakeholders of what
the metric is quite important for the... I mean, they're breaking the results of whatever experiment
you are doing. And I can't stop thinking of all the different places that we keep metrics in
organization. We keep talking a lot about data silos, but there are also like
metric silos, right? We don't really talk about that, but it's really easy to create the same
metric, to recreate the same metric in many different places with even like slightly different
semantics, but this might make like a big difference. So how do you deal with this problem and
how do you relate with this whole movement that we hear a lot lately about metric layers,
metric repositories, and how your product works with that?
Yeah, absolutely. So I have been very heartened to see that all these metric layers and metric
groups have been taken off, right? Because I'm obviously a big believer in it. And experimentation
systems get a lot more powerful when companies have a clear definition of metrics. So, you know,
I see our system integrating well with those metric layers. You know, we are one of many
downstream processes that should operate off the single definition source.
So, you know, there's a little bit of let's see which ones catch on and, you know, what the right
integrations to do are, but it's definitely in our interest to play well with them. In terms of
how we deal with them, I think there's two things. So one is experimentation as a practice gets more powerful by the diversity
of your metrics. And to give an example of that, suppose I'll tell you a story of an Airbnb.
There was this, there were these two teams. One team was focused on driving this instant book
feature. So I've mentioned to you before, it's very annoying to have folks have to approve you.
So it's great to, if a host just says, look, I'm just going to accept everyone.
And that became a strategic thing that we're trying to improve.
So it started out, there was like what, like two or 3% of posts who had it.
And today it's like 80, 85%, like a really, really huge change.
Yup.
And so there was one team that's just running a bunch
of experiments against that.
Simultaneously, there was another team which was trying to make it so that when you use Airbnb you sign up much earlier on than you
currently do so Airbnb is an app where you can get all the way to checkout page and never create a
user account right and so they were experimenting with various ways of incentivizing people to
create user account early on and different parts of the building you know teams that might have
hung out socially but
not really sharing road maps and stuff like that too much it turns out that experiments that drive
sign up rates will actually have a crazy effect on driving up instant book rates because this
instant book feature was gated behind sign up so it's the exact example of where things,
like it's a limited surface area.
These people's metrics have interactions between them.
So if you have some ability to say,
like, I am the business travel team,
all I care about is Airbnb business travel.
I just want to see how every experiment affects it.
Like these sort of views become super important.
So from our standpoint, we're happy that there's been a standardization of metrics.
Philosophically, we are 100% in the direction of saying companies should have a single source
of truth around them.
We're aiming to build off those systems.
Experimentation is a very metrics-hungry.
Makes a lot of sense.
And OK, experimentation platforms have been around for quite a while.
It's not something necessarily new. Why do we need to innovate today and bring something new
in the market? What's the reason behind that? Yeah, so I think there's a few answers here.
So one is that experimentation has existed
in this feature-centric world for much longer
than what you might see at an Airbnb
or Microsoft back in the day.
So I think that the scope of what experimentation includes
has widened, where you now include core business metrics.
It's no longer a CRO consultant who's trying to drive signups off your marketing page.
It's now, no, you're trying to drive core OKRs at the company by an experimentation
strategy.
But part of the reason that that has been enabled is the rise of cloud infrastructure,
where suddenly a lot more people are working off these kind of common set of tools that
make it very easy to do
these complex workflows you know i think of optimizely as a company who like they might
have even been a little bit ahead of its time you know 2008 when they started like when we i could
not build epo in 2008 because instead of integrating with snowflake redshift and BigQuery or whatever,
I would have to integrate with like, but Teradata and hive clusters and, you know,
pig clusters, whatever, all sorts of like SAS or something, you know, it would have been a much tougher argument to say, how do we integrate with these databases? So the entire analytics
side was just really hard before in a way that is now become
possible to deal with. Also, now most companies are operating off of AWS or GCP or something like
that. And so to have a sort of cloud infra place where you can kind of quickly turn experiments on
and off, put them up and down, have this very iterative
process, this continuous integration environment is now just much more common than they used to be.
Makes sense. And additionally, the first thing that comes in mind when someone has heard of
words with experimentation platforms before is product. it's a tool that is heavily used by product teams.
Where else can we use experimentation?
Or do you think that it's like a tool that is reserved only for product
managers?
Yeah, it's,
I think experimentation today probably has two big homes.
One is on product development, which is,
and I think that's a much more expansive definition than just growth and ML teams. I'm talking about literally changes through code,
you know, as an easy way to do it. The other big place I've seen a lot of experimentation
is marketing. You know, if you think of ad campaigns and some of the management of growth
marketing, that's another place that's very experiment heavy.
So those are probably the two biggest buckets.
I think, you know, where else to grow from there?
Experimenting on kind of operational teams is something you're starting to see more companies dabble with,
such as like a sales team or a customer service team.
Like if you think of like UPS, you know,
they can experiment on their fleet of drivers or whatever
so that's kind of an emerging area from my standpoint the the product side of things
and the growth marketing side of things are just these like hugely growing industries
now that we have product like growth you know we have more bottoms up self-serve motions etc
that make it just really attractive yeah and like as like as a person who has worked in product,
I always thought of, and by the way,
my background is mainly B2B products, right?
So when the first time that I tried like to use
in a B2B product, like experimentation platform,
like I felt miserable, to be honest.
And like I developed this kind of way of thinking
that experimentation platforms are mainly for B2C companies
because you need to have volume there
that will drive these experiments.
Is this true?
You know, the thing with sample size and experimentation
is that it's really around what sort of effect size they're trying to get.
So if you are running experiments when you don't have too much sample size, like there's still value in just preventing horrendous bugs, right?
There's sort of a, look, I just want to make sure my, I know if a metric dropped like 10%, you know, if there's some like very major issue. So when you look at B2B enterprise companies, you see experimentation play out much more as hygiene,
as like we just need to make sure that everything is healthy
or that if someone has outlier success, we know about it.
But I would say that the business models
that are much more levered on experimentation
comprehensively beyond that are these consumer,
pro-consumer companies.
Basically, you arrive at a website and just sign up and purchase.
So, you know, today, you know, every startup has to have a kind of focus to start with.
We like to focus a lot on these consumer prosumer companies.
Yeah.
Yeah.
Makes a little sense.
And let's say I'm a founder and like I've started like building my company right and my product
and I'm looking like for product market fit okay where should I introduce an experimentation
platform is this something that I should be using before product market fit after is it something
that can help me like find product market markets faster? You're also a founder.
So what's your opinion on that?
Yeah, I mean, my opinion is that basically once you have the sample size,
the ROI in experimentation is so clear that you should really be doing it.
It's basically saying, do you want to measure product changes or not?
Because that's the basic answer.
And the answer is clearly yes.
So now it happens to be the case that to have sample size that let you run experiments really well probably means you have some amount of product
market fit but i i think the main thing is just sample size it's like can you actually run an
experiment at all because once you can it's just really clear you should so can you give us like a
like i mean some kind of sense around what this sample size should be?
Yeah, what you should think about is what is the most common behavior that you care a lot about?
So maybe you don't care too much about signups, right? But maybe you do if you're saying Webflow,
maybe you care about people publishing sites or something like
that you know that's not exactly a subscription it's not your north star revenue-based metric
but it is something that through all of webflow's history they have noticed driving publishes is
is a powerful thing so what you might want to do is to say okay i have this many users on webflow
i have this many signups every day
of those people here's how many are publishing from there i can plug it into these you know
their online calculators for power analysis we will be building our own there are different ways
to conceive a bit and then from there it's just saying like you know what is your comfort level
with running an experiment for three months or two months or one month or two weeks or whatever?
And once you have an answer to that, that feels comfortable,
which is that you're not going to lose a lot of product time, product speed,
by waiting and being blocked on this experiment for this amount of time, then you should do it.
You should absolutely do it.
Yeah. One last question, and then I'll let Eric ask any questions he has.
I think it's clear that having a team or a person in the company
that knows about statistics is very useful when you operate these tools.
But that's not always the case, right?
Especially when we are talking about younger companies.
I can't imagine many companies that when they start,
they have like a data
scientist there, unless the product is very related to data science. What you would recommend
to a founder or to a product manager that doesn't have access to any resources,
statistical resources, how they should educate themselves in order to maximize the value that
they can get from these tools.
For example, you are talking about power analysis, right? What can I learn about power analysis? How
I can do this thing? Yeah, absolutely. It's funny you mentioned that because one of the things we're
going to be working on is a modern experimentation guide. One of the things that's sort of tough is
that there's a lot of content on the web that's all fragmented everywhere right yeah so it would be nice to kind of compile that the i think there are let's say two or three resources
i would recommend if you're a product manager and you're trying to get educated in experimentation
so the first i think the reforge program is great a little bit so it's basically a product
management mba for product managers is i think how they self-describe themselves.
It naturally tends to lean a lot more on quantitative methods and experimentation than many other places.
So I think that's a really great thing. I think it's great use of your learning budget.
Another great use of your corporate learning budget, if you have it, is Lenny's newsletter.
So Lenny Richesky, the Army alum of ours and also
an investor. He has that Slack group that related to the newsletter is really, really informative.
You know, I've personally learned a lot from it and there's a lot of experimentation talk. I like
to contribute there. So that'd be another great resource. And then the third is probably the
closest thing you can call to an experimentation Bible is the book by Ronnie
Kouhavi, online controlled experiments. It was Ronnie Kouhavi, for those who don't know,
was a, was one of a pioneering team of experiment scientists at Microsoft. So Microsoft, especially
back in the day, really pushed the edge, pushed the frontier on what was possible with these online platforms.
Ronnie Kohavi is probably the one who did the most evangelizing of that in his book
and his talks and stuff like that.
And so that's a great resource to read through.
Very readable.
And I believe now he actually even has an online course.
So that might also be a great venue.
And of course, for any of your listeners, I love chatting with people.
So if you want to just email me or DM me on Twitter, I'd love to chat through whatever
experimentation topics you have on your mind.
Cool.
Super helpful.
Okay.
One more question, Che.
And hopefully this doesn't push us over time, but we've learned so much about sort of testing
frameworks and methodologies.
You've seen both sides, the infrastructure side and the testing side. And so I'm thinking about
our listeners who, you know, are maybe a data engineer who isn't as close to the data science
team, you know, or machine learning side. Could you just help us understand how should data
engineers who aren't as close to that work
with data scientists? Like, what are your thoughts there? What do you have that could be helpful for
those listeners? Yeah, absolutely. And so I think the starting place to establish is to just say
that part of the reason I started an experimentation company is that experimentation just unlocks so
many of the things that a data team wants to do.
You know, data teams, again, they exist to drive good decision-making with a value system that
tends to be more metric-based in database than other teams. And experimentation is just an
incredible cultural export of those values to the organization. It really just helps people
engage with data in a much more meaningful way
that does not require nearly as much cognitive load to build it into your
decision-making process.
So I think a starting point is to just say, like, if you're a data engineer,
your work will be a lot more impactful in the organization if you have an
experimentation program.
So that's a starting point. In terms of how do you enable it, I think if you have an experimentation program. So that's the starting point.
In terms of how do you enable it, I think if you are a data engineer, it's very similar
to what you might see in most other data engineering practices.
The things you can do is provide great definitions of metrics for things that drive the business.
Candidates for things where if this goes up,
we're all in great shape as a business.
And then to create great APIs to them.
So is it very easy for data scientists or PMs
to utilize those metrics?
Easy to plug them into tools like Epo?
Do you have great ways for people to build an affinity
and understand what drives them?
I think that those same topics that probably apply to pretty much everything else a data engineer does,
you know, apply heavily to experimentation. Okay, super helpful. I also think that the
accessibility piece for data engineers is really helpful, right? Because it's hard actually just
to collect the data a lot of times, right. And, and get everything in one place.
Do you have a story that sticks out to you about, you know, a context of a data team where, you know, maybe they were sort of behind the scenes and then experimentation or something
happened where all of a sudden it was like, wow, you know, like you're our best friends now.
Oh yeah, absolutely. Experimentation tends to really, you know, create new relationships in
the org. In terms of where the work just became a lot more visible, you know, at Webflow, we
spent a lot of time trying to quantify when has someone onboarded, when has someone activated to
the system. And if anyone has used the product, it's quite complex, right? It looks a lot like
Photoshop.
It's got buttons everywhere.
It's brimming with power, but you might have to take a Coursera class or something to learn how to use it.
So the learning curve is like this big issue.
And we have significant data resources going towards trying to find levers to improve it.
Now, the thing is that you can have a lot of different theories around how to improve
activation, but it really helps if you can just show that your theories are correct because you
built an experiment against it and you drove the method. So there was this growth team that,
you know, spent a lot of time trying to craft activation metrics and levers, but once they're
running experiments, you know, the whole product org was aware of you know the experiments that were being
run and how they were trending we started seeing more product teams want to run experiments
themselves as a result which is great i always say one of the cool the blessings and the curse
of experimentation is that it's a it's a very public process it draws a lot of eyeballs it's
like it has very much of of that man in the arena quality
where this team is going to go
and they're going to expose themselves
to whether they were successful or not
in a way that most product teams don't expose themselves.
So it's great.
The right type of product team lives off that stuff.
And then if you're a data engineer or data practice,
you've got to feed good data into that process. You know, your tables have to be
pristine. You have to arrive on time, you know, meeting an SLA on a data engineering pipeline
becomes much more critical for an experiment. And, you know, that I think it naturally leads
to more resources in that area. Absolutely. Okay. One last thing.
If someone's interested in looking at Epo, where should they go?
Just go to getepo.com, www.getepo.com.
That has details about a product and also has a link to reach out.
I'm also, like I said, I love chatting with people, whether on Twitter, Slack, LinkedIn,
or whatever.
So you can reach out to me, Chetan Sharma, on any of those mediums.
I'd love to get in touch.
I love talking to anyone who's interested in experimentation, no matter what the maturity stage or readiness for a product like ours is.
So I would love to chat with whoever.
Awesome.
Well, Chet, thank you so much for the time.
We learned so much and just really appreciate you taking the time to chat with us.
Absolutely.
It's been a pleasure.
You know, I'm just constantly struck. I think every single
show, I just am amazed by how smart the people that we get to talk to are and what they've done.
And Che, of course, is no different. Studying synchronization between brain hemispheres,
you know, and then building a statistics practice inside of Airbnb. Pretty amazing. I think, you know, here's my takeaway and this isn't,
this isn't super intellectual, but it's enjoyable, hopefully. I really appreciate that,
even though it's clear that Che is bringing a huge amount of sort of knowledge and experience
into building this technology that does some pretty complex
things, especially on the statistics side. He acknowledged that, you know what, it's like the
small, dumb things that can make the biggest difference in this world, right? Like opening
a link in a new tab. And it was funny just hearing him talk about that and seeing him smile because,
you know, that's like as simple as it gets.
But it was the winningest experiment, you know, over a five year period at Airbnb.
So I just really appreciate that.
Like we can throw as much, you know, math and technology at this as we want.
And sometimes it's just a really simple idea that wins the day.
Yeah, yeah, 100%.
I think one of the misconceptions around experimentation is
that the experimentation process is going to tell you what to do, which is not the case.
You have to come up with a hypothesis. You have to come up with what matters and why.
The experimentation platform and methodology is there to support you in
the decision that you are going to make and that's what we have to keep in mind and that's how i
think we should be using these tools like as another tool that can help us make the right
decision at the end and one of the things that he said that I think is super important is that these platforms
and these methodologies also provide the credibility that is needed in order to communicate
more effectively whatever decisions you propose to the rest of the stakeholders. So that's what I
keep from the conversation today. I think it was a way to, let's say, give a very realistic description of what an experimentation platform
is, what we can achieve with that, and what to expect from them.
I agree.
Well, thank you for joining us and learning about A-B testing today.
Lots of great shows coming up, and we'll catch you on the next one.
We hope you enjoyed this episode of the Datastack Show.
Be sure to subscribe on your favorite podcast app to get notified about new episodes every week.
We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com.
That's E-R-I-C at datastackshow.com. The show is brought to you by rudder stack, the CDP for developers.
Learn how to build a CDP on your data warehouse at rudder stack.com.