Drill to Detail - Drill to Detail Ep.55 'Snowplow, Data Pipelines and Event-Level Digital Analytics' with Special Guest Yali Sassoon
Episode Date: May 21, 2018Mark Rittman is joined by Yali Sassoon from Snowplow to talk about data pipelines and Hadoop in the cloud; how web analytics evolved from counting pageviews to today's event-level analysis of consumer... behavoir across all digital channels; why digital analytics is hard but interesting; and Snowplow's approach to building a successful hybrid open-source/commercial software business that competes successfully with megavendors such as Google and Adobe.Snowplow websiteSnowplow Insights commercial hosted service detailsSnowplow Open-SourceEvolving Your Pipeline - Yali Sassoon - Snowplow Berlin Meetup #3Snowplow on Looker
Transcript
Discussion (0)
So welcome to another episode of Drilled to Detail, the podcast series about big data
analytics and data management in the cloud, and I'm your host, Mark Rittman.
So my guest in today's episode is Yali Sassoon, who some of you may know from the company
he founded, Snowplow Analytics.
So, Yali, welcome to the show. I take it you're also from London as well?
That's right. I'm a fellow Londoner.
Excellent. Well, it's good to meet you, finally. I've read a lot about you and I've seen you a lot on the YouTube videos and, of course, the products that you founded as well but tell us just briefly you know who are you you know what was your what was your route into Snowplow a little bit and just actually start off a little bit with
what Snowplow is and what you do. So Snowplow is a data collection platform the idea is that the
technology makes it easy for any company that wants to collect data about how they're engaging with
their users across all different platforms and channels, web, mobile, different marketing
channels like email, sensors and wearables, smart TV, and also offline platforms. So to build
a data set that describes all those interactions in one place, to ship that data to a data set that describes all those interactions in one place to ship that data to a data warehouse
so that the company can use that data to build real insight into who this user is and how they
can serve them best. And then to make that data available in real time so that the company can
take that data and use it to better engage that user,
lead to better outcomes, make better decisions.
Okay.
So how did you, I mean, taking a step back,
how did you get into doing this?
And what was your work you were doing before SnowCloud
that led you to have this idea and found this company?
So my whole life I've been working with data of one sort or another.
So I had a kind of background in natural sciences and physics in particular.
Then I studied sort of the history of science and philosophy of science.
I was really, I've always been interested in how people have tried to build understanding of different things
and how the role that data has played in building that insight and building that understanding.
So I worked most of my life as a consultant, as a strategy consultant, as an operational consultant.
And I was always interested in digital.
So I did a stint working for Open Ads that became OpenX and is still Open X today. But in the years directly prior to starting Snowplow,
Alex, my co-founder, and I, we had our own little boutique consultancy and we were working with
companies that largely weren't digital natives. They were mostly companies that have built
successful businesses in the offline world transition to more of a digital world. And a lot of the work
we were doing was helping them with the digital product development piece, which was the piece
that was really new to them. And a big part of that was showing them that in digital, you had
this opportunity to use data, to collect data and use data about how your users actually engage
with your digital product to help
inform that product development process so what you what you today call product analytics but
remember this is this is sort of 10 years ago when the idea of product analytics was was much less
well understood than it is than it is today so we were we were working with these companies
nearly all of them had either Google Analytics or Omniture,
which has since become Adobe setup.
And we found there were lots of things we wanted to do with the data.
We wanted to ask questions to help the product development process
that GA, Google Analytics, uh omniture weren't
built really to answer and we wanted to combine that data with other data sets um so you know crm
data other other sorts of offline marketing data and and so on and again that wasn't that what that
wasn't possible so we were we had all these desires around uh using what for us was this incredibly rich, interesting data set
and all these frustrations because we typically couldn't use the data that we wanted
in the way that we wanted for those clients.
And that was what really, that was one of the main drivers for sort of creating Snowplow.
Okay. So, I mean, I came into the digital world for sort of creating Snowplow. Okay.
So, I mean, I came into the digital world about sort of two years ago,
and I was surprised at how separate that was and how different it was to doing analytics on, say,
kind of enterprise data, you know, data warehouses and that sort of thing.
You know, I came from a world of working things like Oracle and Cognos and that sort of thing.
And I was surprised at this whole kind of separate world
of digital analytics and product analytics and so on that was out there
that used tools like Google Analytics and Omniture and so on,
it's quite a separate world, isn't it,
from that kind of enterprise BI world, really?
Yeah, it's a totally separate world.
And it really surprised me that I always considered myself a data guy
and I always considered that web data was just another
pot of pots of data so it always grated on me that there was a another another set of tools
but the thing that that uh that where that data lived but the thing that really frustrated me
and this is really different if you're from a data warehousing world or even if you're just
a consultant who's used used to working in excel sort of 24 7 like if or even if you're just a consultant who's used to working in Excel sort of
24 seven,
like if you're,
if you're in an Excel world,
you start off with a question and then you go get the data that you need to
answer your question.
And you,
uh,
you,
you,
you reshape the data,
you do whatever's required to get the data to fit your question.
Whereas in digital analytics,
the expectation is that you log into a UI and the answers to your questions
are there.
And it's actually,
it's, it's actually in digital analytics,
people have come to let the tools define the questions
that they should be asking instead of starting off by thinking,
what questions should I be answering?
And then taking the data and fitting the data to the question.
And there's a, because for whatever reason,
and we can discuss it if it's if it's interesting digital
analytics has been kind of built around these these solutions and the makers of those solutions
have largely defined the questions that and and the analysis that have been performed in digital
analytics in a way that isn't isn't true in the rest of analytics at large.
That's interesting.
I mean, certainly one of the first impressions that I got when I went into digital was, I guess,
how much more analytically minded a lot of the analysts were
that I worked with in these kind of companies.
So I was surprised at how much SQL was used, for example,
as opposed to sort of using kind of graphical tools.
And so the other thing really was interesting
was the complete dominance of Google Analytics as well.
I mean, that really, it really was the kind of,
it still is, I suppose, the dominant kind of tool.
And that's almost a de facto definition of everything.
People think that analytics is Google Analytics
in this world as well, or certainly did you think so?
Did you find that as well?
Yeah, it's, I mean, it's extraordinary
what the guys at Google have achieved in terms
of the dominance of that analytics solution.
And look, the Google Analytics is brilliant for a whole raft of things.
And Google have done an incredible service for the world making that product free.
But yeah, the flip side side is especially the free product and
that's all that there was for years has really has has really defined um what digital analytics is
for the vast majority of uh digital analysts um even today so what was the i suppose what was
then the problem that you saw you saw needed to be solved then that led to um snowplow so you said there were limitations in web analytics tools and
and so on but what led you to go from you know consulting is is can be quite a nice business to
be in in a way you know that but to actually go there and actually start to build a product and
feel the pain enough to kind of do that what was it that really sort of prompted you to
start snowplow and think it's worth starting something new and fresh beyond sort of like the standards really
um so the so we were we were very frustrated by those those limitations as i described but then
on on the other side we were really excited about uh the new crop of what they were new then of what you'd call big data tools. their data warehousing strategy should be knowing that if you want to build a data warehouse for a
um a display a real-time display ad exchange it needed to be able to scale to billions billions
of events um and in in europe when we were looking at the problem you just couldn't find companies
that had sort of solved that problem at that scale because because at that at the time
that i was there sort of 2006 2007 the tools like hadoop was still was still pretty new and they
didn't really have much adoption and there's more adoption in the us than there was in europe but we
were we were well behind the curve so i was and Alex as well, we were sort of incredibly excited that all these
limitations around the volume of data that you could work with, that we'd kind of lived with
for our whole working lives, suddenly open source frameworks like Hadoop and new services like
Amazon Web Services were suddenly making that possible and actually making that pretty easy.
So we were frustrated by the limitations of traditional web analytics tools.
We were excited that suddenly we could start collecting and querying data at scales that wasn't possible before.
And the sort of the nudge that we needed to start building Snowflake,
it's not like we made a decision to stop consulting
and start to build a company.
We were consultants and we realized we,
I think we were at the pub
and we had a bunch of different people
that we were having drinks with.
And one of them told us how Quantcast logged data
across the network of websites with the Quantcast tags by serving a pixel from a CDN, recording those CDN logs and then passing them using EMR.
And it was just like a light bulb moment. Wow, that is that's so, so simple and so potentially powerful. And so the next day, the next day we took the PWIC,
which is the open source alternative to Google Analytics.
We took their JavaScript tracker
and we forced it to fetch a pixel on CloudFront
and we switched on CloudFront logging
and we wrote a Hive deserializer
so you could run SQL queries against the CloudFront logs.
And that's all that the first version of Snowplow
was. It was put
together in a day.
We published it on GitHub.
We did some blogging about it.
Really, we were
just kind of
amazed that
with relatively minimal
effort, suddenly there was a general
purpose framework.
It was so raw then that I'm embarrassed calling it a tool.
But there was something that was out there that anybody could take
and collect granular event-level data.
And remember, that was the thing that we'd wanted to be able to get out of Google Analytics
for all these different clients that we'd been working for over those years
and we'd never been able to.
And suddenly there wasn't a vendor mediating our access
or a client's access to their own data.
They could collect their own data on Amazon Web Services at any scale
and run any query against it.
I mean, running the queries was a complete, you know,
ball-acre to fire up an EMR cluster
and write a Hadoop job.
But it was possible, and that was a real sort of
light bulb moment.
It's a game-changer.
I mean, there's a lot you've said in there
that's kind of interesting.
And again, a lot of this was new to me
when I came to work at Qubit where I am now.
You mentioned there's a couple of things that are interesting one was the javascript tracker and i think that's
something that most people from my old world wouldn't know that is i think it'd be worth just
kind of maybe elaborate a little bit on what that is but the other thing is event level tracking and
that seems to be the thing that you know like you said you the ability to kind of track in each
individual behavioral interaction you know on a website as opposed to page views and so on,
that level of extra detail is a massive difference, isn't it?
What's the benefit that you, as far as you're concerned, about going down to that event level, really?
What did that really open up for customers and for the industry? So with getting down to the event level data, the key benefit is the ability to determine how you want to aggregate up the data.
So if you're working with session level data, if you're working with user level data, you're fundamentally limited with what you can do with the data because there's a load of logic that has been applied to get to that aggregate aggregate data set so if you don't agree with the logic
the logic doesn't fit with your business then you're then you're sort of stuck and in the case
of in the case of web data the really the most obvious way that that aggregation didn't work
was all that aggregation was done based on cookie ids this idea of a unique uh
unique visitor so you could um and that uh that really just gives you a view of a particular a
particular browser um but what we wanted to do in nearly all cases was understand people and
typically people engage with with sites across multiple devices.
And so the ability to define who a user is and be able to accurately measure what that user's doing
across different devices is really, really powerful, especially if you want to start
joining that behavioral data with other user-level data.
We also found a load of limitations with sessionization.
The whole idea of a session sort of dates back to the web in the 90s,
where you could just do one thing at a time.
There were no multi-tabs.
There wasn't that much you could do apart from click around from document to document and um and the and so the definition
of a session is you know i've stopped doing i haven't done in 30 minutes and when i start doing
something again i must be doing something new that might have worked back then but it really
it really doesn't work in in in today's world and so for all sorts of analysis the the aggregate
data that we were dealing with um out of out of google
analytics by and large was uh it was very hard to do what we wanted with it okay and you said i mean
you've been talking about i mean right back at the start you talked about not just kind of i suppose
web websites for tracking things on there all forms of digital interaction i mean just what
what's the what what was the kind of next area you moved into so you're obviously tracking uh you know uh stuff for a website first of all but how
quickly did it kind of expand into other digital interactions as well and sort of multi-channel
stuff and multi-device um so there were two so um we we went on this evolution we started very much
as a data warehousing solution for web data,
sort of clickstream data warehouse.
And then in terms of moving from that to a general event data collection
across platforms and channels, there were two bits to that.
So one of the early customers that we worked with was in the games industry,
and they wanted to track, you know all games companies do
event level data out of the game so instead of you know loading a page viewing a product adding
a product to basket buying the product they were interested in you know building a castle
planting some crops forming an alliance declaring war the the underlying events were just totally
totally different and the data points
that they wanted to collect with each of those data points were totally different and uh what
we realized which seems obvious now in retrospect but but took us a while back then was that the
the the structure of data that we were used to dealing with we'd sort of accepted that as uh as the default but that'd been something
that had been baked into the web analytics tools and again that that was a structure of data that
was built around the web in the in the 1990s and in in the 2000s 2010s when you had mobile and you
had uh you know like people doing, much more through digital platforms.
You live your whole life through digital platforms.
You flirt and fall in love on dating sites.
You manage your finances on finance apps.
You plan your holidays.
You manage your health.
You track where you run, where you cycle, all these different things.
The activities that you conduct on these platforms
are much, much more varied. And so the data that describes those activities needs to be much,
much more varied. And the structure of that data needs to be much, much more flexible. And so
what we built into Snowplow, and it's the functionality that i'm i'm sort of proudest is the ability for each of
our users to define uh their own events and their own entities so if you're a games company you can
say you know this is the universe of things that a player can do in in in my universe and these are
the data points that i want to track with each of those events these are the different entities
they might interact with me so i want to describe those entities
these are different properties of those uh those entities then obviously we've got you know a load
of standard definitions they're all schemas effectively uh that are public that are publicly
available but each of our users can define their own uh their own it means that two snowplow users
can track uh can track radically different user
journeys and you look at the two data sets and they look totally totally different you know
they're they're being collected through the same the same underlying technology so that was one
that was one big thing because it meant suddenly you could collect data that described events that
didn't look like web events but the second the second thing that
we needed to make possible was to collect data from a non-web environment so you asked me about
the javascript tracker earlier so web web tracking has traditionally been done um primarily via a
javascript sdk that sits in your browser um that for things happening, changes to the DOM web
pages being loaded.
And when those things happen, trigger requests to your data pipeline.
And in those requests, it posts the data that describes those events that have occurred. So there was an alternative model of sort of passing the logs
that the web servers produced to serve the website.
But JavaScript tracking is sort of the way primarily that web tracking is done.
So making it possible to track events from other different locations.
To do that, we had to release a whole host of other trackers, we call them SDKs, that you could track events from any kind of environments.
That's, you know, mobile with an Objective-C tracker and an Android tracker and then a whole host of server-side trackers.
You know, Ruby, Python, Scala, Java, PHP, and all the rest.
And so the combination of having trackers
for all your different environments
and the ability to define your own events
meant that suddenly this was a generic event data capture platform.
It wasn't tied to a specific platform,
a specific type of event data.
And that was really, really, really important.
So you mentioned right back at the
start again you mentioned about um tying identity across different sort of channels and so that
when you go in via your phone or you do something else or you go in by different kind of like
boots in you can tie this together and understand the kind of bigger picture but that that kind of
i suppose cross-device uh understanding of entity that's quite a challenge isn't it how how do you
approach that and and is is that a problem do you think or is it something that's quite a challenge isn't it how how do you approach that
and and is is that a problem do you think or is it something that's now been solved what's your
thoughts on that no i think i think it's a solvable problem but it's quite difficult to solve in a in
a generalizable way um so that we what we find is that each one of our clients will solve it in a slightly different way.
And there are patterns that you can sort of spot across them.
So in terms of the technology, what our tech lets our clients do is it lets them collect very, very granular event data from all different platforms and channels and with each event we do
our best uh for a combination of uh automatic tracking and then very very flexible schemas
to enable our customers to collect the the broadest range of device and user identifiers
as possible so on the web we make it easy to automatically track a first-party cookie ID, a third-party cookie ID,
a fingerprint if you want, IP address,
all those different things.
And then if the client has ways of generating
their own user IDs, and often they do
if a user logs in or if there are ways
of getting users
to identify themselves, then that can be then that can be passed in. So step one is, is, is,
is getting is making it possible to, to record all those different identifiers. And then step two
is building a process building some business logic that ties all those different identifiers together.
And so solving the identity stitching piece is two steps, really.
It's first working out how to reliably identify users. between our clients and their customers, the consumers that use their service, and coming up with ways to really incentivize their users to reveal who they are
so that the user gets a better user experience.
You tell us who you are, and then because we can recognize you across different browsers and platforms,
we can provide you a better, more tailored personal experience or whatever that is.
And then the technology being flexible enough
and granular enough that you can then um uh implement your own identity stitching algorithm
on top to build that uh to build that single customer view for other customers that don't
necessarily have a direct relationship with their consumers they might need to integrate our tech with different identity providers and services so that um you know services that are good at fingerprinting or
services like parable that are good at spotting a single mobile device across different apps and
and browser or services like drawbridge that are good at mapping different third-party cookie ids
and saying you know all these different third-party cookie ids are really the same the same user okay so you said back at
the start you put all the code on github and i know you know it was mvp at the time it was
something that was you know you were surprised it worked and it solved the problem and so on but
but what's the commercial model now for snowplow i mean it is i know your code in your product is
available i think on an open source or freemium basis but how does it work now and and how do you make money and and grow i suppose
really um there are two the two sides to the business so there is a um there's a a paid for
platform um which is built on top of the open source i can talk a little bit about the the
difference between the paid for platform and and the open source and there's also professional
services side of the business so it's pretty common as an open source open source company
to have a professional services um component with snowplow that is particularly important because
the tech doesn't solve a specific business problem.
The technology is very horizontal.
It provides a foundation that then makes it easier to start solving these problems.
It's still not necessarily easy to solve any of these problems.
So if you want to do attribution right,
if you want to better serve your customers,
if you want to better use data to drive your product development process,
then collecting a very, very high-quality data set
is sort of step one, and Snowplatt does that.
But then taking that data and using that data
to solve any of those problems is a big step,
and it's not necessarily an easy
step so having a services team that can go into our clients and help and show them how to use the
technology to to solve those problems and to solve them quickly and iteratively and demonstrate
business value is uh it's really important for a lot of our clients because it it makes our tech more accessible and it's really
important for for us because it's uh it means we don't we're not just we're not limited to selling
into uh the types of data sophisticated companies that have big internal data teams that know what
to do with what is effectively a really big fire hose of of data that that our tech can can deliver
they can we can actually help them solve specific goals
and help develop their internal competency around that data.
So services are really important.
And then the paid-for platform is really important.
So the idea there is that the open source is scalable
and it's robust, and we're committed to open source.
We sort of firmly believe that as a company,
one of the most valuable things you can do is collect data
and you should own that data and control that data
and you shouldn't have to rely on a vendor.
You shouldn't be at the mercy of a vendor
who can potentially lock you in like that customer data that's that's your that that's potentially
your secret source and the intelligence you build and it should hopefully be your
uh your secret source you you also owe it to your customers sort of your data subjects to look to to
collect data right and and use that data for the mutual benefit of you and your customers and not have a vendor tell you how to do that.
So having open source is really important.
We think there needs to be an open source platform that lets people do that.
What we found, though, is that to use our open source successfully,
companies need to be pretty good at the data engineering and DevOps side of things.
And the idea with the paid for platform is we're giving the same power and control to companies
that either don't necessarily have that type of expertise, or if they they do they want to use those resources for something else so we
want to make it possible for an analyst or a data savvy marketer or a product manager a product team
that don't that aren't necessarily you know rich in data engineering and devops resources we want
to we want to make it possible for those uh for those type of people and those types of companies to take control of their data the way that the open source lets the more kind of engineer a um a ui and a set of hosted services so that the
uh the whole experience is is is much is much much easier and they sign in so they want a new pipeline
um everything's set up in a couple of hours and they can focus on actually doing things with the
data rather than um setting up and running the pipes.
Okay.
That sounds very similar to Impli that I had found you in, I think, from Impli on the show.
And they've done the same similar thing with Druid.
They don't really manage the pipeline as such, but they certainly offer it as a managed service.
And they solved the problem for me of how to get Druid working in a reasonable amount of time.
But they take care of a lot more of it as well.
And that's a good model, actually. I i mean it's a model where you get the feedback
into the product and the actual core sdk but you also help people who you know want to focus on
other things really rather than actually data engineering yeah i'm a i i'm a big fan of uh
of the imply guys and yeah the model is very it's very similar and actually we're we're pretty keen
on getting an integration between uhplow and Imply.
Interesting. That'd be really interesting. Yeah, that'd be interesting.
So moving on a bit, I watched a couple of your YouTube videos where you've been presenting on Snowplow.
And there was one that you presented on that really resonated with me where you talked talked about, you said digital analytics is really interesting but really hard.
And that resonated with me because it's very true.
The models are complex.
The use cases are complex.
Why did you say that?
What is it about digital analytics that makes it surprisingly hard
to be productive and successful with?
That's a big question.
I think there are a few different reasons.
The first is digital data is very heterogeneous.
So digital experiences, as we talked about earlier, are really broad.
The types of experience that you have on a jobs board are nothing like the types of experiences you have when you're browsing your national newspaper in the morning and so the types of uh the types of questions that you're going to have of that data are going to vary even if you're if you're if you're a marketer or um you know
there might there might be similar similarities about the questions you ask if you're uh trying
to trying to drive more companies to advertise on your jobs board to more applicants to come
but they're going to look different in the way you're going to use that data to answer those questions going to be different than if you're
a newspaper and ad funded or subscription based or both so domain specific isn't it very domain
specific um you know if you if you're doing if you're going for analysts work in digital marketing
for example you've got to really know your stuff there it's a very kind of specialized area with
very specialized kind of questions you have to answer and data models and that sort of thing.
You've got to understand the business and the business context and all the things that you have to do if you want to be any sort of analyst.
But on top of that, there's just a huge amount of technical knowledge to acquire.
You need to know how this, you can't really work in digital unless you've got a good working knowledge of your your javascript and how data is collected from there typically you're combining that with other
data sets you need to understand where that data is collected you need to understand the
how that data has been processed how that data is being surfaced you've then got a plethora of
tools to take to answer to take that data and try and answer your question and that's where the
sort of second challenge comes up um we're used as analysts to thinking um we benefit from uh
decades uh of of um of work developing analytics techniques you know going back back to the
development of of statistics and then you've got all these different technologies and machine learning and so on.
And analysts, we feel like there's a whole bunch of tools as a, like, we're really empowered to go and do things with data.
But actually, a lot of those tools don't work particularly well with event data.
So if you're working with data sets that are measurements,
so you're running an experiment
and you're measuring the outcome of that experiment,
you're measuring things like temperature or whatever,
or you're dealing with a very uniform data set
like transactions, for example,
then things are very, very easy.
You've got a really wide range of statistical techniques.
You can compute averages, mins, maxes.
You can pivot the data.
You can run all kinds of statistical techniques, regression, et cetera,
on the data.
There's loads you can do with it.
If you're dealing with event data, actually, you can't do any of those things with the underlying data.
When you're looking at event-level data, you know that somebody did something and then
somebody did something else and somebody did something else.
Maybe they're making decisions or maybe they're trying to to get to a particular end point it's often not clear
who the person is who's carrying out these it's not clear what they're trying to do
it's not clear to what extent what they're doing is driven by um uh by the design of the digital
product they're engaging in or by their intention and uh and answering those questions you can't answer those
questions with any of those those techniques um that i've described you can't even like counting
counting events or averaging them or many that doesn't doesn't even make any sense all the
all the usual functions that we use to act data don't actually work on event level data.
You want to take this data and you want to start figuring out who these people are, what
are they doing, and then start measuring them by are they successful at what they're doing?
Are they failing at what they're doing?
If they're failing, why are they failing?
If they're more likely to go one way than another way, why is that?
And that sort of question, that sort there aren't there aren't um techniques
or approaches or tools that just let you answer those um answer those out of the box so do you
think that's why there's a do you think that's why people still use sql a lot really within within
this kind of industry it struck me i was used to everybody in my world using a graphical tool doing a very
simple aggregations and so on that you talked about but i suppose that the mayor to mayor
nature a lot of data and the complexity of those questions and the more the more fundamentally
different way of coming to your conclusion meant that sql was used a lot more than i expected really
did you find that at all as well i yeah i totally agree with that i think that's exactly why um yeah a sequel is just um it is sort of one way of of of solving that problem
and it's actually not a very elegant way because it's very hard it's very hard to write a sql query that that says show me all the people who
uh did a then b but then didn't do c uh sql is still built around um you know grouping things
and computing sums and mins and maxes and count distincts and all the rest and you can you you
can start doing some of this analysis with SQL, which is why people do,
and you end up writing these sort of horrendous window functions.
So that's ugly as sin, but you just can't do these.
There just aren't other tools to let you do this stuff easily.
So I think that's why people are using SQL. That's why people are so i think that's why people are using sql
that's why people are using spark that's why people are using r and python and and so on like
if you if you want the freedom to analyze the data in the way you want you don't want the tool
telling you what to do um those those are the best those are the best tools we have but it's it's
still not easy okay okay another
quote you had from your presentations was you said that um in digital analytics you need smart people
and there are companies out there building tools out there to try and make it possible who aren't
smart to play digital analytics but snowplow is about enabling smart people so i think that first
bit we just covered there by saying the tools out there now aren't really suitable but how is
snowplow you know enabling smart people what's the kind of thing then that particularly does that as far
as you're concerned um so for me the the the tension is always between democratizing data
putting making it possible for more people to uh to sort of do do things with data and not dumbing the data down.
So an easy way to democratize data is to dumb it down.
If you hide a lot of the complexity, if you make it seem simpler,
then it's more likely that more people will start doing things with the data.
But the problem is if you hide some of the complexity if you hide some of the
uh the the intricacies in in the data then you're disempowering the the type uh smart people that
can uh that rather the smart smart people is sort of loaded term, but I mean, people, people who want
to do data, right. People who are like, I want to know if there are irregularities in the data,
because that's an opportunity. Potentially there's something for me, for me to understand
that. So, so we, we don't want to, we don't want to hide anything like empowering empowering smart people for us
means uh rather than dumb anything down rather than hide anything expose uh make it possible for
a smart person using the data to to be able to view any part of the data any part of the data, any part of the processing, and to understand exactly what's going on.
So, for example, a silly example, but it illustrates the point.
We don't filter out suspected bots and spiders
from the data set that we deliver.
We just label them or provide tools so that our users can see, hey, this
is probably a bot or a spider, because that's interesting.
That might mean there are a whole bunch of use cases where actually you might be interested
in exactly how many robots are crawling your website and what you're up to.
And if some of them are committing ad fraud or uh uh you know
scraping scraping your data for nefarious purposes that's that that's really important you want to
you want to know about it so it's it's tempting to sort of sweep that under the rug to filter that
out so that when you're reporting the number of users um you're you're excluding them and that's
the right thing to do if you're interested in the number of human people on the website.
But we don't want to make any assumptions for our users.
Our users are smart,
and we trust them that if we give them the full data set,
and if they want, they can come to us for guidance,
they'll know what to
do with the data to treat it uh right to answer the specific question that they want and depending
on the question that might involve slicing and dicing the data different ways ignoring different
subsets of of the data but it's presumptuous of any vendor to do that ahead of speaking to a user
of understanding the question the question that a user wants to ask or um uh and even the assumption that there's there's one kind of
canonical way of presenting the data that's going to meet all that user's needs is is is pretty
outrageous you're you're really disempowering your user even though you're making it easier
because you're effectively dumbing things down okay so so the last thing i want to talk to you
about was was taking this forward.
I mean, so it sounds like you've done fantastically well
to build something that's got this kind of great open source
stroke commercial model.
You know, you sound like you're solving the problem really well.
But then there comes the question of, you know, growth
and how you perhaps compete with the Googles and the Adobe's
and that sort of this world.
And people might say to you, for example, well, this is good as a pipeline but we'd like to better do you know do you do A B testing
on top as well what's your strategy going forward about competing staying kind of relevant and and
growing and that sort of thing well that's another really big question um where where do I start so
there's I think I think there's what's exciting about the space that we're in is? So there's, I think, I think there's, what's exciting about the space that
we're in is there's, there's, there's so much scope for, there's so much scope for innovation.
So in the time we've been around, if you think about how far the industry has come,
when we, when we started, the number one question that I'd get sort of going around
different digital analytics conferences was, why would would i why would i want to warehouse my data what what does that report does that let me
run that i can't get out of google analytics and we never get we never get that question anymore
it's uh even you know google by launching premium which became 360 acknowledged that there's a ton
of stuff you can do with this data if you can access the
underlying data especially in a cloud data warehouse that lets you sort of query it
flexibly and in a in a in a performant performant way so people i think today
widely recognize the value of warehousing your data but if i i look if i look around i think um i think there's still a whole bunch of
challenges that companies have around um acting on the data in real time and uh and making making
decisions in real time if uh the situation today is there are a lot of people who understand the value in being able to do that and have the aspiration to do that.
But it's technically very difficult.
It's difficult to, it's very difficult for, it requires a lot of engineering that most people in marketing or in product development don't necessarily have access to. That today reminds me of the state of digital analytics six years ago when it came to the warehousing piece.
As an industry, people have got much, much better at the idea of warehousing their clickstream data, their digital data, their mobile data, their web data, etc.
Joining that with other data sets and doing things, that's now a well-trodden path.
I think we need to go on a similar journey with real-time data processing.
So I think that's a big opportunity for us at Snowplow.
I think coming up with ways to solve some of those really difficult problems that make digital data hard to work with is a really fruitful area of research.
So we talked a little bit about some of the limitations
that even SQL has in letting you work with this underlying data.
And we're doing a fair amount of R&D with graph databases, for example,
to see if that's a better paradigm potentially for working
with this data. So that's another area that we're really excited about. And then I think there's a
lot that we can do in the short term to empower more product teams, more marketing teams,
more editorial teams that want to be data-driven
and don't have necessarily that engineering resource.
So we're starting to provide a better experience for those users,
but there's a lot, lot further for us to go.
So there's a lot lot further for us to go so there's a huge amount of yeah huge amount
of interesting problems to keep to keep to keep solving and hopefully the more of them we solve
the uh the more snowplow grows yeah you mentioned you mentioned graph databases i i was speaking i
was chatting with uh i think nick shop who did die behind graph ql um about potentially coming
on the show at one point and we we were talking about the use of GraphQL
in this kind of context, really,
defining over kind of AP, over schemas,
and using for analytics and so on.
I mean, that sounds interesting, really.
I mean, where do you think, just graph databases,
where do you think the potential is there, really,
around that sort of thing?
What would it be able to solve for you, do you think? I think it would make it a lot, lot simpler to query event data. So if you model it
in a graph, right, then running those queries, show me all the users who've done A, then B,
then haven't done C, and then to do D with those who've just done A,
they sort of fall out of a well-designed graph model a lot easier
than writing the SQL queries.
And that means that then potentially aggregating over, over the data in that, in that kind of,
uh,
in that graph world might be,
uh,
might be a lot,
lot.
It,
it might be easier to,
it might be,
it might,
it might be a better paradigm for writing,
writing those,
those jobs.
It might also be computationally more,
um,
more efficient.
Uh,
but I'm,
I'm,
I'm more excited about it because i think it's a
it's it's a very natural way of of modeling event data and it's a richer way because you can start
modeling the the relationship between the the entities and how those relationships change
change over time so interesting interesting interesting well i'm conscious of time anyway
for you but it's something it's been fantastic to speak to you ellie i mean obviously having
read about what you do and the products and so on it's great to speak to the kind of brains behind
it i mean how do people find out more about snowplow if they want to find out and maybe
download the product and get started that sort of thing oh they should come to our website on snowplowanalytics.com and check out our GitHub on github.com slash snowplow slash snowplow.
Excellent. That's really good.
Well, it's been great speaking to you.
Thank you very much for coming on the show and have a good evening.
And yeah, thanks. Thanks, Ali.
Thank you.