Drill to Detail - Drill to Detail Ep.55 'Snowplow, Data Pipelines and Event-Level Digital Analytics' with Special Guest Yali Sassoon

Starting point is 00:00:00 So welcome to another episode of Drilled to Detail, the podcast series about big data analytics and data management in the cloud, and I'm your host, Mark Rittman. So my guest in today's episode is Yali Sassoon, who some of you may know from the company he founded, Snowplow Analytics. So, Yali, welcome to the show. I take it you're also from London as well? That's right. I'm a fellow Londoner. Excellent. Well, it's good to meet you, finally. I've read a lot about you and I've seen you a lot on the YouTube videos and, of course, the products that you founded as well but tell us just briefly you know who are you you know what was your what was your route into Snowplow a little bit and just actually start off a little bit with what Snowplow is and what you do. So Snowplow is a data collection platform the idea is that the

Starting point is 00:00:59 technology makes it easy for any company that wants to collect data about how they're engaging with their users across all different platforms and channels, web, mobile, different marketing channels like email, sensors and wearables, smart TV, and also offline platforms. So to build a data set that describes all those interactions in one place, to ship that data to a data set that describes all those interactions in one place to ship that data to a data warehouse so that the company can use that data to build real insight into who this user is and how they can serve them best. And then to make that data available in real time so that the company can take that data and use it to better engage that user, lead to better outcomes, make better decisions.

Starting point is 00:01:48 Okay. So how did you, I mean, taking a step back, how did you get into doing this? And what was your work you were doing before SnowCloud that led you to have this idea and found this company? So my whole life I've been working with data of one sort or another. So I had a kind of background in natural sciences and physics in particular. Then I studied sort of the history of science and philosophy of science.

Starting point is 00:02:15 I was really, I've always been interested in how people have tried to build understanding of different things and how the role that data has played in building that insight and building that understanding. So I worked most of my life as a consultant, as a strategy consultant, as an operational consultant. And I was always interested in digital. So I did a stint working for Open Ads that became OpenX and is still Open X today. But in the years directly prior to starting Snowplow, Alex, my co-founder, and I, we had our own little boutique consultancy and we were working with companies that largely weren't digital natives. They were mostly companies that have built successful businesses in the offline world transition to more of a digital world. And a lot of the work

Starting point is 00:03:06 we were doing was helping them with the digital product development piece, which was the piece that was really new to them. And a big part of that was showing them that in digital, you had this opportunity to use data, to collect data and use data about how your users actually engage with your digital product to help inform that product development process so what you what you today call product analytics but remember this is this is sort of 10 years ago when the idea of product analytics was was much less well understood than it is than it is today so we were we were working with these companies nearly all of them had either Google Analytics or Omniture,

Starting point is 00:03:47 which has since become Adobe setup. And we found there were lots of things we wanted to do with the data. We wanted to ask questions to help the product development process that GA, Google Analytics, uh omniture weren't built really to answer and we wanted to combine that data with other data sets um so you know crm data other other sorts of offline marketing data and and so on and again that wasn't that what that wasn't possible so we were we had all these desires around uh using what for us was this incredibly rich, interesting data set and all these frustrations because we typically couldn't use the data that we wanted

Starting point is 00:04:31 in the way that we wanted for those clients. And that was what really, that was one of the main drivers for sort of creating Snowplow. Okay. So, I mean, I came into the digital world for sort of creating Snowplow. Okay. So, I mean, I came into the digital world about sort of two years ago, and I was surprised at how separate that was and how different it was to doing analytics on, say, kind of enterprise data, you know, data warehouses and that sort of thing. You know, I came from a world of working things like Oracle and Cognos and that sort of thing. And I was surprised at this whole kind of separate world

Starting point is 00:05:05 of digital analytics and product analytics and so on that was out there that used tools like Google Analytics and Omniture and so on, it's quite a separate world, isn't it, from that kind of enterprise BI world, really? Yeah, it's a totally separate world. And it really surprised me that I always considered myself a data guy and I always considered that web data was just another pot of pots of data so it always grated on me that there was a another another set of tools

Starting point is 00:05:32 but the thing that that uh that where that data lived but the thing that really frustrated me and this is really different if you're from a data warehousing world or even if you're just a consultant who's used used to working in excel sort of 24 7 like if or even if you're just a consultant who's used to working in Excel sort of 24 seven, like if you're, if you're in an Excel world, you start off with a question and then you go get the data that you need to answer your question.

Starting point is 00:05:52 And you, uh, you, you, you reshape the data, you do whatever's required to get the data to fit your question. Whereas in digital analytics, the expectation is that you log into a UI and the answers to your questions

Starting point is 00:06:03 are there. And it's actually, it's, it's actually in digital analytics, people have come to let the tools define the questions that they should be asking instead of starting off by thinking, what questions should I be answering? And then taking the data and fitting the data to the question. And there's a, because for whatever reason,

Starting point is 00:06:24 and we can discuss it if it's if it's interesting digital analytics has been kind of built around these these solutions and the makers of those solutions have largely defined the questions that and and the analysis that have been performed in digital analytics in a way that isn't isn't true in the rest of analytics at large. That's interesting. I mean, certainly one of the first impressions that I got when I went into digital was, I guess, how much more analytically minded a lot of the analysts were that I worked with in these kind of companies.

Starting point is 00:06:56 So I was surprised at how much SQL was used, for example, as opposed to sort of using kind of graphical tools. And so the other thing really was interesting was the complete dominance of Google Analytics as well. I mean, that really, it really was the kind of, it still is, I suppose, the dominant kind of tool. And that's almost a de facto definition of everything. People think that analytics is Google Analytics

Starting point is 00:07:18 in this world as well, or certainly did you think so? Did you find that as well? Yeah, it's, I mean, it's extraordinary what the guys at Google have achieved in terms of the dominance of that analytics solution. And look, the Google Analytics is brilliant for a whole raft of things. And Google have done an incredible service for the world making that product free. But yeah, the flip side side is especially the free product and

Starting point is 00:07:47 that's all that there was for years has really has has really defined um what digital analytics is for the vast majority of uh digital analysts um even today so what was the i suppose what was then the problem that you saw you saw needed to be solved then that led to um snowplow so you said there were limitations in web analytics tools and and so on but what led you to go from you know consulting is is can be quite a nice business to be in in a way you know that but to actually go there and actually start to build a product and feel the pain enough to kind of do that what was it that really sort of prompted you to start snowplow and think it's worth starting something new and fresh beyond sort of like the standards really um so the so we were we were very frustrated by those those limitations as i described but then

Starting point is 00:08:38 on on the other side we were really excited about uh the new crop of what they were new then of what you'd call big data tools. their data warehousing strategy should be knowing that if you want to build a data warehouse for a um a display a real-time display ad exchange it needed to be able to scale to billions billions of events um and in in europe when we were looking at the problem you just couldn't find companies that had sort of solved that problem at that scale because because at that at the time that i was there sort of 2006 2007 the tools like hadoop was still was still pretty new and they didn't really have much adoption and there's more adoption in the us than there was in europe but we were we were well behind the curve so i was and Alex as well, we were sort of incredibly excited that all these limitations around the volume of data that you could work with, that we'd kind of lived with

Starting point is 00:09:51 for our whole working lives, suddenly open source frameworks like Hadoop and new services like Amazon Web Services were suddenly making that possible and actually making that pretty easy. So we were frustrated by the limitations of traditional web analytics tools. We were excited that suddenly we could start collecting and querying data at scales that wasn't possible before. And the sort of the nudge that we needed to start building Snowflake, it's not like we made a decision to stop consulting and start to build a company. We were consultants and we realized we,

Starting point is 00:10:33 I think we were at the pub and we had a bunch of different people that we were having drinks with. And one of them told us how Quantcast logged data across the network of websites with the Quantcast tags by serving a pixel from a CDN, recording those CDN logs and then passing them using EMR. And it was just like a light bulb moment. Wow, that is that's so, so simple and so potentially powerful. And so the next day, the next day we took the PWIC, which is the open source alternative to Google Analytics. We took their JavaScript tracker

Starting point is 00:11:13 and we forced it to fetch a pixel on CloudFront and we switched on CloudFront logging and we wrote a Hive deserializer so you could run SQL queries against the CloudFront logs. And that's all that the first version of Snowplow was. It was put together in a day. We published it on GitHub.

Starting point is 00:11:33 We did some blogging about it. Really, we were just kind of amazed that with relatively minimal effort, suddenly there was a general purpose framework. It was so raw then that I'm embarrassed calling it a tool.

Starting point is 00:11:51 But there was something that was out there that anybody could take and collect granular event-level data. And remember, that was the thing that we'd wanted to be able to get out of Google Analytics for all these different clients that we'd been working for over those years and we'd never been able to. And suddenly there wasn't a vendor mediating our access or a client's access to their own data. They could collect their own data on Amazon Web Services at any scale

Starting point is 00:12:20 and run any query against it. I mean, running the queries was a complete, you know, ball-acre to fire up an EMR cluster and write a Hadoop job. But it was possible, and that was a real sort of light bulb moment. It's a game-changer. I mean, there's a lot you've said in there

Starting point is 00:12:38 that's kind of interesting. And again, a lot of this was new to me when I came to work at Qubit where I am now. You mentioned there's a couple of things that are interesting one was the javascript tracker and i think that's something that most people from my old world wouldn't know that is i think it'd be worth just kind of maybe elaborate a little bit on what that is but the other thing is event level tracking and that seems to be the thing that you know like you said you the ability to kind of track in each individual behavioral interaction you know on a website as opposed to page views and so on,

Starting point is 00:13:07 that level of extra detail is a massive difference, isn't it? What's the benefit that you, as far as you're concerned, about going down to that event level, really? What did that really open up for customers and for the industry? So with getting down to the event level data, the key benefit is the ability to determine how you want to aggregate up the data. So if you're working with session level data, if you're working with user level data, you're fundamentally limited with what you can do with the data because there's a load of logic that has been applied to get to that aggregate aggregate data set so if you don't agree with the logic the logic doesn't fit with your business then you're then you're sort of stuck and in the case of in the case of web data the really the most obvious way that that aggregation didn't work was all that aggregation was done based on cookie ids this idea of a unique uh unique visitor so you could um and that uh that really just gives you a view of a particular a

Starting point is 00:14:16 particular browser um but what we wanted to do in nearly all cases was understand people and typically people engage with with sites across multiple devices. And so the ability to define who a user is and be able to accurately measure what that user's doing across different devices is really, really powerful, especially if you want to start joining that behavioral data with other user-level data. We also found a load of limitations with sessionization. The whole idea of a session sort of dates back to the web in the 90s, where you could just do one thing at a time.

Starting point is 00:14:58 There were no multi-tabs. There wasn't that much you could do apart from click around from document to document and um and the and so the definition of a session is you know i've stopped doing i haven't done in 30 minutes and when i start doing something again i must be doing something new that might have worked back then but it really it really doesn't work in in in today's world and so for all sorts of analysis the the aggregate data that we were dealing with um out of out of google analytics by and large was uh it was very hard to do what we wanted with it okay and you said i mean you've been talking about i mean right back at the start you talked about not just kind of i suppose

Starting point is 00:15:37 web websites for tracking things on there all forms of digital interaction i mean just what what's the what what was the kind of next area you moved into so you're obviously tracking uh you know uh stuff for a website first of all but how quickly did it kind of expand into other digital interactions as well and sort of multi-channel stuff and multi-device um so there were two so um we we went on this evolution we started very much as a data warehousing solution for web data, sort of clickstream data warehouse. And then in terms of moving from that to a general event data collection across platforms and channels, there were two bits to that.

Starting point is 00:16:17 So one of the early customers that we worked with was in the games industry, and they wanted to track, you know all games companies do event level data out of the game so instead of you know loading a page viewing a product adding a product to basket buying the product they were interested in you know building a castle planting some crops forming an alliance declaring war the the underlying events were just totally totally different and the data points that they wanted to collect with each of those data points were totally different and uh what we realized which seems obvious now in retrospect but but took us a while back then was that the

Starting point is 00:16:57 the the structure of data that we were used to dealing with we'd sort of accepted that as uh as the default but that'd been something that had been baked into the web analytics tools and again that that was a structure of data that was built around the web in the in the 1990s and in in the 2000s 2010s when you had mobile and you had uh you know like people doing, much more through digital platforms. You live your whole life through digital platforms. You flirt and fall in love on dating sites. You manage your finances on finance apps. You plan your holidays.

Starting point is 00:17:37 You manage your health. You track where you run, where you cycle, all these different things. The activities that you conduct on these platforms are much, much more varied. And so the data that describes those activities needs to be much, much more varied. And the structure of that data needs to be much, much more flexible. And so what we built into Snowplow, and it's the functionality that i'm i'm sort of proudest is the ability for each of our users to define uh their own events and their own entities so if you're a games company you can say you know this is the universe of things that a player can do in in in my universe and these are

Starting point is 00:18:21 the data points that i want to track with each of those events these are the different entities they might interact with me so i want to describe those entities these are different properties of those uh those entities then obviously we've got you know a load of standard definitions they're all schemas effectively uh that are public that are publicly available but each of our users can define their own uh their own it means that two snowplow users can track uh can track radically different user journeys and you look at the two data sets and they look totally totally different you know they're they're being collected through the same the same underlying technology so that was one

Starting point is 00:18:57 that was one big thing because it meant suddenly you could collect data that described events that didn't look like web events but the second the second thing that we needed to make possible was to collect data from a non-web environment so you asked me about the javascript tracker earlier so web web tracking has traditionally been done um primarily via a javascript sdk that sits in your browser um that for things happening, changes to the DOM web pages being loaded. And when those things happen, trigger requests to your data pipeline. And in those requests, it posts the data that describes those events that have occurred. So there was an alternative model of sort of passing the logs

Starting point is 00:19:50 that the web servers produced to serve the website. But JavaScript tracking is sort of the way primarily that web tracking is done. So making it possible to track events from other different locations. To do that, we had to release a whole host of other trackers, we call them SDKs, that you could track events from any kind of environments. That's, you know, mobile with an Objective-C tracker and an Android tracker and then a whole host of server-side trackers. You know, Ruby, Python, Scala, Java, PHP, and all the rest. And so the combination of having trackers for all your different environments

Starting point is 00:20:32 and the ability to define your own events meant that suddenly this was a generic event data capture platform. It wasn't tied to a specific platform, a specific type of event data. And that was really, really, really important. So you mentioned right back at the start again you mentioned about um tying identity across different sort of channels and so that when you go in via your phone or you do something else or you go in by different kind of like

Starting point is 00:20:54 boots in you can tie this together and understand the kind of bigger picture but that that kind of i suppose cross-device uh understanding of entity that's quite a challenge isn't it how how do you approach that and and is is that a problem do you think or is it something that's quite a challenge isn't it how how do you approach that and and is is that a problem do you think or is it something that's now been solved what's your thoughts on that no i think i think it's a solvable problem but it's quite difficult to solve in a in a generalizable way um so that we what we find is that each one of our clients will solve it in a slightly different way. And there are patterns that you can sort of spot across them. So in terms of the technology, what our tech lets our clients do is it lets them collect very, very granular event data from all different platforms and channels and with each event we do

Starting point is 00:21:46 our best uh for a combination of uh automatic tracking and then very very flexible schemas to enable our customers to collect the the broadest range of device and user identifiers as possible so on the web we make it easy to automatically track a first-party cookie ID, a third-party cookie ID, a fingerprint if you want, IP address, all those different things. And then if the client has ways of generating their own user IDs, and often they do if a user logs in or if there are ways

Starting point is 00:22:24 of getting users to identify themselves, then that can be then that can be passed in. So step one is, is, is, is getting is making it possible to, to record all those different identifiers. And then step two is building a process building some business logic that ties all those different identifiers together. And so solving the identity stitching piece is two steps, really. It's first working out how to reliably identify users. between our clients and their customers, the consumers that use their service, and coming up with ways to really incentivize their users to reveal who they are so that the user gets a better user experience. You tell us who you are, and then because we can recognize you across different browsers and platforms,

Starting point is 00:23:18 we can provide you a better, more tailored personal experience or whatever that is. And then the technology being flexible enough and granular enough that you can then um uh implement your own identity stitching algorithm on top to build that uh to build that single customer view for other customers that don't necessarily have a direct relationship with their consumers they might need to integrate our tech with different identity providers and services so that um you know services that are good at fingerprinting or services like parable that are good at spotting a single mobile device across different apps and and browser or services like drawbridge that are good at mapping different third-party cookie ids and saying you know all these different third-party cookie ids are really the same the same user okay so you said back at

Starting point is 00:24:09 the start you put all the code on github and i know you know it was mvp at the time it was something that was you know you were surprised it worked and it solved the problem and so on but but what's the commercial model now for snowplow i mean it is i know your code in your product is available i think on an open source or freemium basis but how does it work now and and how do you make money and and grow i suppose really um there are two the two sides to the business so there is a um there's a a paid for platform um which is built on top of the open source i can talk a little bit about the the difference between the paid for platform and and the open source and there's also professional services side of the business so it's pretty common as an open source open source company

Starting point is 00:24:53 to have a professional services um component with snowplow that is particularly important because the tech doesn't solve a specific business problem. The technology is very horizontal. It provides a foundation that then makes it easier to start solving these problems. It's still not necessarily easy to solve any of these problems. So if you want to do attribution right, if you want to better serve your customers, if you want to better use data to drive your product development process,

Starting point is 00:25:25 then collecting a very, very high-quality data set is sort of step one, and Snowplatt does that. But then taking that data and using that data to solve any of those problems is a big step, and it's not necessarily an easy step so having a services team that can go into our clients and help and show them how to use the technology to to solve those problems and to solve them quickly and iteratively and demonstrate business value is uh it's really important for a lot of our clients because it it makes our tech more accessible and it's really

Starting point is 00:26:05 important for for us because it's uh it means we don't we're not just we're not limited to selling into uh the types of data sophisticated companies that have big internal data teams that know what to do with what is effectively a really big fire hose of of data that that our tech can can deliver they can we can actually help them solve specific goals and help develop their internal competency around that data. So services are really important. And then the paid-for platform is really important. So the idea there is that the open source is scalable

Starting point is 00:26:42 and it's robust, and we're committed to open source. We sort of firmly believe that as a company, one of the most valuable things you can do is collect data and you should own that data and control that data and you shouldn't have to rely on a vendor. You shouldn't be at the mercy of a vendor who can potentially lock you in like that customer data that's that's your that that's potentially your secret source and the intelligence you build and it should hopefully be your

Starting point is 00:27:13 uh your secret source you you also owe it to your customers sort of your data subjects to look to to collect data right and and use that data for the mutual benefit of you and your customers and not have a vendor tell you how to do that. So having open source is really important. We think there needs to be an open source platform that lets people do that. What we found, though, is that to use our open source successfully, companies need to be pretty good at the data engineering and DevOps side of things. And the idea with the paid for platform is we're giving the same power and control to companies that either don't necessarily have that type of expertise, or if they they do they want to use those resources for something else so we

Starting point is 00:28:05 want to make it possible for an analyst or a data savvy marketer or a product manager a product team that don't that aren't necessarily you know rich in data engineering and devops resources we want to we want to make it possible for those uh for those type of people and those types of companies to take control of their data the way that the open source lets the more kind of engineer a um a ui and a set of hosted services so that the uh the whole experience is is is much is much much easier and they sign in so they want a new pipeline um everything's set up in a couple of hours and they can focus on actually doing things with the data rather than um setting up and running the pipes. Okay. That sounds very similar to Impli that I had found you in, I think, from Impli on the show.

Starting point is 00:29:13 And they've done the same similar thing with Druid. They don't really manage the pipeline as such, but they certainly offer it as a managed service. And they solved the problem for me of how to get Druid working in a reasonable amount of time. But they take care of a lot more of it as well. And that's a good model, actually. I i mean it's a model where you get the feedback into the product and the actual core sdk but you also help people who you know want to focus on other things really rather than actually data engineering yeah i'm a i i'm a big fan of uh of the imply guys and yeah the model is very it's very similar and actually we're we're pretty keen

Starting point is 00:29:44 on getting an integration between uhplow and Imply. Interesting. That'd be really interesting. Yeah, that'd be interesting. So moving on a bit, I watched a couple of your YouTube videos where you've been presenting on Snowplow. And there was one that you presented on that really resonated with me where you talked talked about, you said digital analytics is really interesting but really hard. And that resonated with me because it's very true. The models are complex. The use cases are complex. Why did you say that?

Starting point is 00:30:17 What is it about digital analytics that makes it surprisingly hard to be productive and successful with? That's a big question. I think there are a few different reasons. The first is digital data is very heterogeneous. So digital experiences, as we talked about earlier, are really broad. The types of experience that you have on a jobs board are nothing like the types of experiences you have when you're browsing your national newspaper in the morning and so the types of uh the types of questions that you're going to have of that data are going to vary even if you're if you're if you're a marketer or um you know there might there might be similar similarities about the questions you ask if you're uh trying

Starting point is 00:31:17 to trying to drive more companies to advertise on your jobs board to more applicants to come but they're going to look different in the way you're going to use that data to answer those questions going to be different than if you're a newspaper and ad funded or subscription based or both so domain specific isn't it very domain specific um you know if you if you're doing if you're going for analysts work in digital marketing for example you've got to really know your stuff there it's a very kind of specialized area with very specialized kind of questions you have to answer and data models and that sort of thing. You've got to understand the business and the business context and all the things that you have to do if you want to be any sort of analyst. But on top of that, there's just a huge amount of technical knowledge to acquire.

Starting point is 00:31:58 You need to know how this, you can't really work in digital unless you've got a good working knowledge of your your javascript and how data is collected from there typically you're combining that with other data sets you need to understand where that data is collected you need to understand the how that data has been processed how that data is being surfaced you've then got a plethora of tools to take to answer to take that data and try and answer your question and that's where the sort of second challenge comes up um we're used as analysts to thinking um we benefit from uh decades uh of of um of work developing analytics techniques you know going back back to the development of of statistics and then you've got all these different technologies and machine learning and so on. And analysts, we feel like there's a whole bunch of tools as a, like, we're really empowered to go and do things with data.

Starting point is 00:32:57 But actually, a lot of those tools don't work particularly well with event data. So if you're working with data sets that are measurements, so you're running an experiment and you're measuring the outcome of that experiment, you're measuring things like temperature or whatever, or you're dealing with a very uniform data set like transactions, for example, then things are very, very easy.

Starting point is 00:33:24 You've got a really wide range of statistical techniques. You can compute averages, mins, maxes. You can pivot the data. You can run all kinds of statistical techniques, regression, et cetera, on the data. There's loads you can do with it. If you're dealing with event data, actually, you can't do any of those things with the underlying data. When you're looking at event-level data, you know that somebody did something and then

Starting point is 00:34:00 somebody did something else and somebody did something else. Maybe they're making decisions or maybe they're trying to to get to a particular end point it's often not clear who the person is who's carrying out these it's not clear what they're trying to do it's not clear to what extent what they're doing is driven by um uh by the design of the digital product they're engaging in or by their intention and uh and answering those questions you can't answer those questions with any of those those techniques um that i've described you can't even like counting counting events or averaging them or many that doesn't doesn't even make any sense all the all the usual functions that we use to act data don't actually work on event level data.

Starting point is 00:34:46 You want to take this data and you want to start figuring out who these people are, what are they doing, and then start measuring them by are they successful at what they're doing? Are they failing at what they're doing? If they're failing, why are they failing? If they're more likely to go one way than another way, why is that? And that sort of question, that sort there aren't there aren't um techniques or approaches or tools that just let you answer those um answer those out of the box so do you think that's why there's a do you think that's why people still use sql a lot really within within

Starting point is 00:35:19 this kind of industry it struck me i was used to everybody in my world using a graphical tool doing a very simple aggregations and so on that you talked about but i suppose that the mayor to mayor nature a lot of data and the complexity of those questions and the more the more fundamentally different way of coming to your conclusion meant that sql was used a lot more than i expected really did you find that at all as well i yeah i totally agree with that i think that's exactly why um yeah a sequel is just um it is sort of one way of of of solving that problem and it's actually not a very elegant way because it's very hard it's very hard to write a sql query that that says show me all the people who uh did a then b but then didn't do c uh sql is still built around um you know grouping things and computing sums and mins and maxes and count distincts and all the rest and you can you you

Starting point is 00:36:21 can start doing some of this analysis with SQL, which is why people do, and you end up writing these sort of horrendous window functions. So that's ugly as sin, but you just can't do these. There just aren't other tools to let you do this stuff easily. So I think that's why people are using SQL. That's why people are so i think that's why people are using sql that's why people are using spark that's why people are using r and python and and so on like if you if you want the freedom to analyze the data in the way you want you don't want the tool telling you what to do um those those are the best those are the best tools we have but it's it's

Starting point is 00:37:03 still not easy okay okay another quote you had from your presentations was you said that um in digital analytics you need smart people and there are companies out there building tools out there to try and make it possible who aren't smart to play digital analytics but snowplow is about enabling smart people so i think that first bit we just covered there by saying the tools out there now aren't really suitable but how is snowplow you know enabling smart people what's the kind of thing then that particularly does that as far as you're concerned um so for me the the the tension is always between democratizing data putting making it possible for more people to uh to sort of do do things with data and not dumbing the data down.

Starting point is 00:37:48 So an easy way to democratize data is to dumb it down. If you hide a lot of the complexity, if you make it seem simpler, then it's more likely that more people will start doing things with the data. But the problem is if you hide some of the complexity if you hide some of the uh the the intricacies in in the data then you're disempowering the the type uh smart people that can uh that rather the smart smart people is sort of loaded term, but I mean, people, people who want to do data, right. People who are like, I want to know if there are irregularities in the data, because that's an opportunity. Potentially there's something for me, for me to understand

Starting point is 00:38:36 that. So, so we, we don't want to, we don't want to hide anything like empowering empowering smart people for us means uh rather than dumb anything down rather than hide anything expose uh make it possible for a smart person using the data to to be able to view any part of the data any part of the data, any part of the processing, and to understand exactly what's going on. So, for example, a silly example, but it illustrates the point. We don't filter out suspected bots and spiders from the data set that we deliver. We just label them or provide tools so that our users can see, hey, this is probably a bot or a spider, because that's interesting.

Starting point is 00:39:29 That might mean there are a whole bunch of use cases where actually you might be interested in exactly how many robots are crawling your website and what you're up to. And if some of them are committing ad fraud or uh uh you know scraping scraping your data for nefarious purposes that's that that's really important you want to you want to know about it so it's it's tempting to sort of sweep that under the rug to filter that out so that when you're reporting the number of users um you're you're excluding them and that's the right thing to do if you're interested in the number of human people on the website. But we don't want to make any assumptions for our users.

Starting point is 00:40:15 Our users are smart, and we trust them that if we give them the full data set, and if they want, they can come to us for guidance, they'll know what to do with the data to treat it uh right to answer the specific question that they want and depending on the question that might involve slicing and dicing the data different ways ignoring different subsets of of the data but it's presumptuous of any vendor to do that ahead of speaking to a user of understanding the question the question that a user wants to ask or um uh and even the assumption that there's there's one kind of

Starting point is 00:40:50 canonical way of presenting the data that's going to meet all that user's needs is is is pretty outrageous you're you're really disempowering your user even though you're making it easier because you're effectively dumbing things down okay so so the last thing i want to talk to you about was was taking this forward. I mean, so it sounds like you've done fantastically well to build something that's got this kind of great open source stroke commercial model. You know, you sound like you're solving the problem really well.

Starting point is 00:41:15 But then there comes the question of, you know, growth and how you perhaps compete with the Googles and the Adobe's and that sort of this world. And people might say to you, for example, well, this is good as a pipeline but we'd like to better do you know do you do A B testing on top as well what's your strategy going forward about competing staying kind of relevant and and growing and that sort of thing well that's another really big question um where where do I start so there's I think I think there's what's exciting about the space that we're in is? So there's, I think, I think there's, what's exciting about the space that we're in is there's, there's, there's so much scope for, there's so much scope for innovation.

Starting point is 00:41:51 So in the time we've been around, if you think about how far the industry has come, when we, when we started, the number one question that I'd get sort of going around different digital analytics conferences was, why would would i why would i want to warehouse my data what what does that report does that let me run that i can't get out of google analytics and we never get we never get that question anymore it's uh even you know google by launching premium which became 360 acknowledged that there's a ton of stuff you can do with this data if you can access the underlying data especially in a cloud data warehouse that lets you sort of query it flexibly and in a in a in a performant performant way so people i think today

Starting point is 00:42:37 widely recognize the value of warehousing your data but if i i look if i look around i think um i think there's still a whole bunch of challenges that companies have around um acting on the data in real time and uh and making making decisions in real time if uh the situation today is there are a lot of people who understand the value in being able to do that and have the aspiration to do that. But it's technically very difficult. It's difficult to, it's very difficult for, it requires a lot of engineering that most people in marketing or in product development don't necessarily have access to. That today reminds me of the state of digital analytics six years ago when it came to the warehousing piece. As an industry, people have got much, much better at the idea of warehousing their clickstream data, their digital data, their mobile data, their web data, etc. Joining that with other data sets and doing things, that's now a well-trodden path. I think we need to go on a similar journey with real-time data processing.

Starting point is 00:44:05 So I think that's a big opportunity for us at Snowplow. I think coming up with ways to solve some of those really difficult problems that make digital data hard to work with is a really fruitful area of research. So we talked a little bit about some of the limitations that even SQL has in letting you work with this underlying data. And we're doing a fair amount of R&D with graph databases, for example, to see if that's a better paradigm potentially for working with this data. So that's another area that we're really excited about. And then I think there's a lot that we can do in the short term to empower more product teams, more marketing teams,

Starting point is 00:45:07 more editorial teams that want to be data-driven and don't have necessarily that engineering resource. So we're starting to provide a better experience for those users, but there's a lot, lot further for us to go. So there's a lot lot further for us to go so there's a huge amount of yeah huge amount of interesting problems to keep to keep to keep solving and hopefully the more of them we solve the uh the more snowplow grows yeah you mentioned you mentioned graph databases i i was speaking i was chatting with uh i think nick shop who did die behind graph ql um about potentially coming

Starting point is 00:45:44 on the show at one point and we we were talking about the use of GraphQL in this kind of context, really, defining over kind of AP, over schemas, and using for analytics and so on. I mean, that sounds interesting, really. I mean, where do you think, just graph databases, where do you think the potential is there, really, around that sort of thing?

Starting point is 00:46:02 What would it be able to solve for you, do you think? I think it would make it a lot, lot simpler to query event data. So if you model it in a graph, right, then running those queries, show me all the users who've done A, then B, then haven't done C, and then to do D with those who've just done A, they sort of fall out of a well-designed graph model a lot easier than writing the SQL queries. And that means that then potentially aggregating over, over the data in that, in that kind of, uh, in that graph world might be,

Starting point is 00:46:49 uh, might be a lot, lot. It, it might be easier to, it might be, it might, it might be a better paradigm for writing,

Starting point is 00:46:56 writing those, those jobs. It might also be computationally more, um, more efficient. Uh, but I'm, I'm,

Starting point is 00:47:04 I'm more excited about it because i think it's a it's it's a very natural way of of modeling event data and it's a richer way because you can start modeling the the relationship between the the entities and how those relationships change change over time so interesting interesting interesting well i'm conscious of time anyway for you but it's something it's been fantastic to speak to you ellie i mean obviously having read about what you do and the products and so on it's great to speak to the kind of brains behind it i mean how do people find out more about snowplow if they want to find out and maybe download the product and get started that sort of thing oh they should come to our website on snowplowanalytics.com and check out our GitHub on github.com slash snowplow slash snowplow.

Starting point is 00:47:51 Excellent. That's really good. Well, it's been great speaking to you. Thank you very much for coming on the show and have a good evening. And yeah, thanks. Thanks, Ali. Thank you.

Drill to Detail - Drill to Detail Ep.55 'Snowplow, Data Pipelines and Event-Level Digital Analytics' with Special Guest Yali Sassoon

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.