Drill to Detail - Drill to Detail Ep.33 'Building Out Analytics Functions in Startups' With Special Guest Tristan Handy

Episode Date: July 3, 2017

In this episode Mark is joined by Tristan Handy from Fishtown Analytics to talk about building-out analytics functions in high-growth startups and three related blog posts on this topic....

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to another episode of Drill to Detail and today I'm joined by Tristan Handy from Fishtown Analytics who I got to know through the world of Looker but then I found operates in the same sort of startup space that I work in but in a slightly different way to the way I work. So Tristan, why don't you introduce yourself to everybody on the show and let us know kind of what you do and how you got here. Sure. Thanks so much for having me. It's great to be here. My name is Tristan Handy and I'm the founder and CEO of Fishtown Analytics. I've been working in data for about a decade and a half now, and I guess I've been focused on startups and data for, I guess, since about 2009. I was the first analyst at Squarespace back when Squarespace was a tiny little company. Helped them raise their first big A round and then went on to be the executive at two different startups.
Starting point is 00:01:08 Most recently at RJ Metrics, I ran the marketing team. And we kind of participated in this real fascinating development of the BI tech stack over the past five years. And so developed a lot of strong opinions about BI technology and how analytics should be done. And I left with three other folks and we've started our own consulting company to put some of those ideas in practice. Okay. Okay. So you say Squarespace, it's interesting. All of my websites run on Squarespace. So it's brilliant, isn't it? It's really, really good. I mean, it's a great design and the IT behind it is good as well. And so that's kind of interesting as well. But what you do, Tristan, is interesting because we've come across each other through,
Starting point is 00:01:50 I suppose, the kind of the data engineering kind of conversations and various kind of, I suppose, new world BI development kind of conversations. But you actually provide analytics consulting to the actual startups themselves, don't you? Yeah, that's exactly what we do. The problem that we found that most startups had, and most of our clients at RJ Metrics were startups, they didn't have a software problem, they had a people problem. There's just not enough talent out there today that knows how to operate modern BI technologies. There's a very deep enterprise space, and IT consultants abound in that universe. But startups are kind of new to the BI space because the technology didn't exist previously for them to play with it. It was, you know, write queries on MySQL or build spreadsheets or nothing. So we're trying to fill this gap of
Starting point is 00:02:53 helping startups deploy this new technology. Okay, okay. So yeah, it's a subtle difference there, isn't it? I mean, the work that I'm currently doing at the moment is I'm working with a startup to build analytics products that they then offer to their customers but you're actually looking at the analytics those startups themselves use and it is interesting there were three blog posts that you wrote recently these are the real kind of reasons i wanted to speak to you there were three very kind of you know opinionated which good kind of blog posts that you wrote about the sort of things you do and the problems you're seeing the market and the first one was about i suppose what does startup a startup founder's guide to analytics so that was a kind of a we'll go into it in a second you know there
Starting point is 00:03:27 was a good kind of set of steps in there and setting the kind of scene for that and then we had the steps to setting up a modern sas based bi infrastructure which is very kind of relevant and then you talked about the kind of workflow itself within kind of uh you know within startups and I thought all three of them were kind of very good and actually very relevant to stuff that I was working on at the time so for the first of those posts that you did so the startups founders guide to analytics okay so just some i mean just summarize to the listeners what that was about and what was the motivation to write that and then we'll go into some of the details yeah so i kind of alluded before that um this is all kind of new stuff for startup folks. And so I feel like a lot
Starting point is 00:04:08 of the software vendors in the space want you to just kind of dive in and do it. And to a certain extent, that's good. It's good to have a bias or action. But there's a lot to know. And frequently, we've found folks kind of doing things out of order, caring about the wrong thing at the wrong time, like hiring a data scientist before you had any data to analyze just because you thought that you needed a data scientist. you know track a business from zero to 500 employees and say what kinds of things should you what software should you buy what people should you hire and how should you be doing analytics at these different phases okay so there were and there was that you went in the blog post you went through the kind of phases in growth i suppose really for startup and you kind of went through and talked about um you know what was appropriate at those phases really so let's kind of start at the beginning so if you think about the founding stage really so this year i'm talking
Starting point is 00:05:08 just to be clear we're talking largely about say e-commerce and and sort of web and digital kind of startups here but you know the founding stage describe that and describe what is appropriate what mistakes you see it but what is appropriate in that kind of phase sure um i think that a lot of times the early stages, either your problem is data collection, you haven't instrumented anything, or it is doing too much. I think that a lot of startup founders are very data-driven naturally. They want to know what's going on with their business. And I think that sometimes they overdo it on analytics too early, which creates this maintenance burden where maybe you work a lot on a given analytics setup and it's working great, but your business changes over time. And if you don't have the manpower to kind of keep that up to date, then the reports will get stale and no one will look at them and you'll have just wasted a bunch of time.
Starting point is 00:06:13 So things like a data warehouse and enterprise BI tools and so on would not be appropriate at this stage, you're saying? I don't think so. I think that if you've got fewer than 10 employees, you should install Google Analytics. Make sure that you've done a decent job of that and then you know do what you can with it okay so actually Google Analytics is an interesting topic because again coming from probably for more than enterprise world myself I was I suppose I was unaware of how ubiquitous Google Google Analytics is and how much value there is in that as well I mean just for people who are listening to this who aren't
Starting point is 00:06:43 from the kind of e-commerce world just describe a little bit about google analytics and why it's so good really gosh uh google analytics is the most loved and most hated analytics product in the world it's the microsoft access of e-commerce isn't it really you know in some respects or excel in a way for sure um i i think that um the the basic ga implementation is you install the tracking pixel on your website and it tells you visitor behavior. You can go deeper than that with universal analytics. You can install it in your mobile application. You can get some more sophisticated reports. a tool that unless you pay for GA premium and get all the data loaded into BigQuery, it's a visual tool and you'll run up against the ends of the universe in terms of what kinds of
Starting point is 00:07:32 questions you can ask it. Okay. So would you at this point expect the startup to hire somebody to work with analytics at this point, or is it going to be a founder task? I think it's a founder task at this point. Maybe you've got a marketing person who's doing a bunch of marketing analytics, but mostly it's your founders get stuck with this. Okay, so next stage then, very early stage you put it. I mean, you talk in there about what you do at that stage
Starting point is 00:07:57 and you mentioned things like net promoter score. I mean, what is different about this next stage and what happens there really? Sure, so your team's growing a little bit and you're probably not speaking to each of these people every day. So you need to be focused a little bit more on empowering these people to do their jobs. And in the future, that's going to take the form of a BI stack. It's going to look more like a data warehouse and a BI stack. It's gonna look more like a data warehouse and a BI tool. But for now, I think these people have jobs to do and most of them are not gonna know SQL.
Starting point is 00:08:30 Most of them are not gonna really have any BI skills. So what's important to do is hold them accountable to use the reporting in the tools that they use every day. So if they're a salesperson, they need to build reports in Salesforce. I think a lot of folks export the data and they go to go to town with Excel. And I think that's really a terrible idea. Okay, so so we've got a couple of stages next, you've got early stage and mid stage in your
Starting point is 00:08:57 blog post. And I guess this is kind of where it gets interesting. So this is potentially kind of where someone like you might come in. I guess this is also when you start to have people talking about things like let's redo the analytics in a kind of where someone like you might come in, I guess. This is also when you start to have people talking about things like, let's redo the analytics in a kind of more structured way and so on there. You've then got the kind of this whole new thing about data engineers as well and so on. How do you, when you go into these places, what do you see as common mistakes and how do you sort of make sense of it all and get them in the right direction really? What's the tricks to it all?
Starting point is 00:09:23 Sure. What's the tricks to it all, Sure. What's the tricks to it all? Gosh. Well, what's the problems you see? For myself, I have a job. Do you find sometimes that people are a little bit too clever for their own good when it comes to these sort of things or what? Yeah, totally. So I think that there are a lot of kind of standard principles that you should think about when you're doing this stuff.
Starting point is 00:09:54 One is write as little code as yourself as you can. If you want a BI stack today, the reason that you have the ability to even have a BI stack at 25 employees is that there are so many tools that you can just kind of pull off the shelf and all the integrations just kind of work. So there are some founders and engineers at this stage that have a bias to build it themselves. I think that's one of the things that we see people make a big mistake on. So I think that step one here is plugging together various tools and putting your data stack together. Then step two really is hiring. In order to kind of get this stuff in a good place, it can't just be, you know, 10% of everyone's job. There has to be a person who's pushing this forward. And this is a thing that I've started thinking
Starting point is 00:10:46 a lot about because they're just, I really think that hiring is the biggest problem in analytics today. And it doesn't matter if it is a very large enterprise or if it's a young startup, there's just not enough people who know how to do this stuff. Yeah. I mean, I think there's a distinction as well. I mean, there's a distinction between analytics and things like machine learning and data science and so on. But there's also a set of one thing I found doing this kind of work myself is that there are a lot of things that are different going into this world. And I think that you need to be kind of open minded to actually sometimes it does make sense to build things. There are areas of kind of this work.
Starting point is 00:11:24 There are areas of this world that you've not heard of before for me it was kind of things like e-commerce analytics and so on but they're also kind of eternal truths really as well and i think something i found i found is that you end up end up end up rediscovering a lot of the things that are these eternal truths doing this kind of work i mean having to make the case for analytics is quite an important thing as well isn't it have you found that when you go into places that actually there is a general lack of understanding of what the value of analytics is for a company, really? So we don't have that conversation a lot. It's not that there's no one in the world who still thinks that, but we just don't find ourselves in those conversations, which is good because I feel like you know maybe that's a fair conversation to have ten
Starting point is 00:12:08 years ago but if you're still thinking about that today then you just haven't lived in a world where people are using data well or perhaps you've done it but you've had a I suppose I mean like you say very few people within this world have not used analytics but maybe they've done it and not found it to be actionable or valuable and so on I mean have you found sometimes that you do need to kind of go through and establish some of these basic things and think about how, you know, even things like basic things like planning and budgeting processes, or kind of, you know, how do we do internal reporting and so on? Is that part of what you do as well? So we're very focused on the pure, like, BI part, which is, you know, counting things and adding things.
Starting point is 00:12:48 And, yeah, totally. The work that we do is an input to financial models that get put together by CFOs and shared with investors. But we're generally not making those forecasts. Okay. Okay. So what about, I mean, again, something I found interesting is I suppose startups will often focus on like you say building something that's kind of new and they'll be using for example I don't know sort of airflow or stuff like that and actually maybe actually a more traditional technology
Starting point is 00:13:16 and a more kind of a more kind of I don't know you know easy to understand out of the shelf technology better and what's your thoughts on that, really? On airflow versus? Well, just in general, the fact that a lot of startups will think about the engineering tasks rather than actually what they're trying to do with this. Got it. Yeah, I really, I do agree that it is possible to get lost in the technology kind of forever. But the, and when I'm talking about this question of hiring at this stage, it really is that person who bridges the gap.
Starting point is 00:13:52 It's very possible to find engineers, you know, there, we could all use more engineers in our companies, but they're out there. And there are also plenty of marketers. The question is, who can you find that can understand how to put a data stack together, but maybe not build it by hand themselves, and can understand how a marketing campaign runs, even if maybe they don't run them themselves. And you can combine those two sets of knowledge to actually do effective marketing analytics. And the same for finance or for operate whatever okay okay and you say that's the kind of role that also is a separate role to the person who is kind of doing building the reports and so on i mean do you think it's
Starting point is 00:14:33 important to still have someone out there building reports and so on working with that consultant it's quite hard to do both really isn't it um i think that the so at at the early stage, so maybe 40 employees, something like that, you hire the person who maybe you call them your head of BI. And maybe two or three years down the road, they'll have six people working under them and you'll call them a VP. But for now, you just call them your head of BI. That person is usually, maybe they've got an MBA, maybe they've got some Excel skills and some light SQL, and they're going to pick it up on the job because they're a super smart person. And they're going to build out the basic stuff themselves and then scale the team over time. Okay. And the final stages of this kind of blog post in particular, where you talked about mid-stage and kind of growth, and you talked about the importance in the mid-stage of SQL data modeling and governance and versioning.
Starting point is 00:15:30 I mean, tell us about that, and why is that important, and how do you introduce that into the conversation? Sure. So one of my biggest pet peeves has become copying and pasting. It is unbelievable how analysts are so used to copying and pasting. So, you know, you send an Excel document to somebody else, they use that as the starting point for their own Excel document, they build off of that. But if the core definition of a metric changes, all of these decentralized analyses are not going to get updated. And what ends up happening is that everybody has their own copy of the metrics. Nobody agrees with each other. And it kind of grinds all of this to a halt. So we, you know, if software engineers wrote code like that, literally we wouldn't have any software applications that actually worked.
Starting point is 00:16:27 And I really think that analytics is moving in that direction as well, where you need to think about your analytics applications as scalable pieces of software that you need SLAs, you need source control, you need to build them modularly. And copy paste has to just die. Okay, okay. So the reason that I think we got to know each other was because of Looker. And Looker is an interesting BI tool take on this kind of world, isn't it? I mean, obviously you've got in there the ability to put stuff in GitHub and so on, but you've also got the data modeling side and you've got the kind of SQL side and so on. I mean, you're using Looker currently aren't you in some projects and
Starting point is 00:17:07 what's your kind of thoughts on that really yeah i like looker a lot and i wrote one of the first blog posts after starting fishtown was um about how much we we liked looker yeah that's what got my attention at the time yeah yeah uh so it the nice thing for for us is that we can think very structurally about, you know, what is your data? What does it look end. And since we've built out that LookML model, they can drag and drop and create reports without having to think too much about how to optimize a Redshift query or anything like that. Okay. And so, I mean, this is an area that is your main focus of business.
Starting point is 00:17:58 I mean, how has it worked out building a business in this area? I mean, selling consultancy into a startup, I've always thought it's been quite a hard thing because people there are quite kind of build it themselves and smart how has it gone running a business and starting a business in that space yeah i you know uh uh last march i was talking to my wife about um hey maybe i'm gonna try to start this business and my goal was uh hey maybe i can pay my own salary I really had no idea what to expect because you know I ran several teams at startups and and we
Starting point is 00:18:33 had you know at startup sometimes you hire a design consultant sometimes you hire like a performance marketing like an AdWords consultant but but really startups aren't used to hiring consultants. The thing that has worked out really well for us is that we all come from this ecosystem. And so we know all of the people making the technology. And none of them want to have services businesses. They all want to build software. And so we have ended up getting I would
Starting point is 00:19:06 say a majority of our customers from the ETL tools from the BI tools from the data warehouse tools okay that's interesting okay so so that actually is a quite nice lead into the second blog post you wrote so you wrote one about where are we here I'm just going to find it what are the steps and the tools instead of a modern SaaS based bi infrastructure so just tell us a little bit first of all about again motivation for this but what what's your kind of general kind of approach or general kind of picture of what you do in this kind of space before we get into the detail sure um so data is data like you can have a data workflow that's set up on top of csvs in you know S3, and they can all be processed by Jupyter Notebooks. There's infinite ways of having a data pipeline. The reason that we do things in
Starting point is 00:19:58 the way that we do is because we need them to be very hands-off. We don't want to think about them. And as you get bigger and have more customized needs, maybe you want to do things differently. But these are recommendations for companies who are not trying to have a team of 25 people maintaining this stuff. So we always think of the analytic database as the center of all of your analytics.
Starting point is 00:20:24 Don't analyze data in Excel. Don't do random things in Tableau extracts. Step one is always get data into an analytic database. So the question then becomes, how do you get it there? And there's several off-the-shelf ETL tools, Fivetran, Stitch, Illuma. We use Stitch probably the most. And then once your data is in your analytic database, then you've got choices to make around what your BI tool is. So that's kind of the process, the way that we think about it. Okay. So there's kind of different layers to the stack you talked about, and it's been a recurring theme in a lot of the kind of podcasts we've been doing recently. But there was, so we talked about the database first of all and and you know you in this blog
Starting point is 00:21:07 post you talked about using kind of these analytic elastic kind of mpp databases like redshift and bigquery and snowflake i mean what's your the company i'm at actually went off redshift into kind of bigquery is redshift still still kind of popular out there is it still kind of being used a lot in this startup space from from what i know redshift is still the 800 pound gorilla um and not just within startups like i think that it is the uh dominant cloud database for for all sizes right now um i and and it's it's hard to uh it's it's hard to say that that's bad. Like Redshift, I think is, what, gosh, it's four years old now. And I think that mostly the way that it's showing its age is around concurrency.
Starting point is 00:21:57 So if you have very differing concurrency and load on your warehouse at different points in the day. You might, with Redshift, have to really make hard choices about who gets to use that resource and when. If you go with Snowflake or if you go with BigQuery, they do a much better job of solving that problem. Okay. Okay. Do you find that the, I mean, I know obviously not every business you work in is around data and so on, but do you tend to find that the databases they use for their internal reporting and analytics are the same as they use for the customer ones? I mean, or are they as good? I mean, what typically is that like, really? So for internal analytics purposes, I think that really the three that you named are the ones that are in use. We don't do a ton of work with customer analytics, you know, embedded stuff in applications.
Starting point is 00:22:53 But I think that there you do have a much broader array of possible options. You throw in Elasticsearch, throw in even hosted services like Keen.io. Keen has a great API for doing stuff like this. Now, you can absolutely spin up a Redshift instance and use that for embedded analytics, and we've helped folks do that. I don't think it's really built for that use case quite as much. Okay, okay. So we've got that. I mean, I think the database is a fairly kind of easy topic, really.
Starting point is 00:23:29 But ETL is a recurring topic we've been coming back to in this podcast recently. And I think it's been driven by a lot of the kind of move towards things like data engineering, things like Kafka, things like Apache Airflow we talked about earlier on. To my mind, ETL is the biggest area people can get themselves into, a bit of a mess, really, internally on projects. I mean, what's your take on doing ETL with an internal kind of startup projects and what's your tools of choice and so on? Sure, yeah, and I totally agree with you.
Starting point is 00:23:58 There's a lot of topics in there to unbundle, but yeah, I mean, what's your take on that? Yeah, it's so easy to just kind of like get yourself lost in the forest and be like, how did I get here? So we think about the three letters ETL in two different stages. We like to separate the E and the L from the T. There are a large number of reasons why you might not want to build your pipeline like this, but we load all raw data into the analytic data warehouse as stage one.
Starting point is 00:24:34 So the question really becomes, how do you write the job that gets the data from where it sits into the warehouse? It's based on the fact that a lot of people in these companies are software engineers, do they write them themselves? Is it a good move to write your code yourself, really? So I think that the way to think about that is what can you do to get the most for free? Sign up for as little maintenance as possible because inevitably this stuff breaks. So the best option is somebody's got an off-the-shelf integration.
Starting point is 00:25:09 And there's so many products out there now that have off-the-shelf, like move data to a data warehouse. I mentioned before we use Stitch a lot. So that's option number one. Option number two, I think, is there's this emerging platform called Singer that Stitch is kind of the sponsor of, but it's a totally open source way of doing ETL. And it's essentially a community oriented approach to this maintenance problem where folks are building API integrations with various data sources and kind of sharing them with that community. So we've built about five of those for clients. And the nice thing about that is they can build it once and then the community maintains it.
Starting point is 00:25:57 So that's like your second tier of get it for free. And then if you really have to, you can build the whole thing from scratch. Okay. So we had Maxime Bouchemin on the podcast recently talking about Apache Airflow, for example. I mean, have you had any exposure to that? I mean, what's your kind of take on that, really? Yeah. Airflow is amazing. It is so incredibly capable.
Starting point is 00:26:20 And if you're a data engineer and you have like an unbounded problem set, Airflow, I think is the tool that you definitely want to use. I think that it actually takes a while for you to need to get there. And obviously at Airbnb, they're one of the most data sophisticated organizations literally in the entire world. So if you're working in a startup, you're probably not Airbnb quite yet. So the thing that Airflow does so well is it gives you access to a DAG, a directed acyclic graph. So a way of processing dependencies that kind of has a start and an end. And that's generally how ETL jobs are constructed. We think that that DAG concept is something that data analysts should be able to take advantage of as well. And so we're actually building an
Starting point is 00:27:20 open source tool called DBT, data build build tool that allows you to construct these these sql only data dependency graphs and they get built completely in your in your data warehouse so you don't need a spark cluster you don't need you know big ec2 server farm you really run it from your local machine and it builds all of these data models in your warehouse. Okay. Okay. So do you ever see kind of any of the big ETL tools being used in these companies, Informatica's in this world and that sort of thing? Do you ever see the kind of point and click tools, expensive enterprise ETL tools being used at all? So I really have not. I know that when I was at RJ Metrics, we kind of looked into Informatica as a potential partner. And so that was really my only exposure to it.
Starting point is 00:28:13 But it's a beast. It's quite a lot of work to set up. Yeah. Although, obviously, it's very powerful. Yeah. I mean, there was an interesting blog post that somebody wrote. I'm trying to find the details here actually, a gentleman called Jeff Magnuson,
Starting point is 00:28:28 who wrote a blog post a while ago called Engineers Shouldn't Write ETL, a guide to building a high-functioning data science department. And the general thrust of it was interesting. It was that the worst people to write ETL code are engineers because they're thinkers rather than doers.
Starting point is 00:28:42 And the danger is that each kind of engineer will kind of try to kind of, you know, to introduce a new paradigm or will kind of be trying to solve problems in a very innovative way. Whereas actually a lot of ETL is just basic stuff. I mean, do you think that's kind of valid or is that an interesting observation?
Starting point is 00:28:57 Totally. And I remember that I like read that post. Oh, brilliant. I think he said Stitch Fix. Yes. I love that post. And if anyone's listening to this and you haven't read that blog post, like just pause it and go read that blog post. So just outline it again then for us.
Starting point is 00:29:12 What was it about and what was the point of it from your side then really? He was saying he actually focused a lot on the human capital reasons for this like the engineers software engineers who are good at their jobs they do not like to just do really boring stuff and some sometimes ETL writing ETL code is is rote and boring and that doesn't lend itself to having a happy high functioning team hmm yeah exactly I it's an interesting post-read. I thought it was a good counterpoint, really. Well, a supplementary point to the thing that Maxime talked about. But so another area, I mean, I've just come off a call,
Starting point is 00:29:53 and I was recording a podcast interview with Dan McClary from the BigQuery team, and we were talking about data modeling and, I suppose, transferring some of the things we knew from data warehousing into BigQuery. And I recently almost came unstuck with a BigQuery project with joins, for example. I mean, what's your take on data modeling in this kind of world? And is it different? Is it something you bear in mind differently? Or what's your view on this? Yeah, so I have to admit that my experience My experience in the very large data set size world is only in post-Redshift. I never worked with tables of more than 10 million rows
Starting point is 00:30:33 in Oracle and MySQL. So I kind of maybe got lucky that I didn't have to deal with some of these old data warehousing techniques. I've read the books and I think about them like, gosh, I'm so glad I didn't have to think too hard about that. Yes. So people will ask, what's your take on data modeling in client calls? And generally, our answer is write really clean code, write code that is readable and that other people can maintain really easily. And if you run into performance problems, by and large, can just take the data
Starting point is 00:31:30 as long as you don't do obviously stupid things. Yeah, definitely, definitely. I mean, I don't know if you noticed, there's a company, Snowflake, out there that do obviously a cloud-based elastic kind of database for data warehousing, but it has some of the characteristics of kind of big data as well. I mean, you had any experience with Snowflake? What's your thoughts on that, really? Yeah, so we're just, we had, we got our first client on Snowflake at the beginning of this year, and we just spun up our second one. We like Snowflake a lot. And I want to kind of
Starting point is 00:32:01 tie into that blog post that you just wrote, joins in BigQuery. And the magic of BigQuery is that it can split your processing jobs across sometimes thousands of nodes. And that's amazing because you can process essentially any data set pretty quickly. The problem, though, is having the necessary data on the same node when you want to join one data set to another. And that does make joining less performant. So there are plenty of ways that you can architect, and you did a great job pointing this out, how you architect your BigQuery data to prevent you from needing to do these joins. I think that Snowflake is this nice, it's elastic, and you have the ability to spin up a bunch of compute nodes, but it's not like thousands of them. So it doesn't have joins, the same problem with doing joins in the way that you mentioned.
Starting point is 00:33:00 Yeah, excellent. Yeah, it's interesting. I mean, sorry, Snowflake is, to my mind, it's, on one hand, it's very clever, you i'm sorry snowflake is to my mind it's on one hand it's very clever you know in that they've managed to get the best of both worlds but it's also an interesting kind of um it's interesting to rebuild what is essentially a kind of on-premise data warehouse technology in the cloud elastically i mean you you kind of wondered to yourself given that its primary primary market is is is kind of data warehousing whether it's what it's been worth it introducing and reintroducing things like constraints and and so on in there i mean they seem to have kind of built they seem to have built a technology that is clever but you wonder whether it's needed in
Starting point is 00:33:33 this new kind of setup really i don't know it's i don't know it's yeah i think that the the with redshift in 2013 we got a tool that that was pretty damn good enough for most use cases. And yet we're going to continue to push SQL-based data warehouses further and further over the next decade. And I think that tools will continue to look more and more like BigQuery and less and less like Redshift. But at the same time, I just don't know that BigQuery is quite there yet. It still requires a little more thinking than sometimes I'd like. But at the same time, I just don't know that BigQuery is quite there yet. It still requires like a little more thinking than sometimes I'd like. Okay, so let's get away from technology here.
Starting point is 00:34:13 And the last of the blog posts you wrote was really good. And it was about, I suppose, the method and process by which startups then do their analytics. And I think you talked about it and called it kind of the analytics workflow. So again, what was the background to this? And what were you trying to talk about? And let's go through some of the analytics workflow. So again, what was the background to this and what were you trying to talk about? And let's go through some of the details. Sure. The thing that we observed while we were at RJ Metrics, and so RJ had a little over 400 clients at the time.
Starting point is 00:34:40 And we had that kind of collective knowledge of all of these companies. And you realize that still no one's doing analytics perfectly and sometimes not even like that well. And it's not really a tooling problem. It is that they're working in particular ways that don't end up adding up to make, you know, insight that everybody has access to and is always current and all of that. So we just kind of asked the question, how should people be producing analytics? What's the workflow that they should be using? And that blog post was kind of our answer to that. Okay. And I agree with you. I mean, I'm very conscious we spent 40 minutes of this conversation
Starting point is 00:35:24 talking about different tools and so on, but it's not, it's not what you've got is how you use it really, isn't it? And I think something I've observed is, is, is analytics in all companies really, you know, but particularly startups and so on is very kind of tactical, it's spotty, it's not systematic and so on. And, and it's often in silos and, and, and not collaborative. I mean, I think the first thing you talked about in this blog post was saying analytics is collaborative. I mean, what do you mean by that? And what prompted that? And what are you trying to say there? So let's say that you've got a team of five analysts, the default behavior in this kind of setting is that some manager asks one of the analysts to get a report on something. And that analyst starts from a blank sheet of paper,
Starting point is 00:36:13 and they query the raw data from scratch, and they build up this report. And sometimes that will take the form of a 200-plus line-long SQL statement that only they can read, and even they forget how it works a week later. And so that becomes very fragile very quickly. And so the core insight there is that you should collaboratively in this team of five people build up this ever-growing layer of business logic and everybody should be accessing this same library of existing business logic as opposed to starting from scratch every time. That's interesting. So by business logic do you also mean things like common definitions and metrics? Yeah so the way that we do it, it always takes the form of database tables and views that are materialized in your data warehouse. So you build them with,
Starting point is 00:37:07 so let's say you've got an orders table. We're talking about e-commerce before. So you've got an orders table and you want to get revenue out of that. This is like the most typical thing ever. But there are some test records in there and you need to filter them out of literally every query you ever write. So instead of querying the orders table directly, make a view on top of that that filters out these test records and then everybody can totally forget that they even exist. So that's like a very simple example, but you can find opportunities all over the place to do that kind of thing. Okay. So going back to the conversation I had with maxime about airflow and superset one of the discussions we had was whether you should try and build out a semantic model like a business
Starting point is 00:37:49 model for the business and his argument was that in in startups it's very hard to get a cop to get a common definition agreement on on metrics and and the structure of data and so on i i'm not entirely sure on that myself i i think there is value in doing that i mean what's your take on trying to build some kind of common business model for the business that describes things in a kind of standard way yeah so I listened to that episode and I heard him say that and I flagged that in my brain too the you know I'm not saying it's wrong I'm not saying it's wrong but it's interesting kind of point of view isn't it yeah totally and and I I'm not saying it's wrong, but it's an interesting kind of point of view, isn't it? Yeah, totally. And I have not worked at a company of the scale of Airbnb, so I'm sure that they have their own challenges that they're optimizing for. From my perspective, if you don't have a semantic model, then you're going to really run into kind of organizational challenges around what is true and what is not true.
Starting point is 00:38:47 So, yeah, I think that that model is kind of the core of your analytics at a company. And so, yeah, I guess I come down on the opposite side of that. But where should that model be, do you think? Do you think it should be in a tool like Looker or it should be at a lower level in the data warehouse. And the reason that we do it there is that when you build models in dbt, any tool that connects to your data warehouse has access to that same library of business logic, business models. So then Looker can connect to those, but we also really like Mode Analytics for a lot of use cases, and Mode can connect to that same shared data.
Starting point is 00:39:52 You can connect to Jupyter notebooks and run data science jobs. So we try to push things to that layer when it makes sense. But then at the Looker level, we like to define the metrics, the joins, the calculations, so that Looker knows how to take that model data and turn it into reports and users can point and click with it. Okay. What about metadata? I mean, in my old days of data warehousing, we had drilled into us that metadata was important and so on there. And we don't hear it talked about so explicitly in this kind of new world. Is it something you ever talk about with customers and is it something that is a part of your projects you do so the the word metadata can mean a lot yeah i was about to say yeah so so there's lots of things i mean go through what you think it could mean in various cases and where there's
Starting point is 00:40:37 value and not value gosh um that's not that's not a test sorry i mean you know in things like data lineage and things like what is the meaning of some measure and that sort of thing. Yeah, sure. So that honestly isn't something we think a ton about. I think that source control ends up solving for a decent amount of that because if something changes, you can just look at the blame for it and you can say, OK, well, yeah, that used to be something else. I guess that speaks to the kind of velocity of change in these organizations isn't it that that is one of the things that is important i mean we we certainly on a project we're working on we've found a lot of value in having things like data dictionaries that would kind of give us the actual kind of meaning of a column and so on but beyond that things like data lineage and anything more than that is never going to get done because it's just not a priority, really.
Starting point is 00:41:26 Yeah, I think that so software developers really like to think about how do you produce code that is self-documenting? So, you know, you've got things like Java docs that can create whole applications for you that create the documentation. To the extent that we can, we write code that is readable. And when inevitably some sections of it are more complicated, we will document it in line with comments. And then you can produce assets like a visualization of your DAG that end up kind of helping folks see the bigger picture. So I completely agree that if you're thinking about documentation, that's a big deal. Yeah. Yeah. What about quality and numbers adding up? I mean, again,
Starting point is 00:42:19 the world is kind of fast moving and we've got kind of like particularly lots of data and lots of things in these environments and so on. important is the accuracy of data do you think on projects and how much kind of credence does it put how much emphasis is put in that on projects that you've seen yeah the um i think that it is kind of an endemic problem today where you know it's you have more data than ever you have more ability to store it. So people end up capturing a bunch of data and putting it in a data warehouse and then really having no idea, you know, is it clean, is it good, whatever, until you go to use it for analysis and you realize, holy crap, this is not in good shape. So one of the things that we do quite a lot and we've made much easier with DBT is data unit testing.
Starting point is 00:43:07 So essentially defining standard tests on top of the data that's in your warehouse and allowing you to run those in a scheduled, consistent way. And alerting you if for some reason a field you're counting on to never be null ends up being null. Or a key that's supposed to map to another table for some reason doesn't have a parent record. And it's not super hard to write tests like that. The hard thing is to make sure that you've done it in a way that is lightweight and maintainable. And people can actually do it because getting people to write tests like this is like can be really unpleasant naggy okay okay and i noticed that there's a project that you're involved in the analyst collective project and i think
Starting point is 00:43:55 that's bringing together sort of dbt and things you're working on again just maybe explain what that is and the other components that analytics and data generator and so on yeah the analyst collective was actually um kind of the precursor to fishtown analytics it was uh this this core of people were um thinking about these kinds of problems as we um as we were seeing the industry evolve at rj metrics and we decided to kind of create this space to build open source code and write about the solutions that we thought people should adopt to these problems. Okay. And so, I mean, in general, how much do you and your company participate in these kind of projects and generally the kind of community scene and that sort of thing? Yeah, so the answer to that is as much as we humanly possibly can. I have a real, and maybe I'm an idealist, and I've talked to Lloyd Tabb at Looker who, he was a big open source person in the early days of open source and has different views than I do on this. So maybe he knows better than I do.
Starting point is 00:45:11 But I really think that data technology is just important to your organization that it doesn't seem to me to make sense to have it locked up in a closed source environment. So the BI stack is, I think, moving further and further towards open source. And the layer that is still mostly closed source is the actual visualization um and and looker and mode are both closed source and um uh superset is a is a great example of an open source alternative i don't think superset is quite where you know mode or looker are but um i'm i'm really excited to see that part of the ecosystem evolve okay so so just to kind kind of round things off, really, I mean, you've kind of danced around, obviously, what your company does and so on. Just give us a bit of a kind of a two-minute thing on what Fishtown Analytics does and, I guess, how you engage with customers and how people will contact you if you're interested after hearing this. Sure.
Starting point is 00:46:19 So Fishtown Analytics is an analytics consultancy that serves high growth venture funded startups. We'll work with companies even after just a seed round and all the way up through IPO. So we'll, you know, if you're very early, we'll help you set up your data warehouse and connect your data and do some basic reporting. If you are much further along, then we'll help you build custom ETL jobs and we will help you do custom attribution models and writing custom Spark jobs. So we'll kind of span that entire gap. We work completely in sprints. Every sprint is two weeks and you can cancel any time. So the goal is just to be really agile and easy to work with. And we are really optimized to just kind of getting in and doing the work and having fun in the process because I actually do think a lot of this work is really fun to do right now. It's great. I mean, I love it. It's great, isn't it? I mean, I think certainly looking at your
Starting point is 00:47:22 web presence and looking at your articles and so on, it comes across as somebody who kind of gets the technology, but also gets the kind of the basic ideas behind analytics and also how to run software development projects. And I think that's a good combination really, isn't it? I think, you know, the technical knowledge, the kind of common sense, and actually the kind of the understanding at very root level of how analytics works. I really appreciate that. Yes, good. Excellent. Well, look, it's been great speaking to you. How do people find you on the web and how they contact you? standing in a very root level of how analytics works i really appreciate that yes good excellent
Starting point is 00:47:45 well look it's been great speaking to you how do people find you on the web and how they contact you sure i just find us at fishtownanalytics.com and if you fill out that form on that website i will get it and probably respond to you even if it's 3 a.m excellent it's been great speaking to you tristan uh thank you very much for coming on the show and uh have a good evening thanks you too into you tristan uh thank you very much for coming on the show and uh have a good evening thanks you too

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.