Drill to Detail - Drill to Detail Ep.33 'Building Out Analytics Functions in Startups' With Special Guest Tristan Handy
Episode Date: July 3, 2017In this episode Mark is joined by Tristan Handy from Fishtown Analytics to talk about building-out analytics functions in high-growth startups and three related blog posts on this topic....
Transcript
Discussion (0)
Hello and welcome to another episode of Drill to Detail and today I'm joined by Tristan Handy
from Fishtown Analytics who I got to know through the world of Looker but then I found operates in
the same sort of startup space that I work in but in a slightly different way to the way I work. So Tristan, why don't you introduce
yourself to everybody on the show and let us know kind of what you do and how you got here.
Sure. Thanks so much for having me. It's great to be here. My name is Tristan Handy and I'm
the founder and CEO of Fishtown Analytics. I've been working in data for about a decade and a half now, and I guess I've been focused on startups and data for, I guess, since about 2009.
I was the first analyst at Squarespace back when Squarespace was a tiny little company.
Helped them raise their first big A round and then went on to be the executive at two different startups.
Most recently at RJ Metrics, I ran the marketing team. And we kind of participated in this real
fascinating development of the BI tech stack over the past five years. And so developed a lot of
strong opinions about BI technology and how analytics
should be done. And I left with three other folks and we've started our own consulting company to
put some of those ideas in practice. Okay. Okay. So you say Squarespace,
it's interesting. All of my websites run on Squarespace. So it's brilliant, isn't it? It's
really, really good. I mean, it's a great design and the IT behind it is good as well. And so
that's kind of interesting as well. But what you do, Tristan, is interesting because we've come across each other through,
I suppose, the kind of the data engineering kind of conversations and various kind of, I suppose, new world BI development kind of conversations.
But you actually provide analytics consulting to the actual startups themselves, don't you? Yeah, that's exactly what we do. The problem that we found that most startups had, and most of
our clients at RJ Metrics were startups, they didn't have a software problem, they had a people
problem. There's just not enough talent out there today that knows how to operate modern BI technologies.
There's a very deep enterprise space, and IT consultants abound in that universe.
But startups are kind of new to the BI space because the technology didn't exist previously for them to play with it.
It was, you
know, write queries on MySQL or build spreadsheets or nothing. So we're trying to fill this gap of
helping startups deploy this new technology. Okay, okay. So yeah, it's a subtle difference
there, isn't it? I mean, the work that I'm currently doing at the moment is I'm working
with a startup to build analytics products that they then offer to their customers but you're actually looking at the analytics those startups themselves
use and it is interesting there were three blog posts that you wrote recently these are the real
kind of reasons i wanted to speak to you there were three very kind of you know opinionated which
good kind of blog posts that you wrote about the sort of things you do and the problems you're
seeing the market and the first one was about i suppose what does startup a startup founder's guide
to analytics so that was a kind of a we'll go into it in a second you know there
was a good kind of set of steps in there and setting the kind of scene for that and then we
had the steps to setting up a modern sas based bi infrastructure which is very kind of relevant
and then you talked about the kind of workflow itself within kind of uh you know within startups
and I thought all three of them were kind of very good and actually very relevant to stuff that I
was working on at the time so for the first of those posts that you did
so the startups founders guide to analytics okay so just some i mean just summarize to the listeners
what that was about and what was the motivation to write that and then we'll go into some of the
details yeah so i kind of alluded before that um this is all kind of new stuff for startup folks. And so I feel like a lot
of the software vendors in the space want you to just kind of dive in and do it. And to a certain
extent, that's good. It's good to have a bias or action. But there's a lot to know. And frequently,
we've found folks kind of doing things out of order, caring about the wrong thing at the wrong time, like hiring a data scientist before you had any data to analyze just because you thought that you needed a data scientist. you know track a business from zero to 500 employees and say what kinds of things should
you what software should you buy what people should you hire and how should you be doing
analytics at these different phases okay so there were and there was that you went in the blog post
you went through the kind of phases in growth i suppose really for startup and you kind of went
through and talked about um you know what was appropriate at those phases really so let's kind
of start at the beginning so if you think about the founding stage really so this year i'm talking
just to be clear we're talking largely about say e-commerce and and sort of web and digital kind
of startups here but you know the founding stage describe that and describe what is appropriate
what mistakes you see it but what is appropriate in that kind of phase sure um i think that a lot of times the early stages, either your problem is data collection,
you haven't instrumented anything, or it is doing too much. I think that a lot of startup founders
are very data-driven naturally. They want to know what's going on with their business. And I think that
sometimes they overdo it on analytics too early, which creates this maintenance burden where
maybe you work a lot on a given analytics setup and it's working great, but your business changes
over time. And if you don't have the manpower to kind of keep that up to date, then the reports will get stale and no one will look at them and you'll have just wasted a bunch of time.
So things like a data warehouse and enterprise BI tools and so on would not be appropriate at this stage, you're saying?
I don't think so.
I think that if you've got fewer than 10 employees, you should install Google Analytics.
Make sure that you've done a decent job of that and then you know do what you can with it
okay so actually Google Analytics is an interesting topic because again coming
from probably for more than enterprise world myself I was I suppose I was
unaware of how ubiquitous Google Google Analytics is and how much value there is
in that as well I mean just for people who are listening to this who aren't
from the kind of e-commerce world just describe a little bit about google analytics and why it's so good
really gosh uh google analytics is the most loved and most hated analytics product in the world
it's the microsoft access of e-commerce isn't it really you know in some respects or excel in a way
for sure um i i think that um the the basic ga implementation is you install the tracking pixel on your website and it tells you visitor behavior.
You can go deeper than that with universal analytics.
You can install it in your mobile application.
You can get some more sophisticated reports. a tool that unless you pay for GA premium and get all the data loaded into BigQuery,
it's a visual tool and you'll run up against the ends of the universe in terms of what kinds of
questions you can ask it. Okay. So would you at this point expect the startup to hire somebody
to work with analytics at this point, or is it going to be a founder task? I think it's a founder
task at this point. Maybe you've got a marketing person
who's doing a bunch of marketing analytics,
but mostly it's your founders get stuck with this.
Okay, so next stage then,
very early stage you put it.
I mean, you talk in there about what you do at that stage
and you mentioned things like net promoter score.
I mean, what is different about this next stage
and what happens there really?
Sure, so your team's growing a little bit and you're probably not speaking to each of these people every day.
So you need to be focused a little bit more on empowering these people to do their jobs.
And in the future, that's going to take the form of a BI stack.
It's going to look more like a data warehouse and a BI stack. It's gonna look more like a data warehouse and a BI tool. But for now, I think these people have jobs to do
and most of them are not gonna know SQL.
Most of them are not gonna really have any BI skills.
So what's important to do is hold them accountable
to use the reporting in the tools that they use every day.
So if they're a salesperson,
they need to build reports in Salesforce.
I think a lot of folks export the data and they go to go to town with Excel. And I think
that's really a terrible idea.
Okay, so so we've got a couple of stages next, you've got early stage and mid stage in your
blog post. And I guess this is kind of where it gets interesting. So this is potentially
kind of where someone like you might come in. I guess this is also when you start to
have people talking about things like let's redo the analytics in a kind of where someone like you might come in, I guess. This is also when you start to have people talking about things like,
let's redo the analytics in a kind of more structured way and so on there.
You've then got the kind of this whole new thing about data engineers as well and so on.
How do you, when you go into these places, what do you see as common mistakes
and how do you sort of make sense of it all and get them in the right direction really?
What's the tricks to it all?
Sure. What's the tricks to it all, Sure. What's the tricks to it all?
Gosh.
Well, what's the problems you see?
For myself, I have a job.
Do you find sometimes that people are a little bit too clever for their own good when it comes to these sort of things or what?
Yeah, totally.
So I think that there are a lot of kind of standard principles
that you should think about when you're doing this stuff.
One is write as little code as yourself as you can. If you want a BI stack today, the reason that you have the ability to even have a BI stack at 25 employees is that there are so many tools
that you can just kind of pull off the shelf and all the integrations just kind of work.
So there are some founders and engineers at this stage that have a bias to build it themselves.
I think that's one of the things that we see people make a big mistake on.
So I think that step one here is plugging together various tools and putting
your data stack together. Then step two really is hiring. In order to kind of get this stuff in a
good place, it can't just be, you know, 10% of everyone's job. There has to be a person who's
pushing this forward. And this is a thing that I've started thinking
a lot about because they're just, I really think that hiring is the biggest problem in
analytics today. And it doesn't matter if it is a very large enterprise or if it's a
young startup, there's just not enough people who know how to do this stuff.
Yeah. I mean, I think there's a distinction as well.
I mean, there's a distinction between analytics and things like machine learning and data science and so on.
But there's also a set of one thing I found doing this kind of work myself is that there are a lot of things that are different going into this world.
And I think that you need to be kind of open minded to actually sometimes it does make sense to build things.
There are areas of kind of this work.
There are areas of this world that you've not heard of before for me it was kind of things like
e-commerce analytics and so on but they're also kind of eternal truths really as well and i think
something i found i found is that you end up end up end up rediscovering a lot of the things that
are these eternal truths doing this kind of work i mean having to make the case for analytics is
quite an important thing as well isn't it have you found that when you go into places that
actually there is a general lack of understanding of what the value of analytics is for a company, really?
So we don't have that conversation a lot.
It's not that there's no one in the world who still thinks that, but we just don't find ourselves in those conversations, which is good because I feel like you know maybe that's a fair conversation to have ten
years ago but if you're still thinking about that today then you just haven't
lived in a world where people are using data well or perhaps you've done it but
you've had a I suppose I mean like you say very few people within this world
have not used analytics but maybe they've done it and not found it to be
actionable or valuable and so on I mean have you found sometimes that you do need to kind of
go through and establish some of these basic things and think about how, you know, even things
like basic things like planning and budgeting processes, or kind of, you know, how do we do
internal reporting and so on? Is that part of what you do as well? So we're very focused on the pure, like, BI part, which is, you know, counting things and adding things.
And, yeah, totally.
The work that we do is an input to financial models that get put together by CFOs and shared with investors.
But we're generally not making those forecasts.
Okay.
Okay.
So what about, I mean, again, something I found interesting is I suppose startups will often focus on like you say building something
that's kind of new and they'll be using for example I don't know sort of airflow
or stuff like that and actually maybe actually a more traditional technology
and a more kind of a more kind of I don't know you know easy to understand
out of the shelf technology better and what's your thoughts on that, really?
On airflow versus?
Well, just in general, the fact that a lot of startups will think about the engineering tasks
rather than actually what they're trying to do with this.
Got it. Yeah, I really, I do agree that it is possible to get lost in the technology kind of forever.
But the, and when I'm talking about this question of hiring at this stage, it really is that
person who bridges the gap.
It's very possible to find engineers, you know, there, we could all use more engineers
in our companies, but they're out there.
And there are also plenty of marketers. The question is, who can you find that can
understand how to put a data stack together, but maybe not build it by hand themselves,
and can understand how a marketing campaign runs, even if maybe they don't run them themselves.
And you can combine those two sets of knowledge to actually do effective marketing analytics.
And the same for finance or for operate whatever okay okay and you say that's the kind of role that also is a separate
role to the person who is kind of doing building the reports and so on i mean do you think it's
important to still have someone out there building reports and so on working with that consultant
it's quite hard to do both really isn't it um i think that the so at at the early stage, so maybe 40 employees, something like that, you hire the person who maybe you call them your head of BI.
And maybe two or three years down the road, they'll have six people working under them and you'll call them a VP.
But for now, you just call them your head of BI. That person is usually, maybe they've got an MBA, maybe they've got some Excel skills
and some light SQL, and they're going to pick it up on the job because they're a super smart person.
And they're going to build out the basic stuff themselves and then scale the team over time.
Okay. And the final stages of this kind of blog post in particular, where you talked about
mid-stage and kind of growth, and you talked about the importance in the mid-stage of SQL data modeling and governance and versioning.
I mean, tell us about that, and why is that important, and how do you introduce that into the conversation?
Sure. So one of my biggest pet peeves has become copying and pasting. It is unbelievable how analysts are so used to copying and pasting.
So, you know, you send an Excel document to somebody else, they use that as the starting
point for their own Excel document, they build off of that. But if the core definition of a metric changes, all of these decentralized analyses are not going to get updated.
And what ends up happening is that everybody has their own copy of the metrics.
Nobody agrees with each other.
And it kind of grinds all of this to a halt.
So we, you know, if software engineers wrote code like that, literally we wouldn't have any software applications that actually worked.
And I really think that analytics is moving in that direction as well, where you need to think about your analytics applications as scalable pieces of software that you need SLAs, you need source control, you need to build them modularly.
And copy paste has to just die.
Okay, okay.
So the reason that I think we got to know each other was because of Looker.
And Looker is an interesting BI tool take on this kind of world, isn't it?
I mean, obviously you've got in there the ability to put stuff in GitHub and so on,
but you've also got the data modeling side and you've got the kind of SQL side and so on.
I mean, you're using Looker currently aren't you in some projects and
what's your kind of thoughts on that really yeah i like looker a lot and i wrote one of the first
blog posts after starting fishtown was um about how much we we liked looker yeah that's what got
my attention at the time yeah yeah uh so it the nice thing for for us is that we can think very structurally about, you know, what is your data? What does it look end. And since we've built out that LookML model,
they can drag and drop and create reports
without having to think too much about
how to optimize a Redshift query or anything like that.
Okay. And so, I mean, this is an area
that is your main focus of business.
I mean, how has it worked out building a business in this area?
I mean, selling consultancy into a startup,
I've always thought it's
been quite a hard thing because people there are quite kind of build it themselves and smart
how has it gone running a business and starting a business in that space yeah i you know uh
uh last march i was talking to my wife about um hey maybe i'm gonna try to start this business
and my goal was uh hey maybe i can pay my own salary I really had no
idea what to expect because you know I ran several teams at startups and and we
had you know at startup sometimes you hire a design consultant sometimes you
hire like a performance marketing like an AdWords consultant but but really
startups aren't used to hiring consultants.
The thing that has worked out really well for us is that we all come from this ecosystem.
And so we know all of the people making the technology.
And none of them want to have services businesses.
They all want to build software.
And so we have ended up getting I would
say a majority of our customers from the ETL tools from the BI tools from the data warehouse tools
okay that's interesting okay so so that actually is a quite nice lead into the second blog post
you wrote so you wrote one about where are we here I'm just going to find it what are the steps
and the tools instead of a modern SaaS based bi infrastructure so just tell us a little bit first of all about again motivation for this but what what's your
kind of general kind of approach or general kind of picture of what you do in this kind of space
before we get into the detail sure um so data is data like you can have a data workflow that's set
up on top of csvs in you know S3, and they can all be processed by
Jupyter Notebooks. There's infinite ways of having a data pipeline. The reason that we do things in
the way that we do is because we need them to be very hands-off. We don't want to think about them.
And as you get bigger and have more customized needs,
maybe you want to do things differently.
But these are recommendations for companies
who are not trying to have a team of 25 people
maintaining this stuff.
So we always think of the analytic database
as the center of all of your analytics.
Don't analyze data in Excel. Don't do random things
in Tableau extracts. Step one is always get data into an analytic database. So the question then
becomes, how do you get it there? And there's several off-the-shelf ETL tools, Fivetran, Stitch,
Illuma. We use Stitch probably the most. And then once your data is in your analytic
database, then you've got choices to make around what your BI tool is. So that's kind of the process,
the way that we think about it. Okay. So there's kind of different layers to the stack you talked
about, and it's been a recurring theme in a lot of the kind of podcasts we've been doing recently.
But there was, so we talked about the database first of all and and you know you in this blog
post you talked about using kind of these analytic elastic kind of mpp databases like redshift and
bigquery and snowflake i mean what's your the company i'm at actually went off redshift into
kind of bigquery is redshift still still kind of popular out there is it still kind of being used
a lot in this startup space from from what i know redshift is still the 800 pound gorilla um and not just within startups
like i think that it is the uh dominant cloud database for for all sizes right now um i and and
it's it's hard to uh it's it's hard to say that that's bad.
Like Redshift, I think is, what, gosh, it's four years old now.
And I think that mostly the way that it's showing its age is around concurrency.
So if you have very differing concurrency and load on your warehouse at different points in the day.
You might, with Redshift, have to really make hard choices about who gets to use that resource and when.
If you go with Snowflake or if you go with BigQuery, they do a much better job of solving that problem.
Okay. Okay. Do you find that the, I mean, I know obviously not every business you work in is around data and so on, but do you tend to find that the databases they use for their internal reporting and analytics are the same as they use for the customer ones?
I mean, or are they as good?
I mean, what typically is that like, really?
So for internal analytics purposes, I think that really the three that you named are the ones that are in use.
We don't do a ton of work with customer analytics, you know, embedded stuff in applications.
But I think that there you do have a much broader array of possible options.
You throw in Elasticsearch, throw in even hosted services like Keen.io.
Keen has a great API for doing stuff like this.
Now, you can absolutely spin up a Redshift instance and use that for embedded analytics, and we've helped folks do that.
I don't think it's really built for that use case quite as much.
Okay, okay.
So we've got that.
I mean, I think the database is a fairly kind of easy topic, really.
But ETL is a recurring topic we've been coming back to in this podcast recently.
And I think it's been driven by a lot of the kind of move towards things like data engineering, things like Kafka, things like Apache Airflow we talked about earlier on.
To my mind, ETL is the biggest area people can get themselves into,
a bit of a mess, really, internally on projects.
I mean, what's your take on doing ETL
with an internal kind of startup projects
and what's your tools of choice and so on?
Sure, yeah, and I totally agree with you.
There's a lot of topics in there to unbundle,
but yeah, I mean, what's your take on that?
Yeah, it's so easy to just kind of like
get yourself lost in the forest and be like, how did I get here?
So we think about the three letters ETL in two different stages.
We like to separate the E and the L from the T.
There are a large number of reasons why you might not want to build your pipeline like this,
but we load all raw data into the analytic data warehouse as stage one.
So the question really becomes,
how do you write the job that gets the data from where it sits into the warehouse?
It's based on the fact that a lot of people in these companies are software engineers,
do they write them themselves?
Is it a good move to write your code yourself, really?
So I think that the way to think about that is what can you do to get the most for free?
Sign up for as little maintenance as possible because inevitably this stuff breaks.
So the best option is somebody's got an off-the-shelf integration.
And there's so many products out there now that have off-the-shelf,
like move data to a data warehouse.
I mentioned before we use Stitch a lot.
So that's option number one.
Option number two, I think, is there's this emerging platform called Singer that Stitch is kind of the sponsor of, but it's a totally open source way of doing ETL.
And it's essentially a community oriented approach to this maintenance problem where folks are building API integrations with various data sources and kind of sharing them with that community.
So we've built about five of those for clients.
And the nice thing about that is they can build it once and then the community maintains it.
So that's like your second tier of get it for free.
And then if you really have to, you can build the whole thing from scratch.
Okay. So we had Maxime Bouchemin on the podcast recently talking about Apache Airflow, for example.
I mean, have you had any exposure to that?
I mean, what's your kind of take on that, really?
Yeah.
Airflow is amazing.
It is so incredibly capable.
And if you're a data engineer and you have like an unbounded problem set,
Airflow, I think is the tool that you definitely want to use. I think that it actually takes a
while for you to need to get there. And obviously at Airbnb, they're one of the most data
sophisticated organizations literally in the entire world. So if you're working in a startup, you're probably not Airbnb quite yet.
So the thing that Airflow does so well is it gives you access to a DAG, a directed acyclic graph.
So a way of processing dependencies that kind of has a start and an end.
And that's generally how ETL jobs are constructed. We think that that DAG concept is something that
data analysts should be able to take advantage of as well. And so we're actually building an
open source tool called DBT, data build build tool that allows you to construct these these
sql only data dependency graphs and they get built completely in your in your data warehouse so you
don't need a spark cluster you don't need you know big ec2 server farm you really run it from
your local machine and it builds all of these data models in your warehouse.
Okay. Okay. So do you ever see kind of any of the big ETL tools being used in these companies,
Informatica's in this world and that sort of thing? Do you ever see the kind of point and click tools, expensive enterprise ETL tools being used at all?
So I really have not. I know that when I was at RJ Metrics, we kind of looked into Informatica as a potential partner.
And so that was really my only exposure to it.
But it's a beast.
It's quite a lot of work to set up.
Yeah.
Although, obviously, it's very powerful.
Yeah.
I mean, there was an interesting blog post that somebody wrote.
I'm trying to find the details here actually,
a gentleman called Jeff Magnuson,
who wrote a blog post a while ago
called Engineers Shouldn't Write ETL,
a guide to building a high-functioning
data science department.
And the general thrust of it was interesting.
It was that the worst people to write ETL code
are engineers because they're thinkers
rather than doers.
And the danger is that each kind of engineer
will kind of try to kind of, you know,
to introduce a new paradigm
or will kind of be trying to solve problems
in a very innovative way.
Whereas actually a lot of ETL is just basic stuff.
I mean, do you think that's kind of valid
or is that an interesting observation?
Totally.
And I remember that I like read that post.
Oh, brilliant.
I think he said Stitch Fix.
Yes.
I love that post.
And if anyone's listening to this and you haven't read that blog post, like just pause it and go read that blog post.
So just outline it again then for us.
What was it about and what was the point of it from your side then really?
He was saying he actually focused a lot on the human capital reasons for this like the engineers software engineers who are good at their
jobs they do not like to just do really boring stuff and some sometimes ETL
writing ETL code is is rote and boring and that doesn't lend itself to having a
happy high functioning team hmm yeah exactly I it's an interesting post-read.
I thought it was a good counterpoint, really.
Well, a supplementary point to the thing that Maxime talked about.
But so another area, I mean, I've just come off a call,
and I was recording a podcast interview with Dan McClary from the BigQuery team,
and we were talking about data modeling and, I suppose,
transferring some of the things we knew from data warehousing into BigQuery.
And I recently almost came unstuck with a BigQuery project with joins, for example. I mean, what's your take on data
modeling in this kind of world? And is it different? Is it something you bear in mind
differently? Or what's your view on this? Yeah, so I have to admit that my experience My experience in the very large data set size world
is only in post-Redshift.
I never worked with tables of more than 10 million rows
in Oracle and MySQL.
So I kind of maybe got lucky that I didn't have to deal
with some of these old data warehousing techniques. I've read the books and
I think about them like, gosh, I'm so glad I didn't have to think too hard about that.
Yes.
So people will ask, what's your take on data modeling in client calls? And generally,
our answer is write really clean code, write code that is readable and that other people can maintain really easily.
And if you run into performance problems, by and large, can just take the data
as long as you don't do obviously stupid things.
Yeah, definitely, definitely.
I mean, I don't know if you noticed, there's a company, Snowflake, out there
that do obviously a cloud-based elastic kind of database for data warehousing,
but it has some of the characteristics of kind of big data as well.
I mean, you had any experience with Snowflake? What's your thoughts on that,
really? Yeah, so we're just, we had, we got our first client on Snowflake at the beginning of
this year, and we just spun up our second one. We like Snowflake a lot. And I want to kind of
tie into that blog post that you just wrote, joins in BigQuery.
And the magic of BigQuery is that it can split your processing jobs across sometimes thousands of nodes.
And that's amazing because you can process essentially any data set pretty quickly.
The problem, though, is having the necessary data on the same node when you want to join one data set to another.
And that does make joining less performant.
So there are plenty of ways that you can architect, and you did a great job pointing this out, how you architect your BigQuery data to prevent you from needing to do these joins. I think that Snowflake is this nice, it's elastic,
and you have the ability to spin up a bunch of compute nodes, but it's not like thousands of
them. So it doesn't have joins, the same problem with doing joins in the way that you mentioned.
Yeah, excellent. Yeah, it's interesting. I mean, sorry, Snowflake is, to my mind, it's, on one hand, it's very clever, you i'm sorry snowflake is to my mind it's
on one hand it's very clever you know in that they've managed to get the best of both worlds
but it's also an interesting kind of um it's interesting to rebuild what is essentially a
kind of on-premise data warehouse technology in the cloud elastically i mean you you kind of
wondered to yourself given that its primary primary market is is is kind of data warehousing
whether it's what it's been worth it introducing
and reintroducing things like constraints and and so on in there i mean they seem to have kind of
built they seem to have built a technology that is clever but you wonder whether it's needed in
this new kind of setup really i don't know it's i don't know it's yeah i think that the the
with redshift in 2013 we got a tool that that was pretty damn good enough for most use cases.
And yet we're going to continue to push SQL-based data warehouses further and further over the next decade.
And I think that tools will continue to look more and more like BigQuery and less and less like Redshift.
But at the same time, I just don't know that BigQuery is quite there yet.
It still requires a little more thinking than sometimes I'd like. But at the same time, I just don't know that BigQuery is quite there yet.
It still requires like a little more thinking than sometimes I'd like.
Okay, so let's get away from technology here.
And the last of the blog posts you wrote was really good.
And it was about, I suppose, the method and process by which startups then do their analytics.
And I think you talked about it and called it kind of the analytics workflow.
So again, what was the background to this?
And what were you trying to talk about? And let's go through some of the analytics workflow. So again, what was the background to this and what were you trying to talk about?
And let's go through some of the details.
Sure.
The thing that we observed while we were at RJ Metrics, and so RJ had a little over 400 clients at the time.
And we had that kind of collective knowledge
of all of these companies. And you realize that
still no one's doing analytics perfectly and sometimes not even like that well. And it's not
really a tooling problem. It is that they're working in particular ways that don't end up adding up to make, you know, insight that everybody has
access to and is always current and all of that. So we just kind of asked the question,
how should people be producing analytics? What's the workflow that they should be using?
And that blog post was kind of our answer to that.
Okay. And I agree with you. I mean, I'm very conscious we spent 40 minutes of this conversation
talking about different tools and so on, but it's not, it's not what you've got
is how you use it really, isn't it? And I think something I've observed is, is, is analytics in
all companies really, you know, but particularly startups and so on is very kind of tactical,
it's spotty, it's not systematic and so on. And, and it's often in silos and, and, and not
collaborative. I mean, I think the first thing you talked about in this blog post was saying analytics is collaborative. I mean, what do you mean by that? And what prompted
that? And what are you trying to say there? So let's say that you've got a team of five analysts,
the default behavior in this kind of setting is that some manager asks one of the analysts to get a report on something.
And that analyst starts from a blank sheet of paper,
and they query the raw data from scratch, and they build up this report. And sometimes that will take the form of a 200-plus line-long SQL statement
that only they can read, and even they forget how it works a week later.
And so that becomes very fragile very quickly.
And so the core insight there is that you should collaboratively in this team of five people build up this ever-growing layer of business logic and everybody should be accessing this same library of existing business logic as opposed to starting from
scratch every time. That's interesting. So by business logic do you also mean things
like common definitions and metrics? Yeah so the way that we do it, it always
takes the form of database tables and views that are materialized in your data
warehouse. So you build them with,
so let's say you've got an orders table. We're talking about e-commerce before. So
you've got an orders table and you want to get revenue out of that. This is like the most
typical thing ever. But there are some test records in there and you need to filter them
out of literally every query you ever write. So instead of querying the orders table directly, make a view on top of that that filters out these
test records and then everybody can totally forget that they even exist. So that's like a
very simple example, but you can find opportunities all over the place to do that kind of thing.
Okay. So going back to the conversation I had with maxime about airflow and superset one
of the discussions we had was whether you should try and build out a semantic model like a business
model for the business and his argument was that in in startups it's very hard to get a cop to get
a common definition agreement on on metrics and and the structure of data and so on i i'm not
entirely sure on that myself i i think there is value in doing that i mean what's your take on
trying to build some kind of common business model for the business that describes
things in a kind of standard way yeah so I listened to that episode and I heard him say
that and I flagged that in my brain too the you know I'm not saying it's wrong I'm not saying
it's wrong but it's interesting kind of point of view isn't it yeah totally and and I I'm not saying it's wrong, but it's an interesting kind of point of view, isn't it? Yeah, totally. And I have not worked at a company of the scale of Airbnb, so I'm sure that they have their own challenges that they're optimizing for.
From my perspective, if you don't have a semantic model, then you're going to really run into kind of organizational challenges around what is true and what is not true.
So, yeah, I think that that model is kind of the core of your analytics at a company.
And so, yeah, I guess I come down on the opposite side of that.
But where should that model be, do you think?
Do you think it should be in a tool like Looker or it should be at a lower level in the data warehouse. And the reason that
we do it there is that when you build models in dbt, any tool that connects to your data warehouse
has access to that same library of business logic, business models. So then Looker can connect to
those, but we also really like Mode Analytics for a lot of use cases,
and Mode can connect to that same shared data.
You can connect to Jupyter notebooks and run data science jobs.
So we try to push things to that layer when it makes sense. But then at the Looker level, we like to define the metrics, the joins, the calculations, so that Looker knows how to take that
model data and turn it into reports and users can point and click with it.
Okay. What about metadata? I mean, in my old days of data warehousing, we had drilled into
us that metadata was important and so on there. And we don't hear it talked about so explicitly
in this kind of new world. Is it something you ever talk about with customers and is it something that is a part of your projects
you do so the the word metadata can mean a lot yeah i was about to say yeah so so there's lots
of things i mean go through what you think it could mean in various cases and where there's
value and not value gosh um that's not that's not a test sorry i mean you know in things like
data lineage and things like what is the meaning of some measure and that sort of thing.
Yeah, sure. So that honestly isn't something we think a ton about.
I think that source control ends up solving for a decent amount of that because if something changes, you can just look at the blame for it and you can say, OK, well, yeah, that used to be something else.
I guess that speaks to the kind of velocity of change in these organizations isn't it that that is one of the things that is important i mean we we certainly on a project we're working on we've
found a lot of value in having things like data dictionaries that would kind of give us the actual
kind of meaning of a column and so on but beyond that things like data lineage and anything more
than that is never going to get done because it's just not a priority, really.
Yeah, I think that so software developers really like to think about how do you produce code that is self-documenting?
So, you know, you've got things like Java docs that can create whole applications for you that create the documentation. To the extent that we can, we write code that is readable.
And when inevitably some sections of it are more complicated,
we will document it in line with comments.
And then you can produce assets like a visualization of your DAG
that end up kind of helping folks see the bigger picture.
So I completely agree that if you're thinking about documentation, that's a big deal.
Yeah. Yeah. What about quality and numbers adding up? I mean, again,
the world is kind of fast moving and we've got kind of like particularly lots of data and lots
of things in these environments and so on. important is the accuracy of data do you think
on projects and how much kind of credence does it put how much emphasis is put in that on projects
that you've seen yeah the um i think that it is kind of an endemic problem today where you know
it's you have more data than ever you have more ability to store it. So people end up capturing a bunch of data
and putting it in a data warehouse and then really having no idea, you know, is it clean, is it good,
whatever, until you go to use it for analysis and you realize, holy crap, this is not in good shape.
So one of the things that we do quite a lot and we've made much easier with DBT is data unit testing.
So essentially defining standard tests on top of the data that's in your warehouse and allowing you to run those in a scheduled, consistent way.
And alerting you if for some reason a field you're counting on to never be null ends up being null.
Or a key that's supposed to map to another table for some reason doesn't have a parent record.
And it's not super hard to write tests like that.
The hard thing is to make sure that you've done it in a way that is lightweight and maintainable.
And people can actually do it because
getting people to write tests like this is like can be really unpleasant naggy okay okay and i
noticed that there's a project that you're involved in the analyst collective project and i think
that's bringing together sort of dbt and things you're working on again just maybe explain what
that is and the other components that analytics and data generator and so on yeah the analyst collective was actually um kind of the
precursor to fishtown analytics it was uh this this core of people were um thinking about these
kinds of problems as we um as we were seeing the industry evolve at rj metrics and we decided to kind of create this space to build open source code and write about the solutions that we thought people should adopt to these problems.
Okay. And so, I mean, in general, how much do you and your company participate in these kind of projects and generally the kind of community scene and that sort of thing?
Yeah, so the answer to that is as much as we humanly possibly can. I have a real,
and maybe I'm an idealist, and I've talked to Lloyd Tabb at Looker who, he was a big open source person in the early days of open source and has different views than I do on this.
So maybe he knows better than I do.
But I really think that data technology is just important to your organization that it doesn't seem to me to make sense to have it locked up in a closed source environment.
So the BI stack is, I think, moving further and further towards open source.
And the layer that is still mostly closed source is the actual visualization
um and and looker and mode are both closed source and um uh superset is a is a great example of an
open source alternative i don't think superset is quite where you know mode or looker are but um
i'm i'm really excited to see that part of the ecosystem evolve okay so so just to kind kind of round things off, really, I mean, you've kind of danced around, obviously, what your company does and so on.
Just give us a bit of a kind of a two-minute thing on what Fishtown Analytics does and, I guess, how you engage with customers and how people will contact you if you're interested after hearing this.
Sure.
So Fishtown Analytics is an analytics consultancy that serves high growth venture funded startups.
We'll work with companies even after just a seed round and all the way up through IPO.
So we'll, you know, if you're very early, we'll help you set up your data warehouse and connect your data and do some basic reporting. If you are much further along, then we'll help you build custom ETL jobs and we will help you do custom attribution models and writing custom Spark jobs.
So we'll kind of span that entire gap. We work completely in sprints. Every sprint is two weeks
and you can cancel any time. So the goal is just to be really agile and easy to work
with. And we are really optimized to just kind of getting in and doing the work and having fun in
the process because I actually do think a lot of this work is really fun to do right now.
It's great. I mean, I love it. It's great, isn't it? I mean, I think certainly looking at your
web presence and looking at your articles and so on, it comes across as somebody who kind of gets the technology,
but also gets the kind of the basic ideas behind analytics
and also how to run software development projects.
And I think that's a good combination really, isn't it?
I think, you know, the technical knowledge, the kind of common sense,
and actually the kind of the understanding at very root level of how analytics works.
I really appreciate that.
Yes, good. Excellent. Well, look, it's been great speaking to you. How do people find you on the web and how they contact you? standing in a very root level of how analytics works i really appreciate that yes good excellent
well look it's been great speaking to you how do people find you on the web and how they contact
you sure i just find us at fishtownanalytics.com and if you fill out that form on that website i
will get it and probably respond to you even if it's 3 a.m excellent it's been great speaking
to you tristan uh thank you very much for coming on the show and uh have a good evening thanks you
too into you tristan uh thank you very much for coming on the show and uh have a good evening thanks you too