Drill to Detail - Drill to Detail Ep.80 'Data Architecture and Data Teams at Hubspot' with Special Guest James Densmore

Episode Date: May 5, 2020

Mark Rittman is joined in this episode by Hubspot's Director of Data Infrastructure James Densmore to talk about distributed and remote-friendly data teams, DevOps with dbt and Apache Airflow and care...er path options for data engineers.How should I structure my data team? A look inside HubSpot, Away, M.M. LaFleur, and more (dbt Blog)Software Engineers do not Need to Become Managers to Thrive (Data Liftoff Blog)The Misunderstood Data Engineer (Data Liftoff Blog)Modular ELT (Data Liftoff Blog)Test SQL Pipelines against Production Clones using DBT and Snowflake (Dan Gooden)HubSpot Data Actions, Harvest Analytical Workflows and Looker Data Platform (Rittman Analytics Blog)

Transcript
Discussion (0)
Starting point is 00:00:00 So welcome to Drill to Detail and I'm your host Mark Rittman. So as founder as well as technical lead for an analytics consultancy, my day is typically spent half in DBT and the other half in HubSpot. And so I was particularly interested to find today's guest, HubSpot's Director of Infrastructure, James Densmore, talking about his local DBT meetup in Boston the other day on Twitter. And as I've been fascinated over the past few years thinking about what kind of data analytics infrastructure SaaS companies like HubSpot must have, and how someone working in that type of role thinks about data analytics, data infrastructure, and data engineering. I'm very pleased to be joined today
Starting point is 00:00:49 by James. So welcome to the show, James. It's great to have you here. Yeah, thank you. It's great to be here. So James, just tell, if anybody who, well, if anyone doesn't know you, just tell us what you do, what your role is now, and how you got into what you're doing now. Yeah, so my role right now is Director of Data Infrastructure at HubSpot. And what that means primarily is focusing on the data infrastructure for the BI and analytics kind of core team at HubSpot. And it kind of had a journey to get there through, started really in software engineering. So I know people get into data from a lot of different directions.
Starting point is 00:01:24 Some people come in, you know, they're more business oriented. Some come from a technical background. And I was very technical. Came out of a computer science program, software engineer. And I was at a company called Wayfair, which is now a pretty big e-commerce company. But at the time, it was a pretty small startup in Boston, Massachusetts, and was a web developer. And the sort of growing BI team was very small there. And they needed some help getting started with some more of the technical work behind their quickstream database, which back in, you know, about 2010 was still quite a lift in the olden days, if you will. So yeah, I got started there and just took an opportunity to kind of like literally move across the floor of the office and work in the BI team. And I was working with, you know, we were on Microsoft SQL analysis services.
Starting point is 00:02:18 So, like, you know, building OLAP cubes. And we were just starting to think about migrating from just a pure SQL database to Netiza, which I believe got bought out by IBM ever since. So that was kind of my entry into BI. And since then, I've moved on to a couple of different roles, leading teams at Safari Books, which is O'Reilly Media, and then as well as another startup. And then in between, did some consulting at a company called data liftoff that i started and uh yeah that kind of led me to hubspot and that's where i am today you mentioned there the teaser and analysis services so i mean olap as a tech as a kind of concept and technology it's not something you hear about much these days but you know it was at the time it was it was very interesting wasn't it and the performance and the speed and the interactivity that you got from those kind
Starting point is 00:03:07 of databases and those tools was was was very good wasn't it yeah and i will say that i had never heard of it before i kind of jumped into that role and you know so my my background with software engineering you know you're in relational databases and you know you're not thinking about the speed and performance of those large data sets from an analysis standpoint. So I'll be honest, it took me a little while to get my head around it. What is this thing doing? And then once you get it, you can almost never really move past that way of thinking. Even today, when I'm not building OLAP cubes, I still think dimensionally with data. So I think anyone that's worked with that kind of keeps itAP cubes, I still think dimensionally with data. So I think anyone
Starting point is 00:03:45 that's worked with that kind of keeps it with them, I've found. Yeah, exactly. And I always say that dimensional modeling and Kimball and so on are things that not many people know about really, or certainly an unsurprisingly small amount of people know about that when they're working with, I suppose, tools like DBT and doing data modeling and so on these days, but those, those kind of golden rules about how you design a dimensional model. And I suppose, in a way, the different roles of the different layers in the in the sort of data infrastructure and data architecture, they're really useful skills to have and really useful concepts to understand, aren't they? Absolutely. I mean, even today, when, you know, we're building out, like I said, you know, I use dbt, or, you know i use dbt um or you know hubspot
Starting point is 00:04:26 does and i've used it in the past it is one of those things even building data models in that sort of modern if you will uh way of doing things i still love to see candidates who have you know knowledge of kimball or some kind of understanding of of dimensional modeling because i think there's some best practices there that whether we we you know kind of connect the dimensional modeling. Because I think there's some best practices there, that whether we kind of connect the dots now or not, we're still doing those things, even in new ways. Yeah. And I suppose e-commerce and web analytics and so on back in those days
Starting point is 00:04:57 were the really big, I suppose, use cases and consumers of data that led to things like Netease. I mean, what was it like working with that volume of data then which actually arguably that would be laughable now but but it led to you having to go and get like multi-node mpp systems you know what was it like working in that that space at the time right yeah nowadays it's you know something that doesn't seem so crazy but at the time it was you know we're thinking about a quick stream database for a large e-commerce company that was That was the coolest thing. It was so difficult to go from these smaller data sets
Starting point is 00:05:32 that we were accustomed to working with and then apply the same characteristics to our work with the larger ones. And that was my first realization of, it wasn't just about optimizing anymore. It was about thinking of a totally new way of doing things. And yeah, I mean, that data, we started to, you know, you sort of started the old way, you know, which was like, yeah, how can we just, you know, put more indexes on our tables
Starting point is 00:05:58 and partition, you know, our cubes and things like that to, you know, to actually process and get any value out of this data we have to think about it uh in a more modern way and not just doing things incrementally better so that was the first time i i sort of ran into an architectural change um and you know that that has kind of carried forward as well okay and and so then you moved on to the work at safari books and i noticed um that you've worked with things like redshift and python and postgres and so on there so that struck me again as a bit of a change in terms of the products you were using and the way in which you develop and so on
Starting point is 00:06:35 i mean what was it like moving from i suppose um you know the world of things like the teaser and very structured data data sources and so on into working with those tools and I suppose more cloud technologies and open source? Yeah, that was the first time I really dove into cloud. And, you know, so when I started at Safari, we were really just putting together a centralized BI team for the first time. You know, the company had kind of spread that out a bit and had been doing a lot of this work, but not under one umbrella. And we had been using a lot of different technologies. And I saw the opportunity to say, we could do something like Natiza, which is extremely expensive. For those that haven't used it, you literally get a rack, a server rack that you have to manage type of thing. And we wanted to
Starting point is 00:07:26 kind of move beyond our Postgres data warehouse, which is what we had, and think about how we were going to scale. And Redshift was really kind of just getting, it may have been around for a couple years, but really had started to reach a level of maturity that we wanted to take a risk on it. And honestly, the biggest problem at that time was convincing everybody, especially, you know, security and legal to say, we're going to go to the cloud with our data. You know, it's a little bit easier today, but at that time, still a lot of pushback. So, you know, we took advantage of that. And really, it kind of changed a lot for all of us working there.
Starting point is 00:08:05 You know, we were just immediately able to start to think about not just Redshift, but, you know, what else do we move to the cloud? And what else can we do, you know, in a more scalable way than we were doing on our own infrastructure? So did you find that your team were the ones that led the move into the cloud, you know, analytics and data and so on? Or was it the fact that the rest of Safari books was moving into the cloud made that then acceptable and easier for you to do?
Starting point is 00:08:30 Who led it? Was it you or was it the rest of the business? It was mostly the data team. I think we were the first ones to go there. You know, there were other teams that were, I would say, kind of poking around more like third-party SaaS tools, rather than thinking about the cloud infrastructure as where they were going to build their own type of thing.
Starting point is 00:08:51 Okay. And what was Pentaho like to work with? I mean, again, I suppose there's a whole range of products in that suite, but it's something I've had an eye on in the past. But what was your kind of experience with Pentaho then? Yeah, I think it was good at what it did, which was provide an open source analytics platform where you could build things like OLAP cubes, reports. You know, I think it had a bad reputation as far as it wasn't as glossy as things like Tableau and other tools. But I found that, you know, we started, there's an open source version and then there was a commercial version. And we did move toward the commercial one.
Starting point is 00:09:31 And I think they did a pretty good job of saying, you know, here's what the community version offers. The community was really strong. And so that was a tool that I think was another one where it was kind of a stepping stone for us, you know, in between people were using Excel sheets, right? Kind of just connecting down to the data sources or doing flat files. But we weren't quite ready to move towards some of the bigger enterprise BI tools.
Starting point is 00:09:58 So it really kind of filled that gap for a long time. And we got a lot out of it. Okay. Okay. And so you used Redshift. Did you ever get, I suppose, diverted or looking at things like Hadoop at all? So do you ever look at things like Hive and stuff like that? Or did you go straight to Redshift as your sort of solution? Yeah, we actually did both. So we had a Hadoop cluster, which was managed on-prem. So we had a really great SysOps team there, which is one reason we hadn't moved to the cloud yet.
Starting point is 00:10:27 And so they were managing our Hadoop cluster, which for me, that's always the hardest part is the administration of the Hadoop cluster. But we use that to process some of our larger datasets and then take some of that output and send it to our warehouse. So we were running those in parallel. And I still find that today,
Starting point is 00:10:45 a lot of companies are still doing that. They've moved to Hadoop or Snowflake, or sorry, they've moved to Redshift or Snowflake or BigQuery, but they're still using Hadoop or something similar to do some of their processing. Okay. And was it around this time that your interest in data infrastructure and data engineering came out? What led to that particular focus in your career, really? Yeah, I want to say it was probably Redshift. It was probably, not Redshift in particular, but the advent of really taking advantage of the cloud and just the way that that changed what we could do with data. It became less of a, less of, you know, an, you know, analysis first for me as a, you know, it was, we can do a lot of this analysis because of the technology where before the technology was holding us back. And so it just felt like a world where,
Starting point is 00:11:40 you know, you could do all these things that you've always wanted to do. For me, obviously, that was only a couple of years into my data career, but there was a lot of pent-up demand for that kind of stuff. Excellent. Excellent. And so data liftoff then. So tell us a bit about that and I suppose how that started and what that is now and I suppose your work as a consultant and in that kind of area, really.
Starting point is 00:12:04 For years, I had this dream of helping companies that didn't have a data team, whether that would be because of funding or because they just didn't have the right leadership in place to identify the need, and really help them kind of get off the ground. It's data liftoff. And it was something that I had been kicking around for a while. And I
Starting point is 00:12:27 finally decided to just go forward and do it as a consultancy. And the reason I did that was, there's something about being new to a company. You're naive in a good way, right? As you know, like working with companies, you come in and you're not held back by the way things were always done or, you know, a lot of that history that once you've been around for a while just becomes part of you. And so it was a great opportunity to try to go out and help companies really kind of step back from where they were and look for what could be done to get them kind of moving in the direction where they could eventually become self-sustaining. So it was less about going in and trying to stay for a long time as it was about looking for needs, building a plan, and then actually helping build the initial infrastructure, which they would then own. And as I eventually moved into HubSpot not that long ago, I've kind of kept it around as more of a content library,
Starting point is 00:13:27 but also something where I'm trying to share as much content, both free and low-cost content, for individuals who are just getting into that business. Because I did find along the way that there's a lot of great analysts, data analysts, who if they there's a lot of great, you know, analysts, data analysts, who if they were just a little more technical, could help their company, you know, on the infrastructure side, or the other way around, you know, maybe a data engineer, who they just know so much about the infrastructure side, but need a little bit more seasoning on the analysis or data science end of things. And so it's sort of been kept around and i found a lot
Starting point is 00:14:05 of people were interested in that kind of content of you know they have some data background but they want to move to an adjacent role in data there's definitely an itch there okay okay so let's talk about hubspot then so i mentioned hubspot earlier on it's a tool that i use quite a bit and um uh but i know that hubspot is more than just the but maybe the tool that I use quite a bit. And I know that HubSpot is more than just maybe the part that I use, but just maybe just for anyone who doesn't know what it is, just talk about the company and what its business is really. And I suppose the role of data or certainly give us a flavor of what data means to HubSpot. Yeah, so HubSpot is, you know, it's a public company.
Starting point is 00:14:42 It's been around for, you know, 15 plus years, which seems like a long time now when I say it. But, you know, I think they're best known for their CRM, which is, I believe, what you're using. And an all-purpose business to help your business grow. It's all about helping our customers grow. And so it's moved beyond the CRM into what we call different hubs. So there's a sales hub, a service hub, a marketing hub, a CMS hub that we just launched. And all this is connected to your CRM, you know, to manage your business across both the sales, you know, the support, the marketing, all those aspects. So you can do anything from build a website, build landing pages, A-B tests, you know, send marketing emails.
Starting point is 00:15:40 And really it all becomes one integrated product on the HubSpot platform. So it's getting to be a big company. We're over 3,300 people now based out of Cambridge, Massachusetts, but we have a large office in Dublin, Ireland as well, and a few other small offices around the world. And then we're building up our remote capabilities, which couldn't come at a better time, to be honest, now that we're all distributed anyway. So, you know, we're really all over the world and now, you know, all at our homes. Okay.
Starting point is 00:16:11 I mean, so I use HubSpot to, as I say, store all our customer details and particularly the fact that you can add in custom properties and all these kinds of things. So I link it to segment, I link it to everything else. And, you know, even I have a huge amount of data in there um so the volume of data that you must deal with in aggregate must must be immense really i mean just give us a sort of sense of of of the scale of the sort of things that you do really and um i suppose you know how much data you're typically working with on a daily
Starting point is 00:16:39 basis yeah and i think it's important to distinguish between the analytics data we collect and then our customers' data. So the data that you're storing, we don't bring that into our analytics infrastructure. You own your data. That's certainly something that we don't want to analyze at that level of detail. And so one one of the challenges is, is because of that volume of data overall, as well as all the data we're collecting, you know, just as far as our own analytics, you know, we use HubSpot to sell our product as well. So we have our own customer data in there.
Starting point is 00:17:18 And so one of the challenges is filtering through A, what's valuable, and B, what can we actually use, you know, for decision-making in an ethical way. So, you know, like I said, our customers' data, our customers' data about our customers' customers is off limits there. But data about our own customers,
Starting point is 00:17:38 certainly, you know, that's data that just like you can, you know, move your HubSpot data into other tools and move data into HubSpot. We're doing that at a massive scale. And it's pretty neat to be running your own business on your product, something that I haven't been a part of before. And so that data is, as you can imagine, is it's all sorts, right? You know, we have well-structured data, like our, you know, just like our list of customers and all that,
Starting point is 00:18:10 our list of prospects. But we also have a lot of, you know, raw text data. We have, you know, JSON event data, because as we were talking before about clickstream databases, you know, we still collect all sorts of analytics events on all of our sites.
Starting point is 00:18:27 We do a lot of inbound read content, so we do a lot of blogging. It kind of fits with our narrative. And that data is really valuable. So there's a lot out there that it's really anything that you might get out of a traditional CRM. But then you add on all these other components that we're running our business on, we have a massive customer support team and we're using our product to manage that. So there's a lot there that it's everything that you can imagine. But I would say our largest data sets are definitely around our web analytics and events and really what all those customers and all just
Starting point is 00:19:02 the visitors to our blog and other sites are visiting. Okay. So what was, I mean, it sounds like a silly question to ask you now, but what was it that attracted you to work at HubSpot? I mean, I imagine what you just said to me is the reason, but why did you end up there and what kind of made you maybe give up the idea you had with, you know, with Data Liftoff and commit to working with HubSpot? Yeah, I would say I was really sold on... There's two things. One is HubSpot as a company has a tremendous reputation
Starting point is 00:19:31 and it's well-deserved. It's just a company that really cares about their employees and customers. And I know a lot of companies say that, but spending time there, it's really true. And I think it's a culture that I don't run across very often, top to bottom. And so that was just the opportunity to work for that kind of
Starting point is 00:19:52 organization was always tempting. But what really got me in on the data side was the fact that HubSpot right now is investing a lot in their data infrastructure and their data team. It's clearly a strategic priority. And I've always been attracted in the past to building a group from the ground up. Did that at Safari, did that at a company called Degree that I was at, focused on data science more there. But this was an opportunity to come in at a time when an existing team was doing a really good job. And on top of that, they were going to get investment to really, you know, add leadership and add, you know, engineers and PMs and everything that you would need to really build out that organization. So I had never really been,
Starting point is 00:20:39 I'd never seen an opportunity where I could jump in to something that was already, you know, the groundwork was already there, but then have resources that you'd never have at a startup. You know, you just, you don't walk into a startup typically and have the ability to grow your team like that from day one. So there's a lot of unique challenges in that, but for me, it was something I'd never done before. And for me, that's a really great learning opportunity. Okay. Okay. That's a great lead-in really to the next thing I want to talk to you about, which is how a team's structured in a SaaS business like HubSpot. Tell us,
Starting point is 00:21:20 paint a picture of what the data team looks like and I suppose how that team interacts with the rest of the business. And maybe just talk about what what your role is in there as well just give us some kind of context yeah so HubSpot has a pretty unique model um not totally unique but a little bit different than I've been exposed to in the past where I've been part of centralized BI teams um which definitely work up to a certain point where you know you have your data infrastructure team uh within it you know doing the more technical work so your data engineering and all that you're building your data warehouse team within it, you know, doing the more technical work. So you did engineering and all that. You're building your data warehouse on that team. And then you have analysts on that team. And then I've been at companies where it's completely decentralized, where really you
Starting point is 00:21:55 have a core data infrastructure team that's handling your ingestions, maybe some transformations, but really just making sure that, you know that your warehouse has the basic building blocks and then the analysts are distributed in each department. HubSpot, we're doing it a little bit differently. And this is where we're really growing. We're kind of a hybrid model where our core business intelligence team, we have a data infrastructure team, which is what I oversee, which is our data engineers doing our ETL work, as well as what we call data warehouse engineers, who are building our core data assets.
Starting point is 00:22:33 So they're more in the fundamentals of the data warehouse, taking that ingested data and building out our first line of what we think of as vetted data models. So ones that we trust, we have great validation on. And then we're then working with, we have some analytics engineers, which is a job title that I think has been popularized by DBT. So we have some of those on our team who are, so we have a small number of those on the centralized team working on what I would consider kind of like the global data assets for the company. But then we have a large community of distributed analysts across the company, hundreds of them, who are also empowered to build their own data models.
Starting point is 00:23:19 We use Looker as our dashboarding tool and visualization. So they're also building their Looker models and dashboards. So we're trying to do it both ways, which honestly is a challenge, but it's something that does scale really well if you're well-coordinated across the company. And I think we are that kind of company that even though we're on the larger side, we have a pretty tight culture amongst those analysts in our business intelligence group that it does work. So it is a little bit unique, but it's something that it can just scale as we continue to grow the company.
Starting point is 00:23:56 Okay. So in terms of technology, I mean, is it homegrown technology you use? Do you use off the shelf? What does, at that scale in the kind of work you do to use off the shelf do you what what does at that scale in the kind of work you do how does the actual technology look yeah it's a mix um you know we i think we well i'll talk about our the tool set first so we um most of our data ingestions uh you know a lot of that is homegrown uh we're doing a lot of of custom, both from internal systems as well as third-party SaaS tools that we use out there. And then a lot of our – we use a lot of open-source tools. We use Apache Airflow for data orchestration. So certainly one that a lot of people are familiar with.
Starting point is 00:24:40 So we use Snowflake as our data warehouse. And I know that's growing in popularity. I think we were one of the earlier adopters. So we've actually had it around for quite a while now. It's pretty comfortable there. And then on the top of our stack is Looker, as I mentioned. And that's where all our analysts and users are. So in between all that, to kind of grow it all together,
Starting point is 00:25:03 there is a lot of homegrown data ingestions and processing and validation. Where we see a good fit, we always kind of look at a tool that's already there. I think we tend a little bit more to the build versus buy, but it really is. We take every new project that comes along and we we have that conversation we don't jump you know some companies jump right to like what's the sass tool that will do it for me and some always build uh we try to make it a conversation we have up front of is there you know is there a great tool out there is there an open source framework or is it something we need to build ourselves so i guess probably another big factor on that is the cost of running it day to day i mean that
Starting point is 00:25:45 again about volume of process and analyze i mean how much does cost to heartless silly question really how much does cost drive your choice over this and how you build things well i think with things like snowflake you think about it in a different way than you used to um you know when we're talking about uh we we have conversations and snowflake is the first time you can really do this at scale and think, you know, what will this, what will the cost be for me to run, you know, double the, double the size warehouse or, you know, things like that, where you're starting to, you can break down costs into tiny increments. And I think other products are moving in that direction as well. So we have that discussion with a lot of others.
Starting point is 00:26:26 You see things like Stitch, you know, they're in a volume-based pricing model. You know, so for doing that ETL work, you know, that's always a conversation that you have to have of, well, this is great, you know, but what's the volume pricing as you start to get these larger data sets? And what makes more sense to have an engineer customize. But it is a big part of it. But I think the other part of it is, what fits well with the rest of our systems and what limitations are there if we were to go out and replace a custom setup with tooling.
Starting point is 00:27:01 So cost is definitely one component, but it's not the only one for us okay okay so so i mean you mentioned earlier on about you've got a distributed or sort of hybrid model where you've got analysts that are out in the business and marking for you as well and you're using dbt and using all these kind of things so how do you how do you i suppose in a way um take the ideas you've got on the structure and the workflow and so on you've got at the center and then have that used in the business and maybe are there times in which maybe you do things differently in the business how does how does scaling out that idea around analytics engineering work in a company like hubspot well some of it has been pretty organic uh it's actually
Starting point is 00:27:40 come from our analyst community uh so you know there's i would say it's a very even though it's actually come from our analyst community. So, you know, there's, I would say it's a very big, even though it's distributed, like I said, I think those, those analysts are even from different teams, do a really good job of kind of collaborating, knowledge sharing, you know, going to some of the same meetups. So there's a lot of, a lot of ideas from that community that are generated and come back to our team where they might say, you know, Hey, we'd love to invest in this or we'd love to, you know, make sure we have tests that check for that. And we'll certainly help support that.
Starting point is 00:28:12 Some of it comes internally and, you know, we'll do things, you know, we've kind of teamed up with those analysts on how we do code reviews, you know, and like how we start to get some standardization around documenting data models and tables and columns. And so we, we try to make sure we provide the tooling and we let that community make use of it. But it, again, it does work both ways. And it's great when an analyst comes to you and says, ah, you know, dbt is great. How do I, you know, how do I get this to be more widely adopted or something like that so it does become it's almost like an internal uh you know if you think about like some of these communities that have sprung up dbt uh not just a tool but it's it's a great community we kind of
Starting point is 00:28:58 have that that system internally and that's really the only way we can make it work you know it's not something that across a large company you can set you know big directives and make sure uh everybody's following them it really has to be a little bit more organic and everyone has to buy in okay so do you have one warehouse for the entire hub spot or do you do you is it is it kind of more distributed i mean how what's your i suppose data architecture and warehouse architecture like in that respect yeah we certainly you know those kind of core data assets, that's in a single warehouse that we own. We make sure we have certain SLAs around that. But the great thing about Snowflake, it's really easy to kind of spin up more of like a data model.
Starting point is 00:29:37 And so we certainly can do that as well. And it gives the analysts, you know, that are in each department a lot of autonomy. If they had to wait on, that's been the pushback on these more traditional centralized BI teams. It's been, I just want to add this one column, right? And I have to wait. And now that the tooling out there, like dbt and other tools have caught up and really empowered analysts, they don't want to wait anymore and they really don't have to so we try to make sure we have the right the right balance you know we can have our core data warehouse but there's no reason that these kind of satellite data marks can't exist and tools like snowflake make that possible okay so i mean last in last in the last
Starting point is 00:30:23 episode we recorded i had drew banning on and uh strip rice and we're talking about i suppose the applicability of tools like dbt and i suppose in general these modern uh data stack tools that maybe had their roots in startups um how they how well those concepts still apply to larger organizations that had you know i suppose more governance and more requirements around things like, I don't know, the things like sort of production support and all that. I mean, how well do you think these fit, these tools fit with a company like HubSpot? And how much have you been tempted, I suppose, by the big vendors that come in and say, well, actually, now you're bigger, you should be looking at a solution from IBM or Oracle or whatever? Yeah, I think tools like dbt have a great chance
Starting point is 00:31:03 in the larger organizations. You know, there are certainly some, I would say, you know, if you think about the really large, you know, you know, we're, so we're about 3300 plus people. You know, we're not so big that, and we're not in a highly regulated industry. So we do have some advantages where we can still be pretty nimble. And so I'm always less tempted by the really big enterprise, you know, tools that want to come in and more excited about things like dbt. And certainly we, you know, data governance and, you know, security and data validation and all these different things, they matter a lot to us. But I think for something like dbt, they know that, you know, and that community is really built around kind of, you know, setting good methodologies on how you model data, but also giving you a lot of the ability to plug in your own tools and your own systems.
Starting point is 00:31:59 And that's what we've been able to do with DBT as we use it is, you know, we can still, you know, run tests and validate data and uh ensure that you know models are run we can orchestrate them through airflow if we wanted to or whatever orchestration tool we want it really is a nice product in terms of it's not just a product it truly is a platform i think that's the kind of things i always look for so so you mentioned you mentioned airflow a couple of times there And I think a question a lot of people have is, what problem does that solve beyond what you can do with dbt, say? And how do you get that balance of where do you put the functionality?
Starting point is 00:32:35 How do the two tools link together in your organization? And why do you have both of them there, really? Yeah, we look at Airflow as more of a general orchestration tool. So whether it's building our data models and managing when those are built and dependencies, it's great for that. But it's also great for managing downstream dependencies. you might want to use it to, you know, fire off some extraction loads early in the process or run some tests, you know, run like a more custom test suite. We also have, you know, we have a mix of kind of legacy data models, if you will, and dbt data models. So Airflow is nice in that we can use Airflow to tell dbt to execute its run and let dbt build out its own DAG dynamically.
Starting point is 00:33:27 But we can also have our DAG of other tables and other data models we build in a different way. So I think Airflow, it's great as a generalized tool. It certainly doesn't make it as easy to solve, you know, a lot of the dependency challenges that dbt just does for you, you're right, you know, within your data models. And I think the more people use dbt, the less likely they'll, they'll want to go and kind of build out their own dependencies and manage those in Airflow. But it's still a very useful tool. And for us, something that we've we've had around for a while while and I think we'll continue to use. So how do you share knowledge and things like data dictionaries and data catalogs and so on with the rest of the business?
Starting point is 00:34:18 How do you share that knowledge and promote, I suppose, use of the right metrics and so on? What was your thoughts around that? Yeah, it gets harder as the organization gets bigger. And, you know, I've been at some really small companies where, you know, the challenge was more, when do we find the time to write documentation, right? That's always the thing that people struggle to find time, but they know is important. At a larger organization, I think we do a good job of writing it. But it's about how do we publicize it and socialize across the organization. So we do have a data catalog, something we built ourselves. And it's something that as part of our, I mentioned code reviews, as new data models are built, we sort of have a check to say, is this in the data catalog? How is it documented? Have we established the owner for it? Does it have tests? And so we're trying to do more and more of not just writing documentation and putting information in our data catalog, but standardizing how we do it so people know where to find
Starting point is 00:35:16 it. So we have a great internal wiki at HubSpot that we use for everything, which is great, but it also makes specific things a little bit harder to find. So as we, as we build out documentation, especially on like a new data model, or, or even building out more like business requirements around, you know, why did we build this in the first place, you know, that should be in there. We're trying to standardize how those are, are written and tied together together and make sure that people know where to start when they're looking for an answer of, you know, what does this metric mean?
Starting point is 00:35:49 Or what are the dependencies of these tables? And so we're not only doing the documentation, but we're also trying to build tools that make it easier to find and easier to generate as well. So there's a lot of investments we're making right now in that tooling just to make it, you know, we want to make it easy for people to not only write that information, but also find that information. Okay.
Starting point is 00:36:16 So again, I suppose the volume of data, but you've got a lot of data points and a lot of people using it. How do you ensure that it all actually adds up correctly in the end and people can trust the data? I mean, what's your strategy around things like testing and generally making sure that everything is trusted and consistent, really? Yeah, it really is about, you know, as we're building out any kind of, you know, I've been calling it a data asset. I don't know if that's the right word, but a new data model or even a set of models, you know, around a particular either business line or product feature. Early in the process, we try to establish how we're going to measure consistency and validity of data.
Starting point is 00:37:01 So it's great that we have tooling where we can run tests, you know, and NDBT has, you know, test capabilities. There's all these things you can do, but as you mentioned, you kind of have to know what you're looking for and what to test for. So we try to make that part of the requirements gathering process of, you know, what are the things that we expect in this data? You know, revenue is a great example. Every company has this challenge, right? You know, you try to make sure that your revenue numbers are right. Every company you work for has multiple ways to measure revenue because there's, you know, different business lines, there's different ways you can slice it. There's, you know, accounting views by time period. And so really understanding what those are. And I find when we're making the
Starting point is 00:37:45 documentation, by that point on, you really know what you're testing for as well. You know, if you've been able to truly define what a metric is, you can go back and write tests to make sure. Where it gets a little challenging is making sure it's within certain bounds, reduce some statistical testing, you know, to see if there's fluctuation on a certain metric, you know, over time. That's been challenging in today's age, you know, when large events in the world happen, metrics change wildly and set off all those alarms and you go, oh, wow, you know, does that really, is that an alarm or is it not? But, you know, we try to do everything from, we really try to start with the basics of understanding what the value uh or sorry what the metric means write some tests to that but then do
Starting point is 00:38:31 some more of those generalized you know statistical tests and uh really kind of looking for outliers and other things like that okay so how what balance do you have between so you mentioned looker a couple of times earlier on and obviously looker can itself be a repository of i suppose business information and business logic and calculations and so on and how do you make the the decision between what goes in looker and what goes in the underlying data data warehouse and dbt and so on what was the kind of thinking around that you know it's we try to we try to start with you know who's going to use the data and for what purpose and um because we have that distributed team distributed team all able to build off of our warehouse, if it's data that's key to the business as a whole, we try to bring that down to the warehouse and then allow Looker to derive different views into it. If it's stated it's very specific to a single analysis or, you know, product feature or something, individual team,
Starting point is 00:39:30 we're going to give them more leeway to kind of build that either in Looker or their own data mart, because it's really not shared. You know, there's not those shared dependencies. There are times where even a very specific analysis is extremely, it's either extremely important to get exactly right, or it's a really heavy lift in terms of processing.
Starting point is 00:40:07 And in those cases, we try to go and bring that down as far down on the stack, if you will, as we can. We don't want to make Looker the place where it's generating absolutely crazy SQL statements and hitting the warehouse. If we can say, oh, and we look for those, we have automated alerts that look for what we call our most expensive queries, which in Snowflake is actually true. And we try to bring those down and build the aggregates and build the data models to enable that so it is both a risk assessment as well as one that's more of like a you know kind of cost of processing assessment okay so um i've used um i suppose you i've used snowflake on a couple of projects where um dbt has been involved and and and and we've been thinking about how we do i suppose things like cicd and and how we can leverage some of the features in, say, Snowflake around, I suppose, virtual warehouses and things like that. How do you, are there any particular ways in which you leverage the special features of Snowflake for when you do kind of development and testing and maybe creating virtual environments for things like that?
Starting point is 00:41:02 How does that kind of work? And is there anything you do around that area? Yeah, I would say one of the most helpful things that we can do is just immediately spin up like a clone of a warehouse. And that's something that, you know, coming from having worked with Redshift a lot is really not feasible.
Starting point is 00:41:18 You know, right now we can be working on a specific project and, you know, we don't want to just do it in our kind of global dev environment because it's just too big of a change or it's going to go on for a long time to just be able to say i'm going to clone this this warehouse and work on it has been uh the amount of time saving there and uh really the what it's allowed in collaboration of development as well where you can you know invite other people to use your clone.
Starting point is 00:41:46 That has been just amazing. The other thing we take advantage of a lot is being able to dynamically scale the warehouse. So again, I'm a big fan of Redshift, but it is, once you've sized your warehouse, that's kind of where you're at. And we can do things like certain times a day, scale up the warehouse, because we know more people are using our local reports or there's more development going on.
Starting point is 00:42:11 Or we have a really critical high volume job to run. You know, we'll just scale that up, you know, and then once it's done, scale it back down, you know, right in place. Those kind of features are they totally change the way you run your development process and allow us to to do things that we just could never do before okay okay so um so when you this obvious question i've got now is when you recruit you know when you recruit people to the team um to work with you and to work and when hubspot recruit people what kind of qualities are you looking for and what sort of people do you typically um bring on board and um how does their onboarding kind of take place and how did your onboarding take place really well in the recruitment phase you know we're looking we really have tried to
Starting point is 00:42:55 build a really diverse and caring team and i i mean that in terms of you know we like to bring in people who are collaborative, work well together. We don't necessarily bring in a lot of kind of, you know, people that would be great, sort of, you know, the individual who doesn't want to be part of a team. It's very much a team approach to development. And not just in data, but also on our software development teams. So that matters a lot to us. But we also look for people who are ready for change. HubSpot's a really dynamic company. We're always changing. We adjust to things in the market. We adjust to things that we see internally. We make strategic changes. And so I think it matters a lot that people are excited about what we're doing right now, but also excited about whatever comes next.
Starting point is 00:43:46 That matters just as much as skill sets. From a skill perspective, that's another thing that changes over time. As I've gone through my work history, there's a lot there, but certainly we look for fundamental skills in each role. Then the onboarding process at HubSpot is really great. It's unique. I went through it not that long ago. And your first couple of weeks, you really learn about HubSpot as a company, but also HubSpot as a product. And at first, people are like, well, do I need to know? I'm not selling HubSpot. I'm a software engineer. But once you understand how it all works, you're able to think about the data in a different way. You're able to really understand all the data that you're seeing come in,
Starting point is 00:44:35 what that means, you know, to the product. And it's not just a assembly line of ingesting data and modeling it and then waiting for a specialist in some department to make sense of it. So it's, you know, it's a very heavy onboarding process in terms of time, but it's extremely well run and you come out of it feeling like you've been there for a lot longer than a couple of weeks. Okay. So what about, I mean, with the situation at the moment with coronavirus and so on, there's been a focus on remote working. I know I think you've got a particular interest in remote working in the past. Is that something that works with data teams? Is it something that works with HubSpot at the moment?
Starting point is 00:45:13 What are your thoughts on that? Well, for HubSpot, it definitely works. And, you know, we were grateful that we had been really putting ourselves in a position to be remote friendly for the last couple of years at least. So we have a couple of folks on my team that are already full-time remote. I live not too far from the office, but far enough that I like to work from home a few days a week back when we could go in the office. And so we were really set up for this as best as you can be, at least from a remote work
Starting point is 00:45:44 perspective. In general, and I should have mentioned that earlier, that's another thing that really attracted me to HubSpot too, was that that's something I am a strong believer in, that it works well. And it also really does give employees even more autonomy and more control over their own life. And I think with data teams, I think it works great. I think it's the kind of work where it is collaborative, but it's also there is the need to have uninterrupted time. And one thing that's great about remote work is as long as you have the right setup
Starting point is 00:46:21 and you're prepared for it, you can certainly get that time a little bit easier than being in an open plan office. And with collaboration, I think it's all about the way we communicate. Whether it's in the office or at home, spread out, the way that we communicate via Slack and Zoom, and quite honestly, how we write. I think long- form writing has been something that I've used to communicate, whether I'm in the office, or, you know, or remote. And in data, that's often a great way, not just to communicate, but also to truly understand the problem. So a lot of that documentation you talked about earlier on, that's a method of
Starting point is 00:47:03 communication, not just reference. And so something that as a remote company, we's a method of communication, not just reference. And so something that as a remote company, we spend a lot of time, you know, making sure we're doing that. Okay. Okay. And so, I mean, you've obviously got data liftoff and you've written blogs before and so on. Do you find in general that's helped your career and helped you understand how concepts work and by being able to communicate to people, then you understand it better yourself, really. Absolutely. I can't count the number of times I've said, I'm going to write a blog post on a topic and think I understand it until I start writing the blog post. And then days later, after I've had to go and research all sorts of things, I finish up the post and then I come out of it and I know it a lot better than
Starting point is 00:47:45 I did coming in. So for me, in some ways, blogging and any kind of writing teaches me a ton and I hope it teaches other people too. But it's certainly a way to really, if you want to find out if you really know something, try to write a really in-depth blog post on it. And every time I've done it, it's proven whether I know enough yet or not. Yeah, I mean, I suppose HubSpot itself with its roots in that kind of area. I mean, I suppose HubSpot is quite supportive of blogging and generally, I suppose, the community
Starting point is 00:48:19 and putting content out there really as a way of, I suppose, understanding yourself, but also as a way of marketing what you do. Yeah. Yeah. And HubSpot is, and I think it's also, there's that before I joined HubSpot, you know, I was, I very much found myself reading a lot of content that HubSpot put out, you know, the way that content is generated really to be helpful, right. They, you know, we're one of the pioneers in inbound marketing and it's that, that concept has really stuck with me. It's, it's not about writing about? You know, we're one of the pioneers in inbound marketing. And that concept has really stuck with me. It's not about writing about, you know, something that's helpful to the business.
Starting point is 00:48:56 It's how do I write things that are helpful to either my potential customers or my current customers? And that's just totally changed the way that I thought about blogging. It was never really interesting to me until I kind of thought about it that way. And that's what I've tried to do with Data Liftoff. Yeah, I tend to find as well that, you know, what goes around comes around. And I think if you share your knowledge with the community, then you get more back in return for that straight away in terms of other, maybe sort of, you know, maybe kind of collaboration on what you're writing about.
Starting point is 00:49:21 But just generally, I find that the more you put in, the more you get out really. And, uh, and it's, I think for people like us, maybe not yourself, but myself, who maybe we're not the pun of the most obvious salespeople, a good way of actually communicating what you can do is to write about it and explain the kind of thinking behind it. And that acts as a great form of, um, of, of kind of marketing with a way that is we're comfortable with really. Certainly. Yeah. I've always, you know, being on the end of, as a buyer for a long time and software, I've really always appreciated companies where, you know, I've been able to come to them and say, you know, Hey, I've read about your product. I understand it. I'm interested. Right. And I know that salespeople love those inbound leads too. So it's great being on both sides of
Starting point is 00:50:02 it. It's just a little bit more, it's setting up that initial connection and kind of taking away that cold aspect to it, which I'd never been a big fan of cold selling. When I was doing data liftoff and consulting, I never tried to cold sell my services. It's just not something I'm comfortable doing. Exactly. Exactly. So you've been doing this, you've been in this industry for a while now, and you're still, it sounds like you're still fairly technical. I mean, what's your's your thoughts on I suppose the engineering career path and does it always end up being a manager or does it have you managed to sort of um to keep your hands you know in what you're doing really what's your thoughts on the career path for engineers
Starting point is 00:50:38 really well I'm really passionate about engineers not having to go into management, but being given the opportunity to go into management. I know that traditionally, at a lot of companies, it seemed engineers could really, they would hit a wall unless they would jump into management, you know, so they'd become a great engineer and sort of be rewarded by managing the team and getting to do less of what they love to do and what they were good at. And that was, I've seen a lot of people burn out because of that. You know, they never wanted to manage people. They love writing code. They love architecting systems.
Starting point is 00:51:12 And so I want to make sure there's a career path for people like that. And I know HubSpot is very supportive of that. You know, we have very clear, both individual contributor, we call them, you know, career paths where you don't need to manage people, but you can continue to grow, take on more technical responsibility and management career paths. And some people bounce back and forth. You know, some people, I've worked with several people over my career who got into management, didn't like it, went back to engineering and fell right back into place. So, you know, I think both are great for me. I've always been a little bit more drawn toward management.
Starting point is 00:51:50 I still love to code when I can, you know, and mostly now that's as a hobby rather than at work, unfortunately. But, you know, it is something that I think it's great to kind of keep your skills sharp. But once you make the decision to go into management, it definitely is a clear trade-off. I think people just need to be explicit about saying, you know, I just can't do both. You know, doing two things well is really challenging
Starting point is 00:52:17 for most of us, including myself. So I like that there's multiple paths, very supportive. And I love that people can experiment. Companies should give them the space to do that. Okay. Well, it's been great speaking to you, James. But how do people find out about Data Liftoff then and get access to some of your blog posts and things you've written?
Starting point is 00:52:35 Yeah, so you can go to dataliftoff.com or there's a Twitter account as well. It's at Dataliftoff. And I try to still post there as often as I can and pretty soon I'm planning on getting some guest posts as well. So there'll be content from more of the data community
Starting point is 00:52:50 in the coming months as well. Fantastic. Well, it's been great speaking to you. And thank you very much for coming on the show. And yeah, thanks very much, James. Yeah, thank you. It's been awesome. Thank you.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.