The Data Stack Show - 170: Discussing Data Roles and Solving Data Problems with Katie Bauer of GlossGenius

Episode Date: December 27, 2023

Highlights from this week’s conversation include:The evolution of the data scientist role (1:03)Common problems in different companies (2:05)Measuring and curating content on Reddit (4:29)The challe...nges of working with unstructured content at Reddit and Twitter (11:03)Lessons learned from Reddit and applying them at Twitter (13:17)Data challenges and customer behavior analysis at GlossGenius (20:16)How the data scientist's role has changed over time (00:25:10)The essence of the data scientist/engineer role (29:00)Dynamics and overlaps between different data roles (32:09)The perfect data team for Twitter (34:19)Building a data team at a startup like GlossGenius (36:36)The right time to bring in a dedicated data person in a startup (38:52)The analytics engineer role (46:25)Challenges in implementing telemetry (50:31)Final thoughts and takeaways (52:24)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rutterstack, the CDP for developers. You can learn more at rutterstack.com. We're here with Katie Bauer from Gloss Genius. Katie, welcome to the show. Thanks for having me. Excited to be here. All right. You've had an amazing career in data. Give us the quick overview.
Starting point is 00:00:37 Yeah. I got into data science in the early 2010s. I've spent time in a bunch of different places, working at a natural language search startup and social media. And now I am leading the data team at a vertical SaaS startup in the beauty and salon space. Awesome. Well, so excited to chat with you, Costas. What topics do you want to dig into? Oh, we have plenty to chat about. First of all, my favorite topic, which is always about definitions, right? So I'd love to hear what data scientist role is, like what's the definition behind that? And more importantly, like how it has changed. Like we have Katie here today that has experienced a lot like the evolution of the role.
Starting point is 00:01:22 So I'd love like to hear from her how like things have changed. And also talk a little bit more about how the role is different, like in different companies, different sizes, different products, right? And how also like the data scientists overlaps or not. We'll see. Katie will tell us. With other roles that have to do with data. So it will be like super educational for me at least,
Starting point is 00:01:53 but I have a feeling that it's also going to be like for our audience. What about you, Katie? Is there something in your mind that you would like to chat about today? Yeah, something I think would be sort of an interesting connection to what you're talking about is what kind of problems occur across all these different types of companies. What is something I've had to do at multiple jobs or what types of problems appear in more than one context? Yeah, let's go and do it.
Starting point is 00:02:22 What do you think, Eric? Let's dig in. All right, we'll start where we always do. So tell us how you got into data at the beginning of your career and then sort of what you've done in between and then what you're doing at Glass Genius today. Sure. So like a lot of people who ended up in the data science world in the early 2010s, I was in an academic role looking to do something different and decided I'd go try this tech thing. My background was linguistics and I ended up working at a startup as a linguistic data set, like annotator and curator. Did that for a while. It built a lot of my foundational data skills. And when that company was done, I ended up working
Starting point is 00:03:06 as an analyst and a data scientist for a couple of years in ad tech before I ended up moving into social media, which is where I spent a lot of my career. I joined as an early data scientist at Reddit. It was about 200 people at the time, and we had no data science prior to that point. So that was an interesting journey being there through a period of hyper growth. And then after that, got pulled into a role at Twitter by a friend who was working there at the time and spent a couple of years there until notable news events pulled me away and brought me to my current role at GlossGenius, where I run the data team. GlossGenius is a Series C now company. It was a little over 100 when I joined,
Starting point is 00:04:00 and now we're more than twice that. The data team was a single person. Now it's grown pretty substantially over the course of the past year. I really kind of spent the past year getting the team up and running, hiring people, building the data stack, et cetera, and kind of looking long-term for the company at what we might do commercially with the data that we have. Awesome. Okay, I want to dig into sort of each of those career phases. Let's move on to Reddit.
Starting point is 00:04:22 So social media, I mean, what a topic in and of itself. The data science behind it, what a topic. I'd love to know what were your big takeaways? Because if you think about, you worked in, you know, sort of like moderator, user, in that sort of space within Reddit, And you have all this user-generated content, which is fascinating to me from a data science standpoint, because you have all this unstructured data. How do you derive the meaning? There's sort of lanes within content that should be allowed in certain subs, all that sort of stuff, right? What were the big lessons that you took away from sort of, I mean, I kind of view that, and maybe this is wrong. It's a wild west in many ways, right? User generated content, unstructured data, like things that are subjective, that sort
Starting point is 00:05:17 of need to be objectively implemented in terms of standards. So what were your big takeaways? Yeah, I mean, I guess like one thing to say on the subjectivity, one thing I developed a really strong allergy for working at Reddit long enough was the idea that we would define quality content. Like there were like a lot of ways to like talk about like, is content doing something illegal? Is it like distracting? Is it off topic? But like, is this like good content was always a question people hadn't like there's no way to measure that we tried so many different ways and it just was always a dead end because it's very subjective so this is interesting so i mean
Starting point is 00:05:57 obviously reddit creates some content but it's primarily user-generated content was there a desire for good content? I mean, was that sort of a... That's interesting to me because it's all user-generated, right? Yeah. Well, and there's a lot of different angles on this. One, people want to see content that's engaging so they'll come back to the site.
Starting point is 00:06:20 Yeah, yeah. So it's a Rexus product fundamentally. You need to match content with people's interests and like being able to figure out what is like a good thing to show someone is like the start. Of their Reddit experience was like a perennial question. Like how do we make sure people find their quote unquote home as fast as possible so that they get the site and come back. But there's also like, like a big thing while I was there and a big initiative
Starting point is 00:06:44 that I participated in was trying to flesh out other content verticals on the site that were a little underdeveloped. So, for example, people for a long time thought of Reddit as like a place for gaming or political discussions or things like that, memes. And a big thing that we focused on was trying to make it a more broadly appealing website. So like trying to get more beauty content, for example, or like just recipes or things about like families. Like it was really like a way of trying to make it more of broadly appealing website, which if you think about it from the mechanics of the business as a company that was monetized by, and like still is monetized by placing ads, having more broadly appealing
Starting point is 00:07:27 content meant you could have more advertisers. So like that was like the financial incentive, but it's also something that helps you attract different people. And it was tremendously successful. Like the website is way more mainstream than it used to be by trying to help curate and amplify content that we wanted
Starting point is 00:07:43 to make sure people knew was on the site. Super interesting. And just to dig into that one step deeper, how did you measure that? I mean, there are different ways to sort of measure the quality of content, you know, and I mean, obviously, that's subjective. But in terms of like, how would you know that the beauty section of the site is working, you know, or the recipe section of the site is working? What were the things as a data scientist that you would say, okay, you know, we're making the right recommendations? Like, how did you know that? Because obviously, you feel strongly about the subjectivity of good content, but you have to have some objective measure to build these models for recommendations. Yeah. And I guess maybe a general lesson to take from this was that it was really a bunch of separate problems and we needed to be able to tell like what the separate problems were so another very
Starting point is 00:08:45 big initiative we had was to have good category labels for subreddits which it sounds easy on its face to say like yeah the gaming subreddit is about gaming makeup addiction is about makeup but there's like a long tail of weird stuff on reddit which i don't necessarily mean in like a unsavory way there's just stuff that's like a weird joke that doesn't make sense. Yeah. Super niche. So like to really understand like what parts of the site we're doing while we needed good category labels. And that was something that we actually couldn't really programmatically do. We ended up having to use a human in the loop program to do that, where we would get informed users to give ratings of what they thought something was., where we would get informed users to give ratings of
Starting point is 00:09:25 what they thought something was. And then we would do some stats to reconcile the ratings in a rater agreement for anyone who's curious and figure out which ones was their strong agreement upon. And if there was strong disagreement, we'd go back and get more annotations. And that ended up being the most scalable way to get high quality labels. But once we had those labels, it unlocked a lot of different things where it would tell you like kind of like where are most of your comments, where are most of your posts, or just any type of engagement really, because upvotes are important for Reddit too. But then like you can like start figuring out like, is it the same type of people who are all engaging this? Like how broadly appealing is is it is there crossover between different genres and like an effect that we started observing over
Starting point is 00:10:10 time was like there would be for lack of a better term events on the website that would suck up oxygen from everything else or it's like there was kind of like a baseline amount of activity you would see on a regular basis and like if there's some controversial event in the moderation world, for example, suddenly all of the posting and commenting behavior for that day would be on one post or something. And it helped us realize, yeah, you do need more people. It does need to be more broadly appealing. So there are different people with diversified interests to kind of, I guess, make sure that your eggs are not all in one basket from a content perspective. Yeah. Fascinating. I mean, I can't imagine the Wall Street bets situation.
Starting point is 00:10:55 Yeah, that was after my time. But as that was happening, I was like, oh man, I can only imagine what's happening internally. I can only imagine what you were thinking was happening there. Okay, let's move on to the next social media role. And what's fascinating to me here is that you're dealing with a bunch of sort of unstructured content, subjectivity at Reddit, and then you move into a role at Twitter, I guess now it's called X, obviously, where you were building tooling for teams that were doing similar work to what you were doing at Reddit. Why did you want to move into a role where you were building tooling as
Starting point is 00:11:42 opposed to sort of working on the models that were sort of driving the decisions around the end user experience? Yeah. Partly, it was the team that I joined. The team that I was a part of when I started at Twitter was working very closely with their finance team on narratives for Wall Street. And my initial mandate when I joined was to help create a relatively objective narrative around product velocity, which meant we had to spend a lot of time measuring the quality of the internally developed technology at the company. And Twitter has a lot of it. They became a large company at a point where there was not an off-the-shelf tool to build, say, a huge key value store.
Starting point is 00:12:25 So they created their own instead of using something open source. And over time, that kind of morphed into being more purely focused on the infrastructure organization and helping engineering managers and product managers in that organization to understand the impact of the different things they were doing and measure and evaluate and make better decisions about where to invest. I also kind of didn't want to work on a consumer facing role after Reddit. Like I was a big Reddit user and I am a big Twitter user even still. And like I working in a consumer related role at Reddit was something that like over time I started having kind of a hard time separating my feelings about a product that I really liked from the business about the product. So I kind of wanted to step back and do something that was a little bit different,
Starting point is 00:13:12 but it ended up being very fascinating in and of its own right. And what were some of the big lessons that you learned about? Did you take lessons from what you needed at Reddit to Twitter as you were building tooling for people who were doing similar roles? I mean, what were some of the, I mean, obviously there were a bunch of internal builds maybe that, you know, sort of needed to change over time, but what were some of the big things that you took with you from Reddit to Twitter? Well, the thing that I mentioned a moment ago about a problem that you think is one problem is many problems. That was something I went in thinking about actively. And it was not only
Starting point is 00:13:52 reinforced, it was made much more nuanced. Something that was interesting about joining Reddit at such an early stage is that we had basically nothing to work with from a data science perspective. Like there was data, but nothing was aggregated. We didn't really know what was in any of the data sets that we were going to work with. And coming into my role at Twitter, the thing that my team was doing was also something that had not been done at the company before. So it was kind of a similar thing where it was just, we don't know what's in these data sets. We don't really know what we're going to find and people don't know how to think about this. So I ended up spending a lot of time like helping people understand why you would measure things in the first place or like why certain types of measurement were important versus just the observability dashboards engineers were used to looking at. And like also related to the lesson about categorizing subreddits, it ended up being really important to start developing segmentations of different types of developers at Twitter
Starting point is 00:14:52 to kind of take apart some of the problems. Because we had a bunch of data about what they were doing, like all their different developer tools. But like when you viewed it all in aggregate, it would be impossible to figure out what you should do. But like once you started segmenting to these are backend developers, these are machine learning engineers, that ended up being a great way to like figure out more targeted problems to solve that were a lot more tractable as opposed to just having the mandate of there's 4,000 engineers at this company. Help them. Like it's who can we help the most first? How can we describe their problems in ways that they understand? It just makes it more targeted and easier to
Starting point is 00:15:34 relate to for the end customer. Yeah. Super interesting. Okay. One more question about Twitter. How did you collect that data, right? I mean mean i think about your experience at reddit where you know have some sort of telemetry running on the site and you know you know you're collecting all that data into a data store and you know of course you know sort of running your process from there but thinking about internal data collecting telemetry on those sorts of things seems like a challenging technical problem in and of itself. How did you think about that? Yeah, it was a very multifaceted problem because different tools had different, I guess, ways that you would measure them too. Some of them are, like there's a web UI, some of them are a command line tool. Fortunately, someone relatively early on in the company history,
Starting point is 00:16:27 by which I don't know if it was relatively early on, but it was like for at least four or five years, someone had instrumented a lot of our developer tools, at least that were maintained by a certain organization. So like we did have a lot of data and it was often incomplete or not exactly what we wanted. But that was a big source of some of the data that we looked at. We also would go into the history of their version control system and pull out information about what languages are people using, how often are they committing code, and that sort of thing.
Starting point is 00:16:57 Yeah. That's actually very forward thinking of someone to put telemetry in on the dev tools early on. Yeah, I was very grateful. Yeah. Oh, man. Because that is not the case in many cases. Okay. Tell us about GlossGenius. Can you just dig a little bit deeper? I know you gave us an overview, but dig a little bit deeper into what does the product do? Who are the customers, et cetera? Sure. Gloss Genius is a vertical SaaS company targeted at independent owners of beauty salons. So that's everything from hair and nails to things that are maybe like a little less expected, like med spas and injections. The product is sort of an all-in-one platform to help them manage their business. Like kind of our tagline is like they can focus on the craft and the art of what they do.
Starting point is 00:17:47 And they've got a business manager and their product. Like that's not our literal tagline, but I'm paraphrasing a little bit. But it's like a booking website. It helps them with outreach to customers, like texting and email marketing. It helps them take payments, keep cards on file, helps them get paid faster, a variety of different things. And so when we think about one of your customers, are we talking about, you know, there's a point of sale system where, you know, someone comes in for their appointment and then they check out and that's in the calendar system
Starting point is 00:18:25 at the point of sale and so they can accept the payment. And then there's a booking thing that sends the email. All that stuff is baked into the platform. Yeah. Interesting. Okay. Do users, the customer of your customer, do they create an account? How does that work? Yeah, they don't, which is an interesting thing about our platform compared to maybe some other companies in the space is that right now, the customers of our customers, they may know who we are, but they don't necessarily. We're very tightly aligned with the salon owner, which is a good thing for incentive alignment and maybe suggests future opportunities for us in terms of areas where we can expand.
Starting point is 00:19:11 Yeah, that's interesting because you have like the mind-body sort of model where it's sort of centralizing and there's the, I mean, you're B2B2C. And so some models sort of anchor around the end consumer and then sort of commoditize the provider. Okay, that's super interesting. One thing that I would love to dig into just a little bit is what you think about on a daily basis or the horrible cliche way to say this is what keeps you up at night.
Starting point is 00:19:43 But when I think about Glass Genius, it's a closed system. And what I mean by that is it doesn't, for your customer, you're not selling this concept of the modern data stack, you know, that's like completely modular, you know, and like you need to ingest this data with this tool and model it with this tool and blah, blah, blah. Right. It's no, you need to ingest this data with this tool and model it with this tool and blah, blah, blah.
Starting point is 00:20:05 Right. It's no, you need to like, I mean, I'm assuming spin this up. Like you can send things to your customers to schedule. You can communicate with them. You can accept payments, reschedule, et cetera. And of course, I'm sure there are like back-end analytics that your customers have too, right? That you provide to them.
Starting point is 00:20:31 Which is super interesting. Like what advantages do you have being a closed system from that standpoint? Like what are sort of the, you know, what are the things that you like most about working with a system that you actually kind of have control? I mean, in some ways, it seems like the dream where it's like, wow, like you have all of the data for your customer in one place, which seems great. But it's almost like, oh, now I have this.
Starting point is 00:20:58 Like, what do I do with it? Yeah. I mean, it is good to have a problem with maybe having more than you're prepared to dig through. Yeah, that's a great way to say. Like there's a lot of instances where a customer enters something in themselves and if they spell haircut differently than some other person, like haircut with a space versus haircut as one word, like working with that data, like reconciling and figuring out like how to process those as the same thing is kind of an ongoing thing for us. But like a lot of our problems are really about like, how do we make sure we know
Starting point is 00:21:46 what our customers are really doing with the product? Because like with any product that you spend some time working on, you have ideas about what people should be doing. But then when you look at the behavior, you're like, wait, what is this? Like sometimes they're doing something. You're like, what are you doing or trying to do sometimes uh it's because like you found a bug and sometimes it's because a user is doing something interesting and creative and you like want to find a way to help them turn that into a real like product feature which in my opinion like for product analytics specifically is one of the coolest things about the role is like really knowing what's happening interesting which like there's a ton of data legwork you need to do to enable something like that.
Starting point is 00:22:29 But like another common theme threading throughout a lot of my career is like trying to figure out like, how do you make the problem smaller? For us, like we work with all different types of beauty professionals and something we're thinking about right now is what's different between say someone who does hair versus someone who provides waxing. These people have different needs and different business models, and there are different expectations for different types of salons. And making sure that we can differentiate that in our data will reveal things to us that we want to be able to act on. And we are looking at this to be clear, but it's something that we need to make sure that we're always focusing on and trying to find the
Starting point is 00:23:13 right segments within our business to know what are the distinct personas that we're serving and what are their needs and how do we build a better product for them? Yeah, super interesting. Now, one thing that... So there's a product analytics aspect, which I know is probably an oversimplification, but understanding what's really going on, I think, is a great way to describe it, the words that she used. There's also this other interesting aspect. And I have to think that as a data leader, you're thinking about, okay, if we have a bunch of hair salons and we learn things about what's working well, you can build data products that you could use to sort of push value back.
Starting point is 00:23:56 How much of like, are you thinking about that a ton? I mean, that's got to be one of the more exciting potential things that you're thinking about? Yeah. I'm definitely thinking about it a lot. It's kind of part of the long-term mandate that I have is to make sure that we're putting our data to good commercial uses, whether it's through our product or otherwise. And something that we're starting to do that I'm pretty excited about is just use our data to find economic trends. And we've been in a couple of trade publications talking about, yeah, it's fall. So that means there's lots of pumpkin facials or that sort of thing. And right now it's a bit more fun and entertainment than it is necessarily something we're going to productize, but it's sort of a way of like
Starting point is 00:24:43 exploring what we can do right now and build demand and awareness of the valuable data that we have about beauty, which is a huge industry, as I'm sure I don't need to tell anyone. But there's a lot that we can learn from the data that we have. And we're looking for ways to put it to good use and also to make sure people know all the exciting stuff that we have. Awesome. All right. Well, Costas, I've been monopolizing the mic. Please, please jump in.
Starting point is 00:25:13 Okay, Eric, you were having like an amazing conversation. I really enjoyed like listening to show. Katie, you've been like in the data science like profession for quite a while now. And many things have changed. And I'd love to hear from you, like give us like a little bit of like a glimpse into like this journey, like how the role itself like changed, right. From what was like a couple of years ago to what it is today. Yeah, that is a topic I could go on about for a while. And something that is interesting to me too,
Starting point is 00:25:50 as I try to learn more about the history of roles, like what we call data science or maybe AI engineers now, is people have been doing this for a really long time. But the specific tasks they're applying to and the tools they use for it are maybe a little bit different. When I started, the hot thing was data science and deep learning. Then we kind of went through an era of modern data stack. And maybe analytics engineer was like the hot job title at that point. And now we're kind of back on deep learning, although we call it AI now, because what is
Starting point is 00:26:25 chat GPT except transformer? And now we're calling the people AI engineers. And like fundamentally, I think the thread through all of these things is companies have a bunch of data and they want to put it to use for something. And the idea of how we do that changes partly based on what's trendy. But I think it's also kind of a reaction to previous waves where you think about something like data science. It's kind of a reaction to maybe more traditional BI and the idea that everything has to be really curated by a
Starting point is 00:26:58 person. But then you have the promise of all these algorithms that are going to find amazing insights for you. And then like that enthusiasm kind of fades. We shift the pendulum back to data needs to be curated by people. We get into the modern data stack era, which is really oriented around analytics. And as that kind of people get disillusioned about that, you move back into this idea that AI, like some kind of algorithm, is going to do all the curation for you. I feel like that is kind of a trend I feel like is probably going to continue for all of human history, or at least as long as we have people working with data, is that we're going to get disillusioned with one thing, try the other thing. And we will gradually keep making progress as we move through all of these hype cycles but what we call these
Starting point is 00:27:45 people is changing constantly but still it's the same thing yeah i think so what's the essence of their role like what is the goal like let's say someone had to choose to be a data scientist or AI engineer, whatever you want to call it today. Why they would do it? Actually, let me ask the question in a different way. What kind of things this person should be interested in so at the end they would be happy? Because that's why we work at the end, to be happy. So, help us a little bit about in this
Starting point is 00:28:27 because I think one of the problems with all these changes in the titles and creating new categories of professions that are adjacent and overlap and all that stuff, it's really confusing for people who are like, okay, they want to enter, let's say, the profession to understand what I'm going to be doing at the end. Because it's one thing, like, first of all, it's like one thing to call someone a scientist,
Starting point is 00:28:51 it's another thing to call someone like an engineer, for example, right? Like there's a reason we have different words there. So tell us a little bit more about that. Yeah, I guess like if there are truly two roles in data, we have all these different titles and theoretically job families, but if there are truly two roles, I think it's more of the scientist exploratory knowledge-oriented type and then the builder type, which often gets called engineer. So that maybe is partly what the pendulum is swinging back and forth. But if you want to be an ai
Starting point is 00:29:25 engineer for example you're probably going to be doing something in production you're probably going to be trying how like to figure out how to use data like you don't really care that much about how it's structured to create an end product experience or something like an analytics engineer like the structure of the data is the main thing that you care about. And you care a lot about getting it into the right place and shape so that it is oriented around what you know. I see that as a bit more of a knowledge-oriented role, even though it has engineer in the title. But it's one where understanding the business logic is really important. Whereas an AI engineer less so like it's not necessarily about like modeling a business process as much as it is like trying to create something that people can interact with
Starting point is 00:30:11 by which I mean people who are probably external to your company so like maybe this is also an internal versus external dynamic that I'm describing or like operations versus production but I kind of see those as like the two main profiles. And like whatever is trendy at the time maybe says something about like a hype cycle or what people are excited about from technology. Right. So we talked about like the data scientist. We talked about like the role and how it has changed. Right.
Starting point is 00:30:42 And how it is today. And I love the perspective that you give on how you help us. You're like a true data scientist, I think, at the end. You're looking for patterns and you found the pattern there. You said we swing from one to the other. So we know also we can extrapolate in the future what is going to happen. So that makes total sense. But one of the things that I personally, at least like I find like so interesting when it comes like to like data in general,
Starting point is 00:31:12 and like there are like plenty of different professions that they need like to come together in order like to deliver at the end, like some kind of product, right? We have data engineers, we have a male engineer, we have the data scientists, we have the BI analysts, we have the business ops people that they also now they work with the data. We even have production engineering that now more and more the data becomes part of the product itself. So they need to get access to that stuff and
Starting point is 00:31:41 deliver something over there. How do you see like the dynamics between these different roles and how they're like defined and what kind of overlaps exist? Because I'm pretty sure that like in your experience, you've seen like a lot of overlap out of necessity in many cases, because one role might be missing and someone else might need like to do the job for them. But how do you see see these dynamics there today? Yeah. I guess a hot take I have is that I think
Starting point is 00:32:13 data science would actually be a really good name for all of these jobs collectively because I feel like they're all... I don't know if it's like a spectrum. Maybe it's like some kind of higher dimensional space where you are moving through these different attributes that people have. But like my general theory on like specialization and data roles is that the smaller the team and the smaller the company, the less you need to be specialized.
Starting point is 00:32:40 And as the scale of what you're doing becomes bigger, there's just like more ways in which it can be complex. And as things become more complex, like the different components of the problem necessitate that you have people who like think about narrow pieces of it. So to give an example from my own career, on a lot of early data teams or like early stage projects, although I've been a data scientist, I've mostly done data engineering where I'm building pipelines.
Starting point is 00:33:09 I'm trying to define my own metrics and build all of the infrastructure that I needed for myself. Or like even in some cases, written production code as if I were a member of the engineering team. And like as I've been at really big companies, my job has become more specific. A lot of the data science team at Twitter, for example, was really hyper-focused on experimentation. They were almost pure statisticians sometimes, which that would never happen at a company like the one I work for now, because it would be... We don't even have the scale of user base that we would need, like the level of sophistication and experiments that Twitter did. So it's partly like you're required by your environment to focus on particular parts of
Starting point is 00:33:55 the problem or develop specialized skills or just exercise specialized skills. But it's always helpful to be aware of like the parallel skills to reach into them and use them as you need them. And if you have a big team where there's a lot of people who can focus on narrow parts of the problem, you need to figure out how to coordinate and work well together. But if you don't, it's a lot easier to be able to just do things for yourself, if possible. Yeah, that makes sense. So let's say... And we'll take the the different uh companies as an example like just for the states that they are in right like and let's start like with twitter
Starting point is 00:34:34 if you had like to design let's say the perfect data team for like a company like twitter right what this team would look like? I mean, one thing I'll say is that Twitter would have been a great place for data mesh. They were far along in their data journey for that's really been reasonable for them to pivot to. But like for that, like it's such a big company with so many different concepts
Starting point is 00:35:02 that like having a centralized team can be kind of weird or kind of a strange fit and pushing some of the domain knowledge into specific teams would have made a lot of sense there just because it's not reasonable for one person to hold everything in their head at least with tooling as it is today. So like, I would say that you probably would want to have maybe, maybe there there's like a overall leader for data science at Twitter, which there was not when I was there. And you have maybe sort of like a hub and spoke model. Like I like hub and spoke models where you can get kind of like a standardized quality for the whole company. But like,
Starting point is 00:35:46 if a company really becomes huge enough, you almost start having separate companies within the same company, like Microsoft, for example, like there's no way you could have a centralized data team. So the bigger you become as a company, perhaps you start getting like a fractal thing where you have like a local data team that does everything that they need for that area that has a certain structure and then it's totally separate from some other part of the company that has the same structure whatever structure works for them yeah that makes sense and then moving to something like reddit and like i'm curious about reddit because also like the unique moment that you came into the company right like so where there is
Starting point is 00:36:27 let's say a lot of opportunity but there's no structure and it's like a startup but like i don't know like with like a large scale like startup in a way right where you take all the problems of like the startup and you just exaggerate them like to the scale of something like ready. So what do you, how like a data team would look like in there?
Starting point is 00:36:53 Yeah. I think the right thing to do in that case is to really look at essentially what team would benefit from having data savvy in it. Like not every team
Starting point is 00:37:04 is going to be super quantitative, but some should be. Like growth, the team that I worked with when I joined was the home feed team. Like these are things that it's like, it needs to be measured. It needs to be quantified. And that is probably where you should put your data team first is an area where they can be impactful. The wrong thing to do is to try and cover everything all at once. Like you need to have targeted areas so you can make progress. And like for people who are not familiar with working with a data team,
Starting point is 00:37:40 like it helps them understand what they get by working with a data team. Because you have examples of what it looks like when you go super deep, whereas it can be kind of underwhelming when you're just kind of spread vaguely across everything and what about the startup like as you are like now right and the reason i'm asking is because that's like the space where i also have like the most experience too but i'm coming also like from companies like all my experience like i'm B2B companies, where you don't have that much data in the beginning. So it's always kind of like a question of when is the right time to bring a dedicated person that can work on the data.
Starting point is 00:38:23 Because now we have data and before we didn't, but like in B2C is a little bit different because it tends like to get, let's say data at scale, like much earlier, like in the life cycle of the company, but still there is like the right time for that too, right? Like I wouldn't assume that you get the data science at your first hire when you start like a company, right? So when is the right time to do that? And how, let's say, the team would look like?
Starting point is 00:38:52 Yeah. I guess my answer is kind of related to the answer I gave earlier about specialization. As a company grows, at least to start, data is probably something that a product manager is doing or some biz ops person is doing on the side. And eventually it grows and becomes complex enough that it needs to be someone's full-time job. I don't know if I have a unified theory of this, but I would say it's probably around a time when your user base starts scaling or maybe you have some early product market fit. Then you want to grow. That's a very good time to start bringing in data because you have to be more serious about measuring things.
Starting point is 00:39:28 For Gloss Genius, data really originally was part of our finance function. And it makes sense. Your investor metrics are pretty important for a company and they also need to be right. So the level of accuracy necessitates someone really paying attention to it. As we've expanded from there, it's kind of been trying to figure out like what are the most important business verticals? What's the right staffing model and where do we put people? I have a pretty strong opinion that there should be a business case for every like dedicated embedded data person you hire. So it's usually, there needs to be some combination of like, I identify a need for this and a stakeholder that they would be working with
Starting point is 00:40:10 also needs to agree with me that they need that. And like the model that we have right now, at least in terms of embedded resources, is we have someone working with product. We have someone working with go-to-market, which is both marketing and sales. And we have someone working with our customer experience team. And then we also have analytics engineers and data engineers.
Starting point is 00:40:32 And some would say our team is like a little bottom heavy in the number of analytics engineers we have and data engineers we have relative to analysts. But part of that is because I believe having a well-maintained data stack requires it to be someone's full-time job. If someone's always trying to fight off stakeholder requests, there's never going to be someone thinking about optimizing your DBT models, for example, or refactoring the way things are written when they were written quickly and then end up not scaling. What's the difference between an analytical engineer and a data engineer? In many cases, there probably isn't one. The divide that we have is that the data engineers are really focused on ingestion and managing Snowflake. So it's a combination of managing it via vendors as well as writing custom lambdas in some
Starting point is 00:41:19 cases. And then analytics engineers are much more purely focused on the business logic pieces of modeling metrics, modeling concepts that are relevant to the business that need to be spoken about and measured consistently. Oh, okay. And where is the boundaries between the two in terms of the data models? Because the data coming, you can have data that's, okay, structured, but still very unstructured, right? Like event data, like for example, like from customer directions, right? So you ingest them into tables. From that point to the point where, let's say, an analyst can like start working with
Starting point is 00:42:01 them, like there's quite a few steps in many cases of modeling that need to happen. So where is where the data engineer stops and where the analytical engineer stops? Yeah, that's a good question. Our boundary is really a concept called a staging model, which is sort of like a lightly transformed bit of raw data. And that's stuff like removing PII, as well as maybe like if there's some code in a production database that is just a number, like translating it to its actual English equivalent. And that's kind of connected to the ingestion piece. So data comes into the warehouse. It is connected to, I guess, very loose, high-level business concept and maybe the metrics or rather the tables that people
Starting point is 00:42:54 query for building more fact and dimension style models. It's decoupled from the source, so we can also swap in sources relatively easily if we need to change them, which has happened. But basically it is data engineers making sure that data gets into the warehouse in a way that is compliant and protecting customer privacy. And then analytics engineers take that and then use it to model important concepts. I should say also that analysts write dbt models sometimes or models in Looker, but it is generally considered more of a prototype and not recommended that it's used for anything super serious because it's not really built in a way that's intended to scale.
Starting point is 00:43:40 Mm-hmm. Okay. And how is... So how is the loop closing there, right? You have the data engineer, that's the engineer, let's say. Then you get the analytical engineer, do the modeling, right? But the models are always dynamic creatures by nature, right? Like, as the business changes, like, that model also, like, changes, right? And then you have the analyst who is, like, using these models to go and do, like, ask their questions, right? And get answers. But, and the reason I'm asking is because exactly of what you said, that sometimes, like, the analyst can also create a model. And I think that's where we see that the need to go and change the models on the backend,
Starting point is 00:44:30 let's say, is exactly because something new has been created. And that usually happens from the analyst. It's not like the data engineer knows what is changing. Or even the analytical engineer knows. So how is this life cycle happening and how let's say well structured it is right or it's more of an ad hoc process i guess like right now it the team is small enough and we're still building out foundations enough that like it's not like we have a super well-oiled
Starting point is 00:45:05 machine of every time we release a new product feature, we have something ingested and modeled and then in a dashboard somewhere. Mostly, we're very prioritized and focused on things that are important to company strategy and big initiatives. So it ends up being a little ad hoc and we're always looking for ways to optimize the process but so far it's not something that is happening at a pace at which we cannot manage it just through conversations and we're looking for ways to make sure that it's more streamlined and fewer people are involved but lots of things are in motion at the company right now yeah yeah 100 have you seen like a different way? Is it different
Starting point is 00:45:45 like in bigger companies? Because I would assume that the flow itself remains pretty much the same, right? It's not that much different. You have the different roles, but at the end, change needs to be fed back to the back end from the front end. And the front end is usually a data scientist or an analyst or someone who is building the product or have the interface with the problem that is getting solved. So how does it happen? How have you seen it happen in other commands like Twitter or Reddit or any other commands that you have like experience from there. Yeah. I would say the thing that I most commonly see different from this chain that I've described is a lack of the analytics engineer. It's something that I have been pulled back into doing that as a data scientist, or maybe a data engineer is is more purely infrastructurally oriented or doing something that is way more production oriented than maybe just pulling data into a data warehouse for analytics.
Starting point is 00:46:53 Maybe they do both. But then there's someone who needs to actually build a performant data set for internal usage and dashboards and analysis and experimentation. And there's no one to do it other than like a data scientist who doesn't really know like the right principles for it or an analyst in some cases and like because it is not the thing that they're interested in doing and they're not trained in how to do it right they just do kind of a fast version that ends up breaking or becoming a problem later and this job title gets called a lot of different things at Twitter that was called data engineer. We had two different kinds of data engineers. So I don't really think the
Starting point is 00:47:30 title is that important, but I do think it's important to have someone who is primarily responsible for using business logic to shape data that is then used for other purposes at the company. And it might not even always be used by someone with a data job title. It could be someone on a performance marketing team who's working with this well-curated data. The specific form that it takes will vary on the company and how it's structured and how analytical the end stakeholders are. But generally, I find that those are three key things in a like end-to-end, like data is created, data is used for some kind of decision-making purposes. It gets moved somewhere where it can be used, usually a cloud data warehouse. Someone needs to transform it, and then someone needs to use it.
Starting point is 00:48:17 And the most consistent between all of those is the data engineer moving things into the database where people query it. Who does the transformation piece and who does the analysis piece I think varies a lot by company. Yep, 100%. And one last question from me and then I'll give the microphone back to Eric. Is there like something like a tool or let's say like a technology that you would like to see existing or being used more, let's say, in the environment like that you operate like today? What would be your wish for like something that would make your life like a data scientist, right? Or your data scientists that work with you
Starting point is 00:49:02 like much better and easier the tool that i want which it's maybe like something else like i could see data contracts maybe being argued as this but something that i would really like actually is not a tool for data people it would be a tool that creates a better developer experience for engineers implementing telemetry. So much hinges on your event data. And a lot of times the engineers putting it in, like can't really QA it very well. They usually don't have good context for what it will end up being used for. And there's usually so much event data that it's really hard to document what's happening to like give them a guide to work with so i really like a tool that improves the developer experience to make it easier for
Starting point is 00:49:50 engineers to put in telemetry that's interesting and when you say telemetry because okay there are like many like definitions of that are you talking about telemetry like in the sense of like sre things of telemetry no i guess i'm using the term a little imprecisely. I just mean when people are putting eventing into an app, your clickstream data, your behavioral data, that sort of thing. Just auto-track and everything will be fine. Nothing ever went wrong in that world. 100%. Okay.
Starting point is 00:50:23 I don't know, Eric. I think you should have a lot to say about that right? I was coming like from Rutherford I thought that problem was solved no? Oh man event data is so tough no it is that's super interesting Gideon I think it's really challenging because ultimately I think that what is difficult is that instrumentation
Starting point is 00:50:44 we like to think about it as a technical problem, but it's ultimately a relationship between people in the organization, right? I think you said it well when it's like, well, the engineer who's instrumenting it doesn't necessarily have all the context for how it's going to be used. I think the other thing that I would argue is that when you initially do instrumentation, it's kind of like the person at Twitter who implemented a bunch of telemetry that wasn't really heavily used. But then when you got there, you were like, yes, thank you. It's not perfect, but thank you. And so I think one of the hard things with telemetry is that to do it really well, you need to be very explicit.
Starting point is 00:51:29 But in many cases, it's about option value. And that requires multiple team members to think way farther ahead than the immediate need. But there's some linting and some other things that can certainly make that easier. But yeah, that's tough because event data gets messy faster than anything else. And boy, if you want to see a nasty warehouse, event instrumentation gone wrong is your quickest path to get there. Definitely. I'm sure we all have detailed horror stories about what we've seen
Starting point is 00:52:06 yes okay well we're at the buzzer as we like to say but katie one more question for you we've talked so much about data if you had to just completely leave data and do a completely different career what would you do i guess i think it would be fun to maybe have a catering business where someone gives you like a really specific like theme and you figure out how to make a dinner for them. That would be cool. Like figuring out the logistics piece of it, as well as like coming up with a creative dinner theme for them. Yeah, I love it. That's tricky. My wife is a florist and perishables are, that's like a crazy, it's crazy to work with, right? Like you're on a timeline and.
Starting point is 00:52:52 Yeah. Super interesting. Well, thank you so much, Katie, for joining the show. We learned so much. We'd love to have you back. Thanks again. Yeah, thank you. We hope you enjoyed this episode of the Data Stack
Starting point is 00:53:05 Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, ericdodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.