The Data Stack Show - 137: Data Collection Secrets & The Search Data Problem with Josh Wills

Episode Date: May 10, 2023

Highlights from this week’s conversation include:Josh’s background in data working at Google, Slack, and other companies (1:21)The need and process for high quality data (4:33)Digging into auction... code (14:03)Joining Slack and working in the early days of the company (18:00)Not fighting the last war in data (25:42)Building a product, while using the product (30:35)Transitioning to the search team at Slack (36:50)Usage patterns of search (41:21)Josh’s work in helping build DuckDB (46:20)Having the right toolset to increase precision and efficiency (52:42)Final thoughts and takeaways (56:03)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Transcript
Discussion (0)
Starting point is 00:00:00 Welcome to the Data Stack Show. Each week we explore the world of data by talking to the people shaping its future. You'll learn about new data technology and trends and how data teams and processes are run at top companies. The Data Stack Show is brought to you by Rudderstack, the CDP for developers. You can learn more at rudderstack.com. Welcome back to the Data Stack Show. Today, I'm actually flying solo. Costas is caught in an airport and cannot get to a quiet place to record. So I am going to chat with Josh Wills. He has an amazing story. He worked at Google, he worked at Slack, and he started his
Starting point is 00:00:43 career as an analyst and then sort of got deeper and deeper into data science and data engineering. So I want to try to hit on all those topics. He also is a regular contributor to DuckDB. And we'll sneak that in if we can. I am most interested in hearing about lessons learned from Josh because he was at Google at a really interesting time, at Slack at a really interesting time, and has experienced sort of hyperscale and being the first person to build out a data team and a data infrastructure in several different contexts.
Starting point is 00:01:17 And so I think there's a ton that we can learn from him. And then hopefully we can nerd out too. He's a brilliant guy and has used every tool under the sun. And so hopefully we can dig into the technical stuff. So let's dive in and talk with Josh Walsh. Josh, welcome to the Data Stack Show. So many things to talk about.
Starting point is 00:01:36 So thanks for giving us some of your time. Hey, Eric, my pleasure. Thanks so much for having me. You've done so many things with data. Before the show, you described it as sort of uh you know the villainous descent you know sort of deeper into the stack which you know may end up with you creating silicon wafers but give us the quick the quick recap of the villainous descent oh yeah that's a great question man so i think i started out in life i suppose i was a
Starting point is 00:02:07 math major in college and i thought i was going to go be like a math professor or something like that but it turned out that like i was good at math but i wasn't like that good at math you know what i mean i wasn't like math professor good at math so unfortunately i had to like you know get a job as a software engineer and stuff like that, which was again, fine. I liked computers and I liked data and I liked to analyze things. Right. Preston Pyshko Larson, Ph.D.: Which discipline math did you study just out of curiosity? Jeff Ross, Ph.D.: Oh, I mean, I studied like, I did like effectively a survey of math for all the time. I'm trying to think of like, well, I mean, in, when I got to college, I really wanted to be an analytic number theorist, like Ramon was basically my hero and stuff like
Starting point is 00:02:42 that. I wanted to prove the re like prove the Riemann hypothesis was kind of my goal i think eric you know like we're going to talk about my career today and all this you know interesting stuff i've done i basically think of myself as a failed mathematician more or less like it's like my mental model of who i am you know like that guy who like couldn't hack it basically is me you know but no um so yeah but i i took like statistics in college and i liked it and it sort of you know appealed to me and stuff and so my first job out of college was working at ibm and i was analyzing wafer processing data so i was like working for like those very hardware engineers we were talking about before the show and i was analyzing like like the manufacturing process basically like how do we is this chip going to work? How fast can we run it? Like, I mean, megahertz and stuff like that, gigahertz, whatever. of like at what point i realized like it sort of maybe slowly dawned on me over the course of like five or ten years that the key to really good analysis was really good data and that i could
Starting point is 00:03:51 do all the fancy crazy statistics i wanted you know to like compensate for the deficiencies of the data but really like 99 of the time my effort was better spent getting better data like that was the whole thing and that was what that was that's my villain origin story is that i like i love um uh alice waters the chef who has like this you know like high quality ingredients simply prepared that's the california like style of food right and that's very much been my approach to data analysis is like high quality data simply prepared that's like good analysis and stuff. And that was a big switch for me.
Starting point is 00:04:27 Like, again, when I was a kid in school and doing all this fancy math stuff, I was like, oh yeah, you know, we clearly need to use the Libay integral here to solve, you know, like just sort of nonsense like this, right, just to show off like how smart and technical I was, but like, or at least in a statsy kind of way, but really like, yeah, that's. Didn't end up, you know, yielding the outcomes i was after and stuff whereas high quality data did when you think about the data space you've done so many data things that like really large companies some small companies like when you look at this space today how much effort do you think is going into the data like getting
Starting point is 00:05:03 high quality data versus actually like doing the math, right? Because we hear about doing the math more because that's flashy, but I suspect like a huge amount of the expenditure, both in sort of time and money. I mean, I completely agree. You know, like it's, it isn't sexy. It's not like whatever at all. That's a great question.
Starting point is 00:05:24 It's a great question. Is it like, I don't know. I mean, 80, 20, I don't know. I mean, I'm not asking you to put a percentage on it, but is this well, I like what, like what I'm trying to say here, I think is like, there's the data we get that's relatively easy to get, right. Which is the data we have easy access to. So it's like, I hook up fiveran and it pumps my data from my production database
Starting point is 00:05:46 into my data warehouse. And I analyze, right? And then maybe I have some logs data. Maybe I hook up segments. Maybe I hook up, you know, like rudder stack. Maybe I hook up snowplow, like all those kind of things. I'm trying to be fair to all the vendors in the space. Yeah, yeah, seriously.
Starting point is 00:06:01 I mean, like Debezium or whatever. Yeah, exactly. And then I analyze the shit out of that data. But then how often is it, like, am I going in and saying, you know, if I have this additional bit of information that is available during the context of this request but isn't being logged right now, it's not being stored anywhere,
Starting point is 00:06:18 if I could grab it and introduce it here, I'd be able to do, like, way more powerful stuff. And then it's kind of like crickets at that point right no one's talking about that kind of stuff like how are you how are you bridging that gap how am i like and i think like my like it's like the secret of my success i guess as a sort of data engineering person is like i am the data engineer who is more than happy to go into the front-end code base and get the field i want and send the pr to the front end engineer and get it merged and get it piped like through the rest of
Starting point is 00:06:50 my system so that like my analysts can go do data with it and like that's just like what i do and i'm not i don't know i don't know why more people don't do this i guess is it like is it intimidating is it like Is it scary? Is it going to be mean to you and tell you your code's not very good? I don't care. I'm not a software engineer. I'm a data scientist. I don't identify as a software engineer. Is my code crappy? Yeah, probably.
Starting point is 00:07:17 I don't give a shit. I want my data. I don't care. You know what I mean? I've been very fortunate in my life to do a bad job of lots and lots of different things and have much more competent engineers come along and fix it for me once it proved to be useful i think you know but yeah i mean i think it's that last like we talk a lot about the tools we talk a lot about the databases the visualizations we talk a lot about the
Starting point is 00:07:38 methodology we talk about how we collect data but we don't talk about like going and getting the additional data going and like going kind of above and beyond that's what we don't talk about how we collect data, but we don't talk about going and getting the additional data, going above and beyond. That's what we don't talk about. And that bums me out because that, to me, is the quiet, thankless, but ultimately heroic work that makes the real difference, I think, in how well we do and the impact that we can have as data people. Yeah. Yeah.
Starting point is 00:08:01 Yeah. I love it. I love it. 100%. Okay. people yeah yeah yeah i love it i love it 100 okay so you realize as an analyst you know the i need to spend more time on data quality because you know i can do the math but really yeah but is the big deal so yeah like data collection data like to me it's like data people are like there's governance there's quality there's provenance there's like me i literally know what this data means because i collected it myself
Starting point is 00:08:30 right like i am out there foraging in the fields in the forest for the mushrooms are going into the mushroom soup of the dish i am preparing right that's that is my approach to like data collection and data analysis right there's no governance thing there's no miss there's no game with telephone between like me and the data you know all this kind of stuff it's just like literally i collected this i know it's good i know you know i know what it's supposed to look like because i went and read the code that generated it or wrote the code that generated it right that to me is what makes the difference yeah that's kind of what i'm talking about and again in terms of people who do
Starting point is 00:09:05 this is pretty much just me as far as i'm concerned like i don't there may be other people who do it but like i haven't run into that many of them you know so yeah following that path leads you like okay so you leave analyst world and yeah that's right role did you like were you sort of that like i'm crossing frontiers as a data engineer and getting all this data what was your next role, were you sort of that, I'm crossing frontiers as a data analyst like as a statistician that was my sort of career ladder there but then over time I transitioned over to the software engineering ladder just because the leverage I had as a person who was willing to like get into the ad auction code itself and just literally grab the data I wanted and make it available to everybody on the team was so much higher than me doing any one-off analysis right like just by far obviously right so i made that transition there right when i left
Starting point is 00:10:14 google i went to cloudera i was at the time i was kind of unfortunate i was i've been like a lot of other people i had been roped into working on what eventually became google plus like google's kind of facebook killer social thing and it was just, for any number of reasons, just an absolute nightmare project to work on. So I quit and went to work at Cloudera because I was, I loved the tools that we had at Google for working with, you know, large datasets. And I, I wanted everybody else, you know, to be able to use those tools as well for all kinds of different problems. And so I went to do that at Cloudera. And so I still talked about like how we did this stuff at Google, but it really was a period of time where I was very much away from data collection and doing, I was building tools and stuff.
Starting point is 00:10:53 Right. And then going to Slack brought me back to like, okay, I need to build a data collection system that is akin to what we had at Google so that we can go do so that people like me can do the kind of work that I did there, which is go collect instrument, get high quality immutable copies of every single thing that happens and do all this stuff with it downstream and stuff. And so that was what brought me kind of back to that work to being like militantly focused on data collection and like getting, getting high quality data from the source and all that kind of stuff. So yeah, that's roughly how that went. Yeah.
Starting point is 00:11:26 So I want to zero in on a couple of those steps. So at Google, so you started as an analyst. When you talked about going in to get the ad auction data, was that political at all? No. Okay. It was, to Google's credit, it was super not. It was super not.
Starting point is 00:11:47 If you were in a technical role at Google, like an engineer or like a statistician, we all did. I mean, I think we all did code reviews. We all used the same source control system. We would source control and review our analysis scripts. We would source control and review our code. The thing I think it's important to know is, like, the reason analysts have a hard time getting engineers to do this kind of work of just, like, add this field and copy it from, like, this record to this record is that this is not, like, high, you know, status engineering work. Right? Like, no one's going to see this.
Starting point is 00:12:26 It's not doing some badass algorithmic backend thing. Like like you're not going to talk about this at a conference it's not sexy work but to an analyst this is everything this is oxygen this is everything as an analyst i am thrilled to do this work like i am thrilled to go like copy the stupid field from like you know protocol buffer a to protocol buffer b like nothing makes me happier yeah and so yeah absolutely like wasn't yeah wasn't it was also i think it was funny it was you know like like a lot of the at the time it was funny it was like google's auction code was just such a disaster like back when i did it it was just so there's some parts of google's code base are pristine elegant gorgeous and other parts are just you know absolutely toxic waste dumps right yeah
Starting point is 00:13:10 and this was the auction itself was one of them and so that was another reason why no one really stopped me because it was like yeah i mean if you want to go wander in there sure yeah good luck basically The house of horrors. Yeah, exactly. Exactly. And again, not high status work, but by doing it, I became the expert on how the auction actually works. Like how, how did it really, how, like, how did things actually work? How did all the various pieces and stuff interact with each other to ultimately determine the position and price of an ad on google.com and that turned out to be a very valuable thing because if like something went wrong or you know i was the one who could tell like okay this is these are all the ads we priced incorrectly and this is the compensation we needed to do and this is how we prevent it from happening again and all that kind of stuff that turned out to become like a very like i mean i find this happens like not all the
Starting point is 00:14:02 time but more often than you'd expect like doing like shit low status work can often like become this kind of thing where like it becomes a very crucial role because you're willing to do something that other people aren't and in doing so you create a lot of value again it's just not always the case but in my career it's happened a few Anyway, did you. OK, so you're digging into the auction code. Did you think a lot or sort of go down the path of thinking through sort of the. I don't want to say it's like almost like higher level economic principles that govern. I know you're looking at the code like is this served? But an auction by nature is an economic system, right? And you have supply and demand. And did you dig into that?
Starting point is 00:14:49 Yeah, so it was funny. When I was in Austin, I lived in Austin for a few years after college, just before I came out to California to work for Google. And I was getting kind of bored of my job. And I was also like just kind of that sort of masochistic 20-something idiot that likes to just do things that are impressive. So I started graduate school at the university of texas at the same time um and so i was studying like optimization theory and statistics and stuff like that i took a class in mechanism design which is an economics discipline of like how do you design auctions and stuff like that so i actually had an interest i had an interest in this area coming in and i was especially interested to like go to google to do auction stuff real
Starting point is 00:15:26 like for real money yeah it's the real yeah so i had like i had a decent academic background in this stuff like just in the sense that i like took a graduate level course on auction design and then at google they had like of course like literal real world auction experts people who've done phds and papers and stuff like that that I could work with and call on for their expertise. And again, they weren't in the code, but they were the experts who could kind of explain how things were supposed to work and analyze stuff and all that. So anyway, so I was kind of the bridge between the engineering team and the like research sort of analytics folks.
Starting point is 00:15:59 Yeah. Fascinating. Fascinating. I bet. I mean, gosh, I bet that was really fun. It was super fun. I did it for a number of years and you know i quit honestly i left to go do something else really just kind of when you reached a point where like it's sort of a bummer but it's like the limiting factor on how interesting an ad auction can be is really like advertisers we just like so much more interesting
Starting point is 00:16:22 computational like like how much would you know like one of the things is if you do a again i'm just going to keep using you guys an example if i go to google and i do a query for rudder stack then i'm going to see all of your competitors listed and that's on the top of the page right everyone does this to everyone else and stuff you go query snowflake there's a data breaks out right how much should you be willing to pay to not have those advertisers show up like this is a thing you can do there's a whole auction theory around this like this is a sort of like product you can buy but it's very complicated and it's so complicated that like you know you would need a peachy in economics for me to like explain to you like how should you like bid on this kind of thing right but for a computer it's relatively simple anyway so like it's yeah it's i don't know it's it was one that we went through this project where
Starting point is 00:17:10 we built like this very cool awesome like auction simulator that would let us like back test for advertisers different ad strategies and all kinds of cool shit like that but it kind of ended up like flailing over the fact that like we couldn't figure out a way to make it easy enough for advertisers to use it to make decisions like it kind of came down to like like those sort of like math tests you do on the sat where it's like they show you like a table or a chart and they ask you questions about it and stuff and it was like it was kind of like that kind of thing it's like we could show them data from it but they wouldn't be able to like to draw the right conclusions from it and it's like okay well this is not actively doing what we want.
Starting point is 00:17:46 And it's such a bummer because it's like this massive power imbalance between very large advertisers or Amazons or Ebays who have teams of people like me doing this stuff all the time and can optimize prices down to the fraction of a cent. And then just tiny little mom and pop shops on Shopify or whatever who are just bidding a dollar and they have. Right. It's just, anyway, it's a whole thing. It's okay. I, you know, it's just like, I could talk about this stuff for hours, but I'm sure there's other stuff we could talk about too.
Starting point is 00:18:14 Super interesting. Okay. Let's jump to Slack. So you joined Slack. How big were they when you joined? It was a 240 people. 240. It was pretty big. By my standards, that's a big company. Like Cloudera was like 80. Yeah, anyway. But the data team was small. Data team was small. There was one data engineer.
Starting point is 00:18:38 And then I was hired to be the director of data engineering. One data engineer? He started two months, roughly two months before I joined. It was like, yeah, when he came on board to start building the early data infrastructure. That's right. What did they do for analytics? I mean, that's like wild to hear.
Starting point is 00:19:00 Yeah, so I think there's a few things here, right? I think for a number of years, I think Slack was well known as the fastest growing enterprise software company ever. And one of the things that's kind of underappreciated about being the fastest growing whatever is that you don't really actually have to do anything, Eric. You don't have to do anything. You can just like kind of sit back and just people will keep coming and showing up and using your product. Like there's, you don't need to like, you don't need to A-B test anything. Marketing attribution. I don't. Doesn't matter. Right. Exactly.
Starting point is 00:19:32 Right. I mean, they had a marketing stack, right? They had, you know, kind of like, I don't remember all the things they tried. There was Optimizely at some point. There was Marketo. There was Amplitude. It was all the things, right? Yeah.
Starting point is 00:19:46 In fact, one of my main jobs really was just throwing all that stuff out actually from Slack stack. One, because no one was really using it. Again, because the product just grew. You didn't have to do anything, right? I'm being facetious here when I say this, right? Obviously, they worked very hard on the product. Slack
Starting point is 00:20:04 is the poster child of product-led growth like that's it was their idea that like just make the product really good and people will just use it right and they ab tested landing pages and again there's like things here but like by and large you did not need a massively sophisticated data infrastructure to do any of this kind of stuff right anyway so yeah yeah i started building it really primarily to provide the infrastructure we needed for when that stopped working like as much as anything like once you sort of you know fill your initial target market and it's time to start growing beyond that to the enterprise and all these other people and stuff like that you do need to get serious about all this kind of stuff and And so I was building sort of the infrastructure to make that.
Starting point is 00:20:47 And then also the infrastructure for like machine learning applications and stuff like that. That was my other sort of major sort of area of focus was like making it possible for us to do, you know, search ranking and retrieval optimization and stuff like that. Like making all those things possible requires, you know, really good data infrastructure. So that was what I set out to build yeah yeah absolutely so okay you come in there's one data engineer there
Starting point is 00:21:12 yep how did you decide where to start right i mean is it yeah yeah it's a i mean growing very quickly of course like you've lived so many times. Once teams start getting data that's useful, it creates an insatiable appetite. Totally. I mean, I did it basically wrong, Eric, in every which way. And I mean, I talked about this a little bit a few years ago where I was going through the list of mistakes I made
Starting point is 00:21:41 building data at Slack. The thing I started with was really was data collection and it was data collection of the form that we had at Google. So at Google, like everything at Google is a protocol buffer. Protocol buffers is kind of a vulnerable binary format. People use it for a bunch of different things these days. So the, so I use a variation of it that Facebook had created called thrift at slack because it
Starting point is 00:22:06 was kind of a better fit for our stack and so what i said essentially was okay at the time like the data warehouse that existed was like parquet files ns3 we were using aws we're using emr and we were running kind of a netflix style data architecture like spin up hadoop clusters and presto and stuff like that to use stuff, all these kinds of things. And so I said, from now on, from going forward, the data warehouse will only accept data records that have a thrift schema associated with them. This is it. This is the only thing that we will read. You can't send us JSON anymore.
Starting point is 00:22:40 You can't, like everything must come in with a thrift schema. That was like the rule and i was again just incredibly fortunate to be at the company at a time where like i could make that kind of decision and get buy-in from all the stakeholders because there was no one basically to stop me and so that was like the very first thing i decided to do and that was the only thing i did right but you know it worked out in the sense that that that same kind of rule and protocol still exists today. And Slack can go reprocess all their data from 2015 going forward because it's all thrift records. They can transform it to anything they want. And this is a very good thing.
Starting point is 00:23:17 Everything else I did wrong, though. I did not, first of all... Do one thing right. Yeah, well, sometimes it turns out to be like... I used to work for a guy named Jeff Hammerbacher, and he gave me the advice about management. I did not. First of all, I did one thing right. Yeah. Well, sometimes it turns out to be like, I, you know, I used to work for a guy named Jeff Hammerbacher and he gave me the advice about management. And he was like,
Starting point is 00:23:29 if you hire the right people and you motivate them properly, you incentivize them properly to do the work, you can do everything else wrong and you'll still be okay. And I feel like I was a very like prominent example of that management philosophy in action. Cause I, I did that, but then I did everything else wrong more or less in in action because I did that, but then I did
Starting point is 00:23:45 everything else wrong, more or less. In terms of things I did wrong, I did not have a sort of flagship customer. When I first got to Slack, I was doing kind of peanut butter style, let's help out the growth team a little bit, let's help out the platform team a little bit, let's help out the performance team a little bit,
Starting point is 00:24:01 let's help out the machine learning team a little bit. And so I didn't have a flagship bay customer who was there and it was like i was there for them and they were there for me and like we were going to go build this stuff right so i didn't do that i was spreading stuff kind of all over the place and that was a terrible mistake and i regret that i insisted on building everything ourselves and using open source everything and running everything ourselves and stuff and that was also a terrible mistake because I'm like running, you know, bleeding edge spark versions. And this is like back at like spark, like one dot four and stuff like that.
Starting point is 00:24:32 Right. And so I'm spending all this time debugging spark issues that don't have a stack overflow answer and stuff like that. When I should have just been like, you know, writing an enormous check to snowflake and then like letting them go do this for me, like things like that. Right. So, yeah, fundamentally, like Eric, I came in with a bunch of preconceived notions of like, I am going to make Slack's data systems look as much like Google's as humanly possible. Because I know that if all this data stuff looks like Google's data stuff, then I can be successful and I will be successful. This will be good for me. And I see this like all the time in early stage startup employees that come from big companies.
Starting point is 00:25:11 It's like if you could just take this function and make it work exactly like the function worked at my previous company, I could do all this awesome stuff for you. Right. And it's like a fairly classic error as opposed to taking the time to understand the culture the needs of the company like all that kind of stuff right like what's right for what's right for this place not what was right for the last place right i made like kind of all of those mistakes and so i i like yeah i don't hold myself up as like a like to the extent that i got anything right it was just kind of a coincidence or like an accident you know what i mean as opposed to like i was thinking things through first principles or i'm like some kind of management or technical visionary i wasn't i was just lucky
Starting point is 00:25:48 that this one decision actually turned out to be right because i can show this litany of other decisions that turned out to be wrong so yeah yeah anyway do you feel like that dynamic you know i've heard it called sort of like fighting the last war yeah yeah analogy absolutely yeah you mentioned a few things that fall into that category in terms of like the technology side did you also bring like any cultural or sort of management elements of that from google yeah i did i did oh yeah i think the management one and i ran this with a few other googlers who came to slack google is very big into like technical leadership and management being in the same person being like this the same sort of like the same human is both like the technical
Starting point is 00:26:37 leader and the manager for a group or a function or something like that and engineering directors at google are expected to be very technical all that kind of stuff and it's like it's deeply not the case or at least was not the case when i was there like technical leadership is a separate function separate person from human management like the managerial aspect of things and that was a culture shock for me because i was not like that it did not operate that way and so really just didn't kind of like naturally fit into slack's management culture which didn't want to operate that way i think it's interesting i've again like like most things with time and perspective i come to appreciate the virtues of both systems google system is good from like a communications perspective and that like all of the information that's kind of critical in the interactions between like technical decisions management decisions and stuff are all in one person's head and stuff like that
Starting point is 00:27:28 and that saves a lot of time and like you know means fewer meetings and like less overhead and stuff like that from it downside and kind of the most most pernicious one is really like the google approach is kind of rife for abuse there is if you work for one of these technically managers there is essentially one person who has like a tremendous amount of control over your career in terms of like what you do and how you do it and stuff. And that's not good. You know? So there was like, I would say far more like, and I, this has obviously been dealt with over the years and stuff at Google, but like, there was a lot more like questionable behavior i would say from certain technical ease managers at google
Starting point is 00:28:07 that was just absolutely not tolerated at all at slack like they would just like if anything like that happened like you were just gone you were gone like that day like and and that was like in many ways like much better i think it's kind of a checks and balances system to prevent abuses of power like again no place is perfect by any stretch but like it was significantly better i would say so anyway yeah yeah that makes that stuff yeah combinations of authority like you know people sort of will naturally act out of self-interest and so when you combine like different modes exactly different realms of authority like it you know the consequences are greater will naturally affect out of self-interest. And so when you combine different modes, different realms of authority, the consequences are greater.
Starting point is 00:28:50 Absolutely. Exactly. Again, the good is better and the bad is much worse. And again, it's just like choosing what is right and what makes sense for your organization. I think it's like this is kind of what strategy and company culture and all this kind of stuff is about. It's like if you want to do something and everybody wants to do it like let's ship high quality software yeah obviously great everyone wants to ship high quality software that's not interesting that doesn't say anything about you but like what are you willing to give
Starting point is 00:29:17 up to do that are you willing to hold timelines indefinitely like what's what are you giving up if you're not giving anything up it doesn't mean anything right yep and so yeah it's that it's i think again in my old age and my experience and stuff it's like being able to see those trade-offs and kind of understand that stuff the other thing i've always wanted to i've always wanted to ask a ceo i wasn't i guess like i would love to find a ceo of a company who could honestly tell me what the most important function of their company was i've never actually got a a CEO to honestly tell me that. I don't think. I can ask anyone who works there
Starting point is 00:29:49 and they can clearly tell me what the most important function is. We are an engineering-centric company. We are a sales-centric company. We are a design-centric company. It's unambiguous to everyone who works there, but just a CEO that would just say it would just be so refreshing to me.
Starting point is 00:30:03 Interesting. But I love all my children the same. Of course, everything's important. It's all important. Yeah, but it's not though. It's just deeply not true. Right. It's never true.
Starting point is 00:30:12 Right. Yeah. Anyway. Yeah. That's a, yeah, it's interesting. I mean, yeah. Feature requests, right. Or generally like that's a great proxy for like what's actually important to a company.
Starting point is 00:30:23 Like exactly. It's the tail wagging the dog. Exactly. In one way or another. 100%. I've seen that over and over. 100%. Totally.
Starting point is 00:30:32 Yeah. Okay. You actually changed roles at Slack. I did. You were doing data engineering. But before... I want to dig into that because that's super interesting. Okay.
Starting point is 00:30:43 I'm going to work with Surge. But before we jump into Surge, one thing that's really interesting to me about companies like Slack is that you are sort of a daily user of the product that you're building infrastructure for yeah we had eric bernardson who was at spotify really early on and he talked about this visceral experience of like building it's like okay we're all like using spotify like for 12 hours a day while we're building it yeah what was that dynamic like at slack oh that's a great question man i wish that's so it's so nice of you to ask because no one i don't think i've been asked that before it was amazing um in almost every way the i guess trying to the things i would say here slack used slack for everything everything every process at a company that would normally be handled by workday or
Starting point is 00:31:45 email or documents was done via Slack. It was done via Slack integrations. All of our data analytics, like we built our own kind of in-house dashboarding tool, all like mode or something like that, but had like very deep Slack integrations with it so that you could like interact with charts and visualizations and like, you know,
Starting point is 00:32:03 and again, it was like, we basically backdoored all this stuff into slack product because we were slack and we could do that right yeah so yeah everything was done in slack i used to do you know i did i kind of missed it actually i used to analyze like slack usage at different companies to kind of understand you know it was hard slack, Slack users didn't really, Slack teams didn't really churn. And so churn was, like, always kind of tricky for us.
Starting point is 00:32:29 But, like, we were trying to understand, like, good Slack usage, like, good Slack usage versus bad Slack usage. Like, there are people who have Slack but use it essentially, like, just for DMs. Like, they don't use channels and stuff, right? Yeah. And I consider that, like, bad Slack usage
Starting point is 00:32:42 because, like, what's the difference between using Slack and using any other DM client, right? Slack was kind of like off as a company, as a user of its product was like off by itself. In terms of like the number of channels sort of consumed per user per day, like the number of times like people visited a channel and read messages and that stuff was just like, was just off the charts. That was the one the joke at slack was always that slack the product was always optimized for whatever size company slack was at the time you know so like that was like if when slack was 10 people slack was best for 10 person teams when slack was 240 people it was best for 240 person like so and so forth kind of up the stack. And the hard part, and this is something that I think like, I think I don't want to like,
Starting point is 00:33:30 I don't want to pick on anybody here, but like DBT has gone through this. DBT is one of my angel investments and I do a lot of DBT stuff. And so I know a lot about DBT. And DBT has gone through this paper growth thing that Slack went through as well, like in the data space. And they're kind of the poster child for this stuff right and pedram pedram david who
Starting point is 00:33:50 was on this show a little while ago i think right he wrote a very famous like we need to talk about dbt kind of blog post was really good and he said a lot of things that were deeply true i think what's hard for folks who aren't at these hyper growth companies to understand is how like all the wheels come off of everything like all the time when you're trying to grow especially when you're trying to grow into very large enterprises yeah and so for slack for us that was really like when ibm adopted slack and we went from like the largest team being like 8 000 people to the largest team being like 100 000 people like dude everything broke everything broke for a year every single thing right all we were doing all we were doing was just like fighting fires and like making this
Starting point is 00:34:38 thing work for ibm that was like there was no honestly engineering capacity for anything else right yeah um and again it's that's like one customer as a flagship but it's all your processes it's everything everything is constantly breaking all the time yep and it's hard to it's hard to see that externally all you see externally is like wow slack's not really shipping any features and like blah blah blah and it's like we're actually shipping an enormous amount of features but it's nothing you can see because you're not on a hundred thousand person team right yeah and yeah and that's it's just like incredibly hard and exhausting and i feel like at this point in my career i'm i don't want to do that again yeah i don't feel like it's like i've had that experience and like i have a lot of friends who were at notion and air table and like other companies like this.
Starting point is 00:35:27 And I'm just like, yeah, you know, y'all I'm so happy for you guys. That's great. But I've done that and I don't, I don't feel the need to do that again. I'm okay with just like, you know, pretty regular growth. I don't need to experience that again. I got my taste. I'm fine. Yeah.
Starting point is 00:35:43 Yeah. Yeah. That. Yeah. That makes sense. I've heard, I've heard it described as sort of appreciating the physics of the system, right? Like expansion rates, you know, causes chaos, heat, destruction, you know, like, and I would imagine at Slack that, you know, cause you know, maybe if you think about like infrastructure where there's not a lot of, you know, user interface elements, right? It's like, okay, well, that's a little bit different.
Starting point is 00:36:16 But I would imagine with Slack, like the physics are actually like catching everything from the UI down to what you were doing on the data collection side on FHIR because it's like, okay, well, IBM needs this thing. It makes sense. Well, how do we prioritize that against all of these other feature requests and PRDs and everything that everyone else has? Totally. Culturally, you have great product managers who are like,
Starting point is 00:36:43 I just got deprioritized. I can't imagine. Yeah. You don't want to, Eric. culturally you have great product managers who are like i just got deprioritized i mean i you know i can't imagine yeah you don't want to eric it's not i mean it's not like i don't know anyway yeah yeah yeah it's not fun i don't say yeah exactly yeah i mean it's fascinating it's uh it's absolutely fascinating okay so you, did you get interested in search? So you went to work on search at Slack. Can you describe that transition and like how that came about? Yeah, so when we were talking about all the things that were breaking at Slack and search
Starting point is 00:37:18 was actually like pretty high on the list of things that were breaking and kind of circa like really late 2016, I think. And then like 2017. So the original Slack search stack was built on solar, like solar for, you know, open source, like search technology and stuff like that, like a much older version of it and we needed to do a whole bunch of things to upgrade it and the data engineering part of the problem was that we needed a way to build like a sort of historical search index for Slack
Starting point is 00:37:53 that was sort of optimized. So how do I explain this? There's two kinds of like search problems people have, broadly speaking. There is write intensive intensive search which is like elastic search okay like like logs ingestion or splunk or something like that that's right intensive so for right intensive search writes and updates the index outweigh queries by like a factor of 100 and then the rest of search like e-commerce search or web search is read intensive search right so like rights are relatively rare everything is optimized for reads slack amusingly is kind of both in some ways in that you have a unlike so like elastic search you keep maybe a week or two of logs right yeah at slack you actually have
Starting point is 00:38:36 everything forever right so messages are getting written at a super fast rate, like 100x, right? However, typically speaking, about 99% of messages never change within like 10 minutes of being written. There will be no changes, no modifications to them ever, right? No edits. So you can treat the historical index as a read-only index and then have a real-time index for just the stuff that's happening right now. So you need to do both and then kind of unify them together. And so the problem of how do you take the existing write index and restructure it to become a read optimized index
Starting point is 00:39:12 and like just reorganize the data and sort of build everything for reads is essentially a data engineering problem. And it's like the data engineering problem that like Google had to solve to build web search. And so the data engineering team I was managing was working on this problem when I was managing it. And I was just kind of, again, because I'm like super technical and can't like,
Starting point is 00:39:31 basically can never tear my fingers off the keyboard, basically, when it came to programming. When I left management, I basically pushed the engineers who were working on the problem out of the way and took over the problem from them because I wanted to do it so badly. This is again, kind of, you know, more examples of me being not a great manager director type person, but nonetheless. So yeah, that was the like gigantic data pipeline problem that I had to solve is like, how do I take all this data and restructure it into a read optimized format to fix a bunch of performance issues we were having like performance was the big problem like like the p95 like query latency at slack was like
Starting point is 00:40:11 five seconds or something like that back in 2017 like you type a query and it would take five seconds to get a response which is basically like essentially infinity for all intents purposes right i actually remember this yeah uh it started to degrade pretty significantly once your space got. Yes, exactly. Exactly. Cause it really was because all of your data was being served off of a single server that was, you know, like co-hosted with a bunch of other teams. It was like a multi-tenant kind of instance with like essentially no sharding whatsoever, like replication, but no sharding. And so if you had like a noisy neighbor, like you were basically screwed,
Starting point is 00:40:48 like your search just wouldn't work because someone else was like so dominant. And again, because there was no dedicated sort of write-centric infrastructure and read-centric infrastructure, it was all the same thing. And so like, you know, on the right side, you're just taking data as fast as you can
Starting point is 00:41:02 and making it searchable. But doing that is not a great strategy for making it fast, right? So anyway, all this kind of stuff was what we built, was what we fixed. It was great. It was so hard. I loved it. I can't imagine. I mean, it is interesting.
Starting point is 00:41:17 I hadn't thought about how that's a multidimensional problem, right? Of course not. Usually you have sort of a one-dimensional search problem because of the nature of your problem, right? Because usually you have sort of a one-dimensional search problem because of the nature of your business, right? That's right. Okay. So can we talk a little, you mentioned something really interesting. So let's hear like e-commerce to Slack. So e-commerce, like inventory is changing. I mean, search in many ways is almost like ephemeral unless you think about durable patterns around a category or historical archive for a company that in many cases becomes
Starting point is 00:42:10 like a reference for the work that people do every day. Like that is, how did you, like when you thought, when you studied sort of the usage patterns of search, how did you think about the... Maybe I'll direct the question a little bit. How did you think about the problem in terms of, I need to find that file versus, okay, there's this giant log of historical information that provides a lot of context. Even those are very different search problems. That is deeply true.
Starting point is 00:42:44 Absolutely. context like even those are very different search problems that is deeply true like absolutely so when i was working on search at slack again we were really just mostly focused on this kind of performance latency kind of problem so we were doing fairly minimal like relevance optimization like other than just kind of keeping the system from like you know like catching on fire and burning down and stuff like that. So this would be very classic search relevance, like BM25 ranking with a time decay factor and stuff like that.
Starting point is 00:43:13 Basically, we weren't doing anything special. We were doing something any out of the box solar person could do. We ended up at the same time we were were you know fixing search we started hiring a relevance engineering team right to start doing like a much kind of richer and deeper sort of set of analyses and stuff and build like a dedicated ranking service that would personalize search for you based on like who do you interact with what channels do you interact with historically right that kind of stuff introducing embeddings like in sort of like vector search techniques to like rescore and under like understand what a message is about broadly speaking is also like a very important kind of problem files are much easier
Starting point is 00:43:55 than messages in slack so like files lots and lots of contacts like lots of information messages little tiny snippets so the kind of like search documents in slack isn't just a single message it's really like a kind of collection of timed messages that occur in the same relatively narrow window that you use for like ranking and relevance and stuff like that again it was very crude when i was there it was like just like grab the most recent and messages within some window of time and then like call that the document my understanding is it's become much more sophisticated over time again using embeddings to say yeah these messages are about the same issue and like that kind of stuff right you've
Starting point is 00:44:33 gotten like they've gotten much better at it over time the big thing to understand about search is that it's kind of like what you said like a little tiny startup that has 10 people doesn't really use slack search because you can just ask somebody right you know like what you said, like a little tiny startup that has 10 people doesn't really use Slack search because you can just ask somebody, right? You know, like you, you DM everyone in the company. And then on the flip side,
Starting point is 00:44:52 like IBM, IBM uses the hell out of search because you have no idea in general. Yeah, exactly. Like it's just, you can't do that at IBM. Like it's just not an option. Right.
Starting point is 00:45:02 So people use the hell out of search. So it's just not an option right so people use the whole lot of search so it's something like it's something like three-fourths of searches are done by like the largest of the large companies that slack like it's just sort of dominates the usage and stuff like that and then that's that transition over time you kind of basically essentially see as an organization grows it i'm trying to think like where the knee in the graph is but it basically it goes exponential. It's searches very low at 10 people, and then at 100 people, it's marginal. But then you hit 10,000.
Starting point is 00:45:32 It's going straight up, basically, in the number of searches they do. And so again, it's a relatively small fraction of Slack users, but it's also a relatively high fraction of the most valuable users. It's the large enterprise deals and stuff that are doing the most searches.
Starting point is 00:45:52 That makes total sense. The context is really difficult. It makes sense that files are much easier because from a ranking perspective, it's like, someone in authority above me approved this in either a channel or a DM.
Starting point is 00:46:12 And I remember they said, that sounds fine. And I'm trying to find that, right? Yeah, yeah, totally. That sort of context is super difficult. Absolutely. Completely agree. Okay, I saved the best for last. Sure, sure. But that sort of context is super difficult. Absolutely. Completely agree. Yeah. Okay, I saved the best for last.
Starting point is 00:46:28 Sure, sure. We can keep going, but let's talk about DuckDB. Oh, okay. If you want to. I don't talk about DuckDB very much anywhere, so this is a nice change of pace for me. Well, it's super, it's really interesting to me, even being, you know. I'm kidding eric i talk about duck tv like all the time like literally all day all i do all day is talk about so yeah
Starting point is 00:46:50 but i'm the reply guy on twitter like saying have you tried using duck tv for that like that's me but you've been doing that what's interesting is you've been doing that for a long time i know but it wasn't cool it's still not, Eric. It's definitely people are super annoyed. I think it's cool. I think it's cool. Thank you. That's generous of you. Why did it become cool,
Starting point is 00:47:12 you know, sort of in the recent past? Like it's been around for a while. Yep. And you've been talking about it for some time and you've built, you know, dbt. Duck dbt and small things. Yep. And so you've been working on this for a while, but why do you think it's become popular?
Starting point is 00:47:31 So I, it's a good question. It's a good question. There's a lot of things, Eric, and I don't, I, so I've not personally used rudder stack, so I can't like comment on this. Right. But there are a lot of tools, a lot of software tools that are kind of like experiential goods is how I describe them. How do I sell you? How do I market to you on like why you should go skydiving?
Starting point is 00:47:54 Skydiving is an experiential good, right? You either do it and it's life changing or it's not right. That kind of thing where it's terrifying and you die, whatever, right? Yeah. This is true of so many software tools. Slack is an experiential good right i mean like if i mean again if you describe slack to me in 2013 it's like hey we're going to take irc this thing from the 70s that nerds use and we're going to make it kind of pretty and we're going to put it on your phone i'd be like you know like great good luck with that
Starting point is 00:48:21 like you know call me how that works out. Right. Yeah. Chrome had this experience for me. Like, do you remember like using like, like downloading Chrome for the first time? And it was faster, just consistently faster. Right. Like that kind of stuff.
Starting point is 00:48:34 Way less loaded than it is today. Yeah, exactly. You know, again, like talking about like DBT was this experience for a lot of people. It's like, it's an experiential good.
Starting point is 00:48:44 You have to try it uh there's a tool as an observability tool i love called honeycomb that is absolutely fantastic they're doing kind of traces and sort of deep analysis of requests and stuff but again it's an experiential good you have to try it to really understand how much better it is and so again it's tools like this have these kind of long you know the flat curves and then like things go exponential as like more and more people try it and then the more you hear about people trying it you want to try it yourself and it turns out that like it does live up to the hype chrome lives up to the hype like that kind of stuff like it's these overnight successes that
Starting point is 00:49:21 are years and years in the making of like, yeah, that kind of stuff. I think, you know, what makes DocDB so great to me is really just like very solid kind of like engineering fundamentals, like this is not it. Like there's, I don't want to say like, there's nothing innovative about it. There's tons of innovative stuff about it, but like from a principles, like database architecture principles perspective, it's a vector volcano model. There's no just-in-time compilation using LLVM or anything like that.
Starting point is 00:49:52 There's none of this other fancy stuff. Facebook has this thing called VLOGS now. And it's super cool, fancy, hardware custom compilation. There's some cool-ass computer science stuff in here, but you can't just download it and use it. It's not that kind of party. DuckDB is just like, you can just
Starting point is 00:50:12 download it and use it with your regular boring ass CSV, Parquet, JSON files, and it will just make your life better. And that's it. It's just exactly like what you're doing, just better. You don't have to run a server. You don't have to upload any data and it really is just like having you know first of all i would just say like not only like on a some mark kind of like the co-founders of the project doc db labs and cwy and the other ones before that not only are they fantastic exceptionally
Starting point is 00:50:42 skilled database people and i think you know i don't know you know as i do but like database people are like they are like the best they are they are the best software engineers it is the highest they're off like 100 right they are also just two of the nicest people in the world and they were so utterly kind and welcoming to me when i just started hacking on their stuff back in 2020 and was like hey i want to do this weird thing with dbt and i need some stuff i need to like go make some changes to the database they were super nice and welcoming to me who had not written c plus in anger in like i don't know 10 years or so at that point to make a sort of a community and a product and experience that was like super open super welcoming like very strong kind
Starting point is 00:51:25 of yes and kind of vibes and stuff in a lot of software cultures that are very like no fuck you kind of thing in a lot of ones oh sure and so like just a combination of like technical expertise with just like kindness and community building and then i mean i feel bad man i could just like prattle on about how great doc tv is all the time but like the thing that's just become so cool about it is how like when you free like they're really just free to just hack and try new cool ideas like without having to worry about it's something they don't worry about compatibility and they do like their test suite is extensive and phenomenal and stuff but like they are free of just they're free to try new things and just see like what works and like in a way that just doesn't feel like it's true of a lot
Starting point is 00:52:13 of data vendors and stuff right now i guess and i'm not like totally sure why i guess that's true why like why are they still free and other data vendors aren't like but they are and it's just incredibly refreshing i think yeah anyway yeah yeah i'll stop there anyway yeah well i actually want to you've used so many data tools right you know it's like it's like you said okay we're going to build these things ourselves we're going to do all these open source we're going to run like you know these really early releases of Spark, it causes all these problems, et cetera. So your opinion carries so much weight, but I want to dig into that intangible just a little bit more, right? And I'm going to use an analogy.
Starting point is 00:52:55 So like my dad's a mechanic and he has, you know, a set of really nice tools and it's like, okay, you know, it's a ratchet, right? And it's like, well, okay know, it's a ratchet, right? And it's like, well, okay, a ratchet is a ratchet, right? And it's like, yeah, but like, you know, he has this really amazing ratchet, right? And it's like, no one uses it. It's just his, right?
Starting point is 00:53:15 And it's like, okay, when he lets me use it, it's like, oh, wow. Like, it's so precise. It's hard to describe. Like, it's a ratchet, you know? Yeah. How would you ratchet. Like you said, it's experiential good. That's so interesting because it's very difficult to communicate about that without trying it. You're an advocate for it, obviously. But that's such a fascinating dynamic to me where there's almost like an ergonomic element to it that's difficult to describe in words. I completely agree. I mean, I
Starting point is 00:53:52 think it's like, it also I think the ratchet analogy is an interesting one. And I'm trying to like, I think where DuckDB benefits is from how easy it is to get going. It really is like pip install.tv, import.tv.connect,
Starting point is 00:54:10 query stuff, query CSVs, right? It asks sort of so little of you to get going with it compared to almost any other tool. Like even other open source tools. There's got to be a special circle of hell reserved for these sort of open source tools that make you sign there's got to be a special like circle of hell reserved for like these sort of like quote open source tools that make you like sign up and create an account in the cloud and again apologies if rudder stack does that i hope you don't
Starting point is 00:54:32 before before you're allowed to use them i have some tools that do this in mind and i hate them but like yeah that it asks so little of you yeah in order to just get rolling with it and try it out and see if it works and stuff. And again, I just think that's like, that's super key for this kind of stuff. It's almost like those like little friction points are just ironed out of the way, but it's actually hard to describe what feels like flow, right? It just is better. That's right. That's right. It just is better. That's true. That's right. Exactly.
Starting point is 00:55:06 And then kind of like, I mean, it's just like you said, like you feel it in your hands, like the way it feels in your hand. You're just like, or the analogy I would use would be like a good carving knife. Like if you've ever had the difference
Starting point is 00:55:15 of like, you know, carving like steak or whatever, like a really good, or a really good knife. Yes. Versus like kind of a mediocre knife. It's like, this is like a quality again, it looks like a knife.
Starting point is 00:55:27 You can't tell. Sure. Yeah. Yeah. You hold it, the weight. It's just, it's just better.
Starting point is 00:55:31 Yeah. I mean, that's exactly it. Yeah. It's like, I'm not working as hard and I can have more precision, but that's like, yeah,
Starting point is 00:55:40 it's interesting. Fascinating. Well, Josh, we are at the buzzer. Sweet. This has been amazing. You have to come back on when costs...
Starting point is 00:55:52 Because I feel like we covered maybe like 15% of what I wrote down. Gotcha. But thank you so much. We'd be happy to. Thanks so much for giving us your time. Man, my pleasure, Eric. Thanks so much for having me. I appreciate it.
Starting point is 00:56:04 And like I said, we'd be happy to come back on anytime, Eric. Thanks so much for having me. I appreciate it. And like I said, we'd be happy to come back on anytime. Always good to talk shop, okay? Let's do it. What a fascinating conversation with Josh Wills. I think one of the biggest takeaways from his time at IBM, Google, Slack, and other companies
Starting point is 00:56:23 is that he has this tenacious curiosity that compels him to build his data projects by actually writing code in areas where he doesn't have any authority or, as he says, sort of any skill, which is really interesting. He talked about at Google going in and actually updating the code base for Google's ad auction for search in order to get the right data to do data analysis and setting up data engineering pipelines there, which was, you know, that's very intrepid in and of itself. But even with DuckDB, which is what he's been working on recently, you know, he said he had written C++ in 10 years, right?
Starting point is 00:57:20 And he's submitting pull requests and collaborating with the people who started DuckDB because he just has this intrepid curiosity. So really incredible episode. So much to learn about tool adoption, growing a data team, learning from your mistakes and curiosity. But I think the big takeaway is that, you know, his intrepid curiosity is really what sort of led him to all of his success. So great episode. Subscribe if you haven't, tell a friend, and we will catch you on
Starting point is 00:57:52 the next one. We hope you enjoyed this episode of the Data Stack Show. Be sure to subscribe on your favorite podcast app to get notified about new episodes every week. We'd also love your feedback. You can email me, Eric Dodds, at eric at datastackshow.com. That's E-R-I-C at datastackshow.com. The show is brought to you by Rudderstack, the CDP for developers. Learn how to build a CDP on your data warehouse at rudderstack.com.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.