Coding Blocks - Technical Challenges of Scale at Twitter

Starting point is 00:00:00 You're listening to Coding Blocks, episode 198. Subscribe to us on iTunes, Spotify, Stitcher, wherever you'd like to find your podcasts. Visit us at codingblocks.net where you can find the show notes, examples, discussion, other stuff, rants, ramblings, whatever. Yes, indeed. Send your feedback, questions, and rants to comments at codingblocks.net and you can follow us on Twitter at CodingBlocks. And hey, we got a website at CodingBlocks.net and there's

Starting point is 00:00:29 links at the top of the page. Go to other websites that we curate or I don't know, contribute to, I guess. I don't know. Yeah. Hey, I'm Joe Zach, by the way. I'm Michael Outlaw. And I am Alan Underwood.

Starting point is 00:00:47 And you know what they say about cliffhangers so what's the topic for the evening what do they say about cliffhangers so anxiety has built up so for this episode we didn't have any new reviews since uh last episode. So, you know, we'll have to like, I guess beg. Maybe Jay-Z needs to do the beg. Got it. Are you, is this getting to you yet, Alan?

Starting point is 00:01:15 Do we beg right now? Yeah, I mean, sure. Let's do the awkward part up front. Let's do it. No, I meant the fact that Alan asked and I just walked away from it. How did you not get that that was the cliffhanger?

Starting point is 00:01:32 I got it. A lousy limbo player walked into a bar. They did? Yeah. Yeah. You got the punchline right in the thing. I did.

Starting point is 00:01:46 It was good, right? Yeah. Yeah. Yeah. That's thanks to my son. That one got me the other day. I giggled a little bit. Yeah.

Starting point is 00:01:54 All right. So, so for the topic of this particular show, we're going to be talking about this thing that you might've heard of called Twitter. Um, not the craziness going on, on the interwebs and with the company and all that kind of stuff. More about the technical challenges that they've faced as they've grown over their, what?

Starting point is 00:02:12 18 years? No, 19, something like that. So we're going to do that. But first we got a little bit of news. So outlaw, you want to read us the review names?

Starting point is 00:02:22 I already did that part. Yeah, you did. Yeah. I didn't like it. Yeah, I didn't like it. Oh, you didn't like it? Okay, well, hold on. Let me read it again. All right.

Starting point is 00:02:32 Jay-Z, you got something up here? Yeah. Hey, Game Jam. It's officially time to start talking about it. So, yeah, I am super pumped about that. I love doing it every year. It's going to be year number three. It's going to be better than ever.

Starting point is 00:02:44 We are officially soliciting. Solicitating? I think it's a word. It's going to be year number three. It's going to be better than ever. We are officially soliciting. Solicitating, I think is the word. No, soliciting. I think that's it. I'm feeling goofy tonight. Soliciting theme ideas. So you got an idea? Shoot, email, text, tweet, whatever, and I'll get on the list, and we'll start voting soon.

Starting point is 00:03:01 It's just a lot of fun. Now, how come as a member of this three-party show like i didn't know that there was an official time that we start talking about game january uh yeah probably because i don't share my notes i should what makes it official that's because this is around when we talked about it last year okay so i have a little timeline the whole it's like three months out we do this two months out we do this this is when the emails go out and you guys are hyper organized i have to be otherwise i can't get anything done there's no in between that's probably true so if you're listening to this and you're like oh that sounds like me too let me go ahead and tell you right

Starting point is 00:03:41 now make yourself a note leave us a review because that's why you forgot to do it already that's that's true that's technically true you don't have to do it now just put it on the list yeah put it on the list yeah you'll get to it hey so one last little thing here i found this today i came across this today and you know there are people that think that we we know everything about what we do because we're 198 episodes into talking about coding, right? Man, so I've been dealing with a particular problem that is driving me absolutely up a wall, right? was encoded some way and it went through some sort of encoding, decoding somewhere and trying to get that thing back to its original state is driving me absolutely crazy. Man, I learned stuff in this one article that was written 19 years ago from Joel on software. You've probably heard of them. But the title of this article is the absolute minimum.

Starting point is 00:04:46 Every software developer absolutely positively must know about Unicode and character sets and character sets. No excuses. No excuses. If if you know all this stuff about UTF-8, UTF-7, if you didn't know that was a thing, then you should read this. ASCII, ANSI, ISO, whatever the other ones are. If you don't know this stuff, go read this article. It will absolutely do you a huge favor. I learned so many things today reading this article that I just never knew about. Like I didn't know that when you do UTF eight, um, it can zero pad

Starting point is 00:05:28 lead the characters or you can leave them off. I didn't know that. And that matters because if you encode something to UTF eight, those leading zeros might be there. They might be not, but now trying to re go back to where you started from can be an absolute nightmare because those characters may or may not be there. So it's a it's really an interesting read and you will learn so much about why this stuff even exists. So definitely after you're done listening to this on your drive or whatever, go to the show notes, coding blocks.net slash episode one 98 and find this link for the Joel on software and read this. It will absolutely do awesome things for you. Just understanding how this stuff works.

Starting point is 00:06:16 And it's very timely too, because I mean, this article was written. Oh, you said 19 years ago. Yeah. I take that back. Okay. Yeah. Yeah. it's been a minute oh

Starting point is 00:06:27 we're allowed to forget stuff it's fine it's fine oh man look it up absolutely crazy so yeah um good stuff really good stuff there um so i guess with that let's go ahead and get into the show. So we'll have some links in the resources for this, the very, so if you don't know this exists, we've talked about various different, um, engineering blogs, engineering blogs out there. Like I know in the past outline, I've talked about like the Uber one. We truly love that and all the stuff that they put out into the world. We've talked about Netflix with chaos monkey and all the other things they've added. Twitter also has a fantastic engineering blog and I believe, is it just blog.twitter.com? Can you remember? Blog.twitter.com. Yep. Yeah. So if you don't know that exists and you do a lot with big data, especially this is a fantastic set of read throughs to where you can

Starting point is 00:07:26 learn from people that arguably have the most scale problems besides Facebook on the planet. So, um, definitely go check that out, but we're going to hit on one here called scaling data access by moving an exabyte of data to Google cloud. Now, here's the thing that I want to lead up with. I started with this link. And the problem is, you have to kind of understand some of the history of where they've been and all that before you can even get to this article. So even though I started here, almost everything that we're going to talk about in this episode has nothing to do with this particular article. It's what we'll be talking about today is an article they linked to at the very top that talked about what they did to improve their scaling and ability to be able to look

Starting point is 00:08:15 at data analysis within the company. So just a heads up. Oh, sorry. Go ahead. No, just finish your thought. Just a heads up. So we'll probably be coming back to this other one after this episode and talk a little bit more about things that have changed since since we go over what we're talking about in this one. And in this one, you said they were moving how much data again? Um, an exabyte. So of data. Yeah. But how much data? Because you had to solve for exa, right?

Starting point is 00:08:45 So what was exa? It took a second. It took a second. You thought I had something serious to say? I did. I did. You totally got me. So I don't know if you guys read this stuff because i i literally just went and picked out one of the blog posts and started going through it um so we'll just chat

Starting point is 00:09:12 about this thing as we go so we're not lie to you i did not i did not know that we were going to go i did not read this one okay okay i didn't. All right. So, so we'll chat about this stuff. Yes. We'll chat about this as we go. So in 2019, we're at the tail end of 2022 now, right? So we've had three more years since this.

Starting point is 00:09:35 I can only imagine that things have gotten even bigger. So just keep in mind, this was three years ago, but it was written this year though. The article was written this year. The article was written this year, but article was written this year the article was written this year but he's actually talking about the numbers from 2019 directly so he didn't mention what the new numbers are but they said in 2019 over 100 million people per day would visit twitter

Starting point is 00:10:02 feeds right now they didn't say whether it was from a website or from an app or whatever, but just imagine 100 million people dialing in to that, what do they call it, the fire hose or whatever, to where they're getting data out. That's a whole lot. If you recall, too, just to back up for a moment, I think we talked about this as part of the designing data intensive applications architecture, right? Where some of the problems that Twitter has in terms of putting together a timeline when you would go to Twitter, right? Like, Jay-Z, you might be able to speak to this better than I can, but there was issues of like, if I updated my feed, then Twitter had like one strategy for how they would put my message out there on Alan or Jay-Z's feed. But if the other Jay-Z, who might have a couple more followers, put something out, it was a different strategy for how instead of blasting it out to everyone's queue it would be a on read or as

Starting point is 00:11:06 needed kind of read right something like that am i saying this right jay-z yes if you think like you know if you were to kind of design twitter from scratch you know and just not really think about the problems that they've run into or what you know about in terms of scale you'd probably kind of design it like a content management system or a blog or something like that where you would say okay i'm a user i go to the page and i go and i fetch some stuff out of the database and i show it to you right not hard it's just kind of standard like web development stuff but twitter has the celebrity problem where uh they actually have a whole blog just on taylor swift she released a new album that you know is doing like phenomenally well and yeah's is really good. You should check it out.

Starting point is 00:11:46 I do. I was just kidding. Of course, everyone's heard. Hashtag Swifty. Yeah. They actually have some numbers on it. Anyway, I didn't really get those numbers together. Taylor Swift put out a new album. Everyone's talking about it. So many people

Starting point is 00:12:03 follow her. So whenever she makes a tweet, that Everyone's talking about it. So many people follow her. So whenever she makes a tweet, that's a lot of people that need to get that stuff. So Twitter kind of came up with a strategy they call Fan Out, where basically when Taylor Swift tweets, they actually run out and go and update a bunch of feeds. So that rather than going and kind of trying to couple together these feeds from a database or something. The feeds are pre-generated, so you can have quick, real-time home timelines. And that's part of Twitter's mission is to get you data really quickly. They want you to have low latency, up-to-the-minute data so that your feed, if you both follow Taylor Swift,

Starting point is 00:12:40 we want to see those tweets come in around the same time and they want to keep a conversation flowing. They want to keep it fast. So I bring that up because just to kind of like frame the conversation here, right? That we're talking about like, you know, you said 100 million people per day. But those 100 million people aren't, it's not the same problem that's being solved for, for each of those people as they view or post to Twitter. Right. Totally.

Starting point is 00:13:10 So that's the background of this, of this, you know, the, the, the domain of the problem. Well, so that's the domain of the overall problem.

Starting point is 00:13:20 What we're going to be talking about more is what they tried to do for their internal customers, the people in marketing and accounting and all that kind of stuff on how they could see trends and analytics of what's happening with the various different tweets and stuff out there. So check this out. For every tweet and user action that somebody does, whether you like something or quote something or whatever, it creates an event. So very similar to like the Kafka stuff that we've talked about in the past, right? You generate an event and then that's used by machine learning and it's used by employees for analytics. So every single thing that's done generates some sort of event that happens that goes out into their data pools. And one of the things that they ran into

Starting point is 00:14:06 is they wanted to make it, and they actually said they wanted to democratize data analysis to the people within Twitter so that they could go and query things the way that makes sense for them, right? Like your marketing team's going to care about things differently than maybe a customer retention team does or an engineering team or whatever, right? Like everybody has a slice or a view of the data they want to see, and they wanted to make it to

Starting point is 00:14:35 where they could easily go do it without getting engineering involved. So it was kind of similar. I think we talked about like, as it relates to the uber right uber was doing something kind of similar to that where they were you know you'd have like the one big data lake but then they didn't mind having these smaller uh database offshoots from that where like you know one one team might want like accounts receivable or billing or whatever might need a different set of data than uber eats might want a different set of data related to like go market for that kind of thing right versus the real time stuff so that's the kind of stuff you're talking about in relation to democratizing the data so that the different parts of twitter could use it differently right

Starting point is 00:15:21 very much yeah so really what they talk about here is they kind of do have this big data lake. They don't use that term as much in the article, but they basically said that they had various different technologies that were used for data analysis. They had scalding, which I'd never heard of before. But if you wanted to use that, it required programming knowledge. Then they said another problem was having data spread across multiple systems without a simple way to access it, which is similar to what Uber was talking about, right? Like they were trying to get data into one big area so that everybody could access it. Um, so what they were talking about to start off this particular thing was, Hey, they want

Starting point is 00:16:03 to move things into Google cloud, um, particularly because they wanted to use BigQuery. Because if you're not familiar with BigQuery, I actually copied their kind of simple summary on their own page. It says it's a cost-effective, serverless, multi-cloud enterprise data warehouse to power your data-driven innovation. So if you're not familiar with a BigQuery, kind of basically what you do is you ingest data into this thing. And they've already got, you know, basically massively scalable storage behind the scenes. So as you ingest data into it, it indexes it into its own format that it knows behind the scenes, if I remember correctly. And then it allows you to use regular SQL queries to process and spec like, you know, terabytes, petabytes of data. So you could do all that with BigQuery without having to worry about

Starting point is 00:16:58 infrastructure and all that other kind of stuff, right? So that's what they were trying to move to. And then another thing that I had never even heard of until I started reading this article. And I'm curious if you guys had was this thing called Data Studio. So if you hit that link that actually so not the link that's there. If you go to Data Studio dot Google dot com. you'll be on a page where you can automatically start trying to create reports based off data that you've already got so data studio looks to be available to everybody i don't know what you get charged for it or if you get charged anything wait this is just using going against my google drive exactly that's what i'm saying you could actually point it at other data sources and stuff. So it's pretty interesting. I'd never heard of it, but they are using Data Studio in conjunction with BigQuery so that BigQuery has the data they want. They can run SQL queries out of that thing and then use Data Studio to create visualizations, reports, tell stories with the data that they already have and processed in BigQuery. So Google, man, like what the heck?

Starting point is 00:18:07 Right. I mean, it's actually really cool. And if you look, they've got like, they've got sample data sets here. They've got one that's, um,

Starting point is 00:18:13 what did I just click on a world population? Um, right now there's 7.1 billion people in the world. No, we crossed eight. It was announced this week. It was announced this week that, that, that they believe that we have crossed 8 billion people now on the planet.

Starting point is 00:18:28 That's, that's absolutely crazy. Oh, this says as of 2013. So I guess they populated this with old data. Um, and maybe it's not even real, but, but yeah, I mean, this is, this is really interesting. Like I said, I didn't even know it existed. Like they, they have so many things in the Google. Ecosphere that it's just it's almost impossible to know them all. So anyway, with that, what they were basically trying to do is make it so it'd be easier for managers, people that just know general SQL or or maybe some developers developers or whatever to be able to access this data and, and make it show what they needed to look at. They just want to be able to slice and dice data,

Starting point is 00:19:12 like generate like a chart. They want to be able to drive a chart or something off of it. Right. So charts. Yeah. But one of my favorite things about Twitter is trending. So you can sign on Twitter and say, Hey, what's trending now?

Starting point is 00:19:23 And it shows you like, here's the things that you're interested in that's trending. So maybe I'll see stuff about like music or whatever. You can go to the general news. It's just like entertainment. They kind of break it down by category. So like, you know, for example, Taylor Swift puts an album out. It's probably going to be under news.

Starting point is 00:19:37 It'll be under entertainment. It's going to be under my interests. But it's not going to be under sports. And they've got a whole big data wing that's always kind of working on figuring out what that is. And you think about it, like how crazy is it that you can go and see that they kind of boil down all the tweets of the day and say, hey, election is trending. Or they take all that noise, all that mess, all that stuff that people are talking about and say, hey, Taylor Swift's new album Midnight's came out. Or hey, you know, Super Bowl or whatever. They take the crazy stuff that people tweet, 140 characters,

Starting point is 00:20:07 and they figure out the subject, they figure out how to count it, and they do a great job of it. Well, now it's more than 140, right? I was going to say it's a little bit more than that now. Yeah. But still, yeah, it's absolutely insane. The point is lost on me because you got the number wrong. Right. When you consider how many tweets go out in a minute across the world,

Starting point is 00:20:24 the fact that they're, they're able to do that stuff for the trends is absolutely amazing. So this is where we need to take a step back into history. So they actually lay out what, what they had, what their strategies were starting back in 2011. So this is actually the history of data warehousing at Twitter, 2011. So this is actually the history of data warehousing at Twitter. 2011, they did data analysis with Vertica and they were using Hadoop for their storage.

Starting point is 00:20:52 Data was ingested using Pig. And if you know anything about Hadoop and the way that stuff had to come in, you had to use a MapReduce process. And Pig sort of makes that easier for ingesting data. So 2012, they went away from pig and they picked up scalding. And what they actually said in the article is that uses Scala APIs, Scala, Scala, don't know, that were geared towards creating complex pipelines that you could, it said it was sort of easy to create these complex pipelines and it was also easy to test. So that,

Starting point is 00:21:26 that makes a whole lot of sense. I mean, I know the three of us have been in the, uh, in the real time streaming data world and it's not easy sometimes, right? No. And Hey,

Starting point is 00:21:37 I got a little tip for you. Uh, don't try Googling scalding versus pig. Really? Don't do it. Yeah. Apparently, uh,

Starting point is 00:21:44 it'll be a scalding pig yeah like i didn't know that was the thing it's all it's a whole big thing i think we need to have like some kind of a banner for like when you're going to announce something crazy like that jay-z like there should be like instead of a spoiler there should be like a you know gross warning coming yeah don't do this yeah i whoa what, what's going on? What are we talking about? Yeah, right. So one of the problems with scalding, though, is it's difficult for people that just have, like, general SQL skills to pick up.

Starting point is 00:22:13 Like, they said that the learning curve is pretty extreme on that. So fast forward four years from 2012 to 2016, they start using Presto, PrestoDB, to access Hadoop data using SQL. Now we've talked about Presto on here and we'll get into some interesting things a little bit further into the notes here about that. But if you haven't heard the past episodes, Presto kind of allows you to pick any number of various different storage technologies. In this case, it was Hadoop, right? That they're using here. You can use it to pipe into JDBC databases.

Starting point is 00:22:54 You can use it to pipe into GCS or Google Storage, AWS S3 Storage, all kinds of stuff. So is Scalding, like this was made by Twitter, it looks like. Oh, really? Am I wrong? I'm looking. Hold. Yeah, Twitter open source. Sure enough. Okay. Interest.

Starting point is 00:23:10 So, you know, for those who are working at Twitter, they're like, of course we already know about Scalding. But for the rest of us, they're like, oh. Brand new stuff. And it's open sourced. So if you want to use it. Yeah, I'm looking at it on GitHub. That's why. And it came up as like a Twitter GitHub account.

Starting point is 00:23:25 And I'm like, wait a minute. My spidey sense is tingling. Right? They have a really cool icon logo for it. It's an elephant blowing flames out of its trunk. Because it's Hadoop, right? Isn't the Hadoop logo an elephant? It is.

Starting point is 00:23:42 It is. Yes. So along with using Presto to access Hadoop data using SQL, they were also using Spark for ad hoc data science and machine learning. So now, two years later, 2018, they're using Scalding for their production pipeline. So, you know, transforming data, pushing stuff around. And they're using Scalding in conjunction with Spark for ad hoc data science and machine learning. So not a ton changed there.

Starting point is 00:24:10 What did is they now have Vertica and Presto for ad hoc interactive SQL analysis. And they introduced Druid for interactive exploratory access to time series data. Okay, there's so many technologies. So Presto, if i remember right was the one where there was like a facebook derivative one was called presto and one was called presto db right so you have no two presto db was the original one presto sequel was the one that somebody forked that that super confusing because you'd go search for something for Presto, and sometimes you'd land on Presto SQL, and sometimes you'd land on Presto DB.

Starting point is 00:24:51 But yes, that was the one that was created by Facebook. Yes, Facebook to query basically just about any storage technology. I say just about any. A lot of storage technology is out there using SQL language. But what was super wasn't like Presto and Presto DB. Cause like I'm Googling and I'm seeing like literally Presto was developed by Facebook, but there's a Presto db.io.

Starting point is 00:25:16 I think that I don't think it was Presto SQL. Was it? Yeah, it really was Presto SQL. It was, it was Trino, which is like a newer kind of evolution yeah it's been a minute since i looked at all this stuff oh wait now i see it on the wikipedia page it does say presto including presto db and presto sequel which has been which

Starting point is 00:25:40 has been rebranded to trino okay so that's that's what they did because people were probably getting annoyed just like I was back in the day when I was dealing with it. Well, yeah. Cause I remember like we were looking at that for some reason and I don't remember if like this was maybe a followup from like, you know, or like a fallout from like us looking at one of the Uber engineering blogs

Starting point is 00:26:01 or something. Maybe that's how we got turned on to Presto. I don't remember now it's been so long, but like when you said it, I had kind of had like a little, Oh God, a little PTSD kicked in there. And I'm like,

Starting point is 00:26:12 what is that? And really all it was is the Presto sequel. They just forked Presto DB and then started going off and doing development in their own direction, which I guess is now Trino Trino. So, but the cool part about Presto, if, if you haven't heard it before, is like I said, it'll allow you to query basically

Starting point is 00:26:30 kind of any data source out there. And that's cool. But the actual magic that made it what it is is you can join data across disparate data sources, right? So if you have data stored up in GCS and then you have some lookup data stored in a Postgres database, you can basically say,

Starting point is 00:26:53 hey, select everything from my GCS data source and join it on my lookup information from Postgres SQL. And it will do it in a distributed manner to where to pull the data into its own processing nodes and join the data there and then give you back the data set. So you could basically write a SQL join against anything just about that you can connect to. Yeah.

Starting point is 00:27:15 And so there was also another one that Jay-Z and I remember we looked at, we did it as part of like a, you know, come watch us stumble on a live stream. I mean, no, learn with us. I think that's what we call it. I don't remember, but we definitely stumbled,

Starting point is 00:27:31 but it was on Apache drill, right? So it was a similar kind of thing where like, you know, with, with these technologies, you could, it could,

Starting point is 00:27:38 it didn't even have to be a database. It could be like, I have a CSV over here. I have an Excel spreadsheet over here. I have a SQL server database over there, an Oracle database, a press, uh,

Starting point is 00:27:47 uh, you know, the GCS bucket, whatever these different data sources were, you could like set up these connectors to it and then magically query it. And, and I remember drill was pretty good about like determining the types to like it would,

Starting point is 00:28:02 it would figure out the data types and be like, nah, we got this. We know what the quote schema should be for this thing that isn't really a table. Yeah. I remember you and I, Outlaw, were playing with drill quite a bit. And honestly, it seemed like it was a little bit more impressive from that discovery phase that you're talking about than Presto was, but it just didn't have the, I guess, the Facebook backing.

Starting point is 00:28:29 Or Twitter. Right. Yeah, so it didn't have the same mind share. But, yeah, I mean, an amazing, amazing tool, and it's still used by a lot of stuff out there. So Druid, if you've not heard of Druid, if you're trying to analyze time series data, that is a super powerful analytics platform.

Starting point is 00:28:48 I thought it was an OLAP database. It is. It's OLAP for time series data. Oh, OLAP specifically for time series. Yes. I didn't know about the time series aspect. Yeah,

Starting point is 00:29:00 I didn't either. When you look at the ingestion on a lot of that stuff, you actually have to do it on a time series type basis. Now there may be hacks around it, but that's what it was designed for. Um, there's, there's a lot of competing technologies out there now,

Starting point is 00:29:14 like Pino, um, route roundhouse. What's the, uh, clicky house, click house, something like that.

Starting point is 00:29:20 Well, Pino would just be another like OLAP database, right? But it's not specific to time series. Because when we talk about a time series database, the one that's in our face these days would be Prometheus. Prometheus, yeah. Right?

Starting point is 00:29:33 I'm going to think of something like that. Yeah. Yeah, that's what it was originally done. Like I said, they do allow for tons and tons of dimensions, but it usually has to be sliced up by time. So on top of that, they also used Tableau, which if you haven't heard of that, it's a very popular commercial piece of software out there that allows you to connect things and query them and visualize.

Starting point is 00:29:59 Dashboards. Yep, dashboards. Zeppelin and then Pivot for data visualization. So I've never used Zeppelin or P or pivot so i don't know what those are like so i listen to zeppelin i mean it's been a while i mean they got some cool you know they're not new but yeah i don't know if they're still touring i think they were going to but taylor swift kicked them off the platform to buy tickets she She was trending.

Starting point is 00:30:25 Yeah. Oh, you know, uh, Taylor, Taylor Swift. I'm glad you mentioned her. Oh, uh,

Starting point is 00:30:29 in the last 10 years, you know, she's averaged more than 75,000 tweets a day. Just about, sorry, about Taylor Swift. I was going to say, how about the time?

Starting point is 00:30:39 I was like, yeah, whoa, wait, what bought the G program? I was like, wait a minute. She puts out 75,000 tweets a day and does all these amazing albums and stuff?

Starting point is 00:30:48 Yeah. I'm in there. I'm not productive enough if I'm not accomplishing what she's accomplishing then. Wait, you just said for the past 10 years, she averages 75,000 tweets a day, 365 days a year. About her. About her. About her. That's a total of 329 million tweets now that doesn't

Starting point is 00:31:06 count the 4 million tweets that she got in 24 hours uh when her album midnights was released golly man yeah and um twitter got a blog post out the next day with analytics about the tweets and like how people were uh you know what they were tweeting about the top three songs that people were tweeting about in reference to the album like all tweeting about and reference the album. Like, all right, it's a pretty amazing stuff. Hear me out. Hear me out. You bet.

Starting point is 00:31:29 You're about to brag about our stats. Aren't you? Hear me out. This is our challenge. Dear listener. We want to overtake Taylor Swift on Twitter. So get on Twitter. Social,

Starting point is 00:31:43 you know, like send your tweet, mention us. Hashtag coding blocks or at hat coding blocks, social, you know, like send your tweet, mention us with hashtag coding blocks or at hat coding blocks, whatever, whatever suits your fancy. Either way, it's going to like Twitter. We'll, you know,

Starting point is 00:31:53 rank it all in the same. They'll, they'll figure that out. They'll know that it means the same thing. And, and let's see if we can't, if we can't take the top spot. I think we got it.

Starting point is 00:32:03 I think we, I think we can do this. I believe in us. I believe in you. And I believe in us. We can, we can't take the top spot, I think we got it. I think we, I think we can do this. I believe in us. I believe in you. And I believe in us. We can, we can make this happen. We're going to get ones of tweets.

Starting point is 00:32:11 Is that how you say that? Yeah. Yeah. It's a, it's a, it's a, it's a, it's a,

Starting point is 00:32:15 it's a, it's a, it's a, it's a, it's a, it's a, it's a, it's a,

Starting point is 00:32:17 it's a, it's a, it's a, it's a, it's a, the largest community on Twitter. Oh, really? Yeah. That's pretty interesting. It makes sense. 441 million unique followers. And I'm not really clear on what they consider their communities, but I think it has to do with basically interests that they figure out about you.

Starting point is 00:32:33 Like, for example, I mentioned that they somehow figured out the kinds of music that I like, and they have put me in these various communities about that. And so sometimes they kind of throw me tweets that they think I'm going to be interested in. Oh, that's pretty cool. Alright. Well, now that we've been stomped into oblivion with our ones of tweets. Tens.

Starting point is 00:32:54 Alan Tens. We might get tens. That'd be exciting if it happened. So another thing about Taylor actually is I'm kidding. You're just trying to crush our souls now. Jay-Z's just going to hit us up with Taylor Swift facts all night. That's a Taylor Swift comment or fact. That's right.

Starting point is 00:33:12 Oh, man. All right. So they were already doing all this stuff, right? They had data flowing. They had all these ways to get reports, Tableau, Zeppelin, Pivot, all that kind of stuff. So why the change? Well, their big thing is they wanted to simplify their analytical tools for their internal employees. That's really what it boiled down to. So that's where BigQuery came in. Now, they did say there were challenges. And I mean, the three of us have worked on three different

Starting point is 00:33:42 cloud platforms at this point, right? Azure, AWS, and GCP. And I guess even more. Well, professionally, not including like professional work, you know, play. Yeah, right. And I mean, you could even count in Linode and DigitalOcean and some other stuff, right? Which I don't guess they're quite the same, but there's always challenges, right? Like every page that you go to on a a cloud sites like use this service it's so easy and you get in there you're like man there's nothing easy about this i don't know why it seems

Starting point is 00:34:14 like it should be so easy but it never did so they had challenges starting with i can't even differentiate your icon from all the other services you have aw AWS. Oh, yeah, yeah. I wasn't trying to call anyone out. This is awkward now. We had a whole episode about it, or at least a big chunk of an episode about it. So one of the things that they had to do is they had to develop their own infrastructure

Starting point is 00:34:38 to reliably ingest data, large amounts of data into BigQuery. So that's worth calling out. Twitter did basically everything on-prem. They didn't do cloud computing stuff, not massively, right? And that's similar to Stack Overflow. We've talked about this in the past. Like Stack Overflow even had a page up that showed their, the gist of their overall infrastructure and how things worked. And the reason they said is they spent as much on their entire infrastructure as what one month would cost them if they ran it in the cloud. Right. And I have to imagine that's the same

Starting point is 00:35:18 exact reason why Twitter does everything on prem. And then it started porting things to the cloud that makes sense to make their lives easier. Right. So just wanted to call that out. Man. Talk about another one. I just found, I was trying to find the specific link that you were talking,

Starting point is 00:35:35 you were referring to where like stack overflow would show like, Hey, this is our SQL server. You know, there's a reddish cache in front of it. Things like that. Stack overflow also has a blog for all their fun engineering challenges. Oh, that's excellent.

Starting point is 00:35:52 Stackoverflow.blog slash engineering. Most excellent. Yeah, we'll have to dig into that one too. So while he's looking for that other link, some of the other things they had to worry about, they had to support company-wide data management. They needed to implement access controls, which I think you would, I would hope you know why, right? Like you don't want me accessing private user data somewhere. Or tweeting on behalf of people or being able to see direct messages or something like that.

Starting point is 00:36:22 Yeah, totally. Ensuring the customer privacy. They needed to build sources or build systems for resource allocation, monitoring, chargeback. So if you work for a large corporation, you're probably aware of how this works, right? Let's say that you are in AWS or GCP. Usually departments get charged for the things that they're using, right? Like it's not the company as a whole, because they want to find out, Hey, is engineering blowing out the budget? Or is it accounting over here? That's that's using so many of these resources that are costing us money. So they actually charge it back to the various different areas. So they had to build

Starting point is 00:36:59 those systems to get that stuff in place. So in 2018, when we mentioned the Tableau, Zeppelin, all that kind of stuff, in 2018, they rolled out their alpha release of this GCP infrastructure, the BigQuery, Data Studio, all that kind of stuff. And what they did, and it's kind of interesting, this is actually a really good way to approach things. Just from a product management software development mindset, they basically put out the most frequently used tables that people would be interested in. So they didn't try and put everything online at once, right? They said, hey, this is what I know that most of our internal customers are going to use. And they went with that. In that group, they had over 250 users internal in the company from engineering, finance, marketing, and sometime they didn't have the date in here,

Starting point is 00:37:56 but they said that they said it was near the time that they wrote the, or that blog post was live. They had a month where they had 8,000 queries that processed over a hundred petabytes of data, not including scheduled reports. So these were ad hoc queries that were run. And so people ended up loving it, right? Like they saw that the people were using it, 8,000 queries with over 100 petabytes of data process. Like that's a lot of usage. And so with that,

Starting point is 00:38:31 they proved out that people did want to use the platform. And so from that point, they decided, hey, okay, let's push forward with this. 8,000 queries though in total. I mean, like that sounds low, right? For 250 users. Yeah. This is not like customer facing it.

Starting point is 00:38:49 This is like people actually like running queries at work, you know? Right. This is me trying to find out the trends or whatever. Yeah. Yeah. So. God, I'm such a jerk. That's pretty good.

Starting point is 00:39:02 So check this out. They also, I have a link in the, in the show notes for this, but they have a really nice diagram, a very simple diagram of kind of what the data flow was getting from their on-prem into BigQuery. And, and I'll summarize it here, but I highly recommend going and taking a look at the picture because it'll give you a little bit more detail. So basically what they did is they pushed data into GCS or Google cloud storage. If you didn't know that, that particular acronym from their on-premise Hadoop data clusters. And once they pushed it up to GCS, they then used airflow. I think it's Apache airflow,

Starting point is 00:39:39 if I remember right, to move the data, that data from GCS into BigQuery. And then once it was in BigQuery, that's where they would use Data Studio so that all the end users could actually go create reports that they'd want to look at, right? Like things that they'd want to pull up later. What the heck does Apache Airflow do? Airflow is a platform created by the community to program it. Yeah, some word that allows you to author schedule and monitor workflows.

Starting point is 00:40:09 What? It's amazing. Apache has so many projects around both OLAP and just like streaming type dag stuff. Distribute like a cyclic graph, some very directional. It's Python. It's Python based.

Starting point is 00:40:25 It's Python-based. It's really popular in GCP. Open Source Workflow Management Platform for Data Engineering Pipelines. Anyone with Python knowledge can develop a workflow. Airflow does not limit the scope of your pipeline, so you can use it to build machine learning models transfer data blah blah blah yeah so i'm sure that was hyper complex right but that's why i said the the diagram's worth looking at just to see sort of the gist of what they were doing and i'm sure there was a ton of work that happened to make all this

Starting point is 00:41:00 you know really really go live i found that um link by the way for the uh stack overflow uh infrastructure they're still it like not on like a cloud-based solution nine web servers four sql servers two redis servers but i mean with that they're putting out 55 terabytes of data a month i mean you know i'm sure twitter would be like whatever right our 240 characters is way more than that but um i mean it's still super impressive what stack overflow is doing um without it so easy the point is is like i'll have a link in the, in the resources we like, there'll be a link for the stack overflow one. And the language they use.

Starting point is 00:41:53 Oh, that their stuff's written in stack. You mean? Yeah. Are you, are you trying to pick on my boy C sharp? No, I'm actually excited about it.

Starting point is 00:42:02 I love that. Yeah. No, C sharp ASP.net, NBC. And their homepage loads in 12.2 milliseconds. And their questions pages load in 18.3 milliseconds. That's amazing. All right.

Starting point is 00:42:15 So I think it's Jay-Z's turn. Oh, God. Jay-Z. All right. Tell you what. I'll give you a discount. If you give us a four-star review, we're going to treat it like a five. Why?

Starting point is 00:42:32 We'll give you the full five-star thank you for four-star and up recommendations. Because we love it. It helps us out. And, yeah, it's really good news for the show. That's how podcasts grow. It keeps us going. It's the, I don't know, steaming our turnbinds or however cars work. I don't know. Listen,

Starting point is 00:42:50 I'm telling you guys, if you all go out to Twitter and you tweet about Coding Box, I'm sure that we'll get that many more listeners and subscribers and we can grow the show and it'll be better. And so that's how it works, right? Am I doing this right? I think so.

Starting point is 00:43:05 I think I got it. I think the quality has increased with every person, every new person that's listened to the show. Yeah. Right. That's definitely happened. And when we used to all record in person, I could have reached over and hit the mute button on Jay-Z's mic before he said that nonsense, but I can't virtually do that. yeah so now we have to deal with this yes yeah i i know when i go look at a podcast and uh you know i'm looking for something new to listen to that you know as long as it's four stars and up you know it's good that's all we're asking of you all of us are asking four star review you know you were talking about the Taylor Swift facts and, you know, going, going, listening to her music and all that. So I went to a Foo Fighters concert recently.

Starting point is 00:43:51 It was ever long. So long. All right. Well, it's time for my favorite portion of the show. Survey says. All right. So let's play a little bit of feud here this is what episode 198 so jay-z guess what you get all right first this time so uh let me see i myself and we asked a hundred people name qualities of a bad boss and you just name just name one

Starting point is 00:44:34 i'll see who gets uh i mean okay this tough. This is a hard one. I, uh, I mean, micromanaging is all that they think of. I just keep coming back to that. Okay. Mean. That's the,

Starting point is 00:44:56 I was thinking that too. Okay. So micromanager was the number one answer with 29 respondents so that's 29 points on the board for joe allen angry was the number three answer at 20 i'll take it i'll take it that's not a bad showing that's that's not a bad showing what was number two uh okay we'll run down the list micromanager incompetent number two with 24 okay irresponsible 14 and oblivious number 13 okay or i'm sorry 13 uh respondents all right so my next question for you and Alan, you'll go first this time.

Starting point is 00:45:46 All right. Name a bad job for someone who's afraid of heights. Like how you put your answer. You like that? Yeah. You already put your score in. That's cheating. You didn't get that.

Starting point is 00:46:01 A pilot. A pilot. Good answer. Dang. I'm trying to think what the electricians, you know, for the power company, I don't know what you call them. Oh, like they're called pole climbers? Is it linemen or something? Linemen, yeah.

Starting point is 00:46:17 I think that's right. Yeah. Okay. Pilot was the number two answer with 37 respondents. Hey, I was close on my number. Yeah. I'll,

Starting point is 00:46:29 I'll consider linemen as construction worker. I think that's fair. Yeah. Number three answer 16 points. All right. It takes a lead. Yeah. This is getting interesting.

Starting point is 00:46:42 I got to like use a formula now, like start we're getting into big numbers let's see put that there all right so uh for the win last question here let me find it where did i put it here we go all right for the last one Jay Z this is your chance you go first name a type of building where it always seems to be cold always seems to be cold

Starting point is 00:47:14 mhm also tough I think I've got an answer I'm thinking if I can get a better one. I feel like I should buzz in. I should be allowed to buzz in. Before you buzz in, I'm going to go ahead and say doctor's office.

Starting point is 00:47:34 Okay. Okay. Meat packing facility. Okay. I mean, a meat packing facility doesn't seem to be cold. It is cold. Right, right. That's what I'm saying. Oh, I see where you're coming with it doesn't seem to be cold. It is cold. Right, right. That's what I'm saying. Oh, I see where you're going with it. You're like being logical.

Starting point is 00:47:49 It's cold. If your job's to work in a freezer, guess what? It's going to seem cold. It's going to be cold, yeah. Well, so Jay-Z doctor's office was the number one answer. Oh, man. I just got destroyed. And Alan, what was it a meat packing facility was not on the list at all dang it nothing even remotely close so you get a big fat zero for that now the score leading up to this was alan in the lead 57 to 45 but Jay-Z's number one there gives him a commanding lead because he just

Starting point is 00:48:30 walked away with 44 points on that one for a final score of 89 to 57 Jay-Z takes the win yeah I couldn't even I couldn't even caught up with the second what was the second pretty sure that means that Jay-Z has to buy the next round. Is that what we were playing for? Yeah, I think so. Doctor's office, number 144. Work was number two at 19. Yeah, I'd have lost.

Starting point is 00:48:55 I guess maybe I should have considered your meat packing thing. Yeah, that's work. Work. Yeah. But all of these are somebody's work. A doctor's office is somebody's work. Yeah. The next next one classroom 14 and lastly the dmv number four four people said that that's amazing yeah they spent a lot of time in the dmv a long time right these are the people to DUIs.

Starting point is 00:49:25 They're a lot more. Yeah, you know, a DMV is probably a U.S. thing, huh? Oh, yeah. I don't even know what it stands for. Something motor vehicles, right? Department of Motor Vehicles. Department of Motor Vehicles, yeah. You go there for your license.

Starting point is 00:49:38 I don't even know what else. Maybe passports. I don't know. Your tag. Yeah, it's usually licensed for people that were drinking and driving. I've seen that in there before. No, I swear. Drive your tag. Yeah. It's usually licensed for people that are, that were drinking and driving. I've seen, I've seen that in there before. No,

Starting point is 00:49:48 I swear. I promise you. That's why, that's why I made the drinking reference. There was somebody in there who, the last time I went, he was driven there by somebody. It was like his third time having to come get his license back.

Starting point is 00:49:59 That's not scary. The only time I've ever gone is to get the drive to get the license plate, like your tag renewals. Mine's not at the DMV. I always had to go to the courthouse for that. Oh, wait. I guess that's the tag office I'm thinking of. Yeah, the tag office. Yeah.

Starting point is 00:50:16 DMV was only renewing your license. Well, I guess we don't have a DMV in this state then. Yeah, we do. No, we don't have a DMV in this state then. Yeah, we do. No, we don't have a DMV. There's no Georgia DMV. Totally. I think it's actually dmv.ga.org or something. Georgia DMV.

Starting point is 00:50:34 I'm going to the Googles. Yeah, we don't. It's the Department of Driver Services. Ah, Driver Services. It's different. Technicality. It's different. It is. It is different. Technicality is different. It is.

Starting point is 00:50:47 It is different. Florida doesn't have one either. Florida doesn't? DMV is like a California or New York thing, right? Like, that's where... Yeah, here they call it the Florida Department of Highway Safety and Motor Vehicles. That's way too much. Yeah. Yeah, it's DMV.

Starting point is 00:51:02 DMV. That's what it is. Wow. Excellent song, by the way. All right. So I'm going to share a little secret if I can. May I? A little tip. This is going to be like an early tip of the week.

Starting point is 00:51:15 So I recently had to change my password, right? And it said that it required it to be eight characters long. So I picked Snow White and the Seven Dwarfs. That's a free tip right there.'s pretty good i like it don't don't reuse that though somebody will know it now all right so oh wait i said seven characters i meant eight characters you said eight characters okay you said it you said snow white and seven doors yeahwarfs. It added up. It added up. Oh, okay. All right. So getting it back into what Google was doing, they were shooting for. Yeah, Twitter.

Starting point is 00:51:54 Oh, yeah, Twitter. Twitter with Google. Sorry, Twitter with Google services and whatnot. So they were shooting for ease of use. One of the big things was BigQuery was easy to use because it didn't require anybody to install anything. They could navigate it all through the web UI, right? Like they just log into their Google account and life was good. There were a few things that people had to onboard with. I mean, I know the three of us, when we first started with GCP, you have to learn

Starting point is 00:52:20 about processes, resources, tagging, that kind of stuff. And so they actually created some internal educational materials to get people sort of up to speed on that. And then after that, people were kind of up and running. So that's really nice. Now, they did look at various different things. And this goes back to the airflow thing. And this is why I wanted to at least note it earlier. So loading data into BigQuery, right? So we already said that they were using airflow, right?

Starting point is 00:52:53 They looked at several things. So Google has a thing called Google Cloud Composer. And basically what that is, is a managed airflow. So airflow being an Apache project, you can set it up and run it on your own VMs or whatever, right? Like that's on you. And that means you're managing infrastructure, which you're trying usually to get away from when you're doing Google or cloud services in general. Cloud Composer was supposed to do that for you, but they couldn't use it because they needed to use what they referred to as domain restricted sharing.

Starting point is 00:53:26 And that basically meant that only if you're logged in as Twitter, can you see some of this stuff? And it didn't offer that. So they couldn't use it. They tried to use Google Data Transfer Service, DTS. It wasn't flexible enough to have data pipelines that had dependency. So I think what they meant here is, say you have a data pipeline that kicks off and runs something. Hey, when that thing's done, trigger another one to run.

Starting point is 00:53:50 Hey, when that one's done, trigger another one, right? Or wait for certain things to be ready before you can do. So I think that's what they were talking about there, and it just wasn't flexible enough. And so that's why they ended up using Apache Airflow. And again, they had to set that thing up, get it running on their own, configure it all, all that kind of garbage. And then they were able to set up the services that they needed. Once they had data in BigQuery, and this is kind of interesting, this reminds me sort of of what we were talking about with the Uber blog back in the

Starting point is 00:54:21 day, is once they got it into BigQuery, let's say that they needed to transform some of that data. Well, they would basically create jobs that would use regular SQL queries to do those data transformations, right? So they load the data all in there. They need to polish it up. All right, run a job, have a SQL query, batch it out and put it into another data set. That's for simple stuff. For the more complex things, then they would go back to Airflow again or use Cloud Composer with Cloud Dataflow. And if I remember right,

Starting point is 00:54:53 Dataflow, we looked at that at one point and that allows you to do things like, it wasn't Flink. What was underneath Dataflow? Was it using Flink? That doesn't sound familiar to me. Didn't we look at that back in the day, Jay-Z? Yeah, I'm trying to remember.

Starting point is 00:55:11 It was one of the streaming ones. It's not Flink. It was Spark. But I thought there was some Flink extension or something you could do. It might be. Yeah, I can't remember. But it basically allowed you to do data streaming type things in a managed pipeline that you didn't really have to mess with. You write the code, and it would run it for you.

Starting point is 00:55:32 Man, it wasn't Flink. I cannot read. There was a language behind it. Oh, but you're not talking about Dataproc. Beam, Apache Beam. That's what it was. You could write your things in Apache Beam, put that into Dataflow, and then that would run and do your data streaming.

Starting point is 00:55:49 Yeah, I'm all confused now. Dataflow and Dataproc, where are they thinking? Come on. Yeah, man. Naming's hard, right? Yeah, that's true. Even for infrastructure and services. Yeah, I'm not good at it either.

Starting point is 00:56:01 I shouldn't be throwing stones. No, I've got some badly named variables all over the place. All right, so the next one up, performance. Like, hey, if you're going to try and get a bunch of people to buy into your platform, it probably needs to work well. So this is a big one. And I know that Outlaw and I, when we were first looking at this kind of stuff, you have to know the lines and the boundaries for the different technologies you're using. BigQuery is not for low latency, high throughput queries, or for low latency time series type analysis, meaning you can't put a petabyte of data in it,

Starting point is 00:56:38 run a query and expect it to come back in sub second times. It's not how BigQuery works. It's not built for that. It is for being able to run SQL queries that can process over huge amounts of data. And we already said earlier, I think they ran so many queries over 100 petabytes of data, right? And their goal was they wanted their BigQuery queries to return results within one minute. So it's pretty interesting. They went about this by basically allowing their customer. Well, first, and these are kind of backwards, even even in the paragraph up there that they had, they had their engineering team analyze 800 plus queries, each processing around a terabyte of data each to sort of see what the times were going to be when they came back. And then using that information, they actually allowed their

Starting point is 00:57:39 internal customers to reserve a number of slots in a slot in the GCP terms is it's a unit of computational capacity to execute a query. So here's the interesting thing. What they did is when you're running on cloud services, and I'd imagine they're all sort of the same in this regard. I mean, you guys correct me if I'm wrong here, but there's spot pricing that says, Hey, I just want to pay for what I use. And then there's fixed flat pricing, right? Like, Hey, I'm going to pay for X number of slots every month, right? Like, um, just set me aside a hundred slots and I'm going to pay a flat price for that. As opposed to, you know, Hey, if I write a thousand queries and they hit these things and it could use, you know, I don't know, 2000 slots or whatever. So they went with this fixed price thing and then they were able to see, Hey, how many slots do I need to use for a particular query to get it to return in less than a minute? Right.

Starting point is 00:58:46 And then they use that out. And then different teams within the organization, within Twitter organization could say, hey, reserve me this number of slots, which would then get billed back to their department. But they could run their queries and submit at times. And I'm pretty sure most services have that type of feature, right? The reserved, like I know AWS, Google Cloud, Azure, all of them. If you reserve VMs, you pay a lower price for it because you're guaranteeing those cloud services that, Hey, you're going to buy this much per month, right? Whereas if you're doing spot pricing, you might use it way less, but they're going to jack the price up on you because they want to get their money for that price for the time that you're using it. I thought the reserved pricing was way more expensive for a V like going back to your VM example. Not usually. If you say that you're going to reserve it for like a month,

Starting point is 00:59:47 if you say that, and it's not even a month, I want to say with a lot of those cloud services, if you say that you're going to use it for a year, it's usually way cheaper. Yeah, you're right. Just looking at like that thing on the aws blurb ec2 reserved instances provide a significant discount discount up to 72 compared to on-demand pricing right because they know that if you're doing on-demand pricing probably you're going

Starting point is 01:00:18 to try and use an hour a day right like i just you know throwing a number out there um but if you're going to reserve something you're going to have that thing for 24 hours. So you're kind of guaranteeing the money as opposed to it's almost it's the inverse problem of what Twitter was trying to solve. Right. Like they didn't want to do the spot pricing because they didn't want the fluctuation in the bill. So they wanted to do the reserve pricing so that they could at least plan for their budgets. And I think it's the reverse problem for AWS and Azure and all them, right? Like,

Starting point is 01:00:47 Hey, if you'll tell me that you're going to use this, then at least I know I got money coming in, you know? So it's, it's kind of a push and pull in that regard. I was thinking though, like what a weird time we live in though,

Starting point is 01:01:00 where like, you know, it's not enough to just be able to query the data. Instead, you got to think like hey how many of these slots do you need for that query and you're like a slot like a what you mean a cpu no a slot i said what i meant answer my question don't go making up questions yeah man and i'm sure that's that's super complex right because it's it's you

Starting point is 01:01:27 don't know how much data you're querying necessarily and how many cpus you're going to need and what ram because you don't want to have to think about that right so they had to come up with a new term so yeah all right so data governance now this is interesting this is really important right like every company should care about this. Every developer should care about this. Emphasis on the should. You don't have to, but you probably should. You probably should. So Twitter was focused on discoverability, access control, security and privacy. So data and discovery management, they want, this is really cool. In my opinion, um, they extended their dial, their data access layer to work with both their on-prem and their GCP data. So that enabled users to use a single API to query all their sets of data, right? So,

Starting point is 01:02:17 so just imagine, Hey, I want to, um, pull a list of, you know, users that use this feature, whatever the count of users use this feature, whatever the count of users that use this feature. It can go across everything. They're on-prem Hadoop data sets and their GCP stuff. Like that's really cool. This goes back to our Presto drill conversation from earlier. Yep. Yep.

Starting point is 01:02:39 Next. I wonder like how, I'm sorry, but I wonder like how complex that was though. Like it, was it just like, you know, you, you gave an example of like, you know, some users, for example, were like, was that Dow limited to like the use case? Like, okay, you want users specifically? Yes. I know how to go and get that out of, uh, the, the data that we have in GC or GCP. And I know how to go and get that out of the on-prem

Starting point is 01:03:06 stuff but if you want to like more ad hoc kind of things like maybe it's like whoa whoa whoa whoa yeah i don't i mean it'd be interesting to know what their implementation behind the scenes or exactly if they had some like sort of graph ql thing where people would just do some willy-nilly query that'd be insane the the reason why the reason why i i question it though is because like as you were describing it i immediately thought like wait a minute did they recreate presto or drill like why would you why would you recreate something that you were a already using and b maybe this should have been a it already already exists, right? Well, that's why that's, that was the,

Starting point is 01:03:46 like, you know, what immediately came to mind. So check it out. Like I didn't put this in the notes, but they actually, and we'll get to it in a second in terms of just the bullet point, but they have this thing where it would register data sets,

Starting point is 01:04:00 right? Um, and I'm just going to read this bit here because maybe it'll make more sense. We use scheduled jobs to enumerate BigQuery data sets and register them with the data access layer. So that's part of it, which is Twitter's metadata store. Users will annotate data sets with privacy information and also specify retention for scrubbing. We are evaluating the performance and cost of two things. Oh, well, that stuff didn't matter so that registering thing right like they had something that would

Starting point is 01:04:28 automatically push those data sets down into the dowel in their metadata store so maybe it was good enough to be able to live query these different things for you assuming that they push the right metadata down there for their their software that sounds That sounds kind of awesome. I mean, like talk about like, uh, uh, what the service discoverability type of pattern, right? Like now your, your data is like saying, Hey, I'm available for you to query and I'm going to like, let you know. Isn't that awesome? Yeah, that was pretty cool. Yeah. So, um, the, the other things that I do, they had to control access to the data, which makes total sense. They needed to use the domain restricted sharing so that only people that were logged in with a Twitter account would have access to it. Right. They needed to make sure that data didn't leak out somewhere.

Starting point is 01:05:20 They used VPC service controls. So that basically prevented data exfiltration. And it also allowed them to lock down from what known IP ranges people could come in. So, excuse me, like if your company has a VPN, like Palo Alto is a popular one, right? If you log in, you're probably on a known set of IP ranges. And so by doing that, you're only going to have access to that VPC. If you can get in there, um, the triple a authentication, authorization, and auditing authentication, they use GCP user accounts. Pretty simple. Makes sense. Right. And that was for ad hoc queries. For anything that was like a production load that maybe ran on a schedule or something like that, then they use

Starting point is 01:06:09 Google service accounts. Pretty common in a cloud type environment. Authorization. This was pretty interesting. I don't think I'd ever thought about it because I haven't been that deep down into like BigQuery. But each data set had an owner service account. And then every one of those data sets also had an individual reader group, right? So if you needed access to a particular data set, assuming it was something that was highly sensitive, then you'd have to be added into that particular reader group to even be able to see that data set. So it's kind of a nice way of making it to where you don't have to write a bunch of complex logic to be able to access those data sets. You're either in the group to read it or you're not. So that's pretty neat.

Starting point is 01:06:55 And then auditing, this is them kind of eating their own dog food. So what they would do is anytime a BigQuery query ran, they would take the Stackdriver logs from that execution, which had a bunch of detailed information in it, feed it back into a BigQuery data set so they could analyze it later if they needed to. All right. Well, that gets kind of circular. Yeah, a little bit.

Starting point is 01:07:20 As I said, eating the dog food, eating your own dog food. Your Stackdriver queries in BigQuery become excessive, and then those logs make it back into BigQuery. I would imagine they filter those at some point. Oh, yeah, that would make sense. Well, maybe not. This is what happens when you let me get in charge of engineering. Okay, listen.

Starting point is 01:07:41 Query everything. Right. Do it all. Multiple times. Oh, man. So ensuring proper handling of private data. So this was pretty interesting. So this is why they say they registered all the data sets.

Starting point is 01:07:57 So if you had a new data set that was generated up there, it would auto register with the DAO. And then that way, any access to that data set was going through the dial is my image is what I'd imagine is happening. They didn't call it out directly, but that would make sense. They would annotate private data. Right. So if if you had a column in there, there was like a first name or something and they would say, hey, this is private. They used proper retention. This is a big one. If you've heard anything about GDPR and all that kind of stuff, like data privacy concerns and all that, you have to be very explicit about how long you're going to keep this data around. And you're also supposed to say how you're going to use it.

Starting point is 01:08:36 So I guess, okay, well, finish out this section. All right. So this last one here is making sure that they scrub and remove any data that a user deletes. So if you go in there and you delete something off your Twitter feed or if you deleted a tweet, then they needed to make sure that they also deleted it up in their data storage. All right. So they had this data governance, the AAAs, authentication, authorization, auditing, ensuring the handling of private data, blah blah blah blah like you know like think of it as like a checklist like can we yep we did that we did that we did that we did that we did that and yet that remember that 18 year old that hacked in and like said send all your bitcoin to here do you remember that hack? I do not. It was like last year? Yeah, it was last year that the teen, Florida teen.

Starting point is 01:09:35 So, you know, I mean, Jay-Z, we're looking at you. We're looking at you. But yeah, he took control of some well-known accounts and used them to solicit Bitcoin. Oh, yeah. Like celebrities, right? Yeah. So like one of them was Apple. And he said, hey, we're giving back.

Starting point is 01:09:53 We support Bitcoin and believe you should too. And if you send Bitcoin, all Bitcoin sent to the address below will be sent back doubled. You never heard about this? No. back doubled you know you never heard about this no but did he actually take over user accounts because that's definitely different than hacking into their data warehouse right yeah okay that's fair enough fair enough so yeah so so the data warehouse they locked down for like analytics but like live account stuff they're like i don't care privacy like whatever they probably use that snow white and seven doors password that's what that's what got them wait that's my guess don't give up on password man i told you that in confidence yeah well bad all right so now they actually do have

Starting point is 01:10:37 different categories for their data sets which this makes sense too um these are all good things to sort of keep in mind when you're doing stuff like this, especially dealing with user private data. So highly sensitive data sets were available as needed with the least privilege. Um, and so this is the one where they had individual reader groups that were actively monitored. So if you needed access to some data that had sensitive data in it, you had to be added to a very specific group and they knew everything that you were doing there and they were watching it actively, right? Like it wasn't some passive query that was going to happen later. What would be the highly sensitive data sets there?

Starting point is 01:11:17 What are we talking about here? What's what classifies as a highly sensitive data set? So they didn't say, but my guess is it would be things like first name, last name, right? Like if it's Taylor Swift, um, so it's not the contents of the message necessarily.

Starting point is 01:11:32 Cause I was questioning, like, are we talking about the D that like the, you know, the DMS that are like, you know, person to person, not the public tweets,

Starting point is 01:11:40 right? It's public. Well, what about, um, so, uh, you know,

Starting point is 01:11:44 they probably have information on like i don't know um maybe it's for like the verified users like their contacts or phone numbers or whatnot you know maybe that would be considered highly sensitive well i was seeing um yeah i don't know groups of people like if maybe they're working with the federal government on tracking down a cell of potential terrorists or something then they don't want uh you know people to figure out that they know who you know they're trying to hide the information that they know about those people because they're working with the government or something you know something like that are you speaking from experience jay-z yeah i mean you know i'm just i got my tinfoil hat on you know

Starting point is 01:12:22 so they can't see me. Right. It's good that you know that. Well, also keep in mind, a lot of these sounded like they were event type data, right? So it might have been that, you know, Alan Underwood clicked the heart on this one. And so, you know, the message that I clicked the heart on, you know that I clicked the heart. And then you know my name's Alan. So the medium sensitivity ones, they anonymize things. So you still have user ish type stuff in there, but it's hash. And they actually said it's a one-way hash. So, so if you're trying to get things down to a user level, like, you know, how many individual people did this particular action,

Starting point is 01:13:06 they could do it, but they couldn't actually see who it was. Right. So anonymized, that sounds similar to what I think IMDB did years ago with some, or Netflix did with some sort of contest or something. Low sensitivity, all regular user information is removed. So you won't be able to get like granular level type stuff and then public sensitivity. Anybody can get it. So, and then that was where that paragraph that I started reading earlier, anytime a new data set is added, then they have a scheduled task to go auto register these things with the dial.

Starting point is 01:13:44 Okay. All right. Yeah. I got a question, but I want to reserve it until we get past this next section. Okay. I think this, this last section here is on cost and this is really interesting. So what they said is when they started moving up to BigQuery, remember they already had PrestoDB in play. And they said that the cost was roughly the same for querying PrestoDB versus BigQuery. Now, the important thing here is it's for querying. And PrestoDB, keep in mind, they were managing all that infrastructure on-prem. BigQuery is all managed for you, right?

Starting point is 01:14:23 It's just a service you use. They said that there were additional costs associated with storing data in GCS and BigQuery. And that was something that always kind of bugged me a little bit too, is a lot of times you'd have to put the data in GCS. And then when you ingest that into BigQuery, BigQuery is also storing it again in its own engine. So you're kind of getting double hit with that. So there were additional costs on top of that. But for a lot of people that want to use BigQuery, that's probably worth it because you're looking for that processing power that you're not going to be able to do without setting up a bunch of infrastructure yourself. We already talked about they use flat rate pricing so it didn't fluctuate. And there was one very interesting situation that I find extremely curious, and I'd love to know more about it, but they just put a line in here. In some situations when they're querying tens of petabytes of data, it was more cost effective to

Starting point is 01:15:20 use PrestoDB than to use the GCS storage and BigQuery. Yeah, I wonder what was different about that. I know why. Yeah, the only thing I can think is that what they call it slot, that slot cost was probably crazy for having to pour through petabytes of data. That's like it probably just took so long and used up so many of those storage slot or those slot cost units that it just had to be crazy. Whereas you sort of have a fixed cost of Presto if you're managing a cluster

Starting point is 01:15:55 on-prem, right? That's the only thing that makes sense to me. Yeah. And all of this was in relation to moving an exabyte of data this was what started it and yeah this is this is the why and where they ended up with an exabyte of data being moved into google cloud isn't that correct and this is all for internal querying purposes how many pounds of hard drive would that be oh my god man right if they shipped hard drives

Starting point is 01:16:35 with that much data like how much would that be what would the shipping cost be yeah i'm just wondering like is that like a dump truck worth of hard drives? Is it dump trucks? Is it airplanes worth? I don't know. I don't know if there's an easier way to figure it out than to brain dead and figure it out. All right, so hold on. I just Googled this because that's what you do. How many terabytes are in an exabyte?

Starting point is 01:17:03 There are one million terabytes in an exabyte there are 1 million terabytes in an exabyte so if we assume i mean i'd say most data centers aren't running like um 14 terabyte drives because they're too expensive right they want something cheaper so let's say that out of their 1 million divided by what eight terabytes probably common i was thinking five all right well i did well that's 125 000 drives yeah how many drives fit in a garbage truck so who doesn't know so if we were to come at this from a different direction then because there's one way of doing this where like each like you literally like the drives aren't active they're just literally packaged up and boxed and sent right and that's going to be you you know you're going to have a higher compression rate of drives per so like really now we're talking about like

Starting point is 01:18:02 well what's the you know how how effective am i going to ship those drives am i going to put them in like a you know consumer grade where there's a lot of packaging material around them or am i going to like squeeze some of that in closer you know to to get that tighter um so now you like now you're just like well what's the size of the drive period and also like we're assuming hard drives and not SSDs because of the density of how much more storage you can get in a spinning hard drive versus SSD, but the SSDs are going to be lighter. And those are things, you can get those super tiny now.

Starting point is 01:18:42 But coming at it from a different approach, if you were to look at a comparison of like the AWS snow, but snowmobile service where AWS drives a truck to your location, when you need to store, like, you know, you need to move exabytes of data into AWS and you want to use their system to do it fast.

Starting point is 01:19:03 It's a hundred petabytes per truck. Okay, there you go. But that's a live working truck, though. Yeah. Like, you got a lot of hardware. Is that like a 40-foot tractor trailer truck? That's got to be what it is, right? I don't know.

Starting point is 01:19:20 It's a 45-foot container. Yeah. Good Lord, man. All right right so there's a thousand petabytes in uh one exabyte and how many petabytes did you say was in that truck 100 so it's 10 big trucks that's crazyuming that you wanted the data, like actually like pluggable and ready. Right. Yeah. Yeah. That's,

Starting point is 01:19:51 that's insane, man. Yeah. That's crazy. Also crazy that such a service exists. Like how many times, like how many customers does Amazon have to where that's actually a need? Right.

Starting point is 01:20:03 That they're like, yeah, no, we do, we do this a lot. Uh, right? That they're like, yeah, no, we do, we do this a lot. Uh, you know, Hey,

Starting point is 01:20:07 we, here's your, here's your punch card. And, uh, you know, 10th one's free. It is a 45 foot truck,

Starting point is 01:20:15 by the way. So it's 10, like, you know, tractor trailers. Yeah. Yeah. Yeah.

Starting point is 01:20:19 Yeah. Man, that's, that's insane. Yep. Eight foot wide, nine points, uh, six foot tall, 45 foot long cube or square. So you said eight feet wide times 9.5 tall.

Starting point is 01:20:36 Uh-oh. We're going to calculate some volume. Times 45 feet. That's 3,420 square feet of space available for these. i'm sure it's not packed all the way to the brim but yeah that's that's a lot of space is it worth it i imagine i imagine that that 45 foot container is basically like a moving data center yeah that's exactly what it is a bunch of servers a bunch of hard drives like hey we we uh we need some backup power for our generators, but otherwise, where's your Wi-Fi? Yeah. Could you give me the guest password, please?

Starting point is 01:21:12 Snow White and the Seven Dwarfs. That's right. Hey, so somebody has some fun questions in here. Oh, yeah. So Elon Musk recently ended up buying Twitter. That was a whole big fiasco. What? I'm not going to get into the details of that.

Starting point is 01:21:25 Yeah, you're going to have to Google it. I didn't know about that. Yeah, it's been kind of a thing. But one of the things that happened is there were a ton of layoffs, right? The company had like 7,000 employees, I think. And they got rid of like half. And so there was a lot of discussion on Twitter and a lot of other places saying, oh my gosh, how does Twitter have so many employees?

Starting point is 01:21:46 Like, I could write it in a weekend. And so I was curious, like, you know, obviously we just spent a lot of time talking about a lot of the other things that Twitter does besides just, like, a simple content management system. Like, we talked about kind of pulling data out of the database. But I just thought it might be fun to kind of bring up and say like did you uh you could build twitter in a weekend anybody that can build twitter in a weekend i say just go for it what do you wait what's holding you back then like yeah yeah you've had years to do it yeah you know i think it's ridiculous like obviously uh this is something that happens a lot i think developers will often kind will often take a service and boil it down to one simple part of the use case and then think that that's all there is. I remember when Dropbox came out, there were a lot of people being like, I could have done this with a NAS and rsync for free.

Starting point is 01:22:38 And so there's all these people putting out instructions on how to replace Dropbox. Somehow, Dropbox is a hugely successful company that even has several companies that spawn to compete with it. They're all doing very well. It's just kind of funny to me that people talk about it. I think anytime you're thinking that you could just build

Starting point is 01:22:57 X in a weekend because all you think it is is what you see or your face of it. You don't think about the financials, the billing, the advertisements, the machine learning, like all these things that are really necessary to making that thing successful. You know, it's not just the way you primarily interact with it. here being on a platform and then being able to follow each other and tweet and seeing each other's stuff that alone would take some time especially because it's all live streaming you go up there and things are constantly popping up new fresh everywhere like just that alone is already more than a weekend's worth of work right not even Not even to take into account authentication, authorization,

Starting point is 01:23:45 all that kind of garbage. And then you start building on top of it. Hey, once you get past a few hundred users, your problems just got way different, right? Yeah. I mean, it would take me a weekend just to set up my DevOps pipeline. Yeah, man. Yeah, it's kind of ridiculous.

Starting point is 01:24:03 But it did remind me of a tweet that I just saw a couple days ago from David Whitney, who's an interesting person on Twitter. You should follow him on DevSpace. And I'm going to paraphrase here because they used a naughty word. It says, the more I think about it, the 10x developer trope is less rock star and more crappy cover bands than any of those people would like to admit. And I think that's a really good point. It's like a lot of times when people think or people talk about,

Starting point is 01:24:27 you know, these kinds of great accomplishments or what they could do in a weekend, you're thinking about this, like smoke and mirrors kind of demo, basically standing up something that's made out of, you know, balsa wood and paper that just is totally fragile and unmaintainable. And it's just not nearly as robust and significant as the real thing.

Starting point is 01:24:44 And so I just, all the talk on Twitter of people talking about building Twitter in a weekend are usually just thinking about a UI and a simple database, and it's just not the same thing. You know, it's funny, even that. So you had mentioned, or we mentioned on the last episode, there's some thing that people are installing or using that's like a Twitter replacement type thing. What was it?

Starting point is 01:25:10 Oh, Mastodon. Yeah. Mastodon. Even just setting something like that up can take a day, right? Let alone programming the thing. So that's where I think it's so crazy that developers, especially experienced developers, will go out there and make a statement like, oh, I could totally make this in a weekend. We are a confident bunch that are also opinionated and a little sure of ourselves.

Starting point is 01:25:40 That's how it always starts. We're like, I could do that in a weekend and then we'll start. And then we get start. And then, you know, we get lost. We'll go down some rabbit hole of authentication. You know, this is the one that we always tease Alan about.

Starting point is 01:25:51 We'll fall down some rabbit hole of authentication and then come up for air like 18 days later. And like, wait, the weekend's over. Yeah. What was I doing? I don't even remember anymore.

Starting point is 01:26:02 Yeah. Yeah, totally. Totally. It's so easy to get his nice, which is ridiculous. And I think it's dismiss anymore. Yeah, totally. It's so easy to get hit with snacks, which is ridiculous. And I think it's dismissive of all the hard work that's gone into it, but that's a side note. So do you guys want to create a Twitter?

Starting point is 01:26:12 You want to do it? I think we can do it in the weekend. I can't set up a national one. Sounds like we already got the architecture right here. That's right. I think individually we couldn't do it, but the three of us together, we could build Twitter in a weekend. Definitely. Yeah.

Starting point is 01:26:26 Long weekend. Yeah. And this was just another kind of a little bit of a kind of insight into what's going on there. Another thing that was tweeted out, and this one was actually tweeted by Elon, who mentioned that. I'm going to paraphrase another tweet here. He'd like to apologize for Twitter being super slow in many countries because the app is doing more than a thousand poorly batched RPCs just to render a home timeline. Which, you know, like that's what he tweeted. And there's been some talk on, you know, whether or not that's true or, you know, like how accurate that statement is and who knows.

Starting point is 01:26:57 But I will say that I am not surprised to hear that there are a whole lot of calls being made to external services and so if you tell me that the home timeline is making you know potentially a thousand calls to render like i believe it you know it's like i'm not like so you know disgustingly shocked by it like i could see that sort of thing kind of happening if you take the request and you think about all the various things that kind of spin off of it like we talked about uh just the analytics side of it like just knowing that someone refreshed the timeline and all the various things that kind of spin off of it. Like we talked about just the analytics side of it, like just knowing that someone refreshed the timeline and all the various services that kind of, you know,

Starting point is 01:27:30 end up going through the pipelines and until they end up in their final, you know, data stores it can be a, just a ton of data moving around. And so, you know, while I don't know that it's necessary a thousand, I'm not surprised to hear that it's a whole bunch.

Starting point is 01:27:47 Yeah. And we'll have links to that in the show notes as things are referenced. There's some interesting comments in this thread. I will say that. Yeah. Yeah, it's juicy. juicy so you know if you want uh you know a juicy read here on the one of the developers that worked on it responded to uh claims ian uh made that person since been fired but there's been a whole lot of interesting stuff i'll actually have a link to a news bite uh like a news article about

Starting point is 01:28:18 the whole thing that's got links to all the various tweets it kind of covers the drama there's been a lot of drama a little bit of drama yeah i've been avoiding twitter but i do love the technological side of it because it is seriously one of the most insane engineering things that exist i mean the amount of throughput those people have yeah and you know it's hugely influential, too. Just, like, if you think about, like, when's the last time you saw a bus or a truck or something and had the little blue bird on it? Every commercial, you know, you see a commercial for aspirin and it's got the little, like, follow us on Twitter. It really is a big part of, like, how people communicate and just, you know, it's been a big part of kind of modern culture.

Starting point is 01:29:04 I mean, the term hashtag only exists because of Twitter, right? Like, it's insane. It's definitely like a fascinating set of problems that they created for themselves, right? Because like you said, this didn't exist before. But yeah, like trying to deal with these things in real time and, you know, yeah,

Starting point is 01:29:28 it's insane. Yeah. I was just thinking there, like when you, you mentioned the bus example, I didn't know where you were going. And I was like, man,

Starting point is 01:29:34 when's the last time I saw a bus? I think it was being driven by Sandra Bullock and Keanu Reeves and nobody wanted to be on that bus. So maybe that's where he's going. Yeah. Yeah. Yeah. More. All right.

Starting point is 01:29:48 Well, we'll, we'll have a bunch of links and, uh, uh, resources we like section for this, uh, this episode.

Starting point is 01:29:55 And with that, we head into Alan's favorite portion of the show. It's the tip of the week. All right. And here I, uh, I've got to put a tip about, uh, Kubernetes. So, you know, And here I've got a tip about Kubernetes. So, you know, I love K9's command line interface

Starting point is 01:30:09 that I use for just doing all sorts of Kubernetes stuff all the time. Love it. It's great. Because of that and because of how much I love it, how comfortable I am with that tool, I've really not looked at other ways of kind of interacting with Kubernetes until fairly recently when someone convinced me to install

Starting point is 01:30:25 the vs code plugin for kubernetes i just didn't think i really needed but i really like it turns out so it kind of gives you like a almost like a directory kind of browsing uh just kind of layout in the left nav for uh finding your resources and navigating your context and things like that it's's just kind of nice. And of course, it's also really nice to be able to right-click on a pod or something and attach to it, which I'm sure everyone here, at least, I don't know about listeners, but the three of us have definitely done various things with attaching VS Code to a container or to a remote server,

Starting point is 01:31:02 done their live sharing type features. It's basically, you can do stuff like that with it where you can kind of connect to a server and then open up VS Code. And it's as if you're working on that machine and you can open a terminal in it. You can do all sorts of cool stuff locally, which is just really convenient. And so I am kind of a little bummed that I put off using it so long because it has been really handy and has been a nice compliment to having canines for kind of shooting in, looking at logs, shelling in, that sort of stuff. So it's nice to have more options and I'm glad to have it.

Starting point is 01:31:34 One thing I did want to mention is you have to be careful. One thing I really like about canines is it does not change your global context when you change context in canines. So, for example, I work in several different contexts throughout the day. And I keep my local context always set to just my local instance, my local cluster. And so if I ever need to run a script or anything, I always have to pass the context that I want. And if I make a mistake and don't pass the context or don't pass the namespace, it just affects my local, which is great. But in Kubernetes plugin and Visual Studio Code, it's easy to kind of double click something

Starting point is 01:32:12 and not realize that you've changed your local context to a different namespace or something. So you just have to be careful because I like to keep that always set to something that can't damage too badly. Yeah, that's scary. I wonder why they didn't just on all the things that they're running behind the scenes that they not just passed the the context of the command that kind of stinks but yeah most people running cube cuddle commands are used to kind of dealing with that problem so you know it doesn't bother me too much it's just something i've gotten to kind of take it for granted because i just right

Starting point is 01:32:38 you know always use canines yeah visual studio code is just an amazing tool it's so good it really is so uh for my tip of the week you as you were speaking it reminded me we've talked about i term 2 as a favorite um uh terminal replacement you know on on Mac OS. And, uh, I talked about it before. So it's been mentioned before in a couple of episodes, episode one 47 and one 61. So if you're not using it, um, you need to go back and listen to those episodes. I don't know why, why aren't you using it yet? But, um, in one of the episodes, I want to say it was the one 61 episode, uh, double checking. Yep.

Starting point is 01:33:27 It was one 61. Um, I had mentioned the using like split doing the split windows, right. To like, and like my preference is to split the windows, uh, vertically. So you can just, uh, command D and it'll split it out and you can like have multiples of these. Right. And so as you were describing that your, your Kubernetes environment with visual studio code, I was like,

Starting point is 01:33:50 Oh yeah, you know what? Like dawned on me this week, like my favorite view right now has been, you know, so we have, we've talked about like these widescreen monitors that we have, you know,

Starting point is 01:34:02 we've grown to love them, right? Because you can have, you know, a lot of documents open and see things at one time. Well, for my Kubernetes workflow, my favorite pattern has been to have iTerminal, but split three times. So I have three windows, my leftmost window, because we love scaffold. So my leftmost window I'm using for scaffolding. My rightmost window I'm using for canines. And then my middle one is like all of the, you know, command ad hoc commands that I want to type in, you know, randomly in there.

Starting point is 01:34:39 So, yeah, I just wanted to like give another shout out to iTerm or iTerm2 specifically, I guess. But yeah, because it just makes life so easy. And that wasn't even the tip of the week that I planned to discuss. So that's just a freebie right there. So you got two out of me. You got two freebies. The first one was a good password, right? I think that was the first one.

Starting point is 01:35:05 And then, and then I turned to, all right. So my real tip of the week though was, uh, so I learned of this today and I don't know if you guys have, but I gotta get more into this, uh,

Starting point is 01:35:18 cause this looks super promising, but, uh, there is now a Kafka cuddle command line so that you can do all your, you know, Kafka management using this tool. And the beauty of it is if for those who are like, wait a minute, but there's already like, you know, a bunch of scripts that Kafka comes included. You know, you just go into your bin folder of your broker, for example, or your, your connector, whatever. And you know, there's a Kafka topics shell script. That's like super cryptic. And you got to like, do I provide the bootstrap server or do I provide

Starting point is 01:35:55 the zookeeper? Wait, when does it matter? Do I need both? Do I wait, what with the Kafka cuddle command? You don't need to do any of that. And like, it's got just things that make sense, like verbs that make sense of what you want to do. Right. So it's very Kubernetes like in that regard in from a CLI. So I saw this today and was just like. Mesmerized by it. I was like, Oh, just that looks awesome. And so, uh, I wanted to share that. That's most excellent, man. I was actually looking for, for something that Dave Follett had mentioned. So, um, I think on the previous episode, uh, there was, there was something about

Starting point is 01:36:44 seeing the process that actually had a handle on a file right and he had mentioned something he actually sent a correction i think what he had told us wasn't exactly right and i hadn't double checked it but he sent me something so i can't find it if i can i'll get it in the show notes for this so that it'll be down there in the tips um so where he was on the rock, uh, cadding the process ID. Yeah, it was something a little bit different. He said that it wasn't actually 100% spot on what he had said before. Um, but we'll give him a pass because he has given us a lot of good tips. So, um, I'll try and get that in there. Also a note on the previous shopping episode, I talked about the Roku

Starting point is 01:37:23 streaming stick 4k plus, right? And outlaw and I got into a little conversation about, well, does it actually support Atmos? So it's really weird on their website. It definitely does not show that it supports Dolby Atmos, right? Like it says like Dolby HD plus or something. I can't remember. Um, so I actually did a test on my soundbar that has Atmos and all kinds of other stuff. And I tried content that was both Atmos, um, stereo, um,

Starting point is 01:37:55 just DTS surround and all of them. And every single one of them coming through the Roku registered properly on the soundbar. So if it was Atmos, it showed Atmos on the soundbar. If it was stereo, it showed stereo. So it's at least passing it through. I think, and what I was saying last time is I don't think it decodes Atmos through the Roku,

Starting point is 01:38:17 but it'll pass the signal through. So I believe that's what's going on. I haven't looked it up, but I did see that it would show up properly in both places. Well, that's the thing that was so confusing to me. Like I went back and looked at it too out of curiosity because they had some like weird wording for it. I don't remember it now off the top of my head, but because, and you mentioned it just

Starting point is 01:38:37 again about the decoding and I'm like, yeah, none of them are decoding it. That's what the, that's what your receiver is doing because the decoding is actually deciding like, Oh, this is supposed to go to that channel. And that's supposed to go to that channel. You know, like that's the decoding, right? Well, sort of, man, this is where things get really confusing. So all confusing is what we do here, right? Well, I mean, for years, like with receivers, one of the reasons you would upgrade your AVR at home is because it had more decoding or more codecs that it could handle, right? Like DTS, DTH, DTS HD, whatever. And so if a signal got passed to it and it could read that signal as DTS HD and it supported it, then it would actually play it in that.

Starting point is 01:39:22 If it couldn't, then it would basically fall back to some standard AC three type thing or whatever. Right. Like the stream, the stream might have like, here's the, here's the two or three or four different things that it's available in. It's available in,

Starting point is 01:39:37 in Atmos. It's available in, uh, you know, 5.1 it's available in stereo. And so, yeah, it's going to try,

Starting point is 01:39:44 uh, to, to, to you know it's like a protocol agreement like a handshake you know uh tls handshake agree it's going to try to like find the one the best one right and then it'll like successfully go down the list and that's why it confused me when you were talking about the decoding because i'm like well they're all passing through right like even like a apple tv is passing it through. So yeah, that's, what's interesting is I think what the Roku stick will do is it can actually decode things to DTS HD plus or whatever the things that it supports, but it will pass along the original stream information. So if your receiver can handle Atmos, then it'll do it, but it will not decode Atmos and try and send any information to the TV your receiver can handle Atmos, then it'll do it, but it will not decode Atmos

Starting point is 01:40:25 and try and send any information to the TV or receiver or whatever saying, do this. So I just wanted to say that. So if you do have something that's capable of doing Atmos or whatever, that Roku will actually pass it along and it will get picked up from it. So again, for like when it was on sale and probably during black Friday, it's going to be like 30 bucks or 25 bucks, $25, man. It's a stupid good deal on a streaming thing. So, um, at any rate, all right. So my, my tip of the week, so I got this from, I think that same meeting, the outlook of the other thing from today, um, which was really good Kafkaka i think by default and i may be wrong i should have looked it up before i actually even said this but i want to say the

Starting point is 01:41:11 message size is supposed to be one megabyte does that sound right i think i see the one or two think that is right yes yeah so here's where i'm going with this. All right. Thank you. Um, so yes, what that means is, okay. One megabyte. So by default, you can send messages up to one megabyte in size to Kafka and it'll write it. If you send something bigger, it'll basically blow up and it won't write it because it's like, Hey, I can't do it. It's not going to truncate it. It's just not going to write the record. Um, and we ran in situations where we actually needed more than that, or we thought we did. And there's an interesting thing you can do. So you have in Kafka producers and consumers, producers are the things that are writing things to Kafka consumers are things that

Starting point is 01:41:59 are basically, you know, listening to and getting messages from Kafka in your producer, you have the ability to turn on compression actually on a message by message basis if you want to. Um, but the interesting thing is they have several different types and I've got a link here. You can do no compression, which I think is default. You can do G zip snappy or LZ for compression. So if you had a message that was too large, it was greater than one megabyte. Let's say it was a JSON blob. You could likely use some Gzip or Snappy compression to squeeze that thing down before you even send it to Kafka. And then you might be fine. So you may not need to increase the size of your default messages that Kafka can handle because there are some downsides

Starting point is 01:42:45 to doing that, right? Like Kafka is supposed to be really fast. And if you increase the actual size of the messages, it's writing longer per thing that it's doing potentially. So this might be a really good solution for you. Not only that, but keep in mind that the way you size a Kafka cluster is based on the amount of bandwidth that you expect to go through. So if you're increasing the size of your message, then you're likely going to impact the size of your cluster. Yeah, you're rethinking. Or what you need to have. And Kafka cluster sizing can be a pain in the butt because if you decide like, oh, we

Starting point is 01:43:22 have five brokers today and then tomorrow you decide you know what we need seven to take advantage of those extra two brokers now um becomes a hassle uh it becomes tedious to like rekey messages to spread things out across that because it's deterministically you know deciding where that um where the key belongs for that piece of data and if you are changing the infrastructure around then you know you got to rekey things so that's why i like one of the recommendations related to sizing is um uh you know that you size the cluster to last for two years and then you know come back at it after that although there is another thing too so okay here's your third freebie from me.

Starting point is 01:44:06 Wait, before you go to that one real quick, one other thing just to finish this up. So the interesting thing about when you write these compressed messages, the compression information is stored with the message. So when it's written to Kafka, Kafka gets this, this crushed up compressed thing.

Starting point is 01:44:22 It just writes it. Whatever consumes it. We'll see the metadata about what compression technology was used. And so it decompresses it at the consumer level. So it can actually truly be on a message by message basis. So it's really, really cool how they make this in. All right, go ahead. Tip three. So there's another one that I haven't had a chance to like dig into

Starting point is 01:44:45 in great detail yet, but it's been on my radar now for, for quite a minute. But, um, I think it was created by LinkedIn, but it's called cruise control and it's for Kafka to help manage large Kafka clusters at scale. And so, um, you know, some of the things that I was describing, like might be like, there might be somebody who's like, oh, no, that's old advice. Like now with cruise control, you know, you can easily add and remove brokers, you know, because that's literally like list is one of the things that you could do is rebalance the cluster easily using cruise control and things like that. So I haven't had a chance to dig into it. So that's one of the things I've been kind of curious is like when they say that it'll do it. Okay. But like, how quickly is it? Because I remember from past experiences trying to like rekey messages in a topic because you want to like change partitions or something like, like, you know, based on the amount of data I had at the time, which was a large data set,

Starting point is 01:45:45 but just done for testing purposes. And it was like an eight-hour ordeal to redo it. It was not an exabyte. Yeah, yeah, right. So I mean, that's why I'm kind of curious to see, like, okay, well, what is all this doing? So I don't know. I'll put that out there. You know, it'll be in, in the, in the links and, you know,

Starting point is 01:46:12 maybe we'll all learn something new. That's amazing. And some cool technology. So yeah, whatever. All right. Well, Hey, let me ask you this. If you watch an Apple store get robbed, are you an eyewitness? Okay. Just asking for a friend. I like it. Yeah.

Starting point is 01:46:34 One last question. How do you fix a broken pumpkin? Smashing is all I think with my mother. No, it's gotta be something about squash. Oh, yeah, good. A pumpkin patch. Oh, jeez. That's even better. Excellent.

Starting point is 01:46:51 Thank you, MikeRG, for those. Excellent. And now we head into Jay-Z's favorite portion of the show. Goodbye. It's the end of the show. Subscribe to us on iTunes, Spotify, wherever you like to find your podcasts, and be sure to leave us a review.

Starting point is 01:47:10 I know Jay-Z said he was going to give you a freebie, that if you gave us a four, we'd treat it as a five, but I don't know. I call shenanigans on that. You want to give us the five, right? I mean, am I wrong? Yeah, the only way that four happens is if you accidentally slipped and clicked the button when you were hovering over the four. That yeah the only way that four happens is if you accidentally

Starting point is 01:47:25 slipped and click the button when you're hovering over this is the only thing that makes sense all right so i think you click the four because you're still upset because you were trying to buy your taylor swift tickets and you didn't get them in time and you're upset and you're taking that anger out on us and i don't think that's a good look on you don't take your aggression on us that's not fair to us we didn't do it right it's not our fault that Jay-Z bought all the Taylor Swift tickets. What if he only liked 80% of the show? Ah, why Jay-Z?

Starting point is 01:47:50 Why? That's because you didn't listen to the other 20%. I mean, this, this is not our fault. You tuned out, right? That's right.

Starting point is 01:47:56 Come on, hook us up. All right. So, Hey, while you're up there at Cody blocks.net, make sure you check out our show notes, examples,

Starting point is 01:48:03 discussions, and more, and send your feedback, questions, and rants to Joe. I did not see that coming. Yeah, I'm in slack. That was great. My bad.

Starting point is 01:48:19 No, did you? Hey, it's your turn, Joe. Oh, hey. Yeah, and make sure to follow us on twitter while it still exists at coding box or head over to codingbox.net i got dark social links to the top of the page the times they are changing who knows

Coding Blocks - Technical Challenges of Scale at Twitter

We take a peak into some of the challenges Twitter has faced while solving data problems at large scale, while Michael challenges the audience, Joe speaks from experience, and Allen blindsides them bo...th.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.