Coding Blocks - Technical Challenges of Scale at Twitter
Episode Date: November 21, 2022We take a peak into some of the challenges Twitter has faced while solving data problems at large scale, while Michael challenges the audience, Joe speaks from experience, and Allen blindsides them bo...th.
Transcript
Discussion (0)
You're listening to Coding Blocks, episode 198.
Subscribe to us on iTunes, Spotify, Stitcher, wherever you'd like to find your podcasts.
Visit us at codingblocks.net where you can find the show notes, examples, discussion,
other stuff, rants, ramblings, whatever.
Yes, indeed.
Send your feedback, questions, and rants to comments at codingblocks.net and you can follow
us on Twitter at CodingBlocks.
And hey, we got a website at CodingBlocks.net and there's
links at the top of the page. Go to other
websites that we curate or
I don't know, contribute to, I guess.
I don't know.
Yeah. Hey, I'm Joe
Zach, by the way. I'm
Michael Outlaw. And I
am Alan Underwood.
And you know what they say about cliffhangers so what's the topic for the evening
what do they say about cliffhangers
so anxiety has built up so for this episode we didn't have any new reviews
since uh last episode.
So, you know, we'll have to like, I guess beg.
Maybe Jay-Z needs to do the beg.
Got it.
Are you, is this getting to you yet, Alan?
Do we beg right now?
Yeah, I mean, sure.
Let's do the awkward part up front.
Let's do it.
No, I meant the fact that
Alan asked and I just
walked away from it.
How did you not get that that was the cliffhanger?
I got it.
A lousy limbo player
walked into a bar.
They did?
Yeah.
Yeah.
You got the punchline right in the thing.
I did.
It was good, right?
Yeah.
Yeah.
Yeah.
That's thanks to my son.
That one got me the other day.
I giggled a little bit.
Yeah.
All right.
So, so for the topic of this particular show, we're going to be talking about this thing
that you might've heard of called Twitter.
Um, not the craziness going on, on the interwebs and with the company and all that
kind of stuff.
More about the technical challenges that they've faced as they've grown over
their,
what?
18 years?
No,
19,
something like that.
So we're going to do that.
But first we got a little bit of news.
So outlaw,
you want to read us the review names?
I already did that part.
Yeah,
you did.
Yeah. I didn't like it. Yeah, I didn't like it.
Oh, you didn't like it?
Okay, well, hold on.
Let me read it again.
All right.
Jay-Z, you got something up here?
Yeah.
Hey, Game Jam.
It's officially time to start talking about it.
So, yeah, I am super pumped about that.
I love doing it every year.
It's going to be year number three.
It's going to be better than ever.
We are officially soliciting. Solicitating? I think it's a word. It's going to be year number three. It's going to be better than ever. We are officially soliciting.
Solicitating, I think is the word.
No, soliciting.
I think that's it.
I'm feeling goofy tonight.
Soliciting theme ideas.
So you got an idea?
Shoot, email, text, tweet, whatever, and I'll get on the list, and we'll start voting soon.
It's just a lot of fun.
Now, how come as a member of this three-party show like i didn't know that there was an official time that we
start talking about game january uh yeah probably because i don't share my notes i should what makes
it official that's because this is around when we talked about it last year
okay so i have a little timeline the whole it's like three months out we do this two
months out we do this this is when the emails go out and you guys are hyper organized i have to be
otherwise i can't get anything done there's no in between that's probably true so if you're
listening to this and you're like oh that sounds like me too let me go ahead and tell you right
now make yourself a note leave us a review because that's why you forgot to do it already that's that's true that's technically true
you don't have to do it now just put it on the list yeah put it on the list yeah you'll get to
it hey so one last little thing here i found this today i came across this today and you know there
are people that think that we we know everything about what we do because we're 198 episodes into talking about coding, right?
Man, so I've been dealing with a particular problem that is driving me absolutely up a wall, right? was encoded some way and it went through some sort of encoding, decoding somewhere
and trying to get that thing back to its original state is driving me absolutely crazy.
Man, I learned stuff in this one article that was written 19 years ago
from Joel on software. You've probably heard of them. But the title of this article is the absolute minimum.
Every software developer absolutely positively must know about Unicode and character sets and character sets.
No excuses.
No excuses.
If if you know all this stuff about UTF-8, UTF-7, if you didn't know that was a thing, then you should read this.
ASCII,
ANSI, ISO, whatever the other ones are. If you don't know this stuff, go read this article.
It will absolutely do you a huge favor. I learned so many things today reading this article
that I just never knew about. Like I didn't know that when you do UTF eight, um, it can zero pad
lead the characters or you can leave them off. I didn't know that. And that matters because if you
encode something to UTF eight, those leading zeros might be there. They might be not, but now trying
to re go back to where you started from can be an absolute nightmare because those characters may or may not be there.
So it's a it's really an interesting read and you will learn so much about why this stuff even exists.
So definitely after you're done listening to this on your drive or whatever, go to the show notes, coding blocks.net slash episode one 98 and find this link for the Joel on
software and read this.
It will absolutely do awesome things for you.
Just understanding how this stuff works.
And it's very timely too,
because I mean,
this article was written.
Oh,
you said 19 years ago.
Yeah.
I take that back.
Okay. Yeah. Yeah. it's been a minute oh
we're allowed to forget stuff it's fine it's fine oh man look it up absolutely crazy so yeah um good
stuff really good stuff there um so i guess with that let's go ahead and get into the show. So we'll have some links in the
resources for this, the very, so if you don't know this exists, we've talked about various different,
um, engineering blogs, engineering blogs out there. Like I know in the past outline, I've
talked about like the Uber one. We truly love that and all the stuff that they put out into
the world. We've talked about Netflix with chaos monkey and all the other things they've added. Twitter also has a fantastic engineering
blog and I believe, is it just blog.twitter.com? Can you remember? Blog.twitter.com. Yep. Yeah.
So if you don't know that exists and you do a lot with big data, especially this is a fantastic set of read throughs to where you can
learn from people that arguably have the most scale problems besides Facebook on the planet.
So, um, definitely go check that out, but we're going to hit on one here called scaling data
access by moving an exabyte of data to Google cloud. Now, here's the thing that I want to lead up with.
I started with this link. And the problem is, you have to kind of understand some of the history
of where they've been and all that before you can even get to this article. So even though I started
here, almost everything that we're going to talk about in this episode has nothing to do with this
particular article. It's what we'll be talking about today is an article they linked to at the very top
that talked about what they did to improve their scaling and ability to be able to look
at data analysis within the company.
So just a heads up.
Oh, sorry.
Go ahead.
No, just finish your thought.
Just a heads up. So we'll probably be coming back to this other one after this episode and talk a little bit more about things that have changed since since we go over what we're talking about in this one.
And in this one, you said they were moving how much data again?
Um, an exabyte. So of data. Yeah. But how much data? Because you had to solve for exa, right?
So what was exa?
It took a second.
It took a second.
You thought I had something serious
to say? I did. I did.
You totally got me.
So I don't know if you guys read this stuff because i i literally just
went and picked out one of the blog posts and started going through it um so we'll just chat
about this thing as we go so we're not lie to you i did not i did not know that we were going to go
i did not read this one okay okay i didn't. All right. So, so we'll chat about this stuff.
Yes.
We'll chat about this as we go.
So in 2019,
we're at the tail end of 2022 now,
right?
So we've had three more years since this.
I can only imagine that things have gotten even bigger.
So just keep in mind,
this was three years ago,
but it was written this year though.
The article was written this year.
The article was written this year, but article was written this year the article was written
this year but he's actually talking about the numbers from 2019 directly so he didn't mention
what the new numbers are but they said in 2019 over 100 million people per day would visit twitter
feeds right now they didn't say whether it was from a website
or from an app or whatever, but just imagine 100 million people dialing in to that, what do they
call it, the fire hose or whatever, to where they're getting data out. That's a whole lot.
If you recall, too, just to back up for a moment, I think we talked about this as part of the
designing data intensive applications architecture, right?
Where some of the problems that Twitter has in terms of putting together a timeline when you would go to Twitter, right?
Like, Jay-Z, you might be able to speak to this better than I can, but there was issues of like, if I updated my feed, then Twitter had like one strategy for how they would put my message out there on Alan or Jay-Z's feed.
But if the other Jay-Z, who might have a couple more followers, put something out, it was a different strategy for how instead of blasting it out to everyone's queue it would be a on read or as
needed kind of read right something like that am i saying this right jay-z yes if you think like
you know if you were to kind of design twitter from scratch you know and just not really think
about the problems that they've run into or what you know about in terms of scale you'd probably
kind of design it like a content management system or a blog or something like that where you would
say okay i'm a user i go to the page and i go and i fetch some stuff out of the database and i show it to you right not hard it's
just kind of standard like web development stuff but twitter has the celebrity problem where uh
they actually have a whole blog just on taylor swift she released a new album that you know
is doing like phenomenally well and yeah's is really good. You should check it out.
I do.
I was just kidding. Of course, everyone's heard.
Hashtag Swifty.
Yeah.
They actually have some numbers on it.
Anyway, I didn't really get those numbers together.
Taylor Swift put out a new album. Everyone's talking
about it. So many people
follow her. So whenever she makes a tweet, that Everyone's talking about it. So many people follow her.
So whenever she makes a tweet, that's a lot of people that need to get that stuff. So Twitter kind of came up with a strategy they call Fan Out, where basically when Taylor Swift tweets, they actually run out and go and update a bunch of feeds.
So that rather than going and kind of trying to couple together these feeds from a database or something. The feeds are pre-generated, so you can have quick, real-time home timelines.
And that's part of Twitter's mission
is to get you data really quickly.
They want you to have low latency,
up-to-the-minute data so that your feed,
if you both follow Taylor Swift,
we want to see those tweets come in around the same time
and they want to keep a conversation flowing.
They want to keep it fast.
So I bring that up because just to kind of like frame the conversation here, right?
That we're talking about like, you know, you said 100 million people per day.
But those 100 million people aren't, it's not the same problem that's being solved for, for each of those people as they view or post to Twitter.
Right.
Totally.
So that's the background of this,
of this,
you know,
the,
the,
the domain of the problem.
Well,
so that's the domain of the overall problem.
What we're going to be talking about more is what they tried to do for their
internal customers, the people in marketing and accounting and all that kind of stuff on how
they could see trends and analytics of what's happening with the various different tweets and
stuff out there. So check this out. For every tweet and user action that somebody does, whether
you like something or quote something or whatever, it creates an event.
So very similar to like the Kafka stuff that we've talked about in the past, right? You generate an
event and then that's used by machine learning and it's used by employees for analytics. So every
single thing that's done generates some sort of event that happens that goes out into their data pools. And one of the things that they ran into
is they wanted to make it,
and they actually said they wanted to democratize
data analysis to the people within Twitter
so that they could go and query things
the way that makes sense for them, right?
Like your marketing team's going to care about things
differently than maybe a customer retention team does or an engineering team or whatever, right?
Like everybody has a slice or a view of the data they want to see, and they wanted to make it to
where they could easily go do it without getting engineering involved. So it was kind of similar.
I think we talked about like, as it relates to the uber right uber was
doing something kind of similar to that where they were you know you'd have like the one big data
lake but then they didn't mind having these smaller uh database offshoots from that where like
you know one one team might want like accounts receivable or billing or whatever might need a different set of data than
uber eats might want a different set of data related to like go market for that kind of thing
right versus the real time stuff so that's the kind of stuff you're talking about in relation to
democratizing the data so that the different parts of twitter could use it differently right
very much yeah so really what they talk about here
is they kind of do have this big data lake. They don't use that term as much in the article,
but they basically said that they had various different technologies that were used for data
analysis. They had scalding, which I'd never heard of before. But if you wanted to use that,
it required programming knowledge. Then they said another problem was having data spread across multiple systems without
a simple way to access it, which is similar to what Uber was talking about, right?
Like they were trying to get data into one big area so that everybody could access it.
Um, so what they were talking about to start off this particular thing was, Hey, they want
to move things into Google cloud, um, particularly because they wanted to use BigQuery.
Because if you're not familiar with BigQuery, I actually copied their kind of simple summary on their own page.
It says it's a cost-effective, serverless, multi-cloud enterprise data warehouse to power your data-driven innovation.
So if you're not familiar with a BigQuery, kind of basically what you do is you ingest data into this thing.
And they've already got, you know, basically massively scalable storage behind the scenes.
So as you ingest data into it, it indexes it into its own format that it knows behind the scenes, if I remember correctly.
And then it allows you to use regular SQL queries to process and spec like, you know, terabytes,
petabytes of data. So you could do all that with BigQuery without having to worry about
infrastructure and all that other kind of stuff, right? So that's what they were trying to move to.
And then another thing that I had never even heard of until I started reading this article. And I'm curious if you guys had was this thing called Data Studio. So if you hit that link that actually so not the link that's there. If you go to Data Studio dot Google dot com. you'll be on a page where you can automatically start trying to create reports based off data
that you've already got so data studio looks to be available to everybody i don't know what you
get charged for it or if you get charged anything wait this is just using going against my google
drive exactly that's what i'm saying you could actually point it at other data sources and stuff. So it's pretty interesting. I'd never heard of it, but they are using Data Studio in conjunction with BigQuery so that BigQuery has the data they want. They can run SQL queries out of that thing and then use Data Studio to create visualizations, reports, tell stories with the data that they already have and processed in BigQuery.
So Google,
man,
like what the heck?
Right.
I mean,
it's actually really cool.
And if you look,
they've got like,
they've got sample data sets here.
They've got one that's,
um,
what did I just click on a world population?
Um,
right now there's 7.1 billion people in the world.
No,
we crossed eight.
It was announced this week.
It was announced this week that,
that, that they believe that we have crossed 8 billion people now on the planet.
That's, that's absolutely crazy. Oh, this says as of 2013. So I guess they populated this with
old data. Um, and maybe it's not even real, but, but yeah, I mean, this is, this is really
interesting. Like I said, I didn't even know it existed. Like they, they have so many things
in the Google.
Ecosphere that it's just it's almost impossible to know them all.
So anyway, with that, what they were basically trying to do is make it so it'd be easier for managers, people that just know general SQL or or maybe some developers developers or whatever to be able to access this data and,
and make it show what they needed to look at.
They just want to be able to slice and dice data,
like generate like a chart.
They want to be able to drive a chart or something off of it. Right.
So charts.
Yeah.
But one of my favorite things about Twitter is trending.
So you can sign on Twitter and say,
Hey,
what's trending now?
And it shows you like,
here's the things that you're interested in that's trending.
So maybe I'll see stuff about like music or whatever.
You can go to the general news.
It's just like entertainment.
They kind of break it down by category.
So like, you know, for example, Taylor Swift puts an album out.
It's probably going to be under news.
It'll be under entertainment.
It's going to be under my interests.
But it's not going to be under sports.
And they've got a whole big data wing that's always kind of working on figuring out what that is.
And you think about it, like how crazy is it that you can go and see that they kind of boil down all the tweets of the day and say, hey, election is trending.
Or they take all that noise, all that mess, all that stuff that people are talking about and say, hey, Taylor Swift's new album Midnight's came out.
Or hey, you know, Super Bowl or whatever.
They take the crazy stuff that people tweet, 140 characters,
and they figure out the subject, they figure out how to count it, and they do a great job of it.
Well, now it's more than 140, right?
I was going to say it's a little bit more than that now.
Yeah.
But still, yeah, it's absolutely insane.
The point is lost on me because you got the number wrong.
Right.
When you consider how many tweets go out in a minute across the world,
the fact that they're,
they're able to do that stuff for the trends is absolutely amazing.
So this is where we need to take a step back into history.
So they actually lay out what,
what they had,
what their strategies were starting back in 2011.
So this is actually the history of data warehousing at Twitter,
2011. So this is actually the history of data warehousing at Twitter. 2011, they did data analysis with Vertica and they were using Hadoop for their storage.
Data was ingested using Pig.
And if you know anything about Hadoop and the way that stuff had to come in, you had to use a MapReduce process.
And Pig sort of makes that easier for ingesting data.
So 2012, they went away from pig
and they picked up scalding. And what they actually said in the article is that uses Scala APIs,
Scala, Scala, don't know, that were geared towards creating complex pipelines that you could,
it said it was sort of easy to create these complex pipelines and it was also easy to test.
So that,
that makes a whole lot of sense.
I mean,
I know the three of us have been in the,
uh,
in the real time streaming data world and it's not easy sometimes,
right?
No.
And Hey,
I got a little tip for you.
Uh,
don't try Googling scalding versus pig.
Really?
Don't do it.
Yeah.
Apparently,
uh,
it'll be a scalding pig yeah
like i didn't know that was the thing it's all it's a whole big thing i think we need to have
like some kind of a banner for like when you're going to announce something crazy like that jay-z
like there should be like instead of a spoiler there should be like a you know gross warning
coming yeah don't do this yeah i whoa what, what's going on? What are we talking about? Yeah, right.
So one of the problems with scalding, though,
is it's difficult for people that just have, like,
general SQL skills to pick up.
Like, they said that the learning curve is pretty extreme on that.
So fast forward four years from 2012 to 2016,
they start using Presto, PrestoDB, to access Hadoop data using SQL. Now we've talked about
Presto on here and we'll get into some interesting things a little bit further into the notes here
about that. But if you haven't heard the past episodes, Presto kind of allows you to
pick any number of various different storage technologies. In this case, it was Hadoop,
right? That they're using here.
You can use it to pipe into JDBC databases.
You can use it to pipe into GCS or Google Storage, AWS S3 Storage, all kinds of stuff. So is Scalding, like this was made by Twitter, it looks like.
Oh, really?
Am I wrong?
I'm looking.
Hold.
Yeah, Twitter open source. Sure enough.
Okay.
Interest.
So, you know, for those
who are working at Twitter, they're like, of course we already
know about Scalding. But for the rest of us,
they're like, oh.
Brand new stuff. And it's open sourced.
So if you want to use it. Yeah, I'm looking at it on
GitHub. That's why. And it came up
as like a Twitter GitHub account.
And I'm like, wait a minute.
My spidey sense is tingling.
Right?
They have a really cool icon logo for it.
It's an elephant blowing flames out of its trunk.
Because it's Hadoop, right?
Isn't the Hadoop logo an elephant?
It is.
It is.
Yes.
So along with using Presto to access Hadoop data using SQL, they were also using Spark for ad hoc data science and machine learning.
So now, two years later, 2018, they're using Scalding for their production pipeline.
So, you know, transforming data, pushing stuff around.
And they're using Scalding in conjunction with Spark
for ad hoc data science and machine learning.
So not a ton changed there.
What did is they now have Vertica and Presto
for ad hoc interactive SQL analysis.
And they introduced Druid
for interactive exploratory access to time series data.
Okay, there's so many technologies.
So Presto, if i remember right was the one where there was like a facebook derivative one was called presto and
one was called presto db right so you have no two presto db was the original one presto sequel
was the one that somebody forked that that super confusing because you'd go search for something for Presto, and sometimes you'd land on Presto SQL, and sometimes you'd land on Presto DB.
But yes, that was the one that was created by Facebook.
Yes, Facebook to query basically just about any storage technology.
I say just about any.
A lot of storage technology is out there using SQL language.
But what was super wasn't like Presto and Presto DB.
Cause like I'm Googling and I'm seeing like literally Presto was developed by
Facebook,
but there's a Presto db.io.
I think that I don't think it was Presto SQL.
Was it?
Yeah,
it really was Presto SQL.
It was,
it was Trino, which is like a newer kind
of evolution yeah it's been a minute since i looked at all this stuff oh wait now i see it
on the wikipedia page it does say presto including presto db and presto sequel which has been which
has been rebranded to trino okay so that's that's what they did because people were probably getting annoyed just like
I was back in the day when I was dealing with it.
Well,
yeah.
Cause I remember like we were looking at that for some reason and I don't
remember if like this was maybe a followup from like,
you know,
or like a fallout from like us looking at one of the Uber engineering blogs
or something.
Maybe that's how we got turned on to Presto.
I don't remember now it's been so long,
but like when you said it,
I had kind of had like a little,
Oh God,
a little PTSD kicked in there.
And I'm like,
what is that?
And really all it was is the Presto sequel.
They just forked Presto DB and then started going off and doing development
in their own direction,
which I guess is now Trino Trino.
So,
but the cool part about Presto, if, if you haven't heard it before,
is like I said, it'll allow you to query basically
kind of any data source out there.
And that's cool.
But the actual magic that made it what it is
is you can join data across disparate data sources, right?
So if you have data stored up in GCS
and then you have some lookup data
stored in a Postgres database,
you can basically say,
hey, select everything from my GCS data source
and join it on my lookup information from Postgres SQL.
And it will do it in a distributed manner
to where to
pull the data into its own processing nodes and join the data there and then give you
back the data set.
So you could basically write a SQL join against anything just about that you can connect to.
Yeah.
And so there was also another one that Jay-Z and I remember we looked at, we did it as
part of like a, you know, come watch us stumble on a live stream.
I mean,
no,
learn with us.
I think that's what we call it.
I don't remember,
but we definitely stumbled,
but it was on Apache drill,
right?
So it was a similar kind of thing where like,
you know,
with,
with these technologies,
you could,
it could,
it didn't even have to be a database.
It could be like,
I have a CSV over here.
I have an Excel spreadsheet over here.
I have a SQL server database over there,
an Oracle database,
a press,
uh,
uh,
you know,
the GCS bucket,
whatever these different data sources were,
you could like set up these connectors to it and then magically query it.
And,
and I remember drill was pretty good about like determining the types to like
it would,
it would figure out the data types and be like,
nah,
we got this.
We know what the quote schema should be for this thing that isn't really a table.
Yeah. I remember you and I, Outlaw, were playing with drill quite a bit. And honestly,
it seemed like it was a little bit more impressive from that discovery phase that you're talking
about than Presto was,
but it just didn't have the, I guess, the Facebook backing.
Or Twitter.
Right.
Yeah, so it didn't have the same mind share.
But, yeah, I mean, an amazing, amazing tool,
and it's still used by a lot of stuff out there.
So Druid, if you've not heard of Druid,
if you're trying to analyze time series data,
that is a super powerful analytics platform.
I thought it was an OLAP database.
It is.
It's OLAP for time series data.
Oh,
OLAP specifically for time series.
Yes.
I didn't know about the time series aspect.
Yeah,
I didn't either.
When you look at the ingestion on a lot of that stuff,
you actually have to do it on a time series type basis.
Now there may be hacks around it,
but that's what it was designed for.
Um,
there's,
there's a lot of competing technologies out there now,
like Pino,
um,
route roundhouse.
What's the,
uh,
clicky house,
click house,
something like that.
Well,
Pino would just be another like OLAP database,
right?
But it's not specific to time series.
Because when we talk about a time series database,
the one that's in our face these days would be Prometheus.
Prometheus, yeah.
Right?
I'm going to think of something like that.
Yeah.
Yeah, that's what it was originally done.
Like I said, they do allow for tons and tons of dimensions,
but it usually has to be sliced up by time.
So on top of that, they also used Tableau,
which if you haven't heard of that,
it's a very popular commercial piece of software out there that allows you to connect things and query them and visualize.
Dashboards.
Yep, dashboards.
Zeppelin and then Pivot for data visualization.
So I've never used Zeppelin or P or pivot so i don't know what those are like
so i listen to zeppelin i mean it's been a while i mean they got some cool
you know they're not new but yeah i don't know if they're still touring i
think they were going to but taylor swift kicked them off the platform to
buy tickets she She was trending.
Yeah.
Oh,
you know,
uh, Taylor,
Taylor Swift.
I'm glad you mentioned her.
Oh,
uh,
in the last 10 years,
you know,
she's averaged more than 75,000 tweets a day.
Just about,
sorry,
about Taylor Swift.
I was going to say,
how about the time?
I was like,
yeah,
whoa,
wait,
what bought the G program?
I was like,
wait a minute.
She puts out 75,000 tweets a day and does all these amazing albums and stuff?
Yeah.
I'm in there.
I'm not productive enough if I'm not accomplishing what she's accomplishing then.
Wait, you just said for the past 10 years, she averages 75,000 tweets a day, 365 days a year.
About her.
About her.
About her.
That's a total of 329 million tweets now that doesn't
count the 4 million tweets that she got in 24 hours uh when her album midnights was released
golly man yeah and um twitter got a blog post out the next day with analytics about the tweets and
like how people were uh you know what they were tweeting about the top three songs that people
were tweeting about in reference to the album like all tweeting about and reference the album. Like, all right,
it's a pretty amazing stuff.
Hear me out.
Hear me out.
You bet.
You're about to brag about our stats.
Aren't you?
Hear me out.
This is our challenge.
Dear listener.
We want to overtake Taylor Swift on Twitter.
So get on Twitter.
Social,
you know,
like send your tweet,
mention us. Hashtag coding blocks or at hat coding blocks, social, you know, like send your tweet, mention us with hashtag coding blocks or at hat coding blocks,
whatever,
whatever suits your fancy. Either way,
it's going to like Twitter.
We'll,
you know,
rank it all in the same.
They'll,
they'll figure that out.
They'll know that it means the same thing.
And,
and let's see if we can't,
if we can't take the top spot.
I think we got it.
I think we,
I think we can do this.
I believe in us. I believe in you. And I believe in us. We can, we can't take the top spot, I think we got it. I think we, I think we can do this. I believe in us.
I believe in you.
And I believe in us.
We can,
we can make this happen.
We're going to get ones of tweets.
Is that how you say that?
Yeah.
Yeah.
It's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a,
it's a, the largest community on Twitter. Oh, really? Yeah. That's pretty interesting. It makes sense.
441 million unique followers.
And I'm not really clear on what they consider their communities,
but I think it has to do with basically interests that they figure out about you.
Like, for example, I mentioned that they somehow figured out the kinds of music
that I like, and they have put me in these various communities about that.
And so sometimes they kind of throw me tweets that they think I'm going to be interested in.
Oh, that's pretty cool.
Alright. Well, now
that we've been stomped into oblivion
with our ones of tweets.
Tens.
Alan Tens. We might get tens.
That'd be exciting if it happened.
So another thing about Taylor actually is
I'm kidding. You're just trying to
crush our souls now. Jay-Z's just
going to hit us up with Taylor Swift facts all night.
That's a Taylor Swift comment or fact.
That's right.
Oh, man.
All right.
So they were already doing all this stuff, right?
They had data flowing.
They had all these ways to get reports, Tableau, Zeppelin, Pivot, all that kind of stuff.
So why the change?
Well, their big thing is they wanted to simplify their analytical tools for their internal employees. That's really what it boiled down to. So that's where BigQuery came in. Now,
they did say there were challenges. And I mean, the three of us have worked on three different
cloud platforms at this point, right? Azure, AWS, and GCP.
And I guess even more.
Well, professionally, not including like professional work, you know, play.
Yeah, right.
And I mean, you could even count in Linode and DigitalOcean and some other stuff, right?
Which I don't guess they're quite the same, but there's always challenges, right?
Like every page that you go to on a a cloud sites like use this service it's so
easy and you get in there you're like man there's nothing easy about this i don't know why it seems
like it should be so easy but it never did so they had challenges starting with i can't even
differentiate your icon from all the other services you have aw AWS. Oh, yeah, yeah.
I wasn't trying to call anyone out.
This is awkward now.
We had a whole episode about it,
or at least a big chunk of an episode about it.
So one of the things that they had to do
is they had to develop their own infrastructure
to reliably ingest data,
large amounts of data into BigQuery.
So that's worth calling out. Twitter
did basically everything on-prem. They didn't do cloud computing stuff, not massively, right?
And that's similar to Stack Overflow. We've talked about this in the past.
Like Stack Overflow even had a page up that showed their, the gist of their overall infrastructure and how things worked.
And the reason they said is they spent as much on their entire infrastructure as what one month
would cost them if they ran it in the cloud. Right. And I have to imagine that's the same
exact reason why Twitter does everything on prem. And then it started porting things to the cloud
that makes sense to make their lives easier.
Right.
So just wanted to call that out.
Man.
Talk about another one.
I just found,
I was trying to find the specific link that you were talking,
you were referring to where like stack overflow would show like,
Hey,
this is our SQL server.
You know,
there's a reddish cache in front of it.
Things like that.
Stack overflow also has a blog for all their fun engineering challenges.
Oh, that's excellent.
Stackoverflow.blog slash engineering.
Most excellent.
Yeah, we'll have to dig into that one too.
So while he's looking for that other link, some of the other things they had to worry about,
they had to support company-wide data management.
They needed to implement access controls, which I think you would, I would hope you know why, right?
Like you don't want me accessing private user data somewhere.
Or tweeting on behalf of people or being able to see direct messages or something like that.
Yeah, totally. Ensuring the customer privacy.
They needed to build sources or build systems for resource allocation, monitoring, chargeback.
So if you work for a large corporation, you're probably aware of how this works, right?
Let's say that you are in AWS or GCP.
Usually departments get charged for the things that they're using, right? Like it's not
the company as a whole, because they want to find out, Hey, is engineering blowing out the budget?
Or is it accounting over here? That's that's using so many of these resources that are costing us
money. So they actually charge it back to the various different areas. So they had to build
those systems to get that stuff in place. So in 2018, when we mentioned the Tableau, Zeppelin, all that kind
of stuff, in 2018, they rolled out their alpha release of this GCP infrastructure, the BigQuery,
Data Studio, all that kind of stuff. And what they did, and it's kind of interesting,
this is actually a really good way to approach things. Just from a product management software development mindset, they basically put out the most frequently used tables that people would be interested in. So they didn't try and put everything online at once, right? They said, hey, this is what I know that most of our internal customers are going to use. And they went with that. In that group,
they had over 250 users internal in the company from engineering,
finance,
marketing,
and sometime they didn't have the date in here,
but they said that they said it was near the time that they wrote the,
or that blog post was live. They had a month where they
had 8,000 queries that processed over a hundred petabytes of data, not including scheduled reports.
So these were ad hoc queries that were run. And so people ended up loving it, right? Like they
saw that the people were using it,
8,000 queries with over 100 petabytes of data process.
Like that's a lot of usage.
And so with that,
they proved out that people did want to use the platform.
And so from that point, they decided,
hey, okay, let's push forward with this.
8,000 queries though in total.
I mean, like that sounds low, right?
For 250 users.
Yeah.
This is not like customer facing it.
This is like people actually like running queries at work, you know?
Right.
This is me trying to find out the trends or whatever.
Yeah.
Yeah.
So.
God, I'm such a jerk.
That's pretty good.
So check this out.
They also, I have a link in the, in the show notes for this,
but they have a really nice diagram, a very simple diagram of kind of what the data flow was
getting from their on-prem into BigQuery. And, and I'll summarize it here, but I highly recommend
going and taking a look at the picture because it'll give you a little bit more detail.
So basically what they did is they pushed data into GCS or Google cloud storage. If you
didn't know that, that particular acronym from their on-premise Hadoop data clusters.
And once they pushed it up to GCS, they then used airflow. I think it's Apache airflow,
if I remember right, to move the data, that data from GCS into BigQuery.
And then once it was in BigQuery,
that's where they would use Data Studio so that all the end users could actually go create reports
that they'd want to look at, right?
Like things that they'd want to pull up later.
What the heck does Apache Airflow do?
Airflow is a platform created by the community to program it.
Yeah, some word that allows you to author schedule and monitor workflows.
What?
It's amazing.
Apache has so many projects around both OLAP and just like streaming type
dag stuff.
Distribute like a cyclic graph,
some very directional.
It's Python.
It's Python based.
It's Python-based. It's really popular in GCP.
Open Source Workflow Management Platform
for Data Engineering Pipelines.
Anyone with Python knowledge can develop a workflow.
Airflow does not limit the scope of your pipeline,
so you can use it to build machine learning models transfer data blah blah blah yeah so i'm sure that was
hyper complex right but that's why i said the the diagram's worth looking at just to see sort of the
gist of what they were doing and i'm sure there was a ton of work that happened to make all this
you know really really go live i found that um link by the way for the
uh stack overflow uh infrastructure they're still it like not on like a cloud-based solution
nine web servers four sql servers two redis servers but i mean with that they're putting out 55 terabytes of data a month i mean
you know i'm sure twitter would be like whatever right our 240 characters is way more than that
but um i mean it's still super impressive what stack overflow is doing um without it so easy
the point is is like i'll have a link in the, in the resources we like,
there'll be a link for the stack overflow one.
And the language they use.
Oh,
that their stuff's written in stack.
You mean?
Yeah.
Are you,
are you trying to pick on my boy C sharp?
No,
I'm actually excited about it.
I love that.
Yeah.
No,
C sharp ASP.net, NBC.
And their homepage loads in 12.2 milliseconds.
And their questions pages load in 18.3 milliseconds.
That's amazing.
All right.
So I think it's Jay-Z's turn.
Oh, God.
Jay-Z.
All right.
Tell you what.
I'll give you a discount.
If you give us a four-star review, we're going to treat it like a five.
Why?
We'll give you the full five-star thank you for four-star and up recommendations.
Because we love it.
It helps us out.
And, yeah, it's really good news for the show.
That's how podcasts grow.
It keeps us going.
It's the, I don't know, steaming our turnbinds or
however cars work. I don't know. Listen,
I'm telling you guys, if you all
go out to Twitter and you tweet about
Coding Box, I'm sure that we'll get that
many more listeners and subscribers
and we can grow the show and
it'll be better. And so
that's how it works, right? Am I doing this
right? I think so.
I think I got it. I think the quality has increased with every person, every new person
that's listened to the show. Yeah. Right. That's definitely happened. And when we used to all
record in person, I could have reached over and hit the mute button on Jay-Z's mic before he said
that nonsense, but I can't virtually do that. yeah so now we have to deal with this yes yeah
i i know when i go look at a podcast and uh you know i'm looking for something new to listen to
that you know as long as it's four stars and up you know it's good that's all we're asking of you
all of us are asking four star review you know you were talking about the Taylor Swift facts and, you know, going, going, listening to her music and all that.
So I went to a Foo Fighters concert recently.
It was ever long.
So long.
All right.
Well, it's time for my favorite portion of the show.
Survey says.
All right.
So let's play a little bit of feud here
this is what episode 198 so jay-z guess what you get all right first this time so uh let me see i myself and we asked a hundred people name qualities of a bad boss and you just name just name one
i'll see who gets uh i mean okay this tough. This is a hard one. I,
uh,
I mean,
micromanaging is all that they think of.
I just keep coming back to that.
Okay.
Mean.
That's the,
I was thinking that too.
Okay.
So micromanager was the number one answer with 29 respondents so that's 29 points on the board for joe allen
angry was the number three answer at 20 i'll take it i'll take it that's not a bad showing
that's that's not a bad showing what was number two uh okay we'll run down the list
micromanager incompetent number two with 24 okay irresponsible 14 and oblivious number 13
okay or i'm sorry 13 uh respondents all right so my next question for you and Alan,
you'll go first this time.
All right.
Name a bad job for someone who's afraid of heights.
Like how you put your answer.
You like that?
Yeah.
You already put your score in.
That's cheating.
You didn't get that.
A pilot.
A pilot.
Good answer.
Dang.
I'm trying to think what the electricians, you know, for the power company, I don't know what you call them.
Oh, like they're called pole climbers?
Is it linemen or something?
Linemen, yeah.
I think that's right.
Yeah.
Okay.
Pilot was the number two answer with 37 respondents.
Hey,
I was close on my number.
Yeah.
I'll,
I'll consider linemen as construction worker.
I think that's fair.
Yeah.
Number three answer 16 points.
All right.
It takes a lead.
Yeah.
This is getting interesting.
I got to like use a formula now,
like start
we're getting into big numbers let's see put that there all right so uh for the win last question
here let me find it where did i put it here we go all right for the last one Jay Z this is your chance you go first
name a type of
building where it always
seems to be cold
always seems to be cold
mhm
also tough
I think I've
got an answer I'm thinking if I can
get a better one.
I feel like I should buzz in.
I should be allowed to buzz in.
Before you buzz in, I'm going to go ahead and say doctor's office.
Okay.
Okay.
Meat packing facility.
Okay.
I mean, a meat packing facility doesn't seem to be cold.
It is cold. Right, right. That's what I'm saying. Oh, I see where you're coming with it doesn't seem to be cold. It is cold.
Right, right. That's what I'm saying.
Oh, I see where you're going with it. You're like being logical.
It's cold.
If your job's to work in a freezer, guess what? It's going to seem cold.
It's going to be cold, yeah.
Well, so Jay-Z doctor's office was the number one answer.
Oh, man. I just got destroyed.
And Alan, what was it a meat packing facility
was not on the list at all dang it nothing even remotely close so you get a big fat zero for that
now the score leading up to this was alan in the lead 57 to 45 but Jay-Z's number one there gives him a commanding lead because he just
walked away with 44 points on that one for a final score of 89 to 57 Jay-Z takes the win
yeah I couldn't even I couldn't even caught up with the second what was the second pretty sure
that means that Jay-Z has to buy the next round.
Is that what we were playing for?
Yeah, I think so.
Doctor's office, number 144.
Work was number two at 19.
Yeah, I'd have lost.
I guess maybe I should have considered your meat packing thing.
Yeah, that's work.
Work.
Yeah.
But all of these are somebody's work.
A doctor's office is somebody's work.
Yeah. The next next one classroom 14 and lastly the dmv number four four people said that that's amazing yeah they spent a lot of time in the dmv
a long time right these are the people to DUIs.
They're a lot more.
Yeah, you know, a DMV is probably a U.S. thing, huh?
Oh, yeah.
I don't even know what it stands for.
Something motor vehicles, right?
Department of Motor Vehicles.
Department of Motor Vehicles, yeah.
You go there for your license.
I don't even know what else.
Maybe passports.
I don't know.
Your tag.
Yeah, it's usually licensed for people that were drinking and driving. I've seen that in there before. No, I swear. Drive your tag. Yeah. It's usually licensed for people that are, that were drinking and driving.
I've seen,
I've seen that in there before.
No,
I swear.
I promise you.
That's why,
that's why I made the drinking reference.
There was somebody in there who,
the last time I went,
he was driven there by somebody.
It was like his third time having to come get his license back.
That's not scary.
The only time I've ever gone is to get the drive to get the license plate, like your tag renewals.
Mine's not at the DMV.
I always had to go to the courthouse for that.
Oh, wait.
I guess that's the tag office I'm thinking of.
Yeah, the tag office.
Yeah.
DMV was only renewing your license.
Well, I guess we don't have a DMV in this state then.
Yeah, we do.
No, we don't have a DMV in this state then. Yeah, we do. No, we don't have a DMV.
There's no Georgia DMV.
Totally.
I think it's actually dmv.ga.org or something.
Georgia DMV.
I'm going to the Googles.
Yeah, we don't.
It's the Department of Driver Services.
Ah, Driver Services.
It's different.
Technicality.
It's different.
It is. It is different. Technicality is different. It is.
It is different.
Florida doesn't have one either.
Florida doesn't? DMV is like
a California or New York thing,
right? Like, that's where...
Yeah, here they call it the Florida Department
of Highway Safety and Motor Vehicles.
That's way too much. Yeah. Yeah, it's DMV.
DMV. That's what it is.
Wow.
Excellent song, by the way.
All right.
So I'm going to share a little secret if I can.
May I?
A little tip.
This is going to be like an early tip of the week.
So I recently had to change my password, right?
And it said that it required it to be eight characters long.
So I picked Snow White and the Seven Dwarfs.
That's a free tip right there.'s pretty good i like it don't don't reuse that though somebody will know it now
all right so oh wait i said seven characters i meant eight characters you said eight characters
okay you said it you said snow white and seven doors yeahwarfs. It added up. It added up. Oh, okay.
All right. So getting it back into what Google was doing, they were shooting for.
Yeah, Twitter.
Oh, yeah, Twitter.
Twitter with Google.
Sorry, Twitter with Google services and whatnot.
So they were shooting for ease of use.
One of the big things was BigQuery was easy to use because it didn't
require anybody to install anything. They could navigate it all through the web UI, right? Like
they just log into their Google account and life was good. There were a few things that people had
to onboard with. I mean, I know the three of us, when we first started with GCP, you have to learn
about processes, resources, tagging, that kind of stuff. And so they actually created some internal educational materials to get people sort of up to speed on that.
And then after that, people were kind of up and running.
So that's really nice.
Now, they did look at various different things.
And this goes back to the airflow thing.
And this is why I wanted to at least note it earlier.
So loading data into BigQuery, right?
So we already said that they were using airflow, right?
They looked at several things.
So Google has a thing called Google Cloud Composer.
And basically what that is, is a managed airflow.
So airflow being an Apache project, you can set it up and run it on your own VMs or
whatever, right? Like that's on you. And that means you're managing infrastructure, which
you're trying usually to get away from when you're doing Google or cloud services in general.
Cloud Composer was supposed to do that for you, but they couldn't use it because they needed to
use what they referred to as domain restricted sharing.
And that basically meant that only if you're logged in as Twitter, can you see some of this stuff?
And it didn't offer that.
So they couldn't use it.
They tried to use Google Data Transfer Service, DTS.
It wasn't flexible enough to have data pipelines that had dependency.
So I think what they meant here is,
say you have a data pipeline that kicks off and runs something.
Hey, when that thing's done, trigger another one to run.
Hey, when that one's done, trigger another one, right?
Or wait for certain things to be ready before you can do.
So I think that's what they were talking about there,
and it just wasn't flexible enough.
And so that's why they ended up using Apache Airflow.
And again, they had to set that thing up,
get it running on their own, configure it all, all that kind of garbage. And then they were able to set up the services that they needed. Once they had data in BigQuery, and this is kind
of interesting, this reminds me sort of of what we were talking about with the Uber blog back in the
day, is once they got it into BigQuery, let's say that they needed
to transform some of that data. Well, they would basically create jobs that would use regular SQL
queries to do those data transformations, right? So they load the data all in there. They need to
polish it up. All right, run a job, have a SQL query, batch it out and put it into another data
set. That's for simple stuff. For the more complex things,
then they would go back to Airflow again
or use Cloud Composer with Cloud Dataflow.
And if I remember right,
Dataflow, we looked at that at one point
and that allows you to do things like,
it wasn't Flink.
What was underneath Dataflow?
Was it using Flink?
That doesn't sound familiar to me.
Didn't we look at that back in the day, Jay-Z?
Yeah, I'm trying to remember.
It was one of the streaming ones.
It's not Flink.
It was Spark.
But I thought there was some Flink extension or something you could do.
It might be.
Yeah, I can't remember.
But it basically allowed you to do data streaming type things in a managed pipeline that you didn't really have to mess with.
You write the code, and it would run it for you.
Man, it wasn't Flink.
I cannot read.
There was a language behind it.
Oh, but you're not talking about Dataproc.
Beam, Apache Beam.
That's what it was.
You could write your things in Apache Beam, put that into Dataflow,
and then that would run and do your data streaming.
Yeah, I'm all confused now.
Dataflow and Dataproc, where are they thinking?
Come on.
Yeah, man.
Naming's hard, right?
Yeah, that's true.
Even for infrastructure and services.
Yeah, I'm not good at it either.
I shouldn't be throwing stones.
No, I've got some badly named variables all over the place.
All right, so the next one up, performance.
Like, hey, if you're going to try and get a bunch of people to buy into your platform, it probably needs to work well.
So this is a big one.
And I know that Outlaw and I, when we were first looking at this kind of stuff, you have to know the lines and the boundaries for the
different technologies you're using. BigQuery is not for low latency, high throughput queries,
or for low latency time series type analysis, meaning you can't put a petabyte of data in it,
run a query and expect it to come back in sub second times. It's not how BigQuery works. It's not built for that. It is for being able to
run SQL queries that can process over huge amounts of data. And we already said earlier,
I think they ran so many queries over 100 petabytes of data, right? And their goal was they wanted their BigQuery queries to return results within one minute.
So it's pretty interesting.
They went about this by basically allowing their customer.
Well, first, and these are kind of backwards, even even in the paragraph up there that they had, they had their engineering team analyze 800 plus queries,
each processing around a terabyte of data each to sort of see what the times were going to be
when they came back. And then using that information, they actually allowed their
internal customers to reserve a number of slots in a slot in the GCP terms is it's a unit of computational
capacity to execute a query. So here's the interesting thing. What they did is when you're
running on cloud services, and I'd imagine they're all sort of the same in this regard. I mean,
you guys correct me if I'm wrong here, but there's spot pricing that says, Hey, I just want to pay for what I use. And then there's fixed flat pricing, right? Like, Hey, I'm going to pay for X number of slots every month, right? Like, um, just set me aside a hundred slots and I'm going to pay a flat price for that. As opposed to,
you know, Hey, if I write a thousand queries and they hit these things and it could use,
you know, I don't know, 2000 slots or whatever. So they went with this fixed price thing and then
they were able to see, Hey, how many slots do I need to use for a particular query to get it to return in less than a minute?
Right.
And then they use that out.
And then different teams within the organization, within Twitter organization could say, hey, reserve me this number of slots, which would then get billed back to their department.
But they could run their queries and submit at times. And I'm pretty sure most services have that type of
feature, right? The reserved, like I know AWS, Google Cloud, Azure, all of them. If you reserve
VMs, you pay a lower price for it because you're guaranteeing those cloud services that, Hey, you're going to buy this much per month, right? Whereas if you're doing spot pricing, you might use it way less, but
they're going to jack the price up on you because they want to get their money for that price for
the time that you're using it. I thought the reserved pricing was way more expensive for a V
like going back to your VM example. Not usually. If you say that you're going to reserve it for like a month,
if you say that, and it's not even a month,
I want to say with a lot of those cloud services,
if you say that you're going to use it for a year,
it's usually way cheaper.
Yeah, you're right.
Just looking at like that thing on the aws blurb
ec2 reserved instances provide a significant discount discount up to 72 compared to on-demand
pricing right because they know that if you're doing on-demand pricing probably you're going
to try and use an hour a day right like i just you know throwing a number out there um but if
you're going to reserve something you're going to have that thing for 24 hours.
So you're kind of guaranteeing the money as opposed to it's almost it's the inverse problem of what Twitter was trying to solve.
Right. Like they didn't want to do the spot pricing because they didn't want the fluctuation in the bill.
So they wanted to do the reserve pricing so that they could at least plan for their budgets.
And I think it's the reverse problem for AWS and Azure and all them,
right?
Like,
Hey,
if you'll tell me that you're going to use this,
then at least I know I got money coming in,
you know?
So it's,
it's kind of a push and pull in that regard.
I was thinking though,
like what a weird time we live in though,
where like,
you know,
it's not enough to just be able to query the data.
Instead,
you got
to think like hey how many of these slots do you need for that query and you're like a slot like a
what you mean a cpu no a slot i said what i meant answer my question don't go making up questions
yeah man and i'm sure that's that's super complex right because it's it's you
don't know how much data you're querying necessarily and how many cpus you're going to
need and what ram because you don't want to have to think about that right so they had to come up
with a new term so yeah all right so data governance now this is interesting this is
really important right like every company should care about this. Every developer should care about this.
Emphasis on the should. You don't have to, but you probably should.
You probably should. So Twitter was focused on discoverability, access control, security and privacy. So data and discovery management, they want, this is really cool. In my opinion, um, they extended
their dial, their data access layer to work with both their on-prem and their GCP data.
So that enabled users to use a single API to query all their sets of data, right? So,
so just imagine, Hey, I want to, um, pull a list of, you know, users that use this feature,
whatever the count of users use this feature, whatever the count of users that use this feature.
It can go across everything.
They're on-prem Hadoop data sets and their GCP stuff.
Like that's really cool.
This goes back to our Presto drill conversation from earlier.
Yep.
Yep.
Next.
I wonder like how, I'm sorry, but I wonder like how complex that was though.
Like it, was it just like, you know, you, you gave an example of like, you know, some
users, for example, were like, was that Dow limited to like the use case?
Like, okay, you want users specifically?
Yes.
I know how to go and get that out of, uh, the, the data that we have in GC or GCP.
And I know how to go and get that out of the on-prem
stuff but if you want to like more ad hoc kind of things like maybe it's like whoa whoa whoa whoa
yeah i don't i mean it'd be interesting to know what their implementation behind the scenes or
exactly if they had some like sort of graph ql thing where people would just do some willy-nilly
query that'd be insane the the reason why the reason why i i question it though
is because like as you were describing it i immediately thought like wait a minute did they
recreate presto or drill like why would you why would you recreate something that you were a
already using and b maybe this should have been a it already already exists, right? Well, that's why that's,
that was the,
like,
you know,
what immediately came to mind.
So check it out.
Like I didn't put this in the notes,
but they actually,
and we'll get to it in a second in terms of just the bullet point,
but they have this thing where it would register data sets,
right?
Um,
and I'm just going to read this bit here because maybe it'll make more
sense.
We use scheduled jobs to enumerate BigQuery data sets and register them with the data access layer.
So that's part of it, which is Twitter's metadata store. Users will annotate data sets with privacy
information and also specify retention for scrubbing. We are evaluating the performance
and cost of two things. Oh, well, that stuff didn't matter so that registering thing right like they had something that would
automatically push those data sets down into the dowel in their metadata store so maybe it was good
enough to be able to live query these different things for you assuming that they push the right
metadata down there for their their software that sounds That sounds kind of awesome. I mean, like talk about like, uh, uh, what the service discoverability
type of pattern, right? Like now your, your data is like saying, Hey, I'm available for you to
query and I'm going to like, let you know. Isn't that awesome? Yeah, that was pretty cool. Yeah.
So, um, the, the other things that I do, they had to control access to the data, which makes total sense.
They needed to use the domain restricted sharing so that only people that were logged in with a Twitter account would have access to it.
Right. They needed to make sure that data didn't leak out somewhere.
They used VPC service controls. So that basically prevented data exfiltration.
And it also allowed them to lock down from what known IP ranges people could come in.
So, excuse me, like if your company has a VPN, like Palo Alto is a popular one, right?
If you log in, you're probably on a known set of IP ranges.
And so by doing that, you're only going to have access to that VPC. If you can get in there,
um, the triple a authentication, authorization, and auditing authentication, they use GCP user
accounts. Pretty simple. Makes sense. Right. And that was for ad hoc queries. For anything that
was like a production load that maybe ran on a schedule or something like that, then they use
Google service accounts. Pretty common in a cloud type environment. Authorization. This was pretty
interesting. I don't think I'd ever thought about it because I haven't been that deep down into
like BigQuery. But each data set had an owner service
account. And then every one of those data sets also had an individual reader group, right? So
if you needed access to a particular data set, assuming it was something that was highly
sensitive, then you'd have to be added into that particular reader group to even be able to see
that data set. So it's kind of a nice way of making it to where you don't have to write a bunch of complex logic to be able to
access those data sets. You're either in the group to read it or you're not. So that's pretty neat.
And then auditing, this is them kind of eating their own dog food. So what they would do is
anytime a BigQuery query ran, they would take the Stackdriver logs from that execution,
which had a bunch of detailed information in it,
feed it back into a BigQuery data set so they could analyze it later if they
needed to.
All right.
Well, that gets kind of circular.
Yeah, a little bit.
As I said, eating the dog food, eating your own dog food.
Your Stackdriver queries in BigQuery become excessive,
and then those logs make it back into BigQuery.
I would imagine they filter those at some point.
Oh, yeah, that would make sense.
Well, maybe not.
This is what happens when you let me get in charge of engineering.
Okay, listen.
Query everything.
Right.
Do it all.
Multiple times.
Oh, man.
So ensuring proper handling of private data.
So this was pretty interesting.
So this is why they say they registered all the data sets.
So if you had a new data set that was generated up there, it would auto register with the DAO.
And then that way, any access to that data set was going through the dial is my image is what I'd imagine is happening.
They didn't call it out directly, but that would make sense.
They would annotate private data.
Right. So if if you had a column in there, there was like a first name or something and they would say, hey, this is private.
They used proper retention.
This is a big one. If you've heard anything about GDPR and all that kind of stuff, like data privacy concerns and all that, you have to be very explicit about how long you're
going to keep this data around. And you're also supposed to say how you're going to use it.
So I guess, okay, well, finish out this section.
All right. So this last one here is making sure that they scrub and remove any data that a user deletes.
So if you go in there and you delete something off your Twitter feed or if you deleted a tweet, then they needed to make sure that they also deleted it up in their data storage.
All right.
So they had this data governance, the AAAs, authentication, authorization, auditing, ensuring the handling of private data, blah blah blah blah like you know like think of it as like a checklist like can we yep we did that we
did that we did that we did that we did that and yet that remember that 18 year old that
hacked in and like said send all your bitcoin to here do you remember that hack? I do not. It was like last year?
Yeah, it was last year that the teen, Florida teen.
So, you know, I mean, Jay-Z, we're looking at you.
We're looking at you.
But yeah, he took control of some well-known accounts and used them to solicit Bitcoin.
Oh, yeah.
Like celebrities, right?
Yeah.
So like one of them was Apple.
And he said, hey, we're giving back.
We support Bitcoin and believe you should too.
And if you send Bitcoin, all Bitcoin sent to the address below will be sent back doubled.
You never heard about this?
No. back doubled you know you never heard about this no but did he actually take over user accounts
because that's definitely different than hacking into their data warehouse right yeah okay that's
fair enough fair enough so yeah so so the data warehouse they locked down for like analytics
but like live account stuff they're like i don't care privacy like whatever they probably use that snow white and seven doors password that's what that's what got them wait that's my guess don't give up on
password man i told you that in confidence yeah well bad all right so now they actually do have
different categories for their data sets which this makes sense too um these are all good things
to sort of keep in mind when you're doing stuff like this, especially dealing with user private data. So highly sensitive data sets were
available as needed with the least privilege. Um, and so this is the one where they had individual
reader groups that were actively monitored. So if you needed access to some data that had
sensitive data in it, you had to be added to a very specific group and they knew everything that you
were doing there and they were watching it actively, right?
Like it wasn't some passive query that was going to happen later.
What would be the highly sensitive data sets there?
What are we talking about here?
What's what classifies as a highly sensitive data set?
So they didn't say, but my guess is it would be things like first name,
last name,
right?
Like if it's Taylor Swift,
um,
so it's not the contents of the message necessarily.
Cause I was questioning,
like,
are we talking about the D that like the,
you know,
the DMS that are like,
you know,
person to person,
not the public tweets,
right?
It's public.
Well,
what about,
um,
so,
uh,
you know,
they probably have information on like
i don't know um maybe it's for like the verified users like their contacts or phone numbers or
whatnot you know maybe that would be considered highly sensitive well i was seeing um yeah i don't
know groups of people like if maybe they're working with the federal government on tracking down a cell
of potential terrorists or something then they don't want uh you know people to figure out
that they know who you know they're trying to hide the information that they know about those
people because they're working with the government or something you know something like that are you
speaking from experience jay-z yeah i mean you know i'm just i got my tinfoil hat on you know
so they can't see me. Right.
It's good that you know that.
Well, also keep in mind, a lot of these sounded like they were event type data, right?
So it might have been that, you know, Alan Underwood clicked the heart on this one.
And so, you know, the message that I clicked the heart on, you know that I clicked the heart.
And then you know my name's Alan.
So the medium sensitivity ones, they anonymize things. So you still have user ish type stuff in there, but it's hash. And they actually said it's a one-way hash. So, so if you're trying to
get things down to a user level, like, you know, how many individual people did this particular action,
they could do it, but they couldn't actually see who it was. Right. So anonymized, that sounds
similar to what I think IMDB did years ago with some, or Netflix did with some sort of
contest or something. Low sensitivity, all regular user information is removed.
So you won't be able to get like granular level type stuff and then public sensitivity.
Anybody can get it.
So, and then that was where that paragraph that I started reading earlier, anytime a
new data set is added, then they have a scheduled task to go auto register these things with
the dial.
Okay.
All right. Yeah. I got a question, but I want to reserve it until we get past this next section. Okay. I think this, this last section here is on cost
and this is really interesting. So what they said is when they started moving up to BigQuery,
remember they already had PrestoDB in play.
And they said that the cost was roughly the same for querying PrestoDB versus BigQuery.
Now, the important thing here is it's for querying.
And PrestoDB, keep in mind, they were managing all that infrastructure on-prem.
BigQuery is all managed for you, right?
It's just a service you use.
They said that there were additional costs associated with storing data in GCS and BigQuery.
And that was something that always kind of bugged me a little bit too,
is a lot of times you'd have to put the data in GCS. And then when you ingest that into BigQuery,
BigQuery is also storing it again in its own engine. So you're kind of getting double hit with that. So there were additional costs on top of that.
But for a lot of people that want to use BigQuery, that's probably worth it because you're looking for that processing power that you're not going to be able to do without setting up a bunch of infrastructure yourself.
We already talked about they use flat rate pricing so it didn't fluctuate. And there was one very interesting situation that I find extremely curious, and I'd love to know more about it, but they just put a line in here.
In some situations when they're querying tens of petabytes of data, it was more cost effective to
use PrestoDB than to use the GCS storage and BigQuery.
Yeah, I wonder what was different about that.
I know why.
Yeah, the only thing I can think is that what they call it slot,
that slot cost was probably crazy for having to pour through petabytes of data.
That's like it probably just took so long and used up so many of those storage slot or
those slot cost units that it just had to be crazy.
Whereas you sort of have a fixed cost of Presto if you're managing a cluster
on-prem,
right?
That's the only thing that makes sense to me.
Yeah.
And all of this was in relation to moving an exabyte of data this was what started it and
yeah this is this is the why and where they ended up with an exabyte of data being moved
into google cloud isn't that correct and this is all for internal querying purposes
how many pounds of hard drive would that be oh my god man right if they shipped hard drives
with that much data like how much would that be what would the shipping cost be yeah i'm just
wondering like is that like a dump truck worth of hard drives? Is it dump trucks?
Is it airplanes worth?
I don't know.
I don't know if there's an easier way to figure it out than to brain dead and figure it out.
All right, so hold on.
I just Googled this because that's what you do.
How many terabytes are in an exabyte?
There are one million terabytes in an exabyte there are 1 million terabytes in an exabyte so if we assume
i mean i'd say most data centers aren't running like um 14 terabyte drives because they're too
expensive right they want something cheaper so let's say that out of their 1 million divided by what eight terabytes probably common
i was thinking five all right well i did well that's 125 000 drives yeah how many drives fit
in a garbage truck so who doesn't know so if we were to come at this from a different direction then because there's
one way of doing this where like each like you literally like the drives aren't active they're
just literally packaged up and boxed and sent right and that's going to be you you know you're
going to have a higher compression rate of drives per so like really now we're talking about like
well what's the you know how how effective am i going to ship those drives am i going to put them in like a you know consumer
grade where there's a lot of packaging material around them or am i going to like squeeze some
of that in closer you know to to get that tighter um so now you like now you're just like well what's
the size of the drive period and also like we're assuming hard drives and not SSDs
because of the density of how much more storage you can get
in a spinning hard drive versus SSD,
but the SSDs are going to be lighter.
And those are things, you can get those super tiny now.
But coming at it from a different approach,
if you were to look at a comparison of like the AWS snow,
but snowmobile service where AWS drives a truck to your location,
when you need to store,
like,
you know,
you need to move exabytes of data into AWS and you want to use their system
to do it fast.
It's a hundred petabytes per truck.
Okay, there you go.
But that's a live working truck, though.
Yeah.
Like, you got a lot of hardware.
Is that like a 40-foot tractor trailer truck?
That's got to be what it is, right?
I don't know.
It's a 45-foot container.
Yeah. Good Lord, man. All right right so there's a thousand petabytes in uh one exabyte and how many petabytes did you say was in that truck
100 so it's 10 big trucks that's crazyuming that you wanted the data,
like actually like pluggable and ready.
Right.
Yeah.
Yeah.
That's,
that's insane,
man.
Yeah.
That's crazy.
Also crazy that such a service exists.
Like how many times,
like how many customers does Amazon have to where that's actually a need?
Right.
That they're like,
yeah,
no,
we do, we do this a lot. Uh, right? That they're like, yeah, no, we do,
we do this a lot.
Uh,
you know,
Hey,
we,
here's your,
here's your punch card.
And,
uh,
you know,
10th one's free.
It is a 45 foot truck,
by the way.
So it's 10,
like,
you know,
tractor trailers.
Yeah.
Yeah.
Yeah.
Yeah.
Man,
that's,
that's insane.
Yep.
Eight foot wide,
nine points, uh, six foot tall, 45 foot long cube or square.
So you said eight feet wide times 9.5 tall.
Uh-oh.
We're going to calculate some volume.
Times 45 feet.
That's 3,420 square feet of space available for these. i'm sure it's not packed all the way to the brim but yeah that's that's a lot of space is it worth it i imagine
i imagine that that 45 foot container is basically like a moving data center yeah that's exactly what
it is a bunch of servers a bunch of hard drives like hey we we uh we need some backup power for our generators, but otherwise, where's your Wi-Fi?
Yeah.
Could you give me the guest password, please?
Snow White and the Seven Dwarfs.
That's right.
Hey, so somebody has some fun questions in here.
Oh, yeah.
So Elon Musk recently ended up buying Twitter.
That was a whole big fiasco.
What?
I'm not going to get into the details of that.
Yeah, you're going to have to Google it.
I didn't know about that.
Yeah, it's been kind of a thing.
But one of the things that happened is there were a ton of layoffs, right?
The company had like 7,000 employees, I think.
And they got rid of like half.
And so there was a lot of discussion on Twitter and a lot of other places saying,
oh my gosh, how does Twitter have so many employees?
Like, I could write it in a weekend.
And so I was curious, like, you know, obviously we just spent a lot of time talking about a lot of the other things that Twitter does besides just, like, a simple content management system.
Like, we talked about kind of pulling data out of the database.
But I just thought it might be fun to kind of bring up and say like did you uh you could build twitter in a weekend anybody that can build twitter in a weekend
i say just go for it what do you wait what's holding you back then like yeah yeah you've had
years to do it yeah you know i think it's ridiculous like obviously uh this is something
that happens a lot i think developers will often kind will often take a service and boil it down to one simple part of the use case and then think that that's all there is.
I remember when Dropbox came out, there were a lot of people being like, I could have done this with a NAS and rsync for free.
And so there's all these people putting out instructions on how to replace Dropbox.
Somehow, Dropbox is a hugely successful company that even has
several companies that spawn to compete
with it. They're all doing very well.
It's just kind of funny to me
that people talk about it.
I think
anytime you're thinking that you could just build
X in a weekend because all you think it
is is what you see or
your face of it. You don't think about the
financials, the billing, the advertisements, the machine learning, like all these things that are really necessary to making that thing successful.
You know, it's not just the way you primarily interact with it. here being on a platform and then being able to follow each other and tweet and seeing each other's
stuff that alone would take some time especially because it's all live streaming you go up there
and things are constantly popping up new fresh everywhere like just that alone is already more
than a weekend's worth of work right not even Not even to take into account authentication, authorization,
all that kind of garbage.
And then you start building on top of it.
Hey, once you get past a few hundred users,
your problems just got way different, right?
Yeah.
I mean, it would take me a weekend just to set up my DevOps pipeline.
Yeah, man.
Yeah, it's kind of ridiculous.
But it did remind me of a tweet that I just saw a couple days ago from David Whitney,
who's an interesting person on Twitter.
You should follow him on DevSpace.
And I'm going to paraphrase here because they used a naughty word.
It says, the more I think about it, the 10x developer trope is less rock star and more crappy cover bands
than any of those people would like to admit.
And I think that's a really good point.
It's like a lot of times when people think or people talk about,
you know,
these kinds of great accomplishments or what they could do in a weekend,
you're thinking about this,
like smoke and mirrors kind of demo,
basically standing up something that's made out of,
you know,
balsa wood and paper that just is totally fragile and unmaintainable.
And it's just not nearly as robust and significant as the real thing.
And so I just, all the talk on Twitter of people talking about building Twitter in a weekend
are usually just thinking about a UI and a simple database,
and it's just not the same thing.
You know, it's funny, even that.
So you had mentioned, or we mentioned on the last episode,
there's some thing that people are installing or using
that's like a Twitter replacement type thing.
What was it?
Oh, Mastodon.
Yeah.
Mastodon.
Even just setting something like that up can take a day, right?
Let alone programming the thing.
So that's where I think it's so crazy that developers,
especially experienced developers, will go out there and make a statement like, oh, I could totally make this in a weekend.
We are a confident bunch that are also opinionated and a little sure of ourselves.
That's how it always starts.
We're like, I could do that in a weekend and then we'll start.
And then we get start. And then,
you know,
we get lost.
We'll go down some rabbit hole of authentication.
You know,
this is the one that we always tease Alan about.
We'll fall down some rabbit hole of authentication and then come up for air
like 18 days later.
And like,
wait,
the weekend's over.
Yeah.
What was I doing?
I don't even remember anymore.
Yeah.
Yeah,
totally.
Totally.
It's so easy to get his nice, which is ridiculous. And I think it's dismiss anymore. Yeah, totally. It's so easy to get hit with snacks, which is ridiculous.
And I think it's dismissive of all the
hard work that's gone into it, but that's a side note.
So do you guys want to create a Twitter?
You want to do it? I think we can do it in the weekend.
I can't set up a national one.
Sounds like we already got the
architecture right here. That's right.
I think individually we couldn't do it, but the three of us
together, we could build Twitter in a weekend.
Definitely.
Yeah.
Long weekend.
Yeah.
And this was just another kind of a little bit of a kind of insight into what's going on there.
Another thing that was tweeted out, and this one was actually tweeted by Elon, who mentioned that.
I'm going to paraphrase another tweet here.
He'd like to apologize for Twitter being super slow in many countries because the app is doing more than a thousand poorly batched RPCs just to render a home timeline.
Which, you know, like that's what he tweeted.
And there's been some talk on, you know, whether or not that's true or, you know, like how accurate that statement is and who knows.
But I will say that I am not surprised to hear that there are a whole lot of calls being made to external services and so if you tell me
that the home timeline is making you know potentially a thousand calls to render like
i believe it you know it's like i'm not like so you know disgustingly shocked by it like i could
see that sort of thing kind of happening if you take the request and you think about all the
various things that kind of spin off of it like we talked about uh just the analytics side of it
like just knowing that someone refreshed the timeline and all the various things that kind of spin off of it. Like we talked about just the analytics side of it, like just knowing that someone refreshed the timeline and all the various
services that kind of,
you know,
end up going through the pipelines and until they end up in their final,
you know,
data stores it can be a,
just a ton of data moving around.
And so,
you know,
while I don't know that it's necessary a thousand,
I'm not surprised to hear that it's a whole bunch.
Yeah.
And we'll have links to that in the show notes as things are referenced.
There's some interesting comments in this thread.
I will say that.
Yeah.
Yeah, it's juicy. juicy so you know if you want uh you know a juicy read here on the one of the developers that worked
on it responded to uh claims ian uh made that person since been fired but there's been a whole
lot of interesting stuff i'll actually have a link to a news bite uh like a news article about
the whole thing that's got links to all the various tweets it kind of covers the drama
there's been a lot of drama a little bit of drama
yeah i've been avoiding twitter but i do love the technological side of it because it is seriously
one of the most insane engineering things that exist i mean the amount of throughput those people
have yeah and you know it's hugely influential, too.
Just, like, if you think about, like, when's the last time you saw a bus or a truck or something and had the little blue bird on it?
Every commercial, you know, you see a commercial for aspirin and it's got the little, like, follow us on Twitter.
It really is a big part of, like, how people communicate and just, you know, it's been a big part of kind of modern culture.
I mean, the term hashtag only exists because of Twitter, right?
Like, it's insane.
It's definitely like a fascinating set of problems
that they created for themselves, right?
Because like you said, this didn't exist before.
But yeah, like trying to deal with these things in real time and,
you know,
yeah,
it's insane.
Yeah.
I was just thinking there,
like when you,
you mentioned the bus example,
I didn't know where you were going.
And I was like,
man,
when's the last time I saw a bus?
I think it was being driven by Sandra Bullock and Keanu Reeves and
nobody wanted to be on that bus.
So maybe that's where he's going.
Yeah.
Yeah. Yeah.
More.
All right.
Well,
we'll,
we'll have a bunch of links and,
uh,
uh,
resources we like section for this,
uh,
this episode.
And with that,
we head into Alan's favorite portion of the show.
It's the tip of the week.
All right.
And here I,
uh,
I've got to put a tip about, uh, Kubernetes. So, you know, And here I've got a tip about Kubernetes.
So, you know, I love K9's command line interface
that I use for just doing all sorts of Kubernetes stuff all the time.
Love it.
It's great.
Because of that and because of how much I love it,
how comfortable I am with that tool,
I've really not looked at other ways of kind of interacting with Kubernetes
until fairly recently when someone convinced me
to install
the vs code plugin for kubernetes i just didn't think i really needed but i really like it turns
out so it kind of gives you like a almost like a directory kind of browsing uh just kind of
layout in the left nav for uh finding your resources and navigating your context and
things like that it's's just kind of nice.
And of course, it's also really nice to be able to right-click on a pod or something and attach to it, which I'm sure everyone here,
at least, I don't know about listeners,
but the three of us have definitely done various things
with attaching VS Code to a container or to a remote server,
done their live sharing type features.
It's basically, you can do stuff like that with it where you can kind of connect to a server
and then open up VS Code.
And it's as if you're working on that machine and you can open a terminal in it.
You can do all sorts of cool stuff locally, which is just really convenient.
And so I am kind of a little bummed that I put off using it so long
because it has been really handy and has been a nice compliment to having canines for kind of shooting in, looking at logs, shelling in, that sort of stuff.
So it's nice to have more options and I'm glad to have it.
One thing I did want to mention is you have to be careful.
One thing I really like about canines is it does not change your global context when you change context in canines.
So, for example, I work in several different contexts throughout the day.
And I keep my local context always set to just my local instance, my local cluster.
And so if I ever need to run a script or anything, I always have to pass the context that I want.
And if I make a mistake and don't pass the context or don't pass the namespace, it just
affects my local, which is great.
But in Kubernetes plugin and Visual Studio Code, it's easy to kind of double click something
and not realize that you've changed your local context to a different namespace or something.
So you just have to be careful because I like to keep that always set to something that
can't damage too badly.
Yeah, that's scary.
I wonder why they didn't just on all the things that they're running behind the scenes that they not just
passed the the context of the command that kind of stinks but yeah most people running
cube cuddle commands are used to kind of dealing with that problem so you know it doesn't bother
me too much it's just something i've gotten to kind of take it for granted because i just right
you know always use canines yeah visual studio code is just an amazing tool it's so good it really is
so uh for my tip of the week you as you were speaking it reminded me we've talked about
i term 2 as a favorite um uh terminal replacement you know on on Mac OS. And, uh, I talked about it before. So it's been mentioned before in a
couple of episodes, episode one 47 and one 61. So if you're not using it, um, you need to go
back and listen to those episodes. I don't know why, why aren't you using it yet? But, um, in
one of the episodes, I want to say it was the one 61 episode, uh, double
checking.
Yep.
It was one 61.
Um, I had mentioned the using like split doing the split windows, right.
To like, and like my preference is to split the windows, uh, vertically.
So you can just, uh, command D and it'll split it out and you can like have multiples of
these.
Right. And so as you were describing that your,
your Kubernetes environment with visual studio code,
I was like,
Oh yeah,
you know what?
Like dawned on me this week,
like my favorite view right now has been,
you know,
so we have,
we've talked about like these widescreen monitors that we have,
you know,
we've grown to love them,
right?
Because you can have,
you know, a lot of documents open and see things at one time. Well, for my Kubernetes
workflow, my favorite pattern has been to have iTerminal, but split three times. So I have three
windows, my leftmost window, because we love scaffold. So my leftmost window I'm using for scaffolding.
My rightmost window I'm using for canines.
And then my middle one is like all of the, you know, command ad hoc commands that I want to type in, you know, randomly in there.
So, yeah, I just wanted to like give another shout out to iTerm or iTerm2 specifically, I guess.
But yeah, because it just makes life so easy.
And that wasn't even the tip of the week that I planned to discuss.
So that's just a freebie right there.
So you got two out of me.
You got two freebies.
The first one was a good password, right?
I think that was the first one.
And then,
and then I turned to,
all right.
So my real tip of the week though was,
uh,
so I learned of this today and I don't know if you guys have,
but I gotta get more into this,
uh,
cause this looks super promising,
but,
uh,
there is now a Kafka cuddle command line so that you can do all your, you know, Kafka management using this tool.
And the beauty of it is if for those who are like, wait a minute, but there's already like, you know, a bunch of scripts that Kafka comes included.
You know, you just go into your bin folder of your broker, for example,
or your, your connector, whatever. And you know, there's a Kafka topics shell script.
That's like super cryptic. And you got to like, do I provide the bootstrap server or do I provide
the zookeeper? Wait, when does it matter? Do I need both? Do I wait, what with the Kafka cuddle
command? You don't need to do any of that. And like, it's got just things that make sense, like verbs that make sense of what you want to do.
Right.
So it's very Kubernetes like in that regard in from a CLI.
So I saw this today and was just like.
Mesmerized by it. I was like, Oh, just that looks awesome. And so, uh, I wanted to share that.
That's most excellent, man. I was actually looking for, for something that Dave Follett
had mentioned. So, um, I think on the previous episode, uh, there was, there was something about
seeing the process that actually had a
handle on a file right and he had mentioned something he actually sent a correction i think
what he had told us wasn't exactly right and i hadn't double checked it but he sent me something
so i can't find it if i can i'll get it in the show notes for this so that it'll be down there
in the tips um so where he was on the rock, uh, cadding the process ID. Yeah, it was something a little
bit different. He said that it wasn't actually 100% spot on what he had said before. Um,
but we'll give him a pass because he has given us a lot of good tips. So, um, I'll try and get
that in there. Also a note on the previous shopping episode, I talked about the Roku
streaming stick 4k plus, right?
And outlaw and I got into a little conversation about, well, does it actually support Atmos?
So it's really weird on their website.
It definitely does not show that it supports Dolby Atmos, right?
Like it says like Dolby HD plus or something.
I can't remember.
Um, so I actually did a test on my soundbar that has
Atmos and all kinds of other stuff. And I tried content that was both Atmos, um, stereo, um,
just DTS surround and all of them. And every single one of them coming through the Roku
registered properly on the soundbar. So if it was Atmos, it showed Atmos on the soundbar.
If it was stereo,
it showed stereo.
So it's at least passing it through.
I think,
and what I was saying last time is I don't think it decodes Atmos through the
Roku,
but it'll pass the signal through.
So I believe that's what's going on.
I haven't looked it up,
but I did see that it would show up properly in both places.
Well, that's the thing that was so confusing to me.
Like I went back and looked at it too out of curiosity because they had some like weird
wording for it.
I don't remember it now off the top of my head, but because, and you mentioned it just
again about the decoding and I'm like, yeah, none of them are decoding it.
That's what the, that's what your receiver is doing because the decoding
is actually deciding like, Oh, this is supposed to go to that channel. And that's supposed to go
to that channel. You know, like that's the decoding, right? Well, sort of, man, this is
where things get really confusing. So all confusing is what we do here, right? Well, I mean, for years,
like with receivers, one of the reasons you would upgrade your AVR at home is because it had more decoding or more codecs that it could handle, right?
Like DTS, DTH, DTS HD, whatever.
And so if a signal got passed to it and it could read that signal as DTS HD and it supported it, then it would actually play it in that.
If it couldn't, then it would basically fall back to some standard AC three type thing or
whatever.
Right.
Like the stream,
the stream might have like,
here's the,
here's the two or three or four different things that it's available in.
It's available in,
in Atmos.
It's available in,
uh,
you know,
5.1 it's available in stereo.
And so,
yeah,
it's going to try,
uh, to, to, to you know it's like a
protocol agreement like a handshake you know uh tls handshake agree it's going to try to like
find the one the best one right and then it'll like successfully go down the list and that's
why it confused me when you were talking about the decoding because i'm like well they're all
passing through right like even like a apple tv is passing it through. So yeah, that's, what's interesting is I think what the Roku stick will do is it can actually
decode things to DTS HD plus or whatever the things that it supports, but it will pass along
the original stream information. So if your receiver can handle Atmos, then it'll do it,
but it will not decode Atmos and try and send any information to the TV your receiver can handle Atmos, then it'll do it, but it will not decode Atmos
and try and send any information to the TV or receiver or whatever saying, do this.
So I just wanted to say that. So if you do have something that's capable of doing Atmos or
whatever, that Roku will actually pass it along and it will get picked up from it. So again,
for like when it was on sale and probably during black Friday, it's going to be like 30 bucks or 25 bucks, $25, man. It's a stupid good deal on a streaming
thing. So, um, at any rate, all right. So my, my tip of the week,
so I got this from, I think that same meeting, the outlook of the other thing from today,
um, which was really good Kafkaka i think by default and i may
be wrong i should have looked it up before i actually even said this but i want to say the
message size is supposed to be one megabyte does that sound right i think i see the one or two
think that is right yes yeah so here's where i'm going with this. All right. Thank you. Um, so yes, what that means is,
okay. One megabyte. So by default, you can send messages up to one megabyte in size to Kafka and
it'll write it. If you send something bigger, it'll basically blow up and it won't write it
because it's like, Hey, I can't do it. It's not going to truncate it. It's just not going to
write the record. Um, and we ran in situations where we actually needed more than that,
or we thought we did. And there's an interesting thing you can do. So you have in Kafka producers
and consumers, producers are the things that are writing things to Kafka consumers are things that
are basically, you know, listening to and getting messages from Kafka in your producer, you have the ability to
turn on compression actually on a message by message basis if you want to. Um, but the
interesting thing is they have several different types and I've got a link here. You can do no
compression, which I think is default. You can do G zip snappy or LZ for compression. So if you had
a message that was too large, it was greater than
one megabyte. Let's say it was a JSON blob. You could likely use some Gzip or Snappy compression
to squeeze that thing down before you even send it to Kafka. And then you might be fine. So you
may not need to increase the size of your default messages that Kafka can handle because there are some downsides
to doing that, right? Like Kafka is supposed to be really fast. And if you increase the actual
size of the messages, it's writing longer per thing that it's doing potentially. So this might
be a really good solution for you. Not only that, but keep in mind that the way you size a Kafka
cluster is based on the amount of bandwidth that you expect to go through. So if you're increasing the size of your message, then you're likely going to impact the size
of your cluster.
Yeah, you're rethinking.
Or what you need to have.
And Kafka cluster sizing can be a pain in the butt because if you decide like, oh, we
have five brokers today and then tomorrow you decide you
know what we need seven to take advantage of those extra two brokers now um becomes a hassle uh it
becomes tedious to like rekey messages to spread things out across that because it's deterministically
you know deciding where that um where the key belongs for that piece of data
and if you are changing the infrastructure around then you know you got to rekey things so that's
why i like one of the recommendations related to sizing is um uh you know that you size the
cluster to last for two years and then you know come back at it after that although there is
another thing too so okay here's your third freebie from me.
Wait,
before you go to that one real quick,
one other thing just to finish this up.
So the interesting thing about when you write these compressed messages,
the compression information is stored with the message.
So when it's written to Kafka,
Kafka gets this,
this crushed up compressed thing.
It just writes it.
Whatever consumes it. We'll see the metadata about what compression technology was used.
And so it decompresses it at the consumer level.
So it can actually truly be on a message by message basis.
So it's really, really cool how they make this in.
All right, go ahead.
Tip three.
So there's another one that I haven't had a chance to like dig into
in great detail yet, but it's been on my radar now for, for quite a minute. But, um, I think it was
created by LinkedIn, but it's called cruise control and it's for Kafka to help manage large
Kafka clusters at scale. And so, um, you know, some of the things that I was describing,
like might be like, there might be somebody who's like, oh, no, that's old advice.
Like now with cruise control, you know, you can easily add and remove brokers, you know, because that's literally like list is one of the things that you could do is rebalance the cluster easily using cruise control and things like that. So I haven't had a chance to dig into it. So that's one of the things I've been kind of curious is like when they say that it'll do it. Okay. But like, how quickly is it?
Because I remember from past experiences trying to like rekey messages in a topic because you
want to like change partitions or something like, like, you know, based on the amount of data I had
at the time, which was a large data set,
but just done for testing purposes.
And it was like an eight-hour ordeal to redo it.
It was not an exabyte.
Yeah, yeah, right.
So I mean, that's why I'm kind of curious to see,
like, okay, well, what is all this doing?
So I don't know. I'll put that out there.
You know, it'll be in, in the, in the links and, you know,
maybe we'll all learn something new. That's amazing. And some cool technology.
So yeah, whatever. All right. Well, Hey, let me ask you this.
If you watch an Apple store get robbed,
are you an eyewitness?
Okay.
Just asking for a friend.
I like it.
Yeah.
One last question.
How do you fix a broken pumpkin?
Smashing is all I think with my mother.
No,
it's gotta be something about squash.
Oh, yeah, good. A pumpkin patch.
Oh, jeez.
That's even better. Excellent.
Thank you, MikeRG, for those.
Excellent.
And now we head into
Jay-Z's favorite portion of the show.
Goodbye. It's the end of the show.
Subscribe to us on iTunes, Spotify,
wherever you like to find your podcasts,
and be sure to leave us a review.
I know Jay-Z said he was going to give you a freebie,
that if you gave us a four, we'd treat it as a five,
but I don't know.
I call shenanigans on that.
You want to give us the five, right?
I mean, am I wrong?
Yeah, the only way that four happens
is if you accidentally slipped and clicked the button when you were hovering over the four. That yeah the only way that four happens is if you accidentally
slipped and click the button when you're hovering over this is the only thing that makes sense all
right so i think you click the four because you're still upset because you were trying to buy your
taylor swift tickets and you didn't get them in time and you're upset and you're taking that anger
out on us and i don't think that's a good look on you don't take your aggression on us that's not
fair to us we didn't do it right it's not our fault that Jay-Z bought all the Taylor Swift tickets.
What if he only liked 80% of the show?
Ah,
why Jay-Z?
Why?
That's because you didn't listen to the other 20%.
I mean,
this,
this is not our fault.
You tuned out,
right?
That's right.
Come on,
hook us up.
All right.
So,
Hey,
while you're up there at Cody blocks.net,
make sure you check out our show notes,
examples,
discussions,
and more,
and send your feedback,
questions, and rants to Joe.
I did not see that coming.
Yeah, I'm in slack.
That was great.
My bad.
No, did you?
Hey, it's your turn, Joe.
Oh, hey. Yeah, and make sure to follow us on twitter while it still
exists at coding box or head over to codingbox.net i got dark social links to the top of the page
the times they are changing who knows