Programming Throwdown - Hadoop
Episode Date: December 26, 2012This show covers Hadoop, a set of several languages and libraries for working with big data. Tools of the show: Emacs and Chrome Browser Sync. Books of the show: Hadoop: The Definitive Guide ...http://tinyurl.com/cp3mw32 and Anathem http://tinyurl.com/cas8bux ★ Support this podcast on Patreon ★
Transcript
Discussion (0)
Hosting provided by Host Tornado.
They offer website hosting packages, dedicated servers, and VPS solutions.
HostT.net.
Programming Throwdown, Episode 23, Hadoop.
Take it away, Jason.
Hey, everybody.
So we actually got a pretty awesome question from the audience that we want to kind of start the show with.
We have an audience?
I thought it was just you and me and our moms.
So yeah, our mom had this awesome question.
It was, how's it going?
Wait, wait.
Our mom doesn't make any sense.
Oh, yeah.
No.
Never mind.
Your mom?
This sounds like we're making bad mom jokes, but I'm not really.
Yeah.
I think I just dissed my own mom.
Anyway, so Marco Aurelio sent in a question,
and he's looking at doing some C-sharp.net coding for a website that he's interested in making.
And he was sort of interested.
His question was, you know, he knows that like you know server side stuff you know the
code and the the CPU is used on the server and so if you have you know one user you're using some
amount of CPU if you have a thousand users you're using you know maybe not a thousand times but the
CPU scales up right so his question is how can you big websites like is it true that big websites
like eBay and Amazon and things like that,
you know, run things on the server side? And how do they do that? And how does the server not
explode? And then along those lines, like what languages should he focus on learning?
Yeah. So, I mean, it sounds like he had a question there. He has an idea.
And so his idea was to kind of build a website, but he's worried about it scaling what technology
he's going to choose yeah totally so so you i have a slightly slightly different different answer and mine is that
i think sometimes in the tech community and as engineers uh you know you over engineer things
so this is a classical problem and so having the issue where you can't scale your website
because you're growing too fast uh would be like a very good problem to have.
I mean, there are certain things you don't want to do to aggravate it, or especially
if you get to the point where somebody is funding you or you're making a lot of money
and your website goes down and people are depending on you.
That's a really bad problem.
Right.
But when you just have an idea, it is more important to get the idea out there, to start
working on it, to find other issues like okay people I need to
shift this idea slightly or I need to do this then to spend tons and tons of
money trying you know like if you ever read about some of these people that
started like Google you know it was two guys and some university equipment or
you know Mark Zuckerberg at Facebook you know it was just like running in his
dorm room off his computer right like I mean none of these things started with
like I'm going to start with 10,000 computers
and I'm going to serve millions of,
I mean, that's how you go bankrupt, right?
Did you know, so sorry to interject,
but Stack Overflow, the website that has, you know,
many questions and answers for programmers,
it actually started with the two founders
answering all the questions themselves.
Nice.
Yeah, isn't that crazy?
So they just, and I don't even,
I think they were talking about it.
They didn't even have a website.
Like they had this form where you'd click submit,
and then they would create a.html file with your question and answer.
Like that's how they started.
And now Stack Overflow is gigantic.
There's like probably thousands of questions a second, right?
Right, yeah.
And I mean, I think even you take that example, right?
So there's an idea, and you can read a lot about this.
A lot has been written.
And this isn't exactly an entrepreneurial show. I don't think I said that correctly. I think even you take that example, right? So there's an idea and you can read a lot about this. A lot has been written and this isn't exactly an entrepreneurial show.
I don't think I said that correctly.
I think it's pretty entrepreneurial.
But okay, we're kind of, okay, all right, all right.
But minimally viable product, that's kind of like a word, right?
So the idea there is, in fact, don't even make a backend to Jason's example here of Stack Overflow,
whether it be urban legend or true, the internet shall tell away us.
But, you know,
if you just make a front end that looks like what the website, when people can kind of try it out
and test it, you can get a ton of feedback and save yourself days, months, years of development
time. So that's kind of where to start. And then, um, you know, from there you can kind of, Jason
will kind of address like how, how does that work? Like how does one serve millions of users a second?
An hour?
A day.
Millions of users a day.
Let's go with that.
So, yeah, I mean, but to Patrick's point, just to wrap that up,
even when I was making the Trivipedia, that trivia Wikipedia thing,
I started off with just having just a flat file
and you just scan through that file for everything. So if I needed to find a user by the name, I had to search through the whole file of like thousands of users or hundreds of users to find that one. There's no index.
Millions of users? now there's what, like 2,000 monthly active users. And so let's say there's 100,000 users in the system.
So I couldn't do that anymore.
Like I couldn't just scan through all the usernames.
And so I had to, you know, I ended up making ZombieDB,
which is open source.
And I think I posted on the podcast on G+,
on the podcast's page about that.
But so I had to actually make a database.
But that came months later, right?
And if Trivopedia, if no one was interested in it,
then I would have saved myself the trouble of having to make that database, right?
And I probably saved myself the trouble of doing other things,
which I would have done if there were millions of users, right?
So yeah, to Patrick's point, you totally want to start small.
Don't worry too much about your server blowing up or anything like that.
But now let's answer the question from a tech standpoint.
How does this work?
Do all these guys like Amazon and Google and eBay,
do they all just write JavaScript or do they do stuff on the server?
So they do things on the server,
and it works through a process called sharding.
And Patrick has already helped me out here,
but I'm going to take a first crack at it.
Typically, you'll go to, let's say, Amazon.com.
What'll happen is that'll go to a front-end server.
So this server will get your request, and it'll say,
and there's several front-end servers all around the globe,
so you'll go to the front-end server that's closest to you.
So it'll take your request, and it'll say,
okay, what servers are pretty
you know lightly loaded and the servers themselves the back-ends are constantly
telling the front-end hey you know I I'm not busy or hey I'm slammed processing
people's orders right so the front-end server will look for a server that's not
busy and then redirect you to that server. So then your request will go through to that server, which will do all the things like
query the database, see if you're logged in, do all that stuff.
And keep in mind the database itself is on another server.
So you have...
Set of servers.
Yeah, set of servers.
And so there's this other process going on where the database has a front end, which, you know, keeps the load pretty balanced among all those database servers.
And so this process of having all of these machines sort of working together and having the least used machine, you know, handling your request is what keeps websites like Amazon and eBay up and running.
And a big part of doing this isn't, you know,
what we're going to talk about on this show, Hadoop,
is more for, like, batch processing.
It's not things that you're going to do, you know,
in real time when you're accepting a web request,
which is a big part of doing anything on the web
or with big data involves sort of managing many different machines
and routing work to one machine or the other.
That's right.
Yep.
And so, I mean, yeah, and I think there exists server backend technology
for many of the most common languages for people to write in.
I mean, there's even Node.js, so you can write JavaScript on the backend
and use that as a server.
You know, you can just like all the major, I mean, Java, C++,
I would assume C Sharp as well.
I mean, there's all these things have backend pieces that you can just like all the major, I mean, Java, C++, I would assume C Sharp as well.
I mean, there's all these things have back end pieces that you can write to.
There's middleware kind of stuff, let's call it, that handles this load balancing and splitting. And then you write your server and your language on the back.
So it's, I mean, it's slightly more than what language to use.
It's kind of like understanding this underlying and the problem.
But you don't really know where the scaling problems gonna have
until you really are kind of almost done right so something like if you think
about like a slash dot let's take slash dot right so they have news stories
which are being displayed but that their bottlenecks are gonna be very different
than a site like eBay which is handling transactions and it's counting down
auction or Amazon which has to has to handle when something sells out
and they have no more and they don't want to oversell something.
These things are, there's amounts of traffic,
there's hits on data.
I mean, those things are going to have different bottlenecks
at different amounts along the growth curve of users
and on how many transactions are being handled
and that kind of stuff.
So, I mean, you really kind of, if you solve it too early,
you risk solving the wrong problem. Yeah, that totally makes sense. So, I mean, you really kind of, if you solve it too early, you risk solving the wrong problem.
Yeah, that totally makes sense.
So, all right, well, on to our news.
News.
I have the first one.
All right.
Okay, yeah, so I'm doing the first one.
I saw an article linked today on Hacker News,
which is interesting, about the Fast Fourier Transform.
So this is kind of like an FYI.
It's not exactly news. It's not something updated. But it was a pretty good article. I kind of like a fyi it's not exactly news it's not something updated but
it was a pretty good article i i kind of enjoyed reading it so i have a little bit of background
with what the fourier transform is so uh without getting into the deep math uh and it's like this
guy does a good job of explaining it thoroughly but not if you just kind of skim read it you know
you can also kind of get the gist so four years there's code too yes and
he even posts like more detailed this website's actually very interesting we'll have to take a
look at more and see if there's any other articles to be really interesting to you guys or you guys
just take a look and figure out on your own but he covers a wide range of topics but a four-year
transform and brief is going from the time domain to the frequency domain so if you have some data
that over time changes so the example he uses is audio recording
so you have digital samples over time and you're interested to say what are the most common or most
prominent frequencies in that sound clip then you want to transform from that time data to frequency
data which says you know at 80 kilohertz i don't know if that's too high i'm not good with my audio
i think that's right okay 40 so okay some some number of hertz you know that this is the peak
signal so like oh okay you know maybe that gives you some information or you want to do some sort of
mutation of the signal and then transpose it back but this is something that's very common
you'll come across a lot in a large amount of domains, image processing, audio processing.
I mean, even stuff that seems completely unrelated,
it can come up.
But yeah.
Yeah, compression of all sorts and things like that.
Yeah.
And this article talks about a specific kind of version
of the fast Fourier transform
and also the discretized, discrete version of that. So going from math to kind of practical there and a lot of implications so it's
a pretty good treatment so if you've ever heard of that and wondered what it was or trying to learn
more about it um i would definitely check out this this article seemed like a pretty good description
yeah it's totally awesome this is great i'll have to give this a read cool so i'm doing next article
it's on the oh yeah oh yeah oh yeah oh no that's uh that's that
guy who crashes through the window the kool-aid man yeah the kool-aid man the kool-aid man's
coming out with a console what pretty awesome yeah hopefully it's like bullshit and you could
pour but anyways so the oh yeah is coming out i'm pretty excited about it. I haven't ordered one because Patrick has scared me off of Kickstarter.
But I'm super interested in Oya.
I still might order one off their website, even though the Kickstarter is finished.
So is it available for ordering off their website?
Yeah, you can actually pre-order it.
Is it the same price?
A little bit more?
I'll find out.
Oh, that's okay.
But basically, you can order on the website.
It comes, I think, in March or April.
But it is a...
For people who missed the episode on the Oya,
it is a...
I just had to describe it as an Android-powered console.
So it's a console with a controller and all this good stuff,
but it's totally Android.
So any existing Android games,
you can play using the touchpad on the controller.
And any new Android games will be able to support the hardware buttons of the Oya.
And it's pretty awesome because we've never had this, right?
We should say it hooks up to the TV, so it's not like a phone.
Oh, yeah.
It's a little box that plugs into your TV, just like Xbox.
Yeah, or PS3 or something.
We've never had an open system.
So a little bit of history on consoles.
The way the PlayStation and Nintendo,
Wii and Xbox, the way these things work is,
well, unless you're Nintendo and you're awesome,
but let's talk about everybody besides Nintendo.
The way these work is you lose money on the console.
So every time you buy a PlayStation,
Sony loses money.
Well, initially.
I don't know if that doesn't normally hold true over the whole life of the console.
Oh, because they're cheaper.
Okay.
Continue.
I'm sorry.
Interruption.
So the point is, at least they don't make a ton of money, suffice it to say.
That's not the main way they make money.
Right.
So they actually make money by charging developers for a dev kit.
So, for example, if you want to make a PlayStation 3 game,
then you buy this dev kit.
And the dev kit has a little bit of hardware,
but I don't even know if they have hardware nowadays.
It's probably all in software or firmware.
But they'll send you probably a PS3 with some special firmware.
But they'll charge you for the dev kit,
I think it's like $200,000 for a dev dev kit and the only reason why the price is so high is that
they can recoup some of the money and the idea is you know people who buy your game are going to
you know are going to give you money and you're going to give some of that money pass it along
to the to the hardware manufacturer well i mean so it's worse than that, right? So, I mean, it's that for every copy sold,
they have to, they put DRM on the game.
So if you don't have, but it's,
I guess this is kind of almost slightly different
than normal DRM, which is that if I have own media,
but that media is not signed by Nintendo,
I can't play my media on the Nintendo device.
Right.
So the, you have to go to Nintendo and they,
you know,
I'm doing air quotes here.
They test your software
to make sure it's not going to
break the console
or isn't a virus
or,
you know,
anyone,
it doesn't crash.
It's good,
right?
And then basically
with their permission,
then you can turn around
and you're allowed to sell
that game.
But they,
you owe them,
you know,
X percent or dollar amount.
Oh,
that's true.
Per everything that you sign.
Yeah. That's the most common. I think, i believe that's like a very common way yeah i think there's an initial
like like one or 200k investment just to keep like the small people out of the market and then
there's also a rev share so this is talking to a friend of mine who works for ea like a long time
ago so it might be different now but um but yeah so so the the the point of it is you know patrick and i can't
make a game for the playstation 3 right i mean we just don't like uh have those resources at hand
right so you have to be a big publisher or or nintendo has to come to you like world of goo
is a common example where they're a really popular indie game um they do really well in flash and i
think they were out on mobile even before the wii and then nintendo contacted them and some agreement was arranged etc um the thing about
the ui is it's a completely open platform so like any game that anybody puts out for android like
trivopedia you could play it on the ui like any android program out there will run and so it's
really just sort of opens up the game console experience to like to the whole world
of developers and that's what's really exciting to me and the general trend has been that along
those lines right so like even the xbox there's like was it xbox live arcade yeah um okay maybe
i'm butchering i think playstation has something similar and we you has something similar and even
maybe that we did right so the main games that you
go to you know toys r us or target or best buy or i don't know where everybody goes these days
whatever you're you're in your country the local equivalent is to buy buy a game right there's
those which is kind of what we're talking about but then there's these others but those still
have limitations like i remember reading something about when minecraft came out for the xbox right
so they had some people doing this and then they had an update
and then it was like oh if they wanted to update it like a patch again you're supposed to go through
all this testing again but that testing costs money so like it's like a thing like you can't
they wanted to just push it early and push it often but that didn't like that was not compatible
with how the money was set up and so it caused kind of a problem yeah yep totally and so sometimes people
leave kind of like broken broken games because they just don't have money to to push a fix for
it and that's bad and the other problem too is although there is an xbox live arcade there isn't
a developer there's a developer ecosystem but not to the extent of these open platforms like
could be yeah yeah like for example we talked on last year with that Cocos2DX which is it's not a game it's just a library that you can use to make games well for you to do that on Xbox
Live Arcade I'm assuming you would have to be able to like publish like the library I'm not quite sure
how that would work but on an open platform you just give all the source code and the person can
take your source code and link in the Android libraries, and it just kind of works.
The ecosystem is much more healthy on Android.
And so this is a very positive development, right?
So like OUYA is on track.
They started shipping the developer consoles,
which are very similar to the regular consoles,
but already kind of like opened up so the developers can do what they need to do,
run unsigned code, that kind of stuff.
And so this is a positive kickstarter story yay kickstarter so
hopefully they'll stay on track and early next year we'll be talking about the reviews of jason's
uya console yeah totally all right so i think you have one now all right so uh future of the web the
future of the web would be html8 oh no no man Maybe that's far in the future. HTML 5? Yeah, at this rate, that's like the year 3000.
It's finalized.
You know, this is one of those stories.
I mean, it's kind of interesting just to talk about that.
I think it's kind of slow, these ratification of these things.
Like, we've already kind of entered the HTML 5 zone in my mind.
Like, it's generally used, right?
But yet only just now did everybody kind of agree on what that meant to some extent,
which still doesn't really mean anything
because all sorts of reasons.
But you hear this, like, remember when wireless Wi-Fi in,
802.11n was coming out, right?
It was DraftN.
You could buy DraftN routers.
Oh, yeah, I remember that.
And DraftN routers weren't guaranteed to be compatible with N.
And then, like, a year or two later, right,
then it was like, oh, now you can get the actual N ones yeah i mean just or i remember too like c++ right even
that went through this whole thing about oh we're gonna come up with a new c++ o x and it took over
10 years and then it was like finally c++ 11 or whatever it's hilarious so i mean it just sometimes
we forget that like the world moves on past these standards bodies,
but the standards bodies do play an important role in kind of saying like this is, you know,
what HTML5 really is and really contains.
But even then it's not done.
Like they still have other work to do to kind of make sure everything's okay.
It's not like finalized doesn't mean like it's done.
It means like, oh, now we can move to testing and compatibility and you know actually
making it an official like html5 thing yeah it's really just almost like a social or political
thing it's basically you know like let's say everyone complains about internet explorer right
especially older versions of internet explorer like internet explorer 5 doesn't do what you know
firefox and chrome of the same era do or maybe 5 is too old but but you see what
I'm going with this so this is their way of saying you know HTML 5 is real you
can make content for it and if someone doesn't support it that's on them you
know and and so this it's like a political thing to motivate people to
switch and I'm pretty sure if you're using an old browser and you go to a And so it's like a political thing to motivate people to switch.
And I'm pretty sure if you're using an old browser and you go to a website nowadays that has HTML5 content,
they'll actually have a little link saying, hey, your browser is super old.
You can't display this website correctly, and you should go get a new browser.
Whereas before this specification was complete,
they would just try and make it support everything.
Yeah, and I mean, we should say also this is by the W3C Foundation
or W3C Console.
I don't know what the C there is.
Oh, consortium.
Okay, thank you for saving me.
The World Wide Web Consortium.
Yes, okay.
And it is a good thing.
It is, you know, that they've finalized it.
And also, they've taken time to announce that they've begun the draft of HTML 5.1.
Woo!
Oh, man.
So did they keep the peer-to-peer?
HTML 5 was supposed to have peer-to-peer support.
But, like, no browsers had actually implemented it to this day but i mean
so basically what what that means is you could have a bit torrent client totally written in html
like you just go to a website and you just start torrenting like on like the site you know it's
just craziness and i i know i read an article where they said uh you know no browser will ever
support this.
Yeah, I don't know.
I don't see anything about it in my research that I'm doing right now.
Oh, yeah.
All right.
But it's OK.
It's all right.
Maybe we'll have a future topic about this.
Yeah, totally.
All right.
So I think you're on.
You're up next.
Tell me about some sweet desserts.
Some sweet desserts? Well, I've just been eating some raspberry pie.
Oh!
Yes. Been eating from the raspberry pie store been eating some raspberry pie. Oh! Yes.
Been eating from the raspberry pie store.
You bought raspberry pie at the store?
I did.
It was delicious.
Yeah, it was one of those Safeway specials,
but you have to have the card where they harass you,
you know, the Safeway club card.
So, like, now it's like every time I go,
they want me to buy another raspberry pie.
I'm like, no!
No, you people!
So, Patrick is like so okay so raspberry
pi store um has just launched what today or a couple of days ago but it's pretty cool ideas
if you own a raspberry pi device um you can go to this store you can download apps i'm assuming you
somehow transfer them to the device maybe you download them with the device itself and then you can just start using them so they have a
number of free apps there's free Civ the civilization clone some other free and
open-source games that are that seem to be ubiquitous like Battle of Wesnoth
seems to be on every possible thing like I have Battle of Wesnoth on my gumsticks
my phone like like three computers.
So a lot of these open source games,
it's just as soon as something new comes out,
they try and get themselves on there.
But there's a number of awesome things
that you can install.
Some are free, some cost money.
But if you have a Raspberry Pi
or if you're thinking of getting one,
definitely check out the store.raspberrypi.com
and see what's available.
So is the Raspberry Pi the Ouya before the Ouya?
That's a good question, right?
I mean, a lot of these are games.
It's like a store and a game.
I mean, it doesn't run Android,
but I mean, shy of that.
And it's not nice and bundled.
Yeah, that's a good point.
Actually, a coworker of mine,
I was telling him about the Android Stick PC that we talked about on the last show and how I want to get one for Christmas.
And he was saying.
Wait, do you really think you're going to get, I'm just like, you're going to buy one for yourself for Christmas?
No.
We should talk about things you want for Christmas.
But like, I normally, those kinds of things is like nobody, like really?
Like who's going to.
So this is kind of, so my parents, ever since I was maybe like seven, they've always just asked me what I want for Christmas because they don't know anything about computers and things like that.
So as tradition holds, my mom asked me what I wanted for Christmas.
And you explained to her, like, go to this specific page.
I had to give her a link.
Yeah, yeah.
I had to give her a tiny URL.
Cool. But, yeah, so my coworker said, well, you know, if you get this,
then you're stuck with Android.
You should get a Raspberry Pi or a BeagleBoard or something.
And that is kind of true.
Like, if you get Raspberry Pi, then you have just a pure Linux shell with,
you know, you could sudo app.
You can install packages.
You can do whatever you want.
If you have Android, then you're stuck in the Android ecosystem.
And so I haven't yet decided which route I'm going to go.
You already sent the link, dude.
I haven't sent it yet.
Oh.
Oh, it's sitting in my saved messages but not sent in Gmail.
So, yeah, I'm holding on.
Okay.
All right.
Well, I guess we'll get an update about this.
This is the last one we do.
Sure, everyone wants to know about my mom's five.
We're recording this before.
Well, no, we want to find out which camp you fell into,
which you decided.
Not necessarily which you got for Christmas.
Gotcha.
But I don't know if we'll release this before or after.
We're recording this before Christmas.
That's true.
Yeah, we're recording it in advance
because we're both going to be gone for Christmas and New Year's.
So we're recording some shows in advance.
So we will find out.
So we will have already known what we will have
gotten. Yeah, when you guys
listen to this, I'll be playing with my
oh, we just caused a
quantum, what is it in Back to the Future
where you've broken the
space-time continuum? I don't know.
Anyways, yes.
Time for Tool of the Show her tool of the show this is tool of the show so my tool of the show this is like the tool that I
use more than anything like more hours of my day are spent using this tool than
probably any other program or anything sledgehammer that's a measuring tape it's called choose now
so my tool of the show is emacs and uh emacs is a web editor or sorry emacs is a text editor
web editor i don't know where i was going with that text editor but it is so much more than that
it uh has a ton of features a ton of macros you can have uh like 20 files open
in the editor at the same time and you can do well you can have infinity files open at the same time
you can do things like find all occurrences of patrick and change it to gives jason a hard time
in all 20 files like you could just type escape and then find replace all files or whatever so so it has a
bunch of like little macro support and uh it does has all these like cool tricks it has a color
theme picker you have colors so you start off with you're blowing my mind i know this is insane
you start off with like a white background and black text which I just can't deal with so I picked the the
clarity theme which has black text like a like a grayish like not not
completely white because too much contrast like a nice soft gray coming
next week Jason's favorite Emacs the and just a bunch of cool macros there's an
Emacs wiki which has a ton of things to make your life easier.
Like, for example, just a short anecdote.
You might have something like a file which has a number, colon, and then some text.
And you just want the text after the colon.
You just want to get rid of the number and the colon, right?
So if you're just using Notepad, you'd have to do this by hand, line by line, right?
But in Emacs, you but in emacs you just
could you just do it from the shell yeah you could write like a shell script to like you know cap the
file go through set and oct and things like that but in emacs you actually do like a regular
expression replace like almost trivially or you could use something that had a column editor
a column editor but what if it's like the colon moved to call oh that's true okay that would
be bad yeah all right you didn't fully specify the problem so you get so so you can tell patrick
things that emacs is for people who you know are on magic mushrooms or something come on man
where are you going with this they're trying to make us lose our clean rating what is it so clean but not like
what editors are you on patrick i don't edit my code man i just think it
oh man like you're writing binary i just have telepathy with the computer
oh it's pretty do you have all those headsets yes that was such a fad i know i wanted to get
one of those things to like read your brainwaves and like hack that? I tried one in Best Buy a year or two ago
and it had this like
platformer game
where if like you thought
about moving right,
it moved right.
And I could never get it to work.
But you have to train.
I mean, almost all of them
have always shown like
you have to go through
a set of like exercises.
Oh, yeah.
Like the old dragon dictionary thing
where you'd have to like
translate the words you spoke
or whatever
or write them down.
You'd have to read
like a chapter of Shakespeare to an act of Shakespeare to like train it.
So that's the same thing.
Like you need to go through some thing to train at least now.
I did that, but maybe like I could be pretty scatterbrained.
And so like I kind of think that when I was like I wasn't focusing on the training part of it.
Just remember there is no game.
Yeah, there is no spoon yeah there is no spoon there is
no spoon so what's your tool in the show well okay so i dodged your question but i happen to
actually be in that weird camp of nerds who thinks uh ide is is like a amazing invention of modern
times where i can have 20 files open i can have have colors, and I don't need these crazy escape key combination macro.
That's my camp.
I do appreciate that sometimes you have to open up a shell
and you have to do stuff there, or you're SSH'd into a computer.
Yes, I understand.
But that's like a a
thing of last resort yes okay so in other words you're on the 21st century but i've just offended
all of our audience no i'm pretty sure i might be the only person amongst our audience who uses
if you use emacs i think all right here's my thing and i'm really gonna get in trouble now
but it's okay i'm gonna say anyways because that's how i roll all right um so i think people like the idea of being really really nerdy so they
just say these things and they it's not they're not true like what like vi or die oh yeah i mean
that's not an actual phrase i don't i just couldn't think of one but like you know like oh
i only use vi for everything or like oh i only what's the there's like cult of something i don't
know all these things right like oh i only use emacs everything. Or like, oh, I only, what's the, there's like cult of something. I don't know.
All these things, right?
Like, oh, I only use Emacs.
And I'm sure there are people who do.
And I'm sure that's legitimate.
But I really feel like most people, it's not really true.
Yeah. I like, same thing as like, I remember there's all these people talk to like, oh, Linux this, Linux that.
And it's like, what do you run at home?
Like, oh, Windows.
Yeah, right.
It's like, like, I don't have a problem with that. Just like own up. Like, if somebody makes something good, it's like what do you run at home like oh windows yeah right it's like like i don't have a problem
with it just like own up like if somebody makes something good it's good like don't you don't
have to like oh it's made by microsoft it must be terrible yeah like oh come on i mean if it's good
it's good there is a high level of zealotry with emacs and vi it's true a lot of people hate like
the iphone and the ipad like oh they're terrible right and it's like in reality like okay fine you
might think that but i think most people just it's oh it's because it's apple yeah it doesn't matter if like you
could provably show that it was the best thing like people would still like oh it's apple it
can't be good and then vice versa too other people were like oh it's not made by apple it's got to be
terrible yeah so i think nerds like for some reason or geeks like really fall into this trap of like
like we say we want to be open-minded we want
people to accept our thing but it's like in reality we try to like i think everyone is though
even like you could be i had a friend growing up who's really into skateboarding and he was always
like oh you have a veriflex that's a piece of crap just because it's a veriflex you know i mean so
i think everybody who's like really into something has this like brand loyalty. But yeah, so the thing about tech though is tech is very functional.
So like to have brand loyalty for something
that doesn't necessarily have a style or the style doesn't matter.
Like it's one thing to be like, I really like, you know, express jeans.
And it's like you do it knowing that
they're not going to like make you walk faster, you know.
But it's like, oh, I really like Emacs.
And then for Fido to oh i really like emacs and
then for fido is to make the argument like emacs makes me a better coder which i don't believe
then then now you're putting you know you're putting like a uh like uh you're turning something
that is like a style like a personal preference into like oh i'm functionally like a better
program but even if people say like oh i'm more productive using emacs and an ide but their problem is like okay well like you know did you try all the ids well no
did you try one did you stick with it long enough and if you kind of dig it's like a lot of times
like well you know i tried it for like a week and i just couldn't stand it well i mean did you like
pick up emacs and start using it hyper productively within a week no yeah like i mean i doubt it right
maybe if you're like some savant i don don't know. Maybe. But like, no.
Okay, anyways, my tool of the week is Chrome Browser Sync.
It's pretty awesome.
It's not really like a tool of the week. It's totally a tool of the week.
I totally copped out on all these.
It's okay.
So I really like this.
No, your Snapseed tool was awesome.
I'm still using that.
Oh, good.
I downloaded it during the show.
And you said one of our other tool of the week you were still using.
Yeah, I'm still using KeePass.
Yeah.
So your tools are a winner.
All right. All right. Better than Emacs using. Yeah, I'm still using KeePass. Yeah, so your tools are a winner. All right, all right.
Better than Emacs.
All right.
I'm so in trouble.
I extend the olive branch and I get my hand cut off.
I'm sorry.
All right, anyways.
So Chrome browser thing, if you don't know,
Chrome is a browser made by Google.
Yep.
And one feature that I've been using Chrome,
and I like it
I mean
but I like Chrome
and one feature
that I've decided
I really like
is across my devices
that if you have Chrome
and you sign into it
which
I mean we can get
to all the debate
but alright whatever
so you sign into it
and it'll synchronize
your bookmarks
it'll synchronize
you can see what tabs
are open
from your other computers.
It's a really good feature.
I think now, as somebody was pointing out,
Firefox ships it by default or it's an add-on or something.
I don't remember.
But that browser sync functionality,
and whatever browser, I happen to use Chrome,
so that's my tool of the week.
But they may all have it by now.
Yeah, browser sync is awesome.
Turn it on. It's awesome.
It even works with mobile. Right, so on my iphone i have the chrome and i can look at what oh what was that
page i was looking up on my computer at home and i just go like my home computer and like oh that
tab you know bring it open it's amazing like especially i notice it a lot when i'm using
directions like i'll be at home and i'll look up directions for a place and then i don't want to
like i mean it's not a big deal to retype in the address
But it's so much easier to just pull up your phone and say show me my desktop tabs and the directions are there like done
Ready to go. How do you enter all those key combinations for Emacs on your phone?
Okay
Escape key okay i'm done the last one i'm done escape key i'm done i'll imagine if there was an emax editor
for the phone how hard that would be just a custom keyboard yeah that's true you'd need a keyboard
that had like control s or something like i'm sorry i'm done with the joke anyways browser
all right that's pretty fun time for book of the show. Book of the show.
That was pretty awesome, Echo.
Okay, I'll go first.
You went first last time.
Did you write that in Emacs or was that a script on Audacity?
I wrote it in C sound.
In C sound.
Okay, you guys can look that up later.
Because you didn't get the joke.
Should I go first?
No, you go first.
So I don't have programming.
Jason is going to be related to our topics.
That's why he's going to go second.
Okay.
So I like sci-fi, science fiction.
Totally.
And fantasy as well.
This one, it's science fiction, but it kind of, yeah, a little bit of fantasy kind of as well. But mostly science fiction.
And that is Anathem.
This is a book by Neil Steven stevenson it's really long
like i mean the whole time i've warned you i thought that i thought that i misread this and
i thought it was anthem by ayn rand i thought that's where you're going with this okay anathem
by neil stevenson and um it's it's a really long book i don't I think it's like bordering on like a thousand pages or something oh wow
so here
the paperback
is 981 pages
so it's really long
but it's really good
so this is gonna sound like
Patrick why is this
the best book
you have to stick it out
through the first
couple hundred pages
first couple hundred pages
are a little bit rough
but it is worth it
it's like
a huge investment
but like when you make
the investment
of like oh okay
what's going on here it's really good I don't want to do any spoilers about what the so farther part of the
story but it starts off and you're in a monastery and it just starts describing like a whole world
which is like slightly different than ours and like you kind of begin to pick up that there's
some things which are different but kind of the same and it just just like really, Neil Stevenson just goes into like great detail
about like all the functioning of this monastery.
And it's not, lest you get confused,
it's not like a religious monastery
like you would think of today
or in kind of our world.
So you kind of got to stick with it
and see it's something slightly different.
Again, I'm trying not to give away
because some people really want to have
all that surprise of the book.
So I don't want to give more than even the back cover would say but i'd check it out it's really good um and like i said you gotta stick it out for the first
little bit um it got good reviews yeah it's a little you gotta you just know you're gonna read
the whole book like just commit like i'm gonna read this whole book and it's only 11 bucks that's
not bad i mean that's like, what, a Santa page?
Yeah.
So, I mean, you can get on, you know, e, the Kindle or whatever through Amazon here is like $6.
Oh, even better.
And it's good, you know.
Or even the mass market paperback.
I don't know what the difference is.
It's only $9.
Oh, right.
So, okay.
We're sitting here like reading Amazon prices on the air.
This is fascinating podcasting.
But check it out.
It's a really good book.
I mean, I really like it.
And if anybody has read it or does read it after this recommendation, like, let us know.
Like, any of the books.
Oh, that's a good point.
So for people who bought the book of the previous show and people who buy this book, et cetera.
Or already have it or whatever.
Yeah, or already have it.
You know, if you like the book or if you hate the book, let us know.
Yeah, I'd be curious to see if you have a better book.
Let us know.
Post on G+.
Or if you hate this whole segment.
If you hate the show, post on G+.
Some people might be like,
oh, Head First Design Patterns is decent,
but I also really like this other design pattern book.
Feel free to post on the G+,
and start a thread where you're only going to help out
the rest of the community by contributing.
If you want to leave a really horrible comment about what we should be fixing on iTunes, that's cool. Rate it five stars, because those are the community by contributing. Yeah, and if you want to leave a really horrible comment
about what we should be fixing on iTunes, that's cool.
Rate it five stars because those are the ones we read.
And then just, like, tear it up.
It's fine.
Yeah, if you rate it five stars, then we'll mark it as useful.
There we go.
Just kidding.
Okay.
All right, sorry.
So that's Anathem by Neal Stephenson.
Anathem with an extra A.
I mean, I'm pretty sure that's how you say it.
I'm not exactly sure.
No, you're right.
Yeah, well, it's definitely not Anthem because that's missing an A.
So my book is Hadoop, The Definitive Guide.
And I've actually been reading this book for a while now,
trying to sort of understand more of the Hadoop internals
and the way Hadoop does different things on the backend.
So as somebody who – I feel like I can call myself a MapReduce expert.
I might get destroyed for that by somebody who's a crazy expert.
But I know a ton about MapReduce.
I work a lot with MapReduce and Hadoop internals and things like that.
So you just set yourself up. You know that, right?
Yeah, I know. I'll bring it on. I feel like I can take it I accept the challenge
unless we get like
Jeff Dean or Sanjay
Gemawatt the original creators of
MapReduce unless they comment
that would be awesome yeah I could take
abuse from them I'd be worth it
so
you guys that would be programmingthrowdown
at gmail.com
or on our G Plus page.
Any of you want to say anything?
Go ahead.
So I feel like I've been doing a lot with MapReduce.
We could have them on the show to argue with you.
Oh, yeah.
We should see.
They're in the Bay Area somewhere.
Okay.
Yeah.
All right.
So this book, even as somebody who does a lot of MapReduce and Hadoop, the book is incredibly useful.
It's a great reference.
It has a lot of examples.
It really quickly answers the question, why do you use Hadoop,
which we'll hopefully answer ourselves.
But it explains it in a lot of detail.
And it has a ton of extra content.
So if you like this show and you're interested in Hadoop
and things we talk about, this has a ton of great examples. And it's got a cool elephant on the cover dude it's pretty epic it must be awesome now do
you know why it has an elephant on the cover i think so hadoop is the name of the creator's
pet elephant toy yeah like his his child's elephant oh his child's okay yeah so so it
wasn't created by a five-year-old whose pet elephant he named Hadoop. That would be epic.
That would be pretty awesome.
It would be the most epic five-year-old ever.
That was you as a five-year-old.
Oh, I wish.
Okay, on to our programming language.
Hadoop.
Okay, so I have to caveat that.
I said programming language.
It's not really a programming language.
No, it's...
Hadoop.
But that's okay because I noticed our title
isn't Programming Language Throwdown.
Oh.
It only hit me this week.
That's totally true.
Is that weird?
We've been doing 23 of these episodes now,
and it never hit me.
When we named the show,
we talked about programming languages,
but really the title is Programming Throwdown.
So, I mean,
we're totally in our wheelhouse here.
You could definitely throw down Hado versus like versus all reduce or spark or some of these other you know frameworks there's
definitely there's a throwdown to be had here you already threw it down with emacs now throwing down
that you're an expert on hadoop let's bring it yeah get her done so hadoop it's not even that late did i tell you i'm actually getting no no we gotta say
i'm real quick okay all right i'm getting a bib so uh i want to get a bib for a baby
like the kind your hose hooks up to on the faucet no no like the kind where like they puke and it
lands on the thing like that they wear whatever i want to get one of those and i want it to say
for you for me and i it to say For you For me
And I want it to say
I ate and made Hadoop
I'm just getting the stare down from Patrick
I don't get it
I'll write it in Emacs
Let me think about it
If I start laughing in a few minutes
Okay
Well it ends in OOP
Is it a poop joke?
Yes it is a poop joke
Okay now I'm getting a different kind of stare.
It's like a despondence.
Okay, so let's give some history on Hadoop.
Hadoop was based on MapReduce, which was something created by, as I mentioned,
Jeff Dean and Sanjay Gemawat at Google.
So they created something called MapReduce.
They wrote out a paper explaining how MapReduce works.
And then Doug Cutting, who is another engineer somewhere in the Bay Area,
I think he made a startup.
Is that right?
Yeah, and he was working on an open source version of it.
He was inspired by the paper.
He began working on it.
And then at some point, Yahoo either bought his team or he joined Yahoo.
That was sometime around 2006-ish.
And so then he kept working on it while he was at Yahoo.
So he actually, the whole reason why he created Hadoop was because he wanted to support Lucene,
which is an open source reverse index.
And we actually talked about Lucene in the mailbag episode of this show,
answering the question posed by the
person who wanted to create his own, like he wanted to index the web for Magic the Gathering.
So yeah, Lucene is a reverse index meant for indexing thousands, millions of pages.
And Hadoop was something that Doug wrote to support that.
So at the time, it was kind of meant for that and it was sort of myopic but
since then it's been uh it's grown like it's an amazingly large scale how big the last uh the
largest hadoop file system that we know of is run by facebook and it is a hundred petabytes
do you know how much a petabyte is more than a terabyte how much more than a terabyte
a hundred a thousand ten thousand that reminded me of that monty python where he's like blue i
mean yellow no so yes a petabyte is a thousand terabytes so all you guys have like a terabyte
hard drive at home so a thousand of those that's what a hundred thousand of them well yeah you're right
it has to be more than that a hundred thousand petabytes a hundred thousand terabytes wouldn't
it be it would be a hundred petabytes a hundred oh right all right yeah so facebook has a hundred
thousand um of those of those hard drives that that they're using to store all the data at least
so it might be it might be replicated that's true right i mean that's just one and that's only that's only what they've owned up to they may have bigger yeah
that's true they may be lying right like it might be actually be smaller yep or it could even be
like that you know somebody else has one that's like you know just way bigger and they just don't
want to talk about i mean this stuff's uh it's a competitive advantage yeah yeah that's true it's
true okay so you want to discuss some of the
features? What are the things that...
One thing we should mention in the beginning is
Hadoop has become almost
like an umbrella term.
Hadoop is the MapReduce for
doing this distributed processing,
but people also refer to Hadoop
as this collection
of tools, with
Hadoop being one of the things in
Hadoop which is really confusing let's talk about some of the other features
similar tools that come with Hadoop also changing my answer again I was being
silly it's 1024 right isn't it because isn't it a power of two yeah you're right
we have no idea what we're doing oh man, man. So, yeah, there's... So, okay, so let's do a little bit first.
MapReduce.
Okay.
We use this term a little prematurely.
So what is MapReduce?
So, yeah, I'll try and explain MapReduce.
Yeah, that's a good point.
So, first of all, let's do a little bit about distributed computing.
Back in the day, people used to do, and they still do,
but people used to be totally consumed with MPI and things
like that.
MPI stands for message passing interface.
And so if you had a lot of data and you
wanted to do something with that data,
you would break the data up into pieces.
Then you would pass each piece to a different computer.
And then the computers
would probably like send you back some results and you would put the results together um and then if
you sent if you pass the message to the computer and it died your whole process would blow up well
you would have to have code to handle it yeah or you had to say oh this machine died so i need to
like pass the message to somebody else and so you have to be, oh, this machine died, so I need to pass the message to somebody else. And so you had to be completely consumed
with all of these meta things that didn't really
matter to solving what you wanted to solve.
So it turns out that not every problem
can be what's called embarrassingly parallel, where
you just send chunks of data to different machines.
But almost all algorithms can be implemented
using something called MapReduce.
And so the idea is you take your data,
you break it up into chunks, then for each chunk,
you execute a map on the chunk, which takes this chunk of data,
could be a sentence on a web page, or it could be a URL,
or it could be any atom of data, and then returns a key-value pair.
So for example, let's say you wanted for every word on the internet and it could scan through that
page and then spit out a bunch of key value pairs where the key is the word
and the value is the number of times you saw it on that page so if you saw like
the Fox jumped over the dog then there's two thus so one of your key value pairs
would be the comma two. Another one
would be fox comma one, etc. So now all of these key value pairs generated by all these mappers,
they're all kind of floating somewhere in your cluster. So you have something called a shuffle
that takes all of the keys that are in common and all the key value pairs that have the same key and
squishes them together. So if you have 10 websites with Fox as the key, you'll just have one key with Fox.
And then instead of a value, now you'll have a list of values,
where that list is how many times Fox was seen on all these websites.
Then you have, that's the shuffle phase.
Then you have the reduce phase, where you collapse that key list of values down to one answer.
So in this example, you would take Fox and then 2, 1, 2, 1, 2, 3, 2, 4, and you would add up all those numbers together.
And you would say for Fox, for the whole Internet, it's whatever, 13 million or something like that.
And the cool thing about this is that all these pieces
can happen in parallel like the part that collapses all of the fox values and the part that collapses
all the bear values could happen on two different machines and the part that scans website a and
scans website b can happen on two different machines and uh it all just kind of works and so
as long as you're and the mapuce code also handles all these things like,
oh, the machine that was processing Fox dies.
We need to send Fox to another machine.
They've dealt with all that for you, so you don't have to do it.
So that's a short summary of MapReduce.
And Hadoop is an open source version of this.
All right.
So, I mean, part of this is going to be if you're going to –
you kind of started with something that was a little bit presumptuous, dare I say,
which is you start with everybody reading all the tasks,
reading one page of the Internet.
Where do you store something that's every page of the Internet?
You can't just, like, store that, liketfs or you know ext you know you know a file
partition like this i mean so big yeah well not only the size large but the number of files is
ridiculous oh yeah right so you could run into problems so i mean one thing that has to go along
with this is you have to have a distributed file system so first of all you can't house all the
data on one computer right but then on top of that even like how you would access it in a way that is efficient so that you can have these
tens hundreds thousands millions of map threads working on this you need a way to be able to
retrieve that quickly and reliably and in a sharded manner so that you know that can take
place very quickly and and they've handled that as well with the hadoop file system yep totally yep so the uh the hadoop file system uh operates in 64 megabyte
chunks and that chunk could have one file in it it could have a piece of a file it could have
10 000 files in it if the files are very small and so versus you know a regular file system which
uses you know b trees and redblack trees and things like that,
this uses something totally different that's meant for gigantic files, millions of tiny files, all of these things.
And it expects you to access the files using something like Hadoop
versus a file system which doesn't really know how you're going to access the files.
There's also HBase. HBase is a column oriented database written in Hadoop. So you know
whereas HDFS is a file system, HBase is a database. So if you need to you know get
rows by key and if you need to find all keys where the first name is this, if you
need to do all the secondary indices, all that stuff,
but you still need the distributed coolness from HDFS,
which HBase is built on,
then you can use HBase and get that.
So all this code that's running,
I mean, if we begin to talk about hundreds, thousands of servers
or even more, I don't know how high you could go,
how high do you want to go, I guess,
you're going to have to have a way to keep this.
So machines are going to die.
If you talk about that many machines
and some of this data can take a long time to process,
I mean, machines are going to die.
You might want to add new machines.
You want to increase your capacity.
I mean, that would be a huge mess in the olden days.
I mean, imagine if you're doing MPI
and you have to deal with that, right?
I mean, it would take you months to have to deal with the, oh, this machine died, so I'm going to send it to that machine, right?
I mean, that could be such a nightmare, right?
Or what if the machine that's sending the things dies?
I mean, you don't want to have to restart the whole thing just because of that.
So dealing with that, right?
That could be a nightmare. And so Hadoop's thing which handles that is called Zookeeper,
going along the lines of pet animals and things like that.
And so, yeah, Zookeeper is a distributed coordination service.
And so in really short, what that means is somebody can write some data to Zookeeper,
and somebody else, if they try to write data
and the data collides, Zookeeper sort of
figures all that out.
So if two machines
both want to take the same
job at the same time, Zookeeper
says, no, look, only one of you guys get
this job. Things like that.
So like we said,
I mean, Hadoop is kind of, when people
talk about it, they're kind of talking about all these kind of tools, which are kind of all necessary to get together and enable you to be able to run those MapReduces.
And so, I mean, kind of now talking about the strengths, one of the things that we talked about is you bring up more computers, great.
You can have more threads and the threads can run sooner.
They don't have to be serialized yep and so i mean it really you can just scale this dot dot dot
yeah like just you know whatever like however much i guess time or money you have determines
uh you know how big you can go or you need to go or have to go yeah yeah it's amazing i mean you
could be on using the amazon web services
which run hadoop right so they have a a version that they've altered i guess or changed elastic
elastic map reduce or whatever right yeah right which is is very similar yeah so they have a they
have their version which runs on their infrastructure so they have so like you know for instance they
may not i don't i think the way it works and and I'm speaking a little out of my expertise here, but it's, you know, instead of having, like, a Hadoop file system, they have their version of the same thing, right?
Their elastic file storage, which they kind of hide from you, so you don't have to worry about it.
You just put data there.
And you can run the Elastic MapReduce, which runs on that.
So, it's very similar and shares a lot of stuff with Hadoop, but it's not, isn't exactly the same.
Yeah, yeah, totally.
I don't know if it uses the same API or how that works.
I don't know.
Yeah.
I didn't look into that.
No, yeah, I haven't done that with Amazon.
But the other cool thing is it's fault tolerant.
So, you know, all these crazy corner cases that you'd have to code up yourself,
they've done that for you.
And they'll even do crazy things that you wouldn't think about doing.
But, like, one thing Hadoop will do is let's say you have 20 machines that are in your little hadoop cluster right and you're running a job
and you and hadoop detects the zookeeper detects you're only running one job and the job had eight
workers like eight you know let's say there's eight mappers so hadoop will say well you know
if one of these machines dies you have to like start the whole thing over again.
So let's just run like since you only have eight, let's just duplicate all this work, like run it on two machines.
And if one of them dies, it's OK.
Or another thing they'll do is they'll know, oh, you know, you might be able to like pass some data or you might be able to arrange and do a little bit of the reducing in the mapper.
Like these crazy optimizations that we don't really think about
because we're too busy focused on the algorithm,
they've done all that grunt work, you know,
and they've done all these awesome optimizations.
Sort of like using C++ instead of assembly, right?
Exactly the same.
Yeah, like MPI is so low that like you just have to end up doing
like way too much and that feeds into one of the features i i skipped over that that you were
particularly a little bit passionate about crunch oh yeah so you want to talk about crunch patrick's
actually used used crunch quite a bit yeah well you overstayed sir i i i am not so emblazonedly boldened to say that.
But, okay.
Not to call you a liar, but I'll just... Okay.
So what Crunch is, is Crunch allows you to,
instead of having to use the kind of off-the-shelf nature
of a lot of the stuff that MapReduce does
or kind of fit into their paradigm,
kind of even going again on what Jason said
extending it even a little further even more
flexibility in how you do these things and
just allow the computer to handle it right the computers
to handle it just so you take
care of this right so crunch is an
API for allowing you to add an even higher level
say like here's generally the data
flow I want and here's the operations
I want to perform
and alright go and so um if you if you
kind of drew out like oh i need the data these two data things need to flow together i need to do
some operation on them and then once that's done i make some calculation and then i need that to
flow with another piece of data you know and then i want to kind of join those together in like an
inner join fashion or something right like all these kinds of notions could become they're like multiple map reduce hadoop jobs that we need to run and there are
you know things that you might have to string together very carefully but crunch allows you
to just kind of specify like these are the way i wanted to do and it'll handle like oh i could
merge these two together and run them at the same time um Or, oh, I need to spin up a new Hadoop run for this thing or for that thing.
And it kind of allows you to do that much more simply and easily
and in a way that's flexible, just allowing the computer to handle it.
Yeah, yeah.
And Crunch feels more like Java, you know,
because you have these like P tables,
which are basically like they feel a lot like hash maps in java so you can kind of
take your existing java code and if you need it to run on like a thousand machines you know it's not
just a copy paste or you can't just change hash map to p table there's more to it than that but it
it it um it has things like get all the values like get a list of all the values that exists
in hash map and also in Ptable.
And so it kind of feels more like you're just doing
native Java.
So it's good for people who want to get started with Hadoop
and aren't used to writing this kind of stuff.
Yeah, so that idea like Ptable or PgroupTable
and it allows you to say like, oh, and everything in that,
you know, in parallel do this thing, right?
Like go through and perform some operation.
So that's what I was kind of talking about,
maybe a little too high of a level.
But that's how you kind of, it gives you this API,
but it really is kind of almost classes
and things that make it just look like, to Jason's point,
it looks like you're writing Java code
on some fancy hash map.
And in reality, you're instructing Crunch
to in the back end figure out how to arrange the map
reduces and the to be able to do that yeah totally totally so there are some weaknesses to map reduce
no um i know after we sang the praises you think it would be just a panacea that would just be a
cure-all but no uh one of the problems with map produce is it takes a long time to spin up and spin down.
And so what we mean when we say that is spin up refers to, for example, let's say you're building a multi-threaded program.
So the first thing you'll do is make the thread pool and actually ask the operating system for the threads.
So that's kind of like our spin up time, right? And so to ask the operating system for some threads happens in what?
Milliseconds? Maybe even less than that, microseconds.
But to ask MapReduce for 100 machines to do work on or to ask a Hadoop Zookeeper for 100 machines is going to take on the order of seconds at least.
I mean, it really just depends on your cluster, but it's not going to take Miller or microseconds, right? So if you have, you know, a Mabry's job that adds two numbers, like goes through two files and then adds the numbers together and the files are a meg each, you might just want to do that in C++, you know, because by the time you spin up the machines, you know, send little, send numbers to the machines and, you know,
then compute the result and then store the results somewhere.
Yeah, your C++ program would have run like 40,000 times or something.
So there is a problem there.
Yeah, there is something called worker pools.
And the idea is, you know, if I have a job that I want to run over and over again, I
can sort of prep Hadoop and I can say, look, Hadoop,
you're going to do this thing a hundred times.
So get ready for that.
And then it'll do some optimizations there.
Also, we talked about all these wonderful things
of bringing on new computers
and bringing down computers
and all these stages and divide it
and just, oh, do this little bit of work
and then do this little bit of work.
And that helps it be really scalable.
But I mean, as you can imagine,
all the data that's in your program, all those class and everything need to be able to be
serialized out and then when they're read back in they need to be deserialized and sometimes you
know oh it can be said like oh i'm going to do this on the same machine but a lot of times you
have to do it anyways because you don't know like oh am i going to be doing this on the same chart
or not or is it going to go somewhere else yeah and so it's just a life if that takes a long time
because you have especially some nasty intricacies it could be a problem but even just in general
like that's going to be overhead yeah and then the other thing and there's some theoretical
reasons why this is true that i'm not going to get into because it takes a long time but
but there's excessive there's a lot of materialization so you know each of your
mappers going back to our word account example each of those things that scans one page of the web
has to put those results all those key value pairs on disk on that hadoop file
system and then the reducer the the shuffler sorry that's putting together
all the keys has to read those from disk and then put the reduced key values
lists on disk and then the reducer has to read from disk and then put the reduced key values lists on disk and
then the reducer has to read from disk and then put the answer on disk right if
you were just writing some C++ program you would just put everything in memory
or maybe you would only put you know a bit of that on disk like one stage of
that on disk so there's a lot of disk. when you're using Hadoop.
There are some tools to sort of make Hadoop a little easier to use.
One of them is Pig.
Pig and Hive.
And so they're good for if you don't want to have to write Hadoop, like raw Hadoop code,
you can just put your data in Hive or put your data in a flat file like a text
file then you use something like pig or hive they have their own programming
languages which are much more terse than Hadoop yeah I mean with a lot of these
things that's planning can take a long time even it's like figuring out like oh
how how exactly do I want to divide my data what yeah you know getting my data
into the right place in the right format you know that can take a it can take a
lot of time.
Yeah, totally.
And so these tools will help you with that.
Yep.
So we have Avro.
I have no idea what that does, so that's going to be all you, buddy.
Oh, Avro is like Thrift.
It's like a serialization thing.
Oh, okay, okay.
Yeah, yeah.
We already talked about this.
Yeah, totally.
I'm reading your points.
So what is Hadoop used for uh so what is hadoop used for
what is map produce used for everything everything everything i think there was a in the white paper
on the original white paper on map produce they said that it was i don't remember i think it was
over half of computations at google were part of a MapReduce. I think it was some very large
number. Oh, interesting. But yeah. So, and Google is like a huge company. So, you know,
and I think I'm pretty sure Facebook and all these companies have probably similar statistics.
It's kind of interesting to see what companies will say what, right? It's kind of like a
little bit of a game, like, oh, hey, we're doing this. Well, we're doing, you know, but
you know, it's like's like who knows like how
what percentage of the actual value those things are they're just like kind of one-upping each
other yeah but only like slightly and others are just keeping quiet in the corner and it's like
oh are they embarrassed or like do they like oh this is this is ridiculous i'm just laughing right
like yeah it's kind of interesting but i mean no seriously i mean map reduce it pops up all over
the places like um you know even for even for doing, like, image processing.
Like, oh, you want to, like, do a whole bunch of, you know, look for objects in something.
Or you want to run the same little algorithm on things.
I mean, people have done MapReduce to do that.
People have done MapReduce even for, like, I saw one guy who was writing a little bit of a tutorial. Like, I think he was trying to convert, like, old PDFs and, like, do OCR on them.
Like, he had a whole bunch of them. Each page of old books that had been scanned was either just an image or a PDF.
You wanted to do OCR on all of them and then put them somewhere.
But you could imagine if you have millions and millions of these pages, it would just
be a nightmare to figure out a C++ framework and structure or Java structure to do that.
That could just take a long time.
But that's right in the wheelhouse of map reduce yeah yeah totally yeah there's a ton of
things you can do with map reduce the biggest thing is you know especially as sort of like
let's say you're an indie developer you're developing writing code out of your house right
but you want to do really big data kind of things like for example let's say you have a website you
have a few thousand people on your website and you're capturing all their clicks and things like for example let's say you have a website you have a few thousand people on your website and you're capturing all their clicks and things
like that let's just take so your website is a thousand people a day and
each person click like on average there's a thousand people and each
person clicks once a minute so there's how many minutes in a day I don't know
let's just say they click once in a day okay so you have a thousand clicks a day that's still
30,000 clicks a month so if you wanted to go back through a year's worth of data that's what 365,000
clicks that's a lot of data and so you know you don't even want to do that on your machine because
it's going to take forever so if you wrote a map reduce you could say um you know run it on your
machine on like one day's data make sure it works and then say look amazon you know you have this
elastic cloud that's totally awesome you know chug on this data like run this map reduce that i wrote
and you know amazon i'll charge you however much that costs maybe five bucks or something
or you do it accidentally the wrong way and then do you do that? I don't know.
You just accidentally misconfigured something
and it's really big.
It's like some really
outlandishly large number.
You probably can set a budget, I'm sure.
I'm sure.
All of a sudden, now you can run that same
program on a thousand computers that
you don't have to own. It's pretty awesome.
I think that's a wrap for this episode. Yeah. Hadoop is great. Learn it.
It's fantastic. We hope you will have or will have had a good Christmas. Oh, yeah, that's right.
Yeah. In the future, we hope Christmas will be awesome. Will have been awesome. Will have been
awesome. That's pretty, pretty wild. i don't know all right till next time
see you later the intro music is axo by biner pilot programming throwdown is distributed under
a creative commons attribution share alike 2.0 license you're free to share copy distribute
transmit the work to remix adapt the work but you must provide attribution uh to uh patrick and i
and uh share alike in kind