Programming Throwdown - Hadoop

Episode Date: December 26, 2012

This show covers Hadoop, a set of several languages and libraries for working with big data. Tools of the show: Emacs and Chrome Browser Sync. Books of the show: Hadoop: The Definitive Guide ...http://tinyurl.com/cp3mw32 and Anathem http://tinyurl.com/cas8bux ★ Support this podcast on Patreon ★

Transcript
Discussion (0)
Starting point is 00:00:00 Hosting provided by Host Tornado. They offer website hosting packages, dedicated servers, and VPS solutions. HostT.net. Programming Throwdown, Episode 23, Hadoop. Take it away, Jason. Hey, everybody. So we actually got a pretty awesome question from the audience that we want to kind of start the show with. We have an audience?
Starting point is 00:00:30 I thought it was just you and me and our moms. So yeah, our mom had this awesome question. It was, how's it going? Wait, wait. Our mom doesn't make any sense. Oh, yeah. No. Never mind.
Starting point is 00:00:40 Your mom? This sounds like we're making bad mom jokes, but I'm not really. Yeah. I think I just dissed my own mom. Anyway, so Marco Aurelio sent in a question, and he's looking at doing some C-sharp.net coding for a website that he's interested in making. And he was sort of interested. His question was, you know, he knows that like you know server side stuff you know the
Starting point is 00:01:06 code and the the CPU is used on the server and so if you have you know one user you're using some amount of CPU if you have a thousand users you're using you know maybe not a thousand times but the CPU scales up right so his question is how can you big websites like is it true that big websites like eBay and Amazon and things like that, you know, run things on the server side? And how do they do that? And how does the server not explode? And then along those lines, like what languages should he focus on learning? Yeah. So, I mean, it sounds like he had a question there. He has an idea. And so his idea was to kind of build a website, but he's worried about it scaling what technology
Starting point is 00:01:42 he's going to choose yeah totally so so you i have a slightly slightly different different answer and mine is that i think sometimes in the tech community and as engineers uh you know you over engineer things so this is a classical problem and so having the issue where you can't scale your website because you're growing too fast uh would be like a very good problem to have. I mean, there are certain things you don't want to do to aggravate it, or especially if you get to the point where somebody is funding you or you're making a lot of money and your website goes down and people are depending on you. That's a really bad problem.
Starting point is 00:02:16 Right. But when you just have an idea, it is more important to get the idea out there, to start working on it, to find other issues like okay people I need to shift this idea slightly or I need to do this then to spend tons and tons of money trying you know like if you ever read about some of these people that started like Google you know it was two guys and some university equipment or you know Mark Zuckerberg at Facebook you know it was just like running in his dorm room off his computer right like I mean none of these things started with
Starting point is 00:02:43 like I'm going to start with 10,000 computers and I'm going to serve millions of, I mean, that's how you go bankrupt, right? Did you know, so sorry to interject, but Stack Overflow, the website that has, you know, many questions and answers for programmers, it actually started with the two founders answering all the questions themselves.
Starting point is 00:03:00 Nice. Yeah, isn't that crazy? So they just, and I don't even, I think they were talking about it. They didn't even have a website. Like they had this form where you'd click submit, and then they would create a.html file with your question and answer. Like that's how they started.
Starting point is 00:03:14 And now Stack Overflow is gigantic. There's like probably thousands of questions a second, right? Right, yeah. And I mean, I think even you take that example, right? So there's an idea, and you can read a lot about this. A lot has been written. And this isn't exactly an entrepreneurial show. I don't think I said that correctly. I think even you take that example, right? So there's an idea and you can read a lot about this. A lot has been written and this isn't exactly an entrepreneurial show. I don't think I said that correctly.
Starting point is 00:03:28 I think it's pretty entrepreneurial. But okay, we're kind of, okay, all right, all right. But minimally viable product, that's kind of like a word, right? So the idea there is, in fact, don't even make a backend to Jason's example here of Stack Overflow, whether it be urban legend or true, the internet shall tell away us. But, you know, if you just make a front end that looks like what the website, when people can kind of try it out and test it, you can get a ton of feedback and save yourself days, months, years of development
Starting point is 00:03:54 time. So that's kind of where to start. And then, um, you know, from there you can kind of, Jason will kind of address like how, how does that work? Like how does one serve millions of users a second? An hour? A day. Millions of users a day. Let's go with that. So, yeah, I mean, but to Patrick's point, just to wrap that up, even when I was making the Trivipedia, that trivia Wikipedia thing,
Starting point is 00:04:19 I started off with just having just a flat file and you just scan through that file for everything. So if I needed to find a user by the name, I had to search through the whole file of like thousands of users or hundreds of users to find that one. There's no index. Millions of users? now there's what, like 2,000 monthly active users. And so let's say there's 100,000 users in the system. So I couldn't do that anymore. Like I couldn't just scan through all the usernames. And so I had to, you know, I ended up making ZombieDB, which is open source. And I think I posted on the podcast on G+,
Starting point is 00:04:57 on the podcast's page about that. But so I had to actually make a database. But that came months later, right? And if Trivopedia, if no one was interested in it, then I would have saved myself the trouble of having to make that database, right? And I probably saved myself the trouble of doing other things, which I would have done if there were millions of users, right? So yeah, to Patrick's point, you totally want to start small.
Starting point is 00:05:18 Don't worry too much about your server blowing up or anything like that. But now let's answer the question from a tech standpoint. How does this work? Do all these guys like Amazon and Google and eBay, do they all just write JavaScript or do they do stuff on the server? So they do things on the server, and it works through a process called sharding. And Patrick has already helped me out here,
Starting point is 00:05:42 but I'm going to take a first crack at it. Typically, you'll go to, let's say, Amazon.com. What'll happen is that'll go to a front-end server. So this server will get your request, and it'll say, and there's several front-end servers all around the globe, so you'll go to the front-end server that's closest to you. So it'll take your request, and it'll say, okay, what servers are pretty
Starting point is 00:06:05 you know lightly loaded and the servers themselves the back-ends are constantly telling the front-end hey you know I I'm not busy or hey I'm slammed processing people's orders right so the front-end server will look for a server that's not busy and then redirect you to that server. So then your request will go through to that server, which will do all the things like query the database, see if you're logged in, do all that stuff. And keep in mind the database itself is on another server. So you have... Set of servers.
Starting point is 00:06:39 Yeah, set of servers. And so there's this other process going on where the database has a front end, which, you know, keeps the load pretty balanced among all those database servers. And so this process of having all of these machines sort of working together and having the least used machine, you know, handling your request is what keeps websites like Amazon and eBay up and running. And a big part of doing this isn't, you know, what we're going to talk about on this show, Hadoop, is more for, like, batch processing. It's not things that you're going to do, you know, in real time when you're accepting a web request,
Starting point is 00:07:14 which is a big part of doing anything on the web or with big data involves sort of managing many different machines and routing work to one machine or the other. That's right. Yep. And so, I mean, yeah, and I think there exists server backend technology for many of the most common languages for people to write in. I mean, there's even Node.js, so you can write JavaScript on the backend
Starting point is 00:07:38 and use that as a server. You know, you can just like all the major, I mean, Java, C++, I would assume C Sharp as well. I mean, there's all these things have backend pieces that you can just like all the major, I mean, Java, C++, I would assume C Sharp as well. I mean, there's all these things have back end pieces that you can write to. There's middleware kind of stuff, let's call it, that handles this load balancing and splitting. And then you write your server and your language on the back. So it's, I mean, it's slightly more than what language to use. It's kind of like understanding this underlying and the problem.
Starting point is 00:08:04 But you don't really know where the scaling problems gonna have until you really are kind of almost done right so something like if you think about like a slash dot let's take slash dot right so they have news stories which are being displayed but that their bottlenecks are gonna be very different than a site like eBay which is handling transactions and it's counting down auction or Amazon which has to has to handle when something sells out and they have no more and they don't want to oversell something. These things are, there's amounts of traffic,
Starting point is 00:08:32 there's hits on data. I mean, those things are going to have different bottlenecks at different amounts along the growth curve of users and on how many transactions are being handled and that kind of stuff. So, I mean, you really kind of, if you solve it too early, you risk solving the wrong problem. Yeah, that totally makes sense. So, I mean, you really kind of, if you solve it too early, you risk solving the wrong problem. Yeah, that totally makes sense.
Starting point is 00:08:48 So, all right, well, on to our news. News. I have the first one. All right. Okay, yeah, so I'm doing the first one. I saw an article linked today on Hacker News, which is interesting, about the Fast Fourier Transform. So this is kind of like an FYI.
Starting point is 00:09:03 It's not exactly news. It's not something updated. But it was a pretty good article. I kind of like a fyi it's not exactly news it's not something updated but it was a pretty good article i i kind of enjoyed reading it so i have a little bit of background with what the fourier transform is so uh without getting into the deep math uh and it's like this guy does a good job of explaining it thoroughly but not if you just kind of skim read it you know you can also kind of get the gist so four years there's code too yes and he even posts like more detailed this website's actually very interesting we'll have to take a look at more and see if there's any other articles to be really interesting to you guys or you guys just take a look and figure out on your own but he covers a wide range of topics but a four-year
Starting point is 00:09:36 transform and brief is going from the time domain to the frequency domain so if you have some data that over time changes so the example he uses is audio recording so you have digital samples over time and you're interested to say what are the most common or most prominent frequencies in that sound clip then you want to transform from that time data to frequency data which says you know at 80 kilohertz i don't know if that's too high i'm not good with my audio i think that's right okay 40 so okay some some number of hertz you know that this is the peak signal so like oh okay you know maybe that gives you some information or you want to do some sort of mutation of the signal and then transpose it back but this is something that's very common
Starting point is 00:10:21 you'll come across a lot in a large amount of domains, image processing, audio processing. I mean, even stuff that seems completely unrelated, it can come up. But yeah. Yeah, compression of all sorts and things like that. Yeah. And this article talks about a specific kind of version of the fast Fourier transform
Starting point is 00:10:42 and also the discretized, discrete version of that. So going from math to kind of practical there and a lot of implications so it's a pretty good treatment so if you've ever heard of that and wondered what it was or trying to learn more about it um i would definitely check out this this article seemed like a pretty good description yeah it's totally awesome this is great i'll have to give this a read cool so i'm doing next article it's on the oh yeah oh yeah oh yeah oh no that's uh that's that guy who crashes through the window the kool-aid man yeah the kool-aid man the kool-aid man's coming out with a console what pretty awesome yeah hopefully it's like bullshit and you could pour but anyways so the oh yeah is coming out i'm pretty excited about it. I haven't ordered one because Patrick has scared me off of Kickstarter.
Starting point is 00:11:28 But I'm super interested in Oya. I still might order one off their website, even though the Kickstarter is finished. So is it available for ordering off their website? Yeah, you can actually pre-order it. Is it the same price? A little bit more? I'll find out. Oh, that's okay.
Starting point is 00:11:43 But basically, you can order on the website. It comes, I think, in March or April. But it is a... For people who missed the episode on the Oya, it is a... I just had to describe it as an Android-powered console. So it's a console with a controller and all this good stuff, but it's totally Android.
Starting point is 00:12:01 So any existing Android games, you can play using the touchpad on the controller. And any new Android games will be able to support the hardware buttons of the Oya. And it's pretty awesome because we've never had this, right? We should say it hooks up to the TV, so it's not like a phone. Oh, yeah. It's a little box that plugs into your TV, just like Xbox. Yeah, or PS3 or something.
Starting point is 00:12:22 We've never had an open system. So a little bit of history on consoles. The way the PlayStation and Nintendo, Wii and Xbox, the way these things work is, well, unless you're Nintendo and you're awesome, but let's talk about everybody besides Nintendo. The way these work is you lose money on the console. So every time you buy a PlayStation,
Starting point is 00:12:43 Sony loses money. Well, initially. I don't know if that doesn't normally hold true over the whole life of the console. Oh, because they're cheaper. Okay. Continue. I'm sorry. Interruption.
Starting point is 00:12:54 So the point is, at least they don't make a ton of money, suffice it to say. That's not the main way they make money. Right. So they actually make money by charging developers for a dev kit. So, for example, if you want to make a PlayStation 3 game, then you buy this dev kit. And the dev kit has a little bit of hardware, but I don't even know if they have hardware nowadays.
Starting point is 00:13:16 It's probably all in software or firmware. But they'll send you probably a PS3 with some special firmware. But they'll charge you for the dev kit, I think it's like $200,000 for a dev dev kit and the only reason why the price is so high is that they can recoup some of the money and the idea is you know people who buy your game are going to you know are going to give you money and you're going to give some of that money pass it along to the to the hardware manufacturer well i mean so it's worse than that, right? So, I mean, it's that for every copy sold, they have to, they put DRM on the game.
Starting point is 00:13:48 So if you don't have, but it's, I guess this is kind of almost slightly different than normal DRM, which is that if I have own media, but that media is not signed by Nintendo, I can't play my media on the Nintendo device. Right. So the, you have to go to Nintendo and they, you know,
Starting point is 00:14:05 I'm doing air quotes here. They test your software to make sure it's not going to break the console or isn't a virus or, you know, anyone,
Starting point is 00:14:12 it doesn't crash. It's good, right? And then basically with their permission, then you can turn around and you're allowed to sell that game.
Starting point is 00:14:19 But they, you owe them, you know, X percent or dollar amount. Oh, that's true. Per everything that you sign. Yeah. That's the most common. I think, i believe that's like a very common way yeah i think there's an initial
Starting point is 00:14:29 like like one or 200k investment just to keep like the small people out of the market and then there's also a rev share so this is talking to a friend of mine who works for ea like a long time ago so it might be different now but um but yeah so so the the the point of it is you know patrick and i can't make a game for the playstation 3 right i mean we just don't like uh have those resources at hand right so you have to be a big publisher or or nintendo has to come to you like world of goo is a common example where they're a really popular indie game um they do really well in flash and i think they were out on mobile even before the wii and then nintendo contacted them and some agreement was arranged etc um the thing about the ui is it's a completely open platform so like any game that anybody puts out for android like
Starting point is 00:15:15 trivopedia you could play it on the ui like any android program out there will run and so it's really just sort of opens up the game console experience to like to the whole world of developers and that's what's really exciting to me and the general trend has been that along those lines right so like even the xbox there's like was it xbox live arcade yeah um okay maybe i'm butchering i think playstation has something similar and we you has something similar and even maybe that we did right so the main games that you go to you know toys r us or target or best buy or i don't know where everybody goes these days whatever you're you're in your country the local equivalent is to buy buy a game right there's
Starting point is 00:15:54 those which is kind of what we're talking about but then there's these others but those still have limitations like i remember reading something about when minecraft came out for the xbox right so they had some people doing this and then they had an update and then it was like oh if they wanted to update it like a patch again you're supposed to go through all this testing again but that testing costs money so like it's like a thing like you can't they wanted to just push it early and push it often but that didn't like that was not compatible with how the money was set up and so it caused kind of a problem yeah yep totally and so sometimes people leave kind of like broken broken games because they just don't have money to to push a fix for
Starting point is 00:16:30 it and that's bad and the other problem too is although there is an xbox live arcade there isn't a developer there's a developer ecosystem but not to the extent of these open platforms like could be yeah yeah like for example we talked on last year with that Cocos2DX which is it's not a game it's just a library that you can use to make games well for you to do that on Xbox Live Arcade I'm assuming you would have to be able to like publish like the library I'm not quite sure how that would work but on an open platform you just give all the source code and the person can take your source code and link in the Android libraries, and it just kind of works. The ecosystem is much more healthy on Android. And so this is a very positive development, right?
Starting point is 00:17:10 So like OUYA is on track. They started shipping the developer consoles, which are very similar to the regular consoles, but already kind of like opened up so the developers can do what they need to do, run unsigned code, that kind of stuff. And so this is a positive kickstarter story yay kickstarter so hopefully they'll stay on track and early next year we'll be talking about the reviews of jason's uya console yeah totally all right so i think you have one now all right so uh future of the web the
Starting point is 00:17:38 future of the web would be html8 oh no no man Maybe that's far in the future. HTML 5? Yeah, at this rate, that's like the year 3000. It's finalized. You know, this is one of those stories. I mean, it's kind of interesting just to talk about that. I think it's kind of slow, these ratification of these things. Like, we've already kind of entered the HTML 5 zone in my mind. Like, it's generally used, right? But yet only just now did everybody kind of agree on what that meant to some extent,
Starting point is 00:18:07 which still doesn't really mean anything because all sorts of reasons. But you hear this, like, remember when wireless Wi-Fi in, 802.11n was coming out, right? It was DraftN. You could buy DraftN routers. Oh, yeah, I remember that. And DraftN routers weren't guaranteed to be compatible with N.
Starting point is 00:18:21 And then, like, a year or two later, right, then it was like, oh, now you can get the actual N ones yeah i mean just or i remember too like c++ right even that went through this whole thing about oh we're gonna come up with a new c++ o x and it took over 10 years and then it was like finally c++ 11 or whatever it's hilarious so i mean it just sometimes we forget that like the world moves on past these standards bodies, but the standards bodies do play an important role in kind of saying like this is, you know, what HTML5 really is and really contains. But even then it's not done.
Starting point is 00:18:54 Like they still have other work to do to kind of make sure everything's okay. It's not like finalized doesn't mean like it's done. It means like, oh, now we can move to testing and compatibility and you know actually making it an official like html5 thing yeah it's really just almost like a social or political thing it's basically you know like let's say everyone complains about internet explorer right especially older versions of internet explorer like internet explorer 5 doesn't do what you know firefox and chrome of the same era do or maybe 5 is too old but but you see what I'm going with this so this is their way of saying you know HTML 5 is real you
Starting point is 00:19:33 can make content for it and if someone doesn't support it that's on them you know and and so this it's like a political thing to motivate people to switch and I'm pretty sure if you're using an old browser and you go to a And so it's like a political thing to motivate people to switch. And I'm pretty sure if you're using an old browser and you go to a website nowadays that has HTML5 content, they'll actually have a little link saying, hey, your browser is super old. You can't display this website correctly, and you should go get a new browser. Whereas before this specification was complete, they would just try and make it support everything.
Starting point is 00:20:09 Yeah, and I mean, we should say also this is by the W3C Foundation or W3C Console. I don't know what the C there is. Oh, consortium. Okay, thank you for saving me. The World Wide Web Consortium. Yes, okay. And it is a good thing.
Starting point is 00:20:25 It is, you know, that they've finalized it. And also, they've taken time to announce that they've begun the draft of HTML 5.1. Woo! Oh, man. So did they keep the peer-to-peer? HTML 5 was supposed to have peer-to-peer support. But, like, no browsers had actually implemented it to this day but i mean so basically what what that means is you could have a bit torrent client totally written in html
Starting point is 00:20:53 like you just go to a website and you just start torrenting like on like the site you know it's just craziness and i i know i read an article where they said uh you know no browser will ever support this. Yeah, I don't know. I don't see anything about it in my research that I'm doing right now. Oh, yeah. All right. But it's OK.
Starting point is 00:21:12 It's all right. Maybe we'll have a future topic about this. Yeah, totally. All right. So I think you're on. You're up next. Tell me about some sweet desserts. Some sweet desserts? Well, I've just been eating some raspberry pie.
Starting point is 00:21:23 Oh! Yes. Been eating from the raspberry pie store been eating some raspberry pie. Oh! Yes. Been eating from the raspberry pie store. You bought raspberry pie at the store? I did. It was delicious. Yeah, it was one of those Safeway specials, but you have to have the card where they harass you,
Starting point is 00:21:36 you know, the Safeway club card. So, like, now it's like every time I go, they want me to buy another raspberry pie. I'm like, no! No, you people! So, Patrick is like so okay so raspberry pi store um has just launched what today or a couple of days ago but it's pretty cool ideas if you own a raspberry pi device um you can go to this store you can download apps i'm assuming you
Starting point is 00:22:01 somehow transfer them to the device maybe you download them with the device itself and then you can just start using them so they have a number of free apps there's free Civ the civilization clone some other free and open-source games that are that seem to be ubiquitous like Battle of Wesnoth seems to be on every possible thing like I have Battle of Wesnoth on my gumsticks my phone like like three computers. So a lot of these open source games, it's just as soon as something new comes out, they try and get themselves on there.
Starting point is 00:22:32 But there's a number of awesome things that you can install. Some are free, some cost money. But if you have a Raspberry Pi or if you're thinking of getting one, definitely check out the store.raspberrypi.com and see what's available. So is the Raspberry Pi the Ouya before the Ouya?
Starting point is 00:22:49 That's a good question, right? I mean, a lot of these are games. It's like a store and a game. I mean, it doesn't run Android, but I mean, shy of that. And it's not nice and bundled. Yeah, that's a good point. Actually, a coworker of mine,
Starting point is 00:23:02 I was telling him about the Android Stick PC that we talked about on the last show and how I want to get one for Christmas. And he was saying. Wait, do you really think you're going to get, I'm just like, you're going to buy one for yourself for Christmas? No. We should talk about things you want for Christmas. But like, I normally, those kinds of things is like nobody, like really? Like who's going to. So this is kind of, so my parents, ever since I was maybe like seven, they've always just asked me what I want for Christmas because they don't know anything about computers and things like that.
Starting point is 00:23:31 So as tradition holds, my mom asked me what I wanted for Christmas. And you explained to her, like, go to this specific page. I had to give her a link. Yeah, yeah. I had to give her a tiny URL. Cool. But, yeah, so my coworker said, well, you know, if you get this, then you're stuck with Android. You should get a Raspberry Pi or a BeagleBoard or something.
Starting point is 00:23:52 And that is kind of true. Like, if you get Raspberry Pi, then you have just a pure Linux shell with, you know, you could sudo app. You can install packages. You can do whatever you want. If you have Android, then you're stuck in the Android ecosystem. And so I haven't yet decided which route I'm going to go. You already sent the link, dude.
Starting point is 00:24:11 I haven't sent it yet. Oh. Oh, it's sitting in my saved messages but not sent in Gmail. So, yeah, I'm holding on. Okay. All right. Well, I guess we'll get an update about this. This is the last one we do.
Starting point is 00:24:24 Sure, everyone wants to know about my mom's five. We're recording this before. Well, no, we want to find out which camp you fell into, which you decided. Not necessarily which you got for Christmas. Gotcha. But I don't know if we'll release this before or after. We're recording this before Christmas.
Starting point is 00:24:38 That's true. Yeah, we're recording it in advance because we're both going to be gone for Christmas and New Year's. So we're recording some shows in advance. So we will find out. So we will have already known what we will have gotten. Yeah, when you guys listen to this, I'll be playing with my
Starting point is 00:24:54 oh, we just caused a quantum, what is it in Back to the Future where you've broken the space-time continuum? I don't know. Anyways, yes. Time for Tool of the Show her tool of the show this is tool of the show so my tool of the show this is like the tool that I use more than anything like more hours of my day are spent using this tool than probably any other program or anything sledgehammer that's a measuring tape it's called choose now
Starting point is 00:25:27 so my tool of the show is emacs and uh emacs is a web editor or sorry emacs is a text editor web editor i don't know where i was going with that text editor but it is so much more than that it uh has a ton of features a ton of macros you can have uh like 20 files open in the editor at the same time and you can do well you can have infinity files open at the same time you can do things like find all occurrences of patrick and change it to gives jason a hard time in all 20 files like you could just type escape and then find replace all files or whatever so so it has a bunch of like little macro support and uh it does has all these like cool tricks it has a color theme picker you have colors so you start off with you're blowing my mind i know this is insane
Starting point is 00:26:18 you start off with like a white background and black text which I just can't deal with so I picked the the clarity theme which has black text like a like a grayish like not not completely white because too much contrast like a nice soft gray coming next week Jason's favorite Emacs the and just a bunch of cool macros there's an Emacs wiki which has a ton of things to make your life easier. Like, for example, just a short anecdote. You might have something like a file which has a number, colon, and then some text. And you just want the text after the colon.
Starting point is 00:26:56 You just want to get rid of the number and the colon, right? So if you're just using Notepad, you'd have to do this by hand, line by line, right? But in Emacs, you but in emacs you just could you just do it from the shell yeah you could write like a shell script to like you know cap the file go through set and oct and things like that but in emacs you actually do like a regular expression replace like almost trivially or you could use something that had a column editor a column editor but what if it's like the colon moved to call oh that's true okay that would be bad yeah all right you didn't fully specify the problem so you get so so you can tell patrick
Starting point is 00:27:32 things that emacs is for people who you know are on magic mushrooms or something come on man where are you going with this they're trying to make us lose our clean rating what is it so clean but not like what editors are you on patrick i don't edit my code man i just think it oh man like you're writing binary i just have telepathy with the computer oh it's pretty do you have all those headsets yes that was such a fad i know i wanted to get one of those things to like read your brainwaves and like hack that? I tried one in Best Buy a year or two ago and it had this like platformer game
Starting point is 00:28:08 where if like you thought about moving right, it moved right. And I could never get it to work. But you have to train. I mean, almost all of them have always shown like you have to go through
Starting point is 00:28:15 a set of like exercises. Oh, yeah. Like the old dragon dictionary thing where you'd have to like translate the words you spoke or whatever or write them down. You'd have to read
Starting point is 00:28:24 like a chapter of Shakespeare to an act of Shakespeare to like train it. So that's the same thing. Like you need to go through some thing to train at least now. I did that, but maybe like I could be pretty scatterbrained. And so like I kind of think that when I was like I wasn't focusing on the training part of it. Just remember there is no game. Yeah, there is no spoon yeah there is no spoon there is no spoon so what's your tool in the show well okay so i dodged your question but i happen to
Starting point is 00:28:51 actually be in that weird camp of nerds who thinks uh ide is is like a amazing invention of modern times where i can have 20 files open i can have have colors, and I don't need these crazy escape key combination macro. That's my camp. I do appreciate that sometimes you have to open up a shell and you have to do stuff there, or you're SSH'd into a computer. Yes, I understand. But that's like a a thing of last resort yes okay so in other words you're on the 21st century but i've just offended
Starting point is 00:29:31 all of our audience no i'm pretty sure i might be the only person amongst our audience who uses if you use emacs i think all right here's my thing and i'm really gonna get in trouble now but it's okay i'm gonna say anyways because that's how i roll all right um so i think people like the idea of being really really nerdy so they just say these things and they it's not they're not true like what like vi or die oh yeah i mean that's not an actual phrase i don't i just couldn't think of one but like you know like oh i only use vi for everything or like oh i only what's the there's like cult of something i don't know all these things right like oh i only use emacs everything. Or like, oh, I only, what's the, there's like cult of something. I don't know. All these things, right?
Starting point is 00:30:07 Like, oh, I only use Emacs. And I'm sure there are people who do. And I'm sure that's legitimate. But I really feel like most people, it's not really true. Yeah. I like, same thing as like, I remember there's all these people talk to like, oh, Linux this, Linux that. And it's like, what do you run at home? Like, oh, Windows. Yeah, right.
Starting point is 00:30:23 It's like, like, I don't have a problem with that. Just like own up. Like, if somebody makes something good, it's like what do you run at home like oh windows yeah right it's like like i don't have a problem with it just like own up like if somebody makes something good it's good like don't you don't have to like oh it's made by microsoft it must be terrible yeah like oh come on i mean if it's good it's good there is a high level of zealotry with emacs and vi it's true a lot of people hate like the iphone and the ipad like oh they're terrible right and it's like in reality like okay fine you might think that but i think most people just it's oh it's because it's apple yeah it doesn't matter if like you could provably show that it was the best thing like people would still like oh it's apple it can't be good and then vice versa too other people were like oh it's not made by apple it's got to be
Starting point is 00:30:57 terrible yeah so i think nerds like for some reason or geeks like really fall into this trap of like like we say we want to be open-minded we want people to accept our thing but it's like in reality we try to like i think everyone is though even like you could be i had a friend growing up who's really into skateboarding and he was always like oh you have a veriflex that's a piece of crap just because it's a veriflex you know i mean so i think everybody who's like really into something has this like brand loyalty. But yeah, so the thing about tech though is tech is very functional. So like to have brand loyalty for something that doesn't necessarily have a style or the style doesn't matter.
Starting point is 00:31:33 Like it's one thing to be like, I really like, you know, express jeans. And it's like you do it knowing that they're not going to like make you walk faster, you know. But it's like, oh, I really like Emacs. And then for Fido to oh i really like emacs and then for fido is to make the argument like emacs makes me a better coder which i don't believe then then now you're putting you know you're putting like a uh like uh you're turning something that is like a style like a personal preference into like oh i'm functionally like a better
Starting point is 00:32:00 program but even if people say like oh i'm more productive using emacs and an ide but their problem is like okay well like you know did you try all the ids well no did you try one did you stick with it long enough and if you kind of dig it's like a lot of times like well you know i tried it for like a week and i just couldn't stand it well i mean did you like pick up emacs and start using it hyper productively within a week no yeah like i mean i doubt it right maybe if you're like some savant i don don't know. Maybe. But like, no. Okay, anyways, my tool of the week is Chrome Browser Sync. It's pretty awesome. It's not really like a tool of the week. It's totally a tool of the week.
Starting point is 00:32:31 I totally copped out on all these. It's okay. So I really like this. No, your Snapseed tool was awesome. I'm still using that. Oh, good. I downloaded it during the show. And you said one of our other tool of the week you were still using.
Starting point is 00:32:41 Yeah, I'm still using KeePass. Yeah. So your tools are a winner. All right. All right. Better than Emacs using. Yeah, I'm still using KeePass. Yeah, so your tools are a winner. All right, all right. Better than Emacs. All right. I'm so in trouble. I extend the olive branch and I get my hand cut off.
Starting point is 00:32:51 I'm sorry. All right, anyways. So Chrome browser thing, if you don't know, Chrome is a browser made by Google. Yep. And one feature that I've been using Chrome, and I like it I mean
Starting point is 00:33:05 but I like Chrome and one feature that I've decided I really like is across my devices that if you have Chrome and you sign into it which
Starting point is 00:33:15 I mean we can get to all the debate but alright whatever so you sign into it and it'll synchronize your bookmarks it'll synchronize you can see what tabs
Starting point is 00:33:23 are open from your other computers. It's a really good feature. I think now, as somebody was pointing out, Firefox ships it by default or it's an add-on or something. I don't remember. But that browser sync functionality, and whatever browser, I happen to use Chrome,
Starting point is 00:33:37 so that's my tool of the week. But they may all have it by now. Yeah, browser sync is awesome. Turn it on. It's awesome. It even works with mobile. Right, so on my iphone i have the chrome and i can look at what oh what was that page i was looking up on my computer at home and i just go like my home computer and like oh that tab you know bring it open it's amazing like especially i notice it a lot when i'm using directions like i'll be at home and i'll look up directions for a place and then i don't want to
Starting point is 00:34:03 like i mean it's not a big deal to retype in the address But it's so much easier to just pull up your phone and say show me my desktop tabs and the directions are there like done Ready to go. How do you enter all those key combinations for Emacs on your phone? Okay Escape key okay i'm done the last one i'm done escape key i'm done i'll imagine if there was an emax editor for the phone how hard that would be just a custom keyboard yeah that's true you'd need a keyboard that had like control s or something like i'm sorry i'm done with the joke anyways browser all right that's pretty fun time for book of the show. Book of the show.
Starting point is 00:34:46 That was pretty awesome, Echo. Okay, I'll go first. You went first last time. Did you write that in Emacs or was that a script on Audacity? I wrote it in C sound. In C sound. Okay, you guys can look that up later. Because you didn't get the joke.
Starting point is 00:35:02 Should I go first? No, you go first. So I don't have programming. Jason is going to be related to our topics. That's why he's going to go second. Okay. So I like sci-fi, science fiction. Totally.
Starting point is 00:35:13 And fantasy as well. This one, it's science fiction, but it kind of, yeah, a little bit of fantasy kind of as well. But mostly science fiction. And that is Anathem. This is a book by Neil Steven stevenson it's really long like i mean the whole time i've warned you i thought that i thought that i misread this and i thought it was anthem by ayn rand i thought that's where you're going with this okay anathem by neil stevenson and um it's it's a really long book i don't I think it's like bordering on like a thousand pages or something oh wow so here
Starting point is 00:35:46 the paperback is 981 pages so it's really long but it's really good so this is gonna sound like Patrick why is this the best book you have to stick it out
Starting point is 00:35:55 through the first couple hundred pages first couple hundred pages are a little bit rough but it is worth it it's like a huge investment but like when you make
Starting point is 00:36:02 the investment of like oh okay what's going on here it's really good I don't want to do any spoilers about what the so farther part of the story but it starts off and you're in a monastery and it just starts describing like a whole world which is like slightly different than ours and like you kind of begin to pick up that there's some things which are different but kind of the same and it just just like really, Neil Stevenson just goes into like great detail about like all the functioning of this monastery. And it's not, lest you get confused,
Starting point is 00:36:31 it's not like a religious monastery like you would think of today or in kind of our world. So you kind of got to stick with it and see it's something slightly different. Again, I'm trying not to give away because some people really want to have all that surprise of the book.
Starting point is 00:36:47 So I don't want to give more than even the back cover would say but i'd check it out it's really good um and like i said you gotta stick it out for the first little bit um it got good reviews yeah it's a little you gotta you just know you're gonna read the whole book like just commit like i'm gonna read this whole book and it's only 11 bucks that's not bad i mean that's like, what, a Santa page? Yeah. So, I mean, you can get on, you know, e, the Kindle or whatever through Amazon here is like $6. Oh, even better. And it's good, you know.
Starting point is 00:37:11 Or even the mass market paperback. I don't know what the difference is. It's only $9. Oh, right. So, okay. We're sitting here like reading Amazon prices on the air. This is fascinating podcasting. But check it out.
Starting point is 00:37:21 It's a really good book. I mean, I really like it. And if anybody has read it or does read it after this recommendation, like, let us know. Like, any of the books. Oh, that's a good point. So for people who bought the book of the previous show and people who buy this book, et cetera. Or already have it or whatever. Yeah, or already have it.
Starting point is 00:37:36 You know, if you like the book or if you hate the book, let us know. Yeah, I'd be curious to see if you have a better book. Let us know. Post on G+. Or if you hate this whole segment. If you hate the show, post on G+. Some people might be like, oh, Head First Design Patterns is decent,
Starting point is 00:37:52 but I also really like this other design pattern book. Feel free to post on the G+, and start a thread where you're only going to help out the rest of the community by contributing. If you want to leave a really horrible comment about what we should be fixing on iTunes, that's cool. Rate it five stars, because those are the community by contributing. Yeah, and if you want to leave a really horrible comment about what we should be fixing on iTunes, that's cool. Rate it five stars because those are the ones we read. And then just, like, tear it up.
Starting point is 00:38:11 It's fine. Yeah, if you rate it five stars, then we'll mark it as useful. There we go. Just kidding. Okay. All right, sorry. So that's Anathem by Neal Stephenson. Anathem with an extra A.
Starting point is 00:38:23 I mean, I'm pretty sure that's how you say it. I'm not exactly sure. No, you're right. Yeah, well, it's definitely not Anthem because that's missing an A. So my book is Hadoop, The Definitive Guide. And I've actually been reading this book for a while now, trying to sort of understand more of the Hadoop internals and the way Hadoop does different things on the backend.
Starting point is 00:38:47 So as somebody who – I feel like I can call myself a MapReduce expert. I might get destroyed for that by somebody who's a crazy expert. But I know a ton about MapReduce. I work a lot with MapReduce and Hadoop internals and things like that. So you just set yourself up. You know that, right? Yeah, I know. I'll bring it on. I feel like I can take it I accept the challenge unless we get like Jeff Dean or Sanjay
Starting point is 00:39:10 Gemawatt the original creators of MapReduce unless they comment that would be awesome yeah I could take abuse from them I'd be worth it so you guys that would be programmingthrowdown at gmail.com or on our G Plus page.
Starting point is 00:39:27 Any of you want to say anything? Go ahead. So I feel like I've been doing a lot with MapReduce. We could have them on the show to argue with you. Oh, yeah. We should see. They're in the Bay Area somewhere. Okay.
Starting point is 00:39:38 Yeah. All right. So this book, even as somebody who does a lot of MapReduce and Hadoop, the book is incredibly useful. It's a great reference. It has a lot of examples. It really quickly answers the question, why do you use Hadoop, which we'll hopefully answer ourselves. But it explains it in a lot of detail.
Starting point is 00:39:56 And it has a ton of extra content. So if you like this show and you're interested in Hadoop and things we talk about, this has a ton of great examples. And it's got a cool elephant on the cover dude it's pretty epic it must be awesome now do you know why it has an elephant on the cover i think so hadoop is the name of the creator's pet elephant toy yeah like his his child's elephant oh his child's okay yeah so so it wasn't created by a five-year-old whose pet elephant he named Hadoop. That would be epic. That would be pretty awesome. It would be the most epic five-year-old ever.
Starting point is 00:40:29 That was you as a five-year-old. Oh, I wish. Okay, on to our programming language. Hadoop. Okay, so I have to caveat that. I said programming language. It's not really a programming language. No, it's...
Starting point is 00:40:43 Hadoop. But that's okay because I noticed our title isn't Programming Language Throwdown. Oh. It only hit me this week. That's totally true. Is that weird? We've been doing 23 of these episodes now,
Starting point is 00:40:53 and it never hit me. When we named the show, we talked about programming languages, but really the title is Programming Throwdown. So, I mean, we're totally in our wheelhouse here. You could definitely throw down Hado versus like versus all reduce or spark or some of these other you know frameworks there's definitely there's a throwdown to be had here you already threw it down with emacs now throwing down
Starting point is 00:41:15 that you're an expert on hadoop let's bring it yeah get her done so hadoop it's not even that late did i tell you i'm actually getting no no we gotta say i'm real quick okay all right i'm getting a bib so uh i want to get a bib for a baby like the kind your hose hooks up to on the faucet no no like the kind where like they puke and it lands on the thing like that they wear whatever i want to get one of those and i want it to say for you for me and i it to say For you For me And I want it to say I ate and made Hadoop I'm just getting the stare down from Patrick
Starting point is 00:41:54 I don't get it I'll write it in Emacs Let me think about it If I start laughing in a few minutes Okay Well it ends in OOP Is it a poop joke? Yes it is a poop joke
Starting point is 00:42:03 Okay now I'm getting a different kind of stare. It's like a despondence. Okay, so let's give some history on Hadoop. Hadoop was based on MapReduce, which was something created by, as I mentioned, Jeff Dean and Sanjay Gemawat at Google. So they created something called MapReduce. They wrote out a paper explaining how MapReduce works. And then Doug Cutting, who is another engineer somewhere in the Bay Area,
Starting point is 00:42:31 I think he made a startup. Is that right? Yeah, and he was working on an open source version of it. He was inspired by the paper. He began working on it. And then at some point, Yahoo either bought his team or he joined Yahoo. That was sometime around 2006-ish. And so then he kept working on it while he was at Yahoo.
Starting point is 00:42:49 So he actually, the whole reason why he created Hadoop was because he wanted to support Lucene, which is an open source reverse index. And we actually talked about Lucene in the mailbag episode of this show, answering the question posed by the person who wanted to create his own, like he wanted to index the web for Magic the Gathering. So yeah, Lucene is a reverse index meant for indexing thousands, millions of pages. And Hadoop was something that Doug wrote to support that. So at the time, it was kind of meant for that and it was sort of myopic but
Starting point is 00:43:26 since then it's been uh it's grown like it's an amazingly large scale how big the last uh the largest hadoop file system that we know of is run by facebook and it is a hundred petabytes do you know how much a petabyte is more than a terabyte how much more than a terabyte a hundred a thousand ten thousand that reminded me of that monty python where he's like blue i mean yellow no so yes a petabyte is a thousand terabytes so all you guys have like a terabyte hard drive at home so a thousand of those that's what a hundred thousand of them well yeah you're right it has to be more than that a hundred thousand petabytes a hundred thousand terabytes wouldn't it be it would be a hundred petabytes a hundred oh right all right yeah so facebook has a hundred
Starting point is 00:44:15 thousand um of those of those hard drives that that they're using to store all the data at least so it might be it might be replicated that's true right i mean that's just one and that's only that's only what they've owned up to they may have bigger yeah that's true they may be lying right like it might be actually be smaller yep or it could even be like that you know somebody else has one that's like you know just way bigger and they just don't want to talk about i mean this stuff's uh it's a competitive advantage yeah yeah that's true it's true okay so you want to discuss some of the features? What are the things that... One thing we should mention in the beginning is
Starting point is 00:44:50 Hadoop has become almost like an umbrella term. Hadoop is the MapReduce for doing this distributed processing, but people also refer to Hadoop as this collection of tools, with Hadoop being one of the things in
Starting point is 00:45:06 Hadoop which is really confusing let's talk about some of the other features similar tools that come with Hadoop also changing my answer again I was being silly it's 1024 right isn't it because isn't it a power of two yeah you're right we have no idea what we're doing oh man, man. So, yeah, there's... So, okay, so let's do a little bit first. MapReduce. Okay. We use this term a little prematurely. So what is MapReduce?
Starting point is 00:45:34 So, yeah, I'll try and explain MapReduce. Yeah, that's a good point. So, first of all, let's do a little bit about distributed computing. Back in the day, people used to do, and they still do, but people used to be totally consumed with MPI and things like that. MPI stands for message passing interface. And so if you had a lot of data and you
Starting point is 00:45:57 wanted to do something with that data, you would break the data up into pieces. Then you would pass each piece to a different computer. And then the computers would probably like send you back some results and you would put the results together um and then if you sent if you pass the message to the computer and it died your whole process would blow up well you would have to have code to handle it yeah or you had to say oh this machine died so i need to like pass the message to somebody else and so you have to be, oh, this machine died, so I need to pass the message to somebody else. And so you had to be completely consumed
Starting point is 00:46:26 with all of these meta things that didn't really matter to solving what you wanted to solve. So it turns out that not every problem can be what's called embarrassingly parallel, where you just send chunks of data to different machines. But almost all algorithms can be implemented using something called MapReduce. And so the idea is you take your data,
Starting point is 00:46:52 you break it up into chunks, then for each chunk, you execute a map on the chunk, which takes this chunk of data, could be a sentence on a web page, or it could be a URL, or it could be any atom of data, and then returns a key-value pair. So for example, let's say you wanted for every word on the internet and it could scan through that page and then spit out a bunch of key value pairs where the key is the word and the value is the number of times you saw it on that page so if you saw like the Fox jumped over the dog then there's two thus so one of your key value pairs
Starting point is 00:47:44 would be the comma two. Another one would be fox comma one, etc. So now all of these key value pairs generated by all these mappers, they're all kind of floating somewhere in your cluster. So you have something called a shuffle that takes all of the keys that are in common and all the key value pairs that have the same key and squishes them together. So if you have 10 websites with Fox as the key, you'll just have one key with Fox. And then instead of a value, now you'll have a list of values, where that list is how many times Fox was seen on all these websites. Then you have, that's the shuffle phase.
Starting point is 00:48:20 Then you have the reduce phase, where you collapse that key list of values down to one answer. So in this example, you would take Fox and then 2, 1, 2, 1, 2, 3, 2, 4, and you would add up all those numbers together. And you would say for Fox, for the whole Internet, it's whatever, 13 million or something like that. And the cool thing about this is that all these pieces can happen in parallel like the part that collapses all of the fox values and the part that collapses all the bear values could happen on two different machines and the part that scans website a and scans website b can happen on two different machines and uh it all just kind of works and so as long as you're and the mapuce code also handles all these things like,
Starting point is 00:49:07 oh, the machine that was processing Fox dies. We need to send Fox to another machine. They've dealt with all that for you, so you don't have to do it. So that's a short summary of MapReduce. And Hadoop is an open source version of this. All right. So, I mean, part of this is going to be if you're going to – you kind of started with something that was a little bit presumptuous, dare I say,
Starting point is 00:49:31 which is you start with everybody reading all the tasks, reading one page of the Internet. Where do you store something that's every page of the Internet? You can't just, like, store that, liketfs or you know ext you know you know a file partition like this i mean so big yeah well not only the size large but the number of files is ridiculous oh yeah right so you could run into problems so i mean one thing that has to go along with this is you have to have a distributed file system so first of all you can't house all the data on one computer right but then on top of that even like how you would access it in a way that is efficient so that you can have these
Starting point is 00:50:09 tens hundreds thousands millions of map threads working on this you need a way to be able to retrieve that quickly and reliably and in a sharded manner so that you know that can take place very quickly and and they've handled that as well with the hadoop file system yep totally yep so the uh the hadoop file system uh operates in 64 megabyte chunks and that chunk could have one file in it it could have a piece of a file it could have 10 000 files in it if the files are very small and so versus you know a regular file system which uses you know b trees and redblack trees and things like that, this uses something totally different that's meant for gigantic files, millions of tiny files, all of these things. And it expects you to access the files using something like Hadoop
Starting point is 00:50:58 versus a file system which doesn't really know how you're going to access the files. There's also HBase. HBase is a column oriented database written in Hadoop. So you know whereas HDFS is a file system, HBase is a database. So if you need to you know get rows by key and if you need to find all keys where the first name is this, if you need to do all the secondary indices, all that stuff, but you still need the distributed coolness from HDFS, which HBase is built on, then you can use HBase and get that.
Starting point is 00:51:36 So all this code that's running, I mean, if we begin to talk about hundreds, thousands of servers or even more, I don't know how high you could go, how high do you want to go, I guess, you're going to have to have a way to keep this. So machines are going to die. If you talk about that many machines and some of this data can take a long time to process,
Starting point is 00:51:51 I mean, machines are going to die. You might want to add new machines. You want to increase your capacity. I mean, that would be a huge mess in the olden days. I mean, imagine if you're doing MPI and you have to deal with that, right? I mean, it would take you months to have to deal with the, oh, this machine died, so I'm going to send it to that machine, right? I mean, that could be such a nightmare, right?
Starting point is 00:52:13 Or what if the machine that's sending the things dies? I mean, you don't want to have to restart the whole thing just because of that. So dealing with that, right? That could be a nightmare. And so Hadoop's thing which handles that is called Zookeeper, going along the lines of pet animals and things like that. And so, yeah, Zookeeper is a distributed coordination service. And so in really short, what that means is somebody can write some data to Zookeeper, and somebody else, if they try to write data
Starting point is 00:52:46 and the data collides, Zookeeper sort of figures all that out. So if two machines both want to take the same job at the same time, Zookeeper says, no, look, only one of you guys get this job. Things like that. So like we said,
Starting point is 00:53:02 I mean, Hadoop is kind of, when people talk about it, they're kind of talking about all these kind of tools, which are kind of all necessary to get together and enable you to be able to run those MapReduces. And so, I mean, kind of now talking about the strengths, one of the things that we talked about is you bring up more computers, great. You can have more threads and the threads can run sooner. They don't have to be serialized yep and so i mean it really you can just scale this dot dot dot yeah like just you know whatever like however much i guess time or money you have determines uh you know how big you can go or you need to go or have to go yeah yeah it's amazing i mean you could be on using the amazon web services
Starting point is 00:53:45 which run hadoop right so they have a a version that they've altered i guess or changed elastic elastic map reduce or whatever right yeah right which is is very similar yeah so they have a they have their version which runs on their infrastructure so they have so like you know for instance they may not i don't i think the way it works and and I'm speaking a little out of my expertise here, but it's, you know, instead of having, like, a Hadoop file system, they have their version of the same thing, right? Their elastic file storage, which they kind of hide from you, so you don't have to worry about it. You just put data there. And you can run the Elastic MapReduce, which runs on that. So, it's very similar and shares a lot of stuff with Hadoop, but it's not, isn't exactly the same.
Starting point is 00:54:22 Yeah, yeah, totally. I don't know if it uses the same API or how that works. I don't know. Yeah. I didn't look into that. No, yeah, I haven't done that with Amazon. But the other cool thing is it's fault tolerant. So, you know, all these crazy corner cases that you'd have to code up yourself,
Starting point is 00:54:38 they've done that for you. And they'll even do crazy things that you wouldn't think about doing. But, like, one thing Hadoop will do is let's say you have 20 machines that are in your little hadoop cluster right and you're running a job and you and hadoop detects the zookeeper detects you're only running one job and the job had eight workers like eight you know let's say there's eight mappers so hadoop will say well you know if one of these machines dies you have to like start the whole thing over again. So let's just run like since you only have eight, let's just duplicate all this work, like run it on two machines. And if one of them dies, it's OK.
Starting point is 00:55:14 Or another thing they'll do is they'll know, oh, you know, you might be able to like pass some data or you might be able to arrange and do a little bit of the reducing in the mapper. Like these crazy optimizations that we don't really think about because we're too busy focused on the algorithm, they've done all that grunt work, you know, and they've done all these awesome optimizations. Sort of like using C++ instead of assembly, right? Exactly the same. Yeah, like MPI is so low that like you just have to end up doing
Starting point is 00:55:46 like way too much and that feeds into one of the features i i skipped over that that you were particularly a little bit passionate about crunch oh yeah so you want to talk about crunch patrick's actually used used crunch quite a bit yeah well you overstayed sir i i i am not so emblazonedly boldened to say that. But, okay. Not to call you a liar, but I'll just... Okay. So what Crunch is, is Crunch allows you to, instead of having to use the kind of off-the-shelf nature of a lot of the stuff that MapReduce does
Starting point is 00:56:22 or kind of fit into their paradigm, kind of even going again on what Jason said extending it even a little further even more flexibility in how you do these things and just allow the computer to handle it right the computers to handle it just so you take care of this right so crunch is an API for allowing you to add an even higher level
Starting point is 00:56:37 say like here's generally the data flow I want and here's the operations I want to perform and alright go and so um if you if you kind of drew out like oh i need the data these two data things need to flow together i need to do some operation on them and then once that's done i make some calculation and then i need that to flow with another piece of data you know and then i want to kind of join those together in like an inner join fashion or something right like all these kinds of notions could become they're like multiple map reduce hadoop jobs that we need to run and there are
Starting point is 00:57:10 you know things that you might have to string together very carefully but crunch allows you to just kind of specify like these are the way i wanted to do and it'll handle like oh i could merge these two together and run them at the same time um Or, oh, I need to spin up a new Hadoop run for this thing or for that thing. And it kind of allows you to do that much more simply and easily and in a way that's flexible, just allowing the computer to handle it. Yeah, yeah. And Crunch feels more like Java, you know, because you have these like P tables,
Starting point is 00:57:42 which are basically like they feel a lot like hash maps in java so you can kind of take your existing java code and if you need it to run on like a thousand machines you know it's not just a copy paste or you can't just change hash map to p table there's more to it than that but it it it um it has things like get all the values like get a list of all the values that exists in hash map and also in Ptable. And so it kind of feels more like you're just doing native Java. So it's good for people who want to get started with Hadoop
Starting point is 00:58:12 and aren't used to writing this kind of stuff. Yeah, so that idea like Ptable or PgroupTable and it allows you to say like, oh, and everything in that, you know, in parallel do this thing, right? Like go through and perform some operation. So that's what I was kind of talking about, maybe a little too high of a level. But that's how you kind of, it gives you this API,
Starting point is 00:58:30 but it really is kind of almost classes and things that make it just look like, to Jason's point, it looks like you're writing Java code on some fancy hash map. And in reality, you're instructing Crunch to in the back end figure out how to arrange the map reduces and the to be able to do that yeah totally totally so there are some weaknesses to map reduce no um i know after we sang the praises you think it would be just a panacea that would just be a
Starting point is 00:58:58 cure-all but no uh one of the problems with map produce is it takes a long time to spin up and spin down. And so what we mean when we say that is spin up refers to, for example, let's say you're building a multi-threaded program. So the first thing you'll do is make the thread pool and actually ask the operating system for the threads. So that's kind of like our spin up time, right? And so to ask the operating system for some threads happens in what? Milliseconds? Maybe even less than that, microseconds. But to ask MapReduce for 100 machines to do work on or to ask a Hadoop Zookeeper for 100 machines is going to take on the order of seconds at least. I mean, it really just depends on your cluster, but it's not going to take Miller or microseconds, right? So if you have, you know, a Mabry's job that adds two numbers, like goes through two files and then adds the numbers together and the files are a meg each, you might just want to do that in C++, you know, because by the time you spin up the machines, you know, send little, send numbers to the machines and, you know, then compute the result and then store the results somewhere.
Starting point is 01:00:08 Yeah, your C++ program would have run like 40,000 times or something. So there is a problem there. Yeah, there is something called worker pools. And the idea is, you know, if I have a job that I want to run over and over again, I can sort of prep Hadoop and I can say, look, Hadoop, you're going to do this thing a hundred times. So get ready for that. And then it'll do some optimizations there.
Starting point is 01:00:32 Also, we talked about all these wonderful things of bringing on new computers and bringing down computers and all these stages and divide it and just, oh, do this little bit of work and then do this little bit of work. And that helps it be really scalable. But I mean, as you can imagine,
Starting point is 01:00:44 all the data that's in your program, all those class and everything need to be able to be serialized out and then when they're read back in they need to be deserialized and sometimes you know oh it can be said like oh i'm going to do this on the same machine but a lot of times you have to do it anyways because you don't know like oh am i going to be doing this on the same chart or not or is it going to go somewhere else yeah and so it's just a life if that takes a long time because you have especially some nasty intricacies it could be a problem but even just in general like that's going to be overhead yeah and then the other thing and there's some theoretical reasons why this is true that i'm not going to get into because it takes a long time but
Starting point is 01:01:18 but there's excessive there's a lot of materialization so you know each of your mappers going back to our word account example each of those things that scans one page of the web has to put those results all those key value pairs on disk on that hadoop file system and then the reducer the the shuffler sorry that's putting together all the keys has to read those from disk and then put the reduced key values lists on disk and then the reducer has to read from disk and then put the reduced key values lists on disk and then the reducer has to read from disk and then put the answer on disk right if you were just writing some C++ program you would just put everything in memory
Starting point is 01:01:54 or maybe you would only put you know a bit of that on disk like one stage of that on disk so there's a lot of disk. when you're using Hadoop. There are some tools to sort of make Hadoop a little easier to use. One of them is Pig. Pig and Hive. And so they're good for if you don't want to have to write Hadoop, like raw Hadoop code, you can just put your data in Hive or put your data in a flat file like a text file then you use something like pig or hive they have their own programming
Starting point is 01:02:28 languages which are much more terse than Hadoop yeah I mean with a lot of these things that's planning can take a long time even it's like figuring out like oh how how exactly do I want to divide my data what yeah you know getting my data into the right place in the right format you know that can take a it can take a lot of time. Yeah, totally. And so these tools will help you with that. Yep.
Starting point is 01:02:50 So we have Avro. I have no idea what that does, so that's going to be all you, buddy. Oh, Avro is like Thrift. It's like a serialization thing. Oh, okay, okay. Yeah, yeah. We already talked about this. Yeah, totally.
Starting point is 01:03:02 I'm reading your points. So what is Hadoop used for uh so what is hadoop used for what is map produce used for everything everything everything i think there was a in the white paper on the original white paper on map produce they said that it was i don't remember i think it was over half of computations at google were part of a MapReduce. I think it was some very large number. Oh, interesting. But yeah. So, and Google is like a huge company. So, you know, and I think I'm pretty sure Facebook and all these companies have probably similar statistics. It's kind of interesting to see what companies will say what, right? It's kind of like a
Starting point is 01:03:38 little bit of a game, like, oh, hey, we're doing this. Well, we're doing, you know, but you know, it's like's like who knows like how what percentage of the actual value those things are they're just like kind of one-upping each other yeah but only like slightly and others are just keeping quiet in the corner and it's like oh are they embarrassed or like do they like oh this is this is ridiculous i'm just laughing right like yeah it's kind of interesting but i mean no seriously i mean map reduce it pops up all over the places like um you know even for even for doing, like, image processing. Like, oh, you want to, like, do a whole bunch of, you know, look for objects in something.
Starting point is 01:04:11 Or you want to run the same little algorithm on things. I mean, people have done MapReduce to do that. People have done MapReduce even for, like, I saw one guy who was writing a little bit of a tutorial. Like, I think he was trying to convert, like, old PDFs and, like, do OCR on them. Like, he had a whole bunch of them. Each page of old books that had been scanned was either just an image or a PDF. You wanted to do OCR on all of them and then put them somewhere. But you could imagine if you have millions and millions of these pages, it would just be a nightmare to figure out a C++ framework and structure or Java structure to do that. That could just take a long time.
Starting point is 01:04:44 But that's right in the wheelhouse of map reduce yeah yeah totally yeah there's a ton of things you can do with map reduce the biggest thing is you know especially as sort of like let's say you're an indie developer you're developing writing code out of your house right but you want to do really big data kind of things like for example let's say you have a website you have a few thousand people on your website and you're capturing all their clicks and things like for example let's say you have a website you have a few thousand people on your website and you're capturing all their clicks and things like that let's just take so your website is a thousand people a day and each person click like on average there's a thousand people and each person clicks once a minute so there's how many minutes in a day I don't know
Starting point is 01:05:20 let's just say they click once in a day okay so you have a thousand clicks a day that's still 30,000 clicks a month so if you wanted to go back through a year's worth of data that's what 365,000 clicks that's a lot of data and so you know you don't even want to do that on your machine because it's going to take forever so if you wrote a map reduce you could say um you know run it on your machine on like one day's data make sure it works and then say look amazon you know you have this elastic cloud that's totally awesome you know chug on this data like run this map reduce that i wrote and you know amazon i'll charge you however much that costs maybe five bucks or something or you do it accidentally the wrong way and then do you do that? I don't know.
Starting point is 01:06:06 You just accidentally misconfigured something and it's really big. It's like some really outlandishly large number. You probably can set a budget, I'm sure. I'm sure. All of a sudden, now you can run that same program on a thousand computers that
Starting point is 01:06:21 you don't have to own. It's pretty awesome. I think that's a wrap for this episode. Yeah. Hadoop is great. Learn it. It's fantastic. We hope you will have or will have had a good Christmas. Oh, yeah, that's right. Yeah. In the future, we hope Christmas will be awesome. Will have been awesome. Will have been awesome. That's pretty, pretty wild. i don't know all right till next time see you later the intro music is axo by biner pilot programming throwdown is distributed under a creative commons attribution share alike 2.0 license you're free to share copy distribute transmit the work to remix adapt the work but you must provide attribution uh to uh patrick and i
Starting point is 01:07:05 and uh share alike in kind

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.