The Changelog: Software Development, Open Source - NoSQL Smackdown! (Interview)

Episode Date: March 18, 2010

While at SXSW Interactive, Adam and Wynn got to attend the Data Cluster Meetup hosted by Rackspace and Infochimps. Things got a bit rowdy when the panel debated features of Cassandra, CouchDB, MongoD...B and Amazon SimpleDB and started throwing dirt at everybody else’s favorite NoSQL databases.

Transcript
Discussion (0)
Starting point is 00:00:00 Hey, I'm Chris Anderson. I'm a big Changelog listener, and so are you. I work on CouchDB, and this is the CouchDB theme song. Bum, bum, ba-ba-ba-bum, ba-ba-ba-bum, CouchDB! Psh, psh, CouchDB! Psh, psh, CouchDB! Psh, psh, relax! Welcome to The Change Log, episode 0.1.8. I'm Adam Stachowiak. And I am Wayne Netherland. This is The Change Log. We cover what's fresh and new in the world of open source if you found us on itunes we're also on the web at the changelog.com for a real-time view of open source check out tail.the changelog.com also head over to github.com forward slash explore where you'll find some training repos some feature repos from our blog as well as all of the audio podcasts from this year diddy if you're on twitter follow us at changelogshow, not the changelog.
Starting point is 00:01:05 And I am Adam Stack. And I am Penguin, P-E-N-G-W-Y-N-N. Whiskey, Yankee, November, November. Cool episode this week. So much fun participating in the NoSQL Big Data Smackdown. Brought to you by InfoChimps and Rackspace at South by Southwest. Let's give a hand to J. Chris first for that awesome Couch TV theme song. Yeah, that was pretty wild.
Starting point is 00:01:29 And you liked the dance even better. Yeah, there was a dance company with that. Did we get it on video? I don't think we did. It's like half Running Man, half Wedding Chicken song. Yeah, it was some oddity. I don't know. It had some tigger in there, too, somewhere bouncing all around.
Starting point is 00:01:45 Such a fun time at the SmackDown. You know, the lineup for the SmackDown, we got together and debated the merits of each of these higher-end NoSQL data stores. Stu Hood from Cassandra, Jan Lennert from CouchDB, myself representing MongoDB as the best fanboy that I could be. And Vern Wogles from Amazon, the CTO of Amazon, which I found out later who he was after he joined us from the Peanut Gallery about, what, five minutes in? It was about five minutes in, and he came in and started putting a smackdown on you guys. He did. Clearly the alpha geek in the room. We did get a video of the event. Hopefully we can do something fun with that and post it later. Yeah. Well, we would have done this with the video in mind because that was actually the intention was to post a video of it. But, man, I had some troubles getting that video exported in the right way, and I hate video. It's hard to do it when you're on-site at a conference anyway.
Starting point is 00:02:40 We just got back from South by Southwest within the last 48 hours or so, trying to catch our breath from a fun conference. What a fun conference down in Austin. Yeah, it was an awesome conference. There's been a lot of good people. Geez, we met Tony Hsieh passing by. We met Guy Kawasaki. We met up with Techno Weenie, a.k.a. Rick Olson, and met up with a lot of people.
Starting point is 00:03:02 It was a lot of fun. A couple of guys flagging us down saying, hey, listen to the Change Log. So it made us feel good we have at least a dozen listeners out there or so. Gave out some shirts. We did. Kind of bummed that we won't get to JSConf this year, it looks like. So write your congressman, voice your outrage that we won't cover the coolest JavaScript conference this year. Yeah, a sudden turn of events leaves us in a lurch not being able to make it there.
Starting point is 00:03:25 No solution in sight yet, but I'm hopeful. We've got a busy docket this year. Chirp and Texas JS and perhaps RailsConf. We'll see a full docket already, hoping to get out to OSCON in July at OSCON and let them know that the changelog needs to represent the conference. That's right. Yeah, sometimes the crowd can help us, I suppose, right?
Starting point is 00:03:49 Speaking of crowd, you know, the crowd is fully involved in the SmackDown, as you'll hear in the interview, as is a kind of a rogue bird that was sitting right by the microphone. Yeah, I did hear that bird. Well, it's a fun episode. Should we get to it? Yeah, whole time. I did hear that bird. Well, it's a fun episode. Should we get to it? Yeah. One time. Questions that you have immediately. Hopefully, they'll be relevant.
Starting point is 00:04:19 But, you know, anything. Anything you're curious about. So, right now, we have Cassandra, MongoDB, and CouchDB represented. Everyone's claiming that they're NoSQL. Nobody really knows what the exact definition of NoSQL is. It means many different things. It means data models. It means replication. It means what else? Scaling. So first up, let's talk a little bit about the data model. So Cassandra has a really interesting data model that allows massively wide rows. And I mean, we're all document stores, right? So what do you guys think about huge documents?
Starting point is 00:05:03 Define huge. Huge, like as large as a machine can fit. They're awesome. Yay. I guess the definition of a document differs than in Cassandra and CouchDB. I don't know that much about Cassandra, but CouchDB can handle big documents, but we tend to think of them as small items, small entities that you handle individually. I believe Mongo's got a 4 gig per document rule limit.
Starting point is 00:05:31 Not 4 gig, 4 meg. No, 4 gig. MongoDB, sorry. MongoDB, right. Right, 4 megs. Okay, and so everyone is, everyone, all these other competitors use JSON and kind of are untyped. I don't like that. I don't and kind of are untyped. I don't like that.
Starting point is 00:05:47 I don't like that they're untyped. I mean, because you can do massively interesting things if you have typed data. It might be sorted, so you can slice little pieces out of it. That's something that Cassandra provides. So if you have a document and it allows it to grow large, because you might want to get just a little piece of your document out. Mongo actually uses the BSON spec, so it is pseudotyped at the file system level. So it's not just strings in the database. You have ints, you have other database types, and you can even store files in the file system, gridFS.
Starting point is 00:06:20 Yeah, touch to be used is JSON. It does have a bunch of data types. The nice thing about JSON is it's a subset of all the programming language that we're using. It's the lowest common denominator that everybody can serialize objects into in Exchange. So I can have a Java object and serialize into JSON, load it up in Python, and can just work with it without having like thousands and thousands of lines of code who does some object translation between some arcane format and something else. So JSON is really, really good for data and change.
Starting point is 00:06:47 But it's slow! What? Oh, so he says it's slow. So Bobby Polito from the Python community wrote a JSON module for Python, which is actually faster than the protocol buffers Google claims are so fast. So shut up.
Starting point is 00:07:07 All right. So I believe we disagree on typed, and that's fine with me. I want to add to the typed. The web is not really typed, and people who are using the web are usually not. I hope most of them are not computer scientists. Like, sorry to all the computer scientists and me. But it's great that you use the web, but the web is something that enables everybody to share data I hope most of them are not computer scientists. Sorry to all the computer scientists and me. It's great that you use the web,
Starting point is 00:07:28 but the web is something that enables everybody to share data or to express themselves in some way. And having them to teach about data types is really just an arcane artifact of programming. They shouldn't think about how do I store arrays and objects and stuff. They should just stuff whatever they have into a database. They should not think about it. I would hope that the people that are developing web applications are computer scientists,
Starting point is 00:07:50 but maybe not. I don't know. They shouldn't be. So there was an argument yesterday, I guess, that the iPhone app store has over 100,000 applications on it. And when GeoCities was shut down, it had over 100,000 websites on it. So it's a different magnitude of scale if you let everybody participate in the open web. So you definitely do want to have everybody who has an interest in doing web stuff, doing web stuff and not restricted to the, like,
Starting point is 00:08:18 getting it right computer sciences type, because I'm really bored of all the stuff we come up with. And the amateurs really have no clue what everything is about, doing the real, cool and interesting applications. That's a lot of crap. Awesome. Let me explain why.
Starting point is 00:08:33 So how many programming languages do you know that are completely untimed? Except for Perl. Yeah, so everybody that's developing applications actually has types. And suddenly you go to the database and all your types disappear. That's not true. JSON defines types.
Starting point is 00:08:52 You can do that. But you don't have to worry about it as much as if you're like, you don't have to go up front and define, okay, I will need an array of integers to store whatever list I'm having. You don't have to think about that that much. It's easier for a programmer to do it naturally than for a non-programmer even more. Yeah, but you force everybody to rewrite all their programs
Starting point is 00:09:10 with these things in mind. So? Yeah, well, that's a lot of work. There seem to be quite a few programs out there. You know, GUCDs exist, and all these other things. So what about this compatibility stuff? Why do you guys force everybody to rewrite all their applications? I'm just playing devil's advocate here.
Starting point is 00:09:28 Absolutely. Also, Hadoop is completely unstructured by default. You throw anything in it and then you process it. You can do something similar with CouchDB. I don't know why I'm defending CouchDB, but I just want to point that out. Type doesn't always win. Fantastic. to point that out. Type doesn't always win. So there's a lot of, I think, for many of these things, Colin, it's not actually about type. It also, when you look at consistency models
Starting point is 00:09:52 and things like that, where traditional database applications are used to a very different model. So you either have to rewind your applications fixing according to the new model, whether it's type or whether it's consistency model or other things around it, suddenly you force everybody to rethink. Now, that may be a good thing, but it definitely hurts adoption.
Starting point is 00:10:16 Yeah, so let's talk about consistency really quickly. Let's talk about the different models. Cassandra has kind of a peer-to-peer model that comes from Werner's brainchild, Dynamo, where any node can accept a write, and then if enough nodes have accepted the write, then the write succeeds, otherwise it doesn't. And at read time, you resolve all that.
Starting point is 00:10:37 I don't know how I feel about the CouchDB and Mongo models. I mean, Mongo hasn't actually figured that part out, right? That's right. There's a two-second delay before you actually commit to the database. Is that what you're talking about? No. So I believe Mongo is still master-slave type replication. Sure, sure.
Starting point is 00:10:55 I must admit, I'm core committers to these projects, and I'm just a Microsoft fan. I mean, a Mongo fanboy. I used to be a Microsoft fanboy. Uh-oh. But the M word. Define a little bit what you're talking about. See if I can defend it just as a main user.
Starting point is 00:11:11 So if you have a data center in Washington and you have another data center in California, you can do a write in one of those data centers, and even if the other data center is down, depending on your tunables, you can still succeed that write, because no one of those nodes is actually responsible at a given time.
Starting point is 00:11:30 There's no one dedicated to a particular key. Right. In this case, advantage Cassandra, but I would argue in most applications that's not needed. Actually, the whole consistency model, the eventual consistency model doesn't come from the fact that this is something that you want at the application level. It is basically abstractions from the implementation leaking up. The fact that there's two reasons for replication. Either you do it for gaining fault tolerance or you do it for getting a higher level of
Starting point is 00:12:04 concurrency so you can get better read throughput or write throughput or whatever. For those two reasons, you have to replicate. If you have to replicate, you have to make a decision, do I write to all replicas and guarantee to all write to all replicas such that my reads are always consistent?
Starting point is 00:12:21 That might not be a well-performed issue, because at writes, you get a huge cost and especially if you cannot get your quorum, you may have to fail your writes. And there's a number of applications where that may not be useful for it. So these are things that are actually leaking off from the implementation through the APIs. If everybody could get a choice, everybody would want strong consistency. It's just that with strong consistency means
Starting point is 00:12:49 that you have to take a lot of other trade-offs. The main one being not being able to get much right for a food and the other one is being that there's a number of failure scenarios in which you will be dead and ordered. So are you saying that Dynamo wasn't user friendly? No, absolutely not. No, no, so there's a range of things.
Starting point is 00:13:10 I think Dynamo is one of the systems that predates a number of these. And where we meet consistency models explicit. It's not that we were the first one to go for an eventually consistent system. I think most relational databases give you eventual consistency, you just don't know it. We waren de eerste die een consistent systeem konden vinden. De meeste relaties databases geven je een consistent systeem, je weet het niet. Als je een apparaat gebruikt in een traditieel database, zoals een commerciële, dan is er een afspraak als de log's worden geschikt.
Starting point is 00:13:35 Als je van de slijt leest, krijg je geen consistentie. Er is altijd een deur. Waarom was Dynamo niet gebruiksvriendelijk? Niet alleen voor het consistentie niveau, maar ook omdat je de keuze had om naar de database te komen. Er is geen manier om een lijst te maken om te vinden wat mijn keuzes zijn. Je moet een keuze hebben en de keuze komt normaal gesproken van een andere plek. Bijvoorbeeld van een klant database. Toen we Dynamo ontwikkelden, wasden we een gebruikscase om te shoppen.
Starting point is 00:14:08 Dat betekende dat je naar de database ging, naar het bestuurssysteem en je had een keuze. Daarom is S3 een gebruiksvriendelijke bestuurssysteem, maar Dynamo is niet zo gebruiksvriendelijk. Met S3 kun je een lijst maken, een prefix lijst maken op wat mijn keuzes zijn en dan uitvinden. Dat is bijvoorbeeld niet in... Wat was de naam van die andere? Is S3 niet gebouwd op Dynamo? Wat?
Starting point is 00:14:40 Is S3 niet gebouwd op Dynamo? Dat is niet duidelijk. Is this a test to be built on an animal? No comment. Damn it. Um, no, show up. Nobody kicked the plug out. Uh oh. Man! Waiting for it.
Starting point is 00:14:58 Show up! We weren't expecting to get started with this one, were you? You got stuck over here. So the answer is no, because if you will be an engineer, you will be developing, you will figure. You would know that if you have to do a list operator on top of this, that's a completely different internal architecture. So is Dynamo used and Dynamo principles used throughout all of these systems in Amazon? We have to get enormous scale, yes.
Starting point is 00:15:30 Whether the system in itself, as we described, was mainly built for the shopping cart first. But all of these things consist of modules that are being reused throughout the whole company. It's more the principles that matter than the actual implementation. So I would say that Cassandra is actually a little bit more user friendly in that case because we... I'm sorry? Yeah, okay. We're not using hashing in order to determine where a key lives.
Starting point is 00:15:53 So you can actually do those list operations. You can basically treat it like you would Bigtable from Google and get a list of all of your keys. I imagine you can do that with the competitors, but Cassandra's implementation is better. So I like that you guys all focus on the big data problem on the massive scale and on all the websites that have these problems, which are like seven. CouchDB is more like the personal
Starting point is 00:16:18 database, something that you can use for whatever you want to do. It doesn't force you to think in these big to have these big thoughts. But if you start out small, CouchDB allows you to grow gradually with whatever usage pattern you have. So we're, I think we're, these guys are building Ferraris and Dragstars, so we're building a hundred
Starting point is 00:16:39 of databases that everybody can use but can't get along with for a long, long time. Absolutely, absolutely. There's a reason that Couch rhymes with ouch. Anybody that's used Mongo coming from CouchDB, it's just night and day as far as the ease of use and getting set up, getting the server installed, finding wrappers for your language of choice.
Starting point is 00:17:01 Suddenly, I don't have to know what I'm going to ask for up front. It reminds me of the Seinfeld episode when Kramer's doing the movie phone, and he says, why don't you just tell me the movie you want to watch, right? It's the same thing with your views up front. You have to materialize these up front, but with Mongo, I can just swing them to fly.
Starting point is 00:17:17 So you're saying that indexes magically appear with no performance? Well, indexes are one thing, right? But views are something totally different, right? So when I set an index, yeah, it takes effect for that index to kick in, right? It takes time for it, right? Nobody can cheat. Right, but I can get around it if I have a low, you know, edge case where I need to do a query. With Couch, I have to know what that view is.
Starting point is 00:17:36 You can have that all query. Yeah, but realistically, anything beyond, you know, dynamic and Couch just, in my own experience, just haven't been, you know, it's like all in one. Oh, you should try new versions then. Okay, maybe I need to upgrade. So let me tell you what, all of these guys suck. Yes! Because you should not run your own database.
Starting point is 00:18:00 That time has passed. These guys force you to run your own database, to manage replication, to go dive deep into that. You should all deserve to be a service. How can you, you know... The cloud is awesome, but what do you do if your DSL provider cracks up? What do you do when the 3G is not?
Starting point is 00:18:18 What if you're on AT&T and you have no more coverage? How do you reach the cloud? You're dead in the water. You're dead in the water. With a really great cloud that nobody can reach. You go to a bar and you have no more coverage. How do you reach the cloud? You're dead in the water with a really great cloud that nobody can reach. You go to a bar and you have a few beers, you come back and you're... Exactly.
Starting point is 00:18:31 And your customers will leave you left and right if you're offline. Well, you know, given that we've been doing this for a while, I think our customers... I kind of have an idea what customers do in this particular case. Exactly, but again, you're one of the... No, no, no, no, no. But what is more, if you aggregate all these customers that we have,
Starting point is 00:18:48 whether you have S3, SimpleDB, EBS, and all the other services, you should no longer... I mean, these guys, you are wasting your time. And I love it. You know, this stuff, I really love this data stuff and these databases. I would build 10 more dynamos. Because it is really, really cool. But you're not solving your customers' problems. Because you're forcing them to do a lot of operational
Starting point is 00:19:10 skills. I agree with you wholeheartedly. People who use databases shouldn't use, we want to use stuff, they shouldn't think about the database as something they would use. Part of the thing we're doing is abstracting the database away to build very, very cool applications and getting CouchDB into all the deployments you can think of that when you want to build something, it's already there and can just use it. You don't have to think about it.
Starting point is 00:19:34 Like my mom should be able to run a CouchDB server without knowing that she runs a CouchDB server. That's the thing we're trying to do. We should not run a database. That's nothing you would ever want to do. Remember, guys, if you guys have any questions, just throw up a hand. I mean, I'm sure you have comments. Also,
Starting point is 00:19:49 the comments on the seven biggest sites, like, don't you guys want to be the next biggest site, or you want to be the number one site? So why would you build... Actually, I would like to really argue against the seven biggest sites. If I look at the amount of data, I mean, everybody here,
Starting point is 00:20:05 because you all think about big data, right? That's why we're here. And how many of you work not for the biggest seven sites? I think most of you, don't you? Yeah? Yeah. So there's tons of data out there. Everybody has this big data.
Starting point is 00:20:19 The time where small data sets were normal, I mean, everybody has petabytes data sets, and that's only the start of it. It's also a control thing. If Amazon, if Google, if Apple, if all these people own all your data, Facebook, for example, they own all your data,
Starting point is 00:20:33 they own all the URLs. The web wasn't meant to be a couple of big silos. People should be in control of their own data. They should be able to use their own data as they fit. They should be able to put it under the URLs they control instead of being under the... whatever these guys are doing to screw them over. It's a big privacy issue.
Starting point is 00:20:52 It's another... Oh, it definitely is. There's a bunch of privacy laws in Germany that I cannot use as three to store... Yes, you can. Oh, you can't? Yes, you can. Okay, stop. Stop.
Starting point is 00:21:04 Yes. Let me... The September 1 law in Germany, the new privacy law... No can't? Yes, you can. Okay, stop. Stop. Yes. Let me see. The September 1 law in Germany, the new privacy law, has the definition of a data processor where yes or no, you can use that. And so with S3, you can use it as a data processor. I'm thinking about a very specific policy where in Germany, if I'm asking somebody to delete all my data, he needs to be able to prove that everything is deleted.
Starting point is 00:21:28 If it's somewhere in the cloud, stored somewhere on some data center that the U.S. government has access to at any time. No, that's not true. As you know, and at least Amazon does, I don't know whether it's for the other guys, we comply with safe harbor rules, which means that if you follow the data protection direction, the directorate of the EU, we follow safe harbor rules, which means that if you follow the data protection direction, the directorate of the EU, we follow safe harbor rules. There's an explicit number of lists of things that you have to do when the government comes to you and asks for access to this data, which is to notify you, to give you the ability to retrieve your data or to remove your data before other ones get access. There's very explicit rules around this.
Starting point is 00:22:03 Werner, the rest of us agree against you in the sense that we're open source. So anyone can host our application, right? But they still have to host it. Yeah, we'd love to use you for your virtual product server. No, no, I'm more than happy that you guys, when Cassandra and Mongo and others on Amazon Institute, which you can do with VBS,
Starting point is 00:22:22 and tons of people do these kind of things, yet you're wasting your time because that's not what you should be doing. You should be building better value for your customers. And it is by not focusing on your database. That's what we're doing with the local databases. We're giving, like, take Salesforce as an example. Everybody who's using Salesforce is making a lot of money. If Salesforce goes down, an entire industry is unable to use stuff.
Starting point is 00:22:43 If you have an offline version of Salesforce that would, for example, use... How often did Salesforce go down? Oh, it does happen. Sorry, sorry, sorry. Salesforce doesn't even have to go down for that. You need to have a connection to Salesforce. And again, if your cable provider, your DSL provider craps out, who's a happy Comcast customer?
Starting point is 00:23:01 Exactly. So are you arguing that Amazon should allow you to download the entire Amazon database and then shop locally? It should allow users to buy things locally? Yes! Yes! Can I hack it on the plane?
Starting point is 00:23:18 I guess that's the question. When you fly back from South by Southwest, can I hack the application? But another question, so we go back to the seven biggest sites things for a moment. I hear NoSQL often and the popular blog post a couple weeks ago talking about how MySQL can
Starting point is 00:23:34 scale and the NoSQL line and all of this stuff. We're kind of like in what we were with Web 2.0 a few years ago where we had this term out here that we haven't really defined. How many people think NoSQL means big and scaling? How many people think non- means big and scaling? How many people think non-relational schemaless? A few more hands. I think that's the distinction that we've got to put some sort of definition to this term NoSQL
Starting point is 00:23:56 so that when we have smackdowns like this, we can agree what we're arguing about. I think NoSQL is about choice of data storage, which Werner says we're wasting our time on. But if I'm building an application that needs very fast logging, I'm looking at Memcache and Redis and MongoDB. If I need something that has offline peer-to-peer replication, I'm looking at CouchDB. If I'm looking at something that needs to be hosted
Starting point is 00:24:18 and I shouldn't think about it, I look at S3 and the other stuff Amazon and other people are doing. If I have hundreds and thousands of servers that I need to keep busy, I look at Hadoop or Xandro. I would like to point out that this is a big data meetup. And SimpleDB, I think, has a 10 gigabyte limit? Per domain. Per domain. Well, OK.
Starting point is 00:24:36 So basically what you have to do is you have to do your own partitioning. Right. I'm not saying that any of this is a big deal. Do your own partitioning. OK. So when I think about NoSQL, and given Right. I'm not saying that any of this... Do your own purchasing. Okay, so let me... So when I think about NoSQL, and given that we have some history in this, or what I think before came before SQL was that for a very long time,
Starting point is 00:24:56 the database, any data storage, and whether it was a database or whether it was just storage, the default application to use there, the default application to use there, the default service to use there, was a relational database because there was nothing else. Maybe you can have some B2B, but those are basically the only two choices. Now, what drove us to start building different types of databases is because if you look closer at how your processing is,
Starting point is 00:25:21 and you can decompose processing into different steps, you see that for most of those different steps, you have different data storage requirements. And that for each of those different requirements, you can find a very dedicated solution that is capable of being very fast, very reliable, while doing the generic thing, throwing all requirements into one big bucket,
Starting point is 00:25:44 you end up with something, actually technology, that has been developed in the 80 thing, throwing all requirements into one big bucket, you end up with something, actually technology that has been developed in the 80s, where we expect to have 21st century scaling and performance out of. That's impossible. And so now it's the thing that drives it. And if I look at the things that Amazon offers, it's not necessarily that I think that SimpleDB is the one and only table solution. No.
Starting point is 00:26:03 It is a bucket of tools that you get these days. You have S3, you have SimpleDB, you have maybe you want to run your own, you want some rank caching. But the most important is that we now have a whole range of solutions that people can pick from. So you said impossible. I wouldn't say impossible.
Starting point is 00:26:18 I would say just not discovered yet. I mean, you can't... Or relational. Well, there are ways to implement things that are somewhat relational on top of these scores. Without breaking the model. So, for example, if you want to
Starting point is 00:26:31 implement everything, like inner transactions, like multi-level views, and all of these kind of things, without breaking existing applications, you cannot... If we couldn't build an absolutely infinite scalable relational database and kept all of the running programs
Starting point is 00:26:49 intact we would have done it right absolutely and we tap theorem I'm sure everyone has heard of it cap theorem we all agree I don't think any of us have transactions so I mean we'll just we'll skip right over that right so now simply So now, SimpleDB has transactions now? Atomic gets inputs, isn't it? No, you have conditional inputs. Which are actually in the line with eventual consistency. Here, if you cannot figure out the consistency model, you ask the system to do this for you. And remember, under the covers, SimpleDB is still an
Starting point is 00:27:25 eventually consistent system. It is just that there's a number of operations on top of that with a different failure model that you can use both. Let's talk data size real briefly. I'd just like to point out that we have users of Cassandra that are storing multiple terabytes of data per node.
Starting point is 00:27:43 So... How many? How do you guys respond to that? How many users are doing that? We have at least three or four. Twitter and Dig and Reddit are probably all using multiple gigabytes. Facebook. We've got a couple of things. Mebo and the BBC, one of the biggest ones,
Starting point is 00:27:59 that do multi-terabyte sizes. Maybe not on a single bot. Catch and Be definitely supports that, but these guys haven't run into the... They aren't one of the seven biggest, so they're not there yet. So how many of those sites started on a system of that scale?
Starting point is 00:28:14 And would we have them today if they had? If they had seen the future, they probably would have started on Cassandra. Oh, magic! I don't know. We like to believe that it's only the big sites that have big data. But think about anybody that builds a Facebook game these days. That means that you can go from zero to 25 million users in a month.
Starting point is 00:28:36 Yeah, startup. And so imagine all the logging you need to do, all the objects you need to keep around, all these things you need to do. Running into petabytes of data is something that you do very, very, very quickly these days. You run a marketing campaign on the web. A marketing campaign is no longer just a website. It's a website. It's video. It's user-contributed content. It's casual gaming. It is integration into social networks. All of these things. That's a modern marketing campaign. All the
Starting point is 00:29:02 data that those things generate, you're talking about a terrible bunch of data quickly. That's a good point. So let's talk about some of the use cases and the scenarios that you would need in most applications that do just that thing. Can you, Cassandra or Couch, and I don't know the answer to this, update documents partially?
Starting point is 00:29:23 Do I have to get the whole document before I can update it? Are those positional updates and incrementals and things? Okay. Are those new? No, it's been in for a while. Okay, okay. And, yeah, Cassandra can. I mean, we can have very, very large rows,
Starting point is 00:29:35 so obviously you're not pushing the whole row at once. People can just insert more things. People build indexes within a single row for the rest of their data. What sort of operators are built in to do that for you automatically, to do things like incrementing a value, adding something to an array, updating a key and a hash? Let me call you Bob for a second. You write a JavaScript function for that. Yeah, I mean, I can drop down and do that in Mongo too, but I mean, there's convenience
Starting point is 00:29:59 is what I'm pointing out. We have a very small set of standard library to put these functions, but users have not asked for that a lot. So it is expandable, of course, but we don't have a lot of that yet. Okay, so let me ask you guys another question. So you're open source. So if you put out a new release, do your customers have to take your database down?
Starting point is 00:30:21 No. No. So explain how you do it. So what did you expect? So we, CacheDB has a very robust file format and everything since the last three versions uses the same file format so you don't have to do any upgrades. So the server can deal with the same thing that you're actually doing. So CacheDB has a very robust file system, file system storage model that has been stable for a couple of versions so whenever you upgrade you never have to change anything with your existing setup. On top of
Starting point is 00:30:47 that, CouchDB is built in Erlang. Who's a fan of Erlang here? Erlang has the capacity, has a feature that allows you to upgrade a version at run time so it can run two versions at the same time while serving a database without having to
Starting point is 00:31:03 check it down, so it has live upgrades built in. So, Cassandra, I mean, we're changing the file format soon. You will have to restart the cluster. I mean, saying that you'll never have to change the file format is kind of... So what happens then if you have 10,000 nodes running? You can do a rolling restart. Yeah? That's the thing. And how can do a rolling restart. Yeah?
Starting point is 00:31:25 That's the thing. How long does this rolling restart take? Just from a practical point of view. We've done these things a few times. I mean, what else do operations folks have to do? I mean, it's fun. No, I'm kidding. Honestly...
Starting point is 00:31:38 So this is one more case why you should not be worried about this stuff. Use this storage as a service, man. This is old-fashioned. This is so 1990s. I disagree. So with Cassandra, you can run a single node incredibly easy. You can get a second node started incredibly easy. It can be anywhere. They're data center locality aware. We have 45 node installs. Twitter is running on 45 nodes. But yeah, Facebook had 150. It's easy enough to grow your cluster that I think it may be more difficult to use EC2
Starting point is 00:32:14 than it is to manage your Cassandra cluster. I'd like to tag in J. Chris from the Catch Me Project who's got a few things to say. Sorry about that. I'm breaking the rules. But I just wanted to bring the debate up a notch. So all this they've been talking about as far as I'm concerned, this is geek stuff. I don't care. I'm a developer. I write Erlang. But you shouldn't have to worry about any of this.
Starting point is 00:32:36 Your database is yours. It lives at the edge. It's your data. Replication means any copy of the data. You can move it around. You can build workflows on top of replication. None of these guys, they're all zigging. We're zagging. So I would really, I would like to see some people talking about the use case. I want to share photos with grandma and I don't want to ask Mark Zuckerberg for any favors. Stuff it in S3 because then you get URLs that are just completely addressable from the web.
Starting point is 00:33:04 You don't need any intermediary servers. You know, key value, just web addressable stuff. That's the way to go. Does Grandma know how to use Curl? I mean, I don't know. I assume you're going to develop an app for her. But next up, so in terms of performance, I don't even know if we need to talk about it, because I think Cassandra has you guys topped.
Starting point is 00:33:30 So CouchDB doesn't optimize for single query performance, so everything might just be fast enough, not as fast as it can get, but by the properties that ordering systems come with, it can handle thousands and ten thousand, maybe a hundred thousand concurrent connections and have a constant stream of performance out of whatever your hardware supports without falling over. I would argue Mongo is fast enough. I mean, Cassandra is good to be on there. You don't have a concurrency story. What if a thousand users hit you at the same time you're dead in the morning? They're just caching. They're written in C, I mean, that's a good... right? Aren't they fast enough? It doesn't scan concurrently. So how easy is it to also quickly hook up a CDN to your store?
Starting point is 00:34:17 Okay, does one flip of the bit suddenly have caching all over the world? Not over the world and I don't believe it so wizard s3 do English at 125,000 transactions a second so Cassandra Cassandra can do 25,000 requests per second per node audience questions before that. Talk about transactions. When do you need transactions? Banks have relational databases because they have transactions. Banks, so you're going to have
Starting point is 00:34:51 entities that need transactions that will always be relational databases. I'll repeat the question. He's asking about transactions. Do people need transactions? Raise your hands. Okay, so that's your answer. Wait, wait. First of all, transactions have nothing to do with relational databases. The fact that they were offered by the same particular tool, that's a different thing. Transactions is just that you get a number of guarantees, asset guarantees about the update of your data.
Starting point is 00:35:20 That's all. They have nothing to do with relational databases. Also, NoSQL is about using the right tool for the job. Yeah. Exactly. So there are engines for building transactions. Zookeeper is one of them. It's open source as well. That's how you should do your transactions,
Starting point is 00:35:33 and that's how a lot of people do. So let's talk a little bit about ecosystems. Cassandra, I mean, I'm going to name drop. Cassandra has a few good installs. Twitter, CloudKick. Alphabetically, CloudKick is at the top. Also, Twitter, Facebook. What do you guys got?
Starting point is 00:35:54 I got the BBC. I got Canonical. It's not as big as you can get, but we probably have a few more installs than you guys have. It's been awesome stepping out to the list. The Discuss is out there. Sourceforge. Move to MongoDB. Of course, tail.thechangelog.com
Starting point is 00:36:09 is how you keep up with open source software. Someone tweeted a remark saying that when I see Sam, I tell you to go to the bar. That was a joke. Before someone thinks that that is an official company statement. I think
Starting point is 00:36:30 just with any other database or whatever, you protect yourself at multiple different levels. You use caching. You be intelligent about where you store your data. S3 also gives you multiple zones and availability zones where you can actually store your data. There's many techniques that you can use
Starting point is 00:36:46 to protect yourself from these kind of failures. That's the real answer to giving what you're all funnier. So how about wide area replication? I mean, people are geographically distributed. Cassandra supports wide area replication. It's kind of native. How do people accomplish that with your stores? Or do they fall down? Coach TV has multi-master replication built in, which has been built for geographical distribution in mind. So we just have to have it.
Starting point is 00:37:23 I believe currently Mongo is master-slave, but I believe MasterMaster is coming. Yeah, but it never works with your data model. Multiple regions where each of the regions is guaranteed to store the data independently. Store your data in the EU, it's guaranteed to only stay in the EU,
Starting point is 00:37:39 not even metadata about the data will ever leave the EU. So you get geographical replication for free. News goes, disappears off the earth, which actually appears to be happening at the moment. You know, you still have the other side, and you still have Asia and things like that as well.
Starting point is 00:37:58 Yeah, so Bigtable recently had an outage. I guess it was App Engine. And I love Google because they're very, very open about what went down. Amazon is too. So what went down was that their master and slave replication basically got out of sync between data centers. So I don't know how I feel about master-slave replication. Can we break that?
Starting point is 00:38:20 Is Mongo planning to break that? Good question. Good question. Good question. Yeah, it can break it. It's not that hard to break it. But, I mean, there's advantages on all sides. You know, as always, the cap theorem is not that it forces us to use a particular consistency model. You get to make the trade-offs.
Starting point is 00:38:39 I think, you know, Cassandra, as well as Dynamo before that, one of the exercises in Dynamo was really to make sure that we were giving the hands of the developers the choice to go for, do you want to be really highly available or are you willing to sacrifice some of your consistency model there? And there's nodes which you can use there. Plus, I think, actually,
Starting point is 00:39:00 the biggest innovation in all of this was something called sloppy quorum. The fact that you could take writes even if your quorum is down. And you could always, always write to the system. Here, if a customer wants to put something in his shopping cart, you're going to tell him, no, the storage system failed, timed out. No, it just always works. I don't actually have a response to that.
Starting point is 00:39:23 I guess that's possible in CouchDB, but only because no node actually knows whether it's responsible for something. But I think we had a question. I think the one question you really need to answer with NoSQL is why do I want to do this instead of staying with MySQL and buying a bigger machine? I can buy a 10-terabyte RAM 504 machine
Starting point is 00:39:43 that will run MySQL or Oracle or whatever just fine. Why do I want a NoSQL? So the question was why not just scale up? Why scale out? I would say that the answer when it comes to Oracle is price. That's very, very clear. We're all open source. And when it comes to MySQL, eventually that machine goes down
Starting point is 00:40:03 and you have some sloppy situation where you have to either use patches to MySQL, eventually that machine goes down and you have some sloppy situation where you have to either use patches to MySQL to not lose your data or you have to implement something else. You have your ops team implement DRVD or something. It's back to the big data versus the schemeless questions. And I think if you're comparing something like Mongo to MySQL, I think it's a more fair comparison
Starting point is 00:40:22 because it's a... Back a little bit. You mentioned highly available. I would mention highly productive. A lot of applications now, let's face it, a lot of the data that you use, you're not creating in-house. You're consuming APIs
Starting point is 00:40:35 from other places. A lot of that is coming from JSONs. It's coming from other hashes that are up in the sky. Using something like MySQL, then you have to model that schema and stash those. Using a NoSQL store, you can just stash the hash. So, in answer to that question, when I see somebody writing a Ruby app or a Java app or anything with a middle-tier application layer, I look at that as a huge waste of time.
Starting point is 00:41:02 With CouchApps, it's just the browser and the CouchDB, right? You've got a jQuery guy, and you've got somebody who knows how to keep the server from dying. That's all it takes. And, you know, Werner's on that same case. He's got HTTP. I agree with that. Just having an HTTP-based database means that you don't need all that crap in the middle.
Starting point is 00:41:20 Well, I was actually arguing that you'd have to have a range of things to be able to pick from. And one of them is an HTTP accessible. But coming back to your question, actually, I think there's a number of use cases, and especially where it is about existing software, where you still may want to run your relational database. When we built the first services at Amazon Storage Services, we did not offer a relational storage service.
Starting point is 00:41:44 Why not? Because we thought that that would send the wrong signal because you rebuild, build scalable apps. You don't do that. However, there's a ton of applications. If you use Ruby and you use Active Records or any standard ORM kind of tool, they all
Starting point is 00:41:59 want to talk to MySQL. And so here you have a whole range of developers that just want to focus on writing Ruby. They don't want to run databases or whatever, and they don't care about what the backend is. And that's your argument about very small databases, very small datasets, they don't care. As soon as you have the scale, as soon as reliability becomes an issue, all of these kind of things, then it turns out that relational databases have their limitations at points that will hurt you.
Starting point is 00:42:27 You bring up an excellent point about ActiveRecord. I don't want to bitch about it. It's really, really cool for what it does. But it is a thing of simplicity. CatchDB is built with simplicity in mind. And the thing that we have in the store here is that ActiveRecord, the last time I looked, had around 25,000 lines of Ruby code. And I know this is Apple's and whatever bungalows in a comparison. CatchDB comes at around 15,000 lines of code.
Starting point is 00:42:48 So our entire database is smaller than the wrapper you're using to solve your programming issues. So we compress the stack by using pure catch apps without ORMs, without all the middleware crap that you could find bugs in that takes a long time to use. It's just boring and slow
Starting point is 00:43:04 and I don't know. It just plain sucks. I'm sorry. I welcome this renaissance we've got for JavaScript and all these NoSQL databases that are embracing this language of the past. But the problem of Couch is the fact that you have to do everything in JavaScript.
Starting point is 00:43:19 Talk to Apple, Google, and Mozilla about the language of the past. I'm a JavaScript fanboy. Listen to Changelog. We mention Node.js on every episode. But the problem with Couch is you have to drop down to MapReduce and JavaScript to do anything. Anything of consequence, you have to drop down to JavaScript. And, you know, I'm familiar with JavaScript. I love JavaScript. But I know a lot of the folks that I work with feel like you have to have hazmat gloves to touch JavaScript.
Starting point is 00:43:43 Yeah, that's cool. But they're like the CS majors of everybody. People who tinkle with the web designers who just started to use jQuery, they're comfortable with that. They can use Cache. You're going to get a web designer on a Cache database? Yes. Oh, okay.
Starting point is 00:43:54 Just like I was saying. I have a lot of them. And his name is Bob. But, so guys, we have to start wrapping up. Is it a quick one? Yeah, it's about data modeling. A lot of people who view relational databases always think about their data models.
Starting point is 00:44:08 I just have the impression when you talk about these NoSQL databases that you just set data modeling at the side and you don't think about how you model. So the question was about data models, and there are names for what we do. In a relational database, you typically want to normalize, and in a non-relational database, you want to denormalize, and it's really just that simple. So you duplicate, but that's fine, is what we say. So, and in closing statements, let's just talk about what you would use if
Starting point is 00:44:35 you couldn't use your own product. So you're implementing something, it's going to be a perfect fit for CouchDB, but your manager says, no, you can't use CouchDB. So what's your next choice? I think I'd probably use Preserveer, which is actually written in JavaScript. There you go. Cool. So because it's written in JavaScript and all the other languages are boring,
Starting point is 00:45:02 what does it scale? So if I couldn't use Cassandra, I'd have to say React's really interesting, but the kind of closed source project they have going on. And Voldemort's interesting too, but they don't have ordered keys. And I love ordered keys. But yeah, so maybe React. Maybe Couch. I couldn't use Mongo.
Starting point is 00:45:27 Perhaps Couch, but it depends on the scenario. We need to do dynamic queries or something like that. That was just a pain point for a lot of several apps that we went through. I would like to probably check out Redis or some of the other systems that probably should have been up here. Let me just say that I hope I didn't deter anybody from using MongoDB. I'm just a fanboy, like I said. I wish somebody from Tenjin could have been here to adequately represent the database.
Starting point is 00:45:49 It's a cool database. Actually, I think the one database that was left out, which I think is very different from these ones, is Neo4j. I think that if you look at databases that build things for a very specific domain, where you have graphs, where actually all your data is structured as graphs, take any social network or anything with multiple relationships and multiple connections,
Starting point is 00:46:16 Neo4j is absolutely rocks in that sense. And I think they deserve, you know, partitioning that is a pretty marked thing to do. Partitioning the graph, it's a computer science problem, but it's not so... No, no, so hey, why don't we build one? Yeah, exactly. So if I can't use... I don't know, I'll go to a bar for a few hours, and I'll come back, and then
Starting point is 00:46:37 people will say I can use S-Vegan. Alright, so we're cooling down. Any other questions, guys? If not, guys, let's all say thank you to the non-relational database, Snagdash! Thank you for listening to this edition of The Change Log. Point your browser to tail.thechangelog.com to find out what's going on right now in open source.
Starting point is 00:47:06 Also, be sure to head to github.com forward slash explore to catch up on trending and feature repos, as well as the latest episodes of The Changelog. As if no passion shown Was mine alone Open, I'm open For us to try To bring it back, bring it back to Open, I'm open Bye.

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.