The Changelog: Software Development, Open Source - Document Cloud and Underscore.js (Interview)

Episode Date: December 6, 2009

Jeremy Ashkenas is the Lead Developer at DocumentCloud about their effort to revolutionize the way media organizations gather news. Jeremy discusses their open source projects CloudCrowd, Underscore.j...s, and JAMMIT that they've released along the way.

Transcript
Discussion (0)
Starting point is 00:00:00 Hello and welcome to the ChangeLog episode 0.0.5. My name is Adam Stachowiak. And I am Wendell Nutherland. We've got a great interview today with Jeremy Aschenkes from Document Cloud. Yeah. I think that one turned out really well. Some exciting projects coming out of Document Cloud. We're five episodes into this podcast.
Starting point is 00:00:33 So how close are we to figuring out our format? I think we're getting there. I think it's an iterative process, but lots of small little tweaks along the way, light little tweaks. But I think the format of having the weekly and then the weekly roundup and then also having interviews coupled into that is a nice fit. It would be nice to have some guest contributors come on to the show too. So we're pioneering agile podcasting. Yeah.
Starting point is 00:00:58 Who would you like to see come on as a guest contributor, Wayne? You know, a lot of names out there. I don't want to share probably any of them in case they're too good to come on our little show. On our little show. Well, I mean we was – I guess we're somewhat little. I mean we got just a little over 100 followers in the last few days. I think that's – Yeah, zero to 100 in a week is not bad.
Starting point is 00:01:21 Yeah, that's real nice. And certainly the blog article on GitHub.com, their blog, helped us out a lot. I really think the podcast will take off when the community gets to embrace it and we get the news is more than just what we're scouring to find. We've got the community crowdsourcing this deal.
Starting point is 00:01:42 So if you've got a great story out there, what's new and exciting and open source, just submit to submit at thechangelog.com via email or just go out to the website, thechangelog.com slash submit. We'd love to get that news up on the site. Yeah, absolutely. I'm looking forward to it. We haven't gotten any submissions yet, and that's kind of a shame. Not that we need people to start contributing, but it would be nice to have somebody alert us besides us just kind of picking up what we find. Yeah, I'd like to see what pools of information people are drawing from outside of the ones I'm fishing in. How about you?
Starting point is 00:02:22 Yeah, no, I agree. I mean I don't want to be Ruby-centric, and I don't want to be very language-specific. I want to be agnostic about what we're doing here, and I think that's always our approach, but you and I tend to just jump in those worlds, and those are the ones that are most fresh to us. So if you've got something out there in a different language, let us know.
Starting point is 00:02:42 Absolutely. Well, we've got a great interview today with Jeremy Eschenkis from DocumentCloud. We talked about three of his great projects, and I think it's a really dynamic interview. They're doing some exciting things in the media primary news source space. So how about we get to it? Yeah, let's get to it. Enjoy the show. All right, we're here with Jeremy Ashkenaz, and Jeremy is with Document Cloud.
Starting point is 00:03:09 Jeremy, explain a little bit about what Document Cloud is and what it's doing. Sure. So Document Cloud is a new project that I'm really happy to have started with in August. It's a grant funded generously by the Knight Foundation for a two-year project to help make the primary source documents that the New York Times and the Washington Post and the Chicago Tribune and all of these major news organizations are gathering when they're writing their stories to help make these primary source documents that you get from the government, you get from Freedom of Information Act requests, you you know, good investigative reporting to make those public and to make them searchable online, to make them able to be embedded alongside news articles for context and to make richer stories. And one of the nice perks of this project is that the Knight Foundation has mandated that everything that we produce be open source and be released open source. So as we've been going along, I've been trying to split off the sort of atomic chunks of the document cloud project as little open source projects and release them, and it's gotten a great response so far.
Starting point is 00:04:17 And we've had a whole bunch of community contribution that has really helped improve the three things that we've released so far being Cloud Crowd, which is a parallel processing sort of framework for Ruby that's a little bit MapReduce inspired, although a little bit more practically oriented, I think, for your day-to-day workflow than a pure MapReduce like Hadoop. Jamit, which is an asset packager plugin for Rails that we just launched a couple weeks ago. And Und and underscore.js, which is a collection of functional programming helpers for jQuery to give you those Ruby-style map, inject, select, fold left, fold right kinds of array and object
Starting point is 00:04:59 functions that you don't always have across browser and JavaScript, but it's very nice to have as kind of a standard library base. Awesome. Those are three exciting projects to open the gate with. How about yourself? What's your role at Document Cloud? It's kind of my pet project at the moment. We're looking actually, which I should mention here in case anyone out there is listening, not necessarily in New York, but we're looking to hire more help, both with JavaScript and with Ruby, Postgres, EC2 backend stuff.
Starting point is 00:05:29 But right now it's just me building out the initial prototype of it. So I'm the lead developer, I guess, is my technical job title. So the Knight Foundation, why did they – I mean, I kind of understand why, but do you have some background to why they wanted everything to be open sourced? It's part of the mandate. So they have this thing called the Night News Challenge, and the idea is to fund interesting technology slash journalism projects to help figure out what the future of journalism is going to end up looking like. So they were the ones who funded every block to the tune of about a million dollars a couple years ago, which is, I think, their biggest name. But they fund five or ten projects, most of which are smaller scale than, say, a document cloud or an EveryBlock. And then the idea is that you end up producing pieces of technology that can help newsrooms transition to the Internet age.
Starting point is 00:06:21 And so to that end, everything that you do has to be open source code. That is in the contract, I think. Everything that the grant money is spent on is supposed to be towards the creation of these open source news projects. That's wild. So EveryBlock, too. I didn't realize that they supported that as well. That's off topic, but
Starting point is 00:06:40 EveryBlock is an awesome project. Yeah, that's why EveryBlock did that big code dump at the end of the project before they sold themselves was because that was the contract. I'm behind the news. I didn't hear that they sold themselves. Yeah, they were bought by MSNBC for an unknown amount.
Starting point is 00:06:56 So that was a nice exit for the team after the grant funding because this is a two-year grant, and at the end of the two years, we're going to have to figure out how to continue the project. So we don't have specific plans yet, but EveryBlock's method was to get bought by MSNBC, who's going to continue it. Wow. So what's your team size like? You said, is it just you, or do you have more people in your team now? We recently hired our second full-time person who's working on the administrative and dealing
Starting point is 00:07:20 with all these news organizations who have signed up. On the documentcloud.org website, there's a list of partner orgs, but it's many of the major news organizations in the country, along with magazines like The New Yorker and The Atlantic Monthly and things of that nature. And I guess the overseas stuff is starting to expand a little bit more as well. There's been some interest in the UK. So she's our second full-time person. There's the three founders, Eric and Scott at ProPublica, and Aaron, who's the editor of the interactive news section at the New York Times, were the ones who got the grant in the first place. So
Starting point is 00:07:58 they don't have too much time to devote to the project from day to day because they've still got their day jobs. But they are the, I guess, they're sort of the board. Can you tell us about how this project got started? I guess I wasn't too much involved in the conceiving of the project stage. I got hired after the grant was a sure thing. So it's sort of been in the works for a long time. I think the three of them originally had the idea to make these primary source documents that are sort of, you know, passing through the filing cabinets of the New York Times, for example, to make
Starting point is 00:08:30 them public and to make them accessible online and wanted to start a project to make that happen. So a big part of this is, I don't know if you guys have seen the document viewer that the New York Times does for a lot of their sources. For example, when they had a big Guantanamo project, they released a couple thousand. They started out as PDFs, but they became these sort of JavaScript, HTML web plugins on the Times' website where you could search through the court transcripts and the prison records of these inmates and keep track of
Starting point is 00:09:02 what exactly was going on on a detainee-to-detainee basis. So that particular piece of software, the Document Viewer, that they're using to embed the stories on the web without having to just download PDFs, Times is donating to this project. So part of what I've been working on has been integrating that with the Document Cloud prototype. And there's a new version of it that should be coming out shortly that you'll be able to find on the NewYorkTimes.com in a couple weeks. That is pretty cool. It's got a Google Books-like infinite scroll kind of a setup for these documents,
Starting point is 00:09:33 and it's pretty nicely designed. So that's in the works right now. Do you see Document Cloud primarily being involved in the government space? It's the primary source document space. So it's all of these people, all these organizations whose mission is to uncover primary source documents. So whether that means it's government records or it's internal corporation memos or emails or anything, I guess, that becomes a primary document of record, I think we're interested
Starting point is 00:10:02 in. And then beyond that, we might end up opening it up to more things like watchdog groups who are gathering these things. And, yeah. So you mentioned these three projects, CloudCrown, underscore JS, and Jamit. Are all three of these your creation, or explain a little bit how each came about? Yep, they're three direct extractions from the Document Cloud prototype that I've been working on over the course of the fall. So one of our first problems was that importing PDFs into Document Cloud is a pretty slow, sort of painful process because you've got to split
Starting point is 00:10:38 apart a PDF into a number of pages, and you've got to convert each page into both its full text and its images in different sizes to display it inside of the document viewer.. And you've got to convert each page into both its full text and its images in different sizes to display it inside of the document viewer. And then you've got to... And part of this document cloud is that we're actually using the OpenColet web service to do semantic indexing of the documents. So we end up knowing what people and what places
Starting point is 00:10:59 and what organizations and what terms are mentioned within a document. You can search across that kind of stuff. So we have to go to OpenClay and get that information back. And all of this is a very time-consuming, expensive process. So CloudCrowd, which is our parallel processing framework, is sort of a generic, you have a job you need to get done in Ruby, and you can maybe parallelize it to a certain extent.
Starting point is 00:11:21 And you'd like to do it in as parallel a fashion as possible. So the CloudCrowd primitives are of, you write a Ruby script, you write a class that has at least a process method, and the process method is the parallel part of the computation. And it's all sort of web-based. So there's a REST API that it comes as a gem, and when you install the gem, you get servers and nodes. And the server is the central thing that manages all of the all of the work and the nodes are these are the actual machines that are performing the work and when you install your action all you have to do is say okay if i'm on a machine that's doing the work what is the parallel part of the work that i'm going to do and then you send it a url to a file in our case a pdf although it could be a JSON document or some other kind of
Starting point is 00:12:05 XML document or some other kind of information. And then you can do the processing on those documents in parallel. So in our case, we're doing the PDFs in parallel. And then the MapReduce plays in, in that if you define more than just a process action, if you define a split and a merge, the split at the beginning will take a single input and divide it up into many to all be run in parallel across that process method. And then the merge will take back the results of what came out of all of your process calls and merge it back into a single result for convenient use consumption back at the other end when you get pinged back when your job finishes. So in our case, that means you take a PDF, you split it up into chunks of pages using Cloud Crowd. Each, you know, five or 10 page chunk gets processed in parallel to get the images out,
Starting point is 00:12:50 to get the text out, to get the entities out through OpenColay. And then at the end, merge back together into a single archive that we can import back into the prototype. So in that, using this, we can, you know, install this gem on many EC2 machines if we need to, and spin up Cloud Crowd nodes very easily and start distributing the workout. So this can happen in a reasonably fast fashion. Is it EC2 and S3 only or does it work with any sort of cloud platform? It works with any sort of... So there's actually no dependency on EC2. It's only on HTTP and REST. So you could install it on whatever kind of box you'd like.
Starting point is 00:13:26 It's nice on EC2 because you can spin up and down these nodes on the fly very easily. There is an S3 file system backend built in because that's what we've been using. When it transfers files between different machines, this has always been a problem in Hadoop. In Hadoop, you have to install this Hadoop FS where there's a common shared file system that all of the nodes can write to. So the Cloud Crowd default file system backend is to use S3 as that sort of common shared file space. So when you're done, when they're done with an intermediate work unit, it'll save that work unit to S3. And then in the merge step later on, it can pull that from S3 and continue the processing without having to worry about transferring, about which particular node has which copy of which file. But there's also a file system
Starting point is 00:14:08 back end. So if you're just doing it on one box, if you only have one machine that you're doing work on, you can use the file system back and it'll be faster. Or if you've got something like GFS or GlusterFS set up where you have a shared mounted networked file system, you can use that also for a faster than S3 performance. This is not technical really at all, but I'm curious to the kind of comments you get about the ASCII art in your readme. The cloud crowd? I don't know. Not too many comments on the ASCII art.
Starting point is 00:14:36 People have been more taken with the diagrams that are in the wiki than the ASCII art. I missed that part. Adam's an ASCII art fan and was convinced by looking at the – they read me for underscore JS that you had ripped off his signature ASCII art. Did I say ripped off? No, it's kind of funny though because your underscore.js ASCII art is – if you go and look at – I guess – I don't know if you have any sites out there now that actually do it, but at the top of every web document, we put this ASCII art that says handcrafted, and I think it was the exact same ASCII art font, I guess if that's what you call it. I'm just getting it from this. There's this generator page that does it for you where you can just type it.
Starting point is 00:15:18 Yeah, I use the generator page. It's probably the same one. Probably is. Jeremy, I had not noticed the wikis on these projects because normally I use the GitHub wikis. These are beautiful. So the art, explain a little bit about where the diagrams come from.
Starting point is 00:15:34 I guess, so only one of them has a wiki. So Cloud Crowd has a wiki and Jamit and Underscore have pages. And I'm still trying to figure out how to document these projects correctly. I think I might stick to the plain HTML. But in any case, so the art on the Cloud Crowd is what you're asking about? Sure, yeah, the example PDF processing artwork.
Starting point is 00:15:54 Yeah, so Cloud Crowd really needs some hand-drawn diagrams and they're usually a lot nicer than if you spit out a UML or something because you can actually sort of illustrate what's going on. And I think that CloudCrowd really needs some explanation because you're talking about a complicated system where you have multiple machines. I think, you know, at minimum, you're kind of talking about three different logical machines.
Starting point is 00:16:15 You have your application that is making the request. You have your central CloudCrowd server, and then you have the server where the work's being done. So it gets a bit involved, and so it's nice to be able to draw it out, sketch it out on paper and to show... These are your original drawings? Yep. Awesome. What are you doing to
Starting point is 00:16:31 do the workers, the background jobs and stuff? What are we... What's the question? What are you using to do the worker part of it, the cloud nodes, the physical machines with teams of... So it's all just Ruby. So the idea is that you install this, for Cloud Crowd, you install this gem
Starting point is 00:16:48 and it comes with sort of baked-in Sinatra servers that are able to listen for work requests and then start doing it. So what you do is you install your action, which is just a Ruby class. It's just a script that knows how to do a process method. And then the node will receive requests to do work and it'll run that action if that action is specified. So in our
Starting point is 00:17:10 case, we have an action called process PDFs, but your action might be called encode video and you would have your Ruby script that knows how to do the video encoding and then save that back to S3. So if you look inside the wiki, there's a page called the job API that details all the sorts of built-in methods when you create an action, the kinds of, or I'm sorry, not the job API page, but the writing an action page that details all of the built-in methods that you have. So you have little, you know, it's a really sort of minimal conveniences. You have ways to get the input, and if the input looks like a URL, then it'll pre-download it for you so that by the time your action starts, it'll be ready to go on the local file system, and you can start manipulating it. You can start encoding your
Starting point is 00:17:54 video. You can start resizing your JPEG. You can start processing your PDF. You can pass an arbitrary JSON hash of options to any action. I thought that was a convenient way to be able to configure, to make actions a little bit customizable. So you can imagine if you had an image resizing action that you wrote using, say, graphics magic or image magic to do it efficiently, you could have in your options hash,
Starting point is 00:18:16 you could have the sizes and the image types that you wanted to get back out. And then the other important method that you get when you're writing a custom action is save, where you call save and you pass it a path on the local file system to your finished video or image or PDF. And it'll save that back to the asset store, which is usually S3, but could be the file system like we already discussed. And then it gives back a URL, which can be used to access it, which then gets sent back to your app. So is Cloud Crowd in the same space as other projects like Delayed Job or Rescue?
Starting point is 00:18:55 Rescue actually, I think, overlaps it to a good extent, which is interesting because I didn't know anything about it when we released it, and Cloud Crowd was out for about a month before Rescue showed up. And I'm not sure if I would have just used Rescue if it had been out before we had started working with CloudCrowd. The main difference is that Cloud, between, well, so DelayedJob and BackgroundJob are both simpler alternatives where you're basically just starting up daemons, but there's not this whole distributed sort of queue thing set up. Rescue and CloudCrowd both have
Starting point is 00:19:24 central queues that then work is parceled out to a whole bunch of workers. And I think the main difference is that with Rescue, you have an atomic sort of job, and it's more like background job where you're saying, do this thing. And with Cloud Crowd, you actually have this kind of built-in MapReduce primitive where you can have a split and a process and a merge, and it'll sort of automatically parallelize the processing to a certain extent. But that's certainly something that we could contribute maybe to rescue. That's why I was asking you about what you were using in that part,
Starting point is 00:19:52 like background job or why you went the route of, I guess, writing it all yourself, right? Yeah. You mean instead of using background job? Well, yeah, instead of using something that was out there already to you know, for, you know, to do queuing processing or background jobs or just job handling in general, why you chose to go the route of running yourself versus using something that's out there already and able to use? It was, well, I mean, it kind of had a funny genesis in terms of how it got started because there was sort of an internal project at the times that was taking the beginning steps towards having a distributed image processing system because they need to do a lot of image resizing. And this was sort of the generalization
Starting point is 00:20:29 of that. So I didn't actually start it. I kind of inherited it and then fleshed it out. But background job, I don't think really fits the same niche that rescue or cloud crowd do. And I think that rescue and cloud crowd do overlap to a large extent. And if rescue had been out, then I might've just used that instead of trying to do this thing. Well, it's good to have choices, right? Yep, it is. Talk to us a little bit about how underscore JS came about. Sure.
Starting point is 00:20:57 So underscore is another extraction. The idea, I guess, behind it is that it's sort of all the things that, you know, jQuery gives you a great, it sort of levels the playing field. You know, you're stepping into a naked browser page, and if you have jQuery, there's a whole lot of things you can do. And Underscore is kind of finishing off, I think, you know, sort of where jQuery starts. Like if you, at least in terms of my personal use, if you hand me jQuery, you can start being productive immediately because that's about all that you need to have a solid JavaScript foundation. And it looks like other people sort of feel the same way because there's been a decent amount of interest in getting underscore
Starting point is 00:21:34 available in the common JS and Node and Rhino and all these sort of backend server-side JS systems as kind of a standard, I guess, foundation for doing all of the functional array and object and collection manipulation that you need to do so frequently. I'm a big fan of it. I implemented a new feature in the footer of my blog to pull in the reading list from Read or Not using jQuery and Underscore to do a lot of the parsing of the JSON that comes back from the service. And it just felt natural as a Rubyist to drop in and use these methods and use the templating that is built in. I'm a big fan of this framework.
Starting point is 00:22:18 I think it's going to take off. I hope so. I don't think we have enough. I'd like to think that they're going to take off and they're going to have some the day, you've got to get back to work on Document Cloud proper and making that prototype as solid as you can. But it's nice to put it out there and to have it be picked up and run with. Yeah, for sure. Can you talk about Jamit, where that came about? Can you give us the backstory? So Jamit is another extraction.
Starting point is 00:23:02 So in the Document Cloud prototype, I was thinking about how we were going to be packaging assets. And it had been sort of a problem for me with Rails projects in the past. So the Document Cloud interface is extremely JavaScript heavy. It's basically a JavaScript application and Rails is kind of a skinny backend.
Starting point is 00:23:21 And then the database is more significant because you have to do all the searching of these documents. But the Rails layer is actually very skinny. And most of the rendering of views and the, and the, you know, client side validation logic, you have to validate in the server too, but do it on the client first. And there's actually a full sort of MVC stack in the client. So we have models of, of users and of documents and of saved searches and of labels and of metadata. All of these things are real first-class models in JavaScript in the client using underscore to sort of manipulate them. And then we have this sort of rich, tabbed document searching journalist workspace UI in a client.
Starting point is 00:24:02 Whereas a journalist, you can search through the documents and you can load up the viewer and you can do save searches and you can organize them under labels and you can visualize them. It uses Canvas to do some neat little visualizations of the connections between related documents and the people that are mentioned in more than one document. And so basically, at the end of the day,
Starting point is 00:24:19 you have a huge amount of JavaScript because it's an entire application getting sent down to the client. And in the past, I had had some frustrations using the Rails asset packager to try to manage a large number of small JavaScripts into reasonably efficient parcel downloads. So we had had to customize that a little bit before my previous job. And I figured that I would just extract that into Jamit. So Jamit tries to be a relatively comprehensive asset packager for Rails that is easy to configure, so it uses directory globs
Starting point is 00:24:50 instead of having to specify every single JavaScript. You can just have a specific views directory full of all kinds of tiny 10 or 20 line views, and then just say in your directory glob, just say views slash star dot js, and you'll get all of them included all the time. So you don't have to worry about it. That's what I'm curious about asset packagers, that you have to specify each individual one you want to.
Starting point is 00:25:11 Exactly. And then in development, you're trying to make your app, and every time you change your JavaScript file or rename it, you have to go restart your server and change assets.yaml. It's a pain. So yeah. Write a reg task that does the packaging for you. It's a pain. So, yeah. Or write a regtas that does the packaging for you. It's a pain. I mean, you shouldn't have to.
Starting point is 00:25:28 So the idea here is that if you have an ordered, unique list of directory globs, so they all get, so if you're talking about a specific package, it's going to expand all of the globs in order. It's going to take the unique set of files, and in the end you can keep things ordered the way you want. So you can say, first give me just jQuery, then just give me underscore, and then give me JavaScript slash star dot star.
Starting point is 00:25:49 Give me everything else after that. It's interesting that you have built-in support for JavaScript templates, and you list a number of options here, from John Resick's micro-templating to underscore's built-in templating that we mentioned earlier, prototypes support, and also Mustache.js from Defunct. Any preference or views on the four of those? I think there's really good cases to be made for different ones.
Starting point is 00:26:15 As with most things JavaScript, there isn't really a standard, and there's lots of different competing ways to do it. I wanted to support JavaScript templates out of the box because that's one thing that if you're using JavaScript templates seriously in your web applications, you need to have good asset packaging support for them. Because basically every time you load the page, you've got to rebuild all of your JavaScript templates and send them down as a single JST file, I guess. So I wanted Gemit to be able to do it conveniently. But in terms of the actual template method, I don't think that I am too qualified to know about all the different ones. I know there's really a whole gazillion.
Starting point is 00:26:51 There's PureJS, and there's a whole bunch of different methods out there. A lot of people like sort of inserting hidden DOM elements onto the page and then using those actual DOM elements as templates. The ones that I've been more familiar with are more like strings with interpolation like ERB, which is what the micro-templating that we're using and that Underscore uses is similar to. It's a lot like ERB, but with JavaScript instead of Ruby in your tags. So where do these names come from?
Starting point is 00:27:19 Jamit, you've got these very unique names. Are you part of that naming process, or is that something that you inherited as well? I'm part of it. process or is that something that you inherited as well? I'm part of it. Where did the names come from? Yeah, these are awesome names. I mean, you look at other people in the space too, like ThoughtBot.
Starting point is 00:27:37 They've got some really unique names behind their open source projects. I just wonder where Jammin comes from and kind of the thought process behind these cool names. You try to find something that's evocative of what the actual app is, but not too clunky or acronym-y. So I don't know. Spend a couple hours with it kicking around in the back of your head until you find something that sounds appropriate. I'm not too sure about Cloud Crowd. I keep tripping over it every time I try to say it too many times fast in a row.
Starting point is 00:28:01 It's kind of a tongue twister. But yeah. You know, one of the things that impressed Adam and me when we looked at Jamit and Underscore was just the handcrafted nature of the documentation. Are these your themes? Drip this off from somewhere? The themes? Yeah.
Starting point is 00:28:18 Drip this off somewhere. Listen to you. He's calling you a thief. I don't call you a thief with your ass yard. You call me a thief first, right? We're lit's calling you a thief. I called you a thief with your ask yard. You called me a thief first, right? We're lit when it was a thief. He comes on the show, he's still in my ask yard, and you're still in... Geez.
Starting point is 00:28:34 Well, I don't think the documentation is as much thievery as maybe some of the ideas. None of this stuff is particularly new. Underscore.js has a lot of ideas from Prototype and a lot of partial implementation sharing of what Prototype and jQuery are doing in terms of their collection manipulation. And of course, the idea of having a Rails asset packager isn't new, and the idea of having
Starting point is 00:28:57 a Rails or a Ruby distributed job system isn't new either. But the documentation, I think, is new. I didn't grab that from anywhere. Yeah, the reason I say that is most developers, if we write documentation,
Starting point is 00:29:12 they tend to not be that pretty to look at. And both of these sites are informative, minimalistic, and just look great. And so if this is totally your design, then kudos, because it really does a good job of selling the project without having to dig into the source to see how things operate. Yeah, well, I appreciate that. I think that's a big part of why they're actually, you know, I think we have over a thousand watchers between all of these projects now on GitHub and between, you know, people watching Document Cloud and people watching Document Cloud-related projects, which is great. I think a lot of that has to do with having solid documentation out of the gate.
Starting point is 00:29:49 When people first see a project, what they're going to judge it by is what they start reading about it. Either that's a blog post explaining it, or hopefully it's the official docs, and the official docs are good enough for them to get their feet wet and to start messing with it. Are there any, since you mentioned blog posts, do you have any deep blog posts out there going deeper into some of the stuff that we're talking about? No, I wish I did. It would be nice to. I think that it would be good to start putting on documentcloud.org some blogs about design decisions as to why things are the way they are in terms of Jamit and how it packages assets or Cloud Crowd and how it distributes jobs.
Starting point is 00:30:22 But no, I haven't gotten around to any of that yet. Good idea, though. I mean, I think that one of the things we cling to jumping into acceptance of an open source project is first, what does it do? Why do I care? Second is, where's the documentation? How deep is it? And how informative is it?
Starting point is 00:30:39 And three, does it actually solve the problem I'm trying to solve, right? So, some blog posts. Yeah, amen to that. If any listeners feel like writing something, that would be much appreciated. Do you have anything in Scoop, anything cool coming up that you just have to mention?
Starting point is 00:30:55 Actually, we do. So the next little, it's a bit smaller, I think, in scope than our previous ones, but the next Document Cloud open source release, I think, is going to be a project called PDF Pieces coming out in a day or two that makes it easier to take a PDF and to pull it apart into all of its component pieces and then, you know, things that you can then index and put on the web and make searchable. So you'll be able to do, as a command line, you'll be able to do PDF pieces, pages or images or text and explode the PDF apart
Starting point is 00:31:26 into its UTF full text or into pings or GIFs or JPEGs of each page or into single page PDFs, if that's what you need. And it'll also pull out some of the metadata so you can find out, you know, the title and the author and the producer and things like that of the PDF. So what this is, is just going to be a Ruby gem that wraps the excellent Adobe PDF box Java library. And so under the covers, it's actually shelling out two special little Java classes that are doing the actual work. So it's pretty nice and efficient.
Starting point is 00:31:58 And you can pass it, say, a PDF and tell it to give you back all the images for that document in 700 and 1,000 pixels wide, as well as both JPEG and PNG forms. And it'll do all that for you in a single JVM loop. So you don't have to keep going back and forth between Ruby and Java doing it for every page. So that's the next thing on our plates. That's all very interesting. We normally wrap each show by asking the guest, what's on your open source radar? So any projects out there other than the ones that are coming out of Document Cloud that excite you? Yes, absolutely.
Starting point is 00:32:35 So I think the big thing that I'm excited about but that I can't quite see myself getting into yet, which is kind of like a tease, I guess, is all of the server-side JavaScript stuff that's happening. Because I think we're at the point now with a lot of projects that are more interesting technically on the client side than they are in the server. And you're doing a lot of great visualization and computation, a lot of great interaction with real MVC stacks, with real models in JavaScript. And it's really a source of duplication and pain to be duplicating all of these models. You write it once in Ruby to do the validations
Starting point is 00:33:12 and to do the manipulation where you're asking a document what its metadata is and what people it talks about and that kind of thing. You're doing that both in Ruby on the server as well as on the client in JavaScript. And to be able to have one language where you can share the models and just send down JSON data and you can have the same operations and the same validations running both on the server and the client, I think would be
Starting point is 00:33:32 really, really, really useful. So I'm just kind of waiting for someone to write the complete comprehensive Rails equivalent in one of these server JS platforms, whether it ends up being Node or Narwhal on Rhino or something else on custom V8 that has a complete story of how do you do your parallel processes, how do you do your file interactions, how do you talk to a database, how do you interface with other C or Java libraries,
Starting point is 00:34:01 as the case may be. And once someone has all that figured out and we've got a good server-side platform, I think that it'll become an instant no-brainer to build large-scale web applications, you know, in JavaScript end-to-end. So I'm kind of waiting for that. I can't justify it for Document Cloud as a project
Starting point is 00:34:16 because I don't think it's there yet, but I think it's coming soon, maybe within a year or so. Anybody wants to get a hold of you, what's the best way to reach out to you? Are you on Twitter, what's your handle email? I'm actually not on Twitter people like to message through GitHub which works pretty well or you can do
Starting point is 00:34:32 jeremy at documentcloud.org and just to mention the thing that I said at the beginning if you're you know talented JavaScript or Ruby programmer and you are interested in working on projects that have a mandate to be open-sourced,
Starting point is 00:34:47 then we'd love to hear from you. So yeah, you can send me an email at jeremy at documentcloud.org. And so they can also go to github.com forward slash documentcloud, and they can hit you from there. Is that your user, or do you have your own user? Yep, I'm pretty much that one, too. So yeah, that'll work also. Awesome.
Starting point is 00:35:04 Well, it was awesome having you on the show, Jeremy. Thank you very much for taking the time to chat with us. You're probably awesome. Thanks a lot. It's a pleasure having you. It's a pleasure being on. I appreciate it. Thank you for listening to this edition of The Change Log.
Starting point is 00:35:23 Be sure to tune in weekly for what's fresh and new in open source. Also, visit thechangelog.com to follow along, subscribe to the feed, and more. Thank you for listening. For the first time Safe in your arms As if no passion

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.