The Changelog: Software Development, Open Source - Document Cloud and Underscore.js (Interview)
Episode Date: December 6, 2009Jeremy Ashkenas is the Lead Developer at DocumentCloud about their effort to revolutionize the way media organizations gather news. Jeremy discusses their open source projects CloudCrowd, Underscore.j...s, and JAMMIT that they've released along the way.
Transcript
Discussion (0)
Hello and welcome to the ChangeLog episode 0.0.5.
My name is Adam Stachowiak.
And I am Wendell Nutherland.
We've got a great interview today with Jeremy Aschenkes from Document Cloud.
Yeah.
I think that one turned out really well.
Some exciting projects coming out of Document Cloud.
We're five episodes into this podcast.
So how close are we to figuring out our format?
I think we're getting there.
I think it's an iterative process, but lots of small little tweaks along the way,
light little tweaks.
But I think the format of having the weekly and then the weekly roundup and then also having interviews coupled into that is a nice fit.
It would be nice to have some guest contributors come on to the show too.
So we're pioneering agile podcasting.
Yeah.
Who would you like to see come on as a guest contributor, Wayne?
You know, a lot of names out there.
I don't want to share probably any of them in case they're too good to come on our little show.
On our little show.
Well, I mean we was – I guess we're somewhat little.
I mean we got just a little over 100 followers in the last few days.
I think that's –
Yeah, zero to 100 in a week is not bad.
Yeah, that's real nice.
And certainly the blog article on GitHub.com,
their blog, helped us out a lot.
I really think the podcast will take off
when the community gets to embrace it
and we get the news is more than just
what we're scouring to find.
We've got the community crowdsourcing this deal.
So if you've got a great story out there,
what's new and exciting and open source, just submit to submit at thechangelog.com via email or just go out to the website, thechangelog.com slash submit.
We'd love to get that news up on the site.
Yeah, absolutely.
I'm looking forward to it.
We haven't gotten any submissions yet, and that's kind of a shame. Not that we need people to start contributing, but it would be nice to have somebody alert us besides us just kind of picking up what we find.
Yeah, I'd like to see what pools of information people are drawing from outside of the ones I'm fishing in.
How about you?
Yeah, no, I agree.
I mean I don't want to be Ruby-centric,
and I don't want to be very language-specific.
I want to be agnostic about what we're doing here,
and I think that's always our approach,
but you and I tend to just jump in those worlds,
and those are the ones that are most fresh to us.
So if you've got something out there in a different language, let us know.
Absolutely.
Well, we've got a great interview today with Jeremy Eschenkis from DocumentCloud.
We talked about three of his great projects, and I think it's a really dynamic interview.
They're doing some exciting things in the media primary news source space.
So how about we get to it?
Yeah, let's get to it.
Enjoy the show.
All right, we're here with Jeremy Ashkenaz, and Jeremy is with Document Cloud.
Jeremy, explain a little bit about what Document Cloud is and what it's doing.
Sure. So Document Cloud is a new project that I'm really happy to have started with in August.
It's a grant funded generously by the Knight Foundation for a two-year project to help make the primary source documents that the New York Times and the Washington Post and the Chicago Tribune and all of these major news organizations are gathering when they're writing their stories to help make these primary source documents that you get from the government, you get from Freedom of Information Act requests, you you know, good investigative reporting to make those public and to make them searchable online,
to make them able to be embedded alongside news articles for context and to make richer stories.
And one of the nice perks of this project is that the Knight Foundation has mandated that
everything that we produce be open source and be released open source.
So as we've been going along, I've been trying to split off the sort of atomic chunks of the document cloud project
as little open source projects and release them, and it's gotten a great response so far.
And we've had a whole bunch of community contribution that has really helped improve the three things that we've released so far being
Cloud Crowd, which is a parallel processing sort of framework for Ruby that's a little bit
MapReduce inspired, although a little bit more practically oriented, I think, for your day-to-day
workflow than a pure MapReduce like Hadoop. Jamit, which is an asset packager plugin for Rails that
we just launched a couple weeks ago. And Und and underscore.js, which is a collection of functional
programming helpers for jQuery to give you those Ruby-style
map, inject, select, fold left, fold right
kinds of array and object
functions that you don't always have across browser and JavaScript, but
it's very nice to have as kind of a standard library base.
Awesome. Those are three exciting projects to open the gate with.
How about yourself? What's your role at Document Cloud?
It's kind of my pet project at the moment.
We're looking actually, which I should mention here in case anyone out there is listening,
not necessarily in New York, but we're looking to hire more help,
both with JavaScript and with Ruby, Postgres, EC2 backend stuff.
But right now it's just me building out the initial prototype of it.
So I'm the lead developer, I guess, is my technical job title.
So the Knight Foundation, why did they – I mean, I kind of understand why, but do you have some background to why they wanted everything to be open sourced?
It's part of the mandate.
So they have this thing called the Night News Challenge, and the idea is to fund interesting technology slash journalism projects to help figure out what the future of journalism is going to end up looking like.
So they were the ones who funded every block to the tune of about a million dollars a couple years ago, which is, I think, their biggest name.
But they fund five or ten projects, most of which are smaller scale than, say, a document cloud or an EveryBlock.
And then the idea is that you end up producing pieces of technology that can help newsrooms transition to the Internet age.
And so to that end, everything that you do has to be open source code.
That is in the contract, I think.
Everything that the grant money is spent on
is supposed to be towards the creation of these
open source news projects.
That's wild. So EveryBlock, too.
I didn't realize that they supported
that as well. That's off topic, but
EveryBlock is an awesome
project. Yeah, that's why EveryBlock
did that big code dump
at the end of the project before they sold themselves
was because that was the contract.
I'm behind the news.
I didn't hear that they sold themselves.
Yeah, they were bought by MSNBC for an unknown amount.
So that was a nice exit for the team after the grant funding
because this is a two-year grant,
and at the end of the two years,
we're going to have to figure out how to continue the project.
So we don't have specific plans yet, but EveryBlock's method was to get
bought by MSNBC, who's going to continue it. Wow. So what's your team size like? You said,
is it just you, or do you have more people in your team now?
We recently hired our second full-time person who's working on the administrative and dealing
with all these news organizations who have signed up. On the documentcloud.org website, there's a list of partner orgs,
but it's many of the major news organizations in the country,
along with magazines like The New Yorker and The Atlantic Monthly and things of that nature.
And I guess the overseas stuff is starting to expand a little bit more as well.
There's been some interest in the UK.
So she's our second full-time person. There's the
three founders, Eric and Scott at ProPublica, and Aaron, who's the editor of the interactive
news section at the New York Times, were the ones who got the grant in the first place. So
they don't have too much time to devote to the project from day to day because they've still got
their day jobs. But they are the, I guess, they're sort of the board.
Can you tell us about how this project got started?
I guess I wasn't too much involved in the conceiving of the project stage.
I got hired after the grant was a sure thing.
So it's sort of been in the works for a long time.
I think the three of them originally had the idea to make these primary source documents that are sort of,
you know, passing through the filing cabinets of the New York Times, for example, to make
them public and to make them accessible online and wanted to start a project to make that
happen.
So a big part of this is, I don't know if you guys have seen the document viewer that
the New York Times does for a lot of their sources.
For example,
when they had a big Guantanamo project, they released a couple thousand. They started out
as PDFs, but they became these sort of JavaScript, HTML web plugins on the Times' website where you
could search through the court transcripts and the prison records of these inmates and keep track of
what exactly was going on on a detainee-to-detainee basis.
So that particular piece of software, the Document Viewer, that they're using to embed
the stories on the web without having to just download PDFs, Times is donating to this project.
So part of what I've been working on has been integrating that with the Document Cloud prototype.
And there's a new version of it that should be coming out shortly
that you'll be able to find on the NewYorkTimes.com in a couple weeks.
That is pretty cool.
It's got a Google Books-like infinite scroll kind of a setup for these documents,
and it's pretty nicely designed.
So that's in the works right now.
Do you see Document Cloud primarily being involved in the government space?
It's the primary source document space.
So it's all of these people, all these organizations whose mission is to uncover primary source
documents.
So whether that means it's government records or it's internal corporation memos or emails
or anything, I guess, that becomes a primary document of record, I think we're interested
in.
And then beyond that, we might end up opening it up to more things like watchdog groups who are gathering these things.
And, yeah.
So you mentioned these three projects, CloudCrown, underscore JS, and Jamit.
Are all three of these your creation, or explain a little bit how each came about?
Yep, they're three direct extractions from the Document Cloud prototype that
I've been working on over the course of the fall. So one of our first problems was that importing
PDFs into Document Cloud is a pretty slow, sort of painful process because you've got to split
apart a PDF into a number of pages, and you've got to convert each page into both its full text
and its images in different sizes to display it inside of the document viewer.. And you've got to convert each page into both its full text and its images in different sizes
to display it inside of the document viewer.
And then you've got to...
And part of this document cloud
is that we're actually using the OpenColet web service
to do semantic indexing of the documents.
So we end up knowing what people and what places
and what organizations and what terms are mentioned
within a document.
You can search across that kind of stuff.
So we have to go to OpenClay and get that information back.
And all of this is a very time-consuming, expensive process.
So CloudCrowd, which is our parallel processing framework,
is sort of a generic, you have a job you need to get done in Ruby,
and you can maybe parallelize it to a certain extent.
And you'd like to do it in as parallel a fashion as possible.
So the CloudCrowd primitives are of, you write a Ruby script, you write a class that has at least
a process method, and the process method is the parallel part of the computation.
And it's all sort of web-based. So there's a REST API that it comes as a gem, and when you install
the gem, you get servers and nodes. And the server is the central thing that manages all of the all of the work and the nodes are these are the actual machines that
are performing the work and when you install your action all you have to do is say okay if i'm on a
machine that's doing the work what is the parallel part of the work that i'm going to do and then you
send it a url to a file in our case a pdf although it could be a JSON document or some other kind of
XML document or some other kind of information. And then you can do the processing on those
documents in parallel. So in our case, we're doing the PDFs in parallel. And then the MapReduce
plays in, in that if you define more than just a process action, if you define a split
and a merge, the split at the beginning will take a single input and divide it up into many to all
be run in parallel across that process method. And then the merge will take back the results of
what came out of all of your process calls and merge it back into a single result for convenient
use consumption back at the other end when you get pinged back when your job finishes.
So in our case, that means you take a PDF, you split it up into chunks of pages using Cloud Crowd. Each, you know, five or 10 page chunk gets processed in parallel to get the images out,
to get the text out, to get the entities out through OpenColay. And then at the end,
merge back together into a single archive that we can import back into the prototype.
So in that, using this, we can, you know, install this gem on many EC2 machines if we need to,
and spin up Cloud Crowd nodes very
easily and start distributing the workout. So this can happen in a reasonably fast fashion.
Is it EC2 and S3 only or does it work with any sort of cloud platform?
It works with any sort of... So there's actually no dependency on EC2. It's only on HTTP and REST.
So you could install it on whatever kind of box you'd like.
It's nice on EC2 because you can spin up and down these nodes on the fly very easily.
There is an S3 file system backend built in because that's what we've been using.
When it transfers files between different machines, this has always been a problem in Hadoop.
In Hadoop, you have to install this Hadoop FS where there's a common shared file system that all of the nodes can write to. So the Cloud Crowd default file system backend is to use
S3 as that sort of common shared file space. So when you're done, when they're done with an
intermediate work unit, it'll save that work unit to S3. And then in the merge step later on, it can
pull that from S3 and continue the processing without having to worry about transferring,
about which particular node has which copy of which file. But there's also a file system
back end. So if you're just doing it on one box, if you only have one machine that you're
doing work on, you can use the file system back and it'll be faster. Or if you've got
something like GFS or GlusterFS set up where you have a shared mounted networked file system,
you can use that also for a faster than S3 performance.
This is not technical really at all, but I'm curious to the kind of comments you get about the ASCII art in your readme.
The cloud crowd?
I don't know.
Not too many comments on the ASCII art.
People have been more taken with the diagrams that are in the wiki than the ASCII art.
I missed that part.
Adam's an ASCII art fan and was convinced by looking at the – they read me for underscore JS that you had ripped off his signature ASCII art.
Did I say ripped off?
No, it's kind of funny though because your underscore.js ASCII art is – if you go and look at – I guess – I don't know if you have any sites out there now that actually do it, but at the top of every web document, we put this ASCII art that says handcrafted,
and I think it was the exact same ASCII art font, I guess if that's what you call it.
I'm just getting it from this.
There's this generator page that does it for you where you can just type it.
Yeah, I use the generator page.
It's probably the same one.
Probably is.
Jeremy, I had not noticed the wikis on these projects
because normally I use the GitHub wikis.
These are beautiful.
So the art, explain a little bit about
where the diagrams come from.
I guess, so only one of them has a wiki.
So Cloud Crowd has a wiki and Jamit
and Underscore have pages.
And I'm still trying to figure out
how to document these projects correctly.
I think I might stick to the plain HTML.
But in any case, so the art on the Cloud Crowd is what you're asking about?
Sure, yeah, the example PDF processing artwork.
Yeah, so Cloud Crowd really needs some hand-drawn diagrams
and they're usually a lot nicer than if you spit out a UML or something
because you can actually sort of illustrate what's going on.
And I think that CloudCrowd really needs some explanation
because you're talking about a complicated system
where you have multiple machines.
I think, you know, at minimum,
you're kind of talking about three different logical machines.
You have your application that is making the request.
You have your central CloudCrowd server,
and then you have the server where the work's being done.
So it gets a bit involved,
and so it's nice to be able to draw it out, sketch
it out on paper and to show... These are your original
drawings? Yep.
Awesome. What are you doing to
do the workers, the background jobs and
stuff? What are we...
What's the question? What are you using to do
the worker part of it, the cloud nodes,
the physical machines with teams of...
So it's all just Ruby.
So the idea is that you install this,
for Cloud Crowd, you install this gem
and it comes with sort of baked-in Sinatra servers
that are able to listen for work requests
and then start doing it.
So what you do is you install your action,
which is just a Ruby class.
It's just a script that knows how to do a process method.
And then the node will
receive requests to do work and it'll run that action if that action is specified. So in our
case, we have an action called process PDFs, but your action might be called encode video and you
would have your Ruby script that knows how to do the video encoding and then save that back to S3.
So if you look inside the wiki, there's a page called the job API that details all the sorts of built-in methods
when you create an action, the kinds of, or I'm sorry, not the job API page, but the writing
an action page that details all of the built-in methods that you have. So you have
little, you know, it's a really sort of minimal conveniences. You have ways to get the input, and if the input looks like a URL,
then it'll pre-download it for you so that by the time your action starts, it'll be ready to go on
the local file system, and you can start manipulating it. You can start encoding your
video. You can start resizing your JPEG. You can start processing your PDF. You can pass an
arbitrary JSON hash of options to any action. I thought that was a convenient way to be able to
configure,
to make actions a little bit customizable.
So you can imagine if you had an image resizing action
that you wrote using, say, graphics magic or image magic
to do it efficiently,
you could have in your options hash,
you could have the sizes and the image types
that you wanted to get back out.
And then the other important method that you get
when you're writing a custom action is save,
where you call save and you pass it a path on the local file system to your finished video or image or PDF.
And it'll save that back to the asset store, which is usually S3, but could be the file system like we already discussed.
And then it gives back a URL, which can be used to access it, which then gets sent back to your app.
So is Cloud Crowd in the same space as other projects like Delayed Job or Rescue?
Rescue actually, I think, overlaps it to a good extent,
which is interesting because I didn't know anything about it when we released it,
and Cloud Crowd was out for about a month before Rescue showed up.
And I'm not sure if I would have just used Rescue if it had been out before we had started
working with CloudCrowd. The main difference is that
Cloud, between, well, so DelayedJob and BackgroundJob are both
simpler alternatives where you're basically just starting up daemons, but there's not this whole
distributed sort of queue thing set up. Rescue and CloudCrowd both have
central queues that
then work is parceled out to a whole bunch of workers. And I think the main difference is that
with Rescue, you have an atomic sort of job, and it's more like background job where you're saying,
do this thing. And with Cloud Crowd, you actually have this kind of built-in MapReduce primitive
where you can have a split and a process and a merge, and it'll sort of automatically parallelize
the processing to a certain extent.
But that's certainly something that we could contribute maybe to rescue.
That's why I was asking you about what you were using in that part,
like background job or why you went the route of, I guess, writing it all yourself, right?
Yeah. You mean instead of using background job?
Well, yeah, instead of using something that was out there already to you know, for, you know, to do queuing processing or background jobs or just job handling in general,
why you chose to go the route of running yourself versus using something that's out there already
and able to use? It was, well, I mean, it kind of had a funny genesis in terms of how it got
started because there was sort of an internal project at the times that was taking the beginning
steps towards having a distributed image processing
system because they need to do a lot of image resizing. And this was sort of the generalization
of that. So I didn't actually start it. I kind of inherited it and then fleshed it out.
But background job, I don't think really fits the same niche that rescue or cloud crowd do.
And I think that rescue and cloud crowd do overlap to a large extent. And if rescue had been out,
then I might've just used that instead of trying to do this thing.
Well, it's good to have choices, right?
Yep, it is.
Talk to us a little bit about how underscore JS came about.
Sure.
So underscore is another extraction.
The idea, I guess, behind it is that it's sort of all the things that,
you know, jQuery gives you a great, it sort of levels the playing field.
You know, you're stepping into a naked browser page, and if you have jQuery, there's a whole lot of things you can do.
And Underscore is kind of finishing off, I think, you know, sort of where jQuery starts.
Like if you, at least in terms of my personal use, if you hand me jQuery, you can start being productive immediately because that's about all
that you need to have a solid JavaScript foundation. And it looks like other people
sort of feel the same way because there's been a decent amount of interest in getting underscore
available in the common JS and Node and Rhino and all these sort of backend server-side JS
systems as kind of a standard, I guess, foundation for doing all of the functional
array and object and collection manipulation that you need to do so frequently.
I'm a big fan of it.
I implemented a new feature in the footer of my blog to pull in the reading list from
Read or Not using jQuery and Underscore to do a lot of the parsing of the JSON that comes back from the service.
And it just felt natural as a Rubyist to drop in and use these methods and use the templating that is built in.
I'm a big fan of this framework.
I think it's going to take off.
I hope so. I don't think we have enough. I'd like to think that they're going to take off and they're going to have some the day, you've got to get back to work on Document Cloud proper
and making that prototype as solid as you can.
But it's nice to put it out there and to have it be picked up and run with.
Yeah, for sure.
Can you talk about Jamit, where that came about?
Can you give us the backstory?
So Jamit is another extraction.
So in the Document Cloud prototype,
I was thinking about how we were going to be packaging assets.
And it had been sort of a problem for me
with Rails projects in the past.
So the Document Cloud interface
is extremely JavaScript heavy.
It's basically a JavaScript application
and Rails is kind of a skinny backend.
And then the database is more significant
because you have to do all the searching of these documents.
But the Rails layer is actually very skinny. And most of the rendering
of views and the, and the, you know, client side validation logic, you have to validate in the
server too, but do it on the client first. And there's actually a full sort of MVC stack in the
client. So we have models of, of users and of documents and of saved searches and of labels and of metadata.
All of these things are real first-class models in JavaScript in the client using underscore to sort of manipulate them.
And then we have this sort of rich, tabbed document searching journalist workspace UI in a client.
Whereas a journalist, you can search through the documents and you can load up the viewer
and you can do save searches
and you can organize them under labels
and you can visualize them.
It uses Canvas to do some neat little visualizations
of the connections between related documents
and the people that are mentioned in more than one document.
And so basically, at the end of the day,
you have a huge amount of JavaScript
because it's an entire application
getting sent down to the client.
And in the past, I had had some frustrations using the Rails asset packager to try to manage
a large number of small JavaScripts into reasonably efficient parcel downloads.
So we had had to customize that a little bit before my previous job. And I figured that I
would just extract that into Jamit. So Jamit tries to be a relatively comprehensive asset packager for Rails
that is easy to configure, so it uses directory globs
instead of having to specify every single JavaScript.
You can just have a specific views directory
full of all kinds of tiny 10 or 20 line views,
and then just say in your directory glob,
just say views slash star dot js,
and you'll get all of them included all the time.
So you don't have to worry about it.
That's what I'm curious about asset packagers, that you have to specify each individual one you want to.
Exactly.
And then in development, you're trying to make your app, and every time you change your JavaScript file or rename it,
you have to go restart your server and change assets.yaml.
It's a pain.
So yeah.
Write a reg task that does the packaging for you. It's a pain. So, yeah. Or write a regtas that does the packaging for you.
It's a pain.
I mean, you shouldn't have to.
So the idea here is that if you have an ordered, unique list of directory globs,
so they all get, so if you're talking about a specific package,
it's going to expand all of the globs in order.
It's going to take the unique set of files,
and in the end you can keep things ordered the way you want.
So you can say, first give me just jQuery,
then just give me underscore,
and then give me JavaScript slash star dot star.
Give me everything else after that.
It's interesting that you have built-in support for JavaScript templates,
and you list a number of options here,
from John Resick's micro-templating to underscore's built-in templating
that we mentioned earlier, prototypes support,
and also Mustache.js from Defunct.
Any preference or views on the four of those?
I think there's really good cases to be made for different ones.
As with most things JavaScript, there isn't really a standard,
and there's lots of different competing ways to do it.
I wanted to support JavaScript templates out of the box
because that's one thing that if you're using JavaScript templates seriously in your web applications, you need to have good asset packaging support for them.
Because basically every time you load the page, you've got to rebuild all of your JavaScript templates and send them down as a single JST file, I guess.
So I wanted Gemit to be able to do it conveniently. But in terms of the actual template method,
I don't think that I am too qualified to know about all the different ones.
I know there's really a whole gazillion.
There's PureJS, and there's a whole bunch of different methods out there.
A lot of people like sort of inserting hidden DOM elements onto the page
and then using those actual DOM elements as templates.
The ones that I've been more familiar with are more like strings with interpolation like ERB,
which is what the micro-templating that we're using
and that Underscore uses is similar to.
It's a lot like ERB, but with JavaScript instead of Ruby in your tags.
So where do these names come from?
Jamit, you've got these very unique names.
Are you part of that naming process,
or is that something that you inherited as well?
I'm part of it. process or is that something that you inherited as well?
I'm part of it.
Where did the names come from?
Yeah, these are awesome names.
I mean, you look at other people in the space too, like ThoughtBot.
They've got some really unique names behind their open source projects.
I just wonder where Jammin comes from and kind of the thought process behind these cool names.
You try to find something that's evocative of what the actual app is, but not too clunky or acronym-y.
So I don't know.
Spend a couple hours with it kicking around in the back of your head until you find something
that sounds appropriate.
I'm not too sure about Cloud Crowd.
I keep tripping over it every time I try to say it too many times fast in a row.
It's kind of a tongue twister.
But yeah.
You know, one of the things that impressed Adam and me when we looked at Jamit and Underscore
was just the handcrafted nature of the documentation.
Are these your themes?
Drip this off from somewhere?
The themes?
Yeah.
Drip this off somewhere.
Listen to you.
He's calling you a thief.
I don't call you a thief with your ass yard. You call me a thief first, right? We're lit's calling you a thief. I called you a thief with your ask yard.
You called me a thief first, right?
We're lit when it was a thief.
He comes on the show, he's still in my ask yard, and you're still in...
Geez.
Well, I don't think the documentation is as much thievery as maybe some of the ideas.
None of this stuff is particularly new.
Underscore.js has a lot of ideas from Prototype
and a lot of partial
implementation sharing of what Prototype
and jQuery are doing in terms of their
collection manipulation. And of course, the idea of having a Rails
asset packager isn't new, and the idea of having
a Rails
or a Ruby distributed
job system isn't new either.
But the documentation, I think,
is new.
I didn't grab that from anywhere.
Yeah, the reason I say that is most developers,
if we write documentation,
they tend to not be that pretty to look at.
And both of these sites are informative,
minimalistic, and just look great.
And so if this is totally your design,
then kudos, because it really does a good job
of selling the project without having to dig into the source to see how things operate.
Yeah, well, I appreciate that.
I think that's a big part of why they're actually, you know, I think we have over a thousand watchers between all of these projects now on GitHub and between, you know, people watching Document Cloud and people watching Document Cloud-related projects, which is great. I think a lot of that has to do with having solid documentation out of the gate.
When people first see a project, what they're going to judge it by is what they start reading about it.
Either that's a blog post explaining it, or hopefully it's the official docs,
and the official docs are good enough for them to get their feet wet and to start messing with it.
Are there any, since you mentioned blog posts,
do you have any deep blog posts out there going deeper into some of the stuff that we're talking about?
No, I wish I did.
It would be nice to.
I think that it would be good to start putting on documentcloud.org some blogs about design decisions as to why things are the way they are in terms of Jamit and how it packages assets or Cloud Crowd and how it distributes jobs.
But no, I haven't gotten around to any of that yet.
Good idea, though.
I mean, I think that one of the things we cling to jumping into acceptance of an open source project
is first, what does it do?
Why do I care?
Second is, where's the documentation?
How deep is it?
And how informative is it?
And three, does it actually solve the problem
I'm trying to solve, right?
So, some blog posts.
Yeah, amen to that.
If any listeners feel like writing something,
that would be much appreciated.
Do you have anything in Scoop,
anything cool coming up that you just have to mention?
Actually, we do.
So the next little, it's a bit smaller, I think,
in scope than our previous ones,
but the next Document Cloud open source release,
I think, is going to be a project called PDF Pieces coming out in a day or two that makes it
easier to take a PDF and to pull it apart into all of its component pieces and then, you know,
things that you can then index and put on the web and make searchable. So you'll be able to do,
as a command line, you'll be able to do PDF pieces, pages or images or text and explode the PDF apart
into its UTF full text or into pings or GIFs or JPEGs of each page or into single page PDFs,
if that's what you need. And it'll also pull out some of the metadata so you can find out,
you know, the title and the author and the producer and things like that of the PDF.
So what this is, is just going to be a Ruby gem
that wraps the excellent Adobe PDF box Java library.
And so under the covers, it's actually shelling out
two special little Java classes that are doing the actual work.
So it's pretty nice and efficient.
And you can pass it, say, a PDF and tell it to give you back
all the images for that document in 700 and 1,000
pixels wide, as well as both JPEG and PNG forms. And it'll do all that for you in a single JVM
loop. So you don't have to keep going back and forth between Ruby and Java doing it for every
page. So that's the next thing on our plates. That's all very interesting.
We normally wrap each show by asking the guest, what's on your open source radar?
So any projects out there other than the ones that are coming out of Document Cloud that excite you?
Yes, absolutely.
So I think the big thing that I'm excited about but that I can't quite see myself getting into yet, which is kind of like a tease, I guess, is all of the server-side
JavaScript stuff that's happening. Because I think we're at the point now with a lot of projects that
are more interesting technically on the client side than they are in the server. And you're
doing a lot of great visualization and computation, a lot of great interaction with real MVC stacks,
with real models in JavaScript. And it's really a source of
duplication and pain to be
duplicating all of these models.
You write it once in Ruby to do the validations
and to do the manipulation
where you're asking a document what its metadata
is and what people
it talks about and that kind of thing. You're doing that both in Ruby
on the server as well as on the
client in JavaScript. And to be able to have one
language where you can share the models and just send down JSON data and you can have the same
operations and the same validations running both on the server and the client, I think would be
really, really, really useful. So I'm just kind of waiting for someone to write the complete
comprehensive Rails equivalent in one of these server JS platforms, whether it ends up being Node or Narwhal on Rhino
or something else on custom V8
that has a complete story of how do you do
your parallel processes,
how do you do your file interactions,
how do you talk to a database,
how do you interface with other C or Java libraries,
as the case may be.
And once someone has all that figured out
and we've got a good server-side platform,
I think that it'll become an instant no-brainer
to build large-scale web applications,
you know, in JavaScript end-to-end.
So I'm kind of waiting for that.
I can't justify it for Document Cloud as a project
because I don't think it's there yet,
but I think it's coming soon, maybe within a year or so.
Anybody wants to get a hold of you,
what's the best way to reach out to you?
Are you on Twitter, what's your handle
email? I'm actually not on Twitter
people like to message
through GitHub which works pretty well or you can do
jeremy at documentcloud.org
and
just to mention the thing that I said at the beginning
if you're you know
talented JavaScript or Ruby programmer
and you are interested
in working on projects
that have a mandate to be open-sourced,
then we'd love to hear from you.
So yeah, you can send me an email at jeremy at documentcloud.org.
And so they can also go to github.com forward slash documentcloud,
and they can hit you from there.
Is that your user, or do you have your own user?
Yep, I'm pretty much that one, too.
So yeah, that'll work also.
Awesome.
Well, it was awesome having you on the show, Jeremy.
Thank you very much for taking the time to chat with us.
You're probably awesome.
Thanks a lot.
It's a pleasure having you.
It's a pleasure being on.
I appreciate it.
Thank you for listening to this edition of The Change Log.
Be sure to tune in weekly for what's fresh and new in open source.
Also, visit thechangelog.com to follow along, subscribe to the feed, and more.
Thank you for listening. For the first time Safe in your arms
As if no passion