The Changelog: Software Development, Open Source - PubSubHubBub and the Real-Time Web (Interview)
Episode Date: October 5, 2010Wynn chatted with Julien Genestoux (github/twitter) from Superfeedr about PubSubHubBub, XMPP, Websockets, and the real-time web....
Transcript
Discussion (0)
Welcome to the ChangeLog episode 0.3.7. I'm Adam Stachowiak.
And I'm Winn Netherland. This is the ChangeLog. We cover what's fresh and new in the world of open source.
If you found us on iTunes, we're also on the web at thechangelog.com.
We're also up on GitHub.
Head to github.com forward slash explore.
You'll find some trending repos, some feature repos from our blog, as well as the audio podcasts.
If you're on Twitter, follow Change Log Show.
Not the Change Log.
And I'm Adam Stack.
And I'm Penguin, P-E-N-G-W-I-N-N.
Fun episode this week.
Talked to Julian Guinness-Dew from Superfeeder.
Talked about the real-time web feeds and more, huh?
The real-time web.
You know, a lot of new technologies, it's changing landscapes.
So we talked about PubSubHubbub, which is a real-time web protocol, also XMPP,
and touched briefly on WebSockets and also like the Twitter streams.
Yeah, a lot of this stuff is really just going to the real-time web.
It's pretty intense, WebSockets, Node, and everything else streams. Yeah, a lot of this stuff is really just going to the real-time web, but it's pretty intense, WebSockets, Node, and everything else.
We'll drop a link in the show notes.
There's a really cool demo of Superfeeder in action,
pulling all the live check-ins from Gowalla,
and someone's hooked it up to WebSockets and Chrome,
and you can see a real-time Google map with all the check-ins from Gowalla.
It's really interesting.
Wow, that sounds fun.
Fun episode. Should we get to it?
Let's really interesting. Wow, that sounds fun. Fun episode. Should we get to it? Let's do it. We're chatting today with Julien Genestoux from Superfeeder to talk about the real-time web and
PubSubHubbub. Julien, why don't you introduce yourself and let the folks know who you are and
why they should care. Sure. Hello, I am Julien Genestou, as you said.
I am a French dude living in San Francisco, and I created a monster called Superfeeder,
which actually aims at making the web real-time.
Basically, that's what we do.
And we do that using the PubSubHubbub protocol, but a few other older ones like XMPP and its venerable
PubSub, as well as a few other techniques.
So as some background, what makes this technology important nowadays?
So there's different approaches.
The very technical approach is it saves bandwidth.
The current way of building services is just when you want to interact with another service
is to build something that pulls these other services.
So basically every two seconds, you're going to go check the API,
check the RSS feed, check the content.
It works fine when you have a few endpoints,
but when you start having thousands, tens of thousands, hundreds, millions of thousands,
it's just a mess.
And you're wasting a lot of bandwidth on your own,
a lot of CPU time,
as well as wasting the bandwidth
of the third-party service that you're actually querying.
It's really like the kid in the backseat
who always asks,
are we there yet?
Are we there yet?
Are we there yet?
Every two seconds.
It's very annoying,
and it can be fixed by having an approach
where you can say,
hey, all right, listen, kiddo, I'm going to tell you when we get there, so do not ask anymore.
And that's really what Pops Up Hubbub and the real-time web is aiming at doing.
There's also another way of seeing things.
So the first wave of the web was really a read web.
So you would go on the web and you would read stuff, learn stuff.
So in a way, most of the first websites, the media sites, were actually just this.
Then we had the blogs, the first kind of sharing websites, like Flickr, stuff like this, which would be right.
And now we're entering a third phase where you're not doing read and write, but you're also subscribing to content.
Saying, hey, all right, on Twitter, I am following people, which is like subscribing to people.
On Facebook, I'm subscribing to my friends' stream.
And you could really go, like, much beyond this in, like, seeing that search engines actually subscribe to sites to index their content and stuff like this.
So there's different ways of seeing the thing.
So you mentioned a couple of protocols, PubSubHubbub and XMPP.
What's the payload look like for this type of messaging?
So in both cases, they, I mean, so XMPP is actually based on XML.
So it's mostly XML, even everything XML.
Right now, PubSubHubbub is also aiming only at fixing the issue of Atom and RSS feeds. We're working with the team there at supporting other types of data,
like JSON, like other types.
It's not in the spec yet.
I hope it's going to come very soon,
because a lot of our users actually ask for it.
So I guess one of the real-world scenarios is you've got a feed that you want to check,
and you want to be notified when that feed updates.
So how does the protocol work?
So the first thing when you say you is to define who is you.
In the case of PubSub or XMPP, it's not an end user.
It's really like another service.
So the most common use case is like, hey, Google Reader needs to know when a feed is updated because it wants to show this to its users.
And the way to do this is to use this PubSubHubbub protocol if the feed, or actually the feed's publisher use it, is to use a third party called the hub and tell them, hey, hub, please tell me whenever the content is updated. And the hub's job is basically going to listen to the publisher
so that the publisher tells them, hey, the content has been updated,
and then fan out the update to all the subscribers,
to all the Google readers out there who said, hey, I want this content.
Is there any notion of discovery around finding hubs for content?
Yes.
So the way the protocol works right now is basically the publisher who publishes the feed defines which hub will get its content in real time.
So in the RSS feed itself, you will have a link attribute or, well, it's an item link attribute, sorry, node with the rel equal hub attributes.
And then the href of this link is actually the URL of the hub.
So the discovery is done inside the feed themselves, which is good because it means that basically
having a PubSubHub hub feed is not different of having a regular RSS item feed, which means
that you can really build on top of these and you're not breaking past software and application
that were not using PubSubHubHub.
So from the publisher's side, they just have to annotate their feed
with the special link that, I guess, publishes where the hub is
and then what is involved in setting up the hub?
And then the publisher, the first job of the publisher is to set up this discovery.
And it's also to ping the hub saying,
Hey,
all right,
this content has been updated.
This content has been updated.
The hub then,
which is this third party in charge of fanning,
fanning out the subscription,
we'll get these pings.
If somebody subscribes to the content,
it will go fetch the feed.
So it's the,
the,
the notification from the publisher to the hub is actually light.
It means it just tells them there is something.
And then the hub can decide to go pull or not the feed
based on the fact that there is subscription.
It will then diff the content to know what's new versus what's old
and then publish in a fat way.
So it's actually sending the content to all the subscribers.
So does the publisher have to be involved?
Is there any such thing as a third-party hub where, let's say I wanted GitHub's public
timeline and XML, do they have to be involved?
Yes, you would need GitHub to actually designate their own hub and say, hey, all right, this
is where you can get our content in real time.
The way it works right now, Google...
So a lot of feeds are actually already PubSubHubBub,
so you might already be using this protocol without really knowing it.
There's three big hubs out there.
The biggest one, which is the first historical one,
is the Google Hub.
It was built by two engineers at Google,
and the goal was to make all the Google-owned feeds real-time.
So it involves FeedBurner feeds,
Google Reader shared feeds,
Google Buzz feeds,
Blogger feeds,
a lot of feeds like this.
And it's actually also an open hub,
so if you have your own service,
you can also designate them as the hub.
There is a second hub, which is basically the WordPress.com hub. So you can, if you have your own service, you can also designate them as the hub. There is a second hub, which is basically the WordPress.com hub. So WordPress.com implemented
their own hub. So it's both a publisher and a hub in this case, and you cannot use it from the
outside world. So if you've got your own little WordPress.org blog, you cannot really use it.
And then the third solution is the solution provided by Superfeeder, which is basically, hey, like the Google Hub.
So it's a hub that is public to anyone.
We can designate it.
And that's actually branded to your publisher site.
So we host the hubs for people like Tumblr, Passers, Goala, Six Airports, tons of others like BuzzFeed and working on very interesting use cases with e-commerce websites and stuff like this.
So one of the more interesting demos that I saw
was a Gowalla feed
powered by Superfeeder that
was hooked up to a WebSockets
app in the browser where it showed
all of the check-ins in real time
in Google Chrome. What do
technologies like this mean for
new era in web design?
Sure. So what you must not forget is that PubSubHub is a server-to-server protocol.
So it's really like, hey, from the blogger server to a hub and then to Google Reader.
So the end user doesn't really see it, which means that it doesn't come to the browser.
But then when you have something that comes to the server, you can really easily build
something that achieves the last mile.
And we call the last mile the thing which comes from, like, a browser,
sorry, a server to a browser, a server to an iPhone,
a server to an iPad, a server to, you name it, basically,
any type of devices that is connected to the web.
So what we built for Gowalla is a very, very simple example.
It's like, since we get all the
notification for all their feeds, we have some
kind of a firehose, right? So we
also, Superfeeder has this thing called track
which enables you to
filter feeds on different
criteria, so filter notification
on different criteria.
So instead of saying, hey, I want the
Gowalla feed of
Wins updates, you can say I want any Gowalla update within two miles of Austin.
And you would get that push to you as if it was a real feed on the site.
And we also obviously have some kind of a firehose.
So you can say, hey, I want any Gowalla update and get them. What we built after this is basically this little node server
that does WebSockets
and that turns Publish Hub Hub Hub notification
into WebSocket notifications.
So when you connect the browser, you can just get an update,
subscribe to any feed,
and then whenever they arrive,
you can show that on your browser the way you want.
Very interesting.
So how similar is that set up to what Twitter is doing with a lot of their real-time streams?
So Twitter is basically doing all this in a single proprietary stack.
So basically you can subscribe to, I think, the streaming API with your own libraries.
It doesn't use WebSocket.
It doesn't use anything that is part of the open web.
Maybe just auth is actually the only open thing
out of Twitter, I would say.
And it's sad because if you want to build something
that is a little bit more than just using
their own streaming API,
I don't know, building some kind of server-to-server process,
you cannot.
It's really hard to build something
that subscribes to thousands of users on Twitter.
They make it hard on purpose
because they obviously don't want to make this kind of data available
because they're selling it.
So I think it's like similar technologies
in terms of what you can do with it.
But on one end, you've got some kind of private proprietary stack.
And on the other end, you've got some very open stack,
which is like an open protocol defined by a community of people from a lot of different companies, whether it's Microsoft, Google, even Facebook is part of it now and people like this.
So what are the pros and cons between PubSubHubbub and XMPP?
So they solve two different type of issues.
And actually, PubSubHubHub came much later.
I mean, XMPP is, I think, like 11 or 12 years old now.
And PubSubHubHub is barely a year and a half.
So the idea when they built PubSubHubHub was like, hey, all right, we got this awesome PubSub patterns on the web like XMPP.
One of the early designers of the protocol, Brad Fitzpatrick, actually built his own XMPP server. So he was really convinced of the interest of having this. At the same time, he
also found out that basically XMPP is just too different from your regular web technologies.
It's too different from the web stack, which meant that a lot of people were really scared about it
and had a lot of issues scaling services with this because they didn't know how it works or even if they did it was just too different from PubSub from HTTP so they built
basically the whole PubSub pattern on top of HTTP because even though you could do it with XMPP
people wouldn't because it was just too complex what about reliability if I'm not there to catch
a feed when it updates do I hear it so it really is up to the hub that you would use.
So actually, none of the public hubs at the moment,
whether it's the Superfeeder, the PubSub hub,
the Google App Engine hub, or the WordPress hub,
has some kind of, how can I say, like storage of the entry
and then are able to actually resend you the data
when you're back.
Because we deal with massive amounts of data.
Superfeeder currently pushes 30 million
Atom updates per day.
So if you're off for like just an hour,
we might already store 1 million Atom entries
just for you.
So it's not really easy to scale this.
But we're working on like storing the data
so that whenever you come back, when your endpoint is available again,
we'll push that to you in a way that hopefully won't take you down again.
So this is primarily a real-time update feature, but if you've got to catch everything that comes from a feed,
it's just a single tool, I guess, in the stack instead of it being your primary?
I'm not sure I understand the question.
So if you absolutely have to have all the data coming out of a
separate service, I guess
you could always just pull that feed independently,
right? Yes, of course.
You can still pull the feed from time to time
to make sure. But I mean, if you
really, really need to get the data all the time,
I would suggest making sure that
your service is not going to be offline anyway.
I mean, obviously
you can be offline for two seconds
and then it's a big deal
because you might miss something.
So you might want to pull.
But as long as you're offline at any time,
you will miss some data.
We deal with feeds sometimes
that are very high frequency updates.
So it means that some feeds
might have like an entry every minute
and they just have 10 entries.
So it means that after 10 minutes,
you might have lost some content,
even if you pulled it.
Make sense?
Sure.
So when people say, hey, what happens if I'm offline?
It's like, I'm sorry, but there is no perfect solution if you're offline.
You'd rather make sure that your service is never going to be offline.
There's a ton of techniques to actually do this.
I mean, one of them is to just process everything in an asynchronous way
so that you put every message in a queue, and then you deal with the queue.
So whenever your workers have issues, you can still store the data in the queue
and then process it whenever you're better.
So I'm looking at your GitHub repo and see a lot of Ruby out there.
What sort of languages do you speak?
So we built most of Superfeeder on top of XMPP,
which means that basically any component in our architecture
is a little XMPP worker who sends presents,
which is one of the three XMPP verbs, I would say,
to other workers saying, hey, I'm here.
Please send me some work.
So other workers will send some work.
So the whole boss is XMPP.
Then each of the workers is actually using different languages,
different techniques, I would say, based on what they do.
So our parsers, for example, are built with some C at the very core of it.
And then on top of that, a lot of Ruby to make the rules.
So we actually do some,
Superfeeder does something
that is mapping the different RSS formats.
So if you're using RSS,
I mean, if you're subscribing to an RSS feed,
an Atom feed, and a FeedBurner feed,
you might see different items.
And rather than let you deal with
the complexity of these different formats,
we just normalize it to Atom.
And these rules are actually written in Ruby
in our parsers. We also use
a lot, I mean, I think
pretty much only Event Machine,
because obviously we do a lot of networking stuff,
and waiting on sockets
would just not make any sense for us.
So we use Event Machine,
which is the reactor pattern. So everybody's
talking about Node at the moment.
Event Machine is pretty much like one of the grandparents of Node.
I would say Python's Twisted is the other grandparent of Node.
Yeah, we covered all three of those in the changelog recently,
and I'm fascinated by it.
It seems like a new pattern for developing web applications,
and I think Node might be benefiting just from the fact that
since there was no set of libraries out there when it started, everything could be built from the ground up to support async.
What sort of problems did you find using Ruby to do that with Event Machine and having libraries that would support it?
So we still have some issues with Event Machine based on the fact that some implementations are not there yet. Actually, one of the biggest issues is very interesting.
The DNS resolution inside Event Machine is still sync,
which means it's actually blocking the reactor.
So it's really bad.
So we actually created our own little resolver in an async way.
And we hope that at some point,
Event Machine will include some kind of async DNS resolution.
We also find issues where libraries for most of the,
I would say, very recent data stores are not either up-to-date or even present.
I'm thinking about Redis, for example.
We had to basically kind of hack a lot
on top of what was the first initial attempt
to make sure that it would still work
with newer versions of Redis.
I know that Cassandra doesn't have
an even machine implementation either.
Mongo has had issues in the past as well.
I mean, like the driver,
the asynchronous driver was not really complete
in terms of features
compared to the regular blocking driver.
You kind of walked into that subject.
So let's talk about NoSQL for a minute.
What's your favorite platform out there?
Redis.
We absolutely love Redis.
I mean, there's always a big debate about, hey, all right,
as long as you store everything in memory, it's easy.
So basically Redis is doing something easy.
And what I usually tell people is like,
yeah, they do something easy,
but they actually made the decision to do this.
And not a lot of data stores actually made that decision.
So we use Redis as much as we can.
We still have a few missing features from Redis.
The biggest one is obviously the cluster node.
So I know that Antires is actually working on this,
is the maintainer of Redis.
And it should be live by the end of the year.
But we might actually have to use Mongo, and we already started evaluating this,
for specific things where we really need some kind of clusterable approach where adding a server would just double or increase the size of our store
rather than do some kind of sharding, which was really becoming and is still a big deal for us right now.
We're trying to get Antares on the show to talk about Redis.
Hopefully we'll put that together soon.
But, you know, it's amazing how many of these new NoSQL stores
support JavaScript out of the box.
Well, I mean, what do you mean?
Like in terms of like they use JSON for the data structures and stuff like this?
JSON for the data structures and then a lot of the APIs with Couch and Mongo, you know, are written in JavaScript.
Yeah, well, so Redis is different to that regard.
I think like Redis doesn't have any like native JavaScript thing.
I mean, it's very, I mean, what I find interesting about Redis as well is like took it from a very, very low-level approach.
Just installing Redis is very simple.
You just have to download the code and just make, and that's it.
You don't need any third-party libraries or things like this.
So the approach that they had as well was like, hey, all right, we're going to build this very high-performance thing.
So we need to control all the chain. So we really need to make sure that we're not reusing any complex libraries
that would make Redis much slower or much bigger or much harder to maintain.
It looks like the web development landscape has changed quite a bit
in the last few years.
It used to just be you would have a front-end architecture,
a back-end architecture, but now it seems like you have to have
a NoSQL solution for a lot of these things and a queuing solution.
What of these queue systems have you played with,
I guess like Resque on top of Redis and some others?
So we use RabbitMQ.
I haven't really played with Resque.
I'm not sure how you pronounce it.
We should definitely give it a look.
We use RabbitMQ, and we do not use a lot of queue systems.
I mean, XMPP actually has a lot of features that could be implemented via a queue system,
so we don't really use that a lot.
It's funny, you mentioned the pronunciation of rescue there with, I guess, your French,
so resque would be the French pronunciation.
But that's an important part of creating
an open source project, is
coming up with a name that kind of brands the thing.
So, Superfeeder, where did that come from?
Basically, it's like
let's make feeds
better, so it's kind of like
if they're better, they're like super,
right? So they're like super feeds.
And the machine that makes these
feeds super is actually a super feeder in a way.
So that's the way we built it.
It's fun because it was actually initially an initial component to another much bigger application.
And when we started implementing this component, which was supposed to be a smaller or just a small bit of the whole system,
we found out that it was actually a kind of an endless uh i mean hole that we would dig something and then find something else and
dig further and dig further and dig further and dig further at so much that at some point say hey
all right why don't we just do this and make sure that we do it fine and then maybe in like 10 years
or 15 years we'll find something to build on top of so let's talk about superfeeder for a moment
in your monetization strategy.
So do you charge publishers or subscribers or both?
Neither and both.
So the PubSubHub pattern is really an open web pattern.
So you should not charge anyone.
And we do not charge.
So we make the content from Tumblr real time, and you can get that in real time for free.
And that's implementing the PubSubHub protocol.
So in a way, you don't even need to know that it's actually using Superfeeder.
However, there's still a massive proportion of feeds out there.
I would say something like 70% or 80% of them were not PubSubHub enabled.
And for this, you would need some kind of third-party application to do the polling for you,
if you don't want to do the polling, and push it to you as they were pop sub hubbub so i'm not sure that makes any sense but
the idea is like hey all right um you got 100 feeds out of those 20 of them are actually pop
sub hubbub so you can subscribe to the designated hub and get the content pushed to you right then
you have the 80 more feeds so how do you deal with them some people and that's actually what
they've been doing for years uh build some kind of pollers.
So like, all right, fine, we're going to build something that polls the feed.
And whenever there's a new protocol, a new way of getting the content, or whenever there's a new flavor of RSS or item, we just implement the extra layer to make sure that our 80 remaining feeds are being dealt with correctly. The other approach, and that's what we're training to convince people,
is like, all right, you've seen how easy it is to deal with PubSubHubbub feeds
with these 20 feeds that you're dealing with.
Why not having some kind of third party push that to you
as if they were all PubSubHubbub?
So that's basically what Superfeeder does.
So we just implement this polling or
all these techniques
to avoid polling. We
implement the data normalization on top of it
and then we push it to you as if
it was PubSubHubbub. Obviously
there is some kind of cost involved with this.
So we are actually going to charge
for the content when we push to you. But it's really
cheap. Like you can
get
a couple million notifications
for less than $100 a month.
So one of your repos out on GitHub is popular feeds,
and it's a text file with over 4,200 feeds in this thing.
Does this power any sort of process at Superfeeder,
or is this just out there for informational purposes?
No, it's only for informational purposes.
So we actually had a lot of our users say,
hey, all right, we want kind of a fire hose
of the blogosphere.
Like we want the top 50, the top 100,
the top 10,000, the top 100,000 feeds pushed to us.
And it was really hard to tell them,
like here is the top 1,000 feeds.
We have no idea because it's not our job.
Our job is to distribute the content.
So actually last last weekend,
worked on kind of identifying a list of popular feeds
based on like TechMeme,
on, sorry, a few other services out there
were actually list feeds and OPML files,
Technorati, AllTop, and a few other services.
To identify which one were popular,
then we kind of like, I'm going to say,
matched that up with the superfeeder data.
Because since we have a lot of feeds,
we nearly have 3 million now,
we know which one are actually subscribed
by more than one user, right?
So we kind of matched all the data together
to identify kind of a short list
of what we think is the most popular feeds out there.
Wow, 3 million feeds.
Yeah, I just noticed the counter on the homepage.
3.1 million feeds out there.
Those are entries, I guess.
So these are actually entries, and that's billion.
Yes, I just noticed billion. Wow.
We currently push like about 30 million a day.
Unbelievable.
Yeah, it's a lot of data.
And the interesting thing about superfeeders,
like most of services actually deal with scalability
in terms of, hey, how can I reply
to as many requests per second?
And we actually do the exact ways.
How can we push as much content per second as we can?
How can we push more data
rather than how can we deal with incoming requests?
Gotcha.
Well, this is the part of the show
where we kind of turn it around
and ask our guests what's on your open source radar, what out there in open source land has you excited that you just want to play with?
So definitely Redis is one of my loved, how can I say, I mean, most loved project right now.
I mean, I'm trying to build some stuff with them.
They have a PubSub store.
So it means that basically you can build with Redis a way to
subscribe to items, and whenever something
is published there, you get notification.
Node.js is also something that I've
been playing with a lot in the
past few weeks.
And this is really because we're moving
hosts right now. I love Chef as well.
It's something that not
everybody might know, but it's
a solution that helps
you deploy and manage the configuration of your servers
so when you have one or two
it's not that big of a deal
but when you start having 10s, 30
I mean like 10, 30, 50 or 80 like Superfeeder
it's really starting to become a mess
to deal with the different configurations
the different versions
the different roles of each of the different configurations, the different versions, the different roles
of each of the different servers that you might
use and stuff like this.
I would definitely recommend anyone with more
than maybe three servers look into
Chef because it's really great.
Yep, pretty much
these three are our
current love projects. Chef from OpsCode,
it is an awesome piece of software.
We should do a DevOps show
on the changelog pretty soon.
I think it would be interesting, yeah.
And it's one of the things
where basically I had really no knowledge
before I started Superfeeder.
And it felt like,
oh my, how am I going to,
I mean, I'm going to spend
like two-thirds of my days
dealing with patches
or configuration that I need to update
or anything like this.
And having Chef has been like, all right, I can just put the receipts,
which is really like the way it works,
like put the receipts of what a server is, and it just builds it.
And whenever I need to update, I just change one thing there,
and it builds and updates all the servers.
It's saving like months of work.
You know, and also moving to the cloud has just made this type of skill set
that more valuable
because you need
reproducible processes
that you can set
these servers up,
you know?
Definitely.
Well, there's still
a few differences.
Like, I mean,
we are moving right now
from a host
to another host.
So, I mean,
moving from Slicehost
to Linode
and it's,
we still have, like,
some issues.
We actually had to update
a few of our Chef receipts
maybe because we did it wrong the first time.
But, I mean, hopefully and ideally at some point
we will be able to have this very, very generic way
of describing the servers and describing the IPs
and stuff like this,
so that whenever you just plug a new IP in a root password
and it just deploys whatever on any cloud service
and you can actually do benchmarks.
And that's one of the things that I want to really work on
in the coming weeks as well,
is do some kind of like SHAP receipts
to deploy a very basic script
in an identical way over different providers,
whether it's Rackspace, Slyso, Slynote, EC2,
I mean, you name it.
And then just get all the results back
and make sure that we always use
the most performant machine per dollar spent.
Very interesting. Well, thanks for joining us today, Julian. We certainly appreciate it.
Thanks for having me. It was great. I see it in my eyes
So how could I forget when
I found myself for the first time
Safe in your arms
As the dark passion shows