Zeitgeist 1.0 – An Intelligent RSS News Aggregator

Posted in Java, Software Development by Dan on November 26th, 2009

I recently signed-up for GitHub. Compared to Java.net or Sourceforge, it provides a much lower barrier of entry for code hosting.  There’s no need to wait an indeterminate period of time for somebody to approve your project, you just upload it. And because it’s a DVCS, it’s easy for other people to fork your projects and submit patches. Open Source project hosting has become so straightforward, thanks to sites like GitHub and the Bazaar-based Launchpad, that it encourages developers to open up code that they might otherwise have kept to themselves. After all, why bother with local repositories and back-ups when you can get somebody else to do it for you and get free web-hosting and issue-tracking too?

I have a number of trivial and incomplete projects hosted in local Subversion repositories. I am slowly adding to GitHub those that have any worthwhile substance to them. I’m making no promises about the quality of this code, and I don’t intend to spend much time supporting it, but I’m putting it out there in case somebody might have a use for it.

First up is Zeitgeist. This is a small Java library/application for identifying common topics among a set of news articles downloaded from RSS feeds. It’s sort of like what Google News does. There is a basic HTML publisher included that generates a web page for displaying the current top news stories, including relevant pictures.

You give the program a list of RSS feeds that cover a certain topic (maybe world news, or music news, or a particular sport) and it uses non-negative matrix factorisation to detect similarities in the article contents and to group the articles by topic. The original idea comes from Programming Collective Intelligence.

The default HTML output looks a bit like the image below, but you could customise it with CSS or by hacking the default templates to modify what information is included (for example, you could add an excerpt instead of just displaying headlines).

The algorithm is not infallible and how well it works depends a lot on the feeds that you select. It’s also non-deterministic, so if you run it multiple times with the same input you will get variations in the output.  Perhaps Zeitgeist is not that useful in it’s current form but it could be used for adding on-topic news headlines to a website or as the basis for something more advanced.