thraxil.org:

more fun with RSS

by anders pearson Tue 15 Apr 2003 14:26:03

i am an information junkie. every day i read a lot of websites. there are dozens of news sites, weblogs, and comics that i read on a regular basis. i also like fresh information; i like to know about something as soon as possible. i also have a job that i'd like to keep and every once in a while i'd like to at least pretend that i have a life. so i don't really have time to visit each of these sites one (or more) times a day checking to see if they've been updated with any new content. luckily, there's this great standard called <a href="http://backend.userland.com/rss">RSS</a> which has really taken off lately, and has made my life much easier. RSS is simply a way of making your content available in a standardized machine-readable format. it becomes useful when you use software that can aggregate multiple RSS sources in a single location. there are dozens of RSS aggregators out there. just do a <a href="http://www.google.com/search?q=rss+aggregator&sourceid=mozilla-search&start=0&start=0">google search</a> and see for yourself. most of them, such as <a href="http://ranchero.com/netnewswire/">NetNewsWire</a> run as a regular desktop application. that model's no good for me since i do some of my surfing from home and some from work. so if i've read an item while at work, it would still show up as new on a desktop aggregator at home. the kind that i really wanted was server-side with a central database and a browser based interface. there are plenty of those out there too, but none of the ones i looked at felt quite right to me; either they were too slow, they didn't have features i wanted, their interface didn't mesh with how i surf, or they had way too many features that i didn't need and that only get in the way. so, because i'm a perfectionist, control-freak, and i know Perl, i spent a couple hours writing my own. at the time i was also looking for any excuse to try my hand at writing an application as a full-fledged <a href="http://perl.apache.org/">mod_perl</a> handler rather than a collection of CGI or Apache::Registry scripts. the simple web-based aggregator i wrote, i call Corral. you just tell it to subscribe to a bunch of RSS feeds, and it fetches them hourly, picks out any new items from the feeds and presents them to you in an easy to read format. when you've read the items, you mark them as read and they won't show up again. once it's set up, it's very simple to use. i've been using it myself for the last month or so and now i don't know how i functioned without it. anyway, i figured the least i could do would be to give other people access to it so they can experience the wonders of RSS first-hand as well. so, if you want to beta-test Corral, just do the following: go <a href="http://xnoybis.ccnmtl.columbia.edu/corral/">create a new account</a>, then login to that account. at that point you shouldn't be subscribed to any feeds so you won't see much. follow the link for 'subscribe to existing feeds'. it will take you to a list of a bunch of feeds that i (or my alpha testers) have already added to the system. select (with control-click) the ones that you want to subscribe to and hit the 'subscribe to selected feeds' button. you will probably now see a bunch of items listed on your main page. once you've read them, just hit the 'mark all items as read' button at the bottom (or select individual items and mark just those ones as read). this main page functions as sort of an INBOX like you are probably used to for your email. once you've marked items as read, they won't show up in your INBOX anymore. adding a new RSS feed is a two step process, first add it via the 'add new feed' link, then make sure to subscribe to it. Corral also has the notion of categories for feeds. if you find them useful and can figure them out on your own, go ahead and use them, otherwise, just ignore them. Corral runs on my workstation at work. it should be pretty fast but be warned that since it isn't a production server, it could occasionally be offline or just really slow (like if i'm compiling any big projects or playing UT or something). once you start to learn to enjoy Corral or any other RSS aggregator, you will soon find yourself frustrated that some site you like to read doesn't provide an RSS feed (and hence, can't be aggregated). first of all, make <em>sure</em> it doesn't have an RSS feed; sometimes they're just hidden. if it's a weblog hosted on livejournal, it has a feed. all you have to do is append '/rss' to the end of the url. eg, lani's journal <a href="http://www.livejournal.com/users/kpilo">http://www.livejournal.com/users/kpilo</a> would have its RSS feed at <a href="http://www.livejournal.com/users/kpilo/rss">http://www.livejournal.com/users/kpilo/rss</a>. if it isn't a livejournal site, there's still hope. look at the source for the web-page, look in the &lt;head&gt; for a tag like &lt;link rel="alternate" type="application/rss+xml" title="RSS" href="http://www.miromi.org/mt/blog/index.rdf" /&gt;. if it has one of those, you're in business. if you still don't have any luck, you have about three options left: 1) harass the owner of the site and get them to add an RSS feed (in many weblog authoring tools or CMSs, this is a simple addition), 2) learn <a href="http://www.perl.org/">Perl</a> or <a href="http://www.python.org/">python</a> and write your own script to scrape the site and generate your own feed (if you do this in Perl, i highly recommend that you look into the <a href="http://search.cpan.org/dist/XML-RSS/lib/RSS.pm">XML::RSS</a>, <a href="http://search.cpan.org/author/GAAS/HTML-Parser-3.27/lib/HTML/TokeParser.pm">HTML::TokeParser</a>, and <a href="http://search.cpan.org/author/GAAS/libwww-perl-5.69/lib/LWP/Simple.pm">LWP::Simple</a> modules). a third option, is to use another little tool that i setup. <a href="http://xnoybis.ccnmtl.columbia.edu/fenris/">fenris</a> just watches a given url, and creates a very simple RSS feed that has a new item any time the site changes at all. it's very stupid and can easily be fooled by things like ad banners which occasionally make the page appear to be changed even when the content is really the same. but, it's very fast and doesn't require any programming. using fenris should be pretty much self-explanatory by now. so if you have too much time on your hands (or potentially too little) and have managed to make it this far, past all those paragraphs of technical talk, feel free to give Corral a spin. if you find it useful, or have any ideas for how it could be improved, let me know. source code is forthcoming once i get around to packaging it up.
TAGS: rss aggregator corral newsfeeds

comments

anders, you are once again, the man. That app is looking very slick. Beta testing it will save me so much time each day. When the project gets to a comfortable development stage, I'd love to see the code for it. very impressive. &#60;envy&#62;I wish I could develop something that useful for the world to use.&#60;/envy&#62;
the code for corral is <a href="http://xnoybis.ccnmtl.columbia.edu/corral.tar.gz">here</a> if you really want to check it out. not exactly designed for portability, but should be clean enough to figure out. if you actually want to try running it, you'll also need some <a href="http://xnoybis.ccnmtl.columbia.edu/ccnmtl_libs.tar.gz">libs</a>.
hmm, and thus web serfing approaches USENET. <br>there is the theory of the mobius, where nntp becomes a loup.
i've rewritten in python the part of Corral that goes out and actually downloads and parses all the feeds on a regular basis. i rewrote it in python mostly so i could make use of <a href="http://www.diveintomark.org/">Mark Pilgrim</a>'s <a href="http://diveintomark.org/projects/feed_parser/">ultra liberal parser</a>, which has some nice features like support for Pie/Atom/Echo/whatever, support for almost, but not quite valid feeds, and full <a href="http://diveintomark.org/archives/2003/07/21/atom_aggregator_behavior_http_level.html">support for HTTP response codes</a>. primarily though, i like the idea of letting someone else do the hard work of keeping that code bug-free and cutting edge. the only part of the porting that was at all painful was the little bit of database programming. python just doesn't have an interface to postgresql that is nearly as nice to use as perl's DBI module.

formatting is with Textile syntax. Comments are not displayed until they are approved by a moderator. Moderators will not approve unless the comment contributes value to the discussion.

namerequired
emailrequired
url
remember info?