Digg Top Stories Today as RSS

From Stack Overflow
Jump to: navigation, search

A few months ago, Digg released a new version of their site. The release added a bunch of new stuff, but killed off a feature I had been using. They have RSS feeds of all sorts of things--the front page, your friends, "live" feeds of search terms, but their recent release killed off the one RSS feed I used the most. The problem with their main (front page or individual container) feeds is that they are giant and very active. If I leave my computer sitting over the weekend (or even all day), it picks up new stories from the feed every hour and caches them until I return. I then end up with 200+ stories to sort through. Most of those stories were crap, as "front page"!="quality story." I would often just ignore the whole group of 200+ stories, marking them all as read. This makes that particular feed useless to me. The top stories feed was great because it would change a little over the day as various stories fought for top billing. Returning to a computer that had been idle for a day would bring up 20-30 stories--a good, manageable amount, considering I knew most of them would be fairly high quality. A few weeks ago, I put in a feature request asking for the top stories feed back. After all, The page is still there, but the RSS link points to the main RSS feed, so it shouldn't be too difficult, right? Today, I got fed up of waiting for them to fix it and wrote my own script to give me that page as an RSS feed. It's a simple shell script (with embedded XSL), requiring curl, tidy, and xsltproc. A wrapper shell script (basically "diggtoday.sh > blah.xml") can be dropped in /etc/cron.hourly to make things easier.

#!/bin/sh
# diggtoday.sh / brian@netninja.com
# Takes digg.com top stories of the day and converts it to an RSS feed.
# Digg used to have this as a feature, but dropped it from their 2.0 release.
LINK=http://digg.com/view/all/popular/24hours

#check dependencies
XSLTPROC=`which xsltproc`
CURL=`which curl`
TIDY=`which tidy`
if [ -z $XSLTPROC ]; then
	echo "xsltproc does not appear to be installed"
	exit 1;
fi
if [ -z $CURL ]; then
	echo "curl does not appear to be installed"
	exit 1;
fi
if [ -z $TIDY ]; then
	echo "tidy does not appear to be installed"
	exit 1;
fi

# Output xslt
cat > /tmp/diggtoday.$$.xsl <<_HERE_ 
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns="http://blogs.law.harvard.edu/tech/rss">

<xsl:template match="/">
  <rss version="2.0">
  <channel>
  <title>Digg Top Stories Today</title>
  <link>http://digg.com/view/all/popular/today</link>
  <description>Top stories today, parsed from digg.com</description>
  <language>en-us</language>
  <generator>diggtoday.sh</generator>
  <ttl>60</ttl>
  <xsl:apply-templates/>
  </channel>
  </rss>
</xsl:template>

<xsl:template match="xhtml:div[@class='news-body']">
	<item>
	<title><xsl:value-of select="normalize-space(xhtml:h3/xhtml:a)" /></title>
	<link>http://digg.com<xsl:value-of select="normalize-space(*/xhtml:a[@class='more']/@href)" /></link>
	<guid>http://digg.com<xsl:value-of select="normalize-space(*/xhtml:a[@class='more']/@href)" /></guid>
	<description><xsl:value-of select="xhtml:p[not(@class)]" /></description>
	</item>
</xsl:template>

<xsl:template match="*">
  <xsl:apply-templates/>
</xsl:template>

<xsl:template match="text()" />

</xsl:stylesheet>
_HERE_

# Get source
curl -s -A "diggtoday.sh @ http://stackoverflow.org/wiki/Digg_Top_Stories_Today_as_RSS" $LINK > /tmp/diggtoday.$$.html
RC=$?
if [ $RC -ne 0 ]; then
	echo ERROR $RC downloading
	exit 1
fi
cat /tmp/diggtoday.$$.html \
    | sed 's/ name="share0"//g' \
    | sed 's/ id="share0"//g' \
    | sed 's/ id="title0"//' \
    | /usr/bin/tidy -asxhtml -f /dev/null \
    > /tmp/diggtoday.$$.xml
RC=$?
if [ $RC -ge 2 ]; then
	echo ERROR $RC tidying
	exit 1
fi
xsltproc /tmp/diggtoday.$$.xsl /tmp/diggtoday.$$.xml
rm -f /tmp/diggtoday.$$.html /tmp/diggtoday.$$.xsl /tmp/diggtoday.$$.xml

crontab:

#M H DoM M DoW cmd
0 * * * * /usr/local/bin/diggtodaycron.sh
Personal tools