A script to generate RSS feeds for wlu.ca
So RSS has been popular on the web for about a bajillion years and table layouts have been passée for even longer. Yet Wilfrid Laurier University’s website still uses tables and still does not have RSS feeds for its news items. I’ve griped about this before, and attempted a primitive version of a screen scraper for the website, but changes to the site tended to break that version.
Using BeautifulSoup and PyRSS2Gen, both Python libraries, I’ve created a new screen scraper that should be fairly robust thanks to BeautifulSoup’s forgiving HTML parsing engine.
So far, the scraper produces RSS feeds for the main news page and the Physics and Computer Science department news page. Any other news pages on the Laurier site can be scraped by adding a line to the script file. Let me know if there’s a WLU news page you would like me to scrape.
RSS Feeds:
Main Page
Physics and Computer Science
Code:
#!/usr/bin/env python
import sys
import os
import datetime
import time
import urllib2
import PyRSS2Gen
import BeautifulSoup
debug = False
def parsePage(url, filename):
rss_items = []
prefix = "http://www.wlu.ca/"
document = urllib2.urlopen(url)
souped_doc = BeautifulSoup.BeautifulSoup(document)
main_title = souped_doc.head.title.contents[0]
for table in souped_doc('table', 'news'):
for tr in table('tr'):
if tr.td is not None and not tr.td.has_key('colspan'):
title = "".join([str(i) for i in tr('td')[1].a.contents])
date = datetime.datetime(*(time.strptime(tr('td')[0].contents[0], "%b %d/%y")[0:6]))
link = tr('td')[1].a['href']
if not link.startswith("http"):
link = prefix + link
guid = PyRSS2Gen.Guid(link)
if debug:
print "Title: %s" % title
print "Date: %s" % date
print "URL: %s" % link
print "Guid: %s" % guid
print "===================="
rss_items.append(PyRSS2Gen.RSSItem(title=title,
link=link,
description="",
guid=guid,
pubDate=date))
output = PyRSS2Gen.RSS2(title=main_title,
link=url,
description="",
lastBuildDate = datetime.datetime.now(),
items=rss_items)
output.write_xml(open(filename, "w"))
if __name__ == "__main__":
parsePage("http://www.wlu.ca/news_listing.php", "wlu_main.xml")
parsePage("http://www.wlu.ca/news_listing.php?grp_id=2", "wlu_physcomp.xml")
