Checking winning numbers like a CS student

Most of the web pages are primarily written for people - rather than machines. Extracting information from a web page, such that it can be further processed by an application, is therefore quite tedious in most cases. However, if you are able to extract the right parts of an HTML web page you can build pretty cool things that make your life easier (well, sort of...).

For several years now, it is kind of a family tradition to buy Advent calendars (for those of you who have never heard of this: basically, such a calendar has 24 little doors hiding small pieces of chocolate as a "countdown" until christmas) from a charity organization in cooperation with the city of Schwabach (the town I grew up in). The cool thing about this special one is that each calendar has a unique winning number printed on it. Each day about ten numbers are drawn, and you can win pretty cool stuff if you have the right number on your calendar (btw: all of these winnings are contributed by local companies).

The winning numbers that were drawn are published in the local newspaper and on the charity organization's web page. Since I do not receive the newspaper myself and since I did not want to check the web page each and every day manually, I decided to write a small bash script that checks the winning numbers automatically. I wanted the script to notify me via email if one of our numbers were drawn, and therefore it was necessary to extract the winning numbers along with some other information (the prize, where you can pick it up, and so on) from the web site.

For those of you, who are interested in my script, this is how it looks like:

#!/bin/bash

for limit in '0' '5' '10' '15' '20' '25' ; do
    wget -q -O - "http://www.lions-schwabach.de/index.php?option=com_content&view=category&layout=blog&id=45&Itemid=169&limitstart=$limit" | tr -d '\n\r' | grep -oP "<tr><td height.*?ff0000;\">[0-9]*<\/span>.*?</tr>" --color=never | sed 's/  */ /g' | sed 's/\&amp;/\&/g' | awk -F "</td>" 'function extr(str) { match(str, />[^<>]*<\/span>/); return substr(str, RSTART+1, RLENGTH-8); } {printf("%4d : %s von %s (Wert: %s Euro)\n", extr($2), extr($3), extr($5), extr($4));}'
done

Well, that's pretty nasty, isn't it? Of course I know that this script is "write-only" and that you could solve this problem much easier and nicer using more sophisticated tools. However, it was quite fun to write this script, and isn't this what it is all about?

I do not want to go into much detail here, but basically the script consists of a loop (which is necessary because the winning numbers are distributed among multiple pages) and inside the loop a single HTML page is downloaded via wget and processed by some standard tools like awk, grep and sed to extract the desired information.

I wrapped this snippet in another few lines of code that filter the resulting list for our own winning numbers and that compare the new list to the old list to find out if someone of us has won a prize recently (in which case an email is sent to my address). This script is executed once every day using a cron job running on my Raspberry Pi. Now I only have to wait for the first email to come...