CompSci 108
Spring 2009
Software Design and Implementation

Tag Clouds

A Tag Cloud is a visual representation of the content of a certain text. Here are some examples:

Specification

Write a Python program that, given a text file, generates an HTML tag cloud of its words that can be viewed within a web browser.

Words should be considered white-space delimited strings from the specified file with their starting and ending punctuation stripped off. Words should be considered different in a case-insensitive manner, e.g., ENERGY, energy, and Energy are three occurrences of the same word. Since very common words like "a, the, i" will skew the results, your program should ignore those words. A list of words to ignore are given in this file: common.txt. To further clarify the unique words within the file, you can use Porter's stemming algorithm to determine that the words "end", "ending", "ends" are all forms of the same word.

The different words found in the file should be printed in HTML in alphabetical order. The size of the word should be based on its frequency within the file. To make the output readable, you should divide the top N most frequently occurring words into a few different size groups and associate each word with a specific group. When writing the output, define some CSS font styles, rather than hard-coding the specific font-size into each tag. To furhter enhance the output, you can color the words based on how when it appears within the file, i.e., a word will be darker if it appears more often later in the file and lighter if most of its occurrences were at the beginning.

These text files, obtained from Project Gutenburg, are provided to show off your program:

Here is a Tag Cloud of the top 100 words for Machiavelli's "The Prince":

able acquired affairs afterwards alexander although among arms army attack became become born brought cannot castruccio chapter citizens concerning considered death di died difficulties done duke either electronic enemy entirely fear florence florentines forces fortune foundation france francesco friends given having held hold infantry italy kept killed king kingdom less lost lucca machiavelli messer necessary neither nevertheless nor nothing opportunity order orsini ought pagolo pisa pope power prince princes principalities principality project reason reputation rome secure seen sent shall soldiers son state states subjects taken terms themselves therefore thousand thus upon valour war whilst whom wise wish works yet

As much as possible, your program should represent good Python programing style rather than a Java program that has been translated into Python. To get you started, you can use the code from this website that does not represent good (Python) programming style. You can either start by refactoring it or from scratch.

Resources