CompSci 6
Fall 2008
Program Design and Analysis

Tag Clouds

A Tag Cloud is a visual representation of the content of a certain text. Here are some examples:

Specification

Write a program that, given a text file, generates an HTML tag cloud of its words that can be viewed within a web browser.

Words should be considered white-space delimited strings from the specified file with their starting and ending punctuation stripped off. Words should be considered different in a case-insensitive manner, e.g., ENERGY, energy, and Energy are three occurrences of the same word. Since very common words like "a, the, i" will skew the results, your program should ignore those words. A list of words to ignore are given in this file: common.txt.

The different words found in the file should be printed in HTML in alphabetical order. The size of the word should be based on its frequency within the file. To make the output readable, you should divide the top 100 most frequently occurring words into a five different size groups and associate each word with a specific group.

These text files, obtained from Project Gutenburg, are provided to show off your program:

Here is a Tag Cloud of the top 100 words for Machiavelli's "The Prince":

able acquired affairs afterwards alexander although among arms army attack became become born brought cannot castruccio chapter citizens concerning considered death di died difficulties done duke either electronic enemy entirely fear florence florentines forces fortune foundation france francesco friends given having held hold infantry italy kept killed king kingdom less lost lucca machiavelli messer necessary neither nevertheless nor nothing opportunity order orsini ought pagolo pisa pope power prince princes principalities principality project reason reputation rome secure seen sent shall soldiers son state states subjects taken terms themselves therefore thousand thus upon valour war whilst whom wise wish works yet