CPS 108, Spring 2004, Run, Right, Fast


Find partners at this link.

This assignment uses wordlines.cpp as a starting point and sortdemo.cpp as well.

The assignment has three parts. For each part you'll write a program that will read a file and determine the words that occur most frequently in the file. Words are delimited by whitespace and should not have leading or trailing punctuation. All characters should be converted to lowercase equivalents.

The program should print the n words that occur most frequently, where n defaults to 20, but is otherwise a parameter to the program. If no filename is specified, the program should read from standard input (cin). The examples below show how the program can be used.

Usage


   wordcount

   wordcount -f filename -c wordcount
   
   wordcount --file[=filename] --count[=wordcount]

For example, the following are all valid uses.
   wordcount < ~/data/poe.txt

   wordcount -c 30 < ~/data/poe.txt

   wordcount --file=/u/ola/data/poe.txt --count=30

   wordcount -f ~/data/poe.txt -c 30

Output

For versions 2.0 and 3.0 the output is specified as follows (this was not specified at the time of the 1.0 submission).

Each line of output should contain a count, followed by a two-spaces, followed by a word. The count is the number of times the word occurs in the input being processed. The most frequently occurring word should be printed first, the least frequently word last (of the maximum of n lines printed where n is a parameter to the program that defaults to 20.) The counts should be right-justified so that the words are aligned by the first letter

100  blueberry
 12  apple
 11  berry
 11  cherry
  7  watermelon
  6  orange

The command line option --width[=number] or -w number shoud output number count/word pairs on each line (except the last line which may not be full.) Each count is followed by two spaces and counts are right-justified in each column (as above). Each column is separated from the next column by four spaces (between the longest entry in one column and the largest number in the next column).


  wordlines --file=foo.txt --width=3
  wordlines -f foo.txt -w 3

These could generate output as follows. Note that there are four spaces between the 'n' in watermelon and the first '1' in the count of 11 for cherry in the next column. There are four spaces between the 'y' in blueberry and the '1' in the 11 count for berry.

100  blueberry    11  berry     7  watermelon
 12  apple        11  cheery    6  orange

Versions

  1. Version 1.0 should work correctly reading from standard input and determining the 20 words that occur most frequently (break ties alphabetically). You can process command-line options, but you don't need to. The grade will be based only on correctness, but comments on design will be given.

  2. Version 2.0 should process command-line options and be designed using classes to facilitate alternative implementations with minimal rewriting. The grade will be based 50% on design and 50% on conforming to specifications.

  3. Version 3.0 should be as fast as possible. It will be graded 80% on speed and 20% on design. Programs that produce incorrect output will receive little credit.

Grading, Submit, Due Dates

Information accessible here.
Owen L. Astrachan
Last modified: Thu Jan 29 11:52:11 EST 2004