CPS 100, Test 2, Practicum

Due date: April 14: 8:00 am, absolutely no extensions

This part of the test should require an hour or two of work. You are to work completely by yourself. All communication with TAs/UTAs/Professors/Colleagues/Unknowns should be done via the newsgroup duke.cs.cps100. First you should copy all the files from ~ola/cps100/maps, these files are also accessible here.

Copy these files into a directory you create (call the directory maps). Then compile by typing make usewords. You should create a link to the literary textfiles by typing ln -s ~ola/data data Then you can run the program, and when prompted for an input file type: data/poe.txt or data/hamlet.txt

Run usewords (4 points)

You should create a README file in which you answer the questions below based on running usewords. In your README file you should also include the number of hours you worked on this assignment.
  1. Using romeo.txt, hamlet.txt, and tempest.txt, what two words occur more than 2% of the time in all three files. List how many times each of the two words occurs in each of the three files.

  2. Which of the three plays has the most unique words (and how many words is it)?

  3. How many seconds does it take to process hawthorne.txt and what is the average word length?


Keep hash statistics (6 points) OPTIONAL! Extra Credit

The hash table used in hmap.cc and declared in hmap.h is a vector of linked lists. Each element of the linked list is declared to be Node<Pair<Key,Value> > which in usewords.cc makes it Node<Pair<string,int> > since each string is mapped to the number of times the string occurs in a file. The templated type Pair is declared in map.h, it has two fields: first (the key) and second (the value).

Add a new member function to the class HMap whose prototype is shown below (add this to hmap.h).

void HashStats() const; In the file hmap.cc you implement this function using the header: template <class Key, class Value> void HMap<Key,Value>::HashStats() const { // add code here } This function should compute and print the length of the longest chain and the average chain length. In calculating the average chain length count only chains that have at least one node, i.e., if there 7,001 hash "buckets", but only 30 have linked-lists/chains, divide by 30 when calculating the average chain length.

Use the code below as a starting point for calculating statistics.

int numChains = 0; int k; for(k=0; k < myList.Length(); k++) { Node <Pair <Key,Value> >* ptr = myList[k]; if (ptr != 0) { numChains++; } } When your function works, run the program on hawthorne.txt and melville.txt and include statistics for these programs in your README file. For extra credit, run the program with both a smaller and larger number of buckets: currently there are 7,001 buckets. Use both 3,001 and 13,001 (which are both prime numbers). Include statistics in your README file for these numbers (4 points extra credit).


Sorting Words (extra credit) (6 points)

There's a line in usewords.cc that is commented out, that can be used to print all the entries in the hash table: // uncomment line below to print all entries in table // map.Apply(Print); However, since the hash table isn't sorted, this will print words in the order they occur in the hash table. Write a new class that inherits from MapBase, that you'll use to print the words in the hash table sorted alphabetically.

To do this, you'll implement the class MapSort declared below which uses WordInfo also shown:

struct WordInfo { string word; int count; // how many times word occurs }; class MapSort : public MapBase<string,int> { public: MapSort(); virtual void Function(string & key, int & value); virtual void Report(); private: int myCount; // # different words Vector<WordInfo> myList; }; You'll need to implement the constructor for MapSort, this should construct the vector myList and initialize myCount. Every time that MapSort::Function is called, you'll need to add another entry to myList, growing it as necessary. When MapSort::Report is called you'll sort myList, then print it. You should sort it alphabetically, but print both the word and the # times the word occurs.

Submit

To submit use:
    submit100 test2 README usewords.cc hmap.h hmap.cc Makefile
Your usewords.cc program should show hash statistics when compiled and run.
Owen L. Astrachan
Last modified: Sat Apr 12 14:43:21 EDT 1997