Lab 9: CPS100e: Unique Words Encore Une Fois

Change into your cps100e directory and create a lab9 subdirectory. Then change into this directory and copy files by typing what's below (don't forget the trailing dot).

      cp ~ola/cps100e/lab9/* .

Make a link to several text data files. To do this type

    ln -s ~ola/data data

Then you'll be able to use the directory data to access several big text and data files.

If you type ls you should see the directory data and the files Makefile, sequence.h, sequence.cc, seqiterator.h, seqiterator.cc, lookup.cc, and linelook.cc.

Introduction

In this lab you'll count words (one more time). But you'll use a templated sequence class to do the counting. You'll also be making a concordance which can be used to determine where in a document a given word occurs.

As in the previous lab words are converted to lowercase letters so the word Here and here will be counted as the same word. A word is consecutive non-white-space characters. So the word "more" is considered a different word than the word more because the quotes are considered part of the first word.

For example, if the file test contains the following:

Here are a few words.
and here are a few more.
The output of the program is a list of words and the number of occurences of each word, followed by the number of unique words, the total number of words, and the time to insert all the unique words from the file into the data structure. Output for the file test would be:
enter name of file: test
here		2
are		2
a		2
few		2
words.		1
more.		1

total # words = 10
total # unique words = 6
time = 0 secs.

For this lab, unique words will be stored in a templated sequence rather than in a doubly linked list as was the case in the previous lab. The sequence is implemented using linked-lists, but you don't need to know this in doing this lab.


Part 1: Adding words at the Back

A sequence of Info structs will be used in lookup.cc (see below). The function Update is used to update the count of how many times each word occurs, this is similar to what was done in last week's lab. Notice that both the Sequence and SequenceIterator variables are templated.

  struct Info
  {
      string word;
      int count;       // # of times word occurs
  };

  void
  Update(Sequence & seq,const string & key)
  // postcondition: returns true if seq contains a struct with
  //                word == key, otherwise returns false
 {
      SequenceIterator<Info> iter(seq);

      for(iter.First(); ! iter.IsDone(); iter.Next())
      {
          if (iter.Current().word == key)
  	  {
	      iter.Current().count++;
	      return;
	  }
      }
      // word not found, add to list
      Info dummy;
      dummy.word = key;
      dummy.count = 1;
      seq.Prepend(dummy);
  }

1.1: Time the program

Compile and run the program lookup.cc . Run the program on the three data files: poe.txt, twain.txt, and melville.txt (hamlet.txt will take a long time but you can try it if you'd like to). For these three, type "data/" in front of the file name, e.g. "data/poe.txt". The output just prints how long the process takes, the words are not printed. You are to add code so that an iterator is used in main to print the first 30 words of the sequence. If you define the iterator properly, you can use code as shown below to print the first 30 words.

    for(iter.First(); !iter.IsDone(); iter.Next())
    {
        count++;
        if (count > 30) break;                // print first 30 only
        cout << iter.Current().count << "\t"
	     << iter.Current().word << endl;
    }
You'll also need to modify the function Update to keep track of the number of unique words. This will involve adding another parameter to the function. Put the number of unique words, total number of (non-unique) words, and the total time for update for each of these files in a README file.

Change the call to Prepend for the sequence used in the function Update to Append . Rerun the program and time the results for the three data files. Include these results in the README file.


Part 2: Vectors of Sequences

For this program, use your modified lookup.cc program, but copy it first:

       cp lookup.cc lookup2.cc

Instead of using one sequence, you'll now modify the program lookup2.cc to use a vector of sequences, one for each character, e.g., 'a' and 'Z' (and all other characters as well) but you'll need to account for words that start with any character. For this you'll need a vector of 256 sequences. You can define this vector as shown below.

    const int MAX_INDEX = 256;
    Vector <Sequence<Info>  > list(MAX_INDEX);
Notice the space between the two > symbols; this is necessary to avoid confusion with the stream extraction operator >> .

When calling the function Update you'll need to pass the appropriate sequence (according to the first letter of the word being updated). For example, if the word is "bubble", it will be updated in the 'b'-sequence, which is accessible as list['b'] where list is the vector of sequences. In general, you can use list[word[0]] to specify the appropriate sequence since word[0] is the first character of the string word.

You'll also need to modify the code that prints all the words to print all 256 sequences (some may not have any words in them, but the iterator will do the right thing.)

Time the same three files and include the results in your README file. Then change the number of sequences from 256 to 1001 (this should involve changing a constant). Now, instead of using word[0] to determine which sequence will store a word, use the arithmetic expression below for variable index .

    int len = word.length();
    int index = (len * (word[0] + word[len-1])) % MAX_INDEX;
Note that the modulus operator is used to ensure that the indexing expression is a valid vector index.

Time the program again for the same three data files. In your README file try to explain the difference in timings.

When you turn in this program, turn in the version with the vector of 1001 elements


Extra Credit

Create a new file lookup3.cc (you'll need to copy one of the previous lookup programs to start).

Rather than storing words and how many times they occur, for this modification you'll store the line number on which every word occurs. To do this you'll read the data one line at a time, then get the words from a line using an input string stream. You'll modify the Info struct so that in addition to storing a word and how many times the word occurs, the struct will also store a list of line numbers on which the word occurs. You can use a sequence of integers for this, each time a new occurrence of a word is found, use Append to add the line number to the end of the sequence.

In making this modification you should store only the first 20 line numbers on which a word occurs, not all the line numbers.

To read words a line at a time, use the loop in the file linelook.cc . You will probably want to copy your working program from either Part 1 or Part 2 then cut and paste the new loop for reading lines and extracting words.


Submitting the lab

To turn in programs: When your modified programs work, you can submit them, along with a README describing how long it took you and what your impressions of the lab are. Type (where N is your section number: 1, 2, or 3):

      submit100e lab9secN README lookup.cc lookup2.cc 

(include the extra-credit program if you did it).

The README file should include your name, how long you worked on the program, and anyone you received significant help from. You should also include any comments you have about the lab.

You should receive a message telling you that the programs were submitted correctly.