CPS 100e, Fall 2009, Counting Words

In general, the rank of a word is its location in the sorted-by-frequency order, e.g., the most frequently occurring word has rank 1, the second-most frequently occurring word has rank 2, and so on. The frequency of a word is the number of occurrences of the word. Zipf's law states that in English and other language texts the rank and frequency are related by the following equation where C and alpha are constants that depend on the text being analyzed, F is frequency, and r is rank. Zipf's law applies to cities and populations, notes in music, and proteins in DNA.
    F = C/ralpha

In this in-class assignment, you will answer questions about different methods to count the number of unique words and then write code to print the top k words.

Questions on this in-class assignment refer to the code in the wordcount classes shown here.

Name _________________________ 

login id: ____________________

    Counting the number of Unique Words

  1. The code for determining unique words in SlowUniqueCounter.java has a bug. The value returned is not, in general, correct. Describe what the bug is, how to fix it, and a data file for which the current version will return a correct result.
    
    
    
    
    
    
    
    
    
    
    
    
    
  2. The code in SortingUniqueCounter has a bug. Describe a simple fix for the bug, and why the fix is needed.
    
    
    
    
    
    
    
    
    
  3. (4 points) If the following line from SortingUniqueCounter
                if (! list[k].equals(last)){
    
    is replaced by this line:
                if (list[k] != last){
    
    the program compiles and runs, but indicates that the Melville has 14,352 different words rather than 4,255 different words. Explain this output (note that there are 14,353 words in the Melville file).
    
    
    
    
    
    
    
    
    
    
  4. Provide a plausible explanation for why the timings for the code in SortingUniqueCounter and SetUniqueCounter are similar.
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
  5. As shown in the sample output a run based on Hawthorne's The Scarlet Letter was stopped. Based on the statistics shown, and the fact that there are 85,753 total words, but 14,123 different words in the text, develop estimates for how long the program will take for each of the three classes. Provide reasons for each of your estimates.
    
    
    
    

    Print Commonly Occurring Words

  6. Consider ReadStuff.java, we want to complete the code, so we can print out the number of occurrences for each word. We would like the count of words to accurately reflect our notion of what a word is. How does the normalizeWord method help? Where would it fail?
    
    
    
    
    
    

  7. Why does the WordCount constructor initialize myCount to one?
    
    
    
  8. Explain the purpose of the following line from processWords in ReadStuff.java. myWordsCounted = (WordCount[]) counts.toArray(new WordCount[0]);
    
    
    
    
  9. Complete the processWords method to store each word read from the file in an ArrayList of WordCount objects that store the unique words and their frequency. If a word is found (seen before) its frequency is incremented, otherwise the word is added to the ArrayList with a frequency of one.

  10. Suppose each time two WordCount objects are compared during the lookup shown above that the cost of comparing the objects is $0.01 (one cent) and that it takes 0.001 seconds (one millisecond). In a file of 11 different words, the 11th word processed incurs a cost of ten cents (to check against the 10 words already stored) and takes 0.01 seconds (10 * 0.001). What is the cost of processing all 11 words in time and money? Calculate by adding the costs of each of the 11 different words.
    
    
    
    
    
    
  11. What is the cost in time and money in processing a file of 21 different/unique words? Show your reasoning.
    
    
    
    
  12. Try to develop a formula for the cost in time and money to process a file of N different words. Express your answers in terms of N (this is not simple to do, try it).
    
    
    
  13. Suppose a file consists of 500 occurrences of 'the' followed by 500 occurrences of 'cat'. What's the cost in time and money to process this file?
    
    
    
  • The output when the program is run on Melville's Bartleby the Scrivener follows.
        processing/counting 14353 strings
        # different = 4103
        598   the
        439   i
        418   to
        369   and
        353   of
        333   a
        263   in
        217   his
        208   he
        196   my
        183   was
        182   that
        136   not
        127   it
        114   at
        113   but
        106   with
        105   for
        100   as
        95    would
    
    Note that the words are printed sorted by frequency. The array of WordCount objects is sorted after it is created. Explain what part of the code causes the sort to be done by frequency, breaking ties by alphabetical order.