CPS 100e, Fall 2009, Counting Words
In general, the rank of a word is its location in the
sorted-by-frequency order, e.g., the most frequently occurring word has
rank 1, the second-most frequently occurring word has rank 2, and so
on. The frequency of a word is the number of occurrences of the
word. Zipf's law states that in English and other language texts the
rank and frequency are related by the following equation where C and
alpha are constants that depend on the text being analyzed, F is
frequency, and r is rank. Zipf's law applies to cities and populations,
notes in music, and proteins in DNA.
F = C/ralpha
In this in-class assignment, you will answer questions about different
methods to count the number of unique words and then write code to print
the top k words.
Questions on this in-class assignment refer
to the code in the wordcount classes
shown here.
Name _________________________
login id: ____________________
Counting the number of Unique Words
- The code for determining unique words in
SlowUniqueCounter.java has a bug. The value returned is
not, in general, correct. Describe what the bug is, how to fix it, and a
data file for which the current version will return a correct result.
-
The code in
SortingUniqueCounter has a bug. Describe a
simple fix for the bug, and why the fix is needed.
- (4 points)
If the following line from
SortingUniqueCounter
if (! list[k].equals(last)){
is replaced by this line:
if (list[k] != last){
the program compiles and runs, but indicates that the Melville
has 14,352 different words rather than 4,255 different words. Explain
this output (note that there are 14,353 words in the Melville file).
-
Provide a plausible explanation for why the timings
for the code in
SortingUniqueCounter and SetUniqueCounter
are similar.
-
As shown in the sample
output a run based on Hawthorne's The Scarlet Letter was
stopped. Based on the statistics shown, and the fact that there are
85,753 total words, but 14,123 different words in the text, develop
estimates for how long the program will take for each of the three
classes. Provide reasons for each of your estimates.
Print Commonly Occurring Words
- Consider ReadStuff.java, we want to
complete the code, so we can print out the number of occurrences
for each word. We would like the count of words to accurately
reflect our notion of what a word is. How does the
normalizeWord method help? Where would it fail?
- Why does the
WordCount constructor
initialize myCount to one?
- Explain the purpose of the following line from
processWords in ReadStuff.java.
myWordsCounted = (WordCount[]) counts.toArray(new WordCount[0]);
- Complete the
processWords method to store each word read from the file in an
ArrayList of WordCount objects that store the unique words and
their frequency. If a word is found (seen before) its frequency is
incremented, otherwise the word is added to the ArrayList with a
frequency of one.
- Suppose each time two WordCount objects are compared during
the lookup shown above that the cost of comparing the objects is $0.01
(one cent) and that it takes 0.001 seconds (one millisecond). In a file
of 11 different words, the 11th word processed incurs a cost of ten
cents (to check against the 10 words already stored) and takes 0.01
seconds (10 * 0.001). What is the cost of processing all 11 words in
time and money? Calculate by adding the costs of each of the 11
different words.
- What is the cost in time and money in processing a file
of 21 different/unique words? Show your reasoning.
- Try to develop a formula for the cost in time and money to process
a file of N different words. Express your answers in terms
of N (this is not simple to do, try it).
- Suppose a file consists of 500 occurrences of 'the' followed by
500 occurrences of 'cat'. What's the cost in time and money to process
this file?
The output when the program is run on Melville's
Bartleby the Scrivener follows.
processing/counting 14353 strings
# different = 4103
598 the
439 i
418 to
369 and
353 of
333 a
263 in
217 his
208 he
196 my
183 was
182 that
136 not
127 it
114 at
113 but
106 with
105 for
100 as
95 would
Note that the words are printed sorted by frequency. The array of
WordCount objects is sorted after it is created. Explain
what part of the code causes the
sort to be done by frequency, breaking ties by alphabetical order.