For this part of the assignment you're provided a class Anaword that is used instead of the struct Ana used in Part I of this assignment simpleanagram.cpp. The class Anaword encapsulates a string and its normalized version, and provides methods for printing and comparison. It's a class-ized version of the struct, but shows how to overload operators using a class in C++ and provides mechanims for comparing two different normalizing methods.
In the assignment below you'll find first a brief description of how a normalizer works and what it is, then a summary of what you'll do, then a more complete description of the three things you'll do.
There are two programs you'll be using: one to test whether the normalizer works, testnormal.cpp and one to time the different normalizers, timenormal.cpp.
The class SortNormal is a subclass/implementation of the class Normalizer; SortNormal sorts the characters in a word (just as was done in part I of this assigment). This sorting-normalizer is returned when client code asks the NormFactory for a normalizer. You can see, for example, in anaword.cpp that the constructor requests a normalizer that will be used by all Anaword objects (the normalizer is shared Anaword objects since it is static).
The implementation of SortNormal you're given works correctly except for words containing punctuation. Running the program testnormal.cpp will show this since two failures are reported.
You can assume that strings passed to the normalizer contain only upper and lower case letters or punctuation.
When fixed, running testnormal should print nothing since all tests will pass.
The data files should consist of words that are all anagrams of each other. For example, the data file below is acceptable, though it won't show much difference in the two algorithms. The strings in the file should be anagrams of each other, but they don't have to be real "words" (found in a dictionary).
stop pots opts ptso tsop
You should compile from the commandline using
make testnormalor from Eclipse by adding a target of testnormal as explained in the Eclipse make tutorial
Run the program, either by selecting testnormal.exe in Eclipse or from the commandline:
./testnormal
You should see two failures, correct by removing punctuation as described above, recompile, re-run until nothing is printed since all tests pass.
The class SortNormal you're given uses a sorted form of a word as the normal-form that compares equal for anagrams. The method is described in the header file sortnormal.h. For example, the sorted form of "bagel" and "gable" is the same, it's the string "abegl". This sorted string is returned by the function SortNormal::normalize and used by an Anaword object when it's normalized with SortNormal.
You're to implement a new class, HistoNormal that uses a fingerprint/signature/histogram to determine if two strings are anagrams. The new class should be a subclass of Normalize (just as SortNormal is). You must create files histonormal.h and histonormal.cpp for the declaration and implementation of the class.
For example the string "bagel" and "gable" have the same normalized/canonical signature histogram:
1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0which indicates one 'a', one 'b', one 'e', one 'g', and one 'l'. The signature of aardvark, for example, is
3 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0
Since the function Normalizer::normalize must return a string, the histogram should be converted to a string with each count separated from others by a colon (or other character). For example, the normalized form of "bagel" and "gable" would be
1:1:0:0:1:0:1:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0
Note: colons separate the letter counts, this means the number of colons is one less than the number of counts.
The code in ostrexample.cpp shows how an ostringstream can be used to write to a string like it's a stream. You can find information about the class ostringstream in Section 9.2.3 of Tapestry (pp 413-414).
When normalizing you should convert all characters to lowercase and you should ignore punctuation, this should be done in the code you write in HistoNormal::normalize.
The string returned should have 26 counts, one for each of 'a'-'z', and each count should be separated from others by a colon, there should be neither a leading nor a trailing colon, colons should separate numbers.
Your code should examine every character in the string being normalized once. You may find the functions tolower and ispunct from #include<cctype> useful, see Tapestry section 9.1.2 (page 401).
You should test your class by running testnormal.cpp to use the HistoNormal class you implemented. However, you must do one more thing to ensure that the Anaword class uses the new normalizer; you'll need to make a modifiction to the NormFactory implementation as described below and you'll need to modify the Makefile.
Currently this code returns a SortNormal object. You'll need to change it to return a HistoNormal object. You can do this by changing the assignment to the static ourNormalizer and adding a #include.
make dependThen compile the program as follows.
make testnormal
You should time how long it takes to find all anagrams in a file (similar to what you did in Part I) using both the sort-normalizing method and the histogram-normalizing method. You can do this using the timenormal.cpp program you're given. It will read a file, time how long normalizing all the words in the file takes, and report the number of anagram groups found and the time taken.
From the commandline type
make timenormal
and from Eclipse add a target of timenormal and then compile. You can test by providing a filename on the commandline or at the prompt for a file name, e.g.,
prompt> timenormal bigfile.txt OR prompt> timenormal enter file name: bigfile.txt // output appears here.
Your goal is to create two data files, both should report only one anagram group found.
All the words in the file should be anagrams of each other and all should be different.
The data files should be good for one normalizing method and bad for the other. For the purposes of this assignment "good" means under one second, and "bad" means at least 5 times longer than good.
These data files may be large. Rather than submitting the files you must write programs to create the data files. You're given files to start, named genbadsort.cpp and genbadhisto.cpp. These programs, when executed, will create files, respectively, named badsort.txt and badhisto.txt. Each file should be bad for one normalizing method and good for the other. For example, running
timenormal badsort.txtshould report "bad" times for the SortNormal method and "good" times for the HistoNormal method. Both data files created should have one anagram group: all the words in the files are anagrams of each other.
In writing your programs to generate data files you may find it useful to look at two programs.
make genbadhistoor create an Eclipse make Target named genbadhisto to compile. When the program runs it should create text files as noted above. The program-skeletons you're given will create files, but you must write to these files.
Criteria | Points |
---|---|
Punctuation in SortNormal | 4 |
implementation of HistoNormal | 16 |
this includes quality of code, quality of algorithm/method used, correctness, comments. | |
code for generating data files | 10 |
README write-up/analysis | 6 |
To submit use either Eclipse and anapartII or
submit_cps100 anapartII Makefile *.h *.cpp READMEBe sure to submit .h files, .cpp files and your Makefile!!! (and, of course, your README).