CPS 100, Spring 2004, Anagram Part II

Information on copying or snarfing files for Anagram Part II.

For this part of the assignment you're provided a class Anaword that is used instead of the struct Ana used in Part I of this assignment simpleanagram.cpp. The class Anaword encapsulates a string and its normalized version, and provides methods for printing and comparison. It's a class-ized version of the struct, but shows how to overload operators using a class in C++ and provides mechanims for comparing two different normalizing methods.

Overview of What You'll Do

You'll look at an inheritance hierarchy of two different ways of comparing normalized forms of a word to determine if one word is an anagram of another. You'll implement the two ways, compare them, and write up your findings.

In the assignment below you'll find first a brief description of how a normalizer works and what it is, then a summary of what you'll do, then a more complete description of the three things you'll do.

Overview of Normal Forms and Programs

The code you'll start with uses a sorted form for comparing, using "abegl" for "bagel" and "gable" as in part I. But, the sorted form is obtained from calling a Normalizer object as described below.

There are two programs you'll be using: one to test whether the normalizer works, testnormal.cpp and one to time the different normalizers, timenormal.cpp.

The class SortNormal is a subclass/implementation of the class Normalizer; SortNormal sorts the characters in a word (just as was done in part I of this assigment). This sorting-normalizer is returned when client code asks the NormFactory for a normalizer. You can see, for example, in anaword.cpp that the constructor requests a normalizer that will be used by all Anaword objects (the normalizer is shared Anaword objects since it is static).

void Anaword::normalize() // postcondition: mySortedWord is sorted version of myWord { if (ourNormalizer == 0) { ourNormalizer = NormFactory::getNormalizer(); } myNormalizedWord = ourNormalizer->normalize(myWord); }

The implementation of SortNormal you're given works correctly except for words containing punctuation. Running the program testnormal.cpp will show this since two failures are reported.

Summary of What to Do

You're to do three things for part II of this assignment.

Fix the code in sortnormal.cpp so that it ignores punctuation. You can do this by adding one line, an appropriate call to StripPunc from the Tapestry, strutils.h functions.
You can assume that strings passed to the normalizer contain only upper and lower case letters or punctuation.
When fixed, running testnormal should print nothing since all tests will pass.
Implement a subclass of Normalizer named HistoNormal that uses the fingerprint/signature method of producing a canonical form for which anagrams compare as equal. More details on this are provided below.
Analyze the runtime of the HistoNormal class for finding all anagrams compared to the SortNormal class. You'll need to write two programs, described below, to create data files that illustrate the differences between the two methods. The programs will generate data files, one for each Normalizer, that illustrate a case when each method is significantly faster than the other. Your README file should include your analysis of the running times of the two Normalizer subclasses. More details on this analysis are provided below.
The data files should consist of words that are all anagrams of each other. For example, the data file below is acceptable, though it won't show much difference in the two algorithms. The strings in the file should be anagrams of each other, but they don't have to be real "words" (found in a dictionary).
```
   stop pots opts ptso tsop
```

Fix Code: No Punctuation

You'll modify sortnormal.cpp for this part of the assignment.

You should compile from the commandline using

  make testnormal

testnormal

the Eclipse make tutorial

Run the program, either by selecting testnormal.exe in Eclipse or from the commandline:

  ./testnormal

You should see two failures, correct by removing punctuation as described above, recompile, re-run until nothing is printed since all tests pass.

Subclass: Fingerprint Normalizer

You'll create and implement histonormal.h and histonormal.cpp in this part of the assignment.

The class SortNormal you're given uses a sorted form of a word as the normal-form that compares equal for anagrams. The method is described in the header file sortnormal.h. For example, the sorted form of "bagel" and "gable" is the same, it's the string "abegl". This sorted string is returned by the function SortNormal::normalize and used by an Anaword object when it's normalized with SortNormal.

You're to implement a new class, HistoNormal that uses a fingerprint/signature/histogram to determine if two strings are anagrams. The new class should be a subclass of Normalize (just as SortNormal is). You must create files histonormal.h and histonormal.cpp for the declaration and implementation of the class.

For example the string "bagel" and "gable" have the same normalized/canonical signature histogram:

 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0

 3 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0

Since the function Normalizer::normalize must return a string, the histogram should be converted to a string with each count separated from others by a colon (or other character). For example, the normalized form of "bagel" and "gable" would be

 1:1:0:0:1:0:1:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0

Note: colons separate the letter counts, this means the number of colons is one less than the number of counts.

The code in ostrexample.cpp shows how an ostringstream can be used to write to a string like it's a stream. You can find information about the class ostringstream in Section 9.2.3 of Tapestry (pp 413-414).

When normalizing you should convert all characters to lowercase and you should ignore punctuation, this should be done in the code you write in HistoNormal::normalize.

The string returned should have 26 counts, one for each of 'a'-'z', and each count should be separated from others by a colon, there should be neither a leading nor a trailing colon, colons should separate numbers.

Your code should examine every character in the string being normalized once. You may find the functions tolower and ispunct from #include<cctype> useful, see Tapestry section 9.1.2 (page 401).

You should test your class by running testnormal.cpp to use the HistoNormal class you implemented. However, you must do one more thing to ensure that the Anaword class uses the new normalizer; you'll need to make a modifiction to the NormFactory implementation as described below and you'll need to modify the Makefile.

Using the New Normalizer

To use the new normalizer you'll need to change the code in the normfactory.cpp implementation.
Currently this code returns a SortNormal object. You'll need to change it to return a HistoNormal object. You can do this by changing the assignment to the static ourNormalizer and adding a #include.
You'll need to modify the Makefile to add histonormal.cpp to the line in which the .cpp files are specified for the target testnormal. You can add histonormal.cpp to the target for timenormal line as well.
If you're using the command line, then After modifying the Makefile type the command below to update the Makefile (ignore the warnings/errors)
make depend
Then compile the program as follows.
```
  make testnormal
```
If you're using Eclipse, modify the Makefile to add histornormal.cpp to all the target .cpp lines and then compile using the target testnormal you added earlier.

Analyzing the Methods

You'll write code in genbadsort.cpp and genbadhisto.cpp for this part of the assignment, and you'll write descriptions of your methods in your README.

You should time how long it takes to find all anagrams in a file (similar to what you did in Part I) using both the sort-normalizing method and the histogram-normalizing method. You can do this using the timenormal.cpp program you're given. It will read a file, time how long normalizing all the words in the file takes, and report the number of anagram groups found and the time taken.

From the commandline type

   make timenormal

and from Eclipse add a target of timenormal and then compile. You can test by providing a filename on the commandline or at the prompt for a file name, e.g.,

  prompt> timenormal bigfile.txt

OR

  prompt> timenormal
  enter file name: bigfile.txt

  // output appears here.

Your goal is to create two data files, both should report only one anagram group found.

All the words in the file should be anagrams of each other and all should be different.

The data files should be good for one normalizing method and bad for the other. For the purposes of this assignment "good" means under one second, and "bad" means at least 5 times longer than good.

These data files may be large. Rather than submitting the files you must write programs to create the data files. You're given files to start, named genbadsort.cpp and genbadhisto.cpp. These programs, when executed, will create files, respectively, named badsort.txt and badhisto.txt. Each file should be bad for one normalizing method and good for the other. For example, running

  timenormal badsort.txt

SortNormal

HistoNormal

In writing your programs to generate data files you may find it useful to look at two programs.

permana.cpp tries all permutations of a string in a brute-force anagram-finding method we talked about in class. You can use the ideas/code in this program to print all permutations of a given word. This can lead to a data file that's good for one normalizing method and bad for another.
shuffle.cpp is program 8.4 (Page 354) in Tapestry. The code in function Shuffle shows how to re-arrange the elements in a vector randomly. You can use this idea to re-arrange the elements in a string. By choosing a long string to start with (and shuffling it) you can create a data file that's good for one normalizing method and bad for another (output several shuffled copies of a long string).

Compiling genbadXX programs

   make genbadhisto

genbadhisto

Writeup in README

genbadXX

should include timings for both of them.

Grading

Part II is worth 36 points. The points will be awarded as follows:

Criteria Points
Punctuation in SortNormal 4
implementation of HistoNormal 16
this includes quality of code, quality of algorithm/method used, correctness, comments.
code for generating data files 10
README write-up/analysis 6

Criteria	Points
Punctuation in `SortNormal`	4
implementation of HistoNormal	16
this includes quality of code, quality of algorithm/method used, correctness, comments.
code for generating data files	10
README write-up/analysis	6

Submit

You should submit all .h and .cpp files, your Makefile that compiles your programs, and a README. In your README you should include

How long you spent on the assignment.
Names of people you spoke to about the assignment
A desciption of the timings for two different normalizer methods. This description should be clear and complete. In all likelihood that means you'll need more than a single sentence describing the differences and why they occur.

To submit use either Eclipse and anapartII or

 submit_cps100 anapartII Makefile *.h *.cpp README

Be sure to submit .h files, .cpp files and your Makefile!!! (and, of course, your README).

Owen L. Astrachan

Last modified: Thu Jan 22 12:19:39 EST 2004