CPS 100, Spring 2004, Anagram Part II


Information on copying or snarfing files for Anagram Part II.

For this part of the assignment you're provided a class Anaword that is used instead of the struct Ana used in Part I of this assignment simpleanagram.cpp. The class Anaword encapsulates a string and its normalized version, and provides methods for printing and comparison. It's a class-ized version of the struct, but shows how to overload operators using a class in C++ and provides mechanims for comparing two different normalizing methods.

Overview of What You'll Do

You'll look at an inheritance hierarchy of two different ways of comparing normalized forms of a word to determine if one word is an anagram of another. You'll implement the two ways, compare them, and write up your findings.

In the assignment below you'll find first a brief description of how a normalizer works and what it is, then a summary of what you'll do, then a more complete description of the three things you'll do.


Overview of Normal Forms and Programs

The code you'll start with uses a sorted form for comparing, using "abegl" for "bagel" and "gable" as in part I. But, the sorted form is obtained from calling a Normalizer object as described below.

There are two programs you'll be using: one to test whether the normalizer works, testnormal.cpp and one to time the different normalizers, timenormal.cpp.

The class SortNormal is a subclass/implementation of the class Normalizer; SortNormal sorts the characters in a word (just as was done in part I of this assigment). This sorting-normalizer is returned when client code asks the NormFactory for a normalizer. You can see, for example, in anaword.cpp that the constructor requests a normalizer that will be used by all Anaword objects (the normalizer is shared Anaword objects since it is static).

void Anaword::normalize() // postcondition: mySortedWord is sorted version of myWord { if (ourNormalizer == 0) { ourNormalizer = NormFactory::getNormalizer(); } myNormalizedWord = ourNormalizer->normalize(myWord); }

The implementation of SortNormal you're given works correctly except for words containing punctuation. Running the program testnormal.cpp will show this since two failures are reported.

Summary of What to Do

You're to do three things for part II of this assignment.
  1. Fix the code in sortnormal.cpp so that it ignores punctuation. You can do this by adding one line, an appropriate call to StripPunc from the Tapestry, strutils.h functions.

    You can assume that strings passed to the normalizer contain only upper and lower case letters or punctuation.

    When fixed, running testnormal should print nothing since all tests will pass.

  2. Implement a subclass of Normalizer named HistoNormal that uses the fingerprint/signature method of producing a canonical form for which anagrams compare as equal. More details on this are provided below.

  3. Analyze the runtime of the HistoNormal class for finding all anagrams compared to the SortNormal class. You'll need to write two programs, described below, to create data files that illustrate the differences between the two methods. The programs will generate data files, one for each Normalizer, that illustrate a case when each method is significantly faster than the other. Your README file should include your analysis of the running times of the two Normalizer subclasses. More details on this analysis are provided below.

    The data files should consist of words that are all anagrams of each other. For example, the data file below is acceptable, though it won't show much difference in the two algorithms. The strings in the file should be anagrams of each other, but they don't have to be real "words" (found in a dictionary).

       stop pots opts ptso tsop
    

  1. Fix Code: No Punctuation
  2. You'll modify sortnormal.cpp for this part of the assignment.

    You should compile from the commandline using

      make testnormal
    
    or from Eclipse by adding a target of testnormal as explained in the Eclipse make tutorial

    Run the program, either by selecting testnormal.exe in Eclipse or from the commandline:

      ./testnormal
    

    You should see two failures, correct by removing punctuation as described above, recompile, re-run until nothing is printed since all tests pass.

  3. Subclass: Fingerprint Normalizer
  4. You'll create and implement histonormal.h and histonormal.cpp in this part of the assignment.

    The class SortNormal you're given uses a sorted form of a word as the normal-form that compares equal for anagrams. The method is described in the header file sortnormal.h. For example, the sorted form of "bagel" and "gable" is the same, it's the string "abegl". This sorted string is returned by the function SortNormal::normalize and used by an Anaword object when it's normalized with SortNormal.

    You're to implement a new class, HistoNormal that uses a fingerprint/signature/histogram to determine if two strings are anagrams. The new class should be a subclass of Normalize (just as SortNormal is). You must create files histonormal.h and histonormal.cpp for the declaration and implementation of the class.

    For example the string "bagel" and "gable" have the same normalized/canonical signature histogram:

     1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    
    which indicates one 'a', one 'b', one 'e', one 'g', and one 'l'. The signature of aardvark, for example, is
     3 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0
    

    Since the function Normalizer::normalize must return a string, the histogram should be converted to a string with each count separated from others by a colon (or other character). For example, the normalized form of "bagel" and "gable" would be

     1:1:0:0:1:0:1:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0
    

    Note: colons separate the letter counts, this means the number of colons is one less than the number of counts.

    The code in ostrexample.cpp shows how an ostringstream can be used to write to a string like it's a stream. You can find information about the class ostringstream in Section 9.2.3 of Tapestry (pp 413-414).

    When normalizing you should convert all characters to lowercase and you should ignore punctuation, this should be done in the code you write in HistoNormal::normalize.

    The string returned should have 26 counts, one for each of 'a'-'z', and each count should be separated from others by a colon, there should be neither a leading nor a trailing colon, colons should separate numbers.

    Your code should examine every character in the string being normalized once. You may find the functions tolower and ispunct from #include<cctype> useful, see Tapestry section 9.1.2 (page 401).

    You should test your class by running testnormal.cpp to use the HistoNormal class you implemented. However, you must do one more thing to ensure that the Anaword class uses the new normalizer; you'll need to make a modifiction to the NormFactory implementation as described below and you'll need to modify the Makefile.

    Using the New Normalizer

    When all tests pass, you're done with this part.

  5. Analyzing the Methods
  6. You'll write code in genbadsort.cpp and genbadhisto.cpp for this part of the assignment, and you'll write descriptions of your methods in your README.

    You should time how long it takes to find all anagrams in a file (similar to what you did in Part I) using both the sort-normalizing method and the histogram-normalizing method. You can do this using the timenormal.cpp program you're given. It will read a file, time how long normalizing all the words in the file takes, and report the number of anagram groups found and the time taken.

    From the commandline type

       make timenormal
    

    and from Eclipse add a target of timenormal and then compile. You can test by providing a filename on the commandline or at the prompt for a file name, e.g.,

      prompt> timenormal bigfile.txt
    
    OR
    
      prompt> timenormal
      enter file name: bigfile.txt
    
      // output appears here.
    

    Your goal is to create two data files, both should report only one anagram group found.

    All the words in the file should be anagrams of each other and all should be different.

    The data files should be good for one normalizing method and bad for the other. For the purposes of this assignment "good" means under one second, and "bad" means at least 5 times longer than good.

    These data files may be large. Rather than submitting the files you must write programs to create the data files. You're given files to start, named genbadsort.cpp and genbadhisto.cpp. These programs, when executed, will create files, respectively, named badsort.txt and badhisto.txt. Each file should be bad for one normalizing method and good for the other. For example, running

      timenormal badsort.txt
    
    should report "bad" times for the SortNormal method and "good" times for the HistoNormal method. Both data files created should have one anagram group: all the words in the files are anagrams of each other.

    In writing your programs to generate data files you may find it useful to look at two programs.

    Compiling genbadXX programs

    The programs are standalone programs, they don't use any normalizing or anaword classes, just Tapestry classe and standard C++ classes and functions. This means you can compile from the commandline with:
       make genbadhisto
    
    or create an Eclipse make Target named genbadhisto to compile. When the program runs it should create text files as noted above. The program-skeletons you're given will create files, but you must write to these files.

    Writeup in README

    You should write an explanation of what your timings show in the README file you submit. Specifically, you should describe the data files generated by your genbadXX programs, and you should include timings for both of them. You should explain why the times are good/bad for each normalizing method for each data file. You descriptions should make it clear that you understand why the timings are different and why each method can be good in one case and bad in another.

Grading

Part II is worth 36 points. The points will be awarded as follows:

Criteria Points
Punctuation in SortNormal 4
implementation of HistoNormal 16
this includes quality of code, quality of algorithm/method used, correctness, comments.
code for generating data files 10
README write-up/analysis 6

Submit

You should submit all .h and .cpp files, your Makefile that compiles your programs, and a README. In your README you should include

To submit use either Eclipse and anapartII or

 submit_cps100 anapartII Makefile *.h *.cpp README
Be sure to submit .h files, .cpp files and your Makefile!!! (and, of course, your README).


Owen L. Astrachan
Last modified: Thu Jan 22 12:19:39 EST 2004