CPS 100 Fall 2002: Assignment #2: Anagrams

Due: Tuesday, Sept. 17 by 11pm

30 points

Two words are anagrams if they contain the same letters. For example, bagel and gable are anagrams as are glare, lager, large, and regal.

Word puzzles like the Jumble shown below, and found in many newspapers are based on anagrams too. You can also play jumbles online at jumble.com.

This assignment involves writing code and reasoning about two different methods for using a program to solve jumbles.


Jumbles

You're given a small program that prompts the user for a word and then prints all the words that are anagrams of the user-entered word. The program uses a dictionary/file specified when it's started. You must do three things with the code you're given:

  1. Use the code as the basis for writing another program named allanagrams that prints a list of all anagrams in a file/dictionary specified by the user. Each sequence of anagrams is printed on one line, and every line printed should have at least two words that are anagrams. For example, the output of the program might be as follows (depending on the dictionary/file read).
     prompt> allanagrams datafile
     gazer graze 
     gases sages 
     heals leash shale 
     heaps phase shape 
     hares hears share shear 
     earth hater heart 
     haste hates heats 
     haves shave 
     arise raise 
     lakes leaks 
    
    
  2. Implement a clas HistoNormal that uses the fingerprint/signature method of producing a canonical form for which anagrams compare as equal. More details on this are provided below.

  3. Analyze the runtime of the HistoNormal class for finding all anagrams compared to the SortNormal class. Provide two data files, one for each Normalizer that illustrates a case when the method is significantly faster than the other.

Implementing allanagrams.cpp

You're given the files/code below, all accessible in the code directory and on the acpub system in ~rodger/cps100/anagram. You can copy this code into the directory for this assignment using the instructions below.
  cd
  mkdir cd cps100
  mkdir anagram
  cp ~rodger/cps100/anagram/* .
  ln -s ~rodger/data/words .
(The link created on the last line is to a large data file of words, you should use smaller files to test your program.)

Your job is to study this code, understand it, then implement the program allanagrams described above (see step 1 of what you must do). You can write any code/classes you want to, the output of your program should be a list of all the anagrams in a data file read by the program. Each line is a list of words that are anagrams, each line printed contains at least two words (don't print a word that has no anagram other than itself).

Ideas on Finding Anagrams

The easiest way to find anagrams is to use the concepts from readwords3.cpp, a program we discussed in class.

Fingerprint Normalizer

The class SortNormal you're given uses a sorted form of a word as the normal-form that compares equal for anagrams. The method is described in the header file sortnormal.h. For example, the sorted form of "bagel" and "gable" is the same, it's the string "abegl". This sorted string is returned by the function SortNormal::normalize and used by an Anaword object when it's normalized with SortNormal.

You're to implement a new class, HistoNormal that uses a fingerprint/signature/histogram to determine if two strings are anagrams. For example the string "bagel" and "gable" have the same normalized/canonical signature histogram:

 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
which indicates one 'a', one 'b', one 'e', one 'g', and one 'l'. The signature of aardvark, for example, is
 3 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0

Since the function Normalizer::normalize must return a string, the histogram should be converted to a string with each count separated from others by a colon (or other character). For example, the normalized form of "bagel" and "gable" would be

 1:1:0:0:1:0:1:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0

The code in ostrexample.cpp shows how an ostrstream can be used to write to a string like it's a stream. You can find information about the class ostrstream (which should be called ostringstream, but g++ is lagging in the standard) in Section 9.2.3 of Tapestry (pp 413-414).

When normalizing you should convert all characters to lowercase and you should ignore punctuation, this should be done in the Normalizer code, not (necessarily) in client code. The string returned should have 26 counts, one for each of 'a'-'z', and each count should be separated from others by a colon, there should be neither a leading nor a trailing colon, colons should separate numbers.

You should test your class by using it to find all the anagrams using the program allanagrams.cpp you wrote. You can also test it using findana.cpp provided to you, or you can write your own test program. Good test programs can receive extra credit. They must be turned in and described in your README file in order to receive the extra credit.

Analyzing the Methods

You should time (use a CTimer object) how long it takes to find all anagrams using both the sort-normalizing method and the histogram-normalizing method. You should write an explanation of what your timings show in the README file you submit. You should also create two data files, one which is faster to process using allanagrams.cpp with a SortNormal normalizer and one which is faster to process using a HistoNormal normalizer. The differences in timings should be remarkable, i.e., the more drastic the difference the better. You may need to write code to help create the data files, or you may be able to create them by editting by hand.

Hints

  1. As given, the Makefile for the assignment has this line as part of the target for making allanagrams.
    EXEC   	  = allanagrams
    SRC_FILES = allanagrams.cpp anaword.cpp sortnormal.cpp histonormal.cpp
    

    If you're compiling using make allanagrams, and you don't have a file histonormal.cpp, you'll get an error message as shown below:

      make: *** No rule to make target `histonormal.o', needed by `allanagrams'.
      make: Target `allanagrams' not remade because of errors.
      make: warning:  Clock skew detected.  Your build may be incomplete.
    
    
    To fix this problem, either create a file for histonormal.cpp so it will compile. Alternatively, remove the word histonormal.cpp from the Makefile for allanagrams. You'll need to add it back later when you have a histonormal.cpp.

  2. Your program should be reasonably efficient. Taking more than one minute to process the 45,000 words in ~rodger/data/words isn't reasonable. However, this is a small part of the final grade. Note: the technique used in readwords3.cpp of sorting and looking for runs of equal words will also work for finding all anagrams, and it's reasonably efficient.

  3. ideally you won't repeat the same word twice as an anagram, i.e., you won't print as anagrams:
       the the
    
    Even more ideally, you won't repeat words as anagrams if they differ only in case, i.e., hello Hello. However, this is a small portion of the overall grade, so it shouldn't be a major concern. Note that using a StringSet object (see Tapestry) is one way of coping with this problem.

Grading

This assignment is worth 30 points. The points will be awarded roughly as follows:

Criteria Points
implementation of HistoNormal 5
method/code in allanagrams.cpp 15
this includes quality of code, quality of algorithm/method used, correctness, comments.
README write-up/analysis 6
data files 4
Extra Credit: (great test program) 2

Submit

You should submit all .h and .cpp files, your Makefile that compiles your allanagrams.cpp and any other programs you want the graders to look at. In your README, which you must submit, you should include

You must also submit two data files as described above. Your README should describe these data files and why each is remarkably better for one normalizer than another.

To submit (your data files can have any name, but the example below uses some name to make it clear you need to submit data files).

 submit_cps100 anagram README Makefile *.h *.cpp datafile1 datafile2

If submit_cps100 doesn't work, try ~rodger/bin/submit_cps100.


Susan H. Rodger
Last modified: Thu Sep 5 10:28:48 EST 2002