CPS 100, Fall 2000, Anagrams

Two words are anagrams if they contain the same letters. For example, bagel and gable are anagrams as are glare, lager, large, and regal.

Word puzzles like the Jumble shown below, and found in many newspapers are based on anagrams too.

Preliminaries

Create a directory anagram in your cps100 directory. Copy files from ~ola/cps100/anagram into your anagram directory. If you type 'ls' you should see.

You should also create a link to a large file of words that you can use to time/test your program when it's working. Use smaller files to test when developing the program.

  ln -s ~ola/data/words words

Do not run the program as it is (without implementing) Anaword::equal and Anaword::less since the call to QuickSort will cause an infinite loop.

Program Description

You'll write a program, part of which is given to you, that reads a file of words and generates as output all the anagrams in the file.

Each line of output should contain words that are anagrams of each other, for example:


 gazer graze 
 gases sages 
 heals leash shale 
 heaps phase shape 
 hares hears share shear 
 earth hater heart 
 haste hates heats 
 haves shave 
 arise raise 
 lakes leaks 

Your program should deal with words that contain punctuation. You can ignore these words, but your program should not die/stop if such a word is encountered.

You should make your own small test files, but check your final program with the input file words accessible on acpub as ~ola/data/words. (This file is /usr/dict/words from Linux.)

Coding and Algorithm

You must use the class Anaword whose declaration is given in the file anaword.h. You will need to write the implementation of this class in the file anaword.cpp although this has been started for you.

An Anaword object is constructed from a string, and prints as the string, but is compared using a normalized or canonical form created by counting the number of times each letter in the word occurs (more on this below). For example, the code fragment below prints the two lines of output shown.

Anaword a("bagel"); Anaword b("gable"); cout << a << " " << b << endl; if (a == b) cout << "they're ananagrams!" << endl; Output as shown:
 bagel gable
 they're anagrams!
The objects a and b are equal because the operator == is overloaded for Anaword objects and returns true for "bagel" and "gable" since both words have the same normalized/canonical form of letter-signature:
 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
which indicates one 'a', one 'b', one 'e', one 'g', and one 'l'. The signature of aardvark, for example, is
 3 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0

You must implement the member functions declared and described in anaword.h so that the real word (e.g., "bagel") is used for printing, but the canonical/normal form of the letter-signature is used for comparison using == and <

To do this you'll need to implement all the functions declared in anaword.hthat are not implemented in the anaword.cpp file you're given. This includes

For your program to run efficiently your implementations of Anaword::equal and Anaword::less should return true/false after making the minimal number of comparisons possible. Treat the vector myCounts as a 26-digit number so that, for example, "egg" > "ego" since the signatures are

  egg:   00001020000000000000000000
  ego:   00001010000000100000000000

The determination that "egg" is larger can be made after seven comparisons (the count of g's in egg is greater than the count of g's in ego and 'g' is the seventh letter of the alphabet).

A Working Program

You'll need to add code to doana.cpp to yield a working anagram program. This requires completing the function FindAnagrams using the idea below
  1. Sort the vector of Anaword objects using quicksort. You can do this with QuickSort(list, list.size()); because Anaword ojbects can be compared using <, ==, and <=. Access to QuickSort is via #include "sortall.h".

  2. All anagrams are now adjacent in the vector. Make one pass over the elements looking for adjacent elements that are equal. You must write code to determine when two or more adjacent words are equal and print anagrams one set per line as shown at the beginning of this assignment.

Faster Anagrams

You'll now develop another method of canonicalizing an Anaword that's faster than the letter-signature method. First, when you've got the program working, you should add code to time how long it takes to find and print all anagrams in ~ola/data/words which is words, a file of 45,402 words. Write down the name of the machine you used and the time it takes and include these in your README file you submit. You should then create copies of both anaword.h and anaword.cpp by typing
 cp anaword.h anawordfinger.h
 cp anaword.cpp anawordfinger.cpp
Now you have a copy of your (hopefully) correct Anaword class and implementation. You'll be re-implementing Anaword using another technique, but you'll need to submit both versions so you must make a copy.

Instead of using the letter-signature method you'll store a sorted form of the word and use this to compare Anawords. For example, the sorted form of "bagel" and "gable" is the same, it's the string "abegl". This string is used for relational comparisons. To do this you'll need to make a few changes:

  1. Remove the declaration below from the private section of Anaword. tvector<int> myCounts; // canonicalized form Then you'll add a new declaration for the sorted word: string mySortedWord; // canonicalized form

  2. The function Anaword::equal is now one line: return mySortedWord == rhs.mySortedWord;

  3. The function Anaword::less is similar: return mySortedWord < rhs.mySortedWord;

  4. The only other change needed is the function Anaword::normalize In this function you should sort the letters in mySortedWord which is a copy of myWord. To sort, use the code from selection sort which can be found in both the Sedgwick and Astrachan texts.

  5. Be sure to change the comments in anaword.h to reflect the new method used.
When the program works, time it again on the same machine you timed the original program. Write the time in the README file you create. You should also write a paragraph or explaining why you think the timings are different.

Grading

This assignment is worth 15 points. Style of the code/program counts for 5/15, correctness is 8/10 and the README is 2/10.

Submit

You'll submit a README that describes the timings of both versions of the program, includes information about how much time you spent on the assignment, and a list of people with whom you discussed the program.

Submit all .h and .cpp files and the README file. You do not need to submit the Makefile unless you modified it.

To submit use

 submit_cps100 anagram README *.h *.cpp

If submit_cps100 doesn't work, try ~ola/bin/submit100

Extra Credit

This is worth four points.

Instead of sorting using < and == for Anawords, create a different ordering so that when anagrams are printed shorter words are printed first and longer words printed last. To do this, create a function object as described in pages 543-549 of the Tapestry text. This object will be used when sorting the vector of Anawords, e.g., you'll write

AnaLenComparer analen; ... QuickSort(list,list.size(), analen);

The struct/class AnaLenComparer should compare two Anaword objects, using the length of the strings in the objects as the first criteria of comparison, and using Anaword::less only if the strings have the same length. For example:

Anaword a("tear"); Anaword b("ace"); Anaword c("race"); Here we want b to be "less" than a since it has fewer characters, but c is less than a since although they have the same number of characters, the canonicalized forms show that "acer" < "aert".


Owen L. Astrachan
Last modified: Thu Sep 7 14:33:37 EDT 2000