Compsci 100, Spring 2006, Anagram Part II


To get the code to start this project, use the Ambient/snarf plugin for anagramPartII (alternatively you can browse the code directory). The site for Compsci 100, Spring 2006, for your snarf browser is www.cs.duke.edu/courses/spring06/cps100/snarf which you can enter by right-clicking in the snarf-site browser window of Eclipse.

This is a groups/pair project. Submit one project per group.


The purpose of Part II is to explore alternate methods for determining if two words are anagrams and to understand some principles of object-oriented programming that will facilitate exploring the tradeoffs in alternate algorithms for finding anagrams.

You'll do three things in this part of the assignment.

  1. Modify the Anaword class to have a two-parameter constructor: a word and its normalized form. The normalized form will be created in your model using the new NormalizerFactory (details below).

  2. Create one new normalizer: a way of determining if words are anagrams. This is described below. You'll create one from scratch, and you're given two that work completely.

  3. Develop methods/experiments that highlight the performance characteristics of two of the normalizers: make each one very good and each one very bad. Write up in your README the details of this experiment and your findings.

1. Normalizers

In the original code for Anaword constructing an Anaword object triggered a call to the private/protected normalize method which then created a sorted-version of the string used in construction. Thus each Anaword contains both the original String, e.g., "teacher" and a sorted form of that string, e.g., "aceehrt". These are stored, respectively, in the instance variables myWord and mySortedWord.

Here's the relevant code.

public class Anaword { protected String myWord; protected String mySortedWord; public Anaword(String s) { myWord = s; normalize(); } ... protected void normalize() { char [] temp = myWord.toCharArray(); Arrays.sort(temp); mySortedWord = new String(temp); } ... Now the normalization code will be encapsulated in a new class: SortNormalizer that is part of hierarchy of normalizers as shown below. Your code will use one of these normalizers and pass the normalized string to the Anaword constructor so that two strings are passed in your new version rather than one as in the old version of Anaword.

*

The sorted form a word, e.g., "abegl" is the same for both "bagel" and "gable", is created by the SortNormalizer method normalize --- note that this code is the same as the code in the original Anaword class, but the sorted string is returned. we know that for "bagel" and "gable" is the same, it's the string "abegl".

Here's code from the original AnagramModel method setWords

protected void setWords(Iterator<String> words) { myWords = new ArrayList<Anaword>(); while (words.hasNext()) { myWords.add(new Anaword(words.next())); } } Using an Anaword object with two parameters you'd write this code to use a SortNormalizer instead. protected void setWords(Iterator<String> words) { myWords = new ArrayList<Anaword>(); INormalizer normer = new SortNormalizer(); while (words.hasNext()) { String s = words.next(); myWords.add(new Anaword(s,normer.normalize(s))); } } Now the model is responsible for creating the sorted or normalized form of a word.

2. New Normalizer

You are to implement a new class, FingerPrintNormalizer that uses a fingerprint/signature/histogram to determine if two strings are anagrams. The new class should implement the INormalizer interface as shown in the diagram above.

FingerPrint/Histogram Normalizer

For example the string "bagel" and "gable" have the same normalized/canonical signature histogram:
 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
which indicates one 'a', one 'b', one 'e', one 'g', and one 'l'. The signature of aardvark, for example, is
 3 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0

Note that the histogram can also be a String. The histogram should be converted to a string with each count separated from others by a colon (or other character). For example, the normalized form of "bagel" and "gable" would be

 1:1:0:0:1:0:1:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0

Note: colons separate the letter counts, this means the number of colons is one less than the number of counts.

When normalizing you should convert all characters to lowercase and you should ignore punctuation, this should be done in the code you write in the normalize method for FingerPrintNormalizer.

The string returned should have 26 counts, one for each of 'a'-'z', and each count should be separated from others by a colon, there should be neither a leading nor a trailing colon, colons should separate numbers.

Your code should examine every character in the string being normalized once.

You may find the static methods Character.toLowerCase and Character.isLetter useful.

Final Uses of Normalizers

Using your (from Part I) class AllAnagramModel you should make sure that your new FingerPrintNormalizer normalizer is finding the same anagrams as the original SortNormalizer class. Then you'll need to use the NormalizerFactory class to get a normalized form of a word rather than using the INormalizer directly. Here's a final version of the code that scans all words and uses the factory to get a normalized word (from AnagramModel). protected void setWords(Iterator<String> words) { myWords = new ArrayList<Anaword>(); while (words.hasNext()) { String s = words.next(); String n = NormalizerFactory.getNormalized(s); myWords.add(new Anaword(s,n)); } } The NormalizerFactory uses whatever normalizer it's been set up with. For example, here's a main that sets it up to use the TextingNormalizer which makes two words "anagrams" if they have the same sequence of cell-phone keypresses to create them: public static void main(String[] args){ NormalizerFactory.setNormalizer(new TextingNormalizer()); SimpleViewer av = new SimpleViewer("anagram finder", ""); AnagramModel model = new AllAnagramModel(); av.setModel(model); model.initialize(new Scanner(model.getClass().getResourceAsStream("/lowerwords.txt"))); } This will show some interesting "anagrams", e.g., as follows

 pistol shrunk
 aims bins chop
 kiss lips lisp


3. Experiments

You should time how long it takes to find all anagrams in several files using a modified version of your AllAnagramModel from Part I. To do this you can use TimingView and TimingMain classes. These will read a number of files specified by the names array In TimingMain and report the number of anagram groups found and the time taken. You'll want to change the INormalizer used by the NormalizerFactory when performing the experiments outlined here when you run the program using different files/normalizers. You can run with different normalizers in one run by re-running the experiments after setting the NormalizerFactory normalizer between runs.

Depending on how you wrote the AllAnagramModel in Part I you'll likely need to change it. First, do NOT do the extra-credit, fancy sorting of the output for this experiment if you did that in part I. Rather than creating an ArrayList of Strings, where each string is a collection of anagrams, as you likely did in part I, you should create one ArrayList of Objects, where each Object represents a set of anagrams. Simply create an ArrayList of strings for each group of anagrams, and this ArrayList can be the Object added to the collection that is passed an IView update method. In summary: you want to make sorting the Anaword objects than scanning for equal ones, the only computationally-intensive part of your AllAnagramModel code. Do not concatenate strings and do not sort the list of anagrams.

Your goal is to create two data files.

The data files should be good for one normalizing method and bad for the other. For the purposes of this assignment the timings should differ by a factor of at least two.

You will create badsort.txt that results in "bad" times when using the SortNormalizer and "good" times for FingerPrintNormalizer, while badfinger.txt should result in the opposite behavior, i.e., good times for SortNormalizer.

These data files may be large. A better method than just submitting manually generated files is to write programs to create the data files. Rather than submitting the files you may write programs to create the data files for extra credit.

You can get extra credit if all the words in the file are anagrams of each other, and more bonus points if all the words are different, though that's hard (or impossible).

Writeup in README

You should write an explanation of what your timings show in the README file you submit. Specifically, you should describe your data files and you should include timings for both of them. You should explain why the times are good/bad for each normalizing method for each data file. You descriptions should make it clear that you understand why the timings are different and why each normalizing method can be good in one case and bad in another.

Grading

Part II is worth 22 points. The points will be awarded as follows:

Criteria Points
using Normalizers with Anaword 4
implementation of FingerPrintNormalizer 8
This category includes quality of code, quality of algorithm/method used, correctness, comments.
README write-up/analysis 10
Extra credit max 3

Submit

You should submit all source code, test files, and a README with Eclipse using assignment name anagrampartII. In your README you should include


Owen L. Astrachan
Last modified: Sun Jan 29 14:54:02 EST 2006