www.cs.duke.edu/courses/spring06/cps100/snarf which you
can enter by right-clicking in the snarf-site browser window of Eclipse.
This is a groups/pair project. Submit one project per group.
The purpose of Part II is to explore alternate methods for determining if two words are anagrams and to understand some principles of object-oriented programming that will facilitate exploring the tradeoffs in alternate algorithms for finding anagrams.
You'll do three things in this part of the assignment.
Anaword class to have a two-parameter
constructor: a word and its normalized form. The normalized form will be
created in your model using the new NormalizerFactory
(details below).
In the original code for Anaword
constructing an Anaword object triggered a call to the
private/protected normalize method which then created a
sorted-version of the string used in construction. Thus each
Anaword contains both the original String, e.g., "teacher"
and a sorted form of that string, e.g., "aceehrt". These are stored,
respectively, in the instance variables myWord and
mySortedWord.
Here's the relevant code.
SortNormalizer that is part of hierarchy of normalizers as
shown below. Your code will use one of these normalizers and pass the
normalized string to the Anaword constructor so that two
strings are passed in your new version rather than one as in the old
version of Anaword.
The sorted form a word, e.g., "abegl" is the same for both "bagel" and
"gable", is created by the SortNormalizer method
normalize --- note that this code is the same as the code
in the original Anaword class, but the sorted string is
returned. we know that for "bagel" and "gable" is the same, it's the
string "abegl".
Here's code from the original AnagramModel
method setWords
Anaword object with two parameters you'd write
this code to use a SortNormalizer instead.
You are to implement a new class,
FingerPrintNormalizer
that uses a fingerprint/signature/histogram to
determine if two strings are anagrams. The new class should
implement the
INormalizer interface as shown
in the diagram above.
1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0which indicates one 'a', one 'b', one 'e', one 'g', and one 'l'. The signature of aardvark, for example, is
3 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0
Note that the histogram can also be a String. The histogram should be converted to a string with each count separated from others by a colon (or other character). For example, the normalized form of "bagel" and "gable" would be
1:1:0:0:1:0:1:0:0:0:0:1:0:0:0:0:0:0:0:0:0:0:0:0:0:0
Note: colons separate the letter counts, this means the number of colons is one less than the number of counts.
When normalizing you should convert all characters to lowercase
and you should ignore punctuation, this should be
done in the code you write in the normalize method for
FingerPrintNormalizer.
The string returned should have 26 counts, one for each of 'a'-'z', and each count should be separated from others by a colon, there should be neither a leading nor a trailing colon, colons should separate numbers.
Your code should examine every character in the string being normalized once.
You may find the static methods Character.toLowerCase and
Character.isLetter useful.
FingerPrintNormalizer normalizer is
finding the same anagrams as the original SortNormalizer
class. Then you'll need to use the NormalizerFactory class to get a
normalized form of a word rather than using the INormalizer
directly. Here's a final version of the code that scans all words and
uses the factory to get a normalized word (from AnagramModel).
NormalizerFactory uses whatever normalizer it's been
set up with. For example, here's a main that sets it up to
use the TextingNormalizer
which makes two words "anagrams" if they have the same sequence of
cell-phone keypresses to create them:
pistol shrunk aims bins chop kiss lips lisp
You should time how long it takes to find all anagrams in
several files using a modified version of
your AllAnagramModel from Part I. To do this you can use TimingView and TimingMain classes. These will read a
number of files specified by the names array In
TimingMain and report the number of anagram groups found
and the time taken. You'll want to change the INormalizer
used by the NormalizerFactory
when performing the experiments outlined here when you run the program
using different files/normalizers. You can run with different
normalizers in one run by re-running the experiments after setting
the NormalizerFactory normalizer between runs.
Depending on how you wrote the
Your goal is to create two data files.
The data files should be good for one
normalizing method and bad for the other. For the purposes of this
assignment the timings should differ by a factor of at least two.
You will create badsort.txt that results in "bad" times when
using the
These data files may be large. A better method than just submitting
manually generated files is to write programs to create the data
files. Rather than submitting the files you may write programs to create
the data files for extra credit.
You can get extra credit if all the words in the file are
anagrams of each other, and more bonus points if all the words are
different, though that's hard (or impossible).
AllAnagramModel in Part
I you'll likely need to change it. First, do NOT do the
extra-credit, fancy sorting of the output for this experiment if you did
that in part I. Rather than creating an ArrayList of
Strings, where each string is a collection of anagrams, as you likely
did in part I, you should create one ArrayList of Objects,
where each Object represents a set of anagrams. Simply create an
ArrayList of strings for each group of anagrams, and this ArrayList can
be the Object added to the collection that is passed an
IView update method. In summary: you want to make sorting
the Anaword objects than scanning for equal ones, the only
computationally-intensive part of your AllAnagramModel
code.
Do not concatenate strings and do not sort the list of anagrams.
SortNormalizer and "good" times for
FingerPrintNormalizer, while badfinger.txt should
result in the opposite behavior, i.e., good times for
SortNormalizer.
Writeup in README
You should write an explanation of what your timings show in the README
file you submit. Specifically, you should describe your data files
and you should include
timings for both of them. You should explain why the times are good/bad
for each normalizing method for each data file. You descriptions should
make it clear that you understand why the timings are different and why
each normalizing method can be good in one case and bad in another.
Grading
Part II is worth 22 points. The points will be awarded
as follows:
Criteria Points
using Normalizers with Anaword
4
implementation of FingerPrintNormalizer 8
This category includes quality of code, quality
of algorithm/method used, correctness, comments.
README write-up/analysis 10
Extra credit max 3
Submit
You should submit all source code, test files, and a README with Eclipse
using assignment name anagrampartII.
In your README
you should include
Owen L. Astrachan
Last modified: Sun Jan 29 14:54:02 EST 2006