KWIC, Mastery, CPS 108, Fall 1999

This mastery comes in two parts. The first part is required. The second part is optional. Each part is worth 20 points, though the second twenty points will be harder to earn than the first twenty.

WOOFII

This mastery is required. You must do this mastery on your own.

Please submit a compressed tar file containing all source code and a README. You should include a design document, either in the submission or hardcopy you turn in (for this assignment there's no folder/binder of stuff to turn in).

 submit108 woofii woofii.tgz
You must write a program that reads a textfile and records the different words, and generates as output a list of words and the line numbers on which each word occurs.

You can invoke the program in different ways:

The program will be graded 40% on design/clarity, 30% on correctness, and 30% on speed. Benchmark times for two text files will be posted, you must beat the benchmark times to earn more than 10% of the 30% for speed points. It's likely that to beat the time you'll need to use an STL map (or better/equivalent in terms of efficiency) and C-style strings (and maybe C-style I/O), so you should design your program to make it easy to incorporate changes in I/O.

Words are white-space delimited alphanumeric characters with leading/trailing punctuation removed (internal punctuation is ok.) Letters should be converted to lowercase.

KWIC

This mastery is NOT required, you may work with one other person.

Please submit a compressed tar file containing all source code and a README. You should include a design document, either in the submission or hardcopy you turn in (for this assignment there's no folder/binder of stuff to turn in).

 submit108 kwic kwic.tgz
A Key Word in Context index is useful in looking up titles, words, and other things. Words that aren't key words are ignored in generating a KWIC index. For example, if words to ignore are `` the, of, and, as, a '' and a list of titles is:
Descent of Man
The Ascent of Man
The Old Man and The Sea
A Portrait of The Artist As a Young Man

A KWIC-index of these titles might be given by:

                      a portrait of the ARTIST as a young man 
                                    the ASCENT of man 
                                        DESCENT of man 
                             descent of MAN 
                          the ascent of MAN 
                                the old MAN and the sea 
    a portrait of the artist as a young MAN 
                                    the OLD man and the sea 
                                      a PORTRAIT of the artist as a young man 
                    the old man and the SEA 
          a portrait of the artist as a YOUNG man 

A concordance is similar to a KWIC index. For example, two lines from the online copy of the bible are reproduced below. Assume that these are lines 10 and 11.

And the earth was without form, and void; and darkness was
upon the face of the deep.
These lines might generate a concordance as follows:
      void and DARKNESS was upon the       10-11
        of the DEEP                        11
       and the EARTH was without form      10
      upon the FACE of the deep            11
   was without FORM and void and           10
  darkness was UPON the face of            10-11
      form and VOID and darkness was       10
     earth was WITHOUT form and void       10
In this concordance, words of fewer than three letters are not considered as Key Words, and aren't listed in the concordance. Write a program that generates a concordance from a textfile. Storage efficiency is an important consideration in designing your program, but the correctness and design of the program are more important. Storage efficiency is much more important than speed (though programs that are unbelievably slow will lose some points.)

The output should be sorted by keyword, with keywords capitalized and other words in lower case. Punctuation not internal to words should be ignored.

The context for each key word is two words before the key word and three words after the key word. Ideally these values will be configurable in your program.

Your program should include the option of allowing a file of words to be read, these words will be ignored in determining whether a word is a key word. To be more general, determining whether a word is a key word is a changing criteria your program should be able to cope with.


Owen L. Astrachan
Last modified: Fri Sep 10 13:06:11 EDT 1999