CPS 100, Spring 1997:
KWIC: Key Word in Context, or, Searching Kwicly (30 points)

(This problem appeared in a different format in the Internet Programming Contest.)

Due Date: Early bonus: February 24
Final Due Date:
March 1

This assignment will provide practice with classes, linked lists, vectors, streams, trees, pointers, sorting, reading from files using getline, writing classes, and iterative enhancement.

Table of Contents

[ Introduction | Input/Output | Coding | Grading | Submitting ]


(A Makefile, .h and .cc, as well as sample input files are accessible in ~ola/cps100/kwic on the acpub system. Be sure to create a subdirectory kwic forthis problem and to set the permissions for access by prof/uta/ta by typing fs setacl kwic ola:cps100 read.)

The files available for use in this assignment are shown below. Some of the .h files have public sections only.

Introduction

Searching and sorting are prototypical computer applications. For this assignment you'll write a program that organizes titles (or sentences) for efficient "human search" based on different key words. Given a list of titles and a list of words to ignore, you are to write a program that generates a KWIC (Key Word In Context) index of the titles. In a KWIC-index, a title is listed once for each keyword that occurs in the title. The KWIC-index is alphabetized by keyword. Keywords are any words that are "real words", not words to ignore like "the", "of", "and", etc. A list of words to ignore are included in the each input file, all words that aren't in the list of words to ignore are key words.

For example, if words to ignore are the, of, and, as, a and the list of titles is:

Descent of Man
The Ascent of Man
The Old Man and The Sea
A Portrait of The Artist As a Young Man

A KWIC-index of these titles is given by:

                  
                  a portrait of the ARTIST as a young man 
                                the ASCENT of man 
                                    DESCENT of man 
                         descent of MAN 
                      the ascent of MAN 
                            the old MAN and the sea 
a portrait of the artist as a young MAN 
                                the OLD man and the sea 
                                  a PORTRAIT of the artist as a young man 
                the old man and the SEA 
      a portrait of the artist as a YOUNG man 

Each title is listed as many times as there are key words in the title. For example, "A Portrait of the Artist As a Young Man" is listed four times, once each for "portrait", "artist", "young", and "man".

Input/Output

Your program should read from a file whose name you enter when you run the program. Legal input files contain a list of words to ignore (one per line) followed by a list of titles (one per line) The string :: on a line by itself is used to separate the list of words to ignore from the list of titles. Each of the words to ignore appears in lower-case letters on a line by itself. Each title appears on a line by itself and may consist of mixed-case (upper and lower) letters. Words in a title are separated by whitespace.

No characters other than 'a'--'z', 'A'--'Z', and white space will appear in the input.

The Output

The output should be a KWIC-index of the titles, with each title appearing once for each keyword in the title, and with the KWIC-index alphabetized by keyword. If a word appears more than once in a title, each instance is a potential keyword. In other words the title A Rose is a Rose is an Aphorism would appear three times (once for each occurrence of Rose and once for Aphorism.)

The keyword should appear in all upper-case letters. All other words in a title should be in lower-case letters. Case (upper or lower) is irrelevant when determining if a word is to be ignored. Titles should be roughly centered as shown above with all key words capitalized and justified somewhere near the middle of an 80 column screen (don't worry about this part at first). Assume titles will fit on a line, don't worry about handling weird cases, just handle cases assuming that the longest title will fit properly.

Titles in the KWIC-index with the same keyword should appear in the same order as they appeared in the input file. In the case where multiple instances of a word are keywords in the same title, the keywords should be capitalized in left-to-right order. A sort that maintains the original order of elements with equal keys is called a stable sort. Insertion sort is stable. The code for insertion sort can be found in the Tapestry text and in Weiss, it is reproduced below for a vector of ints.

void InsertSort(Vector<int> & a, int numElts) // precondition: a contains numElts ints // postcondition: elements of a are sorted in non-decreasing order { int k,loc; int hold; for(k=1; k < numElts; k++) { hold = a[k]; // hold the k-th element loc = k; // shift other elements right while (0 < loc && hold < a[loc-1]) { a[loc] = a[loc-1]; loc--; } a[loc] = hold; // store kept element in hole created } }

Sample Input

is
the
of
and
as
a
but
::
Descent of Man
The Ascent of Man
The Old Man and The Sea
A Portrait of The Artist As a Young Man
A Man is a Man but Bubblesort IS A DOG

Corresponding Output

                  a portrait of the ARTIST as a young man 
                                the ASCENT of man 
                 a man is a man but BUBBLESORT is a dog 
                                    DESCENT of man 
 a man is a man but bubblesort is a DOG 
                         descent of MAN 
                      the ascent of MAN 
                            the old MAN and the sea 
a portrait of the artist as a young MAN 
                                  a MAN is a man but bubblesort is a dog 
                         a man is a MAN but bubblesort is a dog 
                                the OLD man and the sea 
                                  a PORTRAIT of the artist as a young man 
                the old man and the SEA 
      a portrait of the artist as a YOUNG man 


Coding Requirements and Help

You'll be using several classes to implement this assignment. One of the goals of the assignment is to get you used to using several classes in one program. The classes are shown in the diagram below. Each class is described after the diagram, but you should see the header files for the classes too (which we provide).

The dashed lines show part of the private sections of the classes, you'll need to fill in details.


You can represent a title as either a vector of words or a linked list of words. The choice is up to you. In either case the diagram below shows the relationship between Titles, KwicTitles, and words stored in a WordList. The title The Old Man and the Sea generates three KwicTitle objects represented by the following diagram:

*

A KwicTitle has two pointers for private data. The first pointer, myTitle, points to a title, the second pointer, myKeyWord points to a string that is a keyword. In the diagram above you can see that a title for The Old Mand and the Sea consists of six string pointers. Three of the strings are also pointed to as keywords of a KwicTitle

WordList: Minimizing Storage

An object of the class WordList is used to store words. Two WordList objects are used: one to store the list of words to ignore and one to store all the words that appear in titles. This way each word is stored once, but can be pointed to a hundred times (e.g., for 100 different titles). This yields a significant space savings. The WordList class has only two operations: Find a word and Add a word. In both cases a pointer is returned --- see the wordlist.h header file. In this assignment the class WordList is an abstract class. We have provided an implementation called VWordList that uses sorted vectors. You will implement a class TreeWordList that inherits from WordList. You should implement this part of the project last since we haven't talked about inheritance yet. Your program should work with either VWordList or TreeWordList classes since the member functions of the Title class take WordList parameters so any kind of WordList will work (just as any kind of stream can be passed to a parameter that's an istream.)

Implementing KWIC

We provide several .h files with partial class declarations. These declarations are partial because in many cases the private section is not filled in, you'll have to do this. You are free to design other classes that we haven't mentioned. You may find the steps below useful in getting the program to work.
  1. Implement enough functionality of the class Kwic and the class Title so that you can read information from a file and print the raw (non key-worded) titles read in.

    You must use a vector of pointers to titles, i.e., Vector<Title *> myTitles in the private section of the class Kwic. If you don't use pointers to titles you'll have problems with one of the later modifications.

    To build a title, you'll need to use an istrstream variable. These are input string streams, information is available in the Tapestry book --- see page 443. The example in readnums.cc shows how to read and count all the words in a string --- this is from the Tapestry book. Based on this code you should be able to read individual words from a string where the words are separated by whitespace.

  2. Once you have the main parts of Kwic and Title implemented, turn to the class KwicTitle as declared in the header file ktitle.h. Except for the printing function, this is relatively straightforward. Don't worry at first about making the print function generate nice/justified output, you can make the output pretty after you know the program is working.

  3. After implementing Kwic, Title, and KwicTitle you should have a working program (you'll need to sort based on keywords, but a keyword knows how to compare itself with other keywords). At this point you should turn to implementing the class TreeWordList which uses a binary search tree to store strings rather than a sorted vector. You'll need to write .h and .cc files for this class. To test this class you'll need to modify the Makefile so that the TreeWordList class is used instead of VWordList. You'll need to modify the class Kwic too since it is the class that actually creates the WordList objects used in the program.

Grading Standards

This assignment is worth 30 points. Points will be awarded as follows:

Behavior Points
Generates KWIC-index 10
Sorted Properly 4
Handles duplicate key words in title 2
Nice output (centered) 2
Memory Efficient 4
Coding Style (uses classes, comments) 6
README 2


Submission

You should create a README file for this and all assignments. All README files should include your name as well as the name(s) of anyone with whom you collaborated on the assignment and the amount of time you spent.

To submit your assignment, type:

   submit100 kwic README kwic.cc Makefile ...
Be sure to submit ALL .cc and .h files as well as your Makefile