(This problem appeared in a different format in the Internet Programming Contest.)
Due Date: Early bonus: February 24
Final Due Date:March 1
This assignment will provide practice with classes, linked lists, vectors, streams, trees, pointers, sorting, reading from files using getline, writing classes, and iterative enhancement.
Table of Contents
[ Introduction | Input/Output | Coding | Grading | Submitting ]
Searching and sorting are prototypical computer applications. For this assignment you'll write a program that organizes titles (or sentences) for efficient "human search" based on different key words. Given a list of titles and a list of words to ignore, you are to write a program that generates a KWIC (Key Word In Context) index of the titles. In a KWIC-index, a title is listed once for each keyword that occurs in the title. The KWIC-index is alphabetized by keyword. Keywords are any words that are "real words", not words to ignore like "the", "of", "and", etc. A list of words to ignore are included in the each input file, all words that aren't in the list of words to ignore are key words.
For example, if words to ignore are the, of, and, as, a and the list of titles is:
Descent of Man The Ascent of Man The Old Man and The Sea A Portrait of The Artist As a Young Man
A KWIC-index of these titles is given by:
a portrait of the ARTIST as a young man
the ASCENT of man
DESCENT of man
descent of MAN
the ascent of MAN
the old MAN and the sea
a portrait of the artist as a young MAN
the OLD man and the sea
a PORTRAIT of the artist as a young man
the old man and the SEA
a portrait of the artist as a YOUNG man
Each title is listed as many times as there are key words in the title. For example, "A Portrait of the Artist As a Young Man" is listed four times, once each for "portrait", "artist", "young", and "man".
Your program should read from a file whose name you enter when you run the program. Legal input files contain a list of words to ignore (one per line) followed by a list of titles (one per line) The string :: on a line by itself is used to separate the list of words to ignore from the list of titles. Each of the words to ignore appears in lower-case letters on a line by itself. Each title appears on a line by itself and may consist of mixed-case (upper and lower) letters. Words in a title are separated by whitespace.
No characters other than 'a'--'z', 'A'--'Z', and white space will appear in the input.
The output should be a KWIC-index of the titles, with each title appearing once for each keyword in the title, and with the KWIC-index alphabetized by keyword. If a word appears more than once in a title, each instance is a potential keyword. In other words the title A Rose is a Rose is an Aphorism would appear three times (once for each occurrence of Rose and once for Aphorism.)
The keyword should appear in all upper-case letters. All other words in a title should be in lower-case letters. Case (upper or lower) is irrelevant when determining if a word is to be ignored. Titles should be roughly centered as shown above with all key words capitalized and justified somewhere near the middle of an 80 column screen (don't worry about this part at first). Assume titles will fit on a line, don't worry about handling weird cases, just handle cases assuming that the longest title will fit properly.
Titles in the KWIC-index with the same keyword should appear in the same order as they appeared in the input file. In the case where multiple instances of a word are keywords in the same title, the keywords should be capitalized in left-to-right order. A sort that maintains the original order of elements with equal keys is called a stable sort. Insertion sort is stable. The code for insertion sort can be found in the Tapestry text and in Weiss, it is reproduced below for a vector of ints.
is the of and as a but :: Descent of Man The Ascent of Man The Old Man and The Sea A Portrait of The Artist As a Young Man A Man is a Man but Bubblesort IS A DOG
a portrait of the ARTIST as a young man
the ASCENT of man
a man is a man but BUBBLESORT is a dog
DESCENT of man
a man is a man but bubblesort is a DOG
descent of MAN
the ascent of MAN
the old MAN and the sea
a portrait of the artist as a young MAN
a MAN is a man but bubblesort is a dog
a man is a MAN but bubblesort is a dog
the OLD man and the sea
a PORTRAIT of the artist as a young man
the old man and the SEA
a portrait of the artist as a YOUNG man
You'll be using several classes to implement this assignment. One of the goals of the assignment is to get you used to using several classes in one program. The classes are shown in the diagram below. Each class is described after the diagram, but you should see the header files for the classes too (which we provide).
The dashed lines show part of the private sections of the classes, you'll need to fill in details.
A KwicTitle has two pointers for private data. The first pointer, myTitle, points to a title, the second pointer, myKeyWord points to a string that is a keyword. In the diagram above you can see that a title for The Old Mand and the Sea consists of six string pointers. Three of the strings are also pointed to as keywords of a KwicTitle
An object of the class WordList is used to store words. Two WordList objects are used: one to store the list of words to ignore and one to store all the words that appear in titles. This way each word is stored once, but can be pointed to a hundred times (e.g., for 100 different titles). This yields a significant space savings. The WordList class has only two operations: Find a word and Add a word. In both cases a pointer is returned --- see the wordlist.h header file. In this assignment the class WordList is an abstract class. We have provided an implementation called VWordList that uses sorted vectors. You will implement a class TreeWordList that inherits from WordList. You should implement this part of the project last since we haven't talked about inheritance yet. Your program should work with either VWordList or TreeWordList classes since the member functions of the Title class take WordList parameters so any kind of WordList will work (just as any kind of stream can be passed to a parameter that's an istream.)
You must use a vector of pointers to titles, i.e.,
Vector<Title *> myTitles in the private section of
the class Kwic. If you don't use pointers to titles you'll
have problems with one of the later modifications.
To build a title, you'll need to use an istrstream variable. These are input string streams, information is available in the Tapestry book --- see page 443. The example in readnums.cc shows how to read and count all the words in a string --- this is from the Tapestry book. Based on this code you should be able to read individual words from a string where the words are separated by whitespace.
This assignment is worth 30 points. Points will be awarded as follows:
| Behavior | Points |
|---|---|
| Generates KWIC-index | 10 |
| Sorted Properly | 4 |
| Handles duplicate key words in title | 2 |
| Nice output (centered) | 2 |
| Memory Efficient | 4 |
| Coding Style (uses classes, comments) | 6 |
| README | 2 |
You should create a README file for this and all assignments. All README files should include your name as well as the name(s) of anyone with whom you collaborated on the assignment and the amount of time you spent.
To submit your assignment, type:
submit100 kwic README kwic.cc Makefile ...Be sure to submit ALL .cc and .h files as well as your Makefile