Due Date: Early bonus: Wednesday, October 2, midnight
Final Due Date: Monday, October 7, midnight
This assignment will provide practice with structs, vectors, sorting, reading from files using getline, streams, writing classes, and iterative enhancement.
For example, if words to ignore are
the, of, and, as, a and the list
of titles is:
A KWIC-index of these titles is given by:
Each title is listed as many times as there are key words in the title.
For example, "A Portrait of the Artist As a Young Man" is listed four
times, once each for "portrait", "artist", "young", and "man".
There will be no more than 100 words to ignore, no more than than 500
titles, and no more than 50,000 characters in the titles and words to
ignore combined. No characters other than 'a'--'z', 'A'--'Z', and white
space will appear in the input.
The keyword should appear in all upper-case
letters. All other words in a title should be in lower-case letters.
Case (upper or lower) is irrelevant when determining if a word is to be
ignored. Titles should be roughly centered as shown above with all key
words capitalized and justified somewhere near the middle of an 80
column screen (don't worry about this part at first). Assume titles
will fit on a line, don't worry about handling weird cases, just handle
cases assuming that the longest title will fit properly.
Titles in the KWIC-index with the same keyword should appear in the same
order as they appeared in the input file.
In the case where multiple
instances of a word are keywords in the same title, the keywords should
be capitalized in left-to-right order.
A sort that maintains the original order of elements with
equal keys is called a stable sort. Insertion sort is stable.
The code for insertion sort can be found in the Tapestry text,
it is reproduced below for a vector of ints.
In the diagram below the title is stored as 4 KwicTitle objects,
once for each keyword. Note that when two two KwicTitles are compared
(using less than: operator <)
the index of the
keyword determines which string in the KwicTitles are compared (you may
need to think about this).
Introduction
Searching and sorting are prototypical computer applications. For this
assignment you'll write a program that organizes titles (or sentences)
for efficient "human search" based on different key words.
Given a list of titles and a list of words to ignore, you are to
write a program that generates a KWIC (Key Word In Context) index of the
titles. In a KWIC-index, a title is listed once for each keyword that
occurs in the title. The KWIC-index is alphabetized by keyword.
Keywords are any words that are not listed in the list of words to ignore.
Descent of Man
The Ascent of Man
The Old Man and The Sea
A Portrait of The Artist As a Young Man
a portrait of the ARTIST as a young man
the ASCENT of man
DESCENT of man
descent of MAN
the ascent of MAN
the old MAN and the sea
a portrait of the artist as a young MAN
the OLD man and the sea
a PORTRAIT of the artist as a young man
the old man and the SEA
a portrait of the artist as a YOUNG man
Input/Output
Your program should read from a file whose name
you enter when you run the program.
Legal input files contain a list of words to ignore (one per line)
followed by a list of titles (one per line)
The string :: on a line by itself is used to
separate the list of words to ignore from the list of titles. Each of
the words to ignore appears in lower-case letters on a line by itself
and is no more than 10 characters in length. Each title appears on a
line by itself and may consist of mixed-case (upper and lower) letters.
Words in a title are separated by whitespace. No title contains more
than 25 words.
The Output
The output should be a KWIC-index of the titles, with each title
appearing once for each keyword in the title, and with the KWIC-index
alphabetized by keyword. If a word appears more than once in a title,
each instance is a potential keyword. In other words the title
A Rose is a Rose is an Aphorism would appear three times (once
for each occurrence of Rose and once for Aphorism.)
Sample Input
is
the
of
and
as
a
but
::
Descent of Man
The Ascent of Man
The Old Man and The Sea
A Portrait of The Artist As a Young Man
A Man is a Man but Bubblesort IS A DOG
Corresponding Output
a portrait of the ARTIST as a young man
the ASCENT of man
a man is a man but BUBBLESORT is a dog
DESCENT of man
a man is a man but bubblesort is a DOG
descent of MAN
the ascent of MAN
the old MAN and the sea
a portrait of the artist as a young MAN
a MAN is a man but bubblesort is a dog
a man is a MAN but bubblesort is a dog
the OLD man and the sea
a PORTRAIT of the artist as a young man
the old man and the SEA
a portrait of the artist as a YOUNG man
Coding Requirements and Help
Ideally you will maintain only one copy of each title, you will not
store a title once for each keyword although you will print a title
once for each keyword. You may choose not to worry about this. This
assignment is worth 24 points and 4 of the points are for minimizing
storage by storing titles only once.
As a first pass, you may decide to store each
title once for each occurrence of a keyword. That might lead to the
following declarations. A diagram below
shows how the struct KwicTitle is used
to store the title The Old Man and the Sea.
Minimizing Storage
One option for storing titles once is to use a vector of titles, storing
each title once in the vector (of course the titles may be vectors
of strings, but this isn't a problem --- you can also make the titles
structs that contain a vector of strings).
Then you can replace myTitle in
the declaration of KwicTitle with an index into the vector of
titles. With this solution there will still be four KwicTitle
objects for The Sun Also Rises but myTitle is now an
index to a title rather than a title (again, think about this carefully
and ask questions.)
Developing a Class
You'll probably find it useful to develop a class to solve this problem.
For example, public member functions could include Read and
PrintIndex. There will probably be several private member
functions that will be called from PrintIndex and that will
call each other. For example, you might store the words to ignore in a
Vector<string> myIgnore and then write a function as
shown below to search this vector.
Grading Standards
This assignment is worth 24 points. Points will be awarded as follows:
Behavior | Points |
---|---|
Generates KWIC-index | 6 |
Sorted Properly | 2 |
Handles duplicate key words in title | 2 |
Nice output (centered) | 2 |
Memory Efficient | 4 |
Coding Style (uses classes, comments) | 6 |
README | 2 |
To submit your assignment, type: