For example, if words to ignore are the, of, and, as, a and the list of titles is:
Descent of Man The Ascent of Man The Old Man and The Sea A Portrait of The Artist As a Young Man
A KWIC-index of these titles is given by:
Each title is listed as many times as there are key words in the title.
For example, "A Portrait of the Artist As a Young Man" is listed four
times, once each for "portrait", "artist", "young", and "man".
There will be no more than 100 words to ignore, no more than than 500
titles, and no more than 50,000 characters in the titles and words to
ignore combined. No characters other than 'a'--'z', 'A'--'Z', and white
space will appear in the input.
The keyword should appear in all upper-case
letters. All other words in a title should be in lower-case letters.
Case (upper or lower) is irrelevant when determining if a word is to be
ignored. Titles should be roughly centered as shown above with all key
words capitalized and justified somewhere near the middle of an 80
column screen (don't worry about this part at first). Assume titles
will fit on a line, don't worry about handling weird cases, just handle
cases assuming that the longest title will fit properly.
Titles in the KWIC-index with the same keyword should appear in the same
order as they appeared in the input file. In the case where multiple
instances of a word are keywords in the same title, the keywords should
be capitalized in left-to-right order. A sort that maintains
the original order of elements with equal keys is called a stable
sort. Insertion sort is stable. The code for insertion sort can
be found in most texts and in the program sortall.cpp it is reproduced
below for a vector of ints.
Ideally you will maintain only one copy of each title, you will not
store a title once for each keyword although you will print a title
once for each keyword.
As a first pass, you may decide to store each
title once for each occurrence of a keyword. That might lead to the
following declarations. A diagram below
shows how the struct KwicTitle is used
to store the title The Old Man and the Sea.
In the diagram below the title is stored as 4 KwicTitle objects,
once for each keyword. Note that when two two KwicTitles are compared
(using less than: operator <)
the index of the
keyword determines which string in the KwicTitles are compared (you may
need to think about this).
a portrait of the ARTIST as a young man
the ASCENT of man
DESCENT of man
descent of MAN
the ascent of MAN
the old MAN and the sea
a portrait of the artist as a young MAN
the OLD man and the sea
a PORTRAIT of the artist as a young man
the old man and the SEA
a portrait of the artist as a YOUNG man
Input/Output
Your program should read from a file whose name
you enter when you run the program.
Legal input files contain a list of words to ignore (one per line)
followed by a list of titles (one per line)
The string :: on a line by itself is used to
separate the list of words to ignore from the list of titles. Each of
the words to ignore appears in lower-case letters on a line by itself
and is no more than 10 characters in length. Each title appears on a
line by itself and may consist of mixed-case (upper and lower) letters.
Words in a title are separated by whitespace. No title contains more
than 25 words.
The Output
The output should be a KWIC-index of the titles, with each title
appearing once for each keyword in the title, and with the KWIC-index
alphabetized by keyword. If a word appears more than once in a title,
each instance is a potential keyword. In other words the title
A Rose is a Rose is an Aphorism would appear three times (once
for each occurrence of Rose and once for Aphorism.)
Sample Input
is
the
of
and
as
a
but
::
Descent of Man
The Ascent of Man
The Old Man and The Sea
A Portrait of The Artist As a Young Man
A Man is a Man but Bubblesort IS A DOG
Corresponding Output
a portrait of the ARTIST as a young man
the ASCENT of man
a man is a man but BUBBLESORT is a dog
DESCENT of man
a man is a man but bubblesort is a DOG
descent of MAN
the ascent of MAN
the old MAN and the sea
a portrait of the artist as a young MAN
a MAN is a man but bubblesort is a dog
a man is a MAN but bubblesort is a dog
the OLD man and the sea
a PORTRAIT of the artist as a young man
the old man and the SEA
a portrait of the artist as a YOUNG man
Coding Requirements and Help
Some suggestions on how to develop a solution for this problem are given
below. These suggestions use structs and vectors, with the design left
largely to you. Another set of suggestions shows a much more
class-oriented approach, this
class approach is linked here.
Minimizing Storage
One option for storing titles once is to use a vector of titles, storing
each title once in the vector (of course the titles may be vectors
of strings, but this isn't a problem --- you can also make the titles
structs that contain a vector of strings).
Then you can replace myTitle in
the declaration of KwicTitle with an index into the vector of
titles. With this solution there will still be four KwicTitle
objects for The Sun Also Rises but myTitle is now an
index to a title rather than a title.
Developing a Class
You'll probably find it useful to develop a class to solve this problem.
For example, public member functions could include Read and
PrintIndex. There will probably be several private member
functions that will be called from PrintIndex and that will
call each other. For example, you might store the words to ignore in a
Vector<string> myIgnore and then write a function as
shown below to search this vector.
Owen L. Astrachan
Last modified: Thu Jun 12 10:58:46 EDT