A concordance is a type of index arrangement that lists each instance of where a word appears within a given text as well as those words immediately before or after that appearance of the word. Whereas an index in the back of a book will show words in alphabetical order but only refer to them, the concordance will show each instance of each word together in the context from which it came. A concordance's function is basically to bring together, in other words, to 'concord' passages of text which show the use of a word. A concordance is particularly useful for studying a piece of literature when thinking in terms of a particular word, phrase or theme as it shows exactly how often a word occurs, or even if it does not occur and so can be extremely helpful in building up some idea of how different themes recur within a text and how they relate to the rest. The term 'concordance' is usually applied to literary and linguistic studies, but it is an extremely useful tool as cross-reference systems for programmers, enabling teams of programmers working together to keep track of all references to, for example, a variable name, across all the files which make up a project.
Write a program that, given a set of files on the command-line, generates a concordance or Key Word In Context (KWIC) index. You may use either Java or C++ as your programming language; however, it is suggested you use your weaker language.
By default, you should output every distinct word in the given files in alphabetical order, with its context (the three words before and after the it, if they exist), followed by the line numbers that span the full context of the word, and finally, the name of the file in which the word appears. See the sample output below.
Every occurrence of the word across all files should generate a line of output, in which it appears capitalized with a context. Suppose the word giant occurs in twenty times in the given input. It should appear in the concordance twenty times. But in what order should it appear? Ideally, the occurrences of GIANT that are printed will be printed in the same order in which they occur, i.e., the line numbers for each line in which the key word GIANT occurs will appear in increasing order. Again, see the example output below.
For the purposes of this program, a word is any sequence of non-white space characters separated from other words by white space (or the first or last word in a file). Leading or trailing punctuation should not be included in the word, though internal punctuation should be. In the sentence below there are eight words: how, are, you, i, asked, can't, complain, replied. The case of the words should be disregarded as well, i.e., "Apple" is the same as "apple", for comparison purposes.
"How are you?" I asked. "Can't complain," you replied.
The word whose context is given is printed in all-caps with every word of the context printed as it appeared in the file.
By default, your program should work as described above. However, if there is a file called kwic.properties in the directory where the program is being run, then it should be able to customize the output based on the following options.
| before=# | maximum number of words of context to print before the keyword, default = 3 |
| after=# | maximum number of words of context to print after the keyword, default = 3 |
| min=# | minimum number of letters in a word to be considered a keyword in the concordance, default = 3 |
The format of the properties file is one value per line, in any order or not all, with the option's name separated from its value by an equals sign, =. See the example below.
Two lines from the online copy of the bible are reproduced below and saved in the file example.txt, such that they are the first two lines of the file.
And the earth was without form, and void; and darkness was upon the face of the deep.
The properties file that exists within the directory is reproduced below.
min=4 before=2
Then your program should generate the output below.
void; and DARKNESS was upon the 1-2 data/simple.txt of the DEEP 2 data/simple.txt And the EARTH was without form, 1 data/simple.txt upon the FACE of the deep. 2 data/simple.txt was without FORM and void; and 1 data/simple.txt darkness was UPON the face of 1-2 data/simple.txt form, and VOID and darkness was 1 data/simple.txt earth was WITHOUT form, and void; 1 data/simple.txt
The capitalized words represent the key words and should be lined up, i.e., centered, in some manner similar to what's shown above in the sample output.