CompSci 108
Fall 2008
Software Design and Implementation

The Google website claims to index more than eight billion pages (as of August 28, 2005). The Google-Desktop application will index nearly all the files on a Windows machine; Apple's spotlight tool does the same under OS X/10.4 The prefix nano represents 10-9 or one-billionth of some quantity. However, for this assignment you will be indexing and searching somewhere between one and several hundred files. You will do the assignment in several stages, first individually, then in teams of two.

Specification

Write a Java program that, given a directory of text files, creates an index that facilitates searching for words within those files. Searching for words should provide a list of files in which those words appear, ranked in some order. For example, a search query for red fox should return a rank-ordered list of files in which both the words red and fox appear. The rank-order should be based on how close the words are physically in the file. The highest ranked file will have all the words in the query a minimal distance apart. For example, if all the words occur between the tenth and fifteenth word in some file, then the rank score (i.e., the difference or range) would be 5. You may assume that the words in the query are given in the order they will appear in the file; in other words, you do not have check permutations of the query. If there is only one word in the query, then the files should be ranked based on the number of times the word appears in each file.

Your program should be able to output its index to a text file so that it can be saved, read in, or added to by future runs of the program. Initially, the format of the output will consist of each word in each file printed in alphabetical order with a list of numbers that represent where each occurrence of the word appears in the file. So, if a file has 1,000 unique words, it will result in 1,000 lines of output for that file.

word:file:title:word #

For example, this might be part of the output of your program.

windmill:996.txt:Don Quixote:1282 1575
windmills:996.txt:Don Quixote:1575 2471 3750 3765 3775 3795 3906 20986 22654 24587
Words

For the purposes of this program, assume all texts are in English and that a word is any sequence of characters separated from other words by white space or a colon, :, that begins and ends with a letter. Thus, leading or trailing punctuation should not be included in the word, though internal punctuation (other than a colon) should be. For example, the sentence below contains nine words: fred, how, are, you, i, asked, can't, complain, replied. The case of the words should be disregarded as well, i.e., for comparison purposes "Apple" is the same as "apple".

 Fred:"How are you?" I asked. "Can't complain," you replied.
Files

For the purposes of this program, assume you will be indexing only text files (i.e., those with a .txt suffix). Furthermore, each file you read will be a text file from The Project Gutenberg website. In particular, to start you will be using thirty of the recent top-100 downloaded files. These files are browsable here.

These files have a particular format where the title and author of the work whose etext is stored in the text file are accessible near the top of the file with specific text-tags associated with them. You will need to figure out these tags and use them when reading to determine title and author. Some files do not have readily discernible titles or etext beginnings, but where possible, you should not index the preamble before the etext beginning. If a file does contain a title, print <no title> instead of a title (include the angle brackets).

Design Goals

In general, you should start by creating the simplest program that completes the specifications. Once you have a working program and tests that demonstrate the program works, you should refactor it to improve its design. However, since a design cannot be judged without a context, you should consider how your design will minimize changes to the code when implementing the following possible extensions:

Next in importance to your grade, your project should be thoroughly tested to prove to the course staff that your confidence in it is justified. You should include whatever data files, unit tests, or other driver programs (as well as documentation on how to use them) you have used to test your program in your submission.

Resources

The following online resources may help you think about indexing algorithms.