
Team Edition
In addition to the original specifications, your team must add the following functionality to your program. These features emphasize specific design issues and are meant to help you think about what it means to "hard-code" things within your program and how to provide a interface for programmer's to access flexibility when using your program.
Specifications
A typical architecture for many program designs is one that divides a program's execution into three stages: input, data is provided to the program; process, that data is transformed; and output, the program displays the results of transforming that data. This input/process/output (IPO) model of programming is used in simple programs like this one as well as in million-line programs that forecast the weather or predict stock market fluctuations. Your final program should be clearly separated into three independent modules such that each contains one or more classes that make it flexible enough to accommodate a variety of options without requiring either of the other modules to change. To do this, you must think carefully about what the result of each step is so that it can be safely received by the next step.
The requirements for each module are described below:
Input
Your program should be able to read text files from a variety of sources. For example,
- for this version, a word is defined as any alphanumeric sequence of characters separated from other words by white space, a colon, two or more dashes, or two or more periods. Words should also begin and end with a letter, so leading or trailing punctuation or digits should not be included in the word, though internal punctuation should be.
- it should be able to accept a single text file or a website as input in addition to reading a directory of etext files. All local pages (i.e., from the same host) of the given web page should be indexed. Your program should not index the HTML tags or the HTML header, just the content words. If the given file is not formatted as a web page or as a Project Gutenberg etext file, then every word in it should be indexed.
Process
Your program should be flexible in how it chooses its words to index and how it ranks its search results. For example,
- it should be able to exclude words from the index by their length (e.g., shorter than 3 letters), or by not including a list of words (e.g., those in this file of common words), or by regular expressions (e.g., all words starting with 'a').
- it should be able to rank search results based on a variety of criteria: alphabetically by title, number of times all of the query words occur, or most recent time the file was modified. Moreover, all orderings should be allowed to be reversed.
Output
Your program should be flexible in the formatting of the output of the search results. For example,
- it should output the search results as a web page, HTML file, listing the title of the document followed by the search words in context within that document (e.g., as Google posts it search results)
Extra Credit
There are many extensions to the basic specifications possible; some are listed below. From the stand point of your grade, the most important thing is that your program is designed well (i.e., that it is possible to index new kinds of documents simply by creating new subclasses and adding O(1) line to your existing code to include the new classes). This means your design should be open to adding new kinds of documents while closed to changing the index and search code.
Next in importance to your grade is your project should be thoroughly tested to prove to the course staff that your confidence in it is justified. You should include whatever data files, unit tests, or other driver programs (as well as documentation on how to use them) you have used to test your program in your submission. If you do all of the above well, the maximum grade you can receive is an A-.
Finally, the extensions given below are intended to stretch your design further and to differentiate your program from others in order to capture the global nanoGoogle market, your team should agree on one area of extensions to focus on if you want to be considered for a grade in the A range. These extensions must further the good design of your program and not simply be hacks of code added at the last minute. If you do not have time to implement an extension, partial extra credit may be given for excellent justification of how your design either supports adding such a feature already or how it would need to changed sufficiently to support such a feature.
- Graphical Front End. Build a GUI that allows the user easy access to your programs features.
- User-friendly Queries. Include standard features such as spell checking, using synonyms, and word stemming to provide better search results
- Index Management. Allow users to update the index without rebuilding it from scratch, such removing individual documents or indexing only updated documents.
- Optimized Back End. Minimize the storage needed for the index by compressing it, such as using numbers to refer to file names and titles or writing only the different letters from one word to the next.
However, note that the amount of extra credit will be in proportion to the amount of intellectual effort needed to implement the option. For example, adding yet another way to filter index words would not be worth very much because your design should already support it. Of course, a well-tested, perfectly working program that has fewer features (but plenty of clear paths to easy expansion) is always worth more than the leaky kitchen sink.
In short, to maximize your grade, you should implement enough variety in your program to clearly demonstrate that your design supports further such extensions.
Resources
The following online resources may help improve your overall design.
- MapReduce: Simplified Data Processing on Large Clusters by J. Dean and S. Ghemawat
- On the Criteria to be used in Decomposing a System into Modules by D.L. Parnas
- Component Programming - a fresh look at software components by M. Jazayeri
Deliverables
- Tuesday, September 30.
Submit a README containing the address of your website that contains:
- a name for your team's "company"
- a description of your team's shared vision of the project (think of this as the advertising blurb on the box in which your software will be sold)
- program documentation that justifies which team member's implementation you intend to use, notes the specific changes you think will be necessary in the current code to implement this project's new features, and estimates how long this project will take you to code.
- Sunday, October 5. Submit a
program that, at a minimum, should be able to
- read from or write to multiple formats (not necessarily both and not necessarily perfectly)
- sort the output in a variety of ways
- Friday, October 10. Submit the final version of program, including all programmer documentation.
- Thursday, October 16. Your individual project analysis is due.