
The Google website claims to index over one trillion pages. The Google-Desktop application will index nearly all the files on a Windows machine; Apple's spotlight tool does the same under OS X/10.4 The prefix nano represents 10-9 or one-billionth of some quantity. However, for this assignment you will be indexing and searching somewhere between one and several hundred files.
Specification
In teams, write a Python module that creates an index that facilitates searching for words within a set of files. Searching for words should provide a list of files in which those words appear, ranked in some order. For example, a search query for red fox should return a rank-ordered list of files in which both the words red and fox appear. The rank-order should be based on how close the words are physically in the file. The highest ranked file will have all the words in the query a minimal distance apart. For example, if all the words occur between the tenth and fifteenth word in some file, then the rank score (i.e., the difference or range) would be 5. You may assume that the words in the query are given in the order they will appear in the file; in other words, you do not have check permutations of the query. If there is only one word in the query, then the files should be ranked based on the number of times the word appears in each file.
Your program should be able to index a variety of text files (each with a different extension), such as plain text files (.txt), Project Gutenberg book files (.ebk), mail messages (.mail), or web pages (.html), from a variety of sources, such as a directory, web site (must start with http:), or as a stand-alone file. Some of these files contain header data that should not be indexed, but might provide useful meta-data about the file, such as its title, author, date, etc that may be searched separately. Some of these files also contain special tag text that should not be indexed at all. Common words should not be indexed in any case. These files may also have different criteria for what it means to be a word, for example, in a mail message '@' probably denotes an email address, whereas in a text file it might be ignored.
In addition to the ranking algorithm described in the example, your program should provide a variety of ways to rank search results based on a variety of criteria: alphabetically by title, number of times all of the query words occur, or most recent time the file was modified. Moreover, all orderings should be allowed to be reversed.
Your program should be able to output the search results in a variety of formats, from just a list of documents matched to a Google style list that includes the context in which the searched words appear.
While you could create a stand-alone program that provides some interface (e.g., text-based, web-based, or GUI-based) to indexing and searching, all that is necessary is to create a Python module that can be imported and interacted with in the Python shell. This also means that for your program to function, you do not need to save the generated index to one or more files before searching it. However, if you want to save an index to search later without having to re-build it, you need to be able to save it; also, if you index a large number of files, your data structure may not fit in memory, in which case it would be useful to be able to save it. Thus, it is required that there is an option to save an index for a set of files and recreate your index from that saved format. Your team may choose any file format to represent the index you want, but creating an especially optimized one is one of the optional features.
Some example files to get you started are available here.
Design Criteria
There are many extensions to the basic specifications possible; some are listed below. From the stand point of your grade, the most important thing is that your program is designed well (i.e., that it is possible to index new kinds of documents simply by adding O(1) line to your existing code to include the new functionality). This means your design should be open to adding new kinds of input documents, new kinds of rankings, and new kinds of output formats, while closed to changing the index and search code.
Next in importance to your grade is your project should be thoroughly tested to prove to the course staff that your confidence in it is justified. You should include whatever data files, unit tests, or other driver programs (as well as documentation on how to use them) you have used to test your program in your submission. If you do all of the above well, the maximum grade you can receive is an A-.
Finally, the extensions given below are intended to stretch your design further and to differentiate your program from others in order to capture the global nanoGoogle market, your team should agree on one area of extensions to focus on if you want to be considered for a grade in the A range. These extensions must further the good design of your program and not simply be hacks of code added at the last minute. If you do not have time to implement an extension, partial extra credit may be given for excellent justification of how your design either supports adding such a feature already or how it would need to changed sufficiently to support such a feature.
- Advanced Query Language. Include query keywords, such as AND, OR, NOT, as well as support for searching specific meta-data fields, such as title, date, etc.
- User-friendly Queries. Include standard features such as spell checking, using synonyms, and word stemming to provide better search results.
- Ranking Preferences. Instead of simply providing hard-coded ranking formulas, provide a means for users to influence your ranking system, such as preferring mail messages come before text files all other things being equal, or more recent documents rank higher than older documents.
- Index Management. Allow users to update the index without rebuilding it from scratch, such removing individual documents or indexing only updated documents.
- Optimized Back End. Minimize the storage needed for the index by compressing it, such as using numbers to refer to file names and titles or writing only the different letters from one word to the next.
However, note that the amount of extra credit will be in proportion to the amount of intellectual effort needed to implement the option. For example, adding yet another way to filter index words would not be worth very much because your design should already support it. Of course, a well-tested, perfectly working program that has fewer features (but plenty of clear paths to easy expansion) is always worth more than the leaky kitchen sink.
In short, to maximize your grade, you should implement enough variety in your program to clearly demonstrate that your design supports further such extensions.
Resources
The following online resources may help you think about indexing algorithms.
- A History of Search Engines b W. Sonnenreich
- The Anatomy of a Large Scale Hypertextual Web Search Engine by S. Brin and L. Page
- Glimpse: A Tool for Searching Entire File Systems by U. Manber and S. Wu
- MapReduce: Simplified Data Processing on Large Clusters by J. Dean and S. Ghemawat
- On the Criteria to be used in Decomposing a System into Modules by D.L. Parnas
- Component Programming - a fresh look at software components by M. Jazayeri
Deliverables
- Monday, January 26. Submit a README containing the address of your website that contains:
- a name for your team's "company"
- a description of your team's shared vision of the project, specifically what required and extra credit features your team is most interested in implementing
- issues that arise as you try to pin down the requirements, e.g., vague, ambiguous, conflicting requirements, or concerns that you have about specific features
- an estimate of how long this project will take your team to code this project
- Monday, February 2. Submit a
program that, at a minimum, should be able to
- read from or write to multiple formats (not necessarily both and not necessarily perfectly)
- sort the output in a variety of ways
- Monday, February 9. Submit the final version of program, including all programmer documentation.
- Thursday, February 12. Your individual project analysis is due.
- Friday, March 6. Submit the final revised version of program, including all programmer documentation.