This page discuss only certain security implications of the indexing done by the search engine. For more general information, please see our Duke Search Engine FAQ.
The index used by a search engine is created from data gathered by a program called a web crawler (AKA a spider, robot, or bot). The crawler used by Google (not OIT's Google appliance) is outside of the Duke internet domain, so restrictions to sites and pages implemented through .htaccess files are enforced as expected.
In contrast, the web crawler used by OIT's Google search appliance is running from a Duke IP address; that is, it is within Duke's Internet domain. Therefore, if a group of web pages is restricted, but is set to be accessible by all of Duke, then those pages will be seen and indexed by the Duke crawler. And while the indexed pages will still be inaccessible to viewers outside of Duke, there's a catch: the OIT search appliance will cache copies of pages it indexes, and these cached copies will then be visible to anyone on the Web!
If pages are restricted to only be viewable from within the C.S. network or within specific other departments' networks (anything other than duke.edu), then they will not be accessible to the OIT crawler, and they will not be indexed and/or cached.
There are four general methods that can be used limit access by the search applicance; each has some drawbacks which should be considered. The Lab Staff will be happy to answer any detailed questions you might have about which method is right for you.
Please note that any of these methods can be subverted by a person within Duke's Internet domain - or within your department - who indexes the pges and makes them accessible on the web.
Search engines standards outline a set of guidelines which web site administrators can use to promote or prevent indexing of various parts of their sites. The access control is centrally administered for the website using a file named robots.txt. In addtion META tags can be placed in individual pages. These controls provide a lot of flexibility in deciding what gets indexed and what does not. More information about this method can be found at http://www.robotstxt.org.
However we do not generally recommend this method due to the following drawbacks.
In the past, the most common method of preventing unauthorized access to web pages was through the use of an .htaccess file limiting access to computers on a certain network. However, due to the OIT web crawler, this method requires additional configuration, since the OIT web crawler runs from the campus internet. If pages are restricted to be only viewable from within duke.edu, they will be directly inaccessible to viewers outside of Duke. But because the search engine will cache copies of pages it indexes, these cached copies will be visible to people outside of Duke!
As of March 2008 this is even more problematic because OIT has not stabilized the name of the search engine, so the prosciptions for denying access to this machine change with time.
The most secure method for securing pages is to require a password authentication. This is accomplished using the authentication mechanism provided by the .htaccess file. There are three basic authentication methods:
Finally, pages can avoid being indexed by not being directly linked to on other pages. However this method will fail if other users link to your page and their pages are indexed. This method is not considered to be reliable.