Limiting Access by the Duke/OIT Web Crawler

The Duke Office of Information Technology (OIT) currently maintains a Google Search Appliance as the search solution for all of www.duke.edu. While allowing members of the Duke committee to more easily search for online material, there are also concerns due to the larger amount of access-protected material that is being distributed via the web.

This page discuss only certain security implications of the indexing done by the search engine. For more general information, please see our Duke Search Engine FAQ.

Concerns for Web Authors

The index used by a search engine is created from data gathered by a program called a web crawler (AKA a spider, robot, or bot). The crawler used by Google (not OIT's Google appliance) is outside of the Duke internet domain, so restrictions to sites and pages implemented through .htaccess files are enforced as expected.

In contrast, the web crawler used by OIT's Google search appliance is running from a Duke IP address; that is, it is within Duke's Internet domain. Therefore, if a group of web pages is restricted, but is set to be accessible by all of Duke, then those pages will be seen and indexed by the Duke crawler. And while the indexed pages will still be inaccessible to viewers outside of Duke, there's a catch: the OIT search appliance will cache copies of pages it indexes, and these cached copies will then be visible to anyone on the Web!

If pages are restricted to only be viewable from within the C.S. network or within specific other departments' networks (anything other than duke.edu), then they will not be accessible to the OIT crawler, and they will not be indexed and/or cached.

How to Limit Access

There are four general methods that can be used limit access by the search applicance; each has some drawbacks which should be considered. The Lab Staff will be happy to answer any detailed questions you might have about which method is right for you.

Please note that any of these methods can be subverted by a person within Duke's Internet domain - or within your department - who indexes the pges and makes them accessible on the web.

The Robots Exclusion Protocol

Search engines standards outline a set of guidelines which web site administrators can use to promote or prevent indexing of various parts of their sites. The access control is centrally administered for the website using a file named robots.txt. In addtion META tags can be placed in individual pages. These controls provide a lot of flexibility in deciding what gets indexed and what does not. More information about this method can be found at http://www.robotstxt.org.

However we do not generally recommend this method due to the following drawbacks.

  1. There is only one robots.txt per site. Users relying on this file would need to coordinate with the Lab Staff prior to making any structural changes to their documents. Users administering virtual web sites or running their own servers may find this method acceptible.
  2. The contents of the robots.txt file are enforced by the search engine and not by the web server. The OIT search engine will honor the site robots.txt file and "robots" META tags in page headers, by OIT policy. However, if no other access controls are used, non-Duke search engines indexing the site are not guaranteed to obey this protocol.

IP restricted access

In the past, the most common method of preventing unauthorized access to web pages was through the use of an .htaccess file limiting access to computers on a certain network. However, due to the OIT web crawler, this method requires additional configuration, since the OIT web crawler runs from the campus internet. If pages are restricted to be only viewable from within duke.edu, they will be directly inaccessible to viewers outside of Duke. But because the search engine will cache copies of pages it indexes, these cached copies will be visible to people outside of Duke!

As of March 2008 this is even more problematic because OIT has not stabilized the name of the search engine, so the prosciptions for denying access to this machine change with time.

Password protected access

The most secure method for securing pages is to require a password authentication. This is accomplished using the authentication mechanism provided by the .htaccess file. There are three basic authentication methods:

  1. Shibboleth, the preferred method, uses Duke s to control access. Access can be limited to a specified group of users, or enabled for any person with a valid NetID.
  2. Webauth is a method similar to Shibboleth, but it is no longer fully supported and it's use is deprecated.
  3. Apache Basic Authentication can be used to grant access to users who do not have Duke NetIDs. Users will need to maintain and distribute the access credentials to the users they wish to have access. Due diligence on the part of the user will be needed to maintain the security and viability of this method. Please contact the Lab Staff for more information if you wish to use this method.

Stealth

Finally, pages can avoid being indexed by not being directly linked to on other pages. However this method will fail if other users link to your page and their pages are indexed. This method is not considered to be reliable.