Hierarchical Web proxy caches descended from the Harvest project [3] are now in wide use. Popular Harvest descendents include the commercial product NetCache [4] and the public-domain Squid caches [20]. Hierarchical Web caches are built as a collective of independent proxy caching servers that share their cache contents through a simple multicast query mechanism using the Internet Cache Protocol (ICP) [19]. The National Laboratory for Applied Network Research has engineered a structure of Squid caches spanning the United States, and is coordinating with participants across the world to build a ``global mesh'' of Web proxy caches.
However, our experiments have shown that hierarchical caches have scalability and coverage problems that manifest themselves in large cache configurations [9]. Other researchers have confirmed these observations [17,6]. The CRISP project is exploring alternative Web cache structures that provide more scalable and efficient structures for large, geographically dispersed collective Web caches. The key idea behind the CRISP Web cache architecture is to supplement each caching server's local cache directory (i.e. the index of all the documents the server holds locally) with an independently managed global directory that allows it to locate objects cached at other servers. This frees the individual caching servers from responding to directory queries, and allows the global directory to scale independently of the caches themselves.
A CRISP cache (shown in Figure 1) consists of a set of caching servers (proxies), which directly field requests from clients/browsers, and a mapping service, which is responsible for maintaining the global directory of objects stored in the caching servers. On a local cache miss, each caching server queries the mapping service to determine if the object is resident elsewhere in the collective cache. If the mapping service reports a hit, then the object is fetched directly from the peer caching server. Caching servers keep the global directory current by reporting object fetches and evictions to the mapping service.
The key design issue for a CRISP cache is the structure of the mapping service. Several factors influence the placement of proxies for a particular environment; likewise, different proxy organizations dictate different structures for the mapping service. The first CRISP prototype [10] was based on a set of independent mapping servers, each maintaining an exclusive portion of the global directory. This model is well-suited to typical ISP environments in which proxies are ``close'' to each other with respect to network latency, and the cost of a mapping service probe is small compared to the total perceived client latency. In this paper we describe extensions to this structure that replicate some or all of the global directory in order to reduce query latency in large cache configurations.
As a basis for evaluating the CRISP cache architecture, we implemented several CRISP variants in Crispy Squid, an extension to a recent Squid release. Crispy Squid introduces a configurable CRISP mapping service to replace the multicast ICP probes used in Squid and other Harvest-derived caches. Crispy Squid caches can deliver hit ratios equal to or better than ICP caches, with lower query costs for both hits and misses. Moreover, a replicated global directory can reduce query latency further by removing network communication from the critical path of probes, at the expense of higher overhead to store and manage the replicas. Crispy Squid supports several alternatives for managing replicated directories that exploit the observed properties of Web access to reduce this overhead, while delivering close to ideal hit ratios.
Our primary approach to evaluating Web cache prototypes is to subject them to traffic loads corresponding to real Web access traces, which are now widely available. To support our experiments we have developed a suite of tools called Proxycizer for replaying Web traces under controlled conditions. Proxycizer accommodates multiple trace formats, allows a choice of request timing properties for average-case or worst-case analysis, and can emulate a variety of Web server characteristics.
This paper is organized as follows. Section 2 summarizes the problems with hierarchical ICP caches, and outlines a range of CRISP alternatives that respond to the factors influencing Web cache design. Section 3 describes the implementation of configurable CRISP caches in Crispy Squid. Section 4 describes our experimental methodology and the Proxycizer package for evaluating performance of Web caches by subjecting them to trace-driven load. We conclude in Section 5.