next up previous
Next: 3 Harvest and CRISP Up: Directory Structures for Scalable Previous: 1 Introduction

Subsections

2 Why Collective Caches?

 
  
Figure 1: Reference breakdown for 25-day trace.
\begin{figure}
\centering

\epsfig {file=pics/fullbreakdowninfinite-pie.eps, width=2.75in}\end{figure}

In this section we present reference studies of the 25-day DEC trace to motivate the collective architectures explored in this paper. All our studies except those using real web proxies use a basic LRU replacement scheme; the operation of real proxies may differ somewhat when objects are invalidated due to age, HTTP expiry indicators (not indicated in the trace), etc.

Figure 1 shows that almost 89% of the the references in this trace are to static documents that can be cached. About 70% of all references are potential hits; the purpose of any Web caching architecture is to serve as many of these accesses from the cache as possible at low cost.

The trace data provides further evidence to support a central premise of our work, that the highest hit ratios are achieved by shared caches serving large user communities:

Although the 25-day trace shows a clear benefit to large, shared caches, experience has shown that the scalability of individual proxy servers is limited. While we can expand proxy server capacity by scaling the hardware and streamlining the software, a simpler and ultimately lower-cost solution is to install additional proxy servers for portions of the reference stream. The following subsections consider three basic choices for structuring distributed Web caches.

2.1 Option 1: Isolated Proxy Caches


  
Figure 4: Best achievable 25-day hit ratios for isolated proxies, with round-robin client assignment and ideal per-user caches.
\begin{figure}
\centering

\epsfig {file=pics/fulluncoopbreakdowns-bar.eps, width=2.75in}\end{figure}

It is easy to build a scalable Web cache by distributing users across a collection of isolated caching servers. However, this structure sacrifices most of the benefits of a shared cache. Figure 4 shows the best hit ratios achievable by such an ``uncooperative'' collective cache as a function of the number of independent caching servers, with balanced round-robin client assignment. As in Figure 2, these shared proxies supplement ideal per-user caches.

Adding isolated proxies to serve a given client population causes hit ratios to decline because it decreases the probability that a client accessing a shared object for the first time can benefit from another client's previous reference to the object. A collective cache of 32 isolated proxies (540 clients per proxy) can yield hits for less than half of the shared references; the absolute achievable hit ratio for the trace drops to 21%, down from 53% for a monolithic shared cache.

2.2 Option 2: Partitioned Cache

  A straightforward way to build a collective cache that avoids the problems with isolated proxies is to partition the document set across the servers by ``homing'' each document on a specific proxy using a URL hashing function [18]. We refer to this structure as Partitioned Proxy Cache (PPC). A similar cache organization is used in NetCache [4] (the commercial continuation of the Harvest Cache project). With a good hashing function, PPC can deliver comparable hit ratios to a monolithic shared cache, while balancing load among the proxies. PPC is supported by a proxy auto-configuration [2] feature in popular browsers, in which the browser executes a user-provided script to select the proxy to query for a particular reference. Netscape's facility can be programmed to mask failures using a timeout/failover/retry mechanism.

Pure PPC suffers from several weaknesses that motivate the more sophisticated schemes in this paper. First, it relies on a hash function that produces balanced distributions of objects and references among the proxies. This is difficult to achieve in practice, and it may be necessary to adjust the proxy sizes to match observed reference patterns, or to run the cache below capacity to minimize queuing delays at the most heavily loaded caching servers. Second, all participants must be made aware of a new hash function if proxies join or leave a PPC cache, and cache performance will degrade temporarily as the cache accommodates the new caching function through flushing, reorganizing, or rebuilding the contents of the cache.

Most importantly, PPC may impose high latency and network overhead for both hits and misses, since the proxy handling an access is predetermined by the referenced document, independent of the user issuing the reference or that user's location in the network. This is a problem primarily if the individual proxies are dispersed in the network, as we envision for caches serving large user communities.

2.3 Option 3: Collective Cache with Remote Queries

The third option is to partition the clients into groups, each bound to an autonomous caching server as in option 1, but also to supplement the collective with a remote querying mechanism. The purpose of the query mechanism is to bridge the gap in hit ratios between monolithic shared caches (Figure 2) and uncooperative collective caches (Figure 4), balancing the benefits of a shared cache with the scalability of isolated proxy servers.

On a local cache miss, a participating server probes the collective cache for the missing document before contacting a Web server to service the miss. This structure presents an opportunity to reduce latency and network traffic by (1) placing each proxy along the network path between the users it serves and the wide-area network link, and (2) placing the objects most likely to yield hits into caching servers closest to the users most likely to reference them. Moreover, the auto-configuration mechanism can be used to implement a dynamic binding of clients to proxies in order to balance load or mask proxy failures.

All of the collective cache architectures considered in the remainder of this paper use this option. At issue is the question of which query mechanism delivers the best hit ratios with the lowest cost and network overhead. The next section compares two fundamental implementation choices for queries in collective caches.


next up previous
Next: 3 Harvest and CRISP Up: Directory Structures for Scalable Previous: 1 Introduction
Syam Gadde
11/14/1997