Proxycizer 0.95 source code walk-through
The purpose of this document is to give an overview of the code layout and programming style of the package, and to make it easier for anyone (including myself) to extend it. If there isn't much documentation on this page for a particular class or program, chances are you'll find plenty of comments in the source, so please look there too. This document tries to stay pretty high level, and describe the relationship between classes, not necessarily the implementation or interfaces themselves.
The code is split between two directories:
- libsrc, which contains the code for the Proxycizer library, containing nearly all classes used in the package, and
- progsrc, which contains the stand-alone programs' top-level procedures (such as main).
Library (libsrc/)
Proxy classes
These classes simulate aspects of a proxy (typically a Web proxy, but can be used in other contexts).
- Proxy.h
- All classes below inherit from the abstract base class Proxy, which mainly exports a Request() method to accept object requests. The Proxy class itself does nothing. It merely provides a common interface; derived classes must do all the work. A ProxyStats object is instantiated in this class; most derived classes add their own stats to this so just calling the ProxyStats::Print() method on it will print out stats for all classes involved. Also maintains a unique unsigned integer id for each instantiated Proxy.
- ProxyCache.{h,cc}
- Sticks a basic cache with LRU replacement onto the proxy. Can be constructed using any of the following constructors:
- ProxyCache(unsigned long cachemax, long objectsizemax = -1);
- ProxyCache(Graph * gin, Vertex * vin, unsigned long cachemax, long objectsizemax = -1);
- ProxyCache(HTableSelfHashKeyPtr<ObjID, CacheElem *> * cachepin, unsigned long cachemax, long objectsizemax = -1);
- ProxyCache(Graph * gin, Vertex * vin, HTableSelfHashKeyPtr<ObjID, CacheElem *> * cachepin, unsigned long cachemax, long objectsizemax = -1);
A ProxyCache stores the objects (represented by the type CacheElem, described below) in a hash table cachep. The hash table is indexed by ObjID * (see below). So, to find out if an object is in the cache, you generate an ObjID *, use it to index into the hash table, and get back the corresponding CacheElem * if the object exists.
ProxyCache can operate in two distinct modes. They are distinguished by whether the ProxyCaches share a common hash table or not. In the case where they do not share a hash table, the CacheElem pointers stored in the hash table point to ProxyCacheElem objects. Otherwise, a shared hash table contains pointers to BorgCacheElem objects. The second (shared) mode may be more efficient.
A ProxyCache knows what mode it is operating in by the way it is constructed. If it is not passed a pointer to a hash table (as in the first two constructors listed above), then it assumes it is a normal proxy, and it allocates its own private hash table. If the constructor is given a hash table, then we assume we are in Borg cache mode, and the given hash table is assumed to be shared among several proxies.
Since the CacheElem classes are defined only in ProxyCache.h, they will be described here:
- class ProxyCacheElem
- This class describes an object stored in the cache. It contains only the info needed for cache maintenance and request validation, such as a object identifier (ObjID *) (with enough info to validate incoming object requests), the time at which this object was entered into the cache, the size of the object, and as a virtue of deriving from DLListElem, it includes linked list pointers to allow the ProxyCache to perform fast LRU replacement.
- class BorgCacheElem
- This class contains a Vector of all the locations in which the object is stored. Each location is represented by a LocInfo object, which is merely derived from ProxyCacheElem, but extended with a pointer to the Proxy that is storing it.
So, to sum up, while a non-Borg cache has a collection of ProxyCacheElems in its hash table, a Borg cache has a collection of lists of ProxyCacheElems (or, more accurately, a derived class LocInfo) in its hash table.
In addition, ProxyCache has graph support (if Proxycizer is compiled with the Stanford GraphBase) which is enabled when a Graph object is given in the constructor; this functionality is described in a later section.
- ProxyCCP.{h,cc}
(This replaces ProxyCacheQueryable in earlier versions.) Derived from ProxyCache, this class represents proxies that can cooperate with each other. When you construct a ProxyCCP object, you have to send a pointer to a protocol object of type CCP, which implements the actual cooperation mechanisms. This class overrides the ProxyCache::GetFromElsewhere() method to ask the protocol object if it knows where to get the object. If so, the request goes to that proxy, otherwise we do an upcall to the default ProxyCache::GetFromElsewhere().
The Added, Discarded, and Touched hooks (i.e. they are called when the specified event occurs) from ProxyCache are just redirected into the protocol object.
- ProxyDB.h
- This proxy uses a db database to return objects of the correct size. The database should be either an uncompressed file of fixed-length records, ordered sequentially by URL, or generated by proxylog2db. This file is out of date, is not compiled by default, and likely does not compile.
Cache cooperation protocol (CCP) classes
These are the protocol objects used in ProxyCCP.
- CCP.{h,cc}
- All cache cooperation protocol objects must implement this interface.
- CCPICP.{h,cc}
- CCPPSD.{h,cc}
- CCPRD.{h,cc}
- CCPRPDSD.{h,cc}
- CCPVC.{h,cc}
- These are various implementations of the CCP interface, implementing different mechanisms for cache cooperation.
Object Identifiers
These classes identify an object in a ProxyCache, are used to hash into the ProxyCache's hash table, and also to validate object requests.
- ObjID.h
- ObjIDMin.h
- ObjIDHTTP.h
Though ProxyCache stores object identifiers as ObjID *, it generates these from the UberLogEntry using two different UberLogEntry methods, NewObjIDHash() and NewObjIDValidate(); the identifiers returned by the former are used as the "keys" in the hash table, and the latter are stored in the CacheElems themselves to be used for validating incoming requests with the cached object (e.g. is the start time of the incoming request greater than the "expires" time in the cached object?). The "hashing" IDs are merely used for the hash function, so can potentially be smaller than the "validating" IDs, saving space. ObjIDMin is an example of a "hashing" ID and ObjIDHTTP is an example of a "validating" ID. When you write your own ObjIDs, you can certainly use the same ID type as the return types for your own UberLogEntry.
You may ask, why can't we use the same ObjID * for both the key and the one stored in the CacheElem, since they're both pointers? We could, but when we're in Borg cache mode, there may be several ProxyCacheElems for every key in the hash table, one for each location, and the IDs for each location may be different (time entered, for example). Even if they're the same, using different objects makes memory management easier (i.e. we know when its safe to delete either pointer). Using a reference counted ObjID type might be a reasonable approach here; this is under investigation.
Trace readers
These classes are used to iterate over traces.
- TraceReader.{h,cc}
- This base class takes care of opening files/pipes and providing iterator methods. It should templated on a type (typically derived from LogEntry) that provides the method GetOps(), which should return a pointer to a new object of type LogEntryOps. If constructed using a filename, it will also accept compressed files (assuming the file has an extension .gz). Typical use of a TraceReader:
for (trp->First(); !trp->Done(); trp->Next()) { const UberLogEntry & entry = trp->Current(); }- TraceReaderSequence.{h,cc}
- Use this class to treat several proxy trace logs as one large (concatenated) log. The Open() method can be called more than once to set up a series of log files. TraceReaderSequence is a TraceReader, and acts just like one. Even First() works correctly.
- BufferedTraceReader.{h,cc}
- BufferedTraceReader keeps a buffer cache of entries while walking through the log. The "B-" versions of the iterator methods add elements to the buffer cache, and allow the user to traverse the buffer cache, reaccessing previously seen elements. Elements are deleted with BDelete(); any element in the buffer cache can be deleted using this method.
Trace writers
- TraceWriter.{h,cc}
- This is fairly self explanatory. Similar to TraceReader, but exports a Write() method to write a entry (of templated type, usually derived from LogEntry) to a file.
Log entry classes
The opening of files is done by the TraceReader classes described previously, but the reading and parsing of a log entry is delegated to these classes and their corresponding -Ops classes. The data classes are all derived from LogEntry. They represent the data contained in one log entry. The operation classes are all derived from LogEntryOps. These define a mechanism for reading/writing their corresponding data class from/to their external representations via a file/pipe/stream/buffer. A LogEntryOps class exports the following methods:
- ReadFirst(LogEntry **lepp, const void *data, size_t *len)
- Read(LogEntry **lepp, const void *data, size_t * len)
- ReadFirst(LogEntry **lepp, FILE *fp)
- Read(LogEntry **lepp, FILE *fp)
- ReadFirst(LogEntry **lepp, istream & is)
- Read(LogEntry **lepp, istream & is)
- WriteFirst(const LogEntry *lep, void *data, size_t * len)
- Write(const LogEntry *lep, void *data, size_t * len)
- WriteFirst(const LogEntry *lep, FILE *fp)
- Write(const LogEntry *lep, FILE *fp)
- WriteFirst(const LogEntry *lep, ostream & os)
- Write(const LogEntry *lep, ostream & os)
All "read" methods return a new LogEntry in *lepp if *lepp == NULL. Otherwise they assume that *lepp points to enough storage to hold the resulting LogEntry. UberLogEntrys are a special case. See UberLogEntry for details.
Each TraceReader will only use one LogEntryOps object, which may construct and allocate many LogEntrys. This enables the "long-lived" LogEntryOps object to keep state through iteration over the many LogEntrys in a file. See UberLogEntry for an example.
- Mergeable.h
- LogEntry.h
- LogEntryText.{h,cc}
These are the abstract LogEntry classes. LogEntry inherits a comparison operator operator <(const Mergeable &) from Mergeable, which uses the _orderval member to enable any set of objects in any of these classes to be ordered (by time, by default).
LogEntryText and LogEntryTextOps provide some common infrastructure for reading text-based log files (meaning all but DEC and UCB log files).
- LogEntryCrispySquidAcc.{h,cc}
- LogEntryDEC.{h,cc}
- LogEntryHarvestAcc.{h,cc}
- LogEntryHarvestHier.{h,cc}
- LogEntrySimclient.{h,cc}
- LogEntrySquidAcc.{h,cc}
- LogEntryUCB.{h,cc}
These files contain the data and operation classes for each supported log type.
- UberLogEntry.{h,cc}
The UberLogEntryOps class attempts to automatically determine what type of trace it is looking at during the first read attempt, and generates the correct type of UberLogEntry-derived log entry objects. If it can't determine the type from the first 8K of the file, it bails with an error message. This class can usually detect Squid (Crispy or not) access logs, Harvest (Crispy or not) access/hierarchy logs, simclient logs, UCB Home-IP logs, and defaults to DEC logs when all else fails.
The "read" methods in UberLogEntryOps work a little differently than most LogEntryOps classes. UberLogEntrys store a pointer to a "real" LogEntry of the type automatically determined on the first read. If the LogEntry **lepp argument in a "read" method points to NULL, a new "real" LogEntry is allocated. However, if it points to valid storage, then that storage is used for the "real" LogEntry, not for the enclosing UberLogEntry. A new UberLogEntry is allocated in every case, and is returned in *lepp. So the value of *lepp always changes on success.
For those that wish to specify the storage for the UberLogEntry, too, the copy(void * uledata, void * realdata) method will copy the current UberLogEntry into storage pointed to by uledata and the encapsulated "real" LogEntry will be copied into storage pointed to by realdata. copy() assumes that the pointers are valid and that the storage they point to is large enough to store the resulting data structures. See sortproxytrace.cc for an example of why this may be useful.
Statistics
- Stats.h
Based on Jason Kastner's perl module Statistics::Descriptive.pm, this class accepts data and will spit out various stats on demand. Stats throws away data after it has calculated and saved the info that it needs. FullStats, however, saves the given data, and enables other statistics, such as Median, Mode, and histograms.
- ProxyStats.h
This class is used by the Proxy class and its subclasses. It provides a way to store (groups of) counters, and a Print() function to print them out nicely. Counters in the same "bracket" (group) are printed separately from other brackets, and counters are also labeled with their fraction (percentage) of the sum of all the other counters in the bracket.
Miscellaneous utilities
Many of these classes are provided courtesy of Owen Astrachan. No doubt, a few of these have gone through enough modification to be totally unrecognizable to him.
- Pair.h
- LList.{h,cc}
- DLList.{h,cc}
- Table.h
- HTable.{h,cc}
- Iterator.h
- HIterator.{h,cc}
- Vector.h
Miscellaneous utilities, 2ème Partie
- EventManager.{h,cc}
Used by simclient and webulator, this class provides an event-driven infrastructure, where events can be triggered by file events or at given times.
- DNSCache.{h,cc}
Used by some Proxy classes because some systems don't provide DNS caches of their own, and generate DNS queries on every gethostbyname().
- Allocator.h
Templated type-specific memory allocator/deallocator.
- tentry_v2.{h,cc}
A wrapper for proxytrace2txt_v2.h
- exiterr.h
An exit() that returns the line number as the error code.
- vtoh.h
Little-endian (VAX) to host byte-order functions.
Programs (progsrc/)
Proxy simulation
- runproxies.cc
runproxies encapsulates the functionality of several earlier proxy simulation programs (e.g. deepharvest, rpdsd, etc.). It allows you to choose a proxy architecture and a proxy cooperation protocol and stream traces through it. Several options are available; run runproxies without arguments to see the help message. runproxies also can enable graph/topology support in ProxyCache.
runproxies has three phases; setup, simulation, and printing statistics. Most of the simulation work is done in clientproxy.cc.
- clientproxy.{h,cc}
- The function clientstoproxies grabs entries from trace files and sends them to a list of proxies. Requests can be assigned to proxies in several ways, either by client, where clients are assigned round-robin to proxies the first time they appear in the trace file, or by reference, where each request is sent to a server in round-robin fashion, or by hashing the URLs among the proxies. Also, separate trace files could be used, one per proxy; in this case, we grab entries from the trace files in order of the entry's start time.
Trace analysis
- distances.{h,cc}
- distances is a horrible name for a very useful tool. It runs through trace files and spits out sundry information about sharing and hit ratio properties of the trace; e.g. how many objects/references are used by n clients, how many compulsory misses are there, what is the ideal hit ratio given a particular cache size and number of proxies, etc.
- countthings.{h,cc}
- A template to walk through traces and do simple analysis, like counting clients, etc.
- clientrates.{h,cc}
- Find out the rates at which clients request objects.
Trace manipulation
- filterlogs.{h,cc}
- Remove all entries that could be filtered by ideal per-user caches.
- sortproxytrace.{h,cc}
- Sort trace files by any of several elements (time, url, etc.) or merge several files together by time. Sorting requires the TPIE library.
- log2txt.cc
- Print out a human-readable representation of a trace file.
- correlate.cc
- Takes simclient logs and traces from proxies used in a simclient experiment and generates a one-line summary of each request, even for requests that bounced between several proxies.
- splitlogs.cc
- Split a trace file into several trace files, by client subsets (e.g. every fourth client seen).
- makefaster.cc
- Stretch or compress time so that the events in a trace happen at a faster or slower rate. Only works with DEC traces; was used for simclient, but functionality is subsumed into simclient's --timefactor option.
- proxylog2db.cc
- Convert a trace file into a database that can be used by webulator or ProxyDB.
Traffic generators/receivers
- simclient.cc
- Generates real traffic (based on trace files) to a Web proxy.
- webulator.{h,cc}
- A fake web server that is used to receive web requests and return dummy objects with fixed size or with object lengths stored in a database.
Miscellaneous
- opts.{h,c}
- These files implement a module for command-line options parsing. It is written in C, compiled by the C compiler, and does not depend on any other Proxycizer code (though it does make use of config.h generated by configure to see if it can support long long types, and to enable support for size_t, time_t and such. So, in theory, this is a portable command-line option parser that can be used in any C or C++ program. See runproxies.cc for an example of how to use this.
Syam Gadde Last modified: Fri Feb 11 19:11:48 EST 2000