COD: Cluster-On-Demand
Computer Science Department
Duke University

Overview | Project Status | External Packages for COD
Publications and Presentations | Software Downloads | Funding | Members


COD is a cluster site manager for the Open Resource Control Architecture (Orca) project in the NICL lab at Duke Computer Science.

Clustering inexpensive computers is an effective way to obtain reliable, scalable computing power for network services and compute-intensive applications. Since clusters have a high initial cost of ownership, including space, power conditioning, and cooling equipment, leasing or sharing access to a common cluster is an attractive solution when demands vary over time. Shared clusters offer economies of scale and more effective use of resources by multiplexing.

Users of a shared cluster should be free to select the software environments that best support their needs. Cluster-on-Demand (COD) is a system to enable rapid, automated, on-the-fly partitioning of a physical cluster into multiple independent virtual clusters. A virtual cluster is a group of physical or virtual machines configured for a common purpose, with associated user accounts and storage resources, a user-specified software environment, and a private IP address block and DNS naming domain. COD virtual clusters are dynamic; their node allotments may change according to competing demands or resource availability.


Cluster-On-Demand (COD) is implemented as a secure Web service that coordinates standard services for network administration: PXE/DHCP network booting, Domain Name Service (DNS), Network File System (NFS) automounter, and Pluggable Authentication Modules (PAM or NSS). The COD service at a cluster site is accessible through a Web-based front end, and also programmatically through a service protocol interface.

COD is a cornerstone of our research on service-oriented architectures for networked utility computing. It is written as a plug-in component for managing physical and virtual machines for Shirako as part of the Cereus project. Shirako provides generic functionality for leasing any type of shared resource; COD implements specific functionality for sharing resources that are machines, described by their CPU speed, memory allotment, disk space, and filesystem image, partitioned within isolated virtual clusters. Shirako provides pluggable policies for resource arbitration between competing users; as a result COD, itself, is policy-agnostic. COD provides the fundamental mechanisms for automating the deployment (and re-deployment) of machines within a cluster site.

COD was conceived in 2001 to control physical machines with database-driven network booting (PXE/DHCP). The physical booting machinery is now familiar: in addition to controlling the IP address bindings assigned by PXE/DHCP, the node driver controls boot images and options by generating configuration files served via TFTP to standard bootloaders (e.g., grub). Our current Java-based implementation (begun in 2004) also manages virtual machines using the Xen hypervisor. The combination of support for both physical and virtual machines offers useful flexibility: it is possible to assign blocks of physical machines dynamically to boot Xen, then add them to a resource pool for dynamic instantiation of virtual machines.

Figure 1: Diagram of a COD site.
A COD site authority drives cluster reconfiguration by manipulating data stored in a back-end directory server with the Lightweight Directory Access Protocol (LDAP) and initiating/monitoring state transitions of machines under its control. The COD LDAP schema extends the RFC 2307 standard for an LDAP-based Network Information Service. Standard open-source services exist to administer networks from an LDAP directory server (see links below). The DNS server for the site is an LDAP-enabled version of the standard BIND9, and for physical booting we use an LDAP-enabled DHCP server from the Internet Systems Consortium. In addition, guest nodes have read access to an LDAP subtree describing the containing virtual cluster. Guest nodes configured to run Linux use an LDAP-enabled version of AutoFS to mount NFS file systems, and a PAM/NSS module that retrieves user logins from LDAP.

COD should be comfortable for cluster site operators to adopt, especially if they already use RFC 2307/LDAP for administration. The directory server is authoritative: if the COD site authority fails, the disposition of the cluster is unaffected until it recovers. Operators may override the COD server with tools that access the LDAP configuration database directly.

Project Status

We manage our "Devil Cluster" within the Duke Computer Science Department, composed of over 300 servers with 18 terabytes of disk space, using the LDAP-based COD approach. COD currently supports managing virtual clusters composed of Xen virtual machines and physical machines.

Our current COD prototype (CODv3) consists of about 4000 lines of Java code based on the Shirako toolkit for service-oriented resource leasing. COD itself includes scriptable resource drivers (node drivers) for service managers and COD site authority servers (see Figure 1), policy modules for node assignment at the site, and components to manage IP and DNS name spaces.

The support for virtual machines consists primarily of a modified node driver plugin and a new authority-side policy plugin to assign virtual machine images to physical machines. Only a few hundred lines of code know the difference between physical and virtual machines. The virtual node driver controls booting by opening a secure connection to a privileged domain on the Xen node, and issuing commands to instantiate and control Xen virtual machines.

We expect an initial code release soon.

Many of our ongoing research efforts are focused on policy-based resource management using COD as an example plugin module. Please refer to the Shirako and Cereus project pages for more information.

Publications and Presentations

COD Architecture, Design, Implementation

  • "Sharing Networked Resources with Brokered Leases", David Irwin, Jeff Chase, Laura Grit, Aydan Yumerefendi, David Becker, and Ken Yocum, USENIX Technical Conference, June 2006, Boston, Massachusetts.[pdf]
  • "Virtual Machine Hosting for Networked Clusters: Building the Foundations for ``Autonomic'' Orchestration", Laura Grit, David Irwin, Aydan Yumerefendi, and Jeff Chase. In the First International Workshop on Virtualization Technology in Distributed Computing (VTDC), November 2006. [pdf]
  • The original 2003 COD paper, describing a previous implementation. "Dynamic Virtual Clusters in a Grid Site Manager", Jeff Chase, David Irwin, Laura Grit, Justin Moore, and Sara Sprenkle, Twelfth IEEE Symposium on High Performance Distributed Computing (HPDC), June 2003, Seattle, Washington.
    [pdf] (talk slides [ppt]  [pdf] )

COD Inside

Ancient History

  • "Cluster-On-Demand (COD)", Justin Moore, Work-in-Progress talk at USENIX 2002. [ppt]

  • "Managing Mixed-Use Clusters with Cluster-On-Demand", Justin Moore, David Irwin, Laura Grit, Sara Sprenkle, and Jeff Chase, Duke University Technical Report, CS-2002-07, November 2002. [pdf]

Software Downloads

Below are links for the open source components leveraged by COD to automate virtual cluster isolation and administration. These links include HOWTOs and downloads for all components used by COD.

General Information
   Various LDAP tools

Network Booting
   Network Booting with PXE
   LDAP patch for DHCP

LDAP-enabled DNS, PAM, and NSS
   LDAP-PAM module
   LDAP-NSS module
   LDAP-enabled Bind

LDAP-enabled Automounting using AutoFS
   LDAP patch for AutoFS

Packages used by COD to access LDAP and invoke configuration actions

  • ANI 03-30658 - Dynamic Virtual Clusters part of the NSF Middleware Initiative
  • NSF CNS-0509408 - Virtual Playgrounds: Making Virtual Distributed Computing Real in collaboration with the Globus Virtual Workspaces Project
  • EIA-99-72879 - Research Infrastructure Grant
  • ANI-01-26231 - Request Routing for Network Services
Project Members