• Storage Systems

    • Automating Storage Service Benchmarking
      This work focuses on the problem of workbench automation for storage server benchmarking. The goal is to develop an automated benchmarking system that plans, configures, sequences and executes benchmarking experiments on a common hardware pool. The activity is coordinated by a workbench controller that can consider various factors including accuracy vs. cost tradeoffs, availability of hardware resources, deadlines, and the results reaped from previous experiments.

      Experiments with a prototype evaluate alternatives for controller policies in the context of NFS server benchmarking using a configurable synthetic workload generator. The prototype obtains saturation throughput (peak rate) measures for NFS server configurations under various workloads, extending the widely practiced NFSOPS standard measure for file server performance. We show that how the automated controller can plan experiments to obtain peak rate measures with a target confidence level and accuracy at low cost. Obtaining the peak rate for a given workload and server configuration is a key building block for systematic mapping of the performance behavior (response surface) across a space of workloads and configurations. We illustrate how the controller can employ established principles of response surface methodology to prune the multi-dimensional sample space, and obtain peak rate measures more efficiently by seeding the search from nearby points in the response surface. Details

    • NFS Observability Tool
      In the summer of 2006 I worked at Sun Microsystems in the NFS group. We designed and implemented nfsperf, a new observability utility for performance and data management in network storage systems. nfsperf enables in depth understanding of NFS workloads and NFS performance to meet high level objectives; the tool reports performance and workload metrics on per-file, per-user, per-directory, and per-client basis. nfsperf is currently an opensolaris project.

    • Information Lifecycle Management
      In the summer of 2005 I worked at IBM Almaden Research Center in the autonomic storage computing group. I worked on the problem of Information Lifecycle Management (ILM) to manage data over its lifetime. The goal was to align the data to the storage devices based on its business value. We designed and developed a prototype that classifies data and storage according to their metadata attributes, and suggests an informed data placement scheme. We showed that our approach resulted in significant cost savings for different real world data sets. (paper, IBM technical report)

    • Transparent namespace splitting
      In the summer of 2003 I worked at IBM Almaden Research Center for designing and implementing transparent namespace splitting in the storage tank (SAN-FS) file system.

    • IP versus non-IP SANs
      Periodic order-of-magnitude jumps in Ethernet bandwidth regularly reawaken interest in TCP/IP transport protocol offload. This time the jump to 10-Gigabit Ethernet coincides with the emergence of new network storage protocols (iSCSI and DAFS), and vendors are combining these with offload NICs to position IP as a competitor to FibreChannel and other SAN interconnects. But what benefits will offload show for application performance? Several recent studies have presented conflicting data to argue that offload either does or does not benefit applications. But the evidence from empirical studies is often little better than anecdotal. The principles that determine the results are not widely understood, except for the first principle: Your Mileage May Vary.

      This work outlines fundamental performance properties of transport offload and other techniques for low-overhead I/O in terms of four key ratios that capture the CPU-intensity of the application and the relative speeds of the host, NIC device, and network path. The study also reflects the role of offload as an enabler for direct data placement, which eliminates some communication overheads rather than merely shifting them to the NIC. The analysis applies to Internet services, streaming data, and other scenarios in which end-to-end throughput is limited by network bandwidth or processing overhead rather than latency. (paper, slides)

  • High Performance Computing

    • Active Learning of Cost Models for Scientific Applications
      In this work, we present the NIMO system that automatically learns cost models for predicting the execution time of computational-science applications running on large-scale networked utilities such as computational grids. Accurate cost models are important for selecting efficient plans for executing these applications on the utility. Computational-science applications are often scripts (written, e.g., in languages like Perl or Matlab) connected using a workflow-description language, and therefore, pose different challenges compared to modeling the execution of plans for declarative queries with well-understood semantics. NIMO generates appropriate training samples for these applications to learn fairly-accurate cost models quickly using statistical learning techniques. NIMO's approach is active and noninvasive: it actively deploys and monitors the application under varying conditions, and obtains its training data from passive instrumentation streams that require no changes to the operating system or applications. Our experiments with real scientific applications demonstrate that NIMO significantly reduces the number of training samples and the time to learn fairly-accurate cost models. (Details)

    • Low-overhead Messaging on Gigabit Ethernet
      Modern interconnects like Myrinet and Gigabit Ethernet offer Gigabits per second (Gb/s) speeds which has put the onus of reducing the communication latency on messaging software. With the advent of programmable NICs, many aspects of protocol processing can be offloaded from kernel and user space to the NIC leaving the host processor to dedicate more cycles to the application. No host-offload messaging systems exists for Gigabit Ethernet. We present a new Ethernet Message Passing (EMP) protocol for Gigabit Ethernet. This protocol is not only OS-bypass but also zero-copy. This protocol has been implemented using the multi-CPU Alteon NICs for Gigabit Ethernet.

      The two CPUs of the Alteon NIC raise an open challenge whether performance of user-level protocols can be improved by taking advantage of a multi-CPU NIC. To answer this challenge we parallelize and pipeline the basic EMP protocol. There are a lot of intrinsic issues associated with such parallelization and we explore different parallelization and pipelining schemes to enhance the performance of the basic EMP protocol. The performance results indicate that parallelizing the receive path of the protocol can deliver 964 Mbps of bandwidth, close to the maximum achievable on Gigabit Ethernet. To the best of our knowledge, this is the first research in the literature to exploit the capabilities of multi-CPU NICs to improve the performance of user-level protocols. Results of this research demonstrate significant potential to design scalable and high performance clusters with Gigabit Ethernet.Details

  • Web Services

    • Experiment-Driven Management of Web Services
      Database-backed Web services (e.g., Amazon, eBay, Yahoo!) play an important role in our daily lives. The performance P of a Web service S can be a complex function of its workload W, resource allocation R, and the large number of configuration parameters C that affect S. Furthermore, P may be dictated by unknown interactions among W, R, and C. We propose a systematic approach for discovering these dependencies and interactions accurately and comprehensively, to process four basic queries that arise in Web-service management. Our approach is based on planning a small set of experiments that observe P for selected < W,R,C > combinations. We propose a planning algorithm that leverages techniques from statistical design of experiments and active machine-learning, and describe a harness we implemented to conduct experiments with chosen < W,R,C > combinations. Our empirical evaluation using two multitier Web services demonstrates the feasibility and usefulness of the experiment-driven approach. (paper in submission)

    • Communication Oriented Architecture
      In this work, we worked on characterizing a commercial network intensive application (Apache Web server) under heavy load conditions. We used oprofile for profiling the application performance in terms of CPI, Cache misses and Branch misprediction. We compared our findings with that of gcc and our results indicate that current processor architectures do not have inherent support for network intensive applications. Details