A summary of my research experience.

Rhythm: Harnessing Data Parallel Hardware for Server Workloads

Trends in increasing web traffic demand an increase in server throughput while preserving energy efficiency and total cost of ownership. Present work in optimizing data center efficiency primarily focuses on the data center as a whole, using off-the-shelf hardware for individual servers. Server capacity is typically increased by adding more machines, which is cheap, though inefficient in the long run in terms of energy and area.

Our work builds on the observation that server workload execution patterns are not completely unique across multiple requests. We present a framework—called Rhythm—for high throughput servers that can exploit similarity across requests to improve server performance and power/energy efficiency by launching data parallel executions for request cohorts. An implementation of the SPECWeb Banking workload using Rhythm on NVIDIA GPUs provides a basis for evaluating both software and hardware for future cohort-based servers. Our evaluation of Rhythm on future server platforms shows that it achieves 4x the throughput (reqs/sec) of a core i7 at efficiencies (reqs/Joule) comparable to a dual core ARM Cortex A9. A Rhythm implementation that generates transposed responses achieves 8x the i7 throughput while processing 2.5x more requests/Joule compared to the A9.

Pairing, State trees and Sub-blocks: Mechanisms for Defect Tolerant Caches

Memory arrays occupy a dominant share of the transistor budget on a chip, making defect tolerance in memories increasingly important. Defects in on-chip memories can be caused by near threshold operation or an unreliable fabrication process. Future scaled CMOS technology generations are predicted to have a large increase in process variability and other fabrication challenges that can incur significant cost to achieve a target yield. Therefore, there is a need to develop solutions that enable reliable and efficient operation of the chip at extremely high defect rates while minimizing complexity and overhead.

In this work we a present a defect tolerant cache design that uses byte scavenging to greedily pair blocks across the entire cache, resulting in fully or partially functional blocks. We reduce the overheads of this scheme to ~5% from 16%, via a novel state storage method called State Trees. We also propose a cache hierarchy design and implementation that uses sub-blocks to maximize performance while ensuring reliable cache operation in the presence of partially functional blocks. Our experiments show a graceful degradation of performance with increasing defect rates, with 99% chip performance (IPC) at 0.1% defects, 90% performance at 1% defects, and 60% performance at 5% defects for our best approach, based on the SPECCPU2000 benchmarks. We use a simple energy model to predict the applicability of our approach to a voltage scaling setting, and achieve 85% energy savings at near threshold operation.

No Free Lunch: Maximizing Storage Resilience

Defect tolerance in memory arrays is increasingly important due to scaled CMOS, voltage scaling, and wear out in emerging non-volatile memories. This paper explores the design space of storage resilience by first defining a taxonomy that captures the important parameters of existing approaches while also exposing a large design space with many areas yet unexplored. We examine one portion of this design space that utilizes error correcting trees (ECT) to maintain state and repair information for defective portions of the memory array and gracefully degrades capacity as defects increase. Conservative evaluations under worst-case scenarios of this approach for an L1 cache show that ECT can maintain up to 40% of cache capacity at 5% defects while an equal area overhead ECP¬12 has no remaining cache capacity. Performance simulations show that ECT can also gracefully degrade performance as defects increase.

Artist Identification for Renaissance Paintings

Current work in author identification is primarily directed towards music classification. Identification of artists by analyzing features of their work has recently gained interest. As a multiclass classification problem, potentially applicable machine learning approaches to the problem are numerous. We propose to extend present work in this area, which uses Naïve Bayes classifiers and multi-class SVMs, by picking a more unique set of paintings across prolific artists. We initially use a histogram of colors as our features, and then we analyze more advanced features like the histogram of gradient orientations (HOG). Due to the large number of features as compared to the number of paintings, we use PCA to condition features based on the highest variance. We apply several multi-class classification techniques like Naïve Bayes, Linear Discriminant Analysis, Logistic Regression, K-Means and SVMs to our problem and achieve a maximum classification accuracy of 65% for an unknown painting across 5 artists. (pdf)

A Performance and Power Evaluation of Speculative Dynamic Vectorization

With the end of Dennard scaling, processor clock frequencies are are no longer unbounded. Under such conditions, speculative approaches based on repetition of instructions in loops present an attractive solution to increase IPC. We consider two such approaches in this paper, speculative dynamic vectorization of all in-structions, and speculative dynamic vectorization of loads only, based on load speculation, and analyze the power penalties that come as a logical trade-off to the increased performance. We find that the first approach increases power by almost 125% for a modest perfor-mance increase of 8% in IPC, whereas the sec-ond approach increases power by about 97% for a 10% increase in performance. These penal-ties primarily arise from the number of history table accesses that are needed for prediction, and the vector register file that is used to store the speculated values. (pdf)

Implementation of Speech Recognition in Resource Constrained Environments

With the emergence of ubiquitous computing powered by state of the art technology, a constant need has been felt for more convenient methods to input data and commands to a computing device. As the size of the computing devices is decreasing exponentially with time, more sophisticated techniques of human computer interfaces such as speech are fast evolving. The project is an attempt to fulfill the gap between the current speech recognition technology and embedded systems. The initial objective of the project will be the implementation of a speech recognition engine using Hidden Markov Models. This would involve the design of an efficient MATLAB code on a PC. This phase of the project will involve the development of a limited domain recognition engine spanning numerals only. The subsequent step will involve porting this engine to a Resource Constrained environment such as an FPGA kit. The long term aim would be to eliminate the PC altogether and build a stand-alone system. The recognition engine should contain the capability to be extended to span the entire vocabulary of the English language. (pdf)