Ques
Project Summary
The increasing complexity, scale, and dynamics of networked computing
systems make it hard for users and system administrators to understand
and control these systems. Recent studies indicate that a significant
fraction of user time gets wasted because of unexpected system
slowdowns, crashes, and application errors. Business-critical systems
often have hundreds of components---e.g., applications, databases,
servers, routers---whose performance depend on thousands of intricate
and time-varying dependencies and parameters. The Ques project
aims to arrest and reverse the dangerous spiral towards unwieldy
systems, high administrative costs, and frustrated users.
Ques is supported generously by NSF CAREER Award Number 0644106, startup
funds from Duke, two faculty awards from IBM, and an equipment grant
(jointly with three other Duke faculty members) from IBM.
Project Details
Ques tackles system management through innovative data management.
Ques treats a computing system as a rich source of data about system
configuration and activity, available typically as continuous, rapid,
and time-varying data streams. The system data---e.g.,
multidimensional time-series of performance and utilization metrics,
control and data-flow paths of requests, and error messages---is
collected in an efficient and controlled fashion. Ques gives users
and administrators the ability to pose a broad range of
system-management queries over this data:
- Health monitoring, e.g., which applications cause the most I/O?
-
Change (anomaly) detection, e.g., alert me when resource-usage
patterns change significantly.
-
Diagnosis, e.g., why is it taking 10 seconds to add an item to a
shopping cart?
-
Forecasting, e.g., what will the system throughput be 1 hour from
now?
-
What-if, e.g., how will processor utilization change if the
database cache size is increased by 10%?
-
Recommendation, e.g., what resource allocation to a
hurricane-prediction workflow will guarantee its completion within 30
minutes?
Ques-Querying addresses challenges in developing simple and
intuitive ways to express such queries---e.g., using a visual
interface, declarative query language, or keyword search---and
processing the queries automatically and efficiently using execution
plans. These plans use statistical (e.g., neural network) and
performance (e.g., queuing network) models learned from system data as
well as operators for data transformation (e.g., feature selection)
and inference. We have developed algorithms to navigate the huge plan
space comprising models, model-parameters, and transformations quickly
using techniques like cost estimation---estimating plan accuracy and
execution time using statistics---and active-learning---executing
sample plans for learning purposes. Ques-Querying also incorporates
Ques-Search, a desktop and Internet search engine to query
about computer problems using keywords or natural language (e.g., why
is my browser slow?). Ques-Search personalizes search by adding
relevant context (e.g., system and application state) to user queries
automatically and securely.
Ques-Control is an ambitious next step to Ques-Querying to
enable automated control of complex computing systems under changing
conditions, based on policies specified by system administrators.
Like Ques-Querying, Ques-Control learns models of system behavior from
data collected passively or through active perturbation. Given a set
of system policies P, Ques-Control derives a controller---an execution
plan based on sensing, actuation, and feedback---to enforce P always.
Ques-Control poses interesting challenges in policy-interface design,
acquiring the right training data to model specific system behavior
quickly, robustness to bursty workloads, and proactive system tuning.
Ques seeks to advance the state of the art in our ability to
understand and control computing systems in a number of ways:
- No current system-management product supports Ques's broad range of
queries or their combinations: health monitoring, anomaly detection,
diagnosis, forecasting, what-if, and recommendation. Furthermore,
Ques targets a comprehensive long-term solution for system management
by automating the generation of plans for executing queries and
enforcing policies. This approach requires extensive collection and
analysis of system configuration and activity data---e.g., performance
metrics, resource utilization, execution and stack traces, error
messages, workload, network packets, source code, and help
manuals---both passively and through controlled system perturbation.
- Busy data centers generate more than 1 Terabyte of log data per
day. More fine-grained logging can make this size 100x larger. To
query such massive time-varying datasets, Ques is pushing the envelope
of data-stream technology where data is modeled as continuous streams
and queried using a "you-get-one-look" approach. In addition, Ques
supports controlled data collection to balance the inherent
cost-accuracy tradeoff.
- Ques is removing technical barriers to automated policy-driven
control of computing systems. Recent industrial initiatives like
IBM's Autonomic Computing and Microsoft's Dynamic Systems Initiative
highlight the pressing need for such control.
- Current system-management products are usually of little help to
desktop users facing unexpected system slowdowns or misbehaving
applications. Ques makes system management accessible to the large
and diverse class of users and developers who administer their own
systems.
- Current system-management products have fairly rigid interfaces and
require a lot of system expertise to use. Ques is rethinking the
system-management interface for a broad spectrum of potential users.
As one example, system administrators need effective ways to input
domain knowledge, while desktop users prefer keyword queries on a
personalized engine for desktop and Internet search.
We are committed to building a fully-functional prototype of Ques
and deploying it in real-world settings. With each novel component of
Ques, we will: (i) perform the research and evaluation using a
prototype in a testbed setting, with both synthetic and real
applications and data, (ii) demonstrate the prototype at a leading
conference, (iii) make the demonstration available publicly on the
Internet, (iv) do a real-world deployment and user studies if there is
sufficient interest, and (v) release the source code publicly. The
effectiveness of Ques will be tested by deploying it to manage
workloads on a virtualized, service-oriented, and on-demand computing
platform on our departmental research-computing cluster. We have also
had encouraging preliminary discussions with the administrators of an
university-wide production cluster used heavily for
computational-science applications. We have established industrial
collaborations (IBM) with the eventual goal of transferring technology
from Ques to industrial-strength system-management products.
Project Members
-
Shivnath Babu, Assistant Professor, Duke Computer Science
-
Songyun Duan, Ph.D. Candidate, Duke Computer Science
-
Peter Franklin, Undergraduate, Duke University
-
Dongdong Zhao, M.S. Candidate, Duke Computer Science
Collaborators
-
Ashraf Aboulnaga, Assistant Professor, Computer Science, University of Waterloo
-
Jeff Chase, Professor, Duke Computer Science
-
Brent Miller, Autonomic Computing Group, IBM
-
Kamesh Munagala, Assistant Professor, Duke Computer Science
-
Sandeep Uttamchandani, IBM Almaden Research Center
-
Jun Yang, Assistant Professor, Duke Computer Science
Alumni
-
Brian Cook, M.S., now at IBM
-
Garrett Bressler, now at Brown University
-
Piyush Shivam, Ph.D., First employment at Sun Microsystems
Publications
-
P. Shivam, V. Marupadi, J. Chase, and S. Babu.
Cutting Corners: Workbench Automation for Server Benchmarking
In Proc. of the 2008 USENIX Annual Technical Conference,
June 2008 (To appear)
- S. Duan and S. Babu.
Guided Problem Diagnosis through Active Learning
In Proc. of the International Conference on Autonomic Computing (ICAC), June 2008 (To appear)
- S. Babu, S. Duan, and K. Munagala.
Processing Diagnosis Queries: A Principled and Scalable Approach
Poster at the International Conference on Data Engineering (ICDE), April 2008.
-
M. Ahmad, A. Aboulnaga, S. Babu, and K. Munagala.
QShuffler: Getting the Query Mix Right
Poster at the International Conference on Data Engineering (ICDE), April 2008.
- S. Duan and S. Babu.
Processing Forecasting Queries
In Proc. of the International Conference on Very Large Databases (VLDB), September 2007
- B. Chandramouli, C. Bond, S. Babu, and J. Yang.
Query Suspend and Resume
In Proc. of the
2007 ACM Intl. Conf. on Management of Data (SIGMOD), June 2007
- A. Yumerefendi, P. Shivam, D. Irwin, P. Gunda,
L. Grit, A. Demberel, J. Chase, and S. Babu.
Towards an Autonomic Computing Testbed
In Workshop
on Hot Topics in Autonomic Computing (HotAC), June 2007
- B. Cook, S. Babu, G. Candea, and S. Duan.
Towards Self-Healing Multitier Services
In Second
Intl. Workshop on Self-Managing Database Systems (SMDB), April
2007
- B. Chandramouli, C. Bond, S. Babu, and J. Yang.
On Suspending and Resuming Dataflows (poster).
In Proc. of IEEE
International Conference on Data Engineering (ICDE), April 2007
- P. Shivam, S. Babu, and J. Chase.
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications
In Proc. of the International Conference on Very Large Databases (VLDB), September 2006
- P. Shivam, S. Babu, and J. Chase.
Active Sampling for Accelerated Learning of Performance Models
In Proc. of the First Workshop on Tackling Computer Systems Problems with Machine Learning Techniques (SysML), June 2006
- P. Shivam, S. Babu, and J. Chase.
Learning Application Models for Utility Resource Planning
In Proc. of IEEE International Conference on Autonomic Computing (ICAC), June 2006
- S. Babu, P. Bizarro, and D. DeWitt.
Proactive Re-optimization
In Proc. of the
2005 ACM Intl. Conf. on Management of Data (SIGMOD 2005), June 2005
The Rio system described in this paper
was demonstrated at
SIGMOD 2005, June 2005
- S. Babu and P. Bizarro. Adaptive Query Processing in the Looking Glass
In Proc. of the Second Biennial Conference on Innovative Data Systems Research (CIDR), January
2005
Demonstrations
- P. Shivam, A. Demberel, P. Gunda, D. Irwin,
L. Grit, A. Yumerefendi, S. Babu, and J.
Chase.
Automated and On-Demand Provisioning of Virtual Machines for Database Applications
Demonstrated at the
2007 ACM Intl. Conf. on Management of Data (SIGMOD 2007), June 2007
- S. Duan and S. Babu.
Proactive Identification of Performance Problems
Demonstrated at the
2006 ACM Intl. Conf. on Management of Data (SIGMOD 2006), June 2006
- S. Babu, P. Bizarro, and D. DeWitt.
Proactive Re-optimization with Rio
Demonstrated at the
2005 ACM Intl. Conf. on Management of Data (SIGMOD 2005), June 2005
Technical Reports in Submission
-
P. Shivam, S. Babu, S. Duan, P. Gunda,
A. Demberel, D. Irwin, and J. Chase. Experiment-Driven
Management of Web Services. June 2007.