Computer Science Research Profile:
Statistical Privacy

From the Fall 2012 issue of Threads

Ashwin Machanavajjhala, the department's newest faculty member, specializes in algorithms for storing and analyzing big data. He joined the Systems and Architecture research group on July 1, and three weeks later he submitted a research proposal to the Early Career Development Program at the National Science Foundation (NSF). On January 29 he heard that NSF will fund his project, titled "PROTEUS: A Practical and Rigorous Toolkit for Privacy."

Machanavajjhala smiled widely on receiving the news of the NSF grant, known as a "CAREER" award. Even before that big moment, a quick, informal poll of members of the department showed that Machanavajjhala does not often frown. Indeed, he is kind enough to chuckle tolerantly when asked -- and one assumes this has been a fairly regular occurrence in his seven months here -- for a detailed explanation of how to pronounce his last name. "Try it this way: Mä' chä nä vəj ä lä," he responds. Then he goes on to request staff, fellow faculty, students, and writers of web profiles, "Please, call me Ashwin!"

The assistant professor describes his primary research interest as: "Statistical privacy, or the problem of disclosing aggregate statistics about data collected from individuals, while ensuring that sensitive information at an individual's level is not disclosed." As Ashwin points out, there is a growing emphasis in the world at large on utilizing and disseminating aggregate statistics from demographic data, health records, internet activity, energy usage, communication patterns, social interactions, etc. "The ubiquity of data in these realms has, over the last few years, revolutionized e-commerce and the advertising and health industries, and has facilitated advances in science and public policy. However, in many cases data cannot be publicly released or analyzed as collected, because identities (or other sensitive attributes of individuals in the data) could be either directly revealed or inferred."

Ashwin aims to overcome the privacy challenges inherent in mining these broad, diverse veins of data by formulating mathematical definitions for privacy on one hand, while on the other developing efficient algorithms with provable guarantees of privacy.

Before coming to Duke he worked on these issues as a Senior Research Scientist in the Knowledge Management group at Yahoo! Research in Santa Clara, CA. He believes he benefitted a lot from his experience in industry. Productive collaborations and opportunities to work with real-life data at Yahoo! provided rich terrain on which to explore solutions to the problems in his field, he affirms. Ultimately, however, he wanted to come back to academia - and here he is.

Appropriately enough, in September one of his first official assignments for the department was to give a talk to graduate students. The topic: The differences and similarities they may experience working in industry as opposed to a university setting. He was also kind enough to discuss the disparate job application processes in the two arenas!

Ashwin vividly remembers and cherishes his own time as a student. "Algorithms was my favorite course, and Professor C. Pandu Rangan, who taught the course, became my undergraduate thesis advisor," he recounts. The mentoring he received from Professor Rangan attracted him to algorithms and research, and later spurred him to continue both at the graduate level. He first studied Computer Science and Engineering at the Indian Institute of Technology in Madras, India, finishing with a BTech in 2002. He then went on to complete his Master's and Doctoral degrees in Computer Science at Cornell University, graduating in 2008. Professor Johannes Gehrke was his advisor. His PhD dissertation received the 2008 ACM SIGMOD (the Association for Computing Machinery's Special Interest Group on Management of Data) Jim Gray Dissertation Award, Honorable Mention. Ashwin's dissertation was one of the first works on privacy to underscore the importance of understanding how much background knowledge about individuals a given data-seeking 'adversary' already possesses. In it Ashwin developed a formal mathematical definition of privacy, called "l-diversity." This work facilitated the publication of data while ensuring privacy against adversaries with limited background knowledge. During the past five years the l-diversity paper has been cited in over 1,000 articles in journals and at conferences.

He continues to be inspired by his field's particular combination of theory and practical application. It's clear he can't wait to pass his excitement along to his students. "In my research group students can look forward to working on some foundational theoretical problems defining privacy -- its limitations as well as its utility -- in a variety of application settings, including social networks, location tracking, genomics, etc." For those with a passion for systems-oriented research, Ashwin adds, "They will also be investigating the best way to implement privacy -- whether in the operating system/cloud itself or at the level of user applications." Another research direction he and his students will be pursuing is privacy-aware data integration -- where a researcher might link his smaller dataset with private datasets on the cloud (e.g., search logs or medical records) to help learn new insights.

In his classes this year students will undertake group projects, and he will encourage them to explore the practical aspects of the field. "Expertise in privacy issues and technologies is a vital skill for students planning to work in any industry - web, medical, or otherwise," he asserts.

Last fall he taught a class on Privacy in a Mobile-Social World, discussing the challenges faced in developing privacy-preserving algorithms for real-world scenarios. Now in spring 2013 Ashwin is offering another course he has created, Algorithms for Big-Data Management, which will provide an introduction to algorithm design for "staggeringly" large datasets. "It will specifically focus on integrating information from multiple datasets in the context of parallel and streaming algorithms, large-scale machine learning, and crowd-sourcing."

As part of his PhD research Ashwin worked with Professor Gehrke to find real-life problems to help demonstrate existing issues with privacy. It turned out that John Abowd, Professor of Economics and of Computing and Information Science right there at Cornell, was looking for a new algorithm to help protect data in a U.S. Census Bureau data product called 'OnTheMap.' At the time, the existing school of thought underlying such tools held that data were secure, because an adversary did not have the exact algorithm used to protect the data. Ashwin showed that this was not enough. He also demonstrated both the need for and potential to develop algorithms that could successfully ensure privacy, even when the adversary did have access to the algorithm. "Making the algorithm public not only avoids 'security by obscurity,' it also allows statisticians to perform more informed analyses," the talented researcher tells us. He developed an algorithm that first fit a hierarchical Bayesian model to the original data, and then sampled a new set of data points from a slightly perturbed version of the model. Sampling from a model ensures that the newly created synthetic population preserves the statistical properties of the original data. It also preserves more statistical information than other techniques, like coarsening or perturbing, while still guaranteeing a strong notion of privacy called differential privacy, which means that even an adversary with access to the parameters used by the algorithm can't learn sensitive information about individuals. Ashwin's algorithm made OnTheMap the first application of its type with provable privacy guarantees. It is still used by the U.S. Department of Commerce to manage and distribute its census data.

To prove truly fruitful, analysis of such data must take into account and utilize correlations among groups of individuals. Another important milestone in Ashwin's research career so far was his proof of a 'No Free Lunch' theorem, which showed that any algorithm not taking into account assumptions about an adversary's background knowledge of how the data is generated could not both guarantee privacy and provide any useful information about the data at the same time. An important consequence of this result is that one can guarantee privacy only against certain limited classes of adversaries. Ashwin illustrated that state-of-the-art privacy tools ensure privacy of individuals only when an adversary believes the data of every individual has no correlation with the data of other individuals. This severely restricts the applicability of such tools, since all useful datasets include correlations.

Ashwin's 'No Free Lunch' result thus opened up a whole new area of research into the theory and implementation of privacy in the analysis of correlated data. One goal for Ashwin's research group, which his brand new NSF CAREER grant will help fund, is to develop a practical tool kit for privacy to allow researchers to analyze individuals' data from social networks and genome datasets, among other sources, without revealing sensitive information about the human beings in question. To that end, Ashwin has begun building his group here at Duke, focusing on data privacy, systems for massive data analytics, and statistical methods for information extraction and entity resolution. He hopes to draw students from both the Computer and Statistical Science departments.

Return to Reseach Profiles home page