Research Projects

Differentially Private Training Corpus Synthesis

Speaker:Ben Stoddard
stodds at cs.duke.edu
Date: Monday, May 6, 2013
Time: 2:00pm - 3:00pm
Location: D344 LSRC, Duke

Abstract

The wealth of publicly available training and benchmark data has allowed for researchers to refine and tune the design and performance of standard classification methods to a great degree. Unfortunately though, most publicly available training data does not include private data from medical, social, web, and communication domains; all of which have great importance to many fields of research. If this sort of data was actually used to create a classifier it would pose an immediate threat to the privacy of those included in the training data. An adversary with sufficient knowledge could use a released classifier to deduce information about the members of this data set, thus breaching their privacy.

The goal of this proposed project is to produce a framework for the creation of synthetic training data from an actual training set while providing the formal privacy guarantees of differential privacy. Additionally we will consider the selection of features in cases where the data is sparse or highly dimensional, thus allowing a minimal addition of noise into the released corpus while preserving as much utility as possible. Within such a framework, the synthetic data or any classifier trained on it could be released to the public as each would guarantee differential privacy with respect to the original training corpus.

Advisor(s): Ashwin Machanavajjhala
Landon Cox, Ronald Parr, Jun Yang