## Hidden Hamiltonian Cycle Recovery via Linear Programming

One of the most pressing challenges in genomics is to reconstruct a long and contiguous DNA sequence from short DNA subsequences (contigs). Enabled by very recent developments in sequencing technology one can now obtain linkage information that is statistically correlated with the true contig ordering, so that many links are observed between neighboring pairs of contigs, while relatively few links are observed for non-neighboring pairs.

Representing contigs as vertices and observed numbers of links between contigs as edge weights, the problem of contig assembly reduces to a Travelling Salesman Problem (TSP) in a weighted complete graph of size $n$ with a hidden Hamiltonian cycle corresponding to the true ordering. We assume a statistical model where the edge weights on the hidden Hamiltonian cycle are drawn independently from a distribution $P_n$, while the remaining edge weights are drawn independently from $Q_n$. Despite the worst-case intractability of solving the TSP, we show that a simple linear programming (LP) relaxation, namely the fractional 2-factor (F2F) LP, recovers the hidden Hamiltonian cycle with probability tending to one as $n \to \infty$ provided that $\alpha_n - \log n \to \infty$, where $\alpha_n \triangleq -2 \log \int \sqrt{d P_n d Q_n}$. This condition is information-theoretically optimal in the sense that, under mild distributional assumptions, $\alpha_n \geq (1+o(1)) \log n$ is necessary for any algorithm to succeed regardless of the computational cost.

Departing from the usual proof techniques based on dual witness construction, the analysis relies on the combinatorial characterization (in particular, the half-integrality) of the extreme points of the F2F polytope. Represented as bicolored multi-graphs, these extreme points are further decomposed into simpler "blossom-type'' structures for the large deviation analysis and counting arguments. Evaluation of the algorithm on real data shows improvements over existing approaches.

This is joint work with Vivek Bagaria (Stanford), Jian Ding (Penn), David Tse (Stanford), and Yihong Wu (Yale).

Jiaming Xu is an assistant professor in the Fuqua School of Business at Duke University which he joined in July 2018. Before that, he was an assistant professor in the Krannert School of Management at Purdue University from August 2016 to June 2018, a research fellow with the Simons Institute for the Theory of Computing, UC Berkeley from January 2016 to June 2016, and a postdoctoral fellow with the Statistics Department, The Wharton School, University of Pennsylvania from January 2015 to December 2015. He received the Ph.D. degree from UIUC in 2014 under the supervision of Prof. Bruce Hajek, the M.S. degree from UT-Austin in 2011, and the B.E. degree from Tsinghua University in 2009, all in Electrical and Computer Engineering. His research interests include data science, high-dimensional statistical inference, information theory, convex and non-convex optimization, queueing theory, and game theory.