Alexander Hartemink: Software

Software

Whenever we design new algorithms, we write code so we can evaluate them and then use them for solving problems in systems biology. Occasionally, these algorithms are of sufficient general use that we have taken the extra time to carefully re-implement them as software packages that are better documented, more efficient, more extensible, and more user-friendly than our standard research-grade code. Below are listed the software packages that fall into this category. Each package has its own site, linked below. In each case, the software is available with complete source code under a non-commercial use license. If you are interested in commercial licensing opportunities, please contact us.

Banjo

Banjo (Bayesian Network Inference with Java Objects) is a highly efficient, configurable, and cluster-deployable Java package for the inference of static or dynamic Bayesian networks. Banjo is currently limited to discrete variables; however, it can discretize continuous data for you, and is modular and extensible so that new components can be written to handle continuous variables if you wish. The modular design also allows you to mix and match various inference algorithm components to implement different learning procedures, ranging from simulated annealing with random local moves to greedy hillclimbing with all local moves, as well as create new ones.

SMLR

SMLR (Sparse Multinomial Logistic Regression) is an efficient implementation of a true multiclass probabilistic classifier based on the well-studied multinomial logistic regression framework. Within this framework, we adopt a Bayesian perspective, enabling us to incorporate a Laplacian prior (related to LASSO) which promotes the learning of a sparse weight vector. The result is a classifier that can operate either directly on input features and perform automatic feature selection (embedded, not filter or wrapper), or with a kernel and perform automatic sample selection (much like the SVM). The objective function is convex so it has a unique global optimum. SMLR software implements a suite of bound-optimization algorithms that we have developed to find this unique optimum efficiently, even when the number of samples or features is large (at least tens of thousands).

PRIORITY

PRIORITY is a tool for de novo motif discovery in the context of transcription factor (TF) binding sites. It implements a new approach to motif discovery in which informative priors over sequence positions are used to guide the search. Although this approach will work for any motif model and any search/optimization strategy, the initial version of PRIORITY adopts a PSSM model and collapsed Gibbs sampling. PRIORITY is packaged with priors designed to measure how likely each sequence position is to be bound by three specific structural classes of TFs: basic leucine zipper, forkhead, and basic helix loop helix. In addition to discovering TF binding sites and a motif model for those binding sites, PRIORITY also predicts the structural class of the TF recognizing the binding sites.

COMPETE

COMPETE predicts the quantitative occupancy level of DNA binding factors—including transcription factors, nucleosomes, and the origin recognition complex—that compete to bind along the genome. The prediction reflects the quantitative occupancy of each factor at each genomic position, and is computed as a weighted average over the entire thermodynamic ensemble of all potential binding configurations. Each of those configurations has a certain probability, which itself depends on the different sequence affinities and concentrations of the various factors in the model. The goal of the COMPETE software package is to be high performance, flexible, and extensible.

RoboCOP

RoboCOP produces chromatin occupancy profiles (COPs) automatically from any combination of chromatin accessibility data, ideally fragment data from paired-end MNase-seq, but alternatively accessiblity data from ATAC-seq or DNase-seq. RoboCOP uses a multivariate hidden Markov model (HMM) to compute a probabilistic occupancy landscape of nucleosomes and hundreds of TFs genome-wide at single-nucleotide resolution. The link above is to a GitHub repository for RoboCOP, which is primarily implemented in Python and C, but calls some functions in R.

TOP

TOP predicts the quantitative occupancy of hundreds of transcription factors (TFs) from a single DNase- or ATAC-seq experiment, allowing one to efficiently study how TF binding changes genome-wide across cell types, over time, or across varying genetic backgrounds. TOP, which stands for TF occupancy profiler, uses a Bayesian hierarchical regression framework and is implemented in R. The link above is to a GitHub repository; precomputed tracks of quantitative predictions of genome-wide occupancy for hundreds of TF × cell type combinations are being uploaded and will be made available shortly.

MILLIPEDE

MILLIPEDE uses DNase-seq data to predict whether a given site in the genome is likely to be bound by a transcription factor. Potential binding sites are determined based on low-threshold TF motif matching, and then DNase data at the site and at its upstream and downstream flanking regions are provided to a trained logistic regression classifier to predict whether the site is indeed bound in this experiment. The model benefits from supervision, but semi-supervised and unsupervised variants work nearly as well. MILLIPEDE bins the DNase data to reduce the size of the parameter space and mitigate against overfitting. Its name is a pun on the fact that its number of parameters is at least an order of magnitude smaller than the popular CENTIPEDE model that preceded it.

NucID

NucID (Nucleosome Identification using DNase) is Python software that uses DNase-seq data to map nucleosome positions genome-wide. Nucleosome positions are identified on the basis of nucleosome scores that are computed from single-end DNase-seq read counts using a Bayes-factor–based method. The nucleosome scores reflect the relative posterior probability that a given nucleosome-sized window (147 bp) is occupied by a nucleosome versus not. The link above is to a GitHub repository that also contains a Jupyter notebook demonstrating how to use the software. Pre-computed tracks of genome-wide nucleosome scores are also available here.

CLOCCS

CLOCCS (Characterizing Loss of Cell Cycle Synchrony) is a branching process model that precisely characterizes how a population of synchronized cells lose synchrony as they repeatedly progress through the cell division cycle. Parameters of the model capture imperfections in the initial synchrony, gradual loss of synchrony due to cell-to-cell variation in cell cycle progression, and losses of synchrony due to potential asymmetric cell division. The model produces precise estimates of the fraction of the cell population at any stage of the cell cycle at any point in time (including times not measured). These estimates are determined by an MCMC fit to observational data, which can be flow cytometric measurements of DNA content and/or counts of binary markers of progression, for example budding index or other measures arising from fluorescence microscopy. The model can also account for the fact that some small but non-negligible number of cells may be dead or halted during the experiment (and thus not progressing through the cell cycle like the other cells, muddying the observational data).

DECONV

Our DECONV algorithm takes the parameter estimates from CLOCCS software, described above, and uses them to deconvolve time-course data collected from a population of cells during a cell cycle synchrony-release experiment. In so doing, the algorithm learns a more accurate view of dynamic cell-cycle processes, free from the convolution effects associated with imperfect cell synchronization. Through wavelet-basis regularization, our DECONV method sharpens signal without sharpening noise, and can remarkably increase both the dynamic range and the temporal resolution of time-series data. The link above is to a website sharing the results of this algorithm applied to a yeast cell-cycle transcription time course.