COMPSCI 260, Fall 2020

Overview

A computational perspective on the exploration and analysis of genomic and genome-scale information. Provides an integrated introduction to genome biology, algorithm design and analysis, and probabilistic and statistical modeling. Topics include genome sequencing, genome sequence assembly, local and global sequence alignment, sequence database search, gene and motif finding, phylogenetic tree building, and basic gene expression analysis. Methods include dynamic programming, indexing, hidden Markov models, and elementary supervised and unsupervised machine learning. Development of practical experience with handling, analyzing, and visualizing genomic data using the computer language Python.

The course will require students to program often in Python. Students coming in to the course must already know how to program in some computer language, but it need not be Python. If it is not Python, students will be expected to come quickly up to speed in Python on their own. Additionally, students should be comfortable with mathematical thinking and formulas, and should have had some exposure to basic probability as well as molecular or cellular biology; however, the course has no formal course prerequisites, and quick refreshers of relevant background will be provided. Please speak to the instructor if you are unsure about your background. This course is a valid elective in both biology and computer science.

Staff

Professor Alex Hartemink

Webpage: http://www.cs.duke.edu/~amink
Email: amink at cs.duke.edu
Office Location: LSRC D239
Office Phone: (919) 660-6514

Rachel Kositsky, TA	Email: rachel.kositsky at duke.edu
Sneha Mitra, TA	Email: sneha.mitra at duke.edu
Trung Tran, TA	Email: trung.tran at duke.edu
Abbey List, UTA	Email: abbey.list at duke.edu
JonMark Pintas, UTA	Email: jonmark.pintas at duke.edu
Michael Williams, UTA	Email: michael.williams3 at duke.edu
Shreyas Kulkarni, UTA	Email: shreyas.kulkarni at duke.edu
Siyun Lee, UTA	Email: siyun.lee at duke.edu
Vicki Lu, UTA	Email: vicki.lu at duke.edu

Office hours

Office hours with TAs and UTAs will be held online at the following times (starting on Friday 21 August); at least one session is available each day, with two sessions on Tuesdays and Wednesdays. Directions for how to access office hours over Zoom will be posted on Piazza. Remember that you will get help quickest by asking your questions on Piazza (in fact, your question may already be answered there).

Sunday, 6-8pm: Rachel Kositsky
Monday, 9-11pm: JonMark Pintas
Tuesday, 4-6pm: Vicki Lu
Tuesday, 8-10pm: Michael Williams
Wednesday, 4-6pm: Trung Tran
Wednesday, 8-10pm: Shreyas Kulkarni
Thursday, 12-2pm: Sneha Mitra
Friday, 1:30-3:30pm: Siyun Lee
Saturday, 2-4pm: Abbey List

If you would like to speak with the instructor about anything, you are welcome to send an email to schedule a meeting at a time that is convenient for you.

Logistics

The class meets on Tuesdays and Thursdays 10:15–11:30AM on Zoom. The Zoom link can be found on the COMPSCI 260 site within Sakai.

Course schedule

Note: The course schedule may change subtly from time to time. Always check the web page for the most up-to-date schedule.

Session	Date	Instructor	Topic	Assignment (out/due)
1	Tue 18 Aug	AH	Course introduction; SARS-CoV-2 genome introduction	PS0 out
2	Thu 20 Aug	AH	Gene/genome organization; SARS-CoV-2 genome revisited	PS1 out
3	Tue 25 Aug	AH	Algorithm introduction; Time and space resources	PS0 due
4	Thu 27 Aug	AH	Analyzing algorithms; Designing efficient algorithms	PS1 due; PS2 out
5	Tue 01 Sep	AH	Divide-and-conquer introduction and exploration
6	Thu 03 Sep	AH	Divide-and-conquer fails; Memoization; Dynamic programming	PS2 due; PS3 out
7	Tue 08 Sep	AH	DNA sequencing; Genome assembly
8	Thu 10 Sep	AH	Genome assembly; HGP and Celera	PS3 due
9	Tue 15 Sep	AH	Short-read mapping; Suffix arrays; BWT	PS4 out
10	Thu 17 Sep	AH	BWT, continued; FM-index
11	Tue 22 Sep	AH	Sequence variation and alignment; Global alignment
12	Thu 24 Sep	AH	Traceback; Aligning sequences with affine gap scores	PS4 due; PS5 out
13	Tue 29 Sep	AH	Affine gap alignment traceback; Local alignment
14	Thu 01 Oct	AH	Local alignment traceback; FASTA and BLAST heuristics	PS5 due; PS6 out
15	Tue 06 Oct	AH	Phylogenetic trees; Time and distance
16	Thu 08 Oct	AH	Building phylogenetic trees (UPGMA and NJ)	PS6 due; PS7 out
17	Tue 13 Oct	AH	Probability; Discrete and continuous random variables
18	Thu 15 Oct	AH	Joint, marginal, and conditional; Bayes rule	PS7 due
19	Tue 20 Oct	AH	Models; Parameter estimation: ML, MAP, PME	PS8 out
20	Thu 22 Oct	AH	Factoring; Graphical models; Markov models
21	Tue 27 Oct	AH	Hidden Markov models
22	Thu 29 Oct	AH	Viterbi decoding and traceback	PS8 due; PS9 out
23	Tue 03 Nov	AH	Posterior decoding and traceback
24	Thu 05 Nov	AH	Estimating HMM parameters; Baum-Welch; GENSCAN	PS9 due; PS10 out
25	Tue 10 Nov	AH	Further HMM extensions and applications
26	Thu 12 Nov	AH	Course summary; Course evaluations	PS10 due

AH: Alex Hartemink

Advanced HMM applications papers

GENSCAN: Burge and Karlin 1997. This paper reports the key result of Burge's Ph.D. dissertation; as an aside, the senior author, Samuel Karlin, is the same fellow that helped develop the significance statistics for BLAST database searching.
The original profile-HMM paper (1994!), which led to software tools like SAM and HMMer, protein domain databases like Pfam and InterPro, and eventually inspired the PSI-BLAST heuristic.
A review of the impact of profile-HMMs in 1996 and again in 1998. As an aside, the latter paper mentions a new database called BLOCKS, which was later used to estimate substitution frequencies in aligned protein domains, leading to BLOCKS-derived SUbstitution Matrices, which we now know as BLOSUM matrices.
TMHMM, an elegant HMM model for finding and understanding the structure of transmembrane proteins (those that are not soluble in the cytoplasm but are embedded in a lipid bilayer membrane like the cell membrane, Golgi, vacuole, ER, etc.).

Tree of life papers

Seminal papers developing sequence alignment and database search

Graph search

An overview of the DFS and BFS algorithms for visiting the nodes of a graph.

Example illustrating BWT, SA, and rudimentary FM-index

An example illustrating the Burrows-Wheeler Transform (BWT) of a short genomic text, and relating that to the simplest version of a suffix array, as well as to the beginnings of the FM-index data structure.

Early "next-generation" genome sequencing technology documents

Papers debating the merits of shotgun sequencing the whole human genome

Papers reporting a newly sequenced genome

Closest pair of points

A careful description of the algorithm for finding the closest pair of points in O(n log n) time. This is from the 2nd edition of "Introduction to Algorithms" (fondly known as CLRS).

Learn more about sorting

If you'd like to learn a bit more about sorting—how different algorithms work and how they compare in practical terms for specific kinds of inputs—check out this cool demo site I visited during lecture. Also, here are some fun videos: quickly visualizing the execution of 15 sorting algorithms and a Hungarian folk dance version of bubble sort (if you can't get enough of that, there are many more).

Master theorem for solving certain recurrence relations

The recurrence relations that arise in analyzing divide-and-conquer algorithms commonly take on a certain form in which the running time for a problem of size n can be expressed in terms of the running time of a copies of a problem that is b times smaller (i.e., size n/b), plus some extra work (which might depend on n). In such cases, a powerful master theorem can help you solve just such a recurrence.

Python tutorial materials

All the PDF slides, Python exercises, and solutions we provided during the Python tutorials, Parts 1 and 2, are available for download here.

For shoring up your biology background

Here are a few different kinds of resources for those with less biology background, ranging from the comprehensive to a basic overview:

Molecular Biology for Computer Scientists, a short primer by Larry Hunter (from the early 1990's but still an excellent brief distillation).
The Chemistry of Life, a short primer by Michael Behe.
Molecular Biology of the Cell, a huge classic textbook by Bruce Alberts, et al., is available for free online at NCBI. It is a good resource for biological questions and background, but can only by accessed via a search interface. If you don't understand a term or concept, you can try searching here.

Textbooks mentioned in class

The various books mentioned in class are summarized here; each is linked to Amazon where you can read more (these are not affiliate links). Note that none of these books is compulsory for the class, though you may benefit from one or more. As for the books on Python, many resources are now available free online, even complete textbooks downloadable as PDFs (you'll save trees (unless you print them)).

Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids: This book covers a good chunk of the material that will be covered in the course with a distinctly probabilistic focus. It is very well written and at a fairly advanced level. It is an excellent reference for folks continuing on in this area.
https://www.amazon.com/exec/obidos/ASIN/0521629713/
An Introduction to Bioinformatics Algorithms: This book by Jones and Pevzner covers many of the topics that will be covered in class, but organizes the material around algorithmic methods rather than biological problems. So one chapter presents many problems across bioinformatics that exploit string matching, etc. While not complete for this course, it is a nice, readable, undergraduate textbook.
https://www.amazon.com/exec/obidos/ASIN/0262101068/
Bioinformatics Algorithms: An Active Learning Approach: A more recent book by Pevzner and Compeau, this time an attempt to capture/complement the teaching that they did together as part of a Coursera course. I have not yet had a chance to examine it too closely, but the course received high praise.
https://www.amazon.com/exec/obidos/ASIN/0990374637/
Introduction to Computational Genomics: A Case Studies Approach: Hahn was a grad student at Duke in the lab of Greg Wray and took this very class once upon a time. I love the case study approach: it makes for very interesting reading. At the end of the day, I decided not to require the book because it doesn't go into much depth, but in terms of overlap with course content, it comes pretty close so it may be useful for some.
https://www.amazon.com/exec/obidos/ASIN/0521671914/

Directions for setting up Anaconda Python and PyCharm IDE

In this class, we will write all our code using Python 3.8, the latest version of which can be downloaded free for any OS from Anaconda. Anaconda (or its minimalist cousin Miniconda) sets up Python in a special environment to prevent it from conflicting with other versions of Python you may have installed. It also includes a tool called conda for managing Python environments and adding new packages (though we will not be using any in this course).

We encourage everyone to develop their Python code using the PyCharm IDE from JetBrains, which is free for educational use. We provide clear directions for setting up the PyCharm IDE, and can offer assistance for students that use it, but if you prefer to develop your code using another IDE, you are free to go your own way.

Complete directions for setting all this up can be found here.

Directions for Problem Set 0

Once you have set up a Python 3.8 environment, and have the PyCharm IDE configured to use it, you are ready to complete a “toy” problem set, which we call PS0. This will familiarize you with how course problem sets are structured, and will confirm that you are ready to download problem sets, write Python code, and submit your work to Gradescope. Completing PS0 on time is worth 3% of your final course grade.

All problem sets will be available at URLs announced within Piazza, so be sure to configure access to Piazza and look for the post announcing the release of PS0.

When you are ready to submit a problem set, you will do so directly to our course Gradescope site.

Academic integrity

All students are expected to abide by generally accepted standards of academic integrity. This includes all the various aspects of Duke's Community Standard. In particular, be reminded that it is not acceptable to take the ideas/work of another and represent it as one's own, even if paraphrased. Ideas/work taken from others—including Internet sources, peers in the class, peers from outside the class—must always be appropriately cited.

Violations of academic integrity will be taken very seriously. At a minimum, assignments in which a student either receives inappropriate input from others or provides inappropriate input to others will be graded as 0. In addition, all violations will be discussed with the Dean who directs the Office of Student Conduct.

Collaboration policy

Unless expressly granted otherwise, the entirety of every problems set should be completed individually; no collaboration is permitted. If you have worked for a while on a particular problem and have encountered a mental wall, and if you have banged your head against said wall for a while, in order to make a bit more progress, your first course of action should be to post a question on Piazza, or to speak to the instructor or TAs.

If for any reason you consult your peers outside of Piazza, such an interaction must be one of consultation and not collaboration: hints to help overcome a small obstacle rather than answers—after consultation, it is expected that you should still have plenty of thinking to do. In addition, if you do happen to consult with another student, both of you must cite this.

Note that posting your work or our course materials, especially our solutions, onto a repository accessible to other students—whether a publicly accessible one like GitHub or a less public repository nevertheless accessible to other students, now or in the future—is a violation of the collaboration policy (and also copyright law). It is considerate to avoid sharing course materials that might tempt others to violate the course collaboration policy and thereby their academic integrity.

Extension policy

You will generally have one week to work on each problem set (although in a few cases, we will give you an early start). With the exception of PS0, all problem sets will be due on Thursday evening at 5pm. If you turn in work after the 5pm deadline, there will be a late penalty amounting to 10% of the total number of available points if you are 0–12 hours late, and 20% if you are 12–24 hours late. No work will be graded if it is more than 24 hours late.

That said, this semester will be challenging for everyone in different ways and at different times, and in acknowledgement of that, we want to be flexible and accommodating. To that end, all students will have their two largest late penalties waived. Put another way, you may elect to turn in your work up to 24 hours late twice this semester without penalty.

Grading policy

Problem Set 0: 3%: Before the main problem sets are released, we release an initial “toy” problem set to ensure everyone is familiar with downloading problem sets, editing Python files, and submitting their completed work on time to Gradescope. One key element of this simple problem set will be a survey asking you to reflect on why you are taking this course.
Problem Sets 1–10: 90%: All students are expected to complete ten problem sets over the semester, each contributing equally to this component of the grade. These will be issued at regularly weekly intervals, but we will incorporate two breaks during the semester so that it doesn't feel like the work is relentless.
Participation and engagement: 5%: All students are expected to regularly attend class online (and watch the recording of class any day they happen to be absent). You are also expected to be active on Piazza, posting questions, notes, and answers: helping each other as questions arise and raising interesting points for further conversation. Finally, please make use of office hours if you need help understanding something better, and please feel free to ask questions at any point in class—if the material is unclear, or if it simply leads you to wonder about a new connection. Whether something is troubling or exciting you about the material, please do not hesitate to speak up about it.
Course evaluation surveys: 2%: At the end of the semester, you will be asked to complete two course evaluation surveys, one the official university evaluation and one a customized Google survey, each contributing equally to this component of the grade.

Approach to grading

We have designed problem sets in the class to permit you to explore the material, and to develop deeper understanding of the material through that exploration. I ask you to focus on the ideas and the learning rather than on the points and the credit; put another way, adopt a perspective of how you can work to satisfy your expectations rather than work to satisfy the instructor's expectations.

That said, when it comes to grading, we still need to assign points and credit: this is unfortunately unavoidable. However, we have designed our approach to assigning credit in an attempt to be consistent with the perspective of the previous paragraph, and the approach is perhaps a little different from what you may be familiar with in other classes. Specifically, I have asked the graders to frame their grading in terms of ‘positive earning’ rather than ‘negative error’.

What do I mean by this? Well, a ‘negative error’ approach is one in which one assumes one's work will earn full credit unless there are mistakes present. Under such an approach, graders are negatively tasked with finding mistakes and errors, and taking away points for any they find.

I have inverted this by choosing to adopt a ‘positive earning’ approach in which an empty problem set earns no points, and students earn more points as they demonstrate deeper levels of mastery of the material and challenge. Under such an approach, graders are instead positively tasked with finding ways that students should earn credit for deeply engaging the material.

A corollary of the ‘negative error’ approach is that unless a student makes a mistake, they are entitled to the highest number of points possible. Conversely, a corollary of the ‘positive earning’ approach is that it is possible for a student to not make any mistakes yet still not earn the highest number of points possible. For example, this can happen if a student minimally engages the material, and while not making any mistakes, never demonstrates mastery or depth of understanding. Our ‘positive earning’ approach not only focuses on the positive instead of the negative, but it also leaves room to grant more credit to students who engage the material more deeply.

I write all this because if you find that you did a problem without making a mistake, but got only +6 when some other student may have gotten +8, it doesn't necessarily mean that something is wrong (though it might be). It could mean that there were some interesting ways to engage the problem you didn't explore that the other student did. An analogy might be from a video game like Mario Brothers: you can successfully rescue the princess but still not end up with the highest score because someone can score higher if they take the time to explore a pipe that leads in a new direction. Analogously, earning the highest number of points possible usually requires more than just ‘no mistakes’; it also requires demonstration of mastery and engagement. We use rubrics to apply these judgments consistently across the class, and the rubrics are not pre-determined: our rubrics adapt to give credit for the new ways we see students engaging a problem.

Distributing grades

Students will submit their problem set work directly to our course Gradescope site. After grading each assignment, results will be available to students within Gradescope. Once scores are finalized in Gradescope, they will move into the gradebook on Sakai, where they will accumulate throughout the semester. Storing scores in the gradebook will be our primary use of Sakai (though I have also put there links to all the other various course websites).

Piazza

This class uses Piazza for course announcements, communication, and discussion. The Piazza system is highly catered to getting you help quickly and efficiently from classmates, the TAs, and the instructor. Rather than emailing questions to the teaching staff, please post your questions on Piazza so everyone can provide responses, as well as benefit from the responses.

Posting a question to Piazza is the fastest way to get help, and it's also the most efficient way for us to provide help, because if two people have the same question, we only need to answer it once. On that note, don't forget to do a quick keyword search to see if your question has already been answered before posting it: the fastest answer is the one that's already there!

To enroll yourself in the Piazza site for this class, you will first need to log in to the COMPSCI 260 site on Sakai. Once in Sakai, you should select Piazza from the menu on the left: You will be taken to Piazza in a new tab and prompted to log in there (or create a new Piazza account if you do not yet have one).

Once you have enrolled yourself in the Piazza site by accessing it via Sakai, you no longer need to go through Sakai to access it in the future. You can instead directly visit our class page at this URL: https://piazza.com/duke/fall2020/compsci260/home.

IMPORTANT DETAILS:

Many of you will be in the situation that you've enrolled (or been asked to enroll) in Piazza using more than one email address (e.g., netid@duke.edu and first.last@duke.edu). This can lead to multiple accounts, or mismatches between Piazza and Sakai, etc. The best way to handle this is to decide on whatever will be your primary email address, log in to your Piazza account with that address, and then under Account/Email Settings (look for the gear icon in the upper right), add the other email address(es) to your account. This will merge your multiple accounts into one, and make sure that you're able to be properly enrolled in a Piazza course no matter what email address is used to invite you.
Please add your full name (both first and last) to your Piazza account: We can't properly alphabetize the roster without this. Your first name should be the name you would like us to use when addressing you (it need not be your given first name if that's not what you go by).

Data expedition challenge: Signal, noise, and bias in yeast MNase-seq data

This is an optional challenge for students interested in applying what we have learned in class to a real computational genomics research problem; practicing the skills of using Python or R (or any other tool you wish) to visualize, analyze, model, and interpret real genomic data; and exploring the science linking chromatin structure and transcriptional regulation. Since this problem represents an open challenge for the genomics community, you are free to choose the approaches you use to analyze the data, as well as the questions you explore. Creative projects are highly encouraged. You may work in small teams (2-3 is ideal). For all submissions we receive by the deadline of 15 Dec 2020, we will provide feedback, and will also designate a best project as well as a most creative project. There will be (simple) prizes!

Data description

In this data expedition challenge, we will explore next-generation sequencing reads from MNase-seq experiments in yeast. The data was generated to detect genome-wide binding locations of various kinds of DNA-binding proteins. The MNase-seq data sets were collected at Duke as part of our ongoing computational genomic research collaboration with the lab of Prof. David MacAlpine in the Department of Pharmacology and Cancer Biology.

Biological background

DNA-binding proteins, including nucleosomes and transcription factors (TFs), play essential roles in gene regulation, and their locations along the genome help give us clues about how genes are regulated. Recently, a new MNase-seq protocol was developed by the MacAlpine group at Duke in conjunction with the Henikoff group at the University of Washington. [1] The basic idea is that genomic locations not bound by proteins are accessible to micrococcal nuclease (MNase) and are therefore more sensitive to MNase digestion. Conversely, genomic locations bound by proteins are less sensitive to MNase digestion.

Consequently, if we sequence the ends of the fragments that remain after MNase digestion, and map the paired sequencing reads that arise, we should be able to see where MNase was able to digest/cut the genome, revealing something about the binding locations of DNA-binding proteins along the genome. It is important to note that the genome of each individual cell in a population may be in a slightly different occupancy/protection state. We collect data from a population of cells so this experiment is sampling the different protection states present in the cell population.

Complicating the issue further, MNase is also known to have a nucleotide-specific bias as it digests DNA, meaning that it tends to cleave/digest certain sequences more than others. For example, it prefers to digest A/T nucleotides compared to G/C (its bias is actually a bit more subtle/complex than that, which is a nice model selection challenge you can explore: what is the simplest model that captures well this bias?). To give you further information about this sequence bias, we are also providing MNase digestion data of naked (deproteinized) DNA in vitro which will allow for the development of models to quantify such bias (because with this data, the variation in cutting that you see is only the result of the MNase interacting with the naked DNA and is not influenced by protein protection).

Data sets

Usually, sequencing reads are stored in files of fastq format. In this case, we downloaded two large yeast MNase-seq read fastq files: in vivo yeast MNase-seq read files generated by Henikoff et al. [1], and in vitro yeast MNase-seq read files generated by Deniz et al. [2] for use in quantifying MNase digestion bias. Both files contain short sequencing reads, of length 25 and 54 base pairs, respectively. The total number of reads in each file is on the order of 100 million. For reference, the yeast genome contains 16 chromosomes whose total size is approximately 12.5 million base pairs.

To analyze those sequencing reads, you would typically first need to map the reads to a reference genome, using tools like BOWTIE. However, to simplify this challenge, we have already performed this mapping step for you. We are thus providing you one tab-delimited text file for each of the first 12 yeast chromosomes, named with ChrI to ChrXII (yeast geneticists like Roman numerals); we will reserve the remaining 4 yeast chromosomes to evaluate your submitted results. Each file contains the start and end genome coordinates of all the reads mapped to that chromosome, one read per line. You may notice that the distances between the start and end coordinates are larger than 25 or 54 base pairs. That is because the MNase-seq experiments produce paired-end reads and we are indicating the coordinates of the spanned fragment from which the two reads come. So the provided coordinates are the start coordinate of one read, along with the start coordinate of its mated read on the opposite strand; or, put another way, the first and last nucleotide of the fragment.

It is reasonable to think of the start and end coordinates as nucleotides just beyond which MNase cleaved the DNA, while the sequence between the start and end coordinates was not digested by MNase. We also provide the whole yeast genome sequence (sacCer2 2008 version, in separate fasta files) if you wish to extract the actual sequence around the cleavage sites based on the provided coordinates.

All data files for this challenge are available from:

Potential research questions and challenges you can explore

You will need to do some independent exploration to figure out what to do next. You may want to read more about the MNase enzyme and how it works, or what is known about it. You probably want to get more info about the MNase-seq protocol, as described in the original paper. [1] Then you can start exploring one or multiple of the following, depending on what suits your fancy, or you may have other ideas of your own:

Investigate the bias of the MNase enzyme in digesting different nucleotides. (You can use the in vitro data to visualize and then model the distributions of read counts conditioned on the immediate nucleotides at the cutting site. However, because MNase is a protein that contacts more than one nucleotide of DNA, the interaction between MNase and DNA sequence perhaps spans more than one nucleotide, and more sophisticated models based on multiple nucleotides around the cutting site can also be explored.)
Study the distribution of MNase-seq read counts around transcription factor binding motifs to predict transcription factor binding sites: Are there specific signals (also called "footprints") left by bound transcription factors? Are these signals specific to each TF, or are there a few canonical footprint types (clusters of footprints shared by sets of TFs)? Or is there a single kind of footprint that just looks different for different TFs because of the role of MNase bias?
Study the distribution of MNase-seq read counts around nucleosomes: Can we define a nucleosome footprint to predict nucleosomal binding locations?
Visualize and compare the in vivo and in vitro MNase digestion profiles for a few transcription factors to explore the contribution of bias to the in vivo signal.
Anything else you find particularly intriguing or feel compelled to discover.

Potential tasks and approaches for data analysis

Visualize MNase digestion signals along the genome in a UCSC genome browser. This will allow you to see how they are distributed and how they compare to other tracks in the genome browser (like locations of coding regions, TSSs, nucleosomes, TFs, etc.).
Build PSSM (position-specific scoring matrix) or other more sophisticated statistical models to quantify the MNase digestion bias.
Visualize the digestion profiles (footprints) around TF motifs and nucleosomes using scientific plot-generating code (Python or R).
Anything else you find helpful in your explorations.

Good luck, and have fun on this expedition!

References (original sources of data)

Henikoff, J.G. et al. (2011) Epigenome characterization at single base-pair resolution. Proc. Natl. Acad. Sci. U.S.A. 108, 18318—18323.
Deniz, O. et al. (2011) Physical properties of naked DNA influence nucleosome positioning and correlate with transcription start and termination sites in yeast. BMC Genomics 12, 489.

COMPSCI 260: Introduction to Computational Genomics