A computational perspective on the exploration and analysis of genomic and genome-scale information. Provides an integrated introduction to genome biology, algorithm design and analysis, and probabilistic and statistical modeling. Topics include genome sequencing, genome sequence assembly, local and global sequence alignment, sequence database search, gene and motif finding, phylogenetic tree building, and gene expression analysis. Methods include dynamic programming, indexing and hashing, hidden Markov models, and elementary machine learning. Development of practical experience with handling, analyzing, and visualizing genomic data using the scripting language Perl.
The course will require students to program in Perl. Students coming in to the course should know how to program in some computer language, but it need not be Perl. Students should also have had some exposure to basic probability, statistics, and molecular or cellular biology; however, the course has no formal course prerequisites, and significant background will be provided. Please speak to the instructor if you are unsure about your background. This course is a valid elective in both biology and computer science.
Professor Alex Hartemink
Alex Hartemink
Michael Mayhew
Daphne Ezer
If these office hours do not work for you, please post questions via Piazza, or send any of us an email to schedule an alternate time.
The class meets on Wednesdays and Fridays from 10:05 until 11:20 in LSRC, Room D243.
Note: The course schedule may change subtly from time to time. Always check the web page for the most up-to-date schedule.
| Session | Date | Instructor | Topic | Assignment |
|---|---|---|---|---|
| 1 | Wed 31 Aug | AH | Course introduction; Structure of SARS genome | |
| 2 | Fri 02 Sep | AH | Molecular biology primer: DNA, RNA and protein | PS1 out |
| 3 | Wed 07 Sep | AH | Gene/genome organization; SARS genome revisited | |
| 4 | Fri 09 Sep | AH | Algorithm analysis and design | |
| 5 | Wed 14 Sep | AH | Divide-and-conquer | |
| 6 | Fri 16 Sep | AH | Divide-and-conquer fails | PS1 due; PS2 out |
| 7 | Wed 21 Sep | AH | Memoization; Dynamic programming | |
| 8 | Fri 23 Sep | AH | Greedy algorithms; Sequence variation | |
| 9 | Wed 28 Sep | AH | The alignment problem; Aligning sequences globally | |
| 10 | Fri 30 Sep | AH | Local alignment; Aligning sequences with affine gap scores | PS2 due; PS3 out |
| 11 | Wed 05 Oct | DW | FASTA and BLAST heuristics | |
| 12 | Fri 07 Oct | AH | DNA and genome sequencing; Human Genome Project and Celera | |
| 13 | Wed 12 Oct | AH | Genome assembly | |
| 14 | Fri 14 Oct | AH | Next-gen sequencing; Indexes and short-read alignment | PS3 due; PS4 out |
| 15 | Wed 19 Oct | LB | Tour of the Duke Genome Sequencing facility | |
| 16 | Fri 21 Oct | AH | Tree of life; Phylogenetics and comparative genomics | |
| 17 | Wed 26 Oct | RG | Building phylogenetic trees (UPGMA and NJ) | |
| 18 | Fri 28 Oct | RG | Unsupervised machine learning: clustering | PS4 due; PS5 out |
| 19 | Wed 02 Nov | RG | Supervised machine learning: classification | |
| 20 | Fri 04 Nov | AH | Probability; Discrete and continuous random variables; Infinity | |
| 21 | Wed 09 Nov | AH | Joint, marginal, conditional; Bayes rule; Parameter estimation | |
| 22 | Fri 11 Nov | AH | Markov and hidden Markov models (HMMs) | PS5 due; PS6 out |
| 23 | Wed 16 Nov | AH | Estimating HMM parameters; Viterbi decoding | |
| 24 | Fri 18 Nov | AH | Posterior decoding | |
| Wed 23 Nov | THANKSGIVING BREAK | |||
| Fri 25 Nov | THANKSGIVING BREAK | |||
| 25 | Wed 30 Nov | AH | Baum-Welch and optimization; HMMs for finding spliced genes | PS6 due; PS7 out |
| 26 | Fri 02 Dec | AH | Motif finding; Multiple sequence alignment | |
| 27 | Wed 07 Dec | AH | Profile HMMs; Course evaluations | |
| 28 | Fri 09 Dec | AH | Course summary | PS7 due |
AH: Alex Hartemink, RG: Raluca Gordân, DW: Debbie Winter, LB: Lisa Bukovnik
Here are a few different kinds of resources for those with less biology background, ranging from the comprehensive to a basic overview:
Here is the SARS genome handout from class. Here is a text file containing the SARS genome (Tor2 isolate). Have fun parsing it! You can also find it, and a lot more information about it, in GenBank. Visit the Genbank entry and see what you can learn.
The various books mentioned in class are all listed in the "External Links" section within Blackboard; each is linked to Amazon where you can read more.
As discussed in class, the GENSCAN tool was first described in this paper, which you may enjoy reading (or at least skimming). To guide you, the most interesting parts (from the point of view of CPS 160) are the methods, which in biology papers are located at the very end, after the introduction, results, and discussion (why this is done is a matter for another time). So you may want to start with the right column of page 85.
As an aside, the senior author of the GENSCAN paper, Samuel Karlin, is the same fellow that helped develop the significance statistics for BLAST database searching.
More help on specific topics:
Let's try snarfing and running your first Perl program in Eclipse.
First ensure that you have the right perspective in Eclipse. The Perl perspective will give you a less cluttered set of windows with the smaller Navigator and Outline on the left and the main editor window on the right.
The Ambient plug-in allows you to browse for and download code online using a tool called "Snarf". For each problem set, we will provide you with some code as a framework and possibly some data files, and Snarf will allow you to import these files into your local copy of Eclipse. To snarf your first program, follow the directions below.
You can repeat step 2 every time you edit and save the program.
For each assignment, we will provide a codebase for you to work from, which you will always be able to import by "Snarfing".
You should also notice that PerlDoc is available from within Eclipse. Just select any Perl keyword in your program (for example, you can select "print") and then choose "Help > Perldoc" (or use the associated key shortcut). This will open a tab "PerlDoc" that contains the help page for the print function. Once the PerlDoc window is open, you can search for other keywords right in the PerlDoc window.
Now modify the program and then submit the code from within Eclipse.
You can submit as many times as you like, and everything will be stored on the server each time, without overwriting previous submissions. Thus, if you realize that you did something wrong at the last minute, you can simply resubmit. In general, we will only look at your last submission, so when you resubmit a project, please resubmit all the relevant files, not just the ones you modified.
All students are expected to abide by generally accepted standards of academic integrity. This includes all the various aspects of Duke's Community Standard. Violations of academic integrity will be taken very seriously. In particular, be reminded that it is not acceptable to take the ideas/work of another and pass it off as one's own, even if paraphrased. Ideas taken from others, whether peers in the class or not, must always be appropriately cited.
Unless expressly granted in the problem set, all problems should be completed individually; no collaboration is permitted. However, if you have worked for a while on a particular problem and have encountered a mental wall, and if you have banged your head against said wall for a while, you should consult others to make progress—that is better than giving up entirely. Your first course of action is to speak to the instructor or TAs, or to post a question on Piazza. If for any reason you consult your peers outside of Piazza, it should remain understood that such an interaction must be one of consultation and not collaboration: hints rather than answers; after consultation, it is expected that you should still have some thinking to do. In addition, if you happen to consult with another student, both of you must cite this.
Students generally have two weeks to work on problem sets—not because two weeks are generally required, but 1) to allow students who start early sufficient time to reflect/ruminate on problems where an impasse has been reached (the thought process through which students go while solving a problem will often include some gestation period before things become clear) and 2) to provide flexibility as to when students complete their work while they juggle requirements and commitments of many sorts during the semester.
Given this latter point, students should not request extensions for turning in their work beyond the two weeks already allotted. However, this rule has two exceptions:
This term we will be using Piazza for course announcements, communication, and discussion. The Piazza system is highly catered to getting you help quickly and efficiently from classmates, the TAs, and the instructor. Rather than emailing questions to the teaching staff, please post your questions on Piazza so everyone can benefit from the responses.
You can find our class page at: http://www.piazza.com/duke/fall2011/cps160.