CompSci 260: Introduction to Computational Genomics
Overview
A computational perspective on the exploration and analysis of genomic and genome-scale information. Provides an integrated introduction to genome biology, algorithm design and analysis, and probabilistic and statistical modeling. Topics include genome sequencing, genome sequence assembly, local and global sequence alignment, sequence database search, gene and motif finding, phylogenetic tree building, and gene expression analysis. Methods include dynamic programming, indexing and hashing, hidden Markov models, and elementary machine learning. Development of practical experience with handling, analyzing, and visualizing genomic data using the scripting language Perl.
The course will require students to program in Perl. Students coming in to the course should know how to program in some computer language, but it need not be Perl. Students should also have had some exposure to basic probability, statistics, and molecular or cellular biology; however, the course has no formal course prerequisites, and significant background will be provided. Please speak to the instructor if you are unsure about your background. This course is a valid elective in both biology and computer science.
Instructor
Professor Alex Hartemink
Office hours
Alex Hartemink, Instructor
Email: amink at cs.duke.edu
Office Hours: Thursday 11:30AM–12:30PM (after class), in LSRC D239 [but not on 18 October]; and also by appointment.
Abrita Chakravarty, TA
Email: abrita at cs.duke.edu
Office Hours: Monday 11:00AM–12:00PM and Wednesday 12:30–1:30PM, in LSRC D301.
Kenneth Hoehn, UTA
Email: kenneth.hoehn at duke.edu
Office Hours: Monday 6:00–8:00PM, in the
Link (map).
Nikhil Saxena, UTA
Email: nikhil.saxena at duke.edu
Office Hours: Tuesday 7:00–9:00PM, in the
Link (map).
If these office hours do not work for you, please post questions via Piazza, or send any of us an email to schedule an alternate time.
Logistics
The class meets on Tuesdays and Thursdays 10:05–11:20AM in 116 Old Chemistry (on the Main West Quad, near Bostock Library).
Course schedule
Note: The course schedule may change subtly from time to time. Always check the web page for the most up-to-date schedule.
Session |
Date |
Instructor |
Topic |
Assignment |
1 |
Tue 28 Aug |
AH |
Course introduction; SARS genome introduction |
|
2 |
Thu 30 Aug |
AH |
Molecular biology primer: DNA, RNA and protein |
PS1 out |
3 |
Tue 04 Sep |
AH |
Gene/genome organization; SARS genome revisited |
|
4 |
Thu 06 Sep |
AH |
Algorithms and their analysis |
|
5 |
Tue 11 Sep |
AH |
Algorithm design; Divide-and-conquer introduction |
|
6 |
Thu 13 Sep |
AH |
Divide-and-conquer, and its failure |
PS1 due; PS2 out |
7 |
Tue 18 Sep |
AH |
Memoization; Dynamic programming |
|
8 |
Thu 20 Sep |
AH |
Greedy algorithms; Sequence variation |
|
9 |
Tue 25 Sep |
AH |
The alignment problem; Aligning sequences globally |
|
10 |
Thu 27 Sep |
AH |
Traceback; Aligning sequences with affine gap scores |
PS2 due; PS3 out |
11 |
Tue 02 Oct |
AH |
Local alignment; Database similarity searching |
|
12 |
Thu 04 Oct |
AH |
FASTA and BLAST heuristics; DNA and genome sequencing |
|
13 |
Tue 09 Oct |
AH |
Genome assembly; Human Genome Project and Celera |
|
14 |
Thu 11 Oct |
AH |
Next-gen sequencing; Indexes and short-read alignment |
PS3 due; PS4 out |
|
Tue 16 Oct |
|
FALL BREAK |
|
15 |
Thu 18 Oct |
AH |
Suffix trees; Tree of life and phylogenomics |
|
16 |
Tue 23 Oct |
OF |
Tour of Duke's sequencing core facility (meet at 119 BioSci) |
|
17 |
Thu 25 Oct |
AH |
Building phylogenetic trees (UPGMA and NJ) |
PS4 due; PS5 out |
18 |
Tue 30 Oct |
AH |
Unsupervised machine learning: clustering |
|
19 |
Thu 01 Nov |
AH |
Supervised machine learning: classification |
|
20 |
Tue 06 Nov |
AH |
Probability; Discrete and continuous random variables; Infinity |
|
21 |
Thu 08 Nov |
AH |
Joint, marginal, conditional; Bayes rule; Parameter estimation |
PS5 due; PS6 out |
22 |
Tue 13 Nov |
AH |
Markov and hidden Markov models (HMMs) |
|
|
Thu 15 Nov |
|
CLASS CANCELLED: SICK |
|
23 |
Tue 20 Nov |
AH |
Estimating HMM parameters; Viterbi decoding |
|
|
Thu 22 Nov |
|
THANKSGIVING BREAK |
|
24 |
Tue 27 Nov |
AH |
Viterbi and posterior decoding |
|
25 |
Thu 29 Nov |
AH |
Baum-Welch and optimization; HMMs for finding spliced genes |
PS6 due; PS7 out |
26 |
Tue 04 Dec |
AH |
PSSMs and profile HMMs |
|
27 |
Thu 06 Dec |
AH |
Course summary and evaluations |
PS7 due |
AH: Alex Hartemink, OF: Olivier Federigo and the sequencing core facility staff
GENSCAN paper
Tree of life papers
Suffix trees
Genome sequencing technology papers
Papers reporting a newly sequenced genome
Papers debating the merits of shotgun sequencing the whole human genome
Graph search
An overview of the DFS and BFS algorithms
Severe acute respiratory syndrome (SARS)
Here is the SARS genome handout from class. Here is a text file containing the SARS genome (Tor2 isolate). Have fun parsing it! You can also find it, and a lot more information about it, in GenBank: visit the Genbank entry and see what else you can learn.
For shoring up your biology background
Here are a few different kinds of resources for those with less biology background, ranging from the comprehensive to a basic overview:
Textbooks mentioned in class
The various books mentioned in class are summarized here; each is linked to Amazon where you can read more. Note that none of these books is compulsory for the class, though you may benefit from one or more. As for the books on Perl, many resources are now available free online, even complete textbooks downloadable as PDFs (you'll save trees (unless you print them)); we have collected some of those for you here.
- Introduction to Computational Genomics: A Case Studies Approach
- This is a very recently published book. Hahn was a grad student at Duke in the lab of Greg Wray and took my class once. I love the case study approach: it makes for very interesting reading. At the end of the day, I decided not to require the book because it's not a perfect match for the course, but in terms of the course content, it comes the closest of any book to date so it might be useful for those that learn much better from a book.
http://www.amazon.com/exec/obidos/ASIN/0521671914/
- Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids
- This book covers a lot of the material that will be covered in the course. It is very well written but is at a fairly advanced level and has a distinctly probabilistic focus. It is an excellent reference for folks continuing on in this area.
http://www.amazon.com/exec/obidos/ASIN/0521629713/
- An Introduction to Bioinformatics Algorithms
- This is a fairly new book by Jones and Pevzner that covers many of the topics that will be covered in class, but organizes the material around algorithmic themes rather than biological themes. So one chapter presents many algorithms across bioinformatics that exploit clustering algorithms, etc. While not complete, it seems a pretty nice reference and is probably a better choice than the book by Pevzner from a few years ago.
http://www.amazon.com/exec/obidos/ASIN/0262101068/
- Beginning Perl, Second Edition
- It seems (according to Amazon reviews) that this may be the best book for people starting out with Perl, all things considered. While Learning Perl (the llama, below) was the de facto standard for many years, this now seems to be viewed as a more useful choice. But I have not used it myself so I can't speak authoritatively on the matter.
http://www.amazon.com/exec/obidos/ASIN/159059391X/
- Learning Perl, Third Edition
- The classic guide to learning Perl (llama). Cited for its continual insights, humorous style, and unpretentious tone, this was for many years almost universally recommended as the place to begin when learning Perl for the first time. Now the field is more crowded and Beginning Perl, Second Edition (above) may be preferred. A fifth edition is also available, while some are in love with the second edition (though the Perl there is quite out of date).
http://www.amazon.com/exec/obidos/ASIN/0596001320/
- Programming Perl, Third Edition
- An authoritative and definitive guide for serious Perl programmers (camel). Written by the author of Perl, this book is not the place to start if you're unfamiliar with Perl (and especially not if you're unfamiliar with programming in any language) but once you understand something of Perl, and you want to go to the next level, this book is cited as indispensable.
http://www.amazon.com/exec/obidos/ASIN/0596000278/
- Beginning Perl for Bioinformatics
- This is a good book for a biologist who has not yet programmed before (tadpole): the examples are relevant to bioinformatics and are worked out in great detail; unfortunately, as a result, the actual information content is somewhat low given the length of the book. So if you already know quite a bit about programming, you might find this book a bit too simple.
http://www.amazon.com/exec/obidos/ASIN/0596000804/
- Mastering Perl for Bioinformatics
- This book (frog) builds on Beginning Perl for Bioinformatics (the tadpole, above) by introducing a number of advanced Perl topics for bioinformatics. The book seems fine, but is not well-targeted to this particular course in that it spends a lot of effort on informatics tasks that are not covered in this course, like connecting to public databases, running web services, and using the BioPerl modules. That said, if you are interested in these tasks for other reasons, or want a gentle introduction to BioPerl specifically, this is a good book to have.
http://www.amazon.com/exec/obidos/ASIN/0596003072/
Perl tutorial slides
Online Perl resources
More help on specific topics:
- Referencing/dereferencing cheat sheet: A cheat sheet from Abrita Chakravarty with examples for referencing and dereferencing arrays and hashes
- References: Short tutorial on the syntax of references in Perl
- Common Perl data structures: A completely excellent guide on how to use arrays of arrays, arrays of hashes, hashes of hashes, and hashes of arrays in Perl (the table of contents is particularly helpful to pointing you to just what you need if you only want a quick reference)
- Regular expressions:
- Quick refresher on Perl regular expressions
- Detailed tutorial on Perl regular expressions
- Debugging:
- Short pieces on common syntax errors, logic errors and using the strict option to prevent undefined variables.
- Using the Perl debugger in EPIC
- A tutorial on using the Perl debugger from command line.
Directions for setting up Perl, Java, Eclipse, and Eclipse plugins
- ensure you have Perl installed (so you can run the programs you write)
- the latest version is 5.16.0, but anything 5.8.8 or higher should be fine
- Perl is already installed on Macs (5.8.8 on Leopard, 5.10.0 on Snow Leopard, 5.12.3 on Lion, 5.12.4 on Mountain Lion)
- Perl is almost surely already installed on any Unix machine; you can check the version by typing "perl -v" in a terminal
- Perl is not installed on Windows (unless you did it yourself), so get the ActivePerl version from here (to keep things simple in the class, please choose the latest 5.12 version, not the newer 5.14 or 5.16 versions)
- ensure you have Java installed (so you can run Eclipse)
- the latest version is 7, but anything 5 or higher should be fine
- Java is already installed on nearly all Macs (except perhaps the newest ones), almost surely on Unix machines, and quite possibly on Windows PCs
- if Java 5 or newer isn't installed, you can find the latest version here
- you only need a JRE (Java Runtime Environment) to run Eclipse, but if you want to write Java programs later, you can get the full JDK, which includes the JRE
- install Eclipse (an environment for writing and running your software programs)
- the latest version is 4.2 (nicknamed Juno), but we will be using version 3.7.2 (nicknamed Indigo)
- you can select the "Eclipse Classic 3.7.2" package for your operating system from here
- choose the 32- or 64-bit version to match your particular version of Java
- Eclipse isn't packaged with an installer; you simply unarchive it and move the resulting "eclipse" folder to C:\ (Windows) or to Applications (Mac)
- install the EPIC plugin within Eclipse (so you can create and run Perl programs within Eclipse)
- open Eclipse and access the Help menu
- select "Install New Software..."
- in the "Work with:" box, type "http://e-p-i-c.sf.net/updates" and press Enter
- you may need to wait up to a minute until the "Pending..." is replaced by "EPIC Main Components"
- select EPIC Main Components (by checking the box next to it) and click "Next >" down at the bottom
- the next steps to finish the installation are straightforward; if you receive a warning about unsigned content, proceed anyway
- restart Eclipse for changes to take effect
- install the Ambient plugin within Eclipse (so you can snarf and submit files for class)
- open Eclipse and access the Help menu
- select "Install New Software..."
- in the "Work with:" box, type "http://www.cs.duke.edu/csed/ambient/update" and press Enter
- you may need to wait a number of seconds until the "Pending..." is replaced by "Ambient"
- select Ambient (by checking the box next to it) and click "Next >" down at the bottom
- the next steps to finish the installation are straightforward; if you receive a warning about unsigned content, proceed anyway
- restart Eclipse for changes to take effect
Snarfing and running a sample Perl program
Let's try snarfing and running your first Perl program in Eclipse.
First ensure that you have the right perspective in Eclipse. The Perl perspective will give you a less cluttered set of windows with the smaller Navigator and Outline on the left and the main editor window on the right.
- Select "Window > Open Perspective > Other..."
- Select "Perl", then hit OK.
- You should see a "Perl" box highlighted in the upper right corner of your window. If in the future, your screen setup looks odd, try to ensure you are in the Perl perspective.
The Ambient plug-in allows you to browse for and download code online using a tool called "Snarf". For each problem set, we will provide you with some code as a framework and possibly some data files, and Snarf will allow you to import these files into your local copy of Eclipse. To snarf your first program, follow the directions below.
- Snarf in the Snarfing Sample project.
- Open Eclipse
- Select "Ambient > Download (Snarf) a Project..."
- This should open a new tab at the bottom called "Snarfer Site Browser". If it does not:
- Select "Window > Show View > Other..."
- Click "Ambient" then select "Snarfer Site Browser" and hit OK
- Right-click within the "Snarfer Site Browser" window, and select "New Site"
- In the window, type "http://www.cs.duke.edu/courses/fall12/compsci260/snarf/"
- Expand the project site "CompSci 260, Fall 2012" and click through the list by expanding the folders until you find "Snarfing Sample (1.0)", and double click on it
- Click the "Install Project" button, and in the window that pops up, check the "use default workspace location" box, and click "Finish"
- The "Import project" window will come up; leave the fields unchanged (in particular, leave "Use the downloaded .project file" selected) and click "Finish"
- You can then double click the "Snarfing Sample" project in the "Navigator" window, on the upper left of the main editor window and then double click "first.pl"
- Now we will try running the simple Perl program that you snarfed.
- Click on the "Run" icon on the toolbar (the green circle with the white triangle pointing right) to run the program; this should create a "Console" tab in the bottom right pane and the results of the program should be printed in it. If the console does not appear:
- Select "Window > Show View > Other..."
- Click "General" then select "Console" and hit OK
- Alternatively, select "Run" from the Run menu
- Alternatively, right click anywhere within the body of the program to see the context menu and then click on "Run As > Perl Local"
You can repeat step 2 every time you edit and save the program.
For each assignment, we will provide a codebase for you to work from, which you will always be able to import by "Snarfing".
You should also notice that PerlDoc is available from within Eclipse. Just select any Perl keyword in your program (for example, you can select "print") and then choose "Help > Perldoc" (or use the associated key shortcut). This will open a tab "PerlDoc" that contains the help page for the print function. Once the PerlDoc window is open, you can search for other keywords right in the PerlDoc window.
Editing a sample Perl program and submitting a project
Now modify the program and then submit the code from within Eclipse.
- Modify the file "first.pl" to print out both the minimum and the maximum value:
- Add another line: print "The maximum value is: $maxval.\n";
- Save the file
- Run "first.pl" again, to see if it prints out the maximum.
- Now test submitting your new program along with the other files in the project:
- Select "Ambient>Submit a Project for Grading..." to bring up the submit window.
- First you must choose the class and assignment you wish to submit to, so click on "compsci260", select the "test.of.submit" folder as your destination, and then click "Next".
- Select "Submit a single project", choose the project you wish to submit (in this case "Snarfing Sample"), and then hit "Next". Alternatively, if you do not want to submit the entire project, you can choose "Submit from the file system" and then you'll have an option to select and deselect the various files in the project (or elsewhere).
- Once you've got the project and/or files that you want selected, choose "Finish". You will be asked to enter your Duke NetID and password. Congratulations, you have submitted your project!
You can submit as many times as you like, and everything will be stored on the server each time, without overwriting previous submissions. Thus, if you realize that you did something wrong at the last minute, you can simply resubmit. In general, we will only look at your last submission, so when you resubmit a project, please resubmit all the relevant files, not just the ones you modified.
Academic integrity
All students are expected to abide by generally accepted standards of academic integrity. This includes all the various aspects of Duke's Community Standard. Violations of academic integrity will be taken very seriously. In particular, be reminded that it is not acceptable to take the ideas/work of another and pass it off as one's own, even if paraphrased. Ideas taken from others, whether peers in the class or not, must always be appropriately cited.
Collaboration policy
Unless expressly granted in the problem set, all problems should be completed individually; no collaboration is permitted. However, if you have worked for a while on a particular problem and have encountered a mental wall, and if you have banged your head against said wall for a while, you should consult others to make progress—that is better than giving up entirely. Your first course of action is to speak to the instructor or TAs, or to post a question on Piazza. If for any reason you consult your peers outside of Piazza, it should remain understood that such an interaction must be one of consultation and not collaboration: hints rather than answers; after consultation, it is expected that you should still have some thinking to do. In addition, if you happen to consult with another student, both of you must cite this.
Extension policy
Students generally have two weeks to work on problem sets—not because two weeks are generally required, but 1) to allow students who start early sufficient time to reflect/ruminate on problems where an impasse has been reached (the thought process through which students go while solving a problem will often include some gestation period before things become clear) and 2) to provide flexibility as to when students complete their work while they juggle requirements and commitments of many sorts during the semester.
Given this latter point, students should not request extensions for turning in their work beyond the two weeks already allotted. However, this rule has two exceptions:
- If you are ill for a non-trivial length of time, you may choose to submit a short-term illness notification to the deans; I am notified during this process, at which point we can work out an extension if necessary.
- Everyone invariably has some two-week interval that is especially tough, so students are allowed, once during the semester, to use an extra 48 hours to turn in their work. If you are exercising this option for a specific problem set, just indicate such when you turn in your work. It is entirely up to you when you want to use this one free extension; when you do, you are trusted to not consult the solutions if they happen to be posted before you turn in your work.
Grading policy
- Problem sets: 84%
- All students will be expected to complete seven problem sets over the semester, each contributing about equally to this component of the grade.
- Participation: 16%
- Students are expected to attend class regularly and participate in discussions. They are also expected to be engaged via Piazza, posting questions or notes, as well as helping each other as questions arise, or raising interesting points for further conversation. Students should feel comfortable asking questions at any point in class—whether the material is unclear, or simply if it leads you to wonder about a new connection. The instructor encourages an interactive classroom so if something is troubling or exciting you, do not hesitate to speak up about it.
Grades
Grades for all work will be recorded and available to students via the course Sakai site. Posting grades will be our only use of Sakai.
Piazza
This term we will be using Piazza for course announcements, communication, and discussion. The Piazza system is highly catered to getting you help quickly and efficiently from classmates, the TAs, and the instructor. Rather than emailing questions to the teaching staff, please post your questions on Piazza so everyone can benefit from the responses.
You can find our class page at: https://piazza.com/class#fall2012/compsci260.