Compsci 100, Fall 2009, DNA Splicing

This current assignment was developed in 2002, used for two years, then resurrected in the Spring of 2008 and used since then. It's turned into a very different assignment over that time. These differences leverage discussion of tradeoffs more than the original assignment, and were motivated in part by opportunities (and differences) provided by Java compared to C++. The assignment is meant to illustrate pragmatically the benefits of using a linked list. This version became a nifty assignment in 2009.

Background on DNA, Restriction Enzymes, and PCR

Restriction Enzymes

(source: http://www.astbury.leeds.ac.uk/gallery/leedspix.html)

This background is interesting, but not really needed to do the assignment. There are some good stories here, but if you want to get to the assignment, you can skip this stuff.

In this assignment you'll experiment with different implementations of a simulated restriction enzyme cutting (or cleaving) a DNA molecule. Three scientists shared the Nobel Prize in 1978 for the discover of restriction enzymes. They're also an essential part of the process called PCR polymerase chain reaction which is one of the most significant discoveries/inventions in chemistry and for which Kary Mullis won the Nobel Prize in 1993.

Kary Mullis, the inventor of PCR, is an interesting character. To see more about him see this archived copy of a 1992 interview in Omni Magazine, this 1994 interview as part of virus myth, his personal website which includes information about his autobiography Dancing Naked in the Mind Field, though you can read this free Nobel autobiography as well.

You can see animations and explanations of both restriction enzymes and PCR at DnaTube and Cold Spring Harbor Dolan DNA Learning Center.

The simulation is a simplification of the chemical process, but provides an example of the utility of linked lists in implementing a data structure. The linked list code you'll write and reason about is an example of a chunk list. You can do more work with chunk lists for extra credit.

What You do for This Assignment

Restriction enzymes cut a strand of DNA at a specific location, the binding site, typically separating the DNA strand into two pieces. In the real chemical process a strand can be split into several pieces at multiple binding sites, we'll simulate this by repeatedly dividing a strand.

Given a strand of DNA "aatccgaattcgtatc" and a restriction enzyme like EcoRI "gaattc", the restriction enzyme locates each occurrence of its pattern in the DNA strand and divides the strand into two pieces at that point, leaving either blunt or sticky ends as described below. In the simulation there's no difference between a blunt and sticky end, and we'll use a single strand of DNA in the simulation rather than the double-helix/double-strand that's found in the physical/real process.

Restriction enzymes have two properties or features: the pattern of DNA that marks a site at which separation occurs and a number/index that indicates how many characters/nucleotides of the pattern attach to the left-part of the split strand. For example, the adjacent diagram shows a strand split by EcoRI. The pattern for EcoRI is "gaattc" and the index of the split is one indicating that the first nucleotide/character of the restriction enzyme adheres to the left part of the split.

In some experiments, and in the simulation you'll run, another strand of DNA will be spliced into the separated strand. The strand spliced in matches the separated strand at each end as shown in the diagram below where the spliced-in strand matches with G on the left and AATTC on the right as you view the strands.

When the spliced-in strand joins the split strand we see a new, recombinant strand of DNA as shown below. The shaded areas indicate where the original strand was cleaved/cut by the restriction enzyme.

Your code will be a software simulation of this recombinant process: the restriction enzyme will cut a strand of DNA and new DNA will be spliced-in to to create a recombinant strand of DNA. In the simulation the code simply replaces every occurrence of the restriction enzyme with new genetic material/DNA --- your code models the process with what is essentially string replacement.

Simulation/Alternate Implementations

This code is from the class SimpleStrand you're given. The String representing DNA, which is instance variable myInfo, is split at every occurrence of the string parameter enzyme. The spaces are added before the split as shown below to ensure that characters representing the enzyme that are found at the beginning or ending of the DNA are found as part of calling .split(enzyme).

As the spliced-in strand splicee grows in size the code above will take longer to execute even with the same original strand of DNA and the same restriction enzyme. Creating the recombinant strand using the code above is an O(N) operation where N is the size of the resulting, recombinant strand (you have to justify this in your README). In making the O(N) claim we're ignoring the time to find all the breaks, which is O(T) for a span with T characters/nucleotides.

As part of this assignment you must develop an alternate implementation of DNA. Instead of using a simple String to represent the DNA/enzymes, you'll use a linked-list implementation that makes the complexity of the splicing independent of the size of the spliced-in strand. Each splice operation, simulated by the call to append above for SimpleStrand, should be O(1) rather than O(S) for a splicee strand with S characters/base-pairs. In your new implementation, the complexity of creating the recombinant strand will be O(B) where B is the number of breaks/splits created by the restriction enzyme. For a recombinant strand of size N where N >> B (>> means much bigger than) this is significantly more efficient both in time and (especially) memory. In making the O(B) claim we're ignoring the time to find all the breaks, which is O(T) for a span with T characters/nucleotides.

Implementation Specifics

You'll be developing/coding a class LinkStrand that implements a Java interface IDnaStrand. The class simulates cutting a strand of DNA by a restriction enzyme and appending/splicing-in a new strand. The code supplied with this project gives specifics as to the interfaces and the howto for this assignment also supplies more details.

You must use a linked-list to support the operations -- specifically the class LinkStrand should maintain pointers to a linked list used to represent a strand. You should keep and maintain a pointer to the first Node of the linked list and to the last node of the linked list. These pointers are maintained as class invariants -- the property of pointing to first/last nodes must hold after any method in the class executes (and thus before any method in the class executes). A Strand of DNA is initially representing by a linked list with one Node, the Node stores one string representing the entire strand of DNA. Thus initially the instance variables myFirst and myLast will point to the same node.

Every linked list representing DNA maintains pointers to the first and last nodes of the linked list. Initially, before any cuts/splices have been made, both myFirst and myLast point to the same node since there is only one node even if it contains thousands of characters representing DNA. The diagram below shows a list with at least two nodes in it. If any nodes are appended, the value of myLast must be updated to ensure that it correctly points to the last node of the new list.

The diagram below shows the results of cutting an original strand of DNA at three points and then splicing-in the strand "GTGATAATTC" at each of the locations at which the original strand was cut. Since splicing into a linked list is a constant-time, O(1) operation this implementation should be more efficient in time and space when compared to the String implementation.

description	points
Benchmark code and explanation for O(N)	10 points
`LinkStrand` that works as described: both correctly and efficiently.	20 points
Benchmark code and explanation for O(B) in creating recombinant strands.	10 points

Compsci 100, Fall 2009, DNA Splicing

Genesis of Assignments Linked to DNA/Genomics

Background on DNA, Restriction Enzymes, and PCR

What You do for This Assignment

Restriction Enzyme Cleaving Explained

Simulation/Alternate Implementations

Implementation Specifics

Linked List Details

Grading