this is a pair/group assignment
This current assignment was developed in 2002, used for two years, then resurrected in the Spring of 2008 and used since then. It's turned into a very different assignment over that time. These differences leverage discussion of tradeoffs more than the original assignment, and were motivated in part by opportunities (and differences) provided by Java compared to C++. The assignment is meant to illustrate pragmatically the benefits of using a linked list. This version became a nifty assignment in 2009.
This background is interesting, but not really needed to do the assignment. There are some good stories here, but if you want to get to the assignment, you can skip this stuff.
In this assignment you'll experiment with different implementations of a simulated restriction enzyme cutting (or cleaving) a DNA molecule. Three scientists shared the Nobel Prize in 1978 for the discover of restriction enzymes. They're also an essential part of the process called PCR polymerase chain reaction which is one of the most significant discoveries/inventions in chemistry and for which Kary Mullis won the Nobel Prize in 1993.
Kary Mullis, the inventor of PCR, is an interesting character. To see more about him see this archived copy of a 1992 interview in Omni Magazine, this 1994 interview as part of virus myth, his personal website which includes information about his autobiography Dancing Naked in the Mind Field, though you can read this free Nobel autobiography as well.
You can see animations and explanations of both restriction enzymes and PCR at DnaTube and Cold Spring Harbor Dolan DNA Learning Center.
The simulation is a simplification of the chemical process, but provides an example of the utility of linked lists in implementing a data structure. The linked list code you'll write and reason about is an example of a chunk list. You can do more work with chunk lists for extra credit.
In particular you'll need to do three things.
SimpleStrand.cutAndSplice
that uses
String.split
to find all occurrences of an enzyme and
replaces the enzyme with another strand of DNA. The benchmarking is
done by the class
DNABenchMark
and is explained in more detail
below. You'll run virtual experiments in code to show two things:
SimpleStrand.cutAndSplice
is O(N)
where N is the size of the recombined strand
returned. In your README describe the results of the process and
reasoning you use to demonstrate the O(N). You
should have empirical results and timings that demonstrate the O(N) behavior. Graphing the data can be useful in
showing behavior.
DNABenchMark
,
should only use strings whose lengths are a power of
two. Report
this result in your README by providing the power-of-two string
you can use without running out of memory with the input file
ecoli.dat which has 646 cut points in an original strand of
4,639,221 base-pairs with the restriction enzyme EcoRI: "gaattc".
Describe how you determined this result and how long your program takes on your machine to create the largest recombinant strand constructed using a splicee string whose length is a power of 2 before memory is an issue and your program crashes. If you double the size of the heap available to the Java runtime (-Xmx1024M) does your machine support the next power-of-two strand (i.e., the one that couldn't run with -Xmx512M)? If so, how long does it take with ecoli.dat. You should also run with -Xmx2048M if your machine has enough memory to support this.
Be sure to report on running with consecutive powers of two for both -Xmx arguments and size-of-string. If your machine doesn't support -Xmx512M start with 256M instead.
For example, here's the output generated by running
DNABenchMark
on the data file ecoli.dat
dna length = 4639221 SimpleStrand: splicee 256 time 0.314 before 4,639,221 after 4,639,221 recomb 4,800,471 SimpleStrand: splicee 512 time 0.253 before 4,639,221 after 4,639,221 recomb 4,965,591 SimpleStrand: splicee 1,024 time 0.235 before 4,639,221 after 4,639,221 recomb 5,295,831 SimpleStrand: splicee 2,048 time 0.253 before 4,639,221 after 4,639,221 recomb 5,956,311 SimpleStrand: splicee 4,096 time 0.254 before 4,639,221 after 4,639,221 recomb 7,277,271 SimpleStrand: splicee 8,192 time 0.287 before 4,639,221 after 4,639,221 recomb 9,919,191 SimpleStrand: splicee 16,384 time 0.375 before 4,639,221 after 4,639,221 recomb 15,203,031 SimpleStrand: splicee 32,768 time 0.498 before 4,639,221 after 4,639,221 recomb 25,770,711 SimpleStrand: splicee 65,536 time 0.784 before 4,639,221 after 4,639,221 recomb 46,906,071 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuilder.append(StringBuilder.java:119) at SimpleStrand.cutAndSplice(SimpleStrand.java:58) at DNABenchMark.strandSpliceBenchmark(DNABenchMark.java:77) at DNABenchMark.main(DNABenchMark.java:109)
LinkStrand
that implements
the
IDnaStrand
interface so that your implementation
passes the JUnit tests in
TestStrand
.
See the DNA howto for help with JUnit. Passing
these tests doesn't guarantee full credit since the tests are
about correctness, not about efficiency. Your code should avoid
recalculating values that can be determined by other means. In
particular the length of a LinkStrand
strand should
be calculated in O(1) time once the
strand is created. Implementing LinkStrand
is the
bulk of the coding work for this assignment. You'll need to
implement every method and use the JUnit tests to help determine
if your methods are correctly implemented.
LinkStrand
in the benchmarking code, change the
string
"SimpleStrand" to "LinkStrand" in the
main
method of the class
DNABenchMark
class.
You'll need to make many runs to show this O(B) behavior, not just one. To start, note that there are 646 breaks for ecoli.dat when using EcoRI "gaattc" as the restriction enzyme. You can vary the number of breaks by constructing your own genomic data, or by re-using the string read from the data file ecoli.dat
Describe in your README how you show this and make sure your submitted code supports your experiments and conclusion.
Restriction enzymes cut a strand of DNA at a specific location, the binding site, typically separating the DNA strand into two pieces. In the real chemical process a strand can be split into several pieces at multiple binding sites, we'll simulate this by repeatedly dividing a strand.
Given a strand of DNA "aatccgaattcgtatc" and a restriction enzyme like EcoRI "gaattc", the restriction enzyme locates each occurrence of its pattern in the DNA strand and divides the strand into two pieces at that point, leaving either blunt or sticky ends as described below. In the simulation there's no difference between a blunt and sticky end, and we'll use a single strand of DNA in the simulation rather than the double-helix/double-strand that's found in the physical/real process.
In some experiments, and in the simulation you'll run, another strand of DNA will be spliced into the separated strand. The strand spliced in matches the separated strand at each end as shown in the diagram below where the spliced-in strand matches with G on the left and AATTC on the right as you view the strands.
When the spliced-in strand joins the split strand we see a new, recombinant strand of DNA as shown below. The shaded areas indicate where the original strand was cleaved/cut by the restriction enzyme.
Your code will be a software simulation of this recombinant process: the restriction enzyme will cut a strand of DNA and new DNA will be spliced-in to to create a recombinant strand of DNA. In the simulation the code simply replaces every occurrence of the restriction enzyme with new genetic material/DNA --- your code models the process with what is essentially string replacement.
splicee
to create a recombinant strand. The
strand spliced-in replaces the enzyme. In the simulation
the enzyme is removed each time it
occurs/finds a binding site. The characters representing
the enzyme are replaced by splicee
. (This simulates the
process of splitting the original DNA by the restriction enzyme,
leaving sticky/blunt ends, and binding the new DNA to the split
site. However, in the simulation the restriction enzyme is removed.)
This code is from the class
SimpleStrand
you're given. The
String representing DNA, which is instance variable
myInfo
, is split at every
occurrence of the string parameter enzyme
. The spaces are
added before the split as shown below to ensure that characters
representing the enzyme that are found at the beginning or ending
of the DNA are found as part of calling
.split(enzyme)
.
As the spliced-in strand splicee
grows in size the code
above will take longer to execute even with the same original strand of
DNA and the same restriction enzyme. Creating the recombinant strand
using the code above is an O(N) operation
where N is the size of the resulting,
recombinant strand (you have to justify this in your README). In
making the O(N) claim we're ignoring the time
to find all the breaks, which is O(T) for a
span
with T characters/nucleotides.
As part of this assignment you must develop an alternate implementation
of DNA. Instead of using a simple String to represent
the DNA/enzymes, you'll use a linked-list implementation
that makes the complexity of the
splicing independent of the size of the spliced-in strand. Each splice
operation, simulated by the call to append
above
for SimpleStrand
, should be
O(1) rather than O(S) for a splicee strand with S
characters/base-pairs. In your new implementation, the complexity of
creating the recombinant strand will be O(B) where B is the number
of breaks/splits created by the restriction enzyme. For a recombinant
strand of size N where N
>> B (>> means much bigger than) this is
significantly more efficient both in time and (especially) memory.
In
making the O(B) claim we're ignoring the time
to find all the breaks, which is O(T) for a
span
with T characters/nucleotides.
You'll be developing/coding a class LinkStrand
that implements
a Java interface IDnaStrand
. The class
simulates cutting a strand of DNA
by a restriction enzyme and appending/splicing-in a new strand. The code
supplied with this project gives specifics as to the interfaces and the
howto for this assignment also supplies more
details.
You must use a linked-list to support the operations -- specifically the
class LinkStrand
should maintain pointers to a linked list
used to represent a strand. You should keep and maintain a pointer to
the first Node of the linked list and to the last node of the linked
list. These pointers are maintained as class invariants -- the
property of pointing to first/last nodes must hold after any method in
the class executes (and thus before any method in the class executes).
A Strand of DNA is initially representing by a linked list with one
Node, the Node stores one string representing the entire strand of DNA.
Thus initially the instance variables myFirst
and
myLast
will point to the same node.
Every linked list representing DNA maintains pointers to the first and
last nodes of the linked list. Initially, before any cuts/splices
have been made,
both myFirst
and myLast
point to the same node since there is only
one node even if it contains thousands of characters
representing DNA. The diagram below shows a list with at
least two nodes in it. If any nodes are appended, the value of
myLast
must be updated to ensure that it correctly points
to the last node of the new list.
The diagram below shows the results of cutting an original strand of DNA at three points and then splicing-in the strand "GTGATAATTC" at each of the locations at which the original strand was cut. Since splicing into a linked list is a constant-time, O(1) operation this implementation should be more efficient in time and space when compared to the String implementation.
initializeFrom
using the
same code (either copied or re-use the common code in one method).
When creating or initializing a new LinkStrand
only one
node is created, the entire string representing the DNA is in
one node. When initializing, be sure to initialize the length of the
simulated strand as well.
append
should not concatenate strings,
but can create new nodes. If you're appending a
LinkStrand
you don't need to create any nodes, you
can simply relink the appended to nodes to the nodes in the list
being appended to (think about that). If you're appending
another kind of strand, e.g., SimpleStrand
, convert
the strand to a string and append it -- see the implementation
in SimpleStrand
for details.
cutAndSplice
assume there's
only one node, though it might contain a huge string of DNA. If
there's more than one node you can throw an exception, e.g.,:
cutAndSplice
should not call
String.split
. Instead you should repeatedly call
String.indexOf
using the two-parameter version that
specifies the index at which the search will start. Each time
the code finds an occurrence of the enzyme, you splice in
new nodes
to the linked
list/Strand that's returned in the method you're writing. One
model is partially shown below.
LinkStrand.cutWith
a call to cutWith
has a side-effect
of altering the strand being cut (the strand changes to represent the
portion before the enzyme cleave/cut. The method also returns a
newly-created strand representing the portion after the cut.
If no cut occurs because the restriction enzyme finds no binding site
then the strand being cut is unaltered and an empty or zero-length DNA
strand is returned. As part of the code provided in this project we
provide a class SimpleStrand
that implements the
IDnaStrand
interface on which you can model your
linked-list implementation.
description | points |
---|---|
Benchmark code and explanation for O(N) | 10 points |
LinkStrand that works as described: both correctly and
efficiently.
| 20 points |
Benchmark code and explanation for O(B) in creating recombinant strands. | 10 points |
Submit your project using the assignment name linked-dna.