Due: Tuesday, Oct. 1 by 11pm
LATE DAY EXTENDED: 10% off by Monday, Oct 7, 11pm
30 points
Since Test 1 is on Oct. 3, if you do not complete this assignment by Oct. 1, the 2-day lateness is extended, your program will be 10% late up until Monday, Oct. 7, 11pm.
For this problem you'll write a program that simulates the action of a restriction enzyme cutting (or cleaving) a DNA molecule. This is one part of the process called PCR polymerase chain reaction which is one of the most significant discoveries/inventions in chemistry and for which Kary Mullis won the Nobel Prize in 1993. The simulation is a simplification of the chemical process, but provides an example of the utility of linked lists in implementing a data structure.
You'll be implementing a class using linked-lists to replace an existing class that uses strings. The increase in efficiency will illustrate the effectiveness of certain operations on linked lists.
For information the controversial scientist Kary Mullis (born in North Carolina) see the links below.
For an explanation of the chemical process of cleaving with restriction enzymes see these links.
Given a strand of DNA "aatccgaattcgtatc" and a restriction enzyme like EcoRI "gaattc", the restriction enzyme locates the first occurrence of its pattern in the DNA strand and divides the strand into two pieces at that point, removing the matched pattern from the DNA strand. (In the real process the matched pattern isn't removed).
For example
aatccgaattcgtatc DNA strand gaattc restriction enzyme aatcc gtatc divided strand
Note that the first occurrence of the pattern specified by the restriction enzyme is located in the DNA strand. The DNA strand is split at this location into two strands.
In the simulation/code you'll write, a DNA strand will be replaced by the portion of the strand that comes before the location where the restriction enzyme is found. A new Strand will be generated for the portion of the strand that comes after the location at which the enzyme is found. Although in a chemical experiment a strand might be literally cut in two, as part of this simulation cutting a strand will result in generating one new strand, and replacing the existing strand with another. This simulates the real experiment since the number of Strand objects will increase by one as a result of cutting with a restriction enzyme (except when no cut can be made). For the class Strand the function that does this is is Strand::cutWith (note: the class Strand is an abstract base class, so this function is implemented in subclasses like StringStrand and LinkStrand, not in the parent class.)
Strand * Strand::cutWith(Restrictor& enzyme) // post: if enzyme occurs in this strand, then replace // this strand with portion occurring before enzyme // and return new Strand representing portion occurring // after the enzymeFor example, the code below will compile and generate the output that follows.
Restrictor ecori("gaattc"); Strand * s = new StringStrand("aatccgaattcgtatc"); Strand * after = s->cutWith(ecori); cout << "s = " << *s << endl; cout << "new strand = " << *after << endl;Output of this code is
s = aatcc new strand = gtatcNote that after the cut, the strand s becomes the portion of the strand that comes before the location of the restriction enzyme pattern. A new strand is returned for the portion of the original strand that comes after the cut. The output (hopefully) shows this.
If a strand is cut with an enzyme that doesn't occur in the strand, the original strand is unchanged and the new strand returned is empty (a strand whose size is zero).
int cutCount = 0; Strand * S = ..... // initailize S as a strand Restrictor enzyme("..."); // make enzyme a Restrictor object do { Strand * after = S->cutWith(enzyme); cout << cutCount << "\t" << *S << endl; // part before cut S = after; // then keep cutting } while (S->size() != 0) {This code is the basis for the code in the clss StrandBench that simulates a lab/bench experiment.
cd cd cps100 mkdir dna cd dna cp ~rodger/cps100/dna/* .
You should link three data files to your directory, use links instead of copying because these files are large. When testing your code use a data file you create, not the data files here. When you're confident your code works, you can run on these files. Only the file ecoli.dat should take a really long time for StringStrand.
ln -s ~rodger/data/ecoli.dat ln -s ~rodger/data/ecolimed.dat
You can see that the StringStrand::cutWith function passes some tests by running testcut which should print nothing if the implementation being tested passes.
make testcut testcut no output here if tests pass
You're to implement a class LinkStrand that stores a strand as a linked list, with one character 'a', 'g', 't', or 'c', per node of the list. You should implement the same functions specified in stringstrand.h, but for linked lists rather than strings.
You must use a header node for the linked list representation. Your class must maintain three data fields:
LinkStrand::LinkStrand() : myFirst(new Node('g',0)), // char value doesn't matter, header myLast(myFirst), myCount(0) // post: this strand is empty, size() == 0 { }This creates a dummy/header node, keeps two pointers to it, and initializes a count to zero. For example, adding a new value to the end of the list maintained would require:
In implementing LinkStrand::cutWith you'll need to find the location in the linked list/strand at which the enzyme occurs. This will most likely be an O(MN) operation where the strand has N characters and the enzyme has M characters. To see this, consider the pseudocode below for StringStrand::cutWith (not the one used in stringstrand.cpp). Note that the LinkedStrand::cutWith method will have the same complexity as StringStrand::cutWith for finding the place at which a cut is made, but the complexity of generating the new LinkStrand object should be O(1) rather than O(N) for the StringStrand method which must generate a new string to construct the StringStrand.
The key insight into creating the new LinkStrand object is to use Nodes already in the Strand being cut. This can be done by creating an empty LinkStrand object and then manipulating the private data fields of this new object within the LinkStrand::cutWith function.
It's essential that once found the complexity of creating the new Strand is O(1) and not O(N).
This is loop-version/pseudocode of what's actually in stringstand.cpp, but that code uses string::find which hides the loop Strand * StringStrand::cutWith(const Restrictor& enzyme) { int len = myString.length(); string estr = enzyme.toString(); int elen = estr.length(); // start search at each char in this strand // at which a match could be found for(int k=0; k <= lastGoodIndex; k++) { // O(N) // look for occurrence of enzyme pattern // by comparing elen chars starting at index k // in this strand with the chars in the enzyme // this could be done in a loop without calling // string::find for(...) { // O(M) } if (enzyme match found at index k) { Strand * retval = new StringStrand(myString.substr(k+elen, len)); myString = myString.substr(0, k); return retval; } } return empty; }
You'll probably want to devise your own tests to verify that the read/print/size functions work correctly. There's an entry in the Makefile for a target anything that you can replace with another test program if you want to.
If the tests in testcut.cpp pass, you should change dnasimulate.cpp to use the new Strand subclass. Before doing this you should time the StringStrand implementation as follows.
/bin/time dnasimulate ecolimed.dat > medstr.outThis runs on the medium-sized ecoli data file, times the execution and stores the results in the data file medstr.out.
After changing the dnasimulate.cpp file to use a LinkStrand object instead of a StringStrand object (don't forget to modify the Makefile) you can time the run again:
/bin/time dnasimulate ecolimed.dat > medlink.outThen you can compare the output files using the Unix command diff:
diff medstr.out medlink.outIf the files are the same (expected) then nothing will be printed. Otherwise information about the line numbers at which the files differ will be printed.
You may want to time the StringStrand version on the full ecoli.dat data file. Be warned, however that on a Teer machine the program took more than 10 minutes to run to process the entire ecoli file. The LinkStrand version of the program took 8.6 seconds to run.
If you actively work with one other person, please include both names in the README and make one submission.
You should also include an explanation of the timings you observe and why the linked list implementation is so much faster even though finding the locations at which to divide/cleave the strands is O(MN) for both implementations.
submit_cps100 dna *.h *.cpp Makefile README
Criteria | Points |
---|---|
LinkStrand read/write/construct | 8 |
LinkStrand cutWith | 12 |
this includes quality of code, quality of algorithm/method used, correctness, comments. | |
README write-up/analysis | 6 |
generic README/tests/etc. | 4 |