CPS 100, Spring 2004, DNA/Gene Splicing

Overview

For this problem you'll write a program that simulates the action of a restriction enzyme cutting (or cleaving) a DNA molecule. This is one part of the process called PCR polymerase chain reaction which is one of the most significant discoveries/inventions in chemistry and for which Kary Mullis won the Nobel Prize in 1993. The simulation is a simplification of the chemical process, but provides an example of the utility of linked lists in implementing a data structure.

You'll be implementing a class using linked-lists to replace an existing class that uses strings. The increase in efficiency will illustrate the effectiveness of certain operations on linked lists.

For information on the eccentric scientist Kary Mullis (born in North Carolina) see the links below.

For an explanation of the chemical process of cleaving with restriction enzymes see these links.

Simulation

Restriction enzymes cut a strand of DNA at a specific location, separating the strand into two pieces. In the real chemical process a strand can be split into several pieces, we'll simulate this by repeatedly dividing a strand.

Given a strand of DNA "aatccgaattcgtatc" and a restriction enzyme like EcoRI "gaattc", the restriction enzyme locates the first occurrence of its pattern in the DNA strand and divides the strand into two pieces at that point, removing the matched pattern from the DNA strand. (In the real process the matched pattern isn't removed).

For example


     aatccgaattcgtatc        DNA strand

          gaattc             restriction enzyme

     aatcc      gtatc        divided strand

Note that the first occurrence of the pattern specified by the restriction enzyme is located in the DNA strand. The DNA strand is split at this location into two strands.

In the simulation/code you'll write, a DNA strand will be replaced by the portion of the strand that comes before the location where the restriction enzyme is found. A new Strand will be generated for the portion of the strand that comes after the location at which the enzyme is found. Although in a chemical experiment a strand might be literally cut in two, as part of this simulation cutting a strand will result in generating one new strand, and replacing the existing strand with another. This simulates the real experiment since the number of Strand objects will increase by one as a result of cutting with a restriction enzyme (except when no cut can be made). For the class Strand the function that does this is is Strand::cutWith (note: the class Strand is an abstract base class, so this function is implemented in subclasses like StringStrand and LinkStrand, not in the parent class.)


  Strand * Strand::cutWith(Restrictor& enzyme)
  // post: if enzyme occurs in this strand, then replace
  //       this strand with portion occurring before enzyme
  //       and return new Strand representing portion occurring
  //       after the enzyme

For example, the code below will compile and generate the output that follows.


    Restrictor ecori("gaattc");

    Strand * s = new StringStrand("aatccgaattcgtatc");
    Strand * after = s->cutWith(ecori);
    cout << "s = " << *s << endl;
    cout << "new strand = " << *after << endl;

Output of this code is

    s = aatcc
    new strand = gtatc

Note that after the cut, the strand s becomes the portion of the strand that comes before the location of the restriction enzyme pattern. A new strand is returned for the portion of the original strand that comes after the cut.

If a strand is cut with an enzyme that doesn't occur in the strand, the original strand is unchanged and the new strand returned by cutWith is empty (a strand whose size is zero).

Cleaving at Every Location

The code below will divide a strand S repeatedly, creating new strands, until S can no longer be divided. Each piece that results from a cut is printed.

   int cutCount = 0;
   Strand * str = .....       // initialize str as a strand
   Restrictor enzyme("...");  // make enzyme a Restrictor object

   do {

       Strand * after = str->cutWith(enzyme);
       cout << cutCount << "\t" << *str << endl;   // part before cut 
       str = after;                                // then keep cutting
       cutCount++;
   } while (str->size() != 0) {

This code is the basis for the code in the clss StrandBench that simulates a lab/bench experiment.

Code you Design and Write for the DNA Assignment

See snarf/copy files information on what files you're provided for this assignment.

You're given an implementation/subclass of Strand, called StringStrand that works, but is inefficient for repeated cleaving. Your task is to implement a new subclass using linked lists as the basis for storing DNA sequences. The subclass, LinkStrand will be tested and then timed and compared to the class you're given.

You must implement, test, and compare the implementations.

This will provide practice with linked lists, with inheritance, and with testing.

The class `LinkStrand`

The class StringStrand does not have an efficient cutWith method because cutting a string of n-characters could result in creating two strings of n/2 characters, for example. This means that cutting an n-character string is potentially an O(n) function in addition to finding the location at which to cut. Also, cutting a string generates two new strings which takes up memory.

You can see that the StringStrand::cutWith function passes some tests by running testcut which should print nothing if the implementation being tested passes.

   make testcut
   testcut
   no output here if tests pass

In Eclipse you should add a target for testcut as described in the Eclipse make tutorial.

You're to implement a class LinkStrand that stores a strand as a linked list, with one character 'a', 'g', 't', or 'c', per node of the list. You should implement the same functions specified in stringstrand.h, but using linked lists to store DNA character/bases rather than strings.

Specifics of Linked List Implementation

Your class must be named LinkStrand, it must be a subclass of Strand, and it must be implemented in linkstrand.h and linkstrand.cpp.

You must use a header node for the linked list representation. Your class must maintain three data fields:

a pointer to the header node,
a pointer to the last node of the list,
a count of how many nodes are in the list (that represent values in the strand, the header node doesn't count.)

For example, here's the code for my default constructor which should provide some ideas.

 LinkStrand::LinkStrand()
     : myFirst(new Node('x',0)),  // char value doesn't matter, header
       myLast(myFirst),
       myCount(0)
// post: this strand is empty, size() == 0
{
    
}

This creates a dummy/header node, keeps two pointers to it, and initializes a count to zero. For example, adding a new value to the end of the list maintained would require:

making myLast->next point at the new node/value
update myLast to point at the new last node
update the count to reflect the addition of a new value

In implementing LinkStrand::cutWith you'll need to find the location in the linked list/strand at which the enzyme occurs. This will most likely be an O(mn) operation where the strand has n characters and the enzyme has m characters.

In general your LinkStrand methods have the following requirements.

The LinkedStrand::cutWith method will have the same complexity as StringStrand::cutWith for finding the place at which a cut is made, but the complexity of generating the new LinkStrand object must be O(1) rather than O(n) for the StringStrand method which must generate a new string to construct the StringStrand.
The LinkedStrand::size method should run in O(1) time (not dependent on number of bases/characters in the Strand).
The LinkeStrand::toString method will probably be inefficient, that's fine (O(n) for an n-character strand is ok).

The key insight into creating the new LinkStrand object is to use Nodes already in the Strand being cut. This can be done by creating an empty LinkStrand object and then manipulating the private data fields of this new object within the LinkStrand::cutWith function.

You can manipulate/assign to the private data fields of the LinkStrand object you create because the cutWith method is a LinkStrand member-function, so it can assign values to private data fields of any LinkStrand object.

To see that the complexity of finding the location to cut is O(mn), consider the pseudocode below for StringStrand::cutWith, note that This is loop-version/pseudocode of what's actually in stringstand.cpp, but that code uses string::find which hides the loop shown below.


  Strand * StringStrand::cutWith(const Restrictor& enzyme)
  {
    int len     = myString.length();
    string estr = enzyme.toString();
    int elen    = estr.length();

    // start search at each char in this strand
    // at which a match could be found

    for(int k=0; k <= lastGoodIndex; k++) {   // O(N)

        // look for occurrence of enzyme pattern
        // by comparing elen chars starting at index k
        // in this strand with the chars in the enzyme
        // this could be done in a loop without calling
        // string::find

 
	for(...) {   // O(M)

	}
	if (enzyme match found at index k) {

	    Strand * retval = 
                     new StringStrand(myString.substr(k+elen, len));
	    myString = myString.substr(0, k);
	    return retval;
	}
    }
    return empty;
  }

Implementing and Testing LinkStrand

Before testing the cutWith method you should test the other methods: reading, converting to a string, and size.

To test your LinkStrand class you should use testcut.cpp to test different member functions. The main function you're given looks like what's below, so that initially only the toString method of the Strand subclass is tested.

int main()
{
    doToString();
 //  doReading();
 //  doSize();
 //  doCutting();

}

You'll need to modify all the testing functions to use LinkStrand rather than StringStrand. You can do this by replacing all occurrences of one string with another using Replace in the Edit menu of Emacs and the find/replace option in Eclipse (in Emacs you'll be prompted for strings in the minibuffer).

You'll need to add linkstrand.cpp to the Makefile, then run make depend and make testcut (no make depend in Eclipse).

Errors in reading and toString can cause problems, so be sure these are right before testing cutWith

Simulating Cutting with `LinkStrand`

When the tests in testcut.cpp pass, you should change dnasimulate.cpp to use LinkStrand, the new Strand subclass. Before doing this you should time the StringStrand implementation as follows.

   dnasimulate ecolimed.dat medstr.out

This runs on the medium-sized ecoli data file, stores the output in a data file medstr.out and prints the runtime for the cutting sequence of the simulation.

After changing the dnasimulate.cpp file to use a LinkStrand object instead of a StringStrand object (don't forget to modify the Makefile) you can time the run again:

   dnasimulate ecolimed.dat medlink.out

Then you can compare the output files using the Unix command diff:

  diff medstr.out medlink.out

You can run diff from a Unix/shell prompt, or from a command prompt in windows using the same syntax shown above.

If the files are the same (expected) then nothing will be printed. Otherwise information about the line numbers at which the files differ will be printed.

You may want to time the StringStrand version on the full ecoli.dat data file. Be warned, however that on a Teer machine the program took more than 10 minutes to run to process the entire ecoli file. My LinkStrand version of the program took under 8 seconds to run.

What to Submit

You should submit all your .h and .cpp files and your Makefile. Be sure you have both linkstrand.h and linkstrand.cpp files included in these. In your README file, also submitted, you should include a list of people with whom you consulted, the number of hours spent on the assignment, and your thoughts of the assignment.

In your README you should also include an explanation of the timings you observe and why the linked list implementation is so much faster even though finding the locations at which to divide/cleave the strands is O(mn) for both implementations.

  submit_cps100 dna *.h *.cpp Makefile README

If you work with a partner, you should include both names in the README file and submit once. You may not choose your own partner, if you want a partner we will assign one to you at random. If you work on your own, you're expected to do the work on your own, though you can talk about high-level details with others (documented in your README file).

Grading

This assignment is worth 36 points. The points will be awarded roughly as follows:

Criteria Points
LinkStrand read/write/construct 12
LinkStrand cutWith 18
this includes quality of code, quality of algorithm/method used, correctness, comments.
README write-up/analysis 6

Criteria	Points
LinkStrand read/write/construct	12
LinkStrand cutWith	18
this includes quality of code, quality of algorithm/method used, correctness, comments.
README write-up/analysis	6

Owen Astrachan

Last modified: Wed Jan 28 23:03:47 EST 2004