CPS 100 Fall 2002: Assignment #4: Gene Splicing/DNA

Due: Tuesday, Oct. 1 by 11pm

LATE DAY EXTENDED: 10% off by Monday, Oct 7, 11pm

30 points

Since Test 1 is on Oct. 3, if you do not complete this assignment by Oct. 1, the 2-day lateness is extended, your program will be 10% late up until Monday, Oct. 7, 11pm.

Overview

Note: You are encouraged to work with one other person in the class on this assignment to engage in "pair programming". Pair programming means you must work together the whole time. You sit down at one computer, one of you types (the driver) and the other one watches closely, makes suggestions and looks for mistakes (the observer). If you do this, you will make one submission with both names in the README file.

For this problem you'll write a program that simulates the action of a restriction enzyme cutting (or cleaving) a DNA molecule. This is one part of the process called PCR polymerase chain reaction which is one of the most significant discoveries/inventions in chemistry and for which Kary Mullis won the Nobel Prize in 1993. The simulation is a simplification of the chemical process, but provides an example of the utility of linked lists in implementing a data structure.

You'll be implementing a class using linked-lists to replace an existing class that uses strings. The increase in efficiency will illustrate the effectiveness of certain operations on linked lists.

For information the controversial scientist Kary Mullis (born in North Carolina) see the links below.

For an explanation of the chemical process of cleaving with restriction enzymes see these links.

  1. restriction enzymes

  2. restriction enzymes

  3. restruction enzymes

Simulation

Restriction enzymes cut a strand of DNA at a specific location, separating the strand into two pieces. In the real chemical process a strand can be split into several pieces, we'll simulate this by repeatedly dividing a strand.

Given a strand of DNA "aatccgaattcgtatc" and a restriction enzyme like EcoRI "gaattc", the restriction enzyme locates the first occurrence of its pattern in the DNA strand and divides the strand into two pieces at that point, removing the matched pattern from the DNA strand. (In the real process the matched pattern isn't removed).

For example


     aatccgaattcgtatc        DNA strand

          gaattc             restriction enzyme

     aatcc      gtatc        divided strand

Note that the first occurrence of the pattern specified by the restriction enzyme is located in the DNA strand. The DNA strand is split at this location into two strands.

In the simulation/code you'll write, a DNA strand will be replaced by the portion of the strand that comes before the location where the restriction enzyme is found. A new Strand will be generated for the portion of the strand that comes after the location at which the enzyme is found. Although in a chemical experiment a strand might be literally cut in two, as part of this simulation cutting a strand will result in generating one new strand, and replacing the existing strand with another. This simulates the real experiment since the number of Strand objects will increase by one as a result of cutting with a restriction enzyme (except when no cut can be made). For the class Strand the function that does this is is Strand::cutWith (note: the class Strand is an abstract base class, so this function is implemented in subclasses like StringStrand and LinkStrand, not in the parent class.)


  Strand * Strand::cutWith(Restrictor& enzyme)
  // post: if enzyme occurs in this strand, then replace
  //       this strand with portion occurring before enzyme
  //       and return new Strand representing portion occurring
  //       after the enzyme

For example, the code below will compile and generate the output that follows.

    Restrictor ecori("gaattc");

    Strand * s = new StringStrand("aatccgaattcgtatc");
    Strand * after = s->cutWith(ecori);
    cout << "s = " << *s << endl;
    cout << "new strand = " << *after << endl;

Output of this code is
    s = aatcc
    new strand = gtatc

Note that after the cut, the strand s becomes the portion of the strand that comes before the location of the restriction enzyme pattern. A new strand is returned for the portion of the original strand that comes after the cut. The output (hopefully) shows this.

If a strand is cut with an enzyme that doesn't occur in the strand, the original strand is unchanged and the new strand returned is empty (a strand whose size is zero).

Cleaving at Every Location

The code below will divide a strand S repeatedly, creating new strands, until S can no longer be divided. Each piece that result from a cut is printed.

   int cutCount = 0;
   Strand * S = .....         // initailize S as a strand
   Restrictor enzyme("...");  // make enzyme a Restrictor object

   do {

       Strand * after = S->cutWith(enzyme);
       cout << cutCount << "\t" << *S << endl;   // part before cut 
       S = after;                                // then keep cutting

   } while (S->size() != 0) {

This code is the basis for the code in the clss StrandBench that simulates a lab/bench experiment.


Code you Design and Write

You're given the classes/files below, also accessible on the acpub system in ~rodger/cps100/dna from which you can copy them as follows.
  cd
  cd cps100
  mkdir dna
  cd dna
  cp ~rodger/cps100/dna/* .

You should link three data files to your directory, use links instead of copying because these files are large. When testing your code use a data file you create, not the data files here. When you're confident your code works, you can run on these files. Only the file ecoli.dat should take a really long time for StringStrand.

   ln -s ~rodger/data/ecoli.dat
   ln -s ~rodger/data/ecolimed.dat

The class LinkStrand

The class StringStrand doesn't have an efficient cutWith method because cutting a string of n-characters could result in creating two strings of n/2 characters, for example. This means that cutting an n-character string is potentially an O(n) function in addition to finding the location at which to cut. Also, cutting a string generates two new strings which takes up memory.

You can see that the StringStrand::cutWith function passes some tests by running testcut which should print nothing if the implementation being tested passes.

   make testcut
   testcut
   no output here if tests pass

You're to implement a class LinkStrand that stores a strand as a linked list, with one character 'a', 'g', 't', or 'c', per node of the list. You should implement the same functions specified in stringstrand.h, but for linked lists rather than strings.

Specifics of Linked List Implementation

Your class must be named LinkStrand, it must be a subclass of Strand, and it must be implemented in linkstrand.h and linkstrand.cpp.

You must use a header node for the linked list representation. Your class must maintain three data fields:

  1. a pointer to the header node,
  2. a pointer to the last node of the list,
  3. a count of how many nodes are in the list (that represent values in the strand, the header node doesn't count.)
For example, here's the code for my default constructor which should provide some ideas.
 LinkStrand::LinkStrand()
     : myFirst(new Node('g',0)),  // char value doesn't matter, header
       myLast(myFirst),
       myCount(0)
// post: this strand is empty, size() == 0
{
    
}

This creates a dummy/header node, keeps two pointers to it, and initializes a count to zero. For example, adding a new value to the end of the list maintained would require:

In implementing LinkStrand::cutWith you'll need to find the location in the linked list/strand at which the enzyme occurs. This will most likely be an O(MN) operation where the strand has N characters and the enzyme has M characters. To see this, consider the pseudocode below for StringStrand::cutWith (not the one used in stringstrand.cpp). Note that the LinkedStrand::cutWith method will have the same complexity as StringStrand::cutWith for finding the place at which a cut is made, but the complexity of generating the new LinkStrand object should be O(1) rather than O(N) for the StringStrand method which must generate a new string to construct the StringStrand.

The key insight into creating the new LinkStrand object is to use Nodes already in the Strand being cut. This can be done by creating an empty LinkStrand object and then manipulating the private data fields of this new object within the LinkStrand::cutWith function.

It's essential that once found the complexity of creating the new Strand is O(1) and not O(N).

This is loop-version/pseudocode of what's actually
in stringstand.cpp, but that code uses string::find which
hides the loop

  Strand * StringStrand::cutWith(const Restrictor& enzyme)
  {
    int len     = myString.length();
    string estr = enzyme.toString();
    int elen    = estr.length();

    // start search at each char in this strand
    // at which a match could be found

    for(int k=0; k <= lastGoodIndex; k++) {   // O(N)

        // look for occurrence of enzyme pattern
        // by comparing elen chars starting at index k
        // in this strand with the chars in the enzyme
        // this could be done in a loop without calling
        // string::find

 
	for(...) {   // O(M)

	}
	if (enzyme match found at index k) {

	    Strand * retval = 
                     new StringStrand(myString.substr(k+elen, len));
	    myString = myString.substr(0, k);
	    return retval;
	}
    }
    return empty;
  }

Testing LinkStrand

To test your LinkStrand class you should use testcut.cpp to test the cut function, modified to use LinkStrand. You'll need to add linkstrand.cpp the the Makefile, then run make depend and make testcut.

You'll probably want to devise your own tests to verify that the read/print/size functions work correctly. There's an entry in the Makefile for a target anything that you can replace with another test program if you want to.

If the tests in testcut.cpp pass, you should change dnasimulate.cpp to use the new Strand subclass. Before doing this you should time the StringStrand implementation as follows.

   /bin/time dnasimulate ecolimed.dat > medstr.out
This runs on the medium-sized ecoli data file, times the execution and stores the results in the data file medstr.out.

After changing the dnasimulate.cpp file to use a LinkStrand object instead of a StringStrand object (don't forget to modify the Makefile) you can time the run again:

   /bin/time dnasimulate ecolimed.dat > medlink.out
Then you can compare the output files using the Unix command diff:
  diff medstr.out medlink.out
If the files are the same (expected) then nothing will be printed. Otherwise information about the line numbers at which the files differ will be printed.

You may want to time the StringStrand version on the full ecoli.dat data file. Be warned, however that on a Teer machine the program took more than 10 minutes to run to process the entire ecoli file. The LinkStrand version of the program took 8.6 seconds to run.

What to Submit

You should submit all your .h and .cpp files. You should have both linkstrand.h and linkstrand.cpp files included in these. In your README file, also submitted, you should include a list of people with whom you consulted, the number of hours spent on the assignment, and your thoughts of the assignment.

If you actively work with one other person, please include both names in the README and make one submission.

You should also include an explanation of the timings you observe and why the linked list implementation is so much faster even though finding the locations at which to divide/cleave the strands is O(MN) for both implementations.

  submit_cps100 dna *.h *.cpp Makefile README

Grading

This assignment is worth 30 points. The points will be awarded roughly as follows:

Criteria Points
LinkStrand read/write/construct 8
LinkStrand cutWith 12
this includes quality of code, quality of algorithm/method used, correctness, comments.
README write-up/analysis 6
generic README/tests/etc. 4


Susan H. Rodger
Last modified: Wed Sep 25 12:34:55 EDT 2002