Compsc 4g, Fall 2005, Shotgun

Please snarf the project shotgun-final. These classes provide a complete, though very slow implementation of a whole genome shotgun simulation. For your assignment you'll develop alternative implementations to improve the efficiency of the program.

You'll need to create a CVS directory for each group (see the help/resources page). You'll also need to ad the biojava jar file to your project's classpath

You can test the program by running MainShotgun.java and reading the file simplefasta.txt which should create this strand:

    tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaa
The strand has a label as well that is printed when you run the program.

You can also test with the largefasta.txt file which should run quickly. However, if you test with the hugefasta.txt file, you'll find that your program takes a long time to run. On my dual-processor G5 Mac, the current program constructs a single strand from this large data set in 637 seconds. You'll work to improve this time.


In general, you'll need to keep a careful log of the kind of machine you're running on and how long each run takes.

Initial Modifications

If you look at the code in ShotgunReconstructor you'll see that the merge threshold is hard-wired to be 12. You'll need to refactor code (introduce parameters, instance variables, etc.) so that the threshold value can be set in the main method from MainShotgun.java and then be passed appropriately when the genome is reconstructed.

Then you should make several runs using the largefasta.txt data file to determine the minimal and maximal thresholds that result in one strand being reconstructed. In the README/final report you produce you should indicate how you determined these numbers. Do the same for the simplefasta.txt data file.

Efficient and New Implementations

  1. Develop a new implementation of the IStrand interface that has a faster merge method. This class should be called FastStrand. You should use regionMatches and/or other string methods to determine if two strands overlap. Your goal is to make to develop an implementation that is correct, but which yields much faster shotgun reconstruction. The points/score you earn on this part are based 80% on correctness and 20% on the efficiency of your code (it should be much faster).

    In this new version you will not use regular expressions to determine if two strings overlap, you'll check if the end of one strand matches the beginning of another strand by using the String regionMatches method. You'll need to be careful to ensure that you handle the threshold properly. You can use the results of your minimal and maximal thresholds from the previous section in your new implementation. On my dual G5 mac that took 637 seconds for the slow implementation the region-matches implementation took 9.686 seconds. That's a lot faster!.

    You'll need to create a new IStrandFactory implementation as well, and configure the reading class appropriately from the main launching program. You should re-run all your previous simulations to see that your new implementation is correct. Be sure to include in your README/final report all the results you obtain.

    Faster Still

  2. Now modify ShotgunReconstructor by creating a new subclass that extends it. Call the new subclass NonReducingReconstructor.java. In this class, the doGun method will not call reduce, it will only call merge. You'll create a new implementation of IStrand based on the FastStrand class you created earlier. Call this new class NoReduceFastStrand.

    In this class, you'll move the contains logic/code into the merge method of the new IStrand object you're creating. The idea is that if the entire other strand matches somewhere in this strand you should not create a new strand, but should return this, the strand containing the other strand. You'll need to think carefully about what the preceding sentences say.

    All String matching should be done with regionMatches, you should not create substrings nor should you call indexOf.

    With the new NonReducingReconstructor combined with your NoReduceFastStrand class you'll see lots of time saved since you'll avoid iterating over the strings twice: once for contains and once for merge.

    In your final README/write-up you should discuss if your refactoring was successful, and whether it reduced the time for completing a shotgun reconstruction. Be sure to rerun all experiments and report on what times you observe.

Bonus

For A or A+ credit you should develop an implementation based on the biojava Sequence, SymbolList, and Edit classes. These can be found in the biojava API, in the org.biojava.bio.seq package. The idea is to create an IStrand implementation that is based on the Sequence returned when reading, this will avoid, perhaps, creating strings for storing the DNA strands.

The new IStrand implementation will store the Sequence, and use it and other biojava objects rather than Strings to do the merging and containing. You'll need to examine base-pairs in a sequence one-at-a-time, you might want to develop a method similar to the String method regionMatches for this purpose.

I'll help with this, I haven't done it (as of Nov 2005) to know how hard it is.


Owen L. Astrachan
Last modified: Fri Dec 2 13:22:30 EST 2005