You'll need to create a CVS directory for each group (see the help/resources page). You'll also need to ad the biojava jar file to your project's classpath
You can test the program by running MainShotgun.java and reading the file simplefasta.txt which should create this strand:
tgaaaattcctttctattttaggcccatgcaatggcattagggcggttaa
The strand has a label as well that is printed when you run
the program.
You can also test with the largefasta.txt file which should run quickly. However, if you test with the hugefasta.txt file, you'll find that your program takes a long time to run. On my dual-processor G5 Mac, the current program constructs a single strand from this large data set in 637 seconds. You'll work to improve this time.
In general, you'll need to keep a careful log of the kind of machine you're running on and how long each run takes.
If you look at the code in ShotgunReconstructor you'll see that the merge threshold is hard-wired to be 12. You'll need to refactor code (introduce parameters, instance variables, etc.) so that the threshold value can be set in the main method from MainShotgun.java and then be passed appropriately when the genome is reconstructed.
Then you should make several runs using the largefasta.txt data file to determine the minimal and maximal thresholds that result in one strand being reconstructed. In the README/final report you produce you should indicate how you determined these numbers. Do the same for the simplefasta.txt data file.
regionMatches and/or other string methods
to determine if two strands overlap. Your goal is to make
to develop an implementation that is correct, but which yields
much faster shotgun reconstruction. The points/score you
earn on this part are based 80% on correctness and 20%
on the efficiency of your code (it should be much faster).
In this new version you will not use regular expressions to determine if
two strings overlap, you'll check if the end of one strand matches the
beginning of another strand by using the String
regionMatches method. You'll need to be careful to ensure
that you handle the threshold properly. You can use the results of your
minimal and maximal thresholds from the previous section in your new
implementation. On my dual G5 mac that took 637 seconds for the slow
implementation the region-matches implementation took 9.686
seconds. That's a lot faster!.
You'll need to create a new IStrandFactory implementation as well, and configure the reading class appropriately from the main launching program. You should re-run all your previous simulations to see that your new implementation is correct. Be sure to include in your README/final report all the results you obtain.
NonReducingReconstructor.java. In this class, the
doGun method will not call reduce, it will
only call merge. You'll create a new implementation of
IStrand based on the FastStrand class you
created earlier. Call this new class NoReduceFastStrand.
In this class, you'll move the contains logic/code into the
merge method of the new IStrand object you're
creating. The idea is that if the entire other strand matches
somewhere in this strand you should not create a new strand, but should
return this, the strand containing the other
strand. You'll need to think carefully about what the preceding
sentences say.
All String matching should be done with
With the new
In your final README/write-up you should discuss if your refactoring was
successful, and whether it reduced the time for completing a shotgun
reconstruction. Be sure to rerun all experiments and report on what
times you observe.
regionMatches, you
should not create substrings nor should you call indexOf.
NonReducingReconstructor combined with
your NoReduceFastStrand class you'll see
lots of time saved since you'll avoid iterating over
the strings twice: once for contains and once for merge.
IStrand implementation that is based on
the Sequence returned when reading, this will avoid, perhaps, creating
strings for storing the DNA strands.
The new IStrand implementation will store
the Sequence, and use it and other biojava objects rather than
Strings to do the merging and containing. You'll need to
examine base-pairs in a sequence one-at-a-time, you might want
to develop a method similar to the String method
regionMatches for this purpose.
I'll help with this, I haven't done it (as of Nov 2005) to know how hard it is.