Due Date: Early bonus: Wed, March 12
Final Due Date: Friday, March 14
This assignment is mostly an empirical study of different sorting
algorithms whose implementations are given to you. Most of the
assignment involves analyzing the algorithms on different kinds of
input and making conclusions supported by the evidence you glean from
your studies.
This assignment is amenable to group work since much of it involves
gathering data. Although there's not much coding, there's still a fair
amount.
Goals
The README file for this assignment is substantial --- be sure you
read the assignment carefully to see what should be included in it (you
are also welcome to submit a report instead of the README file, but this
isn't necessary).
Sortall Table of contents
[ O(n log n) sorts | New Quicksort Partition | Submit | Extra Credit ]
The file sortall.cc will be used as a framework around which different sorting algorithms will run. Currently sortall will do the following:
The program uses templated functions to sort. Any class/type which supports the standard comparision operations. <=, >=, ==, !=, >, < can be sorted.
First change into your cps100 subdirectory (type pwd
to verify where you are). Create a
sortall subdirectory by typing mkdir sortall and change
into this subdirectory (be sure to check that you're
in the sortall subdirectory.) Now copy the files for the assignment
(don't forget the . when copying).
You should see the files listed below (these are links to the files in case you use Netscape, and for users outside of Duke).
Compile the first version of the program by typing: make sortall. Run the program, the output should look like the following:
sortall will also create two output files: selectint.data and selectstr.data which should just have the numbers that were printed on the screen, i.e. selectint.data should look like similar to the following:
You can check this by loading selectint.data into xemacs or from the xterm window, you can type cat selectint.data to see the file.
You will need to create two data files (one for sorting ints and one for strings) for three different sorts:
There will be a total of six files, two for each sort: one for ints and one for strings. You must modify sortall.cc so that each sort is run for different sized vectors. You should time vectors of size 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, and 5000 elements.
You will need to do this in one run of your program. Also, you will need to time each sort on the same initial "random" vectors to make valid comparisons between the different sorting algorithms. Once you have finished this, you will use gnuplot to generate graphs of your data.
To add new sorts, you'll need to add new SortBench variables. For example, the variables below can be used to time insertion sort:
The three arguments to the SortBench constructor are, respectively,
Be sure to sort an unsorted vector with each sort, and this should be the same unsorted vector in order to make valid comparisons between the sorts. To do this you'll assign the saved unsorted vectors, storeInt and storeString to aInt and aString before sorting.
You should probably try running the program for vectors of size 500 and 1000 first. It takes a while for all the sorts to run for vectors sized from 500 to 5000, you should be sure your program is generating six data files properly before doing the final timings.
To use gnuplot just type gnuplot at the UNIX prompt. Now you should see the gnuplot prompt: gnuplot> rather than your normal prompt. To plot a datafile you should type the following:
If you mess up typing, you'll have to type the command again. You can abbreviate linespoints as linesp and you can abbreviate with as w. (To quit gnuplot type quit at the gnuplot prompt.)
The number-pairs 1 1 and 3 3 indicate the style of line and point to use). For example, you can add another file by using a comma after the 5 5 specifying bubblestr.data, and specifying "linespoints 2 2".
This command should generate a plot on the screen. Now to label the axes type:
Finally you will create a version of the plot to print by typing
You should put your observations on the these sorts in your README file (see below on what to submit.) You'll need to turn in a hardcopy of your graphs (either in class, or to a TA, UTA, or Professor for CPS 100).
These are "smart" pointers because as pointers they require less time to swap/move since only pointers are moved rather than entire strings being re-copied. The Print function is used to facilitate debugging and for the PrintArray function in sortall.cc.
You should implement the SmartStrPtr class and plot times for strings compared to smart strings. You should do these for the both O(n^2) sorts and the O(n log n ) sorts described below. You should include timings in your README and an explanation of the behavior of smart string pointers compared to strings. You'll need to write another RandLoad function to create a vector of smart string pointers, you should create a new string using new, and the same logic for creating a random string that's shown in the RandLoad function for strings.
There are several faster sorts. Most of them are discussed in Weiss and we will go over some of them in class.
In doing this part of the assignment you must analyze two sorts as described below.
First analyze the faster sorts by modifying the program sortall so that it sorts arrays of size 10000, 20000, ... 50,000 using the faster sorts. DO NOT USE THE O(n^2) sorts! They will take too long. You can plot the data and then include the data as part of your README file showing how long each sort takes for ints, strings, and smart strings. The quicksort and mergesort functions are named QuickSort and MergeSort, respectively.
(In the changes here don't call the output files storing times the same names you used in the first part of the assignment --- you might change the file names to quick2.data, for example.)
Change sortall.cc so that it uses only quicksort and only with a vector of 10,000 elements (so don't use mergesort). Recompile the program and then run it by typing sortall 100. This will change the RandLoad function so that there are only numbers from 1 to 100, but there will be 10,000 of these numbers. You should notice a big increase in the time to sort the integers compared with your previous run of quicksort on a vector of 10,000 elements. You'll need to put the timings in your README and explain the timings.
To "fix" this problem, you're going to write a new version of quicksort that splits the array into 3 sections: one less than the pivot, one equal to the pivot, and one greater than the pivot.
You should write a new function Pivot3. You should use Pivot as a model for this function. Pivot3 splits the array into the 3 sections. The section equal to the pivot does NOT need to be recursively sorted. The prototype for Pivot3 may be different from the function Pivot because instead of returning a pivot you'll need to return two values: one for the end of the less-than section and one for the beginning of the greater than section. A diagram that might help you develop the code is given below.
If the k-th element is equal to the pivot you can swap it into the end of the equal section much as the original pivot function works. If the k-th element is less than the pivot, you can first swap it into the beginning of the equal section (and bump less), then swap the new k-th element, which is now equal to the pivot, into the end of the equal section (bumping equal). You'll probably need to think about this to get it right. To initialize less and equal note that there will be one element equal to the pivot (that's the pivot element) and NO elements less than the pivot element (so less must be initialized to a location not in the range of locations being partitioned). Write a new function quicksort3 that calls Pivot3 instead of Pivot.
Check your new sorting function to make sure you haven't lost elements. Use the PrintArray function on a randomly generated vector of size 40 to make sure that your vector is sorted. Note: When quicksort is applied to small vectors (size 20 or less) it calls insertionsort, so your pivot code may not be tested unless you change the value of the constant CUTOFF to be 0.
You will probably want to implement a CheckSorted function to determine if your new quick sort is working. It will be difficult to earn full credit without implementing such a function. CheckSort might, for example, take two vectors and decide if one is a sorted version of the other one --- be sure that no numbers are lost when sorting and that the final vector is sorted.
When you've finished debugging quicksort3, run the new n log n sorts for integer vectors only using sizes 10000, 12000, 14000, ..., 20000 using mergesort, quicksort, and quicksort3. Make sure you type sortall 100 for your tests so that you're sorting integers constrained to be between 0 and 99.
submit100 sortall README sortall.cc
The final version of sortall.cc should include your revised pivot function and smart string pointer code. It doesn't matter what your function main does.
Remember that every assignment must have a README file submitted with it (please use all capital letters). Include your name, the date, and an estimate of how long you worked on the assignment in the README file. You must also include a list of names of all those people with whom you collaborated on the assignment.
In your README file you should include the output generated by your sorting programs and an explanation of why quicksort does so badly when the numbers sorted are constrained to be less than 100 and why the modified partition code improves the timings. You should also explain as best you can reasons for the differences observed between different O(n^2) sorts and why sorting strings, ints, and smart strings yield different results. Whenever possible try to give good explanations about why observed behavior occurs rather than merely recording the observed behavior. In particular, you should account for why bubble sort does so poorly and why sorts do better on different kinds of data.
Bucket sort works when sorting integers in a limited range (and, on computers, all integers are in a limited range.) In the routine BucketSort this range is specified by the additional parameter radix as noted in the comments of the routine. (This means that you CANNOT use BucketSort with the class SortBench as the class is written since the BucketSort function doesn't have the right signature/prototype.) For example, if all the numbers being sorted are in the range 0--9 (the value of radix would be 10), then the diagram below shows how "bucket" counts are determined from an array and then used to "sort" the array.
Note that the count in each bucket indicates how many occurrences of each number appear in the original array and can be used in a straightforward manner to "store" numbers in the sorted array. The numbers are not being re-arranged as with other sorts, the count array is used to generate an array that has the same number of occurrences of each number that appeared in the original array.
When sortall is invoked it interprets any argument as the radix used to determine the range of numbers (see the main routine). For example, sortall 1000 indicates that all numbers will be in the range 0--999. The default radix is 10,000.
Shell Sort
Shell sort is described in Weiss. The basic idea is to do a sequence of insertion sorts, but to ``look'' at elements that are far apart. In insertion sort an element is inserted into its proper position relative to all other elements by examining all other elements. In shell sort, an element might be inserted into the proper position relative to every 100th element rather than every element. Then elements are inserted into proper position relative to every 50th element, every 25th element, and so on until at the last stage of shell sort a regular insertion sort takes place. Because many elements are moved before this final stage, the sort is much more efficient than insertion sort. There are many more details of this algorithm in Weiss. The increments used in this version of shell sort are described as Hibbard's increments, they are of the form: 1, 3, 7, ... , 2^k - 1.
Radix Sort Use the function RadixSort, defined in extrasorts.cc, to time how long it takes to sort vectors of integers. Use vectors of size 10,000 to 100,000 in increments of 5,000 and compare the time to quicksort (so be sure that you sort the same vectors with QuickSort and RadixSort).
Submit a README file with the timings, and turn in a hardcopy of a graph comparing the two sorts --- use gnuplot or some other plotting program, use submit100 sortall.xtra.