Using Lightweight Threads (Fibers) for Sorting
Designing a sort algorithm to make efficient use of a few (4 to 8) functional units requires careful analysis, particularly if the performance, in total cycles used, or in useful instructions per cycle, is to be maximized. We consider here the algorithms MERGESORT, QUICKSORT, and RADIXSORT, and attempt to calculate their performance in total cycles used, as a function of N, the number of items to be sorted, M, the width of each item in bits, and P, the number of functional units used. The execution model used is that of a super-scalar processor with "perfect" memory (so each memory reference takes one cycle, as does each arithmetic operation), but with a long pipeline, so that a mispredicted branch takes 8 cycles. We also allow the processor to switch from one fiber to another in one cycle, and allow a new fiber to be started in one cycle, as well. Although our processor is capable of performing out-of-order execution within one fiber, we do not assume that this introduces any parallelism. Instead, execution proceeds by repeatedly fetching P instructions from one fiber on each cycle, until a branch is reached. The branch may be marked "wait" (in which case instruction fetch is stopped for 8 cycles), "dismiss" (in which case instruction fetch proceeds with the next fiber on the next cycle), or "predict" (in which case the hardware predicts its destination, and is assessed a penalty of 8 cycles if wrong). Performance is assessed by assuming that a fiber's first instruction can execute 8 cycles after being fetched, and subsequent instructions execute in order, one per cycle, following that first instruction. Instructions from other fibers may execute in parallel with these, to a maximum of P at one time.
The execution semantics of this processor differ somewhat from the semantics of previous versions of "threads", for in this model, each sequence of instructions taken from one fiber, and ending with a branch, executes logically entirely alone -- instructions from other fibers do not interleave with those of one such instruction sequence. The processor insures that the order in which instructions are fetched controls the logical order (in terms of inter-instruction dependency) of their execution. If an instruction A depends on the result of an instruction B which was fetched before A, then the processor delays execution of A until B's result is ready.
There are at least two schemes for programming this processor. The first assigns separate "tasks" to each fiber, where the tasks each represent a pre-calculated "split" of the overall problem into separately solvable pieces, each of which take about the same amount of execution time. The second scheme seems unique to this type of processor -- a "crowd" execution. In crowd execution, the original sequential algorithm is used, but a separate fiber is assigned to execute each iteration of the algorithm. Assume each such iteration operates on a different datum, x, by obtaining x via an assignment statement like x = A[I++]; Then if two different fibers execute this statement as part of a single instruction sequence then each will obtain x values from different locations in array A, assuming x is a variable private to each fiber, while A and I are shared among those fibers. No explicit synchronization operations are needed to ensure this behavior. However, performance is limited to at most one execution of the statement per cycle, by the "dependence preserving" rule the hardware obeys.
Example:
Consider a QUICKSORT algorithm which copies elements from A[I..J] to B[I..K]B[K+1]B[K+2..J], where the elements of A < s go to B[I..K], the splitter goes to B[K+1], and the elements of A >= s go to B[K+2..J]. Here s is a randomly-selected element of A. In the second stage of this QUICKSORT algorithm, B[I..K+1] can be split, using the same algorithm back into A[I..K+1], and B[K+2..J] can be split back into A[K+2..J], forming 4 or more regions of A. Elements in the I-th regions of A are all <= elements in the (I+1)-th. Each region of A can then be sorted, using any algorithm, and the result will be a sorted list. This version of QUICKSORT, using QUICKSORT to sort each sub-list, then looks (approximately) like the algorithm below WHICH HAS NOT BEEN TESTED:
/* Sort A[I..J] into A[Ib..Ib+J-I] */
Qsort(I,J,Ib) {
float s=A[I], x;
int Jb = Ib+J-I, I0=I, Ib0=Ib, Jb0=Jb;
if (I>J) return;
L0: while (I<=J) {
if ((x = A[I++])<s) A[Ib++] = s;
else A[Jb--] = s;
}
Qsort(Ib0,Ib-1,I0);
Qsort(Ib,Jb0,I);
}
The algorithm above needs to be tested: its end-tests may be off by one, and one needs to ensure that the pieces to be sorted recursively are each smaller than the original, regardless of the order of elements in A[I..J]. I will assume any changes needed are small. Then the algorithm exhibits some interesting properties:
if ((x = A[I++])<s) A[Ib++] = s;
else A[Jb--] = s;
If s, x, and pointers to A[I], A[Ib], and A[Jb] were kept in registers, this loop body would require 6 instructions (plus one to end the fiber), to execute. The dependence of one I++ operation on the same operation in another fiber limits the algorithm's speed to at least one cycle per iteration. The recursive calls of Qsort() should be executed by only one fiber, which would then apply "crowd" execution to each sub-problem.
MERGESORT seems to be harder to redesign for efficient operation on this processor. The MERGE step takes some 8 instructions per item, one of which is an unpredictable branch (to tell which of the sorted sub-lists contains the next output item). Each such comparison it executes depends on the outcome of the previous comparison, so some 16 cycles are needed, per item. The last merge stage then takes some 16N cycles to execute, running about 1/2 instructions per cycle. (Earlier MERGE stages can assign one fiber to each sub-list to be sorted, and thus can execute in parallel.)
Msort(I,J, R) {
if (I>=J) return;
Mid = (J-I)/2;
Msort(Mid+1,J, R+Mid+1-I);
Msort(I,Mid, Mid+1);
merge(R+Mid+1-I,R+J-I,Mid+1,Mid+1+Mid-I, R);
}
merge(I,J,K,L, R) {
register int x, y;
if (I>J) {
while (K<=L) A[R++] = A[J++];
return;
}
x = A[I++];
if (K>L) {
while (I<=J) A[R++] = A[I++];
return;
}
y = A[K++];
while (1) {
if (x<=y) {
A[R++] = x;
if (I>J) {
while (K<=L) A[R++] = A[J++];
return;
}
x = A[I++];
else {
A[R++] = y;
if (K>L) {
while (I<=J) A[R++] = A[I++];
return;
}
y = A[K++];
}
}
}
Linguistic Considerations
While the various strategies these algorithms might use seem clear, expressing them in understandable code is not. Consider Qsort(). What statements should be added to the language, to allow either "crowd" execution, or "assign one fiber per sub-task" execution to be specified? How should shared versus private variables be identified? How should the code a fiber is to execute be expressed? Should "critical sections" be marked in some way, so that if the code were to run under "standard" thread semantics, it would run correctly? What linguistic features would allow us to say "start one fiber running THIS code", "start N fibers running THIS code"? Should fibers be given separate subroutine call stacks? (Note that you can optimize starting N fibers at once, even if only P < N fibers can actually exist in the hardware registers at one time: The hardware keeps track of how many have been started, and begins a new task whenever an old one dies, using the same activation record.) Should fiber semantics demand that EVERY active fiber execute one "instruction segment" before the current fiber resumes? How can fibers delay themselves, without using hardware resources?
Subroutine call linguistic structures are probably adequate to specify use of these facilities, but such linguistic structures will not permit the compiler to check the resulting code for correctness, since the subroutine calls can't convey the programmer's intent to the compiler. Is this a problem? Are programs written using these strategies more likely to contain errors than would straightforward sequential programs?
We will discuss these issues in one or two classes, then adopt some makeshift solution, and proceed to design and test several real algorithms for this processor. In this process, I hope we identify mistakes in the initial language design and concepts, which we will at first carefully work around by hand, and then incorporate into a later language design.
Example problems:
Initial Language Design
I will make available Java files which document and implement an initial cut at a language design. This initial cut simulates the proper form of execution for fibers, by calling a method D(<test>) whenever a Boolean expression <test> is to be followed by a branch that is a "branch-dismiss", i.e., one that allows other fibers to run before being completed. This version has no provision for "timing" execution, and no provision for penalizing branches which must wait 8 cycles to be resolved. The files implement a version of QUICKSORT, but one which isn't nearly as efficient as the one outlined. One or more teams will have to take responsibility for extending the test-bed for the language, while other teams implement several of the algorithms or problems listed above, or other problems that may illustrate how the language and processor design can be used. I am interested in both problems which show that the processor IS valuable, and those that show that the processor can't solve some problem efficiently, or understandably.