Compsci 100, Fall 2009, Burrows-Wheeler

Burrows co-invented Alta Vista arguably the predecessor of Google and the first large-scale successful search engine. For this and other work he won the Mark Weiser Award for "contributions that are highly creative, innovative, and possibly high-risk, in keeping with the visionary spirit of Mark Weiser."

From his London Times obituary we find the following about Wheeler:

Wheeler was an inspiring teacher who helped to develop computer science teaching at Cambridge from its inception in 1953, when the Diploma in Computer Science was launched as the world's first taught course in computing. Many of his research students now occupy senior positions in major computer companies, or have made their own significant contributions to computer science, including the development of new programming languages.

From an online-article about Burrows that is no longer accessible from Stanford, we find the following:

Mike is not just another hip, smart young Googler. He's one of the pioneers of the information age. His invention of Alta Vista helped open up an entire new route for the information highway that is still far from fully explored. His work history, intertwined with the development of the high-tech industry over the past two decades, is distinctly a tale of scientific genius.
Mike knew he wanted to be a scientist early on and focused on it at school. "I strongly prefer the British educational system," he says. "We begin to specialize at age 14. That means after age 14 I didn't have to study history, geography or any social sciences I was bad at. I was able to focus on the subjects I liked. Early specialization gives you more time to concentrate and makes it easier for you to succeed."

The Assignment

You are to implement the block-transform-compression method known as the Burrows-Wheeler Transform (BWT) using Huffman Compression/Coding as the compression part of the transform. Basically the compression algorithm can be described by the following process.

BWT for encoding/compressing

Read the input file in blocks or chunks, e.g., 65,536 bytes at a time.
For each block that is read, emit a compressed version of the block as follows:
- Use the BWT to sort and obtain the last column as described below and in the howto pages. You'll need the index of the original text as well, we'll call this the first index because it's the index of the original, or first text before the sorting phase of BWT. The last column is represented as a char[], an array of char values (though all in the range 0-255).
- Use move-to-front to transform the last column into a more compressable form; you have a new char[] for the transformed data.
- Compress this data using a ByteArrayInputStream and Huffman Coding as described in the howto pages. Write the compressed data and the first index to the output file/stream.
- Remember that you'll be doing these three steps for each block/chunk your program reads.
Repeat beginning with step 1 until the original file has been completely read -- this means you'll process the file in blocks or chunks until done. The last chunk may be smaller than the other chunks.

BWT for decoding/uncompressing

You'll uncompress using the following process.

If the magic-number at the beginning of a file specifies BWT, use that method for uncompressing. Otherwise you might use the normal Huffman uncompression code if that magic number is specified.
Read the compressed file in groups. Each group corresponds to one of the chunks written in the compression step. The groups can be of variable length because of varying compression ratios. But each group corresponds to a chunk of 65,536 bytes written during the BWT compression stage, except perhaps for the last group which could represent a smaller chunk. For each chunk do the following:
- Read an int value representing the first index for the compressed chunk.
- Read the tree/values for uncompressing using Huffman coding, the header written during the BWT compression phase.
- Read the bits to uncompress using the tree that was read or created, e.g., regular Huffman coding, but use a ByteArrayOutputStream for uncompressing.
- Convert the byte[] to a char[] so that you can then undo the move-to-front transform.
- After undoing move-to-front use the inverse BWT to create the original text, you'll need the first index as well as the char[] from the previous step.
Repeat until the compressed file/stream has been uncompressed, see the howto pages for more details.

The Code

If you snarf the assignment or browse the code directory you'll get a new GUI HuffViewer class with an additional menu item to transform an input file to an output file using BWT. The GUI will now call a method transform specified by the IHuffProcessor to encode/transform via BWT and Huffman coding (with move-to-front). You'll need to write this transform code in the class you have that implements the IHuffProcessor interface. There's information in the howto pages on this.

In the code you'll write you'll use methods from the BurrowsWheeler class -- there are three comments labeled TODO in that class, you'll need to write code in those places in addition to the code you write in your IHuffProcessor.transform method.

To uncompress a file compressed by BWT you'll need to look at the magic-number, than call a method you write in your IHuffProcessor class to unhuff/untransform the data. There's more in the howto pages, but basically you'll need to reorganize your Huffman compressing/uncompressing code so you can call it appropriately when using BWT.

Empirical Analysis