Compsci 100, Fall 2009, Burrows-Wheeler

Snarf the burrows-wheeler project via Eclipse, or browse the code directory for the files you need. You'll need a working Huffman Coding/Compression program to do this project. See the howto for help.

Background and Genesis of Assignment

wheeler picture
David Wheeler photo source
The Burrows-Wheeler transform has been the basis of a programming assignment Princeton COS 226 Course for several years---but this Duke assignment emphasizes the block-nature of the transform. The transform was originally invented and developed by David Wheeler (pictured left), but its publication and modern development can be attributed to Mike Burrows (pictured) right as gleaned from Burrows' forward to an issue of Theoretical Computer Science dedicated to David Wheeler who died in 2004.
The Data Compression Conference did not like the paper [about the transform], perhaps put off by my rather pragmatic presentation. However, Mark Nelson drew attention to it in a Dr. Dobbs article, and that was enough to ensure its survival. I was embarrassed to see that Mark Nelson called it the "Burrows-Wheeler Transform", even though I tried to make it clear that the algorithm was David's. Of course, David was far too polite to mention this.
Interestingly both Wheeler and Burrows are more well-known for other accomplishments than the transform.

Wheeler is credited with inventing the subroutine while working on the EDSAC, one of the first computers. Subroutines were both difficult to implement, though conceptually simple; without them we wouldn't be able to design and implement large programs at all.

mike burrows
Mike Burrows photo source

Burrows co-invented Alta Vista arguably the predecessor of Google and the first large-scale successful search engine. For this and other work he won the Mark Weiser Award for "contributions that are highly creative, innovative, and possibly high-risk, in keeping with the visionary spirit of Mark Weiser."

From his London Times obituary we find the following about Wheeler:

Wheeler was an inspiring teacher who helped to develop computer science teaching at Cambridge from its inception in 1953, when the Diploma in Computer Science was launched as the world's first taught course in computing. Many of his research students now occupy senior positions in major computer companies, or have made their own significant contributions to computer science, including the development of new programming languages.

From an online-article about Burrows that is no longer accessible from Stanford, we find the following:

Mike is not just another hip, smart young Googler. He's one of the pioneers of the information age. His invention of Alta Vista helped open up an entire new route for the information highway that is still far from fully explored. His work history, intertwined with the development of the high-tech industry over the past two decades, is distinctly a tale of scientific genius.

Mike knew he wanted to be a scientist early on and focused on it at school. "I strongly prefer the British educational system," he says. "We begin to specialize at age 14. That means after age 14 I didn't have to study history, geography or any social sciences I was bad at. I was able to focus on the subjects I liked. Early specialization gives you more time to concentrate and makes it easier for you to succeed."

The Assignment

You are to implement the block-transform-compression method known as the Burrows-Wheeler Transform (BWT) using Huffman Compression/Coding as the compression part of the transform. Basically the compression algorithm can be described by the following process.

BWT for encoding/compressing

  1. Read the input file in blocks or chunks, e.g., 65,536 bytes at a time.

  2. For each block that is read, emit a compressed version of the block as follows:

    • Use the BWT to sort and obtain the last column as described below and in the howto pages. You'll need the index of the original text as well, we'll call this the first index because it's the index of the original, or first text before the sorting phase of BWT. The last column is represented as a char[], an array of char values (though all in the range 0-255).

    • Use move-to-front to transform the last column into a more compressable form; you have a new char[] for the transformed data.

    • Compress this data using a ByteArrayInputStream and Huffman Coding as described in the howto pages. Write the compressed data and the first index to the output file/stream.

    • Remember that you'll be doing these three steps for each block/chunk your program reads.

  3. Repeat beginning with step 1 until the original file has been completely read -- this means you'll process the file in blocks or chunks until done. The last chunk may be smaller than the other chunks.

BWT for decoding/uncompressing

You'll uncompress using the following process.

  1. If the magic-number at the beginning of a file specifies BWT, use that method for uncompressing. Otherwise you might use the normal Huffman uncompression code if that magic number is specified.

  2. Read the compressed file in groups. Each group corresponds to one of the chunks written in the compression step. The groups can be of variable length because of varying compression ratios. But each group corresponds to a chunk of 65,536 bytes written during the BWT compression stage, except perhaps for the last group which could represent a smaller chunk. For each chunk do the following:

    • Read an int value representing the first index for the compressed chunk.

    • Read the tree/values for uncompressing using Huffman coding, the header written during the BWT compression phase.

    • Read the bits to uncompress using the tree that was read or created, e.g., regular Huffman coding, but use a ByteArrayOutputStream for uncompressing.

    • Convert the byte[] to a char[] so that you can then undo the move-to-front transform.

    • After undoing move-to-front use the inverse BWT to create the original text, you'll need the first index as well as the char[] from the previous step.

  3. Repeat until the compressed file/stream has been uncompressed, see the howto pages for more details.

The Code

If you snarf the assignment or browse the code directory you'll get a new GUI HuffViewer class with an additional menu item to transform an input file to an output file using BWT. The GUI will now call a method transform specified by the IHuffProcessor to encode/transform via BWT and Huffman coding (with move-to-front). You'll need to write this transform code in the class you have that implements the IHuffProcessor interface. There's information in the howto pages on this.

In the code you'll write you'll use methods from the BurrowsWheeler class -- there are three comments labeled TODO in that class, you'll need to write code in those places in addition to the code you write in your IHuffProcessor.transform method.

To uncompress a file compressed by BWT you'll need to look at the magic-number, than call a method you write in your IHuffProcessor class to unhuff/untransform the data. There's more in the howto pages, but basically you'll need to reorganize your Huffman compressing/uncompressing code so you can call it appropriately when using BWT.

Empirical Analysis

Please submit a README in which you analyze both the time to compress and the amount of compression for both text files and non-text files, e.g., the directories calgary and waterloo, respectively, from the original Huffman assignment. Ideally you'll experiment with at least two block/chunk sizes to see how that affects time and the amount of compression yielded by BWT.