CompSci 100e, Fall 2006, Huffman Coding


See the FAQ


You are urged to work in groups of two. Each group should submit ONE program per group. Be sure to include name and login id of each person in your group in the README file that belongs with the submission. Do not wait until the last minute to start this assignment.

Every group member must submit a README, only one group member submits the code


Overview

For this assignment you'll implement several clasess to build what are conceptually two programs: one to compress (huff) and the other to uncompress (unhuff) files that are compressed by the first program.

The Huff class is a simple main that launches a GUI with a connected model that you'll write. You may have a similar UnHuff class/main, but it's possible to incorporate all the functionality into the model you write.

Most of the work will be in the models you implement and in the classes you implement that correspond to interfaces you're given as part of this assignment.


Part One

For part one you should implement a class that implements the IHuffModel interface and so that all methods except write work.

Specifically, for part one your IHuffModel implementation class must be connected to the GUI/Viewer and must implement three of the methods from the IHuffModel interface:

Huffman Coding Overview

This involves implementing both programs based on the greedy Huffman algorithm discussed in class and in this detailed online explanation of Huffman Coding

The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like compress or zip which use slightly different algorithms permitting a higher degree of compression than Huffman coding.

Programming Advice

Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, you probably will not get a completely working program!

Design, develop, and test so that you have a working program at each step.

Build a program by adding working and tested pieces.

Getting Started

Checkout the assignments/huff assignment via Eclipse, or browse the code directory.

huff: The Compression Program

You'll probably use several classes to create the main programs. The classes and methods you write are completely up to you, but a good design will receive maximal points.

You're given several interfaces as part of the code framework. The only interface you must use is IHuffModel. However, the other interfaces will help you achieve a good design.

To compress you should use the Huffman coding algorithm. The algorithm has four steps. You should understand each of these steps before starting to code.

First, the process of Huffman compression is discussed, then some details about the program follow.

To uncompress the file later, you must recreate the same Huffman tree that was used to compress (so the codes you send will match). This tree might be stored directly in the compressed file (e.g., using a preorder traversal), or it might be created from character counts stored in the compressed file. In either case, this information must be coded and transmitted along with the compressed data (the tree/count data will be stored first in the compressed file, to be read by unhuff. There's more information below on storing/reading information to re-create the tree.

More Details Here


Part 2: Writing and Reading Compressed Files


To complete the program, you'll need to write a compressed file based on the encodings and you'll need to read a compressed file and uncompress it.

When you uncompress, the uncompression program/methods must be able to recreate the Huffman tree which was used to compress the file originally, then use this tree to recreate the original (uncompressed) file. Using the interfaces described in Part I, and here, will help ensure you don't duplicate too much code in writing the compress/uncompress classes and code.

Your compression and uncompression methods work together. A file compressed using someone else's compression code probably can't be uncompressed by your code unless you agree on several standards that aren't part of this assignment.

For Part 2, the GUI has been redesigned, though your model doesn't need to change. Here are two shots that show the revised menus in the GUI which are the only visible changes. The old menu is on the left, the new menu on the right.

Writing a Compressed File

There are three steps in writing a compressed file from the information your code found in Part I (the counts and encodings).

  1. Write a magic number at the beginning of the compressed file. This is explained more fully below, but your code uses the magic number to determine if a compressed file was created by your program when uncompressing.

  2. Write information after the magic number that allows the Huffman tree to be recreated. The simplest thing to do here is write ALPH_SIZE counts as int values, but you can also write the tree. This is described more fully below.

  3. Write the bits needed to encode each character of the input file. For example, if the coding for 'a' is "01011" then your code will have to write 5 bits, in the order 0, 1, 0, 1, 1 every time the program is compressing/encoding the chunk 'a'.

Magic Numbers

One easy way to ensure that both programs work in tandem is to write a magic number at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding decompression program first reads N bits, if the bits do not represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in the BitInputStream and BitOutputStream classes.

For example, in my working program I have the following lines in different parts of my class that implements. the IHuffHeader interface.

// write out the magic number out.write(BITS_PER_INT, MAGIC_NUMBER); then in another part of the class (in another method) int magic = in.read(BITS_PER_INT); if (magic != MAGIC_NUMBER){ throw new IOException("magic number not right"); }

In general, a file with the wrong magic number should not generate an error, but should notify the user. For example, in my program the exception above ultimately causes the user to see what's shown below. This is because the exception is caught and the viewer's showError method called appropriately. Your code should at least print a message, and ideally generate an error dialog as shown.

Storing the Huffman Tree

For decompression to work with Huffman coding, information must be stored in the compressed file which allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) int values or you can try to be inventive and save space. For example, it is possible to store just chunk/character counts and recreate the codes from the counts (i.e., store 256 counts, one for each 8-bit character). It's also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable, but full credit requires some attempt to save space/storage. Space-saving techniques are defined as those using less space than simply storing 256 counts as 32 bit ints. One useful technique is to write the tree to the file using a preorder traversal. You can use a 0 or 1 bit to differentiate between internal nodes and leaves, for example. The leaves must store character values (in the general case using 9-bits because of the pseudo-eof character).

For example, the sequence of 0's and 1's below represents the tree on the right (if you write the 0's and 1's the spaces wouldn't appear, the spaces are only to make the bits more readable to humans.)
   0 0 1 001100001 1 000100000 1 001110100

The first 0 indicates a non-leaf, the second 0 is the left child of the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits that represent 97 (001100001 is 97 in binary), the ASCII code for 'a'. Then there's a 1 for the right child of the left child of the root, it stores 32 (000100000 is 32 in binary), the ASCII value of a space. The next 1 indicates the right child of the root is a leaf, it stores the ASCII value for a 't' which is 116 (001110100 is 116 in binary).

Your program can write these bits using a standard pre-order traversal. You can then read them by reading a bit, then recursively reading left/right subtrees if the bit is a zero (think about the 20-questions/animal program).

Even storing only non-zero counts qualifies as space-savings. To store non-zero counts, however, you'll need to store the character/chunk being counted and this might not save space. In my non-saving-space code, my header is written by the following code. Note that BITS_PER_INT is 32 in Java.

for(int k=0; k < ALPH_SIZE; k++){ out.write(BITS_PER_INT, myCounter.getCount(k)); } This header is then read as follows, this doesn't do much, but shows how reading/writing the header are related. for(int k=0; k < ALPH_SIZE; k++){ int bits = in.read(BITS_PER_INT); charcounter.set(k,bits); }

In my code to read/write the header as a tree, the resulting header is much smaller. Both reading/writing are done in classes that implement IHuffHeader, but that's a guide, not a requirement.


Forcing Compression

If compressing a file results in a file larger than the file being compressed (this is always possible) then no compressed file should be created and a message should be shown indicating that this is the case. Here's a screen shot from what happens in my program.

If the user forces compression using the Options menu as shown below then compression occurs even if the compressed file is bigger.

Determining If Compression Happens

To determine if compression results in a smaller file, you'll need to calculate the number of characters/chunks in the original file (your program will compute this by determining character/chunk counts). The size of the compressed file can be calculated from the same counts using the size of each character's encoded number of bits. You must also remember to calculate the file-header information stored in the compressed program. To be more precise, if there are 52 A's, and each A requires 4 bits to encode, then the A's contribute 52*4 = 108 bits to the compressed file. You'll need to make calculations like this for all characters.

The IHuffHeader interface specifies a method headerSize to help with keeping the code for headers in one place.


Coding Details

There are many details that you will need to think about as you code these programs. Some of these are discussed here with some alternatives proposed.

No char variables

Note, do not use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment below.

     int k = 'A';
     System.out.println( (char) k);

Pseudo-EOF character

(for more details, see the complete huffman coding discussion and see the faq.)

The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this.

int bits; while ((bits = input.read(1) != -1) { // process bits } To avoid this problem, you can use a pseudo-EOF character and write a loop that stops when the pseudo-EOF character is read in (in compressed form). The code below is pseudo-code for reading a compressed file using such a technique. int bits; while (true) { bits = input.read(1); if (bits == -1) { throw new IOException("unexpected end of input file"); } else { // use the zero/one value of the bit read // to traverse Huffman coding tree // if a leaf is reached, decode the character and print UNLESS // the character is pseudo-EOF, then decompression done if ( (bits & 1) == 0) // read a 0, go left in tree else // read a 1, go right in tree if (at leaf-node in tree) { if (leaf-node stores pseudo-eof char) break; // out of loop else write character stored in leaf-node } } } When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. You will have to write these bits explicitly. These bits will be recognized during the decompression process. This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (you may need to think about this to really believe it). In other words, when decompressing you will read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate because all decompression is done. If reading a bit fails because there are no more bits (the bit-reading method returns -1) the compressed file is not well formed. Your program should cope with files that are not well-formed, be sure to test for this, i.e., test decompression with plain (uncompressed) files. My program generates this error when such a file is found.

In Huffman trees/tables you use in your programs, the pseudo-EOF character/chunk always has a count of one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.

In the file HuffConstants the number of characters counted is specified by ALPH_SIZE which has value 256. Although only 256 values can be represented by 8 bits, these values are between 0 and 255, inclusive. One character is used as the pseudo-EOF character -- it must be a value not-representable with 8-bits, the smallest such value is 256 which requires 9 bits to represent. However, ideally your program should be able to work with n-bit chunks, not just 8-bit chunks.


Grading

The program is worth 50 points.
Huffman Coding Grading Standards
description points
compression of any text file 10 points
compression of any file (including binary files) 5 points
decompression 10 points
robustness (does unhuff program crash on non-huffed files?) 5 points
program style (class design/implementation, program design) 10 points
test programs/char count/char codings 10 points