CompSci 100, Fall 2006, Huffman Coding


See the FAQ


You are urged to work in groups of two. Each group should submit ONE program per group. Be sure to include name and login id of each person in your group in the README file that belongs with the submission. Do not wait until the last minute to start this assignment.

Every group member must submit a README, only one group member submits the code

Snarf the huff project via Eclipse, or browse the code directory you can see the javadoc here.


Overview

For this assignment you'll implement several clasess to build what are conceptually two programs: one to compress (huff) and the other to uncompress (unhuff) files that are compressed by the first program. However, there is really just a single program with the choice of reading a compressed file or a normal file specified by choosing a menu-option in the GUI-based program.

The Huff class is a simple main that launches a GUI with a connected model that you'll write. This will eventually be the single application/GUI you use to compress/huff files or decompress/unhuff files.

Most of the work will be in the classes you implement that correspond to interfaces you're given as part of this assignment. You're free to ignore the interfaces and implement only the model described below, but the interfaces have been designed to help identify and develop code in chunks so you don't do too much at once (this is a conceptually large program.)

You're writing code based on the greedy Huffman algorithm discussed in class and in this detailed online explanation of Huffman Coding

The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like compress or zip which use slightly different algorithms permitting a higher degree of compression than Huffman coding.

Programming Advice

Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, you probably will not get a completely working program!

Design, develop, and test so that you have a working program at each step.

Build a program by adding working and tested pieces.


The Huffman Compression Program

You'll probably implement several classes to create the program. The classes and methods you write are completely up to you, but a good design will receive maximal points.

You're given several interfaces as part of the code framework. The only interface you must use is IHuffModel. However, the other interfaces will help you achieve a good design.

To compress you should use the Huffman coding algorithm. The algorithm has four steps. You should understand each of these steps before starting to code. The first three steps are the basis for Part I which you should design, implement, and test before proceeding to the next part and step four.

First, the process of Huffman compression is discussed, then some details about the program follow.

  1. To compress a file, count how many times every character occurs in a file. These counts are used to build weighted nodes that will be leaves in the Huffman tree. The word character is used, but we mean 8-bit chunk and this chunk-size could change. Do not use any variables of type char in your program. Use only ints.

    You might want to use an ICharCounter class to do the counting, but you don't have to.

  2. From these counts build the Huffman tree. First create one node per character, weighted with the number of times the character occurs, and insert each node into a priority queue. Then choose two minimal nodes, join these nodes together as children of a newly created node, and insert the newly created node into the priority queue. The new node is weighted with the sum of the two minimal nodes taken from the priority queue. Continue this process until only one node is left in the priority queue. This is the root of the Huffman tree.

    You might want to use an ITreeMaker class to create the tree, but you don't have to --- however, the code to create the tree must live somewhere.

  3. Create a table or map of characters (8-bit chunks) to codings. The table of encodings is formed by traversing the path from the root of the Huffman tree to each leaf, each root-to-leaf path creates an encoding for the value stored in the leaf. When going left in the tree append a zero to the path; when going right append a one. All characters/encoding bit pairs may be stored in some kind of table or map to facilitate easy retrieval later.

    The table can be an array of the appropriate size (roughly 256, but be careful of PSEUDO_EOF) or you can use a Map subclass. An array is the simplest approach for this part of the huff/compress process, using a Map is not necessary.

    You might want to use an IHuffEncoder class to create the table from the tree, but you don't have to -- however, the code to create the table/map from the tree must live somewhere.

  4. Finally, read the input file a second time. For each character/8-bit chunk read, write the encoding of the character (obtained from the map of encodings) to the compressed file. This part is NOT necessary for Part I

To uncompress the file later, you must recreate the same Huffman tree that was used to compress (so the codes you send will match). This tree might be stored directly in the compressed file (e.g., using a preorder traversal), or it might be created from character counts stored in the compressed file. In either case, this information must be coded and transmitted along with the compressed data (the tree/count data will be stored first in the compressed file, to be read by unhuff. There's more information below on storing/reading information to re-create the tree.

Part I

For part one you should implement a class that implements the IHuffModel interface so that all methods except write work. You don't actually turn in Part I separately, but it's a very good idea to make this part work before proceeding to Part II and completion of the entire program.

More Part I details are here, but read the general description below first.

Specifically, for part one your IHuffModel implementation class must be connected to the GUI/Viewer and must implement three of the methods from the IHuffModel interface:

For more information about Part I code and design, see the details here.


Part II: Writing and Reading Compressed Files

To complete the program, you'll need to write a compressed file based on the encodings and you'll need to read a compressed file and uncompress it.

When you uncompress, the uncompression program/methods must be able to recreate the Huffman tree which was used to compress the file originally, then use this tree to recreate the original (uncompressed) file. Using the interfaces described in Part I, and here, will help ensure you don't duplicate too much code in writing the compress/uncompress classes and code.

Your compression and uncompression methods work together. A file compressed using someone else's compression code probably can't be uncompressed by your code unless you agree on several standards that aren't part of this assignment.

Writing a Compressed File

There are three steps in writing a compressed file from the information your code stored/maintained in Part I (the counts and encodings).

  1. Write a magic number at the beginning of the compressed file. This is explained more fully below, but your code uses the magic number to determine if a compressed file was created by your program when uncompressing.

  2. Write information after the magic number that allows the Huffman tree to be recreated. The simplest thing to do here is write ALPH_SIZE counts as int values, but you can also write the tree. This is described more fully below.

  3. Write the bits needed to encode each character of the input file. For example, if the coding for 'a' is "01011" then your code will have to write 5 bits, in the order 0, 1, 0, 1, 1 every time the program is compressing/encoding the chunk 'a'.

Magic Numbers

One easy way to ensure that both programs work in tandem is to write a magic number at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding decompression program first reads N bits, if the bits do not represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in the BitInputStream and BitOutputStream classes.

For example, in my working program I have the following lines in different parts of my class that implements the IHuffHeader interface.

// write out the magic number out.write(BITS_PER_INT, MAGIC_NUMBER); then in another part of the class (in another method) int magic = in.read(BITS_PER_INT); if (magic != MAGIC_NUMBER){ throw new IOException("magic number not right"); }

In general, a file with the wrong magic number should not generate an error, but should notify the user. For example, in my program the exception above ultimately causes the user to see what's shown below. This is because the exception is caught and the viewer's showError method called appropriately. Your code should at least print a message, and ideally generate an error dialog as shown.

Storing the Huffman Tree

For decompression to work with Huffman coding, information must be stored in the compressed file which allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) int values or you can try to be inventive and save space. For example, it is possible to store just chunk/character counts and recreate the codes from the counts (i.e., store 256 counts, one for each 8-bit character). It's also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable, but full credit requires some attempt to save space/storage. Space-saving techniques are defined as those using less space than simply storing 256 counts as 32 bit ints. One useful technique is to write the tree to the file using a preorder traversal. You can use a 0 or 1 bit to differentiate between internal nodes and leaves, for example. The leaves must store character values (in the general case using 9-bits because of the pseudo-eof character).

For example, the sequence of 0's and 1's below represents the tree on the right (if you write the 0's and 1's the spaces wouldn't appear, the spaces are only to make the bits more readable to humans.)
   0 0 1 001100001 1 000100000 1 001110100

The first 0 indicates a non-leaf, the second 0 is the left child of the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits that represent 97 (001100001 is 97 in binary), the ASCII code for 'a'. Then there's a 1 for the right child of the left child of the root, it stores 32 (000100000 is 32 in binary), the ASCII value of a space. The next 1 indicates the right child of the root is a leaf, it stores the ASCII value for a 't' which is 116 (001110100 is 116 in binary).

Your program can write these bits using a standard pre-order traversal. You can then read them by reading a bit, then recursively reading left/right subtrees if the bit is a zero (think about the 20-questions/animal program).

Even storing only non-zero counts qualifies as space-savings. To store non-zero counts, however, you'll need to store the character/chunk being counted and this might not save space. In my non-saving-space code, my header is written by the following code. Note that BITS_PER_INT is 32 in Java.

for(int k=0; k < ALPH_SIZE; k++){ out.write(BITS_PER_INT, myCounter.getCount(k)); } This header is then read as follows, this doesn't do much, but shows how reading/writing the header are related. for(int k=0; k < ALPH_SIZE; k++){ int bits = in.read(BITS_PER_INT); charcounter.set(k,bits); }

In my code to read/write the header as a tree, the resulting header is much smaller. Both reading/writing are done in classes that implement IHuffHeader, but that's a guide, not a requirement.


Forcing Compression

If compressing a file results in a file larger than the file being compressed (this is always possible) then no compressed file should be created and a message should be shown indicating that this is the case. Here's a screen shot from what happens in my program.

If the user forces compression using the Options menu as shown below then compression occurs even if the compressed file is bigger.

Determining If Compression Happens

To determine if compression results in a smaller file, you'll need to calculate the number of characters/chunks in the original file (your program will compute this by determining character/chunk counts). The size of the compressed file can be calculated from the same counts using the size of each character's encoded number of bits. You must also remember to calculate the file-header information stored in the compressed program. To be more precise, if there are 52 A's, and each A requires 4 bits to encode, then the A's contribute 52*4 = 108 bits to the compressed file. You'll need to make calculations like this for all characters.

The IHuffHeader interface specifies a method headerSize to help with keeping the code for headers in one place.


Coding Details

There are many details that you will need to think about as you code these programs. Some of these are discussed here with some alternatives proposed.

No char variables

Note, do not use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment below.

     int k = 'A';
     System.out.println( (char) k);

Pseudo-EOF character

(for more details, see the complete huffman coding discussion and see the faq.)

The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this.

int bits; while ((bits = input.read(1) != -1) { // process bits } To avoid this problem, you can use a pseudo-EOF character and write a loop that stops when the pseudo-EOF character is read in (in compressed form). The code below is pseudo-code for reading a compressed file using such a technique. int bits; while (true) { bits = input.read(1); if (bits == -1) { throw new IOException("unexpected end of input file"); } else { // use the zero/one value of the bit read // to traverse Huffman coding tree // if a leaf is reached, decode the character and print UNLESS // the character is pseudo-EOF, then decompression done if ( (bits & 1) == 0) // read a 0, go left in tree else // read a 1, go right in tree if (at leaf-node in tree) { if (leaf-node stores pseudo-eof char) break; // out of loop else write character stored in leaf-node } } } When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. You will have to write these bits explicitly. These bits will be recognized during the decompression process. This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (you may need to think about this to really believe it). In other words, when decompressing you will read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate because all decompression is done. If reading a bit fails because there are no more bits (the bit-reading method returns -1) the compressed file is not well formed. Your program should cope with files that are not well-formed, be sure to test for this, i.e., test decompression with plain (uncompressed) files. My program generates this error when such a file is found.

In Huffman trees/tables you use in your programs, the pseudo-EOF character/chunk always has a count of one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.

In the file IHuffConstants the number of characters counted is specified by ALPH_SIZE which has value 256. Although only 256 values can be represented by 8 bits, these values are between 0 and 255, inclusive. One character is used as the pseudo-EOF character -- it must be a value not-representable with 8-bits, the smallest such value is 256 which requires 9 bits to represent. However, ideally your program should be able to work with n-bit chunks, not just 8-bit chunks.


Grading

The program is worth 55 points.
Huffman Coding Grading Standards
description points
compression of any text file 10 points
compression of any file (including binary files) 5 points
decompression 10 points
robustness (does unhuff program crash on non-huffed files?) 5 points
program style (class design/implementation, program design) 10 points
char count/char codings (Part I) 10 points
README 5 points

Your README file should include the names of all the people with whom you collaborated, and the TAs/UTAs you consulted with. You should include an estimate of how long you spent on the program and what your thoughts are about the assignment.

Submit your README and all of your source code using Eclipse with assignment name huff. Remember that each person in a group should submit a separate README, this must include the names of the people in the group. Only one group member submits code.