CompSci 100: Program Design & Analysis II(Fall 2007) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Huffman CodingSee the FAQ You are urged to work in groups of two. Each group should submit ONE program per group. Be sure to include name and NetID of each person in your group in the README file that belongs with the submission. Do not wait until the last minute to start this assignment. Every group member must submit a README, only one group member submits the code Snarf the huff project via Eclipse, or browse the code directory you can see the javadoc here.
OverviewFor this assignment you'll implement several clasess to build what are conceptually two programs: one to compress (huff) and the other to uncompress (unhuff) files that are compressed by the first program. However, there is really just a single program with the choice of reading a compressed file or a normal file specified by choosing a menu-option in the GUI-based program.
The
Most of the work will be in the classes you implement that correspond to interfaces you're given as part of this assignment. You're free to ignore the interfaces and implement only the model described below, but the interfaces have been designed to help identify and develop code in chunks so you don't do too much at once (this is a conceptually large program.) You're writing code based on the greedy Huffman algorithm discussed in class and in this detailed online explanation of Huffman Coding The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like compress or zip which use slightly different algorithms permitting a higher degree of compression than Huffman coding.
Programming AdviceBecause the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, you probably will not get a completely working program!Design, develop, and test so that you have a working program at each step. Build a program by adding working and tested pieces.
The Huffman Compression ProgramYou'll probably implement several classes to create the program. The classes and methods you write are completely up to you, but a good design will receive maximal points.
You're given several interfaces as part of the code framework. The only
interface you must use is To compress you should use the Huffman coding algorithm. The algorithm has four steps. You should understand each of these steps before starting to code. The first three steps are the basis for Part I which you should design, implement, and test before proceeding to the next part and step four. First, the process of Huffman compression is discussed, then some details about the program follow.
To uncompress the file later, you must recreate the same Huffman tree that was used to compress (so the codes you send will match). This tree might be stored directly in the compressed file (e.g., using a preorder traversal), or it might be created from character counts stored in the compressed file. In either case, this information must be coded and transmitted along with the compressed data (the tree/count data will be stored first in the compressed file, to be read by unhuff. There's more information below on storing/reading information to re-create the tree.
For part one you should implement a class that implements the More Part I details are here, but read the general description below first.
Specifically, for part one your
For more information about Part I code and design, see the details here.
Part II: Writing and Reading Compressed FilesTo complete the program, you'll need to write a compressed file based on the encodings and you'll need to read a compressed file and uncompress it. When you uncompress, the uncompression program/methods must be able to recreate the Huffman tree which was used to compress the file originally, then use this tree to recreate the original (uncompressed) file. Using the interfaces described in Part I, and here, will help ensure you don't duplicate too much code in writing the compress/uncompress classes and code. Your compression and uncompression methods work together. A file compressed using someone else's compression code probably can't be uncompressed by your code unless you agree on several standards that aren't part of this assignment.
Writing a Compressed FileThere are three steps in writing a compressed file from the information your code stored/maintained in Part I (the counts and encodings).
Magic NumbersOne easy way to ensure that both programs work in tandem is to write a magic number at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding decompression program first reads N bits, if the bits do not represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in the BitInputStream and BitOutputStream classes. For example, in my working program I have the following lines in different parts of my class that implements the IHuffHeader interface.
In general, a file with the wrong magic number should not generate an
error, but should notify the user. For example, in my program the
exception above ultimately causes the user to see what's shown
below. This is because the exception is caught and the viewer's
Storing the Huffman TreeFor decompression to work with Huffman coding, information must be stored in the compressed file which allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) int values or you can try to be inventive and save space. For example, it is possible to store just chunk/character counts and recreate the codes from the counts (i.e., store 256 counts, one for each 8-bit character). It's also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable, but full credit requires some attempt to save space/storage. Space-saving techniques are defined as those using less space than simply storing 256 counts as 32 bit ints. One useful technique is to write the tree to the file using a preorder traversal. You can use a 0 or 1 bit to differentiate between internal nodes and leaves, for example. The leaves must store character values (in the general case using 9-bits because of the pseudo-eof character). For example, the sequence of 0's and 1's below represents the tree on the right (if you write the 0's and 1's the spaces wouldn't appear, the spaces are only to make the bits more readable to humans.)
The first 0 indicates a non-leaf, the second 0 is the left child of the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits that represent 97 (001100001 is 97 in binary), the ASCII code for 'a'. Then there's a 1 for the right child of the left child of the root, it stores 32 (000100000 is 32 in binary), the ASCII value of a space. The next 1 indicates the right child of the root is a leaf, it stores the ASCII value for a 't' which is 116 (001110100 is 116 in binary). Your program can write these bits using a standard pre-order traversal. You can then read them by reading a bit, then recursively reading left/right subtrees if the bit is a zero (think about the 20-questions/animal program).
Even storing only non-zero counts qualifies as space-savings. To store
non-zero counts, however, you'll need to store the character/chunk being
counted and this might not save space. In my non-saving-space code, my
header is written by the following code. Note that
In my code to read/write the header as a tree, the resulting header
is much smaller. Both reading/writing are done in classes that
implement
Forcing CompressionIf compressing a file results in a file larger than the file being compressed (this is always possible) then no compressed file should be created and a message should be shown indicating that this is the case. Here's a screen shot from what happens in my program.
If the user forces compression using the Options menu as shown below then compression occurs even if the compressed file is bigger.
Determining If Compression HappensTo determine if compression results in a smaller file, you'll need to calculate the number of characters/chunks in the original file (your program will compute this by determining character/chunk counts). The size of the compressed file can be calculated from the same counts using the size of each character's encoded number of bits. You must also remember to calculate the file-header information stored in the compressed program. To be more precise, if there are 52 A's, and each A requires 4 bits to encode, then the A's contribute 52*4 = 108 bits to the compressed file. You'll need to make calculations like this for all characters.
The
Coding DetailsThere are many details that you will need to think about as you code these programs. Some of these are discussed here with some alternatives proposed.No char variablesNote, do not use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment below.
int k = 'A';
System.out.println( (char) k);
Pseudo-EOF character(for more details, see the complete huffman coding discussion and see the faq.) The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this.
In Huffman trees/tables you use in your programs, the pseudo-EOF character/chunk always has a count of one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created. In the file IHuffConstants the
number of characters counted is specified by
GradingThe program is worth 55 points.
Your README file should include the names of all the people with whom you collaborated, and the TAs/UTAs you consulted with. You should include an estimate of how long you spent on the program and what your thoughts are about the assignment. Submit your README and all of your source code using Eclipse with assignment name huff. Remember that each person in a group should submit a separate README, this must include the names of the people in the group. Only one group member submits code.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Last updated Fri Nov 30 19:51:55 EST 2007 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||