Snarf the huff project via Eclipse or browse the code directory you can see the javadoc here. See the Huff howto for complete information on the assignment.
You are urged to work in groups of two. Each group should submit ONE program per group. Be sure to include name and login id of each person in your group in the README file that belongs with the submission.
There are many techniques used to compress digital data. This assignment covers the Huffman coding algorithm. Several algorithms for data compression have been patented, e.g., to use the MP3 codec to compress audio (which uses Huffman encoding as one of the steps in the algorithm) legitimately you must pay for a license from mp3licensing.com which coordinates licensing at least in the US and Europe.
Huffman coding was invented by David Huffman while he was a graduate student at MIT in 1950 when given the option of a term paper or a final exam. For details see this 1991 Scientific American Article. In an autobiography Huffman had this to say about the epiphany that led to his invention of the coding method that bears his name:
"-- but a week before the end of the term I seemed to have nothing to show for weeks of effort. I knew I'd better get busy fast, do the standard final, and forget the research problem. I remember, after breakfast that morning, throwing my research notes in the wastebasket. And at that very moment, I had a sense of sudden release, so that I could see a simple pattern in what I had been doing, that I hadn't been able to see at all until then. The result is the thing for which I'm probably best known: the Huffman Coding Procedure. I've had many breakthroughs since then, but never again all at once, like that. It was very exciting."
Huffman's original paper is available, though it's a tough read. The Wikipedia reference is extensive as is this online material developed as one of the original Nifty Assignments. Both jpeg and mp3 encodings use Huffman Coding as part of their compression algorithms. In this assignment you'll implement a complete program to compress and uncompress data using Huffman coding.
For this assignment you'll build what are conceptually two programs: one to compress (huff) and the other to uncompress (unhuff) files that are compressed by the first program. However, there is really just a single program with the choice of compressing a file or uncompressing a file specified by choosing a menu-option in the GUI front-end to the code you write. Abstractly you're writing a program to read an input file and create a corresponding output file --- either from uncompressed to compressed or vice versa.
The Huff class is a simple
main that launches a GUI with a connected IHuffProcessor
implementation. The implementation corresponds to a model in the
model-view architecture we've been using in class: the view/GUI makes calls
on the model/IHuffProcessor methods which in turn
may display information
in the view/GUI. The code you write will also create files of compressed
or uncompresse data when the GUI-front end calls methods you will write.
You'll implement methods and store state in your
IHuffProcessor implementation so that it can either
compress/huff or uncompress/unhuff. You're welcome to implement
additional classes as well, but you don't need to.
You're writing code based on the greedy Huffman algorithm discussed in class and in this detailed online explanation of Huffman Coding. Be sure to read that explanation, the notes from class, and refer appropriately to the howto for this assignment.
The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like winrar or zip which use slightly different algorithms permitting a higher degree of compression than Huffman coding.
The Howto compression section has
complete information on how to create a compressed file. Basically you
first create a Huffman tree to derive per-character encodings, then you
write bits based on these encodings. The Huff main program
has a GUI front-end whose menu offers three choices: count characters,
compress, uncompress (and quit as a fourth choice). You can't compress
until you can count/create a tree, so make sure
counting/tree-creation/encoding works. The howto
document has details on how the program works.
Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, it will be difficult to get a completely working program. The howto development section has more information on incremental developement.
You should run the program HuffMark which will read every file in a directory and compress it to another file in the same directory with a ".hf" suffix. You may want to modify this benchmarking program to print more data than it currently does, and to run it on both the calgary directory which represents the Calgary Corpus, a standard compression suite of files for empirical analysis, you can see this reference for comparisons on the Calgary Corpus and on the waterloo directory which is a collection of .tiff images used in some compression benchmarking. You can, of course, run on other data/collections. Be sure to check whether the file you compress and uncompress is the same as the original. See the howto diff section for more information.
The benchmarking program skips files with .hf suffixes, but you may want to remove these eventually. In your README you should discuss your benchmark results and provide some insight as to their meaning. Your analysis is worth nearly 25% of your grade on this assignment, so you should try to come to some conclusions in addition to simply listing your results.
If you use different methods to store information in the header of your compressed file, e.g., you store the tree and you also store counts, then reporting on differences is a good idea. Information on how to create the header is in the howto.
| description | points |
|---|---|
| compression/decompression of any text file | correctness (half) |
| compression/decompression of any file (including binary files) | correctness (half) |
| robustness (crash on non-huffed files, force compression..) | engineering (third) |
| program style (design, documentation, etc.) | engineering (third) |
| header information | engineering (third) |
| empirical analysis | analysis |
| README | must be complete for any credit |
Your README file should include the names of all the people with whom you collaborated, and the TAs/UTAs you consulted with. You should include an estimate of how long you spent on the program and what your thoughts are about the assignment.
Submit your README and all of your source code using Eclipse with assignment name huff. Remember that each person in a group should submit a separate README, this must include the names of the people in the group. Only one group member submits code and analysis.