See the FAQ
Every group member must submit a README, only one group member submits the code
Snarf the huff project via Eclipse, or browse the code directory you can see the javadoc here.
For this assignment you'll implement several clasess to build what are conceptually two programs: one to compress (huff) and the other to uncompress (unhuff) files that are compressed by the first program. However, there is really just a single program with the choice of reading a compressed file or a normal file specified by choosing a menu-option in the GUI-based program.
The
Huff
class is a simple main that launches a GUI with a
connected model that you'll write. This will eventually
be the single application/GUI you use to compress/huff files or
decompress/unhuff files.
Most of the work will be in the classes you implement that correspond to interfaces you're given as part of this assignment. You're free to ignore the interfaces and implement only the model described below, but the interfaces have been designed to help identify and develop code in chunks so you don't do too much at once (this is a conceptually large program.)
You're writing code based on the greedy Huffman algorithm discussed in class and in this detailed online explanation of Huffman Coding
The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like compress or zip which use slightly different algorithms permitting a higher degree of compression than Huffman coding.
Design, develop, and test so that you have a working program at each step.
Build a program by adding working and tested pieces.
You're given several interfaces as part of the code framework. The only
interface you must use is
IHuffModel. However, the other interfaces will help you
achieve a good design.
To compress you should use the Huffman coding algorithm. The algorithm has four steps. You should understand each of these steps before starting to code. The first three steps are the basis for Part I which you should design, implement, and test before proceeding to the next part and step four.
First, the process of Huffman compression is discussed, then some details about the program follow.
You might want to use
an ICharCounter
class to do the counting, but you don't have to.
You might want to use
an ITreeMaker
class to create the tree, but you don't have to --- however, the
code to create the tree must live somewhere.
The table can be an array of the appropriate size (roughly 256, but be careful of PSEUDO_EOF) or you can use a Map subclass. An array is the simplest approach for this part of the huff/compress process, using a Map is not necessary.
You might want to use
an IHuffEncoder
class to create the table from the tree,
but you don't have to -- however, the
code to create the table/map from the tree must live somewhere.
To uncompress the file later, you must recreate the same Huffman tree that was used to compress (so the codes you send will match). This tree might be stored directly in the compressed file (e.g., using a preorder traversal), or it might be created from character counts stored in the compressed file. In either case, this information must be coded and transmitted along with the compressed data (the tree/count data will be stored first in the compressed file, to be read by unhuff. There's more information below on storing/reading information to re-create the tree.
For part one you should implement a class that implements the IHuffModel interface so
that all methods except write work. You don't actually turn in
Part I separately, but it's a very good idea to make this part work
before proceeding to Part II and completion of the entire program.
More Part I details are here, but read the general description below first.
Specifically, for part one your
IHuffModel implementation
class must be connected to the GUI/Viewer and must
implement three of the methods from the
IHuffModel
interface:
initialize.
This involves reading a stream and updating local state. You'll need to
store counts of characters in some local state (you may want to
investigate the
ICharCounter interface.
The GUI should display something
similar to the screen shot below when the reading
starts and when it is nearly finished --- this will happen
automatically if you read from the InputStream
passed to initialize.
showCounts.
This should call the view's update method, displaying
per-character counts via the view/display. The counts represent the data
read during initialization. Here's a screen shot. Recall that you pass a
Collection, e.g., an ArrayList to the
HuffViewer.update method, where the list contains strings
that will be displayed. The strings should contain all int values with
associated counts for chars/ints in the alphabet (e.g., 0-255). The
screen shot below on the left shows how
to choose "show character counts" from the Options
menu; the screen on the right shows character counts for the file
kjv10.txt, note that there are 17,911 occurrences of an
uppercase 'A' whose unicode/ASCII value is 65.
|
|
showCodings.
Displaying per-character encodings via view/display. Each line shown gives a character/chunk number and the encoding for that value based on the huffman tree that comes from the counts. You'll need to make a tree and a map/table (see below) to get these encodings. In the screen shot you can see the encoding for 'A' is smaller in length then encoding for 'F' (ASCII/unicode 70) which makes sense given the number of occurrences shown above.
setViewer.
You must implement this method, or the viewer will fail to attach itself to the model. You'll need to store a viewer in the model's private state.
For more information about Part I code and design, see the details here.
To complete the program, you'll need to write a compressed file based on the encodings and you'll need to read a compressed file and uncompress it.
When you uncompress, the uncompression program/methods must be able to recreate the Huffman tree which was used to compress the file originally, then use this tree to recreate the original (uncompressed) file. Using the interfaces described in Part I, and here, will help ensure you don't duplicate too much code in writing the compress/uncompress classes and code.
Your compression and uncompression methods work together. A file compressed using someone else's compression code probably can't be uncompressed by your code unless you agree on several standards that aren't part of this assignment.
ALPH_SIZE counts as int values, but you can
also write the tree. This is described more fully below.
One easy way to ensure that both programs work in tandem is to write a magic number at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding decompression program first reads N bits, if the bits do not represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in the BitInputStream and BitOutputStream classes.
For example, in my working program I have the following lines in different parts of my class that implements the IHuffHeader interface.
In general, a file with the wrong magic number should not generate an
error, but should notify the user. For example, in my program the
exception above ultimately causes the user to see what's shown
below. This is because the exception is caught and the viewer's
showError method called appropriately. Your code should
at least print a message, and ideally generate an error dialog as shown.
For decompression to work with Huffman coding, information must be stored in the compressed file which allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) int values or you can try to be inventive and save space. For example, it is possible to store just chunk/character counts and recreate the codes from the counts (i.e., store 256 counts, one for each 8-bit character). It's also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable, but full credit requires some attempt to save space/storage. Space-saving techniques are defined as those using less space than simply storing 256 counts as 32 bit ints. One useful technique is to write the tree to the file using a preorder traversal. You can use a 0 or 1 bit to differentiate between internal nodes and leaves, for example. The leaves must store character values (in the general case using 9-bits because of the pseudo-eof character).
For example, the sequence of 0's and 1's below represents the tree on the right (if you write the 0's and 1's the spaces wouldn't appear, the spaces are only to make the bits more readable to humans.)
0 0 1 001100001 1 000100000 1 001110100 |
|
The first 0 indicates a non-leaf, the second 0 is the left child of the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits that represent 97 (001100001 is 97 in binary), the ASCII code for 'a'. Then there's a 1 for the right child of the left child of the root, it stores 32 (000100000 is 32 in binary), the ASCII value of a space. The next 1 indicates the right child of the root is a leaf, it stores the ASCII value for a 't' which is 116 (001110100 is 116 in binary).
Your program can write these bits using a standard pre-order traversal. You can then read them by reading a bit, then recursively reading left/right subtrees if the bit is a zero (think about the 20-questions/animal program).
Even storing only non-zero counts qualifies as space-savings. To store
non-zero counts, however, you'll need to store the character/chunk being
counted and this might not save space. In my non-saving-space code, my
header is written by the following code. Note that
BITS_PER_INT is 32 in Java.
In my code to read/write the header as a tree, the resulting header
is much smaller. Both reading/writing are done in classes that
implement
IHuffHeader,
but that's a guide, not a requirement.
If the user forces compression using the Options menu as shown below then compression occurs even if the compressed file is bigger.
The
IHuffHeader interface
specifies a method headerSize to help with keeping the code
for headers in one place.
Note, do not use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment below.
int k = 'A';
System.out.println( (char) k);
(for more details, see the complete huffman coding discussion and see the faq.)
The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this.
In Huffman trees/tables you use in your programs, the pseudo-EOF character/chunk always has a count of one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.
In the file IHuffConstants the
number of characters counted is specified by ALPH_SIZE
which has value 256. Although only 256 values can be represented by
8 bits, these values are between 0 and 255, inclusive. One character is
used as the pseudo-EOF character -- it must be a value not-representable
with 8-bits, the smallest such value is 256 which requires 9 bits to
represent. However, ideally your program should be able to work with
n-bit chunks, not just 8-bit chunks.
| description | points |
|---|---|
| compression of any text file | 10 points |
| compression of any file (including binary files) | 5 points |
| decompression | 10 points |
| robustness (does unhuff program crash on non-huffed files?) | 5 points |
| program style (class design/implementation, program design) | 10 points |
| char count/char codings (Part I) | 10 points |
| README | 5 points |
Your README file should include the names of all the people with whom you collaborated, and the TAs/UTAs you consulted with. You should include an estimate of how long you spent on the program and what your thoughts are about the assignment.
Submit your README and all of your source code using Eclipse with assignment name huff. Remember that each person in a group should submit a separate README, this must include the names of the people in the group. Only one group member submits code.