See the FAQ
Every group member must submit a README, only one group member submits the code
For this assignment you'll implement several clasess to build what are conceptually two programs: one to compress (huff) and the other to uncompress (unhuff) files that are compressed by the first program.
The
Huff
class is a simple main that launches a GUI with a
connected model that you'll write. You may have a similar
UnHuff class/main, but it's possible to incorporate all the
functionality into the model you write.
Most of the work will be in the models you implement and in the classes you implement that correspond to interfaces you're given as part of this assignment.
For part one you should implement a class that implements
the IHuffModel
interface and so that all methods
except write work.
Specifically, for part one your
IHuffModel implementation
class must be connected to the GUI/Viewer and must
implement three of the methods from the
IHuffModel
interface:
initialize.
This involves reading a stream and updating local state. You'll need to
store counts of characters in some local state (you may want to
investigate the
ICharCounter interface.
The GUI should display something
similar to the screen shots below when the reading
starts and when it is nearly finished --- this will happen
automatically if you read from the InputStream
passed to initialize.
|
|
showCounts.
This should call the view's update method, displaying
per-character counts via the view/display. The counts represent the data
read during initialization. Here's a screen shot. Recall that you pass a
Collection, e.g., an ArrayList to the
HuffViewer.update method, where the list contains strings
that will be displayed. The strings should contain all int values with
associated counts for ints in the alphabet (e.g., 0-255).
showCodings.
Displaying per-character encodings via view/display. Each line shown gives a character/chunk number and the encoding for that value based on the huffman tree that comes from the counts. You'll need to make a tree and a map/table (see below) to get these encodings.
setViewer.
You must implement this method, or the viewer will fail to attach itself to the model. You'll need to store a viewer in the model's private state.
The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like compress or zip which use slightly different algorithms permitting a higher degree of compression than Huffman coding.
Design, develop, and test so that you have a working program at each step.
Build a program by adding working and tested pieces.
You're given several interfaces as part of the code framework. The only
interface you must use is
IHuffModel. However, the other interfaces will help you
achieve a good design.
To compress you should use the Huffman coding algorithm. The algorithm has four steps. You should understand each of these steps before starting to code.
First, the process of Huffman compression is discussed, then some details about the program follow.
You might want to use
an ICharCounter
class to do the counting, but you don't have to.
You might want to use
an ITreeMaker
class to create the tree, but you don't have to -- however, the
code to create the tree must live somewhere.
The table can be an array of the appropriate size (roughly 256, but be careful of PSEUDO_EOF) or you can use a Map subclass. An array is the simplest approach for this part of the huff/compress process, using a Map is not necessary.
You might want to use
an IHuffEncoder
class to create the table from the tree,
but you don't have to -- however, the
code to create the table/map from the tree must live somewhere.
To uncompress the file later, you must recreate the same Huffman tree that was used to compress (so the codes you send will match). This tree might be stored directly in the compressed file (e.g., using a preorder traversal), or it might be created from character counts stored in the compressed file. In either case, this information must be coded and transmitted along with the compressed data (the tree/count data will be stored first in the compressed file, to be read by unhuff. There's more information below on storing/reading information to re-create the tree.
To complete the program, you'll need to write a compressed file based on the encodings and you'll need to read a compressed file and uncompress it.
When you uncompress, the uncompression program/methods must be able to recreate the Huffman tree which was used to compress the file originally, then use this tree to recreate the original (uncompressed) file. Using the interfaces described in Part I, and here, will help ensure you don't duplicate too much code in writing the compress/uncompress classes and code.
Your compression and uncompression methods work together. A file compressed using someone else's compression code probably can't be uncompressed by your code unless you agree on several standards that aren't part of this assignment.
For Part 2, the GUI has been redesigned, though your model doesn't need to change. Here are two shots that show the revised menus in the GUI which are the only visible changes. The old menu is on the left, the new menu on the right.
|
|
ALPH_SIZE counts as int values, but you can
also write the tree. This is described more fully below.
One easy way to ensure that both programs work in tandem is to write a magic number at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding decompression program first reads N bits, if the bits do not represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in the BitInputStream and BitOutputStream classes.
For example, in my working program I have the following lines in different parts of my class that implements. the IHuffHeader interface.
In general, a file with the wrong magic number should not generate an
error, but should notify the user. For example, in my program the
exception above ultimately causes the user to see what's shown
below. This is because the exception is caught and the viewer's
showError method called appropriately. Your code should
at least print a message, and ideally generate an error dialog as shown.
For decompression to work with Huffman coding, information must be stored in the compressed file which allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) int values or you can try to be inventive and save space. For example, it is possible to store just chunk/character counts and recreate the codes from the counts (i.e., store 256 counts, one for each 8-bit character). It's also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable, but full credit requires some attempt to save space/storage. Space-saving techniques are defined as those using less space than simply storing 256 counts as 32 bit ints. One useful technique is to write the tree to the file using a preorder traversal. You can use a 0 or 1 bit to differentiate between internal nodes and leaves, for example. The leaves must store character values (in the general case using 9-bits because of the pseudo-eof character).
For example, the sequence of 0's and 1's below represents the tree on the right (if you write the 0's and 1's the spaces wouldn't appear, the spaces are only to make the bits more readable to humans.)
0 0 1 001100001 1 000100000 1 001110100 |
|
The first 0 indicates a non-leaf, the second 0 is the left child of the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits that represent 97 (001100001 is 97 in binary), the ASCII code for 'a'. Then there's a 1 for the right child of the left child of the root, it stores 32 (000100000 is 32 in binary), the ASCII value of a space. The next 1 indicates the right child of the root is a leaf, it stores the ASCII value for a 't' which is 116 (001110100 is 116 in binary).
Your program can write these bits using a standard pre-order traversal. You can then read them by reading a bit, then recursively reading left/right subtrees if the bit is a zero (think about the 20-questions/animal program).
Even storing only non-zero counts qualifies as space-savings. To store
non-zero counts, however, you'll need to store the character/chunk being
counted and this might not save space. In my non-saving-space code, my
header is written by the following code. Note that
BITS_PER_INT is 32 in Java.
In my code to read/write the header as a tree, the resulting header
is much smaller. Both reading/writing are done in classes that
implement
IHuffHeader,
but that's a guide, not a requirement.
If the user forces compression using the Options menu as shown below then compression occurs even if the compressed file is bigger.
The
IHuffHeader interface
specifies a method headerSize to help with keeping the code
for headers in one place.
Note, do not use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment below.
int k = 'A';
System.out.println( (char) k);
(for more details, see the complete huffman coding discussion and see the faq.)
The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this.
In Huffman trees/tables you use in your programs, the pseudo-EOF character/chunk always has a count of one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.
In the file HuffConstants the
number of characters counted is specified by ALPH_SIZE
which has value 256. Although only 256 values can be represented by
8 bits, these values are between 0 and 255, inclusive. One character is
used as the pseudo-EOF character -- it must be a value not-representable
with 8-bits, the smallest such value is 256 which requires 9 bits to
represent. However, ideally your program should be able to work with
n-bit chunks, not just 8-bit chunks.
| description | points |
|---|---|
| compression of any text file | 10 points |
| compression of any file (including binary files) | 5 points |
| decompression | 10 points |
| robustness (does unhuff program crash on non-huffed files?) | 5 points |
| program style (class design/implementation, program design) | 10 points |
| test programs/char count/char codings | 10 points |