You will implement two programs:
So applying unhuff to a file compressed by huff must exactly reproduce the original file. This involves implementing both programs based on the greedy Huffman algorithm discussed in class and in this online explanation.
The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like compress or zip which use slightly different algorithms permitting a higher degree of compression than Huffman coding.
Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, you probably will not get a completely working program! Design, develop, and test so that you have a working program at each step.
Build a program by adding working and tested pieces.
To uncompress the file later, you must recreate the same Huffman tree that was used to compress (so the codes you send will match). This tree might be stored directly in the compressed file, or it might be created from character counts stored in the compressed file. In either case, this information must be coded and transmitted along with the compressed data (the tree/count data will be stored first in the compressed file, to be read by unhuff.
The first thing unhuff should do is verify that the file being uncompressed was originally compressed by your huff program. In particular, your program should not crash if it is given a program to uncompress that was not compressed using the corresponding Huffman compression program. The robustness of the program will be an important criterion in grading the program.
One easy way to ensure that both programs work in tandem is to write a "magic number" at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding decompression program first reads N bits, if the bits do not represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in bitops.h. A sample program which uses bitops.cpp is given in bitread.cpp.
For decompression to work with Huffman coding, information must be stored in the compressed file which allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) C/C++ ints or you can try to be inventive and save space. For example, its possible to store just character counts and recreate the codes from the counts. Its also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable, but full credit requires some attempt to save space/storage. Space-saving techniques are defined as those using less space than simply storing 257 counts as 32 bit ints. One useful technique is to write the tree to the file using a preorder traversal. You can use a 0 or 1 bit to differentiate between internal nodes and leaves, for example. The leaves must store character values (in the general case using 9-bits because of the pseudo-eof character).
For example, the sequence of 0's and 1's below represents the tree on the right (if you write the 0's and 1's the spaces wouldn't appear, the spaces are only to make the bits more readable to humans.)
0 0 1 001100001 1 000100000 1 001110100 |
|
The first 0 indicates a non-leaf, the second 0 is the left child of the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits that are represent 97, the ASCII code for 'a'. Then there's a 1 for the right child of the left child of the root, it stores 32, the ASCII value of a space. Then next 1 indicates the right child of the root is a leaf, it stores the ASCII value for a 't' which is 116.
Your program can write these bits using a standard pre-order traversal. You can then read them by reading a bit, then recursively reading left/right subtrees if the bit is a zero.
The huff program should write its compressed output to a file generated from the original file name by adding the suffix .hf to the file name. For example, the command below should compress the file foo storing the result in foo.hf.
huff foo
The user should have the option of invoking huff with an argument -f which forces compression even when the compressed file is bigger. For example:
prompt< huff foo
no compression yielded, no output file foo.hf written
prompt< huff -f foo
Here, the second time huff is executed the compression is
forced because of the -f argument. It's fine for your program to accept
the -f command-line argument only when it appears as the first argument.
Finally, if the user specifies the -h option, your program should output the options that it supports and then exit. This allows the user to see how you expect the program to be used.
Note, do not use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment below.
int k = 'A';
cout << char(k);
The program showargs.cpp shows how command line arguments can be read and processed by your program. Do not spend time worrying how to process command-line arguments until you are sure the program works, this is not the most important part of the program. You can add command-line parameters after everything works.
(for more details, see the complete huffman coding discussion.)
The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this.
In Huffman trees/tables you use in your programs, the pseudo-EOF character/chunk always has a count of one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.
In the file globals.h the number of characters counted is specified by HUFF_ALPH_SIZE which has value 257. Although only 256 values can be represented by 8 bits, one character is used as the pseudo-EOF character [the value of PSEUDO-EOF could be 256, but it's set for 257 for historical reasons]. Your program should be able to work with n-bit chunks, not just 8-bit chunks.
A heap-based, templated priority queue class tpqueue is provided (in the standard location with libtapestry files, see tpq.h. and tpq.cpp.). The program tpqtest.cpp shows how to use the class with pointers.
(for more details, see the complete huffman coding discussion.)
To create a table or map of coded bit values for each character you'll need to traverse the Huffman tree (e.g., inorder, preorder, etc.) making an entry in the table each time you reach a leaf. For example, if you reach a leaf that stores the character 'C', following a path left-left-right-right-left, then an entry in the 'C'-th location of the map should be set to 00110. You'll need to make a decision about how to store the bit patterns in the map. At least two methods are possible for implementing what could be a class/struct BitPattern:
Some ideas for classes are given below. You may decide to combine some of these into one class, or you may decide to implement more classes. These are a start and some suggestions, not requirements.
For example, one possibility for a character counting class is shown below:
You'll need at least ALPH_SIZE counters (typically this is 256, see globals.h). There are more than 256 possible characters if you include the pseudo-EOF character
Since the values stored in the map are constrained to be between 0 and 257, you can simpley store bitpattern/strings in a vector and index them directly. There's nothing at all inferior about this approach compared to using a map.
Similarly, you may want to implement a function to print the Huffman tree to help you determine if the tree you've built is correct.
Designing debugging functions as part of the original program will make the program development go more quickly since you will be able to verify that pieces of the program, or certain classes, work properly. Building in the debugging scaffolding from the start will make it much easier to test and develop your program. When testing, use small examples of test files maybe even as simple as "go go gophers" that help you verify that your program and classes are functioning as intended.
You might want to write encoding bits out first as strings or printable int values rather than as raw bits of zeros and ones which won't be readable except to other computer programs. A Compress class, for example, could support printAscii functions and printBits to print in human readable or machine readable formats.
We cannot stress enough how important it is to develop your program a few steps at a time. At each step, you should have a functioning program, although it may not do everything the first time it's run. By developing in stages, you may find it easier to isolate bugs and you will be more likely to get a program working faster. In other words, do not write hundreds of lines of code before compiling and testing
this code is similar to the loop shown below which uses the standard get function to read a char from a stream.
When using writebits to write a specified number of bits, some bits may not be written immediately because of some buffering that takes place. To ensure that all bits are written, the last bits must be explicitly flushed. The function flushbits is called automatically if your program exits properly when the obstream destructor is called. You should not need to call it explicitly. If, for some reason you create an obstream object referenced by a pointer, make sure to delete the object so that its destructor is called.
Although readbits can be called to read a single bit at a time (by setting the first parameter to 1), the second parameter to the function is an int. You'll need to be able to access just one bit of this int (inbits in code above). In order to access just the right-most bit a bitwise and & can be used:
An example program using the bitreading classes is provided for you to study, it is called bitread.cpp. Because Huffman coding requires two passes over the input file you will need to use the method rewind that works with object variables of type ibstream.
The effect of ib.rewind() is to allow the input bit stream ib to be ''read again'', i.e., a loop that reads until no more bits are available (checking the return-value of readbits) will be able to read the entire file again.
g++ -g -o foo foo.cpp
invokes the g++ compiler (program) with four command-line
arguments: -g, -o, foo, and foo.cpp. Command-line arguments are passed
to the program in a C-style array of strings along with a count of the
number of command-line arguments passed. The header for main
must be altered in a program that processes command-line arguments
(see HowtoA in Tapestry for more details.) The program showargs.cpp reproduced below illustrates how
command-line parameters can be processed.
prompt> showargs program showargs has no command-line args prompt> showargs -o fish -b fowl showargs has 5 command-line args: 1-th arg is: -o 2-th arg is: fish 3-th arg is: -b 4-th arg is: fowlAs shown in the code fragment above, and in the sample file showargs.cpp, all programs have at least one command-line argument, the name of the program being run. This is stored as the first entry in the array argv (the first entry has index zero). The array argv is an array of c-style strings. These strings are just pointers to characters, with a special NUL-character '\0' to signify the last character in a C-style string. You do NOT need to know this to manipulate these C-style strings. The easiest thing to do is to assign each element of argv to a C++ string variable as shown in showargs.cpp. Then you can use the C++ string functions you are familiar with to manipulate the values, e.g., you can call length(), you can use substr(), you can concatenate strings with +, etc. None of these operations work with C-style strings.
| description | points |
|---|---|
| compression of any text file | 10 points |
| compression of any file (including binary files) | 5 points |
| decompression | 10 points |
| robustness (does unhuff program crash on non-huffed files?) | 5 points |
| program style (class design/implementation, program design) | 10 points |
When this single file is unhuffed, all the files originally compressed should be uncompressed/unhuffed.
You'll need to store information about file names in the huffed file to know what to use for names when unhuffing. The original names must be used when uncompressing.
This extra credit is worth 10 points.
submit_cps100 huff README *.h *.cpp MakefileThe Makefile should be able to handle both make huff and make unhuff. The sample Makefile in ~ola/cps100/huff will do this. You can alter the different lines of the Makefile as you need to. Don't forget to run make depend if you change the Makefile.