The resulting program will be a complete and useful compression program although not, perhaps, as powerful as the standard Unix program compress or zip which use different algorithms permitting a higher degree of compression than Huffman coding. You should take care to develop the program in an incremental fashion. If you try to write the whole program at once, you probably will not get a completely working program!. Design, develop, and test so that you have a working program at each step.
Your program will also need to accept command-line parameters that specify the filename (rather than prompting the user or reading from cin). More information on command-line parameters is given below.
Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, an iterative enhancement development of the program is suggested below. You are by no means constrained to adopt this approach, but it might prove useful if your final program doesn't work to show useful and working programs along the way.
You should copy these files to get started. In particular, the file globals.h provides declarations needed by both huff and unhuff programs (compression/uncompression programs). The file filenames.cc shows (very rudimentarily) how to parse command line arguments.
You are to write two programs:
huff.cc This program compresses a file
unhuff.cc This program uncompresses a (previously compressed) file
Both programs may eventually read from a file specified by the user on the command line, but initially the user can be prompted for the program or you can read input from cin and use I/O redirection to read from a file (if this doesn't make sense, ask). The program huff.cc should write its (compressed) output to a file also specified in some way (by user, on command line, or to cout). The line below might compress the file foo.cc storing the compressed version in foo.cc.H.
huff foo.cc foo.cc.H
If your program reads/writes from cin/cout this
could be done using:
As described below in the development section, you MUST use a class-based approach to this problem. This means, for example, that you should implement at least one class. Some suggestions are provided in the development section.
You must write a program huff.cc that will compress a file so that it can be uncompressed by unhuff.cc. In writing huff.cc you'll implement at least one class, this class will also provide support for the decompression program unhuff.cc.
If compressing a file would result in a file larger than the file being compressed (this is always possible) then no compressed file should be created and a message should be printed indicating that this is the case. The user should have the option of invoking huff with an argument -f which forces compression even when the compressed file is bigger. For example:
Here, the second time huff is executed the compression is forced because of the -f argument. Only the first argument to the program can be the -f argument. We'll talk about processing command line arguments in class, but see filenames.cc too, you may need to use C-style strings to process command-line arguments, but this can be avoided if each element of argv[] (an array of C-style strings) is stored in a string. Again, see filenames.cc for details.
The program should be robust enough not to crash if it is given a program to uncompress that wasn't compressed using the corresponding Huffman compression program. The robustness of the program will be an important criterion in grading the program. There are a variety of methods that you can use to ensure this works, but it will most likely always be possible to ``fool'' the program.
One easy way to ensure that compression/decompression work in tandem is to write a "magic number" at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding uncompression program first reads N bits, if the bits don't represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in bitops.h and whose use is shown in bitread.cc.
CountCharFreqs
FreqsToTree
TreeToTable
You'll need to provide more functions and decide on what private data members are needed.
Some steps that may be useful in developing the program are described below. It's important to develop your program a few steps at a time. At each step, you should have a functioning program, although it may not do everything the first time it's run. By developing in stages, you'll find it easier to isolate bugs and you'll be more likely to get a program working faster.
Note that there are 256 different 8-bit values, so you'll need a vector that implements 256 different counters. (There are 257 possible characters if you include the pseudo-EOF chararacter described below.)
In the file globals.h the number of characters counted is specified by HUFF_ALPH_SIZE which has value 257. Although only 256 values can be represented by 8 bits, one character is used as the pseudo-EOF character.
Every time a file is compressed the count of the the number of times that pseudo-EOF occurs should be one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.
Pseudo-EOF character
The operating system will buffer output, that is output to disk actually
occurs when some internal buffer is full. In particular, it is not
possible to write just one single bit to a file, all output is actually
done in "chunks", e.g., it might be done in eight-bit chunks. In any
case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are
eventually written, but you can't be sure precisely when they're written
during the execution of your program. Also, because of buffering, if
all output is done in eight-bit chunks and your program writes exactly 61
bits explicitly, then 3 extra bits will be written so that the number of
bits written is a multiple of eight. Because of the potential for
these "extra" bits, when reading one
bit at a time you cannot simply read bits until there are no more
since your program might then read the extra bits written due to
buffering. This means that when reading a compressed file, you
CANNOT use code like this.
When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. You'll have to write these bits explicity. These bits will be recognized by the program unhuff.cc and used in the decompression process. In particular, when using unhuff a well-formed compressed file will be terminated with the encoded form of the pseudo-EOF char (see code above). This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (think about this). In other words, when decompressing (unhuffing) you'll read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate because all decompression is done. If reading a bit fails because there are no more bits (the bitreading function returns false) the compressed file is not well-formed.
There are some suggestions in Weiss to avoid reading files twice, but you don't have to use them (you have to stop doing extra stuff at some point.)
In either case, you'll need to either create and use a template.cc file with the proper class instantation for each templated class used in your program (similar to what was done in the word-ladder assignment). A sample file template.cc is provided. You won't have to change this if you store pointers to HuffNodes in your priority queue.
To see how the Readbits routine works, note that the code segment below is functionally equivalent to the Unix cat foo command --- it reads BITS_PER_WORD bits at a time (which is 8 bits as shown in bitops.h) and echoes what is read.
int inbits;
ibstream ibs("foo");
while (ibs.Readbits(BITS_PER_WORD,inbits))
{
cout.put(inbits);
}
this code is similar to the loop shown below which uses the
standard get function to read a char from cin.
char inbits;
while (cin.get(inbits))
{
cout.put(inbits);
}
Note that although Readbits can be called to read a single bit at a time (by setting the first parameter to 1), the second parameter to the function is an int. You'll need to be able to access just one bit of this int (inbits in code above). In order to access just the right-most bit a bitwise and & can be used:
Alternatively, the function KthBit can be used to extract a specific bit from an int --- see the specification in bitops.h and note that the right-most bit is the first bit. You may find it useful in creating Huffman codes to use the shiftleft operator: <<. and the bit-wise or operator: |. Be careful in using shiftleft (and shiftright) because of potential confusion with stream operators. If you fully parenthesize expressions trouble can usually be avoided. An example program using the bitreading classes is provided for you to study, it is called bitread.cc.
When using Writebits to write a specified number of bits, some bits may not be written because of some buffering that takes place. To ensure that all bits are written you may need to call Flushbits. The function Flushbits is called automatically if your program exits properly when the obstream destructor is called. You can, however, flush explicitly although you probably don't need to.
Note that some buffering is done with Readbits as well, but you shouldn't need to worry about this. The function Flushbits only needs to be called once. If you call Flushbits more than once when writing the compressed file you will likely get erroneous bits when reading them back in. (Calling twice at the end of the program is ok, but do NOT flush in the middle of writing the compressed file.)
The array argv is an array of c-style strings. These strings are just pointers to characters, with a special NUL-character '\0' to signify the last character in a C-style string. You do NOT need to know this to manipulate these char *, C-style strings. The easiest thing to do is to assign each element of argv to a C++ string variable as shown in filenames.cc. Then you can use "standard" C++ string functions to manipulate the values, e.g., you can call length(), you can use substr(), you can concatenate strings with +, etc. None of these operations work with C-style, char * strings. Assign each element of argv to a C++ string variable for processing.
submit100e huff huff.cc unhuff.cc huffstuff.h Makefile ... READMEwhere the Makefile can be used to make executables huff and unhuff and the README file includes an explanation as to how information is stored in compressed files so that uncompression works. The README file should include the standard information about how much time you spent on different parts of the assignment, what was frustrating or enjoyable about the assignment and a list of collaborators with whom you worked (if any). Be sure to submit any other files that are needed by your program including the bitops files ONLY if you modified them in some way.
| description | points |
|---|---|
| compression of any text file | 8 points |
| compression of any file (including binary files) | 3 points |
| decompression | 5 points |
| user-interface (how easy and intuitive program is to use) | 3 points |
| robustness (does unhuff program crash on non-huffed files?) | 3 points |
| program style (class design/implementation, program design) | 8 points |