CPS 100, Huffman Coding

Huffman Codes
CPS 100, Fall 1995

Final programs due: December 7, 8:00 am
You are STRONGLY urged to work with someone and turn in one submission

Introduction

This assignment involves implementing a compression/uncompression program based on the greedy Huffman algorithm discussed in Weiss (380 -- 384) and in class.

The resulting program will be a complete and useful compression program although not, perhaps, as powerful as the standard Unix program compress which uses a different algorithm permitting a higher degree of compression than Huffman coding. You should take care to develop the program in an incremental fashion. You might be asked to provide a GUI interface to it, so the program should be simple to modify. Your program will also need to accept command-line parameters that specify the filename (rather than prompting the user or reading from cin).

Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, an ``iterative enhancement'' development of the program is suggested below. You are by no means constrained to adopt this approach, but it might prove useful if your final program doesn't work to show useful and working programs along the way.

Assignment Details

The files for this assignment can be found in ~ola/cps100/huff You should copy these files to get started. In particular, the file globals.h provides declarations needed by both huff and unhuff programs (compression/uncompression programs). The file filenames.cc shows (very rudimentarily) how to parse command line arguments.

You are to write two programs: huff.cc and unhuff.cc that are used (respectively) to compress and uncompress files using Huffman coding. Both programs may eventually read from a file specified by the user on the command line, but initially the user can be prompted for the program or you can read input from cin and use I/O redirection to read from a file (if this doesn't make sense, ask). The program huff.cc should write its (compressed) output to a file also specified in some way (by user, on command line, or to cout). The line below might compress the file foo.cc storing the compressed version in foo.cc.H .

               huff foo.cc foo.cc.H

If your program reads/writes from cin/cout this could be done using:

               huff < foo.cc > foo.cc.H

The program huff.cc

If compressing a file would result in a file larger than the file being compressed (this is always possible) then no compressed file should be created and a message should be printed indicating that this is the case. The user should have the option of invoking huff with an argument -f which forces compression even when the compressed file is bigger. For example:

      teer4> huff myprog.c mprog.c.H
      no compression yielded no output file written
      teer4> huff -f myprog.c myprog.c.H

where the second time huff is executed the compression is forced because of the -f argument. Only the first argument to the program can be the -f argument. We'll talk about processing command line arguments (and see filenames.cc ), you'll need to use C-style strings to do this. Don't spend time worrying how to process command-line arguments until you're sure the program works, this isn't the most important part of the program.

The program unhuff.cc

The program should be robust enough not to crash if it is given a program to uncompress that wasn't compressed using the corresponding Huffman compression program. The robustness of the program will be an important criterion in grading the program. There are a variety of methods that you can use to ensure this works, but it will most likely always be possible to ``fool'' the program.

Developing the programs

You may want to develop the programs in stages to ease debugging and decrease program development time. One approach is suggested below, but modifications of this approach are certainly possible.

Create a program huff.cc that reads 8 bits at a time (using bit-reading routines in bitops.cc --- see below) and counts how many times each 8-bit sequence occurs, printing out the results. Since Huffman coding requires counts of each ``character'' this is a useful first stage that can be easily verified as correct.
Build the weighted binary Huffman tree used to compress the file from the table of character counts. Print out the code (and length) for each character. Using small examples (like the ones we do in class) verify that the tree is correct.
Read the file again and write a coded version of the file using ascii-represented integers (rather than bits). You will also need to write out the table of codes and lengths in order to be able to create a program unhuff.cc --- write this program so that it can ``decompress'' an ascii-Huffman coded file. You may find that writing debugging routines such as PrintTable is very helpful. You can call routines from within the debugger gdb by writing:
```
                  call PrintTable()
```
Note that writing unhuff.cc will require a routine to create the huffman tree from the table of code lengths and numbers.
Modify huff.cc so that it calculates whether compression will result in saving space.
Change the programs so that bits are written (and read) using the routines in bitops.cc instead of ascii-coded Huffman codes. You're now finished!

Some Details

You will need to create an array to store ``character'' counts (character is in quotes because counts of 8-bit sequences are counted, but there are 8 bits in a C/C++ char). Each element of this array will be intialized to zero, the final counts will be the weights of the one node trees in the forest of trees that's used to construct the Huffman tree.

In the file globals.h the number of characters counted is specified by HUFF_ALPH_SIZE which has value 257. Although only 256 values can be represented by 8 bits, one character is used to represent end-of-file using a pseudo-EOF character.

Every time a file is compressed the count of the the number of times that pseudo-EOF occurs should be one --- this should be done explicitly in the part of huff.cc that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.

When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. These bits will be recognized by the program unhuff.cc and used in the decompression process. In particular, when using unhuff a well-formed compressed file will be terminated with the encoded form of the pseudo-EOF char and a real EOF will NOT be encountered (think about this). In other words, when decompressing (unhuffing) you'll read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate. If reading a bit fails (returns EOF from bit reading operations) the compressed file is not well-formed.

For decompression to work with Huffman coding, information must be stored in the compressed file that allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) C/C++ ints or you can try to be inventive and save space. For example, it's possible to store just counts and recreate the codes. It's also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable. If you use a successful space-saving technique you can earn several points of extra credit (up to 5). There are some suggestions in Weiss for this.

Reading Files Twice

Finally, because Huffman coding requires two passes over the input file you will need to use the method Rewind that works with object variables of type ibstream --- see the documentation in bitops.h . The effect of ib.Rewind() is to allow the input bit stream ib to be ``read again'', i.e., a loop that checks for EOF will be able to read the entire file again --- see the example file bitread.cc.

Priority Queues

You should use (or modify) the templated priority queue class given in Weiss (page 643 in Chapter 20). All code from the book can be found in ~ola/weiss . Alternatively, a templated priority queue developed for the class can be used. The difference is that the Weiss priority queue requires that the template parameter represent an ordered type (so that min can be determined). The code provided in pqueue.cc and pqueue.h has an int priority that is not "part" of the class/type being inserted into the priority queue.

In either case, you'll need to either include the source code (you can #include this from the .h header file) or you'll need to create a template.cc file to instantiate the templated classes.

What to submit

Submit this program using

  submit100 huff huff.cc unhuff.cc Makefile etc. README

where the Makefile can be used to make executables huff and unhuff and the README file includes an explanation as to how information is stored in compressed files so that uncompression works. The README file should include the standard information about how much time you spent on different parts of the assignment, what was frustrating or enjoyable about the assignment and a list of collaborators with whom you worked (if any). Be sure to submit any other files that are needed by your program including the bitops files ONLY if you modified them in some way.

Using bitops.cc

In order to read and write in a bit-at-a-time manner, the file bitops.cc and the corresponding header file bitops.h must be used. Two classes are specified for reading/writing bits-at-a-time: ibstream and obstream, respectively.

Bit read/write subprograms

To see how the Readbits routine works, note that the code segment below is functionally equivalent to the Unix cat foo command --- it reads BITS_PER_WORD bits at a time (which is 8 bits as shown in bitops.h) and echoes what is read.

    int inbits;
    ibstream ibs("foo");
    while ( (inbits = ibs.Readbits(BITS_PER_WORD)) != EOF)
    {
        cout.put(inbits);
    }

this code is similar to the loop shown below which uses the standard get function to read from cin.

    int inbits;
    while ((inbits = cin.get()) != EOF)
    {
        cout.put(inbits);
    }

Note that although Readbits can be called to read a single bit at a time (by setting the int parameter to 1), the result returned by the function is an int. In order to access just the right-most bit a bitwise and & can be used:

    int inbits;
    ifstream ibs("filename");
    inbits = ibs.Readbits(1);
    if ( (inbits & 1) == 1)
        // read the bit 1 
    else
        // read the bit 0

Alternatively, the function KthBit can be used to extract a specific bit from an int --- see the specification in bitops.h and note that the right-most bit is the first bit. You may find it useful in creating Huffman codes to use the shiftleft operator: <<. and the bit-wise or operator: |. Be careful in using shiftleft (and shiftright) because of potential confusion with stream operators. If you fully parenthesize expressions trouble can usually be avoided. An example program using the bitreading classes is provided for you to study, it is called bitread.cc.

Writing Everything

Executing the C++ statement cout.put(7) results in 8 bits being written because a C/C++ char uses 8 bits (the 8 bits correspond to the character '7'). Executing cout << 7. results in 32 bits being written because a C/C++ int uses 32 bits. Executing obs.Writebits(7,3) results in 3 bits being written (to the output bit-stream obs) --- all the bits are 1 because the number 7 is represented in base two by 111.

When using Writebits to write a specified number of bits, some bits may not be written because of some buffering that takes place. To ensure that all bits are written you should call Flushbits. Note that some buffering is done with Readbits as well, but you shouldn't need to worry about this. The function Flushbits only needs to be called once.

Grading

**Huffman Coding Grading Standards**
description	points
compression of any text file	8 points
compression of any file	3 points
decompression	5 points
user-interface	3 points
robustness	3 points
program style	3 points

Huffman Codes CPS 100, Fall 1995