CPS 100, Fall 2000, Huffman Coding


See the FAQ
You are urged to work in groups of two. Each group should submit ONE program per group. Be sure to include name and login id of each person in your group in the README file that belongs with the submission. Do not wait until the last minute to start this assignment.

You will implement two programs:

So applying unhuff to a file compressed by huff must exactly reproduce the original file. This involves implementing both programs based on the greedy Huffman algorithm discussed in class and in this online explanation.

The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like compress or zip which use slightly different algorithms permitting a higher degree of compression than Huffman coding.

Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, you probably will not get a completely working program! Design, develop, and test so that you have a working program at each step.

Build a program by adding working and tested pieces.


Getting Started

The following files are primarily examples showing how to use some of the tools needed for this assignment. These files can be found in ~ola/cps100/huff. You may want to copy these files to get started. The file bitread.cpp shows how to read and echo a file a bit-at-a-time. The file showargs.cpp shows how to use command lines arguments. The file globals.h provides constants needed by both programs.

huff: The Compression Program

You must write huff, a program that compresses a file so that it can be uncompressed by unhuff. To do this, you should use the Huffman coding algorithm. The algorithm has four steps. You should understand each of these steps before starting to code.

To uncompress the file later, you must recreate the same Huffman tree that was used to compress (so the codes you send will match). This tree might be stored directly in the compressed file, or it might be created from character counts stored in the compressed file. In either case, this information must be coded and transmitted along with the compressed data (the tree/count data will be stored first in the compressed file, to be read by unhuff.


unhuff: The Decompression Program

You must also write unhuff, a program that uncompresses any compressed files created by your huff program. The program unhuff must be able to recreate the huffman tree which was used to compress the file originally.

The first thing unhuff should do is verify that the file being uncompressed was originally compressed by your huff program. In particular, your program should not crash if it is given a program to uncompress that was not compressed using the corresponding Huffman compression program. The robustness of the program will be an important criterion in grading the program.

One easy way to ensure that both programs work in tandem is to write a "magic number" at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding decompression program first reads N bits, if the bits do not represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in bitops.h. A sample program which uses bitops.cpp is given in bitread.cpp.

Storing the Huffman Tree

For decompression to work with Huffman coding, information must be stored in the compressed file which allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) C/C++ ints or you can try to be inventive and save space. For example, its possible to store just character counts and recreate the codes from the counts. Its also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable, but full credit requires some attempt to save space/storage. Space-saving techniques are defined as those using less space than simply storing 257 counts as 32 bit ints. One useful technique is to write the tree to the file using a preorder traversal. You can use a 0 or 1 bit to differentiate between internal nodes and leaves, for example. The leaves must store character values (in the general case using 9-bits because of the pseudo-eof character).

For example, the sequence of 0's and 1's below represents the tree on the right (if you write the 0's and 1's the spaces wouldn't appear, the spaces are only to make the bits more readable to humans.)
   0 0 1 001100001 1 000100000 1 001110100

The first 0 indicates a non-leaf, the second 0 is the left child of the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits that are represent 97, the ASCII code for 'a'. Then there's a 1 for the right child of the left child of the root, it stores 32, the ASCII value of a space. Then next 1 indicates the right child of the root is a leaf, it stores the ASCII value for a 't' which is 116.

Your program can write these bits using a standard pre-order traversal. You can then read them by reading a bit, then recursively reading left/right subtrees if the bit is a zero.

Program Interface

Both programs need to read from a file specified by the user. You can prompt the user for the file name, but the program must also accept the name of the file given on the command-line (see below).

The huff program should write its compressed output to a file generated from the original file name by adding the suffix .hf to the file name. For example, the command below should compress the file foo storing the result in foo.hf.

    huff foo

Forcing Compression

If compressing a file results in a file larger than the file being compressed (this is always possible) then no compressed file should be created and a message should be printed indicating that this is the case.

The user should have the option of invoking huff with an argument -f which forces compression even when the compressed file is bigger. For example:

      prompt< huff foo
      no compression yielded, no output file foo.hf written
      prompt< huff -f foo
Here, the second time huff is executed the compression is forced because of the -f argument. It's fine for your program to accept the -f command-line argument only when it appears as the first argument.

Finally, if the user specifies the -h option, your program should output the options that it supports and then exit. This allows the user to see how you expect the program to be used.

Determining If Compression Happens

To determine if compression results in a smaller file, you'll need to calculate the number of characters/chunks in the original file (your program will compute this by determining character/chunk counts). The size of the compressed file can be calculated from the same counts using the size of each character's encoded number of bits. You must also remember to calculate the file-header information stored in the compressed program. To be more precise, if there are 52 A's, and each A requires 4 bits to encode, then the A's contribute 52*4 = 108 bits to the compressed file. You'll need to make calculations like this for all characters.

Coding and Algorithmic Details

There are many details that you will need to think about as you code these programs. Some of these are discussed here with some alternatives proposed.

Note, do not use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment below.

     int k = 'A';
     cout << char(k);

Command line arguments

The program showargs.cpp shows how command line arguments can be read and processed by your program. Do not spend time worrying how to process command-line arguments until you are sure the program works, this is not the most important part of the program. You can add command-line parameters after everything works.

Pseudo-EOF character

(for more details, see the complete huffman coding discussion.)

The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this.

int bits; while (input.readbits(1, bits)) { // process bits } To avoid this problem, you can use a pseudo-EOF character and write a loop that stops when the pseudo-EOF character is read in (in compressed form). The code below is pseudo-code for reading a compressed file using such a technique. int bits; while (true) { if (! input.readbits(1, bits)) { cerr << "should not happen! trouble reading bits" << endl; } else { // use the zero/one value of the bit read // to traverse Huffman coding tree // if a leaf is reached, decode the character and print UNLESS // the character is pseudo-EOF, then decompression done if ( (bits & 1) == 0) // read a 0, go left in tree else // read a 1, go right in tree if (at leaf-node in tree) { if (leaf-node stores pseudo-eof char) break; // out of loop else write character stored in leaf-node } } } When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. You will have to write these bits explicitly. These bits will be recognized by the program unhuff and used in the decompression process. This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (you may need to think about this to really believe it). In other words, when decompressing you will read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate because all decompression is done. If reading a bit fails because there are no more bits (the bit reading function returns false) the compressed file is not well formed. Your program should cope with files that are not well-formed, be sure to test for this, i.e., test unhuff with plain (uncompressed) files.

In Huffman trees/tables you use in your programs, the pseudo-EOF character/chunk always has a count of one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.

In the file globals.h the number of characters counted is specified by HUFF_ALPH_SIZE which has value 257. Although only 256 values can be represented by 8 bits, one character is used as the pseudo-EOF character [the value of PSEUDO-EOF could be 256, but it's set for 257 for historical reasons]. Your program should be able to work with n-bit chunks, not just 8-bit chunks.

Priority Queues

The easiest way to build the Huffman tree is to use a priority queue since it supports getmin and deletemin operations which are used in creating the Huffman tree. It always returns the minimum count character (or subtree of characters). When more than one character has the same count, it picks one to return (it always picks them in the same order, but it does so in no particular order). It turns out this is not a problem for the decompressor as long as both programs use the same implementation of a priority queue.

A heap-based, templated priority queue class tpqueue is provided (in the standard location with libtapestry files, see tpq.h. and tpq.cpp.). The program tpqtest.cpp shows how to use the class with pointers.

Creating a Table from a Huffman-tree

(for more details, see the complete huffman coding discussion.)

To create a table or map of coded bit values for each character you'll need to traverse the Huffman tree (e.g., inorder, preorder, etc.) making an entry in the table each time you reach a leaf. For example, if you reach a leaf that stores the character 'C', following a path left-left-right-right-left, then an entry in the 'C'-th location of the map should be set to 00110. You'll need to make a decision about how to store the bit patterns in the map. At least two methods are possible for implementing what could be a class/struct BitPattern:

Implementing and Debugging

It's probably a good idea to create several classes to help manage the complexity in these programs. Because the same data structures need to used to ensure that a file compressed using your huff algorithm can be decompressed, you should be able to share several parts of the implementation. You can use classes to exploit this similarity. For example, in writing huff you can implement several classes, usually a class will take care of one part of the Huffman algorithm. These classes should also provide support for the decompression program unhuff.

Some ideas for classes are given below. You may decide to combine some of these into one class, or you may decide to implement more classes. These are a start and some suggestions, not requirements.

You will probably find it useful to implement debugging functions in your classes. These are functions you use in the debugging stages, but not in the final version of the program. For example, you could implement CharCounter::print to print all characters and frequencies read from a source file --- this could use the code fragment shown above with the CharCounter description. Although you could write such a printing/testing function outside the class, encapsulating the testing/debugging functions within the class makes the class more coherent.

Similarly, you may want to implement a function to print the Huffman tree to help you determine if the tree you've built is correct.

Designing debugging functions as part of the original program will make the program development go more quickly since you will be able to verify that pieces of the program, or certain classes, work properly. Building in the debugging scaffolding from the start will make it much easier to test and develop your program. When testing, use small examples of test files maybe even as simple as "go go gophers" that help you verify that your program and classes are functioning as intended.

You might want to write encoding bits out first as strings or printable int values rather than as raw bits of zeros and ones which won't be readable except to other computer programs. A Compress class, for example, could support printAscii functions and printBits to print in human readable or machine readable formats.

We cannot stress enough how important it is to develop your program a few steps at a time. At each step, you should have a functioning program, although it may not do everything the first time it's run. By developing in stages, you may find it easier to isolate bugs and you will be more likely to get a program working faster. In other words, do not write hundreds of lines of code before compiling and testing

Using bitops.cpp

In order to read and write in a bit-at-a-time manner, two classes are declared in the header file bitops.h: ibstream and obstream, respectively, for reading and reading.

Bit read/write subprograms

To see how the readbits routine works, note that the code segment below is functionally equivalent to the Unix command cat foo --- it reads BITS_PER_WORD bits at a time (which is 8 bits as defined in bitops.h) and echoes what is read. int inbits; ibstream ibs("data/poe.txt"); while (ibs.readbits(BITS_PER_WORD, inbits)) { cout.put(inbits); // put writes one character }

this code is similar to the loop shown below which uses the standard get function to read a char from a stream.

int inbits; ifstream input("data/poe.txt"); while (input.get(inbits)) { cout.put(inbits); } EAT Note that executing the C++ statement cout.put('7') results in 8 bits being written because a C/C++ char uses 8 bits (the 8 bits correspond to the character '7'). Executing cout << 7. results in 32 bits being written because a C/C++ int uses 32 bits. Executing obs.writebits(3,7) results in 3 bits being written (to the output bit-stream obs) --- all the bits are 1 because the number 7 is represented in base two by 000111.

When using writebits to write a specified number of bits, some bits may not be written immediately because of some buffering that takes place. To ensure that all bits are written, the last bits must be explicitly flushed. The function flushbits is called automatically if your program exits properly when the obstream destructor is called. You should not need to call it explicitly. If, for some reason you create an obstream object referenced by a pointer, make sure to delete the object so that its destructor is called.

Although readbits can be called to read a single bit at a time (by setting the first parameter to 1), the second parameter to the function is an int. You'll need to be able to access just one bit of this int (inbits in code above). In order to access just the right-most bit a bitwise and & can be used:

int inbits; ifstream ibs("filename"); ibs.readbits(1, inbits); if ((inbits & 1) == 1) // do stuff because the bit read was 1 else // do stuff because the bit read was 0 Alternatively, you can mod by 2, e.g., inbits % 2 and check to see if the remainder is 0 or 1 to determine if the right-most bit is 0 or 1. Using bitwise-and is faster than using mod.

An example program using the bitreading classes is provided for you to study, it is called bitread.cpp. Because Huffman coding requires two passes over the input file you will need to use the method rewind that works with object variables of type ibstream.

The effect of ib.rewind() is to allow the input bit stream ib to be ''read again'', i.e., a loop that reads until no more bits are available (checking the return-value of readbits) will be able to read the entire file again.

Command-line Arguments

When a C/C++ program is invoked, arguments are often given as command-line arguments/parameters, i.e., options entered on the same line as the program. For example,
 
      g++ -g -o foo foo.cpp
invokes the g++ compiler (program) with four command-line arguments: -g, -o, foo, and foo.cpp. Command-line arguments are passed to the program in a C-style array of strings along with a count of the number of command-line arguments passed. The header for main must be altered in a program that processes command-line arguments (see HowtoA in Tapestry for more details.) The program showargs.cpp reproduced below illustrates how command-line parameters can be processed. #include <iostream> #include <string> using namespace std; int main(int argc, char * argv[]) { if (argc == 1) { cout << "program " << argv[0] << " has no command-line args" << endl; } else { int k; string arg; cout << argv[0] << " has " << argc << " command-line args:" << endl; for(k=1; k < argc; k++) { arg = argv[k]; // always use string to access argv[k] cout << k << "-th arg is: " << arg << endl; } } } Here are two runs:
prompt> showargs

program showargs has no command-line args

prompt> showargs -o fish -b fowl

showargs has 5 command-line args:
1-th arg is: -o
2-th arg is: fish
3-th arg is: -b
4-th arg is: fowl

As shown in the code fragment above, and in the sample file showargs.cpp, all programs have at least one command-line argument, the name of the program being run. This is stored as the first entry in the array argv (the first entry has index zero). The array argv is an array of c-style strings. These strings are just pointers to characters, with a special NUL-character '\0' to signify the last character in a C-style string. You do NOT need to know this to manipulate these C-style strings. The easiest thing to do is to assign each element of argv to a C++ string variable as shown in showargs.cpp. Then you can use the C++ string functions you are familiar with to manipulate the values, e.g., you can call length(), you can use substr(), you can concatenate strings with +, etc. None of these operations work with C-style strings.

Grading

Grading

The programs are worth 40 points.
Huffman Coding Grading Standards
description points
compression of any text file 10 points
compression of any file (including binary files) 5 points
decompression 10 points
robustness (does unhuff program crash on non-huffed files?) 5 points
program style (class design/implementation, program design) 10 points

Extra Credit

For extra credit, in addition to compressing a single file the huff/unhuff programs should be able to compress an entire directory (flat, not recursively). This means your program should process the name of a directory and then compress all the files in the directory, creating a single huffed file that stores each compressed file.

When this single file is unhuffed, all the files originally compressed should be uncompressed/unhuffed.

You'll need to store information about file names in the huffed file to know what to use for names when unhuffing. The original names must be used when uncompressing.


Be sure to test the program by uncompressing into an empty directory so that you don't overwrite files accidentally.

This extra credit is worth 10 points.

Submit

To submit use
 submit_cps100 huff README *.h *.cpp Makefile

The Makefile should be able to handle both make huff and make unhuff. The sample Makefile in ~ola/cps100/huff will do this. You can alter the different lines of the Makefile as you need to. Don't forget to run make depend if you change the Makefile.
Owen L. Astrachan
Last modified: Mon Nov 20 16:23:02 EST 2000