CPS 100 Fall 2002: Assignment #7: Huffman Coding

Due: Thursday, Nov. 21 by 11pm

LATE DAY EXTENDED: 10% off by Monday, Nov 25, 11pm

LAST DAY TO TURN IN: 25% off by Monday, Dec 2, 11pm

50 points

Overview

Note: You are encouraged to work with one other person in the class on this assignment to engage in "pair programming". Pair programming means you must work together the whole time. You sit down at one computer, one of you types (the driver) and the other one watches closely, makes suggestions and looks for mistakes (the observer). If you do this, you will make one submission with both names in the README file.
If you work in a pair, you must both submit an evaluation of the pair. See the final pair evaluation page.
See the Notes

Do not wait until the last minute to start this assignment. There will be no extensions. The program is being handed out early.


Overview

You will implement two compression/uncompression programs:

Your executables must have those names, the Makefile is set up for this to happen, though you'll have to add the right .cpp files to the Makefile.

You're also required to create two test programs you must submit. More details on these below. The test program names are

Applying unhuff to a file compressed by huff must exactly reproduce the original file. This involves implementing both programs based on the greedy Huffman algorithm discussed in class and in this detailed online explanation.

The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like compress or zip which use slightly different algorithms permitting a higher degree of compression than Huffman coding.

Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, you probably will not get a completely working program!

Design, develop, and test so that you have a working program at each step.

Build a program by adding working and tested pieces.

To help you with this mode of programming you'll be required to submit two test programs described below. You should complete these test programs as part of the development process you use in creating the huff and unhuff programs.

Getting Started

The following files are primarily examples showing how to use some of the tools needed for this assignment. These files can be found in ~rodger/cps100/huff on the acpub system. You may want to copy these files to get started. The file bitread.cpp shows how to read and echo a file a bit-at-a-time (or several bits-at-a-time).

The file showargs.cpp shows how to use command lines arguments. The file globals.h provides constants needed by both programs.


huff: The Compression Program

You must write huff, a program that compresses a file so that it can be uncompressed by unhuff. The executable is named huff, but you'll probably use several .cpp and .h files to create the executable. The classes and functions you write are completely up to you, but a good design will receive maximal points.

To compress you should use the Huffman coding algorithm. The algorithm has four steps. You should understand each of these steps before starting to code.

First, the process of Huffman compression is discussed, then some details about the program follow.

To uncompress the file later, you must recreate the same Huffman tree that was used to compress (so the codes you send will match). This tree might be stored directly in the compressed file (e.g., using a preorder traversal), or it might be created from character counts stored in the compressed file. In either case, this information must be coded and transmitted along with the compressed data (the tree/count data will be stored first in the compressed file, to be read by unhuff. There's more information below on storing/reading information to re-create the tree.

Huff Program Details

Your executable program must be named huff. When it compresses a file named foo, it must store the compressed version in a file named foo.hf. In general, any compressed file has a .hf suffix and otherwise the same name as the original program.

See below for command-line arguments that affect the huff compression program.


unhuff: The Decompression Program

You must also write unhuff, a program that uncompresses any compressed files created by your huff program. The program unhuff must be able to recreate the huffman tree which was used to compress the file originally, then use this tree to recreate the original (uncompressed) file. The executable must be named unhuff, but you'll use several .cpp and .h files to create this executable.

The first thing unhuff should do is verify that the file being uncompressed was originally compressed by your huff program. In particular, your program should not crash if it is given a program to uncompress that was not compressed using the corresponding Huffman compression program. The robustness of the program will be an important criterion in grading the program.

Magic Numbers

One easy way to ensure that both programs work in tandem is to write a "magic number" at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding decompression program first reads N bits, if the bits do not represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in bitops.h. A sample program which uses bitops.cpp is given in bitread.cpp.

Storing the Huffman Tree

For decompression to work with Huffman coding, information must be stored in the compressed file which allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) C/C++ ints or you can try to be inventive and save space. For example, its possible to store just character counts and recreate the codes from the counts (i.e., store 256 counts, one for each 8-bit character). It's also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable, but full credit requires some attempt to save space/storage. Space-saving techniques are defined as those using less space than simply storing 257 counts as 32 bit ints. One useful technique is to write the tree to the file using a preorder traversal. You can use a 0 or 1 bit to differentiate between internal nodes and leaves, for example. The leaves must store character values (in the general case using 9-bits because of the pseudo-eof character).

For example, the sequence of 0's and 1's below represents the tree on the right (if you write the 0's and 1's the spaces wouldn't appear, the spaces are only to make the bits more readable to humans.)
   0 0 1 001100001 1 000100000 1 001110100

The first 0 indicates a non-leaf, the second 0 is the left child of the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits that are represent 97, the ASCII code for 'a'. Then there's a 1 for the right child of the left child of the root, it stores 32, the ASCII value of a space. Then next 1 indicates the right child of the root is a leaf, it stores the ASCII value for a 't' which is 116.

Your program can write these bits using a standard pre-order traversal. You can then read them by reading a bit, then recursively reading left/right subtrees if the bit is a zero.


Program Interface

Both programs need to read from a file specified by the user. You can prompt the user for the file name, but the program must also accept the name of the file given on the command-line with the --file option (see below).

The huff program should write its compressed output to a file generated from the original file name by adding the suffix .hf to the file name. For example, the command below should compress the file foo storing the result in foo.hf.

    huff --file foo

The unhuff program should write its uncompressed output to a file with a .uhf suffix. This means uncompressing foo.hf should create foo.hf.uhf (which should be the same as the file named foo with no suffix.)

Specifying the File

The command-line flag --file is followed by the name of the file to be compressed (or uncompressed). If there is no --file argument the user should be prompted for the name of the file.

Forcing Compression

If compressing a file results in a file larger than the file being compressed (this is always possible) then no compressed file should be created and a message should be printed indicating that this is the case.

However, the user should have the option of invoking huff with an argument --force which forces compression even when the compressed file is bigger. For example:

      prompt> huff --file foo
      no compression yielded, no output file foo.hf written
      prompt> huff --force --file foo
Here, the second time huff is executed the compression is forced because of the --force argument.

No Suffix

In the unhuff program, the default behavior is to uncompress a file named foo.hf into foo.hf.uhf. However, with the --nosuffix command-line option no suffix will be added. Instead, uncompressing foo.hf should create foo (the original file). Do NOT attempt this option until you're sure your compression program works.

The line below should create a file named foo that's the uncompressed version of foo.hf. Without the --nosuffix option the file created would be named foo.hf.uhf.

prompt> unhuff --nosuffix --file foo.hf

User-accessible Help

Finally, if the user specifies the --help option, your program should output the options that it supports and then exit. This allows the user to see how you expect the program to be used. For example, (should work with unhuff program as well):
prompt> huff --help

Huffman compression, options are:
--force create compressed file even if larger (huff only)
--nosuffix (unhuff) don't use .uhf suffix
--help print this help
--file followed by file to be compressed (or uncompressed)

Determining If Compression Happens

To determine if compression results in a smaller file, you'll need to calculate the number of characters/chunks in the original file (your program will compute this by determining character/chunk counts). The size of the compressed file can be calculated from the same counts using the size of each character's encoded number of bits. You must also remember to calculate the file-header information stored in the compressed program. To be more precise, if there are 52 A's, and each A requires 4 bits to encode, then the A's contribute 52*4 = 108 bits to the compressed file. You'll need to make calculations like this for all characters.

Coding and Algorithmic Details

There are many details that you will need to think about as you code these programs. Some of these are discussed here with some alternatives proposed.

Note, do not use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment below.

     int k = 'A';
     cout << char(k);

The sections below illustrate the following ideas.

Pseudo-EOF character

(for more details, see the complete huffman coding discussion.)

The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this.

int bits; while (input.readbits(1, bits)) { // process bits } To avoid this problem, you can use a pseudo-EOF character and write a loop that stops when the pseudo-EOF character is read in (in compressed form). The code below is pseudo-code for reading a compressed file using such a technique. int bits; while (true) { if (! input.readbits(1, bits)) { cerr << "should not happen! trouble reading bits" << endl; } else { // use the zero/one value of the bit read // to traverse Huffman coding tree // if a leaf is reached, decode the character and print UNLESS // the character is pseudo-EOF, then decompression done if ( (bits & 1) == 0) // read a 0, go left in tree else // read a 1, go right in tree if (at leaf-node in tree) { if (leaf-node stores pseudo-eof char) break; // out of loop else write character stored in leaf-node } } } When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. You will have to write these bits explicitly. These bits will be recognized by the program unhuff and used in the decompression process. This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (you may need to think about this to really believe it). In other words, when decompressing you will read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate because all decompression is done. If reading a bit fails because there are no more bits (the bit reading function returns false) the compressed file is not well formed. Your program should cope with files that are not well-formed, be sure to test for this, i.e., test unhuff with plain (uncompressed) files.

In Huffman trees/tables you use in your programs, the pseudo-EOF character/chunk always has a count of one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.

In the file globals.h the number of characters counted is specified by HUFF_ALPH_SIZE which has value 257. Although only 256 values can be represented by 8 bits, one character is used as the pseudo-EOF character [the value of PSEUDO-EOF could be 256, but it's set for 257 for historical reasons]. Your program should be able to work with n-bit chunks, not just 8-bit chunks.

Priority Queues

The easiest way to build the Huffman tree is to use a priority queue since it supports getmin and deletemin operations which are used in creating the Huffman tree. It always returns the minimum count character (or subtree of characters). When more than one character has the same count, it picks one to return (it always picks them in the same order, but it does so in no particular order). It turns out this is not a problem for the decompressor as long as both programs use the same implementation of a priority queue.

A heap-based, templated priority queue class tpqueue is provided (in the standard location with libtapestry files, see tpq.h. and tpq.cpp.). The program tpqtest.cpp shows how to use the class with pointers.

Creating a Table from a Huffman-tree

(for more details, see the complete huffman coding discussion.)

To create a table or map of coded bit values for each character you'll need to traverse the Huffman tree (e.g., inorder, preorder, etc.) making an entry in the table each time you reach a leaf. For example, if you reach a leaf that stores the character 'C', following a path left-left-right-right-left, then an entry in the 'C'-th location of the map should be set to 00110. You'll need to make a decision about how to store the bit patterns in the map. At least two methods are possible for implementing what could be a class/struct BitPattern:

Implementing and Debugging

It's a good idea to create several classes to help manage the complexity in these programs. Because the same data structures need to used to ensure that a file compressed using your huff algorithm can be decompressed, you should be able to share several parts of the implementation. You can use classes to exploit this similarity. For example, in writing huff you can implement several classes, usually a class will take care of one part of the Huffman algorithm. These classes should also provide support for the decompression program unhuff.

Some ideas for classes are given below. You may decide to combine some of these into one class, or you may decide to implement more classes. These are a start and some suggestions, not requirements.

Debugging Code

You will probably find it useful to implement debugging functions in your classes. These are functions you use in the debugging stages, but not in the final version of the program. For example, you could implement CharCounter::print to print all characters and frequencies read from a source file --- this could use the code fragment shown above with the CharCounter description. Although you could write such a printing/testing function outside the class, encapsulating the testing/debugging functions within the class makes the class more coherent.

Similarly, you may want to implement a function to print the Huffman tree to help you determine if the tree you've built is correct.

Designing debugging functions as part of the original program will make the program development go more quickly since you will be able to verify that pieces of the program, or certain classes, work properly. Building in the debugging scaffolding from the start will make it much easier to test and develop your program. When testing, use small examples of test files maybe even as simple as "go go gophers" that help you verify that your program and classes are functioning as intended.

You might want to write encoding bits out first as strings or printable int values rather than as raw bits of zeros and ones which won't be readable except to other computer programs. A Compress class, for example, could support printAscii functions and printBits to print in human readable or machine readable formats.

We cannot stress enough how important it is to develop your program a few steps at a time. At each step, you should have a functioning program, although it may not do everything the first time it's run. By developing in stages, you may find it easier to isolate bugs and you will be more likely to get a program working faster. In other words, do not write hundreds of lines of code before compiling and testing

Using bitops.cpp

In order to read and write in a bit-at-a-time manner, two classes are declared in the header file bitops.h: ibstream and obstream, respectively, for reading and reading.

Bit read/write subprograms

To see how the readbits routine works, note that the code segment below is functionally equivalent to the Unix command cat foo --- it reads BITS_PER_WORD bits at a time (which is 8 bits as defined in bitops.h) and echoes what is read. int inbits; ibstream ibs("data/poe.txt"); while (ibs.readbits(BITS_PER_WORD, inbits)) { cout.put(inbits); // put writes one character }

this code is similar to the loop shown below which uses the standard get function to read a char from a stream.

int inbits; ifstream input("data/poe.txt"); while (input.get(inbits)) { cout.put(inbits); }

Note that executing the C++ statement cout.put('7') results in 8 bits being written because a C/C++ char uses 8 bits (the 8 bits correspond to the character '7'). Executing cout << 7. results in 32 bits being written because a C/C++ int uses 32 bits. Executing obs.writebits(3,7) results in 3 bits being written (to the output bit-stream obs) --- all the bits are 1 because the number 7 is represented in base two by 000111.

When using writebits to write a specified number of bits, some bits may not be written immediately because of some buffering that takes place. To ensure that all bits are written, the last bits must be explicitly flushed. The function flushbits is called automatically if your program exits properly when the obstream destructor is called. You should not need to call it explicitly. If, for some reason you create an obstream object referenced by a pointer, make sure to delete the object so that its destructor is called.

Although readbits can be called to read a single bit at a time (by setting the first parameter to 1), the second parameter to the function is an int. You'll need to be able to access just one bit of this int (inbits in code above). In order to access just the right-most bit a bitwise and & can be used:

int inbits; ifstream ibs("filename"); ibs.readbits(1, inbits); if ((inbits & 1) == 1) // do stuff because the bit read was 1 else // do stuff because the bit read was 0 Alternatively, you can mod by 2, e.g., inbits % 2 and check to see if the remainder is 0 or 1 to determine if the right-most bit is 0 or 1. Using bitwise-and is faster than using mod.

An example program using the bitreading classes is provided for you to study, it is called bitread.cpp. Because Huffman coding requires two passes over the input file you will need to use the method rewind that works with object variables of type ibstream.

The effect of ib.rewind() is to allow the input bit stream ib to be ''read again'', i.e., a loop that reads until no more bits are available (checking the return-value of readbits) will be able to read the entire file again.

Command-line Arguments

The program showargs.cpp shows how command line arguments can be read and processed by your program. Do not spend time worrying how to process command-line arguments until you are sure the program works, this is not the most important part of the program. You can add command-line parameters after everything works.

When a C/C++ program is invoked, arguments are often given as command-line arguments/parameters, i.e., options entered on the same line as the program. For example,

 
      g++ -g -o foo foo.cpp
invokes the g++ compiler (program) with four command-line arguments: -g, -o, foo, and foo.cpp. Command-line arguments are passed to the program in a C-style array of strings along with a count of the number of command-line arguments passed. The header for main must be altered in a program that processes command-line arguments (see HowtoA in Tapestry for more details.) The program showargs.cpp illustrates how command-line parameters can be processed.

Here are two runs:

prompt> showargs

program showargs has no command-line args

prompt> showargs -o fish -b fowl

showargs has 5 command-line args:
1-th arg is: -o
2-th arg is: fish
3-th arg is: -b
4-th arg is: fowl

As shown in the code fragment above, and in the sample file showargs.cpp, all programs have at least one command-line argument, the name of the program being run. This is stored as the first entry in the array argv (the first entry has index zero). The array argv is an array of c-style strings. These strings are just pointers to characters, with a special NUL-character '\0' to signify the last character in a C-style string. You do NOT need to know this to manipulate these C-style strings. The easiest thing to do is to assign each element of argv to a C++ string variable as shown in showargs.cpp. Then you can use the C++ string functions you are familiar with to manipulate the values, e.g., you can call length(), you can use substr(), you can concatenate strings with +, etc. None of these operations work with C-style strings.

Grading

The programs are worth 50 points.
Huffman Coding Grading Standards
description points
compression of any text file 10 points
compression of any file (including binary files) 5 points
decompression 10 points
robustness (does unhuff program crash on non-huffed files?) 5 points
program style (class design/implementation, program design) 10 points
test programs 10 points

Extra Credit

Directory Compression

For extra credit, in addition to compressing a single file the huff/unhuff programs should be able to compress an entire directory (flat, not recursively). This means your program should process the name of a directory and then compress all the files in the directory, creating a single huffed file that stores each compressed file.

When this single file is unhuffed, all the files originally compressed should be uncompressed/unhuffed.

You'll need to store information about file names in the huffed file to know what to use for names when unhuffing. The original names must be used when uncompressing (though with the .uhf suffix unless you change this with the --nosuffix flag).

You must note in your README that you've done this extra credit assignment and it must be executed using a --dir flag. This means when invoked as shown below, all the files in directory foodir will be compressed and a single archive file named foodir.hf will be created.

 prompt> huff --dir foodir
Be sure to test the program by uncompressing into an empty directory so that you don't overwrite files accidentally.

For information on reading the files in a directory, see the classes DirStream and DirEntry described in Chapter 10 of Tapestry.

This extra credit is worth 10 points and must be turned in with your program. Include a note in your README that you did the extra credit.

More Compression

For an extra 20 points, read this description of LZW or Lempel/Ziv/Welch compression and implement it. This extra credit is turned in separately and must be submitted by Wednesday, Dec. 4, 2002 at 11pm.

Submit

If you worked in a pair, see the final pair evaluation page.

To submit use

 submit_cps100 huff README *.h *.cpp Makefile

The Makefile should be able to handle both make huff and make unhuff. The sample Makefile in ~rodger/cps100/huff will do this. You can alter the different lines of the Makefile as you need to. Don't forget to run make depend if you change the Makefile.

Extra Credit Submission

You can submit LZW extra credit by Dec. 4, 2002.

  submit_cps100 huffextra README *.h *.cpp Makefile

Susan H. Rodger
Last modified: Sat Nov 2 14:09:05 EST 2002