CPS 100, Spring 2004, Huffman Coding

See the FAQ

You are urged to work in groups of two. Each group should submit ONE program per group. Be sure to include name and login id of each person in your group in the README file that belongs with the submission. Do not wait until the last minute to start this assignment.

Overview

You will implement two compression/uncompression programs:

huff: compresses a file
unhuff: uncompresses a file compressed by huff

Your executables must have those names (huff and unhuff), the Makefile is set up for this to happen, though you'll have to create files with the proper names and perhaps modify the Makefile.

You're also required to create two test programs you must submit. More details on these below. The test program names are

testcount.cpp, uses an ibstream to count 8-bit chunks/characters.
testhuff.cpp, prints huffman codes for 8-bit chunks/characters.

Applying unhuff to a file compressed by huff must exactly reproduce the original file.

Huffman Coding Overview

This involves implementing both programs based on the greedy Huffman algorithm discussed in class and in this detailed online explanation of Huffman Coding.

The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like compress or zip which use slightly different algorithms permitting a higher degree of compression than Huffman coding.

Programming Advice

Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, you probably will not get a completely working program!

Design, develop, and test so that you have a working program at each step.

Build a program by adding working and tested pieces.

To help you with this mode of programming you'll be required to submit two test programs described below. You should complete these test programs as part of the development process you use in creating the huff and unhuff programs.

Getting Started

Information on files provided to you and how to access them is accessible via the snarf page

`huff`: The Compression Program

You must write huff, a program that compresses a file so that it can be uncompressed by unhuff. The executable is named huff, but you'll probably use several .cpp and .h files to create the executable. The classes and functions you write are completely up to you, but a good design will receive maximal points.

To compress you should use the Huffman coding algorithm. The algorithm has four steps. You should understand each of these steps before starting to code.

First, the process of Huffman compression is discussed, then some details about the program follow.

To compress a file, count how many times every character occurs in a file. These counts are used to build weighted nodes that will be leaves in the Huffman tree.

character

8-bit chunk

Do not use any variables of type char in your program.

From these counts, build the Huffman tree. To do this, first create one node per character, weighted with the number of times the character occurs, insert each node into a priority queue. Then choose two minimal nodes, join these nodes together as children of a newly created node, and insert the newly created node into the priority queue. The new node is weighted with the sum of the two minimal nodes taken from the priority queue. Continue this process until only one node is left in the priority queue. This is the root of the Huffman tree.

Create a table or map of characters (8-bit chunks) to codings. The table of encodings is formed by traversing the path from the root of the Huffman tree to each leaf, each root-to-leaf path creates an encoding for the value stored in the leaf. When going left in the tree append a zero to the path; when going right append a one. All characters/encoding bit pairs may be stored in some kind of table or map to facilitate easy retrieval later.
The table can be a vector of the appropriate size (roughly 256, but be careful of PSEUDO_EOF) or you can use a tmap subclass. A vector is the simplest approach for this part of the huff/compress process.

Finally, read the input file a second time. For each character/8-bit chunk read, write the encoding of the character (obtained from the map of encodings) to the compressed file.

To uncompress the file later, you must recreate the same Huffman tree that was used to compress (so the codes you send will match). This tree might be stored directly in the compressed file (e.g., using a preorder traversal), or it might be created from character counts stored in the compressed file. In either case, this information must be coded and transmitted along with the compressed data (the tree/count data will be stored first in the compressed file, to be read by unhuff. There's more information below on storing/reading information to re-create the tree.

Huff Program Details

Your executable program must be named huff. When it compresses a file named foo, it must store the compressed version in a file named foo.hf. In general, any compressed file has a .hf suffix and otherwise the same name as the original program.

See below for command-line arguments that affect the huff compression program.

`unhuff`: The Decompression Program

You must also write unhuff, a program that uncompresses any compressed files created by your huff program. The program unhuff must be able to recreate the huffman tree which was used to compress the file originally, then use this tree to recreate the original (uncompressed) file. The executable must be named unhuff, but you'll use several .cpp and .h files to create this executable.

The first thing unhuff should do is verify that the file being uncompressed was originally compressed by your huff program. In particular, your program should not crash if it is given a program to uncompress that was not compressed using the corresponding Huffman compression program. The robustness of the program will be an important criterion in grading the program.

Magic Numbers

One easy way to ensure that both programs work in tandem is to write a "magic number" at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding decompression program first reads N bits, if the bits do not represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in bitops.h. A sample program which uses bitops.cpp is given in bitread.cpp.

Storing the Huffman Tree

For decompression to work with Huffman coding, information must be stored in the compressed file which allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) C/C++ ints or you can try to be inventive and save space. For example, its possible to store just character counts and recreate the codes from the counts (i.e., store 256 counts, one for each 8-bit character). It's also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable, but full credit requires some attempt to save space/storage. Space-saving techniques are defined as those using less space than simply storing 257 counts as 32 bit ints. One useful technique is to write the tree to the file using a preorder traversal. You can use a 0 or 1 bit to differentiate between internal nodes and leaves, for example. The leaves must store character values (in the general case using 9-bits because of the pseudo-eof character).

For example, the sequence of 0's and 1's below represents the tree on the right (if you write the 0's and 1's the spaces wouldn't appear, the spaces are only to make the bits more readable to humans.)

0 0 1 001100001 1 000100000 1 001110100

The first 0 indicates a non-leaf, the second 0 is the left child of the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits that represent 97 (001100001 is 97 in binary), the ASCII code for 'a'. Then there's a 1 for the right child of the left child of the root, it stores 32 (000100000 is 32 in binary), the ASCII value of a space. The next 1 indicates the right child of the root is a leaf, it stores the ASCII value for a 't' which is 116 (001110100 is 116 in binary).

Your program can write these bits using a standard pre-order traversal. You can then read them by reading a bit, then recursively reading left/right subtrees if the bit is a zero.

Program Interface

Both programs need to read from a file specified by the user. You can prompt the user for the file name, but the program must also accept the name of the file given on the command-line with the --file option (see below).

The huff program should write its compressed output to a file generated from the original file name by adding the suffix .hf to the file name. For example, the command below should compress the file foo storing the result in foo.hf.

    huff --file=foo

The unhuff program should write its uncompressed output to a file with a .uhf suffix. This means uncompressing foo.hf should create foo.hf.uhf (which should be the same as the file named foo with no suffix.)

Specifying the File

The command-line flag --file is followed by the name of the file to be compressed (or uncompressed). If there is no --file argument the user should be prompted for the name of the file.

    huff --file=foo

Forcing Compression

If compressing a file results in a file larger than the file being compressed (this is always possible) then no compressed file should be created and a message should be printed indicating that this is the case. Make sure you don't create an empty file in this case, don't open a file (obstream, for example) until you know you want to write it. Or, remove the file if you need to --- see the faq for details.

However, the user should have the option of invoking huff with an argument --force which forces compression even when the compressed file is bigger. For example:

      prompt> huff --file=foo
      no compression yielded, no output file foo.hf written
      prompt> huff --force --file=foo

Here, the second time huff is executed the compression is forced because of the --force argument.

No Suffix

In the unhuff program, the default behavior is to uncompress a file named foo.hf into foo.hf.uhf. However, with the --nosuffix command-line option no suffix will be added. Instead, uncompressing foo.hf should create foo (the original file). Do NOT attempt this option until you're sure your compression program works.

The line below should create a file named foo that's the uncompressed version of foo.hf. Without the --nosuffix option the file created would be named foo.hf.uhf.

prompt> unhuff --nosuffix --file=foo.hf

User-accessible Help

Finally, if the user specifies the --help option, your program should output the options that it supports and then exit. This allows the user to see how you expect the program to be used. For example, (should work with unhuff program as well):

prompt> huff --help

Huffman compression, options are:
--force create compressed file even if larger (huff only)
--nosuffix (unhuff) don't use .uhf suffix
--help print this help
--file followed by file to be compressed (or uncompressed)

Determining If Compression Happens

To determine if compression results in a smaller file, you'll need to calculate the number of characters/chunks in the original file (your program will compute this by determining character/chunk counts). The size of the compressed file can be calculated from the same counts using the size of each character's encoded number of bits. You must also remember to calculate the file-header information stored in the compressed program. To be more precise, if there are 52 A's, and each A requires 4 bits to encode, then the A's contribute 52*4 = 108 bits to the compressed file. You'll need to make calculations like this for all characters.

Coding and Algorithmic Details

If you're using Eclipse you'll need to add several make targets (there are at least four in the Makefile provided, see below for names). Use the Eclipse make tutorial for help.

There are many details that you will need to think about as you code these programs. Some of these are discussed here with some alternatives proposed.

Note, do not use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment below.

     int k = 'A';
     cout << char(k);

The sections below illustrate the following ideas.

Pseudo-EOF character
Priority Queues
Converting Huffman tree to Table
Implementing and Debugging the program
Bitops/reading and writing bits
Command line arguments

Pseudo-EOF character

(for more details, see the complete huffman coding discussion and see the faq.)

The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this.

int bits; while (input.readbits(1, bits)) { // process bits } To avoid this problem, you can use a pseudo-EOF character and write a loop that stops when the pseudo-EOF character is read in (in compressed form). The code below is pseudo-code for reading a compressed file using such a technique. int bits; while (true) { if (! input.readbits(1, bits)) { cerr << "should not happen! trouble reading bits" << endl; } else { // use the zero/one value of the bit read // to traverse Huffman coding tree // if a leaf is reached, decode the character and print UNLESS // the character is pseudo-EOF, then decompression done if ( (bits & 1) == 0) // read a 0, go left in tree else // read a 1, go right in tree if (at leaf-node in tree) { if (leaf-node stores pseudo-eof char) break; // out of loop else write character stored in leaf-node } } } When a compressed file is written the last bits written should be the bits that correspond to the pseudo-EOF char. You will have to write these bits explicitly. These bits will be recognized by the program unhuff and used in the decompression process. This means that your decompression program will never actually run out of bits if it's processing a properly compressed file (you may need to think about this to really believe it). In other words, when decompressing you will read bits, traverse a tree, and eventually find a leaf-node representing some character. When the pseudo-EOF leaf is found, the program can terminate because all decompression is done. If reading a bit fails because there are no more bits (the bit reading function returns false) the compressed file is not well formed. Your program should cope with files that are not well-formed, be sure to test for this, i.e., test unhuff with plain (uncompressed) files.

In Huffman trees/tables you use in your programs, the pseudo-EOF character/chunk always has a count of one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.

In the file globals.h the number of characters counted is specified by HUFF_ALPH_SIZEwhich has value 256. Although only 256 values can be represented by 8 bits, these values are between 0 and 255, inclusive. One character is used as the pseudo-EOF character -- it must be a value not-representable with 8-bits, the smallest such value is 256 which requires 9 bits to represent. Howver, ideally your program should be able to work with n-bit chunks, not just 8-bit chunks.

Priority Queues

The easiest way to build the Huffman tree is to use a priority queue since it supports getmin and deletemin operations which are used in creating the Huffman tree. It always returns the minimum count character (or subtree of characters). When more than one character has the same count, it picks one to return (it always picks them in the same order, but it does so in no particular order). It turns out this is not a problem for the decompressor as long as both programs use the same implementation of a priority queue.

A heap-based, templated priority queue class tpqueue is provided (in the standard location with libtapestry files, see tpq.h. and tpq.cpp.). The program tpqtest.cpp shows how to use the class with pointers. (You can compile this with either make tpqtest or adding a Make target of tpqtest in Eclipse).

Creating a Table from a Huffman-tree

(for more details, see the complete huffman coding discussion.)

To create a table or map of coded bit values for each character you'll need to traverse the Huffman tree (e.g., inorder, preorder, etc.) making an entry in the table each time you reach a leaf. For example, if you reach a leaf that stores the character 'C', following a path left-left-right-right-left, then an entry in the 'C'-th location of the map should be set to 00110. You'll need to make a decision about how to store the bit patterns in the map. At least two methods are possible for implementing what could be a class/struct BitPattern:

Use a string. This makes it easy to add a character (using +) to a string during tree traversal and makes it possible to use string as BitPattern. Your program may be slow because appending characters to a string (in creating the bit pattern) and accessing characters in a string (in writing 0's or 1's when compressing) is slower than the next approach.
Alternatively you can store an integer for the bitwise coding of a character. You need to store the length of the code too so your code will be able to differentiate between 01001 and 00101. However, using an int restricts root-to-leaf paths to be at most 32 edges long since an int holds 32 bits. In a pathological file, a Huffman tree could have a root-to-leaf path of over 100. Because of this problem, you should use strings to store paths rather than ints. A slow correct program is better than a fast incorrect program.

Implementing and Debugging

It's a good idea to create several classes to help manage the complexity in these programs. Because the same data structures need to used to ensure that a file compressed using your huff algorithm can be decompressed, you should be able to share several parts of the implementation. You can use classes to exploit this similarity. For example, in writing huff you can implement several classes, usually a class will take care of one part of the Huffman algorithm. These classes should also provide support for the decompression program unhuff.

Some ideas for classes are given below. You may decide to combine some of these into one class, or you may decide to implement more classes. These are a start and some suggestions, not requirements.

Counting Characters

A class could be responsible for creating the initial counts of how many times each character occurs. You will be required to design, implement, and test the class in isolation from the rest of the huff program. Doing this for each of several classes will help both in developing huff and in re-using code in writing the uncompress program unhuff.

For example, one possibility for using a character counting class is shown below:

#include "globals.h" #include "charcounter.h" CharCounter cc; cc.readFile("data/poe.txt"); // count chars in file int k; for (k = 0; k < ALPH_SIZE; k++) { int occs = cc.count(k); if (occs > 0) { cout << char(k) << " " << occs << endl; } } cout << "pseudo-eof count: " << cc.count(PSEUDO_EOF) << endl; Here the member function CharCounter::readFile reads and counts all the characters in a file. The function CharCounter::count returns the number of occurrences of a given character. You might also use CharCounter to read compressed-file headers too in unhuff.

You'll need at least ALPH_SIZE counters (typically this is 256, see globals.h). There are more than 256 possible characters if you include the pseudo-EOF character

The testcount Test Program

You must write code (at least a .cpp file) that will create an executable named testcount (the Makefile is set up for this). that when run as shown below will print 256 lines. The n-th line should be a count of how many times the 8-bit character/chunk whose ASCII value is n occurs. Many of the lines will be zero. For example, since the character 'a' has an ASCII value of 97, the 98-th line (remember, start with 0) will be the count of how many a's occur in the file read.

Your program must read the file using an ibstream object. The program should process the file specfied on the command-line as shown below.

 prompt> testcount poe.txt
 0
 0
...
...
 7
 5
...
...
 15
 32
...
...

Mapping characters to Codes

A class that represents the map of character and encoding bit pattern pairs. You can use one of the templated map classes in tmap.h (see tmap.h and a demo program usemap.cpp) or write your own map-like class.

Since the values stored in the map are constrained to be between 0 and 256, inclusive, you can simply store bitpattern/strings in a vector and index them directly. There's nothing at all inferior about this approach compared to using a map. In fact, using a vector is in many ways simpler, is as fast as it can be, and is easy to implement (so it's a good choice!).

The testhuff Test Program

You must create an executable program named testhuff (the Makefile is set up for this) that when run on a file will produce the huffman codes (0's and 1's) for every character read. The output should be 257 lines, one for every eight-bit chunk/character and the last line for the PSEUDO-EOF character. The n-th line of output should be the huffman code for the character whose ASCII value is n. For example:

 prompt> testhuff poe.txt
 010101
 10101111 
 ...
 ...

Your program must accept the name of the file as a command-line argument as shown in the run above.

Debugging Code

You will probably find it useful to implement debugging functions in your classes. These are functions you use in the debugging stages, but not in the final version of the program. For example, you could implement CharCounter::print to print all characters and frequencies read from a source file --- this could use the code fragment shown above with the CharCounter description. Although you could write such a printing/testing function outside the class, encapsulating the testing/debugging functions within the class makes the class more coherent.

Similarly, you may want to implement a function to print the Huffman tree to help you determine if the tree you've built is correct.

Designing debugging functions as part of the original program will make the program development go more quickly since you will be able to verify that pieces of the program, or certain classes, work properly. Building in the debugging scaffolding from the start will make it much easier to test and develop your program. When testing, use small examples of test files maybe even as simple as "go go gophers" that help you verify that your program and classes are functioning as intended.

You might want to write encoding bits out first as strings or printable int values rather than as raw bits of zeros and ones which won't be readable except to other computer programs. A Compress class, for example, could support printAscii functions and printBits to print in human readable or machine readable formats.

We cannot stress enough how important it is to develop your program a few steps at a time. At each step, you should have a functioning program, although it may not do everything the first time it's run. By developing in stages, you may find it easier to isolate bugs and you will be more likely to get a program working faster. In other words, do not write hundreds of lines of code before compiling and testing

Using bitops.cpp

In order to read and write in a bit-at-a-time manner, two classes are declared in the header file bitops.h: ibstream and obstream, respectively, for reading and reading.

Bit read/write subprograms

To see how the readbits routine works, note that the code segment below is functionally equivalent to the Unix command cat foo --- it reads BITS_PER_WORD bits at a time (which is 8 bits as defined in bitops.h) and echoes what is read. int inbits; ibstream ibs("data/poe.txt"); while (ibs.readbits(BITS_PER_WORD, inbits)) { cout.put(inbits); // put writes one character }

this code is similar to the loop shown below which uses the standard get function to read a char from a stream.

int inbits; ifstream input("data/poe.txt"); while (input.get(inbits)) { cout.put(inbits); } EAT Note that executing the C++ statement cout.put('7') results in 8 bits being written because a C/C++ char uses 8 bits (the 8 bits correspond to the character '7'). Executing cout << 7. results in 32 bits being written because a C/C++ int uses 32 bits. Executing obs.writebits(3,7) results in 3 bits being written (to the output bit-stream obs) --- all the bits are 1 because the number 7 is represented in base two by 000111.

When using writebits to write a specified number of bits, some bits may not be written immediately because of some buffering that takes place. To ensure that all bits are written, the last bits must be explicitly flushed. The function flushbits is called automatically if your program exits properly when the obstream destructor is called. You should not need to call it explicitly. If, for some reason you create an obstream object referenced by a pointer, make sure to delete the object so that its destructor is called.

Although readbits can be called to read a single bit at a time (by setting the first parameter to 1), the second parameter to the function is an int. You'll need to be able to access just one bit of this int (inbits in code above). In order to access just the right-most bit a bitwise and & can be used:

int inbits; ifstream ibs("filename"); ibs.readbits(1, inbits); if ((inbits & 1) == 1) // do stuff because the bit read was 1 else // do stuff because the bit read was 0 Alternatively, you can mod by 2, e.g., inbits % 2 and check to see if the remainder is 0 or 1 to determine if the right-most bit is 0 or 1. Using bitwise-and is faster than using mod.

An example program using the bitreading classes is provided for you to study, it is called bitread.cpp. Because Huffman coding requires two passes over the input file you will need to use the method rewind that works with object variables of type ibstream.

The effect of ib.rewind() is to allow the input bit stream ib to be ''read again'', i.e., a loop that reads until no more bits are available (checking the return-value of readbits) will be able to read the entire file again.

Command-line Arguments

The program showargs.cpp shows how command line arguments can be read and processed by your program. Do not spend time worrying how to process command-line arguments until you are sure the program works, this is not the most important part of the program. You can add command-line parameters after everything works.

When a C/C++ program is invoked, arguments are often given as command-line arguments/parameters, i.e., options entered on the same line as the program. For example,

 
      g++ -g -o foo foo.cpp

invokes the g++ compiler (program) with four command-line arguments: -g, -o, foo, and foo.cpp. Command-line arguments are passed to the program in a C-style array of strings along with a count of the number of command-line arguments passed. The header for main must be altered in a program that processes command-line arguments (see HowtoA in Tapestry for more details.) The program showargs.cpp illustrates how command-line parameters can be processed.

Here are two runs:

prompt> showargs

program showargs has no command-line args

prompt> showargs --o=fish --b=fowl --foo

showargs has 3 command-line args:
arg is: --o=fish, option is --o, value is fish
arg is: --b=fowl, option is --b, value is fowl
arg is: foo

As shown in the code fragment above, and in the sample file showargs.cpp, all programs have at least one command-line argument, the name of the program being run. This is stored as the first entry in the array argv (the first entry has index zero). The array argv is an array of c-style strings. These strings are just pointers to characters, with a special NUL-character '\0' to signify the last character in a C-style string. You do NOT need to know this to manipulate these C-style strings. The easiest thing to do is to assign each element of argv to a C++ string variable as shown in showargs.cpp. Then you can use the C++ string functions you are familiar with to manipulate the values, e.g., you can call length(), you can use substr(), you can concatenate strings with +, etc. None of these operations work with C-style strings.

Grading

The programs are worth 50 points.

**Huffman Coding Grading Standards**
description	points
compression of any text file	10 points
compression of any file (including binary files)	5 points
decompression	10 points
robustness (does unhuff program crash on non-huffed files?)	5 points
program style (class design/implementation, program design)	10 points
test programs	10 points

Submit

To submit use

 submit_cps100 huff README *.h *.cpp Makefile

The Makefile should be able to handle both make huff and make unhuff. The sample Makefile provides a very basic starting point. You can alter the different lines of the Makefile as you need to. Don't forget to run make depend if you change the Makefile (from Unix, in Eclipse you don't run make depend).

Owen Astrachan

Last modified: Thu Apr 1 18:06:28 EST 2004