# Huffman Coding From a Teaching Perspective

## Data Structures and Topics Used in the Assignment

To ensure that students are adequately prepared to do this assignment, ideally you will have covered the topics listed below. You can, however, skip some of these if you provide classes/modules to students that can be used while insulating students from specific knowledge of how the classes work.

• Binary Trees The structure used in creating per-character Huffman codes is really more of a trie, but from a student perspective (and from a teaching perspective) it can be treated as a binary tree.

Students will need to implement the following operations:

• Create the tree. Nodes must be created and subtrees joined to the nodes. A priority queue is used in determining the structure of the tree.

• Traverse the tree recording every root-to-leaf path. The edges taken for each root-to-leaf path determine the coding of the character stored in the leaf node.

• Optionally perform a pre-order traversal to write the tree, and read it back. This is only needed if the tree is stored in the header of a compressed file.

• Perform a post-order traversal to delete all tree nodes when the tree is no longer needed.

• Priority Queue A priority queue is used in the greedy algorithm for creating the tree. Students might use a priority queue whose implementation is provided, or might implement one (e.g., using an unsorted vector or a heap). Performance isn't a concern in the priority queue implementation because fewer than 1,000 inserts/deletemins are performed.

• Map Students do not need to use a map class (e.g., an STL map in C++ or a HashMap in the Java 1.2 Collections package). It's very easy for students to store character counts in a vector and to store encodings stored as strings in a vector as well (vectors indexed by character). However, it is also possible to use a map if maps are studied.

• string or bit operations When character encodings are created, each edge of a root-to-leaf path is treated as a 0 or a 1 and concatenated to form an encoding for the character in a leaf node. The 0's and 1's can be treated as characters and concatenated to a string being built to represent the encoding. Alternatively, the 0's and 1's can be treated as bit values and "catenatd" to an int by shifting the int and bitwise-anding the 0 and 1 with the resulting value. If bit operations are used the number of bits must be maintained in order to differentiate the numbers 0010 and 00010 which are numerically the same, but different as strings and as encodings of root-to-leaf paths.

## Providing infrastructure for Students

At a minimum, you will need to provide the classes used to read and write 1 bit (or n-bits) at a time. A few simple example programs are typically enough to show students how these bit-stream classes are used. We call the classes ibstream  for input bit stream and obstream  for output bit stream.

Optionally, you can provide design ideas and/or implementations of some classes for students to use. When we originally gave this assignment we asked students to implement a priority queue class. Since most books provide some implementation, and often an efficient heap-based implementation, we now provide the priority queue class and expect students to use it without understanding (for the purposes of this assignment) its implementation.

We have given this assignment with no suggestions for any design or problem decomposition and with some minimal hints for some classes that might be useful. We stress that code that's used in the huff/unhuff (compression/uncompression) programs should not be duplicated in two different source code files, but factored into common functions or classes that are used by both programs.

We also have provided several kinds of solutions (currently these are only available in C++).

• A mostly functional decomposition with minimal use of classes.

• A simple class-based solution.

• A more complex (and poorly designed from an OO perspective) class-based solution that allows for different methods of storing header information: frequency-based, tree-based, and English-based (assuming character frequencies as in Shakespeare's plays).

We use this solution in a later course to discuss its drawbacks and motivate design patterns.

• A (hopefully) well-designed OO solution that uses patterns to fix the problems with the design mentioned in the previous bullet. Functionally this solution is the same as the previous solution.

Owen L. Astrachan