Due: Thursday, Nov. 21 by 11pm
LATE DAY EXTENDED: 10% off by Monday, Nov 25, 11pm
LAST DAY TO TURN IN: 25% off by Monday, Dec 2, 11pm
50 points
Do not wait until the last minute to start this assignment. There will be no extensions. The program is being handed out early.
You will implement two compression/uncompression programs:
Your executables must have those names, the Makefile is set up for this to happen, though you'll have to add the right .cpp files to the Makefile.
You're also required to create two test programs you must submit. More details on these below. The test program names are
Applying unhuff to a file compressed by huff must exactly reproduce the original file. This involves implementing both programs based on the greedy Huffman algorithm discussed in class and in this detailed online explanation.
The resulting program will be a complete and useful compression program although not, perhaps, as powerful as standard programs like compress or zip which use slightly different algorithms permitting a higher degree of compression than Huffman coding.
Because the compression scheme involves reading and writing in a bits-at-a-time manner as opposed to a char-at-a-time manner, the program can be hard to debug. In order to facilitate the design/code/debug cycle, you should take care to develop the program in an incremental fashion. If you try to write the whole program at once, you probably will not get a completely working program!
Design, develop, and test so that you have a working program at each step.
Build a program by adding working and tested pieces.
To help you with this mode of programming you'll be required to submit two test programs described below. You should complete these test programs as part of the development process you use in creating the huff and unhuff programs.
The file showargs.cpp shows how to use command lines arguments. The file globals.h provides constants needed by both programs.
To compress you should use the Huffman coding algorithm. The algorithm has four steps. You should understand each of these steps before starting to code.
First, the process of Huffman compression is discussed, then some details about the program follow.
To uncompress the file later, you must recreate the same Huffman tree that was used to compress (so the codes you send will match). This tree might be stored directly in the compressed file (e.g., using a preorder traversal), or it might be created from character counts stored in the compressed file. In either case, this information must be coded and transmitted along with the compressed data (the tree/count data will be stored first in the compressed file, to be read by unhuff. There's more information below on storing/reading information to re-create the tree.
See below for command-line arguments that affect the huff compression program.
The first thing unhuff should do is verify that the file being uncompressed was originally compressed by your huff program. In particular, your program should not crash if it is given a program to uncompress that was not compressed using the corresponding Huffman compression program. The robustness of the program will be an important criterion in grading the program.
One easy way to ensure that both programs work in tandem is to write a "magic number" at the beginning of a compressed file. This could be any number of bits that are specifically written as the first N bits of a huffman-compressed file (for some N). The corresponding decompression program first reads N bits, if the bits do not represent the "magic number" then the compressed file is not properly formed. You can read/write bits using the classes declared in bitops.h. A sample program which uses bitops.cpp is given in bitread.cpp.
For decompression to work with Huffman coding, information must be stored in the compressed file which allows the Huffman tree to be re-created so that decompression can take place. There are many options here. You can store all codes and lengths as normal (32 bit) C/C++ ints or you can try to be inventive and save space. For example, its possible to store just character counts and recreate the codes from the counts (i.e., store 256 counts, one for each 8-bit character). It's also possible to store code-lengths and codes using bit-at-a-time operations. Any solution to storing information in the compressed file is acceptable, but full credit requires some attempt to save space/storage. Space-saving techniques are defined as those using less space than simply storing 257 counts as 32 bit ints. One useful technique is to write the tree to the file using a preorder traversal. You can use a 0 or 1 bit to differentiate between internal nodes and leaves, for example. The leaves must store character values (in the general case using 9-bits because of the pseudo-eof character).
For example, the sequence of 0's and 1's below represents the tree on the right (if you write the 0's and 1's the spaces wouldn't appear, the spaces are only to make the bits more readable to humans.)
0 0 1 001100001 1 000100000 1 001110100 |
The first 0 indicates a non-leaf, the second 0 is the left child of the root, a non-leaf. The next 1 is a leaf, it is followed by 9 bits that are represent 97, the ASCII code for 'a'. Then there's a 1 for the right child of the left child of the root, it stores 32, the ASCII value of a space. Then next 1 indicates the right child of the root is a leaf, it stores the ASCII value for a 't' which is 116.
Your program can write these bits using a standard pre-order traversal. You can then read them by reading a bit, then recursively reading left/right subtrees if the bit is a zero.
The huff program should write its compressed output to a file generated from the original file name by adding the suffix .hf to the file name. For example, the command below should compress the file foo storing the result in foo.hf.
huff --file foo
The unhuff program should write its uncompressed output to a file with a .uhf suffix. This means uncompressing foo.hf should create foo.hf.uhf (which should be the same as the file named foo with no suffix.)
However, the user should have the option of invoking huff with an argument --force which forces compression even when the compressed file is bigger. For example:
prompt> huff --file foo no compression yielded, no output file foo.hf written prompt> huff --force --file fooHere, the second time huff is executed the compression is forced because of the --force argument.
The line below should create a file named foo that's the uncompressed version of foo.hf. Without the --nosuffix option the file created would be named foo.hf.uhf.
prompt> unhuff --nosuffix --file foo.hf
prompt> huff --help Huffman compression, options are: --force create compressed file even if larger (huff only) --nosuffix (unhuff) don't use .uhf suffix --help print this help --file followed by file to be compressed (or uncompressed)
Note, do not use any variables of type char! You should use int variables when you think you might need a char everywhere in your program. The only time you might want to use a char variable is to print for debugging purposes -- you can cast an int to a printable char as shown in the code fragment below.
int k = 'A'; cout << char(k);
The sections below illustrate the following ideas.
(for more details, see the complete huffman coding discussion.)
The operating system will buffer output, i.e., output to disk actually occurs when some internal buffer is full. In particular, it is not possible to write just one single bit to a file, all output is actually done in "chunks", e.g., it might be done in eight-bit chunks. In any case, when you write 3 bits, then 2 bits, then 10 bits, all the bits are eventually written, but you can not be sure precisely when they're written during the execution of your program. Also, because of buffering, if all output is done in eight-bit chunks and your program writes exactly 61 bits explicitly, then 3 extra bits will be written so that the number of bits written is a multiple of eight. Because of the potential for the existence of these "extra" bits when reading one bit at a time, you cannot simply read bits until there are no more left since your program might then read the extra bits written due to buffering. This means that when reading a compressed file, you CANNOT use code like this.
In Huffman trees/tables you use in your programs, the pseudo-EOF character/chunk always has a count of one --- this should be done explicitly in the code that determines frequency counts. In other words, a pseudo-char EOF with number of occurrences (count) of 1 must be explicitly created.
In the file globals.h the number of
characters counted is specified by HUFF_ALPH_SIZE which has
value 257. Although only 256 values can be represented by 8 bits, one
character is used as the pseudo-EOF character [the value of PSEUDO-EOF
could be 256, but it's set for 257 for historical reasons]. Your
program should be able to work with n-bit chunks, not just 8-bit chunks.
Priority Queues
The easiest way to build the Huffman tree is to use a priority queue
since it supports getmin and deletemin operations
which are used in creating the Huffman tree. It always returns the
minimum count character (or subtree of characters). When more than one
character has the same count, it picks one to return (it always picks
them in the same order, but it does so in no particular order). It
turns out this is not a problem for the decompressor as long as both
programs use the same implementation of a priority queue.
A heap-based, templated priority queue class tpqueue is provided (in the standard location with libtapestry files, see tpq.h. and tpq.cpp.). The program tpqtest.cpp shows how to use the class with pointers.
(for more details, see the complete huffman coding discussion.)
To create a table or map of coded bit values for each character you'll need to traverse the Huffman tree (e.g., inorder, preorder, etc.) making an entry in the table each time you reach a leaf. For example, if you reach a leaf that stores the character 'C', following a path left-left-right-right-left, then an entry in the 'C'-th location of the map should be set to 00110. You'll need to make a decision about how to store the bit patterns in the map. At least two methods are possible for implementing what could be a class/struct BitPattern:
Some ideas for classes are given below. You may decide to combine some of these into one class, or you may decide to implement more classes. These are a start and some suggestions, not requirements.
For example, one possibility for using a character counting class is shown below:
You'll need at least ALPH_SIZE counters (typically this is
256, see globals.h).
There are more than 256 possible characters if you include
the pseudo-EOF character
The testcount Test Program
You must write code (at least a .cpp file) that will create an
executable named testcount (the Makefile is set up for this).
that when run as shown below will print 256 lines. The n-th line should
be a count of how many times the 8-bit character/chunk whose
value is n occurs. Many
of the lines will be zero. For example, since the character 'a'
has an ASCII value of 97, the 98-th line (remember, start with 0)
will be the count of how many a's occur in the file read.
Your program must read the file using an ibstream object. The program should process the file specfied on the command-line as shown below.
prompt> testcount poe.txt 0 0 ... ... 7 5 ... ... 15 32 ... ...
Since the values stored in the map are constrained to be between 0 and
257, you can simply store bitpattern/strings in a vector and index them
directly. There's nothing at all inferior about this approach compared
to using a map. In fact, using a vector is in many ways simpler, is as
fast as it can be, and is easy to implement (so it's a good choice!).
The testhuff Test Program
You must create an executable program named testhuff
(the Makefile is set up for this) that when run on a file will
produce the huffman codes (0's and 1's) for every character read.
The output should be 257 lines, one for every eight-bit chunk/character
and the last line for the PSEUDO-EOF character. The n-th line of output
should be the huffman code for the character whose value is n. For
example:
prompt> testhuff poe.txt 010101 10101111 ... ...Your program must accept the name of the file as an command-line argument as shown in the run above.
Similarly, you may want to implement a function to print the Huffman tree to help you determine if the tree you've built is correct.
Designing debugging functions as part of the original program will make the program development go more quickly since you will be able to verify that pieces of the program, or certain classes, work properly. Building in the debugging scaffolding from the start will make it much easier to test and develop your program. When testing, use small examples of test files maybe even as simple as "go go gophers" that help you verify that your program and classes are functioning as intended.
You might want to write encoding bits out first as strings or printable int values rather than as raw bits of zeros and ones which won't be readable except to other computer programs. A Compress class, for example, could support printAscii functions and printBits to print in human readable or machine readable formats.
We cannot stress enough how important it is to develop your program a few steps at a time. At each step, you should have a functioning program, although it may not do everything the first time it's run. By developing in stages, you may find it easier to isolate bugs and you will be more likely to get a program working faster. In other words, do not write hundreds of lines of code before compiling and testing
this code is similar to the loop shown below which uses the standard get function to read a char from a stream.
Note that executing the C++ statement cout.put('7') results in 8 bits being written because a C/C++ char uses 8 bits (the 8 bits correspond to the character '7'). Executing cout << 7. results in 32 bits being written because a C/C++ int uses 32 bits. Executing obs.writebits(3,7) results in 3 bits being written (to the output bit-stream obs) --- all the bits are 1 because the number 7 is represented in base two by 000111.
When using writebits to write a specified number of bits, some bits may not be written immediately because of some buffering that takes place. To ensure that all bits are written, the last bits must be explicitly flushed. The function flushbits is called automatically if your program exits properly when the obstream destructor is called. You should not need to call it explicitly. If, for some reason you create an obstream object referenced by a pointer, make sure to delete the object so that its destructor is called.
Although readbits can be called to read a single bit at a time (by setting the first parameter to 1), the second parameter to the function is an int. You'll need to be able to access just one bit of this int (inbits in code above). In order to access just the right-most bit a bitwise and & can be used:
An example program using the bitreading classes is provided for you to study, it is called bitread.cpp. Because Huffman coding requires two passes over the input file you will need to use the method rewind that works with object variables of type ibstream.
The effect of ib.rewind() is to allow the input bit stream ib to be ''read again'', i.e., a loop that reads until no more bits are available (checking the return-value of readbits) will be able to read the entire file again.
When a C/C++ program is invoked, arguments are often given as command-line arguments/parameters, i.e., options entered on the same line as the program. For example,
g++ -g -o foo foo.cppinvokes the g++ compiler (program) with four command-line arguments: -g, -o, foo, and foo.cpp. Command-line arguments are passed to the program in a C-style array of strings along with a count of the number of command-line arguments passed. The header for main must be altered in a program that processes command-line arguments (see HowtoA in Tapestry for more details.) The program showargs.cpp illustrates how command-line parameters can be processed.
Here are two runs:
prompt> showargs program showargs has no command-line args prompt> showargs -o fish -b fowl showargs has 5 command-line args: 1-th arg is: -o 2-th arg is: fish 3-th arg is: -b 4-th arg is: fowlAs shown in the code fragment above, and in the sample file showargs.cpp, all programs have at least one command-line argument, the name of the program being run. This is stored as the first entry in the array argv (the first entry has index zero). The array argv is an array of c-style strings. These strings are just pointers to characters, with a special NUL-character '\0' to signify the last character in a C-style string. You do NOT need to know this to manipulate these C-style strings. The easiest thing to do is to assign each element of argv to a C++ string variable as shown in showargs.cpp. Then you can use the C++ string functions you are familiar with to manipulate the values, e.g., you can call length(), you can use substr(), you can concatenate strings with +, etc. None of these operations work with C-style strings.
description | points |
---|---|
compression of any text file | 10 points |
compression of any file (including binary files) | 5 points |
decompression | 10 points |
robustness (does unhuff program crash on non-huffed files?) | 5 points |
program style (class design/implementation, program design) | 10 points |
test programs | 10 points |
When this single file is unhuffed, all the files originally compressed should be uncompressed/unhuffed.
You'll need to store information about file names in the huffed file to know what to use for names when unhuffing. The original names must be used when uncompressing (though with the .uhf suffix unless you change this with the --nosuffix flag).
You must note in your README that you've done this extra credit assignment and it must be executed using a --dir flag. This means when invoked as shown below, all the files in directory foodir will be compressed and a single archive file named foodir.hf will be created.
prompt> huff --dir foodirBe sure to test the program by uncompressing into an empty directory so that you don't overwrite files accidentally.
For information on reading the files in a directory, see the classes DirStream and DirEntry described in Chapter 10 of Tapestry.
This extra credit is worth 10 points and must be turned in with your program. Include a note in your README that you did the extra credit.
For an extra 20 points, read this description of LZW or Lempel/Ziv/Welch compression and implement it. This extra credit is turned in separately and must be submitted by Wednesday, Dec. 4, 2002 at 11pm.
To submit use
submit_cps100 huff README *.h *.cpp MakefileThe Makefile should be able to handle both make huff and make unhuff. The sample Makefile in ~rodger/cps100/huff will do this. You can alter the different lines of the Makefile as you need to. Don't forget to run make depend if you change the Makefile.
Extra Credit Submission
You can submit LZW extra credit by Dec. 4, 2002.
submit_cps100 huffextra README *.h *.cpp Makefile