CompSci 18S - Spring 2011 - Compression


Classwork 6 (10 pts)

March 14, 2011 (Happy Pi Day!)

For group work, please turn in one sheet of paper with all the names of those who participated in your group.

Problem

We will address the compression of files. If you want to send a file over the web, it is faster if you can first compress the file to make it smaller and then decompress it at the other end. We will focus on text files. Note that there are 256 characters and each character can be recognized by 8 bits (0 or 1) in binary using ASCII coding scheme. For example, here are some of the 256 ASCII encodings in bits.

a 011000001
b 011000010
c 011000011
d 011000100
? 00111111
@ 010000000
A 010000001

We will look at a very small file to focus on encoding, but you can imagine files that are much bigger.









Part 1

Your task is to encode the short file that has "duke blue devils" in it.

  1. If you use the ASCII encoding above with 8 bits per letter, how many bits are needed?


  2. Perhaps you can use fewer bits for compressing this file because you are not using all 256 ASCII characters, but a much smaller number of characters. Come up with an encoding scheme that uses as few bits as possible.

    1. Show the encoding scheme (what the code is for each letter).

      letter encoding of letter letter encoding of letter letter encoding of letter letter encoding of letter
      . .. . ... .
      . .. . ... .
      . .. . ... .
      . .. . ... .

    2. Using this encoding scheme, give the encoding of "duke blue devils".



    3. How many bits are needed to encode "duke blue devils" and how does it compare to using ASCII 8 bit codes?



































    Part 2

    Now let's try to get a "smaller" encoding.

      First notice how many times each character is used in the file "duke blue devils".

      letter number occurences letter number occurences letter number occurences letter number occurences
      . .. . ... .
      . .. . ... .
      . .. . ... .
      . .. . ... .

    1. Using this information, give an encoding scheme that uses fewer bits then you did last time. First explain the idea on how you plan to use fewer bits.







      Give the encoding for each letter in "duke blue devils" in the table.

      letter encoding of letter letter encoding of letter letter encoding of letter letter encoding of letter
      . .. . ... .
      . .. . ... .
      . .. . ... .
      . .. . ... .

    2. Give the encoding of "duke blue devils"



    3. Explain how someone at the other end could "uncompress" your file if they had your encoding table.

















    Part 3

    Huffman Coding builds a structure in which characters that are used a lot are near the top and characters not used much are near the bottom. The tree is built by starting with all the characters and the number of times they appear in the file. The smallest two numbers are joined together.

    For example if the characters a and b are both used once, they can be joined together and the structure is labeled with their sum.

    If the character e occurs twice, it can be joined with the previous structure and labeled with the sum of the two.

    In general, two characters or structures are joined together if they have the smallest two values. This continues until there is just one structure. Then codes are assigned to letters by walking down the structure. Going left means "0" and going right means "1". This way each character has a unique code. In the example above with the stucture for a, b and e, the code for a is 00, b is 01 and e is 1.













    1. Use Huffman Coding. Give the table showing the number of occurences for each character in "baseball is not bball"

      letter number occurences letter number occurences letter number occurences letter number occurences
      . .. . ... .
      . .. . ... .
      . .. . ... .
      . .. . ... .

    2. Give the structure of all characters that places those characters closer to the top if they are used more often.












    3. Give the encoding table using this scheme.
      letter encoding of letter letter encoding of letter letter encoding of letter letter encoding of letter
      . .. . ... .
      . .. . ... .
      . .. . ... .
      . .. . ... .

    4. Give the encoding of "baseball is not bball".




    5. Compare the Huffman conding with the 8-bit encoding of "baseball is not bball". How many bits are saved?