L5  Integer Properties of Characters

C “char” (characters) are a form of integer in C programs – specifically, they are treated as 8-bit integers.  If “signed”, their value lies between –128 and +127.  If “unsigned”, between 0 and 255.  Each character value represents a code which can be “printed” – displayed on a terminal or printed on paper in familiar form.  Representing characters as integers internally allows C programs to use all kinds of integer operations on characters, and thus process “text” in complex ways.

 

Some possibilities for processing characters:

  1. A character can be used as the index into an array.  This allows a program to:
    1. Determine what kind of character a particular input character is – letter, digit, operations symbol (like +, -, /, etc) quickly.  The technique used is to initialize an array, say Classify[256], to hold codes the programmer has chosen – maybe:
                     1 for letter, 2 for digits, 3 for “+”, 4 for “-“, and so on
      To classify a character x that has just been read in, write i=Classify[x];.  This sets i to the “character class code” for x, allowing the program to decide what to do with that character very quickly.
    2. Record data about how many times a particular character occurs in the input, or on what lines it occurs, so statistics useful in cryptography can be computed.
  2. Arithmetic can be used to combine characters for various purposes:
    1. Pairs of characters can be combined into s single integer, so you can compute statistics on how many times a particular “digraph: appears in text.
    2. All the characters can be combined into a large integer, which can be used as a “signature” for some text.  This integer, called a “checksum”, can be used to see if the text has been altered since being written to a file – record the text, and its checksum separately.  Then, re-compute the checksum when the file is read, and see if it matches the recorded checksum.
    3. All the characters in a given word can be combined, and the resulting checksum used in a rapid lookup process called “hash table lookup”.  This technique allows a program to rapidly combine information about all occurrences of a given word in some text – for instance, to determine if a requested keyword appears in that text, or to print a “concordance” of the text – a kind of index which lists, for each word that occurs, what lines it occurs in.

 

Hash Table Lookup

 

Several situations in which it is valuable to be able to combine information about all the occurrences of a given word or string of characters in text.  A “hash table” is a “lookup table” on steroids – you store each string to be found in a separate slot of the table, but instead of simply searching from the beginning of the table one slot at a time, you use a “hash function” to find out where the string you are looking for is probably located.  This makes the lookup process fast.  Entering data into such a table actually uses something very similar to the lookup process, so the “enter()” action is just as fast as the lookup().

 

Hash table package:

  1. Build an array H of some size, N.  Each entry in H will hold one “word” to be found later.
  2. From the word W to be found or entered in the table, compute a “hash function” I=h(W).  The hash function h() is supposed to be produce an integer between 0 and N-1, which is computed from all the characters in W.  The h is chosen so that the words you expect as inputs produce hash function values which are “spread out” over all the values 0 to N-1, more or less evenly. 
    1. H[] probably should not contain the actual character strings forming the words.  Instead, H can contain pointers to those strings.

                                                                 i.       The strings themselves can be stored in a separate array, maybe 10000 characters in size, one string immediately following another, each ended with ‘\0’ characters.

    1. H[] should be initialized to NULL pointers.
  1. enter(W) can start searching H at h(W).  enter can compare each word in H with W.  If it finds a match, enter can return the index where the match occurred.  If enter finds a NULL at position j, it knows W was not previously entered into H, and can enter W at j.  Thus “enter” and “lookup” are essentially the same programs.  Each can simply return the location where W is now in the table.  The returned index can be used to index a “parallel” array Info[], which can hold whatever information you want to compile about each word.
    1. Don’t stop the search at the end of H[].  Instead, treat H as a “circular” table, in which 0 follows position N-1.
    2. Keep at least one empty slot in H[], and exit the program if too many different words occur in the input.