L5 Integer Properties of Characters
C “char” (characters) are a form of integer in C programs –
specifically, they are treated as 8-bit integers. If “signed”, their value lies between –128 and +127. If “unsigned”, between 0 and 255. Each character value represents a code which
can be “printed” – displayed on a terminal or printed on paper in familiar
form. Representing characters as
integers internally allows C programs to use all kinds of integer operations on
characters, and thus process “text” in complex ways.
Some possibilities for processing characters:
- A character
can be used as the index into an array.
This allows a program to:
- Determine
what kind of character a particular input character is – letter, digit,
operations symbol (like +, -, /, etc) quickly. The technique used is to initialize an array, say
Classify[256], to hold codes the programmer has chosen – maybe:
1 for letter, 2 for
digits, 3 for “+”, 4 for “-“, and so on
To classify a character x that has just been read in, write
i=Classify[x];. This sets i to
the “character class code” for x, allowing the program to decide what to
do with that character very quickly.
- Record
data about how many times a particular character occurs in the input, or
on what lines it occurs, so statistics useful in cryptography can be
computed.
- Arithmetic
can be used to combine characters for various purposes:
- Pairs
of characters can be combined into s single integer, so you can compute
statistics on how many times a particular “digraph: appears in text.
- All
the characters can be combined into a large integer, which can be used as
a “signature” for some text. This
integer, called a “checksum”, can be used to see if the text has been
altered since being written to a file – record the text, and its checksum
separately. Then, re-compute the
checksum when the file is read, and see if it matches the recorded
checksum.
- All
the characters in a given word can be combined, and the resulting
checksum used in a rapid lookup process called “hash table lookup”. This technique allows a program to
rapidly combine information about all occurrences of a given word in some
text – for instance, to determine if a requested keyword appears in that
text, or to print a “concordance” of the text – a kind of index which
lists, for each word that occurs, what lines it occurs in.
Hash Table Lookup
Several situations in which it is valuable to be able to
combine information about all the occurrences of a given word or string of
characters in text. A “hash table” is a
“lookup table” on steroids – you store each string to be found in a separate
slot of the table, but instead of simply searching from the beginning of the
table one slot at a time, you use a “hash function” to find out where the
string you are looking for is probably located. This makes the lookup process fast. Entering data into such a table actually uses something very
similar to the lookup process, so the “enter()” action is just as fast as the
lookup().
Hash table package:
- Build
an array H of some size, N. Each
entry in H will hold one “word” to be found later.
- From
the word W to be found or entered in the table, compute a “hash function”
I=h(W). The hash function h() is
supposed to be produce an integer between 0 and N-1, which is computed
from all the characters in W. The
h is chosen so that the words you expect as inputs produce hash function
values which are “spread out” over all the values 0 to N-1, more or less
evenly.
- H[]
probably should not contain the actual character strings forming the
words. Instead, H can contain
pointers to those strings.
i.
The strings themselves can be stored in a separate array,
maybe 10000 characters in size, one string immediately following another, each
ended with ‘\0’ characters.
- H[]
should be initialized to NULL pointers.
- enter(W)
can start searching H at h(W).
enter can compare each word in H with W. If it finds a match, enter can return the index where the
match occurred. If enter finds a
NULL at position j, it knows W was not previously entered into H, and can
enter W at j. Thus “enter” and “lookup”
are essentially the same programs.
Each can simply return the location where W is now in the
table. The returned index can be
used to index a “parallel” array Info[], which can hold whatever
information you want to compile about each word.
- Don’t
stop the search at the end of H[].
Instead, treat H as a “circular” table, in which 0 follows position
N-1.
- Keep
at least one empty slot in H[], and exit the program if too many
different words occur in the input.