Randomized Algorithms
* Universal hashing Section
8.4, 8.5
* Perfect hashing
* Occupancy problems
==============================================================================
HASHING
-------
Set up: have a domain/universe M, and some subset S of M
of points we
care
about (|S| << |M|). Want to
map M into a much smaller set N
(e.g.,
|N| = O(|S|)), in a way that's easy to calculate and results in
few
collisions in S.
Function h: M --> N is called a hash function.
Lots of motivations for hashing. One is the Dictionary problem:
- static: set S of keys (e.g., key is a word, or name)
and associated
with each
key is an object (e.g., definition, phone number). Want to
store S so
can do efficient lookups.
- dynamic: sequence of insert(key,object), find(key),
delete(key) requests.
Goal: Idea of hashing: maintain a small table, and use
hash function h
to map keys into this table. If h behaves randomly, shouldn't get too
many collisions.
Interesting because very basic technique in computer
science where
idea of randomness is critical. Even if in practice you don't use
randomness in selecting the function (as in the methods
we will prove
things about), the motivation/inspiration is random
behavior.
Simple fact: if M is sufficiently large (|M| >
|S|*|N|), then for any
h, there exists S such that all elements of S collide.
So, need to either choose h based on S, or make
randomness assumptions
on S, or choose h using some randomness.
For convenience, lets abuse notation and let N = |N| and
M = |M|.
Universal hash functions
------------------------
Definition: H is a 2-universal family (set) of hash
functions from M to
N if: for all x,y in M (x != y),
Prob[h(x)
= h(y)] <= 1/N
h<-H
similar to pairwise independence. Even more similar is notion of
"strongly 2-universal".
H is "strongly 2-universal" if for all x_1 !=
x_2 in M, all y_1, y_2
in N,
Prob[h(x_1)=y_1
& h(x_2)=y_2] = 1/N^2.
Why are universal hash functions useful?
----------------------------------------
Say we use hash function to store set S in table of size
n, where,
say, we handle collisions with linked list. Then, cost of
find(x) is
proportional to length of list at h(x). So, let
C_{x,y} be indicator random variable for x and
y having collision.
C_{x,S}
be length of list at h(x).
Then,
E[C_xy] <= 1/N, so
E[C_xS]
< 1 + |S|/N
So, just need linear table size to have O(1) expected
time per operation.
How to construct? Will give 2 ways.
-----------------
Way #1: Let's assume that M, N are powers of 2: M =
{0,1}^m and N = {0,1}^n.
H = { m x n 0-1 matrices}. h(x) = hx, in GF2 (addition is mod 2).
Note: h(0) = 0. So, could remove 0 from domain if want
strong 2-universal.
Claim: For x!=y, Pr[h(x) = h(y)] = 1/|N|.
Proof: Say x and y differ in ith coordinate, and say that
y_i = 1.
Let
z = y, but with z_i = 0. (So, it may be that z=x).
Then,
h(y) = h(z) + h_i, where h_i is the ith column of h.
Notice
that h(z) is independent from h_i, So,
Pr[h(y)=h(x)]
= Pr[h_i = h(x) - h(z)] = 1/2^n.
In fact, we've shown that for any r, prob[h(y) = h(x) +
r] = 1/N,
independent of h(x). So, so long as x!=0, we have strong 2-universality.
So, this is a nice, easy family of hash functions. One problem is
that it requires log(N)*log(M) random bits.
Here's another 2-universal family:
Here, lets let M = {0,...,m-1} and N = {0,...,n-1}.
Pick prime p >= m (or, think of just rounding m up to
nearest prime). Define
h_{a,b}(x)
= ( (a*x + b) mod p ) mod n.
H
= {h_ab | a,b in Z_p and a != 0}
Claim: H is a 2-universal family.
Proof: Let's fix r != s and calculate, for x!=y,
Pr[[(ax
+ b) = r (mod p)] AND [(ay+b) = s (mod p)]].
-> it
must be that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which
has
exactly one solution (mod p) since Z_p* is a field.
-> given
this value of a, we must have b = r-ax (mod p).
So,
this probability is 1/[p(p-1)].
(What is the probability for r=s? Ans: 0)
Now, how many pairs r!=s are there in {0,...,p-1} such
that r = s (mod n)?
-> have p
choices for r, and then at most ceiling(p/n)-1 choices for
s ("-1" since we disallow s=r). The product is
at most p(p-1)/n
Putting this all together,
Pr[(a*x
+ b mod p) mod n = (a*y + b mod p) mod n]
<=
p(p-1)/n * [1/(p(p-1))] = 1/n.
QED
Perfect hashing
---------------
Say we want to hash a fairly stable set. E.g., a real dictionary, or
lookup table.
So, we can pick the hash function based on the set S.
Can we get
O(1) time WORST-CASE?
--> Perfect hashing. (h is perfect for S if it causes no collisions).
1 way: Say O(|S|^2) space is OK. Then, just do a universal hash
function into table of size n=|S|^2. For any x!=y, we have
Pr[h(x)=h(y)]
<= 1/n.
So,
Pr[Exists
x!=y that collide] <= |S|(|S|-1)/(2n) < 1/2.
So, just pick random h in H and try it. Each time have at
least 1/2
chance of success.
If fail, try again.
What if we only want to use O(|S|) space?
Let's let n=|S| and try a universal hash function.
Let B_i = size of ith bin.
What is the expected value of sum_i (B_i)^2?
sum_i
(B_i)^2 = sum_(x,y) C_xy
= n + sum_(x !=
y) C_x,y < 2n.
So, any ideas....?
For at least 1/2 of h's, this is at most 4n. Now, if B_j = t, can
hash into t^2 elements perfectly with random h. So, final hash
function is a tree of depth 2, into a table of size 4n
with no collisions.