Randomized Algorithms

 

* Universal hashing                 Section 8.4, 8.5

* Perfect hashing

* Occupancy problems

==============================================================================

 

HASHING

-------

Set up: have a domain/universe M, and some subset S of M of points we

     care about (|S| << |M|).  Want to map M into a much smaller set N

     (e.g., |N| = O(|S|)), in a way that's easy to calculate and results in

     few collisions in S.

 

Function h: M --> N is called a hash function.

 

Lots of motivations for hashing.  One is the Dictionary problem:

 

- static: set S of keys (e.g., key is a word, or name) and associated

  with each key is an object (e.g., definition, phone number). Want to

  store S so can do efficient lookups.

 

- dynamic: sequence of insert(key,object), find(key), delete(key) requests.

 

Goal: Idea of hashing: maintain a small table, and use hash function h

to map keys into this table.  If h behaves randomly, shouldn't get too

many collisions.

 

Interesting because very basic technique in computer science where

idea of randomness is critical.  Even if in practice you don't use

randomness in selecting the function (as in the methods we will prove

things about), the motivation/inspiration is random behavior.

 

Simple fact: if M is sufficiently large (|M| > |S|*|N|), then for any

h, there exists S such that all elements of S collide.

 

So, need to either choose h based on S, or make randomness assumptions

on S, or choose h using some randomness.

 

 

For convenience, lets abuse notation and let N = |N| and M = |M|.

 

Universal hash functions

------------------------

 

Definition: H is a 2-universal family (set) of hash functions from M to

N if: for all x,y in M (x != y),

 

     Prob[h(x) = h(y)] <= 1/N

     h<-H

 

similar to pairwise independence.  Even more similar is notion of

"strongly 2-universal".

 

H is "strongly 2-universal" if for all x_1 != x_2 in M, all y_1, y_2

in N,

     Prob[h(x_1)=y_1 & h(x_2)=y_2]  =  1/N^2.

 

 

Why are universal hash functions useful?

----------------------------------------

 

Say we use hash function to store set S in table of size n, where,

say, we handle collisions with linked list. Then, cost of find(x) is

proportional to length of list at h(x).  So, let

     C_{x,y}  be indicator random variable for x and y having collision.

     C_{x,S} be length of list at h(x).

 

     Then, E[C_xy] <= 1/N, so

     E[C_xS] < 1 + |S|/N

 

So, just need linear table size to have O(1) expected time per operation.

 

 

How to construct? Will give 2 ways.

-----------------

Way #1: Let's assume that M, N are powers of 2: M = {0,1}^m and N = {0,1}^n.

 

H = { m x n 0-1 matrices}.  h(x) = hx, in GF2 (addition is mod 2).

 

Note: h(0) = 0. So, could remove 0 from domain if want strong 2-universal.

 

Claim: For x!=y, Pr[h(x) = h(y)] = 1/|N|.

Proof: Say x and y differ in ith coordinate, and say that y_i = 1.

     Let z = y, but with z_i = 0. (So, it may be that z=x).

     Then, h(y) = h(z) + h_i, where h_i is the ith column of h.

     Notice that h(z) is independent from h_i, So,

     Pr[h(y)=h(x)] = Pr[h_i = h(x) - h(z)] = 1/2^n.

 

In fact, we've shown that for any r, prob[h(y) = h(x) + r] = 1/N,

independent of h(x).  So, so long as x!=0, we have strong 2-universality.

 

So, this is a nice, easy family of hash functions.  One problem is

that it requires log(N)*log(M) random bits.

 

Here's another 2-universal family:

 

Here, lets let M = {0,...,m-1} and N = {0,...,n-1}.

Pick prime p >= m (or, think of just rounding m up to nearest prime).  Define

 

     h_{a,b}(x) = ( (a*x + b) mod p ) mod n.

 

     H = {h_ab | a,b in Z_p and a != 0}

 

Claim: H is a 2-universal family.

 

Proof: Let's fix r != s and calculate, for x!=y,

     Pr[[(ax + b) = r (mod p)] AND [(ay+b) = s (mod p)]].

 

 -> it must be that ax-ay = r-s (mod p), so a = (r-s)/(x-y) mod p, which

     has exactly one solution (mod p) since Z_p* is a field.

 

 -> given this value of a, we must have b = r-ax (mod p).

 

     So, this probability is 1/[p(p-1)].

 

(What is the probability for r=s?  Ans: 0)

 

Now, how many pairs r!=s are there in {0,...,p-1} such that r = s (mod n)?

 -> have p choices for r, and then at most ceiling(p/n)-1 choices for

s ("-1" since we disallow s=r). The product is at most p(p-1)/n

 

Putting this all together,

 

     Pr[(a*x + b mod p) mod n = (a*y + b mod p) mod n]

 

          <= p(p-1)/n * [1/(p(p-1))] = 1/n.

QED

 

Perfect hashing

---------------

Say we want to hash a fairly stable set.  E.g., a real dictionary, or

lookup table.  So, we can pick the hash function based on the set S.

Can we get  O(1) time WORST-CASE?

 

--> Perfect hashing.  (h is perfect for S if it causes no collisions).

 

1 way: Say O(|S|^2) space is OK.  Then, just do a universal hash

function into table of size n=|S|^2.  For any x!=y, we have

 

     Pr[h(x)=h(y)] <= 1/n.

So,

     Pr[Exists x!=y that collide] <= |S|(|S|-1)/(2n) < 1/2.

 

So, just pick random h in H and try it. Each time have at least 1/2

chance of success.  If fail, try again.

 

What if we only want to use O(|S|) space?

Let's let n=|S| and try a universal hash function.

 

Let B_i = size of ith bin.

 

What is the expected value of sum_i (B_i)^2?

 

     sum_i (B_i)^2 = sum_(x,y) C_xy

 

                = n + sum_(x != y) C_x,y   < 2n.

 

So, any ideas....?

 

For at least 1/2 of h's, this is at most 4n.  Now, if B_j = t, can

hash into t^2 elements perfectly with random h.  So, final hash

function is a tree of depth 2, into a table of size 4n with no collisions.