Randomized Algorithms
* Analysis of local optimization algorithms
(work in
[Aldous-Vazirani] and [Dimitriou-Impagliazzo]
see: FOCS '94, pages
492-501, and STOC '96, pages 304-313
)
===========================================================================
In practice, if you're given an instance of a hard
problem to solve,
you'll typically use some sort of local optimization
algorithm. In
general, doing this involves 3 steps:
1) setting up some solution space, and attaching a
"goodness" or
"value" to each solution. Typically the size of this space
is exponential in the size of the instance.
2) Defining some notion of "neighbor" among the
solutions in the
space.
(I.e., so we now have a graph with nodes as solutions; of
course, this graph is only defined implicitly, since it
has
exponentially many nodes).
3) doing local search to try to find the best solution in
the space.
For instance
* MAX-SAT: Here (1) could be the space of all
assignments, where the
value of an assignment is just the number of clauses it
satisfies. We
can do (2) by saying that we're allowed to move from one assignment
to another by flipping a bit (i.e., 2 assignments are
neighbors if
they differ in one bit position).
* TSP: Here a natural solution space is the set of all
tours, with
value equal to the length of the tour (so value is now
"badness").
Then, one notion of neighbor might be that you're allowed
to take two
directed edges a-->b and c-->d in the tour and
replace them by a-->c
and b-->d.
(This is sometimes called the 2-OPT rule).
* In machine learning, (1) corresponds to the
"hypothesis space" --
the set of legal hypotheses under consideration. Most learning
algorithms correspond to some sort of natural
neighborhood structure
on this space and some natural local search algorithm.
Often much of the effort is in areas (1) and (2). But, we're going to
ignore that and instead focus on (3): what can we say
about the
behaviour of different local search algorithms, based on
combinatorial
properties of the graph structure we're searching in.
Let's make a few simplifying assumptions
(1) values of solutions are integers in some range [0,d],
where d=poly(n).
(2) A solution has at most poly(n) neighbors.
(3) If a solution has value i, then its neighbors have
value either
i-1
or i+1.
(e.g., MAX SAT satisfies (1) and (2). Not (3) but (3) is
really just
to simplify matters.)
So, we can think of the solution space as a layered
graph, and our
goal is to get to the deepest layer.
--> Interesting special case of tree, and how one
might use a tree to
model general structure of a search graph.
Here are three interesting classes of algorithms:
(1) Hill-climbing: begin at top. while not at a leaf, go to a random
child. When you get to a leaf, halt. (and then start over at the top)
(2) simulated-annealling style: like (1), but you also
allow yourself
to move up with some probability. Or, DFS: do hill-climbing, but
back up to previous decision point when you get stuck.
(3) survival-of-the-fittest: run many particles in
parallel. In
particular, let's define GW1:
-
begin with B particles at the start.
-
if all particles are at leaves, then halt. Otherwise, move each
particle at a leaf to the location of a
random non-leaf particle.
-
move each particle to a random child.
E.g., what about this kind of tree:
o
/ \
o o
/ \
o o
/ \
o o
/ \
o o
/ \
o o
In this case, GW1 only fails if on some branch point, all
particles go
in the wrong direction (so long as this doesn't happen,
the particles
will then regroup in the second step of the
algorithm). So, the
probability of failure is at most d*(1/2)^B. So, we succeed
w.h.p. for B >> log(d).
Here's a generalization:
o
/ \
T o
|
o
/ \
o T
|
o
/ \
T o
|
...
Where "T" is a complete binary tree of some
depth d' (say d'=sqrt{d}),
and "|" represents a path of length d'+1.
In this case, the same analysis holds for GW1, but now
both
DFS and simulated annealing (and hill climbing) fail.
Another interesting case is a branching process:
given a node, flip a coin: with prob gamma < 1/2 make
it a leaf, and
with prob 1-gamma, give it two children. Then cut this off at depth d.
(If you look at the number of coins still to flip, this
is like a
random walk, starting at 1, where each coin flip moves
you 1 step to
the left with prob gamma and 1 step to the right with
prob 1-gamma.)
--> steady state will be at most O(log B) particles
per node, and at
least B/c nodes occupied for some constant c.
Will GW1 work for all trees having at P=poly(d) modes per
level? No.
E.g.,
o
|\
o o
| |\
o o o
| | |\
o o o o
|
o
If depth is >> log(B), then w.h.p., we'll end up
with no particles to
the right.
How to modify?
Idea: redistribute particles according to their
degree. Equivalently,
look at the union of the children of the current nodes
occupied and
distribute the B particles uniformly among those
children.
This rule will keep particles evenly distributed among
the nodes at
any given level.
(Note: we now fail on the previous class of trees,
though.)
Generalization: what if we have a search graph where we
can group the
solutions in the search space into
"supernodes", such that all
solutions in a given supernode have the same value and
furthermore:
(1) Each supernode C at level i has |C| > n_i/poly(n),
where n_i is
the
number of solutions of value i.
(2) The bipartite graph between any supernode C at level
i and the
union
of its children at level i+1 is a good expander.
(3) n_{i-1}/n_i < poly(n).
(by (2) we mean that for any set S of solutions at value
i+1,
|N(N(S))
- S| > const*|S| for |S| < n_{i+1}/2.)
Then,
* If all nodes have the same up-degree, we can use the
same tree
algorithm, except after we move the B particles to the
children, we
then do a random walk on the bipartite graph between the
level i and
level i+1 nodes.
This will randomize the positions of the B
particules among the children nodes. The reason we want a random
sample is that this ensures that each child supernode
receives a
number of particles in proportion to its size.
* If all nodes DON'T have the same up-degree, then we are
over-counting those child nodes of higher degree. We can compensate
by, at the end, redistributing our B particles with
probability
inversely proportional to the up-degrees of the
nodes. Everything
will work fine so long as B is sufficiently larger than
the maximum
degree of any node.