Clustering Algorithms
In this project, we will explore different algorithms to cluster
data items. Clustering is the process of automatically detect items
that are similar to one another, and group them together. Since the
notion of a group is fuzzy,
there are various algorithms for clustering that differ in their
measure of quality of a clustering, and in their running time. The goal
of this project is to implement some of these algorithms and compare
their performance on different data sets, both in terms of running
time, and solution quality.
Part I: Agglomerative Clustering
In agglomorative clustering,
each data point is initially placed in a cluster by itself. At each
step, two clusters with the highest similarity score are merged. The
output of agglomerative clustering is a tree, where the
leaves are the data items. At any intermediate step, the clusters so
far are different trees. When two clusters are merged by the
algorithm, the respective trees are merged by making the roots of
these trees the left and right children of a new node (so that the
total number of trees decreases by 1). In the end, the process yields a
single tree. See here for an example.
This algorithm closely resembles Kruskal's algorithm
for Minimum Spanning Tree construction. In Kruskal's algorithm, at each
step the smallest edge connecting two components is added till there is
only one component left. Each new edge added merges two components.
This can be seen as clustering where the similarity score between
two components is the negative the length of the smallest edge between
any two points in the cluster.
Many different clustering
algorithms can be obtained by varying the similarity measures between
clusters. Some examples of similarity measures are the following. Here,
A and B are two clusters, and sim(A,B) is the similarity measure between them. The similarity measure sim(u,v) between input data items u and v is specified as input. Typically, the similarity is a number between 0 and 1, with larger values implying greater similarity.
Single-linkage (Least similarity): 
Complete-linkage (Maximum similarity): 
Average-linkage:
Centroid similarity: 
Ward's similarity: 
Input: The input is a collection of 50 artists, specified in artists50.in. The similarity between each pair of artists is specified in the file dataAm50.in. Of these 50 artists, a collection of 25 are chosen, and the respective artists and similarity measures are in the files artists25.in and dataAm25.in. These data sets have been generated using the similar artists API here.
Output: Output one line per cluster, which contains the ids of the points belonging to that cluster. See below for an example.
Experiments:
Code
up agglomerative clustering with each of the above similarity measures
as different subroutines. Run these algorithms (5 of them) on the two
data sets and output the clustering tree. Rate the algorithms in terms
of running time and quality of solutions. Compare the robustness
of the different similarity measures by checking how similar the
clusterings are for the 25 items across the two data sets. In the first
data set, the 25 items are padded with 25 outliers, which can possibly
change the clustering of the 25 items themselves. A similarity measure
is robust if the clustering of these items does not change by much. How
will you quantitatively measure the robustness of the clustering?
For
assessing quality, you can use any available information about the
artists. What would be a good quantitative measure for quality of the
clustering? For your reference, here is a possible clustering of the 25
artists that was obtained using an entirely different clustering
algorithm:
Cluster 1: 2 3 4 10 11 12 13 14 17 18 22
Cluster 2: 15 19 20 21 23 24
Cluster 3: 5 6 7
Cluster 4: 0 1 8 9 16
This is the same clustering as above but using the indices for 50 artists
Cluster 1: 18 25 5 8 20 23 44 10 31 19 34
Cluster 2: 49 26 37 38 48 36
Cluster 3: 12 13 14
Cluster 4: 16 4 3 15 29
In addition to the code, you need to produce a document explaining what you did. You will be graded on two dimensions.
- Efficiency: What
is the main bottleneck in the running time? Can you devise data
structures or heuristics to avoid these bottlenecks? How do the
heuristics affect the quality of the solution produced?
- Experiments and Discussion: How do the different clustering measures compare in terms of quality of solutions and robustness? What kinds of data
are desirable for each similarity measure? What are the drawbacks of
each similarity measure?
Part II: Center-based Clustering
The
agglomerative clustering method discussed above constructs a tree of
clusters, where the leaves are the data items. In center-based
clustering, the items are endowed with a distance function instead of a
similarity function, so that the more similar two items are, the
shorter their distance is. The output of the clustering algorithm is K
centers (which are quite often data items themselves). Each center
serves as the representative of a cluster. Each data item is placed in
the cluster whose corresponding center it is closest to. The goal is to
find those centers such that each data item travels as little distance
as possible to reach its center: In some sense, this would correspond
to each cluster being as tight and closely-knit as possible around the
corresponding center.
Formally, the K-medians problem is defined as follows: let X denote the collection of data items, so that |X| = n. For data items u,v, let d(u,v) denote the distance between u and v. The goal of K-medians clustering is to find a subset C of X, so that |C| = K, and the following function is optimized: 
Basically, for any collection C of centers, each data point u is assigned to the closest center. The objective sums this distance over all u, and the goal is to find that collection C for which this objective is minimized.
Since solving this problem requires searching over all possible choices of K
centers, it is NP-Complete. This is not a formal proof - try reducing
it from the set cover problem to obtain a formal proof. In view of the
difficulty in finding the "optimal" clustering, the goal is to design
heuristics that work reasonably fast and output a clustering of
reasonable quality. We list below several approaches to solving
this problem. Your goal is to code these approaches up, and
compare their performance in terms of running time and solution quality
on several data sets.
K-Means Algorithm:
- Choose K centers at random from X. Call this set C.
- Map each point in X to its closest center in C. This yields a set of clusters, one for each center.
- For each cluster so obtained, compute the 1-median. This
is that point in the cluster for which the sum of the distances to the
remaining points in the cluster is as small as possible. This yields
one 1-median for each of the K clusters. These are the new centers; call this set C.
- Repeat Step (2) using the new set C.
- In each iteration, keep track of the objective function
for the current choice of C. Stop when the objective does not increase by much from one iteration of Steps (2-4) to the next. (Why does K-Means eventually terminate?) - Repeat for a few random choices in Step (1), and take the best solution so obtained. (Why is this step important?)
Local Search Algorithm:
This algorithm is somewhat different from K-means. An analysis of this algorithm is presented in
this paper. Let
Cost(C) = 
denote the objective function when the set of centers is
C.
- Initially, C is chosen as in Step (1) above.
- For each u in X and c in C, compute Cost(C - {c} + {u}). Find (u,c) for which this cost is smallest, and replace C with C - {c} + {u}.
- Stop when Cost(C) does not change significantly after one iteration of Step (2).
How can you speed up the swap operation by avoiding searching over all (c,u) pairs? Is this scheme superior in practice to simply repeating K-Means with different random starting points? What about in theory?
Online Algorithm:
The following is based on a clever algorithm due to
Adam Meyerson and is presented in Section 2 of
this paper. The algorithm assigns a cost
f for each center, and finds a set of centers
C that optimizes

+
f |C|. It then tries out different values of
f till it finds one that opens around
K centers.
- Start with f being very small (close to zero), and let Q = 1.1 (or any other number slightly bigger than 1).
- Order the points randomly.
- The set C is initially empty.
- Consider the points in the random order constructed in Step (2).
- If the distance from the current point to the closest center is D, then with probability min(D/f, 1) add this point to the set C.
- When all points have been considered, if |C| > K, then increase f by a factor of Q and repeat Steps (3-6); else output C and the associated clustering and STOP.
What
heuristic can you implement after Step (6) to improve the quality of
the solution produced? What is the running time of the above algorithm?
How can you optimize the performance by avoiding repeating all
computations for every value of
f?
Greedy Clustering:
Consider the following clustering algorithm:
- Start with an arbitrary point (maybe randomly chosen) and include it in C.
- Repeat K-1 times: Pick that data point for which the closest point in C is as far away as possible, and include this point in C.
What
performance metric is this algorithm designed to optimize? Note that it
does not optimize for the K-medians objective, but something slightly
different. What is the running time of the algorithm, and how do the
solutions produced compare with the previous algorithms?
Faster Implementations:
One
can obtain faster implementations of all the above algorithms by
randomly sampling a smaller "representative" set of points and
clustering just these. An algorithm based on sampling and clustering is presented in Section 4 of
this paper.
After Step (6), you can run a K-medians algorithm on the centers to
obtain exactly K centers. Run this algorithm using the above algorithms
as a subroutine. What is the degradation in performance? What is the
gain in running time?
Input: Use the code here to generate point sets with around 1000 to 10,000 points, which are mixtures of K Gaussians in 2 dimensions. (Remove the #include "malloc.h" from the source code if you get a compilation error.) Choose K
= 3 or 4. As a standard input format for the clustering procedure, the
first line of input should have the number of points, and then, each
new line should be the x and y coordinate (separated by spaces) of a
new point.
Output:
The output should be the same format as the input, but each line should
be pre-pended with the cluster number (use numbers 1,2,3,...) to which
that point belongs. You can visualize the data, as well as the clusters
using
either gnuplot, or any other software (Matlab, Mathematica) you are
comfortable with. No points will be given for developing fancy visual interfaces.
Experiments: Compare the performance of K-medians and agglomerative clustering methods as a function of:
- How
elliptical the Gaussians are. This can be controlled by making the
covariance term much larger than the variance terms. K-medians works
best when the Gaussians are spherical, so that the covariance and
variance terms are comparable.
- The separation between the means
of the Gaussians. If the means are close by, the Gaussians overlap and
this confuses most clustering algorithms.
- The
difference between the weights of the clusters. If there is a cluster
with very small weight, many algorithms have trouble discerning it as a
separate cluster.
Things to Ponder: What
is the trade-off between the running time and performance of the
various implementations of K-medians? How does the faster
implementation scale compared to the rest as the data set size
increases? How can the different implementations be combined to improve
performance? Read and understand the
theoretical analyses/properties of K-means, local search, the online
algorithm, the greedy algorithm, and the
fast implementation, and summarize these in your own words. Why do
K-Means and local search procedures eventually terminate? When is
K-medians superior to agglomerative methods, and vice versa? How can we
choose the "correct" value of K in K-median clustering methods?
As
in part I, you will need to produce the code and a document detailing
what you did. You will be graded on how you addressed efficiency
issues, and on the quality of experiments and discussion.
Part III: Mixture Models
Until
now, we assumed no information about what the true underlying
clustering was. Suppose instead someone told us that the data was
generated by a mixture of Gaussians (as in the code used to generate
the input above). Can this information be used to devise a better
clustering algorithm? The answer is the EM algorithm, which generalizes
the K-Means algorithm described above. Instead of starting with a crude
guess of the centers and refining these iteratively, the EM algorithm
works as follows (descriptions here, here, and here):
- Start with any estimate of the mean and covariance matrices of each of the K Gaussians.
- Computes (by Bayes rule) the probability of each data point being generated by each of the K Gaussians.
- Use these probabilities to assign each data point probabilistically to one of K clusters.
- Compute
the sample mean and sample covariance matrices of each of these
clusters, and use these new estimates to repeat Steps (2-4).
- What termination criterion must one use here?
Note
that K-Means is a special case of EM, where we assume the Gaussians are
spherical with the same covariance matrix, and deterministically assign
the data point to the most
likely Gaussian in Step (3). Therefore, K-Means works best when the
Gaussians are identical in shape and spherical, but not very well for
elliptical non-isomorphic Gaussians. As with K-Means, this procedure
eventually terminates. (Try to understand why!)
Input/Output: The input and output formats should be the same as Part II.
Experiments:
Compare
the performance of the EM algorithm on the same data sets as were used
in Part II. How will you quantitatively measure the quality of the
clustering? Note that you should expect the EM algorithm to work
better, since it is assuming the data is drawn from a Gaussian mixture.
Construct a data set (clearly one that is not generated by a Gaussian
mixture model) where the EM algorithm does not work as well as
agglomerative clustering methods.
The criteria for grading is the same as for Parts I and II.
Part IV: MapReduce Implementations
This
part is optional, more open-ended, and for the adventurous. Suppose you
have a large amount of data that you need to cluster, and have a cloud
computing system available. What would be a good clustering algorithm
to implement; how would you implement it; and what would be the
efficiency considerations? This paper has a model for MapReduce, and this paper
has a possible algorithm for K-Medians that works with MapReduce. Can
you implement EM or agglomerative clustering efficiently on a MapReduce
system? I do not need the implementation; just a description of
the algorithm and an analysis of running time and solution quality will
do.Grading
- Correctness of code: 20%
- Efficiency of implementation: 30%
- Experiments (creativity and conclusions drawn): 30%
- Document writing: 20%
General Comments:
- You can work in groups of two. In the document, please list the contributions of each team member.
- You can use C, C++, or Java for coding. I will not give extra points for fancy visualization.
No programming in Matlab or using pre-packaged clustering tools!!!
- We
will use an automated code checker to detect copying. You can discuss
ideas with each other, but code them up individually. Same holds for
writing the document.
- This project is open-ended, so there is
no single "right answer". It is mainly designed to give you an idea of
what algorithm design is like in real life.