K-means: widely used clustering technique! ,Initialization: blind random on input data!
Drawback: very sensitive to choice of initial clustercenters (seeds)!
Local optimal can be arbitrarily bad wrt. objective function, compared to global optimal clustering
Idea: spread the k initial cluster centers away from each other.!
O(log k)-competitive with the optimal clustering" substantial convergence time speedups (empirical)!
C - Sample a point uniformly at random from X
While `C´ < k do
Sample x € X with probability prop, to DSquare (x)
c <- C U {x}
end while
c € c: Cluster Center
x € X: Data Point'D(x) distance between x and nearest Ck that has already chosen
Test dataset
200 Clustering runs, each with and without k-means initialization
Measure RSS (Intra-Class variance)
K.Means optimal clustering 115 times (57.5%)
Implementation Test Dataset: 4 Square (n=16)

Expected: 4 nice Cluster



Evaluation on Test Dataset!
• 200 clustering runs, each with and without kmeans++ initialization!
• Measure RSS (intra-class variance)!
• K-means! optimal clustering 115 times (57.5%) !
• K-means++ ! optimal clustering 182 times (91%)!
Comparison of the frequency distribution of RSS values between k-means and k-means
++ on the evaluation dataset (n=200)!

Comparison of the frequency distribution of RSS values between k-means and k-means
++ on the UCI real world dataset (n=500)!

Drawback: very sensitive to choice of initial clustercenters (seeds)!
Local optimal can be arbitrarily bad wrt. objective function, compared to global optimal clustering
Idea: spread the k initial cluster centers away from each other.!
O(log k)-competitive with the optimal clustering" substantial convergence time speedups (empirical)!
C - Sample a point uniformly at random from X
While `C´ < k do
Sample x € X with probability prop, to DSquare (x)
c <- C U {x}
end while
c € c: Cluster Center
x € X: Data Point'D(x) distance between x and nearest Ck that has already chosen
Test dataset
200 Clustering runs, each with and without k-means initialization
Measure RSS (Intra-Class variance)
K.Means optimal clustering 115 times (57.5%)
Implementation Test Dataset: 4 Square (n=16)
Expected: 4 nice Cluster
Evaluation on Test Dataset!
• 200 clustering runs, each with and without kmeans++ initialization!
• Measure RSS (intra-class variance)!
• K-means! optimal clustering 115 times (57.5%) !
• K-means++ ! optimal clustering 182 times (91%)!
Comparison of the frequency distribution of RSS values between k-means and k-means
++ on the evaluation dataset (n=200)!
Comparison of the frequency distribution of RSS values between k-means and k-means
++ on the UCI real world dataset (n=500)!
Comments
Post a Comment