Guest blog post by Vincent Granville
Here I propose a solution that can be automated and does not require visual inspection by a human being. The solution can thus be applied to billions of clustering problems, automated and processed in batch mode.
Note that the concept of cluster is a fuzzy one. How do you define a cluster? How many clusters in the chart below?
Nevertheless, in many applications, there's a clear optimum number of clusters. The methodology described here will solve all easy and less easy cases, and will provide a "don't know" answer to cases that are ambiguous.
- create a 2-dim table with the following rows: number of clusters in row #1, and percentage of variance explained by clusters in row #2.
- compute 3rd differences
- maximum for 3rd differences (if much higher than other values) determine number of clusters
This is based on the fact that the piece-wise linear plot of number of cluster versus percentage of variance explained by clusters is a convex function with an elbow point, see chart below. The elbow point determines the optimum number of clusters. If the piece-wise linear function is approximated by a smooth curve, the optimum would be the point vanishing the 4-th derivative of the approximating smooth curve. This methodology is simply an application of this "elbow detection" technique in a discrete framework (the number of clusters being a discrete number).
1 2 3 4 5 6 7 8 9 ==> number of clusters
40 65 80 85 88 90 91 91 ==> percentage of variance explained by clusters
25 15 5 3 2 1 0 ==> 1st difference
-10 -10 -2 -1 -1 -1 ==> 2nd difference
0 8 1 0 0 ==> 3rd difference
The optimum number of cluster in this example is 4, corresponding to maximum = 8 in the 3rd differences.
If you have already a strong minimum in the 2nd difference (not the case here), you don't need to go to 3rd difference: stop at level 2.