Subscribe to our Newsletter

Guest blog post by Vincent Granville

Here I propose a solution that can be automated and does not require visual inspection by a human being. The solution can thus be applied to billions of clustering problems, automated and processed in batch mode.

Note that the concept of cluster is a fuzzy one. How do you define a cluster? How many clusters in the chart below?

clans_clusters


Nevertheless, in many applications, there's a clear optimum number of clusters. The methodology described here will solve all easy and less easy cases, and will provide a "don't know" answer to cases that are ambiguous.
 

Methodology:

  • create a 2-dim table with the following rows: number of clusters in row #1, and percentage of variance explained by clusters in row #2. 
  • compute 3rd differences
  • maximum for 3rd differences (if much higher than other values) determine number of clusters

This is based on the fact that the piece-wise linear plot of number of cluster versus percentage of variance explained by clusters is a convex function with an elbow point, see chart below. The elbow point determines the optimum number of clusters. If the piece-wise linear function is approximated by a smooth curve, the optimum would be the point vanishing the 4-th derivative of the approximating smooth curve. This methodology is simply an application of this "elbow detection" technique in a discrete framework (the number of clusters being a discrete number).

Example:
 1   2   3   4   5   6   7   8   9   ==> number of clusters

   40  65  80  85  88  90  91  91    ==> percentage of variance explained by clusters 

     25  15   5   3   2   1   0      ==> 1st difference     

      -10  -10  -2  -1  -1  -1       ==> 2nd difference

          0    8   1   0   0         ==> 3rd difference

The optimum number of cluster in this example is 4, corresponding to maximum = 8 in the 3rd differences.

Note:

If you have already a strong minimum in the 2nd difference (not the case here), you don't need to go to 3rd difference: stop at level 2.

E-mail me when people leave their comments –

You need to be a member of DataViz to add comments!

Join DataViz

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds