*Guest blog post by Vincent Granville*

Here I propose a solution that can be automated and does not require visual inspection by a human being. The solution can thus be applied to billions of clustering problems, automated and processed in batch mode.

Note that the concept of cluster is a fuzzy one. How do you define a cluster? How many clusters in the chart below?

Nevertheless, in many applications, there's a clear optimum number of clusters. The methodology described here will solve all easy and less easy cases, and will provide a "don't know" answer to cases that are ambiguous.** **

**Methodology**:

- create a 2-dim table with the following rows: number of clusters in row #1, and percentage of variance explained by clusters in row #2.
- compute 3rd differences
- maximum for 3rd differences (if much higher than other values) determine number of clusters

This is based on the fact that the piece-wise linear plot of *number of cluster* versus *percentage of variance explained by clusters* is a convex function with an elbow point, see chart below. The elbow point determines the optimum number of clusters. If the piece-wise linear function is approximated by a smooth curve, the optimum would be the point vanishing the 4-th derivative of the approximating smooth curve. This methodology is simply an application of this "elbow detection" technique in a discrete framework (the number of clusters being a discrete number).

**Example**:

1 2 3 4 5 6 7 8 9 ==> number of clusters

40 65 80 85 88 90 91 91 ==> percentage of variance explained by clusters

25 15 5 3 2 1 0 ==> 1st difference

-10 -10 -2 -1 -1 -1 ==> 2nd difference

0 8 1 0 0 ==> 3rd difference

The optimum number of cluster in this example is 4, corresponding to maximum = 8 in the 3rd differences.

**Note**:

If you have already a strong minimum in the 2nd difference (not the case here), you don't need to go to 3rd difference: stop at level 2.

## Comments