Subscribe to our Newsletter

We are investigating a metric that measures the presence or absence of a structure or pattern in data sets. The purpose is to measure the strength of the  association between two variables, and generalizes our modern correlation coefficient in a few ways:

  • It applies to non numeric data, for instance a list of pairs of keywords, with a number attached to each pair, measuring how close to each other the two keywords are
  • It detects relationships that are not necessary functionals (for instance, points distributed in a very unusual domain such as a sphere that has holes in it, and where holes contain smaller spheres that are part of the domain itself).
  • It also works with traditional, numeric bi-variate observations

 Curious pattern: 3-D waves created by 2-D circular motions of each dot

The structuredness coefficient, let's denote it as w, is not yet fully defined - we are working on this right now. You are welcome to help us come up with a great, robust, simple, easy-to-compute, easy-to-understand, easy-to-interpret metric. In a nutshell, we are working under the following framework:

  • We have a data set with n points. For simplicity, let's consider for now that these n points are n vectors (x, y) where x, y are real numbers.
  • For each pair of points {(x,y), (x',y')} we compute a distance d between the two points. In a more general setting, it could be a proximity metric between two keywords.
  • We order all the distances d and compute the distance distribution, based on these n points
  • Leaving-one-out: we remove one point at a time and compute the n new distance distributions, each based on n-1 points
  • We compare the distribution computed on n points, with the n ones computed on n-1 points
  • We repeat this iteration, but this time with n-2, then n-3, n-4 points etc.
  • You would assume that if there is no pattern, these distance distributions (for successive values of n) would have some kind of behavior uniquely characterizing the absence of structure, behavior that can be identified via simulations. Any deviation from this behavior would indicate the presence of a structure. And the pattern-free behavior would be independent of the underlying point distribution or domain - a very important point. All of this would have to be established or tested, of course.
  • It would be interesting to test whether this metric can identify patterns such as fractal distribution / fractal dimension. Would it be able to detect patterns in time series?

Note that this type of structuredness coefficient makes no assumption on the shape of the underlying domains, where the n points are located. These domains could be smooth, bumpy, made up of lines, made up of dual points etc. They might even be non numeric domain at all (e.g. if the data consists of keywords).

Originally posted on AnalyticBridge.com, by Dr. Vincent Granville

Related articles

E-mail me when people leave their comments –

You need to be a member of DataViz to add comments!

Join DataViz

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds