We are investigating a metric that measures the presence or absence of a structure or pattern in data sets. The purpose is to measure the strength of the association between two variables, and generalizes our modern correlation coefficient in a few ways:

- It applies to non numeric data, for instance a list of pairs of keywords, with a number attached to each pair, measuring how close to each other the two keywords are
- It detects relationships that are not necessary functionals (for instance, points distributed in a very unusual domain such as a sphere that has holes in it, and where holes contain smaller spheres that are part of the domain itself).
- It also works with traditional, numeric bi-variate observations

*Curious pattern: 3-D waves created by 2-D circular motions of each dot*

The structuredness coefficient, let's denote it as *w*, is not yet fully defined - we are working on this right now. You are welcome to help us come up with a great, robust, simple, easy-to-compute, easy-to-understand, easy-to-interpret metric. In a nutshell, we are working under the following framework:

- We have a data set with n points. For simplicity, let's consider for now that these n points are n vectors (x, y) where x, y are real numbers.
- For each pair of points {(x,y), (x',y')} we compute a distance d between the two points. In a more general setting, it could be a proximity metric between two keywords.
- We order all the distances d and compute the distance distribution, based on these n points
- Leaving-one-out: we remove one point at a time and compute the n new distance distributions, each based on n-1 points
- We compare the distribution computed on n points, with the n ones computed on n-1 points
- We repeat this iteration, but this time with n-2, then n-3, n-4 points etc.
- You would assume that if there is no pattern, these distance distributions (for successive values of n) would have some kind of behavior uniquely characterizing the absence of structure, behavior that can be identified via simulations. Any deviation from this behavior would indicate the presence of a structure. And the pattern-free behavior would be independent of the underlying point distribution or domain - a very important point. All of this would have to be established or tested, of course.
- It would be interesting to test whether this metric can identify patterns such as fractal distribution / fractal dimension. Would it be able to detect patterns in time series?

Note that this type of structuredness coefficient makes no assumption on the shape of the underlying domains, where the n points are located. These domains could be smooth, bumpy, made up of lines, made up of dual points etc. They might even be non numeric domain at all (e.g. if the data consists of keywords).

*Originally posted on AnalyticBridge.com, by Dr. Vincent Granville*

**Related articles**

- Correlation and R-Squared for Big Data
- AnalyticBridge Data Science Competition
- How to detect a pattern? Problem and solution
- A counter-intuitive finding: twin data points is the norm, not the exception
- Shooting stars
- The 3 Vs of Big Data revisited
- Visualization through videos, using open source tools
- Internet Topology - Massive and Amazing Graphs
- Simple solutions to make videos with R
- 3-D Visualizations with rotating charts, for small and big data
- Great graphic diagrams
- Two more interesting graphs
- A new way to define centrality
- Fast clustering algorithms for massive datasets
- 14 questions about data visualization tools
- The top 20 data visualisation tools
- Another cute graph
- 5 books on data visualization
- Registered meteorites that has impacted on Earth visualized
- From chaos to clusters - statistical modeling without models
- When a data glitch turns great data into worthless gibberish
- New pattern to predict stock prices, multiplies return by factor 5
- Internet Topology - Massive and Amazing Graphs
- Big Data Vendor Revenue and Market Forecast 2012-2017
- What Map Reduce can't do
- Excel for Big Data
- Fast clustering algorithms for massive datasets
- Big Data Analytics Ecosystem
- Source code for our Big Data keyword correlation API

## Comments