### Visualization: From chaos to clusters - statistical modeling without models

Originally posted in AnalyticBridge.com by Dr. Vincent Graville

Here I provide the mathematics, explanations and source code to produce the data and moving clusters in the From chaos to clusters video series.

A little bit of history on how the project started:

1. Interest in astronomy, visualization and how physics models apply to business problems
2. Interest in systems that produce clusters (as well as birth and death processes) and in visualizing cluster formation with videos rather than charts
3. Creating art: videos with sound and images synchronized and both generated using data (coming soon). Maybe I'll be able to turn your business data into a movie (either artistic or insightful or both)! I'm already at the point where I can produce the video frames faster than they are delivered on the streaming device. I called it FRT for faster than real time.

What is a statistical model without model?

There's actually a generic mathematical model behind the algorithm. But nobody cares about the model, the algorithm was created first without having a mathematical model in mind. Initially, I had a gravitational model in mind, but I eventually abandoned it as it was not producing what I expected.

This illustrates a new trend in data science: we care less and less about modeling, but more and more about results. My algorithm has a bunch of parameters and features that can be fine-tuned to produce anything you want - be it a simulation of a Neyman-Scott cluster process, or a simulation of some no-name stochastic process.

It's a bit similar to how modern rock climbing has evolved: focusing on big names such as Everest in the past, to exploring deeper wilderness and climbing no-name peaks today (with their own challenges), to rock climbing on Mars in the future.

You can fine tune the parameters to

1. Achieve best fit between simulated data and real business (or other data), using traditional goodness-of-fit testing and sensitivity analysis. Note that the simulated data represents a realization (an instance for object-oriented people) of a spatio-temporal stochastic process.
2. Once the parameters are calibrated, perform predictions (if you speak statistician language) or extrapolations (if you speak mathematician language).

So how does the algorithm work?

It starts with a random distribution of m mobile points in the [0,1] x [0,1] square window. The points get attracted to each other (attraction is stronger to closest neighbors) and thus over time, they group into clusters.

The algorithm has the following components:

1. Creation of n random fixed points (n=100) on [-0.5, 1.5] x [-0.5, 1.5]. This window is 4 times bigger than the one containing the mobile points, to eliminate edge effects impacting the mobile points. These fixed points (they never move) also act as some sort of dark matter: they are invisible, they are not represented in the video, but they are the glue that prevents the whole system from collapsing onto itself and converging to a single point.
2. Creation of m random mobile points (m=500) on [0,1] x [0,1].
3. Main loop (200 iterations). At each iteration, we compute the distance d between each mobile point (x,y) and each of his m-1 mobile neighbors and n fixed neighbors. A weight w is computed as a function of d, with a special weight for the point (x,y) itself. Then the updated (x,y) is the weighted sum aggregated over all points, and we do that for each point (x,y) at each iteration. The weight is such that the sum of weights over all points is always 1. In other words, we replace each point with a convex linear combination of all points.

Special features

• If the weight for (x,y) [the point being updated] is very high at a given iteration, then (x,y) will barely move.
• We have tested negative weights (especially for the point being updated) and we liked the results better. A delicate amount of negative weights also further prevents the system from collapsing and introduce a bit of chaos.
• Occasionally, one point is replaced by a brand new, random point, rather than updated using the weighted sum of neighbors. We call this event a "birth". It happens for less than 1% of all point updates, and it happens more frequently at the beginning. Of course, you can play with these parameters.

In the source code, the birth process  (for point \$k) is simply encoded as:

if (rand()<0.1/(1+\$iteration)) { # birth and death
\$tmp_x[\$k]=rand();
\$tmp_y[\$k]=rand();
\$rebirth[\$k]=1;
}

In the source code, in the inner loop over \$k, the point (\$x,\$y) to be updated is referenced as point \$k, that is,  (\$y, \$y) = (\$moving_x[\$k], \$moving_y[\$k]). Also, in a loop over \$l, one level deeper, (\$p, \$q) referenced as point \$l, represents a neighboring point when computing the weighted average formula used to update (\$x, \$y). The distance d is computed using the function distance which accepts four arguments (\$x, \$y, \$p, \$q) and returns \$weight, the weight w.

Related articles