Subscribe to our Newsletter

Originally posted in AnalyticBridge.com by Dr. Vincent Graville

Here I provide the mathematics, explanations and source code to produce the data and moving clusters in the From chaos to clusters video series.

A little bit of history on how the project started:

  1. Interest in astronomy, visualization and how physics models apply to business problems
  2. Research on how urban growth could be modeled by the gravitional law
  3. Interest in systems that produce clusters (as well as birth and death processes) and in visualizing cluster formation with videos rather than charts
  4. Creating art: videos with sound and images synchronized and both generated using data (coming soon). Maybe I'll be able to turn your business data into a movie (either artistic or insightful or both)! I'm already at the point where I can produce the video frames faster than they are delivered on the streaming device. I called it FRT for faster than real time.

What is a statistical model without model?

There's actually a generic mathematical model behind the algorithm. But nobody cares about the model, the algorithm was created first without having a mathematical model in mind. Initially, I had a gravitational model in mind, but I eventually abandoned it as it was not producing what I expected.

This illustrates a new trend in data science: we care less and less about modeling, but more and more about results. My algorithm has a bunch of parameters and features that can be fine-tuned to produce anything you want - be it a simulation of a Neyman-Scott cluster process, or a simulation of some no-name stochastic process.

It's a bit similar to how modern rock climbing has evolved: focusing on big names such as Everest in the past, to exploring deeper wilderness and climbing no-name peaks today (with their own challenges), to rock climbing on Mars in the future.

You can fine tune the parameters to

  1. Achieve best fit between simulated data and real business (or other data), using traditional goodness-of-fit testing and sensitivity analysis. Note that the simulated data represents a realization (an instance for object-oriented people) of a spatio-temporal stochastic process.
  2. Once the parameters are calibrated, perform predictions (if you speak statistician language) or extrapolations (if you speak mathematician language).

So how does the algorithm work?

It starts with a random distribution of m mobile points in the [0,1] x [0,1] square window. The points get attracted to each other (attraction is stronger to closest neighbors) and thus over time, they group into clusters.

The algorithm has the following components:

  1. Creation of n random fixed points (n=100) on [-0.5, 1.5] x [-0.5, 1.5]. This window is 4 times bigger than the one containing the mobile points, to eliminate edge effects impacting the mobile points. These fixed points (they never move) also act as some sort of dark matter: they are invisible, they are not represented in the video, but they are the glue that prevents the whole system from collapsing onto itself and converging to a single point.
  2. Creation of m random mobile points (m=500) on [0,1] x [0,1].
  3. Main loop (200 iterations). At each iteration, we compute the distance d between each mobile point (x,y) and each of his m-1 mobile neighbors and n fixed neighbors. A weight w is computed as a function of d, with a special weight for the point (x,y) itself. Then the updated (x,y) is the weighted sum aggregated over all points, and we do that for each point (x,y) at each iteration. The weight is such that the sum of weights over all points is always 1. In other words, we replace each point with a convex linear combination of all points.

Special features

  • If the weight for (x,y) [the point being updated] is very high at a given iteration, then (x,y) will barely move.
  • We have tested negative weights (especially for the point being updated) and we liked the results better. A delicate amount of negative weights also further prevents the system from collapsing and introduce a bit of chaos.
  • Occasionally, one point is replaced by a brand new, random point, rather than updated using the weighted sum of neighbors. We call this event a "birth". It happens for less than 1% of all point updates, and it happens more frequently at the beginning. Of course, you can play with these parameters.

In the source code, the birth process  (for point $k) is simply encoded as:

if (rand()<0.1/(1+$iteration)) { # birth and death
  $tmp_x[$k]=rand();
  $tmp_y[$k]=rand();
  $rebirth[$k]=1;
}

In the source code, in the inner loop over $k, the point ($x,$y) to be updated is referenced as point $k, that is,  ($y, $y) = ($moving_x[$k], $moving_y[$k]). Also, in a loop over $l, one level deeper, ($p, $q) referenced as point $l, represents a neighboring point when computing the weighted average formula used to update ($x, $y). The distance d is computed using the function distance which accepts four arguments ($x, $y, $p, $q) and returns $weight, the weight w.

Click here to view source code.

Related articles

E-mail me when people leave their comments –

You need to be a member of DataViz to add comments!

Join DataViz

Webinar Series

Ask Data: Simplifying Analytics with Natural Language

What if you could directly ask questions of your data? Ask Data, Tableau’s new natural language capability, allows people to get insights by simply conversing with their data. In this latest Data Science Central webinar, members of Tableau’s Ask… Continue

Creating Business Applications with R & Python

Across industries, data scientists are creating powerful models and analytics to solve urgent business problems. However, in far too many cases, these analytics never reach their intended business users. The result is wasted time and effort, as well… Continue

DSC Webinar Series: Optimize the Data Supply Chain

Every organization is aiming to produce more comprehensive understanding of their customers, their business operations and their risks, through data. Most organizations are still learning best practices that allow them to leverage in-house data… Continue

Follow Us

@DataScienceCtrl | RSS Feeds

Careers

Data Reporter

Simons Foundation - POSITION SUMMARY   Spectrum is looking for an experienced and savvy data reporter to bring a new era of quantitative reporting to our site.   Spect...

DIGITAL DATA PLATFORM ANALYST

UPS - DIGITAL DATA PLATFORM ANALYST: This is an exciting opportunity to join a growing organization within UPS, leveraging new technologies to help make ...

Data Scientist

Nehemiah Security - Nehemiah Security is in search of Mid to Senior level Data Scientist. The data scientist will be responsible for creating models that pertain to ou...