Subscribe to our Newsletter

All Posts (218)

Sort by

Originally posted in by Dr. Vincent Graville

Here I provide the mathematics, explanations and source code to produce the data and moving clusters in the From chaos to clusters video series.

A little bit of history on how the project started:

  1. Interest in astronomy, visualization and how physics models apply to business problems
  2. Research on how urban growth could be modeled by the gravitional law
  3. Interest in systems that produce clusters (as well as birth and death processes) and in visualizing cluster formation with videos rather than charts
  4. Creating art: videos with sound and images synchronized and both generated using data (coming soon). Maybe I'll be able to turn your business data into a movie (either artistic or insightful or both)! I'm already at the point where I can produce the video frames faster than they are delivered on the streaming device. I called it FRT for faster than real time.

What is a statistical model without model?

There's actually a generic mathematical model behind the algorithm. But nobody cares about the model, the algorithm was created first without having a mathematical model in mind. Initially, I had a gravitational model in mind, but I eventually abandoned it as it was not producing what I expected.

This illustrates a new trend in data science: we care less and less about modeling, but more and more about results. My algorithm has a bunch of parameters and features that can be fine-tuned to produce anything you want - be it a simulation of a Neyman-Scott cluster process, or a simulation of some no-name stochastic process.

It's a bit similar to how modern rock climbing has evolved: focusing on big names such as Everest in the past, to exploring deeper wilderness and climbing no-name peaks today (with their own challenges), to rock climbing on Mars in the future.

You can fine tune the parameters to

  1. Achieve best fit between simulated data and real business (or other data), using traditional goodness-of-fit testing and sensitivity analysis. Note that the simulated data represents a realization (an instance for object-oriented people) of a spatio-temporal stochastic process.
  2. Once the parameters are calibrated, perform predictions (if you speak statistician language) or extrapolations (if you speak mathematician language).

So how does the algorithm work?

It starts with a random distribution of m mobile points in the [0,1] x [0,1] square window. The points get attracted to each other (attraction is stronger to closest neighbors) and thus over time, they group into clusters.

The algorithm has the following components:

  1. Creation of n random fixed points (n=100) on [-0.5, 1.5] x [-0.5, 1.5]. This window is 4 times bigger than the one containing the mobile points, to eliminate edge effects impacting the mobile points. These fixed points (they never move) also act as some sort of dark matter: they are invisible, they are not represented in the video, but they are the glue that prevents the whole system from collapsing onto itself and converging to a single point.
  2. Creation of m random mobile points (m=500) on [0,1] x [0,1].
  3. Main loop (200 iterations). At each iteration, we compute the distance d between each mobile point (x,y) and each of his m-1 mobile neighbors and n fixed neighbors. A weight w is computed as a function of d, with a special weight for the point (x,y) itself. Then the updated (x,y) is the weighted sum aggregated over all points, and we do that for each point (x,y) at each iteration. The weight is such that the sum of weights over all points is always 1. In other words, we replace each point with a convex linear combination of all points.

Special features

  • If the weight for (x,y) [the point being updated] is very high at a given iteration, then (x,y) will barely move.
  • We have tested negative weights (especially for the point being updated) and we liked the results better. A delicate amount of negative weights also further prevents the system from collapsing and introduce a bit of chaos.
  • Occasionally, one point is replaced by a brand new, random point, rather than updated using the weighted sum of neighbors. We call this event a "birth". It happens for less than 1% of all point updates, and it happens more frequently at the beginning. Of course, you can play with these parameters.

In the source code, the birth process  (for point $k) is simply encoded as:

if (rand()<0.1/(1+$iteration)) { # birth and death

In the source code, in the inner loop over $k, the point ($x,$y) to be updated is referenced as point $k, that is,  ($y, $y) = ($moving_x[$k], $moving_y[$k]). Also, in a loop over $l, one level deeper, ($p, $q) referenced as point $l, represents a neighboring point when computing the weighted average formula used to update ($x, $y). The distance d is computed using the function distance which accepts four arguments ($x, $y, $p, $q) and returns $weight, the weight w.

Click here to view source code.

Related articles

Read more…

What is big data - Infographics by Bernard Marr

About : Bernard Marr is a globally recognized expert in strategic metrics and data. He helps companies manage, measure, analyze and improve performance. His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Bette...

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Bottom line on Data Visualization

Guest blog post by Michael Bryan

The market for data visualization software has bloomed. I'm suspicious.

Companies like Tableau, Spotfire, SAS Visual Analytics, Qlik and Zoomdata are positioning their tools far beyond traditional business intelligence.  Capabilities for graphically navigating data, recognizing patterns and finding relationships are growing in both functional and economic scope.  These new tools can provide charting forms only imagined in the last decade like word clouds, circular hierarchies, tree maps and stream graphs.  Check out the D3 (data driven documents) javascipt library for inspiration.  All this innovation begs a critical question:

Is data visualization

  • an entirely new dimension of data management
  • a subject within analytics, emerging with new tools
  • a rebranding of old subjects like business intelligence, dashboards and reporting
  • or something else?

On the one hand, visual is not new.  In 1983, Tuffle wrote "The Visual Display of Quantitative Information" which never stopped selling.  In 1987, Rockhart and De Long offered "Executive Support Systems" and launched the very user centric EIS age.  Comshare, IRI, Pilot and Arbor Software launched the 90's OLAP generation with its own concepts.  And, the last decade, we've seen familiar players in Business Intelligence leap frog each other, continually competing on presentation. Face it - "visual" sells software.

Analytic visuals aren't new either. Archimedes had charts. He just used a pen. The statistical suites all have rough but ready graph capabilities. Basic, and un-pretty, plots are among the first steps of exploratory data analysis. So, treating data visualization as innovative comes with a very high burden of proof.

I could buy the an argument that Big Data has a consequence of Big Graphics.  We are capturing more detail, we can store it and we need to study it.  So, data science has a ground to need advanced visuals.  But, I'm guessing that data visualization licenses outnumber data scientists by a hundred fold. 

Tufte's original book included a famous duck.   The picture illustrated, for Tufte, irrelevant and useless presentation.  So far, I haven't seen reasoning to treat data visualization as much more than a next generation duck.

Read more…

New in Plotly: Interactive Graphs with IPython

Originally posted on Data Science Central, by Matthew Sundquist.

New! Plotly lets you style interactive graphs in IPython. Then, you can share your Notebook or your Plotly graph. It's like having the NYTimes graphics department inside your IPython.

You can also get these Notebooks on the Plotly GitHub page. Visit to see more documentation

Here's a preview of how it looks to have your code, data, and graph all interactively available. See the live version.

Read more…

We are investigating a metric that measures the presence or absence of a structure or pattern in data sets. The purpose is to measure the strength of the  association between two variables, and generalizes our modern correlation coefficient in a few ways:

  • It applies to non numeric data, for instance a list of pairs of keywords, with a number attached to each pair, measuring how close to each other the two keywords are
  • It detects relationships that are not necessary functionals (for instance, points distributed in a very unusual domain such as a sphere that has holes in it, and where holes contain smaller spheres that are part of the domain itself).
  • It also works with traditional, numeric bi-variate observations

 Curious pattern: 3-D waves created by 2-D circular motions of each dot

The structuredness coefficient, let's denote it as w, is not yet fully defined - we are working on this right now. You are welcome to help us come up with a great, robust, simple, easy-to-compute, easy-to-understand, easy-to-interpret metric. In a nutshell, we are working under the following framework:

  • We have a data set with n points. For simplicity, let's consider for now that these n points are n vectors (x, y) where x, y are real numbers.
  • For each pair of points {(x,y), (x',y')} we compute a distance d between the two points. In a more general setting, it could be a proximity metric between two keywords.
  • We order all the distances d and compute the distance distribution, based on these n points
  • Leaving-one-out: we remove one point at a time and compute the n new distance distributions, each based on n-1 points
  • We compare the distribution computed on n points, with the n ones computed on n-1 points
  • We repeat this iteration, but this time with n-2, then n-3, n-4 points etc.
  • You would assume that if there is no pattern, these distance distributions (for successive values of n) would have some kind of behavior uniquely characterizing the absence of structure, behavior that can be identified via simulations. Any deviation from this behavior would indicate the presence of a structure. And the pattern-free behavior would be independent of the underlying point distribution or domain - a very important point. All of this would have to be established or tested, of course.
  • It would be interesting to test whether this metric can identify patterns such as fractal distribution / fractal dimension. Would it be able to detect patterns in time series?

Note that this type of structuredness coefficient makes no assumption on the shape of the underlying domains, where the n points are located. These domains could be smooth, bumpy, made up of lines, made up of dual points etc. They might even be non numeric domain at all (e.g. if the data consists of keywords).

Originally posted on, by Dr. Vincent Granville

Related articles

Read more…

The Risks and Limitations of Visualization

Guest blog post by Radhika Subramanian

Today’s need to leverage unprecedented amounts of available information has resulted in a flood of tools, services and models claiming to surface insights from Big Data. One model in particular, visualization, has received a lot of attention lately because of its abilities to organize and present information. However, visualization is actually one of the biggest barriers to insight because it places the burden of discovery on the user, and any tool that places the burden on the analyst is a game-stopper.

Data visualization is the study of the visual representation of data, meaning information that has been abstracted in some schematic form, including attributes or variables for the units of information. Humans are better equipped to consume visual data than text. As we know, a picture is worth a thousand words.

While visualization tools are interesting, they rely on human evaluation to extract insight and knowledge. The problem with this is that people often see what they are looking for and miss the breakthrough evidence they are actually seeking. It’s human nature: we see what we are conditioned to see and miss the fact that a gorilla just danced through the living room. But that’s just the beginning. The more severe limitation of visualization is it can only represent two or three dimensions before the amount of information is overwhelming. Visualizing a network of 10-100 friends is fine, but what happens when the data approaches one billion? Thus, while it is certainly a good test for small samples, it is not a sustainable method to gain insight into large volumes of shifting data.

In a previous blog, I wrote that given today’s explosion of “Big Data,” companies need more advanced methods for leveraging their data – methods that don’t rely solely on tribal knowledge, personal experience or best guesses. Like data mining, visualization is limited to manual endeavors. Why limit company success to antiquated methods that by design fail to leverage the data for all it’s worth? It’s time to usher in new methods and new technologies for transforming the enterprise from reactive (based on guesstimates, hunches, and flawed insight) to proactive (based on data-driven, actionable insight).

Read more…

Datascape - Immersive 3D Data Visualisation

Guest blog post by David Burden

With the launch last week of Datascape I thought it would be worth putting an MD’s perspective on the product – how we got here, what the philosophy is that lies behind it, and where we hope to go with it. For a more formal view of the academic and commercial background see our Immersive Data Visualisation white paper.

Datascape has undoubtably grown out of Daden’s virtual world heritage – and my own interest in data and data visualisation. Over the years we’ve used virtual world platforms such as VRML, Active Worlds, Second Life and OpenSim to create a variety of data visualisations, probably culminating in our original Datascape virtual command centre (which won a prize at the US Government’s Federal Virtual World Challenge), and the visualisation of Twitter data we did in OpenSim for the Royal Wedding in 2011. These examples and experiments, and those of others, together with an MOD funded research project we did in 2011 within Aston University doing a quantitative comparison of immersive and non-immersive 3D visualisations spaces convinced us that there was definitely something in immersive data visualisation.

In then moving from ideas and demonstrators to a full blown product I think that there are 4 key ideas that have informed our journey.



Datascape is about immersion. It is about putting you inside your data, allowing you to move around and through your data and view it from any angle, from inside or out. When in navigation mode there is no user interface – there is just your data ( possibly the ultimate expression of Edward Tufte’s Data Ink idea). This sense of immersion appears to help the brain see the patterns and anomalies in the data, because the data behaves like the real world – it stays still whilst your eye travels through it.



Datascape does not constrain you. If you want to map latitude to colour and longitude to shape you can do it. The heart of Datascape is the mapping screen, where you assign the fields in the data to the features of a plot point – its position, rotation, shape, size, colour, image and labels. With a full set of spreadsheet like functions at your call, and self-populating look-up tables, the plots you can produce probably really are only limited by your imagination. That flexibility does mean that initially there might be a bit more to learn, but we’ll be posting “recipes” and “how-to’s” on our web site to help you create the more common visualisations, and as we release successive versions of Datascape we may well start including wizards and templates that get you more directly to those common views.



Given that we needed good graphics and processing capability we took the decision early on that this would initially be a PC application, not something for the web or your tablet. However by basing Datascape on Unity we have got a path available to develop a web and/or tablet versions of Datascape if the demand is there. We have also been keeping a watching brief on HTML5 and WebGL and one feature under serious consideration is being able to export your completed workspace as a standalone HTML5 virtual world to share more easily with friends and colleagues.



One thing we have found as we begin to look at more and more data in Datascape is that we may need a new visual language to describe what we are doing with data in a 3D space. In 2D we are all used to line graphs and bar charts, pie charts and scatter plots. Whilst we can do these in 3D as well, they do not (except for the last) typically take fullest advantage of the medium.

For instance one problem we’ve found in 3D is that whilst the virtual space let’s us plot a long line of data stretching off into the distance, looking at the whole line is hard, you have to scroll as you do in 2D, unless we compress it (but then we lose the detail that the spread out 3D display brings). One solution that we have found is to plot the data as a cylinder, or even as a spiral, with the viewer in the centre. You can then take in a lot of data in one go, and just fly up and down the cylinder to other data – which is typically an easier action to control that horizontal flight. What other standard forms will we find, and how will we determine which form suits which type of data, and which type of enquiry?

Another difference is axes. In 2D the axes form a frame in which your data sits – and the same for non-immersive 3D cubes. But in an immersive space you are usually inside the data and the axes are nowhere in sight. So how do we maintain orientation within the data, and understand where the data points sit on the axes (that is if we actually need enumerated axes). There are no doubt a wide number of solutions to explore, and within Datascape we have distant XYZ markers so you can easily tell which direction you are looking in, whenever you hover over a point it can tell you its X,Y and Z values, and you can also have the point drop reference lines down to the axis or reference planes. One other thing we have tried, but not perfected enough to release, is a 3D compass, and another that we are looking at for future releases is the use of mini-maps, not just as a top-down (XZ plane) view but also as YZ and YX views as well. But can you cope with seeing your data in four directions at once?



We thought long and hard about this tag line, just as we did about whether or not to have avatars. We didn’t put avatars in the single user version since we felt that a) you got enough of a sense of immersion from the navigation alone and b) for most corporate users we spoke to avatars are still a turn-off and too closely associated with gaming environments. However “virtual world” (most emphatically in lower case) did seem by far the most appropriate way to describe what you can create with Datascape, a virtual world populated solely by you and you data.

In multi-user mode we do provide you with a very basic humanoid avatar – but it is very much a place-holder, a glyph, for where you are in the world and what direction you are looking. We deliberately kept clear of an avatar that was human enough for you to start worrying about what gender, or race or age it was, and what clothes it should wear! The resulting avatar is enough to let you know where your colleagues are and what you are looking at, no more, but even so it’s not long before you’re playing hide and seek amongst the data.

Going forward we may well increase the virtual world sense – for those who want it – with better avatars, more 3D scenery in which to place your visualisations ( closer to the original Datascape), and persistency and controlled sharing of your data and workspaces. But let’s start simply, and with something that everyone can hopefully relate to.


So hopefully that gives you some insight into our thinking as we developed Datascape, and some clues as to where we might take it in the future. Please download it (there is a free community version with a 6000 point limit and a paid pro version with a 65,000 limit - although we have had it running up to 250k points)  and give it a try, and hopefully it will open up a whole new world of data visualisation for you.

Read more…

Big data and Data Visualization

Like many people of a certain age, my first exposure to the term dashboard was when I developed a one for monitoring for corrective and preventive actions!

I have realised that Dashboard design itself is now the essence of simplicity and cutting edge technology, and stylish with it too, arising passions about what makes a great interface for analysis.
When it comes to software applications and websites, dashboards are around us everywhere too!

The era of Big Data has arrived, but most organizations are still unprepared. Enterprises erroneously believe and act like big data is a passing fad, and nothing has really changed. But big data is not a temporary thing. By acting as if it is, companies are missing out on tremendous opportunities by not focusing on such a great technology.

So what it is?

     Like many of us  know, an enterprise application dashboard is a one-stop shop of information. It’s a page made up of portlets or regions, grouping up related information into displays of graphs, charts, and graphics of different kinds. Dashboards visualize a breadth of information that spreads over a large range of activities in a application or functional area.

There are numerous case studies in explaining how visual representations are locating and leveraging valuable insights from a large set of structured or unstructured data, i.e., big data, are asking better questions, and are making better decisions.

Is it solves the purpose?

Yes! Dashboards when designed to aggregate sturctured and unstructured data into meaningful visual displays and representations, using analytical formulas over available data-sets at the backend to do the analysis and derivation work that users used to do with notepads, calculators or spreadsheets to find what out what’s changed or in need of attention.

Dashboards over a large amount of data enable users to prioritize work and to manage exceptions by taking light-weight actions immediately from the page, or to drill down to explore and do more in a transactional or analytics work area, if necessary.

The design of Dashboards on a very large amount of data, on the other hand, is much more open to interpretation. Most of these Bigdata Dashboards are simply a series of graphs, charts, gauges, or other visual indicators that a user has chosen to monitor, some of which may be strategically important, but others of which may not. Even if a strategic link exists, it may not be clear to the person monitoring the Dashboard, since the Objective statements, which explain what achievement is desired, are typically not present on Dashboards.

Why this?

I found interesting that there is an infographics and a data visualization categories. My interpretation is that the entries in the infographics section are static and illustrated, while those in the data visualization are generated and data-driven.

Nowadays, Bigdata can be used to gain a better insight over Data visualization using superior tools and techniques to present or analyze the available data.

On the other hand, it is economical in terms of space and would probably work in almost every case which are two things that dashboards should be good at. So while I wouldn’t have used it myself I can understand why this decision has been made. What makes a dashboard, or any other information-based design successful, is neither the design execution nor the clever information analysis and visualization technique.

These kinds of Dashboards, on the other hand eventually, are meant to be useful and to solve a specific problem. Dashboards for business users represent powerful means of communications nowdays when companies build large amounts of data. Those visually compressed representations of only the most important data are used for trackig.

DataViz on my view!

These data visualization can unintentionally bias the viewer as a result of the analyzed choices in visual method, sometimes visualization failing as a result of not understanding your viewers assumptions (cultural for instance, is RED a good or bad color?).

One interesting thing I always think of creating visualizations that discover something with the human eye that can't be discovered by a program. But there will be a challenge showing enough data to give a sense of context while providing enough detail to enable understanding.

What's then?

Whenever a Visualization is done based on Bigdata, once a data visualization designer is aware of simple principles of presenting data on a screen, they can apply them to any report or graph, data analysis or information dashboard without changing it's context or meaning. Only then will it provide a powerful means to make sense of data. When done properly, data visualization will make us think, compare data, read stories out of our data, will put data in the right context and ultimately help decision-makers to make the right decisions regardless of the available type or amount of data.

Do you have any thoughts on this? I am waiting to hear from you!

Read more…

Guest Blog post by Nilesh Jethwa

Data is bits and bytes and visualization has the power to tell the story in multiple forms. Today I wish to share two different visualization for the same dataset. 

Here is the link to the dataset and the dashboard as shown below




And here is the second Dashboard visual using cloropleth 





 You can analyze the pros and cons of both kind of visuals. Both of the visuals are pretty interesting and the key point is what story each one of is trying to tell.

Read more…

14 questions about data visualization tools

Guest blog post by Vincent Granville

Questions to ask when considering visualization tools:

  1. How do you define and measure the quality of a chart?
  2. Which tools allow you to produce interactive graphs or maps?
  3. Which tools do you recommend for big data visualization?
  4. Which visualization tools can be accessed via an API, in batch mode? (for instance, to update earthquake maps every 5 minutes, or stock prices every second)
  5. What do you think of Excel? And Python or Perl graph libraries? And R?
  6. Are there any tools that allow you to easily produce videos about of your data (e.g. to show how fraud cases or diseases spread over time)?
  7. In Excel you can update your data: then your model and charts get updated right away. Are there any alternatives to Excel, offering the same features, but having much better data modeling capabilities?
  8. How do you produce nice graph structures - e.g. to visually display Facebook connections?
  9. What is an heat map? When does it make sense to use it?
  10. How do you draw "force-directed graphs"?
  11. Good tools for raster images? for vector images? for graphs? for decision trees? for fractals? for time series? for stock prices? for maps? for spatial data?
  12. How can you integrate R with other graphical packages?
  13. How do you represent 5 dimensions (e.g. time, volume, category, price, location) in a simple 2-dimensional graph? Or is better to represent fewer dimensions if your goal is to communicate a message to executives?
  14. Why visualization tools used by mathematicians and operations research practitioners (e.g. Matlab) are not the same as tools used by data scientists? Is it because of the type of data, or just historical reasons?

Related articles:

Read more…

This is a guest blog post.

Ever wanted to quickly visually share some data with your colleagues or with the world and struggled with the tools available? After sharing the data, what if the viewer wanted to zoom in on a specific location, city or town to see what's going on there.

Google Fusion Tables is a free tool to show your data on a map & allow viewers to zoom in specific areas that they want to explore further. vHomeInsurance, a data driven home insurance analysis service, has detailed location data on home insurance rates & has used their data to create a guide to use Google Fusion tables to represent home insurance rates visually on a map.

1. Google Fusion Table Home Page

To start using Google Fusion tables, one must have a Google Account. After you have created a Google account or if you have already one, go to:  & click on the Create a Fusion Table link to begin.

2. Get your data into Google Fusion Tables

We have three choices to import data to Google Fusion table 1) Upload a File 2) Import from Google Spreadsheets 3) Update Data from an Empty Table

For this guide, we choose the Google Spreadsheet option. Choose the Google spreadsheet you want to import data from and click Next. You can choose the other options as well and click next.

 After the data is loaded into Google Fusion tables, make sure to check for column names and other details before importing.

3. Naming your table & data

Once the data is imported, make sure to give the appropriate table names, your licensing attribution & other details.

4. Map the “Location” field to the appropriate column

This is where the rubber hits the road where Google Geocodes your data to know which places is where.

For Instance, if you have data on Brooklyn Homeowner Insurance rates & want to Google Maps to show it in the appropriate location, then Google Fusion tables needs to figure out the Geo-coordinates for that data. To tell Google Fusion table, which data type to geocode, we need to change the appropriate column name to “Location” type. This can be done through the following steps:

  1. Hover over the column name that has the location data and click on the downward pointing arrow
  2. Click on Change
  3. On the page that comes, Choose “Location” for the type & then hit Save

5. Show the Geocoded data on a Map

Now, we need to actually show the data on a map. To do that, click on “Add Map”. The actual Geocoding & representation may take time & depends on the number of rows in your tables.


6. Configure Your Map Markers

The default representation of a place on the map are red circles but to we need to customize it make it more meaningful. As an example, in the home insurance world, home insurance in Chicago is $888 so we have a Yellow marker for it whereas home insurance in Phoenix is cheaper at $596 and we represent that as a green marker. In the configure map section, click on Change feature styles and then configure various buckets and associated color markets for the different values.

You can see a screenshot of the finished map below.

A detailed map is available to zoom in for various home insurance rates on vHomeInsurance.

About the Author:

The vHomeInsurance team are experts in home insurance rates data analysis & research.

Related articles

Read more…

Please join us on February 3, 2015 at 9am PT for our latest Data Science Central Webinar Event: 
Avoiding Data Pitfalls: Gaps Between Data and Reality sponsored by Tableau Software.

Space is limited.
Reserve your Webinar seat now

Have you ever been fooled by data? In this Webinar we will cover common pitfalls that anyone who works with data has fallen into. Find out what these pitfalls look like and how to avoid them. The pitfalls range from philosophical to technical, and from analytical to visual. 

Utilizing trusted techniques and visualization tools, we’ll help you learn how to avoid the common mistakes thus missing those otherwise uncomfortable pitfalls.


Ben Jones of Tableau Software

Hosted by: Tim Matteson, Cofounder, Data Science Central

Title:  Avoiding Data Pitfalls: Gaps between Data and Reality
Date:  Tuesday, February 3, 2015
Time:  9:00 AM - 10:00 AM PT


Again, Space is limited so please register early:
Reserve your Webinar seat now

After registering you will receive a confirmation email containing information about joining the Webinar.


Read more…

Guest blog post by Alex Jones.

Post adapted from Correlation vs Causation: Visualization, Statistics, and Intuition!

As someone who has a tendency to think in numbers, I love when success is quantifiable.

However, I suppose that means I must accept defeat (or in true statistician fashion-- try to discredit the correlation) when the numbers don't demonstrate what I had hoped for or intuitively believed!

With that, I decided to look into how my working at Cameron relates to the company's stock price. Alongside this analysis, I'll include a quick demo of scaling and data manipulation for visualization.

Of course, this post is meant to highlight one of the basic lessons of statistics in a mildly entertaining way.

To begin, I pulled Stock Price over my first ~90 Days. Since the market is only open on business days, it fits perfectly with the number of days worked.

If only every analysis was this convenient! From there, I merely added a column that counts number of days.

Eventually the data looked something like this:

Neat! Now, let's graph Adjusted Close Price vs Days Worked.

Super! As you can see in this graph, there's obviously no Relationship!

Not so fast. Let's Regress Days Worked Across Stock Price.

It's important to realize that while visualization is a phenomenal tool and incredibly insightful way to ingest data, it's not the whole story.

Blasphemy! For the sake of this article, humor my logical leaps.

With an R squared of .88 and a P Value out 42 Decimal Places, traditional statistics would say we are incredibly confident about these results!

So what do all those numbers really say? Well one interpretation would be that we can explain Stock Price by:

StockPrice= $75.99 -$.29672(NumberDaysAlexHasWorked)

That's a heck of a deal! I cost a little under 30 cents a day...

WRONG. That's per share. Since the company currently has ~197.45M Shares Outstanding, that means, based on these statistically significant results, I cost $58,587,364 per day.

Well this is awkward... 

Quick! Let's see if we can perform some "Transformations" on the data to get a "Better result". 

First, let's Scale Stock Price from 0 (lowest price) to 1 (highest price). To do so, we'll get the Minimum value and Maximum value. With those, we'll be able to get the Spread/ Span.

That calculation is simply Spread= Maximum - Minimum. Simple enough! 

Now how do we scale every datapoint? Great question.

We'll take (Stock Price X - Minimum)/ Spread. Boom! Scaled.

Now let's graph that!

Oh great! No relationship! Just as I wanted.

Woa woa woa... that doesn't seem right? Ok? Then what do you propose? Scale the number of days worked? 

Well I guess we could try that. Same formula/ process applied to days worked.

Ok, so maybe there's a relationship here... I suppose we should Invert days worked so that the lines go in the same general direction.

See how the Orange line (Days worked) currently starts at 0 and goes to 1. Let's flip that. How? We'll apply the formula Inverted Days Worked= 1-Scaled Days Worked. Now the line is flipped!

Let's Graph them.

Holy Moly. I see it now.

So now we have taken two vectors of differing relative magnitudes, scaled them to an equivalent range and controlled for directionality. Thereby enabling a linear depiction of the relationship and a more intuitive visualization!

Sorry, that was unnecessary. Nerdiness got the best of me.

So what then does this mean? Now what are the results of the regression?!

You better sit down for this... The regression results, in absolute terms, are EXACTLY the same. Even though the equation (shown on the final graph) is apparently different, once we "undo" all the scaling and transformations, and we get the numbers back into their original values... they will be the exact same as the original!

Hm.. Why is that? Because we're just transforming data! We're not changing the underlying geometry of the relationships. Relatively speaking, the data remains holistically the same. We didn't pick out one data point and change JUST that one. We changed all of them at the same time.

In other words, we're just moving the data's perspective in a multi-dimensional space, relative to US the viewers. You can zoom, stretch, angle, compress, and turn data in any way you want!

Let's take a second and think about this. For a moment, think of our data as a cube-- just to help conceptualize what's going on.

If we turn, flip, invert, scale, zoom out, or angle the cube in any way-- has the cube itself changed? Absolutely not. It's the exact same cube!

We're simply looking at it from a different perspective. So when we transform a "Data Vector / Cube" (as long as we "undo" those changes when we analyze the data in real terms)-- we're just finding that perfect angle to tell our story and create a compelling visual. That's powerful and exciting!

Victory is mine! Data hath been conquered!

Even with these marvelous findings, we must address the issue of primary concern--Causation vs Correlation! Based on statistics--- "data driven" results, and the interpretation we proposed earlier-- I'm the worst!

However, that's a myopic approach to statistics. Rather-- I bet you there's a 3rd variable indicative of the movement of stock price. What does days worked really represent? It is merely a count of the past ~90 days. So what else has happened in that period?

Well, if we consider the fact that the company is a major oilfield services firm or pick our head up and look at the companies and markets around us-- we quickly realize-- the missing link is the price of oil (at least I certainly hope so!).

What you should realize is that these relationships aren't always evident or obvious! In fact, visualizations in their raw form could disguise relationships! Statistics is still a subjective science-- subject to the availability of information and robustness of the analyst's forethought and interpretation!

More importantly, we can identify the importance that macro-oil market plays in stock price, rather than otherwise extraneous relationships! For brevity sake, we'll omit another full analysis saga.

Most importantly, this should help to exemplify one of the most exciting value potentials of "Big Data". Essentially, we now have access to incredible amounts of information relative to the "Universal Variable"-- of time. With that point to relate on, we can now see how major indexes, markets, events, weather patterns, customer announcements, etc interrelate!

As we move towards an even smaller and more interconnected world, expect to see more "Universal" data points-- (and actively promote them, in the long run, it'll make your analysis more resilient and dynamic!).

Thanks for reading!

Follow Alex Jones


Read more…

Interactive Data Visualization for the Web, by Scott Murray, O’Reilly (2013), has a Free online version.

An introduction to D3 for people new to programming and web development, published by O’Reilly. “Explaining tricky technical topics with aplomb (and a little cheeky humor) is Scott Murray’s forte. If you want to dive into the world of dynamic visualization using web standards, even if you are new to programming, this book is the place to start.” - Mike Bostock, creator of D3.

From O’Reilly website: "This step-by-step guide is ideal whether you’re a designer or visual artist with no programming experience, a reporter exploring the new frontier of data journalism, or anyone who wants to visualize and share data. Create and publish your own interactive data visualization projects on the Web—even if you have little or no experience with data visualization or web development. It’s easy and fun with this practical, hands-on introduction. Author Scott Murray teaches you the fundamental concepts and methods of D3, a JavaScript library that lets you express data visually in a web browser. Along the way, you’ll expand your web programming skills, using tools such as HTML and JavaScript"

This online version of Interactive Data Visualization for the Web includes 44 examples that will show you how to best represent your interactive data. For instance, you'll learn how to create this simple force layout with 10 nodes and 12 edges. Click and drag the nodes below to see the diagram react.

Read more…

35 books on Data Visualization

1. The Visual Display of Quantitative Information Author: by Edward Tufte Publisher: Graphics Press, 1983 Pages: 197 A modern classic. Tufte teaches the fundamentals of graphics, charts, maps and tables. "A visual Strunk and White" (The Boston Globe). Includes 250 delightfullly entertaining illustrations, all beautifully printed.
Read more…

Beyond The Visualization Zoo

The best document I have read on visualization is called "A Tour Through The Visualization Zoo" by Jeffrey Heer, Michael Bostock, Vadim Ogievetsky. It's a must-read picture book for aspiring Data Scientists. Most of the graphics from this post are examples of the Tour taken from the d3 gallery.
Read more…
The top tech companies by market capitalization are IBM, HP , Oracle , Microsoft , Cisco , SAP , EMC , Apple , Amazon and Google All of the top tech companies are selected based on their current market capitalization with the exception of Yahoo. The year 2014 is not included as part of this analysis. Data: The source of this data is from the public financial records from
Read more…
"A picture is worth a thousand words" or in the case of Data Science, we could say "A picture is worth a thousand statistics". Interactive Data Visualization or Visual Analytics has become one of the top trends in transforming business intelligence (BI) as technologies based on Visual Analytics have moved into widespread use.
Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds