Subscribe to our Newsletter

Featured Posts (189)

14 questions about data visualization tools

Guest blog post by Vincent Granville

Questions to ask when considering visualization tools:

  1. How do you define and measure the quality of a chart?
  2. Which tools allow you to produce interactive graphs or maps?
  3. Which tools do you recommend for big data visualization?
  4. Which visualization tools can be accessed via an API, in batch mode? (for instance, to update earthquake maps every 5 minutes, or stock prices every second)
  5. What do you think of Excel? And Python or Perl graph libraries? And R?
  6. Are there any tools that allow you to easily produce videos about of your data (e.g. to show how fraud cases or diseases spread over time)?
  7. In Excel you can update your data: then your model and charts get updated right away. Are there any alternatives to Excel, offering the same features, but having much better data modeling capabilities?
  8. How do you produce nice graph structures - e.g. to visually display Facebook connections?
  9. What is an heat map? When does it make sense to use it?
  10. How do you draw "force-directed graphs"?
  11. Good tools for raster images? for vector images? for graphs? for decision trees? for fractals? for time series? for stock prices? for maps? for spatial data?
  12. How can you integrate R with other graphical packages?
  13. How do you represent 5 dimensions (e.g. time, volume, category, price, location) in a simple 2-dimensional graph? Or is better to represent fewer dimensions if your goal is to communicate a message to executives?
  14. Why visualization tools used by mathematicians and operations research practitioners (e.g. Matlab) are not the same as tools used by data scientists? Is it because of the type of data, or just historical reasons?

Related articles:

Read more…

This is a guest blog post.

Ever wanted to quickly visually share some data with your colleagues or with the world and struggled with the tools available? After sharing the data, what if the viewer wanted to zoom in on a specific location, city or town to see what's going on there.

Google Fusion Tables is a free tool to show your data on a map & allow viewers to zoom in specific areas that they want to explore further. vHomeInsurance, a data driven home insurance analysis service, has detailed location data on home insurance rates & has used their data to create a guide to use Google Fusion tables to represent home insurance rates visually on a map.

1. Google Fusion Table Home Page

To start using Google Fusion tables, one must have a Google Account. After you have created a Google account or if you have already one, go to:  & click on the Create a Fusion Table link to begin.

2. Get your data into Google Fusion Tables

We have three choices to import data to Google Fusion table 1) Upload a File 2) Import from Google Spreadsheets 3) Update Data from an Empty Table

For this guide, we choose the Google Spreadsheet option. Choose the Google spreadsheet you want to import data from and click Next. You can choose the other options as well and click next.

 After the data is loaded into Google Fusion tables, make sure to check for column names and other details before importing.

3. Naming your table & data

Once the data is imported, make sure to give the appropriate table names, your licensing attribution & other details.

4. Map the “Location” field to the appropriate column

This is where the rubber hits the road where Google Geocodes your data to know which places is where.

For Instance, if you have data on Brooklyn Homeowner Insurance rates & want to Google Maps to show it in the appropriate location, then Google Fusion tables needs to figure out the Geo-coordinates for that data. To tell Google Fusion table, which data type to geocode, we need to change the appropriate column name to “Location” type. This can be done through the following steps:

  1. Hover over the column name that has the location data and click on the downward pointing arrow
  2. Click on Change
  3. On the page that comes, Choose “Location” for the type & then hit Save

5. Show the Geocoded data on a Map

Now, we need to actually show the data on a map. To do that, click on “Add Map”. The actual Geocoding & representation may take time & depends on the number of rows in your tables.


6. Configure Your Map Markers

The default representation of a place on the map are red circles but to we need to customize it make it more meaningful. As an example, in the home insurance world, home insurance in Chicago is $888 so we have a Yellow marker for it whereas home insurance in Phoenix is cheaper at $596 and we represent that as a green marker. In the configure map section, click on Change feature styles and then configure various buckets and associated color markets for the different values.

You can see a screenshot of the finished map below.

A detailed map is available to zoom in for various home insurance rates on vHomeInsurance.

About the Author:

The vHomeInsurance team are experts in home insurance rates data analysis & research.

Related articles

Read more…

Please join us on February 3, 2015 at 9am PT for our latest Data Science Central Webinar Event: 
Avoiding Data Pitfalls: Gaps Between Data and Reality sponsored by Tableau Software.

Space is limited.
Reserve your Webinar seat now

Have you ever been fooled by data? In this Webinar we will cover common pitfalls that anyone who works with data has fallen into. Find out what these pitfalls look like and how to avoid them. The pitfalls range from philosophical to technical, and from analytical to visual. 

Utilizing trusted techniques and visualization tools, we’ll help you learn how to avoid the common mistakes thus missing those otherwise uncomfortable pitfalls.


Ben Jones of Tableau Software

Hosted by: Tim Matteson, Cofounder, Data Science Central

Title:  Avoiding Data Pitfalls: Gaps between Data and Reality
Date:  Tuesday, February 3, 2015
Time:  9:00 AM - 10:00 AM PT


Again, Space is limited so please register early:
Reserve your Webinar seat now

After registering you will receive a confirmation email containing information about joining the Webinar.


Read more…

Guest blog post by Alex Jones.

Post adapted from Correlation vs Causation: Visualization, Statistics, and Intuition!

As someone who has a tendency to think in numbers, I love when success is quantifiable.

However, I suppose that means I must accept defeat (or in true statistician fashion-- try to discredit the correlation) when the numbers don't demonstrate what I had hoped for or intuitively believed!

With that, I decided to look into how my working at Cameron relates to the company's stock price. Alongside this analysis, I'll include a quick demo of scaling and data manipulation for visualization.

Of course, this post is meant to highlight one of the basic lessons of statistics in a mildly entertaining way.

To begin, I pulled Stock Price over my first ~90 Days. Since the market is only open on business days, it fits perfectly with the number of days worked.

If only every analysis was this convenient! From there, I merely added a column that counts number of days.

Eventually the data looked something like this:

Neat! Now, let's graph Adjusted Close Price vs Days Worked.

Super! As you can see in this graph, there's obviously no Relationship!

Not so fast. Let's Regress Days Worked Across Stock Price.

It's important to realize that while visualization is a phenomenal tool and incredibly insightful way to ingest data, it's not the whole story.

Blasphemy! For the sake of this article, humor my logical leaps.

With an R squared of .88 and a P Value out 42 Decimal Places, traditional statistics would say we are incredibly confident about these results!

So what do all those numbers really say? Well one interpretation would be that we can explain Stock Price by:

StockPrice= $75.99 -$.29672(NumberDaysAlexHasWorked)

That's a heck of a deal! I cost a little under 30 cents a day...

WRONG. That's per share. Since the company currently has ~197.45M Shares Outstanding, that means, based on these statistically significant results, I cost $58,587,364 per day.

Well this is awkward... 

Quick! Let's see if we can perform some "Transformations" on the data to get a "Better result". 

First, let's Scale Stock Price from 0 (lowest price) to 1 (highest price). To do so, we'll get the Minimum value and Maximum value. With those, we'll be able to get the Spread/ Span.

That calculation is simply Spread= Maximum - Minimum. Simple enough! 

Now how do we scale every datapoint? Great question.

We'll take (Stock Price X - Minimum)/ Spread. Boom! Scaled.

Now let's graph that!

Oh great! No relationship! Just as I wanted.

Woa woa woa... that doesn't seem right? Ok? Then what do you propose? Scale the number of days worked? 

Well I guess we could try that. Same formula/ process applied to days worked.

Ok, so maybe there's a relationship here... I suppose we should Invert days worked so that the lines go in the same general direction.

See how the Orange line (Days worked) currently starts at 0 and goes to 1. Let's flip that. How? We'll apply the formula Inverted Days Worked= 1-Scaled Days Worked. Now the line is flipped!

Let's Graph them.

Holy Moly. I see it now.

So now we have taken two vectors of differing relative magnitudes, scaled them to an equivalent range and controlled for directionality. Thereby enabling a linear depiction of the relationship and a more intuitive visualization!

Sorry, that was unnecessary. Nerdiness got the best of me.

So what then does this mean? Now what are the results of the regression?!

You better sit down for this... The regression results, in absolute terms, are EXACTLY the same. Even though the equation (shown on the final graph) is apparently different, once we "undo" all the scaling and transformations, and we get the numbers back into their original values... they will be the exact same as the original!

Hm.. Why is that? Because we're just transforming data! We're not changing the underlying geometry of the relationships. Relatively speaking, the data remains holistically the same. We didn't pick out one data point and change JUST that one. We changed all of them at the same time.

In other words, we're just moving the data's perspective in a multi-dimensional space, relative to US the viewers. You can zoom, stretch, angle, compress, and turn data in any way you want!

Let's take a second and think about this. For a moment, think of our data as a cube-- just to help conceptualize what's going on.

If we turn, flip, invert, scale, zoom out, or angle the cube in any way-- has the cube itself changed? Absolutely not. It's the exact same cube!

We're simply looking at it from a different perspective. So when we transform a "Data Vector / Cube" (as long as we "undo" those changes when we analyze the data in real terms)-- we're just finding that perfect angle to tell our story and create a compelling visual. That's powerful and exciting!

Victory is mine! Data hath been conquered!

Even with these marvelous findings, we must address the issue of primary concern--Causation vs Correlation! Based on statistics--- "data driven" results, and the interpretation we proposed earlier-- I'm the worst!

However, that's a myopic approach to statistics. Rather-- I bet you there's a 3rd variable indicative of the movement of stock price. What does days worked really represent? It is merely a count of the past ~90 days. So what else has happened in that period?

Well, if we consider the fact that the company is a major oilfield services firm or pick our head up and look at the companies and markets around us-- we quickly realize-- the missing link is the price of oil (at least I certainly hope so!).

What you should realize is that these relationships aren't always evident or obvious! In fact, visualizations in their raw form could disguise relationships! Statistics is still a subjective science-- subject to the availability of information and robustness of the analyst's forethought and interpretation!

More importantly, we can identify the importance that macro-oil market plays in stock price, rather than otherwise extraneous relationships! For brevity sake, we'll omit another full analysis saga.

Most importantly, this should help to exemplify one of the most exciting value potentials of "Big Data". Essentially, we now have access to incredible amounts of information relative to the "Universal Variable"-- of time. With that point to relate on, we can now see how major indexes, markets, events, weather patterns, customer announcements, etc interrelate!

As we move towards an even smaller and more interconnected world, expect to see more "Universal" data points-- (and actively promote them, in the long run, it'll make your analysis more resilient and dynamic!).

Thanks for reading!

Follow Alex Jones


Read more…

Interactive Data Visualization for the Web, by Scott Murray, O’Reilly (2013), has a Free online version.

An introduction to D3 for people new to programming and web development, published by O’Reilly. “Explaining tricky technical topics with aplomb (and a little cheeky humor) is Scott Murray’s forte. If you want to dive into the world of dynamic visualization using web standards, even if you are new to programming, this book is the place to start.” - Mike Bostock, creator of D3.

From O’Reilly website: "This step-by-step guide is ideal whether you’re a designer or visual artist with no programming experience, a reporter exploring the new frontier of data journalism, or anyone who wants to visualize and share data. Create and publish your own interactive data visualization projects on the Web—even if you have little or no experience with data visualization or web development. It’s easy and fun with this practical, hands-on introduction. Author Scott Murray teaches you the fundamental concepts and methods of D3, a JavaScript library that lets you express data visually in a web browser. Along the way, you’ll expand your web programming skills, using tools such as HTML and JavaScript"

This online version of Interactive Data Visualization for the Web includes 44 examples that will show you how to best represent your interactive data. For instance, you'll learn how to create this simple force layout with 10 nodes and 12 edges. Click and drag the nodes below to see the diagram react.

Read more…

35 books on Data Visualization

1. The Visual Display of Quantitative Information Author: by Edward Tufte Publisher: Graphics Press, 1983 Pages: 197 A modern classic. Tufte teaches the fundamentals of graphics, charts, maps and tables. "A visual Strunk and White" (The Boston Globe). Includes 250 delightfullly entertaining illustrations, all beautifully printed.
Read more…

Beyond The Visualization Zoo

The best document I have read on visualization is called "A Tour Through The Visualization Zoo" by Jeffrey Heer, Michael Bostock, Vadim Ogievetsky. It's a must-read picture book for aspiring Data Scientists. Most of the graphics from this post are examples of the Tour taken from the d3 gallery.
Read more…
The top tech companies by market capitalization are IBM, HP , Oracle , Microsoft , Cisco , SAP , EMC , Apple , Amazon and Google All of the top tech companies are selected based on their current market capitalization with the exception of Yahoo. The year 2014 is not included as part of this analysis. Data: The source of this data is from the public financial records from
Read more…
"A picture is worth a thousand words" or in the case of Data Science, we could say "A picture is worth a thousand statistics". Interactive Data Visualization or Visual Analytics has become one of the top trends in transforming business intelligence (BI) as technologies based on Visual Analytics have moved into widespread use.
Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds


Principal Data Scientist - Mercedes-Benz

Mercedes-Benz Research & Development North America, Inc. - Mercedes-Benz Research & Development is a place for exceptional people with outstanding ideas and the absolute willingness to bring them to lif...

Sr. Computer Scientist - Adobe

Adobe - The challenge Be part of the foundational team that will be responsible on developing generation Platform which will power Adobe’s Experience Cloud...