Subscribe to our Newsletter

All Posts (217)

Guest blog post by Mike Kennedy

Analytics can be used in a variety of ways to demonstrate real business value for organizations. Talent analytics, defined as the measurement of an organization's talent, is no exception.

Imagine being able to use analytics to forecast at a glance visually how four different teams within a large organization may interact with each other?

With today's talent analytics technology you're able to identify trends and forecast employees behavior at the team (or organizational) level. CEOs and Human Capital executives are then able to leverage this data to align talent assets strategically. See below for a screen shot.

Are these analytics useful for your organization?

Comment below or via @talentanalytics on twitter.

 

Read more…

Why is Data now called “Big”?: Part 1

Ever since Man could count we have used data to make sense of the world around us, measuring phenomena through the correcting lens of statistics and facts.

The amount of data we’ve traditionally been able to collect and store has been comparatively small, and notoriously difficult to handle when swamped with too much of the stuff, which is why population censuses are still conducted only rarely. A traditional data analyst’s stock-in-trade was the classic database table, with its neat rows and columns, from which one could extrapolate meaningful insightful, all be it from a comparatively small sample that was sortarepresentative.

Jump cut to 2015 and data’s “small”, “neat” and “sorta” has been expunged by “big, “messy” and “spot-on”. And as the era of the ‘datasphere’ dawns, the definition of data has had to be rewritten and a new set of (super-sized) rules issued. We are learning to play the ‘data’ game afresh, and with the realization that it’s now made its way from being a back-room activity to front-page news.

Like many things born out of the Digital Age, we have once more been wrong-footed by something thought to be immune to re-invention. In this case, it’s data. Its rapid change in identity has brought with it a new set of possibilities to produce, collect and analyse information that’s proving to be nothing short of transformational. And the game-changer in all this is volume. Hence the word “Big”. In the same way that ‘social’ has become shorthand for describing how we interact with each other, “Big” similarly describes the relationship we now have with data.

To give Big some context, 90% of the data in the world today has been created in the last 2 years alone. That’s not just Big, that’s awesome!

Google alone processes 24 petabytes of data from 3 billion daily searches (a petabyte = 1,000,000 gigabytes) which is thousands of times the quantity of all printed material stored in the U.S. Library of Congress. In fact, if all the world’s current stored data were turned into books they would cover the entire surface of the United States some 52 layers thick.

Data has gone from being a comparatively small, static puddle of information to an ever expanding ocean of facts within a very short space of time. We are also about to be hit by another tsunami-like wave of unstructured data as the always “on” smart world envisaged by the Internet of Things becomes more and more of a daily reality.

In Part 2 of “Why is Data Now called Big?” we’ll take a look at big data’s game changing attributes of Volume, Velocity and Variety and see how these  characteristics are helping us to better predict the actions of our colleagues and customers.

Written by – @jasonburrows - Non-executive Sales Representative at Future Fresh Ltd.

Read more…

Guest blog post by Kenneth C Black

Introduction

I've been writing a Tableau and Alteryx-focused blog for 1.5 years on Wordpress and haven't thought of writing anything here on DSC. I just completed a two-part series that discusses solving problems using innovative approaches with Alteryx and Tableau, which were my 99th and 100th blog posts. They are longer than usual but offer a good insight into my background and why I write a technical blog.

My blog is focused on solving business and science problems using Alteryx, Tableau and R. I write a lot about technical techniques that I have developed using these tools. I have extensive experience with Tableau, beginning in 2008. I have less experience with Alteryx and R but I am deeply involved in using them for some amazing applications. 

View my most recent two posts, click the following links: 

How and Why Alteryx and Tableau Allow Me to Innovate

  • Part 1: You will eventually learn why I have such passion for using Alteryx and Tableau and why I feel compelled to write about these technologies.
  • Part 2:  I focus on explaining how I see creativity and innovation being expressed through the use of Alteryx and Tableau. I explore how the combined usage of these tools allow us to solve complex business and science problems. I think this is important to document because we are now working on problems that were not possible to solve just a few years ago. It is my goal over the next 100 blog posts to document some of the concepts we have developed to accomplish great things.

To review my blog post history and search my site, you can use my Tableau Public workbook, Click here.

Links to My Favorite Posts:

Follow me on Twitter | My Blogging Site

Read more…

Guest blog post by Eric Cai

Update on February 2, 2014:

Harlan also noted in the comment below that any truncated kernel density estimator (KDE) from density() in R does not integrate to 1 over its support set.  Thanks to Julian Richer Daily for suggesting on AnalyticBridge to scale any truncated kernel density estimator (KDE) from density() by its integral to get a KDE that integrates to 1 over its support set.  I have used my own function for trapezoidal integration to do so, and this has been added below.

I thank everyone for your patience while I took the time to write a post about numerical integration before posting this correction.  I was in the process of moving between jobs and cities when Harlan first brought this issue to my attention, and I had also been planning a major expansion of this blog since then.  I am glad that I have finally started a series on numerical integration to provide the conceptual background for the correction of this error, and I hope that they are helpful.  I recognize that this is a rather late correction, and I apologize for any confusion.

Update on July 15, 2013:

Thanks to Harlan Nelson for noting on AnalyticBridge that the ozone concentrations for both New York and Ozonopolis are non-negative quantities, so their kernel density plot should have non-negative support sets.  This has been corrected in this post by

- defining new variables called max.ozone and max.ozone2

- using the options “from = 0″ and “to = max.ozone” or “to = max.ozone2″ in the density() function when defining density.ozone and density.ozone2 in the R code.

This post was originally published on my blog.  Please visit The Chemical Statistician for more posts on statistics, machine learning, data analysis and R programming, especially in application to chemistry!

For the sake of brevity, this post has been created from the second half of a previous long post on kernel density estimation.  This second half focuses on constructing kernel density plots and rug plots in R.  The first post focused on the conceptual foundations of kernel density estimation.

Introduction

This post follows the recent introduction of the conceptual foundations of kernel density estimation.  It uses the “Ozone” data from the built-in “airquality” data set in R and the previously simulated ozone data for the fictitious city of “Ozonopolis” to illustrate how to construct kernel density plots in R.  It also introduces rug plots, shows how they can complement kernel density plots, and shows how to construct them in R.

This is another post in a recent series on exploratory data analysis, which has included posts on descriptive statisticsbox plotsviolin plots, the conceptual foundations of empirical cumulative distribution functions (eCDFs), and how to plot empirical CDFs in R.

kernel density plot with rug plot ozone New York

Read the rest of this post to learn how to create the above combination of a kernel density plot and a rug plot!

Example: Ozone Pollution Data from New York and Ozonopolis

Recall that I used 2 sets of ozone data in my last post about box plots.  One came from the “airquality” data set that is built into R.  I simulated the other one and named its city of origin “Ozonopolis”.  Here are the code and the plot of the kernel density estimates (KDEs) of the 2 ozone pollution data sets.  I used the default settings in density() – specifically, I used the normal (Gaussian) kernel and the “nrd0” method of choosing the bandwidth.  I encourage you to try the other settings.  I have used the set.seed() function so that you can replicate my random numbers.

Thanks to Harlan Nelson and Julian Richer Daily, I have learned that the KDE from density() does not integrate to 1 if the support set is truncated with the “from = ” or “to = ” options.  To correct this problem, I have used trapezoidal integration to integrate the resulting KDE and divide the KDE by that integral; this scaled KDE will integrate to 1.  In my correction below, I saved the function in an R script called “trapezoidal integration.R” in my working directory, and I then called it via the source() function.

##### Kernel Density Estimation 
##### By Eric Cai - The Chemical Statistician

# clear all variables in the workspace
rm(list = ls(all.names = TRUE))

# set working directory
setwd('INSERT YOUR WORKING DIRECTORY PATH HERE')

# extract "Ozone" data vector for New York
ozone = airquality$Ozone

# calculate the number of non-missing values in "ozone"
n = sum(!is.na(ozone))

# calculate mean, variance and standard deviation of "ozone" by excluding missing values
mean.ozone = mean(ozone, na.rm = T)
var.ozone = var(ozone, na.rm = T)
sd.ozone = sd(ozone, na.rm = T)
max.ozone = max(ozone, na.rm = T)

# simulate ozone pollution data for ozonopolis
# set seed for you to replicate my random numbers for comparison
set.seed(1)

ozone2 = rgamma(n, shape = mean.ozone^2/var.ozone+3, scale = var.ozone/mean.ozone+3)
max.ozone2 = max(ozone2)

# obtain values of the kernel density estimates
density.ozone = density(ozone, na.rm = T, from = 0, to = max.ozone)
density.ozone2 = density(ozone2, na.rm = T, from = 0, to = max.ozone2)

# access function for trapezoidal integration
source('trapezoidal integration.r')

# scale the kernel density estimates by their integrals over their support sets
kde = density.ozone$y
support.ozone = density.ozone$x
integral.kde = trapezoidal.integration(support.ozone, kde)
kde.scaled = kde/integral.kde

kde2 = density.ozone2$y
support.ozone2 = density.ozone2$x
integral.kde2 = trapezoidal.integration(support.ozone2, kde2)
kde.scaled2 = kde2/integral.kde2

# number of points used in density plot
n.density1 = density.ozone$n
n.density2 = density.ozone2$n

# bandwidth in density plot
bw.density1 = density.ozone$bw
bw.density2 = density.ozone2$bw


# plot kernel density estimates and export as PNG image
png('kernel density plot ozone.png')

plot(support.ozone2, kde.scaled2, ylim = c(0, max(kde.scaled)), main = 'Kernel Density Estimates of Ozone \n in New York and Ozonopolis', xlab = 'Ozone (ppb)', ylab = 'Density', lty = 1, pch = 1, col = 'orange')

points(support.ozone, kde.scaled, pch = 2, col = 'blue')

# add legends to state sample sizes and bandwidths; notice use of paste()
legend(100, 0.015, paste('New York: N = ', n.density1, ', Bandwidth = ', round(bw.density1, 1), sep = ''), bty = 'n')

legend(100, 0.013, paste('Ozonopolis: N = ', n.density2, ', Bandwidth = ', round(bw.density2, 1), sep = ''), bty = 'n')

# add legend to label plots
legend(115, 0.011, c('New York', 'Ozonopolis'), pch = c(2,1), col = c('blue', 'orange'), bty = 'n')

dev.off()

kernel density plot ozone

It is clear that Ozonopolis has more ozone pollution than New York.  The right-skewed shapes of both curves also suggest that the normal distribution may not be suitable.  (If you read my blog post carefully, you will already see evidence of a different distribution!)  In a later post, I will use quantile-quantile plots to illustrate this.  Stay tuned!

To give you a better sense of why the density plots have higher “bumps” at certain places, take a look at the following plot of the ozone pollution just in New York.  Below the density plot, you will find a rug plot – a plot of tick marks along the horizontal axis indicating where the data are located.  Clearly, there are more data in the neighbourhood between 0 and 50, where the highest “bump” is located.  Use the rug() function to get the rug plot in R.



# plot KDE with rug plot and export as PNG image
png('kernel density plot with rug plot ozone New York.png')
plot(support.ozone, kde.scaled, main = 'Kernel Density Plot and Rug Plot of Ozone \n in New York', xlab = 'Ozone (ppb)', ylab = 'Density')
rug(ozone)

dev.off()

kernel density plot with rug plot ozone New York

 

References

Trosset, Michael W. An introduction to statistical inference and its applications with R. Chapman and Hall/CRC, 2011.

Everitt, Brian S., and Torsten Hothorn. A handbook of statistical analyses using R. Chapman and Hall/CRC, 2006.

Read more…

Guest blog post by Venky Rao

In today's post, we explore the use of decision trees in evidence based medicine.  In 1996 David Sackett wrote that "Evidence-based medicine is the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients" [Source: Wikipedia].
 
For our analysis, we start with a data set which contains data about a number of patients all of whom suffered from the same illness.  Each of these patients responded well to one of five medications.  We will use a decision tree to understand what factors in each patients history led to them responding well to one specific medication over the others.  We will then use our findings to generate a set of evidence based rules or policies that can be followed by doctors to treat this illness in future patients.  As part of our analysis, we will also explore how to interpret decision trees.
 
Let us first look at our data set:
 
 
As can be seen, the data set contains information about the age and gender of each patient along with Blood Pressure, Cholesterol, Sodium and Potassium levels.  Finally, we have information about the drug the patient responded well to.
 
Next we use a couple of graphical outputs to understand our data better.  We first look at the distribution of the data set broken down by the drug that was used to treat the illness:
 
 
From the distribution graph above, we can see that drug Y is the most popular drug in treating the illness while there are a few cases that were treated by drugs B and C as well.  In each of the cases, the distribution of male and female patients appears to be approximately equal.  So it does not appear that gender is a factor in determining which drug will work in treating the illness.
 
Next we generate a scatter plot using the two continuous variables (Sodium and Potassium levels) as the X and Y axes and use Drug as the overlay variable:
 
From the scatter plot, one thing appears clear: where Potassium levels are 0.05 or below, the drug that works best is Drug Y whereas that drug rarely works in cases where the Potassium levels exceed 0.05.  Another thing that is clear is that the ratio of Sodium to Potassium levels have a bearing on which drug works well.  Based on this conclusion, we derive a new field in out data set as follows:
 
 
Our new data set is as follows:
 
 
We are now ready to create a decision tree model based on our data set.  In order to do this, we first identify our target variable (Drug) and our predictor variables as follows:
 
 
We then add a C5.0 modeling node to the data to create our decision tree.  Upon running our model, we observe that the most important predictors in determining which drug to use are as follows:
 
 
As can be see from this chart, the ratio of Sodium to Potassium is the most important predictor in determining which drug should be used to treat the illness.  This reinforces the insight we obtained from the scatter plot above.  Another insight that is reinforced is that the gender of the patient is not an important predictor in determining which drug should be prescribed to treat the illness.
 
Next we take a look at the decision tree generated by the C5.0 algorithm:
 
 
The decision tree above has broken down the entire data set based on the important predictors and has identified the exact situations in which a specific drug should be prescribed to treat the illness.  We interpret the tree as follows:
 
Node 0
 
Node 0 is simply a distribution of the entire data set based on the drug used to treat the illness.  This displays the same output shown by our distribution chart above.
 
Node 1 and Node 9
 
Next, the decision tree breaks down the list based on the most important predictor variable: the ratio of Sodium to Potassium.  The threshold identified by the C5.0 algorithm is 14.64.  Where the ratio of Sodium to Potassium exceeds 14.64, 100% of the patients respond well to Drug Y (Node 9).  Since we have clearly identified those patients that respond well to Drug Y, Node 9 is a terminal node, i.e. no further analysis is required.
 
On Node 1 on the other hand, we see that Drug X is the most popular drug; however other drugs have also been used.  Since there is no clear answer as to the best drug, the algorithm continues the analysis based on the next most important predictor variable, Blood Pressure.
 
Node 2, Node 5 and Node 8
 
The decision tree then creates three new nodes based on the Blood Pressure levels of the patients.  Where Blood Pressure is Normal, 100% of the patients respond well to Drug X (Node 8).  Since we have clearly identified those patients that respond well to Drug X, Node 8 is a terminal node, i.e. no further analysis is required.
 
Nodes 2 and 5 however need further analysis.  Node 2 consists of patients that responded well to Drugs A and B while Node 5 consists of patients that responded well to Drugs C and X.
 
Node 3 and Node 4
 
We first examine Node 2 in further detail.  The decision tree breaks this category down by Age.  Where the age of the patient is less than or equal to 50 years old, the drug that works best in 100% of the cases is Drug A.  Since we have clearly identified those patients that respond well to Drug A, Node 3 is a terminal node, i.e. no further analysis is required.  Similarly, where the age of the patient exceeds 50 years old, the drug that works best in 100% of the cases is Drug B.  Since we have clearly identified those patients that respond well to Drug B, Node 4 is also a terminal node, i.e. no further analysis is required.
 
Node 6 and Node 7
 
Finally, we first examine Node 5 in further detail.  The decision tree breaks this category down by Cholesterol.  Where the cholesterol level is normal, the drug that works best in 100% of the cases is Drug X.  Since we have clearly identified those patients that respond well to Drug X, Node 6 is a terminal node, i.e. no further analysis is required.  Similarly, where the cholesterol level is high, the drug that works best in 100% of the cases is Drug C.  Since we have clearly identified those patients that respond well to Drug C, Node 7 is also a terminal node, i.e. no further analysis is required.
 
Evidence based rules for drug prescription
 
Based on this decision tree, we are able to generate the following evidence based rules for drug prescription:
 
Drug A
 
 
Drug B
 

Drug C


Drug X


Drug Y


Based on these rules, both doctors and nurse practitioners have a clearly defined set of guidelines to treat patients in order to obtain improved patient outcomes leading to efficiencies, lower costs and improved patient satisfaction with healthcare providers.
Read more…
tableau
Space is limited.
Reserve your Webinar seat now
 
Please join us on April 28th, 2015 at 9am PDT for our latest Data Science Central Webinar Event: The Beautiful Science of Data Visualization sponsored by Tableau Software.

Seeing and understanding data is richer than creating a collection of queries, dashboards, and workbooks. We'll see how visual and cognitive science explain what makes data visualization so deeply satisfying. Why does a collection of bars, lines, colors, and boxes become surprisingly powerful and meaningful? How does fluid interaction with data views multiply our intelligence? Three decades of research into the beautiful science of data visualization explain why history have converged at this moment, and why interactive data visualization has brought us to the verge of an exciting new revolution.

Speaker:
Jeff Pettiross, User Experience Manager, Tableau Software

Hosted by: Tim Matteson, Cofounder, Data Science Central
 
Title:  The Beautiful Science of Data Visualization
Date:  Tuesday, April 28th, 2015
Time:  9:00 AM - 10:00 AM PDT
 
Again, Space is limited so please register early:
Reserve your Webinar seat now
 
After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

Guest blog.

D3.js is a JavaScript library for illustrating data using open web standards such as HTML, CVG & CSS. The advantage of the D3 is that you can use it to represent data based on freely available web standards instead of any specific proprietary web standards.

We will illustrate the power of the data visualization framework using a complex & interactive Chord Diagram to show linkages between various designers of Apple’s design team.

We analyzed 380 patents of Apple’s head designer, Jonathan Ive. We then showed his colleagues who were listed most often as co-inventors on the 380 patents.

The full interactive D3 Chord Visualization is available here.

The above diagram shows Daniel J Coster as the co-inventor who worked with Ive on the most patents and is represented with the largest slice. The slice gets smaller as the number of patents co-invented with Ive reduces. For example, Steve Jobs has only 161 common patents of the total 380 patents that Ive was listed on.

The beauty of the D3 framework is the interactive nature of the charts. When we click on Steve Jobs’s name, the various lines & its relative breadth shows the number of patents that were co-invented by Jobs & other colleagues. This interactivity is hard to obtain in other data visualization frameworks that ar not proprietary and just rely on open web standards. 

The details on the Javascript code to do the Chord Visualization is given here:

http://bl.ocks.org/mbostock/4062006

For other types of D3 visualizations, check out the gallery here:

https://github.com/mbostock/d3/wiki/Gallery

The Git Hub Repository with examples is given here:

https://github.com/mbostock/d3/wiki

===

This post is by Pansop

Read more…

Stimulation through Gamification

As creative thinkers we are always finding new ways to stimulate and engage our audience within our products.

Our design team has a combined total of over 10 years’ experience working with the E-Learning sector, and we have found that using Gamification within digital training significantly improves the learning rate and screen engagement over a longer length of time.

To explain: Gamification is the concept of applying game based mechanics to non-game applications, making them more engaging and interactive. Gamification motivates the user throughout their training, encouraging them to achieve their goals and training objectives. This is done by tapping into the desires to succeed.

It is important to monitor the performance of the user to ensure they achieve the results and objectives required to successfully complete the training. This can be done by creating a personal account for the user, where they can collect awards and achievements on their learning. By giving the user the chance to earn rewards, they show increased signs of performance and engagement throughout the game.

The activities and content around Gamification can vary depending on the application.

RecycleBank is an online learning library with E-learning courses to educate, inspire and rewards users upon sustainable living and recycling. Each user is rewarded with points be completing the training course, which can be exchanged for discounts through E-commerce websites or local stores. Rewarding the user encourages progression and development naturally, while maintaining more focus and engagement throughout.

 https://www.recyclebank.com/ – 2015

Foldit is an online puzzle game, part of an experimental research project developed by the University of Washington’s Centre for Game Science. The objective of the game is to fold the structure of specific proteins using the tools within the game. The game allows people from all over the world to play and complete in solving tasks, which helps gather research and information used towards fighting AIDS. A major breakthrough was found in 2011, when a solution to the M-PMV virus was found by deciphering the crystal structure in 10 days. This game is more interactive than a simple ‘click and tell’ E-learning course, but is still based around the user earning a high score, and competing with other users to achieve the best results.

https://www.youtube.com/watch?v=lGYJyur4FUA – 2015

Both of these examples show how Gamification can strengthen E-learning and big data. It has been predicted that by 2015, over 70% of the Global 2000 organizations will have one or more Gamification based application and over 71% of companies are expecting big data to create a significant increase on sales and revenue over the next 12 months.

We are continuing to create new game based applications and Digital Training courses that benefit our customers and fulfils there big data needs. Over the coming months we have been given the opportunity to experiment and develop applications using new technology including virtual reality. This technology is rapidly becoming available to many companies, and could open new doors to training and Gamification.

Read more…

Top 10 bloggers writing about Tableau

Guest blog post by Kenneth C Black

For the past two years,Tableau Software has been publishing a monthly list they call the "Best of Tableau Web". Essentially, this list of nearly 500 publications represents the best articles, blog posts, techniques, etc. created by users of Tableau software. The selections of these sources have been made by Tableau employees and represents an unbiased assessment of the web-based activity related to Tableau Software.

All available sources have been compiled into a database and a quantitative analysis has been completed of the information. Emerging from the noise of this data are a list of the top 10 most consistently selected Tableau bloggers, which represents two companies and eight individuals. Due to the rapid growth and expansion of Tableau worldwide, there are also a number of emerging blogging superstars identified.

Companies

Individuals

Click here to read full article.  

Read more…

Guest blog post by Tricia Aanderud

The first thing many customers want to do is make their SAS BI Dashboard to look more like "them". They want their organizational branding (their logo, colors, and corporate identity). If that task has fallen on your, here's some guidance about how to accomplish that task.

What the Heck is the Flex Theme Designer?


The SAS BI Dashboard is an Adobe Flash application. It can be customized using the SAS Theme Designer for Flex application. The Theme Designer is also a Flash-based application available from the browser that allows you to control the look of the SAS flash-based applications at your site. Moreover, you can move the themes between environments. Here's a screenshot that shows the application. From the left side (User Interface Components) you can change the various elements, such as colors of buttons, menus, or landing page. The right pane shows how the various changes appear to the user.

Read the entire post

Read more…

Simple solutions to make videos with R

Guest blog post by Vincent Granville

I'm talking about streaming data displayed in video rather than chart format, like 200 scatter plots continuously updated, as in my recent video series from chaos to clusters, consisting of three parts:

In this article, I explain and illustrate how to produce these videos. You don't need to be a data scientist to understand.

Here's one frame from one version of video clip #3.

Here's the solution:

1. Produce the data that you want to visualize

Using Python, R, Perl, Excel, SAS or any other tool, produce a text file called rfile.txt, with 4 columns:

  • k: frame number
  • x: x-coordinate
  • y: y-coordinate
  • z: color associated with (x,y)

Or download my rfile.txt (sample data to visualize) if you want to exactly replicate my experiment. To access the most recent data, source code (R or Perl), new videos and explanations about the data, click here.

2. Run the following R script

Note that the first variable in the R script (as well as in my rfile.txt) is labeled iter: it is associated with an iterative algorithm that produces an updated data set of 500 (x,y) observations at each iteration. The fourth field is called new: it indicates if point (x,y) is new or not, for a given (x,y) and given iteration. New points appear in red, old ones in black.

vv<-read.table("c:/vincentg/rfile.txt",header=TRUE);
iter<-vv$iter;

for (n in 0:199) {

  x<-vv$x[iter == n];
  y<-vv$y[iter == n];
  z<-vv$new[iter == n];
  plot(x,y,xlim=c(0,1),ylim=c(0,1),pch=20,col=1+z,xlab="",ylab="",axes=FALSE);
  dev.copy(png,filename=paste("c:/vincentg/Zorm_",n,".png",sep=""));
  dev.off ();

}

3. Producing the video

I can see 4 different ways to produce the video. When you run the R script, the following happens

  • 200 images (scatter plots) are produced and displayed sequentially on the R Graphic window, in a period of about 30 seconds.
  • 200 images (one for each iteration or scatter plot) is saved as Zorm0.png, Zorm1.png, ... ,Zorm199 in the target directory (c:/vincentg/ on my laptop)

The four options, to produce the video, are as follows

  • Cave-man style: film the R Graphic frame sequence with your cell phone.
  • Semi cave-man style: use a screen-cast tool (e.g. Active Presenter) to capture the streaming plots displayed on the R Graphic window. 
  • Use Adobe or other software to automatically assemble the 200 Zorm*.png images produced by R.
  • Read this article about other solutions (open source ffmpeg or the ImageMagick library). See also animation: An R Package for Creating Animations and Demonstrating Statistical Methods, published in the Journal of Statistical Software (April 2013 edition).

More details about what my video represents coming soon. You can read this as a starting point, and to watch three versions of my video: one posted on Analyticbridge, one version posted on Youtube, and one version produced with the Active Presenter screencast (2 MB download).

Note about Active Presenter

I used Active Presenter screen-cast software (free edition), as follows:

  1. I let the 200 plots automatically show up in fast motion in the R Graphics window (here's the R code, and the original 4MB dataset is available here as a text file)
  2. I selected with Active Presenter the area I wanted to capture (a portion of the R Graphic window, just like for a screenshot, except that here it captures streaming content rather than a static image)
  3. I clicked on Stop when finished and exported to wmv format, and uploaded on a web server for you to access it

I created two new, better quality videos using Active Presenter:

From chaos to clusters (Part 2): View video on Analyticbridge | YouTube

From chaos to clusters (Part 3): View video on  Analyticbridge | YouTube

These are based on a data set with 2 additional columns, that you can download as a 7 MB text file or as a 3 MB compressed file. It also uses the following, different R script:

vv<-read.table("c:/vincentg/rfile.txt",header=TRUE);

iter<-vv$iter;

for (n in 0:199) {

  x<-vv$x[iter == n];
  y<-vv$y[iter == n];
  z<-vv$new[iter == n];
  u<-vv$d2init[iter == n];
  v<-vv$d2last[iter == n];

  plot(x,y,xlim=c(0,1),ylim=c(0,1),pch=20+z,cex=3*u,col=rgb(z/2,0,u/2),xlab="",ylab="",axes=TRUE);

  Sys.sleep(0.05); # sleep 0.05 second between each iteration

}

To produce the second video, replace the plot function by

plot(x,y,xlim=c(0,1),ylim=c(0,1),pch=20,cex=5,col=rgb(z,0,0),xlab="",ylab="",axes=TRUE);

This new R script has the following features (compared with the previous R script):

  • I have removed the dev.copy and dev.off calls, to stop producing the png images on the hard drive (we don't need them here since we use screen-casts). Producing the png files slows down the whole process, and creates flickering videos. Thus this step removes most of the flickering.
  • I use the function Sys.sleep to make a short pause between each frame. Makes the video smoother.
  • I use rgb(r, g, b) inside the plot command to assign a color to each dot: (x, y) gets assigned a color that is a function of z and u, at each iteration.
  • The size of the dot (cex), in the plot command, now depends on the variable u: that's why you see bubbles of various sizes, that grow bigger or shrink.

Note that d2init (fourth column in the rfile2.txt input data used to produce the video) is the distance between location of (x,y) at current iteration, and location at first iteration; d2last (fifth column) is the distance between the current and previous iterations, for each point. The point will be colored in a more intense blue if it made a big move between now and previous iteration.

The function rgb(r, g, b) accepts 3 parameters r, g, b with values between 0 and 1, representing the intensity respectively in the red, green and blue frequencies. For instance rgb(0,0,1) is blue, rgb(1,1,0) is yellow, rgb(0.5,0.5,0.5) is grey, rgb(1,0,1) is purple. Make sure 0 <= r, g, b <=1 otherwise this stuff will crash.

Conclusions

Enjoy, and hopefully you can replicate my steps and impress your boss! It did not cost me any money. By the way, which version of the video do you like best? Of course, I'm going to play more with these tools, and see how to produce better videos - including via optimizing my Perl script to produce slow-moving, rectangular frames. Stay tuned!

I'm also wondering if instead of producing this as a video, it might be faster, more efficient to just simply access the graphic memory with low level code (maybe in old C), and update each point in turn, directly in the graphic memory. Or maybe have a Web app (SaaS) doing the job: it would consist of an API accepting frames (or better, R code) as input, and producing the video as output.

The whole process - producing the output data, running the R script, producing the video - took less than 5 minutes. Wondering if someone ever created an infinite video: one that goes on non-stop with thousands of new frames added every hour. I can actually produce my frames (in my video)  faster than they are delivered by the streaming device. This is really crazy - I could call it faster than real time (FRT).

Related articles

Read more…

Guest blog post by Mirko Krivanek

Interesting infographics from CrowdFlower. In the hot category, I would add data plumbing, sensor data to better predict Earthquakes, weather or solar flares, predictive analytics for flu and other health or environmental issues, automating data science and man-made statistical analyses, pricing optimization for medical procedures, customized drugs, car traffic optimization via sensor data, properly trained data scientists involved in decision and replacing business analysts, the death of the data silo.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

40 maps that explain the Internet

Great article published in Vox.com. Here are a few of the most spectacular ones.

1. Privatization of the Internet backbone (1994)

2. Chrome taking over the world (animated image: click on picture to see animation)

3. Places with no broadband access in 2011

4. Fiber optic cables

Read full article here (40 maps total). Also, you can find a great selection of articles about data visualization by clicking here.

Related articles

DSC Resources

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Great graphic diagrams

Guest blog post by Vincent Granville

Everybody talks about beautiful charts and graphs, usually produced by Tableau or R, see links below. They are definitely great and I guess publishers (including us!) love to regularly show such graphs to our readers because it generates a lot of reactions. 

But what about graphic diagrams (see examples below) - the kind of stuff you typically produce with Visio ? Why very few people mention them? They are certainly as important as info-graphics for communication purposes, although they are used in the early stage (design) of data science projects, rather than in the final stages (sharing insights). They are certainly very useful to represent database schema, algorithms, workflow, complex hierarchical processes, patent illustrations, dashboard designs and essentially any type of architecture associated with data science projects. And even for marketing purposes.

By the way, which software produce these diagrams (besides database GUI's and Visio)? Can Tableau, R or SAS produce these diagrams? See five random illustrations below: one produced with JMP - a decision tree, one with Visio (click scoring diagram in one of my patent documents), and the first three ones - not sure how they were produced (they are not from me - first one might be from O'Reilly).

Related articles:

Picture

Read more…

Guest blog post by Analytic Girl

See niche chart using Tableau, produced by the Meteoretical Society. Below is a screen shot. The original chart is interactive (you can zoom in, etc.)

Download entire data set: it's a 7MB spreadsheet consisting of 34,513 meteorites, last updated in 2012, with the following fields:

  • place
  • type_of_meteorite
  • mass_g
  • fell_found
  • year
  • database
  • coordinate_1
  • coordinates_2
  • cartodb_id
  • created_at
  • updated_at
  • year_date
  • longitude
  • latitude
  • geojson

Useful to make forecasts, broken down by meteorite size.

Related Articles:

Read more…

Deriving Value with Data Visualization Tools

Guest blog post by Gabriel Lowy

Financial institutions, like many other industries, are grappling with how best to harness and extract value from big data.  Enabling users to either “see the story” or “tell their story” is the key to deriving value with data visualization tools, especially as data sets continue to grow.

With terabytes and petabytes of data flooding organizations, legacy architectures and infrastructures are becoming overmatched to store, manage and analyze big data.  IT teams are ill-equipped to deal with the rising requests for different types of data, specialized reports for tactical projects and ad hoc analytics.  Traditional business intelligence (BI) solutions, where IT presents slices of data that are easier to manage and analyze or creates pre-conceived templates that only accept certain types of data for charting and graphing miss the potential to capture deeper meaning to enable pro-active, or even predictive decisions from big data.

Out of frustration and under pressure to deliver results, user groups increasingly bypass IT.  They procure applications or build custom ones without IT’s knowledge.  Some go so far as to acquire and provision their own infrastructure to accelerate data collection, processing and analysis.  This time-to-market rush creates data silos and potential GRC (governance, regulatory, compliance) risks.

Users accessing cloud-based services – increasingly on devices they own – cannot understand why they face so many hurdles in trying to access corporate data.  Mashups with externally sourced data such as social networks, market data websites or SaaS applications is virtually impossible, unless users possess technical skills to integrate different data sources on their own.    

Steps to visualize big data success

Architecting from users’ perspective with data visualization tools is imperative for management to visualize big data success through better and faster insights that improve decision outcomes.  A key benefit is how these tools change project delivery.  Since they allow value to be visualized rapidly through prototypes and test cases, models can be validated at low cost before algorithms are built for production environments.  Visualization tools also provide a common language by which IT and business users can communicate.

To help shift the perception of IT from being an inhibiting cost center to a business enabler, it must couple data strategy to corporate strategy.  As such, IT needs to provide data in a much more agile way.  The following tips can help IT become integral to how their organizations provide users access to big data efficiently without compromising GRC mandates: 

  • Aim for context.  The people analyzing data should have a deep understanding of the data sources, who will be consuming the data, and what their objectives are in interpreting the information.  Without establishing context, visualization tools are less valuable. 
  • Plan for speed and scale.  To properly enable visualization tools, organizations must identify the data sources and determine where the data will reside.  This should be determined by the sensitive nature of the data.  In a private cloud, the data should be classified and indexed for fast search and analysis.  Whether in a private cloud or a public cloud environment, clustered architectures that leverage in-memory and parallel processing technologies are most effective today for exploring large data sets in real-time.
  • Assure data quality.  While big data hype is centered on the volume, velocity and variety of data, organizations need to focus on the validity, veracity and value of the data more acutely.  Visualization tools and the insights they can enable are only as good as the quality and integrity of the data models they are working with.  Companies need to incorporate data quality tools to assure that data feeding the front end is as clean as possible.
  • Display meaningful results.  Plotting points on a graph or chart for analysis becomes difficult when dealing with massive data sets of structured, semi-structured and unstructured data.  One way to resolve this challenge is to cluster data into a higher-level view where smaller groups of data are exposed.  By grouping the data together, a process referred to as “binning”, users can more effectively visualize the data.
  • Dealing with outliers.  Graphical representations of data using visualization tools can uncover trends and outliers much faster than tables containing numbers and text.  Humans are innately better at identifying trends or issues by “seeing” patterns.  In most instances, outliers account for 5% or less of a data set.  While small as a percentage, when working with very large data sets these outliers become difficult to navigate.  Either remove the outliers from the data (and therefore the visual presentation) or create a separate chart just for the outliers.  Users can then draw conclusions from viewing the distribution of data as well as the outliers.  Isolating outliers may help reveal previously unseen risks or opportunities, such as detecting fraud, changes in market sentiment or new leading indicators.  

Where visualization is heading

Data visualization is evolving from the traditional charts, graphs, heat maps, histograms and scatter plots used to represent numerical values that are then measured against one or more dimensions.  With the trend toward hybrid enterprise data structures that mesh traditional structured data usually stored in a data warehouse with unstructured data derived from a wide variety of sources allows measurement against much broader dimensions.

As a result, expect to see greater intelligence in how these tools index results.  Also expect to see improved dashboards with game-style graphics.  Finally, expect to see more predictive qualities to anticipate user data requests with personalized memory caches to aid performance.  This continues to trend toward self-service analytics where users define the parameters of their own inquiries on ever-increasing sources of data. 

 

Read more…

72 Infographics about big data

Guest blog post by Mirko Krivanek

From BigData-Startups. The infographics below is just one of them.

Here's the list:

  1. How The USA Federal Government Thinks Big With Data
  2. Are You Ready For The Future of the Internet of Things?
  3. How Big Data Centers Impact the Environment
  4. A Look Into How Data Centers Actually Work
  5. How Big Data Gives Retailers a Competitive Edge and Boosts Growth
  6. How We Are Heading Towards a Smart Planet with The Internet of Things
  7. How Data Disasters Can Seriously Harm Your Company
  8. How Data Mining & Decision Support Systems Can Create A Powerful Marketing Strategy
  9. Five Myths Marketers Believe About Big Data
  10. The Explosion of the Internet of Things
  11. What Are The Trends: A Big Data Survey
  12. How The Internet of Things Will Make Our World Smart- Infographic
  13. Big Data Analytics Trends for 2014
  14. How Big Data Will Improve Decision Making in Your Organisation
  15. How M2M Data Will Have a Major Impact by 2020
  16. The World’s Most Unusual Data Centers
  17. 7 Ways Big Data Could Revolutionize Our Lives by 2020
  18. 5 Ways To Become Extinct As Big Data Evolves
  19. Why Marketers Should Stop Worrying And Start Loving Big Data
  20. In The World Of Digital Storage, Size Does Matter
  21. Maximize Online Sales With Product Recommendations
  22. Understanding The Various Sources of Big Data
  23. The Who, What and Why of Big Data
  24. Customers Sharing Their Personal Data Should Be Cared For
  25. Is Bad Data A Hazard For Your Customer Experience
  26. What The Consumer Really Thinks Of Data Privacy
  27. Financial Services Firms Leveraging Big Data
  28. A Reality Gap Exists With Big Data Initiatives
  29. Smart Cities Turn Big Data Into Insight
  30. Using Big Data To Predict Dengue Fever And Malaria Outbreaks
  31. Keeping Track Of The People Keeping Track Of You
  32. Are European Companies Ready For Big Data?
  33. How Are Companies Organising Their Big Data Initiatives
  34. 10 Greatest Challenges Preventing Businesses From Capitalizing On Big Data
  35. What Will The World Look Like When We Connect The Unconnected
  36. What Is The Value Of The Internet Of Things
  37. What Are The Real Costs Of A Data Breach
  38. How Google Applies Big Data To Know You
  39. How To Put Big Data To Work
  40. How To Become More Competitive With Big Data
  41. How Big Data Can Help To Minimize Attacks On Your Digital Assets
  42. How Can Big Data Improve Education
  43. A Closer Look Into The Future Big Data Ecosystem
  44. Data Lovers vs. Data Haters
  45. How The Internet Of Things Will Create A Smart World
  46. 8 Industries That Could Benefit From Big Data
  47. Five Steps To Data-Driven Marketing
  48. Big Data is Big Business in Banking
  49. The body as a source of big data
  50. Is your data secure?
  51. The viability of big data
  52. A visualization of the world’s largest data breaches
  53. The long road to become a big data scientist
  54. The illustrious big data scientist
  55. What data do the five largest tech companies collect
  56. Getting sales and marketing aligned with big data
  57. Big Data in the Supply Chain
  58. The promise of big personalization
  59. Understanding the growing world of bytes
  60. The history of predictive analytics
  61. The Big Data Industry Atlas
  62. CIOs and Big Data
  63. How retailers can deal with big data
  64. The importance of Big Data Governance
  65. Big Data and the possibilities with Hadoop
  66. Big Data brings big benefits, but what are the costs of too much data?
  67. Giving employees access to big data has big potential
  68. Big Data will transform healthcare
  69. Five opportunities to effectively and efficiently analyze big data
  70. Challenging the traditional RDBMS Status Quo
  71. Big data is Big News
  72. Big Data Snapshot

View infographics at http://www.bigdata-startups.com/big-data-infographics/

Resources

Read more…

Guest blog post by Nilesh Jethwa

In this article we perform analytics on a huge dataset available from https://www.pwcmoneytree.com. PWC Money tree provides 20 years of  Venture capital investment data from 1995 onward. Having data that goes far into the history should give us enough to extract the necessary analytical juice out of it.

VC Investment in Billions from 1995 through 2014

The year 2000 was definitely the peak for VC investment craziness. A whopping 105 Billions was pumped into startups and bringing them quickly for IPO. Ever since after the crash of 2000, the investment has never reached even half the mark of year 2000. Again, look at the big jump between 1999 and 2000, a real indicator of investment frenzy!

Let us dig deeper and see which industries were favoured by the VCs

VC investment by industry

We can see that Software industry was the largest receiver of VC investment back in 2000 and it has been gaining attention since 2009 as indicated by the widening mouth.

Let us get more specific and see how has the investment pattern changed between the peak of 2000 and the current time 2014

Change in investment pattern between 2000 and 2014

Notice the high flying targets of 2000 namely Telecommunications and Networking receive almost zero attention in todays time. Back in 2000, all the wireless companies, networking companies were investing heavily to build today's network. The software industry is producing and benefiting more and is reaping the investments that were made during the early 2000 period.

Now compare the bars for the "Biotechnology" industry, it is the only industry where investment has reached the 2000 levels. 

Times have changed and so has the investment favorites in 2014

VC favourites for 2014

Finally let us compare the number of deals happening between today and 2000

 

During the peak of 2000, as high as 8000 deals were registered and comparing that to 2014, it stands at half value of roughly 4000 deals. 

 

Here is a quick tool to compare the VC investment favourites by each year, just select the year

 

Read more…

Shooting stars

Guest blog post by Vincent Granville

This is a follow up to our video series From chaos to clusters, made with data points moving over time to form clusters, and produced with open source and home-made data science algorithms.

See below two frames from the new video, now featuring line segments connecting a current point to its location in the previous frame. These line segments are overwritten and change constantly from iteration to iteration, creating a "shooting stars" visual effect when you watch the video.

Towards the end of the video, the clusters are well formed (though they are also moving, especially the one at the bottom right corner) and points coming from outside are progressively attracted to the nearest cluster: you can see them quickly getting close and then get absorbed. 

Here are the two new videos:

Download the data file rfile3.txt used to produce these videos, (also available in compressed format). These videos are based on the following R script (a more complex version of our initial R script):

R Source code

vv<-read.table("c:/vincentg/rfile3.txt",header=TRUE);
iter<-vv$iter;
for (n in 1:199) {
  x<-vv$x[iter == n];
  y<-vv$y[iter == n];
  z<-vv$new[iter == n];
  u<-vv$d2init[iter == n];
  v<-vv$d2last[iter == n];
  p<-vv$x[iter == n-1];
  q<-vv$y[iter == n-1];
  u[u>1]<-1;
  v[v>0.10]<-0.10;
  s=1/sqrt(1+n);
  if (n==1) {
    plot(p,q,xlim=c(-0.08,1.08),ylim=c(-0.08,1.09),pch=20,cex=0,col=rgb(1,1,0),xlab="",ylab="",axes=TRUE  );
  }

  points(p,q,col=rgb(1-s,1-s,1-s),pch=20,cex=1);
  segments(p,q,x,y,col=rgb(0,0,1));
  points(x,y,col=rgb(z,0,0),pch=20,cex=1);
  Sys.sleep(5*s);
  segments(p,q,x,y,col=rgb(1,1,1));
}

segments(p,q,x,y,col=rgb(0,0,1)); # arrows segments
points(x,y,col=rgb(z,0,0),pch=20,cex=1);

Related articles

Read more…

Originally posted in AnalyticBridge.com by Dr. Vincent Graville

Here I provide the mathematics, explanations and source code to produce the data and moving clusters in the From chaos to clusters video series.

A little bit of history on how the project started:

  1. Interest in astronomy, visualization and how physics models apply to business problems
  2. Research on how urban growth could be modeled by the gravitional law
  3. Interest in systems that produce clusters (as well as birth and death processes) and in visualizing cluster formation with videos rather than charts
  4. Creating art: videos with sound and images synchronized and both generated using data (coming soon). Maybe I'll be able to turn your business data into a movie (either artistic or insightful or both)! I'm already at the point where I can produce the video frames faster than they are delivered on the streaming device. I called it FRT for faster than real time.

What is a statistical model without model?

There's actually a generic mathematical model behind the algorithm. But nobody cares about the model, the algorithm was created first without having a mathematical model in mind. Initially, I had a gravitational model in mind, but I eventually abandoned it as it was not producing what I expected.

This illustrates a new trend in data science: we care less and less about modeling, but more and more about results. My algorithm has a bunch of parameters and features that can be fine-tuned to produce anything you want - be it a simulation of a Neyman-Scott cluster process, or a simulation of some no-name stochastic process.

It's a bit similar to how modern rock climbing has evolved: focusing on big names such as Everest in the past, to exploring deeper wilderness and climbing no-name peaks today (with their own challenges), to rock climbing on Mars in the future.

You can fine tune the parameters to

  1. Achieve best fit between simulated data and real business (or other data), using traditional goodness-of-fit testing and sensitivity analysis. Note that the simulated data represents a realization (an instance for object-oriented people) of a spatio-temporal stochastic process.
  2. Once the parameters are calibrated, perform predictions (if you speak statistician language) or extrapolations (if you speak mathematician language).

So how does the algorithm work?

It starts with a random distribution of m mobile points in the [0,1] x [0,1] square window. The points get attracted to each other (attraction is stronger to closest neighbors) and thus over time, they group into clusters.

The algorithm has the following components:

  1. Creation of n random fixed points (n=100) on [-0.5, 1.5] x [-0.5, 1.5]. This window is 4 times bigger than the one containing the mobile points, to eliminate edge effects impacting the mobile points. These fixed points (they never move) also act as some sort of dark matter: they are invisible, they are not represented in the video, but they are the glue that prevents the whole system from collapsing onto itself and converging to a single point.
  2. Creation of m random mobile points (m=500) on [0,1] x [0,1].
  3. Main loop (200 iterations). At each iteration, we compute the distance d between each mobile point (x,y) and each of his m-1 mobile neighbors and n fixed neighbors. A weight w is computed as a function of d, with a special weight for the point (x,y) itself. Then the updated (x,y) is the weighted sum aggregated over all points, and we do that for each point (x,y) at each iteration. The weight is such that the sum of weights over all points is always 1. In other words, we replace each point with a convex linear combination of all points.

Special features

  • If the weight for (x,y) [the point being updated] is very high at a given iteration, then (x,y) will barely move.
  • We have tested negative weights (especially for the point being updated) and we liked the results better. A delicate amount of negative weights also further prevents the system from collapsing and introduce a bit of chaos.
  • Occasionally, one point is replaced by a brand new, random point, rather than updated using the weighted sum of neighbors. We call this event a "birth". It happens for less than 1% of all point updates, and it happens more frequently at the beginning. Of course, you can play with these parameters.

In the source code, the birth process  (for point $k) is simply encoded as:

if (rand()<0.1/(1+$iteration)) { # birth and death
  $tmp_x[$k]=rand();
  $tmp_y[$k]=rand();
  $rebirth[$k]=1;
}

In the source code, in the inner loop over $k, the point ($x,$y) to be updated is referenced as point $k, that is,  ($y, $y) = ($moving_x[$k], $moving_y[$k]). Also, in a loop over $l, one level deeper, ($p, $q) referenced as point $l, represents a neighboring point when computing the weighted average formula used to update ($x, $y). The distance d is computed using the function distance which accepts four arguments ($x, $y, $p, $q) and returns $weight, the weight w.

Click here to view source code.

Related articles

Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds