Subscribe to our Newsletter

Featured Posts (202)

Sort by

Guest blog post by Denis Rasulev

An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data.

Images are clickable to open hi-res versions.



Original post covers a lot more details and for those who want to pursue more analysis on their own: everything in the post - the data, software, and code - is freely available. Full instructions to download and analyze the data for yourself are available on GitHub.

Here is link to the original post: link

Read more…

Principal Component Analysis using R

Guest blog post by suresh kumar gorakala

Curse of Dimensionality:

One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality. 

In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.

Principal component analysis:

Consider below scenario:

The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.


Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible. 

Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.

Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.

In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.

PCA implementation in R:

For today’s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales. The 42 row names ("9.4", 9.5" ...) correspond to midpoints of intervals of finger lengths whereas the 22 column names ("142.24", "144.78"...) correspond to (body) heights of 3000 criminals, see also below. 

142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34
9.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
9.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.8 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
9.9 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0
182.88 185.42 187.96 190.5 193.04 195.58
9.4 0 0 0 0 0 0
9.5 0 0 0 0 0 0
9.6 0 0 0 0 0 0
9.7 0 0 0 0 0 0
9.8 0 0 0 0 0 0
9.9 0 0 0 0 0 0
[1] 42 22
'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:42] "9.4" "9.5" "9.6" "9.7" ...
..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ...

[1] 3000

[1] "142.24" "144.78" "147.32" "149.86" "152.4" "154.94" "157.48" "160.02" "162.56" "165.1" "167.64" "170.18" "172.72" "175.26" "177.8" "180.34"
[17] "182.88" "185.42" "187.96" "190.5" "193.04" "195.58"

let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.


We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp(). 

pca =prcomp(crimtab)

Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’. Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component.

Let’s plot all the principal components and see how the variance is accounted with each component.

par(mar = rep(2, 4)) plot(pca) 

Clearly the first principal component accounts for maximum information.
Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.

#below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one. pca$rotation=-pca$rotation pca$x=-pca$x biplot (pca , scale =0) 

The output of the preceding code is as follows:

In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.

From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features. 

In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, "165.1, 167.64, and 170.18" which are less correlated with them. 

Complete Code for PCA implementation in R: 

data("crimtab") #load data
head(crimtab) #show sample data
dim(crimtab) #check dimensions
str(crimtab) #show structure of the data
apply(crimtab,2,var) #check the variance accross the variables
pca =prcomp(crimtab) #applying principal component analysis on crimtab data
par(mar = rep(2, 4)) #plot to show variable importance
'below code changes the directions of the biplot, if we donot include
the below two lines the plot will be mirror image to the below one.'
biplot (pca , scale =0) #plot pca components using biplot in r
view rawPCA using R hosted with ❤ by GitHub

So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.

Originally posted here

Read more…

Guest blog post by Jean Villedieu

Following the Mediator scandal, France adopted in 2011 a Sunshine Act. For the first time we have data on the presents and contracts awarded to health care professionals by pharmaceutical companies. Can we use graph visualization to understand these dangerous ties?

Dangerous ties

Pharmaceutical companies in France and in other countries use presents and contracts to influence the prescriptions of health care professionals. This has posed ethical problems in the past.

In France, 21 persons are currently prosecuted for their role in the Mediator scandal, a drug that was recently banned. Some of them are accused of having helped the drug manufacturer obtain an authorization to sell its drug and later fight its ban in exchange for money.

In the US, GlaxoSmithKline was condemned to pay $3 billion in the largest health-care fraud settlement in US history. Before the settlement, GlaxoSmithKline paid various experts to fraudulently market the benefits of its drugs.

Such problems arose in part because of a lack of transparency in the ties between pharmaceutical companies and health-care professionals. With open data now available can we change this?

Moving the data to Neo4j

Regards Citoyens, a French NGO, parsed various sources to build the first database documenting the financial relationships between health care providers and pharmaceutical manufacturers.

That database covers a period from January 2012 to June 2014. It contains 495 951 health care professionals (doctors, dentists, nurses, midwives, pharmacists) and 894 pharmaceutical companies. The contracts and presents represent a total of 244 572 645 €.

The original data can be found on the Regards Citoyens website.

The data is stored in one large CSV file. We are going to use graph visualization to understand the network formed by the financial relationships between pharmaceutical companies and health care professionals.

First we need to move the data into a Neo4j graph database: view rawsunshine_import.cql

Now the data is stored in Neo4j as a graph (download it here). It can be searched, explored and visualized through Linkurious.

Unfortunately, names in the data have been anonymized by Regards Citoyens following pressure from the CNIL (the French Commission nationale de l’informatique et des libertés).

Who is Sanofi giving money to?

Let’s start our data exploration with Sanofi, the French biggest pharmaceutical company. If we search Sanofi through Linkurious we can see that it is connected to 57 765 professionals. Let’s focus on the 20 Sanofi’s contacts who have the most connections.

sanofi's connections

Sanofi’s top 20 connections.


Among these entities there are 19 doctors in general medicine and one student. We can quickly grasp which professions Sanofi is targeting by coloring the health care professionals according to their profession:

19 doctors among Sanofi's top 20 connections

19 doctors among Sanofi’s top 20 connections.

In a click, we can filter the visualization to focus on the doctors. We are now going to color them according to their region of origin.

Region of origin of Sanofi's 19 doctors

Region of origin of Sanofi’s 19 doctors.

Indirectly, the health care professionals Sanofi connects to via presents also tell us about its competitors. Let’s look at who else has given presents to the health care professionals befriended by Sanofi.

sanofi's competitors network

Sanofi’s contacts (highlighted in red) are also in touch with other pharmaceutical companies.


Zooming in, we can see Sanofi is at the center of a very dense network next to Bristol-Myers Quibb, Pierre Gabre, Lilly or Astrazeneca for example. According to the Sunshine dataset, Sanofi’s is competing with these companies.

We can also see an interesting node. It is a student who has received presents from 104 pharmaceutical companies including companies that are not direct competitors of Sanofi.

A successful student

A successful student.

Why has he received so much attention? Unfortunately all we have is an ID (02b0d3726458ef46682389f2ac7dc7af).

Sanofi could identify the professionals its competitors have targeted and perhaps target them too in the future.

Who has received the most money from pharmaceutical companies in France?

Neo4j includes a graph query language called Cypher. Through Cypher we can compute complex graph queries and get results in seconds.

We can for example identify the doctor who has received the most money from pharmaceutical companies:

//Doctor who has received the most money

The doctor behind the ID 2d92eb1e795f7f538556c59e48aaa7c1 has received 77 480€ from 6 pharmaceutical companies.

wealthy doctor

The relationships are colored according to the money they represent. St Jude Medical has over 70 231€ to Dr 2d92eb1e795f7f538556c59e48aaa7c1.

Perhaps next time they receive a prescription from Dr 2d92eb1e795f7f538556c59e48aaa7c1, his patients would like to know about his relationship with St Jude Medical. Unfortunately today the Sunshine data is anonymous.

We can also find the most generous pharmaceutical company.

//Company which has distributed the most money
RETURN a, sum(r.totalDECL) as total

Novartis Pharma has awarded 12 595 760€ to various entities.

top 5 novartis

The 5 entities receiving the most money from Novartis.


When we look closer, we can see that the 5 entities which have received the most money from Novartis Pharma are 5 NGOs.

24f3287da6ab125862249416bc91f9c4 has received 75 000€

24f3287da6ab125862249416bc91f9c4 has received 75 000€.

Come meet us at GraphConnect in London, the biggest graph event in Europe. It is sponsored by Linkurious and you can use “Linkurious30″ to register and get a 30%discount!

The Sunshine dataset offers a rare glimpse into the practice of pharmaceutical companies and how they use money to influence the behavior of health care professionals. Unfortunately for citizens looking for transparency, the data is anonymized. Perhaps it will change in the future?

Read more…

Guest blog post by Laetitia Van Cauwenberge

This article focuses on cases such as Facebook and protein interaction networks. The article was written by By Paul Scherer (paulmorio) and submitted as a research paper to HackCambridge. What makes this article interesting is the fact that it compares five clustering techniques for this type of problems:

  • K Clique Percolation - A clique merging algorithm. Given a set kk, the algorithm goes on to produce kk clique clusters and merge them (percolate) as necessary.
  • MCode - seed growth approach to finding dense subgraphs
  • DP Clustering - seed growth approach to finding dense subgraphs similar to MCODE but has an internal representation of weights in the edges, and the stopiing condition is different.
  • IPCA - Modified DPClus Algorithm which focuses on maintaining the diameter of a cluster (defined as the maximum shortest distance between all pairs of vertices, rather than its density.
  • CoAch - Combined Approach with finding a small number of cliques as complexes first and then growing them.

The articles also provides great visualizations such as the one below:

In the original article, these visualizations are interactive, and you will find out which software was used to produce them.

Below is the summary (written by the original author):


For my submission to HackCambridge I wanted to spend my 24 hours learning something new in accordance with my interests. I was recently introduced to protein interaction networks in my Bioinfomartics class, and during my review of machine learning techniques for an exam noticed that we study many supervised methods, but no unsupervised methods other than the k means clustering. Thus I decided to combine the two interests by clustering the Protein interaction networks with unsupervised clustering techniques and communicate my learning, results, and visualisations using the Beaker notebook.

The study of protein-protein interactions (PPIs) determined by high-throughput experimental techniques has created karge sets of interaction data and a new need for methods allowing us to discover new information about biological function. These interactions can be thought of as a large-scale network, with nodes representing proteins and edges signifying an interaction between two proteins. In a PPI network, we can potentially find protein complexes or functional modules as densely connected subgraphs. A protein complex is a group of proteins that interact with each other at the same time and place creating a quaternary structure. Functional modules are composed of proteins that bind each other at different times and places and are involved in the same cellular process. Various graph clustering algorithms have been applied to PPI networks to detect protein complexes or functional modules, including several designed specifically for PPI network analysis. A select few of the most famous and recent topographical clustering algorithms were implemented based on descriptions from papers, and applied to PPI networks. Upon completion it was recognized that it is possible to apply these to other interaction networks like friend groups on social networks, site maps, or transportation networks to name a few.

I decided to Graphistry's GPU cluster to visualize the large networks with the kind permission of Dr. Meyerovich. (Otherwise I would have likely not finished on time given the specs of my machine) and communicate my results and learning process

The full version with mathematical formulas, detailed descriptions, and source code, can be found here. For more articles about clustering, click here. This link will give you access to the following articles:

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Big Data Insights - IT Support Log Analysis

Guest blog post by Pradeep Mavuluri

This post brings forth to the audience, few glimpses (strictly) of insights that were obtained from a case of how predictive analytic's helped a fortune 1000 client to unlock the value in their huge log files of the IT Support system. Going to quick background, a large organization was interested in value added insights (actionable ones) from thousands of records logged in the past, as they saw both expense increase at no higher productivity.


As, most of us know in these business scenarios end-users will be much interested in out-of-knowledge, strange and unusual things that may not be captured from regular reports. Hence, here data scientist job not only ends at finding un-routine insights, but, also needs to do a deeper dig for its root cause and suggest best possible actions for immediate remedy (knowledge of domain or other best practices in industry will help a lot). Further, as mentioned earlier, only few of those has been shown/discussed here and all the analysis has been carried out using R Programming Language components viz., R-3.2.2RStudio (favorite IDE)ggplot2 package for plotting.

The first graph (below one) is a time series calendar heat map adopted from Paul Bleicher, shows us the number of tickets raised day-wise over every week of each month for the last year (green and its light shades represent less numbers, where as red and its shades represent higher numbers).


Herein, if one carefully observe the above graph, it will be very evident for us that, except for the month of April &amp; December, all other months have sudden increase in the number of tickets raised over last Saturday's and Sunday's; and this was more clearly visible at Quarter ends of March, June, September (also at November which is not a Quarter end). One can think of this as unusual behavior as numbers raising at non-working days. Before, going into further details, lets also look at one more graph (below), which depicts solved duration in minutes on x-axis and their respective time taken through a horizontal time line plot.

The above solved duration plot show us that out of all records analyzed 71.87% belong to "Request for Information" category and they have been solved within few minutes of tickets raised (that's why we cannot see a line plot for this category as compared to others). So, what's happened here actually was a kind of spoof, because of lack of automation in their systems. In simple words, it was found that there doesn't exists a proper documentation/guidance for many of applications they were using; such situation was taken as advantage for increasing the number of tickets (i.e. nothing but, pushing for more tickets even for basic information in the month ends and quarter ends, which resulted in month end openings which in turn forced them to close immediately). Discussed one here is one of those among many which has been presented with possible immediate remedies which can be easily actionable.

Visual Summarization:


Original Post

Read more…

Guest blog post by Petr Travkin

Part 1. Business scenarios.

I have spent many hours planning and executing in-company self-service BI implementation. This enabled me to gain several insights. Now that the ideas became mature enough and field-proven, I believe they are worth sharing. No matter how far you are in toying with potential approaches (possibly you are already in the thick of it!), I hope my attempt of describing feasible scenarios would provide a decent foundation.

All scenarios presume that IT plays its main role by owning the infrastructure, managing scalability, data security, and governance. I tried to elaborate every aspect of possible solution leaving behind all marketing claims of the vendor.

Scenario 1. Tableau Desktop + departmental/cross-functional data schemas.

This scenario involves gaining insights by data analysts on a daily basis. They might be either independent individuals or a team. Business users’ interaction with published workbooks is applicable, but limited to simple filtering.

User categories: professional data analysts;

Technical skills: intermediate/advanced SQL, intermediate/advanced Tableau;

Tableau training: 2-3 days full time (preferably) or continuous self-learning from scratch;

Licenses: Tableau Desktop.


  • Pure self-service BI approach with no IT involved in data analysis;
  • Vast range of data available for analysis with almost no limits;
  • Fast response for complex ad-hoc business problems.


  • Requires highly skilled data analysts;
  • Most likely involves Tableau training on query performance optimisation on a particular data source (e.g. Vertica).


  • Create a “sandbox” that allows data analysts to query and collaborate on their own and without supervision. Further promotion of workbooks to production is welcome.

Scenario 2. Tableau Desktop + custom data marts.

In this scenario, business users are fully in charge of data analysis. IT provides custom data marts.

User categories: business users, line-managers;

Technical skills: basic SQL, basic/intermediate Tableau;

Tableau training: two or three 2-3h sessions + ad-hoc support on daily basis;

Licenses: Tableau Desktop + Server Interactors.


  • Easy access to data for ad-hoc analysis;
  • Self-answering critical business questions;
  • Self-publishing for further ad-hoc access across multiple devices.


  • Adding any data involves IT support;
  • Requires elaborated data dictionaries.


  • Make requirements gathering a collaborative and iterative process with regular communication. That would ensure well-timed data delivery and quality;
  • Deliver training in 2-3 wisely structured sections with 2-3 week breaks for business users to have time for playing with software, along with generating needs for the new skills.
  • Focus on reach visualisations, not tables.

Scenario 3. Tableau Server Web Edit + workbook templates

This scenario fully relies on data models published by data analysts and powerful Web Edit features of Tableau Server.

User categories: line-managers, top managers;

Technical skills: Tableau basics;

Tableau training: one 30 min demo session + ad-hoc support;

License: Server Interactor.


  • No special training;
  • Fast Tableau adoption with basic, but powerful Self-service BI capabilities (Web Edit);
  • Thin client access via any Desktop Web Browser;
  • Could serve as a foundation for self-service BI adoption among C-Suite.


  • High level of accuracy for data preparation and template development;
  • Any changes in the data model require development and republishing of a template.


  • Try to select the most proactive and “data hungry” line manager or executive, who could help to spread the word;
  • Investigate analytical needs, ensure availability of a subject matter expert;
  • Start with simple visualisations, but be ready to increase complexity;
  • Provide as much ad-hoc assistance as you can.

In my next post, I would like to throw light on some technical aspects and limitations of each scenario.

I highly appreciate any comments and looking forward to know about your experience.

Read more…

Guest blog post by Eduardo Siman

Fortune 500 companies are investing staggering amounts into data visualization. Many have opted for Tableau, Qlik, MicroStrategy, etc. but some have created their own in HTML5, full stack JavaScript, Python, and R. Leading CIOs and CTOs are obsessed with being the first adopters in whatever is next in data visualization. 

The next frontier in data visualization is clearly immersive experiences. The 2014 paper "Immersive and Collaborative Data Visualization Using Virtual Reality Platforms" written by CalTech astronomers is a staggeringly large step in the right direction. In fact, I am shocked that 1 year later I have not seen a commercial application of this technology. You can read it here:

The key theme that I hear at technology conferences lately is the need to focus on analytics, visualization and data exploration. The advent of big data systems such as Hadoop and Spark has made it

Picture Source: VR 2015 IEEE Virtual Reality International Conference 

possible - for the first time ever - to store Petabytes of data on commodity hardware and process this data, as needed, in a fault tolerant and incredibly quick fashion. Many of us fail to understand the full implications of this inflection point in the history of computing. 

Storage is decreasing in cost every year, to the point where you can now have multiple GB on a USB drive that 10 years ago you could only store a few MBs. Gigabit internet is being installed in cities all over the world. Spark uses the concept of in memory distributed computation to perform at 10X map reduce for gigantic datasets and is already being used in production by Fortune 50 companies. Tableau, Qlik, MicroStrategy, Domo, etc. have gained tremendous market share as companies that have implemented Hadoop components such as HDFS, Hbase, Hive, Pig, and Map Reduce are starting to wonder "How I can I visualize that data?" 

Now think about VR - probably the hottest field in technology at this moment. It has been more than a year since Facebook bought Oculus for 2Billion and we have seen Google Cardboard burst onto the scene. Applications from media companies like the NY Times are already becoming part of our every day lives. This month at the CES show in Las Vegas, dozens of companies were showcasing virtual reality platforms that improve on the state of the art and allow for a motion-sickness free immersive experience. 

All of this combines into my primary hypothesis - this is a great time to start a company that would provide the capability for immersive data visualization environments to businesses and consumers. I personally believe that businesses and government agencies would be the first to fully engage in this space on the data side, but there is clearly an opportunity in gaming on the consumer side.

Personally, I have been so taken by the potential of this idea that I wrote a post in this blog about the “feeling” of being in one of these immersive VR worlds.

The post describes what it would be like to experience data with not only vision, but touch and sound and even smell. 

Just think about the possibilities of examining streaming data sets, that currently are being analyzed with tools such as Storm, Kafka, Flink, and Spark Streaming as a river flowing under you!

The strength of the water can describe the speed of the data intake, or any other variable that is represented by a flow - stock market prices come to mind. 

The possibilities for immersive data experiences are absolutely astonishing. The CalTech astronomers have already taken the first step in that direction, and perhaps there is a company out there that is already taking the next step. That being said, if this sounds like an exciting venture to you, DM me on twitter @Namenode5 and we can talk. 

Read more…

Big data landscape 2016 - Infographic

Guest blog post by Laetitia Van Cauwenberge

Great infographic about the big data / analytics / data science / deep learning / BI ecosystem. Created by @Mattturk, @Jimrhao and @firstmarkcap. Click on the image to zoom in. 

This infographics features the following components:

  • Infrastructure
  • Analytics
  • Applications
  • Cross-Infrastructure/Analytics
  • Open Source
  • Data Sources & APIs

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

You cannot argue that the marketing landscape has changed drastically over the years - gone are the days of throwing different strategies on the wall to see which one would stick. Instead, companies today are equipped with enough information about their customers and tactics thanks to the big data boom of recent years. In fact, according to Bob Evans, senior vice-president of communications for Oracle, large companies that have not put together complete plans for managing big data expect to lose $71.2 million every year. This is just one example of the extreme impact the big data is having on businesses of all sizes across all industries.

Data is everywhere, from phones and personal computers to NASA database. Some of it may be used for business - some of it may not. In fact, almost 80% of information gained by an organization is unstructured and remains unused without the right software. Fortunately, with recent software boom, small businesses can reap the benefits of the excessive amounts of online and offline information.

Even though most big data conversations concern companies with resources to hire experts and research firms, for those who know where to look, there are several ways for SMBs to gather, analyze and make sense of the information they already have. Here you have a few solutions that could help small companies make better business decisions and compete with larger enterprises in the ever-evolving marketplace.


The tools you are probably already using provide you with a rich source of information. InsightSquared is a sales performance analytics tool, designed to prevent you from wasting your time, mining and analyzing your own data using one spreadsheet after another. It connects to popular business solutions, such as Salesforce, Google Analytics and QuickBooks to automatically gather and extract the information you need.


If you currently do not have any rich data sources, conducting research may be the right thing for you. Qualtrics software lets you conduct a wide range of surveys and studies to gain better insights to guide your decision-making. It offers you customer, employee and market insights in real-time. In addition, you have mobile surveys, online samples and academic research.  


If you want to create dashboards and analyze corporate data without the help of the IT department, here's a solution for you.  When it comes to next-generation smart data discovery, Panorama is a global leader. It is the first business intelligence software that uses Automated Insights and Social Decision Making to allow users to have insights with greater relevancy.


Once limited to companies with large resources, credit card transactions are full of unique and vital data. Customer intelligence company Tranzlogic, makes this info available to small and medium establishments, for a reasonable price. It provides the information that merchants can use to launch better marketing campaigns, measure sales performance and write better business plans.

The transition from an instinct-driven company to an analytics-driven company is one SMB owners everywhere must embrace. New solutions are coming to the market every day, and thankfully, more and more of them are being created for the specific needs of smaller organizations. Finding the right IT solutions can help make it practical and inexpensive to benefit from the opportunity big data affords.

Originally posted on Data Science Central

Read more…

Originally written by Nigel Higgs on LinkedIn Pulse.

We who have been in the data sphere a while and in and around Data Governance will have seen the pitch-decks, watched the webinars, read the blogs and attended the conferences.  Some of us will have hired the staff, taken sage advice from expensive consultants and kicked off programmes to get the organisation up the Data governance maturity curve. It's almost like a religion, Data Governance is so clearly the answer why can't everybody in the organisation see it? It's a no-brainer. Unfortunately and speaking as a Data Governance practitioner for far too many years I can honestly say that I have yet to see a fully functioning enterprise-wide Data Governance implementation. Look, I appreciate that could be down to my incompetence, but I know this is not an isolated or unique sentiment. Lots of peers, colleagues and people far smarter than me have been preaching the benefits of data administration, data architecture, data governance, or whatever it will be called next, for many years and yet many of them struggle to come up with success stories. In fact when pressed they often don't have any!

So why so much denial? Einstein is reputed to have said something along the lines of 'the definition of insanity is to keep doing the same thing and expect the outcome to be different'. It is also reputed to be the most wrongly attributed and quoted platitude on the planet! But hey this is a LinkedIn post and like most of my writings nobody will read it.

What's that got to do with Data Governance? Well, 'Outside In Data Governance' is about approaching the problem from a different angle. There is little doubt the problem Data Governance is trying to solve is very real. Very few organisations know what data they have got, what it means, where it is, who is responsible for it or what its quality is?

But how to solve the problem? What I typically hear is that you need to write a policy, form committees, define processes, assign roles and then everything will be working like clockwork within months - data governed, quality data delivered to users and the organisation flying up the data maturity curve. But is that what happens, does the story painted in the pitch-decks come into reality? Sadly, it very rarely if ever does.

What is needed is a value driven approach. Start with who are we doing a Data Governance approach for? We are doing it for the business users. Then ask what are they interested in? They are interested in something that makes their lives easier right now. So 'Outside In Data Governance' starts with a single business report and works back from there. Answer those fundamental questions (what, where, who and how good?) about the fields and outputs on the report and make that knowledge accessible. You could do this with a simple Excel based approach or maybe a Wiki or SharePoint; but pretty soon you will need some tooling to really make it scalable and responsive to increasing demands for more reports to be included in the scope. There are ways to do this in a 'proof of concept' environment and demonstrate the benefits before committing to spend. A friend of mine is fond of saying it’s easier to ask for forgiveness that for permission'. In this case he is right. There are browser based tools that sit outside your firewall and can offer this try before you buy approach. 

This is what a value driven and lean approach is all about. If what you do in this small scale doesn't get traction then what makes you think a £250k project will end up any better? Start small, ensure you get honest feedback from users at every iteration of your solution and focus on delivering value. If you bring the data users with you then they will demand the capability is extended. Beat the Einstein quote and start from the 'Outside In'.

Originally posted on Data Science Central

Read more…

Guest blog post by Eduardo Siman

You are driving down the highway. As your gaze moves between cars and trees and open sky a filtering thought hits the periphery of your conscious mind. It feels fresh yet somehow part of a thought pattern you’ve had before. A new version, perhaps, of a past obsession. Or a new obsession, emerging from years of exploration in seemingly unrelated fields. How many of us can identify with this feeling of being gently entranced by a sequence of thoughts that had seemingly faded into the past? Perhaps they encourage us to call an old friend who has been out of touch, or to dig into a textbook from yesteryear. 

I suppose you could call it nostalgia, but that word has a connotation of sadness and loss that does not properly fit here. We don’t understand much about how the brain truly functions, let alone the intricacies of consciousness. But perhaps we can think of these ebbing and waning thought cycles as waves, even curves, on a two dimensional plane that I would like to call a Mind Map.

In discussing this abstraction with my friend @openmylab in Sydney, we have been drawing 10 year Mind Maps of our own brains in order to identify areas of intellectual passion as well as intersections between seemingly disparate curves that have become inflection points in our lives. In both cases we started with a few rules to make the map visibly appealing and uncluttered. First, it would only have 3 mind curves - that is 3 major thought patterns or areas of engagement. Second, these curves had to be somewhat sinusoidal in that they couldn’t be everyday thoughts, but rather major epochs of intellectual pursuit that reached a peak, declined and once again reemergence in a surprising way. In doing this exercise we learned quite a bit about our own mind and the territory it has charted over the last ten years. 

Let’s start with @openmylab’s mind map: 

 Focus on curves 2 and 4 in the year 2000. We see a clear focus on learning programming languages and operating systems in 2 - some front end web development, some UNIX, a bit of Java, even dreamweaver and the ubiquitous SQL and C#. Now look at 4 - here is a clear statistical computing route, perhaps it would be called data science now, with all the usual suspects - Python, R, SPSS, Matlab. Between the two curves we have the makings of a rare gem of a software engineer. Seems clear enough right? This person should develop statistical apps and deploy them on front end environments? Maybe. 

But don’t forget these are SEPARATE mind curves. They don’t represent an encapsulated and refined goal or objective. These are distinct areas of interest and if they intersect and yield a common product -that’s great - but certainly not required. Tracing these curves we see how they transform into current areas of interest: quantum computing and the Internet of things. It would have been quite difficult to trace the origins of these current intellectual pursuits without tracing their mental predecessors.

 Now let’s explore my own mind map, keeping in mind it’s points of commonality and distinction from @openmylab’s map:

My map starts with two core academic pursuits: general relativity/quantum mechanics at the top and analytic number theory at the bottom. In addition there is another curve that starts out weak but gains steam as the years go by. This last curve is technology and all of it’s fascinating applications. So how does this play out? The first intersection occurs in 2007 - here is where I discover that back when Einstein and Montgomery where roaming the halls of the Institute of Advanced Studies in Princeton, a chance encounter led to an unexpected revelation. It turns out that the zeros of the zeta function bear a striking similarity to the eigenvalues of a random hermitician matrix. Did I lose you? I did. 

Ok let’s move ahead. By 2012 my obsession with Riemann intersects with my technological pursuits and I discover computational number theory. Unfortunately computers cannot prove the Riemann hypothesis, which is almost certainly true. But they could Dis-prove it! (In the very unlikely case that it isn’t true) And finally, as with @openmylab, there is a shift to applications of technology in the modern world - Big Data,IoT, fintech,machine learning. A possible point of intersection with the physics curve seems a few years away with the advent of quantum computing. 

 In both cases we see a current interest in so called “hot” or “trending” technologies with a long (10 to 15 year) history of predecessor interests and an amalgamation of distinct intellectual flows. Yet the two maps are quite different - one is the story of a software engineer who loves statistics and eventually finds himself enthralled in the world of robotics, data science, and quantum computing. The other is a physics nerd who realizes that computational methods can help bring concepts to life and dives into the visualization and practical application of abstract concepts. 

 And of course the most critical intersection, the one between the two mind maps, occurs in an area that isn’t even present on the maps: social media. It is on Twitter that the concept is exposed, developed, shared, refined, and discussed. Between Miami and Sydney, in real time. The mind map of the world has come along way to make this interaction possible. I encourage you to create your own mind map and explore the hidden mind curves of your intellectual past. If you feel comfortable doing so, please share s picture with me @namenode5 and with @openmylab on Twitter. Happy exploring!!

Read more…

Six categories of Data Scientists

We are now at 9 categories after a few updates. Just like there are a few categories of statisticians (biostatisticians, statisticians, econometricians, operations research specialists, actuaries) or business analysts (marketing-oriented, product-oriented, finance-oriented, etc.) we have different categories of data scientists. First, many data scientists have a job title different from data scientist, mine for instance is co-founder. Check the "related articles" section below to discover 400 potential job titles for data scientists.

Categories of data scientists

  • Those strong in statistics: they sometimes develop new statistical theories for big data, that even traditional statisticians are not aware of. They are expert in statistical modeling, experimental design, sampling, clustering, data reduction, confidence intervals, testing, modeling, predictive modeling and other related techniques.
  • Those strong in mathematics: NSA (national security agency) or defense/military people working on big data, astronomers, and operations research people doing analytic business optimization (inventory management and forecasting, pricing optimization, supply chain, quality control, yield optimization) as they collect, analyse and extract value out of data.
  • Those strong in data engineering, Hadoop, database/memory/file systems optimization and architecture, API's, Analytics as a Service, optimization of data flows, data plumbing.
  • Those strong in machine learning / computer science (algorithms, computational complexity)
  • Those strong in business, ROI optimization, decision sciences, involved in some of the tasks traditionally performed by business analysts in bigger companies (dashboards design, metric mix selection and metric definitions, ROI optimization, high-level database design)
  • Those strong in production code development, software engineering (they know a few programming languages)
  • Those strong in visualization
  • Those strong in GIS, spatial data, data modeled by graphs, graph databases
  • Those strong in a few of the above. After 20 years of experience across many industries, big and small companies (and lots of training), I'm strong both in stats, machine learning, business, mathematics and more than just familiar with visualization and data engineering. This could happen to you as well over time, as you build experience. I mention this because so many people still think that it is not possible to develop a strong knowledge base across multiple domains that are traditionally perceived as separated (the silo mentality). Indeed, that's the very reason why data science was created.

Most of them are familiar or expert in big data. 

There are other ways to categorize data scientists, see for instance our article on Taxonomy of data scientists. A different categorization would be creative versus mundane. The "creative" category has a better future, as mundane can be outsourced (anything published in textbooks or on the web can be automated or outsourced - job security is based on how much you know that no one else know or can easily learn). Along the same lines, we have science users (those using science, that is, practitioners; often they do not have a PhD), innovators (those creating new science, called researchers), and hybrids. Most data scientists, like geologists helping predict earthquakes, or chemists designing new molecules for big pharma, are scientists, and they belong to the user category. 

Implications for other IT professionals

You (engineer, business analyst) probably do already a bit of data science work, and know already some of the stuff that some data scientists do. It might be easier than you think to become a data scientist. Check out our book (listed below in "related articles"), to find out what you already know, what you need to learn, to broaden your career prospects.

Are data scientists a threat to your job/career? Again, check our book (listed below) to find out what data scientists do, if the risk for you is serious (you = the business analyst, data engineer or statistician; risk = being replaced by
a data scientist who does everything) and find out how to mitigate the risk (learn some of the data scientist skills from our book, if you perceive data scientists as competitors)

Originally posted on Data Science Central

Related articles

Read more…

Guest blog post by Tony Agresta

Organizations are struggling with a fundamental challenge – there’s far more data than they can handle.  Sure, there’s a shared vision to analyze structured and unstructured data in support of better decision making but is this a reality for most companies?  The big data tidal wave is transforming the database management industry, employee skill sets, and business strategy as organizations race to unlock meaningful connections between disparate sources of data.

Graph Databases are rapidly gaining traction in the market as an effective method for deciphering meaning but many people outside the space are unsure of what exactly this entails. Generally speaking, graph databases store data in a graph structure where entities are connected through relationships to adjacent elements. The Web is a graph; also your friend-of-a-friend network and the road network are graphs.

The fact is, we all encounter the principles of graph databases in many aspects of our everyday lives, and this familiarity will only increase. Consider just a few examples:

  • Facebook, Twitter and other social networks all employ graphs for more specific, relevant search functionality.  Results are ranked and presented to us to help us discover things.
  • By 2020, it is predicted that the number of connected devices will reach nearly 75 billion globally. As the Internet of Things continues to grow, it is not the devices themselves that will dramatically change the ways in which we live and work, but the connections between these devices. Think healthcare, work productivity, entertainment, education and beyond.
  • There are over 40,000 Google searches processed every second. This results in 3.5 billion searches per day and 1.2 trillion searches per year worldwide. Online search is ubiquitous in terms of information discovery. As people not only perform general Google searches, but search for content within specific websites, graph databases will be instrumental in driving more relevant, comprehensive results. This is game changing for online publishers, healthcare providers, pharma companies, government and financial services to name a few.
  • Many of the most popular online dating sites leverage graph database technology to cull through the massive amounts of personal information users share to determine the best romantic matches. Why is this?  Because relationships matter.

In the simplest terms, graph databases are all about relationships between data points. Think about the graphs we come across every day, whether in a business meeting or news report.   Graphs are often diagrams demonstrating and defining pieces of information in terms of their relations to other pieces of information.

Traditional relational databases can easily capture the relationship between two entities but when the object is to capture “many-to-many” relationships between multiple points of data, queries take a long time to execute and maintenance is quite challenging.  For instance, if you wanted to search for friends on many social networks that both attended the same university AND live in San Francisco AND share at least three mutual friends. Graph databases can execute these types of queries instantly with just a few lines of code or mouse clicks. The implications across industries are tremendous.

Graph databases are gaining in popularity for a variety of reasons.  Many are schema-less allowing you to manage your data more efficiently.   Many support a powerful query language, SPARQL. Some allow for simultaneous graph search and full-text search of content stores. Some exhibit enterprise resilience, replication and highly scalable simultaneous reads and writes.  And some have other very special features worthy of further discussion.

One specialized form of graph database is an RDF triplestore.  This may sound like a foreign language, but at the root of these databases are concepts familiar to all of us.    Consider the sentence, “Fido is a dog.” This sentence structure – subject-predicate-object – is how we speak naturally and is also how data is stored in a triplestore. Nearly all data can be expressed in this simple, atomic form.  Now let’s take this one step further.  Consider the sentence, “All dogs are mammals.” Many triplestores can reason just the way humans can.   They can come to the conclusion that “Fido is a mammal.” What just happened?  An RDF triplestore used its “reasoning engine” to infer a new fact.  These new facts can be useful in providing answers to queries such as “What types of mammals exist?”  In other words, the “knowledge base” was expanded with related, contextual information.    With so many organizations interested in producing new information products, this process of “inference” is a very important aspect of RDF triplestores.  But where do the original facts come from?

Since documents, articles, books and e-mails all contain free flowing text, imagine a technology where the text can be analyzed with results stored inside the RDF triplestore for later use.  Imagine a technology that can create the semantic triples for reuse later.  The breakthrough here is profound on many levels: 1) text mining can be tightly integrated with RDF triplestores to automatically create and store useful facts and 2) RDF triplestores not only manage those facts but they also “reason” and therefore extend the knowledge base using inference.

Why is this groundbreaking?  The full set of reasons extends beyond the scope of this article but here are some of the most important:

Your unstructured content is now discoverable allowing all types of users to quickly find the exact information for which they are searching.  This is a monumental breakthrough since so much of the data that organizations stockpile today exist as dark data repositories.

We said earlier that RDF triplestores are a type of graph database.  By their very nature, the triples stored inside the graph database (think “facts” in the form of subject-predicate-object) are connected. “Fido is a dog.  All dogs are mammals.  Mammals are warm blooded.  Mammals have different body temperatures, etc…”  The facts are linked.  These connections can be measured.   Some entities are more connected than others just like some web pages are more connected to other web pages.   Because of this, metrics can be used to rank the entries in a graph database. One of the most popular (and first) algorithms used at Google is “Page Rank” which counts the number and quality of links to a page – an important metric in assessing the importance of web page.   Similarly, facts inside a triplestore can be ranked to identify important interconnected entities with the most connected ordered first.   There are many ways to measure the entities but this is one very popular use case.

With billions of facts referencing connected entities inside a graph database, this information source can quickly become the foundation for knowledge discovery and knowledge management.  Today, organizations can structure their unstructured data, add additional free facts from Linked Open Data sets, combine all of this with a controlled vocabulary, thesauri, taxonomies or ontologies which, to one degree or another, are used to classify the stored entities and depict relationships.  Real knowledge is then surfaced from the results of queries, visual analysis of graphs or both.  Everything is indexed inside the triplestore.

Graph databases (and specialized versions called native RDF triplestores that embody reasoning power) show great promise in knowledge discovery, data management and analysis.   They reveal simplicity within complexity.  When combined with text mining, their value grows tremendously.   As the database ecosystem continues to grow, as more and more connections are formed, as unstructured data multiplies with fury, the need to analyze text and structure results inside graph databases is becoming an essential part of the database ecosystem.  Today, these combined technologies are available and not just reserved for the big search engines providers.  It may be time for you to consider how to better store, manage, query and analyze your own data.  Graph databases are the answer.

If there is interest, you can learn more about these approaches under the resources section of

Read more…

24 Uses of Statistical Modeling (Part I)

Guest blog post by Vincent Granville

Here we discuss general applications of statistical models, whether they arise from data science, operations research, engineering, machine learning or statistics. We do not discuss specific algorithms such as decision trees, logistic regression, Bayesian modeling, Markov models, data reduction or feature selection. Instead, I discuss frameworks - each one using its own types of techniques and algorithms - to solve real life problems.   

Most of the entries below are found in Wikipedia, and I have used a few definitions or extracts from the relevant Wikipedia articles, in addition to personal contributions.

Source for picture: click here

1. Spatial Models

Spatial dependency is the co-variation of properties within geographic space: characteristics at proximal locations appear to be correlated, either positively or negatively. Spatial dependency leads to the spatial auto-correlation problem in statistics since, like temporal auto-correlation, this violates standard statistical techniques that assume independence among observations

2. Time Series

Methods for time series analyses may be divided into two classes: frequency-domain methods and time-domain methods. The former include spectral analysis and recently wavelet analysis; the latter include auto-correlation and cross-correlation analysis. In time domain, correlation analyses can be made in a filter-like manner using scaled correlation, thereby mitigating the need to operate in frequency domain.

Additionally, time series analysis techniques may be divided into parametric and non-parametric methods. The parametric approaches assume that the underlying stationary stochastic process has a certain structure which can be described using a small number of parameters (for example, using an autoregressive or moving average model). In these approaches, the task is to estimate the parameters of the model that describes the stochastic process. By contrast, non-parametric approaches explicitly estimate the covariance or the spectrum of the process without assuming that the process has any particular structure.

Methods of time series analysis may also be divided into linear and non-linear, and univariate and multivariate.

3. Survival Analysis

Survival analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer questions such as: what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival? Survival models are used by actuaries and statisticians, but also by marketers designing churn and user retention models.

Survival models are also used to predict time-to-event (time from becoming radicalized to turning into a terrorist, or time between when a gun is purchased and when it is used in a murder), or to model and predict decay (see section 4 in this article).

4. Market Segmentation

Market segmentation, also called customer profiling, is a marketing strategy which involves dividing a broad target market into subsets of consumers,businesses, or countries that have, or are perceived to have, common needs, interests, and priorities, and then designing and implementing strategies to target them. Market segmentation strategies are generally used to identify and further define the target customers, and provide supporting data for marketing plan elements such as positioning to achieve certain marketing plan objectives. Businesses may develop product differentiation strategies, or an undifferentiated approach, involving specific products or product lines depending on the specific demand and attributes of the target segment.

5. Recommendation Systems

Recommender systems or recommendation systems (sometimes replacing "system" with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the 'rating' or 'preference' that a user would give to an item.

6. Association Rule Learning

Association rule learning is a method for discovering interesting relations between variables in large databases. For example, the rule { onions, potatoes } ==> { burger }  found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. In fraud detection, association rules are used to detect patterns associated with fraud. Linkage analysis is performed to identify additional fraud cases: if credit card transaction from user A was used to make a fraudulent purchase at store B, by analyzing all transactions from store B, we might find another user C with fraudulent activity. 

7. Attribution Modeling

An attribution model is the rule, or set of rules, that determines how credit for sales and conversions is assigned to touchpoints in conversion paths. For example, the Last Interaction model in Google Analytics assigns 100% credit to the final touchpoints (i.e., clicks) that immediately precede sales or conversions. Macro-economic models use long-term, aggregated historical data to assign, for each sale or conversion, an attribution weight to a number of channels. These models are also used for advertising mix optimization.

8. Scoring

Scoring model is a special kind of predictive models. Predictive models can predict defaulting on loan payments, risk of accident, client churn or attrition, or chance of buying a good. Scoring models typically use a logarithmic scale (each additional 50 points in your score reducing the risk of defaulting by 50%), and are based on logistic regression and decision trees, or a combination of multiple algorithms. Scoring technology is typically applied to transactional data, sometimes in real time (credit card fraud detection, click fraud).

9. Predictive Modeling

Predictive modeling leverages statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive models are often used to detect crimes and identify suspects, after the crime has taken place. They may also used for weather forecasting, to predict stock market prices, or to predict sales, incorporating time series or spatial models. Neural networks, linear regression, decision trees and naive Bayes are some of the techniques used for predictive modeling. They are associated with creating a training set, cross-validation, and model fitting and selection.

Some predictive systems do not use statistical models, but are data-driven instead. See example here

10. Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Unlike supervised classification (below), clustering does not use training sets. Though there are some hybrid implementations, called semi-supervised learning.

11. Supervised Classification

Supervised classification, also called supervised learning, is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called label, class or category). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. 

Examples, with an emphasis on big data, can be found on DSC. Clustering algorithms are notoriously slow, though a very fast technique known as indexation or automated tagging will be described in Part II of this article.

12. Extreme Value Theory

Extreme value theory or extreme value analysis (EVA) is a branch of statistics dealing with the extreme deviations from the median of probability distributions. It seeks to assess, from a given ordered sample of a given random variable, the probability of events that are more extreme than any previously observed. For instance, floods that occur once every 10, 100, or 500 years. These models have been performing poorly recently, to predict catastrophic events, resulting in massive losses for insurance companies. I prefer Monte-Carlo simulations, especially if your training data is very large. This will be described in Part II of this article.

Click here to read Part II.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Analyse TB data using network analysis

Guest blog post by Tim Groot

Analyse TB data using network analysis


In a very interesting publication from Jose A. Dianes on tuberculosis (TB) cases per country it was shown that dimension reduction is achieved using Principal Component Analysis (PCA) and Cluster Analysis ( By showing that the first principal component corresponded mainly to the mean value of TB cases and the second mainly to the change over the used time span, it become clear that the first two PCA-components have a real physical meaning. This is not necessarily the case for PCA constructs an orthogonal basis, by making linear combinations of the original measurements, of which the eigen vectors are orederd in a decending order. Though, this method may not work with data having different types of variables. The scripts in this article are written in R.


Finding correlations in the time trend is a better way to monitor the correspondence between countries. Correlation shows similarities in the trend between countries and is sensitive to deviations from the main trend. Grouping countries based on similarities can give insight in the mechanism behind the trend and opens a way to find effective measures for the illness. Or a hidden measure may have a good causal relation but was not identified yet.

The necessary libraries to use are:

library(RCurl) # reading data

library(igraph) # network plot


Loading required data from and process existing cases file analogous to Jose A. Dianes.

existing_cases_file <-


existing_df <- read.csv(text = existing_cases_file, row.names=1, stringsAsFactor=F)

existing_df[c(1,2,3,4,5,6,15,16,17,18)] <-

  lapply( existing_df[c(1,2,3,4,5,6,15,16,17,18)],

          function(x) { as.integer(gsub(',', '', x) )})

countries <- rownames(existing_df)

meantb <- rowMeans(existing_df)

Create the link-table from the correlation matrix, filtered for the duplicates and the 1’s on the diagonal. The lower triangle function was used here.

cortb <- cor(t(existing_df))

cortb <- cortb*lower.tri(cortb)

links <- data.frame(NULL, ncol(3))

for(i in 1:length(countries)){

  links <- rbind(links,data.frame(countries, countries[i], cortb[,i], meantb,



names(links) <- c('c1','c2','cor','meanc1','meanc2')

links <- links[links$cor != 0,]

A network graph of this link-table will result in one uniform group because each country is still liked to all others.

g <-, directed=FALSE)


The trend is formed from a period of only 18 years. Correlation may therefore not be a strong function to separate the trends of the countries. For a longer span of years correlation will perform better as separator. The trends in this data are generally the same, they are decreasing. Therefore a high limit for the level of correlation is used (0.90).

The link-table is filtered for correlations larger than 0.9 and create a network graph.

links <- links[links$cor > 0.9,]


g <-, directed=FALSE)


fgc <- cluster_fast_greedy(g)


## [1] 5

The countries now appear to split-up into 5 groups, three large clusters and two small ones.


By plotting time-trends of the groups, a grouping in the trends is visible.

trendtb <-

for(group in 1:length(fgc)){

  sel <- trendtb[,as.character(unlist(fgc[group]))]

  plot(x=NA, y=NA , xlim=c(1990,2007), ylim=range(sel), main= paste("Group", group),

     xlab='Year', ylab = 'TB cases per 100K distribution')

  for(i in names(sel)){

    points(1990:2007,sel[,i],type='o', cex = .1, lty = "solid",

           col = which(i == names(sel)))


  if(group %in% c(4,5)) legend('topright', legend = names(sel), lwd=2,



In group 4 and 5 pretty particular trends are selected. Group 4 consist of countries with a maximum amount of TB-cases in the period 1996 to 2003 and the two countries in group 5 show a dramatic drop in TB-cases at 1996 which is followed by a large increase. This latter trend should be explained by meta data about the dataset.


## $`4`

## [1] "Lithuania" "Zambia"    "Belarus"   "Bulgaria"  "Estonia" 


## $`5`

## [1] "Namibia"  "Djibouti"

Group 4 consists of former USSR-countries, though, Zambia is an exception. This trend could be explained by social problems during the collapse of the USSR and for Zambia this trend should be explained by political changes too.

The range in TB-cases the first three graphs is too large to see the similarities within the groups. Dividing them with the mean gives a better view on the trend.

for(group in 1:3){

  sel <- trendtb[,as.character(unlist(fgc[group]))]

  selavg <- meantb[as.character(unlist(fgc[group]))]

  plot(x=NA, y=NA , xlim=c(1990,2007), ylim=range(t(sel)/selavg),

       main= paste("Group", group), xlab='Year',

       ylab = 'TB cases per 100K distribution')

  for(i in names(sel)){

    points(1990:2007,sel[,i]/selavg[i],type='o', cex = .1, lty = "solid",

           col= which(i == names(sel)))



Now the difference between group 1 and 3 better visible, group 1 groups countries tending towards sigmoid-trends while group 3 consist of countries with a more steady decay in TB-cases. Countries in group 2 show an increasing sigmoid-like trend.

In the groups western- and development-countries are mixed. For western countries the amount of TB-cases are low and one TB-case more or less may flip the trend, again a better separation will be found for a larger span in time.


## $`1`

##  [1] "Albania"                          "Argentina"                      

##  [3] "Bahrain"                          "Bangladesh"                     

##  [5] "Bosnia and Herzegovina"           "China"                          

##  [7] "Costa Rica"                       "Croatia"                        

##  [9] "Czech Republic"                   "Korea, Dem. Rep."               

## [11] "Egypt"                            "Finland"                        

## [13] "Germany"                          "Guinea-Bissau"                  

## [15] "Honduras"                         "Hungary"                        

## [17] "India"                            "Indonesia"                      

## [19] "Iran"                             "Japan"                           

## [21] "Kiribati"                         "Kuwait"                         

## [23] "Lebanon"                          "Libyan Arab Jamahiriya"         

## [25] "Myanmar"                          "Nepal"                          

## [27] "Pakistan"                         "Panama"                         

## [29] "Papua New Guinea"                 "Philippines"                    

## [31] "Poland"                           "Portugal"                       

## [33] "Puerto Rico"                      "Singapore"                      

## [35] "Slovakia"                         "Syrian Arab Republic"           

## [37] "Thailand"                         "Macedonia, FYR"                 

## [39] "Turkey"                           "Tuvalu"                         

## [41] "United States of America"         "Vanuatu"                        

## [43] "West Bank and Gaza"               "Yemen"                          

## [45] "Micronesia, Fed. Sts."            "Saint Vincent and the Grenadines"

## [47] "Viet Nam"                         "Eritrea"                        

## [49] "Jordan"                           "Tunisia"                        

## [51] "Monaco"                           "Niger"                          

## [53] "New Caledonia"                    "Guam"                           

## [55] "Timor-Leste"                      "Iraq"                           

## [57] "Mauritius"                        "Afghanistan"                     

## [59] "Australia"                        "Cape Verde"                     

## [61] "French Polynesia"                 "Malaysia"                       


## $`2`

##  [1] "Botswana"                 "Burundi"                

##  [3] "Cote d'Ivoire"            "Ethiopia"               

##  [5] "Guinea"                   "Rwanda"                 

##  [7] "Senegal"                  "Sierra Leone"           

##  [9] "Suriname"                 "Swaziland"              

## [11] "Tajikistan"               "Zimbabwe"               

## [13] "Azerbaijan"               "Georgia"                

## [15] "Kenya"                    "Kyrgyzstan"             

## [17] "Russian Federation"       "Ukraine"                

## [19] "Tanzania"                 "Moldova"                

## [21] "Burkina Faso"             "Congo, Dem. Rep."       

## [23] "Guyana"                   "Nigeria"                

## [25] "Chad"                     "Equatorial Guinea"      

## [27] "Mozambique"               "Uzbekistan"             

## [29] "Kazakhstan"               "Algeria"                

## [31] "Armenia"                  "Central African Republic"


## $`3`

##  [1] "Barbados"                 "Belgium"                 

##  [3] "Bermuda"                  "Bhutan"                 

##  [5] "Bolivia"                  "Brazil"                 

##  [7] "Cambodia"                 "Cayman Islands"         

##  [9] "Chile"                    "Colombia"               

## [11] "Comoros"                  "Cuba"                   

## [13] "Dominican Republic"       "Ecuador"                

## [15] "El Salvador"              "Fiji"                   

## [17] "France"                   "Greece"                 

## [19] "Haiti"                    "Israel"                 

## [21] "Laos"                     "Luxembourg"             

## [23] "Malta"                    "Mexico"                 

## [25] "Morocco"                  "Netherlands"            

## [27] "Netherlands Antilles"     "Nicaragua"              

## [29] "Norway"                   "Peru"                   

## [31] "San Marino"               "Sao Tome and Principe"  

## [33] "Slovenia"                 "Solomon Islands"        

## [35] "Somalia"                  "Spain"                  

## [37] "Switzerland"              "United Arab Emirates"   

## [39] "Uruguay"                  "Antigua and Barbuda"    

## [41] "Austria"                  "British Virgin Islands" 

## [43] "Canada"                   "Cyprus"                 

## [45] "Denmark"                  "Ghana"                  

## [47] "Guatemala"                "Ireland"                

## [49] "Italy"                    "Jamaica"                

## [51] "Mongolia"                 "Oman"                   

## [53] "Saint Lucia"              "Seychelles"             

## [55] "Turks and Caicos Islands" "Virgin Islands (U.S.)"  

## [57] "Venezuela"                "Maldives"               

## [59] "Trinidad and Tobago"      "Korea, Rep."            

## [61] "Andorra"                  "Anguilla"               

## [63] "Belize"                   "Mali"

Read more…

7 Ingredients for Great Visualizations

Great article by Bernard Marr. Here we present a summary. A link to the full article is provided below.

Source for picture: click here

1. Identify your target audience. 

2. Customize the data visualization. 

3. Give the data visualization a clear label or title. 

Source for picture: click here

4. Link the data visualization to your strategy. 

5. Choose your graphics wisely. 

6. Use headings to make the important points stand out. 

Source for picture: click here

7. Add a short narrative where appropriate.  

Click here to read full article. For other articles on visualization science and art, as well as best practices, click here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Contributed by David Comfort. David took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his first class project(due at 2nd week of the program).


Hans Rosling gave a famous TED talk in 2007, “The Best Stats You’ve Ever Seen”. Rosling is a Professor of International Health at the Karolinska Institutet in Stockholm, Sweden and founded the Gapminder Foundation.


To visualise his talk, he and his team at Gapminder developed animated bubble charts, also known as motion charts. Gapminder developed the Trendalzyer data visualization software, which was subsequently acquired by Google in 2007.

The Gapminder Foundation is a Swedish NGO which promotes sustainable global development by increased use and understanding of statistics about social, economic and environmental development.

The purpose of my data visualization project was to visualize data about long-term economic, social and health statistics. Specifically, I wanted to extract data sets from Gapminder using an R package, googlesheets, munge these data sets, and combine them into one dataframe, and then use the GoogleVis R package to visualize these data sets using a Google Motion chart.

The finished product can be viewed at this page and a screen capture demonstrating the features of the interactive data visualization is at:


World Bank Databank

World Bank Databank

2) Data sets used for Data Visualization

The following Gapminder datasets, several of which were adapted from the World Bank Databank, were accessed for data visualization:

  • Child mortality - The probability that a child born in a specific year will die before reaching the age of five if subject to current age-specific mortality rates. Expressed as a rate per 1,000 live births.
  • Democracy score  - Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. It is a summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.
  • Income per person - Gross Domestic Product per capita by Purchasing Power Parities (in international dollars, fixed 2011 prices). The inflation and differences in the cost of living between countries has been taken into account.
  • Life expectancy at birth - Life expectancy at birth (years) with projections. The average number of years a newborn child would live if current mortality patterns were to stay the same. Observations after 2010 are based on projections by the UN.
  • Population, Total - Population is available in Gapminder World in the indicator “population, total”, which contains observations between 1700 to 2012. It is also available in the indicator “population, with projections”, which also includes projections up to 2100 as well as data for several countries before 1700.
  • Country and Dependent Territories Lists with UN Regional Codes - These lists are the result of merging data from two sources, the Wikipedia ISO 3166-1 article for alpha and numeric country codes, and the UN Statistics site for countries' regional, and sub-regional codes.

3) Reading in the Gapminder data sets using Google Sheets R Package

  • The googlesheets R package allows one to access and manage Google spreadsheets from within R.
  • googlesheets Basic Usage (vignette)
  • Reference manual
  • The registration functions gs_title(), gs_key(), and gs_url() return a registered sheet as a googlesheets object, which is the first argument to practically every function in this package.

First, install and load googlesheets and dplyr, and access a Gapminder Google Sheet by the URL and get some information about the Google Sheet:

Note: Setting the parameter lookup=FALSE will block authenticated API requests.

A utility function, extract_key_from_url(), helps you get and store the key from a browser URL:

You can access the Google Sheet by key:

Once, one has registered a worksheet, then you can consume the data in a specific worksheet (“Data” in our case) within the Google Sheet using the gs_read() function (combining the statements with dplyr pipe and using check.names=FALSE so it deals with the integer column names correctly and doesn’t append an “x” to each column name):

You can also target specific cells via the range = argument. The simplest usage is to specify an Excel-like cell range, such asrange = “D12:F15” or range = “R1C12:R6C15”.

But a problem arises since check.names=FALSE does not work with this statement (problem with package?). So a workaround would be to pipe the data frame through the dplyr rename function:

However, for purposes, we will ingest the entire worksheet and not target by cells.

Let’s look at the data frame. We need to change the name of the first column from “GDP per capita” to “Country”.

We need to change the name of the first column from “GDP per capita” to “Country”.

Let’s download the rest of the datasets

4) Read in the Countries Data Set

We want to segment the countries in the data sets by region and sub-region. However, the Gapminder data sets do not include these variables. Therefore, one can download the ISO-3166-Countries-with-Regional-Codes data set from github which includes the ISO country code, country name, region, and sub-region.

Use rCurl to read in directly from Github and make sure you read in the “raw” file, rather than Github’s display version.

Note: The Gapminder data sets do not include ISO country codes, so I had to clean the countries data set with the corresponding country names used in the Gapminder data sets.

5) Reshaping the Datasets

We need to reshape the data frames. For the purposes of reshaping our data frames, we can divide the variables into two groups: identifier, or id, variables and measured variables. In our case, id variables include the Country and Years, whereas the measured variables are the GDP per capita, life expectancy, etc..

We can further abstract and “say there are only id variables and a value, where the id variables also identify what measured variable the value represents.”

For example, we could represent a data set, which has two id variables, subject and time:


where each row represents one observation of one variable. This operation is called melting (and can be achieved by using the melt function of the Reshape package).

Compared to the former table, the latter table has a new id variable “variable”, and a new column “value”, which represents the value of that observation. See the paper, Reshaping data with the reshape package, by Hadley Wickham, for more clarification.

We now have the data frame in a form in which there are only id variables and a value.

Let’s reshape the data frames ( child_mortality, democracy_score, life_expectancy, population):

6) Combining the Datasets

Whew, now we can finally all the datasets using a left_join:

7) What is GoogleVis

[caption id="attachment_8125" align="alignnone" width="690"]googlevis An overview of a GoogleVis Motion Chart[/caption]

8) Implementing GoogleVis

Implementing GoogleVis is fairly easy. The design of the visualization functions is fairly generic.

The name of the visualization function is gvis + ChartType. So for the Motion Chart we have:

9) Parameters for GoogleVis

  • data: a data.frame
  • idvar: the id variable , “Country” in our case.
  • timevar: the time variable for the plot, “Years” in our case.
  • xvar: column name of a numerical vector in data to be plotted on the x-axis.
  • yvar: column name of a numerical vector in data to be plotted on the y-axis.
  • colorvar: column name of data that identifies bubbles in the same series. We will use “Region” in our case.
  • sizevar - values in this column are mapped to actual pixel values using the sizeAxis option. We will use this for “Population”.

The output of a googleVis function (gvisMotionChart in our case) is a list of lists (a nested list) containing information about the chart type, chart id, and the html code in a sub-list split into header, chart, caption and footer.

10) Data Visualization

Let’s plot the GoogleVis Motion Chart.

[caption id="attachment_8127" align="alignnone" width="650"]Gapminder Data Visualization using GoogleVis Gapminder Data Visualization using GoogleVis[/caption]

Note: I had an issue with embedding the GoogleVis motion plot in a WordPress blog post, so I will subsequently feature the GoogleVis interactive motion plot on a separate page at


Here is a screen recording of the data visualization produced by GoogleVis of the Gapminder datasets:


11) What are the Key Lessons of the Gapminder Data Visualization Project?

  • Hans Rosling and Gapminder have made a big impact on data visualization and how data visualization can inform the public about wide misperceptions.
  • The googlesheets R package allows for easy extraction of data sets which are stored in Google Sheets.
  • The different steps involved in reshaping and joining multiple data sets can be a little cumbersome. We could have used dplyr pipes more.
  • It would be a good practice for Gapminder to include the ISO country code in each of their data sets.
  • There is a need for a data set which lists country names, their ISO codes, as well as other categorical information such as Geographic Regions, Income groups, Landlocked, G77, OECD, etc.
  • It is relatively easy to implement a GoogleVis motion chart using R. However, it is difficult to change the configuration options. For instance, I was unable to make a simple change to the chart by adding a chart title.
  • Google Motion Charts provide a great way to visualize several variables at once and be a great teaching tool for all sorts of data.
  • For instance, one can visualize the Great Divergence between the Western world and China, India or Japan, whereby the West had much faster economic growth, with attendant increases in life expectancy and other health indicators.



Originally posted on Data Science Central

Read more…

How to make any plot with ggplot2?


Guest blog post by Selva Prabhakaran

Ggplot2 is the most elegant and aesthetically pleasing graphics framework available in R. It has a nicely planned structure to it. This tutorial focusses on exposing this underlying structure you can use to make any ggplot. But, the way you make plots in ggplot2 is very different from base graphics making the learning curve steep. So leave what you know about base graphics behind and follow along. You are just 5 steps away from cracking the ggplot puzzle.


The distinctive feature of the ggplot2 framework is the way you make plots through adding ‘layers’. The process of making any ggplot is as follows.

1. The Setup

First, you need to tell ggplot what dataset to use. This is done using the ggplot(df) function, where df is a dataframe that contains all features needed to make the plot. This is the most basic step. Unlike base graphics, ggplot doesn’t take vectors as arguments.

Optionally you can add whatever aesthetics you want to apply to your ggplot (inside aes() argument) - such as X and Y axis by specifying the respective variables from the dataset. The variable based on which the color, size, shape and stroke should change can also be specified here itself. The aesthetics specified here will be inherited by all the geom layers you will add subsequently.

If you intend to add more layers later on, may be a bar chart on top of a line graph, you can specify the respective aesthetics when you add those layers.

Below, I show few examples of how to setup ggplot using in the diamonds dataset that comes with ggplot2 itself. However, no plot will be printed until you add the geom layers.

ggplot(diamonds) # if only the dataset is known. ggplot(diamonds, aes(x=carat)) # if only X-axis is known.
ggplot(diamonds, aes(x=carat, y=price)) # if both X and Y axes are fixed for all layers.ggplot(diamonds, aes(x=carat, color=cut)) # color will now vary based on `cut`

The aes argument stands for aesthetics. ggplot2 considers the X and Y axis of the plot to be aesthetics as well, along with color, size, shape, fill etc. If you want to have the color, size etc fixed (i.e. not vary based on a variable from the dataframe), you need to specify it outside the aes(), like this.

ggplot(diamonds, aes(x=carat), color="steelblue")

See this color palette for more colors.

2. The Layers

The layers in ggplot2 are also called ‘geoms’. Once the base setup is done, you can append the geoms one on top of the other. The documentation provides a compehensive list of all available geoms.

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + geom_smooth() # Adding scatterplot geom (layer1) and smoothing geom (layer2).

We have added two layers (geoms) to this plot - the geom_point() and geom_smooth(). Since the X axis Y axis and the color were defined in ggplot() setup itself, these two layers inherited those aesthetics. Alternatively, you can specify those aesthetics inside the geom layer also as shown below.

ggplot(diamonds) + geom_point(aes(x=carat, y=price, color=cut)) + geom_smooth(aes(x=carat, y=price, color=cut)) # Same as above but specifying the aesthetics inside the geoms.

Notice the X and Y axis and how the color of the points vary based on the value of cut variable. The legend was automatically added. I would like to propose a change though. Instead of having multiple smoothing lines for each level of cut, I want to integrate them all under one line. How to do that? Removing the color aesthetic from geom_smooth()layer would accomplish that.

library(ggplot2) ggplot(diamonds) + geom_point(aes(x=carat, y=price, color=cut)) + geom_smooth(aes(x=carat, y=price)) # Remove color from geom_smooth

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(aes(color=cut)) + geom_smooth() # same but simpler


Here is a quick challenge for you. Can you make the shape of the points vary with color feature?

Though setting up took us quite a bit of code, adding further complexity such as the layers, distinct color for each cut etc was easy. Imagine how much code you would have had to write if you were to make this in base graphics? Thanks to ggplot2!

# Answer to the challenge.
ggplot(diamonds, aes(x=carat, y=price, color=cut, shape=color)) + geom_point()

3. The Labels

Now that you have drawn the main parts of the graph. You might want to add the plot’s main title and perhaps change the X and Y axis titles. This can be accomplished using the labs layer, meant for specifying the labels. However, manipulating the size, color of the labels is the job of the ‘Theme’.

library(ggplot2) gg <- ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + labs(title="Scatterplot", x="Carat", y="Price") # add axis labels and plot title. print(gg)


The plot’s main title is added and the X and Y axis labels capitalized.


Note: If you are showing a ggplot inside a function, you need to explicitly save it and then print using the print(gg), like we just did above.


4. The Theme


Almost everything is set, except that we want to increase the size of the labels and change the legend title. Adjusting the size of labels can be done using the theme() function by setting the plot.titleaxis.text.x and axis.text.y. They need to be specified inside the element_text(). If you want to remove any of them, set it to element_blank() and it will vanish entirely.


Adjusting the legend title is a bit tricky. If your legend is that of a color attribute and it varies based in a factor, you need to set the name using scale_color_discrete(), where the color part belongs to the color attribute and the discrete because the legend is based on a factor variable.

gg1 <- gg + theme(plot.title=element_text(size=30, face="bold"), axis.text.x=element_text(size=15), axis.text.y=element_text(size=15), axis.title.x=element_text(size=25), axis.title.y=element_text(size=25)) + scale_color_discrete(name="Cut of diamonds") # add title and axis text, change legend title.

print(gg1) # print the plot

If the legend shows a shape attribute based on a factor variable, you need to change it using scale_shape_discrete(name="legend title"). Had it been a continuous variable,  (name="legend title") instead.

So now, Can you guess the function to use if your legend is based on a fill attribute on a continuous variable?

The answer is scale_fill_continuous(name="legend title").

5. The Facets

In the previous chart, you had the scatterplot for all different values of cut plotted in the same chart. What if you want one chart for one cut?

gg1 + facet_wrap( ~ cut, ncol=3) # columns defined by 'cut'

facet_wrap(formula) takes in a formula as the argument. The item on the RHS corresponds to the column. The item on the LHS defines the rows.

gg1 + facet_wrap(color ~ cut) # row: color, column: cut


In facet_wrap, the scales of the X and Y axis are fixed to accomodate all points by default. This would make comparison of attributes meaningful because they would be in the same scale. However, it is possible to make the scales roam free making the charts look more evenly distributed by setting the argument scales=free.

gg1 + facet_wrap(color ~ cut, scales="free") # row: color, column: cut

For comparison purposes, you can put all the plots in a grid as well using facet_grid(formula).

gg1 + facet_grid(color ~ cut) # In a grid

Note, the headers for individual plots are gone leaving more space for plotting area.

This post was originally published at The full post is available in this ggplot2 tutorial.
Read more…

Exploring the VW scandal with graph analysis

Guest blog post by Zygimantas Jacikevicius

Our software partners Linkurious were excited to dive in to the VW emissions scandal case we have prepared and documented their findings in this great article, keep reading or visit the source of this article Here


The German car manufacturer admitted cheating emissions tests. Which companies are impacted directly or indirectly by these revelations? James Phare from Data To Value  used graph analysis and open source information to unravel the impact of the VW scandal on its customers, partners and shareholders.

 The VW scandal makes headlines


Since September 18 when VW admitted to modifying its cars to disguise their true emissions performances, the news has made headlines worldwide. In all this noise, it’s difficult to piece together the important information necessary to assess the event’s impact.

Our partner Data To Value helps organisations turn data into their most valuable asset. Using a combination of semantic analysis and Linkurious, the Data To Value’s team were able to investigate the ramifications of the VW scandal and show how it impacts its customers, partners, suppliers and shareholders.

To do this Data To Value used open source intelligence (OSINT) from sources like newspapers, social media and blogs. Semantic analysis helps analyse that type of vast volume of unstructured or semi-structured data to extract:

  • relevant entities (persons, companies, locations, etc.)
  • sentiment (is a blog post sympathetic or aggressive?)
  • topics (is this article related to the production of apples or to the company Apple?)


The resulting data can be represented in various ways. For example, Data To Value built a visualization to show the evolution of the general sentiment toward VW during the crisis.


This approach is useful but doesn’t help us understand how various entities linked to VW (directly and indirectly) are impacted by the scandal. Traditional tools and techniques are ill adapted for that type of questions. That’s where graph analysis is useful. It helps see relationships which are hard to find using other techniques and other data structures.


In our case for example, Data to Value chose to represent the data collected about the VW crisis in the following graph model:

In the picture above we see how some of different entities are linked:

  • a company or an investor own shares of another company
  • an engine powers a car model
  • a car model is sold by (in the range of) a company
  • a company is a fleet buyer of another company
  • a supplier supplies a company

The graph model gives a highly structured view of the information. We can now use it to start asking questions.



Investigating the VW network to find hidden insights


First let’s start by visualizing the motors involved in the scandal and in which car models they are used.

We can see how a single motor can be used in many car models. The “R3 1^4l 66kW(90PS) 230Nm 4V TDI CR” motor for example is being used by the Rapid Scapeceback, Toledo, A1, A1 Sportback, Fabia, 250 OSA Polo, Fabia Combi, Polo, Ibiza Sport Tourer, Ibiza 3 Turer, Ibiza 5 Turer.

The Linkurious graph visualization helps us make sense of how the motors and car models are tied. Now that we understand this, we can add an extra set of relationships. The relationships between the car models and the manufacturers.

At this point, we can notice that a car like Skoda’s Superb is using 3 motors involved in the scandal.

Although most customers don’t associate Skoda’s cars with Volkswagen we can see that there are relationships between the two car manufacturers. Two other car manufacturers also appear in the data: Seat and Audi.

Let’s expand the relationships of our car manufacturers to reveal more information. A few new details emerge.

Various car rental companies are major customers of Volkswagen. Companies like Hertz, Europcar and Avis are buying cars manufactured by Volkswagen, Audi, Skoda and Seat.

Daimler and BMW, the other German car manufacturers use Robert Bosch, a VW supplier and a company also involved in the scandal. Although to date these companies have denied using the software illegally in the same way as VW.

Let’s filter the graph to focus on the shareholding relationships now.

Audi, Seat, Skoda and Volkswagen are all owned by Volkswagen AG. Investors in Volkswagen AG include hedge fund Blackrock and sovereign funds Qatar Investment Authority and Norwegian Investment Authority.



Simplifying investment research with graph analysis



Data to Value’s research in the VW scandal has applications both in the long only asset management space and also in the hedge funds space for both managing risks and also identifying investment opportunities or developing a graph-based trading strategy. For example, based on the insights uncovered, the shares of indirectly affected companies could be shorted in order to achieve returns. Although not directly exposed to the crisis, these companies are indirectly affected by the reputational impact, decline in profitability of VW and unplanned costs if the matter is not resolved effectively e.g. fleet buyers bearing modification costs themselves or seeing reduced demand for VW rentals.


“The same insights could also help a risk manager understand a portfolio or entire investment manager’s risk profile associated with systemic or counterparty risks. Graph technology is perfect for performing these types of network analysis” explains Phare.


Research processes are traditionally very manual and document driven. Analysts look at broker’s reports, newspapers, market data. This makes sense when the velocity of the information is low but doesn’t scale up to today’s high data volume. Techniques like machine learning and semantic analysis can automate a lot of this work and make the job of analysts easier. Combined with this kind of self-service graph analysis enabled by Linkurious this presents an innovative approach for investment managers.

Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds