Subscribe to our Newsletter

All Posts (212)

Guest blog post by Eduardo Siman

You are driving down the highway. As your gaze moves between cars and trees and open sky a filtering thought hits the periphery of your conscious mind. It feels fresh yet somehow part of a thought pattern you’ve had before. A new version, perhaps, of a past obsession. Or a new obsession, emerging from years of exploration in seemingly unrelated fields. How many of us can identify with this feeling of being gently entranced by a sequence of thoughts that had seemingly faded into the past? Perhaps they encourage us to call an old friend who has been out of touch, or to dig into a textbook from yesteryear. 

I suppose you could call it nostalgia, but that word has a connotation of sadness and loss that does not properly fit here. We don’t understand much about how the brain truly functions, let alone the intricacies of consciousness. But perhaps we can think of these ebbing and waning thought cycles as waves, even curves, on a two dimensional plane that I would like to call a Mind Map.

In discussing this abstraction with my friend @openmylab in Sydney, we have been drawing 10 year Mind Maps of our own brains in order to identify areas of intellectual passion as well as intersections between seemingly disparate curves that have become inflection points in our lives. In both cases we started with a few rules to make the map visibly appealing and uncluttered. First, it would only have 3 mind curves - that is 3 major thought patterns or areas of engagement. Second, these curves had to be somewhat sinusoidal in that they couldn’t be everyday thoughts, but rather major epochs of intellectual pursuit that reached a peak, declined and once again reemergence in a surprising way. In doing this exercise we learned quite a bit about our own mind and the territory it has charted over the last ten years. 

Let’s start with @openmylab’s mind map: 

 Focus on curves 2 and 4 in the year 2000. We see a clear focus on learning programming languages and operating systems in 2 - some front end web development, some UNIX, a bit of Java, even dreamweaver and the ubiquitous SQL and C#. Now look at 4 - here is a clear statistical computing route, perhaps it would be called data science now, with all the usual suspects - Python, R, SPSS, Matlab. Between the two curves we have the makings of a rare gem of a software engineer. Seems clear enough right? This person should develop statistical apps and deploy them on front end environments? Maybe. 

But don’t forget these are SEPARATE mind curves. They don’t represent an encapsulated and refined goal or objective. These are distinct areas of interest and if they intersect and yield a common product -that’s great - but certainly not required. Tracing these curves we see how they transform into current areas of interest: quantum computing and the Internet of things. It would have been quite difficult to trace the origins of these current intellectual pursuits without tracing their mental predecessors.

 Now let’s explore my own mind map, keeping in mind it’s points of commonality and distinction from @openmylab’s map:

My map starts with two core academic pursuits: general relativity/quantum mechanics at the top and analytic number theory at the bottom. In addition there is another curve that starts out weak but gains steam as the years go by. This last curve is technology and all of it’s fascinating applications. So how does this play out? The first intersection occurs in 2007 - here is where I discover that back when Einstein and Montgomery where roaming the halls of the Institute of Advanced Studies in Princeton, a chance encounter led to an unexpected revelation. It turns out that the zeros of the zeta function bear a striking similarity to the eigenvalues of a random hermitician matrix. Did I lose you? I did. 

Ok let’s move ahead. By 2012 my obsession with Riemann intersects with my technological pursuits and I discover computational number theory. Unfortunately computers cannot prove the Riemann hypothesis, which is almost certainly true. But they could Dis-prove it! (In the very unlikely case that it isn’t true) And finally, as with @openmylab, there is a shift to applications of technology in the modern world - Big Data,IoT, fintech,machine learning. A possible point of intersection with the physics curve seems a few years away with the advent of quantum computing. 

 In both cases we see a current interest in so called “hot” or “trending” technologies with a long (10 to 15 year) history of predecessor interests and an amalgamation of distinct intellectual flows. Yet the two maps are quite different - one is the story of a software engineer who loves statistics and eventually finds himself enthralled in the world of robotics, data science, and quantum computing. The other is a physics nerd who realizes that computational methods can help bring concepts to life and dives into the visualization and practical application of abstract concepts. 

 And of course the most critical intersection, the one between the two mind maps, occurs in an area that isn’t even present on the maps: social media. It is on Twitter that the concept is exposed, developed, shared, refined, and discussed. Between Miami and Sydney, in real time. The mind map of the world has come along way to make this interaction possible. I encourage you to create your own mind map and explore the hidden mind curves of your intellectual past. If you feel comfortable doing so, please share s picture with me @namenode5 and with @openmylab on Twitter. Happy exploring!!

Read more…

Six categories of Data Scientists

We are now at 9 categories after a few updates. Just like there are a few categories of statisticians (biostatisticians, statisticians, econometricians, operations research specialists, actuaries) or business analysts (marketing-oriented, product-oriented, finance-oriented, etc.) we have different categories of data scientists. First, many data scientists have a job title different from data scientist, mine for instance is co-founder. Check the "related articles" section below to discover 400 potential job titles for data scientists.

Categories of data scientists

  • Those strong in statistics: they sometimes develop new statistical theories for big data, that even traditional statisticians are not aware of. They are expert in statistical modeling, experimental design, sampling, clustering, data reduction, confidence intervals, testing, modeling, predictive modeling and other related techniques.
  • Those strong in mathematics: NSA (national security agency) or defense/military people working on big data, astronomers, and operations research people doing analytic business optimization (inventory management and forecasting, pricing optimization, supply chain, quality control, yield optimization) as they collect, analyse and extract value out of data.
  • Those strong in data engineering, Hadoop, database/memory/file systems optimization and architecture, API's, Analytics as a Service, optimization of data flows, data plumbing.
  • Those strong in machine learning / computer science (algorithms, computational complexity)
  • Those strong in business, ROI optimization, decision sciences, involved in some of the tasks traditionally performed by business analysts in bigger companies (dashboards design, metric mix selection and metric definitions, ROI optimization, high-level database design)
  • Those strong in production code development, software engineering (they know a few programming languages)
  • Those strong in visualization
  • Those strong in GIS, spatial data, data modeled by graphs, graph databases
  • Those strong in a few of the above. After 20 years of experience across many industries, big and small companies (and lots of training), I'm strong both in stats, machine learning, business, mathematics and more than just familiar with visualization and data engineering. This could happen to you as well over time, as you build experience. I mention this because so many people still think that it is not possible to develop a strong knowledge base across multiple domains that are traditionally perceived as separated (the silo mentality). Indeed, that's the very reason why data science was created.

Most of them are familiar or expert in big data. 

There are other ways to categorize data scientists, see for instance our article on Taxonomy of data scientists. A different categorization would be creative versus mundane. The "creative" category has a better future, as mundane can be outsourced (anything published in textbooks or on the web can be automated or outsourced - job security is based on how much you know that no one else know or can easily learn). Along the same lines, we have science users (those using science, that is, practitioners; often they do not have a PhD), innovators (those creating new science, called researchers), and hybrids. Most data scientists, like geologists helping predict earthquakes, or chemists designing new molecules for big pharma, are scientists, and they belong to the user category. 

Implications for other IT professionals

You (engineer, business analyst) probably do already a bit of data science work, and know already some of the stuff that some data scientists do. It might be easier than you think to become a data scientist. Check out our book (listed below in "related articles"), to find out what you already know, what you need to learn, to broaden your career prospects.

Are data scientists a threat to your job/career? Again, check our book (listed below) to find out what data scientists do, if the risk for you is serious (you = the business analyst, data engineer or statistician; risk = being replaced by
a data scientist who does everything) and find out how to mitigate the risk (learn some of the data scientist skills from our book, if you perceive data scientists as competitors)

Originally posted on Data Science Central

Related articles

Read more…

Guest blog post by Tony Agresta

Organizations are struggling with a fundamental challenge – there’s far more data than they can handle.  Sure, there’s a shared vision to analyze structured and unstructured data in support of better decision making but is this a reality for most companies?  The big data tidal wave is transforming the database management industry, employee skill sets, and business strategy as organizations race to unlock meaningful connections between disparate sources of data.

Graph Databases are rapidly gaining traction in the market as an effective method for deciphering meaning but many people outside the space are unsure of what exactly this entails. Generally speaking, graph databases store data in a graph structure where entities are connected through relationships to adjacent elements. The Web is a graph; also your friend-of-a-friend network and the road network are graphs.

The fact is, we all encounter the principles of graph databases in many aspects of our everyday lives, and this familiarity will only increase. Consider just a few examples:

  • Facebook, Twitter and other social networks all employ graphs for more specific, relevant search functionality.  Results are ranked and presented to us to help us discover things.
  • By 2020, it is predicted that the number of connected devices will reach nearly 75 billion globally. As the Internet of Things continues to grow, it is not the devices themselves that will dramatically change the ways in which we live and work, but the connections between these devices. Think healthcare, work productivity, entertainment, education and beyond.
  • There are over 40,000 Google searches processed every second. This results in 3.5 billion searches per day and 1.2 trillion searches per year worldwide. Online search is ubiquitous in terms of information discovery. As people not only perform general Google searches, but search for content within specific websites, graph databases will be instrumental in driving more relevant, comprehensive results. This is game changing for online publishers, healthcare providers, pharma companies, government and financial services to name a few.
  • Many of the most popular online dating sites leverage graph database technology to cull through the massive amounts of personal information users share to determine the best romantic matches. Why is this?  Because relationships matter.

In the simplest terms, graph databases are all about relationships between data points. Think about the graphs we come across every day, whether in a business meeting or news report.   Graphs are often diagrams demonstrating and defining pieces of information in terms of their relations to other pieces of information.

Traditional relational databases can easily capture the relationship between two entities but when the object is to capture “many-to-many” relationships between multiple points of data, queries take a long time to execute and maintenance is quite challenging.  For instance, if you wanted to search for friends on many social networks that both attended the same university AND live in San Francisco AND share at least three mutual friends. Graph databases can execute these types of queries instantly with just a few lines of code or mouse clicks. The implications across industries are tremendous.

Graph databases are gaining in popularity for a variety of reasons.  Many are schema-less allowing you to manage your data more efficiently.   Many support a powerful query language, SPARQL. Some allow for simultaneous graph search and full-text search of content stores. Some exhibit enterprise resilience, replication and highly scalable simultaneous reads and writes.  And some have other very special features worthy of further discussion.

One specialized form of graph database is an RDF triplestore.  This may sound like a foreign language, but at the root of these databases are concepts familiar to all of us.    Consider the sentence, “Fido is a dog.” This sentence structure – subject-predicate-object – is how we speak naturally and is also how data is stored in a triplestore. Nearly all data can be expressed in this simple, atomic form.  Now let’s take this one step further.  Consider the sentence, “All dogs are mammals.” Many triplestores can reason just the way humans can.   They can come to the conclusion that “Fido is a mammal.” What just happened?  An RDF triplestore used its “reasoning engine” to infer a new fact.  These new facts can be useful in providing answers to queries such as “What types of mammals exist?”  In other words, the “knowledge base” was expanded with related, contextual information.    With so many organizations interested in producing new information products, this process of “inference” is a very important aspect of RDF triplestores.  But where do the original facts come from?

Since documents, articles, books and e-mails all contain free flowing text, imagine a technology where the text can be analyzed with results stored inside the RDF triplestore for later use.  Imagine a technology that can create the semantic triples for reuse later.  The breakthrough here is profound on many levels: 1) text mining can be tightly integrated with RDF triplestores to automatically create and store useful facts and 2) RDF triplestores not only manage those facts but they also “reason” and therefore extend the knowledge base using inference.

Why is this groundbreaking?  The full set of reasons extends beyond the scope of this article but here are some of the most important:

Your unstructured content is now discoverable allowing all types of users to quickly find the exact information for which they are searching.  This is a monumental breakthrough since so much of the data that organizations stockpile today exist as dark data repositories.

We said earlier that RDF triplestores are a type of graph database.  By their very nature, the triples stored inside the graph database (think “facts” in the form of subject-predicate-object) are connected. “Fido is a dog.  All dogs are mammals.  Mammals are warm blooded.  Mammals have different body temperatures, etc…”  The facts are linked.  These connections can be measured.   Some entities are more connected than others just like some web pages are more connected to other web pages.   Because of this, metrics can be used to rank the entries in a graph database. One of the most popular (and first) algorithms used at Google is “Page Rank” which counts the number and quality of links to a page – an important metric in assessing the importance of web page.   Similarly, facts inside a triplestore can be ranked to identify important interconnected entities with the most connected ordered first.   There are many ways to measure the entities but this is one very popular use case.

With billions of facts referencing connected entities inside a graph database, this information source can quickly become the foundation for knowledge discovery and knowledge management.  Today, organizations can structure their unstructured data, add additional free facts from Linked Open Data sets, combine all of this with a controlled vocabulary, thesauri, taxonomies or ontologies which, to one degree or another, are used to classify the stored entities and depict relationships.  Real knowledge is then surfaced from the results of queries, visual analysis of graphs or both.  Everything is indexed inside the triplestore.

Graph databases (and specialized versions called native RDF triplestores that embody reasoning power) show great promise in knowledge discovery, data management and analysis.   They reveal simplicity within complexity.  When combined with text mining, their value grows tremendously.   As the database ecosystem continues to grow, as more and more connections are formed, as unstructured data multiplies with fury, the need to analyze text and structure results inside graph databases is becoming an essential part of the database ecosystem.  Today, these combined technologies are available and not just reserved for the big search engines providers.  It may be time for you to consider how to better store, manage, query and analyze your own data.  Graph databases are the answer.

If there is interest, you can learn more about these approaches under the resources section of

Read more…

24 Uses of Statistical Modeling (Part I)

Guest blog post by Vincent Granville

Here we discuss general applications of statistical models, whether they arise from data science, operations research, engineering, machine learning or statistics. We do not discuss specific algorithms such as decision trees, logistic regression, Bayesian modeling, Markov models, data reduction or feature selection. Instead, I discuss frameworks - each one using its own types of techniques and algorithms - to solve real life problems.   

Most of the entries below are found in Wikipedia, and I have used a few definitions or extracts from the relevant Wikipedia articles, in addition to personal contributions.

Source for picture: click here

1. Spatial Models

Spatial dependency is the co-variation of properties within geographic space: characteristics at proximal locations appear to be correlated, either positively or negatively. Spatial dependency leads to the spatial auto-correlation problem in statistics since, like temporal auto-correlation, this violates standard statistical techniques that assume independence among observations

2. Time Series

Methods for time series analyses may be divided into two classes: frequency-domain methods and time-domain methods. The former include spectral analysis and recently wavelet analysis; the latter include auto-correlation and cross-correlation analysis. In time domain, correlation analyses can be made in a filter-like manner using scaled correlation, thereby mitigating the need to operate in frequency domain.

Additionally, time series analysis techniques may be divided into parametric and non-parametric methods. The parametric approaches assume that the underlying stationary stochastic process has a certain structure which can be described using a small number of parameters (for example, using an autoregressive or moving average model). In these approaches, the task is to estimate the parameters of the model that describes the stochastic process. By contrast, non-parametric approaches explicitly estimate the covariance or the spectrum of the process without assuming that the process has any particular structure.

Methods of time series analysis may also be divided into linear and non-linear, and univariate and multivariate.

3. Survival Analysis

Survival analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer questions such as: what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival? Survival models are used by actuaries and statisticians, but also by marketers designing churn and user retention models.

Survival models are also used to predict time-to-event (time from becoming radicalized to turning into a terrorist, or time between when a gun is purchased and when it is used in a murder), or to model and predict decay (see section 4 in this article).

4. Market Segmentation

Market segmentation, also called customer profiling, is a marketing strategy which involves dividing a broad target market into subsets of consumers,businesses, or countries that have, or are perceived to have, common needs, interests, and priorities, and then designing and implementing strategies to target them. Market segmentation strategies are generally used to identify and further define the target customers, and provide supporting data for marketing plan elements such as positioning to achieve certain marketing plan objectives. Businesses may develop product differentiation strategies, or an undifferentiated approach, involving specific products or product lines depending on the specific demand and attributes of the target segment.

5. Recommendation Systems

Recommender systems or recommendation systems (sometimes replacing "system" with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the 'rating' or 'preference' that a user would give to an item.

6. Association Rule Learning

Association rule learning is a method for discovering interesting relations between variables in large databases. For example, the rule { onions, potatoes } ==> { burger }  found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. In fraud detection, association rules are used to detect patterns associated with fraud. Linkage analysis is performed to identify additional fraud cases: if credit card transaction from user A was used to make a fraudulent purchase at store B, by analyzing all transactions from store B, we might find another user C with fraudulent activity. 

7. Attribution Modeling

An attribution model is the rule, or set of rules, that determines how credit for sales and conversions is assigned to touchpoints in conversion paths. For example, the Last Interaction model in Google Analytics assigns 100% credit to the final touchpoints (i.e., clicks) that immediately precede sales or conversions. Macro-economic models use long-term, aggregated historical data to assign, for each sale or conversion, an attribution weight to a number of channels. These models are also used for advertising mix optimization.

8. Scoring

Scoring model is a special kind of predictive models. Predictive models can predict defaulting on loan payments, risk of accident, client churn or attrition, or chance of buying a good. Scoring models typically use a logarithmic scale (each additional 50 points in your score reducing the risk of defaulting by 50%), and are based on logistic regression and decision trees, or a combination of multiple algorithms. Scoring technology is typically applied to transactional data, sometimes in real time (credit card fraud detection, click fraud).

9. Predictive Modeling

Predictive modeling leverages statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive models are often used to detect crimes and identify suspects, after the crime has taken place. They may also used for weather forecasting, to predict stock market prices, or to predict sales, incorporating time series or spatial models. Neural networks, linear regression, decision trees and naive Bayes are some of the techniques used for predictive modeling. They are associated with creating a training set, cross-validation, and model fitting and selection.

Some predictive systems do not use statistical models, but are data-driven instead. See example here

10. Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Unlike supervised classification (below), clustering does not use training sets. Though there are some hybrid implementations, called semi-supervised learning.

11. Supervised Classification

Supervised classification, also called supervised learning, is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called label, class or category). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. 

Examples, with an emphasis on big data, can be found on DSC. Clustering algorithms are notoriously slow, though a very fast technique known as indexation or automated tagging will be described in Part II of this article.

12. Extreme Value Theory

Extreme value theory or extreme value analysis (EVA) is a branch of statistics dealing with the extreme deviations from the median of probability distributions. It seeks to assess, from a given ordered sample of a given random variable, the probability of events that are more extreme than any previously observed. For instance, floods that occur once every 10, 100, or 500 years. These models have been performing poorly recently, to predict catastrophic events, resulting in massive losses for insurance companies. I prefer Monte-Carlo simulations, especially if your training data is very large. This will be described in Part II of this article.

Click here to read Part II.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Analyse TB data using network analysis

Guest blog post by Tim Groot

Analyse TB data using network analysis


In a very interesting publication from Jose A. Dianes on tuberculosis (TB) cases per country it was shown that dimension reduction is achieved using Principal Component Analysis (PCA) and Cluster Analysis ( By showing that the first principal component corresponded mainly to the mean value of TB cases and the second mainly to the change over the used time span, it become clear that the first two PCA-components have a real physical meaning. This is not necessarily the case for PCA constructs an orthogonal basis, by making linear combinations of the original measurements, of which the eigen vectors are orederd in a decending order. Though, this method may not work with data having different types of variables. The scripts in this article are written in R.


Finding correlations in the time trend is a better way to monitor the correspondence between countries. Correlation shows similarities in the trend between countries and is sensitive to deviations from the main trend. Grouping countries based on similarities can give insight in the mechanism behind the trend and opens a way to find effective measures for the illness. Or a hidden measure may have a good causal relation but was not identified yet.

The necessary libraries to use are:

library(RCurl) # reading data

library(igraph) # network plot


Loading required data from and process existing cases file analogous to Jose A. Dianes.

existing_cases_file <-


existing_df <- read.csv(text = existing_cases_file, row.names=1, stringsAsFactor=F)

existing_df[c(1,2,3,4,5,6,15,16,17,18)] <-

  lapply( existing_df[c(1,2,3,4,5,6,15,16,17,18)],

          function(x) { as.integer(gsub(',', '', x) )})

countries <- rownames(existing_df)

meantb <- rowMeans(existing_df)

Create the link-table from the correlation matrix, filtered for the duplicates and the 1’s on the diagonal. The lower triangle function was used here.

cortb <- cor(t(existing_df))

cortb <- cortb*lower.tri(cortb)

links <- data.frame(NULL, ncol(3))

for(i in 1:length(countries)){

  links <- rbind(links,data.frame(countries, countries[i], cortb[,i], meantb,



names(links) <- c('c1','c2','cor','meanc1','meanc2')

links <- links[links$cor != 0,]

A network graph of this link-table will result in one uniform group because each country is still liked to all others.

g <-, directed=FALSE)


The trend is formed from a period of only 18 years. Correlation may therefore not be a strong function to separate the trends of the countries. For a longer span of years correlation will perform better as separator. The trends in this data are generally the same, they are decreasing. Therefore a high limit for the level of correlation is used (0.90).

The link-table is filtered for correlations larger than 0.9 and create a network graph.

links <- links[links$cor > 0.9,]


g <-, directed=FALSE)


fgc <- cluster_fast_greedy(g)


## [1] 5

The countries now appear to split-up into 5 groups, three large clusters and two small ones.


By plotting time-trends of the groups, a grouping in the trends is visible.

trendtb <-

for(group in 1:length(fgc)){

  sel <- trendtb[,as.character(unlist(fgc[group]))]

  plot(x=NA, y=NA , xlim=c(1990,2007), ylim=range(sel), main= paste("Group", group),

     xlab='Year', ylab = 'TB cases per 100K distribution')

  for(i in names(sel)){

    points(1990:2007,sel[,i],type='o', cex = .1, lty = "solid",

           col = which(i == names(sel)))


  if(group %in% c(4,5)) legend('topright', legend = names(sel), lwd=2,



In group 4 and 5 pretty particular trends are selected. Group 4 consist of countries with a maximum amount of TB-cases in the period 1996 to 2003 and the two countries in group 5 show a dramatic drop in TB-cases at 1996 which is followed by a large increase. This latter trend should be explained by meta data about the dataset.


## $`4`

## [1] "Lithuania" "Zambia"    "Belarus"   "Bulgaria"  "Estonia" 


## $`5`

## [1] "Namibia"  "Djibouti"

Group 4 consists of former USSR-countries, though, Zambia is an exception. This trend could be explained by social problems during the collapse of the USSR and for Zambia this trend should be explained by political changes too.

The range in TB-cases the first three graphs is too large to see the similarities within the groups. Dividing them with the mean gives a better view on the trend.

for(group in 1:3){

  sel <- trendtb[,as.character(unlist(fgc[group]))]

  selavg <- meantb[as.character(unlist(fgc[group]))]

  plot(x=NA, y=NA , xlim=c(1990,2007), ylim=range(t(sel)/selavg),

       main= paste("Group", group), xlab='Year',

       ylab = 'TB cases per 100K distribution')

  for(i in names(sel)){

    points(1990:2007,sel[,i]/selavg[i],type='o', cex = .1, lty = "solid",

           col= which(i == names(sel)))



Now the difference between group 1 and 3 better visible, group 1 groups countries tending towards sigmoid-trends while group 3 consist of countries with a more steady decay in TB-cases. Countries in group 2 show an increasing sigmoid-like trend.

In the groups western- and development-countries are mixed. For western countries the amount of TB-cases are low and one TB-case more or less may flip the trend, again a better separation will be found for a larger span in time.


## $`1`

##  [1] "Albania"                          "Argentina"                      

##  [3] "Bahrain"                          "Bangladesh"                     

##  [5] "Bosnia and Herzegovina"           "China"                          

##  [7] "Costa Rica"                       "Croatia"                        

##  [9] "Czech Republic"                   "Korea, Dem. Rep."               

## [11] "Egypt"                            "Finland"                        

## [13] "Germany"                          "Guinea-Bissau"                  

## [15] "Honduras"                         "Hungary"                        

## [17] "India"                            "Indonesia"                      

## [19] "Iran"                             "Japan"                           

## [21] "Kiribati"                         "Kuwait"                         

## [23] "Lebanon"                          "Libyan Arab Jamahiriya"         

## [25] "Myanmar"                          "Nepal"                          

## [27] "Pakistan"                         "Panama"                         

## [29] "Papua New Guinea"                 "Philippines"                    

## [31] "Poland"                           "Portugal"                       

## [33] "Puerto Rico"                      "Singapore"                      

## [35] "Slovakia"                         "Syrian Arab Republic"           

## [37] "Thailand"                         "Macedonia, FYR"                 

## [39] "Turkey"                           "Tuvalu"                         

## [41] "United States of America"         "Vanuatu"                        

## [43] "West Bank and Gaza"               "Yemen"                          

## [45] "Micronesia, Fed. Sts."            "Saint Vincent and the Grenadines"

## [47] "Viet Nam"                         "Eritrea"                        

## [49] "Jordan"                           "Tunisia"                        

## [51] "Monaco"                           "Niger"                          

## [53] "New Caledonia"                    "Guam"                           

## [55] "Timor-Leste"                      "Iraq"                           

## [57] "Mauritius"                        "Afghanistan"                     

## [59] "Australia"                        "Cape Verde"                     

## [61] "French Polynesia"                 "Malaysia"                       


## $`2`

##  [1] "Botswana"                 "Burundi"                

##  [3] "Cote d'Ivoire"            "Ethiopia"               

##  [5] "Guinea"                   "Rwanda"                 

##  [7] "Senegal"                  "Sierra Leone"           

##  [9] "Suriname"                 "Swaziland"              

## [11] "Tajikistan"               "Zimbabwe"               

## [13] "Azerbaijan"               "Georgia"                

## [15] "Kenya"                    "Kyrgyzstan"             

## [17] "Russian Federation"       "Ukraine"                

## [19] "Tanzania"                 "Moldova"                

## [21] "Burkina Faso"             "Congo, Dem. Rep."       

## [23] "Guyana"                   "Nigeria"                

## [25] "Chad"                     "Equatorial Guinea"      

## [27] "Mozambique"               "Uzbekistan"             

## [29] "Kazakhstan"               "Algeria"                

## [31] "Armenia"                  "Central African Republic"


## $`3`

##  [1] "Barbados"                 "Belgium"                 

##  [3] "Bermuda"                  "Bhutan"                 

##  [5] "Bolivia"                  "Brazil"                 

##  [7] "Cambodia"                 "Cayman Islands"         

##  [9] "Chile"                    "Colombia"               

## [11] "Comoros"                  "Cuba"                   

## [13] "Dominican Republic"       "Ecuador"                

## [15] "El Salvador"              "Fiji"                   

## [17] "France"                   "Greece"                 

## [19] "Haiti"                    "Israel"                 

## [21] "Laos"                     "Luxembourg"             

## [23] "Malta"                    "Mexico"                 

## [25] "Morocco"                  "Netherlands"            

## [27] "Netherlands Antilles"     "Nicaragua"              

## [29] "Norway"                   "Peru"                   

## [31] "San Marino"               "Sao Tome and Principe"  

## [33] "Slovenia"                 "Solomon Islands"        

## [35] "Somalia"                  "Spain"                  

## [37] "Switzerland"              "United Arab Emirates"   

## [39] "Uruguay"                  "Antigua and Barbuda"    

## [41] "Austria"                  "British Virgin Islands" 

## [43] "Canada"                   "Cyprus"                 

## [45] "Denmark"                  "Ghana"                  

## [47] "Guatemala"                "Ireland"                

## [49] "Italy"                    "Jamaica"                

## [51] "Mongolia"                 "Oman"                   

## [53] "Saint Lucia"              "Seychelles"             

## [55] "Turks and Caicos Islands" "Virgin Islands (U.S.)"  

## [57] "Venezuela"                "Maldives"               

## [59] "Trinidad and Tobago"      "Korea, Rep."            

## [61] "Andorra"                  "Anguilla"               

## [63] "Belize"                   "Mali"

Read more…

7 Ingredients for Great Visualizations

Great article by Bernard Marr. Here we present a summary. A link to the full article is provided below.

Source for picture: click here

1. Identify your target audience. 

2. Customize the data visualization. 

3. Give the data visualization a clear label or title. 

Source for picture: click here

4. Link the data visualization to your strategy. 

5. Choose your graphics wisely. 

6. Use headings to make the important points stand out. 

Source for picture: click here

7. Add a short narrative where appropriate.  

Click here to read full article. For other articles on visualization science and art, as well as best practices, click here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Contributed by David Comfort. David took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his first class project(due at 2nd week of the program).


Hans Rosling gave a famous TED talk in 2007, “The Best Stats You’ve Ever Seen”. Rosling is a Professor of International Health at the Karolinska Institutet in Stockholm, Sweden and founded the Gapminder Foundation.


To visualise his talk, he and his team at Gapminder developed animated bubble charts, also known as motion charts. Gapminder developed the Trendalzyer data visualization software, which was subsequently acquired by Google in 2007.

The Gapminder Foundation is a Swedish NGO which promotes sustainable global development by increased use and understanding of statistics about social, economic and environmental development.

The purpose of my data visualization project was to visualize data about long-term economic, social and health statistics. Specifically, I wanted to extract data sets from Gapminder using an R package, googlesheets, munge these data sets, and combine them into one dataframe, and then use the GoogleVis R package to visualize these data sets using a Google Motion chart.

The finished product can be viewed at this page and a screen capture demonstrating the features of the interactive data visualization is at:


World Bank Databank

World Bank Databank

2) Data sets used for Data Visualization

The following Gapminder datasets, several of which were adapted from the World Bank Databank, were accessed for data visualization:

  • Child mortality - The probability that a child born in a specific year will die before reaching the age of five if subject to current age-specific mortality rates. Expressed as a rate per 1,000 live births.
  • Democracy score  - Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. It is a summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.
  • Income per person - Gross Domestic Product per capita by Purchasing Power Parities (in international dollars, fixed 2011 prices). The inflation and differences in the cost of living between countries has been taken into account.
  • Life expectancy at birth - Life expectancy at birth (years) with projections. The average number of years a newborn child would live if current mortality patterns were to stay the same. Observations after 2010 are based on projections by the UN.
  • Population, Total - Population is available in Gapminder World in the indicator “population, total”, which contains observations between 1700 to 2012. It is also available in the indicator “population, with projections”, which also includes projections up to 2100 as well as data for several countries before 1700.
  • Country and Dependent Territories Lists with UN Regional Codes - These lists are the result of merging data from two sources, the Wikipedia ISO 3166-1 article for alpha and numeric country codes, and the UN Statistics site for countries' regional, and sub-regional codes.

3) Reading in the Gapminder data sets using Google Sheets R Package

  • The googlesheets R package allows one to access and manage Google spreadsheets from within R.
  • googlesheets Basic Usage (vignette)
  • Reference manual
  • The registration functions gs_title(), gs_key(), and gs_url() return a registered sheet as a googlesheets object, which is the first argument to practically every function in this package.

First, install and load googlesheets and dplyr, and access a Gapminder Google Sheet by the URL and get some information about the Google Sheet:

Note: Setting the parameter lookup=FALSE will block authenticated API requests.

A utility function, extract_key_from_url(), helps you get and store the key from a browser URL:

You can access the Google Sheet by key:

Once, one has registered a worksheet, then you can consume the data in a specific worksheet (“Data” in our case) within the Google Sheet using the gs_read() function (combining the statements with dplyr pipe and using check.names=FALSE so it deals with the integer column names correctly and doesn’t append an “x” to each column name):

You can also target specific cells via the range = argument. The simplest usage is to specify an Excel-like cell range, such asrange = “D12:F15” or range = “R1C12:R6C15”.

But a problem arises since check.names=FALSE does not work with this statement (problem with package?). So a workaround would be to pipe the data frame through the dplyr rename function:

However, for purposes, we will ingest the entire worksheet and not target by cells.

Let’s look at the data frame. We need to change the name of the first column from “GDP per capita” to “Country”.

We need to change the name of the first column from “GDP per capita” to “Country”.

Let’s download the rest of the datasets

4) Read in the Countries Data Set

We want to segment the countries in the data sets by region and sub-region. However, the Gapminder data sets do not include these variables. Therefore, one can download the ISO-3166-Countries-with-Regional-Codes data set from github which includes the ISO country code, country name, region, and sub-region.

Use rCurl to read in directly from Github and make sure you read in the “raw” file, rather than Github’s display version.

Note: The Gapminder data sets do not include ISO country codes, so I had to clean the countries data set with the corresponding country names used in the Gapminder data sets.

5) Reshaping the Datasets

We need to reshape the data frames. For the purposes of reshaping our data frames, we can divide the variables into two groups: identifier, or id, variables and measured variables. In our case, id variables include the Country and Years, whereas the measured variables are the GDP per capita, life expectancy, etc..

We can further abstract and “say there are only id variables and a value, where the id variables also identify what measured variable the value represents.”

For example, we could represent a data set, which has two id variables, subject and time:


where each row represents one observation of one variable. This operation is called melting (and can be achieved by using the melt function of the Reshape package).

Compared to the former table, the latter table has a new id variable “variable”, and a new column “value”, which represents the value of that observation. See the paper, Reshaping data with the reshape package, by Hadley Wickham, for more clarification.

We now have the data frame in a form in which there are only id variables and a value.

Let’s reshape the data frames ( child_mortality, democracy_score, life_expectancy, population):

6) Combining the Datasets

Whew, now we can finally all the datasets using a left_join:

7) What is GoogleVis

[caption id="attachment_8125" align="alignnone" width="690"]googlevis An overview of a GoogleVis Motion Chart[/caption]

8) Implementing GoogleVis

Implementing GoogleVis is fairly easy. The design of the visualization functions is fairly generic.

The name of the visualization function is gvis + ChartType. So for the Motion Chart we have:

9) Parameters for GoogleVis

  • data: a data.frame
  • idvar: the id variable , “Country” in our case.
  • timevar: the time variable for the plot, “Years” in our case.
  • xvar: column name of a numerical vector in data to be plotted on the x-axis.
  • yvar: column name of a numerical vector in data to be plotted on the y-axis.
  • colorvar: column name of data that identifies bubbles in the same series. We will use “Region” in our case.
  • sizevar - values in this column are mapped to actual pixel values using the sizeAxis option. We will use this for “Population”.

The output of a googleVis function (gvisMotionChart in our case) is a list of lists (a nested list) containing information about the chart type, chart id, and the html code in a sub-list split into header, chart, caption and footer.

10) Data Visualization

Let’s plot the GoogleVis Motion Chart.

[caption id="attachment_8127" align="alignnone" width="650"]Gapminder Data Visualization using GoogleVis Gapminder Data Visualization using GoogleVis[/caption]

Note: I had an issue with embedding the GoogleVis motion plot in a WordPress blog post, so I will subsequently feature the GoogleVis interactive motion plot on a separate page at


Here is a screen recording of the data visualization produced by GoogleVis of the Gapminder datasets:


11) What are the Key Lessons of the Gapminder Data Visualization Project?

  • Hans Rosling and Gapminder have made a big impact on data visualization and how data visualization can inform the public about wide misperceptions.
  • The googlesheets R package allows for easy extraction of data sets which are stored in Google Sheets.
  • The different steps involved in reshaping and joining multiple data sets can be a little cumbersome. We could have used dplyr pipes more.
  • It would be a good practice for Gapminder to include the ISO country code in each of their data sets.
  • There is a need for a data set which lists country names, their ISO codes, as well as other categorical information such as Geographic Regions, Income groups, Landlocked, G77, OECD, etc.
  • It is relatively easy to implement a GoogleVis motion chart using R. However, it is difficult to change the configuration options. For instance, I was unable to make a simple change to the chart by adding a chart title.
  • Google Motion Charts provide a great way to visualize several variables at once and be a great teaching tool for all sorts of data.
  • For instance, one can visualize the Great Divergence between the Western world and China, India or Japan, whereby the West had much faster economic growth, with attendant increases in life expectancy and other health indicators.



Originally posted on Data Science Central

Read more…

How to make any plot with ggplot2?

Guest blog post by Selva Prabhakaran

Ggplot2 is the most elegant and aesthetically pleasing graphics framework available in R. It has a nicely planned structure to it. This tutorial focusses on exposing this underlying structure you can use to make any ggplot. But, the way you make plots in ggplot2 is very different from base graphics making the learning curve steep. So leave what you know about base graphics behind and follow along. You are just 5 steps away from cracking the ggplot puzzle.

The distinctive feature of the ggplot2 framework is the way you make plots through adding ‘layers’. The process of making any ggplot is as follows.

1. The Setup

First, you need to tell ggplot what dataset to use. This is done using the ggplot(df) function, where df is a dataframe that contains all features needed to make the plot. This is the most basic step. Unlike base graphics, ggplot doesn’t take vectors as arguments.

Optionally you can add whatever aesthetics you want to apply to your ggplot (inside aes() argument) - such as X and Y axis by specifying the respective variables from the dataset. The variable based on which the color, size, shape and stroke should change can also be specified here itself. The aesthetics specified here will be inherited by all the geom layers you will add subsequently.

If you intend to add more layers later on, may be a bar chart on top of a line graph, you can specify the respective aesthetics when you add those layers.

Below, I show few examples of how to setup ggplot using in the diamonds dataset that comes with ggplot2 itself. However, no plot will be printed until you add the geom layers.

ggplot(diamonds) # if only the dataset is known. ggplot(diamonds, aes(x=carat)) # if only X-axis is known.
ggplot(diamonds, aes(x=carat, y=price)) # if both X and Y axes are fixed for all layers.ggplot(diamonds, aes(x=carat, color=cut)) # color will now vary based on `cut`

The aes argument stands for aesthetics. ggplot2 considers the X and Y axis of the plot to be aesthetics as well, along with color, size, shape, fill etc. If you want to have the color, size etc fixed (i.e. not vary based on a variable from the dataframe), you need to specify it outside the aes(), like this.

ggplot(diamonds, aes(x=carat), color="steelblue")

See this color palette for more colors.

2. The Layers

The layers in ggplot2 are also called ‘geoms’. Once the base setup is done, you can append the geoms one on top of the other. The documentation provides a compehensive list of all available geoms.

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + geom_smooth() # Adding scatterplot geom (layer1) and smoothing geom (layer2).

We have added two layers (geoms) to this plot - the geom_point() and geom_smooth(). Since the X axis Y axis and the color were defined in ggplot() setup itself, these two layers inherited those aesthetics. Alternatively, you can specify those aesthetics inside the geom layer also as shown below.

ggplot(diamonds) + geom_point(aes(x=carat, y=price, color=cut)) + geom_smooth(aes(x=carat, y=price, color=cut)) # Same as above but specifying the aesthetics inside the geoms.

Notice the X and Y axis and how the color of the points vary based on the value of cut variable. The legend was automatically added. I would like to propose a change though. Instead of having multiple smoothing lines for each level of cut, I want to integrate them all under one line. How to do that? Removing the color aesthetic from geom_smooth()layer would accomplish that.

library(ggplot2) ggplot(diamonds) + geom_point(aes(x=carat, y=price, color=cut)) + geom_smooth(aes(x=carat, y=price)) # Remove color from geom_smooth

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(aes(color=cut)) + geom_smooth() # same but simpler

Here is a quick challenge for you. Can you make the shape of the points vary with color feature?

Though setting up took us quite a bit of code, adding further complexity such as the layers, distinct color for each cut etc was easy. Imagine how much code you would have had to write if you were to make this in base graphics? Thanks to ggplot2!

# Answer to the challenge.
ggplot(diamonds, aes(x=carat, y=price, color=cut, shape=color)) + geom_point()

3. The Labels

Now that you have drawn the main parts of the graph. You might want to add the plot’s main title and perhaps change the X and Y axis titles. This can be accomplished using the labs layer, meant for specifying the labels. However, manipulating the size, color of the labels is the job of the ‘Theme’.

library(ggplot2) gg <- ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + labs(title="Scatterplot", x="Carat", y="Price")  # add axis labels and plot title. print(gg)

The plot’s main title is added and the X and Y axis labels capitalized.

Note: If you are showing a ggplot inside a function, you need to explicitly save it and then print using the print(gg), like we just did above.

4. The Theme

Almost everything is set, except that we want to increase the size of the labels and change the legend title. Adjusting the size of labels can be done using the theme() function by setting the plot.titleaxis.text.x and axis.text.y. They need to be specified inside the element_text(). If you want to remove any of them, set it to element_blank() and it will vanish entirely.

Adjusting the legend title is a bit tricky. If your legend is that of a color attribute and it varies based in a factor, you need to set the name using scale_color_discrete(), where the color part belongs to the color attribute and the discrete because the legend is based on a factor variable.

gg1 <- gg + theme(plot.title=element_text(size=30, face="bold"), axis.text.x=element_text(size=15), axis.text.y=element_text(size=15), axis.title.x=element_text(size=25), axis.title.y=element_text(size=25)) + scale_color_discrete(name="Cut of diamonds")  # add title and axis text, change legend title. 

print(gg1) # print the plot

If the legend shows a shape attribute based on a factor variable, you need to change it using scale_shape_discrete(name="legend title"). Had it been a continuous variable,  (name="legend title") instead.

So now, Can you guess the function to use if your legend is based on a fill attribute on a continuous variable?

The answer is scale_fill_continuous(name="legend title").

5. The Facets

In the previous chart, you had the scatterplot for all different values of cut plotted in the same chart. What if you want one chart for one cut?

gg1 + facet_wrap( ~ cut, ncol=3)  # columns defined by 'cut'

facet_wrap(formula) takes in a formula as the argument. The item on the RHS corresponds to the column. The item on the LHS defines the rows.

gg1 + facet_wrap(color ~ cut)  # row: color, column: cut

In facet_wrap, the scales of the X and Y axis are fixed to accomodate all points by default. This would make comparison of attributes meaningful because they would be in the same scale. However, it is possible to make the scales roam free making the charts look more evenly distributed by setting the argument scales=free.

gg1 + facet_wrap(color ~ cut, scales="free")  # row: color, column: cut

For comparison purposes, you can put all the plots in a grid as well using facet_grid(formula).

gg1 + facet_grid(color ~ cut)   # In a grid

Note, the headers for individual plots are gone leaving more space for plotting area.

This post was originally published at The full post is available in this ggplot2 tutorial.

Read more…

Exploring the VW scandal with graph analysis

Guest blog post by Zygimantas Jacikevicius

Our software partners Linkurious were excited to dive in to the VW emissions scandal case we have prepared and documented their findings in this great article, keep reading or visit the source of this article Here


The German car manufacturer admitted cheating emissions tests. Which companies are impacted directly or indirectly by these revelations? James Phare from Data To Value  used graph analysis and open source information to unravel the impact of the VW scandal on its customers, partners and shareholders.

 The VW scandal makes headlines


Since September 18 when VW admitted to modifying its cars to disguise their true emissions performances, the news has made headlines worldwide. In all this noise, it’s difficult to piece together the important information necessary to assess the event’s impact.

Our partner Data To Value helps organisations turn data into their most valuable asset. Using a combination of semantic analysis and Linkurious, the Data To Value’s team were able to investigate the ramifications of the VW scandal and show how it impacts its customers, partners, suppliers and shareholders.

To do this Data To Value used open source intelligence (OSINT) from sources like newspapers, social media and blogs. Semantic analysis helps analyse that type of vast volume of unstructured or semi-structured data to extract:

  • relevant entities (persons, companies, locations, etc.)
  • sentiment (is a blog post sympathetic or aggressive?)
  • topics (is this article related to the production of apples or to the company Apple?)


The resulting data can be represented in various ways. For example, Data To Value built a visualization to show the evolution of the general sentiment toward VW during the crisis.


This approach is useful but doesn’t help us understand how various entities linked to VW (directly and indirectly) are impacted by the scandal. Traditional tools and techniques are ill adapted for that type of questions. That’s where graph analysis is useful. It helps see relationships which are hard to find using other techniques and other data structures.


In our case for example, Data to Value chose to represent the data collected about the VW crisis in the following graph model:

In the picture above we see how some of different entities are linked:

  • a company or an investor own shares of another company
  • an engine powers a car model
  • a car model is sold by (in the range of) a company
  • a company is a fleet buyer of another company
  • a supplier supplies a company

The graph model gives a highly structured view of the information. We can now use it to start asking questions.



Investigating the VW network to find hidden insights


First let’s start by visualizing the motors involved in the scandal and in which car models they are used.

We can see how a single motor can be used in many car models. The “R3 1^4l 66kW(90PS) 230Nm 4V TDI CR” motor for example is being used by the Rapid Scapeceback, Toledo, A1, A1 Sportback, Fabia, 250 OSA Polo, Fabia Combi, Polo, Ibiza Sport Tourer, Ibiza 3 Turer, Ibiza 5 Turer.

The Linkurious graph visualization helps us make sense of how the motors and car models are tied. Now that we understand this, we can add an extra set of relationships. The relationships between the car models and the manufacturers.

At this point, we can notice that a car like Skoda’s Superb is using 3 motors involved in the scandal.

Although most customers don’t associate Skoda’s cars with Volkswagen we can see that there are relationships between the two car manufacturers. Two other car manufacturers also appear in the data: Seat and Audi.

Let’s expand the relationships of our car manufacturers to reveal more information. A few new details emerge.

Various car rental companies are major customers of Volkswagen. Companies like Hertz, Europcar and Avis are buying cars manufactured by Volkswagen, Audi, Skoda and Seat.

Daimler and BMW, the other German car manufacturers use Robert Bosch, a VW supplier and a company also involved in the scandal. Although to date these companies have denied using the software illegally in the same way as VW.

Let’s filter the graph to focus on the shareholding relationships now.

Audi, Seat, Skoda and Volkswagen are all owned by Volkswagen AG. Investors in Volkswagen AG include hedge fund Blackrock and sovereign funds Qatar Investment Authority and Norwegian Investment Authority.



Simplifying investment research with graph analysis



Data to Value’s research in the VW scandal has applications both in the long only asset management space and also in the hedge funds space for both managing risks and also identifying investment opportunities or developing a graph-based trading strategy. For example, based on the insights uncovered, the shares of indirectly affected companies could be shorted in order to achieve returns. Although not directly exposed to the crisis, these companies are indirectly affected by the reputational impact, decline in profitability of VW and unplanned costs if the matter is not resolved effectively e.g. fleet buyers bearing modification costs themselves or seeing reduced demand for VW rentals.


“The same insights could also help a risk manager understand a portfolio or entire investment manager’s risk profile associated with systemic or counterparty risks. Graph technology is perfect for performing these types of network analysis” explains Phare.


Research processes are traditionally very manual and document driven. Analysts look at broker’s reports, newspapers, market data. This makes sense when the velocity of the information is low but doesn’t scale up to today’s high data volume. Techniques like machine learning and semantic analysis can automate a lot of this work and make the job of analysts easier. Combined with this kind of self-service graph analysis enabled by Linkurious this presents an innovative approach for investment managers.

Read more…

Visualize Your Data Using

Guest blog post by Vozag

Creating interactive visualizations of your data for web is a cakewalk using, all you need to do is to import your spreadsheet and start generating your interactive visualizations

You need to sign in to start creating visualizations on If you dont have an account, signing up is easy. All you need is a name, an email id and a password, and the account is created immediately

Add your data

Importing data into is very easy.You can import any spreadsheet by clicking on the ‘Import spreadsheet’ link on the right side of the home screen. There is an option for importing either csv or excel files. You can also choose to import data directly from a google spreadsheets by pasting the URL.


Once you choose a file to import, and then click open, you are taken to a screen where you are shown a number of rows in your data.It is to be noted that each row in the spread sheet is considered to be a page. You can set which column is to be used as a title for each page and how each column should be interpreted. You may also choose to ignore some columns depending upon your requirements.

Once you are done, you can finish importing the data by clicking the ‘Start import’ button on the top right of the page. The time taken to import the data depends upon the number of rows and the type of data you are importing.



Explore Your Collection

Once the data is imported, its shown shown in a dashboard. You can start exploring your data by clicking the ‘Explore your collection’ link, which is just under the collection name

By default, the data is shown as a table. You can change the visualization by clicking the drop down button available on the right side of the explore page.


Apart from a table format, you have the option of visualizing your data as a list, grid, pie chart, bar chart, scatter plot, map and a few other forms.

Generate and save visualizations


Generating visualizations of maps is very easy in given that it can convert addresses into latitude-longitude pairs and plot them on the map directly. You will understand that it saves a lot of pain, if you have worked on generating map related visualizations previously. But, proper care should be taken to give full address or the plots generated will be erratic.

Get started by choosing the map visualization in the drop down. You can choose the columns you want to show in the pop up for a location in the map; choose the column of the numeric data if you are plotting a heat map; and choose the column on which you want to sort the data from the options given. Most importantly, you have to choose the column which has the location information and the visualization will be generated accordingly. You can choose to add the visualization to your any of your pages.

Given above is a heat map showing home values in the top 200 counties in USA.

Share your work

Once your visualizations are generated, you can choose to email them, share them on social networking platforms like facebook, twitter, linkedin, or embed them in any of your web pages by clicking the ‘Share and embed’ button on the top right corner of the page.

Interactive visualization of the map is given here.  

Read more…

Guest blog post by Jean Villedieu

The European Union is giving free access to detailed information about public purchasing contracts. That data describes which European institutions are spending money, for what and who benefits from it. We are going to explore the network of public institutions and suppliers and look for interesting patterns with graph visualization.

Public spending in Europe is a €1.3 trillion business

Every year, more than €1.3 trillion in contracts are awarded by public entities in Europe. In an effort to make these public contracts more transparent, the European Union has decided to make the tenders announcements public. The information can be found online through the EU’s open data portalOpenTED, a non profit organization, has gone a step further and made the data available in a CSV format.

Public contracts are complex though. It involves at least one commercial entity which is awarded a contract by a public authority. The public entity may be acting as a “delegate” for another public entity. The contract can be disputed in a certain jurisdiction of appeal.

We have multiple entities and relationships. What this means is that the tenders data describe a graph or network. We are going to explore the tenders graph with Neo4j and Linkurious.

Modeling public contracts as a graph

We will focus on the 2015 tenders. There are 73,269 tenders in one single CSV file with 45 columns.

tenders excel

We decided to model the graph in the following way:

tender data model neo4j

The graph model above highlights the relationships between the contracts, appeal bodies, operators, delegates and authorities in our data.

To put it in a Neo4j database, we wrote a script that you can see here.

This script takes the 2015 tenders data and turn it into a Neo4j graph database with 161,541 nodes and 536,936 edges. Now that it’s in Neo4j, we can search, explore and visualize it with Linkurious. It’s time to start asking questions to the data!

The biggest tenders and the authorities and companies which are involved

As a first step, let’s try to identify the big public contracts which happened in Europe in 2015 and what organizations they involved. In order to get that answer, we’ll use Cypher(the Neo4j query language) within Linkurious.

Here’s how to find the 10 biggest contracts and the public authorities and commercial providers they involved.

// The top 10 biggest contracts and who they involve
WHERE b.contract_initial_value_cost_eur <> ‘null’
ORDER BY b.contract_initial_value_cost_eur DESC
RETURN a, c, b

The result is the following graph.

biggest public contracts

We can see for example that Ministry of Defence has awarded a large contract toBabcock International Group Plc.

Missed connections

The graph structure of the data also allows us to spot missed opportunities. Let’s take for example KPMG. It’s one of the biggest recipient of public contracts within our dataset. What other opportunities the company could have been awarded?

To answer that question, we can identify which of KPMG’s customers awarded contracts to its competitors.

Let’s identify the biggest missed opportunities of KPMG:

// KPMG’s biggest missed opportunities
MATCH (a:OPERATOR {official_name: ‘KPMG’})-[:IS_OPERATOR_OF]->(b:CONTRACT)<-[:IS_AUTHORITY_OF]-(customer:AUTHORITY)-[:IS_AUTHORITY_OF]->(lost_opportunity:CONTRACT)<-[:IS_OPERATOR_OF]-(competitor:OPERATOR)
WHERE NOT a-[:IS_OPERATOR_OF]->competitor
RETURN a,b,customer,lost_opportunity,competitor

We can visualize the result in Linkurious:

KPMG's network.

KPMG’s network.

KPMG is highlighted in red. It is connected to 10 contracts by 9 customers. These customers have awarded 246 other contracts to 181 firms.

This visualization could be used by KPMG to identify its “unfaithful” customers and its competitors. Specifically we may want to filter the visualization to focus on the contracts for similar services to the ones KPMG offer. To do that we will use the CPV code, a procurement taxonomy of products and services.

Here is a visualization filtered to only display contracts with similar CPV codes as KMPG’s contracts:

Filtering on contracts KPMG could have won.

Filtering on contracts KPMG could have won.

If we zoom in, we can see for example that GCS Uniha, one of KMPG’s customer, has also awarded contracts to some of KPMG’s competitor (Ernst and YoungPwC and Deloitte).

GCS Uniha has also awarded contracts to Ernst and Young, PwC and Deloitte.

GCS Uniha has also awarded contracts to Ernst and Young, PwC and Deloitte.

Exploring the French IT sector

Finally, we can use Linkurious to visualize a particular economic sector. Let’s focus for example on IT spending in France. Who are the biggest spenders in that sector? Which companies are capturing these contracts? Finally what are the relationships between all these organizations?

Using a Cypher query, we can identify the customers and suppliers linked to IT contracts (which all have a CPV code starting with “48”):

//The French public IT market
WHERE b.contract_cpv_code STARTS WITH “48” AND = ‘FR’
RETURN a,b,c

We can visualize the result directly in Linkurious.

Public IT contracts in France in 2015.

Public IT contracts in France in 2015.

If we zoom in, we can see that Conseil Général des Bouches du Rhône has awarded 9 IT contracts in 2015, mostly to Dalkia and Idex énérgies.

Conseil Général des Bouches du Rhône has awarded 9 IT contracts in 2015.

Conseil Général des Bouches du Rhône has awarded 9 IT contracts in 2015.

We have dived into €1.3 trillion of public spending using Neo4j and Linkurious. We were able to identify some key actors of the public contracts market and map interesting ecosystems. Want to explore and understand your graph data? Simply try the demo of Linkurious!

Read more…

Space is limited.
Reserve your Webinar seat now

Web and graphic design principals are tremendously useful for creating beautiful, effective dashboards. In this latest Data Science Central Webinar event, we will consider how common design mistakes can diminish visual effectiveness. You will learn how placement, weight, font choice, and practical graphic design techniques can maximize the impact of your visualizations.

SpeakerDave Powell, Solution Architect -- Tableau
Hosted byBill Vorhies, Editorial Director -- Data Science Central

Again, Space is limited so please register early:
Reserve your Webinar seat now

After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

The current state of machine intelligence 2.0

Guest blog post by Laetitia Van Cauwenberge

Interesting O'Reilly article, although focused on the business implications, not the technical aspects. I really liked the infographic, as well as these statements:

  • Startups are considering beefing up funding earlier than they would have, to fight inevitable legal battles and face regulatory hurdles sooner.
  • Startups are making a global arbitrage (e.g., health care companies going to market in emerging markets, drone companies experimenting in the least regulated countries).
  • The “fly under the radar” strategy. Some startups are being very careful to stay on the safest side of the grey area, keep a low profile, and avoid the regulatory discussion as long as possible.

This is a long article with the following sections:

  • Reflections on the landscape
  • Achieving autonomy
  • The new (in)human touch
  • 50 shades of grey markets 
  • What’s your (business) problem?
  • The great verticalization
  • Your money is nice, but tell me more about your data

Note that we are ourselves working on data science 2.0. Some of our predictions can be found here. And for more on machine intelligence (not sure how it is different from machine learning), click on the following links:

To read the full article, click here. Below is the infographics from the original article.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Comparing Data Science and Analytics

You may think that all big data experts are created equal, but nothing could be further from the truth. However, the terms “data scientist” and “business analyst” are often used interchangeably. It’s a common and confusing use of terminology, which is why [email protected], a masters in business analytics, created this infographic to help create further clarity about the two roles

Both business analysts and data scientists are experts in the use of big data, but they have differenttypes of educational backgrounds, usually work in different types of settings and use their skills and knowledge in completely different ways.

Reflective of the increasing need to extract value from the mountain of big data at our fingertips, business analysts are in much higher demand—with a predicted job growth of 27 percent over the next decade. They dig into a variety of sources to pull data for analysis of past, present and future business performance, and then take those results and use the most effective tools to translate them to business leaders. They typically have educational backgrounds in specialties like business and humanities.

In contrast — data scientists are digital builders. They have strong educational backgrounds in computer science, mathematics and technology and use statistical programming to actually build the framework for gathering and using the data by creating and implementing algorithms to do it. Such algorithms help with decision-making, data management, and the creation of visualizations to help explain the data that they gather.

To find out more about who will fit the bill for your organization, dig into the infographic to make sure the big data expert you’re hiring is the right one to meet your needs.

Originally posted on Data Science Central

Read more…

Two great visualizations about data science

Guest blog post by Laetitia Van Cauwenberge

The first one is about the difference between Data Science, Data Analysis, Big Data, Data Analytics, and Data Mining:

The source for this one is, according to a tweet, I could not find the article in question, though this website is very interesting, but anyway, I love the above picture, please share if you agree with it. If someone knows the link for the original article, please post in the comment section.

And now another nice picture about the history of big data and data science:

Here I have the reference for the picture: click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Main Trends for IT in 2016 (infographic)

Guest blog and infographic from Glorium Technologies

Spiceworks report defines 4 main IT trends of upcoming 2016.

  1. Companies’ revenue is to grow while IT budget is to remain the same.
  2. As a consequence, CIOs are to make more with less.
  3. Even though security concers are going to increase, security expencses are to stay flat.
  4. The leading priority of IT busget 2016 is end of life of a product.


On average, each company plans to increase IT spending by max $2000 (comparing 2015 and 2016, worldwide). According to results of responses, 42% claim that budget remains the same, 38% states that it is to increase, while 10% is bound to decrease it. The reason for such outcome is willingness to keep the cost low. IT stuff is not to increase, either: 59% – no changes, 34% – plans to increase, 4% plans to decrease.


The main priorities of 2016 IT budget are: hardware expenses (37%), software expenses (31%), cloud-based projects (14%) and software expenses (13%).


Surprisingly, laptops are not to take the first place in 2016, instead of desktops. Here are how hardware expenses are to be distributed: 21% desktops, 19% servers, 16% laptops, 10% networking, 6% external storage, 6% mobile and tablets.


Software expenses are divided more evenly: 15% goes to virtualization, OS and productivity. 10% goes to CRM, backup and database. 9% goes to security.
The main motto of expenses: “If it ain’t broken – don’t fix it”.


Here are the main reasons foe companies’ investing in IT: end of life, growth or additional needs, upgrades or refresh cycles, end user need, project need, budget availability, application compatibility and new technologies or features.


Responses to the question “Is your security on a descent level?” are rather surprising: 61% do not conduct security audit, 59% believe that  their investments in security are not adequate, for 51% security is not a 2016 priority, 48% regard their business process as not adequately protected.




  1. Companies’ revenue is going to increase, while IT budet is not going to be changed.
  2. CIO’s will have to solve strategic issues for the future with budget from the past.
  3. End of life of a product is the main boost for the IT investments.
  4. Even though, companies understand that they do not invest enough in security, security budget is to stay rather low.


Infographic preparation was based upon Spiceworks annual report on IT budgets and tech trends.

Originally posted on Data Science Central

Read more…

The Amazing Ways Uber Is Using Big Data

Guest blog post by Bernard Marr

Uber is a smartphone-app based taxi booking service which connects users who need to get somewhere with drivers willing to give them a ride. The service has been hugely controversial, due to regular taxi drivers claiming that it is destroying their livelihoods, and concerns over the lack of regulation of the company’s drivers.

Source for picture: Mapping a city’s flow using Uber data

This hasn’t stopped it from also being hugely successful – since being launched to purely serve San Francisco in 2009, the service has been expanded to many major cities on every continent except for Antarctica.

The business is rooted firmly in Big Data and leveraging this data in a more effective way than traditional taxi firms have managed has played a huge part in its success.

Uber’s entire business model is based on the very Big Data principle of crowd sourcing. Anyone with a car who is willing to help someone get to where they want to go can offer to help get them there.  

Uber holds a vast database of drivers in all of the cities it covers, so when a passenger asks for a ride, they can instantly match you with the most suitable drivers.

Fares are calculated automatically, using GPS, street data and the company’s own algorithms which make adjustments based on the time that the journey is likely to take. This is a crucial difference from regular taxi services because customers are charged for the time the journey takes, not the distance covered.

Surge pricing

These algorithms monitor traffic conditions and journey times in real-time, meaning prices can be adjusted as demand for rides changes, and traffic conditions mean journeys are likely to take longer. This encourages more drivers to get behind the wheel when they are needed – and stay at home when demand is low. The company has applied for a patent on this method of Big Data-informed pricing, which is calls “surge pricing”.

This algorithm-based approach with little human oversight has occasionally caused problems – it was reported that fares were pushed up sevenfold by traffic conditions in New York on New Year’s Eve 2011, with a journey of one mile rising in price from $27 to $135 over the course of the night.

This is an implementation of “dynamic pricing” – similar to that used by hotel chains and airlines to adjust price to meet demand – although rather than simply increasing prices at weekends or during public holidays, it uses predictive modelling to estimate demand in real time.  


Changing the way we book taxis is just a part of the grand plan though. Uber CEO Travis Kalanick has claimed that the service will also cut the number of private, owner-operated automobiles on the roads of the world’s most congested cities. In an interview last year he said that he thinks the car-pooling UberPool service will cut the traffic on London’s streets by a third.

UberPool allows users to find others near to them which, according to Uber’s data, often make similar journeys at similar times, and offer to share a ride with them. According to their blog, introducing this service became a no-brainer when their data told them the “vast majority of [Uber trips in New York] have a look-a-like trip – a trip that starts near, ends near, and is happening around the same time as another trip”. 

Other initiatives either trialled or due to launch in the future include UberChopper, offering helicopter rides to the wealthy, UberFresh for grocery deliveries and Uber Rush, a package courier service.

Rating systems

The service also relies on a detailed rating system – users can rate drivers, and vice versa – to build up trust and allow both parties to make informed decisions about who they want to share a car with.

Drivers in particular have to be very conscious of keeping their standards high – a leaked internal document showed that those whose score falls below a certain threshold face being “fired” and not offered any more work.

They have another metric to worry about, too – their “acceptance rate”. This is the number of jobs they accept versus those they decline. Drivers were told they should aim to keep this above 80%, in order to provide a consistently available service to passengers.

Uber’s response to protests over its service by traditional taxi drivers has been to attempt to co-opt them, by adding a new category to its fleet. UberTaxi - meaning you will be picked up by a licensed taxi driver in a registered private hire vehicle - joined UberX (ordinary cars for ordinary journeys), UberSUV (large cars for up to 6 passengers) and UberLux (high end vehicles) as standard options.

Regulatory pressure and controversies

It will still have to overcome legal hurdles – the service is currently banned in a handful of jurisdictions including Brussels and parts of India, and is receiving intense scrutiny in many other parts of the world. Several court cases are underway in the US regarding the company’s compliance with regulatory procedures.  

Another criticism is that because credit cards are the only payment option, the service is not accessible to a large proportion of the population in less developed nations where the company has focused its growth.

But given its popularity wherever it has launched around the world, there is a huge financial incentive for the company to press ahead with its plans for revolutionising private travel.

 If regulatory pressures do not kill it, then it could revolutionise the way we travel around our crowded cities – there are certainly environmental as well as economic reasons why this would be a good thing.

Uber is not alone – it has competitors offering similar services on a (so far) smaller scale such as Lyft , Sidecar and Haxi. If a deregulated private hire market emerges through Uber’s innovation, it will be hugely valuable, and competition among these upstarts will be fierce. We can expect the winners to be those who make the best use of the data available to them, to improve the service they offer to their customers.

The most successful is likely to be the one which manages to best use the data available to it to improve the service it provides to customers.

Case study - how Uber uses big data - a nice, in-depth case study how they have based their entire business model on big data with some practical examples and some mention of the technology used.

I hope you found this post interesting. I am always keen to hear your views on the topic and invite you to comment with any thoughts you might have.

About : Bernard Marr is a globally recognized expert in analytics and big data. He helps companies manage, measure, analyze and improve performance using data.

His new book is: Big Data: Using Smart Big Data, Analytics and Metrics To Make Better Decisions and Improve Performance You can read a free sample chapter here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

The 3Vs that define Big Data

Guest blog post by Diya Soubra

As I studied the subject, the following three terms stood out in relation to Big Data.

Variety, Velocity and Volume.

In marketing, the 4Ps define all of marketing using only four terms.
Product, Promotion, Place, and Price.

I claim that the 3Vs above totally define big data in a similar fashion.
These three properties define the expansion of a data set along various fronts to where it merits to be called big data. An expansion that is accelerating to generate yet more data of various types.

The plot above, using three axes helps to visualize the concept.

Data Volume:
The size of available data has been growing at an increasing rate. This applies to companies and to individuals. A text file is a few kilo bytes, a sound file is a few mega bytes while a full length movie is a few giga bytes.
More sources of data are added on continuous basis. For companies, in the old days, all data was generated internally by employees. Currently, the data is generated by employees, partners and customers. For a group of companies, the data is also generated by machines. For example, Hundreds of millions of smart phones send a variety of information to the network infrastructure. This data did not exist five years ago.
More sources of data with a larger size of data combine to increase the volume of data that has to be analyzed. This is a major issue for those looking to put that data to use instead of letting it just disappear.
Peta byte data sets are common these days and Exa byte is not far away.

Large Synoptic Survey Telescope (LSST).
“Over 30 thousand gigabytes (30TB) of images will be generated every night during the decade -long LSST sky survey. ”
72 hours of video are uploaded to YouTube every minute

There is a corollary to Parkinson’s law that states: “Data expands to fill the space available for storage.”’s_law

This is no longer true since the data being generated will soon exceed all available storage space.

Data Velocity:
Initially, companies analyzed data using a batch process. One takes a chunk of data, submits a job to the server and waits for delivery of the result. That scheme works when the incoming data rate is slower than the batch processing rate and when the result is useful despite the delay. With the new sources of data such as social and mobile applications, the batch process breaks down. The data is now streaming into the server in real time, in a continuous fashion and the result is only useful if the delay is very short.
140 million tweets per day on average.( more in 2012)

I have not yet determined how data velocity may continue to increase since real time is as fast as it gets. The delay for the results and analysis will continue to shrink to also reach real time.

Data Variety:
From excel tables and databases, data structure has changed to loose its structure and to add hundreds of formats. Pure text, photo, audio, video, web, GPS data, sensor data, relational data bases, documents, SMS, pdf, flash, etc etc etc. One no longer has control over the input data format. Structure can no longer be imposed like in the past in order to keep control over the analysis. As new applications are introduced new data formats come to life.

Google uses smart phones as sensors to determine traffic conditions.
In this application they are most likely reading the speed and position of millions of cars to construct the traffic pattern in order to select the best routes for those asking for driving directions. This sort of data did not exist on a collective scale a few years ago.

The 3Vs together describe a set of data and a set of analysis conditions that clearly define the concept of big data.


So what is one to do about this?

So far, I have seen two approaches.
1-divide and concur using Hadoop
2-brute force using an “appliance” such as the SAP HANA
(High- Performance Analytic Appliance)

In the divide and concur approach, the huge data set is broken down into smaller parts (HDFS) and processed (Mapreduce) in a parallel fashion using thousands of servers.

As the volume of the data increases, more servers are added and the process runs in the same manner. Need a shorter delay for the result, add more servers again. Given that with the cloud, server power is infinite, it is really just a matter of cost. How much is it worth to get the result in a shorter time.

One has to accept that not ALL data analysis can be done with Hadoop. Other tools are always required.

For the brute force approach, a very powerful server with terabytes of memory is used to crunch the data as one unit. The data set is compressed in memory. For example, for a Twitter data flow that is pure text, the compression ratio may reach 100:1. A 1TB IBM SAP HANA can then load a data set of 100TB in memory and do analytics on it.

IBM has a 100TB unit for demonstration purposes.

Many other companies are filling in the gap between these two approaches by releasing all sorts of applications that address different steps of the data processing sequence plus the management and the system configuration.

Read more…

The Data Science Industry: Who Does What

Guest blog post by Laetitia Van Cauwenberge

Interesting infographics produced by, an organisation offering R and data science training. Click here to see the original version. I would add that one of the core competencies of the data scientist is to automate the process of data analysis, as well as to create applications that run automatically in the background, sometimes in real-time, e.g.

  • to find and bid on millions of Google keywords each day (eBay and Amazon do that, and most of these keywords have little or no historical performance data, so keyword aggregation algorithms - putting keywords in buckets - must be used to find the right bid based on expected conversion rates),
  • buy or sell stocks,
  • monitor networks and generate automated alerts sent to the right people (to warn about a potential fraud, etc.)
  • or to recommend products to a user, identify optimum pricing, manage inventory, or identify fake reviews (a problem that Amazon and Yelp have failed to solve to this day)  

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Google is a prolific contributor to Open source. Here is a list of 4 open source & cloud projects from Google focusing on analytics, machine learning, data cleansing & visualization.


TensorFlow™ is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.


OpenRefine (formerly Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.Please note that since October 2nd, 2012, Google is not actively supporting this project, which has now been rebranded to OpenRefine. Project development, documentation and promotion is now fully supported by volunteers.

Google Charts

Google Charts provides a perfect way to visualize data on your website. From simple line charts to complex hierarchical tree maps, the chart gallery provides a large number of ready-to-use chart types.

The most common way to use Google Charts is with simple JavaScript that you embed in your web page. You load some Google Chart libraries, list the data to be charted, select options to customize your chart, and finally create a chart object with an id that you choose. Then, later in the web page, you create a <div> with that id to display the Google Chart.

Automatic Statistician

Making sense of data is one of the great challenges of the information age we live in. While it is becoming easier to collect and store all kinds of data, from personal medical data, to scientific data, to public data, and commercial data, there are relatively few people trained in the statistical and machine learning methods required to test hypotheses, make predictions, and otherwise create interpretable knowledge from this data. The Automatic Statistician project aims to build an artificial intelligence for data science, helping people make sense of their data.

This article is compiled by Jogmon.

Originally posted on Data Science Central

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds