Subscribe to our Newsletter

All Posts (217)

Guest blog post by Eduardo Siman

Fortune 500 companies are investing staggering amounts into data visualization. Many have opted for Tableau, Qlik, MicroStrategy, etc. but some have created their own in HTML5, full stack JavaScript, Python, and R. Leading CIOs and CTOs are obsessed with being the first adopters in whatever is next in data visualization. 

The next frontier in data visualization is clearly immersive experiences. The 2014 paper "Immersive and Collaborative Data Visualization Using Virtual Reality Platforms" written by CalTech astronomers is a staggeringly large step in the right direction. In fact, I am shocked that 1 year later I have not seen a commercial application of this technology. You can read it here:

The key theme that I hear at technology conferences lately is the need to focus on analytics, visualization and data exploration. The advent of big data systems such as Hadoop and Spark has made it

Picture Source: VR 2015 IEEE Virtual Reality International Conference 

possible - for the first time ever - to store Petabytes of data on commodity hardware and process this data, as needed, in a fault tolerant and incredibly quick fashion. Many of us fail to understand the full implications of this inflection point in the history of computing. 

Storage is decreasing in cost every year, to the point where you can now have multiple GB on a USB drive that 10 years ago you could only store a few MBs. Gigabit internet is being installed in cities all over the world. Spark uses the concept of in memory distributed computation to perform at 10X map reduce for gigantic datasets and is already being used in production by Fortune 50 companies. Tableau, Qlik, MicroStrategy, Domo, etc. have gained tremendous market share as companies that have implemented Hadoop components such as HDFS, Hbase, Hive, Pig, and Map Reduce are starting to wonder "How I can I visualize that data?" 

Now think about VR - probably the hottest field in technology at this moment. It has been more than a year since Facebook bought Oculus for 2Billion and we have seen Google Cardboard burst onto the scene. Applications from media companies like the NY Times are already becoming part of our every day lives. This month at the CES show in Las Vegas, dozens of companies were showcasing virtual reality platforms that improve on the state of the art and allow for a motion-sickness free immersive experience. 

All of this combines into my primary hypothesis - this is a great time to start a company that would provide the capability for immersive data visualization environments to businesses and consumers. I personally believe that businesses and government agencies would be the first to fully engage in this space on the data side, but there is clearly an opportunity in gaming on the consumer side.

Personally, I have been so taken by the potential of this idea that I wrote a post in this blog about the “feeling” of being in one of these immersive VR worlds.

The post describes what it would be like to experience data with not only vision, but touch and sound and even smell. 

Just think about the possibilities of examining streaming data sets, that currently are being analyzed with tools such as Storm, Kafka, Flink, and Spark Streaming as a river flowing under you!

The strength of the water can describe the speed of the data intake, or any other variable that is represented by a flow - stock market prices come to mind. 

The possibilities for immersive data experiences are absolutely astonishing. The CalTech astronomers have already taken the first step in that direction, and perhaps there is a company out there that is already taking the next step. That being said, if this sounds like an exciting venture to you, DM me on twitter @Namenode5 and we can talk. 

Read more…

Big data landscape 2016 - Infographic

Guest blog post by Laetitia Van Cauwenberge

Great infographic about the big data / analytics / data science / deep learning / BI ecosystem. Created by @Mattturk, @Jimrhao and @firstmarkcap. Click on the image to zoom in. 

This infographics features the following components:

  • Infrastructure
  • Analytics
  • Applications
  • Cross-Infrastructure/Analytics
  • Open Source
  • Data Sources & APIs

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

You cannot argue that the marketing landscape has changed drastically over the years - gone are the days of throwing different strategies on the wall to see which one would stick. Instead, companies today are equipped with enough information about their customers and tactics thanks to the big data boom of recent years. In fact, according to Bob Evans, senior vice-president of communications for Oracle, large companies that have not put together complete plans for managing big data expect to lose $71.2 million every year. This is just one example of the extreme impact the big data is having on businesses of all sizes across all industries.

Data is everywhere, from phones and personal computers to NASA database. Some of it may be used for business - some of it may not. In fact, almost 80% of information gained by an organization is unstructured and remains unused without the right software. Fortunately, with recent software boom, small businesses can reap the benefits of the excessive amounts of online and offline information.

Even though most big data conversations concern companies with resources to hire experts and research firms, for those who know where to look, there are several ways for SMBs to gather, analyze and make sense of the information they already have. Here you have a few solutions that could help small companies make better business decisions and compete with larger enterprises in the ever-evolving marketplace.


The tools you are probably already using provide you with a rich source of information. InsightSquared is a sales performance analytics tool, designed to prevent you from wasting your time, mining and analyzing your own data using one spreadsheet after another. It connects to popular business solutions, such as Salesforce, Google Analytics and QuickBooks to automatically gather and extract the information you need.


If you currently do not have any rich data sources, conducting research may be the right thing for you. Qualtrics software lets you conduct a wide range of surveys and studies to gain better insights to guide your decision-making. It offers you customer, employee and market insights in real-time. In addition, you have mobile surveys, online samples and academic research.  


If you want to create dashboards and analyze corporate data without the help of the IT department, here's a solution for you.  When it comes to next-generation smart data discovery, Panorama is a global leader. It is the first business intelligence software that uses Automated Insights and Social Decision Making to allow users to have insights with greater relevancy.


Once limited to companies with large resources, credit card transactions are full of unique and vital data. Customer intelligence company Tranzlogic, makes this info available to small and medium establishments, for a reasonable price. It provides the information that merchants can use to launch better marketing campaigns, measure sales performance and write better business plans.

The transition from an instinct-driven company to an analytics-driven company is one SMB owners everywhere must embrace. New solutions are coming to the market every day, and thankfully, more and more of them are being created for the specific needs of smaller organizations. Finding the right IT solutions can help make it practical and inexpensive to benefit from the opportunity big data affords.

Originally posted on Data Science Central

Read more…

Originally written by Nigel Higgs on LinkedIn Pulse.

We who have been in the data sphere a while and in and around Data Governance will have seen the pitch-decks, watched the webinars, read the blogs and attended the conferences.  Some of us will have hired the staff, taken sage advice from expensive consultants and kicked off programmes to get the organisation up the Data governance maturity curve. It's almost like a religion, Data Governance is so clearly the answer why can't everybody in the organisation see it? It's a no-brainer. Unfortunately and speaking as a Data Governance practitioner for far too many years I can honestly say that I have yet to see a fully functioning enterprise-wide Data Governance implementation. Look, I appreciate that could be down to my incompetence, but I know this is not an isolated or unique sentiment. Lots of peers, colleagues and people far smarter than me have been preaching the benefits of data administration, data architecture, data governance, or whatever it will be called next, for many years and yet many of them struggle to come up with success stories. In fact when pressed they often don't have any!

So why so much denial? Einstein is reputed to have said something along the lines of 'the definition of insanity is to keep doing the same thing and expect the outcome to be different'. It is also reputed to be the most wrongly attributed and quoted platitude on the planet! But hey this is a LinkedIn post and like most of my writings nobody will read it.

What's that got to do with Data Governance? Well, 'Outside In Data Governance' is about approaching the problem from a different angle. There is little doubt the problem Data Governance is trying to solve is very real. Very few organisations know what data they have got, what it means, where it is, who is responsible for it or what its quality is?

But how to solve the problem? What I typically hear is that you need to write a policy, form committees, define processes, assign roles and then everything will be working like clockwork within months - data governed, quality data delivered to users and the organisation flying up the data maturity curve. But is that what happens, does the story painted in the pitch-decks come into reality? Sadly, it very rarely if ever does.

What is needed is a value driven approach. Start with who are we doing a Data Governance approach for? We are doing it for the business users. Then ask what are they interested in? They are interested in something that makes their lives easier right now. So 'Outside In Data Governance' starts with a single business report and works back from there. Answer those fundamental questions (what, where, who and how good?) about the fields and outputs on the report and make that knowledge accessible. You could do this with a simple Excel based approach or maybe a Wiki or SharePoint; but pretty soon you will need some tooling to really make it scalable and responsive to increasing demands for more reports to be included in the scope. There are ways to do this in a 'proof of concept' environment and demonstrate the benefits before committing to spend. A friend of mine is fond of saying it’s easier to ask for forgiveness that for permission'. In this case he is right. There are browser based tools that sit outside your firewall and can offer this try before you buy approach. 

This is what a value driven and lean approach is all about. If what you do in this small scale doesn't get traction then what makes you think a £250k project will end up any better? Start small, ensure you get honest feedback from users at every iteration of your solution and focus on delivering value. If you bring the data users with you then they will demand the capability is extended. Beat the Einstein quote and start from the 'Outside In'.

Originally posted on Data Science Central

Read more…

Guest blog post by Eduardo Siman

You are driving down the highway. As your gaze moves between cars and trees and open sky a filtering thought hits the periphery of your conscious mind. It feels fresh yet somehow part of a thought pattern you’ve had before. A new version, perhaps, of a past obsession. Or a new obsession, emerging from years of exploration in seemingly unrelated fields. How many of us can identify with this feeling of being gently entranced by a sequence of thoughts that had seemingly faded into the past? Perhaps they encourage us to call an old friend who has been out of touch, or to dig into a textbook from yesteryear. 

I suppose you could call it nostalgia, but that word has a connotation of sadness and loss that does not properly fit here. We don’t understand much about how the brain truly functions, let alone the intricacies of consciousness. But perhaps we can think of these ebbing and waning thought cycles as waves, even curves, on a two dimensional plane that I would like to call a Mind Map.

In discussing this abstraction with my friend @openmylab in Sydney, we have been drawing 10 year Mind Maps of our own brains in order to identify areas of intellectual passion as well as intersections between seemingly disparate curves that have become inflection points in our lives. In both cases we started with a few rules to make the map visibly appealing and uncluttered. First, it would only have 3 mind curves - that is 3 major thought patterns or areas of engagement. Second, these curves had to be somewhat sinusoidal in that they couldn’t be everyday thoughts, but rather major epochs of intellectual pursuit that reached a peak, declined and once again reemergence in a surprising way. In doing this exercise we learned quite a bit about our own mind and the territory it has charted over the last ten years. 

Let’s start with @openmylab’s mind map: 

 Focus on curves 2 and 4 in the year 2000. We see a clear focus on learning programming languages and operating systems in 2 - some front end web development, some UNIX, a bit of Java, even dreamweaver and the ubiquitous SQL and C#. Now look at 4 - here is a clear statistical computing route, perhaps it would be called data science now, with all the usual suspects - Python, R, SPSS, Matlab. Between the two curves we have the makings of a rare gem of a software engineer. Seems clear enough right? This person should develop statistical apps and deploy them on front end environments? Maybe. 

But don’t forget these are SEPARATE mind curves. They don’t represent an encapsulated and refined goal or objective. These are distinct areas of interest and if they intersect and yield a common product -that’s great - but certainly not required. Tracing these curves we see how they transform into current areas of interest: quantum computing and the Internet of things. It would have been quite difficult to trace the origins of these current intellectual pursuits without tracing their mental predecessors.

 Now let’s explore my own mind map, keeping in mind it’s points of commonality and distinction from @openmylab’s map:

My map starts with two core academic pursuits: general relativity/quantum mechanics at the top and analytic number theory at the bottom. In addition there is another curve that starts out weak but gains steam as the years go by. This last curve is technology and all of it’s fascinating applications. So how does this play out? The first intersection occurs in 2007 - here is where I discover that back when Einstein and Montgomery where roaming the halls of the Institute of Advanced Studies in Princeton, a chance encounter led to an unexpected revelation. It turns out that the zeros of the zeta function bear a striking similarity to the eigenvalues of a random hermitician matrix. Did I lose you? I did. 

Ok let’s move ahead. By 2012 my obsession with Riemann intersects with my technological pursuits and I discover computational number theory. Unfortunately computers cannot prove the Riemann hypothesis, which is almost certainly true. But they could Dis-prove it! (In the very unlikely case that it isn’t true) And finally, as with @openmylab, there is a shift to applications of technology in the modern world - Big Data,IoT, fintech,machine learning. A possible point of intersection with the physics curve seems a few years away with the advent of quantum computing. 

 In both cases we see a current interest in so called “hot” or “trending” technologies with a long (10 to 15 year) history of predecessor interests and an amalgamation of distinct intellectual flows. Yet the two maps are quite different - one is the story of a software engineer who loves statistics and eventually finds himself enthralled in the world of robotics, data science, and quantum computing. The other is a physics nerd who realizes that computational methods can help bring concepts to life and dives into the visualization and practical application of abstract concepts. 

 And of course the most critical intersection, the one between the two mind maps, occurs in an area that isn’t even present on the maps: social media. It is on Twitter that the concept is exposed, developed, shared, refined, and discussed. Between Miami and Sydney, in real time. The mind map of the world has come along way to make this interaction possible. I encourage you to create your own mind map and explore the hidden mind curves of your intellectual past. If you feel comfortable doing so, please share s picture with me @namenode5 and with @openmylab on Twitter. Happy exploring!!

Read more…

Six categories of Data Scientists

We are now at 9 categories after a few updates. Just like there are a few categories of statisticians (biostatisticians, statisticians, econometricians, operations research specialists, actuaries) or business analysts (marketing-oriented, product-oriented, finance-oriented, etc.) we have different categories of data scientists. First, many data scientists have a job title different from data scientist, mine for instance is co-founder. Check the "related articles" section below to discover 400 potential job titles for data scientists.

Categories of data scientists

  • Those strong in statistics: they sometimes develop new statistical theories for big data, that even traditional statisticians are not aware of. They are expert in statistical modeling, experimental design, sampling, clustering, data reduction, confidence intervals, testing, modeling, predictive modeling and other related techniques.
  • Those strong in mathematics: NSA (national security agency) or defense/military people working on big data, astronomers, and operations research people doing analytic business optimization (inventory management and forecasting, pricing optimization, supply chain, quality control, yield optimization) as they collect, analyse and extract value out of data.
  • Those strong in data engineering, Hadoop, database/memory/file systems optimization and architecture, API's, Analytics as a Service, optimization of data flows, data plumbing.
  • Those strong in machine learning / computer science (algorithms, computational complexity)
  • Those strong in business, ROI optimization, decision sciences, involved in some of the tasks traditionally performed by business analysts in bigger companies (dashboards design, metric mix selection and metric definitions, ROI optimization, high-level database design)
  • Those strong in production code development, software engineering (they know a few programming languages)
  • Those strong in visualization
  • Those strong in GIS, spatial data, data modeled by graphs, graph databases
  • Those strong in a few of the above. After 20 years of experience across many industries, big and small companies (and lots of training), I'm strong both in stats, machine learning, business, mathematics and more than just familiar with visualization and data engineering. This could happen to you as well over time, as you build experience. I mention this because so many people still think that it is not possible to develop a strong knowledge base across multiple domains that are traditionally perceived as separated (the silo mentality). Indeed, that's the very reason why data science was created.

Most of them are familiar or expert in big data. 

There are other ways to categorize data scientists, see for instance our article on Taxonomy of data scientists. A different categorization would be creative versus mundane. The "creative" category has a better future, as mundane can be outsourced (anything published in textbooks or on the web can be automated or outsourced - job security is based on how much you know that no one else know or can easily learn). Along the same lines, we have science users (those using science, that is, practitioners; often they do not have a PhD), innovators (those creating new science, called researchers), and hybrids. Most data scientists, like geologists helping predict earthquakes, or chemists designing new molecules for big pharma, are scientists, and they belong to the user category. 

Implications for other IT professionals

You (engineer, business analyst) probably do already a bit of data science work, and know already some of the stuff that some data scientists do. It might be easier than you think to become a data scientist. Check out our book (listed below in "related articles"), to find out what you already know, what you need to learn, to broaden your career prospects.

Are data scientists a threat to your job/career? Again, check our book (listed below) to find out what data scientists do, if the risk for you is serious (you = the business analyst, data engineer or statistician; risk = being replaced by
a data scientist who does everything) and find out how to mitigate the risk (learn some of the data scientist skills from our book, if you perceive data scientists as competitors)

Originally posted on Data Science Central

Related articles

Read more…

Guest blog post by Tony Agresta

Organizations are struggling with a fundamental challenge – there’s far more data than they can handle.  Sure, there’s a shared vision to analyze structured and unstructured data in support of better decision making but is this a reality for most companies?  The big data tidal wave is transforming the database management industry, employee skill sets, and business strategy as organizations race to unlock meaningful connections between disparate sources of data.

Graph Databases are rapidly gaining traction in the market as an effective method for deciphering meaning but many people outside the space are unsure of what exactly this entails. Generally speaking, graph databases store data in a graph structure where entities are connected through relationships to adjacent elements. The Web is a graph; also your friend-of-a-friend network and the road network are graphs.

The fact is, we all encounter the principles of graph databases in many aspects of our everyday lives, and this familiarity will only increase. Consider just a few examples:

  • Facebook, Twitter and other social networks all employ graphs for more specific, relevant search functionality.  Results are ranked and presented to us to help us discover things.
  • By 2020, it is predicted that the number of connected devices will reach nearly 75 billion globally. As the Internet of Things continues to grow, it is not the devices themselves that will dramatically change the ways in which we live and work, but the connections between these devices. Think healthcare, work productivity, entertainment, education and beyond.
  • There are over 40,000 Google searches processed every second. This results in 3.5 billion searches per day and 1.2 trillion searches per year worldwide. Online search is ubiquitous in terms of information discovery. As people not only perform general Google searches, but search for content within specific websites, graph databases will be instrumental in driving more relevant, comprehensive results. This is game changing for online publishers, healthcare providers, pharma companies, government and financial services to name a few.
  • Many of the most popular online dating sites leverage graph database technology to cull through the massive amounts of personal information users share to determine the best romantic matches. Why is this?  Because relationships matter.

In the simplest terms, graph databases are all about relationships between data points. Think about the graphs we come across every day, whether in a business meeting or news report.   Graphs are often diagrams demonstrating and defining pieces of information in terms of their relations to other pieces of information.

Traditional relational databases can easily capture the relationship between two entities but when the object is to capture “many-to-many” relationships between multiple points of data, queries take a long time to execute and maintenance is quite challenging.  For instance, if you wanted to search for friends on many social networks that both attended the same university AND live in San Francisco AND share at least three mutual friends. Graph databases can execute these types of queries instantly with just a few lines of code or mouse clicks. The implications across industries are tremendous.

Graph databases are gaining in popularity for a variety of reasons.  Many are schema-less allowing you to manage your data more efficiently.   Many support a powerful query language, SPARQL. Some allow for simultaneous graph search and full-text search of content stores. Some exhibit enterprise resilience, replication and highly scalable simultaneous reads and writes.  And some have other very special features worthy of further discussion.

One specialized form of graph database is an RDF triplestore.  This may sound like a foreign language, but at the root of these databases are concepts familiar to all of us.    Consider the sentence, “Fido is a dog.” This sentence structure – subject-predicate-object – is how we speak naturally and is also how data is stored in a triplestore. Nearly all data can be expressed in this simple, atomic form.  Now let’s take this one step further.  Consider the sentence, “All dogs are mammals.” Many triplestores can reason just the way humans can.   They can come to the conclusion that “Fido is a mammal.” What just happened?  An RDF triplestore used its “reasoning engine” to infer a new fact.  These new facts can be useful in providing answers to queries such as “What types of mammals exist?”  In other words, the “knowledge base” was expanded with related, contextual information.    With so many organizations interested in producing new information products, this process of “inference” is a very important aspect of RDF triplestores.  But where do the original facts come from?

Since documents, articles, books and e-mails all contain free flowing text, imagine a technology where the text can be analyzed with results stored inside the RDF triplestore for later use.  Imagine a technology that can create the semantic triples for reuse later.  The breakthrough here is profound on many levels: 1) text mining can be tightly integrated with RDF triplestores to automatically create and store useful facts and 2) RDF triplestores not only manage those facts but they also “reason” and therefore extend the knowledge base using inference.

Why is this groundbreaking?  The full set of reasons extends beyond the scope of this article but here are some of the most important:

Your unstructured content is now discoverable allowing all types of users to quickly find the exact information for which they are searching.  This is a monumental breakthrough since so much of the data that organizations stockpile today exist as dark data repositories.

We said earlier that RDF triplestores are a type of graph database.  By their very nature, the triples stored inside the graph database (think “facts” in the form of subject-predicate-object) are connected. “Fido is a dog.  All dogs are mammals.  Mammals are warm blooded.  Mammals have different body temperatures, etc…”  The facts are linked.  These connections can be measured.   Some entities are more connected than others just like some web pages are more connected to other web pages.   Because of this, metrics can be used to rank the entries in a graph database. One of the most popular (and first) algorithms used at Google is “Page Rank” which counts the number and quality of links to a page – an important metric in assessing the importance of web page.   Similarly, facts inside a triplestore can be ranked to identify important interconnected entities with the most connected ordered first.   There are many ways to measure the entities but this is one very popular use case.

With billions of facts referencing connected entities inside a graph database, this information source can quickly become the foundation for knowledge discovery and knowledge management.  Today, organizations can structure their unstructured data, add additional free facts from Linked Open Data sets, combine all of this with a controlled vocabulary, thesauri, taxonomies or ontologies which, to one degree or another, are used to classify the stored entities and depict relationships.  Real knowledge is then surfaced from the results of queries, visual analysis of graphs or both.  Everything is indexed inside the triplestore.

Graph databases (and specialized versions called native RDF triplestores that embody reasoning power) show great promise in knowledge discovery, data management and analysis.   They reveal simplicity within complexity.  When combined with text mining, their value grows tremendously.   As the database ecosystem continues to grow, as more and more connections are formed, as unstructured data multiplies with fury, the need to analyze text and structure results inside graph databases is becoming an essential part of the database ecosystem.  Today, these combined technologies are available and not just reserved for the big search engines providers.  It may be time for you to consider how to better store, manage, query and analyze your own data.  Graph databases are the answer.

If there is interest, you can learn more about these approaches under the resources section of

Read more…

24 Uses of Statistical Modeling (Part I)

Guest blog post by Vincent Granville

Here we discuss general applications of statistical models, whether they arise from data science, operations research, engineering, machine learning or statistics. We do not discuss specific algorithms such as decision trees, logistic regression, Bayesian modeling, Markov models, data reduction or feature selection. Instead, I discuss frameworks - each one using its own types of techniques and algorithms - to solve real life problems.   

Most of the entries below are found in Wikipedia, and I have used a few definitions or extracts from the relevant Wikipedia articles, in addition to personal contributions.

Source for picture: click here

1. Spatial Models

Spatial dependency is the co-variation of properties within geographic space: characteristics at proximal locations appear to be correlated, either positively or negatively. Spatial dependency leads to the spatial auto-correlation problem in statistics since, like temporal auto-correlation, this violates standard statistical techniques that assume independence among observations

2. Time Series

Methods for time series analyses may be divided into two classes: frequency-domain methods and time-domain methods. The former include spectral analysis and recently wavelet analysis; the latter include auto-correlation and cross-correlation analysis. In time domain, correlation analyses can be made in a filter-like manner using scaled correlation, thereby mitigating the need to operate in frequency domain.

Additionally, time series analysis techniques may be divided into parametric and non-parametric methods. The parametric approaches assume that the underlying stationary stochastic process has a certain structure which can be described using a small number of parameters (for example, using an autoregressive or moving average model). In these approaches, the task is to estimate the parameters of the model that describes the stochastic process. By contrast, non-parametric approaches explicitly estimate the covariance or the spectrum of the process without assuming that the process has any particular structure.

Methods of time series analysis may also be divided into linear and non-linear, and univariate and multivariate.

3. Survival Analysis

Survival analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer questions such as: what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival? Survival models are used by actuaries and statisticians, but also by marketers designing churn and user retention models.

Survival models are also used to predict time-to-event (time from becoming radicalized to turning into a terrorist, or time between when a gun is purchased and when it is used in a murder), or to model and predict decay (see section 4 in this article).

4. Market Segmentation

Market segmentation, also called customer profiling, is a marketing strategy which involves dividing a broad target market into subsets of consumers,businesses, or countries that have, or are perceived to have, common needs, interests, and priorities, and then designing and implementing strategies to target them. Market segmentation strategies are generally used to identify and further define the target customers, and provide supporting data for marketing plan elements such as positioning to achieve certain marketing plan objectives. Businesses may develop product differentiation strategies, or an undifferentiated approach, involving specific products or product lines depending on the specific demand and attributes of the target segment.

5. Recommendation Systems

Recommender systems or recommendation systems (sometimes replacing "system" with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the 'rating' or 'preference' that a user would give to an item.

6. Association Rule Learning

Association rule learning is a method for discovering interesting relations between variables in large databases. For example, the rule { onions, potatoes } ==> { burger }  found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. In fraud detection, association rules are used to detect patterns associated with fraud. Linkage analysis is performed to identify additional fraud cases: if credit card transaction from user A was used to make a fraudulent purchase at store B, by analyzing all transactions from store B, we might find another user C with fraudulent activity. 

7. Attribution Modeling

An attribution model is the rule, or set of rules, that determines how credit for sales and conversions is assigned to touchpoints in conversion paths. For example, the Last Interaction model in Google Analytics assigns 100% credit to the final touchpoints (i.e., clicks) that immediately precede sales or conversions. Macro-economic models use long-term, aggregated historical data to assign, for each sale or conversion, an attribution weight to a number of channels. These models are also used for advertising mix optimization.

8. Scoring

Scoring model is a special kind of predictive models. Predictive models can predict defaulting on loan payments, risk of accident, client churn or attrition, or chance of buying a good. Scoring models typically use a logarithmic scale (each additional 50 points in your score reducing the risk of defaulting by 50%), and are based on logistic regression and decision trees, or a combination of multiple algorithms. Scoring technology is typically applied to transactional data, sometimes in real time (credit card fraud detection, click fraud).

9. Predictive Modeling

Predictive modeling leverages statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive models are often used to detect crimes and identify suspects, after the crime has taken place. They may also used for weather forecasting, to predict stock market prices, or to predict sales, incorporating time series or spatial models. Neural networks, linear regression, decision trees and naive Bayes are some of the techniques used for predictive modeling. They are associated with creating a training set, cross-validation, and model fitting and selection.

Some predictive systems do not use statistical models, but are data-driven instead. See example here

10. Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Unlike supervised classification (below), clustering does not use training sets. Though there are some hybrid implementations, called semi-supervised learning.

11. Supervised Classification

Supervised classification, also called supervised learning, is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called label, class or category). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. 

Examples, with an emphasis on big data, can be found on DSC. Clustering algorithms are notoriously slow, though a very fast technique known as indexation or automated tagging will be described in Part II of this article.

12. Extreme Value Theory

Extreme value theory or extreme value analysis (EVA) is a branch of statistics dealing with the extreme deviations from the median of probability distributions. It seeks to assess, from a given ordered sample of a given random variable, the probability of events that are more extreme than any previously observed. For instance, floods that occur once every 10, 100, or 500 years. These models have been performing poorly recently, to predict catastrophic events, resulting in massive losses for insurance companies. I prefer Monte-Carlo simulations, especially if your training data is very large. This will be described in Part II of this article.

Click here to read Part II.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Analyse TB data using network analysis

Guest blog post by Tim Groot

Analyse TB data using network analysis


In a very interesting publication from Jose A. Dianes on tuberculosis (TB) cases per country it was shown that dimension reduction is achieved using Principal Component Analysis (PCA) and Cluster Analysis ( By showing that the first principal component corresponded mainly to the mean value of TB cases and the second mainly to the change over the used time span, it become clear that the first two PCA-components have a real physical meaning. This is not necessarily the case for PCA constructs an orthogonal basis, by making linear combinations of the original measurements, of which the eigen vectors are orederd in a decending order. Though, this method may not work with data having different types of variables. The scripts in this article are written in R.


Finding correlations in the time trend is a better way to monitor the correspondence between countries. Correlation shows similarities in the trend between countries and is sensitive to deviations from the main trend. Grouping countries based on similarities can give insight in the mechanism behind the trend and opens a way to find effective measures for the illness. Or a hidden measure may have a good causal relation but was not identified yet.

The necessary libraries to use are:

library(RCurl) # reading data

library(igraph) # network plot


Loading required data from and process existing cases file analogous to Jose A. Dianes.

existing_cases_file <-


existing_df <- read.csv(text = existing_cases_file, row.names=1, stringsAsFactor=F)

existing_df[c(1,2,3,4,5,6,15,16,17,18)] <-

  lapply( existing_df[c(1,2,3,4,5,6,15,16,17,18)],

          function(x) { as.integer(gsub(',', '', x) )})

countries <- rownames(existing_df)

meantb <- rowMeans(existing_df)

Create the link-table from the correlation matrix, filtered for the duplicates and the 1’s on the diagonal. The lower triangle function was used here.

cortb <- cor(t(existing_df))

cortb <- cortb*lower.tri(cortb)

links <- data.frame(NULL, ncol(3))

for(i in 1:length(countries)){

  links <- rbind(links,data.frame(countries, countries[i], cortb[,i], meantb,



names(links) <- c('c1','c2','cor','meanc1','meanc2')

links <- links[links$cor != 0,]

A network graph of this link-table will result in one uniform group because each country is still liked to all others.

g <-, directed=FALSE)


The trend is formed from a period of only 18 years. Correlation may therefore not be a strong function to separate the trends of the countries. For a longer span of years correlation will perform better as separator. The trends in this data are generally the same, they are decreasing. Therefore a high limit for the level of correlation is used (0.90).

The link-table is filtered for correlations larger than 0.9 and create a network graph.

links <- links[links$cor > 0.9,]


g <-, directed=FALSE)


fgc <- cluster_fast_greedy(g)


## [1] 5

The countries now appear to split-up into 5 groups, three large clusters and two small ones.


By plotting time-trends of the groups, a grouping in the trends is visible.

trendtb <-

for(group in 1:length(fgc)){

  sel <- trendtb[,as.character(unlist(fgc[group]))]

  plot(x=NA, y=NA , xlim=c(1990,2007), ylim=range(sel), main= paste("Group", group),

     xlab='Year', ylab = 'TB cases per 100K distribution')

  for(i in names(sel)){

    points(1990:2007,sel[,i],type='o', cex = .1, lty = "solid",

           col = which(i == names(sel)))


  if(group %in% c(4,5)) legend('topright', legend = names(sel), lwd=2,



In group 4 and 5 pretty particular trends are selected. Group 4 consist of countries with a maximum amount of TB-cases in the period 1996 to 2003 and the two countries in group 5 show a dramatic drop in TB-cases at 1996 which is followed by a large increase. This latter trend should be explained by meta data about the dataset.


## $`4`

## [1] "Lithuania" "Zambia"    "Belarus"   "Bulgaria"  "Estonia" 


## $`5`

## [1] "Namibia"  "Djibouti"

Group 4 consists of former USSR-countries, though, Zambia is an exception. This trend could be explained by social problems during the collapse of the USSR and for Zambia this trend should be explained by political changes too.

The range in TB-cases the first three graphs is too large to see the similarities within the groups. Dividing them with the mean gives a better view on the trend.

for(group in 1:3){

  sel <- trendtb[,as.character(unlist(fgc[group]))]

  selavg <- meantb[as.character(unlist(fgc[group]))]

  plot(x=NA, y=NA , xlim=c(1990,2007), ylim=range(t(sel)/selavg),

       main= paste("Group", group), xlab='Year',

       ylab = 'TB cases per 100K distribution')

  for(i in names(sel)){

    points(1990:2007,sel[,i]/selavg[i],type='o', cex = .1, lty = "solid",

           col= which(i == names(sel)))



Now the difference between group 1 and 3 better visible, group 1 groups countries tending towards sigmoid-trends while group 3 consist of countries with a more steady decay in TB-cases. Countries in group 2 show an increasing sigmoid-like trend.

In the groups western- and development-countries are mixed. For western countries the amount of TB-cases are low and one TB-case more or less may flip the trend, again a better separation will be found for a larger span in time.


## $`1`

##  [1] "Albania"                          "Argentina"                      

##  [3] "Bahrain"                          "Bangladesh"                     

##  [5] "Bosnia and Herzegovina"           "China"                          

##  [7] "Costa Rica"                       "Croatia"                        

##  [9] "Czech Republic"                   "Korea, Dem. Rep."               

## [11] "Egypt"                            "Finland"                        

## [13] "Germany"                          "Guinea-Bissau"                  

## [15] "Honduras"                         "Hungary"                        

## [17] "India"                            "Indonesia"                      

## [19] "Iran"                             "Japan"                           

## [21] "Kiribati"                         "Kuwait"                         

## [23] "Lebanon"                          "Libyan Arab Jamahiriya"         

## [25] "Myanmar"                          "Nepal"                          

## [27] "Pakistan"                         "Panama"                         

## [29] "Papua New Guinea"                 "Philippines"                    

## [31] "Poland"                           "Portugal"                       

## [33] "Puerto Rico"                      "Singapore"                      

## [35] "Slovakia"                         "Syrian Arab Republic"           

## [37] "Thailand"                         "Macedonia, FYR"                 

## [39] "Turkey"                           "Tuvalu"                         

## [41] "United States of America"         "Vanuatu"                        

## [43] "West Bank and Gaza"               "Yemen"                          

## [45] "Micronesia, Fed. Sts."            "Saint Vincent and the Grenadines"

## [47] "Viet Nam"                         "Eritrea"                        

## [49] "Jordan"                           "Tunisia"                        

## [51] "Monaco"                           "Niger"                          

## [53] "New Caledonia"                    "Guam"                           

## [55] "Timor-Leste"                      "Iraq"                           

## [57] "Mauritius"                        "Afghanistan"                     

## [59] "Australia"                        "Cape Verde"                     

## [61] "French Polynesia"                 "Malaysia"                       


## $`2`

##  [1] "Botswana"                 "Burundi"                

##  [3] "Cote d'Ivoire"            "Ethiopia"               

##  [5] "Guinea"                   "Rwanda"                 

##  [7] "Senegal"                  "Sierra Leone"           

##  [9] "Suriname"                 "Swaziland"              

## [11] "Tajikistan"               "Zimbabwe"               

## [13] "Azerbaijan"               "Georgia"                

## [15] "Kenya"                    "Kyrgyzstan"             

## [17] "Russian Federation"       "Ukraine"                

## [19] "Tanzania"                 "Moldova"                

## [21] "Burkina Faso"             "Congo, Dem. Rep."       

## [23] "Guyana"                   "Nigeria"                

## [25] "Chad"                     "Equatorial Guinea"      

## [27] "Mozambique"               "Uzbekistan"             

## [29] "Kazakhstan"               "Algeria"                

## [31] "Armenia"                  "Central African Republic"


## $`3`

##  [1] "Barbados"                 "Belgium"                 

##  [3] "Bermuda"                  "Bhutan"                 

##  [5] "Bolivia"                  "Brazil"                 

##  [7] "Cambodia"                 "Cayman Islands"         

##  [9] "Chile"                    "Colombia"               

## [11] "Comoros"                  "Cuba"                   

## [13] "Dominican Republic"       "Ecuador"                

## [15] "El Salvador"              "Fiji"                   

## [17] "France"                   "Greece"                 

## [19] "Haiti"                    "Israel"                 

## [21] "Laos"                     "Luxembourg"             

## [23] "Malta"                    "Mexico"                 

## [25] "Morocco"                  "Netherlands"            

## [27] "Netherlands Antilles"     "Nicaragua"              

## [29] "Norway"                   "Peru"                   

## [31] "San Marino"               "Sao Tome and Principe"  

## [33] "Slovenia"                 "Solomon Islands"        

## [35] "Somalia"                  "Spain"                  

## [37] "Switzerland"              "United Arab Emirates"   

## [39] "Uruguay"                  "Antigua and Barbuda"    

## [41] "Austria"                  "British Virgin Islands" 

## [43] "Canada"                   "Cyprus"                 

## [45] "Denmark"                  "Ghana"                  

## [47] "Guatemala"                "Ireland"                

## [49] "Italy"                    "Jamaica"                

## [51] "Mongolia"                 "Oman"                   

## [53] "Saint Lucia"              "Seychelles"             

## [55] "Turks and Caicos Islands" "Virgin Islands (U.S.)"  

## [57] "Venezuela"                "Maldives"               

## [59] "Trinidad and Tobago"      "Korea, Rep."            

## [61] "Andorra"                  "Anguilla"               

## [63] "Belize"                   "Mali"

Read more…

7 Ingredients for Great Visualizations

Great article by Bernard Marr. Here we present a summary. A link to the full article is provided below.

Source for picture: click here

1. Identify your target audience. 

2. Customize the data visualization. 

3. Give the data visualization a clear label or title. 

Source for picture: click here

4. Link the data visualization to your strategy. 

5. Choose your graphics wisely. 

6. Use headings to make the important points stand out. 

Source for picture: click here

7. Add a short narrative where appropriate.  

Click here to read full article. For other articles on visualization science and art, as well as best practices, click here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Contributed by David Comfort. David took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his first class project(due at 2nd week of the program).


Hans Rosling gave a famous TED talk in 2007, “The Best Stats You’ve Ever Seen”. Rosling is a Professor of International Health at the Karolinska Institutet in Stockholm, Sweden and founded the Gapminder Foundation.


To visualise his talk, he and his team at Gapminder developed animated bubble charts, also known as motion charts. Gapminder developed the Trendalzyer data visualization software, which was subsequently acquired by Google in 2007.

The Gapminder Foundation is a Swedish NGO which promotes sustainable global development by increased use and understanding of statistics about social, economic and environmental development.

The purpose of my data visualization project was to visualize data about long-term economic, social and health statistics. Specifically, I wanted to extract data sets from Gapminder using an R package, googlesheets, munge these data sets, and combine them into one dataframe, and then use the GoogleVis R package to visualize these data sets using a Google Motion chart.

The finished product can be viewed at this page and a screen capture demonstrating the features of the interactive data visualization is at:


World Bank Databank

World Bank Databank

2) Data sets used for Data Visualization

The following Gapminder datasets, several of which were adapted from the World Bank Databank, were accessed for data visualization:

  • Child mortality - The probability that a child born in a specific year will die before reaching the age of five if subject to current age-specific mortality rates. Expressed as a rate per 1,000 live births.
  • Democracy score  - Overall polity score from the Polity IV dataset, calculated by subtracting an autocracy score from a democracy score. It is a summary measure of a country's democratic and free nature. -10 is the lowest value, 10 the highest.
  • Income per person - Gross Domestic Product per capita by Purchasing Power Parities (in international dollars, fixed 2011 prices). The inflation and differences in the cost of living between countries has been taken into account.
  • Life expectancy at birth - Life expectancy at birth (years) with projections. The average number of years a newborn child would live if current mortality patterns were to stay the same. Observations after 2010 are based on projections by the UN.
  • Population, Total - Population is available in Gapminder World in the indicator “population, total”, which contains observations between 1700 to 2012. It is also available in the indicator “population, with projections”, which also includes projections up to 2100 as well as data for several countries before 1700.
  • Country and Dependent Territories Lists with UN Regional Codes - These lists are the result of merging data from two sources, the Wikipedia ISO 3166-1 article for alpha and numeric country codes, and the UN Statistics site for countries' regional, and sub-regional codes.

3) Reading in the Gapminder data sets using Google Sheets R Package

  • The googlesheets R package allows one to access and manage Google spreadsheets from within R.
  • googlesheets Basic Usage (vignette)
  • Reference manual
  • The registration functions gs_title(), gs_key(), and gs_url() return a registered sheet as a googlesheets object, which is the first argument to practically every function in this package.

First, install and load googlesheets and dplyr, and access a Gapminder Google Sheet by the URL and get some information about the Google Sheet:

Note: Setting the parameter lookup=FALSE will block authenticated API requests.

A utility function, extract_key_from_url(), helps you get and store the key from a browser URL:

You can access the Google Sheet by key:

Once, one has registered a worksheet, then you can consume the data in a specific worksheet (“Data” in our case) within the Google Sheet using the gs_read() function (combining the statements with dplyr pipe and using check.names=FALSE so it deals with the integer column names correctly and doesn’t append an “x” to each column name):

You can also target specific cells via the range = argument. The simplest usage is to specify an Excel-like cell range, such asrange = “D12:F15” or range = “R1C12:R6C15”.

But a problem arises since check.names=FALSE does not work with this statement (problem with package?). So a workaround would be to pipe the data frame through the dplyr rename function:

However, for purposes, we will ingest the entire worksheet and not target by cells.

Let’s look at the data frame. We need to change the name of the first column from “GDP per capita” to “Country”.

We need to change the name of the first column from “GDP per capita” to “Country”.

Let’s download the rest of the datasets

4) Read in the Countries Data Set

We want to segment the countries in the data sets by region and sub-region. However, the Gapminder data sets do not include these variables. Therefore, one can download the ISO-3166-Countries-with-Regional-Codes data set from github which includes the ISO country code, country name, region, and sub-region.

Use rCurl to read in directly from Github and make sure you read in the “raw” file, rather than Github’s display version.

Note: The Gapminder data sets do not include ISO country codes, so I had to clean the countries data set with the corresponding country names used in the Gapminder data sets.

5) Reshaping the Datasets

We need to reshape the data frames. For the purposes of reshaping our data frames, we can divide the variables into two groups: identifier, or id, variables and measured variables. In our case, id variables include the Country and Years, whereas the measured variables are the GDP per capita, life expectancy, etc..

We can further abstract and “say there are only id variables and a value, where the id variables also identify what measured variable the value represents.”

For example, we could represent a data set, which has two id variables, subject and time:


where each row represents one observation of one variable. This operation is called melting (and can be achieved by using the melt function of the Reshape package).

Compared to the former table, the latter table has a new id variable “variable”, and a new column “value”, which represents the value of that observation. See the paper, Reshaping data with the reshape package, by Hadley Wickham, for more clarification.

We now have the data frame in a form in which there are only id variables and a value.

Let’s reshape the data frames ( child_mortality, democracy_score, life_expectancy, population):

6) Combining the Datasets

Whew, now we can finally all the datasets using a left_join:

7) What is GoogleVis

[caption id="attachment_8125" align="alignnone" width="690"]googlevis An overview of a GoogleVis Motion Chart[/caption]

8) Implementing GoogleVis

Implementing GoogleVis is fairly easy. The design of the visualization functions is fairly generic.

The name of the visualization function is gvis + ChartType. So for the Motion Chart we have:

9) Parameters for GoogleVis

  • data: a data.frame
  • idvar: the id variable , “Country” in our case.
  • timevar: the time variable for the plot, “Years” in our case.
  • xvar: column name of a numerical vector in data to be plotted on the x-axis.
  • yvar: column name of a numerical vector in data to be plotted on the y-axis.
  • colorvar: column name of data that identifies bubbles in the same series. We will use “Region” in our case.
  • sizevar - values in this column are mapped to actual pixel values using the sizeAxis option. We will use this for “Population”.

The output of a googleVis function (gvisMotionChart in our case) is a list of lists (a nested list) containing information about the chart type, chart id, and the html code in a sub-list split into header, chart, caption and footer.

10) Data Visualization

Let’s plot the GoogleVis Motion Chart.

[caption id="attachment_8127" align="alignnone" width="650"]Gapminder Data Visualization using GoogleVis Gapminder Data Visualization using GoogleVis[/caption]

Note: I had an issue with embedding the GoogleVis motion plot in a WordPress blog post, so I will subsequently feature the GoogleVis interactive motion plot on a separate page at


Here is a screen recording of the data visualization produced by GoogleVis of the Gapminder datasets:


11) What are the Key Lessons of the Gapminder Data Visualization Project?

  • Hans Rosling and Gapminder have made a big impact on data visualization and how data visualization can inform the public about wide misperceptions.
  • The googlesheets R package allows for easy extraction of data sets which are stored in Google Sheets.
  • The different steps involved in reshaping and joining multiple data sets can be a little cumbersome. We could have used dplyr pipes more.
  • It would be a good practice for Gapminder to include the ISO country code in each of their data sets.
  • There is a need for a data set which lists country names, their ISO codes, as well as other categorical information such as Geographic Regions, Income groups, Landlocked, G77, OECD, etc.
  • It is relatively easy to implement a GoogleVis motion chart using R. However, it is difficult to change the configuration options. For instance, I was unable to make a simple change to the chart by adding a chart title.
  • Google Motion Charts provide a great way to visualize several variables at once and be a great teaching tool for all sorts of data.
  • For instance, one can visualize the Great Divergence between the Western world and China, India or Japan, whereby the West had much faster economic growth, with attendant increases in life expectancy and other health indicators.



Originally posted on Data Science Central

Read more…

How to make any plot with ggplot2?

Guest blog post by Selva Prabhakaran

Ggplot2 is the most elegant and aesthetically pleasing graphics framework available in R. It has a nicely planned structure to it. This tutorial focusses on exposing this underlying structure you can use to make any ggplot. But, the way you make plots in ggplot2 is very different from base graphics making the learning curve steep. So leave what you know about base graphics behind and follow along. You are just 5 steps away from cracking the ggplot puzzle.

The distinctive feature of the ggplot2 framework is the way you make plots through adding ‘layers’. The process of making any ggplot is as follows.

1. The Setup

First, you need to tell ggplot what dataset to use. This is done using the ggplot(df) function, where df is a dataframe that contains all features needed to make the plot. This is the most basic step. Unlike base graphics, ggplot doesn’t take vectors as arguments.

Optionally you can add whatever aesthetics you want to apply to your ggplot (inside aes() argument) - such as X and Y axis by specifying the respective variables from the dataset. The variable based on which the color, size, shape and stroke should change can also be specified here itself. The aesthetics specified here will be inherited by all the geom layers you will add subsequently.

If you intend to add more layers later on, may be a bar chart on top of a line graph, you can specify the respective aesthetics when you add those layers.

Below, I show few examples of how to setup ggplot using in the diamonds dataset that comes with ggplot2 itself. However, no plot will be printed until you add the geom layers.

ggplot(diamonds) # if only the dataset is known. ggplot(diamonds, aes(x=carat)) # if only X-axis is known.
ggplot(diamonds, aes(x=carat, y=price)) # if both X and Y axes are fixed for all layers.ggplot(diamonds, aes(x=carat, color=cut)) # color will now vary based on `cut`

The aes argument stands for aesthetics. ggplot2 considers the X and Y axis of the plot to be aesthetics as well, along with color, size, shape, fill etc. If you want to have the color, size etc fixed (i.e. not vary based on a variable from the dataframe), you need to specify it outside the aes(), like this.

ggplot(diamonds, aes(x=carat), color="steelblue")

See this color palette for more colors.

2. The Layers

The layers in ggplot2 are also called ‘geoms’. Once the base setup is done, you can append the geoms one on top of the other. The documentation provides a compehensive list of all available geoms.

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + geom_smooth() # Adding scatterplot geom (layer1) and smoothing geom (layer2).

We have added two layers (geoms) to this plot - the geom_point() and geom_smooth(). Since the X axis Y axis and the color were defined in ggplot() setup itself, these two layers inherited those aesthetics. Alternatively, you can specify those aesthetics inside the geom layer also as shown below.

ggplot(diamonds) + geom_point(aes(x=carat, y=price, color=cut)) + geom_smooth(aes(x=carat, y=price, color=cut)) # Same as above but specifying the aesthetics inside the geoms.

Notice the X and Y axis and how the color of the points vary based on the value of cut variable. The legend was automatically added. I would like to propose a change though. Instead of having multiple smoothing lines for each level of cut, I want to integrate them all under one line. How to do that? Removing the color aesthetic from geom_smooth()layer would accomplish that.

library(ggplot2) ggplot(diamonds) + geom_point(aes(x=carat, y=price, color=cut)) + geom_smooth(aes(x=carat, y=price)) # Remove color from geom_smooth

ggplot(diamonds, aes(x=carat, y=price)) + geom_point(aes(color=cut)) + geom_smooth() # same but simpler

Here is a quick challenge for you. Can you make the shape of the points vary with color feature?

Though setting up took us quite a bit of code, adding further complexity such as the layers, distinct color for each cut etc was easy. Imagine how much code you would have had to write if you were to make this in base graphics? Thanks to ggplot2!

# Answer to the challenge.
ggplot(diamonds, aes(x=carat, y=price, color=cut, shape=color)) + geom_point()

3. The Labels

Now that you have drawn the main parts of the graph. You might want to add the plot’s main title and perhaps change the X and Y axis titles. This can be accomplished using the labs layer, meant for specifying the labels. However, manipulating the size, color of the labels is the job of the ‘Theme’.

library(ggplot2) gg <- ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + labs(title="Scatterplot", x="Carat", y="Price")  # add axis labels and plot title. print(gg)

The plot’s main title is added and the X and Y axis labels capitalized.

Note: If you are showing a ggplot inside a function, you need to explicitly save it and then print using the print(gg), like we just did above.

4. The Theme

Almost everything is set, except that we want to increase the size of the labels and change the legend title. Adjusting the size of labels can be done using the theme() function by setting the plot.titleaxis.text.x and axis.text.y. They need to be specified inside the element_text(). If you want to remove any of them, set it to element_blank() and it will vanish entirely.

Adjusting the legend title is a bit tricky. If your legend is that of a color attribute and it varies based in a factor, you need to set the name using scale_color_discrete(), where the color part belongs to the color attribute and the discrete because the legend is based on a factor variable.

gg1 <- gg + theme(plot.title=element_text(size=30, face="bold"), axis.text.x=element_text(size=15), axis.text.y=element_text(size=15), axis.title.x=element_text(size=25), axis.title.y=element_text(size=25)) + scale_color_discrete(name="Cut of diamonds")  # add title and axis text, change legend title. 

print(gg1) # print the plot

If the legend shows a shape attribute based on a factor variable, you need to change it using scale_shape_discrete(name="legend title"). Had it been a continuous variable,  (name="legend title") instead.

So now, Can you guess the function to use if your legend is based on a fill attribute on a continuous variable?

The answer is scale_fill_continuous(name="legend title").

5. The Facets

In the previous chart, you had the scatterplot for all different values of cut plotted in the same chart. What if you want one chart for one cut?

gg1 + facet_wrap( ~ cut, ncol=3)  # columns defined by 'cut'

facet_wrap(formula) takes in a formula as the argument. The item on the RHS corresponds to the column. The item on the LHS defines the rows.

gg1 + facet_wrap(color ~ cut)  # row: color, column: cut

In facet_wrap, the scales of the X and Y axis are fixed to accomodate all points by default. This would make comparison of attributes meaningful because they would be in the same scale. However, it is possible to make the scales roam free making the charts look more evenly distributed by setting the argument scales=free.

gg1 + facet_wrap(color ~ cut, scales="free")  # row: color, column: cut

For comparison purposes, you can put all the plots in a grid as well using facet_grid(formula).

gg1 + facet_grid(color ~ cut)   # In a grid

Note, the headers for individual plots are gone leaving more space for plotting area.

This post was originally published at The full post is available in this ggplot2 tutorial.

Read more…

Exploring the VW scandal with graph analysis

Guest blog post by Zygimantas Jacikevicius

Our software partners Linkurious were excited to dive in to the VW emissions scandal case we have prepared and documented their findings in this great article, keep reading or visit the source of this article Here


The German car manufacturer admitted cheating emissions tests. Which companies are impacted directly or indirectly by these revelations? James Phare from Data To Value  used graph analysis and open source information to unravel the impact of the VW scandal on its customers, partners and shareholders.

 The VW scandal makes headlines


Since September 18 when VW admitted to modifying its cars to disguise their true emissions performances, the news has made headlines worldwide. In all this noise, it’s difficult to piece together the important information necessary to assess the event’s impact.

Our partner Data To Value helps organisations turn data into their most valuable asset. Using a combination of semantic analysis and Linkurious, the Data To Value’s team were able to investigate the ramifications of the VW scandal and show how it impacts its customers, partners, suppliers and shareholders.

To do this Data To Value used open source intelligence (OSINT) from sources like newspapers, social media and blogs. Semantic analysis helps analyse that type of vast volume of unstructured or semi-structured data to extract:

  • relevant entities (persons, companies, locations, etc.)
  • sentiment (is a blog post sympathetic or aggressive?)
  • topics (is this article related to the production of apples or to the company Apple?)


The resulting data can be represented in various ways. For example, Data To Value built a visualization to show the evolution of the general sentiment toward VW during the crisis.


This approach is useful but doesn’t help us understand how various entities linked to VW (directly and indirectly) are impacted by the scandal. Traditional tools and techniques are ill adapted for that type of questions. That’s where graph analysis is useful. It helps see relationships which are hard to find using other techniques and other data structures.


In our case for example, Data to Value chose to represent the data collected about the VW crisis in the following graph model:

In the picture above we see how some of different entities are linked:

  • a company or an investor own shares of another company
  • an engine powers a car model
  • a car model is sold by (in the range of) a company
  • a company is a fleet buyer of another company
  • a supplier supplies a company

The graph model gives a highly structured view of the information. We can now use it to start asking questions.



Investigating the VW network to find hidden insights


First let’s start by visualizing the motors involved in the scandal and in which car models they are used.

We can see how a single motor can be used in many car models. The “R3 1^4l 66kW(90PS) 230Nm 4V TDI CR” motor for example is being used by the Rapid Scapeceback, Toledo, A1, A1 Sportback, Fabia, 250 OSA Polo, Fabia Combi, Polo, Ibiza Sport Tourer, Ibiza 3 Turer, Ibiza 5 Turer.

The Linkurious graph visualization helps us make sense of how the motors and car models are tied. Now that we understand this, we can add an extra set of relationships. The relationships between the car models and the manufacturers.

At this point, we can notice that a car like Skoda’s Superb is using 3 motors involved in the scandal.

Although most customers don’t associate Skoda’s cars with Volkswagen we can see that there are relationships between the two car manufacturers. Two other car manufacturers also appear in the data: Seat and Audi.

Let’s expand the relationships of our car manufacturers to reveal more information. A few new details emerge.

Various car rental companies are major customers of Volkswagen. Companies like Hertz, Europcar and Avis are buying cars manufactured by Volkswagen, Audi, Skoda and Seat.

Daimler and BMW, the other German car manufacturers use Robert Bosch, a VW supplier and a company also involved in the scandal. Although to date these companies have denied using the software illegally in the same way as VW.

Let’s filter the graph to focus on the shareholding relationships now.

Audi, Seat, Skoda and Volkswagen are all owned by Volkswagen AG. Investors in Volkswagen AG include hedge fund Blackrock and sovereign funds Qatar Investment Authority and Norwegian Investment Authority.



Simplifying investment research with graph analysis



Data to Value’s research in the VW scandal has applications both in the long only asset management space and also in the hedge funds space for both managing risks and also identifying investment opportunities or developing a graph-based trading strategy. For example, based on the insights uncovered, the shares of indirectly affected companies could be shorted in order to achieve returns. Although not directly exposed to the crisis, these companies are indirectly affected by the reputational impact, decline in profitability of VW and unplanned costs if the matter is not resolved effectively e.g. fleet buyers bearing modification costs themselves or seeing reduced demand for VW rentals.


“The same insights could also help a risk manager understand a portfolio or entire investment manager’s risk profile associated with systemic or counterparty risks. Graph technology is perfect for performing these types of network analysis” explains Phare.


Research processes are traditionally very manual and document driven. Analysts look at broker’s reports, newspapers, market data. This makes sense when the velocity of the information is low but doesn’t scale up to today’s high data volume. Techniques like machine learning and semantic analysis can automate a lot of this work and make the job of analysts easier. Combined with this kind of self-service graph analysis enabled by Linkurious this presents an innovative approach for investment managers.

Read more…

Visualize Your Data Using

Guest blog post by Vozag

Creating interactive visualizations of your data for web is a cakewalk using, all you need to do is to import your spreadsheet and start generating your interactive visualizations

You need to sign in to start creating visualizations on If you dont have an account, signing up is easy. All you need is a name, an email id and a password, and the account is created immediately

Add your data

Importing data into is very easy.You can import any spreadsheet by clicking on the ‘Import spreadsheet’ link on the right side of the home screen. There is an option for importing either csv or excel files. You can also choose to import data directly from a google spreadsheets by pasting the URL.


Once you choose a file to import, and then click open, you are taken to a screen where you are shown a number of rows in your data.It is to be noted that each row in the spread sheet is considered to be a page. You can set which column is to be used as a title for each page and how each column should be interpreted. You may also choose to ignore some columns depending upon your requirements.

Once you are done, you can finish importing the data by clicking the ‘Start import’ button on the top right of the page. The time taken to import the data depends upon the number of rows and the type of data you are importing.



Explore Your Collection

Once the data is imported, its shown shown in a dashboard. You can start exploring your data by clicking the ‘Explore your collection’ link, which is just under the collection name

By default, the data is shown as a table. You can change the visualization by clicking the drop down button available on the right side of the explore page.


Apart from a table format, you have the option of visualizing your data as a list, grid, pie chart, bar chart, scatter plot, map and a few other forms.

Generate and save visualizations


Generating visualizations of maps is very easy in given that it can convert addresses into latitude-longitude pairs and plot them on the map directly. You will understand that it saves a lot of pain, if you have worked on generating map related visualizations previously. But, proper care should be taken to give full address or the plots generated will be erratic.

Get started by choosing the map visualization in the drop down. You can choose the columns you want to show in the pop up for a location in the map; choose the column of the numeric data if you are plotting a heat map; and choose the column on which you want to sort the data from the options given. Most importantly, you have to choose the column which has the location information and the visualization will be generated accordingly. You can choose to add the visualization to your any of your pages.

Given above is a heat map showing home values in the top 200 counties in USA.

Share your work

Once your visualizations are generated, you can choose to email them, share them on social networking platforms like facebook, twitter, linkedin, or embed them in any of your web pages by clicking the ‘Share and embed’ button on the top right corner of the page.

Interactive visualization of the map is given here.  

Read more…

Guest blog post by Jean Villedieu

The European Union is giving free access to detailed information about public purchasing contracts. That data describes which European institutions are spending money, for what and who benefits from it. We are going to explore the network of public institutions and suppliers and look for interesting patterns with graph visualization.

Public spending in Europe is a €1.3 trillion business

Every year, more than €1.3 trillion in contracts are awarded by public entities in Europe. In an effort to make these public contracts more transparent, the European Union has decided to make the tenders announcements public. The information can be found online through the EU’s open data portalOpenTED, a non profit organization, has gone a step further and made the data available in a CSV format.

Public contracts are complex though. It involves at least one commercial entity which is awarded a contract by a public authority. The public entity may be acting as a “delegate” for another public entity. The contract can be disputed in a certain jurisdiction of appeal.

We have multiple entities and relationships. What this means is that the tenders data describe a graph or network. We are going to explore the tenders graph with Neo4j and Linkurious.

Modeling public contracts as a graph

We will focus on the 2015 tenders. There are 73,269 tenders in one single CSV file with 45 columns.

tenders excel

We decided to model the graph in the following way:

tender data model neo4j

The graph model above highlights the relationships between the contracts, appeal bodies, operators, delegates and authorities in our data.

To put it in a Neo4j database, we wrote a script that you can see here.

This script takes the 2015 tenders data and turn it into a Neo4j graph database with 161,541 nodes and 536,936 edges. Now that it’s in Neo4j, we can search, explore and visualize it with Linkurious. It’s time to start asking questions to the data!

The biggest tenders and the authorities and companies which are involved

As a first step, let’s try to identify the big public contracts which happened in Europe in 2015 and what organizations they involved. In order to get that answer, we’ll use Cypher(the Neo4j query language) within Linkurious.

Here’s how to find the 10 biggest contracts and the public authorities and commercial providers they involved.

// The top 10 biggest contracts and who they involve
WHERE b.contract_initial_value_cost_eur <> ‘null’
ORDER BY b.contract_initial_value_cost_eur DESC
RETURN a, c, b

The result is the following graph.

biggest public contracts

We can see for example that Ministry of Defence has awarded a large contract toBabcock International Group Plc.

Missed connections

The graph structure of the data also allows us to spot missed opportunities. Let’s take for example KPMG. It’s one of the biggest recipient of public contracts within our dataset. What other opportunities the company could have been awarded?

To answer that question, we can identify which of KPMG’s customers awarded contracts to its competitors.

Let’s identify the biggest missed opportunities of KPMG:

// KPMG’s biggest missed opportunities
MATCH (a:OPERATOR {official_name: ‘KPMG’})-[:IS_OPERATOR_OF]->(b:CONTRACT)<-[:IS_AUTHORITY_OF]-(customer:AUTHORITY)-[:IS_AUTHORITY_OF]->(lost_opportunity:CONTRACT)<-[:IS_OPERATOR_OF]-(competitor:OPERATOR)
WHERE NOT a-[:IS_OPERATOR_OF]->competitor
RETURN a,b,customer,lost_opportunity,competitor

We can visualize the result in Linkurious:

KPMG's network.

KPMG’s network.

KPMG is highlighted in red. It is connected to 10 contracts by 9 customers. These customers have awarded 246 other contracts to 181 firms.

This visualization could be used by KPMG to identify its “unfaithful” customers and its competitors. Specifically we may want to filter the visualization to focus on the contracts for similar services to the ones KPMG offer. To do that we will use the CPV code, a procurement taxonomy of products and services.

Here is a visualization filtered to only display contracts with similar CPV codes as KMPG’s contracts:

Filtering on contracts KPMG could have won.

Filtering on contracts KPMG could have won.

If we zoom in, we can see for example that GCS Uniha, one of KMPG’s customer, has also awarded contracts to some of KPMG’s competitor (Ernst and YoungPwC and Deloitte).

GCS Uniha has also awarded contracts to Ernst and Young, PwC and Deloitte.

GCS Uniha has also awarded contracts to Ernst and Young, PwC and Deloitte.

Exploring the French IT sector

Finally, we can use Linkurious to visualize a particular economic sector. Let’s focus for example on IT spending in France. Who are the biggest spenders in that sector? Which companies are capturing these contracts? Finally what are the relationships between all these organizations?

Using a Cypher query, we can identify the customers and suppliers linked to IT contracts (which all have a CPV code starting with “48”):

//The French public IT market
WHERE b.contract_cpv_code STARTS WITH “48” AND = ‘FR’
RETURN a,b,c

We can visualize the result directly in Linkurious.

Public IT contracts in France in 2015.

Public IT contracts in France in 2015.

If we zoom in, we can see that Conseil Général des Bouches du Rhône has awarded 9 IT contracts in 2015, mostly to Dalkia and Idex énérgies.

Conseil Général des Bouches du Rhône has awarded 9 IT contracts in 2015.

Conseil Général des Bouches du Rhône has awarded 9 IT contracts in 2015.

We have dived into €1.3 trillion of public spending using Neo4j and Linkurious. We were able to identify some key actors of the public contracts market and map interesting ecosystems. Want to explore and understand your graph data? Simply try the demo of Linkurious!

Read more…

Space is limited.
Reserve your Webinar seat now

Web and graphic design principals are tremendously useful for creating beautiful, effective dashboards. In this latest Data Science Central Webinar event, we will consider how common design mistakes can diminish visual effectiveness. You will learn how placement, weight, font choice, and practical graphic design techniques can maximize the impact of your visualizations.

SpeakerDave Powell, Solution Architect -- Tableau
Hosted byBill Vorhies, Editorial Director -- Data Science Central

Again, Space is limited so please register early:
Reserve your Webinar seat now

After registering you will receive a confirmation email containing information about joining the Webinar.
Read more…

The current state of machine intelligence 2.0

Guest blog post by Laetitia Van Cauwenberge

Interesting O'Reilly article, although focused on the business implications, not the technical aspects. I really liked the infographic, as well as these statements:

  • Startups are considering beefing up funding earlier than they would have, to fight inevitable legal battles and face regulatory hurdles sooner.
  • Startups are making a global arbitrage (e.g., health care companies going to market in emerging markets, drone companies experimenting in the least regulated countries).
  • The “fly under the radar” strategy. Some startups are being very careful to stay on the safest side of the grey area, keep a low profile, and avoid the regulatory discussion as long as possible.

This is a long article with the following sections:

  • Reflections on the landscape
  • Achieving autonomy
  • The new (in)human touch
  • 50 shades of grey markets 
  • What’s your (business) problem?
  • The great verticalization
  • Your money is nice, but tell me more about your data

Note that we are ourselves working on data science 2.0. Some of our predictions can be found here. And for more on machine intelligence (not sure how it is different from machine learning), click on the following links:

To read the full article, click here. Below is the infographics from the original article.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Comparing Data Science and Analytics

You may think that all big data experts are created equal, but nothing could be further from the truth. However, the terms “data scientist” and “business analyst” are often used interchangeably. It’s a common and confusing use of terminology, which is why [email protected], a masters in business analytics, created this infographic to help create further clarity about the two roles

Both business analysts and data scientists are experts in the use of big data, but they have differenttypes of educational backgrounds, usually work in different types of settings and use their skills and knowledge in completely different ways.

Reflective of the increasing need to extract value from the mountain of big data at our fingertips, business analysts are in much higher demand—with a predicted job growth of 27 percent over the next decade. They dig into a variety of sources to pull data for analysis of past, present and future business performance, and then take those results and use the most effective tools to translate them to business leaders. They typically have educational backgrounds in specialties like business and humanities.

In contrast — data scientists are digital builders. They have strong educational backgrounds in computer science, mathematics and technology and use statistical programming to actually build the framework for gathering and using the data by creating and implementing algorithms to do it. Such algorithms help with decision-making, data management, and the creation of visualizations to help explain the data that they gather.

To find out more about who will fit the bill for your organization, dig into the infographic to make sure the big data expert you’re hiring is the right one to meet your needs.

Originally posted on Data Science Central

Read more…

Two great visualizations about data science

Guest blog post by Laetitia Van Cauwenberge

The first one is about the difference between Data Science, Data Analysis, Big Data, Data Analytics, and Data Mining:

The source for this one is, according to a tweet, I could not find the article in question, though this website is very interesting, but anyway, I love the above picture, please share if you agree with it. If someone knows the link for the original article, please post in the comment section.

And now another nice picture about the history of big data and data science:

Here I have the reference for the picture: click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds