Subscribe to our Newsletter

All Posts (212)

There are many ways to choose features with given data, and it is always a challenge to pick up the ones with which a particular algorithm will work better. Here I will consider data from monitoring performance of physical exercises with wearable accelerometers, for example, wrist bands.

The data for this project come from this source:

In this project, researchers used data from accelerometers on the belt, forearm, arm, and dumbbell of few participants. They were asked to perform barbell lifts correctly, marked as "A", and incorrectly with four typical mistakes, marked as "B", "C", "D" and "E". The goal of the project is to predict the manner in which they did the exercise.

There are 52 numeric variables and one classification variable, the outcome. We can plot density graphs for first 6 features, which are in effect smoothed out histograms.

We can see that data behaviors are complicated. Some of features are bimodal and even multimodal. These properties could be caused by participants' different sizes or training levels or something else, but we do not have enough information to check it out. Nevertheless it is clear that our variables do not obey normal distribution. Therefore we are better with algorithms which do not assume normality, like trees and random forests.  We can visualize the algorithms work in the following way: as finding vertical lines which divide areas under curves such that areas to the right and to to left of the line are significantly different for different outcomes.
There are a number of ways to distinguish functions analytically on an interval in Functional Analysis. It looks like the most suitable is to consider areas between curves. Clearly, we should scale it with respect to the size of the curves. For every feature I will consider all pairs of density curves to find out if they are sufficiently different.  Here is my final criterion:
If there is a pair for a feature which satisfies it then the feature is chosen for a prediction. As result I got 21 features for a random forests algorithm.  The last one yielded accuracy 99% for the model itself and on a validation set. I checked how many variables we need for the same accuracy with PCA preprocessing, and it was 36. Mind you that the variables will be scaled and rotated, and that we still use the same original 52 features to construct them. Thus more efforts are needed to construct a prediction and to explain it. While with the above method it is easier, since areas under curves represent numbers of observations.

Originally posted on Data Science Central
Read more…

Analysis of Fuel Economy Data

Paul Grech

October 5, 2015

Contributed by Paul Greeh. Paul took NYC Data Science Academy 12 week full time Data Science Bootcamp  program between Sept 23 to Dec 18, 2015. The post was based on his first class project(due at 2nd week of the program).


Analyse fuel economy ratings in the automotive industry.

Compare vehicle efficiency of American automotive manufacturer, Cadillac with the automotive industry as a whole.

Sept 2014 - “We cannot deny the fact that we are leaving behind our traditional customer base,” de Nysschen said. “It will take several years before a sufficiently large part of the audience who until now have been concentrating on the German brands will find us in their consideration set.” Cadillac’s President - Johan de Nysschen

Compare vehicle efficiency of American automotive manufacturer, Cadillac, with self declared competition, the German luxury market.

What further comparisons will display insight into EPA ratings?

Analysis Overview

  1. Automotive Industry
  2. Cadillac vs Automotive Industry
  3. Cadillac vs German Luxury Market
  4. Cadillac vs German Luxury Market by Vehicle Class

Importing the Data

Import data and filter rows needed for analysis. Then remove all zero’s included in city and highway MPG data as this will skew results. - Replace this information with NA as to not perform calculations on data not present.

# Import Data and convert to Dplyr data frame
FuelData <- read.csv("", stringsAsFactors = FALSE)
FuelData <- tbl_df(FuelData)

# Create data frame including information necessary for analysis
FuelDataV1 <- select(FuelData,
mfrCode, year, make, model,
engId, eng_dscr, cylinders, displ, sCharger, tCharger,
trans_dscr, trany, drive,
startStop, phevBlended,
city08, comb08, highway08,

# Replace Zero values in MPG data with NA
FuelDataV1$city08U[FuelDataV1$city08 == 0] <- NA
FuelDataV1$comb08U[FuelDataV1$comb08 == 0] <- NA
FuelDataV1$highway08U[FuelDataV1$highway08 == 0] <- NA

1: Automotive Industry

Visualize city and highway EPA ratings of the entire automotive industry.


How have EPA ratings for city and highway improved across the automotive industry as a whole?

Note: No need to include combined as combined is simply a percentage based calculation defaulting to 60/40 but can be adjusted on the website.

IndCityMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "City")
IndHwyMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "Highway")

Comp.Ind <- rbind(IndCityMPG, IndHwyMPG)

ggplot(data = Comp.Ind, aes(x = year, y = MPG, linetype = MPGType)) +
geom_point() + geom_line() + theme_bw() +
ggtitle("Industry\n(city & highway MPG)")



Data visualization shows relatively poor EPA ratings throughout the 1980's, 1990's and early to mid 2000's with the first drastic improvement in these ratings occurring around 2008. One significant event around this time period was the recession hitting America. Consumers having less disposable income along with increased oil prices likely fueled competition to develop fuel efficient powertrains across the automotive industry as a whole.

2: Cadillac vs Automotive Industry

Visualize Cadillac's city and highway EPA ratings with that of the automotive industry.


How does Cadillac perform when compared to the automotive industry as a whole?
IndCityMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "City")
IndHwyMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "Highway")
CadCityMPG <- filter(FuelDataV1, make == "Cadillac") %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Cadillac") %>%
mutate(., MPGType = "City")
CadHwyMPG <- filter(FuelDataV1, make == "Cadillac") %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Cadillac") %>%
mutate(., MPGType = "Highway")

Comp.Ind.Cad <- rbind(IndCityMPG, IndHwyMPG, CadCityMPG, CadHwyMPG)

ggplot(data = Comp.Ind.Cad, aes(x = year, y = MPG, color = Label, linetype = MPGType)) +
geom_point() + geom_line() + theme_bw() +
scale_color_manual(name = "Cadillac / Industry", values = c("blue","#666666")) +
ggtitle("Cadillac vs Industry\n(city & highway MPG)")



Cadillac was chosen as a brand of interest because they are currently redefining their brand as a whole. It is important to analyze past performance to have a complete understanding of how Cadillac has been viewed for several decades.

In 2002, Cadillac dropped to its lowest performance. Why did this occur? Because the entire fleet was made up of the same 4.6L V8 mated to a 4-speed automatic transmission, or as some would say... slush-box. The image that Cadillac had of this time was of a retirement vehicle to be shipped to its owners new retirement home in Florida with a soft ride, smooth powerful delivery and no performance. With the latest generation of Cadillac's being performance oriented beginning with the LS2 sourced CTS-V and now containing the ATS-V, CTS-V along with several other V-Sport models, a rebranding is crucial in order to appeal to a new market of buyers.

Also interesting to note is that although there is an increased amount of performance models being produced, fuel efficiency is not lacking. The gap noted above has decreased although there has been an increase in performance models being developed, a concept not often found to align.

3: Cadillac vs German Luxury Market

Cadillac has recently targeted the German luxury market consisting of the following manufacturers:
  • Audi
  • BMW
  • Mercedes-Benz


How does Cadillac perform when compared with the German Luxury Market?
# Calculate Cadillac average Highway / City MPG past 2000
CadCityMPG <- filter(CadCityMPG, year > 2000)
CadHwyMPG <- filter(CadHwyMPG, year > 2000)

# Calculate Audi average Highway / City MPG
AudCityMPG <- filter(FuelDataV1, make == "Audi", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Audi") %>%
mutate(., MPGType = "City")
AudHwyMPG <- filter(FuelDataV1, make == "Audi", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Audi") %>%
mutate(., MPGType = "Highway")

# Calculate BMW average Highway / City MPG
BMWCityMPG <- filter(FuelDataV1, make == "BMW", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "BMW") %>%
mutate(., MPGType = "City")
BMWHwyMPG <- filter(FuelDataV1, make == "BMW", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "BMW") %>%
mutate(., MPGType = "Highway")

# Calculate Mercedes-Benz average Highway / City MPG
MbzCityMPG <- filter(FuelDataV1, make == "Mercedes-Benz", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Merc-Benz") %>%
mutate(., MPGType = "City")
MbzHwyMPG <- filter(FuelDataV1, make == "Mercedes-Benz", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Merc-Benz") %>%
mutate(., MPGType = "Highway")

# Concatenate all Highway/City MPG data for:
# v.s. German Competitors
CompGerCadCity <- rbind(CadCityMPG, AudCityMPG, BMWCityMPG, MbzCityMPG)
CompGerCadHwy <- rbind(CadHwyMPG, AudHwyMPG, BMWHwyMPG, MbzHwyMPG)

ggplot(data = CompGerCadCity, aes(x = year, y = MPG, color = Label)) + 
geom_line() + geom_point() + theme_bw() +
scale_color_manual(name = "Cadillac vs German Luxury Market",
values = c("#333333", "#666666", "blue","#999999")) +
ggtitle("CITY MPG\n(Cad vs Audi vs BMW vs Mercedes-Benz)")

ggplot(data = CompGerCadHwy, aes(x = year, y = MPG, color = Label)) + 
geom_line() + geom_point() + theme_bw() +
scale_color_manual(name = "Cadillac vs German Luxury Market",
values = c("#333333", "#666666", "blue","#999999")) +
ggtitle(label = "HIGHWAY MPG\n(Cad vs Audi vs BMW vs Mercedes-Benz)")



“Mr. Ellinghaus, a German who came to Cadillac in January from pen maker Montblanc International after more than a decade at BMW, said he has spent the past 11 months doing”foundational work" to craft an overarching brand theme for Cadillac’s marketing, which he says relied too heavily on product-centric, me-too comparisons.

“In engineering terms, it makes a lot of sense to benchmark the cars against BMW,” Mr. Ellinghaus said. But he added: “From a communication point of view, you must not follow this rule.”

Despite comments made by Mr. Ellinghaus, the end goal is for consumers to be comparing Cadillac with Audi, BMW and Mercedes-Benz. The fact that this is already happening is a huge success for the company which only ten years ago, would never be mentioned in the same sentence as the German Luxury market.

Data visualization shows that Cadillac is equally rated as its German competitors and at the same time, has not had any significant dips unlike all other manufacturers. The continued increase in performance combined with rebranding signify that Cadillac is on a path to success.

4: Cadillac vs German Luxury Market by Vehicle Class

Every manufacturer has its strengths and weaknesses. It is important to assess and recognize these attributes to best determine where an increase in R&D spending is needed and where to maintain a competitive advantage for the consumer by vehicle class.


In what vehicle class is Cadillac excelling or falling behind?
# Filter only Cadillac and german luxury market
German <- filter(FuelDataV1, make %in% c("Cadillac", "Audi", "BMW", "Mercedes-Benz"))
# Group vehicle classes into more generic classes
German$ <- ifelse(grepl("Compact", German$VClass, = T), "Compact",
ifelse(grepl("Wagons", German$VClass), "Wagons",
ifelse(grepl("Utility", German$VClass), "SUV",
ifelse(grepl("Special", German$VClass), "SpecUV", German$VClass))))

# Focus on vehicle model years past 2000
German <- filter(German, year > 2000)
# Vans, Passenger Type are only specific to one company and are not needed for this analysis
German <- filter(German, != "Vans, Passenger Type")

IndClass <- filter(German, make %in% c("Audi", "BMW", "Mercedes-Benz")) %>%
group_by(, year) %>%
summarize(AvgCity = mean(city08), AvgHwy = mean(highway08))
CadClass <- filter(German, make %in% c("Cadillac")) %>%
group_by(, year) %>%
summarize(AvgCity = mean(city08), AvgHwy = mean(highway08))

##### Join tables #####
CadIndClass <- left_join(IndClass, CadClass, by = c("year", ""))
CadIndClass$DifCity <- (CadIndClass$AvgCity.y - CadIndClass$AvgCity.x)
CadIndClass$DifHwy <- (CadIndClass$AvgHwy.y - CadIndClass$AvgHwy.x)

ggplot(CadIndClass, aes(x = year, ymax = DifCity, ymin = 0) ) + 
geom_linerange(color='grey20', size=0.5) +
geom_point(aes(y=DifCity), color = 'blue') +
geom_hline(yintercept = 0) +
theme_bw() +
facet_wrap( +
ggtitle("Cadillac vs Germany Luxury Market\n(city mpg by class)") +
xlab("Year") +
ylab("MPG Difference")

ggplot(CadIndClass, aes(x = year, ymax = DifHwy, ymin = 0) ) + 
geom_linerange(color='grey20', size=0.5) +
geom_point(aes(y=DifHwy), color='blue') +
geom_hline(yintercept = 0) +
theme_bw() +
facet_wrap( +
ggtitle("Cadillac vs German Luxury Market\n(highway mpg by class)") +
xlab("Year") +
ylab("MPG Difference")



The above data visualization displays the delta between Cadillac and the average (Audi, BMW, Mercedes-Benz) fuel economy ratings. Positive can then be considered above the average competition and negative, below the average competition.

There is a lack of performance across all vehicle classes. Reasoning may be because the same power trains are being used across multiple chassis.


Conclusion & Continued Analysis

  1. There is a clear improvement in EPA ratings as federal emission standards drive innovation for increased fleet fuel economy. It is important for automotive manufacturers to continue innovation and push for increased efficiency.
  2. Further analysis on the following areas provides greater researcher opportunity:
    • Drivetrain v.s. MPG
    • Sales data
    • Consumer reaction to new marketing strategies
    • Consumer demand for product or badge

Originally posted on Data Science Central

Read more…

Guest blog post by Vincent Granville

There's been many variations of this theme - defining big data with 3Vs (or more, including velocity, variety, volume, veracity, value), as well as other representations such as the data science alphabet.

Here's an interesting Venn diagram that tries to define statistical computing (a sub-field of data science) with 7 sets and 9 intersections:

It was published in a scholarly paper entitled Computing in the Statistics Curricula (PDF document). Enjoy!

Read more…

Learning R in Seven Simple Steps

Originally posted on Data Science Central

Guest blog post by Martijn Theuwissen, co-founder at DataCamp. Other R resources can be found here, and R Source code for various problems can be found here. A data science cheat sheet can be found here, to get you started with many aspects of data science, including R.

Learning R can be tricky, especially if you have no programming experience or are more familiar working with point-and-click statistical software versus a real programming language. This learning path is mainly for novice R users that are just getting started but it will also cover some of the latest changes in the language that might appeal to more advanced R users.

Creating this learning path was a continuous trade-off between being pragmatic and exhaustive. There are many excellent (free) resources on R out there, and unfortunately not all could be covered here. The material presented here is a mix of relevant documentation, online courses, books, and more that we believe is best to get you up to speed with R as fast as possible.

Data Video produced with R: click here and also here for source code and to watch the video. More here.

Here is an outline:

  • Step 0: Why you should learn R
  • Step 1: The Set-Up
  • Step 2: Understanding the R Syntax
  • Step 3: The core of R -> packages
  • Step 4: Help?!
  • Step 5: The Data Analysis Workflow
    • 5.1 Importing Data
    • 5.2 Data Manipulation
    • 5.3 Data Visualization
    • 5.4 The stats part
    • 5.5 Reporting your results
  • Step 6: Become an R wizard and discovering exciting new stuff

Step 0: Why you should learn R

R is rapidly becoming the lingua franca of Data Science. Having its origins in academics, you will spot it today in an increasing number of business settings as well where it is a contestant to commercial software incumbents such as SAS, STATA and SPSS. Each year, R gains in popularity and in 2015 IEEE listed R in the top ten languages of 2015.

This implies that the demand for individuals with R knowledge is growing, and consequently learning R is definitely a smart investment career wise (according to this survey R even is the highest paying skill). This growth is unlikely to plateau in the next years with large players such as Oracle &Microsoft stepping up by including R in its offerings.

Nevertheless, money should not be the only driver when deciding to learn a new technology or programming language. Luckily, R has a lot more to offer than a solid paycheck. By engaging yourself with R, you will become familiar with a highly diverse and interesting community. Namely, R is being used for a diverse set of task such as finance, genomic analysis, real estate, paid advertising, and much more. All these fields are actively contributing to the development of R. You will encounter a diverse set of examples and applications on a daily basis, keeping things interesting and giving you the ability to apply your knowledge on a diverse range of problems.

Have fun!

Step 1: The Set-Up

Before you can actually start working in R, you need to download a copy of it on your local computer. R is continuously evolving and different versions have been released since R was born in 1993 with (funny) names such as World-Famous Astronaut and Wooden Christmas-Tree. Installing R is pretty straightforward and there are binaries available for Linux, Mac and Windows from the Comprehensive R Archive Network (CRAN).

Once R is installed, you should consider installing one of R’s integrated development environment as well (although you could also work with the basic R console if you prefer). Two fairly established IDE’s are RStudio and Architect. In case you prefer a graphical user interface, you should check out R-commander.

Step 2: Understanding the R Syntax

Learning the syntax of a programming language like R is very similar to the way you would learn a natural language like French or Spanish: by practice & by doing. One of the best ways to learn R by doing is through the following (online) tutorials:

Next to these online tutorials there are also some very good introductory books and written tutorials to get you started:

Step 3: The core of R -> packages

Every R package is simply a bundle of code that serves a specific purpose and is designed to be reusable by other developers. In addition to the primary codebase, packages often include data, documentation, and tests. As an R user, you can simply download a particular package (some are even pre-installed) and start using its functionalities. Everyone can develop R packages, and everyone can share their R packages with others.

The above is an extremely powerful concept and one of the key reasons R is so successful as a language and as a community. Namely, you don’t need to do all the hard core programming yourself or understand every complex detail of a particular algorithm or visualization. You can simple use the out-of-the box functions that come with the relevant package as an interface to such functionalities.  As such it is useful to have an understanding of R’s package ecosystem.

Many R packages are available from the Comprehensive R Archive Network, and you can install them using the install.packages function. What is great about CRAN is that it associates packages with a particular task via Task Views. Alternatively, you can find R packages on bioconductorgithub and bitbucket.

Looking for a particular package and corresponding documentation? Try Rdocumentation, where you can easily search packages from CRAN, github and bioconductor.

Step 4: Help?!

You will quickly find out that for every R question you solve, five new ones will pop-up. Luckily, there are many ways to get help:

  • Within R you can make use of its built-in help system. For example the command  `?plot` will provide you with the documentation on the plot function.
  • R puts a big emphasis on documentation. The previously mentionedRdocumentation is a great website to look at the different documentation of different packages and functions.
  • Stack Overflow is a great resource for seeking answers on common R questions or to ask questions yourself.
  • There are numerous blogs & posts on the web covering R such asKDnuggets and R-bloggers.

Step 5: The Data Analysis Workflow

Once you have an understanding of R’s syntax, the package ecosystem, and how to get help, it’s time to focus on how R can be useful for the most common tasks in the data analysis workflow

5.1 Importing Data

Before you can start performing analysis, you first need to get your data into R. The good thing is that you can import into R all sorts of data formats, the hard part this is that different types often need a different approach:

If you want to learn more on how to import data into R check an online Importing Data into R tutorial or  this post on data importing.

5.2 Data Manipulation

Performing data manipulation with R is a broad topic as you can see in for example this Data Wrangling with R video by RStudio or the book Data Manipulation with R. This is a list of packages in R that you should master when performing data manipulations:

  • The tidyr package for tidying your data.
  • The stringr package for string manipulation.
  • When working with data frame like objects it is best to make yourself familiar with the dplyr package (try this course). However. in case of heavy data wrangling tasks, it makes more sense to check out the blazingly fast data.table package (see this syntax cheatsheet for help).
  • When working with times and dates install the lubridate package which makes it a bit easier to work with these.
  • Packages like zooxts and quantmod offer great support for time series analysis in R.

5.3 Data Visualization

One of the main reasons R is  the favorite tool of data analysts and scientists is because of its data visualization capabilities. Tons of beautiful plots are created with R as shown by all the posts on FlowingData, such as this famous facebook visualization.

Credit card fraud scheme featuring time, location, and loss per event, using R: click here for source

If you want to get started with visualizations in R, take some time to study theggplot2 package. One of the (if not the) most famous packages in R for creating graphs and plots. ggplot2 is makes intensive use of the grammar of graphics, and as a result is very intuitive in usage (you’re continuously building part of your graphs so it’s a bit like playing with lego).  There are tons of resources to get your started such as this interactive coding tutorial, a cheatsheet and  an upcoming book by Hadley Wickham.

Besides ggplot2 there are multiple other packages that allow you to create highly engaging graphics and that have good learning resources to get you up to speed. Some of our favourites are:

If you want to see more packages for visualizations see the CRAN task view. In case you run into issues plotting your data this post might help as well.

Next to the “traditional” graphs, R is able to handle and visualize spatial data as well. You can easily visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps with a package such as ggmap. Another great package is choroplethr developed by Ari Lamstein of Trulia or the tmap package. Take this tutorial onIntroduction to visualising spatial data in R if you want to learn more.

5.4 The stats part

In case you are new to statistics, there are some very solid sources that explain the basic concepts while making use of R:

Note that these resources are aimed at beginners. If you want to go more advanced you can look at the multiple resources there are for machine learning with R. Books such as Mastering Machine Learning with R andMachine Learning with R explain the different concepts very well, and online resources like the Kaggle Machine Learning course help you practice the different concepts. Furthermore there are some very interesting blogs to kickstart your ML knowledge like Machine Learning Mastery or this post.

5.5 Reporting your results

One of the best way to share your models, visualizations, etc is through dynamic documents. R Markdown (based on knitr and pandoc) is a great tool for reporting your data analysis in a reproducible manner though html, word, pdf, ioslides, etc.  This 4 hour tutorial on Reporting with R Markdownexplains the basics of R markdown. Once you are creating your own markdown documents, make sure this cheat sheet is on your desk.

Step 6: Become an R wizard and discovering exciting new stuff

R is a fast-evolving language. It’s adoption in academics and business is skyrocketing, and consequently the rate of new features and tools within R is rapidly increasing. These are some of the new technologies and packages that excite us the most:

Once you have some experience with R, a great way to level up your R skillset is the free book Advanced R by Hadley Wickham. In addition, you can start practicing your R skills by competing with fellow Data Science Enthusiasts on Kaggle, an online platform for data-mining and predictive modelling competitions. Here you have the opportunity to work on fun cases such as this titanic data set.

To end, you are now probably ready to start contributing to R yourself by writing your own packages. Enjoy!

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

New book on data mining and statistics

New book:

Numeric Computation and Statistical Data Analysis on the Java Platform (by S.Chekanov)

710 pages. Springer International Publishing AG. 2016. ISBN 978-3-319-28531-3.

Book S.V.Chekanov 2016

About this book: Numerical computation, knowledge discovery and statistical data analysis integrated with powerful 2D and 3D graphics for visualization are the key topics of this book. The Python code examples powered by the Java platform can easily be transformed to other programming languages, such as Java, Groovy, Ruby and BeanShell. This book equips the reader with a computational platform which, unlike other statistical programs, is not limited by a single programming language.

Originally posted on Data Science Central

Read more…

Dealing with Outliers is like searching a needle in a haystack

This is a guest repost by Jacob Joseph.

An Outlier is an observation or point that is distant from other observations/points. But, how would you quantify the distance of an observation from other observations to qualify it as an outlier. Outliers are also referred to as observations whose probability to occur is low. But, again, what constitutes low??

There are parametric methods and non-parametric methods that are employed to identify outliers. Parametric methods involve assumption of some underlying distribution such as normal distribution whereas there is no such requirement with non-parametric approach. Additionally, you could do a univariate analysis by studying a single variable at a time or multivariate analysis where you would study more than one variable at the same time to identify outliers.

The question arises which approach and which analysis is the right answer??? Unfortunately, there is no single right answer. It depends for what is the end purpose for identifying such outliers. You may want to analyze the variable in isolation or maybe use it among a set of variables to build a predictive model.

Let’s try to identify outliers visually.

Assume we have the data for Revenue and Operating System for Mobile devices for an app. Below is the subset of the data:

How can we identify outliers in the Revenue?

We shall try to detect outliers using parametric as well as non-parametric approach.

Parametric Approach

Comparison of Actual, Lognormal and Normal Density Plot

The x-axis, in the above plot, represents the Revenues and the y-axis, probability density of the observed Revenue value. The density curve for the actual data is shaded in ‘pink’, the normal distribution is shaded in 'green' and  log normal distribution is shaded in 'blue'. The probability density for the actual distribution is calculated from the observed data, whereas for both normal and log-normal distribution is computed based on the observed mean and standard deviation of the Revenues.

Outliers could be identified  by calculating the probability of the occurrence of an observation or calculating how far the observation is from the mean. For example, observations greater/lesser than 3 times the standard deviation from the mean, in case of normal distribution, could be classified as outliers.

In the above case, if we assume a normal distribution, there could be many outlier candidates especially for observations having revenue beyond 60,000. The log-normal plot does a better job than normal distribution, but it is due to the fact that the underlying actual distribution has characteristics of a log-normal distribution. This could not be a general case since determining the distribution or parameters of the underlying distribution is extremely difficult before hand or apriori. One could infer the parameters of the data by fitting a curve to the data, but a change in the underlying parameters like mean and/or standard deviation due to new incoming data will change the location and shape of the curve as observed in the plots below:

Comparison of Density Plot for change in mean and standard deviation for Normal DistributionComparison of Density Plot for change in mean and standard deviation for LogNormal Distribution

The above plots show the shift in location or the spread of the density curve based on an assumed change in mean or standard deviation of the underlying distribution. It is evident that a shift in the parameters of a distribution is likely to influence the identification of outliers.

Non-Parametric Approach

Let’s look at a simple non-parametric approach like a box plot to identify the outliers.

Non Parametric approach to detect outlier with box plots (univariate approach)

In the box plot shown above, we can identify 7 observations, which could be classified as potential outliers, marked in green. These observations are beyond the whiskers. 

In the data, we have also been provided information on the OS. Would we identify the same outliers, if we plot the Revenue based on OS??

Non Parametric approach to detect outlier with box plots (bivariate approach)

In the above box plot, we are doing a bivariate analysis, taking 2 variables at a time which is a special case of multivariate analysis. It seems that there are 3 outlier candidates for iOS whereas there are none for Android. This was due to the difference in distribution of Revenues for Android and iOS users. So, just analyzing Revenue variable on its own i.e univariate analysis, we were able to identify 7 outlier candidates which dropped to 3 candidates when a bivariate analysis was performed.

Both Parametric as well as Non-Parametric approach could be used to identify outliers based on the characteristics of the underlying distribution. If the mean accurately represents the center of the distribution and the data set is large enough, parametric approach could be used whereas if the median represents the center of the distribution, non-parametric approach to identify outliers is suitable.

Dealing with outliers in a multivariate scenario becomes all the more tedious. Clustering, a popular data mining technique and a non-parametric method could be used to identify outliers in such a case.

Originally posted on Data Science Central

Read more…

Guest blog post by ahmet taspinar

One of the most important tasks in Machine Learning are the Classification tasks (a.k.a. supervised machine learning). Classification is used to make an accurate prediction of the class of entries in the test set (a dataset of which the entries have not been labelled yet) with the model which was constructed from a training set. You could think of classifying crime in the field of Pre-Policing, classifying patients in the Health sector, classifying houses in the Real-Estate sector. Another field in which classification is big, is Natural Lanuage Processing (NLP). This is the field of science with the goal to makes machines (computers) understand (written) human language. You could think of Text Categorization, Sentiment Analysis, Spam detection and Topic Categorization.

For classification tasks there are three widely used algorithms; the Naive Bayes, Logistic Regression / Maximum Entropy and Support Vector Machines.  We have already seen how the Naive Bayes works in the context of Sentiment Analysis. Although it is more accurate than a bag-of-words model, it has the assumption of conditional independence of its features. This is a simplification which makes the NB classifier easy to implement, but it is also unrealistic in most cases and leads to a lower accuracy. A direct improvement on the N.B. classifier, is an algorithm which does not assume conditional independence but tries to estimate the weight vectors (feature values) directly.

This algorithm is called Maximum Entropy in the field of NLP and Logistic Regression in the field of Statistics.

Maximum Entropy might sounds like a difficult concept, but actually it is not. It is a simple idea, which can be implemented with a few lines of code. But to fully understand it, we must first go into the basics of Regression and Logistic Regression.

1. Regression Analysis

Regression Analysis is the field of mathematics where the goal is to find a function which best correlates with a dataset. Lets say we have a dataset containing n datapoints; X = ( x^{(1)}, x^{(2)}, .., x^{(n)} ). For each of these (input) datapoints there is a corresponding (output) y^{(i)}-value. Here the x-datapoints are called the independent variables and y the dependent variable; the value of y^{(i)} depends on the value of x^{(i)}, while the value of x^{(i)} may be freely chosen without any restriction imposed on it by any other variable.
The goal of Regression analysis is to find a function f(X) which can best describe the correlation between X and Y. In the field of Machine Learning, this function is called the hypothesis function and is denoted as h_{\theta}(x).




If we can find such a function, we can say we have successfully build a Regression model. If the input-data lives in a 2D-space, this boils down to finding a curve which fits through the datapoints. In the 3D case we have to find a plane and in higher dimensions a hyperplane.

To give an example, lets say that we are trying to find a predictive model for the success of students in a course called Machine Learning. We have a dataset Y which contains the final grade of n students. Dataset X contains the values of the independent variables. Our initial assumption is that the final grade only depends on the studying time. The variable x^{(i)} therefore indicates how many hours student i has studied. The first thing we would do is visualize this data:


regression_left2 regression_right2

If the results looks like the figure on the left, then we are out of luck. It looks like the points are distributed randomly and there is not correlation between Y and X at all. However, if it looks like the figure on the right, there is probably a strong correlation and we can start looking for the function which describes this correlation.


This function could for example be:

h_{\theta}(X) = \theta_0+ \theta_1 \cdot x


h_{\theta}(X) = \theta_0 + \theta_1 \cdot x^2

where \theta are the dependent parameters of our model.


1.1. Multivariate Regression

Evaluating the results from the previous section, we may find the results unsatisfying; the function does not correlate with the datapoints strong enough. Our initial assumption is probably not complete. Taking only the studying time into account is not enough. The final grade does not only depend on the studying time, but also on how much the students have slept the night before the exam. Now the dataset contains an additional variable which represents the sleeping time. Our dataset is then given by X = ( (x_1^{(1)}, x_2^{(1)}), (x_1^{(2)}, x_2^{(2)}), .., (x_1^{(n)}, x_2^{(n)}) ). In this dataset  x_1^{(i)} indicates how many hours student i has studied and x_2^{(i)} indicates how many hours he has slept.

See the rest of the blog here, including Linear vs Non-linear, Gradient Descent, Logistic Regression, and Text Classification and Sentiment Analysis.

Read more…

The Curse of #DataViz

With the wide array of amazing data visualization tools out there - SAP Lumira, Qlik, Domo, Tableau - you would think that the world has moved towards a graphical understanding of reality.

Yet my experience has been that when people are faced with the task of using a fancy new graph or visualization in their work, their first reaction is….to freak out. Now bear with me now. I am not implying that most people can’t handle a nice bar chart or a three dimensional pie chart or even some animated multi-colored craziness. 

What I am saying is that most people don’t know where to go from a data visualization. Are they supposed to take a screenshot and send it to their boss? Should they print it out? Should they click the buttons for 10 hours until it gets boring? Should they make a decision?

Ok - ideally they should make a decision. But what if that person is not empowered to make decisions, or if they need to check with their boss first? Also, what if they don’t trust the data behind the graphics? They are taking a pretty big bet when they write that email to the whole department saying “hey this chart shows we should be doing this!”

In a way this is one representation of a phenomena which will become more and more prevalent as computers make recommendations to humans and we have to follow up with the so called “human assist”. That is - the final point in the decision making process. 

The visualization can only take you so far, but you need to drink the water. 

Originally posted on Data Science Central

Read more…

Guest blog post by Takashi J. OZAKI

I wrote a blog post inspired by Jamie Goode's book "Wine Science: The Application of Science in Winemaking".

In this book, Goode argued that reductionistic approach cannot explain relationship between chemical ingredients and taste of wine. Indeed, we know not all high (alcohol) wines are excellent, although in general high wines are believed to be good. Usually taste of wine is affected by a complicated balance of many components such as sweetness, acid, tannin, density or others that are given by corresponding chemical entities.

However, I think (and probably many other data science experts agree) that it is not a limitation of reductionistic approach, but a limitation of univariate modeling. To illustrate it, I performed a series of multivariate modeling with random forest or other models on "Wine Quality" dataset of UCI Machine Learning repository.

As a result, a random forest classifier predicted tasting score of wine better than intuitive univariate modeling. At the same time, it also showed some hidden and complicated dynamics between chemical ingredients and taste of wine. I believe that modern multivariate modeling such as machine learning can reveal more complicated relationship between chemical ingredients and taste of wine.

See my blog post below for more details.

Read more…

Guest blog post by Denis Rasulev

An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data.

Images are clickable to open hi-res versions.



Original post covers a lot more details and for those who want to pursue more analysis on their own: everything in the post - the data, software, and code - is freely available. Full instructions to download and analyze the data for yourself are available on GitHub.

Here is link to the original post: link

Read more…

Principal Component Analysis using R

Guest blog post by suresh kumar gorakala

Curse of Dimensionality:

One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality. 

In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.

Principal component analysis:

Consider below scenario:

The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.


Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible. 

Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.

Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.

In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.

PCA implementation in R:

For today’s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales. The 42 row names ("9.4", 9.5" ...) correspond to midpoints of intervals of finger lengths whereas the 22 column names ("142.24", "144.78"...) correspond to (body) heights of 3000 criminals, see also below. 

142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34
9.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
9.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.8 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
9.9 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0
182.88 185.42 187.96 190.5 193.04 195.58
9.4 0 0 0 0 0 0
9.5 0 0 0 0 0 0
9.6 0 0 0 0 0 0
9.7 0 0 0 0 0 0
9.8 0 0 0 0 0 0
9.9 0 0 0 0 0 0
[1] 42 22
'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:42] "9.4" "9.5" "9.6" "9.7" ...
..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ...

[1] 3000

[1] "142.24" "144.78" "147.32" "149.86" "152.4" "154.94" "157.48" "160.02" "162.56" "165.1" "167.64" "170.18" "172.72" "175.26" "177.8" "180.34"
[17] "182.88" "185.42" "187.96" "190.5" "193.04" "195.58"

let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.


We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp(). 

pca =prcomp(crimtab)

Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’. Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component.

Let’s plot all the principal components and see how the variance is accounted with each component.

par(mar = rep(2, 4)) plot(pca) 

Clearly the first principal component accounts for maximum information.
Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.

#below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one. pca$rotation=-pca$rotation pca$x=-pca$x biplot (pca , scale =0) 

The output of the preceding code is as follows:

In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.

From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features. 

In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, "165.1, 167.64, and 170.18" which are less correlated with them. 

Complete Code for PCA implementation in R: 

data("crimtab") #load data
head(crimtab) #show sample data
dim(crimtab) #check dimensions
str(crimtab) #show structure of the data
apply(crimtab,2,var) #check the variance accross the variables
pca =prcomp(crimtab) #applying principal component analysis on crimtab data
par(mar = rep(2, 4)) #plot to show variable importance
'below code changes the directions of the biplot, if we donot include
the below two lines the plot will be mirror image to the below one.'
biplot (pca , scale =0) #plot pca components using biplot in r
view rawPCA using R hosted with ❤ by GitHub

So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.

Originally posted here

Read more…

Guest blog post by Jean Villedieu

Following the Mediator scandal, France adopted in 2011 a Sunshine Act. For the first time we have data on the presents and contracts awarded to health care professionals by pharmaceutical companies. Can we use graph visualization to understand these dangerous ties?

Dangerous ties

Pharmaceutical companies in France and in other countries use presents and contracts to influence the prescriptions of health care professionals. This has posed ethical problems in the past.

In France, 21 persons are currently prosecuted for their role in the Mediator scandal, a drug that was recently banned. Some of them are accused of having helped the drug manufacturer obtain an authorization to sell its drug and later fight its ban in exchange for money.

In the US, GlaxoSmithKline was condemned to pay $3 billion in the largest health-care fraud settlement in US history. Before the settlement, GlaxoSmithKline paid various experts to fraudulently market the benefits of its drugs.

Such problems arose in part because of a lack of transparency in the ties between pharmaceutical companies and health-care professionals. With open data now available can we change this?

Moving the data to Neo4j

Regards Citoyens, a French NGO, parsed various sources to build the first database documenting the financial relationships between health care providers and pharmaceutical manufacturers.

That database covers a period from January 2012 to June 2014. It contains 495 951 health care professionals (doctors, dentists, nurses, midwives, pharmacists) and 894 pharmaceutical companies. The contracts and presents represent a total of 244 572 645 €.

The original data can be found on the Regards Citoyens website.

The data is stored in one large CSV file. We are going to use graph visualization to understand the network formed by the financial relationships between pharmaceutical companies and health care professionals.

First we need to move the data into a Neo4j graph database: view rawsunshine_import.cql

Now the data is stored in Neo4j as a graph (download it here). It can be searched, explored and visualized through Linkurious.

Unfortunately, names in the data have been anonymized by Regards Citoyens following pressure from the CNIL (the French Commission nationale de l’informatique et des libertés).

Who is Sanofi giving money to?

Let’s start our data exploration with Sanofi, the French biggest pharmaceutical company. If we search Sanofi through Linkurious we can see that it is connected to 57 765 professionals. Let’s focus on the 20 Sanofi’s contacts who have the most connections.

sanofi's connections

Sanofi’s top 20 connections.


Among these entities there are 19 doctors in general medicine and one student. We can quickly grasp which professions Sanofi is targeting by coloring the health care professionals according to their profession:

19 doctors among Sanofi's top 20 connections

19 doctors among Sanofi’s top 20 connections.

In a click, we can filter the visualization to focus on the doctors. We are now going to color them according to their region of origin.

Region of origin of Sanofi's 19 doctors

Region of origin of Sanofi’s 19 doctors.

Indirectly, the health care professionals Sanofi connects to via presents also tell us about its competitors. Let’s look at who else has given presents to the health care professionals befriended by Sanofi.

sanofi's competitors network

Sanofi’s contacts (highlighted in red) are also in touch with other pharmaceutical companies.


Zooming in, we can see Sanofi is at the center of a very dense network next to Bristol-Myers Quibb, Pierre Gabre, Lilly or Astrazeneca for example. According to the Sunshine dataset, Sanofi’s is competing with these companies.

We can also see an interesting node. It is a student who has received presents from 104 pharmaceutical companies including companies that are not direct competitors of Sanofi.

A successful student

A successful student.

Why has he received so much attention? Unfortunately all we have is an ID (02b0d3726458ef46682389f2ac7dc7af).

Sanofi could identify the professionals its competitors have targeted and perhaps target them too in the future.

Who has received the most money from pharmaceutical companies in France?

Neo4j includes a graph query language called Cypher. Through Cypher we can compute complex graph queries and get results in seconds.

We can for example identify the doctor who has received the most money from pharmaceutical companies:

//Doctor who has received the most money

The doctor behind the ID 2d92eb1e795f7f538556c59e48aaa7c1 has received 77 480€ from 6 pharmaceutical companies.

wealthy doctor

The relationships are colored according to the money they represent. St Jude Medical has over 70 231€ to Dr 2d92eb1e795f7f538556c59e48aaa7c1.

Perhaps next time they receive a prescription from Dr 2d92eb1e795f7f538556c59e48aaa7c1, his patients would like to know about his relationship with St Jude Medical. Unfortunately today the Sunshine data is anonymous.

We can also find the most generous pharmaceutical company.

//Company which has distributed the most money
RETURN a, sum(r.totalDECL) as total

Novartis Pharma has awarded 12 595 760€ to various entities.

top 5 novartis

The 5 entities receiving the most money from Novartis.


When we look closer, we can see that the 5 entities which have received the most money from Novartis Pharma are 5 NGOs.

24f3287da6ab125862249416bc91f9c4 has received 75 000€

24f3287da6ab125862249416bc91f9c4 has received 75 000€.

Come meet us at GraphConnect in London, the biggest graph event in Europe. It is sponsored by Linkurious and you can use “Linkurious30″ to register and get a 30%discount!

The Sunshine dataset offers a rare glimpse into the practice of pharmaceutical companies and how they use money to influence the behavior of health care professionals. Unfortunately for citizens looking for transparency, the data is anonymized. Perhaps it will change in the future?

Read more…

Guest blog post by Laetitia Van Cauwenberge

This article focuses on cases such as Facebook and protein interaction networks. The article was written by By Paul Scherer (paulmorio) and submitted as a research paper to HackCambridge. What makes this article interesting is the fact that it compares five clustering techniques for this type of problems:

  • K Clique Percolation - A clique merging algorithm. Given a set kk, the algorithm goes on to produce kk clique clusters and merge them (percolate) as necessary.
  • MCode - seed growth approach to finding dense subgraphs
  • DP Clustering - seed growth approach to finding dense subgraphs similar to MCODE but has an internal representation of weights in the edges, and the stopiing condition is different.
  • IPCA - Modified DPClus Algorithm which focuses on maintaining the diameter of a cluster (defined as the maximum shortest distance between all pairs of vertices, rather than its density.
  • CoAch - Combined Approach with finding a small number of cliques as complexes first and then growing them.

The articles also provides great visualizations such as the one below:

In the original article, these visualizations are interactive, and you will find out which software was used to produce them.

Below is the summary (written by the original author):


For my submission to HackCambridge I wanted to spend my 24 hours learning something new in accordance with my interests. I was recently introduced to protein interaction networks in my Bioinfomartics class, and during my review of machine learning techniques for an exam noticed that we study many supervised methods, but no unsupervised methods other than the k means clustering. Thus I decided to combine the two interests by clustering the Protein interaction networks with unsupervised clustering techniques and communicate my learning, results, and visualisations using the Beaker notebook.

The study of protein-protein interactions (PPIs) determined by high-throughput experimental techniques has created karge sets of interaction data and a new need for methods allowing us to discover new information about biological function. These interactions can be thought of as a large-scale network, with nodes representing proteins and edges signifying an interaction between two proteins. In a PPI network, we can potentially find protein complexes or functional modules as densely connected subgraphs. A protein complex is a group of proteins that interact with each other at the same time and place creating a quaternary structure. Functional modules are composed of proteins that bind each other at different times and places and are involved in the same cellular process. Various graph clustering algorithms have been applied to PPI networks to detect protein complexes or functional modules, including several designed specifically for PPI network analysis. A select few of the most famous and recent topographical clustering algorithms were implemented based on descriptions from papers, and applied to PPI networks. Upon completion it was recognized that it is possible to apply these to other interaction networks like friend groups on social networks, site maps, or transportation networks to name a few.

I decided to Graphistry's GPU cluster to visualize the large networks with the kind permission of Dr. Meyerovich. (Otherwise I would have likely not finished on time given the specs of my machine) and communicate my results and learning process

The full version with mathematical formulas, detailed descriptions, and source code, can be found here. For more articles about clustering, click here. This link will give you access to the following articles:

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Big Data Insights - IT Support Log Analysis

Guest blog post by Pradeep Mavuluri

This post brings forth to the audience, few glimpses (strictly) of insights that were obtained from a case of how predictive analytic's helped a fortune 1000 client to unlock the value in their huge log files of the IT Support system. Going to quick background, a large organization was interested in value added insights (actionable ones) from thousands of records logged in the past, as they saw both expense increase at no higher productivity.


As, most of us know in these business scenarios end-users will be much interested in out-of-knowledge, strange and unusual things that may not be captured from regular reports. Hence, here data scientist job not only ends at finding un-routine insights, but, also needs to do a deeper dig for its root cause and suggest best possible actions for immediate remedy (knowledge of domain or other best practices in industry will help a lot). Further, as mentioned earlier, only few of those has been shown/discussed here and all the analysis has been carried out using R Programming Language components viz., R-3.2.2RStudio (favorite IDE)ggplot2 package for plotting.

The first graph (below one) is a time series calendar heat map adopted from Paul Bleicher, shows us the number of tickets raised day-wise over every week of each month for the last year (green and its light shades represent less numbers, where as red and its shades represent higher numbers).


Herein, if one carefully observe the above graph, it will be very evident for us that, except for the month of April &amp; December, all other months have sudden increase in the number of tickets raised over last Saturday's and Sunday's; and this was more clearly visible at Quarter ends of March, June, September (also at November which is not a Quarter end). One can think of this as unusual behavior as numbers raising at non-working days. Before, going into further details, lets also look at one more graph (below), which depicts solved duration in minutes on x-axis and their respective time taken through a horizontal time line plot.

The above solved duration plot show us that out of all records analyzed 71.87% belong to "Request for Information" category and they have been solved within few minutes of tickets raised (that's why we cannot see a line plot for this category as compared to others). So, what's happened here actually was a kind of spoof, because of lack of automation in their systems. In simple words, it was found that there doesn't exists a proper documentation/guidance for many of applications they were using; such situation was taken as advantage for increasing the number of tickets (i.e. nothing but, pushing for more tickets even for basic information in the month ends and quarter ends, which resulted in month end openings which in turn forced them to close immediately). Discussed one here is one of those among many which has been presented with possible immediate remedies which can be easily actionable.

Visual Summarization:


Original Post

Read more…

Guest blog post by Petr Travkin

Part 1. Business scenarios.

I have spent many hours planning and executing in-company self-service BI implementation. This enabled me to gain several insights. Now that the ideas became mature enough and field-proven, I believe they are worth sharing. No matter how far you are in toying with potential approaches (possibly you are already in the thick of it!), I hope my attempt of describing feasible scenarios would provide a decent foundation.

All scenarios presume that IT plays its main role by owning the infrastructure, managing scalability, data security, and governance. I tried to elaborate every aspect of possible solution leaving behind all marketing claims of the vendor.

Scenario 1. Tableau Desktop + departmental/cross-functional data schemas.

This scenario involves gaining insights by data analysts on a daily basis. They might be either independent individuals or a team. Business users’ interaction with published workbooks is applicable, but limited to simple filtering.

User categories: professional data analysts;

Technical skills: intermediate/advanced SQL, intermediate/advanced Tableau;

Tableau training: 2-3 days full time (preferably) or continuous self-learning from scratch;

Licenses: Tableau Desktop.


  • Pure self-service BI approach with no IT involved in data analysis;
  • Vast range of data available for analysis with almost no limits;
  • Fast response for complex ad-hoc business problems.


  • Requires highly skilled data analysts;
  • Most likely involves Tableau training on query performance optimisation on a particular data source (e.g. Vertica).


  • Create a “sandbox” that allows data analysts to query and collaborate on their own and without supervision. Further promotion of workbooks to production is welcome.

Scenario 2. Tableau Desktop + custom data marts.

In this scenario, business users are fully in charge of data analysis. IT provides custom data marts.

User categories: business users, line-managers;

Technical skills: basic SQL, basic/intermediate Tableau;

Tableau training: two or three 2-3h sessions + ad-hoc support on daily basis;

Licenses: Tableau Desktop + Server Interactors.


  • Easy access to data for ad-hoc analysis;
  • Self-answering critical business questions;
  • Self-publishing for further ad-hoc access across multiple devices.


  • Adding any data involves IT support;
  • Requires elaborated data dictionaries.


  • Make requirements gathering a collaborative and iterative process with regular communication. That would ensure well-timed data delivery and quality;
  • Deliver training in 2-3 wisely structured sections with 2-3 week breaks for business users to have time for playing with software, along with generating needs for the new skills.
  • Focus on reach visualisations, not tables.

Scenario 3. Tableau Server Web Edit + workbook templates

This scenario fully relies on data models published by data analysts and powerful Web Edit features of Tableau Server.

User categories: line-managers, top managers;

Technical skills: Tableau basics;

Tableau training: one 30 min demo session + ad-hoc support;

License: Server Interactor.


  • No special training;
  • Fast Tableau adoption with basic, but powerful Self-service BI capabilities (Web Edit);
  • Thin client access via any Desktop Web Browser;
  • Could serve as a foundation for self-service BI adoption among C-Suite.


  • High level of accuracy for data preparation and template development;
  • Any changes in the data model require development and republishing of a template.


  • Try to select the most proactive and “data hungry” line manager or executive, who could help to spread the word;
  • Investigate analytical needs, ensure availability of a subject matter expert;
  • Start with simple visualisations, but be ready to increase complexity;
  • Provide as much ad-hoc assistance as you can.

In my next post, I would like to throw light on some technical aspects and limitations of each scenario.

I highly appreciate any comments and looking forward to know about your experience.

Read more…

Guest blog post by Eduardo Siman

Fortune 500 companies are investing staggering amounts into data visualization. Many have opted for Tableau, Qlik, MicroStrategy, etc. but some have created their own in HTML5, full stack JavaScript, Python, and R. Leading CIOs and CTOs are obsessed with being the first adopters in whatever is next in data visualization. 

The next frontier in data visualization is clearly immersive experiences. The 2014 paper "Immersive and Collaborative Data Visualization Using Virtual Reality Platforms" written by CalTech astronomers is a staggeringly large step in the right direction. In fact, I am shocked that 1 year later I have not seen a commercial application of this technology. You can read it here:

The key theme that I hear at technology conferences lately is the need to focus on analytics, visualization and data exploration. The advent of big data systems such as Hadoop and Spark has made it

Picture Source: VR 2015 IEEE Virtual Reality International Conference 

possible - for the first time ever - to store Petabytes of data on commodity hardware and process this data, as needed, in a fault tolerant and incredibly quick fashion. Many of us fail to understand the full implications of this inflection point in the history of computing. 

Storage is decreasing in cost every year, to the point where you can now have multiple GB on a USB drive that 10 years ago you could only store a few MBs. Gigabit internet is being installed in cities all over the world. Spark uses the concept of in memory distributed computation to perform at 10X map reduce for gigantic datasets and is already being used in production by Fortune 50 companies. Tableau, Qlik, MicroStrategy, Domo, etc. have gained tremendous market share as companies that have implemented Hadoop components such as HDFS, Hbase, Hive, Pig, and Map Reduce are starting to wonder "How I can I visualize that data?" 

Now think about VR - probably the hottest field in technology at this moment. It has been more than a year since Facebook bought Oculus for 2Billion and we have seen Google Cardboard burst onto the scene. Applications from media companies like the NY Times are already becoming part of our every day lives. This month at the CES show in Las Vegas, dozens of companies were showcasing virtual reality platforms that improve on the state of the art and allow for a motion-sickness free immersive experience. 

All of this combines into my primary hypothesis - this is a great time to start a company that would provide the capability for immersive data visualization environments to businesses and consumers. I personally believe that businesses and government agencies would be the first to fully engage in this space on the data side, but there is clearly an opportunity in gaming on the consumer side.

Personally, I have been so taken by the potential of this idea that I wrote a post in this blog about the “feeling” of being in one of these immersive VR worlds.

The post describes what it would be like to experience data with not only vision, but touch and sound and even smell. 

Just think about the possibilities of examining streaming data sets, that currently are being analyzed with tools such as Storm, Kafka, Flink, and Spark Streaming as a river flowing under you!

The strength of the water can describe the speed of the data intake, or any other variable that is represented by a flow - stock market prices come to mind. 

The possibilities for immersive data experiences are absolutely astonishing. The CalTech astronomers have already taken the first step in that direction, and perhaps there is a company out there that is already taking the next step. That being said, if this sounds like an exciting venture to you, DM me on twitter @Namenode5 and we can talk. 

Read more…

Big data landscape 2016 - Infographic

Guest blog post by Laetitia Van Cauwenberge

Great infographic about the big data / analytics / data science / deep learning / BI ecosystem. Created by @Mattturk, @Jimrhao and @firstmarkcap. Click on the image to zoom in. 

This infographics features the following components:

  • Infrastructure
  • Analytics
  • Applications
  • Cross-Infrastructure/Analytics
  • Open Source
  • Data Sources & APIs

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

You cannot argue that the marketing landscape has changed drastically over the years - gone are the days of throwing different strategies on the wall to see which one would stick. Instead, companies today are equipped with enough information about their customers and tactics thanks to the big data boom of recent years. In fact, according to Bob Evans, senior vice-president of communications for Oracle, large companies that have not put together complete plans for managing big data expect to lose $71.2 million every year. This is just one example of the extreme impact the big data is having on businesses of all sizes across all industries.

Data is everywhere, from phones and personal computers to NASA database. Some of it may be used for business - some of it may not. In fact, almost 80% of information gained by an organization is unstructured and remains unused without the right software. Fortunately, with recent software boom, small businesses can reap the benefits of the excessive amounts of online and offline information.

Even though most big data conversations concern companies with resources to hire experts and research firms, for those who know where to look, there are several ways for SMBs to gather, analyze and make sense of the information they already have. Here you have a few solutions that could help small companies make better business decisions and compete with larger enterprises in the ever-evolving marketplace.


The tools you are probably already using provide you with a rich source of information. InsightSquared is a sales performance analytics tool, designed to prevent you from wasting your time, mining and analyzing your own data using one spreadsheet after another. It connects to popular business solutions, such as Salesforce, Google Analytics and QuickBooks to automatically gather and extract the information you need.


If you currently do not have any rich data sources, conducting research may be the right thing for you. Qualtrics software lets you conduct a wide range of surveys and studies to gain better insights to guide your decision-making. It offers you customer, employee and market insights in real-time. In addition, you have mobile surveys, online samples and academic research.  


If you want to create dashboards and analyze corporate data without the help of the IT department, here's a solution for you.  When it comes to next-generation smart data discovery, Panorama is a global leader. It is the first business intelligence software that uses Automated Insights and Social Decision Making to allow users to have insights with greater relevancy.


Once limited to companies with large resources, credit card transactions are full of unique and vital data. Customer intelligence company Tranzlogic, makes this info available to small and medium establishments, for a reasonable price. It provides the information that merchants can use to launch better marketing campaigns, measure sales performance and write better business plans.

The transition from an instinct-driven company to an analytics-driven company is one SMB owners everywhere must embrace. New solutions are coming to the market every day, and thankfully, more and more of them are being created for the specific needs of smaller organizations. Finding the right IT solutions can help make it practical and inexpensive to benefit from the opportunity big data affords.

Originally posted on Data Science Central

Read more…

Originally written by Nigel Higgs on LinkedIn Pulse.

We who have been in the data sphere a while and in and around Data Governance will have seen the pitch-decks, watched the webinars, read the blogs and attended the conferences.  Some of us will have hired the staff, taken sage advice from expensive consultants and kicked off programmes to get the organisation up the Data governance maturity curve. It's almost like a religion, Data Governance is so clearly the answer why can't everybody in the organisation see it? It's a no-brainer. Unfortunately and speaking as a Data Governance practitioner for far too many years I can honestly say that I have yet to see a fully functioning enterprise-wide Data Governance implementation. Look, I appreciate that could be down to my incompetence, but I know this is not an isolated or unique sentiment. Lots of peers, colleagues and people far smarter than me have been preaching the benefits of data administration, data architecture, data governance, or whatever it will be called next, for many years and yet many of them struggle to come up with success stories. In fact when pressed they often don't have any!

So why so much denial? Einstein is reputed to have said something along the lines of 'the definition of insanity is to keep doing the same thing and expect the outcome to be different'. It is also reputed to be the most wrongly attributed and quoted platitude on the planet! But hey this is a LinkedIn post and like most of my writings nobody will read it.

What's that got to do with Data Governance? Well, 'Outside In Data Governance' is about approaching the problem from a different angle. There is little doubt the problem Data Governance is trying to solve is very real. Very few organisations know what data they have got, what it means, where it is, who is responsible for it or what its quality is?

But how to solve the problem? What I typically hear is that you need to write a policy, form committees, define processes, assign roles and then everything will be working like clockwork within months - data governed, quality data delivered to users and the organisation flying up the data maturity curve. But is that what happens, does the story painted in the pitch-decks come into reality? Sadly, it very rarely if ever does.

What is needed is a value driven approach. Start with who are we doing a Data Governance approach for? We are doing it for the business users. Then ask what are they interested in? They are interested in something that makes their lives easier right now. So 'Outside In Data Governance' starts with a single business report and works back from there. Answer those fundamental questions (what, where, who and how good?) about the fields and outputs on the report and make that knowledge accessible. You could do this with a simple Excel based approach or maybe a Wiki or SharePoint; but pretty soon you will need some tooling to really make it scalable and responsive to increasing demands for more reports to be included in the scope. There are ways to do this in a 'proof of concept' environment and demonstrate the benefits before committing to spend. A friend of mine is fond of saying it’s easier to ask for forgiveness that for permission'. In this case he is right. There are browser based tools that sit outside your firewall and can offer this try before you buy approach. 

This is what a value driven and lean approach is all about. If what you do in this small scale doesn't get traction then what makes you think a £250k project will end up any better? Start small, ensure you get honest feedback from users at every iteration of your solution and focus on delivering value. If you bring the data users with you then they will demand the capability is extended. Beat the Einstein quote and start from the 'Outside In'.

Originally posted on Data Science Central

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds


Software Engineer

US Soccer Federation - The Opportunity: We are U.S. Soccer and we are the future of sport in the United States. Our mission is to make soccer a preeminent sport in the Un...

Data Scientist

MAYO CLINIC - “A Life-Changing Career”   What if your career could change your life? Perhaps you imagine being part of a team where your colleagues inspire you t...

Data Engineer - Bosch

Bosch USA - Robert Bosch is a world-class engineering and electronics company with over 200 plants and thousands of assembly lines world-wide. Our products imp...