Subscribe to our Newsletter

All Posts (217)

Guest blog post by Ujjwal Karn

I created an R package for exploratory data analysis. You can read about it and install it here.  

The package contains several tools to perform initial exploratory analysis on any input dataset. It includes custom functions for plotting the data as well as performing different kinds of analyses such as univariate, bivariate and multivariate investigation which is the first step of any predictive modeling pipeline. This package can be used to get a good sense of any dataset before jumping on to building predictive models.

The package is constantly under development and more functionalities will be added soon. Pull requests to add more functions are welcome!

The functions currently included in the package are mentioned below:

  • numSummary(mydata) function automatically detects all numeric columns in the dataframe mydata and provides their summary statistics
  • charSummary(mydata) function automatically detects all character columns in the dataframe mydata and provides their summary statistics
  • Plot(mydata, dep.var) plots all independent variables in the dataframe mydata against the dependant variable specified by the dep.var parameter
  • removeSpecial(mydata, vec) replaces all special characters (specified by vector vec) in the dataframe mydata with NA
  • bivariate(mydata, dep.var, indep.var) performs bivariate analysis between dependent variable dep.var and independent variable indep.var in the dataframe mydata

More functions to be added soon. Any feedback on improving this is welcome!

Read more…

Originally posted on Data Science Central

Contributed byBelinda Kanpetch, she is current Architecture graduate student in Columbia University. With the strong urban design sense, she is fascinated in Urban installation art and urge to acquire any elements to ameliorate urban space. In order to gather all the information systematically to apply into her work, she took NYC Data Science Academy 12 week full-time Data Science Bootcamp program April 11th to July 1st 2016. The post was based on her first class project(due at 2nd week of the program).

Why Street Trees?

The New York City street tree can sometimes be taken for granted or go unnoticed. Located along paths of travel they stand steady and patient; quietly going about their business of filtering out pollutants in our air, bringing us oxygen, providing shade during the warmer months, blocking winds during cold seasons, and relieving our sewer systems during heavy rainfall. All of this while beautifying our streets and neighborhoods. Some recent studies have found a link between presence of streets and lower stress levels in urban citizens.

So what makes a street tree different from any other tree? Mainly its location. A street tree is defined as any tree that lives within the public right of way; not in a park or on private property. Although they reside in the public right of way (or within the jurisdiction of The Department of Transportation) they are the property of and cared for by the NYC Department of Parks and Recreation.

With the intent to understand the data and explore what the data was telling me I started with some very basic questions:

  • How many street trees are there in Manhattan?
  • How many different species are there?
  • What is the general condition of the street trees?
  • What is the distribution of species by community district?
  • Is there a connection between median income of a community district to the number of street trees?

The Dataset

The dataset used for this exploratory visualization was downloaded from the NYC Open Data Portal and was collected as part of TreeCount!2015, a street tree census maintained by the NYC Department of Park and Recreation. The first census count was 1995 and has been conducted every 10 years by trained volunteers.

Some challenges with this dataset involved missing values in the form of unidentifiable species types. There were 2285 observations with unclassifiable species type, 487 observations that had unclassifiable community districts, geographic information (longitude and latitude) were character strings that had to be split into different variables, and species codes were given by 4 letter characters without any reference to genus, species, or cultivar and I had to find another dataset to decipher that code.

Visualizing the data

A quick summary of the dataset revealed a total of 51,660 trees total in Manhattan with 91 identifiable species with one ‘species’ as missing values.

A bar plot of all 92 species gave an interesting snapshot of the range in total number of trees per species. It was quite obvious that there was one species that has a dominant presence. In order to get better understanding of their counts and what were common species, I broke them down by quartiles and plotted them.

Plotting the first quartile (< 3.75)revealed that there were several species in which there was only one tree that existed in Manhattan!

The distribution within the 4th quartile (181.75 << total >> 11529) was informative in that it helped to visualize the dominance of two specific species, the Honeylocust and Ornamental Pear that make up 23% and 15% of all the trees in Manhattan respectively. Coming in close were Ginko trees with 9.47% and London Plane with 7.8%. This quartile also contained the missing species group ‘0’.

A palette of the top 4 species in Manhattan.

Looking at trees by Community District

I wanted to look at community districts as opposed to zip codes because in my opinion community districts are more representative of community cohesiveness and character. So I plot the distribution by community district and tree condition.

Plotting the species distribution by community board using facet grid helped visualize other species that were not showing up dominant in the previous graphs. It would be interesting to look further into what those species are and why they are more dominant within some community districts and not others.

Attempts at mapping

The ultimate goal was to map each individual tree location on a map of Manhattan with the community districts outlined or shaded in. I attempted to plot them on a map using leaflet, bringing in shape files and converting to a data frame, and ggplot but neither yielded anything useful. The only visualization I was able to get was using qplot which took over 2 hours to render.

Read more…

Originally posted on Data Science Central

By: Rawi Nanakul and Marnie Morales 

Rawi will present these ideas during a live webinar on May 24th at 9 AM PT / 12 PM ET. Get your questions answered in real-time during this one hour event. Register here 

We create, interpret, and experience stories every day, whether we realize it or not. Our brains are constantly receiving input and stringing things together in order for us to make sense of the world. While our brains create countless stories, only the few great ones stay with us. These make us cry, laugh, or embrace a new perspective.

Understanding how our brains interpret the world can help us become better storytellers. That’s where neuroscience comes in. The field of neuroscience covers anything that studies the nervous system, from studies on molecules within nerve endings to data processing, to even complex social behaviors like economics.

Take the Reader from the Known to the Unknown

So let’s put our brains to the test. Take a look at this image for a few seconds. What do you see?

We know very little about this scene. But because our brains crave structure, we still try to see the story. We take things we know—boxing gloves, children, and a corner man—and try to infer what the unknown might be.

A good story takes us from the Known to the Unknown. This simple premise is the key to telling stories for the brain. Let’s apply this concept to a comic. Why a comic? Comics are similar to data stories in that they present a sequence of panes containing different data points that lead you through a story.

Credit: xkcd


Election year is coming up.
The common joke of “if X wins, then I am leaving the country.”

Unknown (Punchline):

Dying in Canada = real.
Canada is the matrix.

What did we do in the course of reading the comic? We’re going to look at some basic brain anatomy to understand what our brain does when reading something like this.

Good Stories Activate More Parts of Our Brain

As you look at the comic, the prefrontal cortex in your frontal lobe kicks into gear, and your brain’s cognitive control goes to work. You're also processing data that comes into your brain as visual input. From your eyes, that data is sent to the primary visual cortex at the back of your brain and onward along two processing streams: the "what" and the "where" pathways.

The "what" pathway (in purple) uses detailed visual information to identify what we see. It pieces together the lines and figures that add up to the comic's characters. It also recognizes the letters and words, and helps deciphers their meaning with the help of additional cortical regions like Wernicke's Area, a part of our language system.

The "where" pathway (in green) processes where things are in space. We know this data stream is important and active during reading because adults with reading disabilities like dyslexia often have disrupted functioning of this pathway.

So when we're interpreting visual information, we're activating quite a bit of our brains to make sense of the data we're presented.

Things get more complex from there, because as we interpret the stories we see, even more brain areas become active. Part of the way we comprehend stories is through a simulation of what we see. So you can potentially activate parts of your brain involved in motor control or your sense of touch.

And imagine if you connect emotionally to the story you're reading. You'll be activating areas of your brain involved in emotion (the limbic system). So when reading a good story, whether it's prose, a comic strip, a data-driven story, you have the potential to get almost global activation of your brain. And the most impactful and memorable stories are those that engage us most.

Channel Your Inner Oddball

Now that we know some of the anatomy, let’s look at the behavioral applications of what we know. Take a look at the figures and read them from left to right. Which one is not like the others? We can quickly see which figure is out of place. Our eyes jump right to it.

How did we know which one was the oddball figure without anyone telling us what it looked like? We had already established a baseline that our initial figure was the normal figure. And when the outlier was presented, we knew right away that it didn't belong.

This experiment is a common attentional process test called the oddball paradigm. A baseline is presented through repetition, then an oddball is presented. This should remind you of our Known-to-Unknown formula that I mentioned earlier. By creating a strong baseline, when the oddball—or an unexpected twist or climax—occurs, we are prepared for it and enjoy it.

Our brain is processing the information based on our experience of the information input. Below is a figure of an ERP, or event-related potential. ERPs are averaged waveforms that measure electrical activity from your scalp. We can use them to measure reaction speed to attentional processing.

Olichney, Nanakul, et al. 2012

In the left figure, we see our brains when presented with standard stimuli (each tick mark is 100ms). You see that we have relatively flat lines after the initial peak. The flat lines are expected because standard stimuli are essentially noise, and our mind zones out because it has been normalized.

The figure on the right shows the oddball—or target—tone with a peak of 300ms (also known as a P300). This peak is from our brain detecting the oddball and concluding that this is the item to pay attention to. This peak is only possible through having established a clear baseline.

What This Means for Storytelling

The example above shows us we have to lay down a good foundation and logical progression to get to our peak. Without structure, our audience will experience our story as noise and tune out, like our figure on the left.

When creating your own stories, remember that the brain craves structure and loves oddballs. The brain processes information by taking information it already knows to infer what a new piece of information might be. Therefore, making it easy as possible for the brain to understand the story is key to delivering a successful climax or twist.

Now that you have some basic understanding of brain anatomy and neuroscience, try applying the lessons learned to your data stories. Create dashboards that engage the senses through pleasing designs, shapes, color, text, and interactivity. Embrace the oddball paradigm by clearly establishing a baseline before delivering your findings. That way, the audience’s mind will be primed to attend to it. And their brains will help them remember your story as one of the few good ones.

Learn More about Storytelling with Data

Rawi Nanakul will present these ideas during a live webinar on May 24th. Register here 

Read more…

The bagged trees algorithm is a commonly used classification method. By resampling our data and creating trees for the resampled data, we can get an aggregated vote of classification prediction. In this blog post I will demonstrate how bagged trees work visualizing each step.

Visualizing Bagged Trees as Approximating Borders, Part 1

Visualizing Bagged Trees, Part 2

Conclusion: Other tree aggregation methods differ in how they grow trees and they may compute weighted average. But in the end we can visualize the result of a algorithm as borders between classified sets in a shape of connected perpendicular segments, as in this 2-dimensional case. As for higher dimensions these became multidimensional rectangular pieces of hyperplanes which are perpendicular to each other.

Read more…

Contributed by Bin Lin. He took  NYC Data Science Academy 12 week full-time Data Science Bootcamp program between Jan 11th to Apr 1st, 2016. The post was based on his second class project(due at 4th week of the program).


The consumption pattern is an important driver of the development pattern of the industrialized world. The consumption price changes reflect the economic performance and income of households in a country. In this project, the focus is on the food price changes. The goals of the project were:

  • Utilize Shinny for interactive visualization (the Shinny app is hosted at
  • Explore food price changes over time from the year 1974 - 2015.
  • Compare food price changes to All-Items price changes (All-items include all consumer goods and services, including food).
  • Compare Consumer Food Price Changes vs. Producer Price Changes (producer price changes are the average change in prices paid to domestic producers for their output).




Consumer Food Price Changes Dataset:

  • Data dimension: 42 rows x 21 columns
  • Missing data: There are 2 missing values in the column of "Eggs"

Producer Price Changes Dataset:

  • Data dimension: 42 rows x 17 columns
  • Missing data: There are 25 missing values in the column of "Processed.fruits.vegetables".

Consumer Food Categories:

  • Data dimension: 20 rows x 2 columns

Data Analysis and Visualization:

Food Consumption Categories:

Food consumption is broken out into 20 categories. Among all of them, the categories with high share based on consumer expenditures are (see Figure 1 and Figure 2):

  • Food.away.from.Home (eat out): 40.9%
  • Other.foods: 10.5 (note this is a sum of rest of the uncategorized food)
  • Cereals.and.bakery.products: 8.0$
  • Nonalcoholic.beverages: 6.7%
  • Dairy.products: 6.3
  • Beef.and.veal: 4.1
  • Fresh.fruits: 4.0

The high share of nonalcoholic beverages/soft drinks (6.7%) seems concerning as high consumption of soft drinks might pose the health risk.

Figure 1: Pie Chart on Food Categories Share of Consumer Expenditures

Figure 2: Bar Chart on Food Categories Share of Consumer Expenditures 

Food Price Changes over Time:

The Consumer Price Index (CPI) is a measure that examines average change over time in the prices paid by consumers for goods and services. It is calculated by taking price changes for each item in the predetermined basket of goods and averaging them; the goods are weighted according to their importance. Changes in CPI are used to assess price changes associated with the cost of living.

In the Shinny app, I created a line chart that showed price changes for different food categories, which were selected from a drop-down list. See Figure 3 for screenshot of the Different Food Categories Price Changes over Time

As I was looking at the food price changes, I noticed that there was the dramatic increase during the late 70s.  After reviewed history of the 1970s, a lot happened during that time of period, including the "Great Inflation".

Figure 3: Screenshot of Different Food Categories Price Changes over Time

Yearly Food Category Price Changes:

To view the food price changes for each category in a year, I created the bar chart in the Shinny app. Users can select a year from the slider; the chart will show food price changes of each category for that year. I actually created two bar charts side-by-side in case users want to compare the food price changes between any of the two years.

A quick look at the year 2015, the price of "Egg" had the biggest increase; price of "Pork" dropped the most. In fact, many food categories dropped their price. Compared 2015, the year 2014 had fewer categories with dropped price; the price of "Beef and Veal" had the biggest increase.

Figure 4: Screenshot of Food Category Price Changes by Year

Food Price Changes vs All-Items Price Changes

The Consumer Price Index (CPI) for food is a component of the all-items CPI. That led me to the comparing of those two. From the line chart, I observed:

  • Food price changes mostly aligns with all-item price changes.
  • Food price inflation has outpaced the economy-wide inflation in recent years.

Figure 5: Price Changes in All-Items vs Price Changes in Food

Food Price Changes vs Producer Price Changes

Based on United State Department of Agriculture (USDA),  changes in farm-level and wholesale-level PPIs are of particular interest in forecasting food CPIs. Therefore, I created a chart to show the Over All Food Price Changes vs Producer Price Changes. Uses can choose one or more Producer food categories.

From the chart, that food price changes mostly aligns with the producer price changes. However,  farm level milk, farm level cattle, farm level wheat seem fluctuate since year 2000 and they didn't affect the over all food price change that much. Though the impact on the over all food price was small, I doubt they might have impacted individual food categories. I would like to add a new drop-down list to allow users to select food categories from the consumer food categories.

Figure 6: Food Price Changes vs Producer Price Changes

Correlation Tile Map:

To see the relationship among the different categories in terms of price changes, I created a correlation tile map.


Food price has been increasing, in different amount of percentage. Since 1990, food price changes keep under small percentage. The degree of food price inflation varies depending on the type of foods

Looking ahead to 2016, ERS predicts food-at-home (supermarket) prices to rise 2.0 to 3.0 percent - a rate of inflation that remains in line with the 20-year historical average of 2.5 percent. For future works, I would love to try to fit a time-series model to predict the price changes for the coming five years.

Again, this project was done in Shiny and most of the information in this blog post were from the Shiny,

Originally posted on Data Science Central

Read more…

There are many ways to choose features with given data, and it is always a challenge to pick up the ones with which a particular algorithm will work better. Here I will consider data from monitoring performance of physical exercises with wearable accelerometers, for example, wrist bands.

The data for this project come from this source:

In this project, researchers used data from accelerometers on the belt, forearm, arm, and dumbbell of few participants. They were asked to perform barbell lifts correctly, marked as "A", and incorrectly with four typical mistakes, marked as "B", "C", "D" and "E". The goal of the project is to predict the manner in which they did the exercise.

There are 52 numeric variables and one classification variable, the outcome. We can plot density graphs for first 6 features, which are in effect smoothed out histograms.

We can see that data behaviors are complicated. Some of features are bimodal and even multimodal. These properties could be caused by participants' different sizes or training levels or something else, but we do not have enough information to check it out. Nevertheless it is clear that our variables do not obey normal distribution. Therefore we are better with algorithms which do not assume normality, like trees and random forests.  We can visualize the algorithms work in the following way: as finding vertical lines which divide areas under curves such that areas to the right and to to left of the line are significantly different for different outcomes.
There are a number of ways to distinguish functions analytically on an interval in Functional Analysis. It looks like the most suitable is to consider areas between curves. Clearly, we should scale it with respect to the size of the curves. For every feature I will consider all pairs of density curves to find out if they are sufficiently different.  Here is my final criterion:
If there is a pair for a feature which satisfies it then the feature is chosen for a prediction. As result I got 21 features for a random forests algorithm.  The last one yielded accuracy 99% for the model itself and on a validation set. I checked how many variables we need for the same accuracy with PCA preprocessing, and it was 36. Mind you that the variables will be scaled and rotated, and that we still use the same original 52 features to construct them. Thus more efforts are needed to construct a prediction and to explain it. While with the above method it is easier, since areas under curves represent numbers of observations.

Originally posted on Data Science Central
Read more…

Analysis of Fuel Economy Data

Paul Grech

October 5, 2015

Contributed by Paul Greeh. Paul took NYC Data Science Academy 12 week full time Data Science Bootcamp  program between Sept 23 to Dec 18, 2015. The post was based on his first class project(due at 2nd week of the program).


Analyse fuel economy ratings in the automotive industry.

Compare vehicle efficiency of American automotive manufacturer, Cadillac with the automotive industry as a whole.

Sept 2014 - “We cannot deny the fact that we are leaving behind our traditional customer base,” de Nysschen said. “It will take several years before a sufficiently large part of the audience who until now have been concentrating on the German brands will find us in their consideration set.” Cadillac’s President - Johan de Nysschen

Compare vehicle efficiency of American automotive manufacturer, Cadillac, with self declared competition, the German luxury market.

What further comparisons will display insight into EPA ratings?

Analysis Overview

  1. Automotive Industry
  2. Cadillac vs Automotive Industry
  3. Cadillac vs German Luxury Market
  4. Cadillac vs German Luxury Market by Vehicle Class

Importing the Data

Import data and filter rows needed for analysis. Then remove all zero’s included in city and highway MPG data as this will skew results. - Replace this information with NA as to not perform calculations on data not present.

# Import Data and convert to Dplyr data frame
FuelData <- read.csv("", stringsAsFactors = FALSE)
FuelData <- tbl_df(FuelData)

# Create data frame including information necessary for analysis
FuelDataV1 <- select(FuelData,
mfrCode, year, make, model,
engId, eng_dscr, cylinders, displ, sCharger, tCharger,
trans_dscr, trany, drive,
startStop, phevBlended,
city08, comb08, highway08,

# Replace Zero values in MPG data with NA
FuelDataV1$city08U[FuelDataV1$city08 == 0] <- NA
FuelDataV1$comb08U[FuelDataV1$comb08 == 0] <- NA
FuelDataV1$highway08U[FuelDataV1$highway08 == 0] <- NA

1: Automotive Industry

Visualize city and highway EPA ratings of the entire automotive industry.


How have EPA ratings for city and highway improved across the automotive industry as a whole?

Note: No need to include combined as combined is simply a percentage based calculation defaulting to 60/40 but can be adjusted on the website.

IndCityMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "City")
IndHwyMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "Highway")

Comp.Ind <- rbind(IndCityMPG, IndHwyMPG)

ggplot(data = Comp.Ind, aes(x = year, y = MPG, linetype = MPGType)) +
geom_point() + geom_line() + theme_bw() +
ggtitle("Industry\n(city & highway MPG)")



Data visualization shows relatively poor EPA ratings throughout the 1980's, 1990's and early to mid 2000's with the first drastic improvement in these ratings occurring around 2008. One significant event around this time period was the recession hitting America. Consumers having less disposable income along with increased oil prices likely fueled competition to develop fuel efficient powertrains across the automotive industry as a whole.

2: Cadillac vs Automotive Industry

Visualize Cadillac's city and highway EPA ratings with that of the automotive industry.


How does Cadillac perform when compared to the automotive industry as a whole?
IndCityMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "City")
IndHwyMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "Highway")
CadCityMPG <- filter(FuelDataV1, make == "Cadillac") %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Cadillac") %>%
mutate(., MPGType = "City")
CadHwyMPG <- filter(FuelDataV1, make == "Cadillac") %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Cadillac") %>%
mutate(., MPGType = "Highway")

Comp.Ind.Cad <- rbind(IndCityMPG, IndHwyMPG, CadCityMPG, CadHwyMPG)

ggplot(data = Comp.Ind.Cad, aes(x = year, y = MPG, color = Label, linetype = MPGType)) +
geom_point() + geom_line() + theme_bw() +
scale_color_manual(name = "Cadillac / Industry", values = c("blue","#666666")) +
ggtitle("Cadillac vs Industry\n(city & highway MPG)")



Cadillac was chosen as a brand of interest because they are currently redefining their brand as a whole. It is important to analyze past performance to have a complete understanding of how Cadillac has been viewed for several decades.

In 2002, Cadillac dropped to its lowest performance. Why did this occur? Because the entire fleet was made up of the same 4.6L V8 mated to a 4-speed automatic transmission, or as some would say... slush-box. The image that Cadillac had of this time was of a retirement vehicle to be shipped to its owners new retirement home in Florida with a soft ride, smooth powerful delivery and no performance. With the latest generation of Cadillac's being performance oriented beginning with the LS2 sourced CTS-V and now containing the ATS-V, CTS-V along with several other V-Sport models, a rebranding is crucial in order to appeal to a new market of buyers.

Also interesting to note is that although there is an increased amount of performance models being produced, fuel efficiency is not lacking. The gap noted above has decreased although there has been an increase in performance models being developed, a concept not often found to align.

3: Cadillac vs German Luxury Market

Cadillac has recently targeted the German luxury market consisting of the following manufacturers:
  • Audi
  • BMW
  • Mercedes-Benz


How does Cadillac perform when compared with the German Luxury Market?
# Calculate Cadillac average Highway / City MPG past 2000
CadCityMPG <- filter(CadCityMPG, year > 2000)
CadHwyMPG <- filter(CadHwyMPG, year > 2000)

# Calculate Audi average Highway / City MPG
AudCityMPG <- filter(FuelDataV1, make == "Audi", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Audi") %>%
mutate(., MPGType = "City")
AudHwyMPG <- filter(FuelDataV1, make == "Audi", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Audi") %>%
mutate(., MPGType = "Highway")

# Calculate BMW average Highway / City MPG
BMWCityMPG <- filter(FuelDataV1, make == "BMW", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "BMW") %>%
mutate(., MPGType = "City")
BMWHwyMPG <- filter(FuelDataV1, make == "BMW", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "BMW") %>%
mutate(., MPGType = "Highway")

# Calculate Mercedes-Benz average Highway / City MPG
MbzCityMPG <- filter(FuelDataV1, make == "Mercedes-Benz", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Merc-Benz") %>%
mutate(., MPGType = "City")
MbzHwyMPG <- filter(FuelDataV1, make == "Mercedes-Benz", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Merc-Benz") %>%
mutate(., MPGType = "Highway")

# Concatenate all Highway/City MPG data for:
# v.s. German Competitors
CompGerCadCity <- rbind(CadCityMPG, AudCityMPG, BMWCityMPG, MbzCityMPG)
CompGerCadHwy <- rbind(CadHwyMPG, AudHwyMPG, BMWHwyMPG, MbzHwyMPG)

ggplot(data = CompGerCadCity, aes(x = year, y = MPG, color = Label)) + 
geom_line() + geom_point() + theme_bw() +
scale_color_manual(name = "Cadillac vs German Luxury Market",
values = c("#333333", "#666666", "blue","#999999")) +
ggtitle("CITY MPG\n(Cad vs Audi vs BMW vs Mercedes-Benz)")

ggplot(data = CompGerCadHwy, aes(x = year, y = MPG, color = Label)) + 
geom_line() + geom_point() + theme_bw() +
scale_color_manual(name = "Cadillac vs German Luxury Market",
values = c("#333333", "#666666", "blue","#999999")) +
ggtitle(label = "HIGHWAY MPG\n(Cad vs Audi vs BMW vs Mercedes-Benz)")



“Mr. Ellinghaus, a German who came to Cadillac in January from pen maker Montblanc International after more than a decade at BMW, said he has spent the past 11 months doing”foundational work" to craft an overarching brand theme for Cadillac’s marketing, which he says relied too heavily on product-centric, me-too comparisons.

“In engineering terms, it makes a lot of sense to benchmark the cars against BMW,” Mr. Ellinghaus said. But he added: “From a communication point of view, you must not follow this rule.”

Despite comments made by Mr. Ellinghaus, the end goal is for consumers to be comparing Cadillac with Audi, BMW and Mercedes-Benz. The fact that this is already happening is a huge success for the company which only ten years ago, would never be mentioned in the same sentence as the German Luxury market.

Data visualization shows that Cadillac is equally rated as its German competitors and at the same time, has not had any significant dips unlike all other manufacturers. The continued increase in performance combined with rebranding signify that Cadillac is on a path to success.

4: Cadillac vs German Luxury Market by Vehicle Class

Every manufacturer has its strengths and weaknesses. It is important to assess and recognize these attributes to best determine where an increase in R&D spending is needed and where to maintain a competitive advantage for the consumer by vehicle class.


In what vehicle class is Cadillac excelling or falling behind?
# Filter only Cadillac and german luxury market
German <- filter(FuelDataV1, make %in% c("Cadillac", "Audi", "BMW", "Mercedes-Benz"))
# Group vehicle classes into more generic classes
German$ <- ifelse(grepl("Compact", German$VClass, = T), "Compact",
ifelse(grepl("Wagons", German$VClass), "Wagons",
ifelse(grepl("Utility", German$VClass), "SUV",
ifelse(grepl("Special", German$VClass), "SpecUV", German$VClass))))

# Focus on vehicle model years past 2000
German <- filter(German, year > 2000)
# Vans, Passenger Type are only specific to one company and are not needed for this analysis
German <- filter(German, != "Vans, Passenger Type")

IndClass <- filter(German, make %in% c("Audi", "BMW", "Mercedes-Benz")) %>%
group_by(, year) %>%
summarize(AvgCity = mean(city08), AvgHwy = mean(highway08))
CadClass <- filter(German, make %in% c("Cadillac")) %>%
group_by(, year) %>%
summarize(AvgCity = mean(city08), AvgHwy = mean(highway08))

##### Join tables #####
CadIndClass <- left_join(IndClass, CadClass, by = c("year", ""))
CadIndClass$DifCity <- (CadIndClass$AvgCity.y - CadIndClass$AvgCity.x)
CadIndClass$DifHwy <- (CadIndClass$AvgHwy.y - CadIndClass$AvgHwy.x)

ggplot(CadIndClass, aes(x = year, ymax = DifCity, ymin = 0) ) + 
geom_linerange(color='grey20', size=0.5) +
geom_point(aes(y=DifCity), color = 'blue') +
geom_hline(yintercept = 0) +
theme_bw() +
facet_wrap( +
ggtitle("Cadillac vs Germany Luxury Market\n(city mpg by class)") +
xlab("Year") +
ylab("MPG Difference")

ggplot(CadIndClass, aes(x = year, ymax = DifHwy, ymin = 0) ) + 
geom_linerange(color='grey20', size=0.5) +
geom_point(aes(y=DifHwy), color='blue') +
geom_hline(yintercept = 0) +
theme_bw() +
facet_wrap( +
ggtitle("Cadillac vs German Luxury Market\n(highway mpg by class)") +
xlab("Year") +
ylab("MPG Difference")



The above data visualization displays the delta between Cadillac and the average (Audi, BMW, Mercedes-Benz) fuel economy ratings. Positive can then be considered above the average competition and negative, below the average competition.

There is a lack of performance across all vehicle classes. Reasoning may be because the same power trains are being used across multiple chassis.


Conclusion & Continued Analysis

  1. There is a clear improvement in EPA ratings as federal emission standards drive innovation for increased fleet fuel economy. It is important for automotive manufacturers to continue innovation and push for increased efficiency.
  2. Further analysis on the following areas provides greater researcher opportunity:
    • Drivetrain v.s. MPG
    • Sales data
    • Consumer reaction to new marketing strategies
    • Consumer demand for product or badge

Originally posted on Data Science Central

Read more…

Guest blog post by Vincent Granville

There's been many variations of this theme - defining big data with 3Vs (or more, including velocity, variety, volume, veracity, value), as well as other representations such as the data science alphabet.

Here's an interesting Venn diagram that tries to define statistical computing (a sub-field of data science) with 7 sets and 9 intersections:

It was published in a scholarly paper entitled Computing in the Statistics Curricula (PDF document). Enjoy!

Read more…

Learning R in Seven Simple Steps

Originally posted on Data Science Central

Guest blog post by Martijn Theuwissen, co-founder at DataCamp. Other R resources can be found here, and R Source code for various problems can be found here. A data science cheat sheet can be found here, to get you started with many aspects of data science, including R.

Learning R can be tricky, especially if you have no programming experience or are more familiar working with point-and-click statistical software versus a real programming language. This learning path is mainly for novice R users that are just getting started but it will also cover some of the latest changes in the language that might appeal to more advanced R users.

Creating this learning path was a continuous trade-off between being pragmatic and exhaustive. There are many excellent (free) resources on R out there, and unfortunately not all could be covered here. The material presented here is a mix of relevant documentation, online courses, books, and more that we believe is best to get you up to speed with R as fast as possible.

Data Video produced with R: click here and also here for source code and to watch the video. More here.

Here is an outline:

  • Step 0: Why you should learn R
  • Step 1: The Set-Up
  • Step 2: Understanding the R Syntax
  • Step 3: The core of R -> packages
  • Step 4: Help?!
  • Step 5: The Data Analysis Workflow
    • 5.1 Importing Data
    • 5.2 Data Manipulation
    • 5.3 Data Visualization
    • 5.4 The stats part
    • 5.5 Reporting your results
  • Step 6: Become an R wizard and discovering exciting new stuff

Step 0: Why you should learn R

R is rapidly becoming the lingua franca of Data Science. Having its origins in academics, you will spot it today in an increasing number of business settings as well where it is a contestant to commercial software incumbents such as SAS, STATA and SPSS. Each year, R gains in popularity and in 2015 IEEE listed R in the top ten languages of 2015.

This implies that the demand for individuals with R knowledge is growing, and consequently learning R is definitely a smart investment career wise (according to this survey R even is the highest paying skill). This growth is unlikely to plateau in the next years with large players such as Oracle &Microsoft stepping up by including R in its offerings.

Nevertheless, money should not be the only driver when deciding to learn a new technology or programming language. Luckily, R has a lot more to offer than a solid paycheck. By engaging yourself with R, you will become familiar with a highly diverse and interesting community. Namely, R is being used for a diverse set of task such as finance, genomic analysis, real estate, paid advertising, and much more. All these fields are actively contributing to the development of R. You will encounter a diverse set of examples and applications on a daily basis, keeping things interesting and giving you the ability to apply your knowledge on a diverse range of problems.

Have fun!

Step 1: The Set-Up

Before you can actually start working in R, you need to download a copy of it on your local computer. R is continuously evolving and different versions have been released since R was born in 1993 with (funny) names such as World-Famous Astronaut and Wooden Christmas-Tree. Installing R is pretty straightforward and there are binaries available for Linux, Mac and Windows from the Comprehensive R Archive Network (CRAN).

Once R is installed, you should consider installing one of R’s integrated development environment as well (although you could also work with the basic R console if you prefer). Two fairly established IDE’s are RStudio and Architect. In case you prefer a graphical user interface, you should check out R-commander.

Step 2: Understanding the R Syntax

Learning the syntax of a programming language like R is very similar to the way you would learn a natural language like French or Spanish: by practice & by doing. One of the best ways to learn R by doing is through the following (online) tutorials:

Next to these online tutorials there are also some very good introductory books and written tutorials to get you started:

Step 3: The core of R -> packages

Every R package is simply a bundle of code that serves a specific purpose and is designed to be reusable by other developers. In addition to the primary codebase, packages often include data, documentation, and tests. As an R user, you can simply download a particular package (some are even pre-installed) and start using its functionalities. Everyone can develop R packages, and everyone can share their R packages with others.

The above is an extremely powerful concept and one of the key reasons R is so successful as a language and as a community. Namely, you don’t need to do all the hard core programming yourself or understand every complex detail of a particular algorithm or visualization. You can simple use the out-of-the box functions that come with the relevant package as an interface to such functionalities.  As such it is useful to have an understanding of R’s package ecosystem.

Many R packages are available from the Comprehensive R Archive Network, and you can install them using the install.packages function. What is great about CRAN is that it associates packages with a particular task via Task Views. Alternatively, you can find R packages on bioconductorgithub and bitbucket.

Looking for a particular package and corresponding documentation? Try Rdocumentation, where you can easily search packages from CRAN, github and bioconductor.

Step 4: Help?!

You will quickly find out that for every R question you solve, five new ones will pop-up. Luckily, there are many ways to get help:

  • Within R you can make use of its built-in help system. For example the command  `?plot` will provide you with the documentation on the plot function.
  • R puts a big emphasis on documentation. The previously mentionedRdocumentation is a great website to look at the different documentation of different packages and functions.
  • Stack Overflow is a great resource for seeking answers on common R questions or to ask questions yourself.
  • There are numerous blogs & posts on the web covering R such asKDnuggets and R-bloggers.

Step 5: The Data Analysis Workflow

Once you have an understanding of R’s syntax, the package ecosystem, and how to get help, it’s time to focus on how R can be useful for the most common tasks in the data analysis workflow

5.1 Importing Data

Before you can start performing analysis, you first need to get your data into R. The good thing is that you can import into R all sorts of data formats, the hard part this is that different types often need a different approach:

If you want to learn more on how to import data into R check an online Importing Data into R tutorial or  this post on data importing.

5.2 Data Manipulation

Performing data manipulation with R is a broad topic as you can see in for example this Data Wrangling with R video by RStudio or the book Data Manipulation with R. This is a list of packages in R that you should master when performing data manipulations:

  • The tidyr package for tidying your data.
  • The stringr package for string manipulation.
  • When working with data frame like objects it is best to make yourself familiar with the dplyr package (try this course). However. in case of heavy data wrangling tasks, it makes more sense to check out the blazingly fast data.table package (see this syntax cheatsheet for help).
  • When working with times and dates install the lubridate package which makes it a bit easier to work with these.
  • Packages like zooxts and quantmod offer great support for time series analysis in R.

5.3 Data Visualization

One of the main reasons R is  the favorite tool of data analysts and scientists is because of its data visualization capabilities. Tons of beautiful plots are created with R as shown by all the posts on FlowingData, such as this famous facebook visualization.

Credit card fraud scheme featuring time, location, and loss per event, using R: click here for source

If you want to get started with visualizations in R, take some time to study theggplot2 package. One of the (if not the) most famous packages in R for creating graphs and plots. ggplot2 is makes intensive use of the grammar of graphics, and as a result is very intuitive in usage (you’re continuously building part of your graphs so it’s a bit like playing with lego).  There are tons of resources to get your started such as this interactive coding tutorial, a cheatsheet and  an upcoming book by Hadley Wickham.

Besides ggplot2 there are multiple other packages that allow you to create highly engaging graphics and that have good learning resources to get you up to speed. Some of our favourites are:

If you want to see more packages for visualizations see the CRAN task view. In case you run into issues plotting your data this post might help as well.

Next to the “traditional” graphs, R is able to handle and visualize spatial data as well. You can easily visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps with a package such as ggmap. Another great package is choroplethr developed by Ari Lamstein of Trulia or the tmap package. Take this tutorial onIntroduction to visualising spatial data in R if you want to learn more.

5.4 The stats part

In case you are new to statistics, there are some very solid sources that explain the basic concepts while making use of R:

Note that these resources are aimed at beginners. If you want to go more advanced you can look at the multiple resources there are for machine learning with R. Books such as Mastering Machine Learning with R andMachine Learning with R explain the different concepts very well, and online resources like the Kaggle Machine Learning course help you practice the different concepts. Furthermore there are some very interesting blogs to kickstart your ML knowledge like Machine Learning Mastery or this post.

5.5 Reporting your results

One of the best way to share your models, visualizations, etc is through dynamic documents. R Markdown (based on knitr and pandoc) is a great tool for reporting your data analysis in a reproducible manner though html, word, pdf, ioslides, etc.  This 4 hour tutorial on Reporting with R Markdownexplains the basics of R markdown. Once you are creating your own markdown documents, make sure this cheat sheet is on your desk.

Step 6: Become an R wizard and discovering exciting new stuff

R is a fast-evolving language. It’s adoption in academics and business is skyrocketing, and consequently the rate of new features and tools within R is rapidly increasing. These are some of the new technologies and packages that excite us the most:

Once you have some experience with R, a great way to level up your R skillset is the free book Advanced R by Hadley Wickham. In addition, you can start practicing your R skills by competing with fellow Data Science Enthusiasts on Kaggle, an online platform for data-mining and predictive modelling competitions. Here you have the opportunity to work on fun cases such as this titanic data set.

To end, you are now probably ready to start contributing to R yourself by writing your own packages. Enjoy!

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

New book on data mining and statistics

New book:

Numeric Computation and Statistical Data Analysis on the Java Platform (by S.Chekanov)

710 pages. Springer International Publishing AG. 2016. ISBN 978-3-319-28531-3.

Book S.V.Chekanov 2016

About this book: Numerical computation, knowledge discovery and statistical data analysis integrated with powerful 2D and 3D graphics for visualization are the key topics of this book. The Python code examples powered by the Java platform can easily be transformed to other programming languages, such as Java, Groovy, Ruby and BeanShell. This book equips the reader with a computational platform which, unlike other statistical programs, is not limited by a single programming language.

Originally posted on Data Science Central

Read more…

Dealing with Outliers is like searching a needle in a haystack

This is a guest repost by Jacob Joseph.

An Outlier is an observation or point that is distant from other observations/points. But, how would you quantify the distance of an observation from other observations to qualify it as an outlier. Outliers are also referred to as observations whose probability to occur is low. But, again, what constitutes low??

There are parametric methods and non-parametric methods that are employed to identify outliers. Parametric methods involve assumption of some underlying distribution such as normal distribution whereas there is no such requirement with non-parametric approach. Additionally, you could do a univariate analysis by studying a single variable at a time or multivariate analysis where you would study more than one variable at the same time to identify outliers.

The question arises which approach and which analysis is the right answer??? Unfortunately, there is no single right answer. It depends for what is the end purpose for identifying such outliers. You may want to analyze the variable in isolation or maybe use it among a set of variables to build a predictive model.

Let’s try to identify outliers visually.

Assume we have the data for Revenue and Operating System for Mobile devices for an app. Below is the subset of the data:

How can we identify outliers in the Revenue?

We shall try to detect outliers using parametric as well as non-parametric approach.

Parametric Approach

Comparison of Actual, Lognormal and Normal Density Plot

The x-axis, in the above plot, represents the Revenues and the y-axis, probability density of the observed Revenue value. The density curve for the actual data is shaded in ‘pink’, the normal distribution is shaded in 'green' and  log normal distribution is shaded in 'blue'. The probability density for the actual distribution is calculated from the observed data, whereas for both normal and log-normal distribution is computed based on the observed mean and standard deviation of the Revenues.

Outliers could be identified  by calculating the probability of the occurrence of an observation or calculating how far the observation is from the mean. For example, observations greater/lesser than 3 times the standard deviation from the mean, in case of normal distribution, could be classified as outliers.

In the above case, if we assume a normal distribution, there could be many outlier candidates especially for observations having revenue beyond 60,000. The log-normal plot does a better job than normal distribution, but it is due to the fact that the underlying actual distribution has characteristics of a log-normal distribution. This could not be a general case since determining the distribution or parameters of the underlying distribution is extremely difficult before hand or apriori. One could infer the parameters of the data by fitting a curve to the data, but a change in the underlying parameters like mean and/or standard deviation due to new incoming data will change the location and shape of the curve as observed in the plots below:

Comparison of Density Plot for change in mean and standard deviation for Normal DistributionComparison of Density Plot for change in mean and standard deviation for LogNormal Distribution

The above plots show the shift in location or the spread of the density curve based on an assumed change in mean or standard deviation of the underlying distribution. It is evident that a shift in the parameters of a distribution is likely to influence the identification of outliers.

Non-Parametric Approach

Let’s look at a simple non-parametric approach like a box plot to identify the outliers.

Non Parametric approach to detect outlier with box plots (univariate approach)

In the box plot shown above, we can identify 7 observations, which could be classified as potential outliers, marked in green. These observations are beyond the whiskers. 

In the data, we have also been provided information on the OS. Would we identify the same outliers, if we plot the Revenue based on OS??

Non Parametric approach to detect outlier with box plots (bivariate approach)

In the above box plot, we are doing a bivariate analysis, taking 2 variables at a time which is a special case of multivariate analysis. It seems that there are 3 outlier candidates for iOS whereas there are none for Android. This was due to the difference in distribution of Revenues for Android and iOS users. So, just analyzing Revenue variable on its own i.e univariate analysis, we were able to identify 7 outlier candidates which dropped to 3 candidates when a bivariate analysis was performed.

Both Parametric as well as Non-Parametric approach could be used to identify outliers based on the characteristics of the underlying distribution. If the mean accurately represents the center of the distribution and the data set is large enough, parametric approach could be used whereas if the median represents the center of the distribution, non-parametric approach to identify outliers is suitable.

Dealing with outliers in a multivariate scenario becomes all the more tedious. Clustering, a popular data mining technique and a non-parametric method could be used to identify outliers in such a case.

Originally posted on Data Science Central

Read more…

Guest blog post by ahmet taspinar

One of the most important tasks in Machine Learning are the Classification tasks (a.k.a. supervised machine learning). Classification is used to make an accurate prediction of the class of entries in the test set (a dataset of which the entries have not been labelled yet) with the model which was constructed from a training set. You could think of classifying crime in the field of Pre-Policing, classifying patients in the Health sector, classifying houses in the Real-Estate sector. Another field in which classification is big, is Natural Lanuage Processing (NLP). This is the field of science with the goal to makes machines (computers) understand (written) human language. You could think of Text Categorization, Sentiment Analysis, Spam detection and Topic Categorization.

For classification tasks there are three widely used algorithms; the Naive Bayes, Logistic Regression / Maximum Entropy and Support Vector Machines.  We have already seen how the Naive Bayes works in the context of Sentiment Analysis. Although it is more accurate than a bag-of-words model, it has the assumption of conditional independence of its features. This is a simplification which makes the NB classifier easy to implement, but it is also unrealistic in most cases and leads to a lower accuracy. A direct improvement on the N.B. classifier, is an algorithm which does not assume conditional independence but tries to estimate the weight vectors (feature values) directly.

This algorithm is called Maximum Entropy in the field of NLP and Logistic Regression in the field of Statistics.

Maximum Entropy might sounds like a difficult concept, but actually it is not. It is a simple idea, which can be implemented with a few lines of code. But to fully understand it, we must first go into the basics of Regression and Logistic Regression.

1. Regression Analysis

Regression Analysis is the field of mathematics where the goal is to find a function which best correlates with a dataset. Lets say we have a dataset containing n datapoints; X = ( x^{(1)}, x^{(2)}, .., x^{(n)} ). For each of these (input) datapoints there is a corresponding (output) y^{(i)}-value. Here the x-datapoints are called the independent variables and y the dependent variable; the value of y^{(i)} depends on the value of x^{(i)}, while the value of x^{(i)} may be freely chosen without any restriction imposed on it by any other variable.
The goal of Regression analysis is to find a function f(X) which can best describe the correlation between X and Y. In the field of Machine Learning, this function is called the hypothesis function and is denoted as h_{\theta}(x).




If we can find such a function, we can say we have successfully build a Regression model. If the input-data lives in a 2D-space, this boils down to finding a curve which fits through the datapoints. In the 3D case we have to find a plane and in higher dimensions a hyperplane.

To give an example, lets say that we are trying to find a predictive model for the success of students in a course called Machine Learning. We have a dataset Y which contains the final grade of n students. Dataset X contains the values of the independent variables. Our initial assumption is that the final grade only depends on the studying time. The variable x^{(i)} therefore indicates how many hours student i has studied. The first thing we would do is visualize this data:


regression_left2 regression_right2

If the results looks like the figure on the left, then we are out of luck. It looks like the points are distributed randomly and there is not correlation between Y and X at all. However, if it looks like the figure on the right, there is probably a strong correlation and we can start looking for the function which describes this correlation.


This function could for example be:

h_{\theta}(X) = \theta_0+ \theta_1 \cdot x


h_{\theta}(X) = \theta_0 + \theta_1 \cdot x^2

where \theta are the dependent parameters of our model.


1.1. Multivariate Regression

Evaluating the results from the previous section, we may find the results unsatisfying; the function does not correlate with the datapoints strong enough. Our initial assumption is probably not complete. Taking only the studying time into account is not enough. The final grade does not only depend on the studying time, but also on how much the students have slept the night before the exam. Now the dataset contains an additional variable which represents the sleeping time. Our dataset is then given by X = ( (x_1^{(1)}, x_2^{(1)}), (x_1^{(2)}, x_2^{(2)}), .., (x_1^{(n)}, x_2^{(n)}) ). In this dataset  x_1^{(i)} indicates how many hours student i has studied and x_2^{(i)} indicates how many hours he has slept.

See the rest of the blog here, including Linear vs Non-linear, Gradient Descent, Logistic Regression, and Text Classification and Sentiment Analysis.

Read more…

The Curse of #DataViz

With the wide array of amazing data visualization tools out there - SAP Lumira, Qlik, Domo, Tableau - you would think that the world has moved towards a graphical understanding of reality.

Yet my experience has been that when people are faced with the task of using a fancy new graph or visualization in their work, their first reaction is….to freak out. Now bear with me now. I am not implying that most people can’t handle a nice bar chart or a three dimensional pie chart or even some animated multi-colored craziness. 

What I am saying is that most people don’t know where to go from a data visualization. Are they supposed to take a screenshot and send it to their boss? Should they print it out? Should they click the buttons for 10 hours until it gets boring? Should they make a decision?

Ok - ideally they should make a decision. But what if that person is not empowered to make decisions, or if they need to check with their boss first? Also, what if they don’t trust the data behind the graphics? They are taking a pretty big bet when they write that email to the whole department saying “hey this chart shows we should be doing this!”

In a way this is one representation of a phenomena which will become more and more prevalent as computers make recommendations to humans and we have to follow up with the so called “human assist”. That is - the final point in the decision making process. 

The visualization can only take you so far, but you need to drink the water. 

Originally posted on Data Science Central

Read more…

Guest blog post by Takashi J. OZAKI

I wrote a blog post inspired by Jamie Goode's book "Wine Science: The Application of Science in Winemaking".

In this book, Goode argued that reductionistic approach cannot explain relationship between chemical ingredients and taste of wine. Indeed, we know not all high (alcohol) wines are excellent, although in general high wines are believed to be good. Usually taste of wine is affected by a complicated balance of many components such as sweetness, acid, tannin, density or others that are given by corresponding chemical entities.

However, I think (and probably many other data science experts agree) that it is not a limitation of reductionistic approach, but a limitation of univariate modeling. To illustrate it, I performed a series of multivariate modeling with random forest or other models on "Wine Quality" dataset of UCI Machine Learning repository.

As a result, a random forest classifier predicted tasting score of wine better than intuitive univariate modeling. At the same time, it also showed some hidden and complicated dynamics between chemical ingredients and taste of wine. I believe that modern multivariate modeling such as machine learning can reveal more complicated relationship between chemical ingredients and taste of wine.

See my blog post below for more details.

Read more…

Guest blog post by Denis Rasulev

An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data.

Images are clickable to open hi-res versions.



Original post covers a lot more details and for those who want to pursue more analysis on their own: everything in the post - the data, software, and code - is freely available. Full instructions to download and analyze the data for yourself are available on GitHub.

Here is link to the original post: link

Read more…

Principal Component Analysis using R

Guest blog post by suresh kumar gorakala

Curse of Dimensionality:

One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality. 

In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.

Principal component analysis:

Consider below scenario:

The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.


Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible. 

Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.

Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.

In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.

PCA implementation in R:

For today’s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales. The 42 row names ("9.4", 9.5" ...) correspond to midpoints of intervals of finger lengths whereas the 22 column names ("142.24", "144.78"...) correspond to (body) heights of 3000 criminals, see also below. 

142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34
9.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
9.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.8 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
9.9 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0
182.88 185.42 187.96 190.5 193.04 195.58
9.4 0 0 0 0 0 0
9.5 0 0 0 0 0 0
9.6 0 0 0 0 0 0
9.7 0 0 0 0 0 0
9.8 0 0 0 0 0 0
9.9 0 0 0 0 0 0
[1] 42 22
'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:42] "9.4" "9.5" "9.6" "9.7" ...
..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ...

[1] 3000

[1] "142.24" "144.78" "147.32" "149.86" "152.4" "154.94" "157.48" "160.02" "162.56" "165.1" "167.64" "170.18" "172.72" "175.26" "177.8" "180.34"
[17] "182.88" "185.42" "187.96" "190.5" "193.04" "195.58"

let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.


We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp(). 

pca =prcomp(crimtab)

Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’. Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component.

Let’s plot all the principal components and see how the variance is accounted with each component.

par(mar = rep(2, 4)) plot(pca) 

Clearly the first principal component accounts for maximum information.
Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.

#below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one. pca$rotation=-pca$rotation pca$x=-pca$x biplot (pca , scale =0) 

The output of the preceding code is as follows:

In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.

From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features. 

In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, "165.1, 167.64, and 170.18" which are less correlated with them. 

Complete Code for PCA implementation in R: 

data("crimtab") #load data
head(crimtab) #show sample data
dim(crimtab) #check dimensions
str(crimtab) #show structure of the data
apply(crimtab,2,var) #check the variance accross the variables
pca =prcomp(crimtab) #applying principal component analysis on crimtab data
par(mar = rep(2, 4)) #plot to show variable importance
'below code changes the directions of the biplot, if we donot include
the below two lines the plot will be mirror image to the below one.'
biplot (pca , scale =0) #plot pca components using biplot in r
view rawPCA using R hosted with ❤ by GitHub

So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.

Originally posted here

Read more…

Guest blog post by Jean Villedieu

Following the Mediator scandal, France adopted in 2011 a Sunshine Act. For the first time we have data on the presents and contracts awarded to health care professionals by pharmaceutical companies. Can we use graph visualization to understand these dangerous ties?

Dangerous ties

Pharmaceutical companies in France and in other countries use presents and contracts to influence the prescriptions of health care professionals. This has posed ethical problems in the past.

In France, 21 persons are currently prosecuted for their role in the Mediator scandal, a drug that was recently banned. Some of them are accused of having helped the drug manufacturer obtain an authorization to sell its drug and later fight its ban in exchange for money.

In the US, GlaxoSmithKline was condemned to pay $3 billion in the largest health-care fraud settlement in US history. Before the settlement, GlaxoSmithKline paid various experts to fraudulently market the benefits of its drugs.

Such problems arose in part because of a lack of transparency in the ties between pharmaceutical companies and health-care professionals. With open data now available can we change this?

Moving the data to Neo4j

Regards Citoyens, a French NGO, parsed various sources to build the first database documenting the financial relationships between health care providers and pharmaceutical manufacturers.

That database covers a period from January 2012 to June 2014. It contains 495 951 health care professionals (doctors, dentists, nurses, midwives, pharmacists) and 894 pharmaceutical companies. The contracts and presents represent a total of 244 572 645 €.

The original data can be found on the Regards Citoyens website.

The data is stored in one large CSV file. We are going to use graph visualization to understand the network formed by the financial relationships between pharmaceutical companies and health care professionals.

First we need to move the data into a Neo4j graph database: view rawsunshine_import.cql

Now the data is stored in Neo4j as a graph (download it here). It can be searched, explored and visualized through Linkurious.

Unfortunately, names in the data have been anonymized by Regards Citoyens following pressure from the CNIL (the French Commission nationale de l’informatique et des libertés).

Who is Sanofi giving money to?

Let’s start our data exploration with Sanofi, the French biggest pharmaceutical company. If we search Sanofi through Linkurious we can see that it is connected to 57 765 professionals. Let’s focus on the 20 Sanofi’s contacts who have the most connections.

sanofi's connections

Sanofi’s top 20 connections.


Among these entities there are 19 doctors in general medicine and one student. We can quickly grasp which professions Sanofi is targeting by coloring the health care professionals according to their profession:

19 doctors among Sanofi's top 20 connections

19 doctors among Sanofi’s top 20 connections.

In a click, we can filter the visualization to focus on the doctors. We are now going to color them according to their region of origin.

Region of origin of Sanofi's 19 doctors

Region of origin of Sanofi’s 19 doctors.

Indirectly, the health care professionals Sanofi connects to via presents also tell us about its competitors. Let’s look at who else has given presents to the health care professionals befriended by Sanofi.

sanofi's competitors network

Sanofi’s contacts (highlighted in red) are also in touch with other pharmaceutical companies.


Zooming in, we can see Sanofi is at the center of a very dense network next to Bristol-Myers Quibb, Pierre Gabre, Lilly or Astrazeneca for example. According to the Sunshine dataset, Sanofi’s is competing with these companies.

We can also see an interesting node. It is a student who has received presents from 104 pharmaceutical companies including companies that are not direct competitors of Sanofi.

A successful student

A successful student.

Why has he received so much attention? Unfortunately all we have is an ID (02b0d3726458ef46682389f2ac7dc7af).

Sanofi could identify the professionals its competitors have targeted and perhaps target them too in the future.

Who has received the most money from pharmaceutical companies in France?

Neo4j includes a graph query language called Cypher. Through Cypher we can compute complex graph queries and get results in seconds.

We can for example identify the doctor who has received the most money from pharmaceutical companies:

//Doctor who has received the most money

The doctor behind the ID 2d92eb1e795f7f538556c59e48aaa7c1 has received 77 480€ from 6 pharmaceutical companies.

wealthy doctor

The relationships are colored according to the money they represent. St Jude Medical has over 70 231€ to Dr 2d92eb1e795f7f538556c59e48aaa7c1.

Perhaps next time they receive a prescription from Dr 2d92eb1e795f7f538556c59e48aaa7c1, his patients would like to know about his relationship with St Jude Medical. Unfortunately today the Sunshine data is anonymous.

We can also find the most generous pharmaceutical company.

//Company which has distributed the most money
RETURN a, sum(r.totalDECL) as total

Novartis Pharma has awarded 12 595 760€ to various entities.

top 5 novartis

The 5 entities receiving the most money from Novartis.


When we look closer, we can see that the 5 entities which have received the most money from Novartis Pharma are 5 NGOs.

24f3287da6ab125862249416bc91f9c4 has received 75 000€

24f3287da6ab125862249416bc91f9c4 has received 75 000€.

Come meet us at GraphConnect in London, the biggest graph event in Europe. It is sponsored by Linkurious and you can use “Linkurious30″ to register and get a 30%discount!

The Sunshine dataset offers a rare glimpse into the practice of pharmaceutical companies and how they use money to influence the behavior of health care professionals. Unfortunately for citizens looking for transparency, the data is anonymized. Perhaps it will change in the future?

Read more…

Guest blog post by Laetitia Van Cauwenberge

This article focuses on cases such as Facebook and protein interaction networks. The article was written by By Paul Scherer (paulmorio) and submitted as a research paper to HackCambridge. What makes this article interesting is the fact that it compares five clustering techniques for this type of problems:

  • K Clique Percolation - A clique merging algorithm. Given a set kk, the algorithm goes on to produce kk clique clusters and merge them (percolate) as necessary.
  • MCode - seed growth approach to finding dense subgraphs
  • DP Clustering - seed growth approach to finding dense subgraphs similar to MCODE but has an internal representation of weights in the edges, and the stopiing condition is different.
  • IPCA - Modified DPClus Algorithm which focuses on maintaining the diameter of a cluster (defined as the maximum shortest distance between all pairs of vertices, rather than its density.
  • CoAch - Combined Approach with finding a small number of cliques as complexes first and then growing them.

The articles also provides great visualizations such as the one below:

In the original article, these visualizations are interactive, and you will find out which software was used to produce them.

Below is the summary (written by the original author):


For my submission to HackCambridge I wanted to spend my 24 hours learning something new in accordance with my interests. I was recently introduced to protein interaction networks in my Bioinfomartics class, and during my review of machine learning techniques for an exam noticed that we study many supervised methods, but no unsupervised methods other than the k means clustering. Thus I decided to combine the two interests by clustering the Protein interaction networks with unsupervised clustering techniques and communicate my learning, results, and visualisations using the Beaker notebook.

The study of protein-protein interactions (PPIs) determined by high-throughput experimental techniques has created karge sets of interaction data and a new need for methods allowing us to discover new information about biological function. These interactions can be thought of as a large-scale network, with nodes representing proteins and edges signifying an interaction between two proteins. In a PPI network, we can potentially find protein complexes or functional modules as densely connected subgraphs. A protein complex is a group of proteins that interact with each other at the same time and place creating a quaternary structure. Functional modules are composed of proteins that bind each other at different times and places and are involved in the same cellular process. Various graph clustering algorithms have been applied to PPI networks to detect protein complexes or functional modules, including several designed specifically for PPI network analysis. A select few of the most famous and recent topographical clustering algorithms were implemented based on descriptions from papers, and applied to PPI networks. Upon completion it was recognized that it is possible to apply these to other interaction networks like friend groups on social networks, site maps, or transportation networks to name a few.

I decided to Graphistry's GPU cluster to visualize the large networks with the kind permission of Dr. Meyerovich. (Otherwise I would have likely not finished on time given the specs of my machine) and communicate my results and learning process

The full version with mathematical formulas, detailed descriptions, and source code, can be found here. For more articles about clustering, click here. This link will give you access to the following articles:

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Big Data Insights - IT Support Log Analysis

Guest blog post by Pradeep Mavuluri

This post brings forth to the audience, few glimpses (strictly) of insights that were obtained from a case of how predictive analytic's helped a fortune 1000 client to unlock the value in their huge log files of the IT Support system. Going to quick background, a large organization was interested in value added insights (actionable ones) from thousands of records logged in the past, as they saw both expense increase at no higher productivity.


As, most of us know in these business scenarios end-users will be much interested in out-of-knowledge, strange and unusual things that may not be captured from regular reports. Hence, here data scientist job not only ends at finding un-routine insights, but, also needs to do a deeper dig for its root cause and suggest best possible actions for immediate remedy (knowledge of domain or other best practices in industry will help a lot). Further, as mentioned earlier, only few of those has been shown/discussed here and all the analysis has been carried out using R Programming Language components viz., R-3.2.2RStudio (favorite IDE)ggplot2 package for plotting.

The first graph (below one) is a time series calendar heat map adopted from Paul Bleicher, shows us the number of tickets raised day-wise over every week of each month for the last year (green and its light shades represent less numbers, where as red and its shades represent higher numbers).


Herein, if one carefully observe the above graph, it will be very evident for us that, except for the month of April &amp; December, all other months have sudden increase in the number of tickets raised over last Saturday's and Sunday's; and this was more clearly visible at Quarter ends of March, June, September (also at November which is not a Quarter end). One can think of this as unusual behavior as numbers raising at non-working days. Before, going into further details, lets also look at one more graph (below), which depicts solved duration in minutes on x-axis and their respective time taken through a horizontal time line plot.

The above solved duration plot show us that out of all records analyzed 71.87% belong to "Request for Information" category and they have been solved within few minutes of tickets raised (that's why we cannot see a line plot for this category as compared to others). So, what's happened here actually was a kind of spoof, because of lack of automation in their systems. In simple words, it was found that there doesn't exists a proper documentation/guidance for many of applications they were using; such situation was taken as advantage for increasing the number of tickets (i.e. nothing but, pushing for more tickets even for basic information in the month ends and quarter ends, which resulted in month end openings which in turn forced them to close immediately). Discussed one here is one of those among many which has been presented with possible immediate remedies which can be easily actionable.

Visual Summarization:


Original Post

Read more…

Guest blog post by Petr Travkin

Part 1. Business scenarios.

I have spent many hours planning and executing in-company self-service BI implementation. This enabled me to gain several insights. Now that the ideas became mature enough and field-proven, I believe they are worth sharing. No matter how far you are in toying with potential approaches (possibly you are already in the thick of it!), I hope my attempt of describing feasible scenarios would provide a decent foundation.

All scenarios presume that IT plays its main role by owning the infrastructure, managing scalability, data security, and governance. I tried to elaborate every aspect of possible solution leaving behind all marketing claims of the vendor.

Scenario 1. Tableau Desktop + departmental/cross-functional data schemas.

This scenario involves gaining insights by data analysts on a daily basis. They might be either independent individuals or a team. Business users’ interaction with published workbooks is applicable, but limited to simple filtering.

User categories: professional data analysts;

Technical skills: intermediate/advanced SQL, intermediate/advanced Tableau;

Tableau training: 2-3 days full time (preferably) or continuous self-learning from scratch;

Licenses: Tableau Desktop.


  • Pure self-service BI approach with no IT involved in data analysis;
  • Vast range of data available for analysis with almost no limits;
  • Fast response for complex ad-hoc business problems.


  • Requires highly skilled data analysts;
  • Most likely involves Tableau training on query performance optimisation on a particular data source (e.g. Vertica).


  • Create a “sandbox” that allows data analysts to query and collaborate on their own and without supervision. Further promotion of workbooks to production is welcome.

Scenario 2. Tableau Desktop + custom data marts.

In this scenario, business users are fully in charge of data analysis. IT provides custom data marts.

User categories: business users, line-managers;

Technical skills: basic SQL, basic/intermediate Tableau;

Tableau training: two or three 2-3h sessions + ad-hoc support on daily basis;

Licenses: Tableau Desktop + Server Interactors.


  • Easy access to data for ad-hoc analysis;
  • Self-answering critical business questions;
  • Self-publishing for further ad-hoc access across multiple devices.


  • Adding any data involves IT support;
  • Requires elaborated data dictionaries.


  • Make requirements gathering a collaborative and iterative process with regular communication. That would ensure well-timed data delivery and quality;
  • Deliver training in 2-3 wisely structured sections with 2-3 week breaks for business users to have time for playing with software, along with generating needs for the new skills.
  • Focus on reach visualisations, not tables.

Scenario 3. Tableau Server Web Edit + workbook templates

This scenario fully relies on data models published by data analysts and powerful Web Edit features of Tableau Server.

User categories: line-managers, top managers;

Technical skills: Tableau basics;

Tableau training: one 30 min demo session + ad-hoc support;

License: Server Interactor.


  • No special training;
  • Fast Tableau adoption with basic, but powerful Self-service BI capabilities (Web Edit);
  • Thin client access via any Desktop Web Browser;
  • Could serve as a foundation for self-service BI adoption among C-Suite.


  • High level of accuracy for data preparation and template development;
  • Any changes in the data model require development and republishing of a template.


  • Try to select the most proactive and “data hungry” line manager or executive, who could help to spread the word;
  • Investigate analytical needs, ensure availability of a subject matter expert;
  • Start with simple visualisations, but be ready to increase complexity;
  • Provide as much ad-hoc assistance as you can.

In my next post, I would like to throw light on some technical aspects and limitations of each scenario.

I highly appreciate any comments and looking forward to know about your experience.

Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds