Subscribe to our Newsletter

Featured Posts (196)

Guest blog post by Chris Atwood

Recently, I rediscovered a TED Talk by David McCandless, a data journalist, called “The beauty of data visualization.” It’s a great reminder of how charts (though scary to many) can help you tell an actionable story about a topic in a way that bullet points alone usually cannot. If you have not seen the talk, I recommend you take a look for some inspiration about visualizing big ideas.

 

In any social media report you make for the brass, there are several types of data charts to help summarize the performance of your social media channels; the most common ones are bar charts, pie/donut charts and line graphs. They are tried and true but often overused, and are not always the best way to visualize the data to then inform and justify your strategic decisions. Below are some less common charts to help you tell the story about your social media strategy’s ROI.

 

For our examples here, we’ll primarily be examining a brand’s Facebook page for different types of analyses on its owned post performance.

 

Scatter plots

Figure 1: Total engagement vs total reach, colored by post type (Facebook Insights)

What they are: Scatter plots measure two variables against each other to help users determine where a correlation or relationship between those variables might be.

 

Why they’re useful:  One of the most powerful aspects of a scatter plot is its ability to show nonlinear relationships between variables. They also help users get a sense of the big picture. In the example above, we’re looking for any observable correlations between total engagement (Y axis) and total reach (X axis) that can guide this Facebook page’s strategy. The individual dots are colored by the post type — status update (green), photo (blue) or video (red).

 

This scatter plot shows that engagement and reach have a direct relationship for photo posts because it makes a fairly clear, straight line from the bottom left to the upper  right. For other types of posts, the relationships are less clear, although it can be noted that video posts have extremely high reach even though engagement is typically low.

 

Box plots

Figure 2: Total reach benchmark by post type (Facebook Insights)

 

What they are: Box plots show the statistical distribution of different categories in your data, and let you compare them against one another and establish a benchmark for a certain variable. They are not commonly used because they’re not always pretty, and sometimes can be a bit confusing to read without the right context.

 

Why they’re useful: Box plots are excellent ways to display key performance indicators. Each category (with more than one post) will show a series of lines and rectangles; the box and whisker show what’s called the interquartile range (IQR). When you look at all the posts, you can split the values up into groups called quartiles or percentiles based on the distribution of the values. You can use the median or the value of the second quartile as a benchmark for “average” performance.

 

In this example, we’re once again looking at different post types on a brand’s Facebook page, and seeing what the total reach is like for each. For videos (red), you can see that the lower boundary for the reach is higher than the majority of photo posts, and that it doesn’t have any outliers. Photos, however, tell a different story. The first quartile is very short, while the fourth quartile is much longer. Since most of the posts fall above the second quartile, you know that many of these posts are performing above average. The dots above the whisker indicate outliers — i.e., these posts do not fall within the normal distribution. You should take a closer look at outliers to see what you can learn based on what they have in common (seasonality/timing, imagery, topic, audience targeting, or word choices).

Heat maps

Figure 3: Average total engagement per day by post type (Facebook Insights)

 

What they are: Heat maps are a great way to determine factors like which posts have the highest number of engagement or impressions, on average, on a given day. Heat maps take two categories of data and compare a single quantitative variable (average total reach, average total engagement, etc.).

 

Why they’re useful: The difference in the shade in colors shows how values in each column are different from each other. If the shades are all light, there is not a large difference in the values from category to category, versus if there are light colors and darker colors in a column, the values are very different from each other (more interesting!).

 

You could run a similar analysis to see what times  of day your posts get the highest engagement or reach, and find the answer to the classic question, “When should I post for the highest results?” You can also track competitors this way, to see how their content performs throughout the day or on particular days of the week. You can time your own posts around when you think shared audiences may be paying less attention to competitors, or make a splash during times with the best performance.

 

In the above example, you can see that three post types from a brand’s Facebook page have been categorized by their average total engagement on a given day of the week. Based on the chart, photos do not differentiate much from day to day. Looking closer at the data from the previous box plot, we know that photo posts are the most common post, and make up a large amount of the data set; we can conclude that the user must be used to seeing those posts so they perform about the same day to day. We also see that video posts either perform far above or far below average, and that it appears the best day to post videos for this brand is typically on Thursdays.

Tree maps

Figure 4: Average total engagement by content pillar and post type (Facebook Insights)

 

What they are: Tree maps use qualitative information, usually represented as a diagram that grows from a single trunk and ends with many leaves. Tree maps typically have three main components that help you tell what’s going on — the size of each rectangle, the relative color and what the hierarchy is.

 

Why they’re useful: Tree maps are a fantastic way to get a high-level look at your social data and figure out where you want to dig in for further analysis. In this example, we’re able to compare the average total engagement between different post types, broken out by content pillar.

For our brand’s Facebook page, we have trellised the data by post type (figure 4); in other words, we created a visualization that comprises three smaller visualizations, so we can see how the post type impacts the average total engagement for each content pillar. It answers the question, “Do my videos in category X perform differently than my photos in the same category?” You can also see that the rectangles vary in size from content pillar to content pillar; they  are sized by the number of posts in each subset. Finally, they are colored by the average total engagement for that content pillar’s subset of the post type. The darker the color, the higher the engagement.

 

We immediately learn that posts in the status trellis aren’t performing anywhere near the other post types (it only has one post), and that photos have the greatest number of content pillars or the greatest variety in topic. You can see from the visualization that you want to spend more of your energy digging into why posts in the Timely, Education and Event categories perform well in both photos and videos. .  

 

TL;DR: Better Presentations are made with Better Charts

In your next analysis, you shouldn’t disregard the tried and true bar charts, pie graphs and line charts. However, these four different visualizations may offer a more succinct way to summarize your data and help you explain the performance of your campaigns. They’ll also make your reports and wrapups look distinctive when they’re used correctly. Although there are other chart types that are also useful for making better analyses and presentations, the ones discussed here are fairly simple to put together and nearly all of them can be put together in Microsoft Excel or visualization/analysis software such as TIBCO's Spotfire. 

Read more…

Guest blog post by Divya Parmar

To once again demonstrate the power of MySQL (download), MySQL Workbench (download), and Tableau Desktop (free trial version can be downloaded here), I wanted to walk through another data analysis example. This time, I found a Medicare dataset publicly available on Data.gov and imported it using the Import Wizard as seen below.

 

 

Let’s take a look at the data: it has hospital location information, measure name (payment for heart attack patient, pneumonia patient, etc), and payment information.

 

I decided to look at the difference in lower and higher payment estimates for heart attack patients for each state to get a sense of variance in treatment cost. I created a query and saved it as a view.

 

One of the convenient features of Tableau Desktop is the ability to connect directly to MySQL, so I used that connection to load my view directly into Tableau.

  

 I wanted to see how the difference between lower and higher payment estimate varies by state. Using Tableau’s maps and geographic recognition of the state column, I used a few drag-and-drop moves and a color fill to complete the visualization.

You can copy the image itself to use elsewhere, choosing to add labels and legends if necessary. Enjoy. 

About: Divya Parmar is a recent college graduate working in IT consulting. For more posts every week, and to subscribe to his blog, please click here. He can also be found on LinkedIn and Twitter

Read more…

Guest blog post by Ujjwal Karn

I created an R package for exploratory data analysis. You can read about it and install it here.  

The package contains several tools to perform initial exploratory analysis on any input dataset. It includes custom functions for plotting the data as well as performing different kinds of analyses such as univariate, bivariate and multivariate investigation which is the first step of any predictive modeling pipeline. This package can be used to get a good sense of any dataset before jumping on to building predictive models.

The package is constantly under development and more functionalities will be added soon. Pull requests to add more functions are welcome!

The functions currently included in the package are mentioned below:

  • numSummary(mydata) function automatically detects all numeric columns in the dataframe mydata and provides their summary statistics
  • charSummary(mydata) function automatically detects all character columns in the dataframe mydata and provides their summary statistics
  • Plot(mydata, dep.var) plots all independent variables in the dataframe mydata against the dependant variable specified by the dep.var parameter
  • removeSpecial(mydata, vec) replaces all special characters (specified by vector vec) in the dataframe mydata with NA
  • bivariate(mydata, dep.var, indep.var) performs bivariate analysis between dependent variable dep.var and independent variable indep.var in the dataframe mydata

More functions to be added soon. Any feedback on improving this is welcome!

Read more…

Originally posted on Data Science Central

Contributed byBelinda Kanpetch, she is current Architecture graduate student in Columbia University. With the strong urban design sense, she is fascinated in Urban installation art and urge to acquire any elements to ameliorate urban space. In order to gather all the information systematically to apply into her work, she took NYC Data Science Academy 12 week full-time Data Science Bootcamp program April 11th to July 1st 2016. The post was based on her first class project(due at 2nd week of the program).

Why Street Trees?

The New York City street tree can sometimes be taken for granted or go unnoticed. Located along paths of travel they stand steady and patient; quietly going about their business of filtering out pollutants in our air, bringing us oxygen, providing shade during the warmer months, blocking winds during cold seasons, and relieving our sewer systems during heavy rainfall. All of this while beautifying our streets and neighborhoods. Some recent studies have found a link between presence of streets and lower stress levels in urban citizens.

So what makes a street tree different from any other tree? Mainly its location. A street tree is defined as any tree that lives within the public right of way; not in a park or on private property. Although they reside in the public right of way (or within the jurisdiction of The Department of Transportation) they are the property of and cared for by the NYC Department of Parks and Recreation.

With the intent to understand the data and explore what the data was telling me I started with some very basic questions:

  • How many street trees are there in Manhattan?
  • How many different species are there?
  • What is the general condition of the street trees?
  • What is the distribution of species by community district?
  • Is there a connection between median income of a community district to the number of street trees?

The Dataset

The dataset used for this exploratory visualization was downloaded from the NYC Open Data Portal and was collected as part of TreeCount!2015, a street tree census maintained by the NYC Department of Park and Recreation. The first census count was 1995 and has been conducted every 10 years by trained volunteers.

Some challenges with this dataset involved missing values in the form of unidentifiable species types. There were 2285 observations with unclassifiable species type, 487 observations that had unclassifiable community districts, geographic information (longitude and latitude) were character strings that had to be split into different variables, and species codes were given by 4 letter characters without any reference to genus, species, or cultivar and I had to find another dataset to decipher that code.

Visualizing the data

A quick summary of the dataset revealed a total of 51,660 trees total in Manhattan with 91 identifiable species with one ‘species’ as missing values.

A bar plot of all 92 species gave an interesting snapshot of the range in total number of trees per species. It was quite obvious that there was one species that has a dominant presence. In order to get better understanding of their counts and what were common species, I broke them down by quartiles and plotted them.

Plotting the first quartile (< 3.75)revealed that there were several species in which there was only one tree that existed in Manhattan!

The distribution within the 4th quartile (181.75 << total >> 11529) was informative in that it helped to visualize the dominance of two specific species, the Honeylocust and Ornamental Pear that make up 23% and 15% of all the trees in Manhattan respectively. Coming in close were Ginko trees with 9.47% and London Plane with 7.8%. This quartile also contained the missing species group ‘0’.

A palette of the top 4 species in Manhattan.

Looking at trees by Community District

I wanted to look at community districts as opposed to zip codes because in my opinion community districts are more representative of community cohesiveness and character. So I plot the distribution by community district and tree condition.

Plotting the species distribution by community board using facet grid helped visualize other species that were not showing up dominant in the previous graphs. It would be interesting to look further into what those species are and why they are more dominant within some community districts and not others.

Attempts at mapping

The ultimate goal was to map each individual tree location on a map of Manhattan with the community districts outlined or shaded in. I attempted to plot them on a map using leaflet, bringing in shape files and converting to a data frame, and ggplot but neither yielded anything useful. The only visualization I was able to get was using qplot which took over 2 hours to render.

Read more…

The bagged trees algorithm is a commonly used classification method. By resampling our data and creating trees for the resampled data, we can get an aggregated vote of classification prediction. In this blog post I will demonstrate how bagged trees work visualizing each step.

Visualizing Bagged Trees as Approximating Borders, Part 1

Visualizing Bagged Trees, Part 2

Conclusion: Other tree aggregation methods differ in how they grow trees and they may compute weighted average. But in the end we can visualize the result of a algorithm as borders between classified sets in a shape of connected perpendicular segments, as in this 2-dimensional case. As for higher dimensions these became multidimensional rectangular pieces of hyperplanes which are perpendicular to each other.

Read more…

Contributed by Bin Lin. He took  NYC Data Science Academy 12 week full-time Data Science Bootcamp program between Jan 11th to Apr 1st, 2016. The post was based on his second class project(due at 4th week of the program).

Introduction:

The consumption pattern is an important driver of the development pattern of the industrialized world. The consumption price changes reflect the economic performance and income of households in a country. In this project, the focus is on the food price changes. The goals of the project were:

  • Utilize Shinny for interactive visualization (the Shinny app is hosted at https://blin02.shinyapps.io/food_price_changes/)
  • Explore food price changes over time from the year 1974 - 2015.
  • Compare food price changes to All-Items price changes (All-items include all consumer goods and services, including food).
  • Compare Consumer Food Price Changes vs. Producer Price Changes (producer price changes are the average change in prices paid to domestic producers for their output).

Data:

Resource:

Summary:

Consumer Food Price Changes Dataset:

  • Data dimension: 42 rows x 21 columns
  • Missing data: There are 2 missing values in the column of "Eggs"

Producer Price Changes Dataset:

  • Data dimension: 42 rows x 17 columns
  • Missing data: There are 25 missing values in the column of "Processed.fruits.vegetables".

Consumer Food Categories:

  • Data dimension: 20 rows x 2 columns

Data Analysis and Visualization:

Food Consumption Categories:

Food consumption is broken out into 20 categories. Among all of them, the categories with high share based on consumer expenditures are (see Figure 1 and Figure 2):

  • Food.away.from.Home (eat out): 40.9%
  • Other.foods: 10.5 (note this is a sum of rest of the uncategorized food)
  • Cereals.and.bakery.products: 8.0$
  • Nonalcoholic.beverages: 6.7%
  • Dairy.products: 6.3
  • Beef.and.veal: 4.1
  • Fresh.fruits: 4.0

The high share of nonalcoholic beverages/soft drinks (6.7%) seems concerning as high consumption of soft drinks might pose the health risk.

Figure 1: Pie Chart on Food Categories Share of Consumer Expenditures

Figure 2: Bar Chart on Food Categories Share of Consumer Expenditures 

Food Price Changes over Time:

The Consumer Price Index (CPI) is a measure that examines average change over time in the prices paid by consumers for goods and services. It is calculated by taking price changes for each item in the predetermined basket of goods and averaging them; the goods are weighted according to their importance. Changes in CPI are used to assess price changes associated with the cost of living.

In the Shinny app, I created a line chart that showed price changes for different food categories, which were selected from a drop-down list. See Figure 3 for screenshot of the Different Food Categories Price Changes over Time

As I was looking at the food price changes, I noticed that there was the dramatic increase during the late 70s.  After reviewed history of the 1970s, a lot happened during that time of period, including the "Great Inflation".

Figure 3: Screenshot of Different Food Categories Price Changes over Time

Yearly Food Category Price Changes:

To view the food price changes for each category in a year, I created the bar chart in the Shinny app. Users can select a year from the slider; the chart will show food price changes of each category for that year. I actually created two bar charts side-by-side in case users want to compare the food price changes between any of the two years.

A quick look at the year 2015, the price of "Egg" had the biggest increase; price of "Pork" dropped the most. In fact, many food categories dropped their price. Compared 2015, the year 2014 had fewer categories with dropped price; the price of "Beef and Veal" had the biggest increase.

Figure 4: Screenshot of Food Category Price Changes by Year

Food Price Changes vs All-Items Price Changes

The Consumer Price Index (CPI) for food is a component of the all-items CPI. That led me to the comparing of those two. From the line chart, I observed:

  • Food price changes mostly aligns with all-item price changes.
  • Food price inflation has outpaced the economy-wide inflation in recent years.

Figure 5: Price Changes in All-Items vs Price Changes in Food

Food Price Changes vs Producer Price Changes

Based on United State Department of Agriculture (USDA),  changes in farm-level and wholesale-level PPIs are of particular interest in forecasting food CPIs. Therefore, I created a chart to show the Over All Food Price Changes vs Producer Price Changes. Uses can choose one or more Producer food categories.

From the chart, that food price changes mostly aligns with the producer price changes. However,  farm level milk, farm level cattle, farm level wheat seem fluctuate since year 2000 and they didn't affect the over all food price change that much. Though the impact on the over all food price was small, I doubt they might have impacted individual food categories. I would like to add a new drop-down list to allow users to select food categories from the consumer food categories.

Figure 6: Food Price Changes vs Producer Price Changes

Correlation Tile Map:

To see the relationship among the different categories in terms of price changes, I created a correlation tile map.

Conclusion:

Food price has been increasing, in different amount of percentage. Since 1990, food price changes keep under small percentage. The degree of food price inflation varies depending on the type of foods

Looking ahead to 2016, ERS predicts food-at-home (supermarket) prices to rise 2.0 to 3.0 percent - a rate of inflation that remains in line with the 20-year historical average of 2.5 percent. For future works, I would love to try to fit a time-series model to predict the price changes for the coming five years.

Again, this project was done in Shiny and most of the information in this blog post were from the Shiny, https://blin02.shinyapps.io/food_price_changes/.

Originally posted on Data Science Central

Read more…

There are many ways to choose features with given data, and it is always a challenge to pick up the ones with which a particular algorithm will work better. Here I will consider data from monitoring performance of physical exercises with wearable accelerometers, for example, wrist bands.

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har.

In this project, researchers used data from accelerometers on the belt, forearm, arm, and dumbbell of few participants. They were asked to perform barbell lifts correctly, marked as "A", and incorrectly with four typical mistakes, marked as "B", "C", "D" and "E". The goal of the project is to predict the manner in which they did the exercise.

There are 52 numeric variables and one classification variable, the outcome. We can plot density graphs for first 6 features, which are in effect smoothed out histograms.

We can see that data behaviors are complicated. Some of features are bimodal and even multimodal. These properties could be caused by participants' different sizes or training levels or something else, but we do not have enough information to check it out. Nevertheless it is clear that our variables do not obey normal distribution. Therefore we are better with algorithms which do not assume normality, like trees and random forests.  We can visualize the algorithms work in the following way: as finding vertical lines which divide areas under curves such that areas to the right and to to left of the line are significantly different for different outcomes.
There are a number of ways to distinguish functions analytically on an interval in Functional Analysis. It looks like the most suitable is to consider areas between curves. Clearly, we should scale it with respect to the size of the curves. For every feature I will consider all pairs of density curves to find out if they are sufficiently different.  Here is my final criterion:
If there is a pair for a feature which satisfies it then the feature is chosen for a prediction. As result I got 21 features for a random forests algorithm.  The last one yielded accuracy 99% for the model itself and on a validation set. I checked how many variables we need for the same accuracy with PCA preprocessing, and it was 36. Mind you that the variables will be scaled and rotated, and that we still use the same original 52 features to construct them. Thus more efforts are needed to construct a prediction and to explain it. While with the above method it is easier, since areas under curves represent numbers of observations.


Originally posted on Data Science Central
Read more…



Analysis of Fuel Economy Data

Paul Grech

October 5, 2015


Contributed by Paul Greeh. Paul took NYC Data Science Academy 12 week full time Data Science Bootcamp  program between Sept 23 to Dec 18, 2015. The post was based on his first class project(due at 2nd week of the program).

Scope:


Analyse fuel economy ratings in the automotive industry.

Compare vehicle efficiency of American automotive manufacturer, Cadillac with the automotive industry as a whole.

Sept 2014 - “We cannot deny the fact that we are leaving behind our traditional customer base,” de Nysschen said. “It will take several years before a sufficiently large part of the audience who until now have been concentrating on the German brands will find us in their consideration set.” Cadillac’s President - Johan de Nysschen http://www.autonews.com/article/20140915/RETAIL03/140919894/cadillacs-new-chief-vows-no-retreat-on-pricing-strategy

Compare vehicle efficiency of American automotive manufacturer, Cadillac, with self declared competition, the German luxury market.

What further comparisons will display insight into EPA ratings?

Analysis Overview

  1. Automotive Industry
  2. Cadillac vs Automotive Industry
  3. Cadillac vs German Luxury Market
  4. Cadillac vs German Luxury Market by Vehicle Class

Importing the Data


Import FuelEconomy.gov data and filter rows needed for analysis. Then remove all zero’s included in city and highway MPG data as this will skew results. - Replace this information with NA as to not perform calculations on data not present.
library(lsr)
library(dplyr)
library(ggplot2)

# Import Data and convert to Dplyr data frame
FuelData <- read.csv("Project1.data/FuelEconomyGov.csv", stringsAsFactors = FALSE)
FuelData <- tbl_df(FuelData)

# Create data frame including information necessary for analysis
FuelDataV1 <- select(FuelData,
mfrCode, year, make, model,
engId, eng_dscr, cylinders, displ, sCharger, tCharger,
trans_dscr, trany, drive,
startStop, phevBlended,
city08, comb08, highway08,
VClass)

# Replace Zero values in MPG data with NA
FuelDataV1$city08U[FuelDataV1$city08 == 0] <- NA
FuelDataV1$comb08U[FuelDataV1$comb08 == 0] <- NA
FuelDataV1$highway08U[FuelDataV1$highway08 == 0] <- NA

1: Automotive Industry


Visualize city and highway EPA ratings of the entire automotive industry.

Question:


How have EPA ratings for city and highway improved across the automotive industry as a whole?

Note: No need to include combined as combined is simply a percentage based calculation defaulting to 60/40 but can be adjusted on the website.

# VISUALIZE INDUSTRY EPA RATINGS
IndCityMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "City")
IndHwyMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "Highway")

Comp.Ind <- rbind(IndCityMPG, IndHwyMPG)

ggplot(data = Comp.Ind, aes(x = year, y = MPG, linetype = MPGType)) +
geom_point() + geom_line() + theme_bw() +
ggtitle("Industry\n(city & highway MPG)")


unnamed-chunk-3-1

Conclusion:


Data visualization shows relatively poor EPA ratings throughout the 1980's, 1990's and early to mid 2000's with the first drastic improvement in these ratings occurring around 2008. One significant event around this time period was the recession hitting America. Consumers having less disposable income along with increased oil prices likely fueled competition to develop fuel efficient powertrains across the automotive industry as a whole.

2: Cadillac vs Automotive Industry


Visualize Cadillac's city and highway EPA ratings with that of the automotive industry.

Question:


How does Cadillac perform when compared to the automotive industry as a whole?
# COMPARE INDUSTRY EPA RATINGS FOR CITY AND HIGHWAY WITH THAT OF CADILLAC
IndCityMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "City")
IndHwyMPG <- group_by(FuelDataV1, year) %>%
summarise(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Industry") %>%
mutate(., MPGType = "Highway")
CadCityMPG <- filter(FuelDataV1, make == "Cadillac") %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Cadillac") %>%
mutate(., MPGType = "City")
CadHwyMPG <- filter(FuelDataV1, make == "Cadillac") %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Cadillac") %>%
mutate(., MPGType = "Highway")

Comp.Ind.Cad <- rbind(IndCityMPG, IndHwyMPG, CadCityMPG, CadHwyMPG)

ggplot(data = Comp.Ind.Cad, aes(x = year, y = MPG, color = Label, linetype = MPGType)) +
geom_point() + geom_line() + theme_bw() +
scale_color_manual(name = "Cadillac / Industry", values = c("blue","#666666")) +
ggtitle("Cadillac vs Industry\n(city & highway MPG)")

unnamed-chunk-5-1

Conclusion:


Cadillac was chosen as a brand of interest because they are currently redefining their brand as a whole. It is important to analyze past performance to have a complete understanding of how Cadillac has been viewed for several decades.

In 2002, Cadillac dropped to its lowest performance. Why did this occur? Because the entire fleet was made up of the same 4.6L V8 mated to a 4-speed automatic transmission, or as some would say... slush-box. The image that Cadillac had of this time was of a retirement vehicle to be shipped to its owners new retirement home in Florida with a soft ride, smooth powerful delivery and no performance. With the latest generation of Cadillac's being performance oriented beginning with the LS2 sourced CTS-V and now containing the ATS-V, CTS-V along with several other V-Sport models, a rebranding is crucial in order to appeal to a new market of buyers.


Also interesting to note is that although there is an increased amount of performance models being produced, fuel efficiency is not lacking. The gap noted above has decreased although there has been an increase in performance models being developed, a concept not often found to align.

3: Cadillac vs German Luxury Market


Cadillac has recently targeted the German luxury market consisting of the following manufacturers:
  • Audi
  • BMW
  • Mercedes-Benz

Question:


How does Cadillac perform when compared with the German Luxury Market?
# Calculate Cadillac average Highway / City MPG past 2000
CadCityMPG <- filter(CadCityMPG, year > 2000)
CadHwyMPG <- filter(CadHwyMPG, year > 2000)

# Calculate Audi average Highway / City MPG
AudCityMPG <- filter(FuelDataV1, make == "Audi", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Audi") %>%
mutate(., MPGType = "City")
AudHwyMPG <- filter(FuelDataV1, make == "Audi", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Audi") %>%
mutate(., MPGType = "Highway")

# Calculate BMW average Highway / City MPG
BMWCityMPG <- filter(FuelDataV1, make == "BMW", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "BMW") %>%
mutate(., MPGType = "City")
BMWHwyMPG <- filter(FuelDataV1, make == "BMW", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "BMW") %>%
mutate(., MPGType = "Highway")

# Calculate Mercedes-Benz average Highway / City MPG
MbzCityMPG <- filter(FuelDataV1, make == "Mercedes-Benz", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(city08, na.rm = TRUE)) %>%
mutate(., Label = "Merc-Benz") %>%
mutate(., MPGType = "City")
MbzHwyMPG <- filter(FuelDataV1, make == "Mercedes-Benz", year > 2000) %>%
group_by(., year) %>%
summarize(., MPG = mean(highway08, na.rm = TRUE)) %>%
mutate(., Label = "Merc-Benz") %>%
mutate(., MPGType = "Highway")

# Concatenate all Highway/City MPG data for:
# v.s. German Competitors
CompGerCadCity <- rbind(CadCityMPG, AudCityMPG, BMWCityMPG, MbzCityMPG)
CompGerCadHwy <- rbind(CadHwyMPG, AudHwyMPG, BMWHwyMPG, MbzHwyMPG)

ggplot(data = CompGerCadCity, aes(x = year, y = MPG, color = Label)) + 
geom_line() + geom_point() + theme_bw() +
scale_color_manual(name = "Cadillac vs German Luxury Market",
values = c("#333333", "#666666", "blue","#999999")) +
ggtitle("CITY MPG\n(Cad vs Audi vs BMW vs Mercedes-Benz)")

unnamed-chunk-7-1
ggplot(data = CompGerCadHwy, aes(x = year, y = MPG, color = Label)) + 
geom_line() + geom_point() + theme_bw() +
scale_color_manual(name = "Cadillac vs German Luxury Market",
values = c("#333333", "#666666", "blue","#999999")) +
ggtitle(label = "HIGHWAY MPG\n(Cad vs Audi vs BMW vs Mercedes-Benz)")

unnamed-chunk-8-1

Conclusion:


“Mr. Ellinghaus, a German who came to Cadillac in January from pen maker Montblanc International after more than a decade at BMW, said he has spent the past 11 months doing”foundational work" to craft an overarching brand theme for Cadillac’s marketing, which he says relied too heavily on product-centric, me-too comparisons.

“In engineering terms, it makes a lot of sense to benchmark the cars against BMW,” Mr. Ellinghaus said. But he added: “From a communication point of view, you must not follow this rule.” http://adage.com/article/cmo-strategy/ellinghaus-cadillac-a-luxury-brand-makes-cars/296016/


Despite comments made by Mr. Ellinghaus, the end goal is for consumers to be comparing Cadillac with Audi, BMW and Mercedes-Benz. The fact that this is already happening is a huge success for the company which only ten years ago, would never be mentioned in the same sentence as the German Luxury market.

Data visualization shows that Cadillac is equally rated as its German competitors and at the same time, has not had any significant dips unlike all other manufacturers. The continued increase in performance combined with rebranding signify that Cadillac is on a path to success.

4: Cadillac vs German Luxury Market by Vehicle Class


Every manufacturer has its strengths and weaknesses. It is important to assess and recognize these attributes to best determine where an increase in R&D spending is needed and where to maintain a competitive advantage for the consumer by vehicle class.

Question:


In what vehicle class is Cadillac excelling or falling behind?
# Filter only Cadillac and german luxury market
German <- filter(FuelDataV1, make %in% c("Cadillac", "Audi", "BMW", "Mercedes-Benz"))
# Group vehicle classes into more generic classes
German$VClass.new <- ifelse(grepl("Compact", German$VClass, ignore.case = T), "Compact",
ifelse(grepl("Wagons", German$VClass), "Wagons",
ifelse(grepl("Utility", German$VClass), "SUV",
ifelse(grepl("Special", German$VClass), "SpecUV", German$VClass))))

# Focus on vehicle model years past 2000
German <- filter(German, year > 2000)
# Vans, Passenger Type are only specific to one company and are not needed for this analysis
German <- filter(German, VClass.new != "Vans, Passenger Type")

# INDUSTRY
IndClass <- filter(German, make %in% c("Audi", "BMW", "Mercedes-Benz")) %>%
group_by(VClass.new, year) %>%
summarize(AvgCity = mean(city08), AvgHwy = mean(highway08))
# CADILLAC
CadClass <- filter(German, make %in% c("Cadillac")) %>%
group_by(VClass.new, year) %>%
summarize(AvgCity = mean(city08), AvgHwy = mean(highway08))

##### Join tables #####
CadIndClass <- left_join(IndClass, CadClass, by = c("year", "VClass.new"))
CadIndClass$DifCity <- (CadIndClass$AvgCity.y - CadIndClass$AvgCity.x)
CadIndClass$DifHwy <- (CadIndClass$AvgHwy.y - CadIndClass$AvgHwy.x)

ggplot(CadIndClass, aes(x = year, ymax = DifCity, ymin = 0) ) + 
geom_linerange(color='grey20', size=0.5) +
geom_point(aes(y=DifCity), color = 'blue') +
geom_hline(yintercept = 0) +
theme_bw() +
facet_wrap(~VClass.new) +
ggtitle("Cadillac vs Germany Luxury Market\n(city mpg by class)") +
xlab("Year") +
ylab("MPG Difference")

unnamed-chunk-10-1
ggplot(CadIndClass, aes(x = year, ymax = DifHwy, ymin = 0) ) + 
geom_linerange(color='grey20', size=0.5) +
geom_point(aes(y=DifHwy), color='blue') +
geom_hline(yintercept = 0) +
theme_bw() +
facet_wrap(~VClass.new) +
ggtitle("Cadillac vs German Luxury Market\n(highway mpg by class)") +
xlab("Year") +
ylab("MPG Difference")

unnamed-chunk-11-1

Conclusion:


The above data visualization displays the delta between Cadillac and the average (Audi, BMW, Mercedes-Benz) fuel economy ratings. Positive can then be considered above the average competition and negative, below the average competition.

There is a lack of performance across all vehicle classes. Reasoning may be because the same power trains are being used across multiple chassis.

 

Conclusion & Continued Analysis

  1. There is a clear improvement in EPA ratings as federal emission standards drive innovation for increased fleet fuel economy. It is important for automotive manufacturers to continue innovation and push for increased efficiency.
  2. Further analysis on the following areas provides greater researcher opportunity:
    • Drivetrain v.s. MPG
    • Sales data
    • Consumer reaction to new marketing strategies
    • Consumer demand for product or badge

Originally posted on Data Science Central

Read more…

Guest blog post by Vincent Granville

There's been many variations of this theme - defining big data with 3Vs (or more, including velocity, variety, volume, veracity, value), as well as other representations such as the data science alphabet.

Here's an interesting Venn diagram that tries to define statistical computing (a sub-field of data science) with 7 sets and 9 intersections:

It was published in a scholarly paper entitled Computing in the Statistics Curricula (PDF document). Enjoy!

Read more…

Learning R in Seven Simple Steps

Originally posted on Data Science Central

Guest blog post by Martijn Theuwissen, co-founder at DataCamp. Other R resources can be found here, and R Source code for various problems can be found here. A data science cheat sheet can be found here, to get you started with many aspects of data science, including R.

Learning R can be tricky, especially if you have no programming experience or are more familiar working with point-and-click statistical software versus a real programming language. This learning path is mainly for novice R users that are just getting started but it will also cover some of the latest changes in the language that might appeal to more advanced R users.

Creating this learning path was a continuous trade-off between being pragmatic and exhaustive. There are many excellent (free) resources on R out there, and unfortunately not all could be covered here. The material presented here is a mix of relevant documentation, online courses, books, and more that we believe is best to get you up to speed with R as fast as possible.

Data Video produced with R: click here and also here for source code and to watch the video. More here.

Here is an outline:

  • Step 0: Why you should learn R
  • Step 1: The Set-Up
  • Step 2: Understanding the R Syntax
  • Step 3: The core of R -> packages
  • Step 4: Help?!
  • Step 5: The Data Analysis Workflow
    • 5.1 Importing Data
    • 5.2 Data Manipulation
    • 5.3 Data Visualization
    • 5.4 The stats part
    • 5.5 Reporting your results
  • Step 6: Become an R wizard and discovering exciting new stuff

Step 0: Why you should learn R

R is rapidly becoming the lingua franca of Data Science. Having its origins in academics, you will spot it today in an increasing number of business settings as well where it is a contestant to commercial software incumbents such as SAS, STATA and SPSS. Each year, R gains in popularity and in 2015 IEEE listed R in the top ten languages of 2015.

This implies that the demand for individuals with R knowledge is growing, and consequently learning R is definitely a smart investment career wise (according to this survey R even is the highest paying skill). This growth is unlikely to plateau in the next years with large players such as Oracle &Microsoft stepping up by including R in its offerings.

Nevertheless, money should not be the only driver when deciding to learn a new technology or programming language. Luckily, R has a lot more to offer than a solid paycheck. By engaging yourself with R, you will become familiar with a highly diverse and interesting community. Namely, R is being used for a diverse set of task such as finance, genomic analysis, real estate, paid advertising, and much more. All these fields are actively contributing to the development of R. You will encounter a diverse set of examples and applications on a daily basis, keeping things interesting and giving you the ability to apply your knowledge on a diverse range of problems.

Have fun!

Step 1: The Set-Up

Before you can actually start working in R, you need to download a copy of it on your local computer. R is continuously evolving and different versions have been released since R was born in 1993 with (funny) names such as World-Famous Astronaut and Wooden Christmas-Tree. Installing R is pretty straightforward and there are binaries available for Linux, Mac and Windows from the Comprehensive R Archive Network (CRAN).

Once R is installed, you should consider installing one of R’s integrated development environment as well (although you could also work with the basic R console if you prefer). Two fairly established IDE’s are RStudio and Architect. In case you prefer a graphical user interface, you should check out R-commander.

Step 2: Understanding the R Syntax

Learning the syntax of a programming language like R is very similar to the way you would learn a natural language like French or Spanish: by practice & by doing. One of the best ways to learn R by doing is through the following (online) tutorials:

Next to these online tutorials there are also some very good introductory books and written tutorials to get you started:

Step 3: The core of R -> packages

Every R package is simply a bundle of code that serves a specific purpose and is designed to be reusable by other developers. In addition to the primary codebase, packages often include data, documentation, and tests. As an R user, you can simply download a particular package (some are even pre-installed) and start using its functionalities. Everyone can develop R packages, and everyone can share their R packages with others.

The above is an extremely powerful concept and one of the key reasons R is so successful as a language and as a community. Namely, you don’t need to do all the hard core programming yourself or understand every complex detail of a particular algorithm or visualization. You can simple use the out-of-the box functions that come with the relevant package as an interface to such functionalities.  As such it is useful to have an understanding of R’s package ecosystem.

Many R packages are available from the Comprehensive R Archive Network, and you can install them using the install.packages function. What is great about CRAN is that it associates packages with a particular task via Task Views. Alternatively, you can find R packages on bioconductorgithub and bitbucket.

Looking for a particular package and corresponding documentation? Try Rdocumentation, where you can easily search packages from CRAN, github and bioconductor.

Step 4: Help?!

You will quickly find out that for every R question you solve, five new ones will pop-up. Luckily, there are many ways to get help:

  • Within R you can make use of its built-in help system. For example the command  `?plot` will provide you with the documentation on the plot function.
  • R puts a big emphasis on documentation. The previously mentionedRdocumentation is a great website to look at the different documentation of different packages and functions.
  • Stack Overflow is a great resource for seeking answers on common R questions or to ask questions yourself.
  • There are numerous blogs & posts on the web covering R such asKDnuggets and R-bloggers.

Step 5: The Data Analysis Workflow

Once you have an understanding of R’s syntax, the package ecosystem, and how to get help, it’s time to focus on how R can be useful for the most common tasks in the data analysis workflow

5.1 Importing Data

Before you can start performing analysis, you first need to get your data into R. The good thing is that you can import into R all sorts of data formats, the hard part this is that different types often need a different approach:

If you want to learn more on how to import data into R check an online Importing Data into R tutorial or  this post on data importing.

5.2 Data Manipulation

Performing data manipulation with R is a broad topic as you can see in for example this Data Wrangling with R video by RStudio or the book Data Manipulation with R. This is a list of packages in R that you should master when performing data manipulations:

  • The tidyr package for tidying your data.
  • The stringr package for string manipulation.
  • When working with data frame like objects it is best to make yourself familiar with the dplyr package (try this course). However. in case of heavy data wrangling tasks, it makes more sense to check out the blazingly fast data.table package (see this syntax cheatsheet for help).
  • When working with times and dates install the lubridate package which makes it a bit easier to work with these.
  • Packages like zooxts and quantmod offer great support for time series analysis in R.

5.3 Data Visualization

One of the main reasons R is  the favorite tool of data analysts and scientists is because of its data visualization capabilities. Tons of beautiful plots are created with R as shown by all the posts on FlowingData, such as this famous facebook visualization.

Credit card fraud scheme featuring time, location, and loss per event, using R: click here for source

If you want to get started with visualizations in R, take some time to study theggplot2 package. One of the (if not the) most famous packages in R for creating graphs and plots. ggplot2 is makes intensive use of the grammar of graphics, and as a result is very intuitive in usage (you’re continuously building part of your graphs so it’s a bit like playing with lego).  There are tons of resources to get your started such as this interactive coding tutorial, a cheatsheet and  an upcoming book by Hadley Wickham.

Besides ggplot2 there are multiple other packages that allow you to create highly engaging graphics and that have good learning resources to get you up to speed. Some of our favourites are:

If you want to see more packages for visualizations see the CRAN task view. In case you run into issues plotting your data this post might help as well.

Next to the “traditional” graphs, R is able to handle and visualize spatial data as well. You can easily visualize spatial data and models on top of static maps from sources such as Google Maps and Open Street Maps with a package such as ggmap. Another great package is choroplethr developed by Ari Lamstein of Trulia or the tmap package. Take this tutorial onIntroduction to visualising spatial data in R if you want to learn more.

5.4 The stats part

In case you are new to statistics, there are some very solid sources that explain the basic concepts while making use of R:

Note that these resources are aimed at beginners. If you want to go more advanced you can look at the multiple resources there are for machine learning with R. Books such as Mastering Machine Learning with R andMachine Learning with R explain the different concepts very well, and online resources like the Kaggle Machine Learning course help you practice the different concepts. Furthermore there are some very interesting blogs to kickstart your ML knowledge like Machine Learning Mastery or this post.

5.5 Reporting your results

One of the best way to share your models, visualizations, etc is through dynamic documents. R Markdown (based on knitr and pandoc) is a great tool for reporting your data analysis in a reproducible manner though html, word, pdf, ioslides, etc.  This 4 hour tutorial on Reporting with R Markdownexplains the basics of R markdown. Once you are creating your own markdown documents, make sure this cheat sheet is on your desk.

Step 6: Become an R wizard and discovering exciting new stuff

R is a fast-evolving language. It’s adoption in academics and business is skyrocketing, and consequently the rate of new features and tools within R is rapidly increasing. These are some of the new technologies and packages that excite us the most:

Once you have some experience with R, a great way to level up your R skillset is the free book Advanced R by Hadley Wickham. In addition, you can start practicing your R skills by competing with fellow Data Science Enthusiasts on Kaggle, an online platform for data-mining and predictive modelling competitions. Here you have the opportunity to work on fun cases such as this titanic data set.

To end, you are now probably ready to start contributing to R yourself by writing your own packages. Enjoy!

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

New book on data mining and statistics

New book:

Numeric Computation and Statistical Data Analysis on the Java Platform (by S.Chekanov)

710 pages. Springer International Publishing AG. 2016. ISBN 978-3-319-28531-3.  http://www.springer.com/us/book/9783319285290

Book S.V.Chekanov 2016

About this book: Numerical computation, knowledge discovery and statistical data analysis integrated with powerful 2D and 3D graphics for visualization are the key topics of this book. The Python code examples powered by the Java platform can easily be transformed to other programming languages, such as Java, Groovy, Ruby and BeanShell. This book equips the reader with a computational platform which, unlike other statistical programs, is not limited by a single programming language.

Originally posted on Data Science Central

Read more…

Dealing with Outliers is like searching a needle in a haystack

This is a guest repost by Jacob Joseph.

An Outlier is an observation or point that is distant from other observations/points. But, how would you quantify the distance of an observation from other observations to qualify it as an outlier. Outliers are also referred to as observations whose probability to occur is low. But, again, what constitutes low??

There are parametric methods and non-parametric methods that are employed to identify outliers. Parametric methods involve assumption of some underlying distribution such as normal distribution whereas there is no such requirement with non-parametric approach. Additionally, you could do a univariate analysis by studying a single variable at a time or multivariate analysis where you would study more than one variable at the same time to identify outliers.

The question arises which approach and which analysis is the right answer??? Unfortunately, there is no single right answer. It depends for what is the end purpose for identifying such outliers. You may want to analyze the variable in isolation or maybe use it among a set of variables to build a predictive model.

Let’s try to identify outliers visually.

Assume we have the data for Revenue and Operating System for Mobile devices for an app. Below is the subset of the data:

How can we identify outliers in the Revenue?

We shall try to detect outliers using parametric as well as non-parametric approach.

Parametric Approach

Comparison of Actual, Lognormal and Normal Density Plot

The x-axis, in the above plot, represents the Revenues and the y-axis, probability density of the observed Revenue value. The density curve for the actual data is shaded in ‘pink’, the normal distribution is shaded in 'green' and  log normal distribution is shaded in 'blue'. The probability density for the actual distribution is calculated from the observed data, whereas for both normal and log-normal distribution is computed based on the observed mean and standard deviation of the Revenues.

Outliers could be identified  by calculating the probability of the occurrence of an observation or calculating how far the observation is from the mean. For example, observations greater/lesser than 3 times the standard deviation from the mean, in case of normal distribution, could be classified as outliers.

In the above case, if we assume a normal distribution, there could be many outlier candidates especially for observations having revenue beyond 60,000. The log-normal plot does a better job than normal distribution, but it is due to the fact that the underlying actual distribution has characteristics of a log-normal distribution. This could not be a general case since determining the distribution or parameters of the underlying distribution is extremely difficult before hand or apriori. One could infer the parameters of the data by fitting a curve to the data, but a change in the underlying parameters like mean and/or standard deviation due to new incoming data will change the location and shape of the curve as observed in the plots below:

Comparison of Density Plot for change in mean and standard deviation for Normal DistributionComparison of Density Plot for change in mean and standard deviation for LogNormal Distribution

The above plots show the shift in location or the spread of the density curve based on an assumed change in mean or standard deviation of the underlying distribution. It is evident that a shift in the parameters of a distribution is likely to influence the identification of outliers.

Non-Parametric Approach

Let’s look at a simple non-parametric approach like a box plot to identify the outliers.

Non Parametric approach to detect outlier with box plots (univariate approach)

In the box plot shown above, we can identify 7 observations, which could be classified as potential outliers, marked in green. These observations are beyond the whiskers. 

In the data, we have also been provided information on the OS. Would we identify the same outliers, if we plot the Revenue based on OS??

Non Parametric approach to detect outlier with box plots (bivariate approach)

In the above box plot, we are doing a bivariate analysis, taking 2 variables at a time which is a special case of multivariate analysis. It seems that there are 3 outlier candidates for iOS whereas there are none for Android. This was due to the difference in distribution of Revenues for Android and iOS users. So, just analyzing Revenue variable on its own i.e univariate analysis, we were able to identify 7 outlier candidates which dropped to 3 candidates when a bivariate analysis was performed.

Both Parametric as well as Non-Parametric approach could be used to identify outliers based on the characteristics of the underlying distribution. If the mean accurately represents the center of the distribution and the data set is large enough, parametric approach could be used whereas if the median represents the center of the distribution, non-parametric approach to identify outliers is suitable.

Dealing with outliers in a multivariate scenario becomes all the more tedious. Clustering, a popular data mining technique and a non-parametric method could be used to identify outliers in such a case.

Originally posted on Data Science Central

Read more…

Guest blog post by ahmet taspinar

One of the most important tasks in Machine Learning are the Classification tasks (a.k.a. supervised machine learning). Classification is used to make an accurate prediction of the class of entries in the test set (a dataset of which the entries have not been labelled yet) with the model which was constructed from a training set. You could think of classifying crime in the field of Pre-Policing, classifying patients in the Health sector, classifying houses in the Real-Estate sector. Another field in which classification is big, is Natural Lanuage Processing (NLP). This is the field of science with the goal to makes machines (computers) understand (written) human language. You could think of Text Categorization, Sentiment Analysis, Spam detection and Topic Categorization.

For classification tasks there are three widely used algorithms; the Naive Bayes, Logistic Regression / Maximum Entropy and Support Vector Machines.  We have already seen how the Naive Bayes works in the context of Sentiment Analysis. Although it is more accurate than a bag-of-words model, it has the assumption of conditional independence of its features. This is a simplification which makes the NB classifier easy to implement, but it is also unrealistic in most cases and leads to a lower accuracy. A direct improvement on the N.B. classifier, is an algorithm which does not assume conditional independence but tries to estimate the weight vectors (feature values) directly.

This algorithm is called Maximum Entropy in the field of NLP and Logistic Regression in the field of Statistics.

Maximum Entropy might sounds like a difficult concept, but actually it is not. It is a simple idea, which can be implemented with a few lines of code. But to fully understand it, we must first go into the basics of Regression and Logistic Regression.

1. Regression Analysis

Regression Analysis is the field of mathematics where the goal is to find a function which best correlates with a dataset. Lets say we have a dataset containing n datapoints; X = ( x^{(1)}, x^{(2)}, .., x^{(n)} ). For each of these (input) datapoints there is a corresponding (output) y^{(i)}-value. Here the x-datapoints are called the independent variables and y the dependent variable; the value of y^{(i)} depends on the value of x^{(i)}, while the value of x^{(i)} may be freely chosen without any restriction imposed on it by any other variable.
The goal of Regression analysis is to find a function f(X) which can best describe the correlation between X and Y. In the field of Machine Learning, this function is called the hypothesis function and is denoted as h_{\theta}(x).

 

correlation_function

 

If we can find such a function, we can say we have successfully build a Regression model. If the input-data lives in a 2D-space, this boils down to finding a curve which fits through the datapoints. In the 3D case we have to find a plane and in higher dimensions a hyperplane.

To give an example, lets say that we are trying to find a predictive model for the success of students in a course called Machine Learning. We have a dataset Y which contains the final grade of n students. Dataset X contains the values of the independent variables. Our initial assumption is that the final grade only depends on the studying time. The variable x^{(i)} therefore indicates how many hours student i has studied. The first thing we would do is visualize this data:

 

regression_left2 regression_right2

If the results looks like the figure on the left, then we are out of luck. It looks like the points are distributed randomly and there is not correlation between Y and X at all. However, if it looks like the figure on the right, there is probably a strong correlation and we can start looking for the function which describes this correlation.

 

This function could for example be:

h_{\theta}(X) = \theta_0+ \theta_1 \cdot x

or

h_{\theta}(X) = \theta_0 + \theta_1 \cdot x^2

where \theta are the dependent parameters of our model.

 

1.1. Multivariate Regression

Evaluating the results from the previous section, we may find the results unsatisfying; the function does not correlate with the datapoints strong enough. Our initial assumption is probably not complete. Taking only the studying time into account is not enough. The final grade does not only depend on the studying time, but also on how much the students have slept the night before the exam. Now the dataset contains an additional variable which represents the sleeping time. Our dataset is then given by X = ( (x_1^{(1)}, x_2^{(1)}), (x_1^{(2)}, x_2^{(2)}), .., (x_1^{(n)}, x_2^{(n)}) ). In this dataset  x_1^{(i)} indicates how many hours student i has studied and x_2^{(i)} indicates how many hours he has slept.

See the rest of the blog here, including Linear vs Non-linear, Gradient Descent, Logistic Regression, and Text Classification and Sentiment Analysis.

Read more…

Guest blog post by Denis Rasulev

An open-source exploration of the city's neighborhoods, nightlife, airport traffic, and more, through the lens of publicly available taxi and Uber data.

Images are clickable to open hi-res versions.

Pick-ups:

Drop-offs:

Original post covers a lot more details and for those who want to pursue more analysis on their own: everything in the post - the data, software, and code - is freely available. Full instructions to download and analyze the data for yourself are available on GitHub.

Here is link to the original post: link

Read more…

Principal Component Analysis using R

Guest blog post by suresh kumar gorakala

Curse of Dimensionality:

One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality. 


In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.

Principal component analysis:

Consider below scenario:

The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.

 

Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible. 

Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.


Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.

In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.

PCA implementation in R:

For today’s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales. The 42 row names ("9.4", 9.5" ...) correspond to midpoints of intervals of finger lengths whereas the 22 column names ("142.24", "144.78"...) correspond to (body) heights of 3000 criminals, see also below. 

head(crimtab)
142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34
9.4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.5 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
9.6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.8 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
9.9 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0
182.88 185.42 187.96 190.5 193.04 195.58
9.4 0 0 0 0 0 0
9.5 0 0 0 0 0 0
9.6 0 0 0 0 0 0
9.7 0 0 0 0 0 0
9.8 0 0 0 0 0 0
9.9 0 0 0 0 0 0
dim(crimtab)
[1] 42 22
str(crimtab)
'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:42] "9.4" "9.5" "9.6" "9.7" ...
..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ...

sum(crimtab)
[1] 3000

colnames(crimtab)
[1] "142.24" "144.78" "147.32" "149.86" "152.4" "154.94" "157.48" "160.02" "162.56" "165.1" "167.64" "170.18" "172.72" "175.26" "177.8" "180.34"
[17] "182.88" "185.42" "187.96" "190.5" "193.04" "195.58"


let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.

apply(crimtab,2,var) 


We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp(). 

pca =prcomp(crimtab)
pca

Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’. Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component.


Let’s plot all the principal components and see how the variance is accounted with each component.

par(mar = rep(2, 4)) plot(pca) 


Clearly the first principal component accounts for maximum information.
Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.

#below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one. pca$rotation=-pca$rotation pca$x=-pca$x biplot (pca , scale =0) 

The output of the preceding code is as follows:


In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.


From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features. 


In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, "165.1, 167.64, and 170.18" which are less correlated with them. 

Complete Code for PCA implementation in R: 

library(pca)
data("crimtab") #load data
head(crimtab) #show sample data
dim(crimtab) #check dimensions
str(crimtab) #show structure of the data
sum(crimtab)
colnames(crimtab)
apply(crimtab,2,var) #check the variance accross the variables
pca =prcomp(crimtab) #applying principal component analysis on crimtab data
par(mar = rep(2, 4)) #plot to show variable importance
plot(pca)
'below code changes the directions of the biplot, if we donot include
the below two lines the plot will be mirror image to the below one.'
pca$rotation=-pca$rotation
pca$x=-pca$x
biplot (pca , scale =0) #plot pca components using biplot in r
view rawPCA using R hosted with ❤ by GitHub


So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.

Originally posted here

Read more…

Guest blog post by Jean Villedieu

Following the Mediator scandal, France adopted in 2011 a Sunshine Act. For the first time we have data on the presents and contracts awarded to health care professionals by pharmaceutical companies. Can we use graph visualization to understand these dangerous ties?

Dangerous ties

Pharmaceutical companies in France and in other countries use presents and contracts to influence the prescriptions of health care professionals. This has posed ethical problems in the past.

In France, 21 persons are currently prosecuted for their role in the Mediator scandal, a drug that was recently banned. Some of them are accused of having helped the drug manufacturer obtain an authorization to sell its drug and later fight its ban in exchange for money.

In the US, GlaxoSmithKline was condemned to pay $3 billion in the largest health-care fraud settlement in US history. Before the settlement, GlaxoSmithKline paid various experts to fraudulently market the benefits of its drugs.

Such problems arose in part because of a lack of transparency in the ties between pharmaceutical companies and health-care professionals. With open data now available can we change this?

Moving the data to Neo4j

Regards Citoyens, a French NGO, parsed various sources to build the first database documenting the financial relationships between health care providers and pharmaceutical manufacturers.

That database covers a period from January 2012 to June 2014. It contains 495 951 health care professionals (doctors, dentists, nurses, midwives, pharmacists) and 894 pharmaceutical companies. The contracts and presents represent a total of 244 572 645 €.

The original data can be found on the Regards Citoyens website.

The data is stored in one large CSV file. We are going to use graph visualization to understand the network formed by the financial relationships between pharmaceutical companies and health care professionals.

First we need to move the data into a Neo4j graph database: view rawsunshine_import.cql

Now the data is stored in Neo4j as a graph (download it here). It can be searched, explored and visualized through Linkurious.

Unfortunately, names in the data have been anonymized by Regards Citoyens following pressure from the CNIL (the French Commission nationale de l’informatique et des libertés).

Who is Sanofi giving money to?

Let’s start our data exploration with Sanofi, the French biggest pharmaceutical company. If we search Sanofi through Linkurious we can see that it is connected to 57 765 professionals. Let’s focus on the 20 Sanofi’s contacts who have the most connections.

sanofi's connections

Sanofi’s top 20 connections.

 

Among these entities there are 19 doctors in general medicine and one student. We can quickly grasp which professions Sanofi is targeting by coloring the health care professionals according to their profession:

19 doctors among Sanofi's top 20 connections

19 doctors among Sanofi’s top 20 connections.

In a click, we can filter the visualization to focus on the doctors. We are now going to color them according to their region of origin.

Region of origin of Sanofi's 19 doctors

Region of origin of Sanofi’s 19 doctors.

Indirectly, the health care professionals Sanofi connects to via presents also tell us about its competitors. Let’s look at who else has given presents to the health care professionals befriended by Sanofi.

sanofi's competitors network

Sanofi’s contacts (highlighted in red) are also in touch with other pharmaceutical companies.

 

Zooming in, we can see Sanofi is at the center of a very dense network next to Bristol-Myers Quibb, Pierre Gabre, Lilly or Astrazeneca for example. According to the Sunshine dataset, Sanofi’s is competing with these companies.

We can also see an interesting node. It is a student who has received presents from 104 pharmaceutical companies including companies that are not direct competitors of Sanofi.

A successful student

A successful student.

Why has he received so much attention? Unfortunately all we have is an ID (02b0d3726458ef46682389f2ac7dc7af).

Sanofi could identify the professionals its competitors have targeted and perhaps target them too in the future.

Who has received the most money from pharmaceutical companies in France?

Neo4j includes a graph query language called Cypher. Through Cypher we can compute complex graph queries and get results in seconds.

We can for example identify the doctor who has received the most money from pharmaceutical companies:

//Doctor who has received the most money
MATCH (a:DOCTOR)
WHERE a.totalDECL IS NOT NULL
RETURN a
ORDER BY a.totalDECL DESC
LIMIT 50

The doctor behind the ID 2d92eb1e795f7f538556c59e48aaa7c1 has received 77 480€ from 6 pharmaceutical companies.

wealthy doctor

The relationships are colored according to the money they represent. St Jude Medical has over 70 231€ to Dr 2d92eb1e795f7f538556c59e48aaa7c1.

Perhaps next time they receive a prescription from Dr 2d92eb1e795f7f538556c59e48aaa7c1, his patients would like to know about his relationship with St Jude Medical. Unfortunately today the Sunshine data is anonymous.

We can also find the most generous pharmaceutical company.

//Company which has distributed the most money
MATCH (a:PHARMA)-[r:IS_LINKED_TO]->(b:DOCTOR)
RETURN a, sum(r.totalDECL) as total
ORDER BY total DESC
LIMIT 5

Novartis Pharma has awarded 12 595 760€ to various entities.

top 5 novartis

The 5 entities receiving the most money from Novartis.

 

When we look closer, we can see that the 5 entities which have received the most money from Novartis Pharma are 5 NGOs.

24f3287da6ab125862249416bc91f9c4 has received 75 000€

24f3287da6ab125862249416bc91f9c4 has received 75 000€.

Come meet us at GraphConnect in London, the biggest graph event in Europe. It is sponsored by Linkurious and you can use “Linkurious30″ to register and get a 30%discount!

The Sunshine dataset offers a rare glimpse into the practice of pharmaceutical companies and how they use money to influence the behavior of health care professionals. Unfortunately for citizens looking for transparency, the data is anonymized. Perhaps it will change in the future?

Read more…

Guest blog post by Laetitia Van Cauwenberge

This article focuses on cases such as Facebook and protein interaction networks. The article was written by By Paul Scherer (paulmorio) and submitted as a research paper to HackCambridge. What makes this article interesting is the fact that it compares five clustering techniques for this type of problems:

  • K Clique Percolation - A clique merging algorithm. Given a set kk, the algorithm goes on to produce kk clique clusters and merge them (percolate) as necessary.
  • MCode - seed growth approach to finding dense subgraphs
  • DP Clustering - seed growth approach to finding dense subgraphs similar to MCODE but has an internal representation of weights in the edges, and the stopiing condition is different.
  • IPCA - Modified DPClus Algorithm which focuses on maintaining the diameter of a cluster (defined as the maximum shortest distance between all pairs of vertices, rather than its density.
  • CoAch - Combined Approach with finding a small number of cliques as complexes first and then growing them.

The articles also provides great visualizations such as the one below:

In the original article, these visualizations are interactive, and you will find out which software was used to produce them.

Below is the summary (written by the original author):

Summary

For my submission to HackCambridge I wanted to spend my 24 hours learning something new in accordance with my interests. I was recently introduced to protein interaction networks in my Bioinfomartics class, and during my review of machine learning techniques for an exam noticed that we study many supervised methods, but no unsupervised methods other than the k means clustering. Thus I decided to combine the two interests by clustering the Protein interaction networks with unsupervised clustering techniques and communicate my learning, results, and visualisations using the Beaker notebook.

The study of protein-protein interactions (PPIs) determined by high-throughput experimental techniques has created karge sets of interaction data and a new need for methods allowing us to discover new information about biological function. These interactions can be thought of as a large-scale network, with nodes representing proteins and edges signifying an interaction between two proteins. In a PPI network, we can potentially find protein complexes or functional modules as densely connected subgraphs. A protein complex is a group of proteins that interact with each other at the same time and place creating a quaternary structure. Functional modules are composed of proteins that bind each other at different times and places and are involved in the same cellular process. Various graph clustering algorithms have been applied to PPI networks to detect protein complexes or functional modules, including several designed specifically for PPI network analysis. A select few of the most famous and recent topographical clustering algorithms were implemented based on descriptions from papers, and applied to PPI networks. Upon completion it was recognized that it is possible to apply these to other interaction networks like friend groups on social networks, site maps, or transportation networks to name a few.

I decided to Graphistry's GPU cluster to visualize the large networks with the kind permission of Dr. Meyerovich. (Otherwise I would have likely not finished on time given the specs of my machine) and communicate my results and learning process

The full version with mathematical formulas, detailed descriptions, and source code, can be found here. For more articles about clustering, click here. This link will give you access to the following articles:

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Big Data Insights - IT Support Log Analysis

Guest blog post by Pradeep Mavuluri

This post brings forth to the audience, few glimpses (strictly) of insights that were obtained from a case of how predictive analytic's helped a fortune 1000 client to unlock the value in their huge log files of the IT Support system. Going to quick background, a large organization was interested in value added insights (actionable ones) from thousands of records logged in the past, as they saw both expense increase at no higher productivity.

 

As, most of us know in these business scenarios end-users will be much interested in out-of-knowledge, strange and unusual things that may not be captured from regular reports. Hence, here data scientist job not only ends at finding un-routine insights, but, also needs to do a deeper dig for its root cause and suggest best possible actions for immediate remedy (knowledge of domain or other best practices in industry will help a lot). Further, as mentioned earlier, only few of those has been shown/discussed here and all the analysis has been carried out using R Programming Language components viz., R-3.2.2RStudio (favorite IDE)ggplot2 package for plotting.


The first graph (below one) is a time series calendar heat map adopted from Paul Bleicher, shows us the number of tickets raised day-wise over every week of each month for the last year (green and its light shades represent less numbers, where as red and its shades represent higher numbers).

 

Herein, if one carefully observe the above graph, it will be very evident for us that, except for the month of April &amp; December, all other months have sudden increase in the number of tickets raised over last Saturday's and Sunday's; and this was more clearly visible at Quarter ends of March, June, September (also at November which is not a Quarter end). One can think of this as unusual behavior as numbers raising at non-working days. Before, going into further details, lets also look at one more graph (below), which depicts solved duration in minutes on x-axis and their respective time taken through a horizontal time line plot.

The above solved duration plot show us that out of all records analyzed 71.87% belong to "Request for Information" category and they have been solved within few minutes of tickets raised (that's why we cannot see a line plot for this category as compared to others). So, what's happened here actually was a kind of spoof, because of lack of automation in their systems. In simple words, it was found that there doesn't exists a proper documentation/guidance for many of applications they were using; such situation was taken as advantage for increasing the number of tickets (i.e. nothing but, pushing for more tickets even for basic information in the month ends and quarter ends, which resulted in month end openings which in turn forced them to close immediately). Discussed one here is one of those among many which has been presented with possible immediate remedies which can be easily actionable.


Visual Summarization:

 


Original Post

Read more…

Guest blog post by Petr Travkin

Part 1. Business scenarios.

I have spent many hours planning and executing in-company self-service BI implementation. This enabled me to gain several insights. Now that the ideas became mature enough and field-proven, I believe they are worth sharing. No matter how far you are in toying with potential approaches (possibly you are already in the thick of it!), I hope my attempt of describing feasible scenarios would provide a decent foundation.

All scenarios presume that IT plays its main role by owning the infrastructure, managing scalability, data security, and governance. I tried to elaborate every aspect of possible solution leaving behind all marketing claims of the vendor.

Scenario 1. Tableau Desktop + departmental/cross-functional data schemas.

This scenario involves gaining insights by data analysts on a daily basis. They might be either independent individuals or a team. Business users’ interaction with published workbooks is applicable, but limited to simple filtering.

User categories: professional data analysts;

Technical skills: intermediate/advanced SQL, intermediate/advanced Tableau;

Tableau training: 2-3 days full time (preferably) or continuous self-learning from scratch;

Licenses: Tableau Desktop.

Pros:

  • Pure self-service BI approach with no IT involved in data analysis;
  • Vast range of data available for analysis with almost no limits;
  • Fast response for complex ad-hoc business problems.

Cons:

  • Requires highly skilled data analysts;
  • Most likely involves Tableau training on query performance optimisation on a particular data source (e.g. Vertica).

Advice:

  • Create a “sandbox” that allows data analysts to query and collaborate on their own and without supervision. Further promotion of workbooks to production is welcome.

Scenario 2. Tableau Desktop + custom data marts.

In this scenario, business users are fully in charge of data analysis. IT provides custom data marts.

User categories: business users, line-managers;

Technical skills: basic SQL, basic/intermediate Tableau;

Tableau training: two or three 2-3h sessions + ad-hoc support on daily basis;

Licenses: Tableau Desktop + Server Interactors.

Pros:

  • Easy access to data for ad-hoc analysis;
  • Self-answering critical business questions;
  • Self-publishing for further ad-hoc access across multiple devices.

Cons:

  • Adding any data involves IT support;
  • Requires elaborated data dictionaries.

Advice:

  • Make requirements gathering a collaborative and iterative process with regular communication. That would ensure well-timed data delivery and quality;
  • Deliver training in 2-3 wisely structured sections with 2-3 week breaks for business users to have time for playing with software, along with generating needs for the new skills.
  • Focus on reach visualisations, not tables.

Scenario 3. Tableau Server Web Edit + workbook templates

This scenario fully relies on data models published by data analysts and powerful Web Edit features of Tableau Server.

User categories: line-managers, top managers;

Technical skills: Tableau basics;

Tableau training: one 30 min demo session + ad-hoc support;

License: Server Interactor.

Pros:

  • No special training;
  • Fast Tableau adoption with basic, but powerful Self-service BI capabilities (Web Edit);
  • Thin client access via any Desktop Web Browser;
  • Could serve as a foundation for self-service BI adoption among C-Suite.

Cons:

  • High level of accuracy for data preparation and template development;
  • Any changes in the data model require development and republishing of a template.

Advice:

  • Try to select the most proactive and “data hungry” line manager or executive, who could help to spread the word;
  • Investigate analytical needs, ensure availability of a subject matter expert;
  • Start with simple visualisations, but be ready to increase complexity;
  • Provide as much ad-hoc assistance as you can.

In my next post, I would like to throw light on some technical aspects and limitations of each scenario.

I highly appreciate any comments and looking forward to know about your experience.

Read more…

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds

Careers