Subscribe to our Newsletter

All Posts (217)

Originally posted on Data Science Central

Original post published to DataScience+

In this post I will show how to collect data from a webpage and to analyze or visualize in R. For this task I will use the rvest package and will get the data from Wikipedia. I got the idea to write this post from Fisseha Berhane.
I will gain access to the prevalence of obesity in United States from Wikipedia page, then I will plot it in the map. Lets begin with loading the required packages.

## LOAD THE PACKAGES ####
library(rvest)
library(ggplot2)
library(dplyr)
library(scales)


Download the data from Wikipedia.

## LOAD THE DATA ####
obesity = read_html("https://en.wikipedia.org/wiki/Obesity_in_the_United_States")
obesity = obesity %>%
html_nodes("table") %>%
.[[1]]%>%
html_table(fill=T)


The first line of code is calling the data from Wikipedia and the second line of codes is transforming the table that we are interested into dataframe in R.
The head of our data.

head(obesity)
State and District of Columbia Obese adults Overweight (incl. obese) adults
1 Alabama 30.1% 65.4%
2 Alaska 27.3% 64.5%
3 Arizona 23.3% 59.5%
4 Arkansas 28.1% 64.7%
5 California 23.1% 59.4%
6 Colorado 21.0% 55.0%
Obese children and adolescents Obesity rank
1 16.7% 3
2 11.1% 14
3 12.2% 40
4 16.4% 9
5 13.2% 41
6 9.9% 51


The dataframe looks good, now we need to clean it from making ready to plot.

## CLEAN THE DATA ####
str(obesity)
'data.frame': 51 obs. of 5 variables:
$ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ Obese adults : chr "30.1%" "27.3%" "23.3%" "28.1%" ...
$ Overweight (incl. obese) adults: chr "65.4%" "64.5%" "59.5%" "64.7%" ...
$ Obese children and adolescents : chr "16.7%" "11.1%" "12.2%" "16.4%" ...
$ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...

# remove the % and make the data numeric
for(i in 2:4){
obesity[,i] = gsub("%", "", obesity[,i])
obesity[,i] = as.numeric(obesity[,i])
}
# check data again
str(obesity)
'data.frame': 51 obs. of 5 variables:
$ State and District of Columbia : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ Obese adults : num 30.1 27.3 23.3 28.1 23.1 21 20.8 22.1 25.9 23.3 ...
$ Overweight (incl. obese) adults: num 65.4 64.5 59.5 64.7 59.4 55 58.7 55 63.9 60.8 ...
$ Obese children and adolescents : num 16.7 11.1 12.2 16.4 13.2 9.9 12.3 14.8 22.8 14.4 ...
$ Obesity rank : int 3 14 40 9 41 51 49 43 22 39 ...


Fix the names of variables by removing the spaces.

names(obesity)
[1] "State and District of Columbia" "Obese adults"
[3] "Overweight (incl. obese) adults" "Obese children and adolescents"
[5] "Obesity rank"

names(obesity) = make.names(names(obesity))
names(obesity)
[1] "State.and.District.of.Columbia" "Obese.adults"
[3] "Overweight..incl..obese..adults" "Obese.children.and.adolescents"
[5] "Obesity.rank"


Now, it's time to load the map data.

# load the map data
states = map_data("state")
str(states)
'data.frame': 15537 obs. of 6 variables:
$ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
$ lat : num 30.4 30.4 30.4 30.3 30.3 ...
$ group : num 1 1 1 1 1 1 1 1 1 1 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ region : chr "alabama" "alabama" "alabama" "alabama" ...
$ subregion: chr NA NA NA NA ...


Merge two datasets (obesity and states) by region, therefore we first need to create a new variable (region) in obesity dataset.

# create a new variable name for state
obesity$region = tolower(obesity$State.and.District.of.Columbia)


Merge the datasets.

states = merge(states, obesity, by="region", all.x=T)
str(states)
'data.frame': 15537 obs. of 11 variables:
$ region : chr "alabama" "alabama" "alabama" "alabama" ...
$ long : num -87.5 -87.5 -87.5 -87.5 -87.6 ...
$ lat : num 30.4 30.4 30.4 30.3 30.3 ...
$ group : num 1 1 1 1 1 1 1 1 1 1 ...
$ order : int 1 2 3 4 5 6 7 8 9 10 ...
$ subregion : chr NA NA NA NA ...
$ State.and.District.of.Columbia : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
$ Obese.adults : num 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 30.1 ...
$ Overweight..incl..obese..adults: num 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 65.4 ...
$ Obese.children.and.adolescents : num 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 16.7 ...
$ Obesity.rank : int 3 3 3 3 3 3 3 3 3 3 ...

Plot the data


Finally we will plot the prevalence of obesity in adults.

## MAKE THE PLOT ####
# adults
ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.adults)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Prevalence of Obesity in Adults") +
coord_map()


Here is the plot in adults:
adults
Similarly, we can plot the prevalence of obesity in children.

# children
ggplot(states, aes(x = long, y = lat, group = group, fill = Obese.children.and.adolescents)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Percent", low = "#feceda", high = "#c81f49", guide = "colorbar", na.value="black", breaks = pretty_breaks(n = 5)) +
labs(title="Prevalence of Obesity in Children") +
coord_map()


Here is the plot in children:
children
If you like to show the name of State in the map use the code below to create a new dataset.

statenames = states %>% 
group_by(region) %>%
summarise(
long = mean(range(long)),
lat = mean(range(lat)),
group = mean(group),
Obese.adults = mean(Obese.adults),
Obese.children.and.adolescents = mean(Obese.children.and.adolescents)
)


After you add this code to ggplot code above

geom_text(data=statenames, aes(x = long, y = lat, label = region), size=3)


That's all. I hope you learned something useful today.

Read more…

Guest blog post by Vimal Natarajan

Introduction

The City and County of San Francisco had launched an official open data portal called SF OpenData in 2009 as a product of its official open data program, DataSF. The portal contains hundreds of city datasets for use by developers, analysts, residents and more. Under the category of Public Safety, the portal contains the list of SFPD Incidents since Jan 1, 2003.
In my previous post I performed an exploratory time-series analysis on the crime incidents data to identify any patterns.
In this post I have performed an exploratory geo analysis on the crime incidents data to identify any patterns based on the San Francisco Police Department District classification.

Data

The data for this analysis has been downloaded from the publicly available data from the City and County of San Francisco’s OpenData website SF OpenData. The crime incidents database has data recorded from the year 2003 till date. I downloaded the full data and performed my analysis for the time period from 2003 to 2015, filtering out the data from the year 2016. There are nearly 1.9 million crime incidents in this dataset.
I have performed minimal data processing on the downloaded raw data to facilitate my analysis.


SFPD Police Districts

There are 10 police districts in the City and County of San Francisco. I have categorized my analysis based on these Police Districts.
•    Bayview
•    Central
•    Ingleside
•    Mission
•    Northern
•    Park
•    Richmond
•    Southern
•    Taraval
•    Tenderloin

Analysis

Crimes in Police District over the Years

The following plot depicts the number of crimes recorded from the year 2003 till the end of the year 2015 and categorized by the SFPD Police Districts.

By analyzing the plot above, we can arrive at the following insights:

  • Southern district has the highest number of crimes over the years and the number of crimes has been increasing in the last few years. In addition, Central district also had sharp increase in the number of crimes in the last few years. Northern district despite seeing a steady decline in the number of crimes from the year 2003 till 2010, have had sharp increase in the number of crimes since 2011.
  • Park and Richmond districts have had the lowest number of crimes over these years.
  • Only Mission and Tenderloin has seen a steady decline in the number of crimes in recent years.


Crimes in Police District by Hour of the Day

The following plot depicts the number of crimes recorded by the hour and categorized by the SFPD Police Districts.

By analyzing the plot above, we can arrive at the following insights:

  • The number of crimes steadily decline from midnight and are at the lowest during the early morning hours and then they start increasing and peak around 6 PM in the evening. This is the same insight we arrived in my previous analysis but here we have categorized by the Police district and still see the same pattern.
  • As seen in the previous plot, Park and Richmond districts have the lowest number of crimes throughout the day.
  • As highlighted in red in the plot above, the maximum number of crimes happens in Southern district around 6 PM in the evening.


Crimes in Police District by Day of Week

The following plot depicts the number of crimes recorded during different days of the week and categorized by the SFPD Police Districts.

By analyzing the plot above, we can arrive at the following insights:

  • In general, there is less number of crimes happening during the weekends than weekdays across all districts. The only exception here is Central where more number of crimes happens during the weekend, particularly on Saturdays. One possible reason could be that there are more number of people around the Pier on Saturdays.
  • One more observation is that the crimes usually peak on Fridays across all districts with the exception of Tenderloin, where Wednesday seems to have most number of crimes.
  • As highlighted in red in the plot above, Fridays in Southern district seems to have the maximum number of crimes happening. Taking into consideration the analysis from the previous plot, it is possible that Fridays around 6 PM in Southern district seems to be the most dangerous time with regards to the number of crimes happening.

Technology

This analysis was performed entirely using RStudio version 0.99 and R Version 3.2.0.
The data processing and plots were done using the R libraries ggplot2 and dplyr.

Read more…

Guest blog post by Klodian

Original post is published at DataScience+

Recently, I become interested to grasp the data from webpages, such as Wikipedia, and to visualize it with R. As I did in my previous post, I use rvest package to get the data from webpage and ggplot package to visualize the data.
In this post, I will map the life expectancy in White and African-American in US.
Load the required packages.

## LOAD THE PACKAGES ####
library(rvest)
library(ggplot2)
library(dplyr)
library(scales)

Import the data from Wikipedia.

## LOAD THE DATA ####
le = read_html("https://en.wikipedia.org/wiki/List_of_U.S._states_by_life_expectancy")
le = le %>%
html_nodes("table") %>%
.[[2]]%>%
html_table(fill=T)

Now I have to clean the data. Below I have explain the role of each code.

## CLEAN THE DATA ####
# select only columns with data
le = le[c(1:8)]
# get the names from 3rd row and add to columns
names(le) = le[3,]
# delete rows and columns which I am not interested
le = le[-c(1:3), ]
le = le[, -c(5:7)]
# rename the names of 4th and 5th column
names(le)[c(4,5)] = c("le_black", "le_white")
# make variables as numeric
le = le %>%
mutate(
le_black = as.numeric(le_black),
le_white = as.numeric(le_white))

Since there are some differences in life expectancy between White and African-American, I will calculate the differences and will map it.

le = le %>% mutate(le_diff = (le_white - le_black))

I will load the map data and will merge the datasets togather.
## LOAD THE MAP DATA ####
states = map_data("state")
# create a new variable name for state
le$region = tolower(le$State)
# merge the datasets
states = merge(states, le, by="region", all.x=T)

Now its time to make the plot. First I will plot the life expectancy in African-American in US. For few states we don't have the data, and therefore I will color it in grey color.
## MAKE THE PLOT ####
# Life expectancy in African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in African American") +
coord_map()

Here is the plot:
Le_african_american

The code below is for White people in US.

# Life expectancy in White American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_white)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="Gray", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in White") +
coord_map()

Here is the plot:
Le_white

Finally, I will map the differences between white and African American people in US.

# Differences in Life expectancy between White and African American
ggplot(states, aes(x = long, y = lat, group = group, fill = le_diff)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Differences in Life Expectancy between \nWhite and African Americans by States in US") +
coord_map()

Here is the plot:
Le_differences
On my previous post I got a comment to add the pop-up effect as I hover over the states. This is a simple task as Andrea exmplained in his comment. What you have to do is to install the plotly package, to create a object for ggplot, and then to use this function ggplotly(map_plot) to plot it.

library(plotly)
map_plot = ggplot(states, aes(x = long, y = lat, group = group, fill = le_black)) +
geom_polygon(color = "white") +
scale_fill_gradient(name = "Years", low = "#ffe8ee", high = "#c81f49", guide = "colorbar", na.value="#eeeeee", breaks = pretty_breaks(n = 5)) +
labs(title="Life expectancy in African American") +
coord_map()
ggplotly(map_plot)

Here is the plot:
le_plotly
Thats all! Leave a comment below if you have any question.

Original post: Map the Life Expectancy in United States with data from Wikipedia

Read more…

Why Visualize Data?

Guest blog post by Surendran B

If a picture is worth thousand words, then what about a neat data visualization? Displaying information in graphics to generate better insights is not a new phenomenon, but, with the advent of technology and increased access to data, it has become far more prominent. Once restricted to analysis of economics, finance, and science, data visualization has emerged as an industry of its own.

There are now multiple tools to visualize data (like SocialCops Viz), competitions for visualizations, and even data visualization artists. From a haunting depiction of gun violence in America and an assessment of India’s RTE Act to a series of charts highlighting a footballer’s greatness and a collection of maps depicting India’s size, data visualization transcends fields — all data can be visualized.

Data in the eye of the beholder

Why visualize data? Beyond the mere aesthetic attraction of a beautiful graphic, data visualization matters because it can be extremely helpful. It is easier for the human brain to process large volumes of data through visuals rather than text. Studies have shown that humans find it easier to distinguish line length, shape orientation, and color — collectively known as pre-attentive attributes — than read a series of numbers. This is because around two-third of our brain’s neurons are dedicated solely to vision. This makes it easier and quicker to interpret information visually.

Take the example of India’s GDP since Independence. The narrative is a familiar one. After Independence, Indian economic growth was anaemic (rarely hovering above 5%). It was only in the late 1980s and 1990s that growth really accelerated, driven by a tremendous increase in services output. Below are a table and graph conveying this same information, but the graph tells the story in a way that’s both more intuitive and informative.

Data visualization allows us to identify these sorts of trends, along with problems and possible solutions. This makes it a valuable tool for anyone — especially those working in policy. Policymakers use swathes of data across sectors to make important decisions. In this ocean of information, data visualization can quickly show them what needs to be refined or aborted. In a country as diverse and large as India, data visualization’s ability to show data effectively and quickly is paramount.

Data visualization for policy

The Government of India has embraced the potential of data visualization. As part of its Open Data platform, public users are encouraged to create their own visualizations to show different perspectives on government performance. Some ministries are going a step further by implementing their own data visualization initiatives. For instance, the Ministry of Rural Development has created adashboard on MGNREGA implementation (the Indian government’s flagship workfare program). The dashboard has an intuitive interface that provides administrators with real-time visualized summaries of the program’s performance at all administrative levels (from gram panchayats to the center).

At SocialCops, we work with all sorts of decision makers to create intuitive data visualizations. For example, district collectors and other government officials use our platform to create dashboards, which provide important insights to help officials identify pain points, assess their progress, and target important schemes or initiatives to the places that need them most. Individual users can also create maps to better understand their data using our Viz tool.

Below is an example of a data visualization from the Socio Economic Caste Census — a nation-wide survey of socio-economic conditions of households across India.

The map above shows the proportion of rural households with kuccha roofs (thatched, plastic, or hand-made tiled roofs) in every district in India. The same data could have been described in words — for example, districts in Chhattisgarh, Odisha, and Madhya Pradesh have the highest proportion of kuccha roofs while districts in the South have the lowest, and so on. However, all that information — and more — is revealed through a quick glance at the map.

Many people now say that we are living in an era of big data. By some estimates, we are generating 2.5 quintillion (that’s 18 zeros) bytes of data daily. This data can be used for amazing policies and initiatives. However, for this to happen, data has to be managed and interpreted correctly.

This article was originally published here

Read more…

4 Potential Problems With Data Visualization

Originally posted on Data Science Central

Big data has been a big topic for a few years now, and it’s only going to grow bigger as we get our hands on more sophisticated forms of technology and new applications in which to use them. The problem now is beginning to shift; originally, tech developers and researchers were all about gathering greater quantities of data. Now, with all this data in tow, consumers and developers are both eager for new ways to condense, interpret, and take action on this data.

One of the newest and most talked-about methods for this is data visualization, a system of reducing or illustrating data in simplified, visual ways. The buzz around data visualization is strong and growing, but is the trend all it’s cracked up to be?

The Need for Data Visualization

There’s no question that data visualization can be a good thing, and it’s already helped thousands of marketers and analysts do their jobs more efficiently. Human abilities for pattern recognition tend to revolve around sensory inputs—for obvious reasons. We’re hard-wired to recognize visual patterns at a glance, but not to crunch complex numbers and associate those numbers with abstract concepts. Accordingly, representing complex numbers as integrated visual patterns would allow us to tap into our natural analytic abilities.

The Problems With Visualization

Unfortunately, there are a few current and forthcoming problems with the concept of data visualization:

  1. The oversimplification of data. One of the biggest draws of visualization is its ability to take big swaths of data and simplify them to more basic, understandable terms. However, it’s easy to go too far with this; trying to take millions of data points and confine their conclusions to a handful of pictoral representations could lead to unfounded conclusions, or completely neglect certain significant modifiers that could completely change the assumptions you walk away with. As an example not relegated to the world of data, consider basic real-world tests, such as alcohol intoxication tests, which try to reduce complex systems to simple “yes” or “no” results—as Monder Law Group points out, these tests can be unreliable and flat-out inaccurate.

  2. The human limitations of algorithms. This is the biggest potential problem, and also the most complicated. Any algorithm used to reduce data to visual illustrations is based on human inputs, and human inputs can be fundamentally flawed. For example, a human developing an algorithm may highlight different pieces of data that are “most” important to consider, and throw out other pieces entirely; this doesn’t account for all companies or all situations, especially if there are data outliers or unique situations that demand an alternative approach. The problem is compounded by the fact that most data visualization systems are rolled out on a national scale; they evolve to become one-size-fits-all algorithms, and fail to address the specific needs of individuals.

  3. Overreliance on visuals. This is more of a problem with consumers than it is with developers, but it undermines the potential impact of visualization in general. When users start relying on visuals to interpret data, which they can use at-a-glance, they could easily start over-relying on this mode of input. For example, they may take their conclusions as absolute truth, never digging deeper into the data sets responsible for producing those visuals. The general conclusions you draw from this may be generally applicable, but they won’t tell you everything about your audiences or campaigns.

  4. The inevitability of visualization. Already, there are dozens of tools available to help us understand complex data sets with visual diagrams, charts, and illustrations, and data visualization is too popular to ever go away. We’re on a fast course to visualization taking over in multiple areas, and there’s no real going back at this point. To some, this may not seem like a problem, but consider some of the effects—companies racing to develop visualization products, and consumers only seeking products that offer visualization. These effects may feed into user overreliance on visuals, and compound the limitations of human errors in algorithm development (since companies will want to go to market as soon as possible).

There’s no stopping the development of data visualization, and we’re not arguing that it should be stopped. If it’s developed in the right ways, it can be an extraordinary tool for development in countless different areas—but collectively, we need to be aware of the potential problems and biggest obstacles data visualization will need to overcome. 

Read more…

Guest blog post by Manav pietro

These days, customer experience, data and brand strategy are gaining a lot of importance in marketing. Both the customer experience and data analysis play a bigger role and marketers are spending more time in focusing on the broader business strategy instead of just focusing on advertising. The infographic titled, “Let’s Talk about Customer Experience”.

According to Gartner study, in the coming years, most of the companies are expected to compete predominantly on the basis of customer experience. Delivering a satisfied and excellent customer experience is the new battleground of the brands. A good customer experience encourages customer loyalty. If a brand fails to provide good customer experience, its customers are likely to go elsewhere. Therefore, businesses are spending more time and effort in developing customer experience strategies. As a result, the need for hiring a customer experience officer has garnered much more attention.

The key role of a customer experience officer or executive is to oversee marketing communications, internal relations, community relations, investor relations and various other interactions between organization and its customers. To know more about customer experience along with related information and facts, please refer the given infographic.

Read more…

Originally posted on Data Science Central

With the innumerable amounts of data generated in the technology era, data scientists have become an increasingly needed vocation. The US just named its first Chief Data Scientist and all the top companies are hiring their own. Yet due to the novelty of this profession, many are not entirely aware of the many career possibilities that come with being a data scientist. Those in the field can look forward to a promising career and excellent compensation. To learn more about what you can do with a career as a data scientist, checkout this infographic created by Rutgers University’s Online Master of Information.

Virtual Reality: Changing the Way Marketers are Conducting Research

Data Scientist Career Trends

Persons interested in pursuing a career in this line of work should be prepared to go the distance in terms of their education. If we look at the current crop of data specialists, we will see that nearly half of them have a PhD at 48%. A further 44% have earned their master’s degree while only 8% have a bachelor’s degree. It is clear that a solid academic background will help in immensely both in gaining the knowledge required for this career as well as in impressing the important gatekeepers in various companies. 

Common Certifications

Getting certified is another good strategy in creating an excellent resume that will draw offers from the best names in the industry. There are four common certifications that are currently available. These are the Certified Analytics Professional (CAP), the Cloudera Certified Professional: Data Scientist (CCP-DS), the EMC: Data Science Associate (EMCDSA), and the SAS Certified Predictive Modeler. Each of these is geared towards specific competencies. Learn more about them to find out the best ones to take for the desired career path.

Job Experience

The explosion of data is a fairly recent phenomenon aided by digital computing and the Internet. Massive amounts of information are now being collected every day and companies are trying to make sense of these. The pioneers have been around for a while but the bulk of the scientists working with data have been on the job for only four years or less at 76%. It’s a good time to enter the field for those who want to be trailblazers in a fresh and exciting area of technology.

Common Responsibilities

There are plenty of issues that are yet to be cleared up with data possibly providing a clear answer once and for all. In this field, practitioners are often relied upon to conduct research on open-ended industry and organization questions. They may also extract large volumes of data from various sources which is a non-trivial task. Then they must clean and remove irrelevant information to make their collections usable. 

Once everything has been primed, the scientists then begin their analysis to check for weaknesses, trends and opportunities. The clues are all in their hands. They simply have to look for the markers and make intelligent connections. Those who are into development can create algorithms that will solve problems and build new automation tools. After they have compiled all of their findings, they must then effectively communicate the results to the non-technical members of the management. 

Expected Salary

Data scientists are well-compensated for their technical skills. Their average earnings will depend on their years of experience in the field. Entry-level workers with less than 5 years under their belt can expected to earn around $92,000 annually. With almost a decade in data analysis, a person can take home $109,000 per year. Experienced scientists with nearly two decades in this career get about $121,000. The most respected pioneers earn $145,000 a year or more. The median salary was found to be $116,840 in 2016.

Career Possibilities

There are several industries with high demand for data scientists. It should be no surprise that the largest employer is the technology sector with about 41%. This is followed by 13% who work in marketing, 11% in corporate setting, 9% in consulting, 7% in health care, and 6% in financial services. The rest are scattered across government, academia, retail and gaming.

Job Roles

At their chosen workplace, they often take on more than one job role. Around 55.9% act as researchers for their company, mining the data for valuable information. Another common task is business management with 40.2% saying they work in this capacity. Many are asked by their employer to use their skills as developers and creatives at 36.5% and 36.3%. 

Career Profile of US Chief Data Scientist

Dr. DJ Patil was an undergrad in Mathematics at the University of California in San Diego before earning his PhD in Applied Mathematics at the University of Maryland. Here he used his skills to improve the numerical weather forecasting by NOAA using their open datasets. He has written numerous publications that highlight the important applications of data science. In fact, he co-coined the term data scientist. His efforts have led to global recognition including an award at the 2014 World Economic Forum. In 2015, he was appointed as the US Chief Data Scientist.

His work experiences have enabled him to use his skills in various industries. For instance, he was the Vice president of Product at RelateIQ, Head of Data Products and Chief Security Officer at LinkedIn, Data Scientist in Residence at Greylock Partners, Director of Strategy at eBay, Assistant Research Scientist at the University of Maryland, and AAAS Policy Fellow at the Department of Defense. 

Job Growth and Demand

Projections for this career are rosy with well-known publications hailing it as the next big thing. Glassdoor named it the Top Job in America for 2016. The Harvard Business Review called it the Sexist Job of the 21st Century. The good news for those who are thinking about starting on this path is that there’s plenty of room for new people. Nearly 80% of data scientists report a shortage in their field. They need reinforcement given the volume of work that they have to do. In fact, the projected growth over the next decade is at 11%, which is higher than the 7% estimated growth for all occupations.

Expert Tips

According to the experts, interested individuals must do these three things if they wish to succeed in the field: spend time learning effective analytics communication, consider relocation, and interact with other data scientists. The first is crucial as this involves highly technical work with results that need to be understood by non-technical managers. The second is a practical move with 75% of available jobs located on the East and West Coasts. The third is an advice common to all fields: widen your network, learn from your peers, and create future opportunities.

Read original article here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

13 Great Data Science Infographics

Originally posted on Data Science Central

Most of these infographics are tutorials covering various topics in big data, machine learning, visualization, data science, Hadoop, R or Python, typically intended for beginners. Some are cheat sheets and can be nice summaries for professionals with years of experience. Some, popular a while back (you will find one example here) were designed as periodic tables.

For Geeks 

For Business People

Infographics Repositories

Read more…

Twitter Analytics using Tweepsmap

Guest blog post by Salman Khan

This morning I saw #tweepsmap on my twitter feed and decided to check it out. Tweepsmap is a a neat tool that can analyze any twitter account from a social network perspective. It can create interactive maps showing where the followers of a twitter account reside , segment followers  and even show who unfollowed you!

Here is my Followers map generated by country.

You can create the followers map based on city and state as well.

Tweepsmap also provides demographic information such as languages, occupation and gender but it relies on the twitter user having entered this information in the twitter profile.

There is also a hashtag and keyword analyzer that reports on most prolific tweeters, locations of tweets, tweets vs. retweets and so on. I used their free report which is built for a maximum of 100 tweets to analyze the trending hashtag –> #BeautyAndTheBeast. For some reasons, the #BeautyAndTheBeast hashtag is really popular in Brazil, out of the 100 tweets, 26 were from Brazil and 20 from USA.  You can see that 3 out of 5 of the top influencers with most followers are tweeting in Portuguese. Other visualizations included  the tweets vs. retweet numbers and the distribution reach of the tweeters. I was even able to get make the report public so you can check it out here.  Remember it only analyzes 100 tweets so don’t draw any conclusions from it !

                                                  

 If you are doing research on social media or are a business that wants to learn more about competitors and customers, tweepsmap helps you analyze specific twitter accounts as well! Of course, we all know there is no such thing as a free lunch , so this is a paid feature!

From what I saw by tinkering with the pricing calculator on their page, the analysis of a twitter account with more than 2.5M followers will cost a flat fee of $5K. I tried a few twitter accounts to see how much each would cost based on number of followers and found that the cost per follower was $0.002.  So if you wanted to get twitter data on Hans Rosling, it would cost you $642 as he has 320,956 followers (642/320,956 = 0.002).

 calculate3  calculate5   calculate4  calculate6 calculate7 calculate8calculate9       calculate2

Overall this looks like a neat tool to get started when analyzing twitter data and using this information to maximize the returns on your tweets. I have only mentioned a few of their tools above; they have other features like the Best Time to Tweet which will analyze your audience, twitter history, time zones and so on to predict when you get the most out of your tweet. Check out their website for more info here

Read more…

Originally posted on Data Science Central

Written by sought-after speaker, designer, and researcher Stephanie D. H. Evergreen, Effective Data Visualizationshows readers how to create Excel charts and graphs that best communicate data findings. This comprehensive how-to guide functions as a set of blueprints—supported by research and the author’s extensive experience with clients in industries all over the world—for conveying data in an impactful way. Delivered in Evergreen’s humorous and approachable style, the book covers the spectrum of graph types available beyond the default options, how to determine which one most appropriately fits specific data stories, and easy steps for making the chosen graph in Excel. 

The book is available, here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Max Wegner

What’s the first thing you think of when you hear the phrase “artificial intelligence”? Perhaps it’s the HAL 9000 from 2001: A Space Odyssey, or maybe it’s chess Grandmaster Garry Kasparov losing to IBM’s Deep Blue supercomputer. While those are indeed examples of artificial intelligence, examples of AI in the real world of today are a bit more mundane and a whole lot less sinister.

In fact, many of us use AI, in one form or another, in our everyday lives. The personal assistant on your smartphone that helps you locate information, the facial recognition software on Facebook photos, and even the gesture control on your favourite video game are all examples of practical AI applications. Rather than being a part of a dystopian world view in which the machines take over, current AI makes our lives a whole lot more convenient by carrying out simple tasks for us.

What’s more, there’s a lot of money flowing into a lot of companies working on AI developments. This means that in the near future, we could see even more practical uses for AI, from smart robots to smart drones and more.

To give you a better understanding of the current state of AI, our friends at appcessories.co.uk have put together this helpful Artificial Intelligence infographic. It will give you the full rundown, from categories to geography to finances. Check it out, and you’ll see why AI is so essential to our everyday lives, and why the future of AI looks so bright.

Read more…

Originally posted on Data Science Central

This article was posted by Bethany Cartwright. Bethany is the blog team's Data Visualization Intern. She spends most of her time creating infographics and other visuals for blog posts.

Whether you’re writing a blog post, putting together a presentation, or working on a full-length report, using data in your content marketing strategy is a must. Using data helps enhance your arguments by make your writing more compelling. It gives your readers context. And it helps provide support for your claims.

That being said, if you’re not a data scientist yourself, it can be difficult to know where to look for data and how to best present that data once you’ve got it. To help, below you'll find the tools and list of resources you need to source credible data and create some stunning visualizations. 

Resources for Uncovering Credible Data

When looking for data, it’s important to find numbers that not only look good, but are also credible and reliable.

The following resources will point you in the direction of some credible sources to get you started, but don’t forget to fact-check everything you come across. Always ask yourself: Is this data original, reliable, current, and comprehensive?

Tools for Creating Data Visualizations

Now that you know where to find credible data, it’s time to start thinking about how you’re going to display that data in a way that works for your audience.

At its core, data visualization is the process of turning basic facts and figures into a digestible image --  whether it’s a chart, graph, timeline, map, infographic, or other type of visual. 

While understanding the theory behind data visualization is one thing, you also need the tools and resources to make digital data visualization possible. Below we’ve collected 10 powerful tools for you to browse, bookmark, or download to make designing data visuals even easier for your business.

To check all this information, click hereFor more articles about data visualization, click here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Read more…

Guest blog post by Mike Waldron

Originally posted on Data Science Central

This blog was originally published on the AYLIEN Text Analysis blog

We wanted to gather and analyze news content in order to look for similarities and differences in the way two journalists write headlines for their respective news articles and blog posts. The two reporters we selected operate in, and write about, two very different industries/topics and have two very different writing styles:

  • Finance: Akin Oyedele of Business Insider, who covers market updates.
  • Celebrity: Carly Ledbetter of the Huffington Post, who mainly writes about celebrities.

Note: For a more technical, in-depth and interactive representation of this project, check out the Jupyter notebook we created. This includes sample code and more in depth descriptions of our approach.

The Approach

  1. Collect news headlines from both of our journalists
  2. Create parse trees from collected headlines (we explain parse trees below!)
  3. Extract information from each parse tree that is indicative of the overall headline structure
  4. Define a simple sequence similarity metric to quantitatively compare any pair of headlines
  5. Apply the same metric to all headlines collected for each author to find similarity
  6. Use K-Means and tSNE to produce a visual map of all the headlines so we can clearly see the differences between our two journalists

Creating Parse Trees

In linguistics, a parse tree is a rooted tree that represents the syntactic structure of a sentence, according to some pre-defined grammar. Here’s an example;

For example with a simple sentence like “The cat sat on the mat”, a parse tree might look like this;



Thankfully parsing our extracted headlines isn’t too difficult. We used the Pattern Library for Python to parse the headlines and generate our parse trees.

Data

In total we gathered about 700 article headlines for both journalists using the AYLIEN News API which we then analyzed using Python. If you’d like to give it a go yourself, you can grab the Pickled data files directly from the GitHub repository (link), or by using the data collection notebook we prepared for this project.

First we loaded all the headlines for Akin Oyedele, then we created parse trees for all 700 of them, and finally we stored them together with some basic information about the headline in the same Python object.

Then using a sequence similarity metric, we compared all of these headlines two by two, to build a similarity matrix.

Visualizations

To visualize headline similarities for Akin we generated a 2D scatter plot with the hope of grouping similarly structured headlines close together in a graph in groups of sorts.

To achieve this, we first reduced the dimensionality of our similarity matrix using tSNE and applied K-Means clustering to find groups of similar headlines. We also used some a nice viz library which we’ve outlined below;

  • tSNE to reduce the dimensionality of our similarity matrix from 700 down to 2
  • K-Means to identify 5 clusters of similar headlines and add some color
  • Plotted the actual chart using Bokeh



The chart above shows a number of dense groups of headlines, as well as some sparse ones. Each dot on the graph represents a headline, as you can see when you hover over one in the interactive version. Similar titles are as you can see grouped together quite cleanly. Some of the more stand-outish groups are;

  • The circular group left of center typically consists of short, snappy stock update headlines such as “Viacom is crashing”
  • The large circular group on the top right are mostly announcement-style headlines such as “Here come the…. “ formats.
  • The small green circular group towards the bottom left are similar and use the same phrases we see headlines such as “Industrial production falls more than expected” or “ADP private payrolls rise more than expected”.

Comparing the two authors

By repeating the process for our second journalist, Carly Ledbetter, we were then able to compare both authors and see how many common patterns exist between the two in terms of how they write their headlines.

We observed that roughly 50% (347/700) of the headlines had a similar structure.


Here we can see the same dense and sparse patterns, as well as groups of points that are somewhat unique to each author, or shared by both authors. The yellow dots represent our Celebrity focused author and the blue our finance guy.

  • The bottom right cluster is almost exclusive to the first author, as it covers the short financial/stock report headlines such as “Here comes CPI”, but it also covers some of the headlines from the first author such as “There’s Another Leonardo DiCaprio Doppelgänger”. Same could be said about the top middle cluster.
  • The top right cluster mostly contains single-verb headlines about celebrities doing things, such as “Kylie Jenner Graces Coachella With Her Peachy Presence” or “Kate Hudson Celebrated Her Birthday With A Few Shirtless Men” but it also includes market report headlines from the first author such as “Oil rig count plunges for 7th straight week”.

Conclusion and future work

In this project we’ve shown how you can retrieve and analyze news headlines, evaluate their structure and similarity, and visualize the results on an interactive map.

While we were quite happy with the results and found it quite interesting there were some areas that we thought could be improved. Some of the weaknesses of our approach, and ways to improve them are:

 - Using entire parse trees instead of just the chunk types

 - Using a tree or graph similarity metric instead of a sequence similarity one (ideally a linguistic-aware one too)

 - Better pre-processing to identify and normalize Named Entities, etc.

Next up..

In our next post, we’re going to study the correlations between various headline structures and some external metrics like number of Shares and Likes on Social Media platforms, and see if we can uncover any interesting patterns. We can hazard a guess already that the short, snappy Celebrity style headlines would probably get the most shares and reach on social media, but there’s only one way to find out.

If you’d like to access the data used or want to see the sample code we used head over to our Jupyter notebook.

Read more…

Guest blog post by Jeff Pettiross

For almost as long as we have been writing, we’ve been putting meaning into maps, charts, and graphs. Some 1,300 years ago, Chinese astronomers recorded the position of the stars and the shapes of the constellations. The Dunhuang star maps are the oldest preserved atlas of the sky:

More than 500 years ago, the residents of the Marshall Islands learned to navigate the surrounding waters by canoe in the daytime—without the aid of stars. These master rowers learned to recognize the feel of the currents reflecting off the nearby islands. They visualized their insights on maps made of sticks, rocks, and shells.

In the 1800s, Florence Nightingale used charts to explain to government officials how treatable diseases were killing more soldiers in the Crimean War than battle wounds. She knew that pictures would tell a more powerful story than numbers alone:

Why Visualized Data Is So Powerful

Since long before spreadsheets and graphing software, we have communicated data through pictures. But we’ve only begun, in the last half-century, to understand why visualizations are such effective tools for seeing and understanding data.

It starts with the part of your brain called the visual cortex. Located near the bony lump at the back of your skull, it processes input from your eyes. Thanks to the visual cortex, our sense of sight provides information much faster than the other senses. We actually begin to process what we see before we think about it.

This is sound from an evolutionary perspective. The early human who had to stop and think, “Hmm, is that a jaguar sprinting toward me?” probably didn’t survive to pass on their genes. There is a biological imperative for our sense of sight to override cognition—in this case, for us to pay sharp attention to movement in our peripheral vision.

Today, our sight is more likely to save us on a busy street than on the savannah. Moving cars and blinking lights activate the same peripheral attention, helping us navigate a complicated visual environment. We see other cues on the street, too. Bright orange traffic cones mark hazards. Signs call out places, directions, and warnings. Vertical stripes on the street indicate lanes while horizontal lines indicate stop lines.

We have designed a rich, visual system that drivers can comprehend quickly, thanks to perceptual psychology. Our visual cortex is attuned to color hues (like safety orange), position (signs placed above road), and line orientation (lanes versus stop lines). Research has identified other visual features. Size, clustering, and shape also help us perceive our environment almost immediately.

What This Means for Us Today

Fortunately, our offices and homes tend to be safer than the savannah or the highway. Regardless, our lightning-quick sense of vision jumps into action even when we read email, tweets, or websites. And that right there is why data visualization communicates so powerfully and immediately: It takes advantage of these visual features, too.

A line graph immediately reveals upward or downward changes, thanks to the orientation of each segment. The axes of the graph use position to communicate values in relationship to each other. If there are multiple, colored lines, the color hue lets us rapidly tell the lines apart, no matter how many times they cross. Bar charts, maps with symbols, area graphs—these all use the visual superhighway in our brains to communicate meaning.

The early pioneers of data visualization were led by their intuition to use visual features like position, clustering, and hue. The longevity of those works is a testament to their power.

We now have software to help us visualize data and to turn tables of facts and figures into meaningful insights. That means anyone, even non-experts, can explore data in a way that wasn’t possible even 20 years ago. We can, all of us, analyze the world’s growing volume of data, spot trends and outliers, and make data-driven decisions.

Today, we don’t just have charts and graphs; we have the science behind them. We have started to unlock the principles of perception and cognition so we can apply them in new ways and in various combinations. A scatter plot can leverage position, hue, and size to visualize data. Its data points can interactively filter related charts, allowing the user to shift perspectives in their analysis by simply clicking on a point. Animating transitions as users pivot from one idea to the next brings previously hidden differences to the foreground. We’re building on the intuition of the pioneers and the conclusions of science to make analysis faster, easier, and more intuitive.

When humanity unlocked the science behind fire and magnets, we learned to harness chemistry and physics in new ways. And we revolutionized the world with steam engines and electrical generators.

Humanity is now at the dawn of a new revolution, and intuitive tools are putting the beautiful science of data visualization into the hands of millions of users.

I’m excited to see where you take all of us next.

Note: This post first appeared in VentureBeat.

Read more…

Investigating Airport Connectedness

Guest blog post by SupStat

Contributed by the neuroscientist Sricharan Maddineni. He holds huge passion and talents in data science. Thus he took NYC Data Science Academy 12 weeks boot camp program  between Jan 11th to Apr 1st, 2016. The post was based on his second project, which posted on February 16th (due at 4th week of the program). He acquired the publicly transportation data and consult from social media. Consuming the data through his mind, he visualized the economic and business insights.

Why Are Airports Important?

(Photo by theprospect.net)

Aviation infrastructure has been a bedrock of the United States economy and culture for many decades, and it was the first instrument through which we connected with the world. Before the invention of flight, humans were inexorably confined by the immenseness of Earth's oceans.

All the disdain and unpleasantries we endure on flights are quickly forgotten once we safely land at our destinations and realize we have just been transported to a new place on our vast planet. Every time I have flown and landed in a new country or city, I am overwhelmed with feelings of how beautiful our world is and how much I wish I could visit every corner of our planet. My love of aviation has led me to investigate the connectedness of United States airports and the passenger-disparity between the developed and developing countries.

The App

The interactive map can be used as a tool to investigate the connectedness of the US airports. Users can choose from a list of airports including LAX, JFK, IAD and more to visualize the connections out of that airport. The 'Airport Connections' table shows us the combinations of connections by Airline Carrier. For example, we can see that American Airlines (AA) had 8058 flights out of LAX to JFK (2009 dataset). The 'Carriers' table shows us the total flights out of LAX by American Airlines (76,670).

If we select Hartsfield-Jackson Atlanta International, we see that it is the most connected airport in the United States. *Please note that I am not plotting all possible connections, just major airport connections and only within the United States (the map would be filled solid if I plotted all connections!). The size of the airport bubble is calculated by the number of connections. Therefore, all large bubbles are international airports, and smaller bubbles are regional/domestic airports.

I also plotted Voronoi tesselations between the airports using one nearest neighbor to show the area differences between airports in the Eastcoast/Westcoast/Midwest. The largest polygons are found in the Midwest because airports are far apart in all directions. These airports are generally more connected as well since they are connecting the east and west coast (see Denver International or Salt Lake City International). Clicking on a Voronoi polygon brings up the nearest airport within that area.

Why is it important for countries to improve their airport infrastructure?

Looking at the Motion/Bubble Chart, we observe that developing countries travel horizontally whereas developed countries travel vertically. This indicates that developed countries populations have remained steady, but they have seen a rise in passenger travelers. On the flip side, developing countries have seen their populations boom, but the number of air travelers has remained stagnant.

Most importantly, countries moving upward show noticeable gains in GDP whereas countries moving horizontally show minimal gains over the last four decades (GDP is represented by the size of the bubble). We can also notice that airline passenger counts plunge during recessions for first world countries but remain comparatively steady for developing countries (1980, 2000, 2009). We can interpret this to mean that developing countries are not as connected to the rest of the world since their economies are unaffected by global economic crises.

Passenger Counts during weekends and Holidays

The calendar heatmap shows us the Daily flight count in the United States. We can recognize that airlines operate significantly fewer flights on Saturdays and National Holidays such as July 4th and Thanksgiving. The days leading up to and after National Holidays show an increase in flights as expected. Looking carefully, you can also notice there are fewer flights on Tuesdays and Wednesdays, and there are more flights during the summer season.

If you select a day on the calendar, a table shows us the top 20 Airline carrier flight counts on that day. Southwest, American Airlines, SkyWest, and Delta seem to operate the most airlines in the United States.

            

The Data

1. Interactive Map

I utilized comprehensive datasets provided by the United States Department of Transportation and Open Data by Socrata that allowed me to map airport connections in the United States. The first airport dataset included airport locations (city/state) and their latitude and longitude degrees, and the second dataset included the airport connections (LAX - JFK, LAX-SFO, ...). First, I used these datasets to calculate the size of the airport based on how many connections each had.

https://github.com/nycdatasci/bootcamp004_project/tree/master/Project2-Shiny/Sri_New

2. Motion Chart

The second analysis was done using the airline passenger, population, and GDP numbers for the world's countries over the last 45 years. Most of the work here was in transforming the three datasets provided by the World Bank from wide to long. See the code below.

https://github.com/nycdatasci/bootcamp004_project/tree/master/Project2-Shiny/Sri_New

3. Calendar Chart

Lastly, I used the Transtats database to obtain the daily flight counts by Airline Carrier for the years 2004-2007. Some transformation was done to create two separate data frames - flight counts per day and flight counts per carrier. While trying to calculate flight counts by day, I tried this code:

f2007_2 <- f2007 %>% group_by(UniqueCarrier, month) %>% summarise(sum = n()) 

I knew there as an error by looking at the resulting heatmap, but I didn't realize this was showing me a cumulative sum by month rather than the daily flight count, so I hit twitter to see if I could get help diagnosing my problem. I tweeted Jeff Weis who appeared as the Aviation Analyst on CNN during the Malaysian Airlines MH370 disappearance, and he caught my mistake! After he had pushed me in the right direction, I corrected my code to: 

group_by(UniqueCarrier, date) %>% summarise(count = n()) 

The Code

Creating Voronoi Polygons

Connection Lines

The second step was creating the line connections between the airports. To do this, I used the polylines function in Leaflet to add connecting lines between airports filtered by user input. input$Input1 catches the user selected airport and subsets the dataset by all origin airports that equal the selected airport. The gcIntermediate function makes those lines curved.

Calendar json capture

The calendar chart required two parameters, the whichdatevar reads the date column, and numvar which plots the value for each day on the calendar. Then I utilized a gvis.listener.jscode method to capture the user selected date and filter the dataset for the table.

To experience Sricharan Maddineni the interactive Shiny App

Read more…

Guest blog post by Irina Papuc

Machine Learning (ML) is coming into its own, with a growing recognition that ML can play a key role in a wide range of critical applications, such as data mining, natural language processing, image recognition, and expert systems. ML provides potential solutions in all these domains and more, and is set to be a pillar of our future civilization.

The supply of able ML designers has yet to catch up to this demand. A major reason for this is that ML is just plain tricky. This tutorial introduces the basics of Machine Learning theory, laying down the common themes and concepts, making it easy to follow the logic and get comfortable with the topic.

What is Machine Learning?

So what exactly is “machine learning” anyway? ML is actually a lot of things. The field is quite vast and is expanding rapidly, being continually partitioned and sub-partitioned ad nauseam into different sub-specialties and types of machine learning.

There are some basic common threads, however, and the overarching theme is best summed up by this oft-quoted statement made by Arthur Samuel way back in 1959: “[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.”

And more recently, in 1997, Tom Mitchell gave a “well-posed” definition that has proven more useful to engineering types: “A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”

“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.” -- Tom Mitchell, Carnegie Mellon University

So if you want your program to predict, for example, traffic patterns at a busy intersection (task T), you can run it through a machine learning algorithm with data about past traffic patterns (experience E) and, if it has successfully “learned”, it will then do better at predicting future traffic patterns (performance measure P).

The highly complex nature of many real-world problems, though, often means that inventing specialized algorithms that will solve them perfectly every time is impractical, if not impossible. Examples of machine learning problems include, “Is this cancer?”, “What is the market value of this house?”, “Which of these people are good friends with each other?”, “Will this rocket engine explode on take off?”, “Will this person like this movie?”, “Who is this?”, “What did you say?”, and “How do you fly this thing?”. All of these problems are excellent targets for an ML project, and in fact ML has been applied to each of them with great success.

ML solves problems that cannot be solved by numerical means alone.

Among the different types of ML tasks, a crucial distinction is drawn between supervised and unsupervised learning:

  • Supervised machine learning: The program is “trained” on a pre-defined set of “training examples”, which then facilitate its ability to reach an accurate conclusion when given new data.

  • Unsupervised machine learning: The program is given a bunch of data and must find patterns and relationships therein.

We will primarily focus on supervised learning here, but the end of the article includes a brief discussion of unsupervised learning with some links for those who are interested in pursuing the topic further.

Supervised Machine Learning

In the majority of supervised learning applications, the ultimate goal is to develop a finely tuned predictor function h(x) (sometimes called the “hypothesis”). “Learning” consists of using sophisticated mathematical algorithms to optimize this function so that, given input data x about a certain domain (say, square footage of a house), it will accurately predict some interesting value h(x) (say, market price for said house).

In practice, x almost always represents multiple data points. So, for example, a housing price predictor might take not only square-footage (x1) but also number of bedrooms (x2), number of bathrooms (x3), number of floors (x4), year built (x5), zip code (x6), and so forth. Determining which inputs to use is an important part of ML design. However, for the sake of explanation, it is easiest to assume a single input value is used.

So let’s say our simple predictor has this form:

where and are constants. Our goal is to find the perfect values of and to make our predictor work as well as possible.

Optimizing the predictor h(x) is done using training examples. For each training example, we have an input value x_train, for which a corresponding output, y, is known in advance. For each example, we find the difference between the known, correct value y, and our predicted value h(x_train). With enough training examples, these differences give us a useful way to measure the “wrongness” of h(x). We can then tweak h(x) by tweaking the values of and to make it “less wrong”. This process is repeated over and over until the system has converged on the best values for and . In this way, the predictor becomes trained, and is ready to do some real-world predicting.

A Simple Machine Learning Example

We stick to simple problems in this post for the sake of illustration, but the reason ML exists is because, in the real world, the problems are much more complex. On this flat screen we can draw you a picture of, at most, a three-dimensional data set, but ML problems commonly deal with data with millions of dimensions, and very complex predictor functions. ML solves problems that cannot be solved by numerical means alone.

With that in mind, let’s look at a simple example. Say we have the following training data, wherein company employees have rated their satisfaction on a scale of 1 to 100:

First, notice that the data is a little noisy. That is, while we can see that there is a pattern to it (i.e. employee satisfaction tends to go up as salary goes up), it does not all fit neatly on a straight line. This will always be the case with real-world data (and we absolutely want to train our machine using real-world data!). So then how can we train a machine to perfectly predict an employee’s level of satisfaction? The answer, of course, is that we can’t. The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.

It is somewhat reminiscent of the famous statement by British mathematician and professor of statistics George E. P. Box that “all models are wrong, but some are useful”.

The goal of ML is never to make “perfect” guesses, because ML deals in domains where there is no such thing. The goal is to make guesses that are good enough to be useful.

ML builds heavily on statistics. For example, when we train our machine to learn, we have to give it a statistically significant random sample as training data. If the training set is not random, we run the risk of the machine learning patterns that aren’t actually there. And if the training set is too small (see law of large numbers), we won’t learn enough and may even reach inaccurate conclusions. For example, attempting to predict company-wide satisfaction patterns based on data from upper management alone would likely be error-prone.

With this understanding, let’s give our machine the data we’ve been given above and have it learn it. First we have to initialize our predictor h(x) with some reasonable values of and . Now our predictor looks like this when placed over our training set:

If we ask this predictor for the satisfaction of an employee making $60k, it would predict a rating of 27:

It’s obvious that this was a terrible guess and that this machine doesn’t know very much.

So now, let’s give this predictor all the salaries from our training set, and take the differences between the resulting predicted satisfaction ratings and the actual satisfaction ratings of the corresponding employees. If we perform a little mathematical wizardry (which I will describe shortly), we can calculate, with very high certainty, that values of 13.12 for and 0.61 for are going to give us a better predictor.

And if we repeat this process, say 1500 times, our predictor will end up looking like this:

At this point, if we repeat the process, we will find that and won’t change by any appreciable amount anymore and thus we see that the system has converged. If we haven’t made any mistakes, this means we’ve found the optimal predictor. Accordingly, if we now ask the machine again for the satisfaction rating of the employee who makes $60k, it will predict a rating of roughly 60.

Now we’re getting somewhere.

A Note on Complexity

The above example is technically a simple problem of univariate linear regression, which in reality can be solved by deriving a simple normal equation and skipping this “tuning” process altogether. However, consider a predictor that looks like this:

This function takes input in four dimensions and has a variety of polynomial terms. Deriving a normal equation for this function is a significant challenge. Many modern machine learning problems take thousands or even millions of dimensions of data to build predictions using hundreds of coefficients. Predicting how an organism’s genome will be expressed, or what the climate will be like in fifty years, are examples of such complex problems.

Many modern ML problems take thousands or even millions of dimensions of data to build predictions using hundreds of coefficients.

Fortunately, the iterative approach taken by ML systems is much more resilient in the face of such complexity. Instead of using brute force, a machine learning system “feels its way” to the answer. For big problems, this works much better. While this doesn’t mean that ML can solve all arbitrarily complex problems (it can’t), it does make for an incredibly flexible and powerful tool.

Gradient Descent - Minimizing “Wrongness”

Let’s take a closer look at how this iterative process works. In the above example, how do we make sure and are getting better with each step, and not worse? The answer lies in our “measurement of wrongness” alluded to previously, along with a little calculus.

The wrongness measure is known as the cost function (a.k.a., loss function), . The input represents all of the coefficients we are using in our predictor. So in our case, is really the pair and . gives us a mathematical measurement of how wrong our predictor is when it uses the given values of and .

The choice of the cost function is another important piece of an ML program. In different contexts, being “wrong” can mean very different things. In our employee satisfaction example, the well-established standard is the linear least squares function:

With least squares, the penalty for a bad guess goes up quadratically with the difference between the guess and the correct answer, so it acts as a very “strict” measurement of wrongness. The cost function computes an average penalty over all of the training examples.

So now we see that our goal is to find and for our predictor h(x) such that our cost function is as small as possible. We call on the power of calculus to accomplish this.

Consider the following plot of a cost function for some particular ML problem:

Here we can see the cost associated with different values of and . We can see the graph has a slight bowl to its shape. The bottom of the bowl represents the lowest cost our predictor can give us based on the given training data. The goal is to “roll down the hill”, and find and corresponding to this point.

This is where calculus comes in to this machine learning tutorial. For the sake of keeping this explanation manageable, I won’t write out the equations here, but essentially what we do is take the gradient of , which is the pair of derivatives of (one over and one over ). The gradient will be different for every different value of and , and tells us what the “slope of the hill is” and, in particular, “which way is down”, for these particular s. For example, when we plug our current values of into the gradient, it may tell us that adding a little to and subtracting a little from will take us in the direction of the cost function-valley floor. Therefore, we add a little to , and subtract a little from , and voilà! We have completed one round of our learning algorithm. Our updated predictor, h(x) = + x, will return better predictions than before. Our machine is now a little bit smarter.

This process of alternating between calculating the current gradient, and updating the s from the results, is known as gradient descent.



That covers the basic theory underlying the majority of supervised Machine Learning systems. But the basic concepts can be applied in a variety of different ways, depending on the problem at hand.

Classification Problems

Under supervised ML, two major subcategories are:

  • Regression machine learning systems: Systems where the value being predicted falls somewhere on a continuous spectrum. These systems help us with questions of “How much?” or “How many?”.

  • Classification machine learning systems: Systems where we seek a yes-or-no prediction, such as “Is this tumer cancerous?”, “Does this cookie meet our quality standards?”, and so on.

As it turns out, the underlying theory is more or less the same. The major differences are the design of the predictor h(x) and the design of the cost function .

Our examples so far have focused on regression problems, so let’s now also take a look at a classification example.

Here are the results of a cookie quality testing study, where the training examples have all been labeled as either “good cookie” (y = 1) in blue or “bad cookie” (y = 0) in red.

In classification, a regression predictor is not very useful. What we usually want is a predictor that makes a guess somewhere between 0 and 1. In a cookie quality classifier, a prediction of 1 would represent a very confident guess that the cookie is perfect and utterly mouthwatering. A prediction of 0 represents high confidence that the cookie is an embarrassment to the cookie industry. Values falling within this range represent less confidence, so we might design our system such that prediction of 0.6 means “Man, that’s a tough call, but I’m gonna go with yes, you can sell that cookie,” while a value exactly in the middle, at 0.5, might represent complete uncertainty. This isn’t always how confidence is distributed in a classifier but it’s a very common design and works for purposes of our illustration.

It turns out there’s a nice function that captures this behavior well. It’s called the sigmoid function, g(z), and it looks something like this:

z is some representation of our inputs and coefficients, such as:

so that our predictor becomes:

Notice that the sigmoid function transforms our output into the range between 0 and 1.

The logic behind the design of the cost function is also different in classification. Again we ask “what does it mean for a guess to be wrong?” and this time a very good rule of thumb is that if the correct guess was 0 and we guessed 1, then we were completely and utterly wrong, and vice-versa. Since you can’t be more wrong than absolutely wrong, the penalty in this case is enormous. Alternatively if the correct guess was 0 and we guessed 0, our cost function should not add any cost for each time this happens. If the guess was right, but we weren’t completely confident (e.g. y = 1, but h(x) = 0.8), this should come with a small cost, and if our guess was wrong but we weren’t completely confident (e.g. y = 1 but h(x) = 0.3), this should come with some significant cost, but not as much as if we were completely wrong.

This behavior is captured by the log function, such that:

Again, the cost function gives us the average cost over all of our training examples.

So here we’ve described how the predictor h(x) and the cost function differ between regression and classification, but gradient descent still works fine.

A classification predictor can be visualized by drawing the boundary line; i.e., the barrier where the prediction changes from a “yes” (a prediction greater than 0.5) to a “no” (a prediction less than 0.5). With a well-designed system, our cookie data can generate a classification boundary that looks like this:

Now that’s a machine that knows a thing or two about cookies!

An Introduction to Neural Networks

No discussion of ML would be complete without at least mentioning neural networks. Not only do neural nets offer an extremely powerful tool to solve very tough problems, but they also offer fascinating hints at the workings of our own brains, and intriguing possibilities for one day creating truly intelligent machines.

Neural networks are well suited to machine learning problems where the number of inputs is gigantic. The computational cost of handling such a problem is just too overwhelming for the types of systems we’ve discussed above. As it turns out, however, neural networks can be effectively tuned using techniques that are strikingly similar to gradient descent in principle.

A thorough discussion of neural networks is beyond the scope of this tutorial, but I recommend checking out our previous post on the subject.

Unsupervised Machine Learning

Unsupervised learning typically is tasked with finding relationships within data. There are no training examples used in this process. Instead, the system is given a set data and tasked with finding patterns and correlations therein. A good example is identifying close-knit groups of friends in social network data.

The algorithms used to do this are very different from those used for supervised learning, and the topic merits its own post. However, for something to chew on in the meantime, take a look at clustering algorithms such as k-means, and also look into dimensionality reduction systems such as principle component analysis. Our prior post on big data discusses a number of these topics in more detail as well.

Conclusion

We’ve covered much of the basic theory underlying the field of Machine Learning here, but of course, we have only barely scratched the surface.

Keep in mind that to really apply the theories contained in this introduction to real life machine learning examples, a much deeper understanding of the topics discussed herein is necessary. There are many subtleties and pitfalls in ML, and many ways to be lead astray by what appears to be a perfectly well-tuned thinking machine. Almost every part of the basic theory can be played with and altered endlessly, and the results are often fascinating. Many grow into whole new fields of study that are better suited to particular problems.

Clearly, Machine Learning is an incredibly powerful tool. In the coming years, it promises to help solve some of our most pressing problems, as well as open up whole new worlds of opportunity. The demand for ML engineers is only going to continue to grow, offering incredible chances to be a part of something big. I hope you will consider getting in on the action!

This article was originally published in Toptal.

Read more…

Guest blog post by Chris Atwood

Recently, I rediscovered a TED Talk by David McCandless, a data journalist, called “The beauty of data visualization.” It’s a great reminder of how charts (though scary to many) can help you tell an actionable story about a topic in a way that bullet points alone usually cannot. If you have not seen the talk, I recommend you take a look for some inspiration about visualizing big ideas.

 

In any social media report you make for the brass, there are several types of data charts to help summarize the performance of your social media channels; the most common ones are bar charts, pie/donut charts and line graphs. They are tried and true but often overused, and are not always the best way to visualize the data to then inform and justify your strategic decisions. Below are some less common charts to help you tell the story about your social media strategy’s ROI.

 

For our examples here, we’ll primarily be examining a brand’s Facebook page for different types of analyses on its owned post performance.

 

Scatter plots

Figure 1: Total engagement vs total reach, colored by post type (Facebook Insights)

What they are: Scatter plots measure two variables against each other to help users determine where a correlation or relationship between those variables might be.

 

Why they’re useful:  One of the most powerful aspects of a scatter plot is its ability to show nonlinear relationships between variables. They also help users get a sense of the big picture. In the example above, we’re looking for any observable correlations between total engagement (Y axis) and total reach (X axis) that can guide this Facebook page’s strategy. The individual dots are colored by the post type — status update (green), photo (blue) or video (red).

 

This scatter plot shows that engagement and reach have a direct relationship for photo posts because it makes a fairly clear, straight line from the bottom left to the upper  right. For other types of posts, the relationships are less clear, although it can be noted that video posts have extremely high reach even though engagement is typically low.

 

Box plots

Figure 2: Total reach benchmark by post type (Facebook Insights)

 

What they are: Box plots show the statistical distribution of different categories in your data, and let you compare them against one another and establish a benchmark for a certain variable. They are not commonly used because they’re not always pretty, and sometimes can be a bit confusing to read without the right context.

 

Why they’re useful: Box plots are excellent ways to display key performance indicators. Each category (with more than one post) will show a series of lines and rectangles; the box and whisker show what’s called the interquartile range (IQR). When you look at all the posts, you can split the values up into groups called quartiles or percentiles based on the distribution of the values. You can use the median or the value of the second quartile as a benchmark for “average” performance.

 

In this example, we’re once again looking at different post types on a brand’s Facebook page, and seeing what the total reach is like for each. For videos (red), you can see that the lower boundary for the reach is higher than the majority of photo posts, and that it doesn’t have any outliers. Photos, however, tell a different story. The first quartile is very short, while the fourth quartile is much longer. Since most of the posts fall above the second quartile, you know that many of these posts are performing above average. The dots above the whisker indicate outliers — i.e., these posts do not fall within the normal distribution. You should take a closer look at outliers to see what you can learn based on what they have in common (seasonality/timing, imagery, topic, audience targeting, or word choices).

Heat maps

Figure 3: Average total engagement per day by post type (Facebook Insights)

 

What they are: Heat maps are a great way to determine factors like which posts have the highest number of engagement or impressions, on average, on a given day. Heat maps take two categories of data and compare a single quantitative variable (average total reach, average total engagement, etc.).

 

Why they’re useful: The difference in the shade in colors shows how values in each column are different from each other. If the shades are all light, there is not a large difference in the values from category to category, versus if there are light colors and darker colors in a column, the values are very different from each other (more interesting!).

 

You could run a similar analysis to see what times  of day your posts get the highest engagement or reach, and find the answer to the classic question, “When should I post for the highest results?” You can also track competitors this way, to see how their content performs throughout the day or on particular days of the week. You can time your own posts around when you think shared audiences may be paying less attention to competitors, or make a splash during times with the best performance.

 

In the above example, you can see that three post types from a brand’s Facebook page have been categorized by their average total engagement on a given day of the week. Based on the chart, photos do not differentiate much from day to day. Looking closer at the data from the previous box plot, we know that photo posts are the most common post, and make up a large amount of the data set; we can conclude that the user must be used to seeing those posts so they perform about the same day to day. We also see that video posts either perform far above or far below average, and that it appears the best day to post videos for this brand is typically on Thursdays.

Tree maps

Figure 4: Average total engagement by content pillar and post type (Facebook Insights)

 

What they are: Tree maps use qualitative information, usually represented as a diagram that grows from a single trunk and ends with many leaves. Tree maps typically have three main components that help you tell what’s going on — the size of each rectangle, the relative color and what the hierarchy is.

 

Why they’re useful: Tree maps are a fantastic way to get a high-level look at your social data and figure out where you want to dig in for further analysis. In this example, we’re able to compare the average total engagement between different post types, broken out by content pillar.

For our brand’s Facebook page, we have trellised the data by post type (figure 4); in other words, we created a visualization that comprises three smaller visualizations, so we can see how the post type impacts the average total engagement for each content pillar. It answers the question, “Do my videos in category X perform differently than my photos in the same category?” You can also see that the rectangles vary in size from content pillar to content pillar; they  are sized by the number of posts in each subset. Finally, they are colored by the average total engagement for that content pillar’s subset of the post type. The darker the color, the higher the engagement.

 

We immediately learn that posts in the status trellis aren’t performing anywhere near the other post types (it only has one post), and that photos have the greatest number of content pillars or the greatest variety in topic. You can see from the visualization that you want to spend more of your energy digging into why posts in the Timely, Education and Event categories perform well in both photos and videos. .  

 

TL;DR: Better Presentations are made with Better Charts

In your next analysis, you shouldn’t disregard the tried and true bar charts, pie graphs and line charts. However, these four different visualizations may offer a more succinct way to summarize your data and help you explain the performance of your campaigns. They’ll also make your reports and wrapups look distinctive when they’re used correctly. Although there are other chart types that are also useful for making better analyses and presentations, the ones discussed here are fairly simple to put together and nearly all of them can be put together in Microsoft Excel or visualization/analysis software such as TIBCO's Spotfire. 

Read more…

Top 5 graph visualisation tools

Data visualisation is the process of displaying data in visual formats such as charts, graphs or maps. This method is commonly used to grasp more meaning out of a snapshot of data which in other approaches might require sorting through piles of spreadsheets and great quantity of reports. With the amount of data growing rapidly it is more important than ever to interpret all of this information correctly and quickly to make well-informed business decisions.

Graph visualisation has a fairly similar approach just with a more diverse and complex sets of data.  A graph is a representation of objects (nodes) which some are connected by links. Graph visualisation is the process of displaying this data graphically to maximise readability and allow to gain more insight.  

Here is the list of top graph visualisation tools that Data to Value found useful.

 

Gephi

 

Gephi is an interactive visualization and exploration solution that supports dynamic and hierarchical graphs. Gephi’s powerful OpenGL engine allows for real-time visualisation that supports networks up to 50,000 nodes and 1,000,000 edges. It also provides the user with cutting edge layout algorithms that include force-based algorithms and multi-level algorithms to make the experience that much more efficient. Moreover, the software is completely free to use, the company only charges for private repositories.

 

 

Tom Sawyer Perspectives

 

Tom Sawyer Perspectives is a graphics based tool for building data relationship visualization and analysis applications. This software supports two graphic modules: designer and previewer. The designer helps users to define schemas, data sources, rules and searches. The previewer can be used to iteratively view the application design without the need to recompile. Using these two tools together can drastically increase the speed of application development. Other features of Tom Sawyer Perspectives include: data integration for structured semi-structured and unstructured data, multiple view support and advanced graph analytic capabilities.

 

Keylines

 

Keylines is a Javascript toolkit that allows to create a custom network visualisation in a quick and easy way. Keylines puts more freedom into the user’s hands as it is not a pre-built application, this enables the developer to change nodes, links, menus and add entire functions with a few lines of code. This solution also includes geospatial integration, time bars, various layout patterns and filtering options.

 

 

 

 

 

Linkurious

 

An intuitive graph visualisation solution that offers an easy out-of-the-box set up with no configuration needed. This allows people who are not particularly tech savvy to start analysing and discovering insights in complex data. Linkurious includes a robust search engine helping users find text in edges and nodes. Moreover, it includes advanced analytic capabilities allowing to combine filters to answer complex questions. Linkurious also has the ability to identify complex patterns by using Cypher – a query language designed specifically for graph analytics.

 

 

GraphX

 

GraphX is an advanced graph visualization software, it is an open-source project and is a part of the Apache Spark engine. As it is open-source there is a lot of room for customisation from special functions to custom animations. It also utilises Spark’s computing technology that allows for the capture and storing of visual and data graphs in-memory. It has built in default support for layout algorithms, advanced graph edges and vertex features. GraphX also includes a visual preview function for all controls as well as rich usability documentation and user support.

 

 

 

About us

 

Data to Value are a specialist Data consultancy, based in London. We apply graph technology to a variety of data requirements as part of next generation data strategies. Contact us for more details if you are interesting in finding out how we can help your organisation leverage this approach.

Originally posted on Data Science Central

Read more…

Guest blog post by Divya Parmar

To once again demonstrate the power of MySQL (download), MySQL Workbench (download), and Tableau Desktop (free trial version can be downloaded here), I wanted to walk through another data analysis example. This time, I found a Medicare dataset publicly available on Data.gov and imported it using the Import Wizard as seen below.

 

 

Let’s take a look at the data: it has hospital location information, measure name (payment for heart attack patient, pneumonia patient, etc), and payment information.

 

I decided to look at the difference in lower and higher payment estimates for heart attack patients for each state to get a sense of variance in treatment cost. I created a query and saved it as a view.

 

One of the convenient features of Tableau Desktop is the ability to connect directly to MySQL, so I used that connection to load my view directly into Tableau.

  

 I wanted to see how the difference between lower and higher payment estimate varies by state. Using Tableau’s maps and geographic recognition of the state column, I used a few drag-and-drop moves and a color fill to complete the visualization.

You can copy the image itself to use elsewhere, choosing to add labels and legends if necessary. Enjoy. 

About: Divya Parmar is a recent college graduate working in IT consulting. For more posts every week, and to subscribe to his blog, please click here. He can also be found on LinkedIn and Twitter

Read more…

Featured Blog Posts - DSC

Webinar Series

Follow Us

@DataScienceCtrl | RSS Feeds